Linux cgroups development
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition)
From: YoungJun Park @ 2026-06-16  1:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	chengming.zhou, ljs, liam, vbabka, rppt, surenb, qi.zheng,
	axelrasmussen, yuanchu, weixugc, riel, gourry, haowenchao22,
	kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <CAO9r8zPj5EH8Mbpc6N+d1u2eEgoV33f+4s=v-84gaobAodPtUw@mail.gmail.com>

On Mon, Jun 15, 2026 at 12:56:26PM -0700, Yosry Ahmed wrote:
> On Sun, Jun 14, 2026 at 7:39 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Sun, Jun 14, 2026 at 4:20 AM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > ...
> > > > * Integration with swap.tier by Youngjun (see [12]). For now, I'm
> > > >   leaning towards opting out the vswap device from swap.tier entirely, and
> > > >   treat it as a special device. Integrating it with swap.tiers will
> > > >   benefit the cases where you want some cgroups to skip vswap for fast
> > > >   swap devices (pmem), whereas other should go through zswap first. But
> > > >   most other use cases, either the overhead of vswap will be acceptable
> > > >   (or not the bottleneck), or we can just disable CONFIG_VSWAP entirely :)
> > > >
> > > >   Youngjun, may I ask for your thoughts on this?
> > >
> > > Hi Nhat,
> > >
> > > Tier 1: VSWAP, Tier 2: ZSWAP ...
> > >
> > > I don't see any problem applying the desired functionality with the
> > > currently proposed mechanism and interface. With this, a user would be
> > > assigned the default Virtual -> RAM swap tier, and the overall picture
> > > becomes one where swap tiers are composed according to the priority
> > > setting.
> >
> > It's more - is there a strong argument to let vswap be a tier (which
> > is not supported by just turning of vswap altogether).
> >
> > Because right now I'm not exposing vswap device to userspace in any
> > manner, pretty much. It's abstract and transparent, and minimizes
> > complexity (no vswap and swap.tier interaction) and surfaces for
> > issues.
> 
> I definitely think vswap should *not* be a tier. First of all, a vswap
> entry can be backed by zswap or an actual swap device, which would be
> two different tiers. How does that work?
> 
> I also think vswap should not be exposed to userspace in any way, at
> least not now. I still think we should aim to just make the
> redirection layer always on and eliminate "vswap devices".

After following the answers and giving it some thought, I agree that
vswap should be kept user-transparent. If there is a strict need to
disable it, relying on CONFIG_VSWAP to remove it entirely seems like
the right approach.

If a strong use case for user interaction emerges in the future, we can
revisit the design and figure out how to handle it at that time.

Thanks,
Youngjun Park

^ permalink raw reply

* [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Zizhi Wo @ 2026-06-16  1:17 UTC (permalink / raw)
  To: axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai, wozizhi

From: Zizhi Wo <wozizhi@huawei.com>

[BUG]
Our fuzz testing triggered a blkcg use-after-free issue:

  BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
  Call Trace:
  ...
  blkcg_deactivate_policy+0x244/0x4d0
  ioc_rqos_exit+0x44/0xe0
  rq_qos_exit+0xba/0x120
  __del_gendisk+0x50b/0x800
  del_gendisk+0xff/0x190
  ...

[CAUSE]
process1						process2
cgroup_rmdir
...
  css_killed_work_fn
    offline_css
    ...
      blkcg_destroy_blkgs
      ...
        __blkg_release
	  css_put(&blkg->blkcg->css)
          blkg_free
	    INIT_WORK(xxx, blkg_free_workfn)
	    schedule_work
    css_put
    ...
      blkcg_css_free
        kfree(blkcg)--------blkcg has been freed!!!
====================================schedule_work
              blkg_free_workfn
							__del_gendisk
							  rq_qos_exit
							    ioc_rqos_exit
							      blkcg_deactivate_policy
							        mutex_lock(&q->blkcg_mutex)
								spin_lock_irq(&q->queue_lock)
							        list_for_each_entry(blkg, xxx)
								  blkcg = blkg->blkcg
								  spin_lock(&blkcg->lock)-------UAF!!!
	        mutex_lock(&q->blkcg_mutex)
	        spin_lock_irq(&q->queue_lock)
	        /* Only then is the blkg removed from the list */
	        list_del_init(&blkg->q_node)

As a result, a blkg can still be reachable through q->blkg_list while
its ->blkcg has already been freed.

[Fix]
Fix this by deferring the blkcg css_put() until after the blkg has been
unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
blkcg outlives every blkg still reachable through q->blkg_list, so any
iterator holding q->queue_lock is guaranteed to observe a valid
blkg->blkcg.

While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
so that the css reference is owned by the alloc/free pair rather than
straddling layers:
blkg_alloc()  <-> blkg_free()
blkg_create() <-> blkg_destroy()

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
---
v3:
 - move css_put() after mutex_unlock() in blkg_free_workfn().

v2:
 - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
   css reference follows the blkg's own lifetime, making the put in
   blkg_free_workfn() symmetric with the get in blkg_alloc().

v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
 block/blk-cgroup.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bc63bd220865..3ac41f766caf 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -136,6 +136,11 @@ static void blkg_free_workfn(struct work_struct *work)
 	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
+	/*
+	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
+	 * so concurrent iterators won't see a blkg with a freed blkcg.
+	 */
+	css_put(&blkg->blkcg->css);
 	blk_put_queue(q);
 	free_percpu(blkg->iostat_cpu);
 	percpu_ref_exit(&blkg->refcnt);
@@ -179,8 +184,6 @@ static void __blkg_release(struct rcu_head *rcu)
 	for_each_possible_cpu(cpu)
 		__blkcg_rstat_flush(blkcg, cpu);
 
-	/* release the blkcg and parent blkg refs this blkg has been holding */
-	css_put(&blkg->blkcg->css);
 	blkg_free(blkg);
 }
 
@@ -313,6 +316,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 		goto out_exit_refcnt;
 	if (!blk_get_queue(disk->queue))
 		goto out_free_iostat;
+	/* blkg holds a reference to blkcg */
+	if (!css_tryget_online(&blkcg->css))
+		goto out_put_queue;
 
 	blkg->q = disk->queue;
 	INIT_LIST_HEAD(&blkg->q_node);
@@ -353,6 +359,8 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 	while (--i >= 0)
 		if (blkg->pd[i])
 			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
+	css_put(&blkcg->css);
+out_put_queue:
 	blk_put_queue(disk->queue);
 out_free_iostat:
 	free_percpu(blkg->iostat_cpu);
@@ -381,18 +389,12 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		goto err_free_blkg;
 	}
 
-	/* blkg holds a reference to blkcg */
-	if (!css_tryget_online(&blkcg->css)) {
-		ret = -ENODEV;
-		goto err_free_blkg;
-	}
-
 	/* allocate */
 	if (!new_blkg) {
 		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
 		if (unlikely(!new_blkg)) {
 			ret = -ENOMEM;
-			goto err_put_css;
+			goto err_free_blkg;
 		}
 	}
 	blkg = new_blkg;
@@ -402,7 +404,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
 		if (WARN_ON_ONCE(!blkg->parent)) {
 			ret = -ENODEV;
-			goto err_put_css;
+			goto err_free_blkg;
 		}
 		blkg_get(blkg->parent);
 	}
@@ -442,8 +444,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	blkg_put(blkg);
 	return ERR_PTR(ret);
 
-err_put_css:
-	css_put(&blkcg->css);
 err_free_blkg:
 	if (new_blkg)
 		blkg_free(new_blkg);
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Hou Tao @ 2026-06-16  1:23 UTC (permalink / raw)
  To: yukuai, Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1
In-Reply-To: <70642ddf-9ed9-45cb-bf40-891a07247c97@fnnas.com>

Hi,

On 6/16/2026 12:16 AM, Yu Kuai wrote:
> Hi,
>
> 在 2026/6/15 19:55, Zizhi Wo 写道:
>> From: Zizhi Wo <wozizhi@huawei.com>
>>
>> [BUG]
>> Our fuzz testing triggered a blkcg use-after-free issue:
>>
>>    BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>>    Call Trace:
>>    ...
>>    blkcg_deactivate_policy+0x244/0x4d0
>>    ioc_rqos_exit+0x44/0xe0
>>    rq_qos_exit+0xba/0x120
>>    __del_gendisk+0x50b/0x800
>>    del_gendisk+0xff/0x190
>>    ...
>>
>> [CAUSE]
>> process1						process2
>> cgroup_rmdir
>> ...
>>    css_killed_work_fn
>>      offline_css
>>      ...
>>        blkcg_destroy_blkgs
>>        ...
>>          __blkg_release
>> 	  css_put(&blkg->blkcg->css)
>>            blkg_free
>> 	    INIT_WORK(xxx, blkg_free_workfn)
>> 	    schedule_work
>>      css_put
>>      ...
>>        blkcg_css_free
>>          kfree(blkcg)--------blkcg has been freed!!!
>> ====================================schedule_work
>>                blkg_free_workfn
>> 							__del_gendisk
>> 							  rq_qos_exit
>> 							    ioc_rqos_exit
>> 							      blkcg_deactivate_policy
>> 							        mutex_lock(&q->blkcg_mutex)
>> 								spin_lock_irq(&q->queue_lock)
>> 							        list_for_each_entry(blkg, xxx)
>> 								  blkcg = blkg->blkcg
>> 								  spin_lock(&blkcg->lock)-------UAF!!!
>> 	        mutex_lock(&q->blkcg_mutex)
>> 	        spin_lock_irq(&q->queue_lock)
>> 	        /* Only then is the blkg removed from the list */
>> 	        list_del_init(&blkg->q_node)
>>
>> As a result, a blkg can still be reachable through q->blkg_list while
>> its ->blkcg has already been freed.
>>
>> [Fix]
>> Fix this by deferring the blkcg css_put() until after the blkg has been
>> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
>> blkcg outlives every blkg still reachable through q->blkg_list, so any
>> iterator holding q->queue_lock is guaranteed to observe a valid
>> blkg->blkcg.
>>
>> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
>> so that the css reference is owned by the alloc/free pair rather than
>> straddling layers:
>> blkg_alloc()  <-> blkg_free()
>> blkg_create() <-> blkg_destroy()
>>
>> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
>> Suggested-by: Hou Tao <houtao1@huawei.com>
>> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
>> ---
>> v2:
>>   - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>>     css reference follows the blkg's own lifetime, making the put in
>>     blkg_free_workfn() symmetric with the get in blkg_alloc().
>>
>> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>>
>>   block/blk-cgroup.c | 24 ++++++++++++------------
>>   1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index bc63bd220865..27414c291e49 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
>>   	if (blkg->parent)
>>   		blkg_put(blkg->parent);
>>   	spin_lock_irq(&q->queue_lock);
>>   	list_del_init(&blkg->q_node);
>>   	spin_unlock_irq(&q->queue_lock);
>> +	/*
>> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
>> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
>> +	 */
>> +	css_put(&blkg->blkcg->css);
>>   	mutex_unlock(&q->blkcg_mutex);
> Please move css_put after mutex_unlock, unless there is a strong reason.

I think blkcg_mutex is used here to serialize the access of blkg->q_node
and blkg->blkcg. We could move the css_put after the mutex_unlock(),
however it stills depends on the mutex_lock and mutex_unlock pair on
blkcg_mutex implicitly. Instead of such implicit dependency, we move the
css_put inside the lock to make it be explicit.
>
> With above change, feel free to add:
>
> Reviewed-by: Yu Kuai <yukuai@fygo.io>
>
>>   
>>   	blk_put_queue(q);
>>   	free_percpu(blkg->iostat_cpu);
>>   	percpu_ref_exit(&blkg->refcnt);
>> @@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
>>   	 * blkg_stat_lock is for serializing blkg stat update
>>   	 */
>>   	for_each_possible_cpu(cpu)
>>   		__blkcg_rstat_flush(blkcg, cpu);
>>   
>> -	/* release the blkcg and parent blkg refs this blkg has been holding */
>> -	css_put(&blkg->blkcg->css);
>>   	blkg_free(blkg);
>>   }
>>   
>>   /*
>>    * A group is RCU protected, but having an rcu lock does not mean that one
>> @@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>>   	if (!blkg->iostat_cpu)
>>   		goto out_exit_refcnt;
>>   	if (!blk_get_queue(disk->queue))
>>   		goto out_free_iostat;
>> +	/* blkg holds a reference to blkcg */
>> +	if (!css_tryget_online(&blkcg->css))
>> +		goto out_put_queue;
>>   
>>   	blkg->q = disk->queue;
>>   	INIT_LIST_HEAD(&blkg->q_node);
>>   	blkg->blkcg = blkcg;
>>   	blkg->iostat.blkg = blkg;
>> @@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>>   
>>   out_free_pds:
>>   	while (--i >= 0)
>>   		if (blkg->pd[i])
>>   			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>> +	css_put(&blkcg->css);
>> +out_put_queue:
>>   	blk_put_queue(disk->queue);
>>   out_free_iostat:
>>   	free_percpu(blkg->iostat_cpu);
>>   out_exit_refcnt:
>>   	percpu_ref_exit(&blkg->refcnt);
>> @@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>   	if (blk_queue_dying(disk->queue)) {
>>   		ret = -ENODEV;
>>   		goto err_free_blkg;
>>   	}
>>   
>> -	/* blkg holds a reference to blkcg */
>> -	if (!css_tryget_online(&blkcg->css)) {
>> -		ret = -ENODEV;
>> -		goto err_free_blkg;
>> -	}
>> -
>>   	/* allocate */
>>   	if (!new_blkg) {
>>   		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>>   		if (unlikely(!new_blkg)) {
>>   			ret = -ENOMEM;
>> -			goto err_put_css;
>> +			goto err_free_blkg;
>>   		}
>>   	}
>>   	blkg = new_blkg;
>>   
>>   	/* link parent */
>>   	if (blkcg_parent(blkcg)) {
>>   		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>>   		if (WARN_ON_ONCE(!blkg->parent)) {
>>   			ret = -ENODEV;
>> -			goto err_put_css;
>> +			goto err_free_blkg;
>>   		}
>>   		blkg_get(blkg->parent);
>>   	}
>>   
>>   	/* invoke per-policy init */
>> @@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>   
>>   	/* @blkg failed fully initialized, use the usual release path */
>>   	blkg_put(blkg);
>>   	return ERR_PTR(ret);
>>   
>> -err_put_css:
>> -	css_put(&blkcg->css);
>>   err_free_blkg:
>>   	if (new_blkg)
>>   		blkg_free(new_blkg);
>>   	return ERR_PTR(ret);
>>   }


^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: YoungJun Park @ 2026-06-16  1:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAO9r8zOVqbJEaBqTHw=r2bYw7Lm1tO0TU9QuG+eH1rfqcTAJJQ@mail.gmail.com>

On Mon, Jun 15, 2026 at 12:55:09PM -0700, Yosry Ahmed wrote:
> > In that case, the internal logic could stay roughly the same rather
> > than counting via a page counter. Something like:
> >
> > 1. Change the interface shell: tier.*.max — allow only 0 ~ max.
> 
> What about a single interface as I suggested to remain consistent with
> memory tiering?

Hello Yosry!

I agree. As I was implementing the interface for seeing feasibility
, I reconsidered it. Since swap tiers can be added or removed at runtime, 
having static memory."tier_name".max files seems unnatural.

A single interface like `swap.tiers.max` would be better. We can use a
flat-keyed format (similar to io.weight. same as you suggested)

echo ["tier_name"] ["0 or max"] > swap.tiers.max

I am now leaning towards this is a better direction than what I initially
suggested (memory.swap.tiers and memory.swap.tiers.effective).

Considering other reviews and Shakeel's reply, I will update my swap tier
patch accordingly.

> > 2. Keep the internal logic as is: 0 disables the mask (child memcgs
> >    off too), max enables it (child memcgs on too).
> 
> I think a child should be able to disable a swap tier enabled by the
> parent, but not vice versa.

Yes, we are on the same page. I missed a part in my explanation. I meant
that the child's selected tiers should be a subset of the parent's (which
is how the current swap tier suggestion works). 
A child cannot enable a tier that the parent has disabled.

> > 3. memory.zswap.max integrates naturally (it's memory."tier_name".max).
> 
> Not really. memory.zswap.max is in terms of memory usage (compressed
> size), not swap usage (uncompressed size).

I see, memory.zswap.max needs to be maintained separately. I will look
closer into its semantics. I might have misunderstood this part!

> [..]
> > Let me clarify a part I wrote confusingly. Handling
> > memory.zswap.writeback via tiers is possible, but I don't think the
> > interface itself would be replaced even if memory.swap.tiers is adopted.
> >
> > Selecting only zswap in memory.swap.tiers would not just disable
> > writeback.it would also block regular swap entirely, which differs
> > slightly from the current semantic. (... "Per the cgroup v2 docs: a
> > zswap-only tier setting is subtly different from setting
> > memory.swap.max to 0, since it still allows pages to be written to the
> > zswap pool; this has no effect if zswap is disabled, and swapping is
> > allowed unless memory.swap.max is set to 0.")
> 
> I don't understand. How is disabling zswap writeback not equivalent to
> only enabling zswap as a tier?

Isn't there a case where zswap_store() fails and pages fall back to the
backing swap device?

- "zswap tier only": Only zswap is allowed. Fallback to other swap is
  blocked.
- "zswap writeback disabled": zswap is allowed, but if zswap_store()
  fails, pages can still fall back to other swap devices.

Because of this slight semantic difference, I thought they couldn't be
fully unified. If my understanding is correct, we could extend the zswap
tier to select the target swap device for writeback, but replacing the
writeback interface entirely might be difficult.

> Do you just mean the fact that disabling zswap writeback is a noop of
> zswap is disabled? It's a different interface so I think a small

Yes, I think so too.

> semantic difference is okay. In practice, I doubt that zswap is being
> disabled at runtime.

I thought disabling zswap at runtime might have some use cases, but we
can discuss this further when we talk about the patch extending the
zswap tier.

Best regards,
Youngjun

^ permalink raw reply

* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Paul Moore @ 2026-06-15 22:03 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: tsbogend, jmorris, serge, mingo, juri.lelli, vincent.guittot,
	stephen.smalley.work, casey, longman, tj, hannes, mkoutny,
	chenridong, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, omosnace, kees, neelx, sean, chjohnst, steve,
	mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <exlgb3dg2kwxgna6gx2qixexvwjjul7z2ya7npal2gz4jjtr7m@h4oxgd74gsbp>

On Mon, Jun 15, 2026 at 11:22 AM Aaron Tomlin <atomlin@atomlin.com> wrote:
> On Wed, May 27, 2026 at 09:19:11PM -0400, Aaron Tomlin wrote:
> > On Wed, May 27, 2026 at 09:58:58PM +0200, Peter Zijlstra wrote:
> > > On Wed, May 27, 2026 at 01:41:52PM -0400, Aaron Tomlin wrote:
> > >
> > > > > > The actual use case here is multi-tenant workload isolation and visibility.
> > > > > > Passing the evaluated cpumask to the BPF LSM allows operators to write a
> > > > > > simple eBPF program to detect spatial boundary overlaps (e.g., logging an
> > > > > > event if a requested mask intersects with platform-reserved cores).
> > >
> > > Why isn't cgroups good enough to enforce this? If you create a cgroup
> > > hierarchy per tenant, and constrain them using the cpuset controller,
> > > they should not be able to escape, rendering this event impossible.
> >
> > Hi Peter,
> >
> > You raise a very fair point. The cpuset cgroup controller is indeed the
> > kernel's primary vehicle for spatial enforcement, and under normal
> > circumstances, it successfully prevents a tenant from escaping their
> > designated cores.
> >
> > The cpuset controller does govern resource limits, but does not audit
> > intent. When __sched_setaffinity() is invoked, the kernel compares the
> > requested in_mask against the task's allowed cpuset. If there is only a
> > partial intersection, the kernel silently truncates the requested mask to
> > fit the cpuset, without raising any alarm.
> >
> > The BPF LSM hook, conversely, receives the raw, untruncated in_mask,
> > affording operators the visibility to detect, audit, and even reject these
> > violations of intent before the kernel silently sanitises the input.
> >
> > This patch does not seek to replace the cpuset controller, but rather to
> > complement it by providing auditing capabilities.
> >
> > > > We are not creating a bespoke BPF hook here; rather, we are rectifying a
> > > > historical blind spot within the API. The existing LSM hook is invoked
> > > > during sched_setaffinity(), yet it presently receives only the task_struct
> > > > pointer. Consequently, the security module is essentially asked, "Should
> > > > Process A be permitted to alter Process B's affinity?" without being
> > > > informed of the proposed affinity itself. Providing in_mask simply
> > > > furnishes the existing hook with the requisite payload to make an informed
> > > > decision.
> > >
> > > It occurs to me that this same argument would require to also pass in
> > > the new sched_attr, no? That way the LSM can inspect the new policy
> > > before it becomes effective.
> >
> > I agree, the underlying logic does indeed extend perfectly to sched_attr.
> >
> > Presently, the LSM is equally oblivious as to whether a process is
> > requesting a benign transition to SCHED_BATCH, or attempting to escalate
> > its privileges by requesting a real-time policy such as SCHED_FIFO with
> > maximum priority. Just as with the CPU mask, providing the sched_attr
> > payload would rectify this parallel blind spot, allowing BPF policies to
> > inspect and mediate scheduling attributes before they become effective.
> >
> > If you are amenable, I should be more than happy to expand the scope of the
> > forthcoming patch to include this. Alternatively, we could address the
> > sched_attr expansion in a separate, subsequent patch. Personally, I would
> > favour the latter approach, but please do let me know your preference.
> >
> > I very much look forward to hearing Paul's thoughts on whether this aligns
> > with the broader LSM vision.
>
> Hi Paul,
>
> I am writing to politely follow up on the discussion above regarding the
> proposed enhancement to the sched_setaffinity LSM hook.

Generally speaking I wait until all dependencies land in Linus' tree.
I've lost a lot of time in the past sorting out issues only to have
one of the dependencies rejected.

> As you will see from the thread, Peter Zijlstra and I have discussed the
> architectural justification for this change. While the cpuset cgroup
> controller effectively handles spatial enforcement, it silently truncates
> requested affinity masks. Passing the raw in_mask to the LSM hook enables
> security modules (such as the BPF LSM) to audit and mediate the actual
> intent of the request before the kernel sanitises the input, a capability
> that cgroups inherently lack.

The issue of resource control comes up from time to time within the
context of LSMs, and my general comment is that we likely need to see
a more comprehensive approach to what access control on resource
limits would look like from a LSM perspective.  We've seen a lot of
quick changes to solve very specific problems, but I have yet to see a
good proposal of what it would look like for a more comprehensive
approach.

There is also another issue to consider: none of the in-tree LSMs
currently use these new parameters, raising questions about their
purpose, maintainability, etc.  While this is not necessarily a deal
breaker, it does go along with my comment above about taking a more
holistic view of LSM resource controls.

To summarize, I haven't thought about this too much yet because there
are other fires/patches that don't (currently) have the dependency
issues of this patch.  I would also feel a lot better if there was an
in-tree user of this parameter and some discussion of how this might
fit into a more holistic approach to controlling resource limits in
the LSM subsystem.

-- 
paul-moore.com

^ permalink raw reply

* [GIT PULL] cgroup changes for v7.2
From: Tejun Heo @ 2026-06-15 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, cgroups, Johannes Weiner, Michal Koutný,
	Waiman Long

Hello,

The following changes since commit 4a39eda5fdd867fc39f3c039714dd432cee00268:

  cgroup/cpuset: Reset DL migration state on can_attach() failure (2026-05-10 22:14:49 -1000)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git tags/cgroup-for-7.2

for you to fetch changes up to a99ce697ea5e27b867c9ba4ee55fa5ba3b8d1188:

  cgroup: Migrate tasks to the root css when a controller is rebound (2026-06-02 08:25:29 -1000)

----------------------------------------------------------------
cgroup: Changes for v7.2

- Last cycle deferred css teardown on cgroup removal until the cgroup
  depopulated, so a css is not taken offline while tasks can still
  reference it. Disabling a controller through cgroup.subtree_control
  still had the same problem. This reworks the deferral from per-cgroup
  to per-css so that path is covered too.

- New RDMA controller monitoring files: rdma.peak for per-device peak
  usage and rdma.events / rdma.events.local for resource-limit
  exhaustion. The max-limit parser was rewritten, fixing two input
  parsing bugs.

- cpuset: fix a sched-domain leak on the domain-rebuild failure path and
  skip a redundant hardwall ancestor scan on v2.

- Misc: pair the remaining lockless cgroup.max.* reads with WRITE_ONCE,
  assorted selftest robustness fixes, and doc path corrections.

----------------------------------------------------------------
Chen Wandun (1):
      cgroup/cpuset: Skip hardwall ancestor scan in cpuset v2 in cpuset_current_node_allowed()

Costa Shulyupin (1):
      docs: cgroup: Fix stale source file paths

Guopeng Zhang (2):
      selftests/cgroup: enable memory controller in hugetlb memcg test
      cgroup/cpuset: Free sched domains on rebuild guard failure

Hongfu Li (3):
      selftests/cgroup: Fix incorrect variable check in online_cpus()
      selftests/cgroup: Add NULL check after malloc in cgroup_util.c
      selftests/cgroup: check malloc return value in alloc_anon functions

Ren Tamura (1):
      cgroup: pair max limit READ_ONCE() with WRITE_ONCE()

Tao Cui (8):
      cgroup/rdma: refactor resource parsing with match_table_t/match_token()
      selftests/cgroup: fix child process escaping to parent cleanup in test_cpucg_nice
      selftests/cgroup: fix misleading debug message in test_cgfreezer_time_child
      cgroup/rdma: add rdma.peak for per-device peak usage tracking
      cgroup/rdma: add rdma.events to track resource limit exhaustion
      cgroup/rdma: add rdma.events.local for per-cgroup allocation failure attribution
      cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local
      cgroup/rdma: Drop unnecessary READ_ONCE() on event counters

Tejun Heo (7):
      Merge branch 'for-7.1-fixes' into for-7.2
      cgroup: Inline cgroup_has_tasks() in cgroup.h
      cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE
      cgroup: Move populated counters to cgroup_subsys_state
      cgroup: Add per-subsys-css kill_css_finish deferral
      cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
      cgroup: Migrate tasks to the root css when a controller is rebound

 Documentation/admin-guide/cgroup-v1/cgroups.rst    |   2 +-
 Documentation/admin-guide/cgroup-v1/memcg_test.rst |   2 +-
 Documentation/admin-guide/cgroup-v2.rst            |  53 ++++
 include/linux/cgroup-defs.h                        |  30 +-
 include/linux/cgroup.h                             |  27 +-
 include/linux/cgroup_rdma.h                        |   4 +
 kernel/cgroup/cgroup.c                             | 222 +++++++++------
 kernel/cgroup/cpuset-v1.c                          |   2 +-
 kernel/cgroup/cpuset.c                             |  10 +-
 kernel/cgroup/rdma.c                               | 315 ++++++++++++++++-----
 tools/testing/selftests/cgroup/lib/cgroup_util.c   |   9 +-
 tools/testing/selftests/cgroup/test_cpu.c          |   2 +-
 tools/testing/selftests/cgroup/test_cpuset_prs.sh  |   2 +-
 tools/testing/selftests/cgroup/test_freezer.c      |   2 +-
 .../testing/selftests/cgroup/test_hugetlb_memcg.c  |   8 +
 tools/testing/selftests/cgroup/test_memcontrol.c   |  53 ++--
 16 files changed, 532 insertions(+), 211 deletions(-)

--
tejun

^ permalink raw reply

* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Tejun Heo @ 2026-06-15 20:38 UTC (permalink / raw)
  To: Yuri Andriaccio, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Johannes Weiner,
	Michal Koutný
  Cc: cgroups, linux-kernel, Luca Abeni, Yuri Andriaccio
In-Reply-To: <20260608121546.69910-1-yurand2000@gmail.com>

Hello,

Looks great. Two things:

1. cpu.rt.internal doesn't follow the naming convention. The file is the
   cgroup's own budget (cpu.rt.max minus its children), so
   cpu.rt.max.effective.local fits better: .effective like
   cpuset.cpus.effective, .local like memory.events.local.

2. root's cpu.rt.max: sched_rt_runtime_us already caps total DL/RT
   bandwidth and rt-cgroups admit against the same pool, so what does
   reserving the cgroup share separately at root add? It's also a writable
   control on root, which we otherwise keep off root.

Thanks.

--
tejun

^ permalink raw reply

* Re: [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition)
From: Yosry Ahmed @ 2026-06-15 19:56 UTC (permalink / raw)
  To: Nhat Pham
  Cc: YoungJun Park, akpm, chrisl, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, david, muchun.song, shikemeng,
	baoquan.he, baohua, chengming.zhou, ljs, liam, vbabka, rppt,
	surenb, qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <CAKEwX=O23a4iWBZoewKVb8QqODte6r3Xijckw3_oCJNoiO9M5A@mail.gmail.com>

On Sun, Jun 14, 2026 at 7:39 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sun, Jun 14, 2026 at 4:20 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > ...
> > > * Integration with swap.tier by Youngjun (see [12]). For now, I'm
> > >   leaning towards opting out the vswap device from swap.tier entirely, and
> > >   treat it as a special device. Integrating it with swap.tiers will
> > >   benefit the cases where you want some cgroups to skip vswap for fast
> > >   swap devices (pmem), whereas other should go through zswap first. But
> > >   most other use cases, either the overhead of vswap will be acceptable
> > >   (or not the bottleneck), or we can just disable CONFIG_VSWAP entirely :)
> > >
> > >   Youngjun, may I ask for your thoughts on this?
> >
> > Hi Nhat,
> >
> > Tier 1: VSWAP, Tier 2: ZSWAP ...
> >
> > I don't see any problem applying the desired functionality with the
> > currently proposed mechanism and interface. With this, a user would be
> > assigned the default Virtual -> RAM swap tier, and the overall picture
> > becomes one where swap tiers are composed according to the priority
> > setting.
>
> It's more - is there a strong argument to let vswap be a tier (which
> is not supported by just turning of vswap altogether).
>
> Because right now I'm not exposing vswap device to userspace in any
> manner, pretty much. It's abstract and transparent, and minimizes
> complexity (no vswap and swap.tier interaction) and surfaces for
> issues.

I definitely think vswap should *not* be a tier. First of all, a vswap
entry can be backed by zswap or an actual swap device, which would be
two different tiers. How does that work?

I also think vswap should not be exposed to userspace in any way, at
least not now. I still think we should aim to just make the
redirection layer always on and eliminate "vswap devices".

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-06-15 19:55 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <ai5y923elCSZp41j@yjaykim-PowerEdge-T330>

> In that case, the internal logic could stay roughly the same rather
> than counting via a page counter. Something like:
>
> 1. Change the interface shell: tier.*.max — allow only 0 ~ max.

What about a single interface as I suggested to remain consistent with
memory tiering?

> 2. Keep the internal logic as is: 0 disables the mask (child memcgs
>    off too), max enables it (child memcgs on too).

I think a child should be able to disable a swap tier enabled by the
parent, but not vice versa.

> 3. memory.zswap.max integrates naturally (it's memory."tier_name".max).

Not really. memory.zswap.max is in terms of memory usage (compressed
size), not swap usage (uncompressed size).

[..]
> Let me clarify a part I wrote confusingly. Handling
> memory.zswap.writeback via tiers is possible, but I don't think the
> interface itself would be replaced even if memory.swap.tiers is adopted.
>
> Selecting only zswap in memory.swap.tiers would not just disable
> writeback.it would also block regular swap entirely, which differs
> slightly from the current semantic. (... "Per the cgroup v2 docs: a
> zswap-only tier setting is subtly different from setting
> memory.swap.max to 0, since it still allows pages to be written to the
> zswap pool; this has no effect if zswap is disabled, and swapping is
> allowed unless memory.swap.max is set to 0.")

I don't understand. How is disabling zswap writeback not equivalent to
only enabling zswap as a tier?

Do you just mean the fact that disabling zswap writeback is a noop of
zswap is disabled? It's a different interface so I think a small
semantic difference is okay. In practice, I doubt that zswap is being
disabled at runtime.

>
> So the interface itself needs to be retained, and it could be extended
> toward selective writeback — e.g., passing a desired tier into
> memory.zswap.writeback so writeback targets only that tier. Currently
> it only controls on/off. Other tiers probably don't need this. demotion
> based on the selected tier should be enough.
>
> Thanks,
> Youngjun Park
>

On Sun, Jun 14, 2026 at 2:23 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> ....
> > >Based on the memcg interface currently proposed in swap_tier
> > > (memory.swap.tiers, memory.swap.tiers.effective), I think it aligns well
> > > with the current direction. It provides a foundation for selectively
> > > targeting devices in tier order.
> >
> > Here instead of cpuset like interface, we may want more zswap like interface
> > where you can put limit on the usage i.e. memory.swap.tier*.max. We can start
> > with allowing only two values i.e. 0 and max which effectively will be the
> > same as what you need.
> >
>
> Good idea, and it's certainly feasible. When I considered this a while
> ago, the reasons I didn't take this direction were:
>
> 1. There's no real-world usage for adjusting the swap tier amount (it's
>    either 0 or MAX). That said, your suggestion to initially allow only
>    0 and max is the killing point, and it's making me reconsider.
>
> 2. The implementation cost seems high. The current implementation
>    handles this at runtime via simple masking.
>
> 3. Relationship with swap.max:
>    - If we tie it to the current interface, wouldn't limiting the swap
>      amount within a selected tier already be possible? I wonder if
>      that alone is enough.
>    - If we add tier.max, it would need to be a subset of swap.max.
>      (Any other complexities here?)
>
> 4. vswap enable/disable: vswap doesn't seem to have an amount-control
>    aspect, so an on/off semantic would be clearer.
>    https://lore.kernel.org/linux-mm/ai5kOOmR1LPTWs1J@yjaykim-PowerEdge-T330/T/#m8831ec057bf9387978d3bd698f51920600e09a04
>
> In that case, the internal logic could stay roughly the same rather
> than counting via a page counter. Something like:
>
> 1. Change the interface shell: tier.*.max — allow only 0 ~ max.
> 2. Keep the internal logic as is: 0 disables the mask (child memcgs
>    off too), max enables it (child memcgs on too).
> 3. memory.zswap.max integrates naturally (it's memory."tier_name".max).
> 4. Extend later if use cases arise.
>
> On balance I still lean toward the current interface, but if a per-tier
> max is the better fit for memcg's direction and others feel the same,
> I'm happy to switch. I'd like to hear Shakeel's thoughts again, and I'm
> curious about others' opinions too.
>
> A few more perspectives on the points below.
>
> > I will respond to your other points later when I have time.
>
> > >
> > > To summarize the discussions so far, the following points align well.
> > >
> > > - Per-cgroup swap control, as I suggested.
> > > - Proactive zswap writeback (Hao's usecase)
> > > - Swap device target demotion(if it wants selective, then it is more better), as you mentioned:
> > >   https://lore.kernel.org/linux-mm/aicZ-5GX9De3MAU7@linux.dev/
> > > - Virtual Swap on/off in the future, as Nhat mentioned:
> > >   https://lore.kernel.org/linux-mm/20260528212955.1912856-1-nphamcs@gmail.com/
> > > - The memory.zswap.writeback alternative (no hierarchy model conflict)
> > > - zswap is first swap tier.
> > > - Promotion. (Also better for selectve usage)
> > > - tier based swap policy (e.g round-robin...)
> > >
> > > To accelerate this work, I believe we should reach a consensus and
> > > merge the currently proposed swap_tier interface :)
> > >
> > > If the above approach is difficult, I would like to suggest an
> > > alternative for progress with the memcg interfaces removed:
> > >
> > > 1) We could make zswap the first tier and create
> > > a use case where memory.zswap.writeback internally is handled by tier logic.
> > >
> > > 2) Or simply merge the swap_tier infrastructure itself first.
> > >
> > > This would allow the swap_tier infrastructure to be merged and discussed
> > > more easily.
> > >
> > > If it takes longer to adopt swap_tier anyway, by doing so we progress next step
> > > as a experimental feature.
> > >
> > > - Apply per-cgroup swap as an experimental (debugfs) feature.
> > > - Apply Hao's use case experimentally or as it is as Yosry suggested.
> > > (future migration to swap tier)
> > >
> > > How do you think?
> > >
> > > (FYI: My emails to kernel.org are failing due to internal server issues.)
> > >
> > > Thank you
> > > Youngjun Park
>
> Let me clarify a part I wrote confusingly. Handling
> memory.zswap.writeback via tiers is possible, but I don't think the
> interface itself would be replaced even if memory.swap.tiers is adopted.
>
> Selecting only zswap in memory.swap.tiers would not just disable
> writeback.it would also block regular swap entirely, which differs
> slightly from the current semantic. (... "Per the cgroup v2 docs: a
> zswap-only tier setting is subtly different from setting
> memory.swap.max to 0, since it still allows pages to be written to the
> zswap pool; this has no effect if zswap is disabled, and swapping is
> allowed unless memory.swap.max is set to 0.")
>
> So the interface itself needs to be retained, and it could be extended
> toward selective writeback — e.g., passing a desired tier into
> memory.zswap.writeback so writeback targets only that tier. Currently
> it only controls on/off. Other tiers probably don't need this. demotion
> based on the selected tier should be enough.
>
> Thanks,
> Youngjun Park
>

^ permalink raw reply

* Re: [PATCH v6 0/6] [PATCH v6 0/6] Add reclaim to the dmem cgroup controller
From: Tejun Heo @ 2026-06-15 18:57 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Natalie Vock, Johannes Weiner, Michal Koutný,
	cgroups, Huang Rui, Matthew Brost, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <ajBJU-Jp2QVy14qt@slm.duckdns.org>

On Mon, Jun 15, 2026 at 08:49:55AM -1000, Tejun Heo wrote:
> The canonical behavior for cgroup2 would be not failing the write at all
> even when the usage can't be brought down below the new max. Updating the
> target configuration and tracking the current usage are separate operations.
> The former should just set max and trigger reclaim and a writer should not
> assume that a successful write indicates that the usage is below the written
> max value.

Sent too early. One of the reasons is that cgroup is hierarchical and there
can be multiple delegation layers and if you tie application of configuration
to immediate enforcement, some hierarchical control actions become racy and
awkward.

Here's an example: Imagine a system agent trying to lower usage in a subtree
which contains multiple delegated containers. If max can be set below what
reclaim can achieve immediately, it can just set the max and if the usage is
still too high, can go around and e.g. kill some of the containers. If max
write fails, it'd have to kill and then try again and inbetween someone else
might push up the usage.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v6 0/6] [PATCH v6 0/6] Add reclaim to the dmem cgroup controller
From: Tejun Heo @ 2026-06-15 18:49 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Natalie Vock, Johannes Weiner, Michal Koutný,
	cgroups, Huang Rui, Matthew Brost, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

Hello,

On Thu, Jun 11, 2026 at 07:32:55PM +0200, Thomas Hellström wrote:
> When writing a "max" limit lower than the current usage, the
> existing code silently failed. This series aims to improve
> on that by returning -EBUSY on failure and also attempt
> to synchronously reclaim device memory to push the usage
> under the new max limit to avoid the error.

The canonical behavior for cgroup2 would be not failing the write at all
even when the usage can't be brought down below the new max. Updating the
target configuration and tracking the current usage are separate operations.
The former should just set max and trigger reclaim and a writer should not
assume that a successful write indicates that the usage is below the written
max value.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Yu Kuai @ 2026-06-15 16:16 UTC (permalink / raw)
  To: Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260615115556.1225472-1-wozizhi@huaweicloud.com>

Hi,

在 2026/6/15 19:55, Zizhi Wo 写道:
> From: Zizhi Wo <wozizhi@huawei.com>
>
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
>
>    BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>    Call Trace:
>    ...
>    blkcg_deactivate_policy+0x244/0x4d0
>    ioc_rqos_exit+0x44/0xe0
>    rq_qos_exit+0xba/0x120
>    __del_gendisk+0x50b/0x800
>    del_gendisk+0xff/0x190
>    ...
>
> [CAUSE]
> process1						process2
> cgroup_rmdir
> ...
>    css_killed_work_fn
>      offline_css
>      ...
>        blkcg_destroy_blkgs
>        ...
>          __blkg_release
> 	  css_put(&blkg->blkcg->css)
>            blkg_free
> 	    INIT_WORK(xxx, blkg_free_workfn)
> 	    schedule_work
>      css_put
>      ...
>        blkcg_css_free
>          kfree(blkcg)--------blkcg has been freed!!!
> ====================================schedule_work
>                blkg_free_workfn
> 							__del_gendisk
> 							  rq_qos_exit
> 							    ioc_rqos_exit
> 							      blkcg_deactivate_policy
> 							        mutex_lock(&q->blkcg_mutex)
> 								spin_lock_irq(&q->queue_lock)
> 							        list_for_each_entry(blkg, xxx)
> 								  blkcg = blkg->blkcg
> 								  spin_lock(&blkcg->lock)-------UAF!!!
> 	        mutex_lock(&q->blkcg_mutex)
> 	        spin_lock_irq(&q->queue_lock)
> 	        /* Only then is the blkg removed from the list */
> 	        list_del_init(&blkg->q_node)
>
> As a result, a blkg can still be reachable through q->blkg_list while
> its ->blkcg has already been freed.
>
> [Fix]
> Fix this by deferring the blkcg css_put() until after the blkg has been
> unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
> blkcg outlives every blkg still reachable through q->blkg_list, so any
> iterator holding q->queue_lock is guaranteed to observe a valid
> blkg->blkcg.
>
> While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
> so that the css reference is owned by the alloc/free pair rather than
> straddling layers:
> blkg_alloc()  <-> blkg_free()
> blkg_create() <-> blkg_destroy()
>
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Suggested-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> ---
> v2:
>   - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
>     css reference follows the blkg's own lifetime, making the put in
>     blkg_free_workfn() symmetric with the get in blkg_alloc().
>
> v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/
>
>   block/blk-cgroup.c | 24 ++++++++++++------------
>   1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bc63bd220865..27414c291e49 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
>   	if (blkg->parent)
>   		blkg_put(blkg->parent);
>   	spin_lock_irq(&q->queue_lock);
>   	list_del_init(&blkg->q_node);
>   	spin_unlock_irq(&q->queue_lock);
> +	/*
> +	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
> +	 * so concurrent iterators won't see a blkg with a freed blkcg.
> +	 */
> +	css_put(&blkg->blkcg->css);
>   	mutex_unlock(&q->blkcg_mutex);

Please move css_put after mutex_unlock, unless there is a strong reason.

With above change, feel free to add:

Reviewed-by: Yu Kuai <yukuai@fygo.io>

>   
>   	blk_put_queue(q);
>   	free_percpu(blkg->iostat_cpu);
>   	percpu_ref_exit(&blkg->refcnt);
> @@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
>   	 * blkg_stat_lock is for serializing blkg stat update
>   	 */
>   	for_each_possible_cpu(cpu)
>   		__blkcg_rstat_flush(blkcg, cpu);
>   
> -	/* release the blkcg and parent blkg refs this blkg has been holding */
> -	css_put(&blkg->blkcg->css);
>   	blkg_free(blkg);
>   }
>   
>   /*
>    * A group is RCU protected, but having an rcu lock does not mean that one
> @@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>   	if (!blkg->iostat_cpu)
>   		goto out_exit_refcnt;
>   	if (!blk_get_queue(disk->queue))
>   		goto out_free_iostat;
> +	/* blkg holds a reference to blkcg */
> +	if (!css_tryget_online(&blkcg->css))
> +		goto out_put_queue;
>   
>   	blkg->q = disk->queue;
>   	INIT_LIST_HEAD(&blkg->q_node);
>   	blkg->blkcg = blkcg;
>   	blkg->iostat.blkg = blkg;
> @@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
>   
>   out_free_pds:
>   	while (--i >= 0)
>   		if (blkg->pd[i])
>   			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
> +	css_put(&blkcg->css);
> +out_put_queue:
>   	blk_put_queue(disk->queue);
>   out_free_iostat:
>   	free_percpu(blkg->iostat_cpu);
>   out_exit_refcnt:
>   	percpu_ref_exit(&blkg->refcnt);
> @@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>   	if (blk_queue_dying(disk->queue)) {
>   		ret = -ENODEV;
>   		goto err_free_blkg;
>   	}
>   
> -	/* blkg holds a reference to blkcg */
> -	if (!css_tryget_online(&blkcg->css)) {
> -		ret = -ENODEV;
> -		goto err_free_blkg;
> -	}
> -
>   	/* allocate */
>   	if (!new_blkg) {
>   		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
>   		if (unlikely(!new_blkg)) {
>   			ret = -ENOMEM;
> -			goto err_put_css;
> +			goto err_free_blkg;
>   		}
>   	}
>   	blkg = new_blkg;
>   
>   	/* link parent */
>   	if (blkcg_parent(blkcg)) {
>   		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
>   		if (WARN_ON_ONCE(!blkg->parent)) {
>   			ret = -ENODEV;
> -			goto err_put_css;
> +			goto err_free_blkg;
>   		}
>   		blkg_get(blkg->parent);
>   	}
>   
>   	/* invoke per-policy init */
> @@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>   
>   	/* @blkg failed fully initialized, use the usual release path */
>   	blkg_put(blkg);
>   	return ERR_PTR(ret);
>   
> -err_put_css:
> -	css_put(&blkcg->css);
>   err_free_blkg:
>   	if (new_blkg)
>   		blkg_free(new_blkg);
>   	return ERR_PTR(ret);
>   }

-- 
Thanks,
Kuai

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Alexei Starovoitov @ 2026-06-15 15:49 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Suren Baghdasaryan
  Cc: Hao Li, Harry Yoo, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, LKML,
	open list:CONTROL GROUP (CGROUP)
In-Reply-To: <f927f1b4-3f60-471e-b42b-8d098c1ce5dd@kernel.org>

On Mon Jun 15, 2026 at 2:02 AM PDT, Vlastimil Babka (SUSE) wrote:
> On 6/15/26 04:16, Alexei Starovoitov wrote:
>> On Sun, Jun 14, 2026 at 7:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>>
>>> On Thu, Jun 11, 2026 at 8:50 PM Hao Li <hao.li@linux.dev> wrote:
>>> >
>>> > On Wed, Jun 10, 2026 at 05:40:07PM +0200, Vlastimil Babka (SUSE) wrote:
>>> > > Similarly to the page allocators, introduce slab-allocator specific
>>> > > alloc flags that internally control allocation behavior in addition to
>>> > > gfp_flags, without occupying the limited gfp flags space.
>>> > >
>>> > > Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
>>> > > page allocator's ALLOC_TRYLOCK and will be used to reimplement
>>> > > kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
>>> > > gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
>>> > > importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
>>> > > e.g. in early boot with a restricted gfp_allowed_mask.
>>> > >
>>> > > Also introduce alloc_flags_allow_spinning() to replace the usage of
>>> > > gfpflags_allow_spinning().
>>> > >
>>> > > Start using alloc_flags and the new check first in alloc_from_pcs() and
>>> > > __pcs_replace_empty_main(). This means some slab allocations that were
>>> > > falsely treated as kmalloc_nolock() due to their gfp flags will now have
>>> > > higher chances of succeed, and this will further increase with followup
>>>
>>> nit: I think it should be either "higher chances of succeess" or
>>> "higher chances to succeed".
>
> success it is
>
>>>
>>> > > changes.
>>> > >
>>> > > Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
>>> > > reach it from a slab allocation that's not _nolock() and yet lacks
>>> > > __GFP_KSWAPD_RECLAIM for other reasons.
>>> > >
>>> > > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>>> > > ---
>>> >
>>> > Reviewed-by: Hao Li <hao.li@linux.dev>
>>>
>>> I would call SLAB_ALLOC_TRYLOCK something like SLAB_ALLOC_NOSPIN or
>>> SLAB_ALLOC_NOLOCK but naming is hard and I don't claim myself to be
>>> good at it. So, feel free to adopt my suggestion if you like it or
>>> ignore it otherwise.
>>>
>>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> 
>> Just noticed "trylock" in the #define SLAB_ALLOC_TRYLOCK
>> 
>> Please call it SLAB_ALLOC_NOLOCK.
>> 
>> Initial api was using 'trylock' name and it was a mistake,
>> since people assumed normal spin_trylock() like semantics.
>> "trylock" implies that it fails under contention
>> and retry is a normal next step. It's not the case.
>> No one should be retrying. That's why the final api was kmalloc_nolock().
>> So please keep this important distinction in the name.
>> SLAB_ALLOC_NOLOCK should mean that spinning locks
>> should not be taken. It should not mean "just go to trylock everywhere".
>
> Eh, ok then, will change to SLAB_ALLOC_NOLOCK. Even though it's mostly internal.
>
> So next thing we change page allocator's ALLOC_TRYLOCK to ALLOC_NOLOCK too?

yeah. Would be good to align as well.


^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-15 15:38 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <94d6c446-a8a6-485e-bb3c-ee809ebb1d3b@kernel.org>

On 6/15/26 17:27, Vlastimil Babka (SUSE) wrote:
> On 6/15/26 17:18, David Hildenbrand (Arm) wrote:
>> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>>>
>>> I think the memalloc approach is dangerous due to unexpected nesting. There
>>> might be nested page allocations in page allocation itself (due to some
>>> debugging option). But also interrupts do not change what "current" points
>>> to. Suddenly those could start requesting folios and/or private nodes and be
>>> surprised, I'm afraid.
>>
>> Yeah, we'd need some way to distinguish the main allocation from these other
>> (nested) allocations.
> 
> That goes against the very principle of scopes. And I don't see how, except
> via a ... flag to the main allocation :D

Unless we teach the handful of debug callpaths to set a custom context. Have
some memalloc_context_save/restore.

I'd assume that the number of such nested allocations we can trigger from the
buddy (through some callbacks like kasan and such) should be rather limited.

I also wonder for a second whether we could use of the _noprof vs. !_noprof
mechanism.

Essentially, calling a !_noprof variant ("external allocation interface") could
mean "save/restore folio allocation". Just a thought.

-- 
Cheers,

David

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-15 15:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Vlastimil Babka (SUSE), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <fdbdc9f7-d142-4880-b429-065d5056cabb@kernel.org>

On Mon, Jun 15, 2026 at 05:18:55PM +0200, David Hildenbrand (Arm) wrote:
> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
> > 
> > I think the memalloc approach is dangerous due to unexpected nesting. There
> > might be nested page allocations in page allocation itself (due to some
> > debugging option). But also interrupts do not change what "current" points
> > to. Suddenly those could start requesting folios and/or private nodes and be
> > surprised, I'm afraid.
> 
> Yeah, we'd need some way to distinguish the main allocation from these other
> (nested) allocations.
>
> 
> > 
> > The memalloc scopes only work well when they restrict the context wrt
> > reclaim, and allocations in IRQ have to be already restricted heavily
> > (atomic) so further memalloc restrictions don't do anything in practice. But
> > to make them change other aspects of the allocations like this won't work.
> 
> I was assuming that memalloc_pin_save() would already violate that, but really
> it only restricts where movable allocations land, and that doesn't matter for
> other kernel allocations.
> 
> Do you see any other way to make something like an allocation context work, and
> avoid introducing more GFP flags?
>

One thought would be a way to switch what fallback list is used, and
then have specific fallback lists for certain contexts.

Right now there is a single example of this: __GFP_THISNODE
  |= __GFP_THISNODE   =>  NOFALLBACK
  &= ~__GFP_THISNODE  =>  FALLBACK

We could add an interface with the desired fallback list based as an
argument, and let get_page_from_freelist to prefer that over the default
global lists.

Omit all special nodes from FALLBACK/NOFALLBACK and make the special
contexts provide the fallback-base that should be used.

On my current branch i think that would include modifying, in totality:

   alloc_folio_mpol()
   alloc_demotion_folio()
   alloc_migration_target()

And i'm pretty sure that all just nests nicely.

We might not even need memalloc... hmmm

~Gregory

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-15 15:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <fdbdc9f7-d142-4880-b429-065d5056cabb@kernel.org>

On 6/15/26 17:18, David Hildenbrand (Arm) wrote:
> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>> On 6/12/26 17:29, Gregory Price wrote:
>>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>>> ... snip ...
>>>>
>>>> I will still probably send the next RFC version tomorrow or friday,
>>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>>
>>>> Also, I made a new `anondax` driver which enables userland testing
>>>> of this functionality without any specialty hardware.
>>>>
>>>
>>> (apologies for the length of this email: this will all be covered in
>>> the coming cover letter, but I just wanted to share a bit of a preview)
>>>
>>> ===
>>>
>>> Just another small update - I am planning to post the RFC today once i
>>> get some mild cleanup done.  It will be based on the dax atomic hotplug
>>>
>>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>>
>>> But a couple specific details regarding the memalloc pieces that i've
>>> learned the past couple of days playing with it.
>>>
>>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>>>    on the private node, even if it happens within a memalloc_private
>>>    context.  Since memalloc_folio may be useful in contexts outside of
>>>    private nodes, I kept this as a separate flag.
>>>
>>>    If we think there will *never* be additional users of memalloc_folio,
>>>    then we could fold _folio into _private to save the flag for now and
>>>    add it back when we actually need it.
>>>
>>> 2) memalloc_private is needed to unlock private nodes, but in the
>>>    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>>
>>>    This is *highly* restrictive.  I found when playing with mbind that
>>>    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>>>    implies a bug). 
>>>
>>>    That leads me to #3
>> 
>> I think the memalloc approach is dangerous due to unexpected nesting. There
>> might be nested page allocations in page allocation itself (due to some
>> debugging option). But also interrupts do not change what "current" points
>> to. Suddenly those could start requesting folios and/or private nodes and be
>> surprised, I'm afraid.
> 
> Yeah, we'd need some way to distinguish the main allocation from these other
> (nested) allocations.

That goes against the very principle of scopes. And I don't see how, except
via a ... flag to the main allocation :D

>> 
>> The memalloc scopes only work well when they restrict the context wrt
>> reclaim, and allocations in IRQ have to be already restricted heavily
>> (atomic) so further memalloc restrictions don't do anything in practice. But
>> to make them change other aspects of the allocations like this won't work.
> 
> I was assuming that memalloc_pin_save() would already violate that, but really
> it only restricts where movable allocations land, and that doesn't matter for
> other kernel allocations.

Hm yeah its suboptimal, as it can turn a movable allocation unmovable. But
shouldn't cause outright bugs.

> Do you see any other way to make something like an allocation context work, and
> avoid introducing more GFP flags?

Yeah, the idea of augomenting gfp flags with alloc_flags that are no longer
strictly internal to the page allocator, seems like a way to achieve what we
need.

^ permalink raw reply

* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Aaron Tomlin @ 2026-06-15 15:22 UTC (permalink / raw)
  To: paul
  Cc: tsbogend, paul, jmorris, serge, mingo, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <6hqq5oxvlcpmjvyns42dy2vtfvvixy7q4xyyjrrn46jrvsx5ar@gkmjsteqlpzd>

On Wed, May 27, 2026 at 09:19:11PM -0400, Aaron Tomlin wrote:
> On Wed, May 27, 2026 at 09:58:58PM +0200, Peter Zijlstra wrote:
> > On Wed, May 27, 2026 at 01:41:52PM -0400, Aaron Tomlin wrote:
> > 
> > > > > The actual use case here is multi-tenant workload isolation and visibility.
> > > > > Passing the evaluated cpumask to the BPF LSM allows operators to write a
> > > > > simple eBPF program to detect spatial boundary overlaps (e.g., logging an
> > > > > event if a requested mask intersects with platform-reserved cores).
> > 
> > Why isn't cgroups good enough to enforce this? If you create a cgroup
> > hierarchy per tenant, and constrain them using the cpuset controller,
> > they should not be able to escape, rendering this event impossible.
> 
> Hi Peter,
> 
> You raise a very fair point. The cpuset cgroup controller is indeed the
> kernel's primary vehicle for spatial enforcement, and under normal
> circumstances, it successfully prevents a tenant from escaping their
> designated cores.
> 
> The cpuset controller does govern resource limits, but does not audit
> intent. When __sched_setaffinity() is invoked, the kernel compares the
> requested in_mask against the task's allowed cpuset. If there is only a
> partial intersection, the kernel silently truncates the requested mask to
> fit the cpuset, without raising any alarm.
> 
> The BPF LSM hook, conversely, receives the raw, untruncated in_mask,
> affording operators the visibility to detect, audit, and even reject these
> violations of intent before the kernel silently sanitises the input.
> 
> This patch does not seek to replace the cpuset controller, but rather to
> complement it by providing auditing capabilities.
> 
> > > We are not creating a bespoke BPF hook here; rather, we are rectifying a
> > > historical blind spot within the API. The existing LSM hook is invoked
> > > during sched_setaffinity(), yet it presently receives only the task_struct
> > > pointer. Consequently, the security module is essentially asked, "Should
> > > Process A be permitted to alter Process B's affinity?" without being
> > > informed of the proposed affinity itself. Providing in_mask simply
> > > furnishes the existing hook with the requisite payload to make an informed
> > > decision.
> > 
> > It occurs to me that this same argument would require to also pass in
> > the new sched_attr, no? That way the LSM can inspect the new policy
> > before it becomes effective.
> 
> I agree, the underlying logic does indeed extend perfectly to sched_attr.
> 
> Presently, the LSM is equally oblivious as to whether a process is
> requesting a benign transition to SCHED_BATCH, or attempting to escalate
> its privileges by requesting a real-time policy such as SCHED_FIFO with
> maximum priority. Just as with the CPU mask, providing the sched_attr
> payload would rectify this parallel blind spot, allowing BPF policies to
> inspect and mediate scheduling attributes before they become effective.
> 
> If you are amenable, I should be more than happy to expand the scope of the
> forthcoming patch to include this. Alternatively, we could address the
> sched_attr expansion in a separate, subsequent patch. Personally, I would
> favour the latter approach, but please do let me know your preference.
> 
> I very much look forward to hearing Paul's thoughts on whether this aligns
> with the broader LSM vision.

Hi Paul,

I am writing to politely follow up on the discussion above regarding the
proposed enhancement to the sched_setaffinity LSM hook.

As you will see from the thread, Peter Zijlstra and I have discussed the
architectural justification for this change. While the cpuset cgroup
controller effectively handles spatial enforcement, it silently truncates
requested affinity masks. Passing the raw in_mask to the LSM hook enables
security modules (such as the BPF LSM) to audit and mediate the actual
intent of the request before the kernel sanitises the input, a capability
that cgroups inherently lack.

Furthermore, Peter rightly observed that this reasoning extends naturally
to sched_attr. Presently, the LSM cannot inspect whether a process is
requesting a benign scheduling policy or attempting to escalate to a
real-time priority. I am entirely amenable to addressing this parallel
blind spot, preferably in a subsequent patch.

Before I proceed any further, I would be most grateful for your perspective
as the Security sub-system maintainer. Do you feel this expansion is
acceptable?

As a brief administrative aside, please note that Thomas Bogendoerfer has
already queued the MIPS-specific changes related to this work into the
mips-next tree [1][2].

I look forward to hearing your thoughts.

[1]: https://lore.kernel.org/lkml/psb6pxogv2dlknps4p3sh6rt2h7xuuxkoif6ock5vxfz2jimec@txa6iy65crtb/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/commit/?id=98e37db4a34d3af3fb2f4648295c25b5e40b20e3


Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-15 15:20 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: David Hildenbrand (Arm), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>

On Mon, Jun 15, 2026 at 04:38:43PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
> > 
> > 1) memalloc_folio is required to ensure non-folio allocations don't land
> >    on the private node, even if it happens within a memalloc_private
> >    context.  Since memalloc_folio may be useful in contexts outside of
> >    private nodes, I kept this as a separate flag.
> > 
> >    If we think there will *never* be additional users of memalloc_folio,
> >    then we could fold _folio into _private to save the flag for now and
> >    add it back when we actually need it.
> > 
> > 2) memalloc_private is needed to unlock private nodes, but in the
> >    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
> > 
> >    This is *highly* restrictive.  I found when playing with mbind that
> >    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
> >    implies a bug). 
> > 
> >    That leads me to #3
> 
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.
> 
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.
>

Reduced to practice I have found success, however what you are
describing could probably be resolved by re-introducing fallback list
isolation.  If private nodes are not in fallback lists, and they're not
N_MEMORY, then they're unreachable via nodemask-fallbacks, and a
specific node has to be requested.  For everything else memalloc locks
them out regardless.

In v5 I actually stripped this all the way back to just memalloc flags
and implemented a bunch of pressure tests to try to detect leakage - and
I was not able to do so - even with all nodes in each other's fallback
lists.

We can tack on both fallback list isolation and __GFP_THISNODE
requirements on top without ABI implications if we find that is
insufficient.

The only place I think this will matter is in the reclaim / demotion
code, would need to rework the allocation code to handle private nodes
more explicitly.  This has no ABI implications AND the entire demotion
logic in vmscan.c is utterly broken anyway and needs a rewrite.

I'm running a mass build test at the moment, and it's looking clean, I'm
expecting to be able to test the new code today or tomorrow.

~Gregory

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-15 15:18 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>

On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>> ... snip ...
>>>
>>> I will still probably send the next RFC version tomorrow or friday,
>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>
>>> Also, I made a new `anondax` driver which enables userland testing
>>> of this functionality without any specialty hardware.
>>>
>>
>> (apologies for the length of this email: this will all be covered in
>> the coming cover letter, but I just wanted to share a bit of a preview)
>>
>> ===
>>
>> Just another small update - I am planning to post the RFC today once i
>> get some mild cleanup done.  It will be based on the dax atomic hotplug
>>
>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>
>> But a couple specific details regarding the memalloc pieces that i've
>> learned the past couple of days playing with it.
>>
>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>>    on the private node, even if it happens within a memalloc_private
>>    context.  Since memalloc_folio may be useful in contexts outside of
>>    private nodes, I kept this as a separate flag.
>>
>>    If we think there will *never* be additional users of memalloc_folio,
>>    then we could fold _folio into _private to save the flag for now and
>>    add it back when we actually need it.
>>
>> 2) memalloc_private is needed to unlock private nodes, but in the
>>    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>
>>    This is *highly* restrictive.  I found when playing with mbind that
>>    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>>    implies a bug). 
>>
>>    That leads me to #3
> 
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.

Yeah, we'd need some way to distinguish the main allocation from these other
(nested) allocations.


> 
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.

I was assuming that memalloc_pin_save() would already violate that, but really
it only restricts where movable allocations land, and that doesn't matter for
other kernel allocations.

Do you see any other way to make something like an allocation context work, and
avoid introducing more GFP flags?

-- 
Cheers,

David

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-15 14:38 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <aiwl4kCG814dpX7L@gourry-fedora-PF4VCD3F>

On 6/12/26 17:29, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
>> > > 
>> > > I understand this question in two ways:
>> > > 
>> > >   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>> > 
>> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
>> > me reply to that one below.
>> > 
>> ... snip ...
>> > 
>> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
>> > context might be better. I think there was also talk about how the memalloc_*
>> > interface might be a better way forward. Maybe we would start giving the
>> > allocator more context ("we are allocating a folio").
>> > 
>> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>> >
>> 
>> I will still probably send the next RFC version tomorrow or friday,
>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>> 
>> Also, I made a new `anondax` driver which enables userland testing
>> of this functionality without any specialty hardware.
>> 
> 
> (apologies for the length of this email: this will all be covered in
> the coming cover letter, but I just wanted to share a bit of a preview)
> 
> ===
> 
> Just another small update - I am planning to post the RFC today once i
> get some mild cleanup done.  It will be based on the dax atomic hotplug
> 
> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
> 
> But a couple specific details regarding the memalloc pieces that i've
> learned the past couple of days playing with it.
> 
> 1) memalloc_folio is required to ensure non-folio allocations don't land
>    on the private node, even if it happens within a memalloc_private
>    context.  Since memalloc_folio may be useful in contexts outside of
>    private nodes, I kept this as a separate flag.
> 
>    If we think there will *never* be additional users of memalloc_folio,
>    then we could fold _folio into _private to save the flag for now and
>    add it back when we actually need it.
> 
> 2) memalloc_private is needed to unlock private nodes, but in the
>    original NOFALLBACK-only design, you also needed __GFP_THISNODE.
> 
>    This is *highly* restrictive.  I found when playing with mbind that
>    MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>    implies a bug). 
> 
>    That leads me to #3

I think the memalloc approach is dangerous due to unexpected nesting. There
might be nested page allocations in page allocation itself (due to some
debugging option). But also interrupts do not change what "current" points
to. Suddenly those could start requesting folios and/or private nodes and be
surprised, I'm afraid.

The memalloc scopes only work well when they restrict the context wrt
reclaim, and allocations in IRQ have to be already restricted heavily
(atomic) so further memalloc restrictions don't do anything in practice. But
to make them change other aspects of the allocations like this won't work.

> 3) If a private node is opted into something like Demotion (the node is
>    a demotion target) or mbind(), such that normal kernel operation can
>    place memory there - it's *pseudo-private*, and should actually land
>    in it's own FALLBACK list (reachable without __GFP_THISNODE, but not
>    reachable as a normal fallback allocation target).
> 
> I'm still playing with this, but I think we can even omit the
> __GFP_THISNODE requirement (my initial feeling that __GFP_THISNODE
> didn't buy us anything in particular seems to have panned out).
> 
> At the end of the day, this makes the whole memalloc_private_save()
> pattern a heck of a lot cleaner than trying fiddle with GFP.
> 
> I think you will all enjoy how clean the code ends up, and how easily
> testable it is.
> 
> As a testbed I've implement an anondax (we can discuss naming) that
> adds some sample NODE_PRIVATE_OPT_* flags so you can do the following.
> 
> I'm including this in the next RFC - but we can hack the entire thing
> off (including the OPT flags) if we prefer to just get the base set in
> without a new driver as a start.
> 
> echo 1 > dax0.0/reclaim   # kswapd and reclaim run normally on this node
> echo 1 > dax0.0/demotion  # it is a demotion target
> echo 1 > dax0.0/mbind     # mbind() can target this node for anon-vma's
> echo 1 > dax0.0/madvise   # allow madvise() to operate on its folios
> echo 1 > dax0.0/numa_balance  # allow numa balancing for this node
> echo 1 > dax0.0/ltpin     # allow GUP longterm pin to operate normally
> echo * > dax0.0/adistance # set the adistance for hotplug time
> echo * > dax0.0/hotplug   # same as kmem/hotplug
> 
> This also means *existing hardware* can leverage private nodes if
> they're capable of generating a dax device.
> 
> I've even gotten it such that you can put a private node above dram in
> the adistance heirarchy - which means demotion flows downward from
> device to CPU, but allocations don't default or fallback there.
> 
> This seems *immediately* useful for a variety of use cases.
> 
> ~Gregory
> _______________________________________________
> Lsf-pc mailing list
> Lsf-pc@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc


^ permalink raw reply

* [PATCH V2] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Zizhi Wo @ 2026-06-15 11:55 UTC (permalink / raw)
  To: axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, yukuai, houtao1, wozizhi

From: Zizhi Wo <wozizhi@huawei.com>

[BUG]
Our fuzz testing triggered a blkcg use-after-free issue:

  BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
  Call Trace:
  ...
  blkcg_deactivate_policy+0x244/0x4d0
  ioc_rqos_exit+0x44/0xe0
  rq_qos_exit+0xba/0x120
  __del_gendisk+0x50b/0x800
  del_gendisk+0xff/0x190
  ...

[CAUSE]
process1						process2
cgroup_rmdir
...
  css_killed_work_fn
    offline_css
    ...
      blkcg_destroy_blkgs
      ...
        __blkg_release
	  css_put(&blkg->blkcg->css)
          blkg_free
	    INIT_WORK(xxx, blkg_free_workfn)
	    schedule_work
    css_put
    ...
      blkcg_css_free
        kfree(blkcg)--------blkcg has been freed!!!
====================================schedule_work
              blkg_free_workfn
							__del_gendisk
							  rq_qos_exit
							    ioc_rqos_exit
							      blkcg_deactivate_policy
							        mutex_lock(&q->blkcg_mutex)
								spin_lock_irq(&q->queue_lock)
							        list_for_each_entry(blkg, xxx)
								  blkcg = blkg->blkcg
								  spin_lock(&blkcg->lock)-------UAF!!!
	        mutex_lock(&q->blkcg_mutex)
	        spin_lock_irq(&q->queue_lock)
	        /* Only then is the blkg removed from the list */
	        list_del_init(&blkg->q_node)

As a result, a blkg can still be reachable through q->blkg_list while
its ->blkcg has already been freed.

[Fix]
Fix this by deferring the blkcg css_put() until after the blkg has been
unlinked from q->blkg_list in blkg_free_workfn(). This ensures that the
blkcg outlives every blkg still reachable through q->blkg_list, so any
iterator holding q->queue_lock is guaranteed to observe a valid
blkg->blkcg.

While at it, move css_tryget_online() from blkg_create() into blkg_alloc()
so that the css reference is owned by the alloc/free pair rather than
straddling layers:
blkg_alloc()  <-> blkg_free()
blkg_create() <-> blkg_destroy()

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
---
v2:
 - Move css_tryget_online() from blkg_create() into blkg_alloc() so the
   css reference follows the blkg's own lifetime, making the put in
   blkg_free_workfn() symmetric with the get in blkg_alloc().

v1: https://lore.kernel.org/all/20260518010932.633707-1-wozizhi@huaweicloud.com/

 block/blk-cgroup.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bc63bd220865..27414c291e49 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -132,10 +132,15 @@ static void blkg_free_workfn(struct work_struct *work)
 	if (blkg->parent)
 		blkg_put(blkg->parent);
 	spin_lock_irq(&q->queue_lock);
 	list_del_init(&blkg->q_node);
 	spin_unlock_irq(&q->queue_lock);
+	/*
+	 * Release blkcg css ref only after blkg is removed from q->blkg_list,
+	 * so concurrent iterators won't see a blkg with a freed blkcg.
+	 */
+	css_put(&blkg->blkcg->css);
 	mutex_unlock(&q->blkcg_mutex);
 
 	blk_put_queue(q);
 	free_percpu(blkg->iostat_cpu);
 	percpu_ref_exit(&blkg->refcnt);
@@ -177,12 +182,10 @@ static void __blkg_release(struct rcu_head *rcu)
 	 * blkg_stat_lock is for serializing blkg stat update
 	 */
 	for_each_possible_cpu(cpu)
 		__blkcg_rstat_flush(blkcg, cpu);
 
-	/* release the blkcg and parent blkg refs this blkg has been holding */
-	css_put(&blkg->blkcg->css);
 	blkg_free(blkg);
 }
 
 /*
  * A group is RCU protected, but having an rcu lock does not mean that one
@@ -311,10 +314,13 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
 	if (!blkg->iostat_cpu)
 		goto out_exit_refcnt;
 	if (!blk_get_queue(disk->queue))
 		goto out_free_iostat;
+	/* blkg holds a reference to blkcg */
+	if (!css_tryget_online(&blkcg->css))
+		goto out_put_queue;
 
 	blkg->q = disk->queue;
 	INIT_LIST_HEAD(&blkg->q_node);
 	blkg->blkcg = blkcg;
 	blkg->iostat.blkg = blkg;
@@ -351,10 +357,12 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct gendisk *disk,
 
 out_free_pds:
 	while (--i >= 0)
 		if (blkg->pd[i])
 			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
+	css_put(&blkcg->css);
+out_put_queue:
 	blk_put_queue(disk->queue);
 out_free_iostat:
 	free_percpu(blkg->iostat_cpu);
 out_exit_refcnt:
 	percpu_ref_exit(&blkg->refcnt);
@@ -379,32 +387,26 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 	if (blk_queue_dying(disk->queue)) {
 		ret = -ENODEV;
 		goto err_free_blkg;
 	}
 
-	/* blkg holds a reference to blkcg */
-	if (!css_tryget_online(&blkcg->css)) {
-		ret = -ENODEV;
-		goto err_free_blkg;
-	}
-
 	/* allocate */
 	if (!new_blkg) {
 		new_blkg = blkg_alloc(blkcg, disk, GFP_NOWAIT);
 		if (unlikely(!new_blkg)) {
 			ret = -ENOMEM;
-			goto err_put_css;
+			goto err_free_blkg;
 		}
 	}
 	blkg = new_blkg;
 
 	/* link parent */
 	if (blkcg_parent(blkcg)) {
 		blkg->parent = blkg_lookup(blkcg_parent(blkcg), disk->queue);
 		if (WARN_ON_ONCE(!blkg->parent)) {
 			ret = -ENODEV;
-			goto err_put_css;
+			goto err_free_blkg;
 		}
 		blkg_get(blkg->parent);
 	}
 
 	/* invoke per-policy init */
@@ -440,12 +442,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
 
 	/* @blkg failed fully initialized, use the usual release path */
 	blkg_put(blkg);
 	return ERR_PTR(ret);
 
-err_put_css:
-	css_put(&blkcg->css);
 err_free_blkg:
 	if (new_blkg)
 		blkg_free(new_blkg);
 	return ERR_PTR(ret);
 }
-- 
2.52.0


^ permalink raw reply related

* [PATCH v3 15/15] mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
From: Vlastimil Babka (SUSE) @ 2026-06-15 11:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260615-slab_alloc_flags-v3-0-ce1146d140fb@kernel.org>

Finish the switch away from __GFP_NO_OBJ_EXT by replacing it with
SLAB_ALLOC_NO_RECURSE when allocating empty sheaves. Pass alloc_flags to
[__]alloc_empty_sheaf(). Callers that can't be part of a recursive
kmalloc() chain simply pass SLAB_ALLOC_DEFAULT. Use kmalloc_flags()
instead of kzalloc() for allocating the sheaf.

With that we can finalize the removal the __GFP_NO_OBJ_EXT handling from
obj_ext allocations as well, leaving only SLAB_ALLOC_NO_RECURSE in
place.

This leaves __GFP_NO_OBJ_EXT with no users in slab, so stop allowing the
flag in kmalloc_nolock().

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-16-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 include/linux/slab.h |  6 +++---
 mm/slub.c            | 34 +++++++++++++++++-----------------
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index b955f3cbb732..43c3d9b51107 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1039,9 +1039,9 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 /**
  * kmalloc_nolock - Allocate an object of given size from any context.
  * @size: size to allocate
- * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO, __GFP_NO_OBJ_EXT
- * allowed. Also __GFP_NOWARN and __GFP_NOMEMALLOC are allowed but added
- * internally thus not necessary.
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT and __GFP_ZERO allowed.  Also
+ * __GFP_NOWARN and __GFP_NOMEMALLOC are allowed but added internally thus not
+ * necessary.
  * @node: node number of the target node.
  *
  * Return: pointer to the new object or NULL in case of error.
diff --git a/mm/slub.c b/mm/slub.c
index fc5b8c85b690..62e9cd46916f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2171,7 +2171,6 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
-	gfp |= __GFP_NO_OBJ_EXT;
 	alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sz = obj_exts_alloc_size(s, slab, gfp);
@@ -2376,7 +2375,7 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
 	if (s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE))
 		return;
 
-	if (alloc_flags & SLAB_ALLOC_NO_RECURSE || flags & __GFP_NO_OBJ_EXT)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE)
 		return;
 
 	slab = virt_to_slab(object);
@@ -2761,7 +2760,7 @@ static inline void *setup_object(struct kmem_cache *s, void *object)
 }
 
 static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
-					      unsigned int capacity)
+				unsigned int alloc_flags, unsigned int capacity)
 {
 	struct slab_sheaf *sheaf;
 	size_t sheaf_size;
@@ -2772,10 +2771,10 @@ static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
 	 * bucket)
 	 */
 	if (s->flags & SLAB_KMALLOC)
-		gfp |= __GFP_NO_OBJ_EXT;
+		alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sheaf_size = struct_size(sheaf, objects, capacity);
-	sheaf = kzalloc(sheaf_size, gfp);
+	sheaf = kmalloc_flags(sheaf_size, gfp | __GFP_ZERO, alloc_flags, NUMA_NO_NODE);
 
 	if (unlikely(!sheaf))
 		return NULL;
@@ -2788,20 +2787,20 @@ static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
 }
 
 static inline struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s,
-						   gfp_t gfp)
+				gfp_t gfp, unsigned int alloc_flags)
 {
-	if (gfp & __GFP_NO_OBJ_EXT)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE)
 		return NULL;
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 
-	return __alloc_empty_sheaf(s, gfp, s->sheaf_capacity);
+	return __alloc_empty_sheaf(s, gfp, alloc_flags, s->sheaf_capacity);
 }
 
 static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 {
 	/*
-	 * If the sheaf was created with __GFP_NO_OBJ_EXT flag then its
+	 * If the sheaf was created with SLAB_ALLOC_NO_RECURSE flag then its
 	 * corresponding extension is NULL and alloc_tag_sub() will throw a
 	 * warning, therefore replace NULL with CODETAG_EMPTY to indicate
 	 * that the extension for this sheaf is expected to be NULL.
@@ -4693,7 +4692,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 
 	if (!empty) {
-		empty = alloc_empty_sheaf(s, gfp);
+		empty = alloc_empty_sheaf(s, gfp, alloc_flags);
 		if (!empty)
 			return NULL;
 	}
@@ -5066,7 +5065,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 	if (unlikely(size > s->sheaf_capacity)) {
 
-		sheaf = __alloc_empty_sheaf(s, gfp, size);
+		sheaf = __alloc_empty_sheaf(s, gfp, SLAB_ALLOC_DEFAULT, size);
 		if (!sheaf)
 			return NULL;
 
@@ -5111,7 +5110,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 
 	if (!sheaf)
-		sheaf = alloc_empty_sheaf(s, gfp);
+		sheaf = alloc_empty_sheaf(s, gfp, SLAB_ALLOC_DEFAULT);
 
 	if (sheaf) {
 		sheaf->capacity = s->sheaf_capacity;
@@ -5396,7 +5395,7 @@ static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_f
 
 	VM_WARN_ON_ONCE(alloc_flags_allow_spinning(ac->alloc_flags));
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
-			__GFP_NO_OBJ_EXT | __GFP_NOWARN | __GFP_NOMEMALLOC));
+				      __GFP_NOWARN | __GFP_NOMEMALLOC));
 
 	gfp_flags |= __GFP_NOWARN | __GFP_NOMEMALLOC;
 
@@ -5911,7 +5910,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	if (!allow_spin)
 		return NULL;
 
-	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+	empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
 	if (empty)
 		goto got_empty;
 
@@ -6095,7 +6094,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 		local_unlock(&s->cpu_sheaves->lock);
 
-		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
 
 		if (!empty)
 			goto fail;
@@ -7640,7 +7639,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 		if (!s->sheaf_capacity)
 			pcs->main = &bootstrap_sheaf;
 		else
-			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL, SLAB_ALLOC_DEFAULT);
 
 		if (!pcs->main)
 			return -ENOMEM;
@@ -8506,7 +8505,8 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL, capacity);
+		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL,
+				SLAB_ALLOC_DEFAULT, capacity);
 
 		if (!pcs->main) {
 			failed = true;

-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 14/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Vlastimil Babka (SUSE) @ 2026-06-15 11:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260615-slab_alloc_flags-v3-0-ce1146d140fb@kernel.org>

__GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
gfp flags are a scarce resource, unlike slab's alloc_flags.

Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
__GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
family function should not recurse into another kmalloc*() for the
purposes of allocating auxiliary structures (obj_ext arrays or sheaves).

First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
added. This will also pass through SLAB_ALLOC_NOLOCK so we don't need
to special case kmalloc_nolock() anymore.

Note that until now the kmalloc_nolock() ignored the incoming gfp flags
and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
the incoming gfp flags (only augmented with __GFP_ZERO), because if
alloc_flags contain SLAB_ALLOC_NOLOCK, the incoming gfp flags have to
be also compatible with it. However, we might have added __GFP_THISNODE
for opportunistic slab allocation, as pointed out by Hao Li, and
__GFP_COMP by allocate_slab() as pointed out by Shengming Hu. Solve this
by adding both flags to OBJCGS_CLEAR_MASK as it makes sense to strip
them anyway for non-kmalloc_nolock() allocations of sheaves or obj_ext
arrays as well.

To avoid recursion of sheaf -> obj_ext -> sheaf -> ... allocations at
this patch, until the next patch converts sheaves to
SLAB_ALLOC_NO_RECURSE, use both gfp and alloc_flags for obj_ext. The
next patch will remove the gfp part.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h |  1 +
 mm/slub.c | 22 ++++++++++++----------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 482b8e0fe797..281a65233795 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -21,6 +21,7 @@
 #define SLAB_ALLOC_DEFAULT	0x00 /* no flags */
 #define SLAB_ALLOC_NOLOCK	0x01 /* a kmalloc_nolock() allocation */
 #define SLAB_ALLOC_NEW_SLAB	0x02 /* a flag for alloc_slab_obj_exts() */
+#define SLAB_ALLOC_NO_RECURSE	0x04 /* prevent kmalloc() recursion */
 
 static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
 {
diff --git a/mm/slub.c b/mm/slub.c
index 383d39a22561..fc5b8c85b690 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2047,12 +2047,16 @@ static inline void dec_slabs_node(struct kmem_cache *s, int node,
 #endif /* CONFIG_SLUB_DEBUG */
 
 /*
- * The allocated objcg pointers array is not accounted directly.
+ * The allocated objcg pointers array or sheaf is not accounted directly.
  * Moreover, it should not come from DMA buffer and is not readily
- * reclaimable. So those GFP bits should be masked off.
+ * reclaimable. Node restriction for the parent allocation also should
+ * not apply to the slab's internal objects, as well as __GFP_COMP used
+ * for new slab allocations.
+ * So those GFP bits should be masked off.
  */
 #define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | \
-				__GFP_ACCOUNT | __GFP_NOFAIL)
+				__GFP_ACCOUNT | __GFP_NOFAIL | \
+				__GFP_THISNODE | __GFP_COMP)
 
 #ifdef CONFIG_SLAB_OBJ_EXT
 
@@ -2168,14 +2172,12 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
 	gfp |= __GFP_NO_OBJ_EXT;
+	alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sz = obj_exts_alloc_size(s, slab, gfp);
 
-	if (unlikely(!allow_spin))
-		vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
-				     slab_nid(slab));
-	else
-		vec = kmalloc_node(sz, gfp | __GFP_ZERO, slab_nid(slab));
+	/* This will use kmalloc_nolock() if alloc_flags say so */
+	vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab));
 
 	if (!vec) {
 		/*
@@ -2251,7 +2253,7 @@ static inline void free_slab_obj_exts(struct slab *slab, bool allow_spin)
 	}
 
 	/*
-	 * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
+	 * obj_exts was created with SLAB_ALLOC_NO_RECURSE flag, therefore its
 	 * corresponding extension will be NULL. alloc_tag_sub() will throw a
 	 * warning if slab has extensions but the extension of an object is
 	 * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
@@ -2374,7 +2376,7 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
 	if (s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE))
 		return;
 
-	if (flags & __GFP_NO_OBJ_EXT)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE || flags & __GFP_NO_OBJ_EXT)
 		return;
 
 	slab = virt_to_slab(object);

-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 13/15] mm/slab: introduce kmalloc_flags()
From: Vlastimil Babka (SUSE) @ 2026-06-15 11:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260615-slab_alloc_flags-v3-0-ce1146d140fb@kernel.org>

With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
alloc flag that prevents kmalloc recursion. For that we need a version
of kmalloc() that takes alloc_flags and use it in places that perform
these potentially recursive kmalloc allocations (of sheaves or obj_ext
arrays).

Add this function, named kmalloc_flags(). Right now it's only useful for
these nested allocations, so it doesn't need to optimize build-time
constant sizes like kmalloc() or kmalloc_buckets.

Since we need it to support both normal and non-spinning
kmalloc_nolock() context through the SLAB_ALLOC_NOLOCK flag, split out
most of the special _kmalloc_nolock_noprof() implementation to
__kmalloc_nolock_noprof() that takes a slab_alloc_context, and make
_kmalloc_nolock_noprof() a simple tail calling wrapper with the proper
context.

kmalloc_flags() can thus determine whether to call
__kmalloc_nolock_noprof() or __do_kmalloc_node(), based on the
given alloc_flags.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-14-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h | 13 +++++++++++++
 mm/slub.c | 55 +++++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 56 insertions(+), 12 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index d86203131f58..482b8e0fe797 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -11,6 +11,7 @@
 #include <linux/memcontrol.h>
 #include <linux/kfence.h>
 #include <linux/kasan.h>
+#include <linux/slab.h>
 
 /*
  * Internal slab definitions
@@ -26,6 +27,18 @@ static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
 	return !(alloc_flags & SLAB_ALLOC_NOLOCK);
 }
 
+void *__kmalloc_flags_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags,
+				  unsigned int alloc_flags, int node)
+				  __assume_kmalloc_alignment __alloc_size(1);
+
+static __always_inline __alloc_size(1) void *_kmalloc_flags_noprof(size_t size,
+		gfp_t flags, unsigned int alloc_flags, int node, kmalloc_token_t token)
+{
+	return __kmalloc_flags_noprof(PASS_TOKEN_PARAMS(size, token), flags, alloc_flags, node);
+}
+#define kmalloc_flags_noprof(...)	_kmalloc_flags_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
+#define kmalloc_flags(...)		alloc_hooks(kmalloc_flags_noprof(__VA_ARGS__))
+
 #ifdef CONFIG_64BIT
 # ifdef system_has_cmpxchg128
 # define system_has_freelist_aba()	system_has_cmpxchg128()
diff --git a/mm/slub.c b/mm/slub.c
index 8769083bec81..383d39a22561 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5385,19 +5385,14 @@ void *__kmalloc_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
-void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
+static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags,
+				     int node, const struct slab_alloc_context *ac)
 {
-	size_t orig_size = size;
-	unsigned int alloc_flags = SLAB_ALLOC_NOLOCK;
 	struct kmem_cache *s;
 	bool can_retry = true;
 	void *ret;
-	const struct slab_alloc_context ac = {
-		.caller_addr = _RET_IP_,
-		.orig_size = orig_size,
-		.alloc_flags = alloc_flags,
-	};
 
+	VM_WARN_ON_ONCE(alloc_flags_allow_spinning(ac->alloc_flags));
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
 			__GFP_NO_OBJ_EXT | __GFP_NOWARN | __GFP_NOMEMALLOC));
 
@@ -5434,7 +5429,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		 */
 		return NULL;
 
-	ret = alloc_from_pcs(s, gfp_flags, alloc_flags, node);
+	ret = alloc_from_pcs(s, gfp_flags, ac->alloc_flags, node);
 	if (ret)
 		goto success;
 
@@ -5444,7 +5439,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
 	 * and slab_post_alloc_hook() directly.
 	 */
-	ret = __slab_alloc_node(s, gfp_flags, node, &ac);
+	ret = __slab_alloc_node(s, gfp_flags, node, ac);
 
 	/*
 	 * It's possible we failed due to trylock as we preempted someone with
@@ -5467,11 +5462,23 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, gfp_flags, 1, &ret, &ac);
+	slab_post_alloc_hook(s, gfp_flags, 1, &ret, ac);
 
-	ret = kasan_kmalloc(s, ret, orig_size, gfp_flags);
+	ret = kasan_kmalloc(s, ret, ac->orig_size, gfp_flags);
 	return ret;
 }
+
+void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
+{
+	const struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_NOLOCK,
+	};
+
+	return __kmalloc_nolock_noprof(PASS_TOKEN_PARAMS(size, token),
+				       gfp_flags, node, &ac);
+}
 EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);
 
 void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags,
@@ -5525,6 +5532,30 @@ void *__kmalloc_cache_node_noprof(struct kmem_cache *s, gfp_t gfpflags,
 }
 EXPORT_SYMBOL(__kmalloc_cache_node_noprof);
 
+/*
+ * The only version of kmalloc_node() that takes alloc_flags and thus can
+ * determine on its own whether to handle the allocation via kmalloc_nolock() or
+ * normally
+ */
+void *__kmalloc_flags_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags,
+			     unsigned int alloc_flags, int node)
+{
+	const struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = alloc_flags,
+	};
+
+	if (alloc_flags_allow_spinning(alloc_flags)) {
+		return __do_kmalloc_node(NULL, flags, node,
+				PASS_TOKEN_PARAM(token), &ac);
+	} else {
+		return __kmalloc_nolock_noprof(PASS_TOKEN_PARAMS(size, token),
+					       flags, node, &ac);
+	}
+}
+
+
 static noinline void free_to_partial_list(
 	struct kmem_cache *s, struct slab *slab,
 	void *head, void *tail, int bulk_cnt,

-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 12/15] mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock()
From: Vlastimil Babka (SUSE) @ 2026-06-15 11:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260615-slab_alloc_flags-v3-0-ce1146d140fb@kernel.org>

The two flags are added internally so there's no point for warning if
they are passed by the caller as well, so allow them. This will allow
simplifying obj_ext allocation under kmalloc_nolock().

Also it's not necessary to have the extra alloc_gfp variable for adding
the two flags. The original gfp_flags parameter is not used anywhere
except for the warning. So remove alloc_gfp and directly modify and use
gfp_flags everywhere.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-13-7190909db118@kernel.org
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 include/linux/slab.h |  3 ++-
 mm/slub.c            | 19 ++++++++++---------
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ce1c867dc0ba..b955f3cbb732 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1040,7 +1040,8 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
  * kmalloc_nolock - Allocate an object of given size from any context.
  * @size: size to allocate
  * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO, __GFP_NO_OBJ_EXT
- * allowed.
+ * allowed. Also __GFP_NOWARN and __GFP_NOMEMALLOC are allowed but added
+ * internally thus not necessary.
  * @node: node number of the target node.
  *
  * Return: pointer to the new object or NULL in case of error.
diff --git a/mm/slub.c b/mm/slub.c
index 537ea68f417b..8769083bec81 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5387,7 +5387,6 @@ EXPORT_SYMBOL(__kmalloc_noprof);
 
 void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
 {
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
 	size_t orig_size = size;
 	unsigned int alloc_flags = SLAB_ALLOC_NOLOCK;
 	struct kmem_cache *s;
@@ -5400,7 +5399,9 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 	};
 
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
-				      __GFP_NO_OBJ_EXT));
+			__GFP_NO_OBJ_EXT | __GFP_NOWARN | __GFP_NOMEMALLOC));
+
+	gfp_flags |= __GFP_NOWARN | __GFP_NOMEMALLOC;
 
 	if (unlikely(!size))
 		return ZERO_SIZE_PTR;
@@ -5419,7 +5420,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 retry:
 	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
 		return NULL;
-	s = kmalloc_slab(size, NULL, alloc_gfp, PASS_TOKEN_PARAM(token));
+	s = kmalloc_slab(size, NULL, gfp_flags, PASS_TOKEN_PARAM(token));
 
 	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
 		/*
@@ -5433,7 +5434,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		 */
 		return NULL;
 
-	ret = alloc_from_pcs(s, alloc_gfp, alloc_flags, node);
+	ret = alloc_from_pcs(s, gfp_flags, alloc_flags, node);
 	if (ret)
 		goto success;
 
@@ -5443,7 +5444,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
 	 * and slab_post_alloc_hook() directly.
 	 */
-	ret = __slab_alloc_node(s, alloc_gfp, node, &ac);
+	ret = __slab_alloc_node(s, gfp_flags, node, &ac);
 
 	/*
 	 * It's possible we failed due to trylock as we preempted someone with
@@ -5456,8 +5457,8 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		size = s->object_size + 1;
 		/*
 		 * Another alternative is to
-		 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
-		 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
+		 * if (memcg) gfp_flags &= ~__GFP_ACCOUNT;
+		 * else if (!memcg) gfp_flags |= __GFP_ACCOUNT;
 		 * to retry from bucket of the same size.
 		 */
 		can_retry = false;
@@ -5466,9 +5467,9 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, alloc_gfp, 1, &ret, &ac);
+	slab_post_alloc_hook(s, gfp_flags, 1, &ret, &ac);
 
-	ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
+	ret = kasan_kmalloc(s, ret, orig_size, gfp_flags);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);

-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox