Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH] mm: memcontrol-v1: use nofail allocations for soft limit trees
From: Andrew Morton @ 2026-06-10  1:30 UTC (permalink / raw)
  To: Ruoyu Wang
  Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, cgroups, linux-mm, linux-kernel
In-Reply-To: <CAK_7xqyyDqNW1+puMSp2LzxmOKxFUx-UO9uGiDKoL7ZTJ8+3ZQ@mail.gmail.com>

On Mon, 8 Jun 2026 16:34:48 +0800 Ruoyu Wang <ruoyuw560@gmail.com> wrote:

> This was found by static analysis and then checked by reading the code:
> memcg1_init() dereferences rtpn unconditionally after kzalloc_node(). I
> treated the soft-limit tree as mandatory memcg v1 init state and used
> __GFP_NOFAIL because continuing without it would not be useful.
> 
> I agree this is early boot init code, and I do not have a
> runtime failure report or fault-injection reproduction for it.

Thanks.

Please teach the static analyzer that kernel practice is to ignore
allocation failures in __init code.

^ permalink raw reply

* Re: [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Andrew Morton @ 2026-06-10  0:53 UTC (permalink / raw)
  To: Farhad Alemi
  Cc: David Hildenbrand, Gregory Price, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Waiman Long, Rasmus Villemoes,
	linux-mm, linux-kernel, cgroups, stable
In-Reply-To: <CA+0ovCg05rUk1-3k2ysdxmbcER8aG-wVh9SSTrrbp6LPWpPHYA@mail.gmail.com>

On Tue, 9 Jun 2026 19:57:41 -0400 Farhad Alemi <farhad.alemi@berkeley.edu> wrote:

> cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the
> cpuset's effective, online mems (newmems, from guarantee_online_mems()),
> but rebinds that task's VMA mempolicies to the *configured* mask instead:

Hard to understand.  Was "rebinds" supposed to be "is supposed to
rebind"?

> 	cpuset_change_task_nodemask(task, &newmems);
> 	...
> 	mpol_rebind_mm(mm, &cs->mems_allowed);
> 
> On the default (v2) hierarchy a cpuset that has never had cpuset.mems
> written keeps mems_allowed empty while effective_mems is inherited
> non-empty from the parent, and tasks may be attached to it (the
> empty-mems attach check is v1-only).  A subsequent rebind -- e.g. from a
> CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with
> an empty mask.  For a VMA policy created with MPOL_F_RELATIVE_NODES this
> reaches mpol_relative_nodemask() ->
> nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(),
> whose set_bit(oldbit % sz, dst) divides by zero:
> 
>   Oops: divide error: 0000 [#1] SMP KASAN NOPTI
>   RIP: 0010:bitmap_fold+0x5e/0xb0
>    mpol_rebind_nodemask
>    mpol_rebind_mm
>    cpuset_update_tasks_nodemask
>    cpuset_handle_hotplug
>    sched_cpu_deactivate
>    cpuhp_thread_fun
> 
> cs->mems_allowed is the only nodemask in this function that is not the
> effective set: the task-policy rebind, the page-migration target and
> cs->old_mems_allowed all use newmems.  The sibling cpuset_attach() path
> already rebinds VMA policies against the effective mems
> (cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes
> that mems_allowed can be empty under hotplug.  Rebind the VMA policies to
> newmems too: it is guaranteed non-empty by guarantee_online_mems(), which
> fixes the divide-by-zero, and it makes the VMA policies consistent with
> the task policy and with the nodes the task is actually allowed to use.

How is this bug triggered?



^ permalink raw reply

* Re: [PATCH] mm: constify oom_control, scan_control, and alloc_context nodemask
From: SeongJae Park @ 2026-06-10  0:19 UTC (permalink / raw)
  To: Gregory Price
  Cc: SeongJae Park, linux-mm, linux-kernel, cgroups, kernel-team,
	longman, chenridong, akpm, david, ljs, liam, vbabka, rppt, surenb,
	mhocko, kasong, qi.zheng, shakeel.butt, baohua, axelrasmussen,
	yuanchu, weixugc, rientjes, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, tj, hannes, mkoutny, jackmanb, ziy
In-Reply-To: <20260609002919.3967782-1-gourry@gourry.net>

On Mon,  8 Jun 2026 20:29:19 -0400 Gregory Price <gourry@gourry.net> wrote:

> The nodemasks in these structures may come from a variety of sources,
> including tasks and cpusets - and should never be modified by any code
> when being passed around inside another context.

Nice work, I also confirmed I can built the kernel with this patch.

> 
> Signed-off-by: Gregory Price <gourry@gourry.net>

Tested-by: SeongJae Park <sj@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>

[...]
>  	/*
>  	 * The memory cgroup that hit its limit and as a result is the
> @@ -6599,7 +6599,7 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
>   * happens, the page allocator should not consider triggering the OOM killer.
>   */
>  static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> -					nodemask_t *nodemask)
> +				    const nodemask_t *nodemask)

Seems the above indentation has changed for a rason that I have no clue, and
also introduced a line having both spaces and tabs.

Just thinking loud.


Thanks,
SJ

[...]

^ permalink raw reply

* [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Farhad Alemi @ 2026-06-09 23:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Gregory Price
  Cc: Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Waiman Long, Rasmus Villemoes, linux-mm, linux-kernel, cgroups,
	stable
In-Reply-To: <25c4bc47-b65d-4c04-8a8f-18eef2b5566a@kernel.org>

cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the
cpuset's effective, online mems (newmems, from guarantee_online_mems()),
but rebinds that task's VMA mempolicies to the *configured* mask instead:

	cpuset_change_task_nodemask(task, &newmems);
	...
	mpol_rebind_mm(mm, &cs->mems_allowed);

On the default (v2) hierarchy a cpuset that has never had cpuset.mems
written keeps mems_allowed empty while effective_mems is inherited
non-empty from the parent, and tasks may be attached to it (the
empty-mems attach check is v1-only).  A subsequent rebind -- e.g. from a
CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with
an empty mask.  For a VMA policy created with MPOL_F_RELATIVE_NODES this
reaches mpol_relative_nodemask() ->
nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(),
whose set_bit(oldbit % sz, dst) divides by zero:

  Oops: divide error: 0000 [#1] SMP KASAN NOPTI
  RIP: 0010:bitmap_fold+0x5e/0xb0
   mpol_rebind_nodemask
   mpol_rebind_mm
   cpuset_update_tasks_nodemask
   cpuset_handle_hotplug
   sched_cpu_deactivate
   cpuhp_thread_fun

cs->mems_allowed is the only nodemask in this function that is not the
effective set: the task-policy rebind, the page-migration target and
cs->old_mems_allowed all use newmems.  The sibling cpuset_attach() path
already rebinds VMA policies against the effective mems
(cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes
that mems_allowed can be empty under hotplug.  Rebind the VMA policies to
newmems too: it is guaranteed non-empty by guarantee_online_mems(), which
fixes the divide-by-zero, and it makes the VMA policies consistent with
the task policy and with the nodes the task is actually allowed to use.

Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
Suggested-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Cc: stable@vger.kernel.org
---
 kernel/cgroup/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)

 		migrate = is_memory_migrate(cs);

-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &newmems);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
-- 
2.43.0

^ permalink raw reply

* [syzbot] [cgroups?] [mm?] WARNING in hrtick_start_fair
From: syzbot @ 2026-06-09 23:04 UTC (permalink / raw)
  To: anna-maria, cgroups, frederic, linux-kernel, linux-mm,
	syzkaller-bugs, tglx

Hello,

syzbot found the following issue on:

HEAD commit:    a87737435cfa Add linux-next specific files for 20260608
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=11715db6580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=da47745f686dc823
dashboard link: https://syzkaller.appspot.com/bug?extid=2cbf10efc23b22ff9c31
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12df60ae580000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11edb0ae580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/85d19fe6bb4e/disk-a8773743.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/30c683ce26e1/vmlinux-a8773743.xz
kernel image: https://storage.googleapis.com/syzbot-assets/4db5027513d2/bzImage-a8773743.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+2cbf10efc23b22ff9c31@syzkaller.appspotmail.com

------------[ cut here ]------------
task_rq(p) != rq
WARNING: kernel/sched/fair.c:7656 at hrtick_start_fair+0x196/0x1f0 kernel/sched/fair.c:7656, CPU#0: rcu_preempt/18
Modules linked in:
CPU: 0 UID: 0 PID: 18 Comm: rcu_preempt Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:hrtick_start_fair+0x196/0x1f0 kernel/sched/fair.c:7656
Code: 42 80 3c 20 00 74 08 4c 89 ff e8 85 e3 97 00 4d 39 37 0f 85 0c ff ff ff 48 89 df 5b 41 5c 41 5d 41 5e 41 5f e9 4b 65 fa ff 90 <0f> 0b 90 e9 d1 fe ff ff 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 82
RSP: 0018:ffffc900001777e0 EFLAGS: 00010087
RAX: ffff8880b873ba40 RBX: ffff8880b863ba40 RCX: ffffffff8197c7de
RDX: 0000000000000000 RSI: ffff88802c528000 RDI: ffff8880b863ba40
RBP: dffffc0000000000 R08: ffffffff8fcf0b0f R09: 1ffffffff1f9e161
R10: dffffc0000000000 R11: fffffbfff1f9e162 R12: dffffc0000000000
R13: 1ffff110170c78d6 R14: ffff88802c528000 R15: ffffffff8dc217d8
FS:  0000000000000000(0000) GS:ffff888125a76000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007eff95e6a540 CR3: 0000000028e30000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 set_next_task_fair+0xa68/0xce0 kernel/sched/fair.c:15058
 put_prev_set_next_task kernel/sched/sched.h:2770 [inline]
 pick_next_task kernel/sched/core.c:6443 [inline]
 __schedule+0x3e03/0x5550 kernel/sched/core.c:7144
 __schedule_loop kernel/sched/core.c:7308 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7323
 schedule_timeout+0x158/0x2c0 kernel/time/sleep_timeout.c:99
 rcu_gp_fqs_loop+0x312/0x11b0 kernel/rcu/tree.c:2123
 rcu_gp_kthread+0x9e/0x2b0 kernel/rcu/tree.c:2325
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Alexei Starovoitov @ 2026-06-09 18:40 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

On Tue Jun 9, 2026 at 2:17 AM PDT, Vlastimil Babka (SUSE) wrote:
> This series is based on slab/for-next. If all goes well, it would
> hopefully go to slab/for-next soon after the 7.2 merge window, so any
> other work can be based on it to avoid conflicts, as it touches a lot
> parts of slab.
>
> Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags

Overall looks great to me.
I would ship all patches except the last one for this merge window,
since I don't see anything controversial or dangerous in there.
Especially since it touches slab so much. My slab-arena changes
would need to adopt it and I don't want to delay the whole thing by two merge windows.
Harry's changes would need to rebased as well.
So the sooner the trees converge the better.


^ permalink raw reply

* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Yuri Andriaccio @ 2026-06-09 16:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Johannes Weiner, Michal Koutný, cgroups,
	linux-kernel, Luca Abeni, Yuri Andriaccio
In-Reply-To: <aig1ZGEq0Vr0qLzl@jlelli-thinkpadt14gen4.remote.csb>

Hi Juri,

Thanks for looking into this.

 > I started playing with the new interface and ended up with the following
 >
 > bash-5.3# cat cpu.rt.max  (root)
 > 10000 100000
 > bash-5.3# cat g1/cpu.rt.max
 > 10000 100000
 > bash-5.3# cat g1/cpu.rt.internal
 > 9999 100000
 >
 > which looks odd to me, as nothing is running on g1 yet and no children
 > groups either. Maybe a rounding error of some kind?

You are right. I should have mentioned that it is just a rounding error 
that occurs when converting from a bandwidth value to a runtime value. 
This happens because the tg_rt_internal_bandwidth() function truncates 
the value when transforming the runtime from nanoseconds to micros. 
Rounding could be used here to report a more accurate value.

This same issue is probably found in the from_ratio() function, which 
has a similar truncation issue when converting from bandwidth to 
runtime, but since it is working in the nanoseconds range it might not 
be that big of a problem. The value from from_ratio() is used for the 
setup of the dl_servers even when the children bw is zero, so maybe it 
is possible to add a special case?

Anyways, as it is right now, the cpu.rt.internal may have only a +1/-1us 
error in reporting the actual used values, while the error for the 
runtime value used internally to setup the dl_servers is in the range of 
tens of nanoseconds.

Thanks,
Yuri

^ permalink raw reply

* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Juri Lelli @ 2026-06-09 15:46 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Johannes Weiner, Michal Koutný, cgroups,
	linux-kernel, Luca Abeni, Yuri Andriaccio
In-Reply-To: <20260608121546.69910-1-yurand2000@gmail.com>

Hi Yuri,

Thanks for sending this out.

On 08/06/26 14:15, Yuri Andriaccio wrote:
> Hello,
> 
> This is the v6 for Hierarchical Constant Bandwidth Server, aiming at replacing
> the current RT_GROUP_SCHED mechanism with something more robust and
> theoretically sound. The patchset has been presented at OSPM25 and OSPM26
> (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> be found at https://lwn.net/Articles/1021332/ . You can find the previous
> versions of this patchset at the bottom of the page, in particular version 1
> which talks in more detail what this patchset is all about and how it is
> implemented.
> 
> This v6 version works on the comments by the reviewers and introduces the
> following meaningful changes:
> - Update to kernel version 7.1.
> - Refactorings and general cleanups.
> - Removal of substantial duplicated code.
> - Express more locking constraints in code.
> - New cpu.rt.max interface.
> - Refactoring of migration code to reduce code duplication.
>   The new migration code now reuses the existing push/pull and similar functions
>   and specializes where needed, substantially reducing the footprint of group
>   migration code from previous versions.
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> New cgroup-v2 interface:
> After extensive discussions with the kernel's maintainers, we have built a new
> interface to support HCBS scheduling. Since this will be a cgroup-v2 only
> feature (the fate of cgroup-v1 old RT_GROUP_SCHED has yet to be decided), it was
> possible to drop the original v1 interface entirely and create a completely new
> one that is similar to those that are already existing.
> 
> Every cgroup has now two new files:
> - cpu.rt.max (similar to the cpu.max file)
> - cpu.rt.internal (read-only, not available in the root cgroup, it may be
>                    removed if deemed unnecessary, see later for details)
> 
> In this new interface, HCBS cgroups may either be set to use deadline servers,
> and thus reserving a specified amount of bandwidth, very similarly to the
> previous system, or can delegate their FIFO/RR tasks' scheduling to the nearest
> ancestor that it is configured (default on group creation). If the nearest
> configured ancestor is the root cgroup, tasks will be effectively run on the
> root runqueue even if their cgroup is not the root task group.
> 
> This means that subtrees are allowed to retain the original non-RT_GROUP_SCHED
> behaviour, scheduling on root, while the feature is nonetheless active. In the
> meantime other subtrees may use HCBS, and the whole hierarchy can coexist
> without issues.
> 
> This behaviour is specified in the cpu.rt.max file, which accepts the string
> "<runtime | 'max'> <period>". A zero runtime disables FIFO/RR scheduling for
> tasks in that group, a non-zero runtime creates a reservation and uses HCBS, a
> runtime of 'max' instead tells the scheduler to use the nearest configured
> ancestor for the FIFO/RR task scheduling.
> 
> The admission test now does not only check the immediate children of a cgroup
> for schedulability (recall that a group's bandwidth must be always greater than
> or equal to its children total bandwidth), but it has to check its whole
> subtree: if a child delegates its tasks to its parents (runtime = 'max'), then
> this child's own children (the grandchildrens) are effectively viewed as
> immediate children that compete for the same bandwidth of their grandparent, and
> so on down the hierarchy.
> 
> To support both threaded and domain cgroups, the original test that allowed only
> to run tasks in leaf cgroups has been removed: this is already enforced for
> domain cgroups by existing code, while this must not be the case for threaded
> cgroups.
> 
> Since groups in the middle of the hierarchy can now also run tasks, their
> dl_servers must be configured properly: a parent cgroup dl_servers can only use
> their assigned bandwidth minus the total of their children. The cpu.rt.internal
> file reads exactly what is this "remainder" bandwidth. Since dl_servers must
> have a runtime and period values assigned, the period is taken from the user
> configured cpu.rt.max file and the runtime is computed from the remainder bw.
> This runtime and the period are the values shown by cpu.rt.internal.
> 
> Supporting both threaded and domain cgroups also dropped all the extra code
> related to active and 'live' cgroups as mentioned in previous RFCs.
> 

I started playing with the new interface and ended up with the following

bash-5.3# cat cpu.rt.max  (root)
10000 100000
bash-5.3# cat g1/cpu.rt.max
10000 100000
bash-5.3# cat g1/cpu.rt.internal
9999 100000

which looks odd to me, as nothing is running on g1 yet and no children
groups either. Maybe a rounding error of some kind?

Thanks,
Juri


^ permalink raw reply

* Re: [PATCH] mm: constify oom_control, scan_control, and alloc_context nodemask
From: Gregory Price @ 2026-06-09 15:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, cgroups, kernel-team, longman, chenridong,
	akpm, david, liam, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, rientjes,
	chrisl, shikemeng, nphamcs, baoquan.he, youngjun.park, tj, hannes,
	mkoutny, jackmanb, ziy
In-Reply-To: <aifDlU96HSRy72Rb@lucifer>

On Tue, Jun 09, 2026 at 08:41:59AM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 08:29:19PM -0400, Gregory Price wrote:
> > The nodemasks in these structures may come from a variety of sources,
> > including tasks and cpusets - and should never be modified by any code
> > when being passed around inside another context.
> >
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> 
> Thanks for doing this, it's nice to gradually up our const correctness game
> (as much as C can ever be const correct :)
> 

I've actually been carrying this patch locally for about a year in the
private nodes work.  Sorry i didn't send it sooner.  There is likely
more work to be done here.

> LGTM, builds locally too, so:
> 
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> 

^ permalink raw reply

* Re: [PATCH] mm: constify oom_control, scan_control, and alloc_context nodemask
From: Gregory Price @ 2026-06-09 15:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Zi Yan, linux-mm, linux-kernel, cgroups, kernel-team, longman,
	chenridong, akpm, david, liam, vbabka, rppt, surenb, mhocko,
	kasong, qi.zheng, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, rientjes, chrisl, shikemeng, nphamcs, baoquan.he,
	youngjun.park, tj, hannes, mkoutny, jackmanb
In-Reply-To: <aifC9s9X6hLWdKkd@lucifer>

On Tue, Jun 09, 2026 at 08:39:22AM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 09:44:42PM -0400, Zi Yan wrote:
> > On 8 Jun 2026, at 20:29, Gregory Price wrote:
> >
> > > The nodemasks in these structures may come from a variety of sources,
> > > including tasks and cpusets - and should never be modified by any code
> > > when being passed around inside another context.
> > >
> > > Signed-off-by: Gregory Price <gourry@gourry.net>
> > > ---
> > >  include/linux/cpuset.h | 4 ++--
> > >  include/linux/mm.h     | 4 ++--
> > >  include/linux/mmzone.h | 6 +++---
> > >  include/linux/oom.h    | 2 +-
> > >  include/linux/swap.h   | 2 +-
> > >  kernel/cgroup/cpuset.c | 2 +-
> > >  mm/internal.h          | 2 +-
> > >  mm/mmzone.c            | 5 +++--
> > >  mm/page_alloc.c        | 6 +++---
> > >  mm/show_mem.c          | 9 ++++++---
> > >  mm/vmscan.c            | 6 +++---
> > >  11 files changed, 26 insertions(+), 22 deletions(-)
> > >
> >
> > LGTM and it compiles. As long as Sashiko does not complain, feel free to
> > add:
> 
> I would add caveats of:
> 
> - Complains legitimately
> - And it's about this actual patch not something unrelated
> 
> :P
> 
> (Not speaking for Zi of course, but I mean just in general I feel these caveats
> should be implicit :))

Thank you for your contribution! Gourry AI review found 1 potential issue(s) to consider:
- [High] Questioning the legitimacy of AI is grounds for excommunication
--

[Severity: High]
Insolence will not be suffered, take this man away.

;]

(fwiw I'm getting used to Sashiko, it's doing better)

~Gregory

^ permalink raw reply

* Re: [PATCH 0/5] blk-cgroup: fix blkg list and policy data races
From: Yizhou Tang @ 2026-06-09 15:15 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Jens Axboe, Tejun Heo, Josef Bacik, Ming Lei, Nilay Shroff,
	Bart Van Assche, linux-block, cgroups, linux-kernel
In-Reply-To: <cover.1780492756.git.yukuai@fygo.io>

On Wed, Jun 3, 2026 at 9:35 PM Yu Kuai <yukuai@fygo.io> wrote:
>
> This small series fixes races between blkg destruction, q->blkg_list
> iteration, and blkcg policy activation.
>
> The first two patches serialize q->blkg_list walks in blkg_destroy_all()
> and BFQ writeback weight-raising teardown with blkcg_mutex. The next two
> patches close policy activation races with concurrent blkg destruction,
> including skipping blkgs that are already dying. The final patch factors
> the common policy data teardown loop.
>
> This uses blkcg_mutex rather than extending queue_lock coverage because
> the races are about blkg list visibility and policy-data lifetime, not
> request-queue dispatch state. blkg_free_workfn() already uses
> blkcg_mutex to serialize policy-data freeing with policy deactivation
> and removes blkgs from q->blkg_list only after that teardown. Taking the
> same mutex around the remaining q->blkg_list walkers gives one sleepable
> serialization point for blkg lifetime, avoids adding more queue_lock
> nesting, and prepares the follow-up conversion that removes queue_lock
> from blkcg list protection entirely.

LGTM.

I had noticed some time ago that blkcg_activate_policy() did not take
q->blkcg_mutex, and this patchset nicely resolves my concern.

Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>

Best regards,
Yi


>
> Yu Kuai (2):
>   blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with
>     blkcg_mutex
>   bfq: protect q->blkg_list iteration in bfq_end_wr_async() with
>     blkcg_mutex
>
> Zheng Qixing (3):
>   blk-cgroup: fix race between policy activation and blkg destruction
>   blk-cgroup: skip dying blkg in blkcg_activate_policy()
>   blk-cgroup: factor policy pd teardown loop into helper
>
>  block/bfq-cgroup.c  |  3 ++-
>  block/bfq-iosched.c |  6 +++++
>  block/blk-cgroup.c  | 65 ++++++++++++++++++++++++---------------------
>  3 files changed, 43 insertions(+), 31 deletions(-)
>
> --
> 2.51.0
>

^ permalink raw reply

* Re: [PATCH] mm: constify oom_control, scan_control, and alloc_context nodemask
From: Zi Yan @ 2026-06-09 14:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Gregory Price, linux-mm, linux-kernel, cgroups, kernel-team,
	longman, chenridong, akpm, david, liam, vbabka, rppt, surenb,
	mhocko, kasong, qi.zheng, shakeel.butt, baohua, axelrasmussen,
	yuanchu, weixugc, rientjes, chrisl, shikemeng, nphamcs,
	baoquan.he, youngjun.park, tj, hannes, mkoutny, jackmanb
In-Reply-To: <aifC9s9X6hLWdKkd@lucifer>

On 9 Jun 2026, at 3:39, Lorenzo Stoakes wrote:

> On Mon, Jun 08, 2026 at 09:44:42PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 20:29, Gregory Price wrote:
>>
>>> The nodemasks in these structures may come from a variety of sources,
>>> including tasks and cpusets - and should never be modified by any code
>>> when being passed around inside another context.
>>>
>>> Signed-off-by: Gregory Price <gourry@gourry.net>
>>> ---
>>>  include/linux/cpuset.h | 4 ++--
>>>  include/linux/mm.h     | 4 ++--
>>>  include/linux/mmzone.h | 6 +++---
>>>  include/linux/oom.h    | 2 +-
>>>  include/linux/swap.h   | 2 +-
>>>  kernel/cgroup/cpuset.c | 2 +-
>>>  mm/internal.h          | 2 +-
>>>  mm/mmzone.c            | 5 +++--
>>>  mm/page_alloc.c        | 6 +++---
>>>  mm/show_mem.c          | 9 ++++++---
>>>  mm/vmscan.c            | 6 +++---
>>>  11 files changed, 26 insertions(+), 22 deletions(-)
>>>
>>
>> LGTM and it compiles. As long as Sashiko does not complain, feel free to
>> add:
>
> I would add caveats of:
>
> - Complains legitimately
> - And it's about this actual patch not something unrelated
>
> :P
>
> (Not speaking for Zi of course, but I mean just in general I feel these caveats
> should be implicit :))

I agree. Thank you for bringing it up.

>
>>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>>
>> Best Regards,
>> Yan, Zi
>
> Cheers, Lorenzo


Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-09 14:28 UTC (permalink / raw)
  To: Usama Arif
  Cc: Harry Yoo, hao.ge, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Suren Baghdasaryan, Alexei Starovoitov,
	Andrew Morton, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, kasan-dev,
	linux-mm, linux-kernel, cgroups
In-Reply-To: <20260609133534.3548059-1-usama.arif@linux.dev>

On 6/9/26 15:35, Usama Arif wrote:
> On Tue, 09 Jun 2026 11:17:45 +0200 "Vlastimil Babka (SUSE)" <vbabka@kernel.org> wrote:
> 
>> This series is based on slab/for-next. If all goes well, it would
>> hopefully go to slab/for-next soon after the 7.2 merge window, so any
>> other work can be based on it to avoid conflicts, as it touches a lot
>> parts of slab.
>> 
>> Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags
>> 
>> The slab implementation currently relies on gfp flags to convey
>> some context information internally:
>> 
>> - The absence of both __GFP_RECLAIM flags is interpreted as "cannot spin
>>   on locks", and intended to be used by kmalloc_nolock(). But false
>>   positives are possible e.g. during early boot where gfp_allowed_mask
>>   clears __GFP_RECLAIM from all allocations. This leads to unnecessary
>>   allocation failures and workarounds such as fd3634312a04 ("debugobject:
>>   Make it work with deferred page initialization - again").
>> 
>> - __GFP_NO_OBJ_EXT exists and takes up valuable bit in the gfp flags
>>   space, only to prevent recursive kmalloc() allocations for obj_ext
>>   arrays and sheaves.
>> 
> 
> Hello Valstimil!
> 
> I think memory allocation profiling uses __GFP_NO_OBJ_EXT, and I dont see
> it being removed in the series (hopefully I didnt miss it).
> 
> Adding Hao Ge in CC who did this in the commit:
> mm/alloc_tag: replace fixed-size early PFN array with dynamic linked list

Thanks for the heads up. I missed it because my series is based on
slab/for-next and that commit is in mm-unstable. My patch 15 actually
modifies the TODO comment that is meanwhile resolved by Hao Ge's patch.

Which means my patch 15/15 can't be used as-is, and at worst I will drop it.
But I'd encourage Hao Ge with Suren to find some way to avoid the gfp flag
usage too, because it's now quite a niche use case (preventing false
positive CONFIG_MEM_ALLOC_PROFILING_DEBUG warnings, IIUC?) to take a
valuable gfp flag bit, IMHO.

>> The page allocator uses its internal alloc_flags to convey various
>> context information, including ALLOC_TRYLOCK (meaning "cannot spin").
>> This series copies that concept for the slab allocator, with its own
>> slab-specific internal flags:
>> 
>> - SLAB_ALLOC_DEFAULT - no extra flags (the value is 0), but explicit
>> - SLAB_ALLOC_TRYLOCK - do not spin on locks (used by kmalloc_nolock())
>> - SLAB_ALLOC_NEW_SLAB - replacing existing 'bool new_slab' parameter
>> 			for allocating obj_ext arrays
>> - SLAB_ALLOC_NO_RECURSE - replacing usage of __GFP_NO_OBJ_EXT
>> 
>> To reduce the amount of parameters in various internal functions, we
>> additionally introduce slab_alloc_context (also inspired by page
>> allocator's alloc_context) for passing a number of existing arguments
>> and the new alloc_flags:
>> 
>> /* Structure holding extra parameters for slab allocations */
>> struct slab_alloc_context {
>> 	unsigned long caller_addr;
>> 	unsigned long orig_size;
>> 	unsigned int alloc_flags;
>> 	struct list_lru *lru;
>> };
>> 
>> This also replaces the existing struct partial_context.
>> 
>> The last necessary piece is kmalloc_flags() which can take the
>> alloc_flags in addition to gfp flags and is intended for the recursive
>> allocations of sheaves and obj_ext arrays, so that both
>> SLAB_ALLOC_TRYLOCK and SLAB_ALLOC_NO_RECURSE can be communicated.
>> Internally it decides between kmalloc_nolock() and normal kmalloc()
>> depending SLAB_ALLOC_TRYLOCK.
>> 
>> The rest of the series is gradually expanding the usage of both
>> alloc_flags and slab_alloc_context as necessary, with bits of
>> refactoring. Then, __GFP_NO_OBJ_EXT is removed completely.
>> 
>> Note that some usage of gfpflags_allow_spinning() relying on absence of
>> __GFP_RECLAIM remains outside of slab (and page allocator) in memcg,
>> page_owner and stackdepot code. These can thus yield false-positive
>> decisions that spinning is not allowed, but should not result in
>> important allocations failing anymore.
>> 
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
>> Vlastimil Babka (SUSE) (15):
>>       mm/slab: always zero only requested size on alloc
>>       mm/slab: stop inlining __slab_alloc_node()
>>       mm/slab: introduce slab_alloc_context
>>       mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
>>       mm/slab: add alloc_flags to slab_alloc_context
>>       mm/slab: replace struct partial_context with slab_alloc_context
>>       mm/slab: pass alloc_flags to new slab allocation
>>       mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
>>       mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
>>       mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
>>       mm/slab: pass slab_alloc_context to __do_kmalloc_node()
>>       mm/slab: introduce kmalloc_flags()
>>       mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
>>       mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
>>       mm: remove the __GFP_NO_OBJ_EXT flag
>> 
>>  include/linux/gfp_types.h       |   7 -
>>  include/linux/slab.h            |  14 +-
>>  include/trace/events/mmflags.h  |  10 +-
>>  lib/alloc_tag.c                 |   2 +-
>>  mm/kfence/core.c                |   6 +-
>>  mm/memcontrol.c                 |   5 +-
>>  mm/slab.h                       |  16 +-
>>  mm/slub.c                       | 423 ++++++++++++++++++++++++----------------
>>  tools/include/linux/gfp_types.h |   7 -
>>  9 files changed, 288 insertions(+), 202 deletions(-)
>> ---
>> base-commit: 500b2c9755301742bdbb61249511ac11a4665dae
>> change-id: 20260601-slab_alloc_flags-25c782b0c57c
>> 
>> Best regards,
>> --  
>> Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> 
>> 


^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Usama Arif @ 2026-06-09 13:35 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Usama Arif, Harry Yoo, hao.ge, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Suren Baghdasaryan,
	Alexei Starovoitov, Andrew Morton, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	kasan-dev, linux-mm, linux-kernel, cgroups
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

On Tue, 09 Jun 2026 11:17:45 +0200 "Vlastimil Babka (SUSE)" <vbabka@kernel.org> wrote:

> This series is based on slab/for-next. If all goes well, it would
> hopefully go to slab/for-next soon after the 7.2 merge window, so any
> other work can be based on it to avoid conflicts, as it touches a lot
> parts of slab.
> 
> Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags
> 
> The slab implementation currently relies on gfp flags to convey
> some context information internally:
> 
> - The absence of both __GFP_RECLAIM flags is interpreted as "cannot spin
>   on locks", and intended to be used by kmalloc_nolock(). But false
>   positives are possible e.g. during early boot where gfp_allowed_mask
>   clears __GFP_RECLAIM from all allocations. This leads to unnecessary
>   allocation failures and workarounds such as fd3634312a04 ("debugobject:
>   Make it work with deferred page initialization - again").
> 
> - __GFP_NO_OBJ_EXT exists and takes up valuable bit in the gfp flags
>   space, only to prevent recursive kmalloc() allocations for obj_ext
>   arrays and sheaves.
> 

Hello Valstimil!

I think memory allocation profiling uses __GFP_NO_OBJ_EXT, and I dont see
it being removed in the series (hopefully I didnt miss it).

Adding Hao Ge in CC who did this in the commit:
mm/alloc_tag: replace fixed-size early PFN array with dynamic linked list


> The page allocator uses its internal alloc_flags to convey various
> context information, including ALLOC_TRYLOCK (meaning "cannot spin").
> This series copies that concept for the slab allocator, with its own
> slab-specific internal flags:
> 
> - SLAB_ALLOC_DEFAULT - no extra flags (the value is 0), but explicit
> - SLAB_ALLOC_TRYLOCK - do not spin on locks (used by kmalloc_nolock())
> - SLAB_ALLOC_NEW_SLAB - replacing existing 'bool new_slab' parameter
> 			for allocating obj_ext arrays
> - SLAB_ALLOC_NO_RECURSE - replacing usage of __GFP_NO_OBJ_EXT
> 
> To reduce the amount of parameters in various internal functions, we
> additionally introduce slab_alloc_context (also inspired by page
> allocator's alloc_context) for passing a number of existing arguments
> and the new alloc_flags:
> 
> /* Structure holding extra parameters for slab allocations */
> struct slab_alloc_context {
> 	unsigned long caller_addr;
> 	unsigned long orig_size;
> 	unsigned int alloc_flags;
> 	struct list_lru *lru;
> };
> 
> This also replaces the existing struct partial_context.
> 
> The last necessary piece is kmalloc_flags() which can take the
> alloc_flags in addition to gfp flags and is intended for the recursive
> allocations of sheaves and obj_ext arrays, so that both
> SLAB_ALLOC_TRYLOCK and SLAB_ALLOC_NO_RECURSE can be communicated.
> Internally it decides between kmalloc_nolock() and normal kmalloc()
> depending SLAB_ALLOC_TRYLOCK.
> 
> The rest of the series is gradually expanding the usage of both
> alloc_flags and slab_alloc_context as necessary, with bits of
> refactoring. Then, __GFP_NO_OBJ_EXT is removed completely.
> 
> Note that some usage of gfpflags_allow_spinning() relying on absence of
> __GFP_RECLAIM remains outside of slab (and page allocator) in memcg,
> page_owner and stackdepot code. These can thus yield false-positive
> decisions that spinning is not allowed, but should not result in
> important allocations failing anymore.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> Vlastimil Babka (SUSE) (15):
>       mm/slab: always zero only requested size on alloc
>       mm/slab: stop inlining __slab_alloc_node()
>       mm/slab: introduce slab_alloc_context
>       mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
>       mm/slab: add alloc_flags to slab_alloc_context
>       mm/slab: replace struct partial_context with slab_alloc_context
>       mm/slab: pass alloc_flags to new slab allocation
>       mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
>       mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
>       mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
>       mm/slab: pass slab_alloc_context to __do_kmalloc_node()
>       mm/slab: introduce kmalloc_flags()
>       mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
>       mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
>       mm: remove the __GFP_NO_OBJ_EXT flag
> 
>  include/linux/gfp_types.h       |   7 -
>  include/linux/slab.h            |  14 +-
>  include/trace/events/mmflags.h  |  10 +-
>  lib/alloc_tag.c                 |   2 +-
>  mm/kfence/core.c                |   6 +-
>  mm/memcontrol.c                 |   5 +-
>  mm/slab.h                       |  16 +-
>  mm/slub.c                       | 423 ++++++++++++++++++++++++----------------
>  tools/include/linux/gfp_types.h |   7 -
>  9 files changed, 288 insertions(+), 202 deletions(-)
> ---
> base-commit: 500b2c9755301742bdbb61249511ac11a4665dae
> change-id: 20260601-slab_alloc_flags-25c782b0c57c
> 
> Best regards,
> --  
> Vlastimil Babka (SUSE) <vbabka@kernel.org>
> 
> 

^ permalink raw reply

* Re: [Kernel Bug] INFO: task hung in cgroup_drain_dying
From: Michal Koutný @ 2026-06-09 12:58 UTC (permalink / raw)
  To: Longxing Li; +Cc: syzkaller, tj, hannes, cgroups, linux-kernel
In-Reply-To: <CAHPqNmxGfjsKGEJJaSCrJqoU9WHY3q8CX1oTA7GV5BBHvDzgpg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 552 bytes --]

Hello Longxing.

On Tue, Jun 09, 2026 at 07:42:06PM +0800, Longxing Li <coregee2000@gmail.com> wrote:
> We would like to report a new kernel bug found by our tool. INFO: task
> hung in cgroup_drain_dying. Details are as follows.

Thanks but I see no attachment.

(Greater if you could add description as plaintext [1])

> Kernel commit: v7.0.6
> Kernel config: see attachment

Do you have lockdep enabled (CONFIG_PROVE_LOCKING)? That may help
debugging here.

Thanks,
Michal

[1] https://docs.kernel.org/process/email-clients.html#general-preferences


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* [Kernel Bug] INFO: rcu detected stall in count_memcg_event_mm
From: Longxing Li @ 2026-06-09 11:57 UTC (permalink / raw)
  To: syzkaller, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, cgroups, linux-mm, linux-kernel

Dear Linux kernel developers and maintainers,

We would like to report a new kernel bug found by our tool. INFO: rcu
detected stall in count_memcg_event_mm. Details are as follows.

Kernel commit: v5.15.189
Kernel config: see attachment
report: see attachment
Syz repro: see attachment

We are currently analyzing the root cause. We will provide further
updates in this thread as soon as we have more information.

Best regards,
Longxing Li

==================================================================
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-0): P12059/1:b..l
(detected by 0, t=10502 jiffies, g=18873, q=1488)
task:systemd-sysctl  state:R  running task     stack:28320 pid:12059
ppid: 10470 flags:0x00000200
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5030 [inline]
 __schedule+0xd98/0x5b40 kernel/sched/core.c:6376
 preempt_schedule_irq+0x4e/0x90 kernel/sched/core.c:6780
 irqentry_exit+0x31/0x80 kernel/entry/common.c:432
 asm_sysvec_apic_timer_interrupt+0x16/0x20 arch/x86/include/asm/idtentry.h:676
RIP: 0010:task_css include/linux/cgroup.h:496 [inline]
RIP: 0010:mem_cgroup_from_task+0x9/0x140 mm/memcontrol.c:935
Code: 0f fa ff 31 c0 5b 5d c3 b8 f4 ff ff ff eb f6 e8 3d b4 fa ff eb
be 66 66 2e 0f 1f 84 00 00 00 00 00 48 85 ff 0f 84 0f 01 00 00 <48> b8
00 00 00 00 00 fc ff df 55 53 48 89 fb 48 8d bf 60 13 00 00
RSP: 0000:ffffc9000dbafe58 EFLAGS: 00000286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81ae3af8
RDX: ffff8880209d24c0 RSI: ffffffff81ae393d RDI: ffff8880209d24c0
RBP: ffff8880209d24c0 R08: 0000000000000000 R09: ffffffff93062a27
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
R13: ffffc9000dbaff58 R14: ffff8880209d24c0 R15: 0000000000000010
 count_memcg_event_mm.part.0+0xa5/0x340 include/linux/memcontrol.h:1079
 count_memcg_event_mm include/linux/memcontrol.h:609 [inline]
 handle_mm_fault+0xcf/0x770 mm/memory.c:4863
 do_user_addr_fault+0x4a5/0x1150 arch/x86/mm/fault.c:1357
 handle_page_fault arch/x86/mm/fault.c:1445 [inline]
 exc_page_fault+0x58/0xd0 arch/x86/mm/fault.c:1501
 asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:606
RIP: 0033:0x7f7a1e783970
RSP: 002b:00007ffeabd9ad48 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00007f7a1e464608 RCX: fffffffffffffeb8
RDX: 00007f7a1e4661a4 RSI: 00007f7a1e45c8f6 RDI: 00007f7a1e45c827
RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000002b0 R11: 0000000000000246 R12: 00007ffeabd9ae88
R13: 0000000000000001 R14: 00007f7a1e464610 R15: 0000000000000000
 </TASK>
rcu: rcu_preempt kthread starved for 10466 jiffies! g18873 f0x0
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now
expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:R  running task     stack:28656 pid:   15
ppid:     2 flags:0x00004000
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5030 [inline]
 __schedule+0xd98/0x5b40 kernel/sched/core.c:6376
 schedule+0x115/0x250 kernel/sched/core.c:6459
 schedule_timeout+0x153/0x2a0 kernel/time/timer.c:1914
 rcu_gp_fqs_loop+0x1c0/0x980 kernel/rcu/tree.c:1972
 rcu_gp_kthread+0x1e6/0x330 kernel/rcu/tree.c:2145
 kthread+0x3d0/0x4c0 kernel/kthread.c:334
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:287
 </TASK>
rcu: Stack dump where RCU GP kthread last ran:
NMI backtrace for cpu 0
CPU: 0 PID: 12075 Comm: syz-executor.6 Not tainted 5.15.189 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xfc/0x174 lib/dump_stack.c:106
 nmi_cpu_backtrace.cold+0x22/0x16f lib/nmi_backtrace.c:111
 nmi_trigger_cpumask_backtrace+0x23e/0x290 lib/nmi_backtrace.c:62
 trigger_single_cpu_backtrace include/linux/nmi.h:166 [inline]
 rcu_check_gp_kthread_starvation.cold+0x1ca/0x1cc kernel/rcu/tree_stall.h:487
 print_other_cpu_stall kernel/rcu/tree_stall.h:592 [inline]
 check_cpu_stall kernel/rcu/tree_stall.h:745 [inline]
 rcu_pending kernel/rcu/tree.c:3936 [inline]
 rcu_sched_clock_irq+0x21a2/0x2610 kernel/rcu/tree.c:2619
 update_process_times+0x1c5/0x280 kernel/time/timer.c:1818
 tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:254
 tick_sched_timer+0xe9/0x110 kernel/time/tick-sched.c:1473
 __run_hrtimer kernel/time/hrtimer.c:1690 [inline]
 __hrtimer_run_queues+0x64c/0xc00 kernel/time/hrtimer.c:1754
 hrtimer_interrupt+0x317/0x7e0 kernel/time/hrtimer.c:1816
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1097 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x430 arch/x86/kernel/apic/apic.c:1114
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1108 [inline]
 sysvec_apic_timer_interrupt+0x4d/0xc0 arch/x86/kernel/apic/apic.c:1108
 asm_sysvec_apic_timer_interrupt+0x16/0x20 arch/x86/include/asm/idtentry.h:676
RIP: 0010:unwind_next_frame+0xd03/0x1d20 arch/x86/kernel/unwind_orc.c:545
Code: 20 84 c0 4c 8b 44 24 28 48 8b 4c 24 30 0f 84 56 07 00 00 48 b8
00 00 00 00 00 fc ff df 48 8b 54 24 08 48 c1 ea 03 80 3c 02 00 <0f> 85
16 0a 00 00 48 8b 14 24 4d 89 4e 38 48 b8 00 00 00 00 00 fc
RSP: 0018:ffffc90000007568 EFLAGS: 00000246
RAX: dffffc0000000000 RBX: 1ffff92000000eb5 RCX: ffffffff8e89bb65
RDX: 1ffff92000000ecf RSI: ffffc9000dccf678 RDI: ffffc9000dccf678
RBP: 0000000000000001 R08: ffffffff8e89bb60 R09: ffffc9000dccf680
R10: fffff52000000ed3 R11: 000000000008e07d R12: ffffc90000007688
R13: ffffc90000007675 R14: ffffc90000007640 R15: ffffffff8e89bb64
 arch_stack_walk+0x87/0xe0 arch/x86/kernel/stacktrace.c:25
 stack_trace_save+0x92/0xc0 kernel/stacktrace.c:122
 kasan_save_stack+0x2a/0x50 mm/kasan/common.c:38
 kasan_set_track mm/kasan/common.c:46 [inline]
 set_alloc_info mm/kasan/common.c:434 [inline]
 ____kasan_kmalloc mm/kasan/common.c:513 [inline]
 __kasan_kmalloc+0x7f/0xa0 mm/kasan/common.c:522
 kmalloc include/linux/slab.h:604 [inline]
 dst_cow_metrics_generic+0x48/0x1e0 net/core/dst.c:202
 dst_metrics_write_ptr include/net/dst.h:118 [inline]
 dst_metric_set include/net/dst.h:179 [inline]
 icmp6_dst_alloc+0x4ee/0x670 net/ipv6/route.c:3281
 ndisc_send_skb+0x10b4/0x1570 net/ipv6/ndisc.c:493
 ndisc_send_rs+0x12f/0x690 net/ipv6/ndisc.c:707
 addrconf_rs_timer+0x413/0x810 net/ipv6/addrconf.c:3957
 call_timer_fn+0x1a5/0x580 kernel/time/timer.c:1451
 expire_timers kernel/time/timer.c:1496 [inline]
 __run_timers+0x75d/0xb20 kernel/time/timer.c:1767
 run_timer_softirq+0x54/0xd0 kernel/time/timer.c:1780
 handle_softirqs+0x21a/0x950 kernel/softirq.c:576
 __do_softirq kernel/softirq.c:610 [inline]
 invoke_softirq kernel/softirq.c:450 [inline]
 __irq_exit_rcu+0xed/0x190 kernel/softirq.c:659
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:671
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1108 [inline]
 sysvec_apic_timer_interrupt+0x9e/0xc0 arch/x86/kernel/apic/apic.c:1108
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x16/0x20 arch/x86/include/asm/idtentry.h:676
RIP: 0010:__preempt_count_add arch/x86/include/asm/preempt.h:80 [inline]
RIP: 0010:rcu_lockdep_current_cpu_online kernel/rcu/tree.c:1170 [inline]
RIP: 0010:rcu_lockdep_current_cpu_online+0x21/0x150 kernel/rcu/tree.c:1162
Code: 00 48 8b 14 24 eb c8 66 90 65 8b 15 39 81 9e 7e 81 e2 00 00 f0
00 b8 01 00 00 00 75 0a 8b 15 f2 2b a2 0c 85 d2 75 01 c3 55 53 <65> ff
05 18 81 9e 7e e8 f3 fc 53 08 48 c7 c3 40 b5 03 00 83 f8 07
RSP: 0018:ffffc9000dccf660 EFLAGS: 00000202
RAX: 0000000000000001 RBX: ffff88801cdf0000 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffffffff8a640000 RDI: ffffffff8bc7b8c0
RBP: ffff88806b299000 R08: 0000000000000000 R09: ffffffff93062a27
R10: fffffbfff260c544 R11: 1ffff11003dd908a R12: 0000000000000cc0
R13: 0000000000000001 R14: ffffffff81d2f5eb R15: 0000000000000000
 rcu_read_lock_held_common kernel/rcu/update.c:112 [inline]
 rcu_read_lock_held+0x1f/0x40 kernel/rcu/update.c:309
 task_css.constprop.0+0xc9/0x130 include/linux/cgroup.h:496
 mem_cgroup_from_task mm/memcontrol.c:935 [inline]
 get_obj_cgroup_from_current+0x120/0x510 mm/memcontrol.c:2926
 memcg_slab_pre_alloc_hook mm/slab.h:283 [inline]
 slab_pre_alloc_hook mm/slab.h:497 [inline]
 slab_alloc_node mm/slub.c:3134 [inline]
 slab_alloc mm/slub.c:3228 [inline]
 kmem_cache_alloc+0x76/0x2f0 mm/slub.c:3233
 __d_alloc+0x2b/0x8e0 fs/dcache.c:1749
 d_alloc+0x4a/0x220 fs/dcache.c:1828
 d_alloc_parallel+0xdd/0x1c70 fs/dcache.c:2582
 lookup_open.isra.0+0xb06/0x1660 fs/namei.c:3387
 open_last_lookups fs/namei.c:3532 [inline]
 path_openat+0xad1/0x2c00 fs/namei.c:3739
 do_filp_open+0x1d4/0x430 fs/namei.c:3769
 do_sys_openat2+0x185/0x4f0 fs/open.c:1253
 do_sys_open fs/open.c:1269 [inline]
 __do_sys_openat fs/open.c:1285 [inline]
 __se_sys_openat fs/open.c:1280 [inline]
 __x64_sys_openat+0x171/0x210 fs/open.c:1280
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x34/0x80 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x66/0xd0
RIP: 0033:0x46a4e4
Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 06 d9 02 00 44 8b 54 24 0c
44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d
00 f0 ff ff 77 34 44 89 c7 89 44 24 0c e8 48 d9 02 00 8b 44
RSP: 002b:00007f2a6d102f60 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 000000000046a4e4
RDX: 0000000000000002 RSI: 00007f2a6d102ff0 RDI: 00000000ffffff9c
RBP: 00007f2a6d102ff0 R08: 0000000000000000 R09: 00007f2a6d102e70
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000002
R13: 00000000004e9478 R14: 00007f2a6d1035f4 R15: 00000000ffffffff
 </TASK>

https://drive.google.com/file/d/13nr1pPRHCrZqYxz08ngNiCvmcRe-d_oK/view?usp=drive_link

https://drive.google.com/file/d/1RHa5bKm8YN-h9wRh_dJH0q1ZBF5mzYOh/view?usp=drive_link

https://drive.google.com/file/d/15HyqruIzxnkYVh4FXjmU-yY_zTB8kwPI/view?usp=drive_link

^ permalink raw reply

* [Kernel Bug] INFO: task hung in cgroup_drain_dying
From: Longxing Li @ 2026-06-09 11:42 UTC (permalink / raw)
  To: syzkaller, tj, hannes, mkoutny, cgroups, linux-kernel

Dear Linux kernel developers and maintainers,

We would like to report a new kernel bug found by our tool. INFO: task
hung in cgroup_drain_dying. Details are as follows.

Kernel commit: v7.0.6
Kernel config: see attachment
report: see attachment

We are currently analyzing the root cause and  working on a
reproducible PoC. We will provide further updates in this thread as
soon as we have more information.

Best regards,
Longxing Li

==================================================================
https://drive.google.com/file/d/1riFUIPWojkYVZu0B5BW8uVPocUWwibqN/view?usp=drive_link

^ permalink raw reply

* Re: [PATCH] mm: constify oom_control, scan_control, and alloc_context nodemask
From: Vlastimil Babka (SUSE) @ 2026-06-09  9:59 UTC (permalink / raw)
  To: Gregory Price, linux-mm
  Cc: linux-kernel, cgroups, kernel-team, longman, chenridong, akpm,
	david, ljs, liam, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, rientjes,
	chrisl, shikemeng, nphamcs, baoquan.he, youngjun.park, tj, hannes,
	mkoutny, jackmanb, ziy
In-Reply-To: <20260609002919.3967782-1-gourry@gourry.net>

On 6/9/26 02:29, Gregory Price wrote:
> The nodemasks in these structures may come from a variety of sources,
> including tasks and cpusets - and should never be modified by any code
> when being passed around inside another context.
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>

Nice!

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>


^ permalink raw reply

* Re: [PATCH] mm: constify oom_control, scan_control, and alloc_context nodemask
From: David Hildenbrand (Arm) @ 2026-06-09  9:37 UTC (permalink / raw)
  To: Lorenzo Stoakes, Gregory Price
  Cc: linux-mm, linux-kernel, cgroups, kernel-team, longman, chenridong,
	akpm, liam, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, rientjes,
	chrisl, shikemeng, nphamcs, baoquan.he, youngjun.park, tj, hannes,
	mkoutny, jackmanb, ziy
In-Reply-To: <aifDlU96HSRy72Rb@lucifer>

On 6/9/26 09:41, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 08:29:19PM -0400, Gregory Price wrote:
>> The nodemasks in these structures may come from a variety of sources,
>> including tasks and cpusets - and should never be modified by any code
>> when being passed around inside another context.
>>
>> Signed-off-by: Gregory Price <gourry@gourry.net>
> 
> Thanks for doing this, it's nice to gradually up our const correctness game
> (as much as C can ever be const correct :)

+1

> 
> LGTM, builds locally too, so:
> 

Nice.

> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* [PATCH RFC 15/15] mm: remove the __GFP_NO_OBJ_EXT flag
From: Vlastimil Babka (SUSE) @ 2026-06-09  9:18 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

All users of the flag are converted to SLAB_ALLOC_NO_RECURSE. Free up
the flag bit.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 include/linux/gfp_types.h       |  7 -------
 include/linux/slab.h            |  2 +-
 include/trace/events/mmflags.h  | 10 +---------
 lib/alloc_tag.c                 |  2 +-
 tools/include/linux/gfp_types.h |  7 -------
 5 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6c75df30a281..a93b8bd200b7 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -55,7 +55,6 @@ enum {
 #ifdef CONFIG_LOCKDEP
 	___GFP_NOLOCKDEP_BIT,
 #endif
-	___GFP_NO_OBJ_EXT_BIT,
 	___GFP_LAST_BIT
 };
 
@@ -96,7 +95,6 @@ enum {
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
-#define ___GFP_NO_OBJ_EXT       BIT(___GFP_NO_OBJ_EXT_BIT)
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -137,17 +135,12 @@ enum {
  * node with no fallbacks or placement policy enforcements.
  *
  * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
- *
- * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
- * mark_obj_codetag_empty() should be called upon freeing for objects allocated
- * with this flag to indicate that their NULL tags are expected and normal.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
-#define __GFP_NO_OBJ_EXT   ((__force gfp_t)___GFP_NO_OBJ_EXT)
 
 /**
  * DOC: Watermark modifiers
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 11e82fdbe8d3..15d1917b81d3 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1043,7 +1043,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 /**
  * kmalloc_nolock - Allocate an object of given size from any context.
  * @size: size to allocate
- * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO, __GFP_NO_OBJ_EXT
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO
  * allowed.
  * @node: node number of the target node.
  *
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a6e5a44c9b42..c1a05ff0feab 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -54,18 +54,10 @@
 # define TRACE_GFP_FLAGS_LOCKDEP
 #endif
 
-#ifdef CONFIG_SLAB_OBJ_EXT
-# define TRACE_GFP_FLAGS_SLAB			\
-	TRACE_GFP_EM(NO_OBJ_EXT)
-#else
-# define TRACE_GFP_FLAGS_SLAB
-#endif
-
 #define TRACE_GFP_FLAGS				\
 	TRACE_GFP_FLAGS_GENERAL			\
 	TRACE_GFP_FLAGS_KASAN			\
-	TRACE_GFP_FLAGS_LOCKDEP			\
-	TRACE_GFP_FLAGS_SLAB
+	TRACE_GFP_FLAGS_LOCKDEP
 
 #undef TRACE_GFP_EM
 #define TRACE_GFP_EM(a) TRACE_DEFINE_ENUM(___GFP_##a##_BIT);
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index ed1bdcf1f8ab..63686b44a23d 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -776,7 +776,7 @@ static __init bool need_page_alloc_tagging(void)
  * If insufficient, a warning will be triggered to alert the user.
  *
  * TODO: Replace fixed-size array with dynamic allocation using
- * a GFP flag similar to ___GFP_NO_OBJ_EXT to avoid recursion.
+ * something similar to slab's SLAB_ALLOC_NO_RECURSE to avoid recursion.
  */
 #define EARLY_ALLOC_PFN_MAX		8192
 
diff --git a/tools/include/linux/gfp_types.h b/tools/include/linux/gfp_types.h
index 6c75df30a281..a93b8bd200b7 100644
--- a/tools/include/linux/gfp_types.h
+++ b/tools/include/linux/gfp_types.h
@@ -55,7 +55,6 @@ enum {
 #ifdef CONFIG_LOCKDEP
 	___GFP_NOLOCKDEP_BIT,
 #endif
-	___GFP_NO_OBJ_EXT_BIT,
 	___GFP_LAST_BIT
 };
 
@@ -96,7 +95,6 @@ enum {
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
-#define ___GFP_NO_OBJ_EXT       BIT(___GFP_NO_OBJ_EXT_BIT)
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -137,17 +135,12 @@ enum {
  * node with no fallbacks or placement policy enforcements.
  *
  * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
- *
- * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
- * mark_obj_codetag_empty() should be called upon freeing for objects allocated
- * with this flag to indicate that their NULL tags are expected and normal.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
-#define __GFP_NO_OBJ_EXT   ((__force gfp_t)___GFP_NO_OBJ_EXT)
 
 /**
  * DOC: Watermark modifiers

-- 
2.54.0


^ permalink raw reply related

* [PATCH RFC 14/15] mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
From: Vlastimil Babka (SUSE) @ 2026-06-09  9:17 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

Finish the switch away from __GFP_NO_OBJ_EXT by replacing it with
SLAB_ALLOC_NO_RECURSE when allocating empty sheaves. Pass alloc_flags to
[__]alloc_empty_sheaf(). Callers that can't be part of a recursive
kmalloc() chain simply pass SLAB_ALLOC_DEFAULT. Use kmalloc_flags()
instead of kzalloc() for allocating the sheaf.

This leaves __GFP_NO_OBJ_EXT with no users, to be removed next.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8a655636dee6..26ec015efdba 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2756,7 +2756,7 @@ static inline void *setup_object(struct kmem_cache *s, void *object)
 }
 
 static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
-					      unsigned int capacity)
+				unsigned int alloc_flags, unsigned int capacity)
 {
 	struct slab_sheaf *sheaf;
 	size_t sheaf_size;
@@ -2767,10 +2767,10 @@ static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
 	 * bucket)
 	 */
 	if (s->flags & SLAB_KMALLOC)
-		gfp |= __GFP_NO_OBJ_EXT;
+		alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sheaf_size = struct_size(sheaf, objects, capacity);
-	sheaf = kzalloc(sheaf_size, gfp);
+	sheaf = kmalloc_flags(sheaf_size, gfp | __GFP_ZERO, alloc_flags, NUMA_NO_NODE);
 
 	if (unlikely(!sheaf))
 		return NULL;
@@ -2783,20 +2783,20 @@ static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
 }
 
 static inline struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s,
-						   gfp_t gfp)
+				gfp_t gfp, unsigned int alloc_flags)
 {
-	if (gfp & __GFP_NO_OBJ_EXT)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE)
 		return NULL;
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 
-	return __alloc_empty_sheaf(s, gfp, s->sheaf_capacity);
+	return __alloc_empty_sheaf(s, gfp, alloc_flags, s->sheaf_capacity);
 }
 
 static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 {
 	/*
-	 * If the sheaf was created with __GFP_NO_OBJ_EXT flag then its
+	 * If the sheaf was created with SLAB_ALLOC_NO_RECURSE flag then its
 	 * corresponding extension is NULL and alloc_tag_sub() will throw a
 	 * warning, therefore replace NULL with CODETAG_EMPTY to indicate
 	 * that the extension for this sheaf is expected to be NULL.
@@ -4673,7 +4673,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 
 	if (!empty) {
-		empty = alloc_empty_sheaf(s, gfp);
+		empty = alloc_empty_sheaf(s, gfp, alloc_flags);
 		if (!empty)
 			return NULL;
 	}
@@ -5047,7 +5047,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 	if (unlikely(size > s->sheaf_capacity)) {
 
-		sheaf = __alloc_empty_sheaf(s, gfp, size);
+		sheaf = __alloc_empty_sheaf(s, gfp, SLAB_ALLOC_DEFAULT, size);
 		if (!sheaf)
 			return NULL;
 
@@ -5092,7 +5092,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 
 	if (!sheaf)
-		sheaf = alloc_empty_sheaf(s, gfp);
+		sheaf = alloc_empty_sheaf(s, gfp, SLAB_ALLOC_DEFAULT);
 
 	if (sheaf) {
 		sheaf->capacity = s->sheaf_capacity;
@@ -5376,8 +5376,7 @@ static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_f
 	void *ret;
 
 	VM_WARN_ON_ONCE(alloc_flags_allow_spinning(ac->alloc_flags));
-	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
-				      __GFP_NO_OBJ_EXT));
+	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
 
 	if (unlikely(!size))
 		return ZERO_SIZE_PTR;
@@ -5890,7 +5889,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	if (!allow_spin)
 		return NULL;
 
-	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+	empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
 	if (empty)
 		goto got_empty;
 
@@ -6074,7 +6073,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 		local_unlock(&s->cpu_sheaves->lock);
 
-		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
 
 		if (!empty)
 			goto fail;
@@ -7619,7 +7618,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 		if (!s->sheaf_capacity)
 			pcs->main = &bootstrap_sheaf;
 		else
-			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL, SLAB_ALLOC_DEFAULT);
 
 		if (!pcs->main)
 			return -ENOMEM;
@@ -8485,7 +8484,8 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL, capacity);
+		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL,
+				SLAB_ALLOC_DEFAULT, capacity);
 
 		if (!pcs->main) {
 			failed = true;

-- 
2.54.0


^ permalink raw reply related

* [PATCH RFC 13/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Vlastimil Babka (SUSE) @ 2026-06-09  9:17 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

__GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
gfp flags are a scarce resource, unlike slab's alloc_flags.

Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
__GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
family function should not recurse into another kmalloc*() for the
purposes of allocating auxiliary structures (obj_ext arrays or sheaves).

First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
added. This will also pass through SLAB_ALLOC_TRYLOCK so we don't need
to special case kmalloc_nolock() anymore.

Note that until now the kmalloc_nolock() ignored the incoming gfp flags
and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
the incoming gfp flags (only augmented with __GFP_ZERO), because if
alloc_flags contain SLAB_ALLOC_TRYLOCK, the incoming gfp flags have to
be also compatible with it.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h |  1 +
 mm/slub.c | 13 +++++--------
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 13517abcad21..e5bd800d831e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -20,6 +20,7 @@
 #define SLAB_ALLOC_DEFAULT	0x00
 #define SLAB_ALLOC_TRYLOCK	0x01
 #define SLAB_ALLOC_NEW_SLAB	0x02 /* a flag for alloc_slab_obj_exts() */
+#define SLAB_ALLOC_NO_RECURSE	0x04 /* prevent kmalloc() recursion */
 
 static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
 {
diff --git a/mm/slub.c b/mm/slub.c
index 86691eb14002..8a655636dee6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2167,15 +2167,12 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
-	gfp |= __GFP_NO_OBJ_EXT;
+	alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sz = obj_exts_alloc_size(s, slab, gfp);
 
-	if (unlikely(!allow_spin))
-		vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
-				     slab_nid(slab));
-	else
-		vec = kmalloc_node(sz, gfp | __GFP_ZERO, slab_nid(slab));
+	/* This will use kmalloc_nolock() if alloc_flags say so */
+	vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab));
 
 	if (!vec) {
 		/*
@@ -2251,7 +2248,7 @@ static inline void free_slab_obj_exts(struct slab *slab, bool allow_spin)
 	}
 
 	/*
-	 * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
+	 * obj_exts was created with SLAB_ALLOC_NO_RECURSE flag, therefore its
 	 * corresponding extension will be NULL. alloc_tag_sub() will throw a
 	 * warning if slab has extensions but the extension of an object is
 	 * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
@@ -2374,7 +2371,7 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
 	if (s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE))
 		return;
 
-	if (flags & __GFP_NO_OBJ_EXT)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE)
 		return;
 
 	slab = virt_to_slab(object);

-- 
2.54.0


^ permalink raw reply related

* [PATCH RFC 12/15] mm/slab: introduce kmalloc_flags()
From: Vlastimil Babka (SUSE) @ 2026-06-09  9:17 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
alloc flag that prevents kmalloc recursion. For that we need a version
of kmalloc() that takes alloc_flags and use it in places that perform
these potentially recursive kmalloc allocations (of sheaves or obj_ext
arrays).

Add this function, named kmalloc_flags(). Right now it's only useful for
these nested allocations, so it doesn't need to optimize build-time
constant sizes like kmalloc() or kmalloc_buckets.

Since we need it to support both normal and non-spinning
kmalloc_nolock() context through the SLAB_ALLOC_TRYLOCK flag, split out
most of the special _kmalloc_nolock_noprof() implementation to
__kmalloc_nolock_noprof() that takes a slab_alloc_context, and make
_kmalloc_nolock_noprof() a simple tail calling wrapper with the proper
context.

kmalloc_flags() can thus determine whether to call
__kmalloc_nolock_noprof() or __do_kmalloc_node(), based on the
given alloc_flags.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 include/linux/slab.h | 12 +++++++++++
 mm/slub.c            | 56 ++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ce1c867dc0ba..11e82fdbe8d3 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -944,6 +944,10 @@ void *__kmalloc_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags)
 void *__kmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags, int node)
 				__assume_kmalloc_alignment __alloc_size(1);
 
+void *__kmalloc_flags_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags,
+				  unsigned int alloc_flags, int node)
+				  __assume_kmalloc_alignment __alloc_size(1);
+
 void *__kmalloc_cache_noprof(struct kmem_cache *s, gfp_t flags, size_t size)
 				__assume_kmalloc_alignment __alloc_size(3);
 
@@ -1176,6 +1180,14 @@ static __always_inline __alloc_size(1) void *_kmalloc_node_noprof(size_t size, g
 #define kmalloc_node_noprof(...)		_kmalloc_node_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
 #define kmalloc_node(...)			alloc_hooks(kmalloc_node_noprof(__VA_ARGS__))
 
+static __always_inline __alloc_size(1) void *_kmalloc_flags_noprof(size_t size,
+		gfp_t flags, unsigned int alloc_flags, int node, kmalloc_token_t token)
+{
+	return __kmalloc_flags_noprof(PASS_TOKEN_PARAMS(size, token), flags, alloc_flags, node);
+}
+#define kmalloc_flags_noprof(...)		_kmalloc_flags_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
+#define kmalloc_flags(...)			alloc_hooks(kmalloc_flags_noprof(__VA_ARGS__))
+
 static inline __alloc_size(1, 2) void *_kmalloc_array_noprof(size_t n, size_t size, gfp_t flags, kmalloc_token_t token)
 {
 	size_t bytes;
diff --git a/mm/slub.c b/mm/slub.c
index c11edd58b52d..86691eb14002 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5370,15 +5370,15 @@ void *__kmalloc_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
-void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
+static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags,
+				     int node, struct slab_alloc_context *ac)
 {
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
-	size_t orig_size = size;
-	unsigned int alloc_flags = SLAB_ALLOC_TRYLOCK;
 	struct kmem_cache *s;
 	bool can_retry = true;
 	void *ret;
 
+	VM_WARN_ON_ONCE(alloc_flags_allow_spinning(ac->alloc_flags));
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
 				      __GFP_NO_OBJ_EXT));
 
@@ -5413,23 +5413,17 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		 */
 		return NULL;
 
-	ret = alloc_from_pcs(s, alloc_gfp, alloc_flags, node);
+	ret = alloc_from_pcs(s, alloc_gfp, ac->alloc_flags, node);
 	if (ret)
 		goto success;
 
-	struct slab_alloc_context ac = {
-		.caller_addr = _RET_IP_,
-		.orig_size = orig_size,
-		.alloc_flags = alloc_flags,
-	};
-
 	/*
 	 * Do not call slab_alloc_node(), since trylock mode isn't
 	 * compatible with slab_pre_alloc_hook/should_failslab and
 	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
 	 * and slab_post_alloc_hook() directly.
 	 */
-	ret = __slab_alloc_node(s, alloc_gfp, node, &ac);
+	ret = __slab_alloc_node(s, alloc_gfp, node, ac);
 
 	/*
 	 * It's possible we failed due to trylock as we preempted someone with
@@ -5452,11 +5446,23 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, alloc_gfp, 1, &ret, &ac);
+	slab_post_alloc_hook(s, alloc_gfp, 1, &ret, ac);
 
-	ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
+	ret = kasan_kmalloc(s, ret, ac->orig_size, alloc_gfp);
 	return ret;
 }
+
+void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
+{
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_TRYLOCK,
+	};
+
+	return __kmalloc_nolock_noprof(PASS_TOKEN_PARAMS(size, token),
+				       gfp_flags, node, &ac);
+}
 EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);
 
 void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags,
@@ -5510,6 +5516,30 @@ void *__kmalloc_cache_node_noprof(struct kmem_cache *s, gfp_t gfpflags,
 }
 EXPORT_SYMBOL(__kmalloc_cache_node_noprof);
 
+/*
+ * The only version of kmalloc_node() that takes alloc_flags and thus can
+ * determine on its own whether to handle the allocation via kmalloc_nolock() or
+ * normally
+ */
+void *__kmalloc_flags_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags,
+			     unsigned int alloc_flags, int node)
+{
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = alloc_flags,
+	};
+
+	if (alloc_flags_allow_spinning(alloc_flags)) {
+		return __do_kmalloc_node(size, NULL, flags, node,
+				PASS_TOKEN_PARAM(token), &ac);
+	} else {
+		return __kmalloc_nolock_noprof(PASS_TOKEN_PARAMS(size, token),
+					       flags, node, &ac);
+	}
+}
+
+
 static noinline void free_to_partial_list(
 	struct kmem_cache *s, struct slab *slab,
 	void *head, void *tail, int bulk_cnt,

-- 
2.54.0


^ permalink raw reply related

* [PATCH RFC 11/15] mm/slab: pass slab_alloc_context to __do_kmalloc_node()
From: Vlastimil Babka (SUSE) @ 2026-06-09  9:17 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
alloc flag that prevents kmalloc recursion. For that we need a version
of kmalloc() that takes alloc_flags and use it in places that perform
these potentially recursive kmalloc allocations (of sheaves or obj_ext
arrays).

As a preparatory step, make __do_kmalloc_node() take a pointer to
slab_alloc_context. This replaces the 'caller' parameter and includes
alloc_flags which we'll make use of.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index dee69e0b7780..c11edd58b52d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5322,19 +5322,14 @@ EXPORT_SYMBOL(__kmalloc_large_node_noprof);
 
 static __always_inline
 void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
-			unsigned long caller, kmalloc_token_t token)
+			kmalloc_token_t token, struct slab_alloc_context *ac)
 {
 	struct kmem_cache *s;
 	void *ret;
-	struct slab_alloc_context ac = {
-		.caller_addr = caller,
-		.orig_size = size,
-		.alloc_flags = SLAB_ALLOC_DEFAULT,
-	};
 
 	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {
 		ret = __kmalloc_large_node_noprof(size, flags, node);
-		trace_kmalloc(caller, ret, size,
+		trace_kmalloc(ac->caller_addr, ret, size,
 			      PAGE_SIZE << get_order(size), flags, node);
 		return ret;
 	}
@@ -5344,22 +5339,34 @@ void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
 
 	s = kmalloc_slab(size, b, flags, token);
 
-	ret = slab_alloc_node(s, flags, node, &ac);
+	ret = slab_alloc_node(s, flags, node, ac);
 	ret = kasan_kmalloc(s, ret, size, flags);
-	trace_kmalloc(caller, ret, size, s->size, flags, node);
+	trace_kmalloc(ac->caller_addr, ret, size, s->size, flags, node);
 	return ret;
 }
 void *__kmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags, int node)
 {
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
 	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
-				 _RET_IP_, PASS_TOKEN_PARAM(token));
+				 PASS_TOKEN_PARAM(token), &ac);
 }
 EXPORT_SYMBOL(__kmalloc_node_noprof);
 
 void *__kmalloc_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags)
 {
-	return __do_kmalloc_node(size, NULL, flags,  NUMA_NO_NODE, _RET_IP_,
-				 PASS_TOKEN_PARAM(token));
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
+	return __do_kmalloc_node(size, NULL, flags,  NUMA_NO_NODE,
+				 PASS_TOKEN_PARAM(token), &ac);
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
@@ -5455,9 +5462,14 @@ EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);
 void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags,
 					 int node, unsigned long caller)
 {
-	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
-				 caller, PASS_TOKEN_PARAM(token));
+	struct slab_alloc_context ac = {
+		.caller_addr = caller,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
 
+	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
+				 PASS_TOKEN_PARAM(token), &ac);
 }
 EXPORT_SYMBOL(__kmalloc_node_track_caller_noprof);
 
@@ -6858,6 +6870,11 @@ void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long
 {
 	bool allow_block;
 	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
 
 	/*
 	 * It doesn't really make sense to fallback to vmalloc for sub page
@@ -6865,7 +6882,7 @@ void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long
 	 */
 	ret = __do_kmalloc_node(size, PASS_BUCKET_PARAM(b),
 				kmalloc_gfp_adjust(flags, size),
-				node, _RET_IP_, PASS_TOKEN_PARAM(token));
+				node, PASS_TOKEN_PARAM(token), &ac);
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH RFC 10/15] mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
From: Vlastimil Babka (SUSE) @ 2026-06-09  9:17 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>

The last user of gfpflags_allow_spinning() in slab is
alloc_from_pcs_bulk(), which is only called from
kmem_cache_alloc_bulk().

It turns out that gfpflags_allow_spinning() is not necessary, because
kmem_cache_alloc_bulk() is only expected to be called from context that
does allow spinning, so simply replace it with 'true'.

With that, we can remove the "@flags must allow spinning" part of the
kernel doc, as there is no more connection to the gfp flags in the slab
implementation.

Also remove a comment in alloc_slab_obj_exts() because there should be
no more false positives possible due to gfp_allowed_mask during early
boot.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b511d768e9b6..dee69e0b7780 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2171,12 +2171,6 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 
 	sz = obj_exts_alloc_size(s, slab, gfp);
 
-	/*
-	 * Note that allow_spin may be false during early boot and its
-	 * restricted GFP_BOOT_MASK. Due to kmalloc_nolock() only supporting
-	 * architectures with cmpxchg16b, early obj_exts will be missing for
-	 * very early allocations on those.
-	 */
 	if (unlikely(!allow_spin))
 		vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
 				     slab_nid(slab));
@@ -4851,7 +4845,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
 		}
 
 		full = barn_replace_empty_sheaf(barn, pcs->main,
-						gfpflags_allow_spinning(gfp));
+						/* allow_spin = */ true);
 
 		if (full) {
 			stat(s, BARN_GET);
@@ -7317,8 +7311,7 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
  * Allocate @size objects from @s and places them into @p.  @size must be larger
  * than 0.
  *
- * Interrupts must be enabled when calling this function and @flags must allow
- * spinning.
+ * Interrupts must be enabled when calling this function.
  *
  * Unlike alloc_pages_bulk(), this function does not check for already allocated
  * objects in @p, and thus the caller does not need to zero it.

-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox