Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH] mm: memcontrol-v1: use nofail allocations for soft limit trees
From: Andrew Morton @ 2026-06-10  1:30 UTC (permalink / raw)
  To: Ruoyu Wang
  Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, cgroups, linux-mm, linux-kernel
In-Reply-To: <CAK_7xqyyDqNW1+puMSp2LzxmOKxFUx-UO9uGiDKoL7ZTJ8+3ZQ@mail.gmail.com>

On Mon, 8 Jun 2026 16:34:48 +0800 Ruoyu Wang <ruoyuw560@gmail.com> wrote:

> This was found by static analysis and then checked by reading the code:
> memcg1_init() dereferences rtpn unconditionally after kzalloc_node(). I
> treated the soft-limit tree as mandatory memcg v1 init state and used
> __GFP_NOFAIL because continuing without it would not be useful.
> 
> I agree this is early boot init code, and I do not have a
> runtime failure report or fault-injection reproduction for it.

Thanks.

Please teach the static analyzer that kernel practice is to ignore
allocation failures in __init code.

^ permalink raw reply

* Re: [PATCH] mm: constify oom_control, scan_control, and alloc_context nodemask
From: Gregory Price @ 2026-06-10  1:48 UTC (permalink / raw)
  To: SeongJae Park
  Cc: linux-mm, linux-kernel, cgroups, kernel-team, longman, chenridong,
	akpm, david, ljs, liam, vbabka, rppt, surenb, mhocko, kasong,
	qi.zheng, shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc,
	rientjes, chrisl, shikemeng, nphamcs, baoquan.he, youngjun.park,
	tj, hannes, mkoutny, jackmanb, ziy
In-Reply-To: <20260610001937.77371-1-sj@kernel.org>

On Tue, Jun 09, 2026 at 05:19:36PM -0700, SeongJae Park wrote:
> On Mon,  8 Jun 2026 20:29:19 -0400 Gregory Price <gourry@gourry.net> wrote:
> >   */
> >  static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > -					nodemask_t *nodemask)
> > +				    const nodemask_t *nodemask)
> 
> Seems the above indentation has changed for a rason that I have no clue, and
> also introduced a line having both spaces and tabs.
> 

It aligns the const with gfp_t on the line above.

This is a common alignment throughout the kernel for any function whose
prototype spans multiple lines.

Thanks
~Gregory

^ permalink raw reply

* Re: [Kernel Bug] INFO: task hung in cgroup_drain_dying
From: Longxing Li @ 2026-06-10  7:11 UTC (permalink / raw)
  To: Michal Koutný; +Cc: syzkaller, tj, hannes, cgroups, linux-kernel
In-Reply-To: <aigMzVNsQpz_J0oQ@localhost.localdomain>

sorry for not containing full information in last email. the config[1]
and report[2] are as follows. CONFIG_PROVE_LOCKING is not enabled in
our config.

[1] https://drive.google.com/file/d/1Bx2unEf-QntjVi8g6Zw7QNO6OP4cjGO_/view?usp=drive_link

[2] https://drive.google.com/file/d/1riFUIPWojkYVZu0B5BW8uVPocUWwibqN/view?usp=sharing

and report plain text is as follows:

INFO: task systemd:1 blocked for more than 143 seconds.
      Not tainted 7.0.6 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:systemd         state:D stack:20616 pid:1     tgid:1     ppid:0
   task_flags:0x400100 flags:0x00080001
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5298 [inline]
 __schedule+0x1006/0x5f00 kernel/sched/core.c:6911
 __schedule_loop kernel/sched/core.c:6993 [inline]
 schedule+0xe7/0x3a0 kernel/sched/core.c:7008
 cgroup_drain_dying+0x1ed/0x360 kernel/cgroup/cgroup.c:6294
 cgroup_rmdir+0x38/0x300 kernel/cgroup/cgroup.c:6309
 kernfs_iop_rmdir+0x10a/0x180 fs/kernfs/dir.c:1311
 vfs_rmdir fs/namei.c:5344 [inline]
 vfs_rmdir+0x340/0x860 fs/namei.c:5317
 filename_rmdir+0x3be/0x510 fs/namei.c:5399
 __do_sys_rmdir fs/namei.c:5422 [inline]
 __se_sys_rmdir fs/namei.c:5419 [inline]
 __x64_sys_rmdir+0x47/0x90 fs/namei.c:5419
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x11b/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb6c32a61c7
RSP: 002b:00007fff90d2bc98 EFLAGS: 00000202 ORIG_RAX: 0000000000000054
RAX: ffffffffffffffda RBX: 000055c177d80fb0 RCX: 00007fb6c32a61c7
RDX: 00007fb6c3387be0 RSI: 0000000000000000 RDI: 000055c177eb1300
RBP: 00007fb6c35eb2da R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000100 R11: 0000000000000202 R12: 0000000000000000
R13: 00007fb6c2ddb6c8 R14: 0000000000000001 R15: 0000000000000000
 </TASK>

Showing all locks held in the system:
3 locks held by systemd/1:
 #0: ffff8880294f8420 (sb_writers#10){.+.+}-{0:0}, at:
filename_rmdir+0x2cc/0x510 fs/namei.c:5388
 #1: ffff888034d16e98 (&type->i_mutex_dir_key#6/1){+.+.}-{4:4}, at:
inode_lock_nested include/linux/fs.h:1073 [inline]
 #1: ffff888034d16e98 (&type->i_mutex_dir_key#6/1){+.+.}-{4:4}, at:
__start_dirop fs/namei.c:2929 [inline]
 #1: ffff888034d16e98 (&type->i_mutex_dir_key#6/1){+.+.}-{4:4}, at:
start_dirop fs/namei.c:2940 [inline]
 #1: ffff888034d16e98 (&type->i_mutex_dir_key#6/1){+.+.}-{4:4}, at:
filename_rmdir+0x318/0x510 fs/namei.c:5392
 #2: ffff8880386d7888 (&type->i_mutex_dir_key#6){++++}-{4:4}, at:
inode_lock include/linux/fs.h:1028 [inline]
 #2: ffff8880386d7888 (&type->i_mutex_dir_key#6){++++}-{4:4}, at:
vfs_rmdir fs/namei.c:5329 [inline]
 #2: ffff8880386d7888 (&type->i_mutex_dir_key#6){++++}-{4:4}, at:
vfs_rmdir+0xef/0x860 fs/namei.c:5317
6 locks held by kworker/u4:0/12:
3 locks held by kworker/u4:1/13:
1 lock held by khungtaskd/25:
 #0: ffffffff8e5e6ce0 (rcu_read_lock){....}-{1:3}, at:
rcu_lock_acquire include/linux/rcupdate.h:312 [inline]
 #0: ffffffff8e5e6ce0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
include/linux/rcupdate.h:850 [inline]
 #0: ffffffff8e5e6ce0 (rcu_read_lock){....}-{1:3}, at:
debug_show_all_locks+0x36/0x1c0 kernel/locking/lockdep.c:6775
1 lock held by kcompactd0/28:
3 locks held by kworker/u4:3/45:
2 locks held by kworker/0:2/49:
3 locks held by kworker/u4:6/597:
3 locks held by kworker/u4:8/3491:
2 locks held by systemd-journal/5166:
2 locks held by systemd-udevd/5178:
1 lock held by in:imklog/9177:
4 locks held by sshd/9696:
2 locks held by syz-fuzzer/32911:
2 locks held by syz-executor.6/9754:
2 locks held by syz-executor.7/9774:
1 lock held by syz-executor.2/9812:
1 lock held by syz-executor.1/9902:
2 locks held by syz-executor.14/10080:
2 locks held by syz-executor.9/10842:
1 lock held by syz-executor.15/11893:
 #0: ffffffff8e5f25f8 (rcu_state.exp_mutex){+.+.}-{4:4}, at:
exp_funnel_lock+0x1a3/0x3b0 kernel/rcu/tree_exp.h:343
3 locks held by kworker/0:8/13140:
 #0: ffff88801b8a6948 ((wq_completion)events){+.+.}-{0:0}, at:
process_one_work+0x139e/0x1c60 kernel/workqueue.c:3263
 #1: ffffc9000cd37d08 (free_ipc_work){+.+.}-{0:0}, at:
process_one_work+0x938/0x1c60 kernel/workqueue.c:3264
 #2: ffffffff8e5f25f8 (rcu_state.exp_mutex){+.+.}-{4:4}, at:
exp_funnel_lock+0x1a3/0x3b0 kernel/rcu/tree_exp.h:343
2 locks held by kworker/0:10/13232:
3 locks held by kworker/u4:10/13343:
3 locks held by kworker/u4:12/14656:
1 lock held by syz-executor.13/24672:
3 locks held by kworker/u4:5/45131:
3 locks held by kworker/u4:9/46406:
3 locks held by kworker/u4:13/46990:
3 locks held by kworker/u4:16/46993:
2 locks held by syz-executor.8/48198:
3 locks held by kworker/u4:17/53143:
4 locks held by kworker/u4:18/53144:
2 locks held by systemd-rfkill/53174:
2 locks held by syz-executor.7/53471:
2 locks held by kworker/u4:20/53472:
3 locks held by kworker/u4:21/53476:
3 locks held by kworker/u4:22/53479:
3 locks held by kworker/u4:24/53484:
3 locks held by kworker/u4:25/53488:
2 locks held by kworker/0:19/53491:
2 locks held by systemd-udevd/53495:

=============================================

NMI backtrace for cpu 0
CPU: 0 UID: 0 PID: 25 Comm: khungtaskd Not tainted 7.0.6 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
 nmi_cpu_backtrace+0x2a0/0x350 lib/nmi_backtrace.c:113
 nmi_trigger_cpumask_backtrace+0x29c/0x300 lib/nmi_backtrace.c:62
 trigger_all_cpu_backtrace include/linux/nmi.h:161 [inline]
 __sys_info lib/sys_info.c:157 [inline]
 sys_info+0x133/0x180 lib/sys_info.c:165
 check_hung_uninterruptible_tasks kernel/hung_task.c:346 [inline]
 watchdog+0xeac/0x11e0 kernel/hung_task.c:515
 kthread+0x38d/0x4a0 kernel/kthread.c:436
 ret_from_fork+0x942/0xe50 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Michal Koutný <mkoutny@suse.com> 于2026年6月9日周二 20:58写道：
>
> Hello Longxing.
>
> On Tue, Jun 09, 2026 at 07:42:06PM +0800, Longxing Li <coregee2000@gmail.com> wrote:
> > We would like to report a new kernel bug found by our tool. INFO: task
> > hung in cgroup_drain_dying. Details are as follows.
>
> Thanks but I see no attachment.
>
> (Greater if you could add description as plaintext [1])
>
> > Kernel commit: v7.0.6
> > Kernel config: see attachment
>
> Do you have lockdep enabled (CONFIG_PROVE_LOCKING)? That may help
> debugging here.
>
> Thanks,
> Michal
>
> [1] https://docs.kernel.org/process/email-clients.html#general-preferences
>

^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Hao Ge @ 2026-06-10  8:29 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Usama Arif
  Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Suren Baghdasaryan, Alexei Starovoitov,
	Andrew Morton, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov, kasan-dev,
	linux-mm, linux-kernel, cgroups
In-Reply-To: <fe1bd73b-d956-442f-8c4f-f5f62587346e@kernel.org>

Hi Vlastimil and Usama

On 2026/6/9 22:28, Vlastimil Babka (SUSE) wrote:
> On 6/9/26 15:35, Usama Arif wrote:
>> On Tue, 09 Jun 2026 11:17:45 +0200 "Vlastimil Babka (SUSE)"<vbabka@kernel.org>  wrote:
>>
>>> This series is based on slab/for-next. If all goes well, it would
>>> hopefully go to slab/for-next soon after the 7.2 merge window, so any
>>> other work can be based on it to avoid conflicts, as it touches a lot
>>> parts of slab.
>>>
>>> Git:https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags
>>>
>>> The slab implementation currently relies on gfp flags to convey
>>> some context information internally:
>>>
>>> - The absence of both __GFP_RECLAIM flags is interpreted as "cannot spin
>>>    on locks", and intended to be used by kmalloc_nolock(). But false
>>>    positives are possible e.g. during early boot where gfp_allowed_mask
>>>    clears __GFP_RECLAIM from all allocations. This leads to unnecessary
>>>    allocation failures and workarounds such as fd3634312a04 ("debugobject:
>>>    Make it work with deferred page initialization - again").
>>>
>>> - __GFP_NO_OBJ_EXT exists and takes up valuable bit in the gfp flags
>>>    space, only to prevent recursive kmalloc() allocations for obj_ext
>>>    arrays and sheaves.
>>>
>> Hello Valstimil!
>>
>> I think memory allocation profiling uses __GFP_NO_OBJ_EXT, and I dont see
>> it being removed in the series (hopefully I didnt miss it).
>>
>> Adding Hao Ge in CC who did this in the commit:
>> mm/alloc_tag: replace fixed-size early PFN array with dynamic linked list


Thanks for the CC. I'm now aware of this.


> Thanks for the heads up. I missed it because my series is based on
> slab/for-next and that commit is in mm-unstable. My patch 15 actually
> modifies the TODO comment that is meanwhile resolved by Hao Ge's patch.
>
> Which means my patch 15/15 can't be used as-is, and at worst I will drop it.
> But I'd encourage Hao Ge with Suren to find some way to avoid the gfp flag
> usage too, because it's now quite a niche use case (preventing false
> positive CONFIG_MEM_ALLOC_PROFILING_DEBUG warnings, IIUC?) to take a
> valuable gfp flag bit, IMHO.


I previously used __GFP_NO_OBJ_EXT because it serves the same purpose as 
in slab.

We use it here to prevent recursion within the page allocator.

I hadn't anticipated that __GFP_NO_OBJ_EXT would be removed so soon.

I agree with you. Since slab no longer uses it, retaining this GFP flag 
solely for debug is indeed costly.

I've also been thinking about possible solutions today. Since we are 
working in the page allocation path,

we need to take various race conditions into consideration.

For instance, what if an interrupt is triggered inside page_alloc, which 
then invokes page_alloc again?

I'm not sure if such a scenario exists in practice, but I believe we 
still need to account for it.

I would highly appreciate it if anyone could share their ideas.

I've made a note of this.

Would it make sense to hold off on merging patch 15/15 for now?

We can always include it in a later cycle once we have a proper 
replacement for the

memory allocation profilingside. Thanks Best Regards Hao

>>> The page allocator uses its internal alloc_flags to convey various
>>> context information, including ALLOC_TRYLOCK (meaning "cannot spin").
>>> This series copies that concept for the slab allocator, with its own
>>> slab-specific internal flags:
>>>
>>> - SLAB_ALLOC_DEFAULT - no extra flags (the value is 0), but explicit
>>> - SLAB_ALLOC_TRYLOCK - do not spin on locks (used by kmalloc_nolock())
>>> - SLAB_ALLOC_NEW_SLAB - replacing existing 'bool new_slab' parameter
>>> 			for allocating obj_ext arrays
>>> - SLAB_ALLOC_NO_RECURSE - replacing usage of __GFP_NO_OBJ_EXT
>>>
>>> To reduce the amount of parameters in various internal functions, we
>>> additionally introduce slab_alloc_context (also inspired by page
>>> allocator's alloc_context) for passing a number of existing arguments
>>> and the new alloc_flags:
>>>
>>> /* Structure holding extra parameters for slab allocations */
>>> struct slab_alloc_context {
>>> 	unsigned long caller_addr;
>>> 	unsigned long orig_size;
>>> 	unsigned int alloc_flags;
>>> 	struct list_lru *lru;
>>> };
>>>
>>> This also replaces the existing struct partial_context.
>>>
>>> The last necessary piece is kmalloc_flags() which can take the
>>> alloc_flags in addition to gfp flags and is intended for the recursive
>>> allocations of sheaves and obj_ext arrays, so that both
>>> SLAB_ALLOC_TRYLOCK and SLAB_ALLOC_NO_RECURSE can be communicated.
>>> Internally it decides between kmalloc_nolock() and normal kmalloc()
>>> depending SLAB_ALLOC_TRYLOCK.
>>>
>>> The rest of the series is gradually expanding the usage of both
>>> alloc_flags and slab_alloc_context as necessary, with bits of
>>> refactoring. Then, __GFP_NO_OBJ_EXT is removed completely.
>>>
>>> Note that some usage of gfpflags_allow_spinning() relying on absence of
>>> __GFP_RECLAIM remains outside of slab (and page allocator) in memcg,
>>> page_owner and stackdepot code. These can thus yield false-positive
>>> decisions that spinning is not allowed, but should not result in
>>> important allocations failing anymore.
>>>
>>> Signed-off-by: Vlastimil Babka (SUSE)<vbabka@kernel.org>
>>> ---
>>> Vlastimil Babka (SUSE) (15):
>>>        mm/slab: always zero only requested size on alloc
>>>        mm/slab: stop inlining __slab_alloc_node()
>>>        mm/slab: introduce slab_alloc_context
>>>        mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
>>>        mm/slab: add alloc_flags to slab_alloc_context
>>>        mm/slab: replace struct partial_context with slab_alloc_context
>>>        mm/slab: pass alloc_flags to new slab allocation
>>>        mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
>>>        mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
>>>        mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
>>>        mm/slab: pass slab_alloc_context to __do_kmalloc_node()
>>>        mm/slab: introduce kmalloc_flags()
>>>        mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
>>>        mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
>>>        mm: remove the __GFP_NO_OBJ_EXT flag
>>>
>>>   include/linux/gfp_types.h       |   7 -
>>>   include/linux/slab.h            |  14 +-
>>>   include/trace/events/mmflags.h  |  10 +-
>>>   lib/alloc_tag.c                 |   2 +-
>>>   mm/kfence/core.c                |   6 +-
>>>   mm/memcontrol.c                 |   5 +-
>>>   mm/slab.h                       |  16 +-
>>>   mm/slub.c                       | 423 ++++++++++++++++++++++++----------------
>>>   tools/include/linux/gfp_types.h |   7 -
>>>   9 files changed, 288 insertions(+), 202 deletions(-)
>>> ---
>>> base-commit: 500b2c9755301742bdbb61249511ac11a4665dae
>>> change-id: 20260601-slab_alloc_flags-25c782b0c57c
>>>
>>> Best regards,
>>> --
>>> Vlastimil Babka (SUSE)<vbabka@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Juri Lelli @ 2026-06-10  9:21 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Johannes Weiner, Michal Koutný, cgroups,
	linux-kernel, Luca Abeni, Yuri Andriaccio
In-Reply-To: <9c0d3d19-5c14-42e0-b29d-4ea32d9e624f@santannapisa.it>

On 09/06/26 18:23, Yuri Andriaccio wrote:
> Hi Juri,
> 
> Thanks for looking into this.
> 
> > I started playing with the new interface and ended up with the following
> >
> > bash-5.3# cat cpu.rt.max  (root)
> > 10000 100000
> > bash-5.3# cat g1/cpu.rt.max
> > 10000 100000
> > bash-5.3# cat g1/cpu.rt.internal
> > 9999 100000
> >
> > which looks odd to me, as nothing is running on g1 yet and no children
> > groups either. Maybe a rounding error of some kind?
> 
> You are right. I should have mentioned that it is just a rounding error that
> occurs when converting from a bandwidth value to a runtime value. This
> happens because the tg_rt_internal_bandwidth() function truncates the value
> when transforming the runtime from nanoseconds to micros. Rounding could be
> used here to report a more accurate value.
> 
> This same issue is probably found in the from_ratio() function, which has a
> similar truncation issue when converting from bandwidth to runtime, but
> since it is working in the nanoseconds range it might not be that big of a
> problem. The value from from_ratio() is used for the setup of the dl_servers
> even when the children bw is zero, so maybe it is possible to add a special
> case?
> 
> Anyways, as it is right now, the cpu.rt.internal may have only a +1/-1us
> error in reporting the actual used values, while the error for the runtime
> value used internally to setup the dl_servers is in the range of tens of
> nanoseconds.

Not a huge problem per se, but it will raise some eyebrows (and generate
questions) if we leave things as is, I fear.

I wonder if, instead of converting to bandwidth ratios and back (losing
precision in both directions), we can compute children's runtime sum directly
in nanoseconds. For children with different periods, we can maybe normalize
(128-bit intermediate?). Parent's internal runtime is then a simple exact
subtraction: parent_runtime - children_runtime_sum. This should reduce
precision loss from double conversions. Also, as you suggest as well, apply
rounding when displaying to user. 


^ permalink raw reply

* Re: [PATCH RFC 01/15] mm/slab: always zero only requested size on alloc
From: Vlastimil Babka (SUSE) @ 2026-06-10 10:36 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260609-slab_alloc_flags-v1-1-2bf4a4b9b526@kernel.org>

On 6/9/26 11:17, Vlastimil Babka (SUSE) wrote:
> When zeroing on alloc is requested (by __GFP_ZERO or the init_on_alloc
> parameter), we have been trying to zero the whole kmalloc bucket size
> and not just requested size, if possible.
> 
> This probably comes from the past where ksize() could be used to
> discover the bucket size and use it opportunistically beyond the
> requested size. This is now forbidden and enabling debugging such as
> KASAN or slab's red zoning would catch this misuse. Therefore, nobody
> can be relying on __GFP_ZERO zeroing beyond requested size.

Well, Sashiko says I'm wrong because krealloc() might be used later and then
the initially unused part might become used and we won't clear it because we
don't (unless slab debugging is enabled) know the original requested size
anymore. So we have to keep zeroing the full s->object_size in the cases we
currently do that.

> Theoretically it might still improve hardening in case of unintended
> accesses beond requested size accessing some sensitive data from a
> previous allocation. But then, init_on_free is probably used also for
> hardening and would have cleared that.
> 
> So the usefullness of zeroing beyond requested size is practically none
> nowadays. The disadvantages for doing it are:
> 
> - Interaction with KFENCE, which perfoms the zeroing on its own because
>   it has its own redzone beyond requested size. As a consequence
>   slab_post_alloc_hook() has an 'init' parameter which has to be
>   evaluated in all callers (via slab_want_init_on_alloc()).
> 
>   For kfence allocations in slab_alloc_node() this evaluation is subtly
>   skipped over in order to do the right thing. Other callers (i.e.
>   kmem_cache_alloc_bulk_noprof()) evaluate it unconditionally even if
>   they do end up with a kfence allocation. This is only subtly not a
>   problem, as those are not kmalloc allocations and are using
>   s->object_size as requested size, so it doesn't interfere with kfence's
>   redzone. There's just a unnecessary double zeroing (in both kfence and
>   slab_post_alloc_hook()), but it's all very fragile and contradicts the
>   comment in kfence_guarded_alloc().
> 
> - Interaction with slab's redzoning where we have to limit the zeroing
>   to requested size.
> 
> We can make the code much more simple by always zeroing only up to the
> requested size. Move slab_want_init_on_alloc() call to
> slab_post_alloc_hook(), removing the parameter. Remove the red zone
> handling.
> 
> For kfence's zeroing code, update the comment. We could remove it
> completely, but due to possible interactions with KASAN, there are
> configurations where neither slab or KASAN would zero the object,
> so simply do it in kfence. At worst the zeroing will happen twice, but
> kfence allocations are rare by design so the cost is negligible.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 10:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ah-0CyZurn5D1ezY@parvat>

On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > 
> >    __GFP_THISNODE cannot be overloaded to do anything useful here.
> 
> Let me clarify, I meant to say, let's use a nodemask for allocation
> and __GFP_THISNODE gets us to the node we desire, if that is the only
> node. My earlier comment might not have been clear.
> 

I've been tested an stripped back patch set where I drop all FALLBACK
entries for private nodes (including for itself) and only keep the
NOFALLBACK entry for private nodes.

This effectively isolates the nodes for any allocation without
__GFP_THISNODE.

This also precludes these nodes from ever using non-mbind mempolicies,
which I think is a completely reasonable compromise and something I was
already expecting we would do.

Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
which causes spillage into private nodes because slub allows private
nodes in its mask.  I think this is fixable.

I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
code, etc), but it seems like fully dropping the FALLBACK entries and
requiring __GFP_THISNODE might be sufficient.

~Gregory

^ permalink raw reply

* Re: [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Gregory Price @ 2026-06-10 11:34 UTC (permalink / raw)
  To: Farhad Alemi
  Cc: Andrew Morton, David Hildenbrand, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Waiman Long, Rasmus Villemoes,
	linux-mm, linux-kernel, cgroups, stable
In-Reply-To: <CA+0ovCg05rUk1-3k2ysdxmbcER8aG-wVh9SSTrrbp6LPWpPHYA@mail.gmail.com>

On Tue, Jun 09, 2026 at 07:57:41PM -0400, Farhad Alemi wrote:
> cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the
> cpuset's effective, online mems (newmems, from guarantee_online_mems()),
> but rebinds that task's VMA mempolicies to the *configured* mask instead:
> 
> 	cpuset_change_task_nodemask(task, &newmems);
> 	...
> 	mpol_rebind_mm(mm, &cs->mems_allowed);
> 
> On the default (v2) hierarchy a cpuset that has never had cpuset.mems
> written keeps mems_allowed empty while effective_mems is inherited
> non-empty from the parent, and tasks may be attached to it (the
> empty-mems attach check is v1-only).  A subsequent rebind -- e.g. from a
> CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with
> an empty mask.  For a VMA policy created with MPOL_F_RELATIVE_NODES this
> reaches mpol_relative_nodemask() ->
> nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(),
> whose set_bit(oldbit % sz, dst) divides by zero:
> 
>   Oops: divide error: 0000 [#1] SMP KASAN NOPTI
>   RIP: 0010:bitmap_fold+0x5e/0xb0
>    mpol_rebind_nodemask
>    mpol_rebind_mm
>    cpuset_update_tasks_nodemask
>    cpuset_handle_hotplug
>    sched_cpu_deactivate
>    cpuhp_thread_fun
> 
> cs->mems_allowed is the only nodemask in this function that is not the
> effective set: the task-policy rebind, the page-migration target and
> cs->old_mems_allowed all use newmems.  The sibling cpuset_attach() path
> already rebinds VMA policies against the effective mems
> (cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes
> that mems_allowed can be empty under hotplug.  Rebind the VMA policies to
> newmems too: it is guaranteed non-empty by guarantee_online_mems(), which
> fixes the divide-by-zero, and it makes the VMA policies consistent with
> the task policy and with the nodes the task is actually allowed to use.
>

I think you can make this a bit more concise:

  Creating a child cpuset where cpuset.mems is never set leads to
  a div/0 when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds
  in response to a CPU hotplug event.

  Reproduction steps:
     1) Create a cgroup w/ cpuset controls (do not set cpuset.mems)
     2) Move the task into the child cpuset
     3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES
     4) unplug and hotplug a cpu
          echo 0 > /sys/devices/system/cpu/cpu1/oneline
          echo 1 > /sys/devices/system/cpu/cpu1/oneline
     5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the
        call to __nodes_fold()

  The cpuset code passes (cs->mems_allowed) which is not guaranteed to
  have nodes to the rebind routine.  Use mems_effective - the value
  returned by guarantee_online_mems() - instead, which is guaranteed to
  have a non-empty nodemask..

Maybe add a link to your reproducer and the original [BUG]

Link: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/

Does this need a Closes tag?

> Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
> Suggested-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> Cc: stable@vger.kernel.org
> ---
>  kernel/cgroup/cpuset.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
> 
>  		migrate = is_memory_migrate(cs);
> 
> -		mpol_rebind_mm(mm, &cs->mems_allowed);
> +		mpol_rebind_mm(mm, &newmems);
>  		if (migrate)
>  			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
>  		else
> -- 
> 2.43.0

^ permalink raw reply

* Re: [PATCH RFC 02/15] mm/slab: stop inlining __slab_alloc_node()
From: Harry Yoo @ 2026-06-10 12:06 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260609-slab_alloc_flags-v1-2-2bf4a4b9b526@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 374 bytes --]



On 6/9/26 6:17 PM, Vlastimil Babka (SUSE) wrote:
> With sheaves, this is no longer part of the allocation fastpath.  For
> the same reason, also mark the call to it from slab_alloc_node() as
> unlikely().
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [Kernel Bug] INFO: task hung in cgroup_drain_dying
From: Michal Koutný @ 2026-06-10 13:27 UTC (permalink / raw)
  To: Longxing Li; +Cc: syzkaller, tj, hannes, cgroups, linux-kernel
In-Reply-To: <CAHPqNmwdh5Je=hrvEVzK90j91h2kOqXDmF1vz9UTtfcn1LUO1A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1588 bytes --]

On Wed, Jun 10, 2026 at 03:11:41PM +0800, Longxing Li <coregee2000@gmail.com> wrote:
> sorry for not containing full information in last email. the config[1]
> and report[2] are as follows. CONFIG_PROVE_LOCKING is not enabled in
> our config.

Thanks.

> INFO: task systemd:1 blocked for more than 143 seconds.
>       Not tainted 7.0.6 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:systemd         state:D stack:20616 pid:1     tgid:1     ppid:0
>    task_flags:0x400100 flags:0x00080001
> Call Trace:
>  <TASK>
>  context_switch kernel/sched/core.c:5298 [inline]
>  __schedule+0x1006/0x5f00 kernel/sched/core.c:6911
>  __schedule_loop kernel/sched/core.c:6993 [inline]
>  schedule+0xe7/0x3a0 kernel/sched/core.c:7008
>  cgroup_drain_dying+0x1ed/0x360 kernel/cgroup/cgroup.c:6294
>  cgroup_rmdir+0x38/0x300 kernel/cgroup/cgroup.c:6309
>  kernfs_iop_rmdir+0x10a/0x180 fs/kernfs/dir.c:1311
>  vfs_rmdir fs/namei.c:5344 [inline]
>  vfs_rmdir+0x340/0x860 fs/namei.c:5317
>  filename_rmdir+0x3be/0x510 fs/namei.c:5399
>  __do_sys_rmdir fs/namei.c:5422 [inline]
>  __se_sys_rmdir fs/namei.c:5419 [inline]
>  __x64_sys_rmdir+0x47/0x90 fs/namei.c:5419
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0x11b/0xf80 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Hm, hm, this kinds fits 93618edf75383 ("cgroup: Defer css percpu_ref
kill on rmdir until cgroup is depopulated") 
which got into stable 7.0.9.
Can you reproduce even with that (or newer) kernel?

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Harry Yoo @ 2026-06-10 13:46 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 6471 bytes --]



On 6/9/26 6:17 PM, Vlastimil Babka (SUSE) wrote:
> This series is based on slab/for-next. If all goes well, it would
> hopefully go to slab/for-next soon after the 7.2 merge window, so any
> other work can be based on it to avoid conflicts, as it touches a lot
> parts of slab.
> 
> Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags
> 
> The slab implementation currently relies on gfp flags to convey
> some context information internally:
> 
> - The absence of both __GFP_RECLAIM flags is interpreted as "cannot spin
>   on locks", and intended to be used by kmalloc_nolock(). But false
>   positives are possible e.g. during early boot where gfp_allowed_mask
>   clears __GFP_RECLAIM from all allocations. This leads to unnecessary
>   allocation failures and workarounds such as fd3634312a04 ("debugobject:
>   Make it work with deferred page initialization - again").
> 
> - __GFP_NO_OBJ_EXT exists and takes up valuable bit in the gfp flags
>   space, only to prevent recursive kmalloc() allocations for obj_ext
>   arrays and sheaves.

[ Cc'ing Vishal and Matthew as it's somewhat relevant to memdescs... ]

When the page allocator starts allocateing slab objects,
we still need a way to avoid recursion for obj_ext arrays and sheaves
(by passing SLAB_ALLOC_NO_RECURSE).

Looking at kmalloc_flags(), probably we'll end up introducing a separate
gfp type for slab-specific flags?

Hmm but SLAB_ALLOC_* flags are defined in mm/slab.h and kmalloc_flags()
is defined in include/linux/slab.h. Do yo intend to restrict the slab
alloc flags to MM only?

> The page allocator uses its internal alloc_flags to convey various
> context information, including ALLOC_TRYLOCK (meaning "cannot spin").
> This series copies that concept for the slab allocator, with its own
> slab-specific internal flags:
> 
> - SLAB_ALLOC_DEFAULT - no extra flags (the value is 0), but explicit
> - SLAB_ALLOC_TRYLOCK - do not spin on locks (used by kmalloc_nolock())
> - SLAB_ALLOC_NEW_SLAB - replacing existing 'bool new_slab' parameter
> 			for allocating obj_ext arrays
> - SLAB_ALLOC_NO_RECURSE - replacing usage of __GFP_NO_OBJ_EXT
> 
> To reduce the amount of parameters in various internal functions, we
> additionally introduce slab_alloc_context (also inspired by page
> allocator's alloc_context) for passing a number of existing arguments
> and the new alloc_flags:
> 
> /* Structure holding extra parameters for slab allocations */
> struct slab_alloc_context {
> 	unsigned long caller_addr;
> 	unsigned long orig_size;
> 	unsigned int alloc_flags;
> 	struct list_lru *lru;
> };

Perhaps beyond the scope of the patchset, but I wonder if we could have
something like struct slab_alloc_context but for kmalloc callers to
simplify {PASS,DECL}_KMALLOC_PARAMS().

Something like:

struct kmalloc_params {
#ifdef CONFIG_SLAB_BUCKETS
	kmem_buckets *b;
#endif
#ifdef CONFIG_KMALLOC_PARTITION_CACHES
	kmalloc_token_t token;
#endif
};

The idea is to move optional kmalloc parameters (depending on config)
into a single struct, instead of using the macros.

void *__kmalloc_node(size_t size, gfp_t flags, int node,
		     unsigned long caller,
		     struct kmalloc_params params);

void *kmalloc_node() {
    /* ... snip ...*/
    struct kmalloc_params params = KMALLOC_PARAMS(params.b, params.token);
    return __kmalloc_node(size, flags, node, _RET_IP_, params);
}

The compiler should optimize away unused fields based on the config.

Per System V AMD64 ABI, the compiler will use registers to pass the
struct, as long as the struct size does not exceed 16 bytes.
(Otherwise it will be passed on stack).

> This also replaces the existing struct partial_context.
> 
> The last necessary piece is kmalloc_flags() which can take the
> alloc_flags in addition to gfp flags and is intended for the recursive
> allocations of sheaves and obj_ext arrays, so that both
> SLAB_ALLOC_TRYLOCK and SLAB_ALLOC_NO_RECURSE can be communicated.
> Internally it decides between kmalloc_nolock() and normal kmalloc()
> depending SLAB_ALLOC_TRYLOCK.
> 
> The rest of the series is gradually expanding the usage of both
> alloc_flags and slab_alloc_context as necessary, with bits of
> refactoring. Then, __GFP_NO_OBJ_EXT is removed completely.
> 
> Note that some usage of gfpflags_allow_spinning() relying on absence of
> __GFP_RECLAIM remains outside of slab (and page allocator) in memcg,
> page_owner and stackdepot code. These can thus yield false-positive
> decisions that spinning is not allowed, but should not result in
> important allocations failing anymore.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> Vlastimil Babka (SUSE) (15):
>       mm/slab: always zero only requested size on alloc
>       mm/slab: stop inlining __slab_alloc_node()
>       mm/slab: introduce slab_alloc_context
>       mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
>       mm/slab: add alloc_flags to slab_alloc_context
>       mm/slab: replace struct partial_context with slab_alloc_context
>       mm/slab: pass alloc_flags to new slab allocation
>       mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
>       mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
>       mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
>       mm/slab: pass slab_alloc_context to __do_kmalloc_node()
>       mm/slab: introduce kmalloc_flags()
>       mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
>       mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
>       mm: remove the __GFP_NO_OBJ_EXT flag
> 
>  include/linux/gfp_types.h       |   7 -
>  include/linux/slab.h            |  14 +-
>  include/trace/events/mmflags.h  |  10 +-
>  lib/alloc_tag.c                 |   2 +-
>  mm/kfence/core.c                |   6 +-
>  mm/memcontrol.c                 |   5 +-
>  mm/slab.h                       |  16 +-
>  mm/slub.c                       | 423 ++++++++++++++++++++++++----------------
>  tools/include/linux/gfp_types.h |   7 -
>  9 files changed, 288 insertions(+), 202 deletions(-)
> ---
> base-commit: 500b2c9755301742bdbb61249511ac11a4665dae
> change-id: 20260601-slab_alloc_flags-25c782b0c57c
> 
> Best regards,
> --  
> Vlastimil Babka (SUSE) <vbabka@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-10 14:04 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <e650e125-5f6c-4de7-88b5-9da666bb0a4e@kernel.org>

On 6/10/26 15:46, Harry Yoo wrote:
> 
> 
> On 6/9/26 6:17 PM, Vlastimil Babka (SUSE) wrote:
>> This series is based on slab/for-next. If all goes well, it would
>> hopefully go to slab/for-next soon after the 7.2 merge window, so any
>> other work can be based on it to avoid conflicts, as it touches a lot
>> parts of slab.
>> 
>> Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags
>> 
>> The slab implementation currently relies on gfp flags to convey
>> some context information internally:
>> 
>> - The absence of both __GFP_RECLAIM flags is interpreted as "cannot spin
>>   on locks", and intended to be used by kmalloc_nolock(). But false
>>   positives are possible e.g. during early boot where gfp_allowed_mask
>>   clears __GFP_RECLAIM from all allocations. This leads to unnecessary
>>   allocation failures and workarounds such as fd3634312a04 ("debugobject:
>>   Make it work with deferred page initialization - again").
>> 
>> - __GFP_NO_OBJ_EXT exists and takes up valuable bit in the gfp flags
>>   space, only to prevent recursive kmalloc() allocations for obj_ext
>>   arrays and sheaves.
> 
> [ Cc'ing Vishal and Matthew as it's somewhat relevant to memdescs... ]
> 
> When the page allocator starts allocateing slab objects,
> we still need a way to avoid recursion for obj_ext arrays and sheaves
> (by passing SLAB_ALLOC_NO_RECURSE).
> 
> Looking at kmalloc_flags(), probably we'll end up introducing a separate
> gfp type for slab-specific flags?

What do you mean by separate gfp type?

> Hmm but SLAB_ALLOC_* flags are defined in mm/slab.h and kmalloc_flags()
> is defined in include/linux/slab.h. Do yo intend to restrict the slab
> alloc flags to MM only?

Yeah I don't expect users outside MM. If a valid one appears, we can move
it. I should try moving kmalloc_flags() to mm/slab.h as well, unless there's
some header dependency issue that will prevent it.

>> The page allocator uses its internal alloc_flags to convey various
>> context information, including ALLOC_TRYLOCK (meaning "cannot spin").
>> This series copies that concept for the slab allocator, with its own
>> slab-specific internal flags:
>> 
>> - SLAB_ALLOC_DEFAULT - no extra flags (the value is 0), but explicit
>> - SLAB_ALLOC_TRYLOCK - do not spin on locks (used by kmalloc_nolock())
>> - SLAB_ALLOC_NEW_SLAB - replacing existing 'bool new_slab' parameter
>> 			for allocating obj_ext arrays
>> - SLAB_ALLOC_NO_RECURSE - replacing usage of __GFP_NO_OBJ_EXT
>> 
>> To reduce the amount of parameters in various internal functions, we
>> additionally introduce slab_alloc_context (also inspired by page
>> allocator's alloc_context) for passing a number of existing arguments
>> and the new alloc_flags:
>> 
>> /* Structure holding extra parameters for slab allocations */
>> struct slab_alloc_context {
>> 	unsigned long caller_addr;
>> 	unsigned long orig_size;
>> 	unsigned int alloc_flags;
>> 	struct list_lru *lru;
>> };
> 
> Perhaps beyond the scope of the patchset, but I wonder if we could have
> something like struct slab_alloc_context but for kmalloc callers to
> simplify {PASS,DECL}_KMALLOC_PARAMS().
> 
> Something like:
> 
> struct kmalloc_params {
> #ifdef CONFIG_SLAB_BUCKETS
> 	kmem_buckets *b;
> #endif
> #ifdef CONFIG_KMALLOC_PARTITION_CACHES
> 	kmalloc_token_t token;
> #endif
> };
> 
> The idea is to move optional kmalloc parameters (depending on config)
> into a single struct, instead of using the macros.
> 
> void *__kmalloc_node(size_t size, gfp_t flags, int node,
> 		     unsigned long caller,
> 		     struct kmalloc_params params);
> 
> void *kmalloc_node() {
>     /* ... snip ...*/
>     struct kmalloc_params params = KMALLOC_PARAMS(params.b, params.token);
>     return __kmalloc_node(size, flags, node, _RET_IP_, params);
> }
> 
> The compiler should optimize away unused fields based on the config.
> 
> Per System V AMD64 ABI, the compiler will use registers to pass the
> struct, as long as the struct size does not exceed 16 bytes.
> (Otherwise it will be passed on stack).

Hm but does this work on all architectures, and are we doing this somewhere
(for structures larger than a native word) already?
Also Marco noted earlier that gcc won't optimize away the struct if it
becomes zero-sized:

https://lore.kernel.org/all/CANpmjNO1aNm3mKphDGWasK_NUfVY8q4K9GCjyREZFqrOu9WLcw@mail.gmail.com/

>> This also replaces the existing struct partial_context.
>> 
>> The last necessary piece is kmalloc_flags() which can take the
>> alloc_flags in addition to gfp flags and is intended for the recursive
>> allocations of sheaves and obj_ext arrays, so that both
>> SLAB_ALLOC_TRYLOCK and SLAB_ALLOC_NO_RECURSE can be communicated.
>> Internally it decides between kmalloc_nolock() and normal kmalloc()
>> depending SLAB_ALLOC_TRYLOCK.
>> 
>> The rest of the series is gradually expanding the usage of both
>> alloc_flags and slab_alloc_context as necessary, with bits of
>> refactoring. Then, __GFP_NO_OBJ_EXT is removed completely.
>> 
>> Note that some usage of gfpflags_allow_spinning() relying on absence of
>> __GFP_RECLAIM remains outside of slab (and page allocator) in memcg,
>> page_owner and stackdepot code. These can thus yield false-positive
>> decisions that spinning is not allowed, but should not result in
>> important allocations failing anymore.
>> 
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
>> Vlastimil Babka (SUSE) (15):
>>       mm/slab: always zero only requested size on alloc
>>       mm/slab: stop inlining __slab_alloc_node()
>>       mm/slab: introduce slab_alloc_context
>>       mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
>>       mm/slab: add alloc_flags to slab_alloc_context
>>       mm/slab: replace struct partial_context with slab_alloc_context
>>       mm/slab: pass alloc_flags to new slab allocation
>>       mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
>>       mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
>>       mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
>>       mm/slab: pass slab_alloc_context to __do_kmalloc_node()
>>       mm/slab: introduce kmalloc_flags()
>>       mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
>>       mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
>>       mm: remove the __GFP_NO_OBJ_EXT flag
>> 
>>  include/linux/gfp_types.h       |   7 -
>>  include/linux/slab.h            |  14 +-
>>  include/trace/events/mmflags.h  |  10 +-
>>  lib/alloc_tag.c                 |   2 +-
>>  mm/kfence/core.c                |   6 +-
>>  mm/memcontrol.c                 |   5 +-
>>  mm/slab.h                       |  16 +-
>>  mm/slub.c                       | 423 ++++++++++++++++++++++++----------------
>>  tools/include/linux/gfp_types.h |   7 -
>>  9 files changed, 288 insertions(+), 202 deletions(-)
>> ---
>> base-commit: 500b2c9755301742bdbb61249511ac11a4665dae
>> change-id: 20260601-slab_alloc_flags-25c782b0c57c
>> 
>> Best regards,
>> --  
>> Vlastimil Babka (SUSE) <vbabka@kernel.org>
> 


^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-10 14:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <DJ4QLD7TWSOU.3LEDXKMML5DMK@gmail.com>

On 6/9/26 20:40, Alexei Starovoitov wrote:
> On Tue Jun 9, 2026 at 2:17 AM PDT, Vlastimil Babka (SUSE) wrote:
>> This series is based on slab/for-next. If all goes well, it would
>> hopefully go to slab/for-next soon after the 7.2 merge window, so any
>> other work can be based on it to avoid conflicts, as it touches a lot
>> parts of slab.
>>
>> Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags
> 
> Overall looks great to me.
> I would ship all patches except the last one for this merge window,
> since I don't see anything controversial or dangerous in there.

Hmm that's ambitious :)

> Especially since it touches slab so much. My slab-arena changes
> would need to adopt it and I don't want to delay the whole thing by two merge windows.
> Harry's changes would need to rebased as well.

It wouldn't be a problem if they went through the slab tree as well, and
just be applied on top of this series already in the slab tree.
In case of bpf tree there could be a shared stable branch.
So no delays by two merge windows.

> So the sooner the trees converge the better.

But yeah it would be simpler.
I can try exposing this to -next ASAP and plan to send a second PR
separately from the series already there in the second merge window week, if
no issues arise, and see if Linus is benevolent.


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-10 15:00 UTC (permalink / raw)
  To: Gregory Price, Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aik_ddHymus2DJ6D@gourry-fedora-PF4VCD3F>

On 6/10/26 12:41, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
>>>
>>>    __GFP_THISNODE cannot be overloaded to do anything useful here.
>>
>> Let me clarify, I meant to say, let's use a nodemask for allocation
>> and __GFP_THISNODE gets us to the node we desire, if that is the only
>> node. My earlier comment might not have been clear.
>>
> 
> I've been tested an stripped back patch set where I drop all FALLBACK
> entries for private nodes (including for itself) and only keep the
> NOFALLBACK entry for private nodes.
> 
> This effectively isolates the nodes for any allocation without
> __GFP_THISNODE.
> 
> This also precludes these nodes from ever using non-mbind mempolicies,
> which I think is a completely reasonable compromise and something I was
> already expecting we would do.
> 
> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> which causes spillage into private nodes because slub allows private
> nodes in its mask.  I think this is fixable.
> 
> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> code, etc), but it seems like fully dropping the FALLBACK entries and
> requiring __GFP_THISNODE might be sufficient.

Sorry, I haven't been able to follow up so far, and not sure if that's what you
are discussing here ...

After the LSF/MM session, I was wondering, whether if we focus on allowing only
folios allocations to end up on private memory nodes for now: could the
__GFP_THISNODE approach work there?

Essentially, disallow any allocations on non-folio paths, and allow folio
allocation only with __GFP_THISNODE set.

I have to find time to read the other mails in this thread, on my todo list.

So sorry if that is precisely what is being discussed here.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH RFC 00/15] mm/slab: introduce alloc_flags and slab_alloc_context
From: Harry Yoo @ 2026-06-10 15:04 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <0b9dce0f-f2a8-4b52-9c06-600eb13d4a7c@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 5109 bytes --]



On 6/10/26 11:04 PM, Vlastimil Babka (SUSE) wrote:
> On 6/10/26 15:46, Harry Yoo wrote:
>>
>>
>> On 6/9/26 6:17 PM, Vlastimil Babka (SUSE) wrote:
>>> This series is based on slab/for-next. If all goes well, it would
>>> hopefully go to slab/for-next soon after the 7.2 merge window, so any
>>> other work can be based on it to avoid conflicts, as it touches a lot
>>> parts of slab.
>>>
>>> Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags
>>>
>>> The slab implementation currently relies on gfp flags to convey
>>> some context information internally:
>>>
>>> - The absence of both __GFP_RECLAIM flags is interpreted as "cannot spin
>>>   on locks", and intended to be used by kmalloc_nolock(). But false
>>>   positives are possible e.g. during early boot where gfp_allowed_mask
>>>   clears __GFP_RECLAIM from all allocations. This leads to unnecessary
>>>   allocation failures and workarounds such as fd3634312a04 ("debugobject:
>>>   Make it work with deferred page initialization - again").
>>>
>>> - __GFP_NO_OBJ_EXT exists and takes up valuable bit in the gfp flags
>>>   space, only to prevent recursive kmalloc() allocations for obj_ext
>>>   arrays and sheaves.
>>
>> [ Cc'ing Vishal and Matthew as it's somewhat relevant to memdescs... ]
>>
>> When the page allocator starts allocateing slab objects,
>> we still need a way to avoid recursion for obj_ext arrays and sheaves
>> (by passing SLAB_ALLOC_NO_RECURSE).
>>
>> Looking at kmalloc_flags(), probably we'll end up introducing a separate
>> gfp type for slab-specific flags?
> 
> What do you mean by separate gfp type?

I meant the patchset is introducing a new type to specify the context
(specific to slab) other than gfp_t... which is `unsigned int
alloc_flags` now.

>> Hmm but SLAB_ALLOC_* flags are defined in mm/slab.h and kmalloc_flags()
>> is defined in include/linux/slab.h. Do yo intend to restrict the slab
>> alloc flags to MM only?
> 
> Yeah I don't expect users outside MM. If a valid one appears, we can move
> it. I should try moving kmalloc_flags() to mm/slab.h as well, unless there's
> some header dependency issue that will prevent it.

Ack.

>>> The page allocator uses its internal alloc_flags to convey various
>>> context information, including ALLOC_TRYLOCK (meaning "cannot spin").
>>> This series copies that concept for the slab allocator, with its own
>>> slab-specific internal flags:
>>>
>>> - SLAB_ALLOC_DEFAULT - no extra flags (the value is 0), but explicit
>>> - SLAB_ALLOC_TRYLOCK - do not spin on locks (used by kmalloc_nolock())
>>> - SLAB_ALLOC_NEW_SLAB - replacing existing 'bool new_slab' parameter
>>> 			for allocating obj_ext arrays
>>> - SLAB_ALLOC_NO_RECURSE - replacing usage of __GFP_NO_OBJ_EXT
>>>
>>> To reduce the amount of parameters in various internal functions, we
>>> additionally introduce slab_alloc_context (also inspired by page
>>> allocator's alloc_context) for passing a number of existing arguments
>>> and the new alloc_flags:
>>>
>>> /* Structure holding extra parameters for slab allocations */
>>> struct slab_alloc_context {
>>> 	unsigned long caller_addr;
>>> 	unsigned long orig_size;
>>> 	unsigned int alloc_flags;
>>> 	struct list_lru *lru;
>>> };
>>
>> Perhaps beyond the scope of the patchset, but I wonder if we could have
>> something like struct slab_alloc_context but for kmalloc callers to
>> simplify {PASS,DECL}_KMALLOC_PARAMS().
>>
>> Something like:
>>
>> struct kmalloc_params {
>> #ifdef CONFIG_SLAB_BUCKETS
>> 	kmem_buckets *b;
>> #endif
>> #ifdef CONFIG_KMALLOC_PARTITION_CACHES
>> 	kmalloc_token_t token;
>> #endif
>> };
>>
>> The idea is to move optional kmalloc parameters (depending on config)
>> into a single struct, instead of using the macros.
>>
>> void *__kmalloc_node(size_t size, gfp_t flags, int node,
>> 		     unsigned long caller,
>> 		     struct kmalloc_params params);
>>
>> void *kmalloc_node() {
>>     /* ... snip ...*/
>>     struct kmalloc_params params = KMALLOC_PARAMS(params.b, params.token);
>>     return __kmalloc_node(size, flags, node, _RET_IP_, params);
>> }
>>
>> The compiler should optimize away unused fields based on the config.
>>
>> Per System V AMD64 ABI, the compiler will use registers to pass the
>> struct, as long as the struct size does not exceed 16 bytes.
>> (Otherwise it will be passed on stack).
> 
> Hm but does this work on all architectures,

apparently not on s390, unfortunately.
on s390 it works only when the struct size does not exceed 8 bytes.

> and are we doing this somewhere
> (for structures larger than a native word) already?

hmm perhaps struct timespec64?

> Also Marco noted earlier that gcc won't optimize away the struct if it
> becomes zero-sized:
> 
> https://lore.kernel.org/all/CANpmjNO1aNm3mKphDGWasK_NUfVY8q4K9GCjyREZFqrOu9WLcw@mail.gmail.com/

Ouch, right. That means we still need at least one macro to define those
parameters :( Sounds less promising now...

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
From: Waiman Long @ 2026-06-10 15:09 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef
In-Reply-To: <20260605124051.589618504@infradead.org>

On 6/5/26 8:40 AM, Peter Zijlstra wrote:
> In order to avoid the average CPU fraction avg(F_g_n) becoming tiny '1/N',
> assume each cgroup is maximally concurrent and distrubute 'N*weight', such
> that:
>
> 	F_g_n' = N * F_g_n
>
> Giving:
>
> 	avg(F_g_n') = N*avg(F_g_n) ~ N * 1/N = 1
>
> And while this sounds like it solves things, remember what that ~ meant. There
> is the corner case when a cgroup is minimally loaded, eg a single runnable
> task, therefore limit the CPU fraction to that of a nice -20 task to avoid
> getting too much load.
>
> This last bit is what makes it different from a previous proposal to allow
> raising cpu.weight to '100 * N', that would not limit the mininal concurrency
> case and results in a very large F_g_n. And just like F_g_n << 1 is
> problematic, so is F_g_n >> 1 for the exact same reasons (it would drown the
> kthreads, but it also risks overflowing the load values).
>
> So while this might appear to be a better scheme than the current default
> scheme, it doesn't really handle less than maximal concurrency nicely -- it
> clips and introduces artificially large weights. So where the traditional SMP
> mode works well when nr_tasks << nr_cpus, MAX doesn't work well in that regime
> and vice-versa.
>
> The meaning of "cpu.weight" would be: weight per allowed CPU.
>
> Included for completeness (and infrastructure).
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   include/linux/cpuset.h |    6 +++++
>   kernel/cgroup/cpuset.c |   15 ++++++++++++++
>   kernel/sched/debug.c   |    1
>   kernel/sched/fair.c    |   52 ++++++++++++++++++++++++++++++++++++++++++++-----
>   4 files changed, 69 insertions(+), 5 deletions(-)
>
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
>   extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
>   extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
>   extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
> +extern int cpuset_num_cpus(struct cgroup *cgroup);
>   extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>   #define cpuset_current_mems_allowed (current->mems_allowed)
>   void cpuset_init_current_mems_allowed(void);
> @@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
>   	return false;
>   }
>   
> +static inline int cpuset_num_cpus(struct cgroup *cgroup)
> +{
> +	return num_online_cpus();
> +}
> +
>   static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
>   {
>   	return node_possible_map;
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
>   	return changed;
>   }
>   
> +int cpuset_num_cpus(struct cgroup *cgrp)
> +{
> +	int nr = num_online_cpus();
> +	struct cpuset *cs;
> +
> +	if (is_in_v2_mode()) {
> +		guard(rcu)();
> +		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
> +		if (cs)
> +			nr = cpumask_weight(cs->effective_cpus);
> +	}
> +
> +	return nr;
> +}

I just have a question about cgroup v1 support. I am assuming that 
cgroup v1 without the cpuset_v2_mode mount option is not supported. To 
fully support cgroup v1, you may have to use guarantee_active_cpus() to 
return the actual set of CPUs that the task can run on. Also there is a 
caveat about the arm64 specific task_cpu_possible_mask() for certain 
arm64 CPUs. That is for 32-bit binary running on 64-bit core which are 
allowed only on a selected subset of cores within the CPU.

This is probably not what you want to focus on right now, but it will be 
good to have a comment to list items that are not fully supported here.

Cheers,
Longman


^ permalink raw reply

* [PATCH v2 00/16] mm/slab: introduce alloc_flags and slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE), stable

This series is based on slab/for-next. As suggested by Alexei I will
try to put it there ASAP (hence the early respin) and see if it looks
stable enough to be send in the second 7.2 merge window week.

Git: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slab_alloc_flags

The slab implementation currently relies on gfp flags to convey
some context information internally:

- The absence of both __GFP_RECLAIM flags is interpreted as "cannot spin
  on locks", and intended to be used by kmalloc_nolock(). But false
  positives are possible e.g. during early boot where gfp_allowed_mask
  clears __GFP_RECLAIM from all allocations. This leads to unnecessary
  allocation failures and workarounds such as fd3634312a04 ("debugobject:
  Make it work with deferred page initialization - again").

- __GFP_NO_OBJ_EXT exists and takes up valuable bit in the gfp flags
  space, only to prevent recursive kmalloc() allocations for obj_ext
  arrays and sheaves.

The page allocator uses its internal alloc_flags to convey various
context information, including ALLOC_TRYLOCK (meaning "cannot spin").
This series copies that concept for the slab allocator, with its own
slab-specific internal flags:

- SLAB_ALLOC_DEFAULT - no extra flags (the value is 0), but explicit
- SLAB_ALLOC_TRYLOCK - do not spin on locks (used by kmalloc_nolock())
- SLAB_ALLOC_NEW_SLAB - replacing existing 'bool new_slab' parameter
			for allocating obj_ext arrays
- SLAB_ALLOC_NO_RECURSE - replacing usage of __GFP_NO_OBJ_EXT

To reduce the amount of parameters in various internal functions, we
additionally introduce slab_alloc_context (also inspired by page
allocator's alloc_context) for passing a number of existing arguments
and the new alloc_flags:

/* Structure holding extra parameters for slab allocations */
struct slab_alloc_context {
	unsigned long caller_addr;
	unsigned long orig_size;
	unsigned int alloc_flags;
	struct list_lru *lru;
};

This also replaces the existing struct partial_context.

The last necessary piece is kmalloc_flags() which can take the
alloc_flags in addition to gfp flags and is intended for the recursive
allocations of sheaves and obj_ext arrays, so that both
SLAB_ALLOC_TRYLOCK and SLAB_ALLOC_NO_RECURSE can be communicated.
Internally it decides between kmalloc_nolock() and normal kmalloc()
depending SLAB_ALLOC_TRYLOCK.

The rest of the series is gradually expanding the usage of both
alloc_flags and slab_alloc_context as necessary, with bits of
refactoring. Then, __GFP_NO_OBJ_EXT is removed completely.

Note that some usage of gfpflags_allow_spinning() relying on absence of
__GFP_RECLAIM remains outside of slab (and page allocator) in memcg,
page_owner and stackdepot code. These can thus yield false-positive
decisions that spinning is not allowed, but should not result in
important allocations failing anymore.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
Changes in v2:
- Due to Sashiko review, drop the idea of zeroing orig_size
  unconditionally, as it can break krealloc(). Thanks to that found a
  pre-existing bug fixed by the new Patch 1. The kfence zeroing related
  cleanup is implemented differently in Patch 2.
- Prevent nested kmalloc_nolock warnings due to added gfp flags
  (Sashiko)
- Fix a pre-existing issue with opportunistic slab allocation from the
  target node only effectively dropping __GFP_NOMEMALLOC and __GFP_RECLAIM.
  (Sashiko)
- Move kmalloc_flags() definitions to mm/slab.h (per Harry).
- Link to v1: https://patch.msgid.link/20260609-slab_alloc_flags-v1-0-2bf4a4b9b526@kernel.org

---
Vlastimil Babka (SUSE) (16):
      mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
      mm/slab: do not init any kfence objects on allocation
      mm/slab: stop inlining __slab_alloc_node()
      mm/slab: introduce slab_alloc_context
      mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
      mm/slab: add alloc_flags to slab_alloc_context
      mm/slab: replace struct partial_context with slab_alloc_context
      mm/slab: pass alloc_flags to new slab allocation
      mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
      mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
      mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
      mm/slab: pass slab_alloc_context to __do_kmalloc_node()
      mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock()
      mm/slab: introduce kmalloc_flags()
      mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
      mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves

 include/linux/slab.h |   5 +-
 mm/kfence/core.c     |   2 +-
 mm/memcontrol.c      |   5 +-
 mm/slab.h            |  29 +++-
 mm/slub.c            | 439 +++++++++++++++++++++++++++++++--------------------
 5 files changed, 304 insertions(+), 176 deletions(-)
---
base-commit: 500b2c9755301742bdbb61249511ac11a4665dae
change-id: 20260601-slab_alloc_flags-25c782b0c57c

Best regards,
--  
Vlastimil Babka (SUSE) <vbabka@kernel.org>


^ permalink raw reply

* [PATCH v2 01/16] mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE), stable
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

When init (zeroing) on allocation is requested, for kmalloc() we
generally have to zero the full object size even if a smaller size is
requested, in order to provide krealloc()'s __GFP_ZERO guarantees.

But if we track the requested size, krealloc() uses that information to
do the right thing. With red zoning also enabled, any unused size
became part of the red zone, so it must not be zeroed.

However the check is imprecise, and will trigger also when only
SLAB_RED_ZONE is enabled without SLAB_STORE_USER. This means enabling
red zoning alone can compromise krealloc()'s __GFP_ZERO contract.

Fix this by using slub_debug_orig_size() instead, which is the exact
check for whether the requested size is tracked. We don't need to care
if red zoning is also enabled or not. Also update and expand the
comment accordingly.

Fixes: 9ce67395f5a0 ("mm/slub: only zero requested size of buffer for kzalloc when debug enabled")
Cc: <stable@vger.kernel.org>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 63c1ef998dd3..e2ee8f1aaccf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4574,15 +4574,17 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	gfp_t init_flags = flags & gfp_allowed_mask;
 
 	/*
-	 * For kmalloc object, the allocated memory size(object_size) is likely
-	 * larger than the requested size(orig_size). If redzone check is
-	 * enabled for the extra space, don't zero it, as it will be redzoned
-	 * soon. The redzone operation for this extra space could be seen as a
-	 * replacement of current poisoning under certain debug option, and
-	 * won't break other sanity checks.
+	 * For kmalloc object, the allocated size (object_size) can be larger
+	 * than the requested size (orig_size). We however need to zero the
+	 * whole object_size to handle possible later krealloc() with
+	 *__GFP_ZERO properly.
+	 *
+	 * But if we keep track of the requested size, krealloc() uses that
+	 * information. Additionally if red zoning is enabled, the extra space
+	 * is also red zone, so we should not overwrite it. So limit zeroing to
+	 * orig_size if we track it.
 	 */
-	if (kmem_cache_debug_flags(s, SLAB_STORE_USER | SLAB_RED_ZONE) &&
-	    (s->flags & SLAB_KMALLOC))
+	if (slub_debug_orig_size(s))
 		zero_size = orig_size;
 
 	/*

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 02/16] mm/slab: do not init any kfence objects on allocation
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

When init (zeroing) on allocation is requested, for kmalloc() we
generally have to zero the full object size even if a smaller size is
requested, in order to provide krealloc()'s __GFP_ZERO guarantees.

When we end up allocating a kfence object, kfence perfoms the zeroing on
its own because has its own redzone beyond the requested size. Thus
slab_post_alloc_hook() has an 'init' parameter which has to be evaluated
in all callers (via slab_want_init_on_alloc()) and should be false for
kfence allocations.

For kfence allocations in slab_alloc_node() this is achieved by subtly
skipping over the slab_want_init_on_alloc() call. Other callers (i.e.
kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
if they do end up with a kfence allocation. This is only subtly not a
problem, as those are not kmalloc allocations and thus the "requested
size" equals s->object_size and thus it cannot interfere with kfence's
redzone. There's just a unnecessary double zeroing (in both kfence and
slab_post_alloc_hook()), but it's all very fragile and contradicts the
comment in kfence_guarded_alloc().

Remove this subtlety and simplify the code by eliminating the init
parameter from slab_post_alloc_hook() and make it call
slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
check before performing the memset, which will start doing the right
thing for all callers of slab_post_alloc_hook().

This potentially adds overhead of the is_kfence_address() check to
allocation hotpath, but that one is designed to be as small as possible,
and it's only evaluated if zeroing is about to happen. This means (aside
from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
zeroing itself comes with an overhead likely larger than the added
check.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/kfence/core.c |  2 +-
 mm/slub.c        | 23 ++++++++---------------
 2 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 655dc5ce3240..5e0b406924e9 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -500,7 +500,7 @@ static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t g
 
 	/*
 	 * We check slab_want_init_on_alloc() ourselves, rather than letting
-	 * SL*B do the initialization, as otherwise we might overwrite KFENCE's
+	 * slab do the initialization, as otherwise it might overwrite KFENCE's
 	 * redzone.
 	 */
 	if (unlikely(slab_want_init_on_alloc(gfp, cache)))
diff --git a/mm/slub.c b/mm/slub.c
index e2ee8f1aaccf..8e5264d3ddbf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4565,9 +4565,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
 
 static __fastpath_inline
 bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
-			  gfp_t flags, size_t size, void **p, bool init,
+			  gfp_t flags, size_t size, void **p,
 			  unsigned int orig_size)
 {
+	bool init = slab_want_init_on_alloc(flags, s);
 	unsigned int zero_size = s->object_size;
 	bool kasan_init = init;
 	size_t i;
@@ -4608,7 +4609,8 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	for (i = 0; i < size; i++) {
 		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
 		if (p[i] && init && (!kasan_init ||
-				     !kasan_has_integrated_init()))
+				     !kasan_has_integrated_init())
+				 && !is_kfence_address(p[i]))
 			memset(p[i], 0, zero_size);
 		if (gfpflags_allow_spinning(flags))
 			kmemleak_alloc_recursive(p[i], s->object_size, 1,
@@ -4910,7 +4912,6 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
 {
 	void *object;
-	bool init = false;
 
 	s = slab_pre_alloc_hook(s, gfpflags);
 	if (unlikely(!s))
@@ -4926,16 +4927,13 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
-	init = slab_want_init_on_alloc(gfpflags, s);
 
 out:
 	/*
-	 * When init equals 'true', like for kzalloc() family, only
-	 * @orig_size bytes might be zeroed instead of s->object_size
 	 * In case this fails due to memcg_slab_post_alloc_hook(),
 	 * object is set to NULL
 	 */
-	slab_post_alloc_hook(s, lru, gfpflags, 1, &object, init, orig_size);
+	slab_post_alloc_hook(s, lru, gfpflags, 1, &object, orig_size);
 
 	return object;
 }
@@ -5230,7 +5228,6 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
 				   struct slab_sheaf *sheaf)
 {
 	void *ret = NULL;
-	bool init;
 
 	if (sheaf->size == 0)
 		goto out;
@@ -5240,10 +5237,8 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
 	if (likely(!ret))
 		ret = sheaf->objects[--sheaf->size];
 
-	init = slab_want_init_on_alloc(gfp, s);
-
 	/* add __GFP_NOFAIL to force successful memcg charging */
-	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
+	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, s->object_size);
 out:
 	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
 
@@ -5423,8 +5418,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
-			     slab_want_init_on_alloc(alloc_gfp, s), orig_size);
+	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret, orig_size);
 
 	ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
 	return ret;
@@ -7339,8 +7333,7 @@ bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
 
 out:
 	/* memcg and kmem_cache debug support and memory initialization */
-	return likely(slab_post_alloc_hook(s, NULL, flags, size, p,
-			slab_want_init_on_alloc(flags, s), s->object_size));
+	return likely(slab_post_alloc_hook(s, NULL, flags, size, p, s->object_size));
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 03/16] mm/slab: stop inlining __slab_alloc_node()
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

With sheaves, this is no longer part of the allocation fastpath.  For
the same reason, also mark the call to it from slab_alloc_node() as
unlikely().

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8e5264d3ddbf..7b48c0d38404 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4519,8 +4519,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	return object;
 }
 
-static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
-		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
+static void *__slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node,
+			       unsigned long addr, size_t orig_size)
 {
 	void *object;
 
@@ -4923,7 +4923,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 
 	object = alloc_from_pcs(s, gfpflags, node);
 
-	if (!object)
+	if (unlikely(!object))
 		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 04/16] mm/slab: introduce slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Similarly to page allocator's struct alloc_context, introduce a helper
struct to hold a part of the allocation arguments. This will allow
reducing the number of parameters in many functions of the
implementation, and extend them easily if needed.

For now, make it hold the caller address and the originally requested
allocation size.

Convert alloc_single_from_new_slab(), __slab_alloc_node() and
___slab_alloc(). No functional change intended.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 46 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 7b48c0d38404..a3cac7281cc6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -213,6 +213,12 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
 static DEFINE_STATIC_KEY_FALSE(strict_numa);
 #endif
 
+/* Structure holding extra parameters for slab allocations */
+struct slab_alloc_context {
+	unsigned long caller_addr;
+	unsigned long orig_size;
+};
+
 /* Structure holding parameters for get_from_partial() call chain */
 struct partial_context {
 	gfp_t flags;
@@ -3687,7 +3693,8 @@ static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
  * and put the slab to the partial (or full) list.
  */
 static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
-					int orig_size, bool allow_spin)
+					struct slab_alloc_context *ac,
+					bool allow_spin)
 {
 	struct kmem_cache_node *n;
 	struct slab_obj_iter iter;
@@ -3705,7 +3712,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
 	/* alloc_debug_processing() always expects a valid freepointer */
 	set_freepointer(s, object, slab->freelist);
 
-	if (!alloc_debug_processing(s, slab, object, orig_size)) {
+	if (!alloc_debug_processing(s, slab, object, ac->orig_size)) {
 		/*
 		 * It's not really expected that this would fail on a
 		 * freshly allocated slab, but a concurrent memory
@@ -4443,7 +4450,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
  * slab.
  */
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			   unsigned long addr, unsigned int orig_size)
+			   struct slab_alloc_context *ac)
 {
 	bool allow_spin = gfpflags_allow_spinning(gfpflags);
 	void *object;
@@ -4476,7 +4483,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
 	}
 
-	pc.orig_size = orig_size;
+	pc.orig_size = ac->orig_size;
 	object = get_from_partial(s, node, &pc);
 	if (object)
 		goto success;
@@ -4496,7 +4503,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	stat(s, ALLOC_SLAB);
 
 	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
-		object = alloc_single_from_new_slab(s, slab, orig_size, allow_spin);
+		object = alloc_single_from_new_slab(s, slab, ac, allow_spin);
 
 		if (likely(object))
 			goto success;
@@ -4514,13 +4521,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 success:
 	if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
-		set_track(s, object, TRACK_ALLOC, addr, gfpflags);
+		set_track(s, object, TRACK_ALLOC, ac->caller_addr, gfpflags);
 
 	return object;
 }
 
 static void *__slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node,
-			       unsigned long addr, size_t orig_size)
+			       struct slab_alloc_context *ac)
 {
 	void *object;
 
@@ -4545,7 +4552,7 @@ static void *__slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node,
 	}
 #endif
 
-	object = ___slab_alloc(s, gfpflags, node, addr, orig_size);
+	object = ___slab_alloc(s, gfpflags, node, ac);
 
 	return object;
 }
@@ -4923,8 +4930,13 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 
 	object = alloc_from_pcs(s, gfpflags, node);
 
-	if (unlikely(!object))
-		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (unlikely(!object)) {
+		struct slab_alloc_context ac = {
+			.caller_addr = addr,
+			.orig_size = orig_size,
+		};
+		object = __slab_alloc_node(s, gfpflags, node, &ac);
+	}
 
 	maybe_wipe_obj_freeptr(s, object);
 
@@ -5389,13 +5401,18 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 	if (ret)
 		goto success;
 
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = orig_size,
+	};
+
 	/*
 	 * Do not call slab_alloc_node(), since trylock mode isn't
 	 * compatible with slab_pre_alloc_hook/should_failslab and
 	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
 	 * and slab_post_alloc_hook() directly.
 	 */
-	ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, orig_size);
+	ret = __slab_alloc_node(s, alloc_gfp, node, &ac);
 
 	/*
 	 * It's possible we failed due to trylock as we preempted someone with
@@ -7237,10 +7254,13 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 	int i;
 
 	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
+		struct slab_alloc_context ac = {
+			.caller_addr = _RET_IP_,
+			.orig_size = s->object_size,
+		};
 		for (i = 0; i < size; i++) {
 
-			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
-					     s->object_size);
+			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, &ac);
 			if (unlikely(!p[i]))
 				goto error;
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Similarly to the page allocators, introduce slab-allocator specific
alloc flags that internally control allocation behavior in addition to
gfp_flags, without occupying the limited gfp flags space.

Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
page allocator's ALLOC_TRYLOCK and will be used to reimplement
kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
e.g. in early boot with a restricted gfp_allowed_mask.

Also introduce alloc_flags_allow_spinning() to replace the usage of
gfpflags_allow_spinning().

Start using alloc_flags and the new check first in alloc_from_pcs() and
__pcs_replace_empty_main(). This means some slab allocations that were
falsely treated as kmalloc_nolock() due to their gfp flags will now have
higher chances of succeed, and this will further increase with followup
changes.

Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
reach it from a slab allocation that's not _nolock() and yet lacks
__GFP_KSWAPD_RECLAIM for other reasons.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h |  9 +++++++++
 mm/slub.c | 17 ++++++++---------
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 1bf9c3021ae3..96f65b625600 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -16,6 +16,15 @@
  * Internal slab definitions
  */
 
+/* slab's alloc_flags definitions */
+#define SLAB_ALLOC_DEFAULT	0x00 /* no flags */
+#define SLAB_ALLOC_TRYLOCK	0x01 /* a kmalloc_nolock() allocation */
+
+static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
+{
+	return !(alloc_flags & SLAB_ALLOC_TRYLOCK);
+}
+
 #ifdef CONFIG_64BIT
 # ifdef system_has_cmpxchg128
 # define system_has_freelist_aba()	system_has_cmpxchg128()
diff --git a/mm/slub.c b/mm/slub.c
index a3cac7281cc6..e79fbca11bc0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4638,7 +4638,8 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
  * unlocked.
  */
 static struct slub_percpu_sheaves *
-__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
+__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
+			 gfp_t gfp, unsigned int alloc_flags)
 {
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
@@ -4664,7 +4665,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	allow_spin = gfpflags_allow_spinning(gfp);
+	allow_spin = alloc_flags_allow_spinning(alloc_flags);
 
 	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
@@ -4750,7 +4751,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 }
 
 static __fastpath_inline
-void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, unsigned int alloc_flags, int node)
 {
 	struct slub_percpu_sheaves *pcs;
 	bool node_requested;
@@ -4795,7 +4796,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == 0)) {
-		pcs = __pcs_replace_empty_main(s, pcs, gfp);
+		pcs = __pcs_replace_empty_main(s, pcs, gfp, alloc_flags);
 		if (unlikely(!pcs))
 			return NULL;
 	}
@@ -4928,7 +4929,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = alloc_from_pcs(s, gfpflags, node);
+	object = alloc_from_pcs(s, gfpflags, SLAB_ALLOC_DEFAULT, node);
 
 	if (unlikely(!object)) {
 		struct slab_alloc_context ac = {
@@ -5359,6 +5360,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 {
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
 	size_t orig_size = size;
+	unsigned int alloc_flags = SLAB_ALLOC_TRYLOCK;
 	struct kmem_cache *s;
 	bool can_retry = true;
 	void *ret;
@@ -5397,7 +5399,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		 */
 		return NULL;
 
-	ret = alloc_from_pcs(s, alloc_gfp, node);
+	ret = alloc_from_pcs(s, alloc_gfp, alloc_flags, node);
 	if (ret)
 		goto success;
 
@@ -7216,9 +7218,6 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 	unsigned int refilled;
 	struct slab *slab;
 
-	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
-		return 0;
-
 	refilled = __refill_objects_node(s, p, gfp, min, max,
 					 get_node(s, local_node),
 					 /* allow_spin = */ true);

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 06/16] mm/slab: add alloc_flags to slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Add alloc_flags as a new field to the slab_alloc_context helper struct,
so we can pass it to more functions in the slab implementation without
adding another function parameter.

Start checking them via alloc_flags_allow_spinning() in
alloc_single_from_new_slab() (where we can drop the allow_spin
parameter) and ___slab_alloc(). This further reduces false-positive
spinning-not-allowed from allocations that are not kmalloc_nolock() but
lack __GFP_RECLAIM flags.

_kmalloc_nolock_noprof() initializes ac.alloc_flags using its flags that
are SLAB_ALLOC_TRYLOCK. slab_alloc_node() and __kmem_cache_alloc_bulk()
are not reachable from kmalloc_nolock() and all their callers expect
spinning to be allowed, so they can use SLAB_ALLOC_DEFAULT. This is
temporary as the scope of slab_alloc_context will further move to the
callers, making the alloc_flags usage more obvious.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e79fbca11bc0..ef745b37d063 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -217,6 +217,7 @@ static DEFINE_STATIC_KEY_FALSE(strict_numa);
 struct slab_alloc_context {
 	unsigned long caller_addr;
 	unsigned long orig_size;
+	unsigned int alloc_flags;
 };
 
 /* Structure holding parameters for get_from_partial() call chain */
@@ -3693,9 +3694,9 @@ static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
  * and put the slab to the partial (or full) list.
  */
 static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
-					struct slab_alloc_context *ac,
-					bool allow_spin)
+					struct slab_alloc_context *ac)
 {
+	bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
 	struct kmem_cache_node *n;
 	struct slab_obj_iter iter;
 	bool needs_add_partial;
@@ -4452,7 +4453,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			   struct slab_alloc_context *ac)
 {
-	bool allow_spin = gfpflags_allow_spinning(gfpflags);
+	bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
 	void *object;
 	struct slab *slab;
 	struct partial_context pc;
@@ -4503,7 +4504,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	stat(s, ALLOC_SLAB);
 
 	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
-		object = alloc_single_from_new_slab(s, slab, ac, allow_spin);
+		object = alloc_single_from_new_slab(s, slab, ac);
 
 		if (likely(object))
 			goto success;
@@ -4919,6 +4920,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
 static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
 		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
 {
+	const unsigned int alloc_flags = SLAB_ALLOC_DEFAULT;
 	void *object;
 
 	s = slab_pre_alloc_hook(s, gfpflags);
@@ -4929,12 +4931,13 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = alloc_from_pcs(s, gfpflags, SLAB_ALLOC_DEFAULT, node);
+	object = alloc_from_pcs(s, gfpflags, alloc_flags, node);
 
 	if (unlikely(!object)) {
 		struct slab_alloc_context ac = {
 			.caller_addr = addr,
 			.orig_size = orig_size,
+			.alloc_flags = alloc_flags,
 		};
 		object = __slab_alloc_node(s, gfpflags, node, &ac);
 	}
@@ -5406,6 +5409,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 	struct slab_alloc_context ac = {
 		.caller_addr = _RET_IP_,
 		.orig_size = orig_size,
+		.alloc_flags = alloc_flags,
 	};
 
 	/*
@@ -7256,6 +7260,7 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 		struct slab_alloc_context ac = {
 			.caller_addr = _RET_IP_,
 			.orig_size = s->object_size,
+			.alloc_flags = SLAB_ALLOC_DEFAULT,
 		};
 		for (i = 0; i < size; i++) {
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 07/16] mm/slab: replace struct partial_context with slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Refactor get_from_partial_node(), get_from_any_partial(),
get_from_partial() and ___slab_alloc().

Remove struct partial_context, which used to be more substantial but
shrank as part of the sheaves conversion. Instead pass gfp_flags and
pointer to the new slab_alloc_context, which together is a superset of
partial_context.

This means alloc_flags are now available and we can use them to
determine if spinning is allowed, further reducing false positive "not
allowed" in the slow path due to gfp flags lacking __GFP_RECLAIM.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 52 ++++++++++++++++++++++++----------------------------
 1 file changed, 24 insertions(+), 28 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ef745b37d063..98b79e5e7679 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -220,12 +220,6 @@ struct slab_alloc_context {
 	unsigned int alloc_flags;
 };
 
-/* Structure holding parameters for get_from_partial() call chain */
-struct partial_context {
-	gfp_t flags;
-	unsigned int orig_size;
-};
-
 /* Structure holding parameters for get_partial_node_bulk() */
 struct partial_bulk_context {
 	gfp_t flags;
@@ -3826,7 +3820,8 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
  */
 static void *get_from_partial_node(struct kmem_cache *s,
 				   struct kmem_cache_node *n,
-				   struct partial_context *pc)
+				   gfp_t gfp_flags,
+				   struct slab_alloc_context *ac)
 {
 	struct slab *slab, *slab2;
 	unsigned long flags;
@@ -3841,7 +3836,7 @@ static void *get_from_partial_node(struct kmem_cache *s,
 	if (!n || !n->nr_partial)
 		return NULL;
 
-	if (gfpflags_allow_spinning(pc->flags))
+	if (alloc_flags_allow_spinning(ac->alloc_flags))
 		spin_lock_irqsave(&n->list_lock, flags);
 	else if (!spin_trylock_irqsave(&n->list_lock, flags))
 		return NULL;
@@ -3849,12 +3844,12 @@ static void *get_from_partial_node(struct kmem_cache *s,
 
 		struct freelist_counters old, new;
 
-		if (!pfmemalloc_match(slab, pc->flags))
+		if (!pfmemalloc_match(slab, gfp_flags))
 			continue;
 
 		if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
 			object = alloc_single_from_partial(s, n, slab,
-							pc->orig_size);
+							ac->orig_size);
 			if (object)
 				break;
 			continue;
@@ -3888,15 +3883,16 @@ static void *get_from_partial_node(struct kmem_cache *s,
 /*
  * Get an object from somewhere. Search in increasing NUMA distances.
  */
-static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *pc)
+static void *get_from_any_partial(struct kmem_cache *s, gfp_t gfp_flags,
+				  struct slab_alloc_context *ac)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
+	enum zone_type highest_zoneidx = gfp_zone(gfp_flags);
 	unsigned int cpuset_mems_cookie;
-	bool allow_spin = gfpflags_allow_spinning(pc->flags);
+	bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -3930,16 +3926,17 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
 		if (allow_spin)
 			cpuset_mems_cookie = read_mems_allowed_begin();
 
-		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
+		zonelist = node_zonelist(mempolicy_slab_node(), gfp_flags);
 		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
 			struct kmem_cache_node *n;
 
 			n = get_node(s, zone_to_nid(zone));
 
-			if (n && cpuset_zone_allowed(zone, pc->flags) &&
+			if (n && cpuset_zone_allowed(zone, gfp_flags) &&
 					n->nr_partial > s->min_partial) {
 
-				void *object = get_from_partial_node(s, n, pc);
+				void *object = get_from_partial_node(s, n,
+								gfp_flags, ac);
 
 				if (object) {
 					/*
@@ -3961,8 +3958,8 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
 /*
  * Get an object from a partial slab
  */
-static void *get_from_partial(struct kmem_cache *s, int node,
-			      struct partial_context *pc)
+static void *get_from_partial(struct kmem_cache *s, int node, gfp_t flags,
+			      struct slab_alloc_context *ac)
 {
 	int searchnode = node;
 	void *object;
@@ -3970,11 +3967,11 @@ static void *get_from_partial(struct kmem_cache *s, int node,
 	if (node == NUMA_NO_NODE)
 		searchnode = numa_mem_id();
 
-	object = get_from_partial_node(s, get_node(s, searchnode), pc);
-	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
+	object = get_from_partial_node(s, get_node(s, searchnode), flags, ac);
+	if (object || (node != NUMA_NO_NODE && (flags & __GFP_THISNODE)))
 		return object;
 
-	return get_from_any_partial(s, pc);
+	return get_from_any_partial(s, flags, ac);
 }
 
 static bool has_pcs_used(int cpu, struct kmem_cache *s)
@@ -4454,16 +4451,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			   struct slab_alloc_context *ac)
 {
 	bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
+	gfp_t trynode_flags;
 	void *object;
 	struct slab *slab;
-	struct partial_context pc;
 	bool try_thisnode = true;
 
 	stat(s, ALLOC_SLOWPATH);
 
 new_objects:
 
-	pc.flags = gfpflags;
+	trynode_flags = gfpflags;
 	/*
 	 * When a preferred node is indicated but no __GFP_THISNODE
 	 *
@@ -4479,17 +4476,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		     && try_thisnode)) {
 		if (unlikely(!allow_spin))
 			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
-			pc.flags = gfpflags | __GFP_THISNODE;
+			trynode_flags = gfpflags | __GFP_THISNODE;
 		else
-			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+			trynode_flags = GFP_NOWAIT | __GFP_THISNODE;
 	}
 
-	pc.orig_size = ac->orig_size;
-	object = get_from_partial(s, node, &pc);
+	object = get_from_partial(s, node, trynode_flags, ac);
 	if (object)
 		goto success;
 
-	slab = new_slab(s, pc.flags, node);
+	slab = new_slab(s, trynode_flags, node);
 
 	if (unlikely(!slab)) {
 		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 08/16] mm/slab: pass alloc_flags to new slab allocation
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Add the alloc_flags parameter to allocate_slab() and new_slab()
so it can be used to determine if spinning is allowed, independently
from gfp flags.

refill_objects() passes SLAB_ALLOC_DEFAULT because it can only be
reached from contexts that allow spinning.

Also change how trynode_flags are constructed in ___slab_alloc() to
achieve the same "do not upgrade to GFP_NOWAIT" by using masking instead
of a branch. It will now also not upgrade in cases where gfp is weaker
than GFP_NOWAIT (i.e. lacks __GFP_KSWAPD_RECLAIM) but doesn't come from
kmalloc_nolock() - which is more correct anyway.

During the masking keep also existing __GFP_NOMEMALLOC (pointed out by
Sashiko) and __GFP_ACCOUNT. Previously the hardcoded GFP_NOWAIT would
eliminate them, but it's not a big problem that would need a separate
fix.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 98b79e5e7679..8f6ca3d5fdfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3378,9 +3378,10 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 }
 
 /* Allocate and initialize a slab without building its freelist. */
-static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags,
+				  unsigned int alloc_flags, int node)
 {
-	bool allow_spin = gfpflags_allow_spinning(flags);
+	bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
 	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
@@ -3438,15 +3439,17 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	return slab;
 }
 
-static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *new_slab(struct kmem_cache *s, gfp_t flags,
+			     unsigned int alloc_flags, int node)
 {
 	if (unlikely(flags & GFP_SLAB_BUG_MASK))
 		flags = kmalloc_fix_flags(flags);
 
 	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
 
-	return allocate_slab(s,
-		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	flags &= GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK;
+
+	return allocate_slab(s, flags, alloc_flags, node);
 }
 
 static void __free_slab(struct kmem_cache *s, struct slab *slab, bool allow_spin)
@@ -4467,25 +4470,22 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 * 1) try to get a partial slab from target node only by having
 	 *    __GFP_THISNODE in pc.flags for get_from_partial()
 	 * 2) if 1) failed, try to allocate a new slab from target node with
-	 *    GPF_NOWAIT | __GFP_THISNODE opportunistically
+	 *    (at most) GFP_NOWAIT | __GFP_THISNODE opportunistically
 	 * 3) if 2) failed, retry with original gfpflags which will allow
 	 *    get_from_partial() try partial lists of other nodes before
 	 *    potentially allocating new page from other nodes
 	 */
 	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
 		     && try_thisnode)) {
-		if (unlikely(!allow_spin))
-			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
-			trynode_flags = gfpflags | __GFP_THISNODE;
-		else
-			trynode_flags = GFP_NOWAIT | __GFP_THISNODE;
+		trynode_flags &= GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_ACCOUNT;
+		trynode_flags |= __GFP_NOWARN | __GFP_THISNODE;
 	}
 
 	object = get_from_partial(s, node, trynode_flags, ac);
 	if (object)
 		goto success;
 
-	slab = new_slab(s, trynode_flags, node);
+	slab = new_slab(s, trynode_flags, ac->alloc_flags, node);
 
 	if (unlikely(!slab)) {
 		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
@@ -7231,7 +7231,7 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 
 new_slab:
 
-	slab = new_slab(s, gfp, local_node);
+	slab = new_slab(s, gfp, SLAB_ALLOC_DEFAULT, local_node);
 	if (!slab)
 		goto out;
 
@@ -7579,7 +7579,7 @@ static void early_kmem_cache_node_alloc(int node)
 
 	BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
 
-	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
+	slab = new_slab(kmem_cache_node, GFP_NOWAIT, SLAB_ALLOC_DEFAULT, node);
 
 	BUG_ON(!slab);
 	if (slab_nid(slab) != node) {

-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox