Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] mm/thp: clear deferred split shrinker bits when queues drain
From: Wei Yang @ 2026-06-13  2:58 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ljs, shakeel.butt, mhocko, david, roman.gushchin,
	muchun.song, qi.zheng, yosry.ahmed, ziy, liam, usama.arif, kas,
	vbabka, ryncsn, zaslonko, gor, wangkefeng.wang, baolin.wang,
	baohua, dev.jain, npache, ryan.roberts, cgroups, linux-mm,
	linux-kernel
In-Reply-To: <20260602043453.67597-1-lance.yang@linux.dev>

On Tue, Jun 02, 2026 at 12:34:53PM +0800, Lance Yang wrote:
>From: Lance Yang <lance.yang@linux.dev>
>
>deferred_split_count() returns the raw list_lru count. When the per-memcg,
>per-node list is empty, that count is 0.
>
>That skips scanning, but it does not tell memcg reclaim that the shrinker
>is empty. shrink_slab_memcg() only clears the memcg shrinker bit when the
>count callback reports SHRINK_EMPTY.
>
>Return SHRINK_EMPTY for an empty deferred split list, so the bit can be
>cleared once the queue has drained.
>
>Signed-off-by: Lance Yang <lance.yang@linux.dev>
>---
> mm/huge_memory.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>index 72f6caf0fec6..62d598290c3b 100644
>--- a/mm/huge_memory.c
>+++ b/mm/huge_memory.c
>@@ -4397,7 +4397,10 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> static unsigned long deferred_split_count(struct shrinker *shrink,
> 		struct shrink_control *sc)
> {
>-	return list_lru_shrink_count(&deferred_split_lru, sc);
>+	unsigned long count;
>+
>+	count = list_lru_shrink_count(&deferred_split_lru, sc);
>+	return count ?: SHRINK_EMPTY;
> }

Looks you are right, thanks.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

> 
> static bool thp_underused(struct folio *folio)
>-- 
>2.49.0
>

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* [PATCH] blk-iocost: correct CONFIG_TRACEPOINTS macro name in comments
From: Ethan Nelson-Moore @ 2026-06-13 22:54 UTC (permalink / raw)
  To: cgroups, linux-block
  Cc: Ethan Nelson-Moore, Tejun Heo, Josef Bacik, Jens Axboe

Comments in block/blk-iocost.c incorrectly refer to
CONFIG_TRACE_POINTS instead of CONFIG_TRACEPOINTS. Correct them.

Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
---
 block/blk-iocost.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 0cca88a366dc..04630c36b737 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -205,9 +205,9 @@ static char trace_iocg_path[TRACE_IOCG_PATH_LEN];
 		}								\
 	} while (0)
 
-#else	/* CONFIG_TRACE_POINTS */
+#else	/* CONFIG_TRACEPOINTS */
 #define TRACE_IOCG_PATH(type, iocg, ...)	do { } while (0)
-#endif	/* CONFIG_TRACE_POINTS */
+#endif	/* CONFIG_TRACEPOINTS */
 
 enum {
 	MILLION			= 1000000,
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition)
From: YoungJun Park @ 2026-06-14  8:20 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, yosry, david, muchun.song, shikemeng, baoquan.he,
	baohua, chengming.zhou, ljs, liam, vbabka, rppt, surenb, qi.zheng,
	axelrasmussen, yuanchu, weixugc, riel, gourry, haowenchao22,
	kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <20260612193738.2183968-1-nphamcs@gmail.com>

...
> * Integration with swap.tier by Youngjun (see [12]). For now, I'm
>   leaning towards opting out the vswap device from swap.tier entirely, and
>   treat it as a special device. Integrating it with swap.tiers will
>   benefit the cases where you want some cgroups to skip vswap for fast
>   swap devices (pmem), whereas other should go through zswap first. But
>   most other use cases, either the overhead of vswap will be acceptable
>   (or not the bottleneck), or we can just disable CONFIG_VSWAP entirely :)
> 
>   Youngjun, may I ask for your thoughts on this?

Hi Nhat,

Tier 1: VSWAP, Tier 2: ZSWAP ...

I don't see any problem applying the desired functionality with the
currently proposed mechanism and interface. With this, a user would be
assigned the default Virtual -> RAM swap tier, and the overall picture
becomes one where swap tiers are composed according to the priority
setting.

A few more thoughts came to mind.

Shakeel also proposed a per-tier max for the swap tier interface.

https://lore.kernel.org/linux-mm/aiw2p5ANjsQUCIHA@linux.dev/

However, for vswap, rather than treating it as a case for limiting the
amount via such a per-tier max, I think the current interface is the
better fit. (But, as Shakeel mentioned, if we only allow the limit
to be set to 0 or max, the usage could end up being the same. I'm still
thinking this part through.)

I have a few other thoughts as well, but I plan to raise those points in
the swap tier discussion thread instead. Please take a look at the
related thread, and let me know if you have any opinions. :)

And I'll share more if other thoughts come to mind

Thanks,
Youngjun Park

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: YoungJun Park @ 2026-06-14  9:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yosry Ahmed, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <aiw2p5ANjsQUCIHA@linux.dev>

....
> >Based on the memcg interface currently proposed in swap_tier
> > (memory.swap.tiers, memory.swap.tiers.effective), I think it aligns well
> > with the current direction. It provides a foundation for selectively
> > targeting devices in tier order.
> 
> Here instead of cpuset like interface, we may want more zswap like interface
> where you can put limit on the usage i.e. memory.swap.tier*.max. We can start
> with allowing only two values i.e. 0 and max which effectively will be the
> same as what you need.
>

Good idea, and it's certainly feasible. When I considered this a while
ago, the reasons I didn't take this direction were:

1. There's no real-world usage for adjusting the swap tier amount (it's
   either 0 or MAX). That said, your suggestion to initially allow only
   0 and max is the killing point, and it's making me reconsider.

2. The implementation cost seems high. The current implementation
   handles this at runtime via simple masking.

3. Relationship with swap.max:
   - If we tie it to the current interface, wouldn't limiting the swap
     amount within a selected tier already be possible? I wonder if
     that alone is enough.
   - If we add tier.max, it would need to be a subset of swap.max.
     (Any other complexities here?)

4. vswap enable/disable: vswap doesn't seem to have an amount-control
   aspect, so an on/off semantic would be clearer.
   https://lore.kernel.org/linux-mm/ai5kOOmR1LPTWs1J@yjaykim-PowerEdge-T330/T/#m8831ec057bf9387978d3bd698f51920600e09a04

In that case, the internal logic could stay roughly the same rather
than counting via a page counter. Something like:

1. Change the interface shell: tier.*.max — allow only 0 ~ max.
2. Keep the internal logic as is: 0 disables the mask (child memcgs
   off too), max enables it (child memcgs on too).
3. memory.zswap.max integrates naturally (it's memory."tier_name".max).
4. Extend later if use cases arise.

On balance I still lean toward the current interface, but if a per-tier
max is the better fit for memcg's direction and others feel the same,
I'm happy to switch. I'd like to hear Shakeel's thoughts again, and I'm
curious about others' opinions too.

A few more perspectives on the points below.

> I will respond to your other points later when I have time.

> > 
> > To summarize the discussions so far, the following points align well.
> > 
> > - Per-cgroup swap control, as I suggested.
> > - Proactive zswap writeback (Hao's usecase)
> > - Swap device target demotion(if it wants selective, then it is more better), as you mentioned:
> >   https://lore.kernel.org/linux-mm/aicZ-5GX9De3MAU7@linux.dev/
> > - Virtual Swap on/off in the future, as Nhat mentioned:
> >   https://lore.kernel.org/linux-mm/20260528212955.1912856-1-nphamcs@gmail.com/
> > - The memory.zswap.writeback alternative (no hierarchy model conflict)
> > - zswap is first swap tier.
> > - Promotion. (Also better for selectve usage)
> > - tier based swap policy (e.g round-robin...)
> > 
> > To accelerate this work, I believe we should reach a consensus and
> > merge the currently proposed swap_tier interface :)
> > 
> > If the above approach is difficult, I would like to suggest an
> > alternative for progress with the memcg interfaces removed:
> > 
> > 1) We could make zswap the first tier and create
> > a use case where memory.zswap.writeback internally is handled by tier logic.
> > 
> > 2) Or simply merge the swap_tier infrastructure itself first.
> > 
> > This would allow the swap_tier infrastructure to be merged and discussed
> > more easily.
> > 
> > If it takes longer to adopt swap_tier anyway, by doing so we progress next step
> > as a experimental feature.
> > 
> > - Apply per-cgroup swap as an experimental (debugfs) feature.
> > - Apply Hao's use case experimentally or as it is as Yosry suggested.
> > (future migration to swap tier)
> > 
> > How do you think?
> > 
> > (FYI: My emails to kernel.org are failing due to internal server issues.)
> > 
> > Thank you 
> > Youngjun Park

Let me clarify a part I wrote confusingly. Handling
memory.zswap.writeback via tiers is possible, but I don't think the
interface itself would be replaced even if memory.swap.tiers is adopted.

Selecting only zswap in memory.swap.tiers would not just disable
writeback.it would also block regular swap entirely, which differs
slightly from the current semantic. (... "Per the cgroup v2 docs: a
zswap-only tier setting is subtly different from setting
memory.swap.max to 0, since it still allows pages to be written to the
zswap pool; this has no effect if zswap is disabled, and swapping is
allowed unless memory.swap.max is set to 0.")

So the interface itself needs to be retained, and it could be extended
toward selective writeback — e.g., passing a desired tier into
memory.zswap.writeback so writeback targets only that tier. Currently
it only controls on/off. Other tiers probably don't need this. demotion
based on the selected tier should be enough.

Thanks,
Youngjun Park

^ permalink raw reply

* [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Farhad Alemi @ 2026-06-14 13:25 UTC (permalink / raw)
  To: Andrew Morton, Waiman Long
  Cc: Farhad Alemi, David Hildenbrand, Gregory Price, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <CA+0ovCg05rUk1-3k2ysdxmbcER8aG-wVh9SSTrrbp6LPWpPHYA@mail.gmail.com>

Creating a child cpuset where cpuset.mems is never set leads to a div/0
when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a
CPU hotplug event.

Reproduction steps:
 1) Create a cgroup w/ cpuset controls (do not set cpuset.mems)
 2) Move the task into the child cpuset
 3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES
 4) unplug and hotplug a cpu
      echo 0 > /sys/devices/system/cpu/cpu1/online
      echo 1 > /sys/devices/system/cpu/cpu1/online
 5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the
    call to __nodes_fold()

The cpuset code passes (cs->mems_allowed) which is not guaranteed to have
nodes to the rebind routine.  Use cs->effective_mems instead, which is
guaranteed to have a non-empty nodemask.

Link: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/
Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
Suggested-by: Gregory Price <gourry@gourry.net>
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Cc: stable@vger.kernel.org
---
v2: rebind to cs->effective_mems instead of newmems (Waiman Long);
    condense the changelog.

 kernel/cgroup/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)

 		migrate = is_memory_migrate(cs);

-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &cs->effective_mems);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
-- 
2.43.0

^ permalink raw reply

* Re: [PATCH v2 02/16] mm/slab: do not init any kfence objects on allocation
From: Suren Baghdasaryan @ 2026-06-15  1:28 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Andrey Konovalov, Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm,
	linux-kernel, cgroups
In-Reply-To: <4cf98483-ae35-4ad0-8f77-5a46194eb65f@kernel.org>

On Thu, Jun 11, 2026 at 9:37 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> On 6/11/26 17:11, Harry Yoo wrote:
> >
> >> From 3a1c4398ce9f361a4e6f4d9946eab6237eea89c2 Mon Sep 17 00:00:00 2001
> >> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> >> Date: Wed, 10 Jun 2026 17:40:04 +0200
> >> Subject: [PATCH] mm/slab: do not init any kfence objects on allocation
> >>
> >> When init (zeroing) on allocation is requested, for kmalloc() we
> >> generally have to zero the full object size even if a smaller size is
> >> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.
> >>
> >> When we end up allocating a kfence object, kfence perfoms the zeroing on
> >
> > nit: perfoms -> performs
>
> Fixed.
>
> >> its own because has its own redzone beyond the requested size. Thus

nit: s/because has/because it has

> >> slab_post_alloc_hook() has an 'init' parameter which has to be evaluated
> >> in all callers (via slab_want_init_on_alloc()) and should be false for
> >> kfence allocations.
> >>
> >> For kfence allocations in slab_alloc_node() this is achieved by subtly
> >> skipping over the slab_want_init_on_alloc() call. Other callers (i.e.
> >> kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
> >> if they do end up with a kfence allocation. This is only subtly not a
> >> problem, as those are not kmalloc allocations and thus the "requested
> >> size" equals s->object_size and thus it cannot interfere with kfence's
> >> redzone. There's just a unnecessary double zeroing (in both kfence and
> >> slab_post_alloc_hook()), but it's all very fragile and contradicts the
> >> comment in kfence_guarded_alloc().
> >>
> >> Remove this subtlety and simplify the code by eliminating the init
> >> parameter from slab_post_alloc_hook() and make it call
> >> slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
> >> check before performing the memset, which will start doing the right
> >> thing for all callers of slab_post_alloc_hook().
> >>
> >> This potentially adds overhead of the is_kfence_address() check to
> >> allocation hotpath, but that one is designed to be as small as possible,
> >> and it's only evaluated if zeroing is about to happen. This means (aside
> >> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
> >> zeroing itself comes with an overhead likely larger than the added
> >> check.
> >
> >> While at it, refactor the handling of evaluating when KASAN does the
> >> init instead of SLUB, with no intended functional changes. A
> >> non-functional change is that we don't pass kasan_init as true to
> >> kasan_slab_alloc() if kasan has no integrated init, but then the value
> >> is ignored anyway, so it's theoretically more correct.
> >
> > Right.
> >
> >> Thanks to Harry Yoo for the initial refactoring attempt, and for updated
> >> comments that are used here.
> >
> > No problem ;)
> >
> >> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-2-7190909db118@kernel.org
> >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> >> ---
> >
> > Looks good to me,
> > Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> Thanks!
>
> > Thanks!
> >
>

^ permalink raw reply

* Re: [PATCH v2 03/16] mm/slab: stop inlining __slab_alloc_node()
From: Suren Baghdasaryan @ 2026-06-15  1:33 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <aiuBW_xY6x5IBQfE@fedora>

On Thu, Jun 11, 2026 at 8:48 PM Hao Li <hao.li@linux.dev> wrote:
>
> On Wed, Jun 10, 2026 at 05:40:05PM +0200, Vlastimil Babka (SUSE) wrote:
> > With sheaves, this is no longer part of the allocation fastpath.  For
> > the same reason, also mark the call to it from slab_alloc_node() as
> > unlikely().
> >
> > Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


>
> --
> Thanks,
> Hao

^ permalink raw reply

* Re: [PATCH v2 04/16] mm/slab: introduce slab_alloc_context
From: Suren Baghdasaryan @ 2026-06-15  1:41 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Harry Yoo, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <3914c31f-c97e-4943-9f4f-630b05785014@kernel.org>

On Fri, Jun 12, 2026 at 2:51 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> On 6/12/26 05:10, Hao Li wrote:
> > On Wed, Jun 10, 2026 at 05:40:06PM +0200, Vlastimil Babka (SUSE) wrote:
> >> @@ -5389,13 +5401,18 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
> >>      if (ret)
> >>              goto success;
> >>
> >> +    struct slab_alloc_context ac = {
> >> +            .caller_addr = _RET_IP_,
> >> +            .orig_size = orig_size,
> >> +    };
> >
> > It might be better to move this to the beginning of the function, to avoid
> > patch09 jump to `success` before ac is initialized.
>
> Hm right, didn't compilers actually complain about goto skipping over
> declarations? But neither gcc nor clang do for me, hm. Will move, thanks.

I see it's moved in
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next,
so

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> >> +
> >>      /*
> >>       * Do not call slab_alloc_node(), since trylock mode isn't
> >>       * compatible with slab_pre_alloc_hook/should_failslab and
> >>       * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> >>       * and slab_post_alloc_hook() directly.
> >>       */
> >> -    ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, orig_size);
> >> +    ret = __slab_alloc_node(s, alloc_gfp, node, &ac);
> >>
> >>      /*
> >>       * It's possible we failed due to trylock as we preempted someone with
> >> @@ -7237,10 +7254,13 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> >>      int i;
> >>
> >>      if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> >> +            struct slab_alloc_context ac = {
> >> +                    .caller_addr = _RET_IP_,
> >> +                    .orig_size = s->object_size,
> >> +            };
> >>              for (i = 0; i < size; i++) {
> >>
> >> -                    p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
> >> -                                         s->object_size);
> >> +                    p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, &ac);
> >>                      if (unlikely(!p[i]))
> >>                              goto error;
> >>
> >>
> >> --
> >> 2.54.0
> >>
>

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Suren Baghdasaryan @ 2026-06-15  2:00 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <aiuBoDbQc0N-l7e-@fedora>

On Thu, Jun 11, 2026 at 8:50 PM Hao Li <hao.li@linux.dev> wrote:
>
> On Wed, Jun 10, 2026 at 05:40:07PM +0200, Vlastimil Babka (SUSE) wrote:
> > Similarly to the page allocators, introduce slab-allocator specific
> > alloc flags that internally control allocation behavior in addition to
> > gfp_flags, without occupying the limited gfp flags space.
> >
> > Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
> > page allocator's ALLOC_TRYLOCK and will be used to reimplement
> > kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
> > gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
> > importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
> > e.g. in early boot with a restricted gfp_allowed_mask.
> >
> > Also introduce alloc_flags_allow_spinning() to replace the usage of
> > gfpflags_allow_spinning().
> >
> > Start using alloc_flags and the new check first in alloc_from_pcs() and
> > __pcs_replace_empty_main(). This means some slab allocations that were
> > falsely treated as kmalloc_nolock() due to their gfp flags will now have
> > higher chances of succeed, and this will further increase with followup

nit: I think it should be either "higher chances of succeess" or
"higher chances to succeed".

> > changes.
> >
> > Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
> > reach it from a slab allocation that's not _nolock() and yet lacks
> > __GFP_KSWAPD_RECLAIM for other reasons.
> >
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Reviewed-by: Hao Li <hao.li@linux.dev>

I would call SLAB_ALLOC_TRYLOCK something like SLAB_ALLOC_NOSPIN or
SLAB_ALLOC_NOLOCK but naming is hard and I don't claim myself to be
good at it. So, feel free to adopt my suggestion if you like it or
ignore it otherwise.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Suren Baghdasaryan @ 2026-06-15  2:01 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <CAJuCfpGSHfNUvL9AzbftSg=uGRW4cJLbO6iB15keyN6A_eSWEw@mail.gmail.com>

On Sun, Jun 14, 2026 at 7:00 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jun 11, 2026 at 8:50 PM Hao Li <hao.li@linux.dev> wrote:
> >
> > On Wed, Jun 10, 2026 at 05:40:07PM +0200, Vlastimil Babka (SUSE) wrote:
> > > Similarly to the page allocators, introduce slab-allocator specific
> > > alloc flags that internally control allocation behavior in addition to
> > > gfp_flags, without occupying the limited gfp flags space.
> > >
> > > Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
> > > page allocator's ALLOC_TRYLOCK and will be used to reimplement
> > > kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
> > > gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
> > > importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
> > > e.g. in early boot with a restricted gfp_allowed_mask.
> > >
> > > Also introduce alloc_flags_allow_spinning() to replace the usage of
> > > gfpflags_allow_spinning().
> > >
> > > Start using alloc_flags and the new check first in alloc_from_pcs() and
> > > __pcs_replace_empty_main(). This means some slab allocations that were
> > > falsely treated as kmalloc_nolock() due to their gfp flags will now have
> > > higher chances of succeed, and this will further increase with followup
>
> nit: I think it should be either "higher chances of succeess" or
> "higher chances to succeed".

And of course I misspelled "success" :)

>
> > > changes.
> > >
> > > Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
> > > reach it from a slab allocation that's not _nolock() and yet lacks
> > > __GFP_KSWAPD_RECLAIM for other reasons.
> > >
> > > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > > ---
> >
> > Reviewed-by: Hao Li <hao.li@linux.dev>
>
> I would call SLAB_ALLOC_TRYLOCK something like SLAB_ALLOC_NOSPIN or
> SLAB_ALLOC_NOLOCK but naming is hard and I don't claim myself to be
> good at it. So, feel free to adopt my suggestion if you like it or
> ignore it otherwise.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> >
> > --
> > Thanks,
> > Hao

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Alexei Starovoitov @ 2026-06-15  2:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Hao Li, Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, LKML,
	open list:CONTROL GROUP (CGROUP)
In-Reply-To: <CAJuCfpGSHfNUvL9AzbftSg=uGRW4cJLbO6iB15keyN6A_eSWEw@mail.gmail.com>

On Sun, Jun 14, 2026 at 7:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jun 11, 2026 at 8:50 PM Hao Li <hao.li@linux.dev> wrote:
> >
> > On Wed, Jun 10, 2026 at 05:40:07PM +0200, Vlastimil Babka (SUSE) wrote:
> > > Similarly to the page allocators, introduce slab-allocator specific
> > > alloc flags that internally control allocation behavior in addition to
> > > gfp_flags, without occupying the limited gfp flags space.
> > >
> > > Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
> > > page allocator's ALLOC_TRYLOCK and will be used to reimplement
> > > kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
> > > gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
> > > importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
> > > e.g. in early boot with a restricted gfp_allowed_mask.
> > >
> > > Also introduce alloc_flags_allow_spinning() to replace the usage of
> > > gfpflags_allow_spinning().
> > >
> > > Start using alloc_flags and the new check first in alloc_from_pcs() and
> > > __pcs_replace_empty_main(). This means some slab allocations that were
> > > falsely treated as kmalloc_nolock() due to their gfp flags will now have
> > > higher chances of succeed, and this will further increase with followup
>
> nit: I think it should be either "higher chances of succeess" or
> "higher chances to succeed".
>
> > > changes.
> > >
> > > Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
> > > reach it from a slab allocation that's not _nolock() and yet lacks
> > > __GFP_KSWAPD_RECLAIM for other reasons.
> > >
> > > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > > ---
> >
> > Reviewed-by: Hao Li <hao.li@linux.dev>
>
> I would call SLAB_ALLOC_TRYLOCK something like SLAB_ALLOC_NOSPIN or
> SLAB_ALLOC_NOLOCK but naming is hard and I don't claim myself to be
> good at it. So, feel free to adopt my suggestion if you like it or
> ignore it otherwise.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Just noticed "trylock" in the #define SLAB_ALLOC_TRYLOCK

Please call it SLAB_ALLOC_NOLOCK.

Initial api was using 'trylock' name and it was a mistake,
since people assumed normal spin_trylock() like semantics.
"trylock" implies that it fails under contention
and retry is a normal next step. It's not the case.
No one should be retrying. That's why the final api was kmalloc_nolock().
So please keep this important distinction in the name.
SLAB_ALLOC_NOLOCK should mean that spinning locks
should not be taken. It should not mean "just go to trylock everywhere".

^ permalink raw reply

* Re: [PATCH v2 06/16] mm/slab: add alloc_flags to slab_alloc_context
From: Suren Baghdasaryan @ 2026-06-15  2:20 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <aiuB47Lj0vFGyuFA@fedora>

On Thu, Jun 11, 2026 at 8:51 PM Hao Li <hao.li@linux.dev> wrote:
>
> On Wed, Jun 10, 2026 at 05:40:08PM +0200, Vlastimil Babka (SUSE) wrote:
> > Add alloc_flags as a new field to the slab_alloc_context helper struct,
> > so we can pass it to more functions in the slab implementation without
> > adding another function parameter.
> >
> > Start checking them via alloc_flags_allow_spinning() in
> > alloc_single_from_new_slab() (where we can drop the allow_spin
> > parameter) and ___slab_alloc(). This further reduces false-positive
> > spinning-not-allowed from allocations that are not kmalloc_nolock() but
> > lack __GFP_RECLAIM flags.

___slab_alloc() is now using alloc_flags_allow_spinning(alloc_flags)
while function it uses (get_from_partial()->get_from_any_partial()) is
still using gfpflags_allow_spinning(gfpflags). I'm guessing
get_from_any_partial() will be converted later on but I wonder if that
conversion would better be done in the same patch to avoid
inconsistent behavior during possible bisection.

> >
> > _kmalloc_nolock_noprof() initializes ac.alloc_flags using its flags that
> > are SLAB_ALLOC_TRYLOCK. slab_alloc_node() and __kmem_cache_alloc_bulk()
> > are not reachable from kmalloc_nolock() and all their callers expect
> > spinning to be allowed, so they can use SLAB_ALLOC_DEFAULT. This is
> > temporary as the scope of slab_alloc_context will further move to the
> > callers, making the alloc_flags usage more obvious.
> >
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Reviewed-by: Hao Li <hao.li@linux.dev>
>
> --
> Thanks,
> Hao

^ permalink raw reply

* Re: [PATCH v2 07/16] mm/slab: replace struct partial_context with slab_alloc_context
From: Suren Baghdasaryan @ 2026-06-15  2:36 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka (SUSE), Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <f9b7935c-f5f0-496c-b55e-1f3feee5c87a@kernel.org>

On Wed, Jun 10, 2026 at 11:05 PM Harry Yoo <harry@kernel.org> wrote:
>
>
>
> On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> > Refactor get_from_partial_node(), get_from_any_partial(),
> > get_from_partial() and ___slab_alloc().
> >
> > Remove struct partial_context, which used to be more substantial but
> > shrank as part of the sheaves conversion. Instead pass gfp_flags and
> > pointer to the new slab_alloc_context, which together is a superset of
> > partial_context.
> >
> > This means alloc_flags are now available and we can use them to
> > determine if spinning is allowed, further reducing false positive "not
> > allowed" in the slow path due to gfp flags lacking __GFP_RECLAIM.
> >
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Looks good to me,
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

Ah, nice! The conversion I was anticipating in the previous patch...
I would do this removal of partial_context as patch 6 and then convert
___slab_alloc() and get_from_any_partial*() altogether in patch 7. I
think that would keep the behavior of the ___slab_alloc() more robust
throughout the patchset. But I would say it's nice to have, not a
must-have.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Cheers,
> Harry / Hyeonggon

^ permalink raw reply

* Re: [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition)
From: Nhat Pham @ 2026-06-15  2:38 UTC (permalink / raw)
  To: YoungJun Park
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, yosry, david, muchun.song, shikemeng, baoquan.he,
	baohua, chengming.zhou, ljs, liam, vbabka, rppt, surenb, qi.zheng,
	axelrasmussen, yuanchu, weixugc, riel, gourry, haowenchao22,
	kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <ai5kOOmR1LPTWs1J@yjaykim-PowerEdge-T330>

On Sun, Jun 14, 2026 at 4:20 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> ...
> > * Integration with swap.tier by Youngjun (see [12]). For now, I'm
> >   leaning towards opting out the vswap device from swap.tier entirely, and
> >   treat it as a special device. Integrating it with swap.tiers will
> >   benefit the cases where you want some cgroups to skip vswap for fast
> >   swap devices (pmem), whereas other should go through zswap first. But
> >   most other use cases, either the overhead of vswap will be acceptable
> >   (or not the bottleneck), or we can just disable CONFIG_VSWAP entirely :)
> >
> >   Youngjun, may I ask for your thoughts on this?
>
> Hi Nhat,
>
> Tier 1: VSWAP, Tier 2: ZSWAP ...
>
> I don't see any problem applying the desired functionality with the
> currently proposed mechanism and interface. With this, a user would be
> assigned the default Virtual -> RAM swap tier, and the overall picture
> becomes one where swap tiers are composed according to the priority
> setting.

It's more - is there a strong argument to let vswap be a tier (which
is not supported by just turning of vswap altogether).

Because right now I'm not exposing vswap device to userspace in any
manner, pretty much. It's abstract and transparent, and minimizes
complexity (no vswap and swap.tier interaction) and surfaces for
issues.

But if you have a strong use case in mind please let me know :)

Worst case scenario if we're wrong, we can always do it as a follow-up
down the line.

>
> A few more thoughts came to mind.
>
> Shakeel also proposed a per-tier max for the swap tier interface.
>
> https://lore.kernel.org/linux-mm/aiw2p5ANjsQUCIHA@linux.dev/
>
> However, for vswap, rather than treating it as a case for limiting the
> amount via such a per-tier max, I think the current interface is the
> better fit. (But, as Shakeel mentioned, if we only allow the limit
> to be set to 0 or max, the usage could end up being the same. I'm still
> thinking this part through.)
>
> I have a few other thoughts as well, but I plan to raise those points in
> the swap tier discussion thread instead. Please take a look at the
> related thread, and let me know if you have any opinions. :)

I'm following that thread too. I'm still thinking about it - will let
you know when I have a more definitive opinion.

^ permalink raw reply

* Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg
From: Hao Jia @ 2026-06-15  2:45 UTC (permalink / raw)
  To: Yosry Ahmed, Shakeel Butt
  Cc: Nhat Pham, akpm, tj, hannes, mhocko, mkoutny, chengming.zhou,
	muchun.song, roman.gushchin, cgroups, linux-mm, linux-kernel,
	linux-doc, Hao Jia
In-Reply-To: <CAO9r8zM=CMtUfV0RX3YyztqMNcw=s8M3WX6Q0epR5YHUvwTTKw@mail.gmail.com>



On 2026/6/13 02:15, Yosry Ahmed wrote:
> On Fri, Jun 12, 2026 at 9:40 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Thu, Jun 11, 2026 at 05:39:16PM +0000, Yosry Ahmed wrote:
>>> On Tue, Jun 09, 2026 at 11:18:26AM +0800, Hao Jia wrote:
>>>>
>>>>
>>>> On 2026/6/9 02:01, Nhat Pham wrote:
>>>>> On Mon, Jun 8, 2026 at 9:48 AM Yosry Ahmed <yosry@kernel.org> wrote:
>>>>>>
>>>>>>> But OTOH, this does seem like a recipe for inefficient reclaim. We
>>>>>>> might exhaust hotter memory of a cgroup while sparing colder memory of
>>>>>>> another cgroup... But maybe if they're all cold anyway, then who
>>>>>>> cares, and eventually you'll get to the cold stuff of other child?
>>>>>>
>>>>>> Forgot to respond to this part, the unfairness is limited to the batch
>>>>>> size per-invocation, so it should be fine as long as you don't divide
>>>>>> the amount over 100 iterations for some reason. Also yes, all memory
>>>>>> in zswap is cold, the relative coldness is not that important (e.g.
>>>>>> compared to relative coldness during reclaim).
>>>>>
>>>>> Ok then yeah, I think we should shelve per-memcg cursor for the next
>>>>> version. Down the line, if we have more data that unfairness is an
>>>>> issue, we can always fix it. One step at a time :)
>>>>
>>>> Thanks a lot to Yosry, Nhat, and Shakeel for the great suggestions!
>>>>
>>>> Let me summarize what I plan to do in the next version to make sure we are
>>>> on the same page:
>>>>
>>>>   - Drop the per-memcg cursor and keep the root cgroup cursor
>>>> (zswap_next_shrink) logic intact.
>>>>   - Stick to using the zswap_writeback_only key, and change the proactive
>>>> writeback size to use the compressed size.
>>>>   - Consolidate and reuse the logic between shrink_worker() and
>>>> shrink_memcg(). Enable batch writeback in the shrink_worker() path, while
>>>> keeping the writeback behavior in the zswap_store() path unchanged.
>>>>
>>>> Please let me know if I missed or misunderstood anything. Thanks again for
>>>> clearing things up!
>>>
>>> Sorry for the late response, yes I think this makes sense. However, I
>>> have some comment about how this interacts with swap tiering, let me
>>> reply to the other thread.
>>>
>>
>> I think the swap tiers interaction will be figured out over next cycle. However
>> Hao can/should continue to push and we may decide to let it in orthogonal to
>> swap tiers.
> 
> Yeah I think there are a lot of changes we discussed outside of the
> memcg interface, so maybe keep the interface as-is for now, work on a
> new version with the other changes, and we can finalize the interface
> at the end?

Okay, I will split the non-memcg interface parts into a few separate 
patches. These will serve as the preparation work for proactive 
writeback and enable batch writeback in the shrink_worker() path.

However, I will still send the complete patchset using the 
zswap_writeback_only key approach in the next version. This should make 
it easier to review whether the preparation logic is reasonable, and to 
decide whether it should eventually be merged independently of the swap 
tiers.

Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 08/16] mm/slab: pass alloc_flags to new slab allocation
From: Suren Baghdasaryan @ 2026-06-15  4:10 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Harry Yoo, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <0546df1f-616c-44b5-8a1c-f96d5f33d8e6@kernel.org>

On Fri, Jun 12, 2026 at 2:59 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> On 6/12/26 07:26, Hao Li wrote:
> > On Wed, Jun 10, 2026 at 05:40:10PM +0200, Vlastimil Babka (SUSE) wrote:
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -3378,9 +3378,10 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
> >>  }
> >>
> >>  /* Allocate and initialize a slab without building its freelist. */
> >> -static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> >> +static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags,
> >> +                              unsigned int alloc_flags, int node)
> >>  {
> >> -    bool allow_spin = gfpflags_allow_spinning(flags);
> >> +    bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
> >
> > nit: allow_spin doesn't depend on `flags` now, so it seems we can delete the
> > comments:
> >
> > /*
> >  * __GFP_RECLAIM could be cleared on the first allocation attempt,
> >  * so pass allow_spin flag directly.
> >  */
>
> Right, deleted.
>
> > Otherwise, looks good to me.
> > Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> Thanks!
>

^ permalink raw reply

* Re: [PATCH v2 09/16] mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
From: Suren Baghdasaryan @ 2026-06-15  4:35 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-9-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 8:41 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> Convert the whole following call stack to pass either slab_alloc_context
> (thus including alloc_flags) or just alloc_flags as necessary:
>
> slab_post_alloc_hook()
>   alloc_tagging_slab_alloc_hook()
>     __alloc_tagging_slab_alloc_hook()
>       prepare_slab_obj_exts_hook()
>         alloc_slab_obj_exts()
>   memcg_slab_post_alloc_hook()
>     __memcg_slab_post_alloc_hook()
>       alloc_slab_obj_exts()
>
> Converting all these at once avoids unnecessary churn and is mostly
> mechanical.
>
> This ultimately allows to decide if spinning is allowed using
> alloc_flags in alloc_slab_obj_exts(), as well as slab_post_alloc_hook().
> Aside from alloc_from_pcs_bulk() (to be handled next) there is nothing
> else in slab itself relying on gfpflags_allow_spinning() which can
> be false even if not called from kmalloc_nolock().
>
> A followup change will also use the alloc_flags availability in the call
> stack above to remove the __GFP_NO_OBJ_EXT flag.
>
> For alloc_slab_obj_exts(), also replace the suboptimal "bool new_slab"
> parameter with a SLAB_ALLOC_NEW_SLAB flag with identical functionality.
>
> To further reduce the number of parameters of slab_post_alloc_hook(),
> also make 'struct list_lru *lru' (which is NULL for most callers) a new
> field of slab_alloc_context.
>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/memcontrol.c |  5 +--
>  mm/slab.h       |  6 ++--
>  mm/slub.c       | 94 +++++++++++++++++++++++++++++++++------------------------
>  3 files changed, 62 insertions(+), 43 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c03d4787d466..29390ba13baa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3424,7 +3424,8 @@ static inline size_t obj_full_size(struct kmem_cache *s)
>  }
>
>  bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> -                                 gfp_t flags, size_t size, void **p)
> +                                 gfp_t flags, unsigned int slab_alloc_flags,
> +                                 size_t size, void **p)
>  {
>         size_t obj_size = obj_full_size(s);
>         struct obj_cgroup *objcg;
> @@ -3472,7 +3473,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>                 slab = virt_to_slab(p[i]);
>
>                 if (!slab_obj_exts(slab) &&
> -                   alloc_slab_obj_exts(slab, s, flags, false)) {
> +                   alloc_slab_obj_exts(slab, s, flags, slab_alloc_flags)) {
>                         continue;
>                 }
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 96f65b625600..4db6d8aa0ee3 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -19,6 +19,7 @@
>  /* slab's alloc_flags definitions */
>  #define SLAB_ALLOC_DEFAULT     0x00 /* no flags */
>  #define SLAB_ALLOC_TRYLOCK     0x01 /* a kmalloc_nolock() allocation */
> +#define SLAB_ALLOC_NEW_SLAB    0x02 /* a flag for alloc_slab_obj_exts() */
>
>  static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
>  {
> @@ -612,7 +613,7 @@ static inline struct slabobj_ext *slab_obj_ext(struct slab *slab,
>  }
>
>  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> -                        gfp_t gfp, bool new_slab);
> +                       gfp_t gfp, unsigned int alloc_flags);
>
>  #else /* CONFIG_SLAB_OBJ_EXT */
>
> @@ -642,7 +643,8 @@ static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s)
>
>  #ifdef CONFIG_MEMCG
>  bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> -                                 gfp_t flags, size_t size, void **p);
> +                                 gfp_t flags, unsigned int slab_alloc_flags,
> +                                 size_t size, void **p);
>  void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
>                             void **p, int objects, unsigned long obj_exts);
>  #endif
> diff --git a/mm/slub.c b/mm/slub.c
> index 8f6ca3d5fdfa..e634137b67fa 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -218,6 +218,7 @@ struct slab_alloc_context {
>         unsigned long caller_addr;
>         unsigned long orig_size;
>         unsigned int alloc_flags;
> +       struct list_lru *lru;
>  };
>
>  /* Structure holding parameters for get_partial_node_bulk() */
> @@ -2155,9 +2156,9 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
>  }
>
>  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> -                       gfp_t gfp, bool new_slab)
> +                       gfp_t gfp, unsigned int alloc_flags)
>  {
> -       bool allow_spin = gfpflags_allow_spinning(gfp);
> +       const bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
>         unsigned int objects = objs_per_slab(s, slab);
>         unsigned long new_exts;
>         unsigned long old_exts;
> @@ -2206,7 +2207,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>         old_exts = READ_ONCE(slab->obj_exts);
>         handle_failed_objexts_alloc(old_exts, vec, objects);
>
> -       if (new_slab) {
> +       if (alloc_flags & SLAB_ALLOC_NEW_SLAB) {
>                 /*
>                  * If the slab is brand new and nobody can yet access its
>                  * obj_exts, no synchronization is required and obj_exts can
> @@ -2331,7 +2332,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
>  }
>
>  static int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> -                              gfp_t gfp, bool new_slab)
> +                              gfp_t gfp, unsigned int alloc_flags)
>  {
>         return 0;
>  }
> @@ -2351,10 +2352,10 @@ static inline void alloc_slab_obj_exts_early(struct kmem_cache *s,
>
>  static inline unsigned long
>  prepare_slab_obj_exts_hook(struct kmem_cache *s, struct slab *slab,
> -                          gfp_t flags, void *p)
> +                          gfp_t flags, unsigned int alloc_flags, void *p)
>  {
>         if (!slab_obj_exts(slab) &&
> -           alloc_slab_obj_exts(slab, s, flags, false)) {
> +           alloc_slab_obj_exts(slab, s, flags, alloc_flags)) {
>                 pr_warn_once("%s, %s: Failed to create slab extension vector!\n",
>                              __func__, s->name);
>                 return 0;
> @@ -2366,7 +2367,8 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, struct slab *slab,
>
>  /* Should be called only if mem_alloc_profiling_enabled() */
>  static noinline void
> -__alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
> +__alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
> +                               unsigned int alloc_flags)
>  {
>         unsigned long obj_exts;
>         struct slabobj_ext *obj_ext;
> @@ -2382,7 +2384,7 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
>                 return;
>
>         slab = virt_to_slab(object);
> -       obj_exts = prepare_slab_obj_exts_hook(s, slab, flags, object);
> +       obj_exts = prepare_slab_obj_exts_hook(s, slab, flags, alloc_flags, object);
>         /*
>          * Currently obj_exts is used only for allocation profiling.
>          * If other users appear then mem_alloc_profiling_enabled()
> @@ -2401,10 +2403,11 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
>  }
>
>  static inline void
> -alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
> +alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
> +                             unsigned int alloc_flags)
>  {
>         if (mem_alloc_profiling_enabled())
> -               __alloc_tagging_slab_alloc_hook(s, object, flags);
> +               __alloc_tagging_slab_alloc_hook(s, object, flags, alloc_flags);
>  }
>
>  /* Should be called only if mem_alloc_profiling_enabled() */
> @@ -2443,7 +2446,8 @@ alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
>  #else /* CONFIG_MEM_ALLOC_PROFILING */
>
>  static inline void
> -alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
> +alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
> +                             unsigned int alloc_flags)
>  {
>  }
>
> @@ -2461,8 +2465,9 @@ alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
>  static void memcg_alloc_abort_single(struct kmem_cache *s, void *object);
>
>  static __fastpath_inline
> -bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> -                               gfp_t flags, size_t size, void **p)
> +bool memcg_slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
> +                               size_t size, void **p,
> +                               struct slab_alloc_context *ac)
>  {
>         if (likely(!memcg_kmem_online()))
>                 return true;
> @@ -2470,7 +2475,8 @@ bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>         if (likely(!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT)))
>                 return true;
>
> -       if (likely(__memcg_slab_post_alloc_hook(s, lru, flags, size, p)))
> +       if (likely(__memcg_slab_post_alloc_hook(s, ac->lru, flags,
> +                                               ac->alloc_flags, size, p)))
>                 return true;
>
>         if (likely(size == 1)) {
> @@ -2558,14 +2564,15 @@ bool memcg_slab_post_charge(void *p, gfp_t flags)
>                 put_slab_obj_exts(obj_exts);
>         }
>
> -       return __memcg_slab_post_alloc_hook(s, NULL, flags, 1, &p);
> +       return __memcg_slab_post_alloc_hook(s, NULL, flags, SLAB_ALLOC_DEFAULT,
> +                                           1, &p);
>  }
>
>  #else /* CONFIG_MEMCG */
>  static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s,
> -                                             struct list_lru *lru,
> -                                             gfp_t flags, size_t size,
> -                                             void **p)
> +                                             gfp_t flags,
> +                                             size_t size, void **p,
> +                                             struct slab_alloc_context *ac)
>  {
>         return true;
>  }
> @@ -3352,12 +3359,14 @@ static inline void init_freelist_randomization(void) { }
>  #endif /* CONFIG_SLAB_FREELIST_RANDOM */
>
>  static __always_inline void account_slab(struct slab *slab, int order,
> -                                        struct kmem_cache *s, gfp_t gfp)
> +                                        struct kmem_cache *s, gfp_t gfp,
> +                                        unsigned int alloc_flags)
>  {
>         if (memcg_kmem_online() &&
>                         (s->flags & SLAB_ACCOUNT) &&
>                         !slab_obj_exts(slab))
> -               alloc_slab_obj_exts(slab, s, gfp, true);
> +               alloc_slab_obj_exts(slab, s, gfp,
> +                                   alloc_flags | SLAB_ALLOC_NEW_SLAB);
>
>         mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
>                             PAGE_SIZE << order);
> @@ -3434,7 +3443,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags,
>          * to prevent the array from being overwritten.
>          */
>         alloc_slab_obj_exts_early(s, slab);
> -       account_slab(slab, oo_order(oo), s, flags);
> +       account_slab(slab, oo_order(oo), s, flags, alloc_flags);
>
>         return slab;
>  }
> @@ -4568,9 +4577,8 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>  }
>
>  static __fastpath_inline
> -bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> -                         gfp_t flags, size_t size, void **p,
> -                         unsigned int orig_size)
> +bool slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, size_t size,
> +                         void **p, struct slab_alloc_context *ac)

Would if be possible to make this last parameter a ""const struct
slab_alloc_context*" (here and in other functions accepting it)? I
think these functions accept it as an input parameter only and are not
supposed to change it, right? Makes it easy to veriy that
slab_alloc_context is not changed between consequitive calls reusing
it, for example inside slab_alloc_node().

>  {
>         bool init = slab_want_init_on_alloc(flags, s);
>         unsigned int zero_size = s->object_size;
> @@ -4590,7 +4598,7 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>          * orig_size if we track it.
>          */
>         if (slub_debug_orig_size(s))
> -               zero_size = orig_size;
> +               zero_size = ac->orig_size;
>
>         /*
>          * When slab_debug is enabled, avoid memory initialization integrated
> @@ -4616,14 +4624,14 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>                                      !kasan_has_integrated_init())
>                                  && !is_kfence_address(p[i]))
>                         memset(p[i], 0, zero_size);
> -               if (gfpflags_allow_spinning(flags))
> +               if (alloc_flags_allow_spinning(ac->alloc_flags))
>                         kmemleak_alloc_recursive(p[i], s->object_size, 1,
>                                                  s->flags, init_flags);
>                 kmsan_slab_alloc(s, p[i], init_flags);
> -               alloc_tagging_slab_alloc_hook(s, p[i], flags);
> +               alloc_tagging_slab_alloc_hook(s, p[i], flags, ac->alloc_flags);
>         }
>
> -       return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
> +       return memcg_slab_post_alloc_hook(s, flags, size, p, ac);
>  }
>
>  /*
> @@ -4918,6 +4926,12 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>  {
>         const unsigned int alloc_flags = SLAB_ALLOC_DEFAULT;
>         void *object;
> +       struct slab_alloc_context ac = {
> +               .caller_addr = addr,
> +               .orig_size = orig_size,
> +               .alloc_flags = alloc_flags,
> +               .lru = lru,
> +       };
>
>         s = slab_pre_alloc_hook(s, gfpflags);
>         if (unlikely(!s))
> @@ -4929,14 +4943,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>
>         object = alloc_from_pcs(s, gfpflags, alloc_flags, node);
>
> -       if (unlikely(!object)) {
> -               struct slab_alloc_context ac = {
> -                       .caller_addr = addr,
> -                       .orig_size = orig_size,
> -                       .alloc_flags = alloc_flags,
> -               };
> +       if (!object)

Any reason "unlikely" is removed?

>                 object = __slab_alloc_node(s, gfpflags, node, &ac);
> -       }
>
>         maybe_wipe_obj_freeptr(s, object);
>
> @@ -4945,7 +4953,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>          * In case this fails due to memcg_slab_post_alloc_hook(),
>          * object is set to NULL
>          */
> -       slab_post_alloc_hook(s, lru, gfpflags, 1, &object, orig_size);
> +       slab_post_alloc_hook(s, gfpflags, 1, &object, &ac);
>
>         return object;
>  }
> @@ -5240,6 +5248,10 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
>                                    struct slab_sheaf *sheaf)
>  {
>         void *ret = NULL;
> +       struct slab_alloc_context ac = {
> +               .orig_size = s->object_size,
> +               .alloc_flags = SLAB_ALLOC_DEFAULT,
> +       };
>
>         if (sheaf->size == 0)
>                 goto out;
> @@ -5250,7 +5262,7 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
>                 ret = sheaf->objects[--sheaf->size];
>
>         /* add __GFP_NOFAIL to force successful memcg charging */
> -       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, s->object_size);
> +       slab_post_alloc_hook(s, gfp | __GFP_NOFAIL, 1, &ret, &ac);
>  out:
>         trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
>
> @@ -5437,7 +5449,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
>
>  success:
>         maybe_wipe_obj_freeptr(s, ret);
> -       slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret, orig_size);
> +       slab_post_alloc_hook(s, alloc_gfp, 1, &ret, &ac);
>
>         ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
>         return ret;
> @@ -7303,6 +7315,10 @@ bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
>  {
>         unsigned int i = 0;
>         void *kfence_obj;
> +       struct slab_alloc_context ac = {
> +               .orig_size = s->object_size,
> +               .alloc_flags = SLAB_ALLOC_DEFAULT,
> +       };
>
>         if (!size)
>                 return false;
> @@ -7353,7 +7369,7 @@ bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
>
>  out:
>         /* memcg and kmem_cache debug support and memory initialization */
> -       return likely(slab_post_alloc_hook(s, NULL, flags, size, p, s->object_size));
> +       return likely(slab_post_alloc_hook(s, flags, size, p, &ac));
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
>
>
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH v2 10/16] mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
From: Suren Baghdasaryan @ 2026-06-15  4:39 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <aiuY1mDjpfCT94VY@fedora>

On Thu, Jun 11, 2026 at 10:29 PM Hao Li <hao.li@linux.dev> wrote:
>
> On Wed, Jun 10, 2026 at 05:40:12PM +0200, Vlastimil Babka (SUSE) wrote:
> > The function takes all the parameters that exist as fields in
> > slab_alloc_context, except alloc_flags. Replace them with a single
> > pointer.
> >
> > This moves slab_alloc_context initialization to a number of callers,
> > which is more verbose, but arguably also more clear than a long list of
> > parameters, and most do not use the 'lru' field.
> >
> > This will also allow kmalloc_nolock() to call slab_alloc_node() and
> > reduce the special open-coding it currently has.
> >
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao

^ permalink raw reply

* Re: [PATCH v2 11/16] mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
From: Suren Baghdasaryan @ 2026-06-15  4:48 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Harry Yoo, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <3f53fc18-838f-44ab-acad-4323daa0fbcc@kernel.org>

On Fri, Jun 12, 2026 at 3:05 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> On 6/12/26 05:21, Hao Li wrote:
> > On Wed, Jun 10, 2026 at 05:40:13PM +0200, Vlastimil Babka (SUSE) wrote:
> >> The last user of gfpflags_allow_spinning() in slab is
> >> alloc_from_pcs_bulk(), which is only called from
> >> kmem_cache_alloc_bulk().
> >>
> >> It turns out that gfpflags_allow_spinning() is not necessary, because
> >> kmem_cache_alloc_bulk() is only expected to be called from context that
> >> does allow spinning, so simply replace it with 'true'.
> >>
> >> With that, we can remove the "@flags must allow spinning" part of the
> >> kernel doc, as there is no more connection to the gfp flags in the slab
> >> implementation.
> >>
> >> Also remove a comment in alloc_slab_obj_exts() because there should be
> >> no more false positives possible due to gfp_allowed_mask during early
> >> boot.
> >>
> >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> >> ---
> >>  mm/slub.c | 11 ++---------
> >>  1 file changed, 2 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 0b9974bfcb24..ef457e07db83 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -2171,12 +2171,6 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >>
> >>      sz = obj_exts_alloc_size(s, slab, gfp);
> >>
> >> -    /*
> >> -     * Note that allow_spin may be false during early boot and its
> >> -     * restricted GFP_BOOT_MASK. Due to kmalloc_nolock() only supporting
> >> -     * architectures with cmpxchg16b, early obj_exts will be missing for
> >> -     * very early allocations on those.
> >> -     */
> >>      if (unlikely(!allow_spin))
> >>              vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
> >>                                   slab_nid(slab));
> >> @@ -4867,7 +4861,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
> >>              }
> >>
> >>              full = barn_replace_empty_sheaf(barn, pcs->main,
> >> -                                            gfpflags_allow_spinning(gfp));
> >> +                                            /* allow_spin = */ true);
> >
> > we can remove the `gfp` arg as this function no longer use it.
>
> True, done!

I see it fixed in
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next,
so

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>

^ permalink raw reply

* Re: [PATCH v2 12/16] mm/slab: pass slab_alloc_context to __do_kmalloc_node()
From: Suren Baghdasaryan @ 2026-06-15  4:58 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-12-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 8:41 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
> alloc flag that prevents kmalloc recursion. For that we need a version
> of kmalloc() that takes alloc_flags and use it in places that perform
> these potentially recursive kmalloc allocations (of sheaves or obj_ext
> arrays).
>
> As a preparatory step, make __do_kmalloc_node() take a pointer to
> slab_alloc_context. This replaces the 'caller' parameter and includes
> alloc_flags which we'll make use of.

I think you could also eliminate __do_kmalloc_node() function's "size"
parameter as it's always the same as ac->orig_size.

>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slub.c | 47 ++++++++++++++++++++++++++++++++---------------
>  1 file changed, 32 insertions(+), 15 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index ef457e07db83..6845e15c148a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5338,19 +5338,14 @@ EXPORT_SYMBOL(__kmalloc_large_node_noprof);
>
>  static __always_inline
>  void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
> -                       unsigned long caller, kmalloc_token_t token)
> +                       kmalloc_token_t token, struct slab_alloc_context *ac)
>  {
>         struct kmem_cache *s;
>         void *ret;
> -       struct slab_alloc_context ac = {
> -               .caller_addr = caller,
> -               .orig_size = size,
> -               .alloc_flags = SLAB_ALLOC_DEFAULT,
> -       };
>
>         if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {
>                 ret = __kmalloc_large_node_noprof(size, flags, node);
> -               trace_kmalloc(caller, ret, size,
> +               trace_kmalloc(ac->caller_addr, ret, size,
>                               PAGE_SIZE << get_order(size), flags, node);
>                 return ret;
>         }
> @@ -5360,22 +5355,34 @@ void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
>
>         s = kmalloc_slab(size, b, flags, token);
>
> -       ret = slab_alloc_node(s, flags, node, &ac);
> +       ret = slab_alloc_node(s, flags, node, ac);
>         ret = kasan_kmalloc(s, ret, size, flags);
> -       trace_kmalloc(caller, ret, size, s->size, flags, node);
> +       trace_kmalloc(ac->caller_addr, ret, size, s->size, flags, node);
>         return ret;
>  }
>  void *__kmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags, int node)
>  {
> +       struct slab_alloc_context ac = {
> +               .caller_addr = _RET_IP_,
> +               .orig_size = size,
> +               .alloc_flags = SLAB_ALLOC_DEFAULT,
> +       };
> +
>         return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
> -                                _RET_IP_, PASS_TOKEN_PARAM(token));
> +                                PASS_TOKEN_PARAM(token), &ac);
>  }
>  EXPORT_SYMBOL(__kmalloc_node_noprof);
>
>  void *__kmalloc_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags)
>  {
> -       return __do_kmalloc_node(size, NULL, flags,  NUMA_NO_NODE, _RET_IP_,
> -                                PASS_TOKEN_PARAM(token));
> +       struct slab_alloc_context ac = {
> +               .caller_addr = _RET_IP_,
> +               .orig_size = size,
> +               .alloc_flags = SLAB_ALLOC_DEFAULT,
> +       };
> +
> +       return __do_kmalloc_node(size, NULL, flags,  NUMA_NO_NODE,
> +                                PASS_TOKEN_PARAM(token), &ac);
>  }
>  EXPORT_SYMBOL(__kmalloc_noprof);
>
> @@ -5471,9 +5478,14 @@ EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);
>  void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags,
>                                          int node, unsigned long caller)
>  {
> -       return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
> -                                caller, PASS_TOKEN_PARAM(token));
> +       struct slab_alloc_context ac = {
> +               .caller_addr = caller,
> +               .orig_size = size,
> +               .alloc_flags = SLAB_ALLOC_DEFAULT,
> +       };
>
> +       return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
> +                                PASS_TOKEN_PARAM(token), &ac);
>  }
>  EXPORT_SYMBOL(__kmalloc_node_track_caller_noprof);
>
> @@ -6874,6 +6886,11 @@ void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long
>  {
>         bool allow_block;
>         void *ret;
> +       struct slab_alloc_context ac = {
> +               .caller_addr = _RET_IP_,
> +               .orig_size = size,
> +               .alloc_flags = SLAB_ALLOC_DEFAULT,
> +       };
>
>         /*
>          * It doesn't really make sense to fallback to vmalloc for sub page
> @@ -6881,7 +6898,7 @@ void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long
>          */
>         ret = __do_kmalloc_node(size, PASS_BUCKET_PARAM(b),
>                                 kmalloc_gfp_adjust(flags, size),
> -                               node, _RET_IP_, PASS_TOKEN_PARAM(token));
> +                               node, PASS_TOKEN_PARAM(token), &ac);
>         if (ret || size <= PAGE_SIZE)
>                 return ret;
>
>
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH v2 13/16] mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock()
From: Suren Baghdasaryan @ 2026-06-15  5:06 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <aiuthBHdDb0CNs3n@fedora>

On Thu, Jun 11, 2026 at 11:57 PM Hao Li <hao.li@linux.dev> wrote:
>
> On Wed, Jun 10, 2026 at 05:40:15PM +0200, Vlastimil Babka (SUSE) wrote:
> > The two flags are added internally so there's no point for warning if
> > they are passed by the caller as well, so allow them. This will allow
> > simplifying obj_ext allocation under kmalloc_nolock().
> >
> > Also it's not necessary to have the extra alloc_gfp variable for adding
> > the two flags. The original gfp_flags parameter is not used anywhere
> > except for the warning. So remove alloc_gfp and directly modify and use
> > gfp_flags everywhere.
> >
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> LGTM
> Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao

^ permalink raw reply

* Re: [PATCH v2 14/16] mm/slab: introduce kmalloc_flags()
From: Suren Baghdasaryan @ 2026-06-15  5:14 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <aiu8W3EQhalCP9HW@fedora>

On Fri, Jun 12, 2026 at 1:03 AM Hao Li <hao.li@linux.dev> wrote:
>
> On Wed, Jun 10, 2026 at 05:40:16PM +0200, Vlastimil Babka (SUSE) wrote:
> > With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
> > alloc flag that prevents kmalloc recursion. For that we need a version
> > of kmalloc() that takes alloc_flags and use it in places that perform
> > these potentially recursive kmalloc allocations (of sheaves or obj_ext
> > arrays).
> >
> > Add this function, named kmalloc_flags(). Right now it's only useful for
> > these nested allocations, so it doesn't need to optimize build-time
> > constant sizes like kmalloc() or kmalloc_buckets.
> >
> > Since we need it to support both normal and non-spinning
> > kmalloc_nolock() context through the SLAB_ALLOC_TRYLOCK flag, split out
> > most of the special _kmalloc_nolock_noprof() implementation to
> > __kmalloc_nolock_noprof() that takes a slab_alloc_context, and make
> > _kmalloc_nolock_noprof() a simple tail calling wrapper with the proper
> > context.
> >
> > kmalloc_flags() can thus determine whether to call
> > __kmalloc_nolock_noprof() or __do_kmalloc_node(), based on the
> > given alloc_flags.
> >
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao

^ permalink raw reply

* Re: [PATCH v2 15/16] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Suren Baghdasaryan @ 2026-06-15  5:38 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka (SUSE), Harry Yoo, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Hao Ge
In-Reply-To: <aivpob0Zgnbc4AG4@fedora>

On Fri, Jun 12, 2026 at 4:30 AM Hao Li <hao.li@linux.dev> wrote:
>
> On Fri, Jun 12, 2026 at 12:17:45PM +0200, Vlastimil Babka (SUSE) wrote:
> > On 6/12/26 08:54, Hao Li wrote:
> > > On Wed, Jun 10, 2026 at 05:40:17PM +0200, Vlastimil Babka (SUSE) wrote:
> > >> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
> > >> gfp flags are a scarce resource, unlike slab's alloc_flags.
> > >>
> > >> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
> > >> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
> > >> family function should not recurse into another kmalloc*() for the
> > >> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
> > >>
> > >> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
> > >> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
> > >> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
> > >> added. This will also pass through SLAB_ALLOC_TRYLOCK so we don't need
> > >> to special case kmalloc_nolock() anymore.
> > >>
> > >> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
> > >> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
> > >> the incoming gfp flags (only augmented with __GFP_ZERO), because if
> > >> alloc_flags contain SLAB_ALLOC_TRYLOCK, the incoming gfp flags have to
> > >> be also compatible with it.
> > >>
> > >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > >> ---
> > >>  mm/slab.h |  1 +
> > >>  mm/slub.c | 13 +++++--------
> > >>  2 files changed, 6 insertions(+), 8 deletions(-)
> > >>
> > >> diff --git a/mm/slab.h b/mm/slab.h
> > >> index 45bfcfb35a9c..509f330654b8 100644
> > >> --- a/mm/slab.h
> > >> +++ b/mm/slab.h
> > >> @@ -21,6 +21,7 @@
> > >>  #define SLAB_ALLOC_DEFAULT        0x00 /* no flags */
> > >>  #define SLAB_ALLOC_TRYLOCK        0x01 /* a kmalloc_nolock() allocation */
> > >>  #define SLAB_ALLOC_NEW_SLAB       0x02 /* a flag for alloc_slab_obj_exts() */
> > >> +#define SLAB_ALLOC_NO_RECURSE     0x04 /* prevent kmalloc() recursion */
> > >>
> > >>  static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
> > >>  {
> > >> diff --git a/mm/slub.c b/mm/slub.c
> > >> index cbb38bd01e46..7dfbd0251aa2 100644
> > >> --- a/mm/slub.c
> > >> +++ b/mm/slub.c
> > >> @@ -2167,15 +2167,12 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > >>
> > >>    gfp &= ~OBJCGS_CLEAR_MASK;
> > >>    /* Prevent recursive extension vector allocation */
> > >> -  gfp |= __GFP_NO_OBJ_EXT;
> > >> +  alloc_flags |= SLAB_ALLOC_NO_RECURSE;
> > >>
> > >>    sz = obj_exts_alloc_size(s, slab, gfp);
> > >>
> > >
> > > For the original calls to kmalloc_nolock and kmalloc_node, I notice a difference:
> > >
> > >> -  if (unlikely(!allow_spin))
> > >> -          vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
> > >> -                               slab_nid(slab));
> > >
> > > kmalloc_nolock completely discarded `gfp` flags.
> > >
> > >> -  else
> > >> -          vec = kmalloc_node(sz, gfp | __GFP_ZERO, slab_nid(slab));
> > >
> > > while kmalloc_node preserved and passed it along.
> > >
> > >> +  /* This will use kmalloc_nolock() if alloc_flags say so */
> > >> +  vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab));
> > >
> > > Now both paths are merged into kmalloc_flags, the gfp flags are
> > > unconditionally carried through. It seems this might carry some unwanted flags.
> > >
> > > I traced the call path and found that ___slab_alloc sets the __GFP_THISNODE
> > > for trynode_flags. If this flag propagates all the way into
> > > kmalloc_flags->...->__kmalloc_nolock_noprof, it will trigger the
> > > VM_WARN_ON_ONCE warning. Maybe we need to strip the original gfp if
> > > `!allow_spin`.
> >
> > Thanks. This should do the job in a more generic way I hope?
> >
>
> Yeah, this is more elegant.
>
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f9b8dc56bb57..0bf53f70c9be 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2047,12 +2047,15 @@ static inline void dec_slabs_node(struct kmem_cache *s, int node,
> >  #endif /* CONFIG_SLUB_DEBUG */
> >
> >  /*
> > - * The allocated objcg pointers array is not accounted directly.
> > + * The allocated objcg pointers array or sheaf is not accounted directly.
> >   * Moreover, it should not come from DMA buffer and is not readily
> > - * reclaimable. So those GFP bits should be masked off.
> > + * reclaimable. Node restriction for the parent allocation also should
> > + * not apply to the slab's internal objects.
> > + * So those GFP bits should be masked off.
> >   */
> >  #define OBJCGS_CLEAR_MASK      (__GFP_DMA | __GFP_RECLAIMABLE | \
> > -                               __GFP_ACCOUNT | __GFP_NOFAIL)
> > +                               __GFP_ACCOUNT | __GFP_NOFAIL |
> > +                               __GFP_THISNODE )
>
> Good idea! Both code and comments make sense to me.

Makes sense. I see
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
already implementing this and also keeping __GFP_NO_OBJ_EXT and
SLAB_ALLOC_NO_RECURSE both used. That version looks good to me, so
I'll wait for v3.

At the end of this series, we end up with no users of __GFP_NO_OBJ_EXT
but we still keep it defined. I'm guessing you leave it because of the
new patch [1] which aliases __GFP_NO_OBJ_EXT? I will have to make that
mechanism work without a GFP flag, possibly using a similar approach.
CC'ing Hao Ge to be in the loop of these changes. I'll work with him
on aliminating that __GFP_NO_OBJ_EXT alias.

[1] https://lore.kernel.org/all/20260604024008.46592-1-hao.ge@linux.dev/

>
> >
> >  #ifdef CONFIG_SLAB_OBJ_EXT
> >
> >
>
> --
> Thanks,
> Hao

^ permalink raw reply

* Re: [syzbot] [cgroups?] INFO: task hung in cgroup_subtree_control_write (2)
From: syzbot @ 2026-06-15  8:05 UTC (permalink / raw)
  To: cgroups, hannes, linux-kernel, mkoutny, syzkaller-bugs, tj
In-Reply-To: <6a23a4b4.e4db5ad2.3b7dfb.0000.GAE@google.com>

syzbot has found a reproducer for the following issue on:

HEAD commit:    ec039126b7fa Add linux-next specific files for 20260611
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=178e637a580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=d0d1fa2afcbce17c
dashboard link: https://syzkaller.appspot.com/bug?extid=bb2e19a1190a556c01b1
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10b0d4ae580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/be809c32a471/disk-ec039126.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/13ea1053e3b5/vmlinux-ec039126.xz
kernel image: https://storage.googleapis.com/syzbot-assets/1bcdab47ddc8/bzImage-ec039126.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+bb2e19a1190a556c01b1@syzkaller.appspotmail.com

INFO: task syz.1.18:6037 blocked for more than 143 seconds.
      Not[  392.048269][   T39]       Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.1.18        state:D stack:27936 pid:6037  tgid:6035  ppid:6000   task_flags:0x400140 flags:0x00080002
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5504 [inline]
 __schedule+0x172b/0x5550 kernel/sched/core.c:7228
 __schedule_loop kernel/sched/core.c:7307 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7322
 cgroup_lock_and_drain_offline+0x516/0x650 kernel/cgroup/cgroup.c:3275
 cgroup_kn_lock_live+0x120/0x230 kernel/cgroup/cgroup.c:1715
 cgroup_subtree_control_write+0x4b3/0x10a0 kernel/cgroup/cgroup.c:3577
 cgroup_file_write+0x331/0x8f0 kernel/cgroup/cgroup.c:4316
 kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:345
 new_sync_write fs/read_write.c:595 [inline]
 vfs_write+0x629/0xba0 fs/read_write.c:687
 ksys_write+0x156/0x270 fs/read_write.c:739
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f3fb3b0ce59
RSP: 002b:00007f3fb3145028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f3fb3d86090 RCX: 00007f3fb3b0ce59
RDX: 0000000000000005 RSI: 0000200000000040 RDI: 0000000000000006
RBP: 00007f3fb3ba2d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f3fb3d86128 R14: 00007f3fb3d86090 R15: 00007ffe868740f8
 </TASK>
INFO: task syz.2.19:6081 blocked for more than 143 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.2.19        state:D stack:28344 pid:6081  tgid:6079  ppid:6045   task_flags:0x400140 flags:0x00080002
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5504 [inline]
 __schedule+0x172b/0x5550 kernel/sched/core.c:7228
 __schedule_loop kernel/sched/core.c:7307 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7322
 cgroup_lock_and_drain_offline+0x516/0x650 kernel/cgroup/cgroup.c:3275
 cgroup_kn_lock_live+0x120/0x230 kernel/cgroup/cgroup.c:1715
 cgroup_subtree_control_write+0x4b3/0x10a0 kernel/cgroup/cgroup.c:3577
 cgroup_file_write+0x331/0x8f0 kernel/cgroup/cgroup.c:4316
 kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:345
 new_sync_write fs/read_write.c:595 [inline]
 vfs_write+0x629/0xba0 fs/read_write.c:687
 ksys_write+0x156/0x270 fs/read_write.c:739
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcd96f6ce59
RSP: 002b:00007fcd965ad028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fcd971e6090 RCX: 00007fcd96f6ce59
RDX: 0000000000000005 RSI: 0000200000000300 RDI: 0000000000000006
RBP: 00007fcd97002d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fcd971e6128 R14: 00007fcd971e6090 R15: 00007ffc255c33a8
 </TASK>
INFO: task syz.2.19:6082 blocked for more than 143 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.2.19        state:D stack:28920 pid:6082  tgid:6079  ppid:6045   task_flags:0x400040 flags:0x00080002
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5504 [inline]
 __schedule+0x172b/0x5550 kernel/sched/core.c:7228
 __schedule_loop kernel/sched/core.c:7307 [inline]
 rt_mutex_schedule+0x76/0xf0 kernel/sched/core.c:7603
 rt_mutex_slowlock_block+0x505/0x670 kernel/locking/rtmutex.c:1669
 __rt_mutex_slowlock kernel/locking/rtmutex.c:1746 [inline]
 __rt_mutex_slowlock_locked kernel/locking/rtmutex.c:1786 [inline]
 rt_mutex_slowlock+0x2dc/0x780 kernel/locking/rtmutex.c:1826
 __rt_mutex_lock kernel/locking/rtmutex.c:1841 [inline]
 __mutex_lock_common kernel/locking/rtmutex_api.c:560 [inline]
 mutex_lock_nested+0x168/0x1d0 kernel/locking/rtmutex_api.c:578
 fdget_pos+0x252/0x320 fs/file.c:1259
 class_fd_pos_constructor include/linux/file.h:85 [inline]
 ksys_write+0x79/0x270 fs/read_write.c:730
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcd96f6ce59
RSP: 002b:00007fcd9658c028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fcd971e6180 RCX: 00007fcd96f6ce59
RDX: 0000000000000005 RSI: 0000200000000040 RDI: 0000000000000006
RBP: 00007fcd97002d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fcd971e6218 R14: 00007fcd971e6180 R15: 00007ffc255c33a8
 </TASK>

Showing all locks held in the system:
3 locks held by kworker/1:0/32:
 #0: ffff88813fe57938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x897/0x1630 kernel/workqueue.c:3301
 #1: ffffc90000a6fc40 (deferred_process_work){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3302
 #2: ffffffff8f7b0bb8 (rtnl_mutex){+.+.}-{4:4}, at: switchdev_deferred_process_work+0xe/0x20 net/switchdev/switchdev.c:104
1 lock held by khungtaskd/39:
 #0: ffffffff8e3cb2a0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline]
 #0: ffffffff8e3cb2a0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:840 [inline]
 #0: ffffffff8e3cb2a0 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x2e/0x180 kernel/locking/lockdep.c:6777
2 locks held by getty/5366:
 #0: ffff88802940e0a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x25/0x70 drivers/tty/tty_ldisc.c:243
 #1: ffffc90003cbe2e0 (&ldata->atomic_read_lock){+.+.}-{4:4}, at: n_tty_read+0x465/0x1490 drivers/tty/n_tty.c:2211
3 locks held by kworker/u8:18/5822:
 #0: ffff888032abd938 ((wq_completion)bat_events){+.+.}-{0:0}, at: process_one_work+0x897/0x1630 kernel/workqueue.c:3301
 #1: ffffc90003db7c40 ((work_completion)(&(&bat_priv->dat.work)->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3302
 #2: ffffffff8e3cb2a0 (rcu_read_lock){....}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
3 locks held by syz.1.18/6037:
 #0: ffff88803a43f128 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff8880436a3c78 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
3 locks held by syz.2.19/6081:
 #0: ffff88803b865f28 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff88803697fc78 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.2.19/6082:
 #0: ffff88803b865f28 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz.3.20/6122:
 #0: ffff888037837f28 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff888036d05878 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.3.20/6123:
 #0: ffff888037837f28 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz.4.21/6167:
 #0: ffff88803bb00d28 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff888033038478 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.4.21/6168:
 #0: ffff88803bb00d28 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz.5.22/6214:
 #0: ffff88803a467328 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff8880408bb078 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.5.22/6216:
 #0: ffff88803a467328 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz.6.23/6264:
 #0: ffff888028ed0328 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff88803c173c78 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.6.23/6265:
 #0: ffff888028ed0328 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz.7.24/6310:
 #0: ffff8880296e8328 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff88803e05a878 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.7.24/6311:
 #0: ffff8880296e8328 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz.8.25/6356:
 #0: ffff8880257b2728 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff888031468078 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.8.25/6357:
 #0: ffff8880257b2728 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz.9.26/6407:
 #0: ffff888028ea4528 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2728 [inline]
 #1: ffff8880344f2500 (sb_writers#9){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:683
 #2: ffff888032e47078 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:336
1 lock held by syz.9.26/6408:
 #0: ffff888028ea4528 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1259
3 locks held by syz-executor/6411:
 #0: ffffffff8f7b0bb8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
 #0: ffffffff8f7b0bb8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock net/core/rtnetlink.c:341 [inline]
 #0: ffffffff8f7b0bb8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x883/0x1bb0 net/core/rtnetlink.c:4150
 #1: ffffffff8e3cb2a0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline]
 #1: ffffffff8e3cb2a0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:840 [inline]
 #1: ffffffff8e3cb2a0 (rcu_read_lock){....}-{1:3}, at: ib_device_get_by_netdev+0x81/0x4f0 drivers/infiniband/core/device.c:2357
 #2: ffff8880b8724540 (psi_seq){-...}-{0:0}, at: psi_task_change+0xd4/0x340 kernel/sched/psi.c:919

=============================================

NMI backtrace for cpu 0
CPU: 0 UID: 0 PID: 39 Comm: khungtaskd Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 nmi_cpu_backtrace+0x274/0x2d0 lib/nmi_backtrace.c:122
 nmi_trigger_cpumask_backtrace+0x17a/0x380 lib/nmi_backtrace.c:65
 trigger_all_cpu_backtrace include/linux/nmi.h:162 [inline]
 __sys_info lib/sys_info.c:157 [inline]
 sys_info+0x135/0x170 lib/sys_info.c:165
 check_hung_uninterruptible_tasks kernel/hung_task.c:353 [inline]
 watchdog+0xfd3/0x1030 kernel/hung_task.c:561
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 UID: 0 PID: 29 Comm: rcuc/1 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:mark_lock+0x9a/0x190 kernel/locking/lockdep.c:4776
Code: 0e 00 75 13 48 8d 3d e5 1a 31 0e 48 c7 c6 51 2b a9 8d 67 48 0f b9 3a 90 31 c9 4c 89 fe 4c 89 f7 b8 01 00 00 00 85 69 60 74 10 <5b> 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 49 89 fe 49 89 f7
RSP: 0018:ffffc90000a3f970 EFLAGS: 00000006
RAX: 0000000000000001 RBX: 0000000000000009 RCX: ffffffff93640d68
RDX: 0000000000000008 RSI: ffff88801e694a30 RDI: ffff88801e693e00
RBP: 0000000000000200 R08: ffffffff81883dac R09: ffffffff8e3cb2a0
R10: dffffc0000000000 R11: fffffbfff1f9e71f R12: 0000000000000003
R13: ffff88801e694a30 R14: ffff88801e693e00 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff888125b6b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2e59a3b6b0 CR3: 000000003ef78000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 mark_usage kernel/locking/lockdep.c:4676 [inline]
 __lock_acquire+0x6b5/0x2d10 kernel/locking/lockdep.c:5193
 lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
 rcu_lock_acquire include/linux/rcupdate.h:300 [inline]
 rcu_read_lock include/linux/rcupdate.h:840 [inline]
 __local_bh_disable_ip+0x205/0x420 kernel/softirq.c:174
 local_bh_disable include/linux/bottom_half.h:20 [inline]
 rcu_cpu_kthread+0x214/0x1470 kernel/rcu/tree.c:2978
 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>


---
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: David Hildenbrand (Arm) @ 2026-06-15  8:08 UTC (permalink / raw)
  To: Farhad Alemi, Andrew Morton, Waiman Long
  Cc: Farhad Alemi, Gregory Price, Yury Norov, Joshua Hahn, Zi Yan,
	Matthew Brost, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Rasmus Villemoes, linux-mm, linux-kernel,
	cgroups, stable
In-Reply-To: <CA+0ovCgfHJHv5d1mzapWWvF-LhjppzDX8NPPLvCPZxPKg8RiYw@mail.gmail.com>

On 6/14/26 15:25, Farhad Alemi wrote:

Hi, thanks for your patch!

For the future, please don't submit new revisions as reply to previous submissions.

> Creating a child cpuset where cpuset.mems is never set leads to a div/0
> when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a
> CPU hotplug event.
> 
> Reproduction steps:
>  1) Create a cgroup w/ cpuset controls (do not set cpuset.mems)
>  2) Move the task into the child cpuset
>  3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES
>  4) unplug and hotplug a cpu
>       echo 0 > /sys/devices/system/cpu/cpu1/online
>       echo 1 > /sys/devices/system/cpu/cpu1/online
>  5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the
>     call to __nodes_fold()
> 
> The cpuset code passes (cs->mems_allowed) which is not guaranteed to have
> nodes to the rebind routine.  Use cs->effective_mems instead, which is
> guaranteed to have a non-empty nodemask.

Probably worth mentioning here that this makes the linked reproducer happy.

> 
> Link: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/

This should be a

Closes:
https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/

> Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/
> Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
> Suggested-by: Gregory Price <gourry@gourry.net>
> Suggested-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> Cc: stable@vger.kernel.org
> ---
> v2: rebind to cs->effective_mems instead of newmems (Waiman Long);
>     condense the changelog.
> 
>  kernel/cgroup/cpuset.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
> 
>  		migrate = is_memory_migrate(cs);
> 
> -		mpol_rebind_mm(mm, &cs->mems_allowed);
> +		mpol_rebind_mm(mm, &cs->effective_mems);

God this is confusing.

So, we obtain newmems from guarantee_online_mems(), which guarantees that
newmems is non-empty.

In cpuset_change_task_nodemask(), we set tsk->mems_allowed to newmems, and call
mpol_rebind_task(tsk, newmems).

So at least tsk->mems_allowed should be non-empty.

Then we call mpol_rebind_mm(mm, &cs->mems_allowed);


Naturally I wonder: Why are we not using "task->mems_allowed" (maybe cs vs. tsk
was the original bug?), which is effectively just newmems?

guarantee_online_mems() computes newmems as "cs->effective_mems &
node_states[N_MEMORY]", but walks up to the parent if it would be empty.

-- 
Cheers,

David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox