Linux cgroups development
 help / color / mirror / Atom feed
* Re: [PATCH v2 15/16] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Hao Li @ 2026-06-12  6:54 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:17PM +0200, Vlastimil Babka (SUSE) wrote:
> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
> gfp flags are a scarce resource, unlike slab's alloc_flags.
> 
> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
> family function should not recurse into another kmalloc*() for the
> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
> 
> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
> added. This will also pass through SLAB_ALLOC_TRYLOCK so we don't need
> to special case kmalloc_nolock() anymore.
> 
> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
> the incoming gfp flags (only augmented with __GFP_ZERO), because if
> alloc_flags contain SLAB_ALLOC_TRYLOCK, the incoming gfp flags have to
> be also compatible with it.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slab.h |  1 +
>  mm/slub.c | 13 +++++--------
>  2 files changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 45bfcfb35a9c..509f330654b8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -21,6 +21,7 @@
>  #define SLAB_ALLOC_DEFAULT	0x00 /* no flags */
>  #define SLAB_ALLOC_TRYLOCK	0x01 /* a kmalloc_nolock() allocation */
>  #define SLAB_ALLOC_NEW_SLAB	0x02 /* a flag for alloc_slab_obj_exts() */
> +#define SLAB_ALLOC_NO_RECURSE	0x04 /* prevent kmalloc() recursion */
>  
>  static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
>  {
> diff --git a/mm/slub.c b/mm/slub.c
> index cbb38bd01e46..7dfbd0251aa2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2167,15 +2167,12 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  
>  	gfp &= ~OBJCGS_CLEAR_MASK;
>  	/* Prevent recursive extension vector allocation */
> -	gfp |= __GFP_NO_OBJ_EXT;
> +	alloc_flags |= SLAB_ALLOC_NO_RECURSE;
>  
>  	sz = obj_exts_alloc_size(s, slab, gfp);
>  

For the original calls to kmalloc_nolock and kmalloc_node, I notice a difference:

> -	if (unlikely(!allow_spin))
> -		vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
> -				     slab_nid(slab));

kmalloc_nolock completely discarded `gfp` flags.

> -	else
> -		vec = kmalloc_node(sz, gfp | __GFP_ZERO, slab_nid(slab));

while kmalloc_node preserved and passed it along.

> +	/* This will use kmalloc_nolock() if alloc_flags say so */
> +	vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab));

Now both paths are merged into kmalloc_flags, the gfp flags are
unconditionally carried through. It seems this might carry some unwanted flags.

I traced the call path and found that ___slab_alloc sets the __GFP_THISNODE
for trynode_flags. If this flag propagates all the way into
kmalloc_flags->...->__kmalloc_nolock_noprof, it will trigger the
VM_WARN_ON_ONCE warning. Maybe we need to strip the original gfp if
`!allow_spin`.

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 12/16] mm/slab: pass slab_alloc_context to __do_kmalloc_node()
From: Hao Li @ 2026-06-12  5:34 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-12-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:14PM +0200, Vlastimil Babka (SUSE) wrote:
> With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
> alloc flag that prevents kmalloc recursion. For that we need a version
> of kmalloc() that takes alloc_flags and use it in places that perform
> these potentially recursive kmalloc allocations (of sheaves or obj_ext
> arrays).
> 
> As a preparatory step, make __do_kmalloc_node() take a pointer to
> slab_alloc_context. This replaces the 'caller' parameter and includes
> alloc_flags which we'll make use of.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 10/16] mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
From: Hao Li @ 2026-06-12  5:28 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-10-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:12PM +0200, Vlastimil Babka (SUSE) wrote:
> The function takes all the parameters that exist as fields in
> slab_alloc_context, except alloc_flags. Replace them with a single
> pointer.
> 
> This moves slab_alloc_context initialization to a number of callers,
> which is more verbose, but arguably also more clear than a long list of
> parameters, and most do not use the 'lru' field.
> 
> This will also allow kmalloc_nolock() to call slab_alloc_node() and
> reduce the special open-coding it currently has.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 08/16] mm/slab: pass alloc_flags to new slab allocation
From: Hao Li @ 2026-06-12  5:26 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-8-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:10PM +0200, Vlastimil Babka (SUSE) wrote:
> Add the alloc_flags parameter to allocate_slab() and new_slab()
> so it can be used to determine if spinning is allowed, independently
> from gfp flags.
> 
> refill_objects() passes SLAB_ALLOC_DEFAULT because it can only be
> reached from contexts that allow spinning.
> 
> Also change how trynode_flags are constructed in ___slab_alloc() to
> achieve the same "do not upgrade to GFP_NOWAIT" by using masking instead
> of a branch. It will now also not upgrade in cases where gfp is weaker
> than GFP_NOWAIT (i.e. lacks __GFP_KSWAPD_RECLAIM) but doesn't come from
> kmalloc_nolock() - which is more correct anyway.
> 
> During the masking keep also existing __GFP_NOMEMALLOC (pointed out by
> Sashiko) and __GFP_ACCOUNT. Previously the hardcoded GFP_NOWAIT would
> eliminate them, but it's not a big problem that would need a separate
> fix.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slub.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 98b79e5e7679..8f6ca3d5fdfa 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3378,9 +3378,10 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
>  }
>  
>  /* Allocate and initialize a slab without building its freelist. */
> -static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags,
> +				  unsigned int alloc_flags, int node)
>  {
> -	bool allow_spin = gfpflags_allow_spinning(flags);
> +	bool allow_spin = alloc_flags_allow_spinning(alloc_flags);

nit: allow_spin doesn't depend on `flags` now, so it seems we can delete the
comments:

/*
 * __GFP_RECLAIM could be cleared on the first allocation attempt,
 * so pass allow_spin flag directly.
 */

Otherwise, looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Zenghui Yu @ 2026-06-12  5:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, guanhao.wang
In-Reply-To: <ainFROZ3WrGioyuY@gourry-fedora-PF4VCD3F>

[ trim the Cc list ]

Hi Gregory,

On 2026/6/11 4:12, Gregory Price wrote:

> I will still probably send the next RFC version tomorrow or friday,
> as I want to get some eyes on the __GFP_PRIVATE-less pattern.

Could you please Cc me in the next version? I appreciate that and would be
happy to follow this work.

Thanks,
Zenghui

^ permalink raw reply

* Re: [PATCH v2 07/16] mm/slab: replace struct partial_context with slab_alloc_context
From: Hao Li @ 2026-06-12  4:04 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-7-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:09PM +0200, Vlastimil Babka (SUSE) wrote:
> Refactor get_from_partial_node(), get_from_any_partial(),
> get_from_partial() and ___slab_alloc().
> 
> Remove struct partial_context, which used to be more substantial but
> shrank as part of the sheaves conversion. Instead pass gfp_flags and
> pointer to the new slab_alloc_context, which together is a superset of
> partial_context.
> 
> This means alloc_flags are now available and we can use them to
> determine if spinning is allowed, further reducing false positive "not
> allowed" in the slow path due to gfp flags lacking __GFP_RECLAIM.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slub.c | 52 ++++++++++++++++++++++++----------------------------
>  1 file changed, 24 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index ef745b37d063..98b79e5e7679 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -220,12 +220,6 @@ struct slab_alloc_context {
>  	unsigned int alloc_flags;
>  };
>  
> -/* Structure holding parameters for get_from_partial() call chain */
> -struct partial_context {
> -	gfp_t flags;
> -	unsigned int orig_size;
> -};
> -
>  /* Structure holding parameters for get_partial_node_bulk() */
>  struct partial_bulk_context {
>  	gfp_t flags;
> @@ -3826,7 +3820,8 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
>   */
>  static void *get_from_partial_node(struct kmem_cache *s,
>  				   struct kmem_cache_node *n,
> -				   struct partial_context *pc)
> +				   gfp_t gfp_flags,
> +				   struct slab_alloc_context *ac)
>  {
>  	struct slab *slab, *slab2;
>  	unsigned long flags;
> @@ -3841,7 +3836,7 @@ static void *get_from_partial_node(struct kmem_cache *s,
>  	if (!n || !n->nr_partial)
>  		return NULL;
>  
> -	if (gfpflags_allow_spinning(pc->flags))
> +	if (alloc_flags_allow_spinning(ac->alloc_flags))
>  		spin_lock_irqsave(&n->list_lock, flags);
>  	else if (!spin_trylock_irqsave(&n->list_lock, flags))
>  		return NULL;
> @@ -3849,12 +3844,12 @@ static void *get_from_partial_node(struct kmem_cache *s,
>  
>  		struct freelist_counters old, new;
>  
> -		if (!pfmemalloc_match(slab, pc->flags))
> +		if (!pfmemalloc_match(slab, gfp_flags))
>  			continue;
>  
>  		if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
>  			object = alloc_single_from_partial(s, n, slab,
> -							pc->orig_size);
> +							ac->orig_size);
>  			if (object)
>  				break;
>  			continue;
> @@ -3888,15 +3883,16 @@ static void *get_from_partial_node(struct kmem_cache *s,
>  /*
>   * Get an object from somewhere. Search in increasing NUMA distances.
>   */
> -static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *pc)
> +static void *get_from_any_partial(struct kmem_cache *s, gfp_t gfp_flags,
> +				  struct slab_alloc_context *ac)
>  {
>  #ifdef CONFIG_NUMA
>  	struct zonelist *zonelist;
>  	struct zoneref *z;
>  	struct zone *zone;
> -	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
> +	enum zone_type highest_zoneidx = gfp_zone(gfp_flags);
>  	unsigned int cpuset_mems_cookie;
> -	bool allow_spin = gfpflags_allow_spinning(pc->flags);
> +	bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
>  
>  	/*
>  	 * The defrag ratio allows a configuration of the tradeoffs between
> @@ -3930,16 +3926,17 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
>  		if (allow_spin)
>  			cpuset_mems_cookie = read_mems_allowed_begin();
>  
> -		zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
> +		zonelist = node_zonelist(mempolicy_slab_node(), gfp_flags);
>  		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
>  			struct kmem_cache_node *n;
>  
>  			n = get_node(s, zone_to_nid(zone));
>  
> -			if (n && cpuset_zone_allowed(zone, pc->flags) &&
> +			if (n && cpuset_zone_allowed(zone, gfp_flags) &&
>  					n->nr_partial > s->min_partial) {
>  
> -				void *object = get_from_partial_node(s, n, pc);
> +				void *object = get_from_partial_node(s, n,
> +								gfp_flags, ac);
>  
>  				if (object) {
>  					/*
> @@ -3961,8 +3958,8 @@ static void *get_from_any_partial(struct kmem_cache *s, struct partial_context *
>  /*
>   * Get an object from a partial slab
>   */
> -static void *get_from_partial(struct kmem_cache *s, int node,
> -			      struct partial_context *pc)
> +static void *get_from_partial(struct kmem_cache *s, int node, gfp_t flags,
> +			      struct slab_alloc_context *ac)
>  {
>  	int searchnode = node;
>  	void *object;
> @@ -3970,11 +3967,11 @@ static void *get_from_partial(struct kmem_cache *s, int node,
>  	if (node == NUMA_NO_NODE)
>  		searchnode = numa_mem_id();
>  
> -	object = get_from_partial_node(s, get_node(s, searchnode), pc);
> -	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
> +	object = get_from_partial_node(s, get_node(s, searchnode), flags, ac);
> +	if (object || (node != NUMA_NO_NODE && (flags & __GFP_THISNODE)))
>  		return object;
>  
> -	return get_from_any_partial(s, pc);
> +	return get_from_any_partial(s, flags, ac);
>  }
>  
>  static bool has_pcs_used(int cpu, struct kmem_cache *s)
> @@ -4454,16 +4451,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  			   struct slab_alloc_context *ac)
>  {
>  	bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
> +	gfp_t trynode_flags;
>  	void *object;
>  	struct slab *slab;
> -	struct partial_context pc;
>  	bool try_thisnode = true;
>  
>  	stat(s, ALLOC_SLOWPATH);
>  
>  new_objects:
>  
> -	pc.flags = gfpflags;
> +	trynode_flags = gfpflags;
>  	/*
>  	 * When a preferred node is indicated but no __GFP_THISNODE
>  	 *
> @@ -4479,17 +4476,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		     && try_thisnode)) {
>  		if (unlikely(!allow_spin))
>  			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
> -			pc.flags = gfpflags | __GFP_THISNODE;
> +			trynode_flags = gfpflags | __GFP_THISNODE;
>  		else
> -			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
> +			trynode_flags = GFP_NOWAIT | __GFP_THISNODE;

nit: the comment "__GFP_THISNODE in pc.flags" also needs to be updated to "trynode_flags"

otherwise, looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 06/16] mm/slab: add alloc_flags to slab_alloc_context
From: Hao Li @ 2026-06-12  3:50 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-6-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:08PM +0200, Vlastimil Babka (SUSE) wrote:
> Add alloc_flags as a new field to the slab_alloc_context helper struct,
> so we can pass it to more functions in the slab implementation without
> adding another function parameter.
> 
> Start checking them via alloc_flags_allow_spinning() in
> alloc_single_from_new_slab() (where we can drop the allow_spin
> parameter) and ___slab_alloc(). This further reduces false-positive
> spinning-not-allowed from allocations that are not kmalloc_nolock() but
> lack __GFP_RECLAIM flags.
> 
> _kmalloc_nolock_noprof() initializes ac.alloc_flags using its flags that
> are SLAB_ALLOC_TRYLOCK. slab_alloc_node() and __kmem_cache_alloc_bulk()
> are not reachable from kmalloc_nolock() and all their callers expect
> spinning to be allowed, so they can use SLAB_ALLOC_DEFAULT. This is
> temporary as the scope of slab_alloc_context will further move to the
> callers, making the alloc_flags usage more obvious.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Hao Li @ 2026-06-12  3:49 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-5-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:07PM +0200, Vlastimil Babka (SUSE) wrote:
> Similarly to the page allocators, introduce slab-allocator specific
> alloc flags that internally control allocation behavior in addition to
> gfp_flags, without occupying the limited gfp flags space.
> 
> Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
> page allocator's ALLOC_TRYLOCK and will be used to reimplement
> kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
> gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
> importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
> e.g. in early boot with a restricted gfp_allowed_mask.
> 
> Also introduce alloc_flags_allow_spinning() to replace the usage of
> gfpflags_allow_spinning().
> 
> Start using alloc_flags and the new check first in alloc_from_pcs() and
> __pcs_replace_empty_main(). This means some slab allocations that were
> falsely treated as kmalloc_nolock() due to their gfp flags will now have
> higher chances of succeed, and this will further increase with followup
> changes.
> 
> Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
> reach it from a slab allocation that's not _nolock() and yet lacks
> __GFP_KSWAPD_RECLAIM for other reasons.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 03/16] mm/slab: stop inlining __slab_alloc_node()
From: Hao Li @ 2026-06-12  3:48 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-3-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:05PM +0200, Vlastimil Babka (SUSE) wrote:
> With sheaves, this is no longer part of the allocation fastpath.  For
> the same reason, also mark the call to it from slab_alloc_node() as
> unlikely().
> 
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 01/16] mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
From: Hao Li @ 2026-06-12  3:47 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, stable
In-Reply-To: <20260610-slab_alloc_flags-v2-1-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:03PM +0200, Vlastimil Babka (SUSE) wrote:
> When init (zeroing) on allocation is requested, for kmalloc() we
> generally have to zero the full object size even if a smaller size is
> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.
> 
> But if we track the requested size, krealloc() uses that information to
> do the right thing. With red zoning also enabled, any unused size
> became part of the red zone, so it must not be zeroed.
> 
> However the check is imprecise, and will trigger also when only
> SLAB_RED_ZONE is enabled without SLAB_STORE_USER. This means enabling
> red zoning alone can compromise krealloc()'s __GFP_ZERO contract.
> 
> Fix this by using slub_debug_orig_size() instead, which is the exact
> check for whether the requested size is tracked. We don't need to care
> if red zoning is also enabled or not. Also update and expand the
> comment accordingly.
> 
> Fixes: 9ce67395f5a0 ("mm/slub: only zero requested size of buffer for kzalloc when debug enabled")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v2 11/16] mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
From: Hao Li @ 2026-06-12  3:21 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-11-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:13PM +0200, Vlastimil Babka (SUSE) wrote:
> The last user of gfpflags_allow_spinning() in slab is
> alloc_from_pcs_bulk(), which is only called from
> kmem_cache_alloc_bulk().
> 
> It turns out that gfpflags_allow_spinning() is not necessary, because
> kmem_cache_alloc_bulk() is only expected to be called from context that
> does allow spinning, so simply replace it with 'true'.
> 
> With that, we can remove the "@flags must allow spinning" part of the
> kernel doc, as there is no more connection to the gfp flags in the slab
> implementation.
> 
> Also remove a comment in alloc_slab_obj_exts() because there should be
> no more false positives possible due to gfp_allowed_mask during early
> boot.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slub.c | 11 ++---------
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 0b9974bfcb24..ef457e07db83 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2171,12 +2171,6 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  
>  	sz = obj_exts_alloc_size(s, slab, gfp);
>  
> -	/*
> -	 * Note that allow_spin may be false during early boot and its
> -	 * restricted GFP_BOOT_MASK. Due to kmalloc_nolock() only supporting
> -	 * architectures with cmpxchg16b, early obj_exts will be missing for
> -	 * very early allocations on those.
> -	 */
>  	if (unlikely(!allow_spin))
>  		vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
>  				     slab_nid(slab));
> @@ -4867,7 +4861,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
>  		}
>  
>  		full = barn_replace_empty_sheaf(barn, pcs->main,
> -						gfpflags_allow_spinning(gfp));
> +						/* allow_spin = */ true);

we can remove the `gfp` arg as this function no longer use it.

>  
>  		if (full) {
>  			stat(s, BARN_GET);
> @@ -7333,8 +7327,7 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>   * Allocate @size objects from @s and places them into @p.  @size must be larger
>   * than 0.
>   *
> - * Interrupts must be enabled when calling this function and @flags must allow
> - * spinning.
> + * Interrupts must be enabled when calling this function.
>   *
>   * Unlike alloc_pages_bulk(), this function does not check for already allocated
>   * objects in @p, and thus the caller does not need to zero it.
> 
> -- 
> 2.54.0
> 

^ permalink raw reply

* Re: [PATCH v2 04/16] mm/slab: introduce slab_alloc_context
From: Hao Li @ 2026-06-12  3:10 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-4-7190909db118@kernel.org>

On Wed, Jun 10, 2026 at 05:40:06PM +0200, Vlastimil Babka (SUSE) wrote:
> Similarly to page allocator's struct alloc_context, introduce a helper
> struct to hold a part of the allocation arguments. This will allow
> reducing the number of parameters in many functions of the
> implementation, and extend them easily if needed.
> 
> For now, make it hold the caller address and the originally requested
> allocation size.
> 
> Convert alloc_single_from_new_slab(), __slab_alloc_node() and
> ___slab_alloc(). No functional change intended.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slub.c | 46 +++++++++++++++++++++++++++++++++-------------
>  1 file changed, 33 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 7b48c0d38404..a3cac7281cc6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -213,6 +213,12 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
>  static DEFINE_STATIC_KEY_FALSE(strict_numa);
>  #endif
>  
> +/* Structure holding extra parameters for slab allocations */
> +struct slab_alloc_context {
> +	unsigned long caller_addr;
> +	unsigned long orig_size;
> +};
> +
>  /* Structure holding parameters for get_from_partial() call chain */
>  struct partial_context {
>  	gfp_t flags;
> @@ -3687,7 +3693,8 @@ static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
>   * and put the slab to the partial (or full) list.
>   */
>  static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> -					int orig_size, bool allow_spin)
> +					struct slab_alloc_context *ac,
> +					bool allow_spin)
>  {
>  	struct kmem_cache_node *n;
>  	struct slab_obj_iter iter;
> @@ -3705,7 +3712,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  	/* alloc_debug_processing() always expects a valid freepointer */
>  	set_freepointer(s, object, slab->freelist);
>  
> -	if (!alloc_debug_processing(s, slab, object, orig_size)) {
> +	if (!alloc_debug_processing(s, slab, object, ac->orig_size)) {
>  		/*
>  		 * It's not really expected that this would fail on a
>  		 * freshly allocated slab, but a concurrent memory
> @@ -4443,7 +4450,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>   * slab.
>   */
>  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -			   unsigned long addr, unsigned int orig_size)
> +			   struct slab_alloc_context *ac)
>  {
>  	bool allow_spin = gfpflags_allow_spinning(gfpflags);
>  	void *object;
> @@ -4476,7 +4483,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
>  	}
>  
> -	pc.orig_size = orig_size;
> +	pc.orig_size = ac->orig_size;
>  	object = get_from_partial(s, node, &pc);
>  	if (object)
>  		goto success;
> @@ -4496,7 +4503,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	stat(s, ALLOC_SLAB);
>  
>  	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> -		object = alloc_single_from_new_slab(s, slab, orig_size, allow_spin);
> +		object = alloc_single_from_new_slab(s, slab, ac, allow_spin);
>  
>  		if (likely(object))
>  			goto success;
> @@ -4514,13 +4521,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  success:
>  	if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
> -		set_track(s, object, TRACK_ALLOC, addr, gfpflags);
> +		set_track(s, object, TRACK_ALLOC, ac->caller_addr, gfpflags);
>  
>  	return object;
>  }
>  
>  static void *__slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node,
> -			       unsigned long addr, size_t orig_size)
> +			       struct slab_alloc_context *ac)
>  {
>  	void *object;
>  
> @@ -4545,7 +4552,7 @@ static void *__slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	}
>  #endif
>  
> -	object = ___slab_alloc(s, gfpflags, node, addr, orig_size);
> +	object = ___slab_alloc(s, gfpflags, node, ac);
>  
>  	return object;
>  }
> @@ -4923,8 +4930,13 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>  
>  	object = alloc_from_pcs(s, gfpflags, node);
>  
> -	if (unlikely(!object))
> -		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> +	if (unlikely(!object)) {
> +		struct slab_alloc_context ac = {
> +			.caller_addr = addr,
> +			.orig_size = orig_size,
> +		};
> +		object = __slab_alloc_node(s, gfpflags, node, &ac);
> +	}
>  
>  	maybe_wipe_obj_freeptr(s, object);
>  
> @@ -5389,13 +5401,18 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
>  	if (ret)
>  		goto success;
>  
> +	struct slab_alloc_context ac = {
> +		.caller_addr = _RET_IP_,
> +		.orig_size = orig_size,
> +	};

It might be better to move this to the beginning of the function, to avoid
patch09 jump to `success` before ac is initialized.

> +
>  	/*
>  	 * Do not call slab_alloc_node(), since trylock mode isn't
>  	 * compatible with slab_pre_alloc_hook/should_failslab and
>  	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
>  	 * and slab_post_alloc_hook() directly.
>  	 */
> -	ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, orig_size);
> +	ret = __slab_alloc_node(s, alloc_gfp, node, &ac);
>  
>  	/*
>  	 * It's possible we failed due to trylock as we preempted someone with
> @@ -7237,10 +7254,13 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>  	int i;
>  
>  	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> +		struct slab_alloc_context ac = {
> +			.caller_addr = _RET_IP_,
> +			.orig_size = s->object_size,
> +		};
>  		for (i = 0; i < size; i++) {
>  
> -			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
> -					     s->object_size);
> +			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, &ac);
>  			if (unlikely(!p[i]))
>  				goto error;
>  
> 
> -- 
> 2.54.0
> 

^ permalink raw reply

* Re: [PATCH v3 0/7] sched: Flatten the pick
From: Shubhang Kaushik @ 2026-06-12  2:29 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef
In-Reply-To: <20260605105513.354837583@infradead.org>

Hello Peter,

I applied the `sched/flat` patchset from your tree on top of the
`tip/sched/core` base commit (9ebe5c3c29f62)/(7.1-rc2)

The evaluation was performed on an 80-core, Ampere Altra
system running Fedora Linux 41.

Benchmark Runs:-

1. Hackbench (Execution time in seconds: lower is better)
     The data reveals a clear architectural pivot point at 4 tasks:
     - Low Concurrency (< 4 tasks): Regresses by +1.8% to +4.0%.
       Removing cgroup isolation boundaries expands the idle CPU search
       adding slight overhead to the wake-up path.
       * 1 Thread:  (+1.8%)
       * 2 Threads: (+4.0%)
       * 2 Procs:   (+3.3%)
     - Tipping Point (4 tasks): Performance is completely flat.
       * 4 Threads: (+0.03%)
       * 4 Procs:   (+0.1%)
     - High Concurrency (>= 8 tasks): Improves by -0.7% to -2.3%.
       Collapsing the tree structure down to a flat layout removes
       multi-layer load tracking updates (update_load_avg), saving cycles
       under load.
       * 8 Threads:  (-0.7%)
       * 16 Threads: (-1.8%)
       * 8 Procs:    (-1.2%)
       * 16 Procs:   (-2.3%)
       * 32 Procs:   (-1.6%)

2. Schbench (Wakeup Tail Latency)
     - 16 Threads (128kb footprint): 99.9th percentile tail latency drops
       significantly by -12.21% (us). Operating on a unified runqueue layer
       prevents induced group-level throttling.
     - 32 Threads (128kb footprint): 99.9th percentile tail latency
       regresses by +5.50% (us). Eliminating nested queues increases lock
       contention during heavy simultaneous wakeups.

3. Sysbench
     - Sysbench RAM: Throughput increases by +1.55% (MiB/sec). Fewer tree
       traversals reduce cache-line bouncing, freeing up cycles.

The patchset trades minor low-load performance for better scaling and
tighter tail latencies under distributed load. However, the majority of
these deltas remain small and sit near the measurement noise floor (<=
4%).

Regards,
Shubhang Kaushik

On Fri, 5 Jun 2026, Peter Zijlstra wrote:

>
> Hi!
>
> New version, same story [1]. TL;DR:
>
> - Adds new cgroup_mode knob and implements new policies to address the
>   hierarchy level weight mismatch.
>
> - Builds upon that base to create a flat / single runqueue scheduler where the
>   cgroup hierarchy is expressed through dynamic weight management.
>
> I'm hoping to be able to merge these patches early in the next cycle (after
> 7.2-rc1).
>
> Random benchmark:
>
> Game vs 'for ((i=0; i<8; i++)) do nice ./spin.sh; done':
>
>  Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
>  Intel Core i7-2600K
>  AMD Radeon RX 580
>
>  Shadows Awakening (GOG)
>
> 	  default slice(*)
>
>  FPS min   4.0   29.0
>      avg  47.5   59.2
>      max  83.7   83.7
>
>  FT  min   9.3   10.2
>      avg  34.0   17.0
>      max 121.2   30.0
>
>  FPS (Frames Per Second)
>  FT  (FrameTime)
>
>  [*] Command prefix: 'chrt -o --sched-runtime 100000 0'
>
>
> Changes since v2:
>
> - merged debug and prep patches
> - fixed update_entity_lag() on dequeue (Vincent)
> - fixed throttle vs tick (Prateek)
> - fixed wakeup_preempt_fair()
> - rebased on tip/sched/core
> - rewritten cgroup_mode changelogs
> - reworked cgroup_mode concur
> - added cgroup_mode tasks
> - changed default cgroup_mode
>
>
> [1] - https://lore.kernel.org/r/20260511113104.563854162@infradead.org
>
> Can also be had:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat
>
> include/linux/cpuset.h |    6
> include/linux/sched.h  |    1
> kernel/cgroup/cpuset.c |   15
> kernel/sched/core.c    |    5
> kernel/sched/debug.c   |   89 ++++
> kernel/sched/fair.c    |  943 ++++++++++++++++++++++++-------------------------
> kernel/sched/pelt.c    |    6
> kernel/sched/sched.h   |   30 -
> 8 files changed, 607 insertions(+), 488 deletions(-)
>
>

^ permalink raw reply

* Re: [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
From: Aaron Lu @ 2026-06-12  2:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef,
	svens
In-Reply-To: <20260611113219.GG187714@noisy.programming.kicks-ass.net>

Hi Peter,

On Thu, Jun 11, 2026 at 01:32:19PM +0200, Peter Zijlstra wrote:
> 
> Aaron,
> 
> Sorry I failed to notice this email earlier.
> 

Never mind.

> On Wed, Jun 03, 2026 at 05:51:08PM +0800, Aaron Lu wrote:
> 
> > I applied below diff and the problem is gone:
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5f48af700fd44..942a543af3e54 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9897,6 +9897,9 @@ static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
> >  	return p;
> >  
> >  idle:
> > +	if (sched_core_enabled(rq))
> > +		return NULL;
> > +
> >  	new_tasks = sched_balance_newidle(rq, rf);
> >  	if (new_tasks < 0)
> >  		return RETRY_TASK;
> > 
> 
> Right, this is the safe patch and restores pick_task_fair() to its
> previous status (for core-sched).
> 
> Since people are hitting this problem, I'm going to merge it as below.
> I've presumed your SoB, please let me know if that's a problem.

No problem.

> 
> I think I'm going to try and move newidle into sched_class::balance /
> balance_fair(), but I'll do that next cycle.

I'll surely test it then.

Best regards,
Aaron

^ permalink raw reply

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
From: Waiman Long @ 2026-06-11 20:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef
In-Reply-To: <20260611134724.GK48970@noisy.programming.kicks-ass.net>

On 6/11/26 9:47 AM, Peter Zijlstra wrote:
> On Wed, Jun 10, 2026 at 11:09:59AM -0400, Waiman Long wrote:
>
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
>>>    	return changed;
>>>    }
>>> +int cpuset_num_cpus(struct cgroup *cgrp)
>>> +{
>>> +	int nr = num_online_cpus();
>>> +	struct cpuset *cs;
>>> +
>>> +	if (is_in_v2_mode()) {
>>> +		guard(rcu)();
>>> +		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
>>> +		if (cs)
>>> +			nr = cpumask_weight(cs->effective_cpus);
>>> +	}
>>> +
>>> +	return nr;
>>> +}
>> I just have a question about cgroup v1 support. I am assuming that cgroup v1
>> without the cpuset_v2_mode mount option is not supported.
> Correct.
>
>> To fully support
>> cgroup v1, you may have to use guarantee_active_cpus() to return the actual
>> set of CPUs that the task can run on.
> Except this is group based, we'd need an iteration of all tasks in the
> group and compute a union of guarantee_active_cpus(). Which all seems
> far too expensive and not worth the effort.
I thought so.
>
>> Also there is a caveat about the arm64 specific
>> task_cpu_possible_mask() for certain arm64 CPUs. That is for 32-bit
>> binary running on 64-bit core which are allowed only on a selected
>> subset of cores within the CPU.
>>
>> This is probably not what you want to focus on right now, but it will be
>> good to have a comment to list items that are not fully supported here.
> Will add a comment!
Thanks,
Longman


^ permalink raw reply

* Re: [PATCH v6 6/6] drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager
From: Thomas Hellström @ 2026-06-11 19:41 UTC (permalink / raw)
  To: intel-xe
  Cc: Natalie Vock, Johannes Weiner, Tejun Heo, Michal Koutný,
	cgroups, Huang Rui, Matthew Brost, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-7-thomas.hellstrom@linux.intel.com>

On Thu, 2026-06-11 at 19:33 +0200, Thomas Hellström wrote:
> Register the VRAM manager with the dmem cgroup reclaim infrastructure
> so that lowering dmem.max below current VRAM usage triggers TTM
> eviction rather than failing with -EBUSY.
> 
> Guard place->flags in amdgpu_ttm_bo_eviction_valuable() against NULL,
> as the TTM reclaim path passes a NULL place in cgroup drain mode.
> 
> v3:
> - Rebased on fix for uninitialized list and buddy allocator on the
>   drmm_cgroup_register_region() error path.
> 
> v5:
> - Rebased on the introduction of struct dmem_cgroup_init.
> - Clear the reclaim callback in amdgpu_vram_mgr_fini() to prevent
>   use-after-free if cgroup reclaim is triggered after driver unbind
>   while userspace holds an open DRM file descriptor. (Sashiko-bot)
> - Switch from drmm_cgroup_register_region() to the raw
>   dmem_cgroup_register_region() and store the region in
>   amdgpu_vram_mgr.cg_region. Call dmem_cgroup_unregister_region()
>   in amdgpu_vram_mgr_fini() after ttm_resource_manager_evict_all()
>   to drain in-flight reclaim callbacks, and clear man->cg afterwards.
>   This is required because amdgpu's vram manager fini is called
>   explicitly during driver unbind, which may precede the DRM device
>   release and thus precede any drmm-based cleanup. (Sashiko-bot)
> 
> v6:
> - Fix mgr->cg_region never being assigned, so
>   dmem_cgroup_unregister_region() in fini silently no-ops on NULL
>   and leaks the region. (Sashiko-bot)
> - Reorder fini to call set_used(false) and evict_all() before
>   dmem_cgroup_unregister_region(), so ttm_resource_free() can
>   uncharge via man->cg during eviction; clear man->cg after
>   unregister. (Sashiko-bot)
> 
> Assisted-by: GitHub_Copilot:claude-sonnet-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 31 ++++++++++++++++--
> --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h |  2 ++
>  3 files changed, 28 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 2740de94e93c..8cbcd33f51a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1488,7 +1488,7 @@ static bool
> amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
>  	dma_resv_for_each_fence(&resv_cursor, bo->base.resv,
>  				DMA_RESV_USAGE_BOOKKEEP, f) {
>  		if (amdkfd_fence_check_mm(f, current->mm) &&
> -		    !(place->flags & TTM_PL_FLAG_CONTIGUOUS))
> +		    !(place && (place->flags &
> TTM_PL_FLAG_CONTIGUOUS)))
>  			return false;
>  	}
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 08f05c3aed1d..2250bab0970d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -906,6 +906,10 @@ static const struct ttm_resource_manager_func
> amdgpu_vram_mgr_func = {
>  	.debug	= amdgpu_vram_mgr_debug
>  };
>  
> +static const struct dmem_cgroup_ops amdgpu_vram_mgr_dmem_ops = {
> +	.reclaim = ttm_resource_manager_dmem_reclaim,
> +};

Probably might want to block reclaim after device unbind, just like xe.
I'll look at that for v7.

> +
>  /**
>   * amdgpu_vram_mgr_init - init VRAM manager and DRM MM
>   *
> @@ -917,6 +921,7 @@ int amdgpu_vram_mgr_init(struct amdgpu_device
> *adev)
>  {
>  	struct amdgpu_vram_mgr *mgr = &adev->mman.vram_mgr;
>  	struct ttm_resource_manager *man = &mgr->manager;
> +	struct dmem_cgroup_region *cg;
>  	int err;
>  
>  	ttm_resource_manager_init(man, &adev->mman.bdev,
> @@ -933,12 +938,16 @@ int amdgpu_vram_mgr_init(struct amdgpu_device
> *adev)
>  	if (err)
>  		return err;
>  
> -	man->cg = drmm_cgroup_register_region(adev_to_drm(adev),
> "vram",
> -					      &(struct
> dmem_cgroup_init){
> -						.size = adev-
> >gmc.real_vram_size,
> -					      });
> -	if (IS_ERR(man->cg))
> -		return PTR_ERR(man->cg);
> +	cg = dmem_cgroup_register_region(&(struct dmem_cgroup_init){
> +					     .size = adev-
> >gmc.real_vram_size,
> +					     .ops =
> &amdgpu_vram_mgr_dmem_ops,
> +					     .reclaim_priv = man,
> +					 }, "vram");
> +	if (IS_ERR(cg))
> +		return PTR_ERR(cg);
> +
> +	mgr->cg_region = cg;
> +	ttm_resource_manager_set_dmem_region(man, cg);
>  
>  	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr-
> >manager);
>  	ttm_resource_manager_set_used(man, true);
> @@ -966,6 +975,16 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device
> *adev)
>  	if (ret)
>  		return;
>  
> +	/*
> +	 * Drain any in-flight dmem cgroup reclaim callbacks and
> remove the
> +	 * region from the global list.  This must happen after
> evict_all()
> +	 * so that ttm_resource_free() can still uncharge via man-
> >cg while
> +	 * BOs are being evicted.
> +	 */
> +	dmem_cgroup_unregister_region(mgr->cg_region);
> +	mgr->cg_region = NULL;
> +	man->cg = NULL;
> +
>  	mutex_lock(&mgr->lock);
>  	list_for_each_entry_safe(rsv, temp, &mgr-
> >reservations_pending, blocks)
>  		kfree(rsv);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
> index 429a21a2e9b2..07103cddb335 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
> @@ -36,6 +36,8 @@ struct amdgpu_vram_mgr {
>  	atomic64_t vis_usage;
>  	u64 default_page_size;
>  	struct list_head allocated_vres_list;
> +	/** @cg_region: dmem cgroup region for VRAM; unregistered in
> fini. */
> +	struct dmem_cgroup_region *cg_region;
>  };
>  
>  struct amdgpu_vres_task {

^ permalink raw reply

* Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Shakeel Butt @ 2026-06-11 19:12 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: YoungJun Park, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he
In-Reply-To: <airzE7jD9UtyR17J@google.com>

On Thu, Jun 11, 2026 at 05:45:04PM +0000, Yosry Ahmed wrote:
> On Tue, Jun 09, 2026 at 01:19:13PM +0900, YoungJun Park wrote:
> > On Mon, Jun 08, 2026 at 03:27:07PM -0700, Yosry Ahmed wrote:
> > 
> > +Chris +Kairui +Baoquan
> > 
> > Hello
> > 
> > Thanks for inviting me to the discussion, Shakeel.
> > 
> > > > > > Youngjun is working on swap tiers. At the moment he is more interested in
> > > > > > allowing a specific swap device to a memcg or not. I can imagine in future there
> > > > > > will be use-cases where there will be a need to demote data on higher tier swap
> > > > > > to lower tier swap. What would be the appropriate interface?
> > 
> > Speaking of my work on swap tiers, I recently submitted a patch and am
> > currently considering memcg integration:
> > https://lore.kernel.org/linux-mm/20260527062247.3440692-1-youngjun.park@lge.com/
> > 
> > The future use-cases imagined above seem to align with this
> > direction. (BTW, I am currently waiting for reviews/feedback from the memcg
> > folks on this patch. Any reviews would be highly appreciated!)
> > 
> > We could potentially assign a target tier
> > for writeback within the existing memory.zswap.writeback interface. 
> > 
> > For instance, '0' could mean disabled, while non-zero values could represent
> > specific tiers, which would maintain backward compatibility with the current
> > version. Alternatively, if zswap is treated as the default top tier, 
> > the `memory.swap.tiers` interface could potentially replace `memory.zswap.writeback`.
> > 
> > Furthermore, this could be expanded so that each swap tier can demote data
> > user-triggered demotion between swap tiers.
> > 
> > Based on the current patch's ideas combined with my swap tiers concept:
> > 
> > Assuming a hierarchy like:
> > zswap -> tier1 (SSD swap) -> tier2 (HDD swap) -> tier3 (Network swap)
> > 
> > We could configure the active tiers via a setting like `memory.swap.tiers`
> > (tier2 enabled, tier3 enabled).
> > 
> > For example, the concept of `echo "100M zswap_writeback_only > memory.reclaim"`
> > could be extended. A user could run `echo "100M tier2 > memory.reclaim"`
> > to explicitly trigger demotion from tier2 to tier3.
> > (BTW, if we combine these features, my personal preference for the keyword
> > format would be `<size> <demote_prefix><tier_name>`. I think it would be
> > better to explicitly indicate that it is a swap demotion by using a specific
> > prefix followed by the tier name. 
> > Or make demote prefix another key is also possible)
> 
> I am not sure if proactive demotion between swap tiers would be driven
> by memory.reclaim, I am guessing a new interface might be more suitable.
> But yes, you are right that it's very possible that
> 'zswap_writeback_only' with memory.reclaim will become obsolete once
> swap tiering matures and starts supporting things like proactive
> demotion.
> 
> Part of me wants to wait until the swap tiering interfaces are figured
> out so that we don't end up with redundant interfaces, but I also don't
> want to hold Hao's work since it doesn't directly depend on swap
> tiering.
> 
> Shakeel, how do you want to handle this? I think there's a few options:
> 
> 1. Add zswap_writeback_only now, and when we have swap tiering demotion
> it becomes a redundant interface, like memory.zswap.writeback -- or
> maybe we try to deprecate both of them at that point. It's difficult to
> remove interfaces tho, but maybe easier to stop supporting
> zswap_writeback_only.
> 
> 2. Add zswap_writeback_only behind an experimental config option, to
> unblock development but have a line of sight to dropping support once we
> have a swap tiering interface.
> 
> 3. Wait until we figure out the swap tiering interfaces and then add
> the proactive zswap writeback as part of it.
> 
> WDYT?

Is Hao's work needed for some followup work/development? The earliest Hao's
work can is 7.3, so if we aim to figure out swap tiering interfaces in next
couple of weeks then option 3 is the way to go. If swap tiers take more time
then we can discuss other options as well.

However I would need zswap folks (Yosry & Nhat) help in figuring out swap tiers
interfaces. Zswap is the current top tier swap usage in real world. I want
zswap users to eaily (and hopefully transparently) migrate to swap tiers.

^ permalink raw reply

* Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-06-11 17:45 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he
In-Reply-To: <aieUQUBHI+E3uNPW@yjaykim-PowerEdge-T330>

On Tue, Jun 09, 2026 at 01:19:13PM +0900, YoungJun Park wrote:
> On Mon, Jun 08, 2026 at 03:27:07PM -0700, Yosry Ahmed wrote:
> 
> +Chris +Kairui +Baoquan
> 
> Hello
> 
> Thanks for inviting me to the discussion, Shakeel.
> 
> > > > > Youngjun is working on swap tiers. At the moment he is more interested in
> > > > > allowing a specific swap device to a memcg or not. I can imagine in future there
> > > > > will be use-cases where there will be a need to demote data on higher tier swap
> > > > > to lower tier swap. What would be the appropriate interface?
> 
> Speaking of my work on swap tiers, I recently submitted a patch and am
> currently considering memcg integration:
> https://lore.kernel.org/linux-mm/20260527062247.3440692-1-youngjun.park@lge.com/
> 
> The future use-cases imagined above seem to align with this
> direction. (BTW, I am currently waiting for reviews/feedback from the memcg
> folks on this patch. Any reviews would be highly appreciated!)
> 
> We could potentially assign a target tier
> for writeback within the existing memory.zswap.writeback interface. 
> 
> For instance, '0' could mean disabled, while non-zero values could represent
> specific tiers, which would maintain backward compatibility with the current
> version. Alternatively, if zswap is treated as the default top tier, 
> the `memory.swap.tiers` interface could potentially replace `memory.zswap.writeback`.
> 
> Furthermore, this could be expanded so that each swap tier can demote data
> user-triggered demotion between swap tiers.
> 
> Based on the current patch's ideas combined with my swap tiers concept:
> 
> Assuming a hierarchy like:
> zswap -> tier1 (SSD swap) -> tier2 (HDD swap) -> tier3 (Network swap)
> 
> We could configure the active tiers via a setting like `memory.swap.tiers`
> (tier2 enabled, tier3 enabled).
> 
> For example, the concept of `echo "100M zswap_writeback_only > memory.reclaim"`
> could be extended. A user could run `echo "100M tier2 > memory.reclaim"`
> to explicitly trigger demotion from tier2 to tier3.
> (BTW, if we combine these features, my personal preference for the keyword
> format would be `<size> <demote_prefix><tier_name>`. I think it would be
> better to explicitly indicate that it is a swap demotion by using a specific
> prefix followed by the tier name. 
> Or make demote prefix another key is also possible)

I am not sure if proactive demotion between swap tiers would be driven
by memory.reclaim, I am guessing a new interface might be more suitable.
But yes, you are right that it's very possible that
'zswap_writeback_only' with memory.reclaim will become obsolete once
swap tiering matures and starts supporting things like proactive
demotion.

Part of me wants to wait until the swap tiering interfaces are figured
out so that we don't end up with redundant interfaces, but I also don't
want to hold Hao's work since it doesn't directly depend on swap
tiering.

Shakeel, how do you want to handle this? I think there's a few options:

1. Add zswap_writeback_only now, and when we have swap tiering demotion
it becomes a redundant interface, like memory.zswap.writeback -- or
maybe we try to deprecate both of them at that point. It's difficult to
remove interfaces tho, but maybe easier to stop supporting
zswap_writeback_only.

2. Add zswap_writeback_only behind an experimental config option, to
unblock development but have a line of sight to dropping support once we
have a swap tiering interface.

3. Wait until we figure out the swap tiering interfaces and then add
the proactive zswap writeback as part of it.

WDYT?

^ permalink raw reply

* Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg
From: Yosry Ahmed @ 2026-06-11 17:39 UTC (permalink / raw)
  To: Hao Jia
  Cc: Nhat Pham, shakeel.butt, akpm, tj, hannes, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <1c25650e-bf98-2863-d505-9b94c385668b@gmail.com>

On Tue, Jun 09, 2026 at 11:18:26AM +0800, Hao Jia wrote:
> 
> 
> On 2026/6/9 02:01, Nhat Pham wrote:
> > On Mon, Jun 8, 2026 at 9:48 AM Yosry Ahmed <yosry@kernel.org> wrote:
> > > 
> > > > But OTOH, this does seem like a recipe for inefficient reclaim. We
> > > > might exhaust hotter memory of a cgroup while sparing colder memory of
> > > > another cgroup... But maybe if they're all cold anyway, then who
> > > > cares, and eventually you'll get to the cold stuff of other child?
> > > 
> > > Forgot to respond to this part, the unfairness is limited to the batch
> > > size per-invocation, so it should be fine as long as you don't divide
> > > the amount over 100 iterations for some reason. Also yes, all memory
> > > in zswap is cold, the relative coldness is not that important (e.g.
> > > compared to relative coldness during reclaim).
> > 
> > Ok then yeah, I think we should shelve per-memcg cursor for the next
> > version. Down the line, if we have more data that unfairness is an
> > issue, we can always fix it. One step at a time :)
> 
> Thanks a lot to Yosry, Nhat, and Shakeel for the great suggestions!
> 
> Let me summarize what I plan to do in the next version to make sure we are
> on the same page:
> 
>  - Drop the per-memcg cursor and keep the root cgroup cursor
> (zswap_next_shrink) logic intact.
>  - Stick to using the zswap_writeback_only key, and change the proactive
> writeback size to use the compressed size.
>  - Consolidate and reuse the logic between shrink_worker() and
> shrink_memcg(). Enable batch writeback in the shrink_worker() path, while
> keeping the writeback behavior in the zswap_store() path unchanged.
> 
> Please let me know if I missed or misunderstood anything. Thanks again for
> clearing things up!

Sorry for the late response, yes I think this makes sense. However, I
have some comment about how this interacts with swap tiering, let me
reply to the other thread.

> 
> Thanks,
> Hao

^ permalink raw reply

* [PATCH v6 6/6] drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager
From: Thomas Hellström @ 2026-06-11 17:33 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

Register the VRAM manager with the dmem cgroup reclaim infrastructure
so that lowering dmem.max below current VRAM usage triggers TTM
eviction rather than failing with -EBUSY.

Guard place->flags in amdgpu_ttm_bo_eviction_valuable() against NULL,
as the TTM reclaim path passes a NULL place in cgroup drain mode.

v3:
- Rebased on fix for uninitialized list and buddy allocator on the
  drmm_cgroup_register_region() error path.

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Clear the reclaim callback in amdgpu_vram_mgr_fini() to prevent
  use-after-free if cgroup reclaim is triggered after driver unbind
  while userspace holds an open DRM file descriptor. (Sashiko-bot)
- Switch from drmm_cgroup_register_region() to the raw
  dmem_cgroup_register_region() and store the region in
  amdgpu_vram_mgr.cg_region. Call dmem_cgroup_unregister_region()
  in amdgpu_vram_mgr_fini() after ttm_resource_manager_evict_all()
  to drain in-flight reclaim callbacks, and clear man->cg afterwards.
  This is required because amdgpu's vram manager fini is called
  explicitly during driver unbind, which may precede the DRM device
  release and thus precede any drmm-based cleanup. (Sashiko-bot)

v6:
- Fix mgr->cg_region never being assigned, so
  dmem_cgroup_unregister_region() in fini silently no-ops on NULL
  and leaks the region. (Sashiko-bot)
- Reorder fini to call set_used(false) and evict_all() before
  dmem_cgroup_unregister_region(), so ttm_resource_free() can
  uncharge via man->cg during eviction; clear man->cg after
  unregister. (Sashiko-bot)

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 31 ++++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h |  2 ++
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 2740de94e93c..8cbcd33f51a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1488,7 +1488,7 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
 	dma_resv_for_each_fence(&resv_cursor, bo->base.resv,
 				DMA_RESV_USAGE_BOOKKEEP, f) {
 		if (amdkfd_fence_check_mm(f, current->mm) &&
-		    !(place->flags & TTM_PL_FLAG_CONTIGUOUS))
+		    !(place && (place->flags & TTM_PL_FLAG_CONTIGUOUS)))
 			return false;
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 08f05c3aed1d..2250bab0970d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -906,6 +906,10 @@ static const struct ttm_resource_manager_func amdgpu_vram_mgr_func = {
 	.debug	= amdgpu_vram_mgr_debug
 };
 
+static const struct dmem_cgroup_ops amdgpu_vram_mgr_dmem_ops = {
+	.reclaim = ttm_resource_manager_dmem_reclaim,
+};
+
 /**
  * amdgpu_vram_mgr_init - init VRAM manager and DRM MM
  *
@@ -917,6 +921,7 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_vram_mgr *mgr = &adev->mman.vram_mgr;
 	struct ttm_resource_manager *man = &mgr->manager;
+	struct dmem_cgroup_region *cg;
 	int err;
 
 	ttm_resource_manager_init(man, &adev->mman.bdev,
@@ -933,12 +938,16 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	if (err)
 		return err;
 
-	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram",
-					      &(struct dmem_cgroup_init){
-						.size = adev->gmc.real_vram_size,
-					      });
-	if (IS_ERR(man->cg))
-		return PTR_ERR(man->cg);
+	cg = dmem_cgroup_register_region(&(struct dmem_cgroup_init){
+					     .size = adev->gmc.real_vram_size,
+					     .ops = &amdgpu_vram_mgr_dmem_ops,
+					     .reclaim_priv = man,
+					 }, "vram");
+	if (IS_ERR(cg))
+		return PTR_ERR(cg);
+
+	mgr->cg_region = cg;
+	ttm_resource_manager_set_dmem_region(man, cg);
 
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
 	ttm_resource_manager_set_used(man, true);
@@ -966,6 +975,16 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device *adev)
 	if (ret)
 		return;
 
+	/*
+	 * Drain any in-flight dmem cgroup reclaim callbacks and remove the
+	 * region from the global list.  This must happen after evict_all()
+	 * so that ttm_resource_free() can still uncharge via man->cg while
+	 * BOs are being evicted.
+	 */
+	dmem_cgroup_unregister_region(mgr->cg_region);
+	mgr->cg_region = NULL;
+	man->cg = NULL;
+
 	mutex_lock(&mgr->lock);
 	list_for_each_entry_safe(rsv, temp, &mgr->reservations_pending, blocks)
 		kfree(rsv);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
index 429a21a2e9b2..07103cddb335 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
@@ -36,6 +36,8 @@ struct amdgpu_vram_mgr {
 	atomic64_t vis_usage;
 	u64 default_page_size;
 	struct list_head allocated_vres_list;
+	/** @cg_region: dmem cgroup region for VRAM; unregistered in fini. */
+	struct dmem_cgroup_region *cg_region;
 };
 
 struct amdgpu_vres_task {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v6 5/6] drm/xe: Wire up dmem cgroup reclaim for VRAM manager
From: Thomas Hellström @ 2026-06-11 17:33 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

Register the VRAM manager with the dmem cgroup reclaim infrastructure
so that lowering dmem.max below current VRAM usage triggers TTM
eviction rather than failing with -EBUSY.

v4:
- Rebased on drm-tip; dropped the XE_PL_STOLEN guard as stolen memory
  uses a separate TTM manager and never calls __xe_ttm_vram_mgr_init().

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Register the fini drmm action before drmm_cgroup_register_region() so
  that devres LIFO teardown runs unregister_region() first (draining any
  in-flight reclaim callbacks via the rwsem) and xe_ttm_vram_mgr_fini()
  second, ensuring the manager is never accessed by a reclaim callback
  after teardown. (Sashiko-bot)
- Wrap the reclaim callback in xe_ttm_vram_mgr_dmem_reclaim() using
  drm_dev_enter()/drm_dev_exit() to prevent TTM reclaim from running
  after driver unbind.

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 54 +++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 308fda4248eb..b2500344cd57 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -276,6 +276,28 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
 	.debug	= xe_ttm_vram_mgr_debug
 };
 
+static const struct dmem_cgroup_ops xe_ttm_vram_mgr_dmem_ops;
+
+static int xe_ttm_vram_mgr_dmem_reclaim(struct dmem_cgroup_pool_state *pool,
+					 u64 target_bytes, void *priv)
+{
+	struct ttm_resource_manager *man = priv;
+	struct xe_device *xe = ttm_to_xe_device(man->bdev);
+	int ret, idx;
+
+	if (!drm_dev_enter(&xe->drm, &idx))
+		return -ENODEV;
+
+	ret = ttm_resource_manager_dmem_reclaim(pool, target_bytes, priv);
+
+	drm_dev_exit(idx);
+	return ret;
+}
+
+static const struct dmem_cgroup_ops xe_ttm_vram_mgr_dmem_ops = {
+	.reclaim = xe_ttm_vram_mgr_dmem_reclaim,
+};
+
 static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_device *xe = to_xe_device(dev);
@@ -301,17 +323,10 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 			   u64 default_page_size)
 {
 	struct ttm_resource_manager *man = &mgr->manager;
+	struct dmem_cgroup_region *cg;
 	const char *name;
 	int err;
 
-	name = mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
-	man->cg = drmm_cgroup_register_region(&xe->drm, name,
-					      &(struct dmem_cgroup_init){
-						.size = size,
-					      });
-	if (IS_ERR(man->cg))
-		return PTR_ERR(man->cg);
-
 	man->func = &xe_ttm_vram_mgr_func;
 	mgr->mem_type = mem_type;
 	err = drmm_mutex_init(&xe->drm, &mgr->lock);
@@ -330,7 +345,28 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
 	ttm_resource_manager_set_used(&mgr->manager, true);
 
-	return drmm_add_action_or_reset(&xe->drm, xe_ttm_vram_mgr_fini, mgr);
+	/*
+	 * Register the fini action before the cgroup region so that devres
+	 * LIFO teardown runs unregister_region first (draining any in-flight
+	 * reclaim callbacks) and the manager fini second.
+	 */
+	err = drmm_add_action_or_reset(&xe->drm, xe_ttm_vram_mgr_fini, mgr);
+	if (err)
+		return err;
+
+	name = mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
+	cg = drmm_cgroup_register_region(&xe->drm, name,
+					 &(struct dmem_cgroup_init){
+						.size = size,
+						.ops = &xe_ttm_vram_mgr_dmem_ops,
+						.reclaim_priv = man,
+					 });
+	if (IS_ERR(cg))
+		return PTR_ERR(cg);
+
+	ttm_resource_manager_set_dmem_region(man, cg);
+
+	return 0;
 }
 
 /**
-- 
2.54.0


^ permalink raw reply related

* [PATCH v6 4/6] drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem controller
From: Thomas Hellström @ 2026-06-11 17:32 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

Add ttm_bo_evict_cgroup() to evict buffer objects charged to a specific
dmem cgroup pool from a resource manager's LRU until a byte target is
met.  Add ttm_resource_manager_set_dmem_region() to associate a dmem
cgroup region with a resource manager; drivers supply their own
dmem_cgroup_ops with ttm_resource_manager_dmem_reclaim as the reclaim
function and the manager pointer as reclaim_priv in the dmem_cgroup_init
to wire up TTM eviction as the reclaim callback.

The eviction context is interruptible; signals abort the operation and
propagate back through the write() syscall.

Introduce a new mode for the bo LRU walker so that sleeping locks
can be taken. This can be used when the caller doesn't hold any
previous dma_resv locks, and where it intends to hold at most
one lock at a time.

Like the rest of the TTM eviction this should sooner than later
be converted to full WW transactions.

v3:
- Fix ttm_resource_manager_set_dmem_region() storing an error pointer
  in man->cg unconditionally. (Sashiko-bot)
- Fix kernel-doc function name format for ttm_bo_evict_cgroup() and
  ttm_resource_manager_set_dmem_region().

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Handle NULL region in ttm_resource_manager_set_dmem_region() to clear
  the reclaim callback, preventing use-after-free when the manager is
  torn down while the dmem region outlives it. (Sashiko-bot)
- Return 0 on any progress (even partial eviction), -ENOSPC only when
  nothing was freed; fixes callers that expected 0 on partial success.
- Document that the reclaim callback should return 0 if some progress
  was made, -ENOSPC if no progress at all, or another error for fatal
  failures.

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c       | 95 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c  |  3 +-
 drivers/gpu/drm/ttm/ttm_resource.c | 50 ++++++++++++++++
 include/drm/ttm/ttm_bo.h           | 10 ++++
 include/drm/ttm/ttm_resource.h     |  7 +++
 5 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f0..db0e38bd8a43 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -515,12 +515,20 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 {
 	struct ttm_bo_evict_walk *evict_walk =
 		container_of(walk, typeof(*evict_walk), walk);
+	/* Capture size before eviction in case res is cleared. */
+	s64 bo_size = bo->base.size;
 	s64 lret;
 
 	if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
 					      evict_walk->try_low, &evict_walk->hit_low))
 		return 0;
 
+	/*
+	 * evict_walk->place is NULL in cgroup drain mode.  Drivers'
+	 * eviction_valuable() callbacks must handle a NULL place, treating it
+	 * as "any placement": the TTM base implementation already does so via
+	 * ttm_resource_intersects().
+	 */
 	if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
 		return 0;
 
@@ -536,11 +544,15 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 		goto out;
 
 	evict_walk->evicted++;
-	if (evict_walk->res)
+	if (evict_walk->res) {
 		lret = ttm_resource_alloc(evict_walk->evictor, evict_walk->place,
 					  evict_walk->res, NULL);
-	if (lret == 0)
-		return 1;
+		if (lret == 0)
+			return 1;
+	} else {
+		/* Cgroup drain: return bytes freed for byte-denominated progress. */
+		return bo_size;
+	}
 out:
 	/* Errors that should terminate the walk. */
 	if (lret == -ENOSPC)
@@ -614,6 +626,83 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
 	return 0;
 }
 
+/**
+ * ttm_bo_evict_cgroup() - Evict buffer objects charged to a specific cgroup.
+ * @bdev: The TTM device.
+ * @man: The resource manager whose LRU to walk.
+ * @limit_pool: The cgroup pool state whose members should be evicted.
+ * @target_bytes: Number of bytes to free.
+ * @ctx: The TTM operation context.
+ *
+ * Walk the LRU of @man and evict buffer objects that are charged to the
+ * cgroup identified by @limit_pool, until at least @target_bytes have been
+ * freed.  Mirrors the two-pass (trylock -> sleeping-lock, low-watermark)
+ * strategy used by ttm_bo_evict_alloc().
+ *
+ * Return: >= @target_bytes on full success, 0..target_bytes-1 if partial,
+ *         negative error code on fatal error.
+ */
+s64 ttm_bo_evict_cgroup(struct ttm_device *bdev,
+			struct ttm_resource_manager *man,
+			struct dmem_cgroup_pool_state *limit_pool,
+			s64 target_bytes,
+			struct ttm_operation_ctx *ctx)
+{
+	struct ttm_bo_evict_walk evict_walk = {
+		.walk = {
+			.ops = &ttm_evict_walk_ops,
+			.arg = { .ctx = ctx },
+		},
+		.limit_pool = limit_pool,
+		/* place, evictor, res left NULL: selects cgroup drain mode */
+	};
+	s64 lret, pass;
+
+	evict_walk.walk.arg.trylock_only = true;
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, target_bytes);
+	if (lret < 0 || lret >= target_bytes)
+		return lret;
+
+	/* Second pass: also evict BOs at the low watermark. */
+	if (evict_walk.hit_low) {
+		evict_walk.try_low = true;
+		pass = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man,
+					      target_bytes - lret);
+		if (pass < 0)
+			return pass;
+		lret += pass;
+		if (lret >= target_bytes)
+			return lret;
+	}
+
+	/* Full sleeping-lock pass for remaining target. */
+	evict_walk.try_low = evict_walk.hit_low = false;
+	evict_walk.walk.arg.trylock_only = false;
+
+retry:
+	evict_walk.walk.arg.sleeping_lock = true;
+	do {
+		evict_walk.evicted = 0;
+		pass = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man,
+					      target_bytes - lret);
+		if (pass < 0) {
+			lret = pass;
+			goto out;
+		}
+		lret += pass;
+	} while (lret < target_bytes && evict_walk.evicted);
+
+	/* One more attempt if we hit the low limit during sleeping-lock pass. */
+	if (lret < target_bytes && evict_walk.hit_low && !evict_walk.try_low) {
+		evict_walk.try_low = true;
+		goto retry;
+	}
+
+out:
+	return lret;
+}
+EXPORT_SYMBOL(ttm_bo_evict_cgroup);
+
 /**
  * ttm_bo_pin - Pin the buffer object.
  * @bo: The buffer object to pin
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 3e3c201a0222..bd0b23ac2cc4 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -999,7 +999,8 @@ __ttm_bo_lru_cursor_next(struct ttm_bo_lru_cursor *curs)
 		bo = res->bo;
 		if (ttm_lru_walk_trylock(curs, bo))
 			bo_locked = true;
-		else if (!arg->ticket || arg->ctx->no_wait_gpu || arg->trylock_only)
+		else if ((!arg->ticket && !arg->sleeping_lock) || arg->ctx->no_wait_gpu ||
+			 arg->trylock_only)
 			continue;
 
 		if (!ttm_bo_get_unless_zero(bo)) {
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index 154d6739256f..ad00723e99ef 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -953,3 +953,53 @@ void ttm_resource_manager_create_debugfs(struct ttm_resource_manager *man,
 #endif
 }
 EXPORT_SYMBOL(ttm_resource_manager_create_debugfs);
+
+/**
+ * ttm_resource_manager_dmem_reclaim() - dmem cgroup reclaim callback for TTM
+ *                                       resource managers.
+ * @pool: The dmem cgroup pool state for the cgroup being reclaimed.
+ * @target_bytes: Number of bytes to try to free.
+ * @priv: The &ttm_resource_manager pointer, passed as @init.reclaim_priv to
+ *        dmem_cgroup_register_region().
+ *
+ * Drivers should use this as the @reclaim member of their own
+ * &struct dmem_cgroup_ops, with the &ttm_resource_manager pointer as
+ * @init.reclaim_priv.
+ *
+ * Return: 0 if some memory was freed, -ENOSPC if nothing was freed, or
+ *         another negative error code on fatal failure.
+ */
+int ttm_resource_manager_dmem_reclaim(struct dmem_cgroup_pool_state *pool,
+				      u64 target_bytes, void *priv)
+{
+	struct ttm_resource_manager *man = priv;
+	struct ttm_operation_ctx ctx = { .interruptible = true };
+	s64 freed;
+
+	freed = ttm_bo_evict_cgroup(man->bdev, man, pool, target_bytes, &ctx);
+	if (freed < 0)
+		return freed;
+
+	return freed > 0 ? 0 : -ENOSPC;
+}
+EXPORT_SYMBOL(ttm_resource_manager_dmem_reclaim);
+
+/**
+ * ttm_resource_manager_set_dmem_region() - Associate a dmem cgroup region with a
+ *                                        resource manager.
+ * @man: The resource manager.
+ * @region: The dmem cgroup region to associate, may be NULL or IS_ERR().
+ *
+ * When @region is valid, stores it in @man->cg so that TTM can look up the
+ * associated pool during charging and eviction-target selection.
+ * The reclaim callback must be wired up using ttm_resource_manager_dmem_reclaim()
+ * in the driver's own &struct dmem_cgroup_ops, with the manager pointer as
+ * @init.reclaim_priv.
+ */
+void ttm_resource_manager_set_dmem_region(struct ttm_resource_manager *man,
+					  struct dmem_cgroup_region *region)
+{
+	if (!IS_ERR_OR_NULL(region))
+		man->cg = region;
+}
+EXPORT_SYMBOL(ttm_resource_manager_set_dmem_region);
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 8310bc3d55f9..32791c4db2a9 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -226,6 +226,11 @@ struct ttm_lru_walk_arg {
 	struct ww_acquire_ctx *ticket;
 	/** @trylock_only: Only use trylock for locking. */
 	bool trylock_only;
+	/**
+	 * @sleeping_lock: Use sleeping locks even with %NULL @ticket.
+	 * @trylock_only has precedence over this field.
+	 */
+	bool sleeping_lock;
 };
 
 /**
@@ -431,6 +436,11 @@ void ttm_bo_unpin(struct ttm_buffer_object *bo);
 int ttm_bo_evict_first(struct ttm_device *bdev,
 		       struct ttm_resource_manager *man,
 		       struct ttm_operation_ctx *ctx);
+s64 ttm_bo_evict_cgroup(struct ttm_device *bdev,
+			struct ttm_resource_manager *man,
+			struct dmem_cgroup_pool_state *limit_pool,
+			s64 target_bytes,
+			struct ttm_operation_ctx *ctx);
 int ttm_bo_access(struct ttm_buffer_object *bo, unsigned long offset,
 		  void *buf, int len, int write);
 vm_fault_t ttm_bo_vm_reserve(struct ttm_buffer_object *bo,
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index a5d386583fb6..32e485fdce9a 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -39,6 +39,7 @@
 
 struct dentry;
 struct dmem_cgroup_device;
+struct dmem_cgroup_region;
 struct drm_printer;
 struct ttm_device;
 struct ttm_resource_manager;
@@ -477,6 +478,12 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
 			       struct ttm_device *bdev,
 			       uint64_t size);
 
+void ttm_resource_manager_set_dmem_region(struct ttm_resource_manager *man,
+					  struct dmem_cgroup_region *region);
+
+int ttm_resource_manager_dmem_reclaim(struct dmem_cgroup_pool_state *pool,
+				      u64 target_bytes, void *priv);
+
 int ttm_resource_manager_evict_all(struct ttm_device *bdev,
 				   struct ttm_resource_manager *man);
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v6 3/6] cgroup/dmem: Add reclaim callback for lowering max below current usage
From: Thomas Hellström @ 2026-06-11 17:32 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

Add an optional reclaim callback to struct dmem_cgroup_region. When
dmem.max is set below the current usage of a cgroup pool, the new limit
is applied immediately (so that concurrent allocations are throttled
while reclaim is in progress) and then the driver is asked to evict
memory to bring usage back below the limit.

Reclaim is attempted up to a bounded number of times. No error is
returned to userspace if usage remains above the limit after reclaim,
and a pending signal will abort the reclaim loop early. This matches
the behavior of memory.max in the memory cgroup controller.

Also honor O_NONBLOCK so that if that flag is set during the
max value write, no reclaim is initiated. The idea is to avoid
charging the reclaim cost to the writer of the max value.

v2:
- Write max before reclaim is attempted (Maarten)
- Let signals abort the reclaim without error (Maarten)
- If a new max value is written with the O_NONBLOCK flag,
  reclaim is not attempted (Maarten)
- Extract region from the pool parameter rather than
  passing it explicitly to set_resource_xxx().

v3:
- Use an rwsem to protect reclaim callback registration and
  region unregister against concurrent reclaim invocations,
  ensuring reclaim_priv is visible when the callback is
  invoked. (Sashiko-bot)

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Use nonblock=true in reset_all_resource_limits() to avoid sleeping
  inside rcu_read_lock() in dmemcs_offline(). (Sashiko-bot)
- Compare usage against the truncated limit value stored in cnt.max,
  not the original u64. (Sashiko-bot)
- Use a DMEM_MAX_RECLAIM_RETRIES (16) retry budget instead of 5, matching
  the memcg controller's MAX_RECLAIM_RETRIES. Only -ENOSPC (no progress)
  counts against the retry budget; other errors terminate the loop
  immediately.

v6:
- Fix dmem_cgroup_ops->reclaim docstring: -ENOSPC does not stop reclaim
  immediately but is retried up to DMEM_MAX_RECLAIM_RETRIES times; only
  other negative errors terminate the loop. (Sashiko-bot)

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 include/linux/cgroup_dmem.h |  22 +++++++
 kernel/cgroup/dmem.c        | 119 +++++++++++++++++++++++++++++++++---
 2 files changed, 131 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index d9eab8a2c1ee..8664321fa9f7 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -14,12 +14,34 @@ struct dmem_cgroup_pool_state;
 /* Opaque definition of a cgroup region, used internally */
 struct dmem_cgroup_region;
 
+/**
+ * struct dmem_cgroup_ops - Operations for a dmem cgroup region.
+ * @reclaim: Optional callback invoked when dmem.max is set below the current
+ *           usage of a pool. The driver should attempt to free at least
+ *           @target_bytes from @pool. May be called multiple times if usage
+ *           remains above the limit after returning.
+ *
+ *           Return: 0 if some progress was made (even if less than
+ *           @target_bytes was freed), -ENOSPC if no progress could be made
+ *           (the caller will retry up to a bounded number of times), or
+ *           another negative error code if a fatal error occurred (stops
+ *           further reclaim attempts immediately).
+ */
+struct dmem_cgroup_ops {
+	int (*reclaim)(struct dmem_cgroup_pool_state *pool,
+		       u64 target_bytes, void *priv);
+};
+
 /**
  * struct dmem_cgroup_init - Initialization parameters for a dmem cgroup region.
  * @size: Size of the region in bytes.
+ * @ops: Optional operations for this region. May be NULL.
+ * @reclaim_priv: Opaque pointer passed to @ops->reclaim. May be NULL.
  */
 struct dmem_cgroup_init {
 	u64 size;
+	const struct dmem_cgroup_ops *ops;
+	void *reclaim_priv;
 };
 
 #if IS_ENABLED(CONFIG_CGROUP_DMEM)
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index d12c8543f3fe..da99d133182c 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -18,6 +18,13 @@
 #include <linux/rculist.h>
 #include <linux/slab.h>
 
+/*
+ * Number of reclaim attempts before giving up when lowering dmem.max
+ * below current usage. Mirrors memcg's MAX_RECLAIM_RETRIES; unify the
+ * two in a follow-up instead of duplicating the constant.
+ */
+#define DMEM_MAX_RECLAIM_RETRIES 16
+
 struct dmem_cgroup_region {
 	/**
 	 * @ref: References keeping the region alive.
@@ -51,6 +58,24 @@ struct dmem_cgroup_region {
 	 * No new pools should be added to the region afterwards.
 	 */
 	bool unregistered;
+
+	/**
+	 * @ops: Optional operations, set from dmem_cgroup_init at registration.
+	 */
+	const struct dmem_cgroup_ops *ops;
+
+	/** @reclaim_priv: Private data passed to @ops->reclaim. */
+	void *reclaim_priv;
+
+	/**
+	 * @unregister_sem: Serialises reclaim callbacks against unregistration.
+	 *
+	 * Readers (reclaim) hold the read side for the duration of a callback
+	 * invocation.  dmem_cgroup_unregister_region() takes the write side to
+	 * drain any in-flight callbacks before returning, so callers may safely
+	 * free @reclaim_priv once unregister returns.
+	 */
+	struct rw_semaphore unregister_sem;
 };
 
 struct dmemcg_state {
@@ -145,21 +170,71 @@ static void free_cg_pool(struct dmem_cgroup_pool_state *pool)
 }
 
 static void
-set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
 {
 	page_counter_set_min(&pool->cnt, val);
 }
 
 static void
-set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
 {
 	page_counter_set_low(&pool->cnt, val);
 }
 
 static void
-set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
 {
-	page_counter_set_max(&pool->cnt, val);
+	struct dmem_cgroup_region *region = pool->region;
+	unsigned long limit = (unsigned long)val;
+
+	/*
+	 * Always update the limit, even if usage currently exceeds it.
+	 * Concurrent allocations will be throttled against the new limit
+	 * while reclaim is in progress.
+	 */
+	xchg(&pool->cnt.max, limit);
+
+	if (nonblock)
+		return;
+
+	/*
+	 * Hold the read side for the duration of the reclaim loop so that
+	 * dmem_cgroup_unregister_region() cannot return (and the caller
+	 * cannot free reclaim_priv) while a callback is in progress.
+	 *
+	 * The ops check must happen inside the lock.  A caller may have
+	 * observed ops != NULL before dmem_cgroup_unregister_region()
+	 * acquired the write side; rechecking under down_read() is safe
+	 * because region->unregistered is set while the write side is
+	 * held, so any down_read() that succeeds after up_write() will
+	 * see unregistered = true and skip the loop.
+	 */
+	down_read(&region->unregister_sem);
+	if (!region->unregistered && region->ops && region->ops->reclaim) {
+		for (int retries = DMEM_MAX_RECLAIM_RETRIES; ; ) {
+			u64 usage = page_counter_read(&pool->cnt);
+			int ret;
+
+			if (usage <= limit)
+				break;
+
+			if (signal_pending(current))
+				break;
+
+			ret = region->ops->reclaim(pool, usage - limit, region->reclaim_priv);
+
+			/*
+			 * Mirror memcg's retry strategy: only count -ENOSPC (no
+			 * progress) against the retry budget; any other error is
+			 * fatal and terminates the loop immediately.
+			 */
+			if (ret && (ret != -ENOSPC || !retries--))
+				break;
+
+			cond_resched();
+		}
+	}
+	up_read(&region->unregister_sem);
 }
 
 static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
@@ -189,9 +264,14 @@ static u64 get_resource_peak(struct dmem_cgroup_pool_state *pool)
 
 static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
-	set_resource_min(rpool, 0);
-	set_resource_low(rpool, 0);
-	set_resource_max(rpool, PAGE_COUNTER_MAX);
+	set_resource_min(rpool, 0, false);
+	set_resource_low(rpool, 0, false);
+	/*
+	 * Use nonblock=true: we are raising the limit to PAGE_COUNTER_MAX so
+	 * reclaim is pointless, and dmemcs_offline() holds rcu_read_lock()
+	 * which forbids sleeping.
+	 */
+	set_resource_max(rpool, PAGE_COUNTER_MAX, true);
 }
 
 static void dmemcs_offline(struct cgroup_subsys_state *css)
@@ -468,7 +548,10 @@ static void dmemcg_free_region(struct kref *ref)
  * dmem_cgroup_unregister_region() - Unregister a previously registered region.
  * @region: The region to unregister.
  *
- * This function undoes dmem_cgroup_register_region.
+ * This function undoes dmem_cgroup_register_region.  It drains any
+ * in-flight reclaim callbacks before returning, so the caller may safely
+ * free the resources pointed to by @init.reclaim_priv once this function
+ * returns.
  */
 void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
 {
@@ -477,6 +560,15 @@ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
 	if (!region)
 		return;
 
+	/*
+	 * Acquire the write side to drain any in-flight reclaim callbacks.
+	 * After up_write() below, set_resource_max() will observe
+	 * region->unregistered = true under its own down_read() and skip
+	 * the reclaim loop, so reclaim_priv is safe to free once this
+	 * function returns.
+	 */
+	down_write(&region->unregister_sem);
+
 	spin_lock(&dmemcg_lock);
 
 	/* Remove from global region list */
@@ -496,6 +588,8 @@ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
 	region->unregistered = true;
 	spin_unlock(&dmemcg_lock);
 
+	up_write(&region->unregister_sem);
+
 	kref_put(&region->ref, dmemcg_free_region);
 }
 EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region);
@@ -537,7 +631,10 @@ dmem_cgroup_register_region(const struct dmem_cgroup_init *init,
 	INIT_LIST_HEAD(&ret->pools);
 	ret->name = region_name;
 	ret->size = init->size;
+	ret->ops = init->ops;
+	ret->reclaim_priv = init->reclaim_priv;
 	kref_init(&ret->ref);
+	init_rwsem(&ret->unregister_sem);
 
 	spin_lock(&dmemcg_lock);
 	list_add_tail_rcu(&ret->region_node, &dmem_cgroup_regions);
@@ -733,9 +830,10 @@ static int dmemcg_parse_limit(char *options, u64 *new_limit)
 
 static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
 				 char *buf, size_t nbytes, loff_t off,
-				 void (*apply)(struct dmem_cgroup_pool_state *, u64))
+				 void (*apply)(struct dmem_cgroup_pool_state *, u64, bool))
 {
 	struct dmemcg_state *dmemcs = css_to_dmemcs(of_css(of));
+	bool nonblock = of->file->f_flags & O_NONBLOCK;
 	int err = 0;
 
 	while (buf && !err) {
@@ -780,7 +878,8 @@ static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
 		}
 
 		/* And commit */
-		apply(pool, new_limit);
+		apply(pool, new_limit, nonblock);
+
 		dmemcg_pool_put(pool);
 
 out_put:
-- 
2.54.0


^ permalink raw reply related

* [PATCH v6 2/6] cgroup/dmem: Introduce struct dmem_cgroup_init for region initialization
From: Thomas Hellström @ 2026-06-11 17:32 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

Replace the bare u64 size argument to dmem_cgroup_register_region() and
drmm_cgroup_register_region() with a const struct dmem_cgroup_init *
pointer. The struct currently carries only the size field, but using a
struct makes the API extensible: future callers can supply additional
initialization parameters without adding more positional arguments.

Update all in-tree callers (amdgpu, xe) to use a compound-literal
initializer.

v5:
- Commit introduced.

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c |  6 +++++-
 drivers/gpu/drm/drm_drv.c                    |  8 +++++---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c         |  7 ++++++-
 include/drm/drm_drv.h                        |  4 +++-
 include/linux/cgroup_dmem.h                  | 16 +++++++++++++---
 kernel/cgroup/dmem.c                         | 10 ++++++----
 6 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index ac3f71d77140..08f05c3aed1d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -23,6 +23,7 @@
  */
 
 #include <linux/dma-mapping.h>
+#include <linux/cgroup_dmem.h>
 #include <drm/ttm/ttm_range_manager.h>
 #include <drm/drm_drv.h>
 #include <drm/drm_buddy.h>
@@ -932,7 +933,10 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	if (err)
 		return err;
 
-	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
+	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram",
+					      &(struct dmem_cgroup_init){
+						.size = adev->gmc.real_vram_size,
+					      });
 	if (IS_ERR(man->cg))
 		return PTR_ERR(man->cg);
 
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 1ff0bf7cba6a..3c570f9393b9 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -960,17 +960,19 @@ static void drmm_cg_unregister_region(struct drm_device *dev, void *arg)
  * drmm_cgroup_register_region - Register a region of a DRM device to cgroups
  * @dev: device for region
  * @region_name: Region name for registering
- * @size: Size of region in bytes
+ * @init: Initialization parameters for the region.
  *
  * This decreases the ref-count of @dev by one. The device is destroyed if the
  * ref-count drops to zero.
  */
-struct dmem_cgroup_region *drmm_cgroup_register_region(struct drm_device *dev, const char *region_name, u64 size)
+struct dmem_cgroup_region *
+drmm_cgroup_register_region(struct drm_device *dev, const char *region_name,
+			    const struct dmem_cgroup_init *init)
 {
 	struct dmem_cgroup_region *region;
 	int ret;
 
-	region = dmem_cgroup_register_region(size, "drm/%s/%s", dev->unique, region_name);
+	region = dmem_cgroup_register_region(init, "drm/%s/%s", dev->unique, region_name);
 	if (IS_ERR_OR_NULL(region))
 		return region;
 
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index b518f7dec680..308fda4248eb 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -4,6 +4,8 @@
  * Copyright (C) 2021-2022 Red Hat
  */
 
+#include <linux/cgroup_dmem.h>
+
 #include <drm/drm_managed.h>
 #include <drm/drm_drv.h>
 #include <drm/drm_buddy.h>
@@ -303,7 +305,10 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	int err;
 
 	name = mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
-	man->cg = drmm_cgroup_register_region(&xe->drm, name, size);
+	man->cg = drmm_cgroup_register_region(&xe->drm, name,
+					      &(struct dmem_cgroup_init){
+						.size = size,
+					      });
 	if (IS_ERR(man->cg))
 		return PTR_ERR(man->cg);
 
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index e09559495c5b..b23830494ed4 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -34,6 +34,7 @@
 
 #include <drm/drm_device.h>
 
+struct dmem_cgroup_init;
 struct dmem_cgroup_region;
 struct drm_fb_helper;
 struct drm_fb_helper_surface_size;
@@ -433,7 +434,8 @@ void *__devm_drm_dev_alloc(struct device *parent,
 
 struct dmem_cgroup_region *
 drmm_cgroup_register_region(struct drm_device *dev,
-			    const char *region_name, u64 size);
+			    const char *region_name,
+			    const struct dmem_cgroup_init *init);
 
 /**
  * devm_drm_dev_alloc - Resource managed allocation of a &drm_device instance
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736..d9eab8a2c1ee 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -14,8 +14,18 @@ struct dmem_cgroup_pool_state;
 /* Opaque definition of a cgroup region, used internally */
 struct dmem_cgroup_region;
 
+/**
+ * struct dmem_cgroup_init - Initialization parameters for a dmem cgroup region.
+ * @size: Size of the region in bytes.
+ */
+struct dmem_cgroup_init {
+	u64 size;
+};
+
 #if IS_ENABLED(CONFIG_CGROUP_DMEM)
-struct dmem_cgroup_region *dmem_cgroup_register_region(u64 size, const char *name_fmt, ...) __printf(2,3);
+struct dmem_cgroup_region *
+dmem_cgroup_register_region(const struct dmem_cgroup_init *init,
+			    const char *name_fmt, ...) __printf(2, 3);
 void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region);
 int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 			   struct dmem_cgroup_pool_state **ret_pool,
@@ -27,8 +37,8 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 
 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
 #else
-static inline __printf(2,3) struct dmem_cgroup_region *
-dmem_cgroup_register_region(u64 size, const char *name_fmt, ...)
+static inline __printf(2, 3) struct dmem_cgroup_region *
+dmem_cgroup_register_region(const struct dmem_cgroup_init *init, const char *name_fmt, ...)
 {
 	return NULL;
 }
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 6430c7ce1e03..d12c8543f3fe 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -502,7 +502,7 @@ EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region);
 
 /**
  * dmem_cgroup_register_region() - Register a regions for dev cgroup.
- * @size: Size of region to register, in bytes.
+ * @init: Initialization parameters for the region.
  * @fmt: Region parameters to register
  *
  * This function registers a node in the dmem cgroup with the
@@ -511,13 +511,15 @@ EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region);
  *
  * Return: NULL or a struct on success, PTR_ERR on failure.
  */
-struct dmem_cgroup_region *dmem_cgroup_register_region(u64 size, const char *fmt, ...)
+struct dmem_cgroup_region *
+dmem_cgroup_register_region(const struct dmem_cgroup_init *init,
+			    const char *fmt, ...)
 {
 	struct dmem_cgroup_region *ret;
 	char *region_name;
 	va_list ap;
 
-	if (!size)
+	if (!init || !init->size)
 		return NULL;
 
 	va_start(ap, fmt);
@@ -534,7 +536,7 @@ struct dmem_cgroup_region *dmem_cgroup_register_region(u64 size, const char *fmt
 
 	INIT_LIST_HEAD(&ret->pools);
 	ret->name = region_name;
-	ret->size = size;
+	ret->size = init->size;
 	kref_init(&ret->ref);
 
 	spin_lock(&dmemcg_lock);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v6 1/6] drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init()
From: Thomas Hellström @ 2026-06-11 17:32 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Sashiko-bot, Friedrich Vock,
	Maarten Lankhorst, Tejun Heo, Maxime Ripard, Christian König,
	Alex Deucher, amd-gfx, dri-devel, stable, Natalie Vock,
	Johannes Weiner, Michal Koutný, cgroups, Huang Rui,
	Matthew Brost, Matthew Auld, Maarten Lankhorst, Thomas Zimmermann,
	Simona Vetter, David Airlie, Rodrigo Vivi, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

drmm_cgroup_register_region() is called before INIT_LIST_HEAD() and
gpu_buddy_init() in amdgpu_vram_mgr_init(). If it fails, the function
returns early and bypasses those initializations.

Since adev->mman.initialized is set to true before amdgpu_vram_mgr_init()
is called, a failure triggers amdgpu_ttm_fini(), which calls
amdgpu_vram_mgr_fini(), which then:

 - Calls list_for_each_entry_safe() on reservations_pending and
   reserved_pages, whose list_head::next pointers are zero-initialized
   (NULL). The loop does not recognize them as empty and dereferences NULL.

 - Calls gpu_buddy_fini(), which iterates free_trees[] unconditionally
   via for_each_free_tree(). Since mm->free_trees is NULL
   (never allocated), this dereferences NULL.

Both result in a kernel panic on the module load error path.

Fix by moving drmm_cgroup_register_region() to after the list and buddy
allocator are fully initialized, so the teardown path is safe to run.

Reported-by: Sashiko-bot <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260428073116.15687-1-thomas.hellstrom@linux.intel.com?part=4
Fixes: 2b624a2c1865 ("drm/ttm: Handle cgroup based eviction in TTM")
Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.14+
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 2a241a5b12c4..ac3f71d77140 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -918,9 +918,6 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	struct ttm_resource_manager *man = &mgr->manager;
 	int err;
 
-	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
-	if (IS_ERR(man->cg))
-		return PTR_ERR(man->cg);
 	ttm_resource_manager_init(man, &adev->mman.bdev,
 				  adev->gmc.real_vram_size);
 
@@ -935,6 +932,10 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	if (err)
 		return err;
 
+	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
+	if (IS_ERR(man->cg))
+		return PTR_ERR(man->cg);
+
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
 	ttm_resource_manager_set_used(man, true);
 	return 0;
-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox