Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* [PATCH v6 1/6] drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init()
From: Thomas Hellström @ 2026-06-11 17:32 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Sashiko-bot, Friedrich Vock,
	Maarten Lankhorst, Tejun Heo, Maxime Ripard, Christian König,
	Alex Deucher, amd-gfx, dri-devel, stable, Natalie Vock,
	Johannes Weiner, Michal Koutný, cgroups, Huang Rui,
	Matthew Brost, Matthew Auld, Maarten Lankhorst, Thomas Zimmermann,
	Simona Vetter, David Airlie, Rodrigo Vivi, linux-kernel
In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com>

drmm_cgroup_register_region() is called before INIT_LIST_HEAD() and
gpu_buddy_init() in amdgpu_vram_mgr_init(). If it fails, the function
returns early and bypasses those initializations.

Since adev->mman.initialized is set to true before amdgpu_vram_mgr_init()
is called, a failure triggers amdgpu_ttm_fini(), which calls
amdgpu_vram_mgr_fini(), which then:

 - Calls list_for_each_entry_safe() on reservations_pending and
   reserved_pages, whose list_head::next pointers are zero-initialized
   (NULL). The loop does not recognize them as empty and dereferences NULL.

 - Calls gpu_buddy_fini(), which iterates free_trees[] unconditionally
   via for_each_free_tree(). Since mm->free_trees is NULL
   (never allocated), this dereferences NULL.

Both result in a kernel panic on the module load error path.

Fix by moving drmm_cgroup_register_region() to after the list and buddy
allocator are fully initialized, so the teardown path is safe to run.

Reported-by: Sashiko-bot <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260428073116.15687-1-thomas.hellstrom@linux.intel.com?part=4
Fixes: 2b624a2c1865 ("drm/ttm: Handle cgroup based eviction in TTM")
Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.14+
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 2a241a5b12c4..ac3f71d77140 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -918,9 +918,6 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	struct ttm_resource_manager *man = &mgr->manager;
 	int err;
 
-	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
-	if (IS_ERR(man->cg))
-		return PTR_ERR(man->cg);
 	ttm_resource_manager_init(man, &adev->mman.bdev,
 				  adev->gmc.real_vram_size);
 
@@ -935,6 +932,10 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	if (err)
 		return err;
 
+	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
+	if (IS_ERR(man->cg))
+		return PTR_ERR(man->cg);
+
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
 	ttm_resource_manager_set_used(man, true);
 	return 0;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v6 0/6] [PATCH v6 0/6] Add reclaim to the dmem cgroup controller
From: Thomas Hellström @ 2026-06-11 17:32 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel

When writing a "max" limit lower than the current usage, the
existing code silently failed. This series aims to improve
on that by returning -EBUSY on failure and also attempt
to synchronously reclaim device memory to push the usage
under the new max limit to avoid the error.

Patch 1 fixes a pre-existing amdgpu_vram_mgr_init() error path
Patch 2 introduces struct dmem_cgroup_init for extensible region
      registration.
Patch 3 implements and documents a reclaim callback interface
      for the dmem controller.
Patch 4 implements a TTM reclaim callback.
Patches 5-6 hook up the reclaim callback to the dmem cgroup-aware
      drivers xe and amdgpu.

v2:
- Remove the error propagation that was in a previous series (Maarten)
- A number of updates in patch 1. See its commit message for
  details (Maarten)

v3:
- Add patch 1 fixing a pre-existing amdgpu_vram_mgr_init() error path
  bug where drmm_cgroup_register_region() was called before
  INIT_LIST_HEAD() and gpu_buddy_init(), causing a kernel panic on
  failure. (Sashiko-bot)
- Use an rwsem to protect reclaim callback registration and region
  unregister against concurrent reclaim invocations. (Sashiko-bot)
- Fix ttm_resource_manager_set_dmem_region() storing an error pointer
  in man->cg unconditionally. (Sashiko-bot)
- Fix kernel-doc function name format for ttm_bo_evict_cgroup() and
  ttm_resource_manager_set_dmem_region().

v4:
- Rebased on drm-tip; dropped the XE_PL_STOLEN guard in the xe patch
  as stolen memory uses a separate TTM manager.

v5:
- Add patch 2 introducing struct dmem_cgroup_init to make the
  dmem_cgroup_register_region() API extensible without adding positional
  arguments in the future.
- Use nonblock=true in reset_all_resource_limits() to avoid sleeping
  inside rcu_read_lock() in dmemcs_offline(). (Sashiko-bot)
- Compare usage against the truncated limit stored in cnt.max, not the
  original u64. (Sashiko-bot)
- Use DMEM_MAX_RECLAIM_RETRIES (16) retry budget instead of 5, matching
  the memcg controller; only -ENOSPC (no progress) counts against the
  budget, other errors abort immediately.
- Handle NULL region in ttm_resource_manager_set_dmem_region() to clear
  the reclaim callback, preventing use-after-free when the manager is
  torn down while the dmem region outlives it. (Sashiko-bot)
- Return 0 on any eviction progress; reserve -ENOSPC for zero progress.
- Clear the reclaim callback in xe and amdgpu fini paths to prevent
  use-after-free after driver unbind with open DRM file descriptors.
  (Sashiko-bot)
- Register xe fini devres action before drmm_cgroup_register_region()
  so LIFO teardown runs unregister first, draining callbacks before the
  manager is destroyed. (Sashiko-bot)
- Switch amdgpu to explicit dmem_cgroup_unregister_region() at the top
  of amdgpu_vram_mgr_fini() before any manager teardown, since amdgpu's
  fini is called explicitly during driver unbind before drmm cleanup.
  (Sashiko-bot)
- Wrap the xe reclaim callback with drm_dev_enter()/drm_dev_exit() to
  prevent TTM reclaim from running after driver unbind.

v6:
- Move the ops check inside down_read() in set_resource_max(), guarded
  by region->unregistered, to close a UAF race against
  dmem_cgroup_unregister_region(). (Sashiko-bot)
- Fix dmem_cgroup_ops->reclaim docstring: -ENOSPC is retried up to
  DMEM_MAX_RECLAIM_RETRIES times, not an immediate stop. (Sashiko-bot)
- Fix mgr->cg_region never being assigned in amdgpu_vram_mgr_init(),
  causing dmem_cgroup_unregister_region() in fini to silently no-op.
  (Sashiko-bot)
- Reorder amdgpu_vram_mgr_fini() to call set_used(false) and
  evict_all() before dmem_cgroup_unregister_region(), so
  ttm_resource_free() can uncharge via man->cg during eviction; clear
  man->cg after unregister. (Sashiko-bot)

User-space tests are at
https://patchwork.freedesktop.org/series/163935/

Test-with: 20260428065411.4222-1-thomas.hellstrom@linux.intel.com

Thomas Hellström (6):
  drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init()
  cgroup/dmem: Introduce struct dmem_cgroup_init for region initialization
  cgroup/dmem: Add reclaim callback for lowering max below current usage
  drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem
    controller
  drm/xe: Wire up dmem cgroup reclaim for VRAM manager
  drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager

 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c |  30 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h |   2 +
 drivers/gpu/drm/drm_drv.c                    |   8 +-
 drivers/gpu/drm/ttm/ttm_bo.c                 |  95 +++++++++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c            |   3 +-
 drivers/gpu/drm/ttm/ttm_resource.c           |  50 +++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c         |  53 +++++++++--
 include/drm/drm_drv.h                        |   4 +-
 include/drm/ttm/ttm_bo.h                     |  10 +++
 include/drm/ttm/ttm_resource.h               |   7 ++
 include/linux/cgroup_dmem.h                  |  38 +++++++-
 kernel/cgroup/dmem.c                         | 129 ++++++++++++++++++++++++---
 13 files changed, 396 insertions(+), 35 deletions(-)

-- 
2.54.0

Thomas Hellström (6):
  drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init()
  cgroup/dmem: Introduce struct dmem_cgroup_init for region
    initialization
  cgroup/dmem: Add reclaim callback for lowering max below current usage
  drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem
    controller
  drm/xe: Wire up dmem cgroup reclaim for VRAM manager
  drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager

 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c |  30 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h |   2 +
 drivers/gpu/drm/drm_drv.c                    |   8 +-
 drivers/gpu/drm/ttm/ttm_bo.c                 |  95 +++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c            |   3 +-
 drivers/gpu/drm/ttm/ttm_resource.c           |  50 +++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c         |  53 +++++++-
 include/drm/drm_drv.h                        |   4 +-
 include/drm/ttm/ttm_bo.h                     |  10 ++
 include/drm/ttm/ttm_resource.h               |   7 +
 include/linux/cgroup_dmem.h                  |  38 +++++-
 kernel/cgroup/dmem.c                         | 129 +++++++++++++++++--
 13 files changed, 396 insertions(+), 35 deletions(-)

-- 
2.54.0


^ permalink raw reply

* Re: [PATCH 1/1] cgroup: rdma: free idle pools during cgroup teardown
From: Michal Koutný @ 2026-06-11 17:29 UTC (permalink / raw)
  To: Ren Wei
  Cc: cgroups, tj, hannes, pandit.parav, yuantan098, zcliangcn, bird,
	tr0jan, d4n.for.sec
In-Reply-To: <9eb365a37ab83f38686007f8a61a656759d39bd7.1781092143.git.d4n.for.sec@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1737 bytes --]

On Thu, Jun 11, 2026 at 02:13:16AM +0800, Ren Wei <n05ec@lzu.edu.cn> wrote:
> From: Daming Li <d4n.for.sec@gmail.com>
> 
> rdmacg_css_offline() converts each pool to all-max limits so the
> existing reclaim path can free it after the last uncharge. However,
> zero-usage pools are already reclaimable at that point and leaving them
> linked until rdmacg_css_free() lets later device teardown hit a
> use-after-free when free_cg_rpool_locked() deletes cg_node from a freed
> cgroup list head.

That's a valid problem and good analysis. The rpool->cg_node points to
rdma_cgroup w/out bumping a refcount on respective css hence the
observed UaF.

> Free zero-usage pools directly from rdmacg_css_offline() while holding
> rdmacg_mutex. This keeps the existing reclaim rule, avoids new lifetime
> states, and ensures a cgroup cannot be freed with reclaimable rdmacg
> pools still attached.

I see this approach works (without explicit ref bump and complications
arising from that tracking).

The shortened availability of events/peak should be OK as those are
meant to be only for onlined cgs.

> 
> Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller")
> Cc: stable@vger.kernel.org
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Assisted-by: Codex:GPT-5.4
> Co-developed-by: Luxing Yin <tr0jan@lzu.edu.cn>
> Signed-off-by: Luxing Yin <tr0jan@lzu.edu.cn>
> Signed-off-by: Daming Li <d4n.for.sec@gmail.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> ---
>  kernel/cgroup/rdma.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [Kernel Bug] INFO: rcu detected stall in count_memcg_event_mm
From: Shakeel Butt @ 2026-06-11 17:21 UTC (permalink / raw)
  To: Longxing Li
  Cc: syzkaller, hannes, mhocko, roman.gushchin, muchun.song, cgroups,
	linux-mm, linux-kernel
In-Reply-To: <CAHPqNmzgCY+sHOOG8YVrCFO-7oh6TBeL4SCHEcfVvH6J1SUVdg@mail.gmail.com>

Hi Longxing,

Thanks for the report.

On Tue, Jun 09, 2026 at 07:57:56PM +0800, Longxing Li wrote:
> Dear Linux kernel developers and maintainers,
> 
> We would like to report a new kernel bug found by our tool. INFO: rcu
> detected stall in count_memcg_event_mm. Details are as follows.
> 
> Kernel commit: v5.15.189

This is an old kernel, can you reproduce with the latest kernel? Also at the
high level this seems like CPU starvation. Can you also describe your system and
the workload/test which is trigerring this issue?


^ permalink raw reply

* Re: [PATCH v2 02/16] mm/slab: do not init any kfence objects on allocation
From: Vlastimil Babka (SUSE) @ 2026-06-11 16:37 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Andrey Konovalov, Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm,
	linux-kernel, cgroups
In-Reply-To: <74adf668-78c2-4989-a6c6-c6ec7bd68855@kernel.org>

On 6/11/26 17:11, Harry Yoo wrote:
> 
>> From 3a1c4398ce9f361a4e6f4d9946eab6237eea89c2 Mon Sep 17 00:00:00 2001
>> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
>> Date: Wed, 10 Jun 2026 17:40:04 +0200
>> Subject: [PATCH] mm/slab: do not init any kfence objects on allocation
>> 
>> When init (zeroing) on allocation is requested, for kmalloc() we
>> generally have to zero the full object size even if a smaller size is
>> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.
>> 
>> When we end up allocating a kfence object, kfence perfoms the zeroing on
> 
> nit: perfoms -> performs

Fixed.

>> its own because has its own redzone beyond the requested size. Thus
>> slab_post_alloc_hook() has an 'init' parameter which has to be evaluated
>> in all callers (via slab_want_init_on_alloc()) and should be false for
>> kfence allocations.
>> 
>> For kfence allocations in slab_alloc_node() this is achieved by subtly
>> skipping over the slab_want_init_on_alloc() call. Other callers (i.e.
>> kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
>> if they do end up with a kfence allocation. This is only subtly not a
>> problem, as those are not kmalloc allocations and thus the "requested
>> size" equals s->object_size and thus it cannot interfere with kfence's
>> redzone. There's just a unnecessary double zeroing (in both kfence and
>> slab_post_alloc_hook()), but it's all very fragile and contradicts the
>> comment in kfence_guarded_alloc().
>> 
>> Remove this subtlety and simplify the code by eliminating the init
>> parameter from slab_post_alloc_hook() and make it call
>> slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
>> check before performing the memset, which will start doing the right
>> thing for all callers of slab_post_alloc_hook().
>> 
>> This potentially adds overhead of the is_kfence_address() check to
>> allocation hotpath, but that one is designed to be as small as possible,
>> and it's only evaluated if zeroing is about to happen. This means (aside
>> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
>> zeroing itself comes with an overhead likely larger than the added
>> check.
> 
>> While at it, refactor the handling of evaluating when KASAN does the
>> init instead of SLUB, with no intended functional changes. A
>> non-functional change is that we don't pass kasan_init as true to
>> kasan_slab_alloc() if kasan has no integrated init, but then the value
>> is ignored anyway, so it's theoretically more correct.
> 
> Right.
> 
>> Thanks to Harry Yoo for the initial refactoring attempt, and for updated
>> comments that are used here.
> 
> No problem ;)
> 
>> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-2-7190909db118@kernel.org
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
> 
> Looks good to me,
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

Thanks!

> Thanks!
> 


^ permalink raw reply

* Re: [PATCH v2 15/16] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Vlastimil Babka (SUSE) @ 2026-06-11 16:28 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org>

On 6/10/26 17:40, Vlastimil Babka (SUSE) wrote:
> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
> gfp flags are a scarce resource, unlike slab's alloc_flags.
> 
> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
> family function should not recurse into another kmalloc*() for the
> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
> 
> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
> added. This will also pass through SLAB_ALLOC_TRYLOCK so we don't need
> to special case kmalloc_nolock() anymore.
> 
> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
> the incoming gfp flags (only augmented with __GFP_ZERO), because if
> alloc_flags contain SLAB_ALLOC_TRYLOCK, the incoming gfp flags have to
> be also compatible with it.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

As pointed out by Sashiko, this piecemeal approach creates a bisection
hazard where sheaves -> obj_ext -> sheaves -> ... recursion can happen.
So I'll changes this as follows to make obj_ext accept and pass both the
gfp and alloc_flags preventing recursion, and change the next patch
to revert that temporary change again.

diff --git a/mm/slub.c b/mm/slub.c
index a81f1f6bad67..c60f3a252ae5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2167,6 +2167,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
+	gfp |= __GFP_NO_OBJ_EXT;
 	alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sz = obj_exts_alloc_size(s, slab, gfp);
@@ -2371,7 +2372,7 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
 	if (s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE))
 		return;
 
-	if (alloc_flags & SLAB_ALLOC_NO_RECURSE)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE || flags & __GFP_NO_OBJ_EXT)
 		return;
 
 	slab = virt_to_slab(object);


^ permalink raw reply related

* Re: [PATCH v2 02/16] mm/slab: do not init any kfence objects on allocation
From: Harry Yoo @ 2026-06-11 15:11 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Andrey Konovalov, Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm,
	linux-kernel, cgroups
In-Reply-To: <e71bfc13-c233-4f85-a6ec-76327d3c6510@kernel.org>



On 6/11/26 11:47 PM, Vlastimil Babka (SUSE) wrote:
> On 6/11/26 10:34, Vlastimil Babka (SUSE) wrote:
>> On 6/11/26 05:19, Harry Yoo wrote:
>>>
>>>> This potentially adds overhead of the is_kfence_address() check to
>>>> allocation hotpath, but that one is designed to be as small as possible,
>>>> and it's only evaluated if zeroing is about to happen. This means (aside
>>>> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
>>>> zeroing itself comes with an overhead likely larger than the added
>>>> check.
>>>>
>>>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>>>> ---
>>>>  mm/kfence/core.c |  2 +-
>>>>  mm/slub.c        | 23 ++++++++---------------
>>>>  2 files changed, 9 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/mm/slub.c b/mm/slub.c
>>>> index e2ee8f1aaccf..8e5264d3ddbf 100644
>>>> --- a/mm/slub.c
>>>> +++ b/mm/slub.c
>>>> @@ -4565,9 +4565,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>>>>  
>>>>  static __fastpath_inline
>>>>  bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>>> -			  gfp_t flags, size_t size, void **p, bool init,
>>>> +			  gfp_t flags, size_t size, void **p,
>>>>  			  unsigned int orig_size)
>>>>  {
>>>> +	bool init = slab_want_init_on_alloc(flags, s);
>>>>  	unsigned int zero_size = s->object_size;
>>>>  	bool kasan_init = init;
>>>>  	size_t i;
>>>> @@ -4608,7 +4609,8 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>>>  	for (i = 0; i < size; i++) {
>>>>  		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
>>>>  		if (p[i] && init && (!kasan_init ||
>>>> -				     !kasan_has_integrated_init()))
>>>> +				     !kasan_has_integrated_init())
>>>> +				 && !is_kfence_address(p[i]))
>>>
>>> I hope we could make it bit more verbose and straightforward,
>>> something like:
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 5d7ea72ebebd..29cf4590f9d9 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -4573,7 +4573,6 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
>>> gfp_t flags, size_t size,
>>>  {
>>>  	bool init = slab_want_init_on_alloc(flags, s);
>>>  	unsigned int zero_size = s->object_size;
>>> -	bool kasan_init = init;
>>>  	size_t i;
>>>  	gfp_t init_flags = flags & gfp_allowed_mask;
>>>
>>> @@ -4591,29 +4590,37 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
>>> gfp_t flags, size_t size,
>>>  	if (slub_debug_orig_size(s))
>>>  		zero_size = ac->orig_size;
>>>
>>> -	/*
>>> -	 * When slab_debug is enabled, avoid memory initialization integrated
>>> -	 * into KASAN and instead zero out the memory via the memset below with
>>> -	 * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
>>> -	 * cause false-positive reports. This does not lead to a performance
>>> -	 * penalty on production builds, as slab_debug is not intended to be
>>> -	 * enabled there.
>>> -	 */
>>> -	if (__slub_debug_enabled())
>>> -		kasan_init = false;
>>> -
>>> -	/*
>>> -	 * As memory initialization might be integrated into KASAN,
>>> -	 * kasan_slab_alloc and initialization memset must be
>>> -	 * kept together to avoid discrepancies in behavior.
>>> -	 *
>>> -	 * As p[i] might get tagged, memset and kmemleak hook come after KASAN.
>>> -	 */
>>>  	for (i = 0; i < size; i++) {
>>> -		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
>>> -		if (p[i] && init && (!kasan_init ||
>>> -				     !kasan_has_integrated_init())
>>> -				 && !is_kfence_address(p[i]))
>>> +		bool skip_init = false;
>>> +
>>> +		if (is_kfence_address(p[i])) {
>>> +			/*
>>> +			 * kfence zeroes the object instead of SLUB to avoid
>>> +			 * overwriting its own redzone, and zeroing of
>>> +			 * s->object_size will corrupt it.
>>> +			 */
>>> +			skip_init = true;
>>
>> But now we perform this check even if init is false, making it more hot.
>>
>>> +		} else if (__slub_debug_enabled()) {
>>> +			/*
>>> +			 * KASAN never zeroes memory when slab_debug is enabled
>>> +			 * to avoid overwriting SLUB redzones. This does not
>>> +			 * lead to a performance penalty on production builds,
>>> +			 * as slab_debug is not intended to be enabled there.
>>> +			 */
>>> +			skip_init = false;
>>> +		} else if (kasan_has_integrated_init()) {
>>> +			/*
>>> +			 * ARM64 can set memory tags and zero the memory using
>>> +			 * a single instruction. Since HW_TAGS KASAN uses that
>>> +			 * while tagging the object, a separate zeroing is
>>> +			 * unnecessary unless slab_debug is enabled.
>>> +			 */
>>
>> (I like the new/updated comments)
>>
>>> +			skip_init = true;
>>> +		}>
>>
>> And these two are now done in every loop iteration even though they don't
>> depend on the object. Yeah it's a static key and build-time constant but still.
>>
>> But maybe there's some middle ground?
>>
>> Above the loop do (with your comments).
> 
> OK, not so simple, we still need the kasan_init variable too.

Ouch, right.

> I've ended up with this, thoughts?

Much better!

> From 3a1c4398ce9f361a4e6f4d9946eab6237eea89c2 Mon Sep 17 00:00:00 2001
> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> Date: Wed, 10 Jun 2026 17:40:04 +0200
> Subject: [PATCH] mm/slab: do not init any kfence objects on allocation
> 
> When init (zeroing) on allocation is requested, for kmalloc() we
> generally have to zero the full object size even if a smaller size is
> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.
> 
> When we end up allocating a kfence object, kfence perfoms the zeroing on

nit: perfoms -> performs

> its own because has its own redzone beyond the requested size. Thus
> slab_post_alloc_hook() has an 'init' parameter which has to be evaluated
> in all callers (via slab_want_init_on_alloc()) and should be false for
> kfence allocations.
> 
> For kfence allocations in slab_alloc_node() this is achieved by subtly
> skipping over the slab_want_init_on_alloc() call. Other callers (i.e.
> kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
> if they do end up with a kfence allocation. This is only subtly not a
> problem, as those are not kmalloc allocations and thus the "requested
> size" equals s->object_size and thus it cannot interfere with kfence's
> redzone. There's just a unnecessary double zeroing (in both kfence and
> slab_post_alloc_hook()), but it's all very fragile and contradicts the
> comment in kfence_guarded_alloc().
> 
> Remove this subtlety and simplify the code by eliminating the init
> parameter from slab_post_alloc_hook() and make it call
> slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
> check before performing the memset, which will start doing the right
> thing for all callers of slab_post_alloc_hook().
> 
> This potentially adds overhead of the is_kfence_address() check to
> allocation hotpath, but that one is designed to be as small as possible,
> and it's only evaluated if zeroing is about to happen. This means (aside
> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
> zeroing itself comes with an overhead likely larger than the added
> check.

> While at it, refactor the handling of evaluating when KASAN does the
> init instead of SLUB, with no intended functional changes. A
> non-functional change is that we don't pass kasan_init as true to
> kasan_slab_alloc() if kasan has no integrated init, but then the value
> is ignored anyway, so it's theoretically more correct.

Right.

> Thanks to Harry Yoo for the initial refactoring attempt, and for updated
> comments that are used here.

No problem ;)

> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-2-7190909db118@kernel.org
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

Thanks!

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply

* Re: [PATCH v2 02/16] mm/slab: do not init any kfence objects on allocation
From: Vlastimil Babka (SUSE) @ 2026-06-11 14:47 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Andrey Konovalov, Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm,
	linux-kernel, cgroups
In-Reply-To: <159d1e20-5b21-4329-ac9a-f7a5cb0fd56a@kernel.org>

On 6/11/26 10:34, Vlastimil Babka (SUSE) wrote:
> On 6/11/26 05:19, Harry Yoo wrote:
>> 
>>> This potentially adds overhead of the is_kfence_address() check to
>>> allocation hotpath, but that one is designed to be as small as possible,
>>> and it's only evaluated if zeroing is about to happen. This means (aside
>>> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
>>> zeroing itself comes with an overhead likely larger than the added
>>> check.
>>> 
>>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>>> ---
>>>  mm/kfence/core.c |  2 +-
>>>  mm/slub.c        | 23 ++++++++---------------
>>>  2 files changed, 9 insertions(+), 16 deletions(-)
>>> 
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index e2ee8f1aaccf..8e5264d3ddbf 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -4565,9 +4565,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>>>  
>>>  static __fastpath_inline
>>>  bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>> -			  gfp_t flags, size_t size, void **p, bool init,
>>> +			  gfp_t flags, size_t size, void **p,
>>>  			  unsigned int orig_size)
>>>  {
>>> +	bool init = slab_want_init_on_alloc(flags, s);
>>>  	unsigned int zero_size = s->object_size;
>>>  	bool kasan_init = init;
>>>  	size_t i;
>>> @@ -4608,7 +4609,8 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>>  	for (i = 0; i < size; i++) {
>>>  		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
>>>  		if (p[i] && init && (!kasan_init ||
>>> -				     !kasan_has_integrated_init()))
>>> +				     !kasan_has_integrated_init())
>>> +				 && !is_kfence_address(p[i]))
>> 
>> I hope we could make it bit more verbose and straightforward,
>> something like:
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 5d7ea72ebebd..29cf4590f9d9 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -4573,7 +4573,6 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
>> gfp_t flags, size_t size,
>>  {
>>  	bool init = slab_want_init_on_alloc(flags, s);
>>  	unsigned int zero_size = s->object_size;
>> -	bool kasan_init = init;
>>  	size_t i;
>>  	gfp_t init_flags = flags & gfp_allowed_mask;
>> 
>> @@ -4591,29 +4590,37 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
>> gfp_t flags, size_t size,
>>  	if (slub_debug_orig_size(s))
>>  		zero_size = ac->orig_size;
>> 
>> -	/*
>> -	 * When slab_debug is enabled, avoid memory initialization integrated
>> -	 * into KASAN and instead zero out the memory via the memset below with
>> -	 * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
>> -	 * cause false-positive reports. This does not lead to a performance
>> -	 * penalty on production builds, as slab_debug is not intended to be
>> -	 * enabled there.
>> -	 */
>> -	if (__slub_debug_enabled())
>> -		kasan_init = false;
>> -
>> -	/*
>> -	 * As memory initialization might be integrated into KASAN,
>> -	 * kasan_slab_alloc and initialization memset must be
>> -	 * kept together to avoid discrepancies in behavior.
>> -	 *
>> -	 * As p[i] might get tagged, memset and kmemleak hook come after KASAN.
>> -	 */
>>  	for (i = 0; i < size; i++) {
>> -		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
>> -		if (p[i] && init && (!kasan_init ||
>> -				     !kasan_has_integrated_init())
>> -				 && !is_kfence_address(p[i]))
>> +		bool skip_init = false;
>> +
>> +		if (is_kfence_address(p[i])) {
>> +			/*
>> +			 * kfence zeroes the object instead of SLUB to avoid
>> +			 * overwriting its own redzone, and zeroing of
>> +			 * s->object_size will corrupt it.
>> +			 */
>> +			skip_init = true;
> 
> But now we perform this check even if init is false, making it more hot.
> 
>> +		} else if (__slub_debug_enabled()) {
>> +			/*
>> +			 * KASAN never zeroes memory when slab_debug is enabled
>> +			 * to avoid overwriting SLUB redzones. This does not
>> +			 * lead to a performance penalty on production builds,
>> +			 * as slab_debug is not intended to be enabled there.
>> +			 */
>> +			skip_init = false;
>> +		} else if (kasan_has_integrated_init()) {
>> +			/*
>> +			 * ARM64 can set memory tags and zero the memory using
>> +			 * a single instruction. Since HW_TAGS KASAN uses that
>> +			 * while tagging the object, a separate zeroing is
>> +			 * unnecessary unless slab_debug is enabled.
>> +			 */
> 
> (I like the new/updated comments)
> 
>> +			skip_init = true;
>> +		}>
> 
> And these two are now done in every loop iteration even though they don't
> depend on the object. Yeah it's a static key and build-time constant but still.
> 
> But maybe there's some middle ground?
> 
> Above the loop do (with your comments).

OK, not so simple, we still need the kasan_init variable too.
I've ended up with this, thoughts?

From 3a1c4398ce9f361a4e6f4d9946eab6237eea89c2 Mon Sep 17 00:00:00 2001
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Date: Wed, 10 Jun 2026 17:40:04 +0200
Subject: [PATCH] mm/slab: do not init any kfence objects on allocation

When init (zeroing) on allocation is requested, for kmalloc() we
generally have to zero the full object size even if a smaller size is
requested, in order to provide krealloc()'s __GFP_ZERO guarantees.

When we end up allocating a kfence object, kfence perfoms the zeroing on
its own because has its own redzone beyond the requested size. Thus
slab_post_alloc_hook() has an 'init' parameter which has to be evaluated
in all callers (via slab_want_init_on_alloc()) and should be false for
kfence allocations.

For kfence allocations in slab_alloc_node() this is achieved by subtly
skipping over the slab_want_init_on_alloc() call. Other callers (i.e.
kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
if they do end up with a kfence allocation. This is only subtly not a
problem, as those are not kmalloc allocations and thus the "requested
size" equals s->object_size and thus it cannot interfere with kfence's
redzone. There's just a unnecessary double zeroing (in both kfence and
slab_post_alloc_hook()), but it's all very fragile and contradicts the
comment in kfence_guarded_alloc().

Remove this subtlety and simplify the code by eliminating the init
parameter from slab_post_alloc_hook() and make it call
slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
check before performing the memset, which will start doing the right
thing for all callers of slab_post_alloc_hook().

This potentially adds overhead of the is_kfence_address() check to
allocation hotpath, but that one is designed to be as small as possible,
and it's only evaluated if zeroing is about to happen. This means (aside
from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
zeroing itself comes with an overhead likely larger than the added
check.

While at it, refactor the handling of evaluating when KASAN does the
init instead of SLUB, with no intended functional changes. A
non-functional change is that we don't pass kasan_init as true to
kasan_slab_alloc() if kasan has no integrated init, but then the value
is ignored anyway, so it's theoretically more correct.

Thanks to Harry Yoo for the initial refactoring attempt, and for updated
comments that are used here.

Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-2-7190909db118@kernel.org
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/kfence/core.c |  2 +-
 mm/slub.c        | 60 ++++++++++++++++++++++--------------------------
 2 files changed, 29 insertions(+), 33 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 655dc5ce3240..5e0b406924e9 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -500,7 +500,7 @@ static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t g
 
 	/*
 	 * We check slab_want_init_on_alloc() ourselves, rather than letting
-	 * SL*B do the initialization, as otherwise we might overwrite KFENCE's
+	 * slab do the initialization, as otherwise it might overwrite KFENCE's
 	 * redzone.
 	 */
 	if (unlikely(slab_want_init_on_alloc(gfp, cache)))
diff --git a/mm/slub.c b/mm/slub.c
index e2ee8f1aaccf..d762cbe5d040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4565,13 +4565,13 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
 
 static __fastpath_inline
 bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
-			  gfp_t flags, size_t size, void **p, bool init,
+			  gfp_t flags, size_t size, void **p,
 			  unsigned int orig_size)
 {
+	bool init = slab_want_init_on_alloc(flags, s);
 	unsigned int zero_size = s->object_size;
-	bool kasan_init = init;
-	size_t i;
 	gfp_t init_flags = flags & gfp_allowed_mask;
+	bool kasan_init = false;
 
 	/*
 	 * For kmalloc object, the allocated size (object_size) can be larger
@@ -4588,28 +4588,33 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		zero_size = orig_size;
 
 	/*
-	 * When slab_debug is enabled, avoid memory initialization integrated
-	 * into KASAN and instead zero out the memory via the memset below with
-	 * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
-	 * cause false-positive reports. This does not lead to a performance
+	 * ARM64 can set memory tags and zero the memory using a single
+	 * instruction. Since HW_TAGS KASAN uses that while tagging the object,
+	 * separate zeroing is unnecessary.
+	 *
+	 * However, KASAN never zeroes memory when slab_debug is enabled to
+	 * avoid overwriting SLUB redzones. This does not lead to a performance
 	 * penalty on production builds, as slab_debug is not intended to be
 	 * enabled there.
 	 */
-	if (__slub_debug_enabled())
-		kasan_init = false;
+	if (kasan_has_integrated_init() && !__slub_debug_enabled()) {
+		kasan_init = init;
+		init = false;
+	}
 
-	/*
-	 * As memory initialization might be integrated into KASAN,
-	 * kasan_slab_alloc and initialization memset must be
-	 * kept together to avoid discrepancies in behavior.
-	 *
-	 * As p[i] might get tagged, memset and kmemleak hook come after KASAN.
-	 */
-	for (i = 0; i < size; i++) {
+	for (size_t i = 0; i < size; i++) {
 		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
-		if (p[i] && init && (!kasan_init ||
-				     !kasan_has_integrated_init()))
+
+		/*
+		 * memset and hooks come after KASAN as p[i] might get tagged
+		 *
+		 * kfence zeroes the object instead of SLUB to avoid overwriting
+		 * its own redzone starting at orig_size, which could happen
+		 * with SLUB zeroing full s->object_size
+		 */
+		if (init && p[i] && !is_kfence_address(p[i]))
 			memset(p[i], 0, zero_size);
+
 		if (gfpflags_allow_spinning(flags))
 			kmemleak_alloc_recursive(p[i], s->object_size, 1,
 						 s->flags, init_flags);
@@ -4910,7 +4915,6 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
 {
 	void *object;
-	bool init = false;
 
 	s = slab_pre_alloc_hook(s, gfpflags);
 	if (unlikely(!s))
@@ -4926,16 +4930,13 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
-	init = slab_want_init_on_alloc(gfpflags, s);
 
 out:
 	/*
-	 * When init equals 'true', like for kzalloc() family, only
-	 * @orig_size bytes might be zeroed instead of s->object_size
 	 * In case this fails due to memcg_slab_post_alloc_hook(),
 	 * object is set to NULL
 	 */
-	slab_post_alloc_hook(s, lru, gfpflags, 1, &object, init, orig_size);
+	slab_post_alloc_hook(s, lru, gfpflags, 1, &object, orig_size);
 
 	return object;
 }
@@ -5230,7 +5231,6 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
 				   struct slab_sheaf *sheaf)
 {
 	void *ret = NULL;
-	bool init;
 
 	if (sheaf->size == 0)
 		goto out;
@@ -5240,10 +5240,8 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
 	if (likely(!ret))
 		ret = sheaf->objects[--sheaf->size];
 
-	init = slab_want_init_on_alloc(gfp, s);
-
 	/* add __GFP_NOFAIL to force successful memcg charging */
-	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
+	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, s->object_size);
 out:
 	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
 
@@ -5423,8 +5421,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
-			     slab_want_init_on_alloc(alloc_gfp, s), orig_size);
+	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret, orig_size);
 
 	ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
 	return ret;
@@ -7339,8 +7336,7 @@ bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
 
 out:
 	/* memcg and kmem_cache debug support and memory initialization */
-	return likely(slab_post_alloc_hook(s, NULL, flags, size, p,
-			slab_want_init_on_alloc(flags, s), s->object_size));
+	return likely(slab_post_alloc_hook(s, NULL, flags, size, p, s->object_size));
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 
-- 
2.54.0



^ permalink raw reply related

* [PATCH v5 6/6] drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager
From: Thomas Hellström @ 2026-06-11 14:22 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611142242.2529-1-thomas.hellstrom@linux.intel.com>

Register the VRAM manager with the dmem cgroup reclaim infrastructure
so that lowering dmem.max below current VRAM usage triggers TTM
eviction rather than failing with -EBUSY.

Guard place->flags in amdgpu_ttm_bo_eviction_valuable() against NULL,
as the TTM reclaim path passes a NULL place in cgroup drain mode.

v3:
- Rebased on fix for uninitialized list and buddy allocator on the
  drmm_cgroup_register_region() error path.

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Clear the reclaim callback in amdgpu_vram_mgr_fini() to prevent
  use-after-free if cgroup reclaim is triggered after driver unbind
  while userspace holds an open DRM file descriptor. (Sashiko-bot)
- Switch from drmm_cgroup_register_region() to the raw
  dmem_cgroup_register_region() and store the region in
  amdgpu_vram_mgr.cg_region. Explicitly call
  dmem_cgroup_unregister_region() at the top of amdgpu_vram_mgr_fini()
  before any manager teardown, draining in-flight reclaim callbacks via
  the rwsem before the manager is destroyed. This is required because
  amdgpu's vram manager fini is called explicitly during driver unbind,
  which may precede the DRM device release and thus precede any
  drmm-based cleanup. (Sashiko-bot)

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 29 ++++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h |  2 ++
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 2740de94e93c..8cbcd33f51a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1488,7 +1488,7 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
 	dma_resv_for_each_fence(&resv_cursor, bo->base.resv,
 				DMA_RESV_USAGE_BOOKKEEP, f) {
 		if (amdkfd_fence_check_mm(f, current->mm) &&
-		    !(place->flags & TTM_PL_FLAG_CONTIGUOUS))
+		    !(place && (place->flags & TTM_PL_FLAG_CONTIGUOUS)))
 			return false;
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 08f05c3aed1d..ee98b963e84a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -906,6 +906,10 @@ static const struct ttm_resource_manager_func amdgpu_vram_mgr_func = {
 	.debug	= amdgpu_vram_mgr_debug
 };
 
+static const struct dmem_cgroup_ops amdgpu_vram_mgr_dmem_ops = {
+	.reclaim = ttm_resource_manager_dmem_reclaim,
+};
+
 /**
  * amdgpu_vram_mgr_init - init VRAM manager and DRM MM
  *
@@ -917,6 +921,7 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_vram_mgr *mgr = &adev->mman.vram_mgr;
 	struct ttm_resource_manager *man = &mgr->manager;
+	struct dmem_cgroup_region *cg;
 	int err;
 
 	ttm_resource_manager_init(man, &adev->mman.bdev,
@@ -933,12 +938,15 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	if (err)
 		return err;
 
-	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram",
-					      &(struct dmem_cgroup_init){
-						.size = adev->gmc.real_vram_size,
-					      });
-	if (IS_ERR(man->cg))
-		return PTR_ERR(man->cg);
+	cg = dmem_cgroup_register_region(&(struct dmem_cgroup_init){
+					     .size = adev->gmc.real_vram_size,
+					     .ops = &amdgpu_vram_mgr_dmem_ops,
+					     .reclaim_priv = man,
+					 }, "vram");
+	if (IS_ERR(cg))
+		return PTR_ERR(cg);
+
+	ttm_resource_manager_set_dmem_region(man, cg);
 
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
 	ttm_resource_manager_set_used(man, true);
@@ -960,6 +968,15 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device *adev)
 	int ret;
 	struct amdgpu_vram_reservation *rsv, *temp;
 
+	/*
+	 * Drain any in-flight dmem cgroup reclaim callbacks and remove the
+	 * region from the global list before tearing down the manager.
+	 * This must happen first so no reclaim callback can access the
+	 * manager after this point.
+	 */
+	dmem_cgroup_unregister_region(mgr->cg_region);
+	mgr->cg_region = NULL;
+
 	ttm_resource_manager_set_used(man, false);
 
 	ret = ttm_resource_manager_evict_all(&adev->mman.bdev, man);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
index 429a21a2e9b2..07103cddb335 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
@@ -36,6 +36,8 @@ struct amdgpu_vram_mgr {
 	atomic64_t vis_usage;
 	u64 default_page_size;
 	struct list_head allocated_vres_list;
+	/** @cg_region: dmem cgroup region for VRAM; unregistered in fini. */
+	struct dmem_cgroup_region *cg_region;
 };
 
 struct amdgpu_vres_task {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 5/6] drm/xe: Wire up dmem cgroup reclaim for VRAM manager
From: Thomas Hellström @ 2026-06-11 14:22 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611142242.2529-1-thomas.hellstrom@linux.intel.com>

Register the VRAM manager with the dmem cgroup reclaim infrastructure
so that lowering dmem.max below current VRAM usage triggers TTM
eviction rather than failing with -EBUSY.

v4:
- Rebased on drm-tip; dropped the XE_PL_STOLEN guard as stolen memory
  uses a separate TTM manager and never calls __xe_ttm_vram_mgr_init().

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Register the fini drmm action before drmm_cgroup_register_region() so
  that devres LIFO teardown runs unregister_region() first (draining any
  in-flight reclaim callbacks via the rwsem) and xe_ttm_vram_mgr_fini()
  second, ensuring the manager is never accessed by a reclaim callback
  after teardown. (Sashiko-bot)
- Wrap the reclaim callback in xe_ttm_vram_mgr_dmem_reclaim() using
  drm_dev_enter()/drm_dev_exit() to prevent TTM reclaim from running
  after driver unbind.

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 54 +++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 308fda4248eb..b2500344cd57 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -276,6 +276,28 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
 	.debug	= xe_ttm_vram_mgr_debug
 };
 
+static const struct dmem_cgroup_ops xe_ttm_vram_mgr_dmem_ops;
+
+static int xe_ttm_vram_mgr_dmem_reclaim(struct dmem_cgroup_pool_state *pool,
+					 u64 target_bytes, void *priv)
+{
+	struct ttm_resource_manager *man = priv;
+	struct xe_device *xe = ttm_to_xe_device(man->bdev);
+	int ret, idx;
+
+	if (!drm_dev_enter(&xe->drm, &idx))
+		return -ENODEV;
+
+	ret = ttm_resource_manager_dmem_reclaim(pool, target_bytes, priv);
+
+	drm_dev_exit(idx);
+	return ret;
+}
+
+static const struct dmem_cgroup_ops xe_ttm_vram_mgr_dmem_ops = {
+	.reclaim = xe_ttm_vram_mgr_dmem_reclaim,
+};
+
 static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_device *xe = to_xe_device(dev);
@@ -301,17 +323,10 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 			   u64 default_page_size)
 {
 	struct ttm_resource_manager *man = &mgr->manager;
+	struct dmem_cgroup_region *cg;
 	const char *name;
 	int err;
 
-	name = mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
-	man->cg = drmm_cgroup_register_region(&xe->drm, name,
-					      &(struct dmem_cgroup_init){
-						.size = size,
-					      });
-	if (IS_ERR(man->cg))
-		return PTR_ERR(man->cg);
-
 	man->func = &xe_ttm_vram_mgr_func;
 	mgr->mem_type = mem_type;
 	err = drmm_mutex_init(&xe->drm, &mgr->lock);
@@ -330,7 +345,28 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
 	ttm_resource_manager_set_used(&mgr->manager, true);
 
-	return drmm_add_action_or_reset(&xe->drm, xe_ttm_vram_mgr_fini, mgr);
+	/*
+	 * Register the fini action before the cgroup region so that devres
+	 * LIFO teardown runs unregister_region first (draining any in-flight
+	 * reclaim callbacks) and the manager fini second.
+	 */
+	err = drmm_add_action_or_reset(&xe->drm, xe_ttm_vram_mgr_fini, mgr);
+	if (err)
+		return err;
+
+	name = mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
+	cg = drmm_cgroup_register_region(&xe->drm, name,
+					 &(struct dmem_cgroup_init){
+						.size = size,
+						.ops = &xe_ttm_vram_mgr_dmem_ops,
+						.reclaim_priv = man,
+					 });
+	if (IS_ERR(cg))
+		return PTR_ERR(cg);
+
+	ttm_resource_manager_set_dmem_region(man, cg);
+
+	return 0;
 }
 
 /**
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 4/6] drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem controller
From: Thomas Hellström @ 2026-06-11 14:22 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611142242.2529-1-thomas.hellstrom@linux.intel.com>

Add ttm_bo_evict_cgroup() to evict buffer objects charged to a specific
dmem cgroup pool from a resource manager's LRU until a byte target is
met.  Add ttm_resource_manager_set_dmem_region() to associate a dmem
cgroup region with a resource manager; drivers supply their own
dmem_cgroup_ops with ttm_resource_manager_dmem_reclaim as the reclaim
function and the manager pointer as reclaim_priv in the dmem_cgroup_init
to wire up TTM eviction as the reclaim callback.

The eviction context is interruptible; signals abort the operation and
propagate back through the write() syscall.

Introduce a new mode for the bo LRU walker so that sleeping locks
can be taken. This can be used when the caller doesn't hold any
previous dma_resv locks, and where it intends to hold at most
one lock at a time.

Like the rest of the TTM eviction this should sooner than later
be converted to full WW transactions.

v3:
- Fix ttm_resource_manager_set_dmem_region() storing an error pointer
  in man->cg unconditionally. (Sashiko-bot)
- Fix kernel-doc function name format for ttm_bo_evict_cgroup() and
  ttm_resource_manager_set_dmem_region().

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Handle NULL region in ttm_resource_manager_set_dmem_region() to clear
  the reclaim callback, preventing use-after-free when the manager is
  torn down while the dmem region outlives it. (Sashiko-bot)
- Return 0 on any progress (even partial eviction), -ENOSPC only when
  nothing was freed; fixes callers that expected 0 on partial success.
- Document that the reclaim callback should return 0 if some progress
  was made, -ENOSPC if no progress at all, or another error for fatal
  failures.

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c       | 95 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c  |  3 +-
 drivers/gpu/drm/ttm/ttm_resource.c | 50 ++++++++++++++++
 include/drm/ttm/ttm_bo.h           | 10 ++++
 include/drm/ttm/ttm_resource.h     |  7 +++
 5 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f0..db0e38bd8a43 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -515,12 +515,20 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 {
 	struct ttm_bo_evict_walk *evict_walk =
 		container_of(walk, typeof(*evict_walk), walk);
+	/* Capture size before eviction in case res is cleared. */
+	s64 bo_size = bo->base.size;
 	s64 lret;
 
 	if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
 					      evict_walk->try_low, &evict_walk->hit_low))
 		return 0;
 
+	/*
+	 * evict_walk->place is NULL in cgroup drain mode.  Drivers'
+	 * eviction_valuable() callbacks must handle a NULL place, treating it
+	 * as "any placement": the TTM base implementation already does so via
+	 * ttm_resource_intersects().
+	 */
 	if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
 		return 0;
 
@@ -536,11 +544,15 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 		goto out;
 
 	evict_walk->evicted++;
-	if (evict_walk->res)
+	if (evict_walk->res) {
 		lret = ttm_resource_alloc(evict_walk->evictor, evict_walk->place,
 					  evict_walk->res, NULL);
-	if (lret == 0)
-		return 1;
+		if (lret == 0)
+			return 1;
+	} else {
+		/* Cgroup drain: return bytes freed for byte-denominated progress. */
+		return bo_size;
+	}
 out:
 	/* Errors that should terminate the walk. */
 	if (lret == -ENOSPC)
@@ -614,6 +626,83 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
 	return 0;
 }
 
+/**
+ * ttm_bo_evict_cgroup() - Evict buffer objects charged to a specific cgroup.
+ * @bdev: The TTM device.
+ * @man: The resource manager whose LRU to walk.
+ * @limit_pool: The cgroup pool state whose members should be evicted.
+ * @target_bytes: Number of bytes to free.
+ * @ctx: The TTM operation context.
+ *
+ * Walk the LRU of @man and evict buffer objects that are charged to the
+ * cgroup identified by @limit_pool, until at least @target_bytes have been
+ * freed.  Mirrors the two-pass (trylock -> sleeping-lock, low-watermark)
+ * strategy used by ttm_bo_evict_alloc().
+ *
+ * Return: >= @target_bytes on full success, 0..target_bytes-1 if partial,
+ *         negative error code on fatal error.
+ */
+s64 ttm_bo_evict_cgroup(struct ttm_device *bdev,
+			struct ttm_resource_manager *man,
+			struct dmem_cgroup_pool_state *limit_pool,
+			s64 target_bytes,
+			struct ttm_operation_ctx *ctx)
+{
+	struct ttm_bo_evict_walk evict_walk = {
+		.walk = {
+			.ops = &ttm_evict_walk_ops,
+			.arg = { .ctx = ctx },
+		},
+		.limit_pool = limit_pool,
+		/* place, evictor, res left NULL: selects cgroup drain mode */
+	};
+	s64 lret, pass;
+
+	evict_walk.walk.arg.trylock_only = true;
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, target_bytes);
+	if (lret < 0 || lret >= target_bytes)
+		return lret;
+
+	/* Second pass: also evict BOs at the low watermark. */
+	if (evict_walk.hit_low) {
+		evict_walk.try_low = true;
+		pass = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man,
+					      target_bytes - lret);
+		if (pass < 0)
+			return pass;
+		lret += pass;
+		if (lret >= target_bytes)
+			return lret;
+	}
+
+	/* Full sleeping-lock pass for remaining target. */
+	evict_walk.try_low = evict_walk.hit_low = false;
+	evict_walk.walk.arg.trylock_only = false;
+
+retry:
+	evict_walk.walk.arg.sleeping_lock = true;
+	do {
+		evict_walk.evicted = 0;
+		pass = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man,
+					      target_bytes - lret);
+		if (pass < 0) {
+			lret = pass;
+			goto out;
+		}
+		lret += pass;
+	} while (lret < target_bytes && evict_walk.evicted);
+
+	/* One more attempt if we hit the low limit during sleeping-lock pass. */
+	if (lret < target_bytes && evict_walk.hit_low && !evict_walk.try_low) {
+		evict_walk.try_low = true;
+		goto retry;
+	}
+
+out:
+	return lret;
+}
+EXPORT_SYMBOL(ttm_bo_evict_cgroup);
+
 /**
  * ttm_bo_pin - Pin the buffer object.
  * @bo: The buffer object to pin
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 3e3c201a0222..bd0b23ac2cc4 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -999,7 +999,8 @@ __ttm_bo_lru_cursor_next(struct ttm_bo_lru_cursor *curs)
 		bo = res->bo;
 		if (ttm_lru_walk_trylock(curs, bo))
 			bo_locked = true;
-		else if (!arg->ticket || arg->ctx->no_wait_gpu || arg->trylock_only)
+		else if ((!arg->ticket && !arg->sleeping_lock) || arg->ctx->no_wait_gpu ||
+			 arg->trylock_only)
 			continue;
 
 		if (!ttm_bo_get_unless_zero(bo)) {
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index 154d6739256f..ad00723e99ef 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -953,3 +953,53 @@ void ttm_resource_manager_create_debugfs(struct ttm_resource_manager *man,
 #endif
 }
 EXPORT_SYMBOL(ttm_resource_manager_create_debugfs);
+
+/**
+ * ttm_resource_manager_dmem_reclaim() - dmem cgroup reclaim callback for TTM
+ *                                       resource managers.
+ * @pool: The dmem cgroup pool state for the cgroup being reclaimed.
+ * @target_bytes: Number of bytes to try to free.
+ * @priv: The &ttm_resource_manager pointer, passed as @init.reclaim_priv to
+ *        dmem_cgroup_register_region().
+ *
+ * Drivers should use this as the @reclaim member of their own
+ * &struct dmem_cgroup_ops, with the &ttm_resource_manager pointer as
+ * @init.reclaim_priv.
+ *
+ * Return: 0 if some memory was freed, -ENOSPC if nothing was freed, or
+ *         another negative error code on fatal failure.
+ */
+int ttm_resource_manager_dmem_reclaim(struct dmem_cgroup_pool_state *pool,
+				      u64 target_bytes, void *priv)
+{
+	struct ttm_resource_manager *man = priv;
+	struct ttm_operation_ctx ctx = { .interruptible = true };
+	s64 freed;
+
+	freed = ttm_bo_evict_cgroup(man->bdev, man, pool, target_bytes, &ctx);
+	if (freed < 0)
+		return freed;
+
+	return freed > 0 ? 0 : -ENOSPC;
+}
+EXPORT_SYMBOL(ttm_resource_manager_dmem_reclaim);
+
+/**
+ * ttm_resource_manager_set_dmem_region() - Associate a dmem cgroup region with a
+ *                                        resource manager.
+ * @man: The resource manager.
+ * @region: The dmem cgroup region to associate, may be NULL or IS_ERR().
+ *
+ * When @region is valid, stores it in @man->cg so that TTM can look up the
+ * associated pool during charging and eviction-target selection.
+ * The reclaim callback must be wired up using ttm_resource_manager_dmem_reclaim()
+ * in the driver's own &struct dmem_cgroup_ops, with the manager pointer as
+ * @init.reclaim_priv.
+ */
+void ttm_resource_manager_set_dmem_region(struct ttm_resource_manager *man,
+					  struct dmem_cgroup_region *region)
+{
+	if (!IS_ERR_OR_NULL(region))
+		man->cg = region;
+}
+EXPORT_SYMBOL(ttm_resource_manager_set_dmem_region);
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 8310bc3d55f9..32791c4db2a9 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -226,6 +226,11 @@ struct ttm_lru_walk_arg {
 	struct ww_acquire_ctx *ticket;
 	/** @trylock_only: Only use trylock for locking. */
 	bool trylock_only;
+	/**
+	 * @sleeping_lock: Use sleeping locks even with %NULL @ticket.
+	 * @trylock_only has precedence over this field.
+	 */
+	bool sleeping_lock;
 };
 
 /**
@@ -431,6 +436,11 @@ void ttm_bo_unpin(struct ttm_buffer_object *bo);
 int ttm_bo_evict_first(struct ttm_device *bdev,
 		       struct ttm_resource_manager *man,
 		       struct ttm_operation_ctx *ctx);
+s64 ttm_bo_evict_cgroup(struct ttm_device *bdev,
+			struct ttm_resource_manager *man,
+			struct dmem_cgroup_pool_state *limit_pool,
+			s64 target_bytes,
+			struct ttm_operation_ctx *ctx);
 int ttm_bo_access(struct ttm_buffer_object *bo, unsigned long offset,
 		  void *buf, int len, int write);
 vm_fault_t ttm_bo_vm_reserve(struct ttm_buffer_object *bo,
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index a5d386583fb6..32e485fdce9a 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -39,6 +39,7 @@
 
 struct dentry;
 struct dmem_cgroup_device;
+struct dmem_cgroup_region;
 struct drm_printer;
 struct ttm_device;
 struct ttm_resource_manager;
@@ -477,6 +478,12 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
 			       struct ttm_device *bdev,
 			       uint64_t size);
 
+void ttm_resource_manager_set_dmem_region(struct ttm_resource_manager *man,
+					  struct dmem_cgroup_region *region);
+
+int ttm_resource_manager_dmem_reclaim(struct dmem_cgroup_pool_state *pool,
+				      u64 target_bytes, void *priv);
+
 int ttm_resource_manager_evict_all(struct ttm_device *bdev,
 				   struct ttm_resource_manager *man);
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 3/6] cgroup/dmem: Add reclaim callback for lowering max below current usage
From: Thomas Hellström @ 2026-06-11 14:22 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611142242.2529-1-thomas.hellstrom@linux.intel.com>

Add an optional reclaim callback to struct dmem_cgroup_region. When
dmem.max is set below the current usage of a cgroup pool, the new limit
is applied immediately (so that concurrent allocations are throttled
while reclaim is in progress) and then the driver is asked to evict
memory to bring usage back below the limit.

Reclaim is attempted up to a bounded number of times. No error is
returned to userspace if usage remains above the limit after reclaim,
and a pending signal will abort the reclaim loop early. This matches
the behavior of memory.max in the memory cgroup controller.

Also honor O_NONBLOCK so that if that flag is set during the
max value write, no reclaim is initiated. The idea is to avoid
charging the reclaim cost to the writer of the max value.

v2:
- Write max before reclaim is attempted (Maarten)
- Let signals abort the reclaim without error (Maarten)
- If a new max value is written with the O_NONBLOCK flag,
  reclaim is not attempted (Maarten)
- Extract region from the pool parameter rather than
  passing it explicitly to set_resource_xxx().

v3:
- Use an rwsem to protect reclaim callback registration and
  region unregister against concurrent reclaim invocations,
  ensuring reclaim_priv is visible when the callback is
  invoked. (Sashiko-bot)

v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Use nonblock=true in reset_all_resource_limits() to avoid sleeping
  inside rcu_read_lock() in dmemcs_offline(). (Sashiko-bot)
- Compare usage against the truncated limit value stored in cnt.max,
  not the original u64. (Sashiko-bot)
- Use a DMEM_MAX_RECLAIM_RETRIES (16) retry budget instead of 5, matching
  the memcg controller's MAX_RECLAIM_RETRIES. Only -ENOSPC (no progress)
  counts against the retry budget; other errors terminate the loop
  immediately.
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 include/linux/cgroup_dmem.h |  21 +++++++
 kernel/cgroup/dmem.c        | 119 +++++++++++++++++++++++++++++++++---
 2 files changed, 130 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index d9eab8a2c1ee..d705e94d8784 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -14,12 +14,33 @@ struct dmem_cgroup_pool_state;
 /* Opaque definition of a cgroup region, used internally */
 struct dmem_cgroup_region;
 
+/**
+ * struct dmem_cgroup_ops - Operations for a dmem cgroup region.
+ * @reclaim: Optional callback invoked when dmem.max is set below the current
+ *           usage of a pool. The driver should attempt to free at least
+ *           @target_bytes from @pool. May be called multiple times if usage
+ *           remains above the limit after returning.
+ *
+ *           Return: 0 if some progress was made (even if less than
+ *           @target_bytes was freed), -ENOSPC if no progress could be made,
+ *           or another negative error code if a fatal error occurred.
+ *           Any non-zero return stops further reclaim attempts.
+ */
+struct dmem_cgroup_ops {
+	int (*reclaim)(struct dmem_cgroup_pool_state *pool,
+		       u64 target_bytes, void *priv);
+};
+
 /**
  * struct dmem_cgroup_init - Initialization parameters for a dmem cgroup region.
  * @size: Size of the region in bytes.
+ * @ops: Optional operations for this region. May be NULL.
+ * @reclaim_priv: Opaque pointer passed to @ops->reclaim. May be NULL.
  */
 struct dmem_cgroup_init {
 	u64 size;
+	const struct dmem_cgroup_ops *ops;
+	void *reclaim_priv;
 };
 
 #if IS_ENABLED(CONFIG_CGROUP_DMEM)
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index d12c8543f3fe..da99d133182c 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -18,6 +18,13 @@
 #include <linux/rculist.h>
 #include <linux/slab.h>
 
+/*
+ * Number of reclaim attempts before giving up when lowering dmem.max
+ * below current usage. Mirrors memcg's MAX_RECLAIM_RETRIES; unify the
+ * two in a follow-up instead of duplicating the constant.
+ */
+#define DMEM_MAX_RECLAIM_RETRIES 16
+
 struct dmem_cgroup_region {
 	/**
 	 * @ref: References keeping the region alive.
@@ -51,6 +58,24 @@ struct dmem_cgroup_region {
 	 * No new pools should be added to the region afterwards.
 	 */
 	bool unregistered;
+
+	/**
+	 * @ops: Optional operations, set from dmem_cgroup_init at registration.
+	 */
+	const struct dmem_cgroup_ops *ops;
+
+	/** @reclaim_priv: Private data passed to @ops->reclaim. */
+	void *reclaim_priv;
+
+	/**
+	 * @unregister_sem: Serialises reclaim callbacks against unregistration.
+	 *
+	 * Readers (reclaim) hold the read side for the duration of a callback
+	 * invocation.  dmem_cgroup_unregister_region() takes the write side to
+	 * drain any in-flight callbacks before returning, so callers may safely
+	 * free @reclaim_priv once unregister returns.
+	 */
+	struct rw_semaphore unregister_sem;
 };
 
 struct dmemcg_state {
@@ -145,21 +170,71 @@ static void free_cg_pool(struct dmem_cgroup_pool_state *pool)
 }
 
 static void
-set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
 {
 	page_counter_set_min(&pool->cnt, val);
 }
 
 static void
-set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
 {
 	page_counter_set_low(&pool->cnt, val);
 }
 
 static void
-set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
 {
-	page_counter_set_max(&pool->cnt, val);
+	struct dmem_cgroup_region *region = pool->region;
+	unsigned long limit = (unsigned long)val;
+
+	/*
+	 * Always update the limit, even if usage currently exceeds it.
+	 * Concurrent allocations will be throttled against the new limit
+	 * while reclaim is in progress.
+	 */
+	xchg(&pool->cnt.max, limit);
+
+	if (nonblock)
+		return;
+
+	/*
+	 * Hold the read side for the duration of the reclaim loop so that
+	 * dmem_cgroup_unregister_region() cannot return (and the caller
+	 * cannot free reclaim_priv) while a callback is in progress.
+	 *
+	 * The ops check must happen inside the lock.  A caller may have
+	 * observed ops != NULL before dmem_cgroup_unregister_region()
+	 * acquired the write side; rechecking under down_read() is safe
+	 * because region->unregistered is set while the write side is
+	 * held, so any down_read() that succeeds after up_write() will
+	 * see unregistered = true and skip the loop.
+	 */
+	down_read(&region->unregister_sem);
+	if (!region->unregistered && region->ops && region->ops->reclaim) {
+		for (int retries = DMEM_MAX_RECLAIM_RETRIES; ; ) {
+			u64 usage = page_counter_read(&pool->cnt);
+			int ret;
+
+			if (usage <= limit)
+				break;
+
+			if (signal_pending(current))
+				break;
+
+			ret = region->ops->reclaim(pool, usage - limit, region->reclaim_priv);
+
+			/*
+			 * Mirror memcg's retry strategy: only count -ENOSPC (no
+			 * progress) against the retry budget; any other error is
+			 * fatal and terminates the loop immediately.
+			 */
+			if (ret && (ret != -ENOSPC || !retries--))
+				break;
+
+			cond_resched();
+		}
+	}
+	up_read(&region->unregister_sem);
 }
 
 static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
@@ -189,9 +264,14 @@ static u64 get_resource_peak(struct dmem_cgroup_pool_state *pool)
 
 static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
-	set_resource_min(rpool, 0);
-	set_resource_low(rpool, 0);
-	set_resource_max(rpool, PAGE_COUNTER_MAX);
+	set_resource_min(rpool, 0, false);
+	set_resource_low(rpool, 0, false);
+	/*
+	 * Use nonblock=true: we are raising the limit to PAGE_COUNTER_MAX so
+	 * reclaim is pointless, and dmemcs_offline() holds rcu_read_lock()
+	 * which forbids sleeping.
+	 */
+	set_resource_max(rpool, PAGE_COUNTER_MAX, true);
 }
 
 static void dmemcs_offline(struct cgroup_subsys_state *css)
@@ -468,7 +548,10 @@ static void dmemcg_free_region(struct kref *ref)
  * dmem_cgroup_unregister_region() - Unregister a previously registered region.
  * @region: The region to unregister.
  *
- * This function undoes dmem_cgroup_register_region.
+ * This function undoes dmem_cgroup_register_region.  It drains any
+ * in-flight reclaim callbacks before returning, so the caller may safely
+ * free the resources pointed to by @init.reclaim_priv once this function
+ * returns.
  */
 void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
 {
@@ -477,6 +560,15 @@ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
 	if (!region)
 		return;
 
+	/*
+	 * Acquire the write side to drain any in-flight reclaim callbacks.
+	 * After up_write() below, set_resource_max() will observe
+	 * region->unregistered = true under its own down_read() and skip
+	 * the reclaim loop, so reclaim_priv is safe to free once this
+	 * function returns.
+	 */
+	down_write(&region->unregister_sem);
+
 	spin_lock(&dmemcg_lock);
 
 	/* Remove from global region list */
@@ -496,6 +588,8 @@ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
 	region->unregistered = true;
 	spin_unlock(&dmemcg_lock);
 
+	up_write(&region->unregister_sem);
+
 	kref_put(&region->ref, dmemcg_free_region);
 }
 EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region);
@@ -537,7 +631,10 @@ dmem_cgroup_register_region(const struct dmem_cgroup_init *init,
 	INIT_LIST_HEAD(&ret->pools);
 	ret->name = region_name;
 	ret->size = init->size;
+	ret->ops = init->ops;
+	ret->reclaim_priv = init->reclaim_priv;
 	kref_init(&ret->ref);
+	init_rwsem(&ret->unregister_sem);
 
 	spin_lock(&dmemcg_lock);
 	list_add_tail_rcu(&ret->region_node, &dmem_cgroup_regions);
@@ -733,9 +830,10 @@ static int dmemcg_parse_limit(char *options, u64 *new_limit)
 
 static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
 				 char *buf, size_t nbytes, loff_t off,
-				 void (*apply)(struct dmem_cgroup_pool_state *, u64))
+				 void (*apply)(struct dmem_cgroup_pool_state *, u64, bool))
 {
 	struct dmemcg_state *dmemcs = css_to_dmemcs(of_css(of));
+	bool nonblock = of->file->f_flags & O_NONBLOCK;
 	int err = 0;
 
 	while (buf && !err) {
@@ -780,7 +878,8 @@ static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
 		}
 
 		/* And commit */
-		apply(pool, new_limit);
+		apply(pool, new_limit, nonblock);
+
 		dmemcg_pool_put(pool);
 
 out_put:
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 2/6] cgroup/dmem: Introduce struct dmem_cgroup_init for region initialization
From: Thomas Hellström @ 2026-06-11 14:22 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611142242.2529-1-thomas.hellstrom@linux.intel.com>

Replace the bare u64 size argument to dmem_cgroup_register_region() and
drmm_cgroup_register_region() with a const struct dmem_cgroup_init *
pointer. The struct currently carries only the size field, but using a
struct makes the API extensible: future callers can supply additional
initialization parameters without adding more positional arguments.

Update all in-tree callers (amdgpu, xe) to use a compound-literal
initializer.

v5:
- Commit introduced.

Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c |  6 +++++-
 drivers/gpu/drm/drm_drv.c                    |  8 +++++---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c         |  7 ++++++-
 include/drm/drm_drv.h                        |  4 +++-
 include/linux/cgroup_dmem.h                  | 16 +++++++++++++---
 kernel/cgroup/dmem.c                         | 10 ++++++----
 6 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index ac3f71d77140..08f05c3aed1d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -23,6 +23,7 @@
  */
 
 #include <linux/dma-mapping.h>
+#include <linux/cgroup_dmem.h>
 #include <drm/ttm/ttm_range_manager.h>
 #include <drm/drm_drv.h>
 #include <drm/drm_buddy.h>
@@ -932,7 +933,10 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	if (err)
 		return err;
 
-	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
+	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram",
+					      &(struct dmem_cgroup_init){
+						.size = adev->gmc.real_vram_size,
+					      });
 	if (IS_ERR(man->cg))
 		return PTR_ERR(man->cg);
 
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 1ff0bf7cba6a..3c570f9393b9 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -960,17 +960,19 @@ static void drmm_cg_unregister_region(struct drm_device *dev, void *arg)
  * drmm_cgroup_register_region - Register a region of a DRM device to cgroups
  * @dev: device for region
  * @region_name: Region name for registering
- * @size: Size of region in bytes
+ * @init: Initialization parameters for the region.
  *
  * This decreases the ref-count of @dev by one. The device is destroyed if the
  * ref-count drops to zero.
  */
-struct dmem_cgroup_region *drmm_cgroup_register_region(struct drm_device *dev, const char *region_name, u64 size)
+struct dmem_cgroup_region *
+drmm_cgroup_register_region(struct drm_device *dev, const char *region_name,
+			    const struct dmem_cgroup_init *init)
 {
 	struct dmem_cgroup_region *region;
 	int ret;
 
-	region = dmem_cgroup_register_region(size, "drm/%s/%s", dev->unique, region_name);
+	region = dmem_cgroup_register_region(init, "drm/%s/%s", dev->unique, region_name);
 	if (IS_ERR_OR_NULL(region))
 		return region;
 
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index b518f7dec680..308fda4248eb 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -4,6 +4,8 @@
  * Copyright (C) 2021-2022 Red Hat
  */
 
+#include <linux/cgroup_dmem.h>
+
 #include <drm/drm_managed.h>
 #include <drm/drm_drv.h>
 #include <drm/drm_buddy.h>
@@ -303,7 +305,10 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	int err;
 
 	name = mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
-	man->cg = drmm_cgroup_register_region(&xe->drm, name, size);
+	man->cg = drmm_cgroup_register_region(&xe->drm, name,
+					      &(struct dmem_cgroup_init){
+						.size = size,
+					      });
 	if (IS_ERR(man->cg))
 		return PTR_ERR(man->cg);
 
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index e09559495c5b..b23830494ed4 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -34,6 +34,7 @@
 
 #include <drm/drm_device.h>
 
+struct dmem_cgroup_init;
 struct dmem_cgroup_region;
 struct drm_fb_helper;
 struct drm_fb_helper_surface_size;
@@ -433,7 +434,8 @@ void *__devm_drm_dev_alloc(struct device *parent,
 
 struct dmem_cgroup_region *
 drmm_cgroup_register_region(struct drm_device *dev,
-			    const char *region_name, u64 size);
+			    const char *region_name,
+			    const struct dmem_cgroup_init *init);
 
 /**
  * devm_drm_dev_alloc - Resource managed allocation of a &drm_device instance
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736..d9eab8a2c1ee 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -14,8 +14,18 @@ struct dmem_cgroup_pool_state;
 /* Opaque definition of a cgroup region, used internally */
 struct dmem_cgroup_region;
 
+/**
+ * struct dmem_cgroup_init - Initialization parameters for a dmem cgroup region.
+ * @size: Size of the region in bytes.
+ */
+struct dmem_cgroup_init {
+	u64 size;
+};
+
 #if IS_ENABLED(CONFIG_CGROUP_DMEM)
-struct dmem_cgroup_region *dmem_cgroup_register_region(u64 size, const char *name_fmt, ...) __printf(2,3);
+struct dmem_cgroup_region *
+dmem_cgroup_register_region(const struct dmem_cgroup_init *init,
+			    const char *name_fmt, ...) __printf(2, 3);
 void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region);
 int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 			   struct dmem_cgroup_pool_state **ret_pool,
@@ -27,8 +37,8 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 
 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
 #else
-static inline __printf(2,3) struct dmem_cgroup_region *
-dmem_cgroup_register_region(u64 size, const char *name_fmt, ...)
+static inline __printf(2, 3) struct dmem_cgroup_region *
+dmem_cgroup_register_region(const struct dmem_cgroup_init *init, const char *name_fmt, ...)
 {
 	return NULL;
 }
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 6430c7ce1e03..d12c8543f3fe 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -502,7 +502,7 @@ EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region);
 
 /**
  * dmem_cgroup_register_region() - Register a regions for dev cgroup.
- * @size: Size of region to register, in bytes.
+ * @init: Initialization parameters for the region.
  * @fmt: Region parameters to register
  *
  * This function registers a node in the dmem cgroup with the
@@ -511,13 +511,15 @@ EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region);
  *
  * Return: NULL or a struct on success, PTR_ERR on failure.
  */
-struct dmem_cgroup_region *dmem_cgroup_register_region(u64 size, const char *fmt, ...)
+struct dmem_cgroup_region *
+dmem_cgroup_register_region(const struct dmem_cgroup_init *init,
+			    const char *fmt, ...)
 {
 	struct dmem_cgroup_region *ret;
 	char *region_name;
 	va_list ap;
 
-	if (!size)
+	if (!init || !init->size)
 		return NULL;
 
 	va_start(ap, fmt);
@@ -534,7 +536,7 @@ struct dmem_cgroup_region *dmem_cgroup_register_region(u64 size, const char *fmt
 
 	INIT_LIST_HEAD(&ret->pools);
 	ret->name = region_name;
-	ret->size = size;
+	ret->size = init->size;
 	kref_init(&ret->ref);
 
 	spin_lock(&dmemcg_lock);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 1/6] drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init()
From: Thomas Hellström @ 2026-06-11 14:22 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Sashiko-bot, Friedrich Vock,
	Maarten Lankhorst, Tejun Heo, Maxime Ripard, Christian König,
	Alex Deucher, amd-gfx, dri-devel, stable, Natalie Vock,
	Johannes Weiner, Michal Koutný, cgroups, Huang Rui,
	Matthew Brost, Matthew Auld, Maarten Lankhorst, Thomas Zimmermann,
	Simona Vetter, David Airlie, Rodrigo Vivi, linux-kernel
In-Reply-To: <20260611142242.2529-1-thomas.hellstrom@linux.intel.com>

drmm_cgroup_register_region() is called before INIT_LIST_HEAD() and
gpu_buddy_init() in amdgpu_vram_mgr_init(). If it fails, the function
returns early and bypasses those initializations.

Since adev->mman.initialized is set to true before amdgpu_vram_mgr_init()
is called, a failure triggers amdgpu_ttm_fini(), which calls
amdgpu_vram_mgr_fini(), which then:

 - Calls list_for_each_entry_safe() on reservations_pending and
   reserved_pages, whose list_head::next pointers are zero-initialized
   (NULL). The loop does not recognize them as empty and dereferences NULL.

 - Calls gpu_buddy_fini(), which iterates free_trees[] unconditionally
   via for_each_free_tree(). Since mm->free_trees is NULL
   (never allocated), this dereferences NULL.

Both result in a kernel panic on the module load error path.

Fix by moving drmm_cgroup_register_region() to after the list and buddy
allocator are fully initialized, so the teardown path is safe to run.

Reported-by: Sashiko-bot <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260428073116.15687-1-thomas.hellstrom@linux.intel.com?part=4
Fixes: 2b624a2c1865 ("drm/ttm: Handle cgroup based eviction in TTM")
Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.14+
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 2a241a5b12c4..ac3f71d77140 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -918,9 +918,6 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	struct ttm_resource_manager *man = &mgr->manager;
 	int err;
 
-	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
-	if (IS_ERR(man->cg))
-		return PTR_ERR(man->cg);
 	ttm_resource_manager_init(man, &adev->mman.bdev,
 				  adev->gmc.real_vram_size);
 
@@ -935,6 +932,10 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	if (err)
 		return err;
 
+	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram", adev->gmc.real_vram_size);
+	if (IS_ERR(man->cg))
+		return PTR_ERR(man->cg);
+
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
 	ttm_resource_manager_set_used(man, true);
 	return 0;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 0/6] [PATCH v5 0/6] Add reclaim to the dmem cgroup controller
From: Thomas Hellström @ 2026-06-11 14:22 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel

When writing a "max" limit lower than the current usage, the
existing code silently failed. This series aims to improve
on that by returning -EBUSY on failure and also attempt
to synchronously reclaim device memory to push the usage
under the new max limit to avoid the error.

Patch 1 fixes a pre-existing amdgpu_vram_mgr_init() error path
Patch 2 introduces struct dmem_cgroup_init for extensible region
      registration.
Patch 3 implements and documents a reclaim callback interface
      for the dmem controller.
Patch 4 implements a TTM reclaim callback.
Patches 5-6 hook up the reclaim callback to the dmem cgroup-aware
      drivers xe and amdgpu.

v2:
- Remove the error propagation that was in a previous series (Maarten)
- A number of updates in patch 1. See its commit message for
  details (Maarten)

v3:
- Add patch 1 fixing a pre-existing amdgpu_vram_mgr_init() error path
  bug where drmm_cgroup_register_region() was called before
  INIT_LIST_HEAD() and gpu_buddy_init(), causing a kernel panic on
  failure. (Sashiko-bot)
- Use an rwsem to protect reclaim callback registration and region
  unregister against concurrent reclaim invocations. (Sashiko-bot)
- Fix ttm_resource_manager_set_dmem_region() storing an error pointer
  in man->cg unconditionally. (Sashiko-bot)
- Fix kernel-doc function name format for ttm_bo_evict_cgroup() and
  ttm_resource_manager_set_dmem_region().

v4:
- Rebased on drm-tip; dropped the XE_PL_STOLEN guard in the xe patch
  as stolen memory uses a separate TTM manager.

v5:
- Add patch 2 introducing struct dmem_cgroup_init to make the
  dmem_cgroup_register_region() API extensible without adding positional
  arguments in the future.
- Use nonblock=true in reset_all_resource_limits() to avoid sleeping
  inside rcu_read_lock() in dmemcs_offline(). (Sashiko-bot)
- Compare usage against the truncated limit stored in cnt.max, not the
  original u64. (Sashiko-bot)
- Use DMEM_MAX_RECLAIM_RETRIES (16) retry budget instead of 5, matching
  the memcg controller; only -ENOSPC (no progress) counts against the
  budget, other errors abort immediately.
- Handle NULL region in ttm_resource_manager_set_dmem_region() to clear
  the reclaim callback, preventing use-after-free when the manager is
  torn down while the dmem region outlives it. (Sashiko-bot)
- Return 0 on any eviction progress; reserve -ENOSPC for zero progress.
- Register xe fini devres action before drmm_cgroup_register_region()
  so LIFO teardown runs unregister first, draining callbacks before the
  manager is destroyed. (Sashiko-bot)
- Switch amdgpu to explicit dmem_cgroup_unregister_region() at the top
  of amdgpu_vram_mgr_fini() before any manager teardown, since amdgpu's
  fini is called explicitly during driver unbind before drmm cleanup.
  (Sashiko-bot)
- Wrap the xe reclaim callback with drm_dev_enter()/drm_dev_exit() to
  prevent TTM reclaim from running after driver unbind.

User-space tests are at
https://patchwork.freedesktop.org/series/163935/

Test-with: 20260428065411.4222-1-thomas.hellstrom@linux.intel.com

Thomas Hellström (6):
  drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init()
  cgroup/dmem: Introduce struct dmem_cgroup_init for region initialization
  cgroup/dmem: Add reclaim callback for lowering max below current usage
  drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem
    controller
  drm/xe: Wire up dmem cgroup reclaim for VRAM manager
  drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager

 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c |  10 +-
 drivers/gpu/drm/ttm/ttm_bo.c                 |  95 ++++++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c            |   3 +-
 drivers/gpu/drm/ttm/ttm_resource.c           |  37 +++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c         |  19 ++--
 include/drm/ttm/ttm_bo.h                     |  10 ++
 include/drm/ttm/ttm_resource.h               |   4 +
 include/linux/cgroup_dmem.h                  |  24 +++++
 kernel/cgroup/dmem.c                         | 106 +++++++++++++++++--
 10 files changed, 286 insertions(+), 24 deletions(-)

-- 
2.54.0

Thomas Hellström (6):
  drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init()
  cgroup/dmem: Introduce struct dmem_cgroup_init for region
    initialization
  cgroup/dmem: Add reclaim callback for lowering max below current usage
  drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem
    controller
  drm/xe: Wire up dmem cgroup reclaim for VRAM manager
  drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager

 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c |  28 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h |   2 +
 drivers/gpu/drm/drm_drv.c                    |   8 +-
 drivers/gpu/drm/ttm/ttm_bo.c                 |  95 +++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c            |   3 +-
 drivers/gpu/drm/ttm/ttm_resource.c           |  50 +++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c         |  53 +++++++-
 include/drm/drm_drv.h                        |   4 +-
 include/drm/ttm/ttm_bo.h                     |  10 ++
 include/drm/ttm/ttm_resource.h               |   7 +
 include/linux/cgroup_dmem.h                  |  37 +++++-
 kernel/cgroup/dmem.c                         | 129 +++++++++++++++++--
 13 files changed, 393 insertions(+), 35 deletions(-)

-- 
2.54.0


^ permalink raw reply

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
From: Peter Zijlstra @ 2026-06-11 13:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: mingo, chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef
In-Reply-To: <d4ca5fe7-fd76-47c8-949a-a69916bfcbd4@redhat.com>

On Wed, Jun 10, 2026 at 11:09:59AM -0400, Waiman Long wrote:

> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
> >   	return changed;
> >   }
> > +int cpuset_num_cpus(struct cgroup *cgrp)
> > +{
> > +	int nr = num_online_cpus();
> > +	struct cpuset *cs;
> > +
> > +	if (is_in_v2_mode()) {
> > +		guard(rcu)();
> > +		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
> > +		if (cs)
> > +			nr = cpumask_weight(cs->effective_cpus);
> > +	}
> > +
> > +	return nr;
> > +}
> 
> I just have a question about cgroup v1 support. I am assuming that cgroup v1
> without the cpuset_v2_mode mount option is not supported. 

Correct.

> To fully support
> cgroup v1, you may have to use guarantee_active_cpus() to return the actual
> set of CPUs that the task can run on.

Except this is group based, we'd need an iteration of all tasks in the
group and compute a union of guarantee_active_cpus(). Which all seems
far too expensive and not worth the effort.

> Also there is a caveat about the arm64 specific
> task_cpu_possible_mask() for certain arm64 CPUs. That is for 32-bit
> binary running on 64-bit core which are allowed only on a selected
> subset of cores within the CPU.
> 
> This is probably not what you want to focus on right now, but it will be
> good to have a comment to list items that are not fully supported here.

Will add a comment!

^ permalink raw reply

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
From: Peter Zijlstra @ 2026-06-11 13:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: mingo, chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef
In-Reply-To: <0cbc8a90-8e88-4227-bea5-f12fb0f293db@redhat.com>

On Wed, Jun 10, 2026 at 11:42:47AM -0400, Waiman Long wrote:

> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
> > >       return changed;
> > >   }
> > >   +int cpuset_num_cpus(struct cgroup *cgrp)
> > > +{
> > > +    int nr = num_online_cpus();
> > > +    struct cpuset *cs;
> > > +
> > > +    if (is_in_v2_mode()) {
> > > +        guard(rcu)();
> > > +        cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
> > > +        if (cs)
> > > +            nr = cpumask_weight(cs->effective_cpus);
> > > +    }
> > > +
> > > +    return nr;
> > > +}

> FYI, you may have to take the callback_lock to ensure the stability of the
> effective_cpus mask.

That seems pointless, the moment we drop that lock, its changeable
again. Either way around nr is but a snapshot.

^ permalink raw reply

* Re: [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
From: Peter Zijlstra @ 2026-06-11 11:32 UTC (permalink / raw)
  To: Aaron Lu
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef,
	svens
In-Reply-To: <20260603095108.GA1684319@bytedance.com>


Aaron,

Sorry I failed to notice this email earlier.

On Wed, Jun 03, 2026 at 05:51:08PM +0800, Aaron Lu wrote:

> I applied below diff and the problem is gone:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5f48af700fd44..942a543af3e54 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9897,6 +9897,9 @@ static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>  	return p;
>  
>  idle:
> +	if (sched_core_enabled(rq))
> +		return NULL;
> +
>  	new_tasks = sched_balance_newidle(rq, rf);
>  	if (new_tasks < 0)
>  		return RETRY_TASK;
> 

Right, this is the safe patch and restores pick_task_fair() to its
previous status (for core-sched).

Since people are hitting this problem, I'm going to merge it as below.
I've presumed your SoB, please let me know if that's a problem.

I think I'm going to try and move newidle into sched_class::balance /
balance_fair(), but I'll do that next cycle.

Thanks!

---
Subject: sched/fair: Fix newidle vs core-sched
From: "Aaron Lu" <ziqianlu@bytedance.com>
Date: Wed, 3 Jun 2026 17:51:08 +0800

From: "Aaron Lu" <ziqianlu@bytedance.com>

While testing Prateek's throttle series, I noticed a panic issue when
coresched is enabled and bisected to this patch.

I fed the panic log and this patch to an agent and its analysis looks
correct to me(cpu56 and cpu57 are siblings in a VM):

       cpu57 (holds core-wide lock)

     pick_next_task() [core scheduling]
     for_each_cpu_wrap(i, smt_mask, 57):
       i=57: pick_task(rq_57)
             pick_task_fair(rq_57)
             -> picks task A
       rq_57->core_pick = task A
       // task_rq(A) == rq_57

       i=56: pick_task(rq_56)
             pick_task_fair(rq_56)
             cfs_rq->nr_queued == 0
             goto idle
             sched_balance_newidle(rq_56)
             raw_spin_rq_unlock(rq_56)
             // core-wide lock released
             newidle_balance() pulls
               task A: rq_57 -> rq_56
             // task_rq(A) == rq_56 now
             raw_spin_rq_lock(rq_56)
             // core-wide lock re-acquired
             return > 0
             goto again
             pick_task_fair(rq_56)
             -> picks task A
       rq_56->core_pick = task A

     // first loop done
     // rq_57->core_pick is still task A (set before lock release)
     // but task_rq(A) == rq_56 now
     next = rq_57->core_pick  // = task A

     put_prev_set_next_task(rq_57, prev, task A)
     __set_next_task_fair(rq_57, task A)
     hrtick_start_fair(rq_57, task A)
     WARN_ON_ONCE(task_rq(task A) != rq_57)
     // task_rq(A) == rq_56

IOW: by allowing pick_task_fair() to do newidle_balance and not returning
RETRY_TASK, it can end up selecting the same task on two CPUs. Restore the
previous state by never doing newidle when core scheduling is enabled.

Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: "Aaron Lu" <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260603095108.GA1684319@bytedance.com
---
 kernel/sched/fair.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9942,6 +9942,9 @@ struct task_struct *pick_task_fair(struc
 	return p;
 
 idle:
+	if (sched_core_enabled(rq))
+		return NULL;
+
 	new_tasks = sched_balance_newidle(rq, rf);
 	if (new_tasks < 0)
 		return RETRY_TASK;

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Vlastimil Babka (SUSE) @ 2026-06-11  8:51 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <7064a90e-4bf5-4be7-8b7f-a5a11dcee66f@kernel.org>

On 6/11/26 08:40, Harry Yoo wrote:
> Sashiko wrote [1]:
>> Does this bypass the caller's gfp constraints for standard allocations?
>> Looking at slab_alloc_node(), standard allocations now pass
>> SLAB_ALLOC_DEFAULT into alloc_from_pcs():
>> -	object = alloc_from_pcs(s, gfpflags, node);
>> +	object = alloc_from_pcs(s, gfpflags, SLAB_ALLOC_DEFAULT, node);
>> This default flag means alloc_flags_allow_spinning() will unconditionally
>> return true regardless of the gfp flags provided.
> 
> Yes, but that's not used in _nolock path
> as mentioned in patch 6 description :)
> 
>> If a caller allocating under a raw spinlock intentionally strips
>> __GFP_KSWAPD_RECLAIM (for example, by using __GFP_NOWARN) to prevent
>> sleeping,
> 
> That's a horrible hack (and hypothetical. Nobody should be stripping
> __GFP_KSWAP_RECLAIM instead of using kmalloc_nolock(). That's purely
> broken).

Indeed this was never intended to work, and was just an unfortunate
sideffect of the gfp flag reuse to implement kmalloc_nolock().

>> won't this allow the allocator to execute spin_lock_irqsave()
>> on barn->lock or n->list_lock?
>>
>> On systems with preempt-rt enabled, a standard spinlock maps to a sleeping
>> lock, so taking these locks in an atomic context could cause a scheduling
>> while atomic panic.
>>
>> Since there is no nolock variant available for custom caches, do callers
>> currently have any alternative mitigation?
> 
> Well, RT kernels are not supposed to allocate meomry under a raw
> spinlock (at least w/ allow_spin = true)

Yep.

> [1]
> https://sashiko.dev/#/patchset/20260610-slab_alloc_flags-v2-0-7190909db118%40kernel.org
> 


^ permalink raw reply

* Re: [RFC PATCH v6 12/25] sched/rt: Add {alloc/unregister/free}_rt_sched_group
From: Juri Lelli @ 2026-06-11  8:42 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Johannes Weiner, Michal Koutný, cgroups,
	linux-kernel, Luca Abeni, Yuri Andriaccio
In-Reply-To: <20260608121546.69910-13-yurand2000@gmail.com>

Hello,

On 08/06/26 14:15, Yuri Andriaccio wrote:
> Add allocation and deallocation code for rt-cgroups.
> 
> Declare dl_server specific functions (only skeleton, but no
> implementation yet), needed by the deadline servers to be called when
> trying to schedule.
> 
> Initialize a cgroup's active context to that of its parent.
> 
> Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
> Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
> Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
> Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
> Co-developed-by: luca abeni <luca.abeni@santannapisa.it>
> Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> ---

...

>  void free_rt_sched_group(struct task_group *tg)
>  {
> +	int i;
> +	unsigned long flags;
> +
>  	if (!rt_group_sched_enabled())
>  		return;
> +
> +	if (!tg->dl_se || !tg->rt_rq)
> +		return;
> +
> +	for_each_possible_cpu(i) {
> +		if (!tg->dl_se[i] || !tg->rt_rq[i])
> +			continue;
> +
> +		/*
> +		 * Shutdown the dl_server and free it
> +		 *
> +		 * Since the dl timer is going to be cancelled,
> +		 * we risk to never decrease the running bw...
> +		 * Fix this issue by changing the group runtime
> +		 * to 0 immediately before freeing it.
> +		 */
> +		if (tg->dl_se[i]->dl_runtime)
> +			dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period);
> +
> +		raw_spin_rq_lock_irqsave(cpu_rq(i), flags);
> +		hrtimer_cancel(&tg->dl_se[i]->dl_timer);
> +		raw_spin_rq_unlock_irqrestore(cpu_rq(i), flags);

Why do we need to grab rq lock here? I actually fear this can deadlock
with the timer callback.

> +		kfree(tg->dl_se[i]);
> +
> +		/* Free the local per-cpu runqueue */
> +		kfree(rq_of_rt_rq(tg->rt_rq[i]));
> +	}
> +
> +	kfree(tg->rt_rq);
> +	kfree(tg->dl_se);
>  }
> 
> +static inline void __rt_rq_free(struct rt_rq **rt_rq)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		kfree(rq_of_rt_rq(rt_rq[i]));
                            ^^
Can this result in NULL pointer deref if __alloc_rt_sched_group_data()
fails for some reason midway in the CPU loop?

> +	}
> +
> +	kfree(rt_rq);
> +}
> +
> +DEFINE_FREE(rt_rq_free, struct rt_rq **, if (_T) __rt_rq_free(_T))
> +
> +static inline void __dl_se_free(struct sched_dl_entity **dl_se)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		kfree(dl_se[i]);
> +	}
> +
> +	kfree(dl_se);
> +}
> +
> +DEFINE_FREE(dl_se_free, struct sched_dl_entity **, if (_T) __dl_se_free(_T))
> +
> +static int __alloc_rt_sched_group_data(struct task_group *tg) {
> +	/* Instantiate automatic cleanup in event of kalloc fail */
> +	struct rt_rq **tg_rt_rq __free(rt_rq_free) = NULL;
> +	struct sched_dl_entity **tg_dl_se __free(dl_se_free) = NULL;
> +	struct sched_dl_entity *dl_se __free(kfree) = NULL;
> +	struct rq *s_rq __free(kfree) = NULL;
> +	int i;
> +
> +	tg_rt_rq = kcalloc(nr_cpu_ids, sizeof(struct rt_rq *), GFP_KERNEL);
> +	if (!tg_rt_rq)
> +		return 0;
> +
> +	tg_dl_se = kcalloc(nr_cpu_ids,
> +			   sizeof(struct sched_dl_entity *), GFP_KERNEL);
> +	if (!tg_dl_se)
> +		return 0;
> +
> +	for_each_possible_cpu(i) {
> +		s_rq = kzalloc_node(sizeof(struct rq),
> +				    GFP_KERNEL, cpu_to_node(i));
> +		if (!s_rq)
> +			return 0;
> +
> +		dl_se = kzalloc_node(sizeof(struct sched_dl_entity),
> +				     GFP_KERNEL, cpu_to_node(i));
> +		if (!dl_se)
> +			return 0;
> +
> +		tg_rt_rq[i] = &no_free_ptr(s_rq)->rt;
> +		tg_dl_se[i] = no_free_ptr(dl_se);
> +	}
> +
> +	tg->rt_rq = no_free_ptr(tg_rt_rq);
> +	tg->dl_se = no_free_ptr(tg_dl_se);
> +
> +	return 1;
> +}

...

Thanks,
Juri


^ permalink raw reply

* Re: [PATCH v2 02/16] mm/slab: do not init any kfence objects on allocation
From: Vlastimil Babka (SUSE) @ 2026-06-11  8:34 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Andrey Konovalov, Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm,
	linux-kernel, cgroups
In-Reply-To: <b6530e92-d648-4028-9e77-0df8c3ab166d@kernel.org>

On 6/11/26 05:19, Harry Yoo wrote:
> 
>> This potentially adds overhead of the is_kfence_address() check to
>> allocation hotpath, but that one is designed to be as small as possible,
>> and it's only evaluated if zeroing is about to happen. This means (aside
>> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
>> zeroing itself comes with an overhead likely larger than the added
>> check.
>> 
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
>>  mm/kfence/core.c |  2 +-
>>  mm/slub.c        | 23 ++++++++---------------
>>  2 files changed, 9 insertions(+), 16 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index e2ee8f1aaccf..8e5264d3ddbf 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -4565,9 +4565,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>>  
>>  static __fastpath_inline
>>  bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>> -			  gfp_t flags, size_t size, void **p, bool init,
>> +			  gfp_t flags, size_t size, void **p,
>>  			  unsigned int orig_size)
>>  {
>> +	bool init = slab_want_init_on_alloc(flags, s);
>>  	unsigned int zero_size = s->object_size;
>>  	bool kasan_init = init;
>>  	size_t i;
>> @@ -4608,7 +4609,8 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>  	for (i = 0; i < size; i++) {
>>  		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
>>  		if (p[i] && init && (!kasan_init ||
>> -				     !kasan_has_integrated_init()))
>> +				     !kasan_has_integrated_init())
>> +				 && !is_kfence_address(p[i]))
> 
> I hope we could make it bit more verbose and straightforward,
> something like:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 5d7ea72ebebd..29cf4590f9d9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4573,7 +4573,6 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
> gfp_t flags, size_t size,
>  {
>  	bool init = slab_want_init_on_alloc(flags, s);
>  	unsigned int zero_size = s->object_size;
> -	bool kasan_init = init;
>  	size_t i;
>  	gfp_t init_flags = flags & gfp_allowed_mask;
> 
> @@ -4591,29 +4590,37 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
> gfp_t flags, size_t size,
>  	if (slub_debug_orig_size(s))
>  		zero_size = ac->orig_size;
> 
> -	/*
> -	 * When slab_debug is enabled, avoid memory initialization integrated
> -	 * into KASAN and instead zero out the memory via the memset below with
> -	 * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
> -	 * cause false-positive reports. This does not lead to a performance
> -	 * penalty on production builds, as slab_debug is not intended to be
> -	 * enabled there.
> -	 */
> -	if (__slub_debug_enabled())
> -		kasan_init = false;
> -
> -	/*
> -	 * As memory initialization might be integrated into KASAN,
> -	 * kasan_slab_alloc and initialization memset must be
> -	 * kept together to avoid discrepancies in behavior.
> -	 *
> -	 * As p[i] might get tagged, memset and kmemleak hook come after KASAN.
> -	 */
>  	for (i = 0; i < size; i++) {
> -		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
> -		if (p[i] && init && (!kasan_init ||
> -				     !kasan_has_integrated_init())
> -				 && !is_kfence_address(p[i]))
> +		bool skip_init = false;
> +
> +		if (is_kfence_address(p[i])) {
> +			/*
> +			 * kfence zeroes the object instead of SLUB to avoid
> +			 * overwriting its own redzone, and zeroing of
> +			 * s->object_size will corrupt it.
> +			 */
> +			skip_init = true;

But now we perform this check even if init is false, making it more hot.

> +		} else if (__slub_debug_enabled()) {
> +			/*
> +			 * KASAN never zeroes memory when slab_debug is enabled
> +			 * to avoid overwriting SLUB redzones. This does not
> +			 * lead to a performance penalty on production builds,
> +			 * as slab_debug is not intended to be enabled there.
> +			 */
> +			skip_init = false;
> +		} else if (kasan_has_integrated_init()) {
> +			/*
> +			 * ARM64 can set memory tags and zero the memory using
> +			 * a single instruction. Since HW_TAGS KASAN uses that
> +			 * while tagging the object, a separate zeroing is
> +			 * unnecessary unless slab_debug is enabled.
> +			 */

(I like the new/updated comments)

> +			skip_init = true;
> +		}>

And these two are now done in every loop iteration even though they don't
depend on the object. Yeah it's a static key and build-time constant but still.

But maybe there's some middle ground?

Above the loop do (with your comments).

bool init;

/* ARM64 can ...
 * ...
 * But KASAN never zeroes ...
 */
if (kasan_has_integrated_init() && !__slub_debug_enabled())
	init = false;
else
	init = slab_want_init_on_alloc(flags, s);

In the loop:

		if (p[i] && init && !is_kfence_address(p[i]))
			memset(p[i], 0, zero_size);

> +		p[i] = kasan_slab_alloc(s, p[i], init_flags, init && skip_init);
> +		/* memset and hooks come after KASAN as p[i] might get tagged */
> +		if (p[i] && init && !skip_init)
>  			memset(p[i], 0, zero_size);
>  		if (alloc_flags_allow_spinning(ac->alloc_flags))
>  			kmemleak_alloc_recursive(p[i], s->object_size, 1,
> 


^ permalink raw reply

* Re: [PATCH v2 08/16] mm/slab: pass alloc_flags to new slab allocation
From: Harry Yoo @ 2026-06-11  7:52 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-8-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 2452 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> Add the alloc_flags parameter to allocate_slab() and new_slab()
> so it can be used to determine if spinning is allowed, independently
> from gfp flags.
> 
> refill_objects() passes SLAB_ALLOC_DEFAULT because it can only be
> reached from contexts that allow spinning.
> 
> Also change how trynode_flags are constructed in ___slab_alloc() to
> achieve the same "do not upgrade to GFP_NOWAIT" by using masking instead
> of a branch. It will now also not upgrade in cases where gfp is weaker
> than GFP_NOWAIT (i.e. lacks __GFP_KSWAPD_RECLAIM) but doesn't come from
> kmalloc_nolock() - which is more correct anyway.

Wait, debugobjects intentionally avoids __GFP_KSWAPD_RECLAIM,
but we have been upgrading it to GFP_NOWAIT?

> During the masking keep also existing __GFP_NOMEMALLOC (pointed out by
> Sashiko) and __GFP_ACCOUNT. Previously the hardcoded GFP_NOWAIT would
> eliminate them, but it's not a big problem that would need a separate
> fix.

Ack.

> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slub.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 98b79e5e7679..8f6ca3d5fdfa 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4467,25 +4470,22 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	 * 1) try to get a partial slab from target node only by having
>  	 *    __GFP_THISNODE in pc.flags for get_from_partial()
>  	 * 2) if 1) failed, try to allocate a new slab from target node with
> -	 *    GPF_NOWAIT | __GFP_THISNODE opportunistically
> +	 *    (at most) GFP_NOWAIT | __GFP_THISNODE opportunistically
>  	 * 3) if 2) failed, retry with original gfpflags which will allow
>  	 *    get_from_partial() try partial lists of other nodes before
>  	 *    potentially allocating new page from other nodes
>  	 */
>  	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
>  		     && try_thisnode)) {
> -		if (unlikely(!allow_spin))
> -			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
> -			trynode_flags = gfpflags | __GFP_THISNODE;
> -		else
> -			trynode_flags = GFP_NOWAIT | __GFP_THISNODE;
> +		trynode_flags &= GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_ACCOUNT;
> +		trynode_flags |= __GFP_NOWARN | __GFP_THISNODE;
>  	}

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Harry Yoo @ 2026-06-11  6:40 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-5-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 3347 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> Similarly to the page allocators, introduce slab-allocator specific
> alloc flags that internally control allocation behavior in addition to
> gfp_flags, without occupying the limited gfp flags space.
> 
> Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
> page allocator's ALLOC_TRYLOCK and will be used to reimplement
> kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
> gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
> importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
> e.g. in early boot with a restricted gfp_allowed_mask.
> 
> Also introduce alloc_flags_allow_spinning() to replace the usage of
> gfpflags_allow_spinning().
> 
> Start using alloc_flags and the new check first in alloc_from_pcs() and
> __pcs_replace_empty_main(). This means some slab allocations that were
> falsely treated as kmalloc_nolock() due to their gfp flags will now have
> higher chances of succeed, and this will further increase with followup
> changes.
> 
> Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
> reach it from a slab allocation that's not _nolock() and yet lacks
> __GFP_KSWAPD_RECLAIM for other reasons.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slab.h |  9 +++++++++
>  mm/slub.c | 17 ++++++++---------
>  2 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 1bf9c3021ae3..96f65b625600 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -4664,7 +4665,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>  		return NULL;
>  	}
>  
> -	allow_spin = gfpflags_allow_spinning(gfp);
> +	allow_spin = alloc_flags_allow_spinning(alloc_flags);

Sashiko wrote [1]:
> Does this bypass the caller's gfp constraints for standard allocations?
> Looking at slab_alloc_node(), standard allocations now pass
> SLAB_ALLOC_DEFAULT into alloc_from_pcs():
> -	object = alloc_from_pcs(s, gfpflags, node);
> +	object = alloc_from_pcs(s, gfpflags, SLAB_ALLOC_DEFAULT, node);
> This default flag means alloc_flags_allow_spinning() will unconditionally
> return true regardless of the gfp flags provided.

Yes, but that's not used in _nolock path
as mentioned in patch 6 description :)

> If a caller allocating under a raw spinlock intentionally strips
> __GFP_KSWAPD_RECLAIM (for example, by using __GFP_NOWARN) to prevent
> sleeping,

That's a horrible hack (and hypothetical. Nobody should be stripping
__GFP_KSWAP_RECLAIM instead of using kmalloc_nolock(). That's purely
broken).

> won't this allow the allocator to execute spin_lock_irqsave()
> on barn->lock or n->list_lock?
>
> On systems with preempt-rt enabled, a standard spinlock maps to a sleeping
> lock, so taking these locks in an atomic context could cause a scheduling
> while atomic panic.
>
> Since there is no nolock variant available for custom caches, do callers
> currently have any alternative mitigation?

Well, RT kernels are not supposed to allocate meomry under a raw
spinlock (at least w/ allow_spin = true)

[1]
https://sashiko.dev/#/patchset/20260610-slab_alloc_flags-v2-0-7190909db118%40kernel.org

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern
From: Ridong Chen @ 2026-06-11  6:17 UTC (permalink / raw)
  To: Waiman Long; +Cc: cgroups, Tejun Heo, Johannes Weiner, linux-kernel
In-Reply-To: <e7bbd0b7-6c44-4875-aaed-160791adc9b2@redhat.com>



On 6/9/2026 2:49 AM, Waiman Long wrote:
> 
> On 6/6/26 11:12 PM, Ridong Chen wrote:
>> When the cpuset controller is enabled/disabled in a parent cgroup, tasks
>> from multiple child cpusets need to be migrated. The current code only
>> handles a single source/destination pair.
>>
>> Support multiple source/destination cpusets by adopting the per-task
>> processing pattern similar to the pids controller:
>>
>> 1) Perform per-task DL bandwidth reservation (dl_bw_alloc) directly in
>>     cpuset_can_attach() instead of batching into sum_migrate_dl_bw. This
>>     eliminates the sum_migrate_dl_bw and dl_bw_cpu fields from the cpuset
>>     struct.
>>
>> 2) Track attach_in_progress per-task per-destination cpuset to properly
>>     guard all involved cpusets from having their cpus/mems zeroed.
>>
>> 3) Use a shared cpuset_undo_attach() helper for both rollback-on-error
>>     in cpuset_can_attach() and for cpuset_cancel_attach().
>>
>> 4) Detect many-source migrations and force cpus_updated/mems_updated
>>     to true so all tasks get properly updated during attach.
>>
>> 5) Defer nr_deadline_tasks updates to cpuset_attach() (after migration
>>     is committed) to avoid a race with dl_rebuild_rd_accounting() that
>>     could see inconsistent values between can_attach and attach.
>>
>> Signed-off-by: Ridong Chen <ridong.chen@linux.dev>
>> ---
>>   kernel/cgroup/cpuset-internal.h |   7 --
>>   kernel/cgroup/cpuset.c          | 167 ++++++++++++++++----------------
>>   2 files changed, 84 insertions(+), 90 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset- 
>> internal.h
>> index f7aaf01f7cd5..8f32cb97eb94 100644
>> --- a/kernel/cgroup/cpuset-internal.h
>> +++ b/kernel/cgroup/cpuset-internal.h
>> @@ -167,13 +167,6 @@ struct cpuset {
>>        */
>>       int nr_deadline_tasks;
>>       int nr_migrate_dl_tasks;
>> -    /* DL bandwidth that needs destination reservation for this 
>> attach. */
>> -    u64 sum_migrate_dl_bw;
>> -    /*
>> -     * CPU used for temporary DL bandwidth allocation during attach;
>> -     * -1 if no DL bandwidth was allocated in the current attach.
>> -     */
>> -    int dl_bw_cpu;
>>       /* Invalid partition error code, not lock protected */
>>       enum prs_errcode prs_err;
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index e52a5a40d607..a6d96a39cdb1 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -288,7 +288,6 @@ struct cpuset top_cpuset = {
>>       .flags = BIT(CS_CPU_EXCLUSIVE) |
>>            BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
>>       .partition_root_state = PRS_ROOT,
>> -    .dl_bw_cpu = -1,
>>   };
>>   /**
>> @@ -580,8 +579,6 @@ static struct cpuset *dup_or_alloc_cpuset(struct 
>> cpuset *cs)
>>       if (!trial)
>>           return NULL;
>> -    trial->dl_bw_cpu = -1;
>> -
>>       /* Setup cpumask pointer array */
>>       cpumask_var_t *pmask[4] = {
>>           &trial->cpus_allowed,
>> @@ -3026,31 +3023,36 @@ static int cpuset_can_attach_check(struct 
>> cpuset *cs, struct cpuset *oldcs,
>>       return 0;
>>   }
>> -static int cpuset_reserve_dl_bw(struct cpuset *cs)
>> +/*
>> + * Undo DL bandwidth reservations and attach_in_progress increments done
>> + * in cpuset_can_attach(). Used for both rollback on error and 
>> cancel_attach.
>> + * If @stop_at is non-NULL, undo only for tasks before @stop_at in 
>> the tset.
>> + */
>> +static void cpuset_undo_attach(struct cgroup_taskset *tset,
>> +                   struct task_struct *stop_at)
>>   {
>> -    int cpu, ret;
>> -
>> -    if (!cs->sum_migrate_dl_bw)
>> -        return 0;
>> +    struct cgroup_subsys_state *css;
>> +    struct task_struct *task;
>> -    cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
>> -    if (unlikely(cpu >= nr_cpu_ids))
>> -        return -EINVAL;
>> +    cgroup_taskset_for_each(task, css, tset) {
>> +        struct cpuset *cs = css_cs(css);
>> -    ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
>> -    if (ret)
>> -        return ret;
>> +        if (task == stop_at)
>> +            break;
>> -    cs->dl_bw_cpu = cpu;
>> -    return 0;
>> +        if (dl_task(task)) {
>> +            cs->nr_migrate_dl_tasks--;
>> +            if (dl_task_needs_bw_move(task, cs->effective_cpus)) {
>> +                int cpu = cpumask_any_and(cpu_active_mask,
>> +                             cs->effective_cpus);
>> +                dl_bw_free(cpu, task->dl.dl_bw);
>> +            }
>> +        }
>> +        dec_attach_in_progress_locked(cs);
>> +    }
>>   }
>> -static void reset_migrate_dl_data(struct cpuset *cs)
>> -{
>> -    cs->nr_migrate_dl_tasks = 0;
>> -    cs->sum_migrate_dl_bw = 0;
>> -    cs->dl_bw_cpu = -1;
>> -}
>> +static bool attach_many_sources;
>>   /* Called by cgroups to determine if a cpuset is usable; 
>> cpuset_mutex held */
>>   static int cpuset_can_attach(struct cgroup_taskset *tset)
>> @@ -3067,90 +3069,73 @@ static int cpuset_can_attach(struct 
>> cgroup_taskset *tset)
>>       cs = css_cs(css);
>>       mutex_lock(&cpuset_mutex);
>> +    attach_many_sources = false;
>>       /* Check to see if task is allowed in the cpuset */
>>       ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
>>       if (ret)
>>           goto out_unlock;
>> -    /*
>> -     * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() 
>> to get
>> -     * the old_mems_allowed value. There are two ways that many-to-one
>> -     * cpuset migration can happen:
>> -     * 1) A multithread application with threads in different cpusets is
>> -     *    wholely migrated to a new cpuset.
>> -     * 2) Disabling v2 cpuset controller will move all the tasks in 
>> child
>> -     *    cpusets to the parent cpuset.
>> -     *
>> -     * In the former case, it is the mm setting of the group leader that
>> -     * really matters. So cpuset_attach_old_cs should track the oldcs 
>> of the
>> -     * group leader. It falls back to the oldcs of the first task if 
>> there
>> -     * is no group leader in the taskset. In the latter case, 
>> effective_mems
>> -     * of child cpusets must always be a subset of the parent. So no 
>> real
>> -     * page migration will be necessary no matter which child cpuset is
>> -     * selected as cpuset_attach_old_cs.
>> -     */
>>       cgroup_taskset_for_each(task, css, tset) {
>> +        struct cpuset *newcs = css_cs(css);
>> +        struct cpuset *new_oldcs = task_cs(task);
>> +
>> +        if (newcs != cs || new_oldcs != oldcs) {
>> +            if (new_oldcs != oldcs)
>> +                attach_many_sources = true;
>> +            cs = newcs;
>> +            oldcs = new_oldcs;
>> +            ret = cpuset_can_attach_check(cs, oldcs,
>> +                              &setsched_check);
>> +            if (ret)
>> +                goto out_rollback;
>> +        }
>> +
>>           ret = task_can_attach(task);
>>           if (ret)
>> -            goto out_unlock;
>> +            goto out_rollback;
>> -        /* Update cpuset_attach_old_cs to the latest group leader */
>>           if (task == task->group_leader)
>>               cpuset_attach_old_cs = task_cs(task);
>>           if (setsched_check) {
>>               ret = security_task_setscheduler(task);
>>               if (ret)
>> -                goto out_unlock;
>> +                goto out_rollback;
>>           }
>>           if (dl_task(task)) {
>> -            /*
>> -             * Count all migrating DL tasks for cpuset task accounting.
>> -             * Only tasks that need a root-domain bandwidth move
>> -             * contribute to sum_migrate_dl_bw.
>> -             */
>>               cs->nr_migrate_dl_tasks++;
>> -            if (dl_task_needs_bw_move(task, cs->effective_cpus))
>> -                cs->sum_migrate_dl_bw += task->dl.dl_bw;
>> +            if (dl_task_needs_bw_move(task, cs->effective_cpus)) {
>> +                int cpu = cpumask_any_and(cpu_active_mask,
>> +                             cs->effective_cpus);
>> +                if (unlikely(cpu >= nr_cpu_ids)) {
>> +                    ret = -EINVAL;
>> +                    goto out_rollback;
>> +                }
>> +                ret = dl_bw_alloc(cpu, task->dl.dl_bw);
>> +                if (ret)
>> +                    goto out_rollback;
>> +            }
>>           }
>> -    }
>> -
>> -    ret = cpuset_reserve_dl_bw(cs);
>> -out_unlock:
>> -    if (ret) {
>> -        reset_migrate_dl_data(cs);
>> -    } else {
>> -        /*
>> -         * Mark attach is in progress.  This makes validate_change() 
>> fail
>> -         * changes which zero cpus/mems_allowed.
>> -         */
>>           cs->attach_in_progress++;
>>       }
>> +    goto out_unlock;
>> +
>> +out_rollback:
>> +    cpuset_undo_attach(tset, task);
>> +
>> +out_unlock:
>>       mutex_unlock(&cpuset_mutex);
>>       return ret;
>>   }
>>   static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>>   {
>> -    struct cgroup_subsys_state *css;
>> -    struct cpuset *cs;
>> -
>> -    cgroup_taskset_first(tset, &css);
>> -    cs = css_cs(css);
>> -
>>       mutex_lock(&cpuset_mutex);
>> -    dec_attach_in_progress_locked(cs);
>> -
>> -    if (cs->dl_bw_cpu >= 0)
>> -        dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
>> -
>> -    if (cs->nr_migrate_dl_tasks)
>> -        reset_migrate_dl_data(cs);
>> -
>> +    cpuset_undo_attach(tset, NULL);
>>       mutex_unlock(&cpuset_mutex);
>>   }
>> @@ -3232,8 +3217,15 @@ static void cpuset_attach(struct cgroup_taskset 
>> *tset)
>>       mutex_lock(&cpuset_mutex);
>>       queue_task_work = false;
>> -    attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs- 
>> >effective_cpus);
>> -    attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs- 
>> >effective_mems);
>> +    if (attach_many_sources) {
>> +        attach_cpus_updated = true;
>> +        attach_mems_updated = true;
>> +    } else {
>> +        attach_cpus_updated = !cpumask_equal(cs->effective_cpus,
>> +                            oldcs->effective_cpus);
>> +        attach_mems_updated = !nodes_equal(cs->effective_mems,
>> +                           oldcs->effective_mems);
>> +    }
>>       /*
>>        * In the default hierarchy, enabling cpuset in the child cgroups
>> @@ -3250,20 +3242,29 @@ static void cpuset_attach(struct 
>> cgroup_taskset *tset)
>>       }
>>       cgroup_taskset_for_each(task, css, tset)
>> -        cpuset_attach_task(cs, task);
>> +        cpuset_attach_task(css_cs(css), task);
>>   out:
>>       if (queue_task_work)
>>           schedule_flush_migrate_mm();
>>       cs->old_mems_allowed = cpuset_attach_nodemask_to;
>> -    if (cs->nr_migrate_dl_tasks) {
>> -        cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
>> -        oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
>> -        reset_migrate_dl_data(cs);
>> -    }
>> +    /*
>> +     * Update nr_deadline_tasks now that migration is committed.
>> +     * nr_migrate_dl_tasks was accumulated per-dst in can_attach but
>> +     * nr_deadline_tasks is deferred to here to avoid a race with
>> +     * dl_rebuild_rd_accounting() between can_attach and attach.
>> +     */
>> +    cgroup_taskset_for_each(task, css, tset) {
>> +        struct cpuset *dst_cs = css_cs(css);
>> -    dec_attach_in_progress_locked(cs);
>> +        if (dst_cs->nr_migrate_dl_tasks) {
>> +            dst_cs->nr_deadline_tasks += dst_cs->nr_migrate_dl_tasks;
>> +            oldcs->nr_deadline_tasks -= dst_cs->nr_migrate_dl_tasks;
>> +            dst_cs->nr_migrate_dl_tasks = 0;
>> +        }
>> +        dec_attach_in_progress_locked(dst_cs);
>> +    }
> 
> You are assuming that there is only one source cpuset. That may not be 
> true. I suppose that we may not track the set of destination cpusets and 
> they will be hit during task iteration. We will still need to track all 
> the source cpusets.
> 
> Cheers,
> Longman
> 
>>       mutex_unlock(&cpuset_mutex);
>>   }
> 

Yeah, we still need the per-src here. I agree that we can just track the 
source cpusets.

-- 
Best regards
Ridong


^ permalink raw reply

* Re: [PATCH v2 07/16] mm/slab: replace struct partial_context with slab_alloc_context
From: Harry Yoo @ 2026-06-11  6:05 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-7-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 796 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> Refactor get_from_partial_node(), get_from_any_partial(),
> get_from_partial() and ___slab_alloc().
> 
> Remove struct partial_context, which used to be more substantial but
> shrank as part of the sheaves conversion. Instead pass gfp_flags and
> pointer to the new slab_alloc_context, which together is a superset of
> partial_context.
> 
> This means alloc_flags are now available and we can use them to
> determine if spinning is allowed, further reducing false positive "not
> allowed" in the slow path due to gfp flags lacking __GFP_RECLAIM.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox