* [PATCH 1/5] memcg: add GPU statistic
2025-04-23 21:37 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Dave Airlie
@ 2025-04-23 21:37 ` Dave Airlie
2025-04-23 21:37 ` [PATCH 2/5] memcg: export stat change function Dave Airlie
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Dave Airlie @ 2025-04-23 21:37 UTC (permalink / raw)
To: dri-devel, tj, christian.koenig; +Cc: cgroups
From: Dave Airlie <airlied@redhat.com>
Discrete and Integrated GPUs can use system RAM instead of
VRAM for all or some allocations. These allocations happen
via drm/ttm subsystem and are currently not accounted for
in cgroups.
Add a gpu statistic to allow a place to visualise allocations
once they are supported.
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
Documentation/admin-guide/cgroup-v2.rst | 3 +++
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 2 ++
3 files changed, 6 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1a16ce68a4d7..e10a1dfa6051 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1480,6 +1480,9 @@ The following nested keys are defined.
vmalloc (npn)
Amount of memory used for vmap backed memory.
+ gpu (npn)
+ Amount of memory used for GPU device system RAM.
+
shmem
Amount of cached filesystem data that is swap-backed,
such as tmpfs, shm segments, shared anonymous mmap()s
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..4058d4bd94ed 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,6 +36,7 @@ enum memcg_stat_item {
MEMCG_SOCK,
MEMCG_PERCPU_B,
MEMCG_VMALLOC,
+ MEMCG_GPU,
MEMCG_KMEM,
MEMCG_ZSWAP_B,
MEMCG_ZSWAPPED,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c96c1f2b9cf5..25471a0fd0be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -326,6 +326,7 @@ static const unsigned int memcg_stat_items[] = {
MEMCG_SOCK,
MEMCG_PERCPU_B,
MEMCG_VMALLOC,
+ MEMCG_GPU,
MEMCG_KMEM,
MEMCG_ZSWAP_B,
MEMCG_ZSWAPPED,
@@ -1358,6 +1359,7 @@ static const struct memory_stat memory_stats[] = {
{ "percpu", MEMCG_PERCPU_B },
{ "sock", MEMCG_SOCK },
{ "vmalloc", MEMCG_VMALLOC },
+ { "gpu", MEMCG_GPU },
{ "shmem", NR_SHMEM },
#ifdef CONFIG_ZSWAP
{ "zswap", MEMCG_ZSWAP_B },
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 2/5] memcg: export stat change function
2025-04-23 21:37 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Dave Airlie
2025-04-23 21:37 ` [PATCH 1/5] memcg: add GPU statistic Dave Airlie
@ 2025-04-23 21:37 ` Dave Airlie
2025-04-23 21:37 ` [PATCH 3/5] ttm: add initial memcg integration Dave Airlie
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Dave Airlie @ 2025-04-23 21:37 UTC (permalink / raw)
To: dri-devel, tj, christian.koenig; +Cc: cgroups
From: Dave Airlie <airlied@redhat.com>
In order for modular GPU memory mgmt TTM to adjust the GPU
statistic we need to export the stat change functionality.
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
mm/memcontrol.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25471a0fd0be..68b23a03d2a6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -709,6 +709,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
memcg_rstat_updated(memcg, val);
trace_mod_memcg_state(memcg, idx, val);
}
+EXPORT_SYMBOL_GPL(__mod_memcg_state);
#ifdef CONFIG_MEMCG_V1
/* idx can be of type enum memcg_stat_item or node_stat_item. */
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 3/5] ttm: add initial memcg integration.
2025-04-23 21:37 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Dave Airlie
2025-04-23 21:37 ` [PATCH 1/5] memcg: add GPU statistic Dave Airlie
2025-04-23 21:37 ` [PATCH 2/5] memcg: export stat change function Dave Airlie
@ 2025-04-23 21:37 ` Dave Airlie
2025-04-23 21:37 ` [PATCH 4/5] amdgpu: add support for " Dave Airlie
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Dave Airlie @ 2025-04-23 21:37 UTC (permalink / raw)
To: dri-devel, tj, christian.koenig; +Cc: cgroups
From: Dave Airlie <airlied@redhat.com>
Doing proper integration of TTM system memory allocations with
memcg is a difficult ask, primarily due to difficulties around
accounting for evictions properly.
However there are systems where userspace will be allocating
objects in system memory and they won't be prone to migrating
or evicting and we should start with at least accounting those.
This adds a memcg group to ttm bo and tt objects. If this object
allocates pages directly (not from a TTM pool, so cached non-dma
pages), then they will be accounted for using __GFP_ACCOUNT,
and added to the memcg GPU statistic.
Only operations which set the account_op flag in the ttm operation
context will have this accounting happen. This patch disables the
flag around object evictions, but any operation that could populate
a TTM tt object in process context should set the account_op flag.
This doesn't account for uncached pages due to the page pool,
and it doesn't account for dma allocated pages. However for a lot
of use cases this should be a sufficient first step. Adding uncached
page support would mean hooking into memcg for the uncached pool
interactions, and I need to consider dma paths further.
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
drivers/gpu/drm/ttm/ttm_bo.c | 4 ++++
drivers/gpu/drm/ttm/ttm_bo_vm.c | 3 ++-
drivers/gpu/drm/ttm/ttm_pool.c | 19 ++++++++++++++++++-
drivers/gpu/drm/ttm/ttm_tt.c | 1 +
include/drm/ttm/ttm_bo.h | 8 ++++++++
include/drm/ttm/ttm_tt.h | 6 +++++-
6 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 95b86003c50d..631984fd459b 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -744,8 +744,12 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
continue;
}
+ /* we don't want to account evictions at this point */
+ bool old_ctx_account = ctx->account_op;
+ ctx->account_op = false;
ret = ttm_bo_evict_alloc(bdev, man, place, bo, ctx,
ticket, res, limit_pool);
+ ctx->account_op = old_ctx_account;
dmem_cgroup_pool_state_put(limit_pool);
if (ret == -EBUSY)
continue;
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index a194db83421d..163039cf40a5 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -220,7 +220,8 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
struct ttm_operation_ctx ctx = {
.interruptible = true,
.no_wait_gpu = false,
- .force_alloc = true
+ .force_alloc = true,
+ .account_op = true,
};
ttm = bo->ttm;
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 83b10706ba89..934bbf2c0ff6 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -672,6 +672,9 @@ static void ttm_pool_free_range(struct ttm_pool *pool, struct ttm_tt *tt,
tt->dma_address + i : NULL;
nr = ttm_pool_unmap_and_free(pool, p, dma_addr, caching);
+ if (tt->page_flags & TTM_TT_FLAG_ACCOUNTED) {
+ mod_memcg_state(tt->memcg, MEMCG_GPU, -nr);
+ }
}
}
}
@@ -742,9 +745,17 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
* that behaviour.
*/
if (!p) {
+ gfp_t account_flags = gfp_flags;
page_caching = ttm_cached;
allow_pools = false;
- p = ttm_pool_alloc_page(pool, gfp_flags, order);
+
+ if (!pool->use_dma_alloc && tt->memcg && ctx->account_op) {
+ account_flags |= __GFP_ACCOUNT;
+ tt->page_flags |= TTM_TT_FLAG_ACCOUNTED;
+ } else {
+ tt->page_flags &= ~TTM_TT_FLAG_ACCOUNTED;
+ }
+ p = ttm_pool_alloc_page(pool, account_flags, order);
}
/* If that fails, lower the order if possible and retry. */
if (!p) {
@@ -757,6 +768,12 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
r = -ENOMEM;
goto error_free_all;
}
+
+ /* only deal with non-pool pages for memcg for now */
+ if (tt->page_flags & TTM_TT_FLAG_ACCOUNTED) {
+ mod_memcg_state(tt->memcg, MEMCG_GPU, (1 << order));
+ }
+
r = ttm_pool_page_allocated(pool, order, p, page_caching, alloc,
restore);
if (r)
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index df0aa6c4b8b8..cf9852560d9b 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -161,6 +161,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
ttm->caching = caching;
ttm->restore = NULL;
ttm->backup = NULL;
+ ttm->memcg = bo->memcg;
}
int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 903cd1030110..03e30a056427 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -135,6 +135,12 @@ struct ttm_buffer_object {
* reservation lock.
*/
struct sg_table *sg;
+
+ /**
+ * @memcg: memory cgroup to charge this to if it ends up using system memory.
+ * NULL means don't charge.
+ */
+ struct mem_cgroup *memcg;
};
#define TTM_BO_MAP_IOMEM_MASK 0x80
@@ -174,6 +180,7 @@ struct ttm_bo_kmap_obj {
* BOs share the same reservation object.
* @force_alloc: Don't check the memory account during suspend or CPU page
* faults. Should only be used by TTM internally.
+ * @account_op: use __GFP_ACCOUNT for this.
* @resv: Reservation object to allow reserved evictions with.
* @bytes_moved: Statistics on how many bytes have been moved.
*
@@ -186,6 +193,7 @@ struct ttm_operation_ctx {
bool gfp_retry_mayfail;
bool allow_res_evict;
bool force_alloc;
+ bool account_op;
struct dma_resv *resv;
uint64_t bytes_moved;
};
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 13cf47f3322f..23e5bcbaec3a 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -101,8 +101,9 @@ struct ttm_tt {
#define TTM_TT_FLAG_EXTERNAL_MAPPABLE BIT(3)
#define TTM_TT_FLAG_DECRYPTED BIT(4)
#define TTM_TT_FLAG_BACKED_UP BIT(5)
+#define TTM_TT_FLAG_ACCOUNTED BIT(6)
-#define TTM_TT_FLAG_PRIV_POPULATED BIT(6)
+#define TTM_TT_FLAG_PRIV_POPULATED BIT(7)
uint32_t page_flags;
/** @num_pages: Number of pages in the page array. */
uint32_t num_pages;
@@ -126,6 +127,9 @@ struct ttm_tt {
enum ttm_caching caching;
/** @restore: Partial restoration from backup state. TTM private */
struct ttm_pool_tt_restore *restore;
+
+ /** @memcg: Memory cgroup to account this to */
+ struct mem_cgroup *memcg;
};
/**
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 4/5] amdgpu: add support for memcg integration
2025-04-23 21:37 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Dave Airlie
` (2 preceding siblings ...)
2025-04-23 21:37 ` [PATCH 3/5] ttm: add initial memcg integration Dave Airlie
@ 2025-04-23 21:37 ` Dave Airlie
2025-04-23 21:37 ` [PATCH 5/5] nouveau: add " Dave Airlie
2025-04-28 10:43 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Christian König
5 siblings, 0 replies; 10+ messages in thread
From: Dave Airlie @ 2025-04-23 21:37 UTC (permalink / raw)
To: dri-devel, tj, christian.koenig; +Cc: cgroups
From: Dave Airlie <airlied@redhat.com>
This adds the memcg object for any user allocated object,
and adds account_op to necessary paths which might populate
a tt object.
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 14 +++++++++++---
drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 +
4 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 82df06a72ee0..1a275224b4a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -787,6 +787,7 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo)
struct ttm_operation_ctx ctx = {
.interruptible = true,
.no_wait_gpu = false,
+ .account_op = true,
.resv = bo->tbo.base.resv
};
uint32_t domain;
@@ -839,7 +840,11 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
union drm_amdgpu_cs *cs)
{
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
- struct ttm_operation_ctx ctx = { true, false };
+ struct ttm_operation_ctx ctx = {
+ .interruptible = true,
+ .no_wait_gpu = false,
+ .account_op = true,
+ };
struct amdgpu_vm *vm = &fpriv->vm;
struct amdgpu_bo_list_entry *e;
struct drm_gem_object *obj;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 69429df09477..bdad9a862ed3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -89,6 +89,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
amdgpu_hmm_unregister(aobj);
+ mem_cgroup_put(aobj->tbo.memcg);
ttm_bo_put(&aobj->tbo);
}
@@ -116,6 +117,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
bp.domain = initial_domain;
bp.bo_ptr_size = sizeof(struct amdgpu_bo);
bp.xcp_id_plus1 = xcp_id_plus1;
+ bp.memcg = get_mem_cgroup_from_mm(current->mm);
r = amdgpu_bo_create_user(adev, &bp, &ubo);
if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 0b9987781f76..777cf05ebac8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -632,6 +632,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
struct ttm_operation_ctx ctx = {
.interruptible = (bp->type != ttm_bo_type_kernel),
.no_wait_gpu = bp->no_wait_gpu,
+ .account_op = true,
/* We opt to avoid OOM on system pages allocations */
.gfp_retry_mayfail = true,
.allow_res_evict = bp->type != ttm_bo_type_kernel,
@@ -657,16 +658,21 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
size = ALIGN(size, PAGE_SIZE);
}
- if (!amdgpu_bo_validate_size(adev, size, bp->domain))
+ if (!amdgpu_bo_validate_size(adev, size, bp->domain)) {
+ mem_cgroup_put(bp->memcg);
return -ENOMEM;
+ }
BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
*bo_ptr = NULL;
bo = kvzalloc(bp->bo_ptr_size, GFP_KERNEL);
- if (bo == NULL)
+ if (bo == NULL) {
+ mem_cgroup_put(bp->memcg);
return -ENOMEM;
+ }
drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
+ bo->tbo.memcg = bp->memcg;
bo->tbo.base.funcs = &amdgpu_gem_object_funcs;
bo->vm_bo = NULL;
bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
@@ -1341,7 +1347,9 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
vm_fault_t amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
{
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->bdev);
- struct ttm_operation_ctx ctx = { false, false };
+ struct ttm_operation_ctx ctx = { .interruptible = false,
+ .no_wait_gpu = false,
+ .account_op = true };
struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
int r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 375448627f7b..9a4c506cfb76 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -55,6 +55,7 @@ struct amdgpu_bo_param {
enum ttm_bo_type type;
bool no_wait_gpu;
struct dma_resv *resv;
+ struct mem_cgroup *memcg;
void (*destroy)(struct ttm_buffer_object *bo);
/* xcp partition number plus 1, 0 means any partition */
int8_t xcp_id_plus1;
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 5/5] nouveau: add memcg integration
2025-04-23 21:37 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Dave Airlie
` (3 preceding siblings ...)
2025-04-23 21:37 ` [PATCH 4/5] amdgpu: add support for " Dave Airlie
@ 2025-04-23 21:37 ` Dave Airlie
2025-04-28 10:43 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Christian König
5 siblings, 0 replies; 10+ messages in thread
From: Dave Airlie @ 2025-04-23 21:37 UTC (permalink / raw)
To: dri-devel, tj, christian.koenig; +Cc: cgroups
From: Dave Airlie <airlied@redhat.com>
This just adds the memcg init and account_op support.
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
drivers/gpu/drm/nouveau/nouveau_bo.c | 1 +
drivers/gpu/drm/nouveau/nouveau_gem.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 2016c1e7242f..8e2da4d48ce3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -350,6 +350,7 @@ nouveau_bo_init(struct nouveau_bo *nvbo, u64 size, int align, u32 domain,
struct ttm_operation_ctx ctx = {
.interruptible = false,
.no_wait_gpu = false,
+ .account_op = true,
.resv = robj,
};
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index 67e3c99de73a..56899c89bdd8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -87,6 +87,7 @@ nouveau_gem_object_del(struct drm_gem_object *gem)
return;
}
+ mem_cgroup_put(nvbo->bo.memcg);
ttm_bo_put(&nvbo->bo);
pm_runtime_mark_last_busy(dev);
@@ -254,6 +255,7 @@ nouveau_gem_new(struct nouveau_cli *cli, u64 size, int align, uint32_t domain,
if (IS_ERR(nvbo))
return PTR_ERR(nvbo);
+ nvbo->bo.memcg = get_mem_cgroup_from_mm(current->mm);
nvbo->bo.base.funcs = &nouveau_gem_object_funcs;
nvbo->no_share = domain & NOUVEAU_GEM_DOMAIN_NO_SHARE;
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration
2025-04-23 21:37 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Dave Airlie
` (4 preceding siblings ...)
2025-04-23 21:37 ` [PATCH 5/5] nouveau: add " Dave Airlie
@ 2025-04-28 10:43 ` Christian König
2025-04-28 16:00 ` Simona Vetter
2025-04-28 19:31 ` Dave Airlie
5 siblings, 2 replies; 10+ messages in thread
From: Christian König @ 2025-04-28 10:43 UTC (permalink / raw)
To: Dave Airlie, dri-devel, tj; +Cc: cgroups
On 4/23/25 23:37, Dave Airlie wrote:
> Hey,
>
> I've been tasked to look into this, and I'm going start from hopeless
> naivety and see how far I can get. This is an initial attempt to hook
> TTM system memory allocations into memcg and account for them.
Yeah, this looks mostly like what we had already discussed.
>
> It does:
> 1. Adds memcg GPU statistic,
> 2. Adds TTM memcg pointer for drivers to set on their user object
> allocation paths
> 3. Adds a singular path where we account for memory in TTM on cached
> non-pooled non-dma allocations. Cached memory allocations used to be
> pooled but we dropped that a while back which makes them the best target
> to start attacking this from.
I think that should go into the resource like the existing dmem approach instead. That allows drivers to control the accounting through the placement which is far less error prone than the context.
It would also completely avoid the pooled vs unpooled problematic.
> 4. It only accounts for memory that is allocated directly from a userspace
> TTM operation (like page faults or validation). It *doesn't* account for
> memory allocated in eviction paths due to device memory pressure.
Yeah, that's something I totally agree on.
But the major show stopper is still accounting to memcg will break existing userspace. E.g. display servers can get attacked with a deny of service with that.
The feature would need to be behind a module option or not account allocations for DRM masters or something like that.
>
> This seems to work for me here on my hacked up tests systems at least, I
> can see the GPU stats moving and they look sane.
>
> Future work:
> Account for pooled non-cached
> Account for pooled dma allocations (no idea how that looks)
> Figure out if accounting for eviction is possible, and what it might look
> like.
T.J. suggested to account but don't limit the evictions and I think that should work.
Regards,
Christian.
>
> Dave.
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration
2025-04-28 10:43 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Christian König
@ 2025-04-28 16:00 ` Simona Vetter
2025-04-28 19:31 ` Dave Airlie
1 sibling, 0 replies; 10+ messages in thread
From: Simona Vetter @ 2025-04-28 16:00 UTC (permalink / raw)
To: Christian König; +Cc: Dave Airlie, dri-devel, tj, cgroups
On Mon, Apr 28, 2025 at 12:43:30PM +0200, Christian König wrote:
> On 4/23/25 23:37, Dave Airlie wrote:
> > Hey,
> >
> > I've been tasked to look into this, and I'm going start from hopeless
> > naivety and see how far I can get. This is an initial attempt to hook
> > TTM system memory allocations into memcg and account for them.
>
> Yeah, this looks mostly like what we had already discussed.
>
> >
> > It does:
> > 1. Adds memcg GPU statistic,
> > 2. Adds TTM memcg pointer for drivers to set on their user object
> > allocation paths
> > 3. Adds a singular path where we account for memory in TTM on cached
> > non-pooled non-dma allocations. Cached memory allocations used to be
> > pooled but we dropped that a while back which makes them the best target
> > to start attacking this from.
>
> I think that should go into the resource like the existing dmem approach
> instead. That allows drivers to control the accounting through the
> placement which is far less error prone than the context.
>
> It would also completely avoid the pooled vs unpooled problematic.
>
>
> > 4. It only accounts for memory that is allocated directly from a userspace
> > TTM operation (like page faults or validation). It *doesn't* account for
> > memory allocated in eviction paths due to device memory pressure.
>
> Yeah, that's something I totally agree on.
>
> But the major show stopper is still accounting to memcg will break
> existing userspace. E.g. display servers can get attacked with a deny of
> service with that.
>
> The feature would need to be behind a module option or not account
> allocations for DRM masters or something like that.
The trouble is that support is very uneven, and it will be even more
uneven going forward. Especially if we then also add in SoC drivers, which
have all kinds of fun between system memory, cma, carveout and userptr all
being differently accounted for.
Which means I think we need two pieces here:
1. opt-in enabling, or things break
2. some way to figure out whether what userspace expects in term of
enforcement matches what the kernel actually does
Without two we'll never manage to get this beyond the initial demo stage I
fear, and we'll have a really hard time rolling out the various pieces to
various drivers.
But I have no idea what this should look like at all unfortuantely. Best I
can come up with is a set of flags of what kind of enforcement the kernel
does, and every time we add something new we set a new flag. And if the
flags userspace or the modoption opt-in sets don't match what the kernel
supports, you get a fallback to no enforcment.
But a module option flag approach doesn't cover at all per-driver or
per-device changes. I think for that we need the kernel to provide enough
information to userspace in sysfs, which userspace then needs to use to
set/update cgroup limits to fit whatever use-case it case. Or maybe a
per-device opt-in flag set.
Also I think the only fallback we can realistically provide is "no
enforcement", and that would technically be a regression every time we add
a new enforcement feature and hence another opt-in flag. And see below
with just the eviction example, I think there's plenty of really tricky
areas where we will just never get to the end state in one step, because
it's too much work and too hard to get right from the first attempt.
I think once we have a decent opt-in/forward-compatible strategy for
cgroups gpu features, adding not-entirely-complete solutions to get this
moving is the right thing to do.
> > This seems to work for me here on my hacked up tests systems at least, I
> > can see the GPU stats moving and they look sane.
> >
> > Future work:
> > Account for pooled non-cached
> > Account for pooled dma allocations (no idea how that looks)
> > Figure out if accounting for eviction is possible, and what it might look
> > like.
>
> T.J. suggested to account but don't limit the evictions and I think that
> should work.
I think this will need a ladder of implementations, where we slowly get to
a full featured place. Maybe something like:
1. Don't account evicted buffers. Pretty obvious gap if you're on a dgpu,
but entirely fine with an igpu without stolen memory.
2. Account, but don't enforce any limits on evictions. This could already
get funny if then system memory allocations start failing for random
reasons due to memory pressure from other processes.
3. Probably at this point we need a memcg aware shrinker in ttm drivers
that want to go further.
4. Start enforcing limits even on eviction.
I probably missed a few steps, like about enforcing dmem limits. And
memory pin limits also tie into this all in interesting ways (both for
system and device memory).
Cheers, Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration
2025-04-28 10:43 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration Christian König
2025-04-28 16:00 ` Simona Vetter
@ 2025-04-28 19:31 ` Dave Airlie
2025-04-29 7:29 ` Christian König
1 sibling, 1 reply; 10+ messages in thread
From: Dave Airlie @ 2025-04-28 19:31 UTC (permalink / raw)
To: Christian König; +Cc: dri-devel, tj, cgroups
On Mon, 28 Apr 2025 at 20:43, Christian König <christian.koenig@amd.com> wrote:
>
> On 4/23/25 23:37, Dave Airlie wrote:
> > Hey,
> >
> > I've been tasked to look into this, and I'm going start from hopeless
> > naivety and see how far I can get. This is an initial attempt to hook
> > TTM system memory allocations into memcg and account for them.
>
> Yeah, this looks mostly like what we had already discussed.
>
> >
> > It does:
> > 1. Adds memcg GPU statistic,
> > 2. Adds TTM memcg pointer for drivers to set on their user object
> > allocation paths
> > 3. Adds a singular path where we account for memory in TTM on cached
> > non-pooled non-dma allocations. Cached memory allocations used to be
> > pooled but we dropped that a while back which makes them the best target
> > to start attacking this from.
>
> I think that should go into the resource like the existing dmem approach instead. That allows drivers to control the accounting through the placement which is far less error prone than the context.
I'll reconsider this, but I'm not sure it'll work at that level,
because we have to handle the fact that when something gets put back
into the pool it gets removed from the cgroup kmem accounting and
taken from the pool gets added to the cgroup kmem account, but
otherwise we just use __GFP_ACCOUNT on allocations. I've added cached
pool support yesterday, which just leaves the dma paths which probably
aren't too insane, but I'll re-evaluate this and see if higher level
makes sense.
> > 4. It only accounts for memory that is allocated directly from a userspace
> > TTM operation (like page faults or validation). It *doesn't* account for
> > memory allocated in eviction paths due to device memory pressure.
>
> Yeah, that's something I totally agree on.
>
> But the major show stopper is still accounting to memcg will break existing userspace. E.g. display servers can get attacked with a deny of service with that.
The thing with modern userspace, I'm not sure this out of the box is a
major problem, we usually run the display server and the user
processes in the same cgroup, so they share limits. Most modern
distros don't run X.org servers as root in a separate cgroup, even
running X is usually in the same cgroup as the users of it, Android
might have different opinions of course, but I'd probably suggest we
Kconfig this stuff and let distros turn it on once we agree on a
baseline.
> >
> > This seems to work for me here on my hacked up tests systems at least, I
> > can see the GPU stats moving and they look sane.
> >
> > Future work:
> > Account for pooled non-cached
> > Account for pooled dma allocations (no idea how that looks)
> > Figure out if accounting for eviction is possible, and what it might look
> > like.
>
> T.J. suggested to account but don't limit the evictions and I think that should work.
>
I was going to introduce an gpu eviction stat counter as a start, I
also got the idea that might be a bit hard to pull off, but if a
process needs to evict from VRAM, but the original process has no
space in it's cgroup, we just fail the VRAM allocation for the current
process, which didn't sound insane, but I haven't considered how
implementing that in TTM might look.
Dave.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration
2025-04-28 19:31 ` Dave Airlie
@ 2025-04-29 7:29 ` Christian König
0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2025-04-29 7:29 UTC (permalink / raw)
To: Dave Airlie; +Cc: dri-devel, tj, cgroups
On 4/28/25 21:31, Dave Airlie wrote:
> On Mon, 28 Apr 2025 at 20:43, Christian König <christian.koenig@amd.com> wrote:
>>
>> On 4/23/25 23:37, Dave Airlie wrote:
>>> Hey,
>>>
>>> I've been tasked to look into this, and I'm going start from hopeless
>>> naivety and see how far I can get. This is an initial attempt to hook
>>> TTM system memory allocations into memcg and account for them.
>>
>> Yeah, this looks mostly like what we had already discussed.
>>
>>>
>>> It does:
>>> 1. Adds memcg GPU statistic,
>>> 2. Adds TTM memcg pointer for drivers to set on their user object
>>> allocation paths
>>> 3. Adds a singular path where we account for memory in TTM on cached
>>> non-pooled non-dma allocations. Cached memory allocations used to be
>>> pooled but we dropped that a while back which makes them the best target
>>> to start attacking this from.
>>
>> I think that should go into the resource like the existing dmem approach instead. That allows drivers to control the accounting through the placement which is far less error prone than the context.
>
> I'll reconsider this, but I'm not sure it'll work at that level,
> because we have to handle the fact that when something gets put back
> into the pool it gets removed from the cgroup kmem accounting and
> taken from the pool gets added to the cgroup kmem account, but
> otherwise we just use __GFP_ACCOUNT on allocations.
Especially for the user queue case a lot of those allocations are done from a background worker were simply using __GFP_ACCOUNT doesn't work.
We need to track for each BO who created it and either switch to that group before allocations or just account to it directly.
> I've added cached
> pool support yesterday, which just leaves the dma paths which probably
> aren't too insane, but I'll re-evaluate this and see if higher level
> makes sense.
The DMA path is still used quite often, especially on laptops with APUs and limited addressing capabilities as well as basically all non-x86 architectures.
>
>>> 4. It only accounts for memory that is allocated directly from a userspace
>>> TTM operation (like page faults or validation). It *doesn't* account for
>>> memory allocated in eviction paths due to device memory pressure.
>>
>> Yeah, that's something I totally agree on.
>>
>> But the major show stopper is still accounting to memcg will break existing userspace. E.g. display servers can get attacked with a deny of service with that.
>
> The thing with modern userspace, I'm not sure this out of the box is a
> major problem, we usually run the display server and the user
> processes in the same cgroup, so they share limits. Most modern
> distros don't run X.org servers as root in a separate cgroup, even
> running X is usually in the same cgroup as the users of it, Android
> might have different opinions of course, but I'd probably suggest we
> Kconfig this stuff and let distros turn it on once we agree on a
> baseline.
>
>>>
>>> This seems to work for me here on my hacked up tests systems at least, I
>>> can see the GPU stats moving and they look sane.
>>>
>>> Future work:
>>> Account for pooled non-cached
>>> Account for pooled dma allocations (no idea how that looks)
>>> Figure out if accounting for eviction is possible, and what it might look
>>> like.
>>
>> T.J. suggested to account but don't limit the evictions and I think that should work.
>>
>
> I was going to introduce an gpu eviction stat counter as a start, I
> also got the idea that might be a bit hard to pull off, but if a
> process needs to evict from VRAM, but the original process has no
> space in it's cgroup, we just fail the VRAM allocation for the current
> process, which didn't sound insane,
That is insane.
The problem is that you can't let the allocation of one process fail because another process has reached it's limit.
That basically kills all reproducibility because userspace can't figure out why an allocation failed.
Regards,
Christian.
> but I haven't considered how
> implementing that in TTM might look.
>
> Dave.
^ permalink raw reply [flat|nested] 10+ messages in thread