[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
@ 2025-05-02  3:35 Dave Airlie
  2025-05-02  3:36 ` [PATCH 1/5] memcg: add GPU statistic Dave Airlie
                   ` (6 more replies)
  0 siblings, 7 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-02  3:35 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Waiman Long, simona

Hey all,

This is my second attempt at adding the initial simple memcg/ttm
integration.

This varies from the first attempt in two major ways:

1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
for pool memory, and directly hitting the GPU statistic, Waiman
suggested I just do what the network socket stuff did, which looks
simpler. So this adds two new memcg apis that wrap accounting.
The pages no longer get assigned the memcg, it's owned by the
larger BO object which makes more sense.

2. Christian suggested moving it up a layer to avoid the pool business,
this was a bit tricky, since I want the gfp flags, but I think it only
needs some of them and it should work. One other big difference is that
I aligned it with the dmem interaction, where it tries to get space in
the memcg before it has even allocated any pages, I'm not 100% sure
this is how things should be done, but it was easier, so please let 
me know if it is wrong.

This still doesn't do anything with evictions except ignore them,
and I've some follows up on the old thread to discuss more on them.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/5] memcg: add GPU statistic
  2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
@ 2025-05-02  3:36 ` Dave Airlie
  2025-05-02  3:36 ` [PATCH 2/5] memcg: add hooks for gpu memcg charging/uncharging Dave Airlie
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-02  3:36 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

Discrete and Integrated GPUs can use system RAM instead of
VRAM for all or some allocations. These allocations happen
via drm/ttm subsystem and are currently not accounted for
in cgroups.

Add a gpu statistic to allow a place to visualise allocations
once they are supported.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 3 +++
 include/linux/memcontrol.h              | 1 +
 mm/memcontrol.c                         | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1a16ce68a4d7..e10a1dfa6051 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1480,6 +1480,9 @@ The following nested keys are defined.
 	  vmalloc (npn)
 		Amount of memory used for vmap backed memory.
 
+	  gpu (npn)
+		Amount of memory used for GPU device system RAM.
+
 	  shmem
 		Amount of cached filesystem data that is swap-backed,
 		such as tmpfs, shm segments, shared anonymous mmap()s
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..4058d4bd94ed 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,6 +36,7 @@ enum memcg_stat_item {
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
 	MEMCG_VMALLOC,
+	MEMCG_GPU,
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c96c1f2b9cf5..25471a0fd0be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -326,6 +326,7 @@ static const unsigned int memcg_stat_items[] = {
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
 	MEMCG_VMALLOC,
+	MEMCG_GPU,
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
@@ -1358,6 +1359,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "percpu",			MEMCG_PERCPU_B			},
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			MEMCG_VMALLOC			},
+	{ "gpu",			MEMCG_GPU			},
 	{ "shmem",			NR_SHMEM			},
 #ifdef CONFIG_ZSWAP
 	{ "zswap",			MEMCG_ZSWAP_B			},
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/5] memcg: add hooks for gpu memcg charging/uncharging.
  2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
  2025-05-02  3:36 ` [PATCH 1/5] memcg: add GPU statistic Dave Airlie
@ 2025-05-02  3:36 ` Dave Airlie
  2025-05-02  3:36 ` [PATCH 3/5] ttm: add initial memcg integration. (v2) Dave Airlie
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-02  3:36 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

As per the socket hooks, just adds two APIs to charge GPU pages
to the memcg and uncharge them.

Suggested by Waiman.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 include/linux/memcontrol.h |  5 +++++
 mm/memcontrol.c            | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4058d4bd94ed..f74b7db89f00 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1618,6 +1618,11 @@ struct sock;
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
 			     gfp_t gfp_mask);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+
+bool mem_cgroup_charge_gpu(struct mem_cgroup *memcg, unsigned int nr_pages,
+			   gfp_t gfp_mask);
+void mem_cgroup_uncharge_gpu(struct mem_cgroup *memcg, unsigned int nr_pages);
+
 #ifdef CONFIG_MEMCG
 extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25471a0fd0be..76a0ec34b7dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4958,6 +4958,40 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 	refill_stock(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_charge_gpu - charge GPU memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ * @gfp_mask: reclaim mode
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if it doesn't.
+ */
+bool mem_cgroup_charge_gpu(struct mem_cgroup *memcg, unsigned int nr_pages,
+			   gfp_t gfp_mask)
+{
+	if (try_charge_memcg(memcg, gfp_mask, nr_pages) == 0) {
+		mod_memcg_state(memcg, MEMCG_GPU, nr_pages);
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_charge_gpu);
+
+/**
+ * mem_cgroup_uncharge_gpu - uncharge GPU memory
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void mem_cgroup_uncharge_gpu(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	mod_memcg_state(memcg, MEMCG_GPU, -nr_pages);
+
+	refill_stock(memcg, nr_pages);
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_uncharge_gpu);
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/5] ttm: add initial memcg integration. (v2)
  2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
  2025-05-02  3:36 ` [PATCH 1/5] memcg: add GPU statistic Dave Airlie
  2025-05-02  3:36 ` [PATCH 2/5] memcg: add hooks for gpu memcg charging/uncharging Dave Airlie
@ 2025-05-02  3:36 ` Dave Airlie
  2025-05-02 12:01   ` Christian König
                     ` (2 more replies)
  2025-05-02  3:36 ` [PATCH 4/5] amdgpu: add support for memcg integration Dave Airlie
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-02  3:36 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

Doing proper integration of TTM system memory allocations with
memcg is a difficult ask, primarily due to difficulties around
accounting for evictions properly.

However there are systems where userspace will be allocating
objects in system memory and they won't be prone to migrating
or evicting and we should start with at least accounting those.

This adds a memcg group to ttm bo and resource objects.

This memcg is used when:
a) resource is allocated in system/tt memory
b) the account_op is set in the operation context

This patch disables the flag around object evictions,
but any operation that could populate a TTM tt object in process context
should set the account_op flag.

This v2 moves the charging up a level and also no longer uses
__GFP_ACCOUNT, or attaches the memcg to object pages, it instead
uses the same approach as socket memory and just charges/uncharges
at the object level. This was suggested by Christian.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c       |  9 +++++++--
 drivers/gpu/drm/ttm/ttm_bo_vm.c    |  3 ++-
 drivers/gpu/drm/ttm/ttm_resource.c | 20 ++++++++++++++++++++
 include/drm/ttm/ttm_bo.h           |  8 ++++++++
 include/drm/ttm/ttm_resource.h     |  6 +++++-
 5 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 95b86003c50d..89d2df246ed2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -537,6 +537,7 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 	evict_walk->evicted++;
 	if (evict_walk->res)
 		lret = ttm_resource_alloc(evict_walk->evictor, evict_walk->place,
+					  walk->ctx,
 					  evict_walk->res, NULL);
 	if (lret == 0)
 		return 1;
@@ -733,7 +734,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
 			continue;
 
 		may_evict = (force_space && place->mem_type != TTM_PL_SYSTEM);
-		ret = ttm_resource_alloc(bo, place, res, force_space ? &limit_pool : NULL);
+		ret = ttm_resource_alloc(bo, place, ctx, res, force_space ? &limit_pool : NULL);
 		if (ret) {
 			if (ret != -ENOSPC && ret != -EAGAIN) {
 				dmem_cgroup_pool_state_put(limit_pool);
@@ -744,8 +745,12 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
 				continue;
 			}
 
+			/* we don't want to account evictions at this point */
+			bool old_ctx_account = ctx->account_op;
+			ctx->account_op = false;
 			ret = ttm_bo_evict_alloc(bdev, man, place, bo, ctx,
 						 ticket, res, limit_pool);
+			ctx->account_op = old_ctx_account;
 			dmem_cgroup_pool_state_put(limit_pool);
 			if (ret == -EBUSY)
 				continue;
@@ -1145,7 +1150,7 @@ ttm_bo_swapout_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
 
 		memset(&hop, 0, sizeof(hop));
 		place.mem_type = TTM_PL_SYSTEM;
-		ret = ttm_resource_alloc(bo, &place, &evict_mem, NULL);
+		ret = ttm_resource_alloc(bo, &place, ctx, &evict_mem, NULL);
 		if (ret)
 			goto out;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index a194db83421d..163039cf40a5 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -220,7 +220,8 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 		struct ttm_operation_ctx ctx = {
 			.interruptible = true,
 			.no_wait_gpu = false,
-			.force_alloc = true
+			.force_alloc = true,
+			.account_op = true,
 		};
 
 		ttm = bo->ttm;
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index 7e5a60c55813..da257678a5ba 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -27,6 +27,7 @@
 #include <linux/iosys-map.h>
 #include <linux/scatterlist.h>
 #include <linux/cgroup_dmem.h>
+#include <linux/memcontrol.h>
 
 #include <drm/ttm/ttm_bo.h>
 #include <drm/ttm/ttm_placement.h>
@@ -373,12 +374,14 @@ EXPORT_SYMBOL(ttm_resource_fini);
 
 int ttm_resource_alloc(struct ttm_buffer_object *bo,
 		       const struct ttm_place *place,
+		       struct ttm_operation_ctx *ctx,
 		       struct ttm_resource **res_ptr,
 		       struct dmem_cgroup_pool_state **ret_limit_pool)
 {
 	struct ttm_resource_manager *man =
 		ttm_manager_type(bo->bdev, place->mem_type);
 	struct dmem_cgroup_pool_state *pool = NULL;
+	struct mem_cgroup *memcg = NULL;
 	int ret;
 
 	if (man->cg) {
@@ -387,13 +390,26 @@ int ttm_resource_alloc(struct ttm_buffer_object *bo,
 			return ret;
 	}
 
+	if ((place->mem_type == TTM_PL_SYSTEM || place->mem_type == TTM_PL_TT) &&
+	    ctx->account_op && bo->memcg) {
+		memcg = bo->memcg;
+		gfp_t gfp_flags = GFP_USER;
+		if (ctx->gfp_retry_mayfail)
+			gfp_flags |= __GFP_RETRY_MAYFAIL;
+
+		if (!mem_cgroup_charge_gpu(memcg, bo->base.size >> PAGE_SHIFT, gfp_flags))
+			return -ENOMEM;
+	}
 	ret = man->func->alloc(man, bo, place, res_ptr);
 	if (ret) {
 		if (pool)
 			dmem_cgroup_uncharge(pool, bo->base.size);
+		if (memcg)
+			mem_cgroup_uncharge_gpu(memcg, bo->base.size >> PAGE_SHIFT);
 		return ret;
 	}
 
+	(*res_ptr)->memcg = memcg;
 	(*res_ptr)->css = pool;
 
 	spin_lock(&bo->bdev->lru_lock);
@@ -407,6 +423,7 @@ void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res)
 {
 	struct ttm_resource_manager *man;
 	struct dmem_cgroup_pool_state *pool;
+	struct mem_cgroup *memcg;
 
 	if (!*res)
 		return;
@@ -416,11 +433,14 @@ void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res)
 	spin_unlock(&bo->bdev->lru_lock);
 
 	pool = (*res)->css;
+	memcg = (*res)->memcg;
 	man = ttm_manager_type(bo->bdev, (*res)->mem_type);
 	man->func->free(man, *res);
 	*res = NULL;
 	if (man->cg)
 		dmem_cgroup_uncharge(pool, bo->base.size);
+	if (memcg)
+		mem_cgroup_uncharge_gpu(memcg, bo->base.size >> PAGE_SHIFT);
 }
 EXPORT_SYMBOL(ttm_resource_free);
 
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 903cd1030110..56a33b5f5c41 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -135,6 +135,12 @@ struct ttm_buffer_object {
 	 * reservation lock.
 	 */
 	struct sg_table *sg;
+
+	/**
+	 * @memcg: memory cgroup to charge this to if it ends up using system memory.
+	 * NULL means don't charge.
+	 */
+	struct mem_cgroup *memcg;
 };
 
 #define TTM_BO_MAP_IOMEM_MASK 0x80
@@ -174,6 +180,7 @@ struct ttm_bo_kmap_obj {
  * BOs share the same reservation object.
  * @force_alloc: Don't check the memory account during suspend or CPU page
  * faults. Should only be used by TTM internally.
+ * @account_op: account for any memory allocations by a bo with an memcg.
  * @resv: Reservation object to allow reserved evictions with.
  * @bytes_moved: Statistics on how many bytes have been moved.
  *
@@ -186,6 +193,7 @@ struct ttm_operation_ctx {
 	bool gfp_retry_mayfail;
 	bool allow_res_evict;
 	bool force_alloc;
+	bool account_op;
 	struct dma_resv *resv;
 	uint64_t bytes_moved;
 };
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index e52bba15012f..1ab515c6ec00 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -45,6 +45,7 @@ struct ttm_resource;
 struct ttm_place;
 struct ttm_buffer_object;
 struct ttm_placement;
+struct ttm_operation_ctx;
 struct iosys_map;
 struct io_mapping;
 struct sg_table;
@@ -245,7 +246,8 @@ struct ttm_bus_placement {
  * @placement: Placement flags.
  * @bus: Placement on io bus accessible to the CPU
  * @bo: weak reference to the BO, protected by ttm_device::lru_lock
- * @css: cgroup state this resource is charged to
+ * @css: cgroup state this resource is charged to for dmem
+ * @memcg: memory cgroup this resource is charged to for sysmem
  *
  * Structure indicating the placement and space resources used by a
  * buffer object.
@@ -264,6 +266,7 @@ struct ttm_resource {
 	 * @lru: Least recently used list, see &ttm_resource_manager.lru
 	 */
 	struct ttm_lru_item lru;
+	struct mem_cgroup *memcg;
 };
 
 /**
@@ -444,6 +447,7 @@ void ttm_resource_fini(struct ttm_resource_manager *man,
 
 int ttm_resource_alloc(struct ttm_buffer_object *bo,
 		       const struct ttm_place *place,
+		       struct ttm_operation_ctx *ctx,
 		       struct ttm_resource **res,
 		       struct dmem_cgroup_pool_state **ret_limit_pool);
 void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/5] amdgpu: add support for memcg integration
  2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
                   ` (2 preceding siblings ...)
  2025-05-02  3:36 ` [PATCH 3/5] ttm: add initial memcg integration. (v2) Dave Airlie
@ 2025-05-02  3:36 ` Dave Airlie
  2025-05-02 14:01   ` Waiman Long
  2025-05-02  3:36 ` [PATCH 5/5] nouveau: add " Dave Airlie
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Dave Airlie @ 2025-05-02  3:36 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This adds the memcg object for any user allocated object,
and adds account_op to necessary paths which might populate
a tt object.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  7 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 14 +++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 82df06a72ee0..1a275224b4a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -787,6 +787,7 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo)
 	struct ttm_operation_ctx ctx = {
 		.interruptible = true,
 		.no_wait_gpu = false,
+		.account_op = true,
 		.resv = bo->tbo.base.resv
 	};
 	uint32_t domain;
@@ -839,7 +840,11 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 				union drm_amdgpu_cs *cs)
 {
 	struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
-	struct ttm_operation_ctx ctx = { true, false };
+	struct ttm_operation_ctx ctx = {
+		.interruptible = true,
+		.no_wait_gpu = false,
+		.account_op = true,
+	};
 	struct amdgpu_vm *vm = &fpriv->vm;
 	struct amdgpu_bo_list_entry *e;
 	struct drm_gem_object *obj;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 69429df09477..bdad9a862ed3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -89,6 +89,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
 	struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
 
 	amdgpu_hmm_unregister(aobj);
+	mem_cgroup_put(aobj->tbo.memcg);
 	ttm_bo_put(&aobj->tbo);
 }
 
@@ -116,6 +117,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
 	bp.domain = initial_domain;
 	bp.bo_ptr_size = sizeof(struct amdgpu_bo);
 	bp.xcp_id_plus1 = xcp_id_plus1;
+	bp.memcg = get_mem_cgroup_from_mm(current->mm);
 
 	r = amdgpu_bo_create_user(adev, &bp, &ubo);
 	if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 0b9987781f76..777cf05ebac8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -632,6 +632,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 	struct ttm_operation_ctx ctx = {
 		.interruptible = (bp->type != ttm_bo_type_kernel),
 		.no_wait_gpu = bp->no_wait_gpu,
+		.account_op = true,
 		/* We opt to avoid OOM on system pages allocations */
 		.gfp_retry_mayfail = true,
 		.allow_res_evict = bp->type != ttm_bo_type_kernel,
@@ -657,16 +658,21 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 		size = ALIGN(size, PAGE_SIZE);
 	}
 
-	if (!amdgpu_bo_validate_size(adev, size, bp->domain))
+	if (!amdgpu_bo_validate_size(adev, size, bp->domain)) {
+		mem_cgroup_put(bp->memcg);
 		return -ENOMEM;
+	}
 
 	BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
 
 	*bo_ptr = NULL;
 	bo = kvzalloc(bp->bo_ptr_size, GFP_KERNEL);
-	if (bo == NULL)
+	if (bo == NULL) {
+		mem_cgroup_put(bp->memcg);
 		return -ENOMEM;
+	}
 	drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
+	bo->tbo.memcg = bp->memcg;
 	bo->tbo.base.funcs = &amdgpu_gem_object_funcs;
 	bo->vm_bo = NULL;
 	bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
@@ -1341,7 +1347,9 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
 vm_fault_t amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
 {
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->bdev);
-	struct ttm_operation_ctx ctx = { false, false };
+	struct ttm_operation_ctx ctx = { .interruptible = false,
+					 .no_wait_gpu = false,
+					 .account_op = true };
 	struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
 	int r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 375448627f7b..9a4c506cfb76 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -55,6 +55,7 @@ struct amdgpu_bo_param {
 	enum ttm_bo_type		type;
 	bool				no_wait_gpu;
 	struct dma_resv			*resv;
+	struct mem_cgroup               *memcg;
 	void				(*destroy)(struct ttm_buffer_object *bo);
 	/* xcp partition number plus 1, 0 means any partition */
 	int8_t				xcp_id_plus1;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/5] nouveau: add memcg integration
  2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
                   ` (3 preceding siblings ...)
  2025-05-02  3:36 ` [PATCH 4/5] amdgpu: add support for memcg integration Dave Airlie
@ 2025-05-02  3:36 ` Dave Airlie
  2025-05-06  0:37 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Shakeel Butt
  2025-05-07 17:52 ` Johannes Weiner
  6 siblings, 0 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-02  3:36 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This just adds the memcg init and account_op support.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 1 +
 drivers/gpu/drm/nouveau/nouveau_gem.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 2016c1e7242f..8e2da4d48ce3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -350,6 +350,7 @@ nouveau_bo_init(struct nouveau_bo *nvbo, u64 size, int align, u32 domain,
 	struct ttm_operation_ctx ctx = {
 		.interruptible = false,
 		.no_wait_gpu = false,
+		.account_op = true,
 		.resv = robj,
 	};
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index 67e3c99de73a..56899c89bdd8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -87,6 +87,7 @@ nouveau_gem_object_del(struct drm_gem_object *gem)
 		return;
 	}
 
+	mem_cgroup_put(nvbo->bo.memcg);
 	ttm_bo_put(&nvbo->bo);
 
 	pm_runtime_mark_last_busy(dev);
@@ -254,6 +255,7 @@ nouveau_gem_new(struct nouveau_cli *cli, u64 size, int align, uint32_t domain,
 	if (IS_ERR(nvbo))
 		return PTR_ERR(nvbo);
 
+	nvbo->bo.memcg = get_mem_cgroup_from_mm(current->mm);
 	nvbo->bo.base.funcs = &nouveau_gem_object_funcs;
 	nvbo->no_share = domain & NOUVEAU_GEM_DOMAIN_NO_SHARE;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/5] ttm: add initial memcg integration. (v2)
  2025-05-02  3:36 ` [PATCH 3/5] ttm: add initial memcg integration. (v2) Dave Airlie
@ 2025-05-02 12:01   ` Christian König
  2025-05-02 14:24   ` kernel test robot
  2025-05-03  2:09   ` kernel test robot
  2 siblings, 0 replies; 42+ messages in thread
From: Christian König @ 2025-05-02 12:01 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Waiman Long, simona

On 5/2/25 05:36, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> Doing proper integration of TTM system memory allocations with
> memcg is a difficult ask, primarily due to difficulties around
> accounting for evictions properly.
> 
> However there are systems where userspace will be allocating
> objects in system memory and they won't be prone to migrating
> or evicting and we should start with at least accounting those.
> 
> This adds a memcg group to ttm bo and resource objects.
> 
> This memcg is used when:
> a) resource is allocated in system/tt memory
> b) the account_op is set in the operation context
> 
> This patch disables the flag around object evictions,
> but any operation that could populate a TTM tt object in process context
> should set the account_op flag.
> 
> This v2 moves the charging up a level and also no longer uses
> __GFP_ACCOUNT, or attaches the memcg to object pages, it instead
> uses the same approach as socket memory and just charges/uncharges
> at the object level. This was suggested by Christian.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/ttm/ttm_bo.c       |  9 +++++++--
>  drivers/gpu/drm/ttm/ttm_bo_vm.c    |  3 ++-
>  drivers/gpu/drm/ttm/ttm_resource.c | 20 ++++++++++++++++++++
>  include/drm/ttm/ttm_bo.h           |  8 ++++++++
>  include/drm/ttm/ttm_resource.h     |  6 +++++-
>  5 files changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 95b86003c50d..89d2df246ed2 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -537,6 +537,7 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
>  	evict_walk->evicted++;
>  	if (evict_walk->res)
>  		lret = ttm_resource_alloc(evict_walk->evictor, evict_walk->place,
> +					  walk->ctx,
>  					  evict_walk->res, NULL);
>  	if (lret == 0)
>  		return 1;
> @@ -733,7 +734,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
>  			continue;
>  
>  		may_evict = (force_space && place->mem_type != TTM_PL_SYSTEM);
> -		ret = ttm_resource_alloc(bo, place, res, force_space ? &limit_pool : NULL);
> +		ret = ttm_resource_alloc(bo, place, ctx, res, force_space ? &limit_pool : NULL);
>  		if (ret) {
>  			if (ret != -ENOSPC && ret != -EAGAIN) {
>  				dmem_cgroup_pool_state_put(limit_pool);
> @@ -744,8 +745,12 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
>  				continue;
>  			}
>  
> +			/* we don't want to account evictions at this point */
> +			bool old_ctx_account = ctx->account_op;
> +			ctx->account_op = false;
>  			ret = ttm_bo_evict_alloc(bdev, man, place, bo, ctx,
>  						 ticket, res, limit_pool);
> +			ctx->account_op = old_ctx_account;
>  			dmem_cgroup_pool_state_put(limit_pool);
>  			if (ret == -EBUSY)
>  				continue;
> @@ -1145,7 +1150,7 @@ ttm_bo_swapout_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
>  
>  		memset(&hop, 0, sizeof(hop));
>  		place.mem_type = TTM_PL_SYSTEM;
> -		ret = ttm_resource_alloc(bo, &place, &evict_mem, NULL);
> +		ret = ttm_resource_alloc(bo, &place, ctx, &evict_mem, NULL);
>  		if (ret)
>  			goto out;
>  
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index a194db83421d..163039cf40a5 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -220,7 +220,8 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>  		struct ttm_operation_ctx ctx = {
>  			.interruptible = true,
>  			.no_wait_gpu = false,
> -			.force_alloc = true
> +			.force_alloc = true,
> +			.account_op = true,
>  		};
>  
>  		ttm = bo->ttm;
> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
> index 7e5a60c55813..da257678a5ba 100644
> --- a/drivers/gpu/drm/ttm/ttm_resource.c
> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> @@ -27,6 +27,7 @@
>  #include <linux/iosys-map.h>
>  #include <linux/scatterlist.h>
>  #include <linux/cgroup_dmem.h>
> +#include <linux/memcontrol.h>
>  
>  #include <drm/ttm/ttm_bo.h>
>  #include <drm/ttm/ttm_placement.h>
> @@ -373,12 +374,14 @@ EXPORT_SYMBOL(ttm_resource_fini);
>  
>  int ttm_resource_alloc(struct ttm_buffer_object *bo,
>  		       const struct ttm_place *place,
> +		       struct ttm_operation_ctx *ctx,
>  		       struct ttm_resource **res_ptr,
>  		       struct dmem_cgroup_pool_state **ret_limit_pool)
>  {
>  	struct ttm_resource_manager *man =
>  		ttm_manager_type(bo->bdev, place->mem_type);
>  	struct dmem_cgroup_pool_state *pool = NULL;
> +	struct mem_cgroup *memcg = NULL;
>  	int ret;
>  
>  	if (man->cg) {
> @@ -387,13 +390,26 @@ int ttm_resource_alloc(struct ttm_buffer_object *bo,
>  			return ret;
>  	}
>  
> +	if ((place->mem_type == TTM_PL_SYSTEM || place->mem_type == TTM_PL_TT) &&
> +	    ctx->account_op && bo->memcg) {


I suggest to make that a placement flag instead of putting it into the ctx.

E.g. something like "if (place->flags TTM_PL_FLAG_MEMCG)" here and then just set the flag in amdgpu_bo_placement_from_domain() and clear it in amdgpu_evict_flags().

This wopuld not only simplify the handling here but also gives a potential solution to the eviction handling.

In other words allocations don't get charged to memcg on eviction, but when the next CS says during validation that the BO should stay in GTT we then charge the memory to the application.

Background is that then the right application get's the potential ENOMEM from their CS IOCTL.

Apart from that the solution looks totally sane to me as well.

> +		memcg = bo->memcg;
> +		gfp_t gfp_flags = GFP_USER;
> +		if (ctx->gfp_retry_mayfail)
> +			gfp_flags |= __GFP_RETRY_MAYFAIL;

BTW: gfp_retry_mayfail is kind of deprecated. Sima is strictly against that.

I was about to remove it but then XE picked it up as well.

I never kicked of the discussion on what to do with that.

Regards,
Christian.


> +
> +		if (!mem_cgroup_charge_gpu(memcg, bo->base.size >> PAGE_SHIFT, gfp_flags))
> +			return -ENOMEM;
> +	}
>  	ret = man->func->alloc(man, bo, place, res_ptr);
>  	if (ret) {
>  		if (pool)
>  			dmem_cgroup_uncharge(pool, bo->base.size);
> +		if (memcg)
> +			mem_cgroup_uncharge_gpu(memcg, bo->base.size >> PAGE_SHIFT);
>  		return ret;
>  	}
>  
> +	(*res_ptr)->memcg = memcg;
>  	(*res_ptr)->css = pool;
>  
>  	spin_lock(&bo->bdev->lru_lock);
> @@ -407,6 +423,7 @@ void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res)
>  {
>  	struct ttm_resource_manager *man;
>  	struct dmem_cgroup_pool_state *pool;
> +	struct mem_cgroup *memcg;
>  
>  	if (!*res)
>  		return;
> @@ -416,11 +433,14 @@ void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res)
>  	spin_unlock(&bo->bdev->lru_lock);
>  
>  	pool = (*res)->css;
> +	memcg = (*res)->memcg;
>  	man = ttm_manager_type(bo->bdev, (*res)->mem_type);
>  	man->func->free(man, *res);
>  	*res = NULL;
>  	if (man->cg)
>  		dmem_cgroup_uncharge(pool, bo->base.size);
> +	if (memcg)
> +		mem_cgroup_uncharge_gpu(memcg, bo->base.size >> PAGE_SHIFT);
>  }
>  EXPORT_SYMBOL(ttm_resource_free);
>  
> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> index 903cd1030110..56a33b5f5c41 100644
> --- a/include/drm/ttm/ttm_bo.h
> +++ b/include/drm/ttm/ttm_bo.h
> @@ -135,6 +135,12 @@ struct ttm_buffer_object {
>  	 * reservation lock.
>  	 */
>  	struct sg_table *sg;
> +
> +	/**
> +	 * @memcg: memory cgroup to charge this to if it ends up using system memory.
> +	 * NULL means don't charge.
> +	 */
> +	struct mem_cgroup *memcg;
>  };
>  
>  #define TTM_BO_MAP_IOMEM_MASK 0x80
> @@ -174,6 +180,7 @@ struct ttm_bo_kmap_obj {
>   * BOs share the same reservation object.
>   * @force_alloc: Don't check the memory account during suspend or CPU page
>   * faults. Should only be used by TTM internally.
> + * @account_op: account for any memory allocations by a bo with an memcg.
>   * @resv: Reservation object to allow reserved evictions with.
>   * @bytes_moved: Statistics on how many bytes have been moved.
>   *
> @@ -186,6 +193,7 @@ struct ttm_operation_ctx {
>  	bool gfp_retry_mayfail;
>  	bool allow_res_evict;
>  	bool force_alloc;
> +	bool account_op;
>  	struct dma_resv *resv;
>  	uint64_t bytes_moved;
>  };
> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
> index e52bba15012f..1ab515c6ec00 100644
> --- a/include/drm/ttm/ttm_resource.h
> +++ b/include/drm/ttm/ttm_resource.h
> @@ -45,6 +45,7 @@ struct ttm_resource;
>  struct ttm_place;
>  struct ttm_buffer_object;
>  struct ttm_placement;
> +struct ttm_operation_ctx;
>  struct iosys_map;
>  struct io_mapping;
>  struct sg_table;
> @@ -245,7 +246,8 @@ struct ttm_bus_placement {
>   * @placement: Placement flags.
>   * @bus: Placement on io bus accessible to the CPU
>   * @bo: weak reference to the BO, protected by ttm_device::lru_lock
> - * @css: cgroup state this resource is charged to
> + * @css: cgroup state this resource is charged to for dmem
> + * @memcg: memory cgroup this resource is charged to for sysmem
>   *
>   * Structure indicating the placement and space resources used by a
>   * buffer object.
> @@ -264,6 +266,7 @@ struct ttm_resource {
>  	 * @lru: Least recently used list, see &ttm_resource_manager.lru
>  	 */
>  	struct ttm_lru_item lru;
> +	struct mem_cgroup *memcg;
>  };
>  
>  /**
> @@ -444,6 +447,7 @@ void ttm_resource_fini(struct ttm_resource_manager *man,
>  
>  int ttm_resource_alloc(struct ttm_buffer_object *bo,
>  		       const struct ttm_place *place,
> +		       struct ttm_operation_ctx *ctx,
>  		       struct ttm_resource **res,
>  		       struct dmem_cgroup_pool_state **ret_limit_pool);
>  void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/5] amdgpu: add support for memcg integration
  2025-05-02  3:36 ` [PATCH 4/5] amdgpu: add support for memcg integration Dave Airlie
@ 2025-05-02 14:01   ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2025-05-02 14:01 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, tj, christian.koenig, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, simona

On 5/1/25 11:36 PM, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
>
> This adds the memcg object for any user allocated object,
> and adds account_op to necessary paths which might populate
> a tt object.
>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  7 ++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 14 +++++++++++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>   4 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 82df06a72ee0..1a275224b4a6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -787,6 +787,7 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo)
>   	struct ttm_operation_ctx ctx = {
>   		.interruptible = true,
>   		.no_wait_gpu = false,
> +		.account_op = true,
>   		.resv = bo->tbo.base.resv
>   	};
>   	uint32_t domain;
> @@ -839,7 +840,11 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>   				union drm_amdgpu_cs *cs)
>   {
>   	struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
> -	struct ttm_operation_ctx ctx = { true, false };
> +	struct ttm_operation_ctx ctx = {
> +		.interruptible = true,
> +		.no_wait_gpu = false,
> +		.account_op = true,
> +	};
>   	struct amdgpu_vm *vm = &fpriv->vm;
>   	struct amdgpu_bo_list_entry *e;
>   	struct drm_gem_object *obj;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index 69429df09477..bdad9a862ed3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -89,6 +89,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
>   	struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
>   
>   	amdgpu_hmm_unregister(aobj);
> +	mem_cgroup_put(aobj->tbo.memcg);
>   	ttm_bo_put(&aobj->tbo);
>   }
>   
> @@ -116,6 +117,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
>   	bp.domain = initial_domain;
>   	bp.bo_ptr_size = sizeof(struct amdgpu_bo);
>   	bp.xcp_id_plus1 = xcp_id_plus1;
> +	bp.memcg = get_mem_cgroup_from_mm(current->mm);
>   
>   	r = amdgpu_bo_create_user(adev, &bp, &ubo);
>   	if (r)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 0b9987781f76..777cf05ebac8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -632,6 +632,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
>   	struct ttm_operation_ctx ctx = {
>   		.interruptible = (bp->type != ttm_bo_type_kernel),
>   		.no_wait_gpu = bp->no_wait_gpu,
> +		.account_op = true,
>   		/* We opt to avoid OOM on system pages allocations */
>   		.gfp_retry_mayfail = true,
>   		.allow_res_evict = bp->type != ttm_bo_type_kernel,
> @@ -657,16 +658,21 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
>   		size = ALIGN(size, PAGE_SIZE);
>   	}
>   
> -	if (!amdgpu_bo_validate_size(adev, size, bp->domain))
> +	if (!amdgpu_bo_validate_size(adev, size, bp->domain)) {
> +		mem_cgroup_put(bp->memcg);
You should clear bp->memcg after mem_cgroup_put() to avoid stalled 
reference as memcg can go away after that.
>   		return -ENOMEM;
> +	}
>   
>   	BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
>   
>   	*bo_ptr = NULL;
>   	bo = kvzalloc(bp->bo_ptr_size, GFP_KERNEL);
> -	if (bo == NULL)
> +	if (bo == NULL) {
> +		mem_cgroup_put(bp->memcg);

Ditto.

Cheers,
Longman

>   		return -ENOMEM;
> +	}
>   	drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
> +	bo->tbo.memcg = bp->memcg;
>   	bo->tbo.base.funcs = &amdgpu_gem_object_funcs;
>   	bo->vm_bo = NULL;
>   	bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
> @@ -1341,7 +1347,9 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
>   vm_fault_t amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo)
>   {
>   	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->bdev);
> -	struct ttm_operation_ctx ctx = { false, false };
> +	struct ttm_operation_ctx ctx = { .interruptible = false,
> +					 .no_wait_gpu = false,
> +					 .account_op = true };
>   	struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
>   	int r;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> index 375448627f7b..9a4c506cfb76 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> @@ -55,6 +55,7 @@ struct amdgpu_bo_param {
>   	enum ttm_bo_type		type;
>   	bool				no_wait_gpu;
>   	struct dma_resv			*resv;
> +	struct mem_cgroup               *memcg;
>   	void				(*destroy)(struct ttm_buffer_object *bo);
>   	/* xcp partition number plus 1, 0 means any partition */
>   	int8_t				xcp_id_plus1;


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/5] ttm: add initial memcg integration. (v2)
  2025-05-02  3:36 ` [PATCH 3/5] ttm: add initial memcg integration. (v2) Dave Airlie
  2025-05-02 12:01   ` Christian König
@ 2025-05-02 14:24   ` kernel test robot
  2025-05-03  2:09   ` kernel test robot
  2 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-05-02 14:24 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, tj, christian.koenig, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: oe-kbuild-all, cgroups, Waiman Long, simona

Hi Dave,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on akpm-mm/mm-everything linus/master v6.15-rc4]
[cannot apply to next-20250502]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Airlie/memcg-add-hooks-for-gpu-memcg-charging-uncharging/20250502-123650
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250502034046.1625896-4-airlied%40gmail.com
patch subject: [PATCH 3/5] ttm: add initial memcg integration. (v2)
config: arc-randconfig-002-20250502 (https://download.01.org/0day-ci/archive/20250502/202505022223.9sTn2H1l-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 12.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250502/202505022223.9sTn2H1l-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505022223.9sTn2H1l-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/ttm/tests/ttm_resource_test.c: In function 'ttm_sys_man_free_basic':
>> drivers/gpu/drm/ttm/tests/ttm_resource_test.c:305:39: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     305 |         ttm_resource_alloc(bo, place, &res, NULL);
         |                                       ^~~~
         |                                       |
         |                                       struct ttm_resource **
   In file included from drivers/gpu/drm/ttm/tests/ttm_resource_test.c:5:
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
>> drivers/gpu/drm/ttm/tests/ttm_resource_test.c:305:9: error: too few arguments to function 'ttm_resource_alloc'
     305 |         ttm_resource_alloc(bo, place, &res, NULL);
         |         ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c: In function 'ttm_bo_validate_no_placement_signaled':
>> drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c:545:45: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     545 |         err = ttm_resource_alloc(bo, place, &bo->resource, NULL);
         |                                             ^~~~~~~~~~~~~
         |                                             |
         |                                             struct ttm_resource **
   In file included from drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c:8:
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
>> drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c:545:15: error: too few arguments to function 'ttm_resource_alloc'
     545 |         err = ttm_resource_alloc(bo, place, &bo->resource, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c: In function 'ttm_bo_validate_no_placement_not_signaled':
   drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c:606:45: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     606 |         err = ttm_resource_alloc(bo, place, &bo->resource, NULL);
         |                                             ^~~~~~~~~~~~~
         |                                             |
         |                                             struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c:606:15: error: too few arguments to function 'ttm_resource_alloc'
     606 |         err = ttm_resource_alloc(bo, place, &bo->resource, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c: In function 'ttm_bo_unreserve_basic':
>> drivers/gpu/drm/ttm/tests/ttm_bo_test.c:261:45: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     261 |         err = ttm_resource_alloc(bo, place, &res1, NULL);
         |                                             ^~~~~
         |                                             |
         |                                             struct ttm_resource **
   In file included from drivers/gpu/drm/ttm/tests/ttm_bo_test.c:13:
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
>> drivers/gpu/drm/ttm/tests/ttm_bo_test.c:261:15: error: too few arguments to function 'ttm_resource_alloc'
     261 |         err = ttm_resource_alloc(bo, place, &res1, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:267:39: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     267 |         ttm_resource_alloc(bo, place, &res2, NULL);
         |                                       ^~~~~
         |                                       |
         |                                       struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:267:9: error: too few arguments to function 'ttm_resource_alloc'
     267 |         ttm_resource_alloc(bo, place, &res2, NULL);
         |         ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c: In function 'ttm_bo_unreserve_pinned':
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:303:45: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     303 |         err = ttm_resource_alloc(bo, place, &res1, NULL);
         |                                             ^~~~~
         |                                             |
         |                                             struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:303:15: error: too few arguments to function 'ttm_resource_alloc'
     303 |         err = ttm_resource_alloc(bo, place, &res1, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:308:45: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     308 |         err = ttm_resource_alloc(bo, place, &res2, NULL);
         |                                             ^~~~~
         |                                             |
         |                                             struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:308:15: error: too few arguments to function 'ttm_resource_alloc'
     308 |         err = ttm_resource_alloc(bo, place, &res2, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c: In function 'ttm_bo_unreserve_bulk':
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:358:46: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     358 |         err = ttm_resource_alloc(bo1, place, &res1, NULL);
         |                                              ^~~~~
         |                                              |
         |                                              struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:358:15: error: too few arguments to function 'ttm_resource_alloc'
     358 |         err = ttm_resource_alloc(bo1, place, &res1, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:366:46: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     366 |         err = ttm_resource_alloc(bo2, place, &res2, NULL);
         |                                              ^~~~~
         |                                              |
         |                                              struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:366:15: error: too few arguments to function 'ttm_resource_alloc'
     366 |         err = ttm_resource_alloc(bo2, place, &res2, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c: In function 'ttm_bo_put_basic':
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:404:45: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     404 |         err = ttm_resource_alloc(bo, place, &res, NULL);
         |                                             ^~~~
         |                                             |
         |                                             struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,
         |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:404:15: error: too few arguments to function 'ttm_resource_alloc'
     404 |         err = ttm_resource_alloc(bo, place, &res, NULL);
         |               ^~~~~~~~~~~~~~~~~~
   include/drm/ttm/ttm_resource.h:448:5: note: declared here
     448 | int ttm_resource_alloc(struct ttm_buffer_object *bo,
         |     ^~~~~~~~~~~~~~~~~~
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c: In function 'ttm_bo_pin_unpin_resource':
   drivers/gpu/drm/ttm/tests/ttm_bo_test.c:521:45: error: passing argument 3 of 'ttm_resource_alloc' from incompatible pointer type [-Werror=incompatible-pointer-types]
     521 |         err = ttm_resource_alloc(bo, place, &res, NULL);
         |                                             ^~~~
         |                                             |
         |                                             struct ttm_resource **
   include/drm/ttm/ttm_resource.h:450:50: note: expected 'struct ttm_operation_ctx *' but argument is of type 'struct ttm_resource **'
     450 |                        struct ttm_operation_ctx *ctx,


vim +/ttm_resource_alloc +305 drivers/gpu/drm/ttm/tests/ttm_resource_test.c

9afc1e0aa4851e Karolina Stolarek 2023-11-29  288  
9afc1e0aa4851e Karolina Stolarek 2023-11-29  289  static void ttm_sys_man_free_basic(struct kunit *test)
9afc1e0aa4851e Karolina Stolarek 2023-11-29  290  {
9afc1e0aa4851e Karolina Stolarek 2023-11-29  291  	struct ttm_resource_test_priv *priv = test->priv;
9afc1e0aa4851e Karolina Stolarek 2023-11-29  292  	struct ttm_resource_manager *man;
9afc1e0aa4851e Karolina Stolarek 2023-11-29  293  	struct ttm_buffer_object *bo;
9afc1e0aa4851e Karolina Stolarek 2023-11-29  294  	struct ttm_place *place;
9afc1e0aa4851e Karolina Stolarek 2023-11-29  295  	struct ttm_resource *res;
07430fa5248964 Karolina Stolarek 2024-06-12  296  	u32 mem_type = TTM_PL_SYSTEM;
9afc1e0aa4851e Karolina Stolarek 2023-11-29  297  
9afc1e0aa4851e Karolina Stolarek 2023-11-29  298  	ttm_init_test_mocks(test, priv, mem_type, 0);
9afc1e0aa4851e Karolina Stolarek 2023-11-29  299  	bo = priv->bo;
9afc1e0aa4851e Karolina Stolarek 2023-11-29  300  	place = priv->place;
9afc1e0aa4851e Karolina Stolarek 2023-11-29  301  
9afc1e0aa4851e Karolina Stolarek 2023-11-29  302  	res = kunit_kzalloc(test, sizeof(*res), GFP_KERNEL);
9afc1e0aa4851e Karolina Stolarek 2023-11-29  303  	KUNIT_ASSERT_NOT_NULL(test, res);
9afc1e0aa4851e Karolina Stolarek 2023-11-29  304  
2b624a2c18656e Maarten Lankhorst 2024-12-04 @305  	ttm_resource_alloc(bo, place, &res, NULL);
9afc1e0aa4851e Karolina Stolarek 2023-11-29  306  
9afc1e0aa4851e Karolina Stolarek 2023-11-29  307  	man = ttm_manager_type(priv->devs->ttm_dev, mem_type);
9afc1e0aa4851e Karolina Stolarek 2023-11-29  308  	man->func->free(man, res);
9afc1e0aa4851e Karolina Stolarek 2023-11-29  309  
9afc1e0aa4851e Karolina Stolarek 2023-11-29  310  	KUNIT_ASSERT_TRUE(test, list_empty(&man->lru[bo->priority]));
9afc1e0aa4851e Karolina Stolarek 2023-11-29  311  	KUNIT_ASSERT_EQ(test, man->usage, 0);
9afc1e0aa4851e Karolina Stolarek 2023-11-29  312  }
9afc1e0aa4851e Karolina Stolarek 2023-11-29  313  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/5] ttm: add initial memcg integration. (v2)
  2025-05-02  3:36 ` [PATCH 3/5] ttm: add initial memcg integration. (v2) Dave Airlie
  2025-05-02 12:01   ` Christian König
  2025-05-02 14:24   ` kernel test robot
@ 2025-05-03  2:09   ` kernel test robot
  2 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-05-03  2:09 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, tj, christian.koenig, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: oe-kbuild-all, cgroups, Waiman Long, simona

Hi Dave,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on akpm-mm/mm-everything linus/master v6.15-rc4]
[cannot apply to drm-misc/drm-misc-next drm-tip/drm-tip next-20250502]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Airlie/memcg-add-hooks-for-gpu-memcg-charging-uncharging/20250502-123650
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250502034046.1625896-4-airlied%40gmail.com
patch subject: [PATCH 3/5] ttm: add initial memcg integration. (v2)
config: s390-randconfig-001-20250503 (https://download.01.org/0day-ci/archive/20250503/202505030927.6cZ0SdOU-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250503/202505030927.6cZ0SdOU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505030927.6cZ0SdOU-lkp@intel.com/

All errors (new ones prefixed by >>):

   s390-linux-ld: drivers/gpu/drm/ttm/ttm_resource.o: in function `ttm_resource_free':
>> drivers/gpu/drm/ttm/ttm_resource.c:443: undefined reference to `mem_cgroup_uncharge_gpu'
   s390-linux-ld: drivers/gpu/drm/ttm/ttm_resource.o: in function `ttm_resource_alloc':
   drivers/gpu/drm/ttm/ttm_resource.c:408: undefined reference to `mem_cgroup_uncharge_gpu'
>> s390-linux-ld: drivers/gpu/drm/ttm/ttm_resource.c:400: undefined reference to `mem_cgroup_charge_gpu'


vim +443 drivers/gpu/drm/ttm/ttm_resource.c

   374	
   375	int ttm_resource_alloc(struct ttm_buffer_object *bo,
   376			       const struct ttm_place *place,
   377			       struct ttm_operation_ctx *ctx,
   378			       struct ttm_resource **res_ptr,
   379			       struct dmem_cgroup_pool_state **ret_limit_pool)
   380	{
   381		struct ttm_resource_manager *man =
   382			ttm_manager_type(bo->bdev, place->mem_type);
   383		struct dmem_cgroup_pool_state *pool = NULL;
   384		struct mem_cgroup *memcg = NULL;
   385		int ret;
   386	
   387		if (man->cg) {
   388			ret = dmem_cgroup_try_charge(man->cg, bo->base.size, &pool, ret_limit_pool);
   389			if (ret)
   390				return ret;
   391		}
   392	
   393		if ((place->mem_type == TTM_PL_SYSTEM || place->mem_type == TTM_PL_TT) &&
   394		    ctx->account_op && bo->memcg) {
   395			memcg = bo->memcg;
   396			gfp_t gfp_flags = GFP_USER;
   397			if (ctx->gfp_retry_mayfail)
   398				gfp_flags |= __GFP_RETRY_MAYFAIL;
   399	
 > 400			if (!mem_cgroup_charge_gpu(memcg, bo->base.size >> PAGE_SHIFT, gfp_flags))
   401				return -ENOMEM;
   402		}
   403		ret = man->func->alloc(man, bo, place, res_ptr);
   404		if (ret) {
   405			if (pool)
   406				dmem_cgroup_uncharge(pool, bo->base.size);
   407			if (memcg)
   408				mem_cgroup_uncharge_gpu(memcg, bo->base.size >> PAGE_SHIFT);
   409			return ret;
   410		}
   411	
   412		(*res_ptr)->memcg = memcg;
   413		(*res_ptr)->css = pool;
   414	
   415		spin_lock(&bo->bdev->lru_lock);
   416		ttm_resource_add_bulk_move(*res_ptr, bo);
   417		spin_unlock(&bo->bdev->lru_lock);
   418		return 0;
   419	}
   420	EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_resource_alloc);
   421	
   422	void ttm_resource_free(struct ttm_buffer_object *bo, struct ttm_resource **res)
   423	{
   424		struct ttm_resource_manager *man;
   425		struct dmem_cgroup_pool_state *pool;
   426		struct mem_cgroup *memcg;
   427	
   428		if (!*res)
   429			return;
   430	
   431		spin_lock(&bo->bdev->lru_lock);
   432		ttm_resource_del_bulk_move(*res, bo);
   433		spin_unlock(&bo->bdev->lru_lock);
   434	
   435		pool = (*res)->css;
   436		memcg = (*res)->memcg;
   437		man = ttm_manager_type(bo->bdev, (*res)->mem_type);
   438		man->func->free(man, *res);
   439		*res = NULL;
   440		if (man->cg)
   441			dmem_cgroup_uncharge(pool, bo->base.size);
   442		if (memcg)
 > 443			mem_cgroup_uncharge_gpu(memcg, bo->base.size >> PAGE_SHIFT);
   444	}
   445	EXPORT_SYMBOL(ttm_resource_free);
   446	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
                   ` (4 preceding siblings ...)
  2025-05-02  3:36 ` [PATCH 5/5] nouveau: add " Dave Airlie
@ 2025-05-06  0:37 ` Shakeel Butt
  2025-05-06  0:59   ` Dave Airlie
  2025-05-07 17:52 ` Johannes Weiner
  6 siblings, 1 reply; 42+ messages in thread
From: Shakeel Butt @ 2025-05-06  0:37 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Waiman Long, simona

On Fri, May 02, 2025 at 01:35:59PM +1000, Dave Airlie wrote:
> Hey all,
> 
> This is my second attempt at adding the initial simple memcg/ttm
> integration.
> 
> This varies from the first attempt in two major ways:
> 
> 1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
> for pool memory, and directly hitting the GPU statistic,

Why was the first attempt abandoned? What was the issue with the above
approach?

> Waiman
> suggested I just do what the network socket stuff did, which looks
> simpler. So this adds two new memcg apis that wrap accounting.
> The pages no longer get assigned the memcg, it's owned by the
> larger BO object which makes more sense.

The issue with this approach is that this new stat is only exposed in
memcg. For networking, there are interfaces like /proc/net/sockstat and
/proc/net/protocols which expose system wide network memory usage. I
think we should expose this new "memory used by gpus" at the system
level possibly through /proc/meminfo.

> 
> 2. Christian suggested moving it up a layer to avoid the pool business,
> this was a bit tricky, since I want the gfp flags, but I think it only
> needs some of them and it should work. One other big difference is that
> I aligned it with the dmem interaction, where it tries to get space in
> the memcg before it has even allocated any pages,

I don't understand the memcg reference in the above statement. Dmem is a
separate cgroup controller orthogonal to memcg.

> I'm not 100% sure
> this is how things should be done, but it was easier, so please let 
> me know if it is wrong.
> 
> This still doesn't do anything with evictions except ignore them,
> and I've some follows up on the old thread to discuss more on them.
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-06  0:37 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Shakeel Butt
@ 2025-05-06  0:59   ` Dave Airlie
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-06  0:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Waiman Long, simona

On Tue, 6 May 2025 at 10:37, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, May 02, 2025 at 01:35:59PM +1000, Dave Airlie wrote:
> > Hey all,
> >
> > This is my second attempt at adding the initial simple memcg/ttm
> > integration.
> >
> > This varies from the first attempt in two major ways:
> >
> > 1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
> > for pool memory, and directly hitting the GPU statistic,
>
> Why was the first attempt abandoned? What was the issue with the above
> approach?

I had to export a bunch of memcg functionality to drivers (like stat
mod and kmem charge interfaces),
whereas the new version keeps all of that under a GPU API.

> > Waiman
> > suggested I just do what the network socket stuff did, which looks
> > simpler. So this adds two new memcg apis that wrap accounting.
> > The pages no longer get assigned the memcg, it's owned by the
> > larger BO object which makes more sense.
>
> The issue with this approach is that this new stat is only exposed in
> memcg. For networking, there are interfaces like /proc/net/sockstat and
> /proc/net/protocols which expose system wide network memory usage. I
> think we should expose this new "memory used by gpus" at the system
> level possibly through /proc/meminfo.

We do have some places where we could expose this info via debugfs,
but maybe /proc/meminfo
should grow this stat, so that it makes sense.

>
> >
> > 2. Christian suggested moving it up a layer to avoid the pool business,
> > this was a bit tricky, since I want the gfp flags, but I think it only
> > needs some of them and it should work. One other big difference is that
> > I aligned it with the dmem interaction, where it tries to get space in
> > the memcg before it has even allocated any pages,
>
> I don't understand the memcg reference in the above statement. Dmem is a
> separate cgroup controller orthogonal to memcg.
>

The TTM code that deals with dmem in the tree is at a level in the TTM code,
v1 of this added memcg interactions at a lower level, but Christian suggested,
that ttm could interact with both controllers in the same level, and
with the new
approach it seems right.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
                   ` (5 preceding siblings ...)
  2025-05-06  0:37 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Shakeel Butt
@ 2025-05-07 17:52 ` Johannes Weiner
  2025-05-07 22:03   ` Dave Airlie
  6 siblings, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2025-05-07 17:52 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

Hello Dave,

On Fri, May 02, 2025 at 01:35:59PM +1000, Dave Airlie wrote:
> Hey all,
> 
> This is my second attempt at adding the initial simple memcg/ttm
> integration.
> 
> This varies from the first attempt in two major ways:
> 
> 1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
> for pool memory, and directly hitting the GPU statistic, Waiman
> suggested I just do what the network socket stuff did, which looks
> simpler. So this adds two new memcg apis that wrap accounting.
> The pages no longer get assigned the memcg, it's owned by the
> larger BO object which makes more sense.

Unfortunately, this was bad advice :(

Naked-charging like this is quite awkward from the memcg side. It
requires consumer-specific charge paths in the memcg code, adds stat
counters that are memcg-only with no system-wide equivalent, and it's
difficult for the memcg maintainers to keep track of the link between
what's in the counter and the actual physical memory that is supposed
to be tracked.

The network and a few others like it are rather begrudging exceptions
because they do not have a suitable page context or otherwise didn't
fit the charging scheme. They're not good examples to follow if it can
at all be avoided.

__GFP_ACCOUNT and an enum node_stat_item is the much preferred way. I
have no objections to exports if you need to charge and account memory
from a module.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-07 17:52 ` Johannes Weiner
@ 2025-05-07 22:03   ` Dave Airlie
  2025-05-07 22:11     ` Dave Airlie
  2025-05-13  7:54     ` Johannes Weiner
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-07 22:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Thu, 8 May 2025 at 03:52, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Dave,
>
> On Fri, May 02, 2025 at 01:35:59PM +1000, Dave Airlie wrote:
> > Hey all,
> >
> > This is my second attempt at adding the initial simple memcg/ttm
> > integration.
> >
> > This varies from the first attempt in two major ways:
> >
> > 1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
> > for pool memory, and directly hitting the GPU statistic, Waiman
> > suggested I just do what the network socket stuff did, which looks
> > simpler. So this adds two new memcg apis that wrap accounting.
> > The pages no longer get assigned the memcg, it's owned by the
> > larger BO object which makes more sense.
>
> Unfortunately, this was bad advice :(
>
> Naked-charging like this is quite awkward from the memcg side. It
> requires consumer-specific charge paths in the memcg code, adds stat
> counters that are memcg-only with no system-wide equivalent, and it's
> difficult for the memcg maintainers to keep track of the link between
> what's in the counter and the actual physical memory that is supposed
> to be tracked.
>
> The network and a few others like it are rather begrudging exceptions
> because they do not have a suitable page context or otherwise didn't
> fit the charging scheme. They're not good examples to follow if it can
> at all be avoided.

I unfortunately feel GPU might fit in this category as well, we aren't
allocating pages in the traditional manner, so we have a number of
problems with doing it at the page level.

I think the biggest barrier to doing page level tracking is with the
page pools we have to keep. When we need CPU uncached pages, we
allocate a bunch of pages and do the work to fix their cpu mappings to
be uncached, this work is costly, so when we no longer need the
uncached pages in an object, we return them to a pool rather than to
the central allocator. We then use a shrinker to empty the pool and
undo the cpu mapping change. We also do equivalent pools for dma able
and 32-bit dma able pages if they are used.

This means that if userspace allocates a large uncached object, and we
account it to the current memcg with __GFP_ACCOUNT, then when it gets
handed back to the pool we have to remove it from the memcg
accounting. This was where I used the memcg kmem charge/uncharges,
uncharge on adding to pool and charge on reuse. However kmem seems
like the wrong place to charge, but it's where the initial
__GFP_ACCOUNT will put those pages. This is why the suggestions to use
the network solution worked better for me, I can do all the work
outside the pool code at a slightly higher level (the same level where
we charge/uncharge dmem), and I don't have to worry about handling the
different pool types etc. We also don't need page->memcg to be set for
these pages I don't think as all pages are part of a large object and
the object is what gets swapped or freed, not parts of it.

>
> __GFP_ACCOUNT and an enum node_stat_item is the much preferred way. I
> have no objections to exports if you need to charge and account memory
> from a module.

Now maybe it makes sense to add a node_stat_item for this as well as
the GPU memcg accounting so we can have it accounted in both places,
right now mm has no insight statistics wise into this memory at all,
other than it being pages allocated.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-07 22:03   ` Dave Airlie
@ 2025-05-07 22:11     ` Dave Airlie
  2025-05-13  7:54     ` Johannes Weiner
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-07 22:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Thu, 8 May 2025 at 08:03, Dave Airlie <airlied@gmail.com> wrote:
>
> On Thu, 8 May 2025 at 03:52, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hello Dave,
> >
> > On Fri, May 02, 2025 at 01:35:59PM +1000, Dave Airlie wrote:
> > > Hey all,
> > >
> > > This is my second attempt at adding the initial simple memcg/ttm
> > > integration.
> > >
> > > This varies from the first attempt in two major ways:
> > >
> > > 1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
> > > for pool memory, and directly hitting the GPU statistic, Waiman
> > > suggested I just do what the network socket stuff did, which looks
> > > simpler. So this adds two new memcg apis that wrap accounting.
> > > The pages no longer get assigned the memcg, it's owned by the
> > > larger BO object which makes more sense.
> >
> > Unfortunately, this was bad advice :(
> >
> > Naked-charging like this is quite awkward from the memcg side. It
> > requires consumer-specific charge paths in the memcg code, adds stat
> > counters that are memcg-only with no system-wide equivalent, and it's
> > difficult for the memcg maintainers to keep track of the link between
> > what's in the counter and the actual physical memory that is supposed
> > to be tracked.
> >
> > The network and a few others like it are rather begrudging exceptions
> > because they do not have a suitable page context or otherwise didn't
> > fit the charging scheme. They're not good examples to follow if it can
> > at all be avoided.
>
> I unfortunately feel GPU might fit in this category as well, we aren't
> allocating pages in the traditional manner, so we have a number of
> problems with doing it at the page level.
>
> I think the biggest barrier to doing page level tracking is with the
> page pools we have to keep. When we need CPU uncached pages, we
> allocate a bunch of pages and do the work to fix their cpu mappings to
> be uncached, this work is costly, so when we no longer need the
> uncached pages in an object, we return them to a pool rather than to
> the central allocator. We then use a shrinker to empty the pool and
> undo the cpu mapping change. We also do equivalent pools for dma able
> and 32-bit dma able pages if they are used.
>
> This means that if userspace allocates a large uncached object, and we
> account it to the current memcg with __GFP_ACCOUNT, then when it gets
> handed back to the pool we have to remove it from the memcg
> accounting. This was where I used the memcg kmem charge/uncharges,
> uncharge on adding to pool and charge on reuse. However kmem seems
> like the wrong place to charge, but it's where the initial
> __GFP_ACCOUNT will put those pages. This is why the suggestions to use
> the network solution worked better for me, I can do all the work
> outside the pool code at a slightly higher level (the same level where
> we charge/uncharge dmem), and I don't have to worry about handling the
> different pool types etc. We also don't need page->memcg to be set for
> these pages I don't think as all pages are part of a large object and
> the object is what gets swapped or freed, not parts of it.

With all that said we also manage moving objects to swap and maybe for
proper accounting of
swapping I need the more detailed integrated approach at the lower
levels, so objects that have
been swapped get removed from the gpu counter and added to the normal
swap counters.

Dave.

>
> >
> > __GFP_ACCOUNT and an enum node_stat_item is the much preferred way. I
> > have no objections to exports if you need to charge and account memory
> > from a module.
>
> Now maybe it makes sense to add a node_stat_item for this as well as
> the GPU memcg accounting so we can have it accounted in both places,
> right now mm has no insight statistics wise into this memory at all,
> other than it being pages allocated.
>
> Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-07 22:03   ` Dave Airlie
  2025-05-07 22:11     ` Dave Airlie
@ 2025-05-13  7:54     ` Johannes Weiner
  2025-05-15  3:02       ` Dave Airlie
  2025-05-21  2:23       ` Dave Airlie
  1 sibling, 2 replies; 42+ messages in thread
From: Johannes Weiner @ 2025-05-13  7:54 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Thu, May 08, 2025 at 08:03:49AM +1000, Dave Airlie wrote:
> On Thu, 8 May 2025 at 03:52, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hello Dave,
> >
> > On Fri, May 02, 2025 at 01:35:59PM +1000, Dave Airlie wrote:
> > > Hey all,
> > >
> > > This is my second attempt at adding the initial simple memcg/ttm
> > > integration.
> > >
> > > This varies from the first attempt in two major ways:
> > >
> > > 1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
> > > for pool memory, and directly hitting the GPU statistic, Waiman
> > > suggested I just do what the network socket stuff did, which looks
> > > simpler. So this adds two new memcg apis that wrap accounting.
> > > The pages no longer get assigned the memcg, it's owned by the
> > > larger BO object which makes more sense.
> >
> > Unfortunately, this was bad advice :(
> >
> > Naked-charging like this is quite awkward from the memcg side. It
> > requires consumer-specific charge paths in the memcg code, adds stat
> > counters that are memcg-only with no system-wide equivalent, and it's
> > difficult for the memcg maintainers to keep track of the link between
> > what's in the counter and the actual physical memory that is supposed
> > to be tracked.
> >
> > The network and a few others like it are rather begrudging exceptions
> > because they do not have a suitable page context or otherwise didn't
> > fit the charging scheme. They're not good examples to follow if it can
> > at all be avoided.
> 
> I unfortunately feel GPU might fit in this category as well, we aren't
> allocating pages in the traditional manner, so we have a number of
> problems with doing it at the page level.
> 
> I think the biggest barrier to doing page level tracking is with the
> page pools we have to keep. When we need CPU uncached pages, we
> allocate a bunch of pages and do the work to fix their cpu mappings to
> be uncached, this work is costly, so when we no longer need the
> uncached pages in an object, we return them to a pool rather than to
> the central allocator. We then use a shrinker to empty the pool and
> undo the cpu mapping change. We also do equivalent pools for dma able
> and 32-bit dma able pages if they are used.

Thanks for explaining the background, this is quite helpful. And I can
see where you're coming from with this.

But I'm actually not so sure it's a good idea to uncharge upon release
into the pool. Those pages might be unused from a GPU driver POV, but
they're not readily available for, say, a heap fault in another group.

For example, AFAIU a cgroup can allocate all of its memory allowance
in GPU memory, free it back into the pool, then fill the resulting
headroom inside the group with other allocations, like page cache or
heap. This lets a cgroup directly trigger allocations worth up to
twice its memory allowance. Do that on an ongoing basis, and you can
really tie up other containers in the shrinkers picking up after you -
just to get to the memory they've been rightfully assigned.

That's a clear containment problem.

System-wide, if I'm looking at the right code, the pool can be up to
50% of memory. That's a lot of used memory not attributed to any
cgroup, even though each allocation is directly triggered by one.

Container people are no fans of big, unattributed pools like this.
Imagine you start with a certain amount of memory in the system and
have a number of containers you would like to run. How much global
memory headroom do you need to set aside, not assigned to any
particular cgroup, to get a gpu cache that serves all cgroups well?

Caching itself is not an issue. A large part of cgroup-tracked memory
is some kind of cache. But what's usually done is to make the cache
itself per cgroup, and tie the size and the lifetime of entries to the
group's memory allowance. That gets you isolation and control.

E.g. if one cgroup in the system is more important than the others,
you can let it have a bigger cache by increasing its memory
allowance. Whereas if the pool is shared, you have no control over
that - less important groups could consume the cache and negatively
impact hit rates of the important group.

So in the GPU case, you'd charge on allocation, free objects into a
cgroup-specific pool, and shrink using a cgroup-specific LRU
list. Freed objects can be reused by this cgroup, but nobody else.
They're reclaimed through memory pressure inside the cgroup, not due
to the action of others. And all allocated memory is accounted for.

I have to admit I'm pretty clueless about the gpu driver internals and
can't really judge how feasible this is. But from a cgroup POV, if you
want proper memory isolation between groups, it seems to me that's the
direction you'd have to take this in.

> This means that if userspace allocates a large uncached object, and we
> account it to the current memcg with __GFP_ACCOUNT, then when it gets
> handed back to the pool we have to remove it from the memcg
> accounting. This was where I used the memcg kmem charge/uncharges,
> uncharge on adding to pool and charge on reuse. However kmem seems
> like the wrong place to charge, but it's where the initial
> __GFP_ACCOUNT will put those pages.

Ah, no need to worry about it. The name is just a historical memcgism,
from back when we first started charging "kernel" allocations, as
opposed to the conventional, pageable userspace memory. It's no longer
a super meaningful distinction, tbh.

You can still add a separate counter for GPU memory.

> This is why the suggestions to use the network solution worked
> better for me, I can do all the work outside the pool code at a
> slightly higher level (the same level where we charge/uncharge
> dmem), and I don't have to worry about handling the different pool
> types etc. We also don't need page->memcg to be set for these pages
> I don't think as all pages are part of a large object and the object
> is what gets swapped or freed, not parts of it.

I agree this doesn't need to be a goal in itself. It would just be a
side effect of charging through __GFP_ACCOUNT and uncharging inside
__free_pages(). What's more important is that the charge lifetime is
correlated with the actual memory allocation.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-13  7:54     ` Johannes Weiner
@ 2025-05-15  3:02       ` Dave Airlie
  2025-05-15  8:55         ` Christian König
  2025-05-16 16:12         ` Johannes Weiner
  2025-05-21  2:23       ` Dave Airlie
  1 sibling, 2 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-15  3:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

> I have to admit I'm pretty clueless about the gpu driver internals and
> can't really judge how feasible this is. But from a cgroup POV, if you
> want proper memory isolation between groups, it seems to me that's the
> direction you'd have to take this in.

Thanks for this insight, I think you have definitely shown me where
things need to go here, and I agree that the goal should be to make
the pools and the shrinker memcg aware is the proper answer,
unfortunately I think we are long way from that at the moment, but
I'll need to do a bit more research. I wonder if we can agree on some
compromise points in order to move things forward from where they are
now.

Right now we have 0 accounting for any system memory allocations done
via GPU APIs, never mind the case where we have pools and evictions.

I think I sort of see 3 stages:
1. Land some sort of accounting so you can at least see the active GPU
memory usage globally, per-node and per-cgroup - this series mostly
covers that, modulo any other feedback I get.
2. Work on making the ttm subsystem cgroup aware and achieve the state
where we can shrink inside the cgroup first.
3. Work on what to do with evicted memory for VRAM allocations, and
how best to integrate with dmem to possibly allow userspace to define
policy for this.

> Ah, no need to worry about it. The name is just a historical memcgism,
> from back when we first started charging "kernel" allocations, as
> opposed to the conventional, pageable userspace memory. It's no longer
> a super meaningful distinction, tbh.
>
> You can still add a separate counter for GPU memory.

Okay that's interesting, so I guess the only question vs the bespoke
ones is whether we use __GFP_ACCOUNT and whether there is benefit in
having page->memcg set.

>
> I agree this doesn't need to be a goal in itself. It would just be a
> side effect of charging through __GFP_ACCOUNT and uncharging inside
> __free_pages(). What's more important is that the charge lifetime is
> correlated with the actual memory allocation.

How much flexibility to do we have to evolve here, like if we start
with where the latest series I posted gets us (maybe with a CONFIG
option), then work on memcg aware shrinkers for the pools, then with
that in place it might make more sense to account across the complete
memory allocation. I think I'm also not sure if passing __GFP_ACCOUNT
to the dma allocators is supported, which is also something we need to
do, and having the bespoke API allows that to be possible.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-15  3:02       ` Dave Airlie
@ 2025-05-15  8:55         ` Christian König
  2025-05-15 15:04           ` Waiman Long
  2025-05-15 16:08           ` Johannes Weiner
  2025-05-16 16:12         ` Johannes Weiner
  1 sibling, 2 replies; 42+ messages in thread
From: Christian König @ 2025-05-15  8:55 UTC (permalink / raw)
  To: Dave Airlie, Johannes Weiner
  Cc: dri-devel, tj, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, cgroups, Waiman Long, simona

On 5/15/25 05:02, Dave Airlie wrote:
>> I have to admit I'm pretty clueless about the gpu driver internals and
>> can't really judge how feasible this is. But from a cgroup POV, if you
>> want proper memory isolation between groups, it seems to me that's the
>> direction you'd have to take this in.
> 
> Thanks for this insight, I think you have definitely shown me where
> things need to go here, and I agree that the goal should be to make
> the pools and the shrinker memcg aware is the proper answer,
> unfortunately I think we are long way from that at the moment, but
> I'll need to do a bit more research. I wonder if we can agree on some
> compromise points in order to move things forward from where they are
> now.
> 
> Right now we have 0 accounting for any system memory allocations done
> via GPU APIs, never mind the case where we have pools and evictions.
> 
> I think I sort of see 3 stages:
> 1. Land some sort of accounting so you can at least see the active GPU
> memory usage globally, per-node and per-cgroup - this series mostly
> covers that, modulo any other feedback I get.
> 2. Work on making the ttm subsystem cgroup aware and achieve the state
> where we can shrink inside the cgroup first.
> 3. Work on what to do with evicted memory for VRAM allocations, and
> how best to integrate with dmem to possibly allow userspace to define
> policy for this.
> 
>> Ah, no need to worry about it. The name is just a historical memcgism,
>> from back when we first started charging "kernel" allocations, as
>> opposed to the conventional, pageable userspace memory. It's no longer
>> a super meaningful distinction, tbh.
>>
>> You can still add a separate counter for GPU memory.
> 
> Okay that's interesting, so I guess the only question vs the bespoke
> ones is whether we use __GFP_ACCOUNT and whether there is benefit in
> having page->memcg set.
> 
>>
>> I agree this doesn't need to be a goal in itself. It would just be a
>> side effect of charging through __GFP_ACCOUNT and uncharging inside
>> __free_pages(). What's more important is that the charge lifetime is
>> correlated with the actual memory allocation.
> 
> How much flexibility to do we have to evolve here, like if we start
> with where the latest series I posted gets us (maybe with a CONFIG
> option), then work on memcg aware shrinkers for the pools, then with
> that in place it might make more sense to account across the complete
> memory allocation. I think I'm also not sure if passing __GFP_ACCOUNT
> to the dma allocators is supported, which is also something we need to
> do, and having the bespoke API allows that to be possible.

Stop for a second.

As far as I can see the shrinker for the TTM pool should *not* be memcg aware. Background is that pages who enter the pool are considered freed by the application.

The only reason we have the pool is to speed up allocation of uncached and write combined pages as well as work around for performance problems of the coherent DMA API.

The shrinker makes sure that the pages can be given back to the core memory management at any given time.

Regards,
Christian.


> 
> Dave.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-15  8:55         ` Christian König
@ 2025-05-15 15:04           ` Waiman Long
  2025-05-15 15:16             ` Christian König
  2025-05-15 16:08           ` Johannes Weiner
  1 sibling, 1 reply; 42+ messages in thread
From: Waiman Long @ 2025-05-15 15:04 UTC (permalink / raw)
  To: Christian König, Dave Airlie, Johannes Weiner
  Cc: dri-devel, tj, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, cgroups, simona

On 5/15/25 4:55 AM, Christian König wrote:
> On 5/15/25 05:02, Dave Airlie wrote:
>>> I have to admit I'm pretty clueless about the gpu driver internals and
>>> can't really judge how feasible this is. But from a cgroup POV, if you
>>> want proper memory isolation between groups, it seems to me that's the
>>> direction you'd have to take this in.
>> Thanks for this insight, I think you have definitely shown me where
>> things need to go here, and I agree that the goal should be to make
>> the pools and the shrinker memcg aware is the proper answer,
>> unfortunately I think we are long way from that at the moment, but
>> I'll need to do a bit more research. I wonder if we can agree on some
>> compromise points in order to move things forward from where they are
>> now.
>>
>> Right now we have 0 accounting for any system memory allocations done
>> via GPU APIs, never mind the case where we have pools and evictions.
>>
>> I think I sort of see 3 stages:
>> 1. Land some sort of accounting so you can at least see the active GPU
>> memory usage globally, per-node and per-cgroup - this series mostly
>> covers that, modulo any other feedback I get.
>> 2. Work on making the ttm subsystem cgroup aware and achieve the state
>> where we can shrink inside the cgroup first.
>> 3. Work on what to do with evicted memory for VRAM allocations, and
>> how best to integrate with dmem to possibly allow userspace to define
>> policy for this.
>>
>>> Ah, no need to worry about it. The name is just a historical memcgism,
>>> from back when we first started charging "kernel" allocations, as
>>> opposed to the conventional, pageable userspace memory. It's no longer
>>> a super meaningful distinction, tbh.
>>>
>>> You can still add a separate counter for GPU memory.
>> Okay that's interesting, so I guess the only question vs the bespoke
>> ones is whether we use __GFP_ACCOUNT and whether there is benefit in
>> having page->memcg set.
>>
>>> I agree this doesn't need to be a goal in itself. It would just be a
>>> side effect of charging through __GFP_ACCOUNT and uncharging inside
>>> __free_pages(). What's more important is that the charge lifetime is
>>> correlated with the actual memory allocation.
>> How much flexibility to do we have to evolve here, like if we start
>> with where the latest series I posted gets us (maybe with a CONFIG
>> option), then work on memcg aware shrinkers for the pools, then with
>> that in place it might make more sense to account across the complete
>> memory allocation. I think I'm also not sure if passing __GFP_ACCOUNT
>> to the dma allocators is supported, which is also something we need to
>> do, and having the bespoke API allows that to be possible.
> Stop for a second.
>
> As far as I can see the shrinker for the TTM pool should *not* be memcg aware. Background is that pages who enter the pool are considered freed by the application.
>
> The only reason we have the pool is to speed up allocation of uncached and write combined pages as well as work around for performance problems of the coherent DMA API.
>
> The shrinker makes sure that the pages can be given back to the core memory management at any given time.

I think what Johannes is suggesting that we can have a separate cgroup 
just for the memory pool so that if the system is running into global 
memory pressure, the shrinker can force the uncached pool to release 
memory back into the system.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-15 15:04           ` Waiman Long
@ 2025-05-15 15:16             ` Christian König
  0 siblings, 0 replies; 42+ messages in thread
From: Christian König @ 2025-05-15 15:16 UTC (permalink / raw)
  To: Waiman Long, Dave Airlie, Johannes Weiner
  Cc: dri-devel, tj, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, cgroups, simona

On 5/15/25 17:04, Waiman Long wrote:
> On 5/15/25 4:55 AM, Christian König wrote:
>> On 5/15/25 05:02, Dave Airlie wrote:
>>>> I have to admit I'm pretty clueless about the gpu driver internals and
>>>> can't really judge how feasible this is. But from a cgroup POV, if you
>>>> want proper memory isolation between groups, it seems to me that's the
>>>> direction you'd have to take this in.
>>> Thanks for this insight, I think you have definitely shown me where
>>> things need to go here, and I agree that the goal should be to make
>>> the pools and the shrinker memcg aware is the proper answer,
>>> unfortunately I think we are long way from that at the moment, but
>>> I'll need to do a bit more research. I wonder if we can agree on some
>>> compromise points in order to move things forward from where they are
>>> now.
>>>
>>> Right now we have 0 accounting for any system memory allocations done
>>> via GPU APIs, never mind the case where we have pools and evictions.
>>>
>>> I think I sort of see 3 stages:
>>> 1. Land some sort of accounting so you can at least see the active GPU
>>> memory usage globally, per-node and per-cgroup - this series mostly
>>> covers that, modulo any other feedback I get.
>>> 2. Work on making the ttm subsystem cgroup aware and achieve the state
>>> where we can shrink inside the cgroup first.
>>> 3. Work on what to do with evicted memory for VRAM allocations, and
>>> how best to integrate with dmem to possibly allow userspace to define
>>> policy for this.
>>>
>>>> Ah, no need to worry about it. The name is just a historical memcgism,
>>>> from back when we first started charging "kernel" allocations, as
>>>> opposed to the conventional, pageable userspace memory. It's no longer
>>>> a super meaningful distinction, tbh.
>>>>
>>>> You can still add a separate counter for GPU memory.
>>> Okay that's interesting, so I guess the only question vs the bespoke
>>> ones is whether we use __GFP_ACCOUNT and whether there is benefit in
>>> having page->memcg set.
>>>
>>>> I agree this doesn't need to be a goal in itself. It would just be a
>>>> side effect of charging through __GFP_ACCOUNT and uncharging inside
>>>> __free_pages(). What's more important is that the charge lifetime is
>>>> correlated with the actual memory allocation.
>>> How much flexibility to do we have to evolve here, like if we start
>>> with where the latest series I posted gets us (maybe with a CONFIG
>>> option), then work on memcg aware shrinkers for the pools, then with
>>> that in place it might make more sense to account across the complete
>>> memory allocation. I think I'm also not sure if passing __GFP_ACCOUNT
>>> to the dma allocators is supported, which is also something we need to
>>> do, and having the bespoke API allows that to be possible.
>> Stop for a second.
>>
>> As far as I can see the shrinker for the TTM pool should *not* be memcg aware. Background is that pages who enter the pool are considered freed by the application.
>>
>> The only reason we have the pool is to speed up allocation of uncached and write combined pages as well as work around for performance problems of the coherent DMA API.
>>
>> The shrinker makes sure that the pages can be given back to the core memory management at any given time.
> 
> I think what Johannes is suggesting that we can have a separate cgroup just for the memory pool so that if the system is running into global memory pressure, the shrinker can force the uncached pool to release memory back into the system.

That's something we already have.

TTM already has a limit on the maximum size of the pool.

Additional to that the shrinker makes sure that the memory is given back to the system immediately when running into global memory pressure.

Regards,
Christian.


> 
> Cheers,
> Longman
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-15  8:55         ` Christian König
  2025-05-15 15:04           ` Waiman Long
@ 2025-05-15 16:08           ` Johannes Weiner
  2025-05-16  6:53             ` Christian König
  1 sibling, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2025-05-15 16:08 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Thu, May 15, 2025 at 10:55:51AM +0200, Christian König wrote:
> On 5/15/25 05:02, Dave Airlie wrote:
> >> I have to admit I'm pretty clueless about the gpu driver internals and
> >> can't really judge how feasible this is. But from a cgroup POV, if you
> >> want proper memory isolation between groups, it seems to me that's the
> >> direction you'd have to take this in.
> > 
> > Thanks for this insight, I think you have definitely shown me where
> > things need to go here, and I agree that the goal should be to make
> > the pools and the shrinker memcg aware is the proper answer,
> > unfortunately I think we are long way from that at the moment, but
> > I'll need to do a bit more research. I wonder if we can agree on some
> > compromise points in order to move things forward from where they are
> > now.
> > 
> > Right now we have 0 accounting for any system memory allocations done
> > via GPU APIs, never mind the case where we have pools and evictions.
> > 
> > I think I sort of see 3 stages:
> > 1. Land some sort of accounting so you can at least see the active GPU
> > memory usage globally, per-node and per-cgroup - this series mostly
> > covers that, modulo any other feedback I get.
> > 2. Work on making the ttm subsystem cgroup aware and achieve the state
> > where we can shrink inside the cgroup first.
> > 3. Work on what to do with evicted memory for VRAM allocations, and
> > how best to integrate with dmem to possibly allow userspace to define
> > policy for this.
> > 
> >> Ah, no need to worry about it. The name is just a historical memcgism,
> >> from back when we first started charging "kernel" allocations, as
> >> opposed to the conventional, pageable userspace memory. It's no longer
> >> a super meaningful distinction, tbh.
> >>
> >> You can still add a separate counter for GPU memory.
> > 
> > Okay that's interesting, so I guess the only question vs the bespoke
> > ones is whether we use __GFP_ACCOUNT and whether there is benefit in
> > having page->memcg set.
> > 
> >>
> >> I agree this doesn't need to be a goal in itself. It would just be a
> >> side effect of charging through __GFP_ACCOUNT and uncharging inside
> >> __free_pages(). What's more important is that the charge lifetime is
> >> correlated with the actual memory allocation.
> > 
> > How much flexibility to do we have to evolve here, like if we start
> > with where the latest series I posted gets us (maybe with a CONFIG
> > option), then work on memcg aware shrinkers for the pools, then with
> > that in place it might make more sense to account across the complete
> > memory allocation. I think I'm also not sure if passing __GFP_ACCOUNT
> > to the dma allocators is supported, which is also something we need to
> > do, and having the bespoke API allows that to be possible.
> 
> Stop for a second.
> 
> As far as I can see the shrinker for the TTM pool should *not* be
> memcg aware. Background is that pages who enter the pool are
> considered freed by the application.

They're not free from a system POV until they're back in the page
allocator.

> The only reason we have the pool is to speed up allocation of
> uncached and write combined pages as well as work around for
> performance problems of the coherent DMA API.
> 
> The shrinker makes sure that the pages can be given back to the core
> memory management at any given time.

That's work. And it's a direct result of some cgroup having allocated
this memory. Why should somebody else have to clean it up?

The shrinker also doesn't run in isolation. It's invoked in the
broader context of there being a memory shortage, along with all the
other shrinkers in the system, along with file reclaim, and
potentially even swapping.

Why should all of this be externalized to other containers?

For proper memory isolation, the cleanup cost needs to be carried by
the cgroup that is responsible for it in the first place - not some
other container that's just trying to read() a file or malloc().

This memory isn't special. The majority of memcg-tracked memory is
shrinkable/reclaimable. In every single case it stays charged until
the shrink work has been completed, and the pages are handed back to
the allocator.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-15 16:08           ` Johannes Weiner
@ 2025-05-16  6:53             ` Christian König
  2025-05-16 14:53               ` Johannes Weiner
  0 siblings, 1 reply; 42+ messages in thread
From: Christian König @ 2025-05-16  6:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Airlie, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On 5/15/25 18:08, Johannes Weiner wrote:
>> Stop for a second.
>>
>> As far as I can see the shrinker for the TTM pool should *not* be
>> memcg aware. Background is that pages who enter the pool are
>> considered freed by the application.
> 
> They're not free from a system POV until they're back in the page
> allocator.
> 
>> The only reason we have the pool is to speed up allocation of
>> uncached and write combined pages as well as work around for
>> performance problems of the coherent DMA API.
>>
>> The shrinker makes sure that the pages can be given back to the core
>> memory management at any given time.
> 
> That's work. And it's a direct result of some cgroup having allocated
> this memory. Why should somebody else have to clean it up?

Because the cgroup who has allocated the memory is long gone. As soon as the pages enter the pool they must be considered freed by this cgroup.

The cgroup who originally allocated it has no reference to the memory any more and also no way of giving it back to the core system.

Keeping the memory accounted to the cgroup who allocated it would break the whole system.

See the pool only exists because of missing features in the core memory management.

> The shrinker also doesn't run in isolation. It's invoked in the
> broader context of there being a memory shortage, along with all the
> other shrinkers in the system, along with file reclaim, and
> potentially even swapping.
> 
> Why should all of this be externalized to other containers?

That's the whole purpose of the pool.

The pool only exists because the core memory management can't track the difference between unchached, write combined and cached memory. It's similar to moveable or DMA/DMA32.

> For proper memory isolation, the cleanup cost needs to be carried by
> the cgroup that is responsible for it in the first place - not some
> other container that's just trying to read() a file or malloc().

That makes no sense at all.

> This memory isn't special. The majority of memcg-tracked memory is
> shrinkable/reclaimable. In every single case it stays charged until
> the shrink work has been completed, and the pages are handed back to
> the allocator.

To be honest I think that memcg is the special case and what TTM or the network subsystem does for per device memory allocation is the norm.

Keeping memory accounted to the cgroup who originally allocated it after this cgroup has freed it back to a pool makes no sense at all because the pool is exactly there to improve the performance independent of the cgroup who is allocating.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16  6:53             ` Christian König
@ 2025-05-16 14:53               ` Johannes Weiner
  2025-05-16 15:35                 ` Christian König
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2025-05-16 14:53 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Fri, May 16, 2025 at 08:53:07AM +0200, Christian König wrote:
> On 5/15/25 18:08, Johannes Weiner wrote:
> >> Stop for a second.
> >>
> >> As far as I can see the shrinker for the TTM pool should *not* be
> >> memcg aware. Background is that pages who enter the pool are
> >> considered freed by the application.
> > 
> > They're not free from a system POV until they're back in the page
> > allocator.
> > 
> >> The only reason we have the pool is to speed up allocation of
> >> uncached and write combined pages as well as work around for
> >> performance problems of the coherent DMA API.
> >>
> >> The shrinker makes sure that the pages can be given back to the core
> >> memory management at any given time.
> > 
> > That's work. And it's a direct result of some cgroup having allocated
> > this memory. Why should somebody else have to clean it up?
> 
> Because the cgroup who has allocated the memory is long gone. As
> soon as the pages enter the pool they must be considered freed by
> this cgroup.

Nope, not at all.

> The cgroup who originally allocated it has no reference to the
> memory any more and also no way of giving it back to the core
> system.

Of course it does, the shrinker LRU.

Listen, none of this is even remotely new. This isn't the first cache
we're tracking, and it's not the first consumer that can outlive the
controlling cgroup.

> > The shrinker also doesn't run in isolation. It's invoked in the
> > broader context of there being a memory shortage, along with all the
> > other shrinkers in the system, along with file reclaim, and
> > potentially even swapping.
> > 
> > Why should all of this be externalized to other containers?
> 
> That's the whole purpose of the pool.
> 
> The pool only exists because the core memory management can't track
> the difference between unchached, write combined and cached
> memory. It's similar to moveable or DMA/DMA32.

And because of that, in the real world, you are operating a
shrinker-managed cache for those setup costs, yes? And I explained to
you the implications and consequences of that.

> > For proper memory isolation, the cleanup cost needs to be carried by
> > the cgroup that is responsible for it in the first place - not some
> > other container that's just trying to read() a file or malloc().
> 
> That makes no sense at all.

How about we stay in our respective lanes of expertise and find a more
productive way to get alignment on this, shall we?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16 14:53               ` Johannes Weiner
@ 2025-05-16 15:35                 ` Christian König
  2025-05-16 16:41                   ` Johannes Weiner
  0 siblings, 1 reply; 42+ messages in thread
From: Christian König @ 2025-05-16 15:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Airlie, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On 5/16/25 16:53, Johannes Weiner wrote:
> On Fri, May 16, 2025 at 08:53:07AM +0200, Christian König wrote:
>> On 5/15/25 18:08, Johannes Weiner wrote:
>>>> Stop for a second.
>>>>
>>>> As far as I can see the shrinker for the TTM pool should *not* be
>>>> memcg aware. Background is that pages who enter the pool are
>>>> considered freed by the application.
>>>
>>> They're not free from a system POV until they're back in the page
>>> allocator.
>>>
>>>> The only reason we have the pool is to speed up allocation of
>>>> uncached and write combined pages as well as work around for
>>>> performance problems of the coherent DMA API.
>>>>
>>>> The shrinker makes sure that the pages can be given back to the core
>>>> memory management at any given time.
>>>
>>> That's work. And it's a direct result of some cgroup having allocated
>>> this memory. Why should somebody else have to clean it up?
>>
>> Because the cgroup who has allocated the memory is long gone. As
>> soon as the pages enter the pool they must be considered freed by
>> this cgroup.
> 
> Nope, not at all.

Why not? The process belonging to this cgroup has freed up it's memory so that other processes potentially belonging to other cgroups can allocate it again.

>> The cgroup who originally allocated it has no reference to the
>> memory any more and also no way of giving it back to the core
>> system.
> 
> Of course it does, the shrinker LRU.

No it doesn't. The LRU handling here is global and not per cgroup.

When some cgroup asks for memory we also give it pages which were previously allocated to other cgroups.

Because who allocated it doesn't matter in this moment, what matters are the properties of the memory.

> Listen, none of this is even remotely new. This isn't the first cache
> we're tracking, and it's not the first consumer that can outlive the
> controlling cgroup.

Yes, I knew about all of that and I find that extremely questionable on existing handling as well.

Memory pools which are only used to improve allocation performance are something the kernel handles transparently and are completely outside of any cgroup tracking whatsoever.

>>> The shrinker also doesn't run in isolation. It's invoked in the
>>> broader context of there being a memory shortage, along with all the
>>> other shrinkers in the system, along with file reclaim, and
>>> potentially even swapping.
>>>
>>> Why should all of this be externalized to other containers?
>>
>> That's the whole purpose of the pool.
>>
>> The pool only exists because the core memory management can't track
>> the difference between unchached, write combined and cached
>> memory. It's similar to moveable or DMA/DMA32.
> 
> And because of that, in the real world, you are operating a
> shrinker-managed cache for those setup costs, yes? And I explained to
> you the implications and consequences of that.

Yes and I explained that those implications and consequences are intentional.

>>> For proper memory isolation, the cleanup cost needs to be carried by
>>> the cgroup that is responsible for it in the first place - not some
>>> other container that's just trying to read() a file or malloc().
>>
>> That makes no sense at all.
> 
> How about we stay in our respective lanes of expertise and find a more
> productive way to get alignment on this, shall we?

I completely respect your opinion, but I have to point out that I have strong evidence disproving your that this approach won't work at all for this use case.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-15  3:02       ` Dave Airlie
  2025-05-15  8:55         ` Christian König
@ 2025-05-16 16:12         ` Johannes Weiner
  1 sibling, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2025-05-16 16:12 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Thu, May 15, 2025 at 01:02:07PM +1000, Dave Airlie wrote:
> > I have to admit I'm pretty clueless about the gpu driver internals and
> > can't really judge how feasible this is. But from a cgroup POV, if you
> > want proper memory isolation between groups, it seems to me that's the
> > direction you'd have to take this in.
> 
> Thanks for this insight, I think you have definitely shown me where
> things need to go here, and I agree that the goal should be to make
> the pools and the shrinker memcg aware is the proper answer,
> unfortunately I think we are long way from that at the moment, but
> I'll need to do a bit more research.

Per-cgroup LRUs are quite common, so we have a lib to make this easy.

Take a look at <linux/list_lru.h>.

It provides a list type as a replacement for the bare struct
list_head, along with list_lru_add(), list_lru_del() helpers.

Call memcg_list_lru_alloc() before adding objects, this makes sure the
internal per-cgroup data structures are all set up.

list_lru_add()/del() take a memcg argument, so you have to work out
how you want to plumb that down. page->memcg still sounds easiest to
me. That doesn't mean you have to use __GFP_ACCOUNT, considering the
dma allocation path; You can always memcg_kmem_charge_page() them by
hand after the allocation is done, which will set up the backptr.

For the shrinker itself, there are list_lru_shrink_count() and
list_lru_shrink_walk() helpers which are designed to slot right into
the shrinker callbacks. You only have to implement the callback to
isolate an item - this is where the LRU specific locking and reclaim
rules live, but that should be straight forward in your case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16 15:35                 ` Christian König
@ 2025-05-16 16:41                   ` Johannes Weiner
  2025-05-16 17:42                     ` Christian König
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2025-05-16 16:41 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Fri, May 16, 2025 at 05:35:12PM +0200, Christian König wrote:
> On 5/16/25 16:53, Johannes Weiner wrote:
> > On Fri, May 16, 2025 at 08:53:07AM +0200, Christian König wrote:
> >> The cgroup who originally allocated it has no reference to the
> >> memory any more and also no way of giving it back to the core
> >> system.
> > 
> > Of course it does, the shrinker LRU.
> 
> No it doesn't. The LRU handling here is global and not per cgroup.

Well, the discussion at hand is that it should be.

> > Listen, none of this is even remotely new. This isn't the first cache
> > we're tracking, and it's not the first consumer that can outlive the
> > controlling cgroup.
> 
> Yes, I knew about all of that and I find that extremely questionable
> on existing handling as well.

This code handles billions of containers every day, but we'll be sure
to consult you on the next redesign.

> Memory pools which are only used to improve allocation performance
> are something the kernel handles transparently and are completely
> outside of any cgroup tracking whatsoever.

You're describing a cache. It doesn't matter whether it's caching CPU
work, IO work or network packets.

What matters is what it takes to recycle those pages for other
purposes - especially non-GPU purposes.

And more importantly, *what other memory in other cgroups they
displace in the meantime*.

It's really not that difficult to see an isolation issue here.

Anyway, it doesn't look like there is a lot of value in continuing
this conversation, so I'm going to check out of this subthread.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16 16:41                   ` Johannes Weiner
@ 2025-05-16 17:42                     ` Christian König
  2025-05-16 20:04                       ` Johannes Weiner
  0 siblings, 1 reply; 42+ messages in thread
From: Christian König @ 2025-05-16 17:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Airlie, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On 5/16/25 18:41, Johannes Weiner wrote:
>>> Listen, none of this is even remotely new. This isn't the first cache
>>> we're tracking, and it's not the first consumer that can outlive the
>>> controlling cgroup.
>>
>> Yes, I knew about all of that and I find that extremely questionable
>> on existing handling as well.
> 
> This code handles billions of containers every day, but we'll be sure
> to consult you on the next redesign.

Well yes, please do so. I'm working on Linux for around 30 years now and halve of that on device memory management.

And the subsystems I maintain is used by literally billion Android devices and HPC datacenters

One of the reasons we don't have a good integration between device memory and cgroups is because specific requirements have been ignored while designing cgroups.

That cgroups works for a lot of use cases doesn't mean that it does for all of them.

>> Memory pools which are only used to improve allocation performance
>> are something the kernel handles transparently and are completely
>> outside of any cgroup tracking whatsoever.
> 
> You're describing a cache. It doesn't matter whether it's caching CPU
> work, IO work or network packets.

A cache description doesn't really fit this pool here.

The memory properties are similar to what GFP_DMA or GFP_DMA32 provide.

The reasons we haven't moved this into the core memory management is because it is completely x86 specific and only used by a rather specific group of devices.

> What matters is what it takes to recycle those pages for other
> purposes - especially non-GPU purposes.

Exactly that, yes. From the TTM pool pages can be given back to the core OS at any time. It's just a bunch of extra CPU cycles.

> And more importantly, *what other memory in other cgroups they
> displace in the meantime*.

What do you mean with that?

Other cgroups are not affected by anything the allocating cgroup does, except for the extra CPU overhead while giving pages back to the core OS, but that is negligible we haven't even optimized this code path. 

> It's really not that difficult to see an isolation issue here.
> 
> Anyway, it doesn't look like there is a lot of value in continuing
> this conversation, so I'm going to check out of this subthread.

Please don't!

I absolutely need to get this knowledge into somebody head on the cgroup side or we will have this discussion over and over again.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16 17:42                     ` Christian König
@ 2025-05-16 20:04                       ` Johannes Weiner
  2025-05-16 20:25                         ` Dave Airlie
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2025-05-16 20:04 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Fri, May 16, 2025 at 07:42:08PM +0200, Christian König wrote:
> On 5/16/25 18:41, Johannes Weiner wrote:
> >>> Listen, none of this is even remotely new. This isn't the first cache
> >>> we're tracking, and it's not the first consumer that can outlive the
> >>> controlling cgroup.
> >>
> >> Yes, I knew about all of that and I find that extremely questionable
> >> on existing handling as well.
> > 
> > This code handles billions of containers every day, but we'll be sure
> > to consult you on the next redesign.
> 
> Well yes, please do so. I'm working on Linux for around 30 years now and halve of that on device memory management.
> 
> And the subsystems I maintain is used by literally billion Android devices and HPC datacenters
> 
> One of the reasons we don't have a good integration between device memory and cgroups is because specific requirements have been ignored while designing cgroups.
> 
> That cgroups works for a lot of use cases doesn't mean that it does for all of them.
> 
> >> Memory pools which are only used to improve allocation performance
> >> are something the kernel handles transparently and are completely
> >> outside of any cgroup tracking whatsoever.
> > 
> > You're describing a cache. It doesn't matter whether it's caching CPU
> > work, IO work or network packets.
> 
> A cache description doesn't really fit this pool here.
> 
> The memory properties are similar to what GFP_DMA or GFP_DMA32
> provide.
>
> The reasons we haven't moved this into the core memory management is
> because it is completely x86 specific and only used by a rather
> specific group of devices.

I fully understand that. It's about memory properties.

What I think you're also saying is that the best solution would be
that you could ask the core MM for pages with a specific property, and
it would hand you pages that were previously freed with those same
properties. Or, if none such pages are on the freelists, it would grab
free pages with different properties and convert them on the fly.

For all intents and purposes, this free memory would then be trivially
fungible between drm use, non-drm use, and different cgroups - except
for a few CPU cycles when converting but that's *probably* negligible?
And now you could get rid of the "hack" in drm and didn't have to hang
on to special-property pages and implement a shrinker at all.

So far so good.

But that just isn't the implementation of today. And the devil is very
much in the details with this:

Your memory attribute conversions are currently tied to a *shrinker*.

This means the conversion doesn't trivially happen in the allocator,
it happens from *reclaim context*.

Now *your* shrinker is fairly cheap to run, so I do understand when
you're saying in exasperation: We give this memory back if somebody
needs it for other purposes. What *is* the big deal?

The *reclaim context* is the big deal. The problem is *all the other
shrinkers that run at this time as well*. Because you held onto those
pages long enough that they contributed to a bonafide, general memory
shortage situation. And *that* has consequences for other cgroups.

> > What matters is what it takes to recycle those pages for other
> > purposes - especially non-GPU purposes.
> 
> Exactly that, yes. From the TTM pool pages can be given back to the
> core OS at any time. It's just a bunch of extra CPU cycles.
>
> > And more importantly, *what other memory in other cgroups they
> > displace in the meantime*.
> 
> What do you mean with that?
> 
> Other cgroups are not affected by anything the allocating cgroup
> does, except for the extra CPU overhead while giving pages back to
> the core OS, but that is negligible we haven't even optimized this
> code path.

I hope the answer to this question is apparent now.

But to illustrate the problem better, let's consider the following
container setup.

A system has 10G of memory. You run two cgroups on it that each have a
a limit of 5G:

             system (10G)
            /      \
           A (5G)   B (5G)

Let's say A is running some database and is using its full 5G.

B is first doing some buffered IO, instantiating up to 5G worth of
file cache. Since the file cache is cgroup-aware, those pages will go
onto the private LRU list of B. And they will remain there until those
cache pages are fully reclaimed.

B then malloc()s. Because it's at the cgroup limit, it's forced into
cgroup reclaim on its private LRUs, where it recycles some of its old
page cache to satisfy the heap request.

A was not affected by anything that occurred in B.

---

Now let's consider the same starting scenario, but instead B is
interacting with the gpu driver and creates 5G worth of ttm objects.

Once its done with them, you put the pages into the pool and uncharge
the memory from B.

Now B mallocs() again. The cgroup is not maxed out - it's empty in
fact. So no cgroup reclaim happens.

However, at this point, A has 5G allocated, and there are still 5G in
the drm driver. The *system itself* is out of memory now.

So B enters *global* reclaim to find pages for its heap request.

It invokes all the shrinkers and runs reclaim on all cgroups.

In the process, it will claw back some pages from ttm; but it will
*also* reclaim all kinds of caches, maybe even swap stuff, from A!

Now *A* starts faulting due to the pages that were stolen and tries to
allocate. But memory is still full, because B backfilled it with heap.

So now *A* goes into global reclaim as well: it takes some pages from
the ttm pool, some from B, *and some from itself*.

It can take several iterations of this until the ttm pool has been
fully drained, and A has all its memory faulted back in and is running
at full speed again.

In this scenario, A was directly affected, potentially quite severely,
by B's actions.

This is the definition of containment failure.

Consider two aggravating additions to this scenario:

1. While A *could* in theory benefit from the pool pages too, let's
   say it never interacts with the GPU at all. So it's paying the cost
   for something that *only benefits B*.

2. B is malicious. It does the above sequence - interact with ttm, let
   the pool escape its cgroup, then allocate heap - rapidly over and
   over again. It effectively DoSes A.

---

So as long as the pool is implemented as it is today, it should very
much be per cgroup.

Reclaim lifetime means it can displace other memory with reclaim
lifetime.

You cannot assume other cgroups benefit from this memory.

You cannot trust other cgroups to play nice.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16 20:04                       ` Johannes Weiner
@ 2025-05-16 20:25                         ` Dave Airlie
  2025-05-18 16:28                           ` Christian König
  2025-05-22 19:51                           ` Tejun Heo
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-16 20:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christian König, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Sat, 17 May 2025 at 06:04, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 16, 2025 at 07:42:08PM +0200, Christian König wrote:
> > On 5/16/25 18:41, Johannes Weiner wrote:
> > >>> Listen, none of this is even remotely new. This isn't the first cache
> > >>> we're tracking, and it's not the first consumer that can outlive the
> > >>> controlling cgroup.
> > >>
> > >> Yes, I knew about all of that and I find that extremely questionable
> > >> on existing handling as well.
> > >
> > > This code handles billions of containers every day, but we'll be sure
> > > to consult you on the next redesign.
> >
> > Well yes, please do so. I'm working on Linux for around 30 years now and halve of that on device memory management.
> >
> > And the subsystems I maintain is used by literally billion Android devices and HPC datacenters
> >
> > One of the reasons we don't have a good integration between device memory and cgroups is because specific requirements have been ignored while designing cgroups.
> >
> > That cgroups works for a lot of use cases doesn't mean that it does for all of them.
> >
> > >> Memory pools which are only used to improve allocation performance
> > >> are something the kernel handles transparently and are completely
> > >> outside of any cgroup tracking whatsoever.
> > >
> > > You're describing a cache. It doesn't matter whether it's caching CPU
> > > work, IO work or network packets.
> >
> > A cache description doesn't really fit this pool here.
> >
> > The memory properties are similar to what GFP_DMA or GFP_DMA32
> > provide.
> >
> > The reasons we haven't moved this into the core memory management is
> > because it is completely x86 specific and only used by a rather
> > specific group of devices.
>
> I fully understand that. It's about memory properties.
>
> What I think you're also saying is that the best solution would be
> that you could ask the core MM for pages with a specific property, and
> it would hand you pages that were previously freed with those same
> properties. Or, if none such pages are on the freelists, it would grab
> free pages with different properties and convert them on the fly.
>
> For all intents and purposes, this free memory would then be trivially
> fungible between drm use, non-drm use, and different cgroups - except
> for a few CPU cycles when converting but that's *probably* negligible?
> And now you could get rid of the "hack" in drm and didn't have to hang
> on to special-property pages and implement a shrinker at all.
>
> So far so good.
>
> But that just isn't the implementation of today. And the devil is very
> much in the details with this:
>
> Your memory attribute conversions are currently tied to a *shrinker*.
>
> This means the conversion doesn't trivially happen in the allocator,
> it happens from *reclaim context*.
>
> Now *your* shrinker is fairly cheap to run, so I do understand when
> you're saying in exasperation: We give this memory back if somebody
> needs it for other purposes. What *is* the big deal?
>
> The *reclaim context* is the big deal. The problem is *all the other
> shrinkers that run at this time as well*. Because you held onto those
> pages long enough that they contributed to a bonafide, general memory
> shortage situation. And *that* has consequences for other cgroups.

I think this is where we have 2 options:
(a) moving this stuff into core mm and out of shrinker context
(b) fix our shrinker to be cgroup aware and solve that first.

The main question I have for Christian, is can you give me a list of
use cases that this will seriously negatively effect if we proceed
with (b).

From my naive desktop use case and HPC use case scenarios, I'm not
seeing a massive hit, now maybe I see more consistency from an
application overheads inside a cgroup.

Desktop use-case:
The user session and everything inside the user-session, compositor,
apps are all in a single cgroup, any pools memory usage will be
reusable between all the users active session, if there are multiple
users, they won't have the benefit of pages from others but their own
pool will be available.

HPC use-case:
One cgroup per application running in some sort of batch system. There
will be a downside at app launch if there has already been a bunch of
other applications launched on the machine the have filled the pool,
but by default in the cold start case the app won't get any worse
behaviour than it's current worst case, it will get consistent
behaviour of initial allocations being worst case in a new cgroup vs
now where they might benefit from previously running cgroups having
allocated pooled memory, but I'm not sure the benefit outweighs the
upside here since they reallly want containers contained.

Android? I've no idea.

Like what can we live with here, vs what needs to be a Kconfig option
vs what needs to be a kernel command line option,

I'm also happy to look at (a) but I think for (a) it's not just
uncached pool that is the problem, the dma pools will be harder to
deal with.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16 20:25                         ` Dave Airlie
@ 2025-05-18 16:28                           ` Christian König
  2025-05-19  6:18                             ` Dave Airlie
  2025-05-22 19:51                           ` Tejun Heo
  1 sibling, 1 reply; 42+ messages in thread
From: Christian König @ 2025-05-18 16:28 UTC (permalink / raw)
  To: Dave Airlie, Johannes Weiner
  Cc: dri-devel, tj, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, cgroups, Waiman Long, simona

On 5/16/25 22:25, Dave Airlie wrote:
> On Sat, 17 May 2025 at 06:04, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>> The memory properties are similar to what GFP_DMA or GFP_DMA32
>>> provide.
>>>
>>> The reasons we haven't moved this into the core memory management is
>>> because it is completely x86 specific and only used by a rather
>>> specific group of devices.
>>
>> I fully understand that. It's about memory properties.
>>
>> What I think you're also saying is that the best solution would be
>> that you could ask the core MM for pages with a specific property, and
>> it would hand you pages that were previously freed with those same
>> properties. Or, if none such pages are on the freelists, it would grab
>> free pages with different properties and convert them on the fly.
>>
>> For all intents and purposes, this free memory would then be trivially
>> fungible between drm use, non-drm use, and different cgroups - except
>> for a few CPU cycles when converting but that's *probably* negligible?
>> And now you could get rid of the "hack" in drm and didn't have to hang
>> on to special-property pages and implement a shrinker at all.
>>
>> So far so good.
>>
>> But that just isn't the implementation of today. And the devil is very
>> much in the details with this:
>>
>> Your memory attribute conversions are currently tied to a *shrinker*.
>>
>> This means the conversion doesn't trivially happen in the allocator,
>> it happens from *reclaim context*.

Ah! At least I now understand your concern here.

>> Now *your* shrinker is fairly cheap to run, so I do understand when
>> you're saying in exasperation: We give this memory back if somebody
>> needs it for other purposes. What *is* the big deal?
>>
>> The *reclaim context* is the big deal. The problem is *all the other
>> shrinkers that run at this time as well*. Because you held onto those
>> pages long enough that they contributed to a bonafide, general memory
>> shortage situation. And *that* has consequences for other cgroups.

No it doesn't, or at least not as much as you think.

We have gone back and forth on this multiple times already when discussion the shrinker implementations. See the DRM mailing list about both the TTM and the GEM shared mem shrinker.

The TTM pool shrinker is basically just a nice to have feature which is used to avoid deny of service attacks and allows to kick in when use cases change. E.g. between installing software (gcc) and running software (Blender, ROCm etc..).

In other words the TTM shrinker is not even optimized and spends tons of extra CPU cycles because the expectation is that it never really triggers in practice.

> I think this is where we have 2 options:
> (a) moving this stuff into core mm and out of shrinker context
> (b) fix our shrinker to be cgroup aware and solve that first.

(c) give better priorities to the shrinker API.

E.g. the shrinker for example assumes that the users of the API must scan the pages to be able to clean them up.

But implementations like the TTM pool could basically just throw away pages as many as necessary.

So by saying to the shrinker please ask us on reclaim first we would completely solve the problem.

That was considered before but never done because it's basically just nice to have and most likely not really important.

> The main question I have for Christian, is can you give me a list of
> use cases that this will seriously negatively effect if we proceed
> with (b).

It would basically render the whole TTM pool useless, or at least massively limit its usefulness.

See the main benefit is to be able to quickly allocate buffers on HW use cases which needs then, e.g. scanout on APUs, PSP for secure playback etc....

The idea is that when you alt+tab or swipe between applications that the new application can just grab the memory the previous application has just released.

And yes, it is explicitly required that those applications can be in different cgroups.

> From my naive desktop use case and HPC use case scenarios, I'm not
> seeing a massive hit, now maybe I see more consistency from an
> application overheads inside a cgroup.

Yeah for HPC it is most likely completely irrelevant, for desktop it might have some minor use cases.

But the killer argument is that we do have some cloud gaming and embedded use cases where it is really important to get this right.

> Android? I've no idea.

Mainline Android currently has it's complete own way of doing mostly the same what cgroup does, but in userspace.

The problem is that this doesn't account for memory allocated in kernel space. See the discussion on DMA-buf accounting with T.J. and the older discussion with Greg (sysfs) about that.

In my opinion it would make sense to just use cgroups for a lot of that as well, but we would need to convince Google of that.

Regards,
Christian.

> Like what can we live with here, vs what needs to be a Kconfig option
> vs what needs to be a kernel command line option,
> 
> I'm also happy to look at (a) but I think for (a) it's not just
> uncached pool that is the problem, the dma pools will be harder to
> deal with.
> 
> Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-18 16:28                           ` Christian König
@ 2025-05-19  6:18                             ` Dave Airlie
  2025-05-19  8:26                               ` Christian König
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Airlie @ 2025-05-19  6:18 UTC (permalink / raw)
  To: Christian König
  Cc: Johannes Weiner, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Mon, 19 May 2025 at 02:28, Christian König <christian.koenig@amd.com> wrote:
>
> On 5/16/25 22:25, Dave Airlie wrote:
> > On Sat, 17 May 2025 at 06:04, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>> The memory properties are similar to what GFP_DMA or GFP_DMA32
> >>> provide.
> >>>
> >>> The reasons we haven't moved this into the core memory management is
> >>> because it is completely x86 specific and only used by a rather
> >>> specific group of devices.
> >>
> >> I fully understand that. It's about memory properties.
> >>
> >> What I think you're also saying is that the best solution would be
> >> that you could ask the core MM for pages with a specific property, and
> >> it would hand you pages that were previously freed with those same
> >> properties. Or, if none such pages are on the freelists, it would grab
> >> free pages with different properties and convert them on the fly.
> >>
> >> For all intents and purposes, this free memory would then be trivially
> >> fungible between drm use, non-drm use, and different cgroups - except
> >> for a few CPU cycles when converting but that's *probably* negligible?
> >> And now you could get rid of the "hack" in drm and didn't have to hang
> >> on to special-property pages and implement a shrinker at all.
> >>
> >> So far so good.
> >>
> >> But that just isn't the implementation of today. And the devil is very
> >> much in the details with this:
> >>
> >> Your memory attribute conversions are currently tied to a *shrinker*.
> >>
> >> This means the conversion doesn't trivially happen in the allocator,
> >> it happens from *reclaim context*.
>
> Ah! At least I now understand your concern here.
>
> >> Now *your* shrinker is fairly cheap to run, so I do understand when
> >> you're saying in exasperation: We give this memory back if somebody
> >> needs it for other purposes. What *is* the big deal?
> >>
> >> The *reclaim context* is the big deal. The problem is *all the other
> >> shrinkers that run at this time as well*. Because you held onto those
> >> pages long enough that they contributed to a bonafide, general memory
> >> shortage situation. And *that* has consequences for other cgroups.
>
> No it doesn't, or at least not as much as you think.
>
> We have gone back and forth on this multiple times already when discussion the shrinker implementations. See the DRM mailing list about both the TTM and the GEM shared mem shrinker.
>
> The TTM pool shrinker is basically just a nice to have feature which is used to avoid deny of service attacks and allows to kick in when use cases change. E.g. between installing software (gcc) and running software (Blender, ROCm etc..).
>
> In other words the TTM shrinker is not even optimized and spends tons of extra CPU cycles because the expectation is that it never really triggers in practice.
>
> > I think this is where we have 2 options:
> > (a) moving this stuff into core mm and out of shrinker context
> > (b) fix our shrinker to be cgroup aware and solve that first.
>
> (c) give better priorities to the shrinker API.
>
> E.g. the shrinker for example assumes that the users of the API must scan the pages to be able to clean them up.

Well my again naive approach is to just add simpler low-overhead
shrinkers to the start of the shrinker list and if they free up enough
memory then win, otherwise we were in reclaim anyways,

however this asks the question if just going into reclaim and having
to touch any shrinkers at all is bad, if the overheads of just doing
that aren't acceptable then we would need to come up with a better way
I suspect?

adding a single shrinker flag to put the ttm shrinker at the top of
the list is pretty trivial.

Thanks for use-cases that probably matter, I can see the online gaming
workloads being useful overhead reduction.

There probably isn't much appetite to just migrate the ttm pools into
the core mm, I see a couple of other users like sound do set_memory_*
calls, but I doubt they are on the radar for how much it costs.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-19  6:18                             ` Dave Airlie
@ 2025-05-19  8:26                               ` Christian König
  0 siblings, 0 replies; 42+ messages in thread
From: Christian König @ 2025-05-19  8:26 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Johannes Weiner, dri-devel, tj, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On 5/19/25 08:18, Dave Airlie wrote:
> On Mon, 19 May 2025 at 02:28, Christian König <christian.koenig@amd.com> wrote:
>>
>> On 5/16/25 22:25, Dave Airlie wrote:
>>> On Sat, 17 May 2025 at 06:04, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>> The memory properties are similar to what GFP_DMA or GFP_DMA32
>>>>> provide.
>>>>>
>>>>> The reasons we haven't moved this into the core memory management is
>>>>> because it is completely x86 specific and only used by a rather
>>>>> specific group of devices.
>>>>
>>>> I fully understand that. It's about memory properties.
>>>>
>>>> What I think you're also saying is that the best solution would be
>>>> that you could ask the core MM for pages with a specific property, and
>>>> it would hand you pages that were previously freed with those same
>>>> properties. Or, if none such pages are on the freelists, it would grab
>>>> free pages with different properties and convert them on the fly.
>>>>
>>>> For all intents and purposes, this free memory would then be trivially
>>>> fungible between drm use, non-drm use, and different cgroups - except
>>>> for a few CPU cycles when converting but that's *probably* negligible?
>>>> And now you could get rid of the "hack" in drm and didn't have to hang
>>>> on to special-property pages and implement a shrinker at all.
>>>>
>>>> So far so good.
>>>>
>>>> But that just isn't the implementation of today. And the devil is very
>>>> much in the details with this:
>>>>
>>>> Your memory attribute conversions are currently tied to a *shrinker*.
>>>>
>>>> This means the conversion doesn't trivially happen in the allocator,
>>>> it happens from *reclaim context*.
>>
>> Ah! At least I now understand your concern here.
>>
>>>> Now *your* shrinker is fairly cheap to run, so I do understand when
>>>> you're saying in exasperation: We give this memory back if somebody
>>>> needs it for other purposes. What *is* the big deal?
>>>>
>>>> The *reclaim context* is the big deal. The problem is *all the other
>>>> shrinkers that run at this time as well*. Because you held onto those
>>>> pages long enough that they contributed to a bonafide, general memory
>>>> shortage situation. And *that* has consequences for other cgroups.
>>
>> No it doesn't, or at least not as much as you think.
>>
>> We have gone back and forth on this multiple times already when discussion the shrinker implementations. See the DRM mailing list about both the TTM and the GEM shared mem shrinker.
>>
>> The TTM pool shrinker is basically just a nice to have feature which is used to avoid deny of service attacks and allows to kick in when use cases change. E.g. between installing software (gcc) and running software (Blender, ROCm etc..).
>>
>> In other words the TTM shrinker is not even optimized and spends tons of extra CPU cycles because the expectation is that it never really triggers in practice.
>>
>>> I think this is where we have 2 options:
>>> (a) moving this stuff into core mm and out of shrinker context
>>> (b) fix our shrinker to be cgroup aware and solve that first.
>>
>> (c) give better priorities to the shrinker API.
>>
>> E.g. the shrinker for example assumes that the users of the API must scan the pages to be able to clean them up.
> 
> Well my again naive approach is to just add simpler low-overhead
> shrinkers to the start of the shrinker list and if they free up enough
> memory then win, otherwise we were in reclaim anyways,
> 
> however this asks the question if just going into reclaim and having
> to touch any shrinkers at all is bad, if the overheads of just doing
> that aren't acceptable then we would need to come up with a better way
> I suspect?

Yes, but in that case we would need to come up with a better way for the non-cgroup case as well.

> adding a single shrinker flag to put the ttm shrinker at the top of
> the list is pretty trivial.

I'm not 100% sure when the shrinkers actually start doing something.

For the TTM pool it would make sense to let memory management look into it before it even looks into the page cache, in other words very early.

> Thanks for use-cases that probably matter, I can see the online gaming
> workloads being useful overhead reduction.
> 
> There probably isn't much appetite to just migrate the ttm pools into
> the core mm, I see a couple of other users like sound do set_memory_*
> calls, but I doubt they are on the radar for how much it costs.

The long term plan was always to move the whole TTM pool into the DMA API.

This way we can not only potentially abstract all this architecture specific stuff, but also remove other cases where people implemented pools in drivers because the DMA API was just not fast enough to come up with coherent memory for example.

It's just that there was never a technical need to look into this, so it never hurt hard enough to actually clean that up.

The GFP_DMA and GFP_DMA32 flags are also seen as kind of historical leftovers as well.

So moving the logic into the core memory management is most likely a no-go from upstream, but what we could do is to move it into the DMA API and let the DMA API then interact with the core memory management in a much better way than what we can do now with the various shrinkers.

Regards,
Christian.


> 
> Dave.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-13  7:54     ` Johannes Weiner
  2025-05-15  3:02       ` Dave Airlie
@ 2025-05-21  2:23       ` Dave Airlie
  2025-05-21  7:50         ` Christian König
  2025-05-21 14:43         ` Johannes Weiner
  1 sibling, 2 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-21  2:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

>
> So in the GPU case, you'd charge on allocation, free objects into a
> cgroup-specific pool, and shrink using a cgroup-specific LRU
> list. Freed objects can be reused by this cgroup, but nobody else.
> They're reclaimed through memory pressure inside the cgroup, not due
> to the action of others. And all allocated memory is accounted for.
>
> I have to admit I'm pretty clueless about the gpu driver internals and
> can't really judge how feasible this is. But from a cgroup POV, if you
> want proper memory isolation between groups, it seems to me that's the
> direction you'd have to take this in.

I've been digging into this a bit today, to try and work out what
various paths forward might look like and run into a few impedance
mismatches.

1. TTM doesn't pool objects, it pools pages. TTM objects are varied in
size, we don't need to keep any sort of special allocator that we
would need if we cached sized objects (size buckets etc). list_lru
doesn't work on pages, if we were pooling the ttm objects I can see
being able to enable list_lru. But I'm seeing increased complexity for
no major return, but I might dig a bit more into whether caching
objects might help.

2. list_lru isn't suitable for pages, AFAICS we have to stick the page
into another object to store it in the list_lru, which would mean we'd
be allocating yet another wrapper object. Currently TTM uses the page
LRU pointer to add it to the shrinker_list, which is simple and low
overhead.

If we wanted to stick with keeping pages in the pool, I do feel moving
the pool code closer to the mm core and having some sort of more
tightly integrated reclaim to avoid the overheads. Now in an ideal
world we'd get a page flag like PG_uncached, and we can keep an
uncached inactive list per memcg/node and migrate pages off it, but I
don't think anyone is willing to give us a page flag for this, so I
think we do need to find a compromise that isn't ideal but works for
us now. I've also played a bit with the idea of MEMCG_LOWOVERHEAD
which adds a shrinker to start of shrinker list instead of end and
registering TTM pool shrinker as one of those.

Have I missed anything here that might make this easier?

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-21  2:23       ` Dave Airlie
@ 2025-05-21  7:50         ` Christian König
  2025-05-21 14:43         ` Johannes Weiner
  1 sibling, 0 replies; 42+ messages in thread
From: Christian König @ 2025-05-21  7:50 UTC (permalink / raw)
  To: Dave Airlie, Johannes Weiner
  Cc: dri-devel, tj, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, cgroups, Waiman Long, simona

On 5/21/25 04:23, Dave Airlie wrote:
>>
>> So in the GPU case, you'd charge on allocation, free objects into a
>> cgroup-specific pool, and shrink using a cgroup-specific LRU
>> list. Freed objects can be reused by this cgroup, but nobody else.
>> They're reclaimed through memory pressure inside the cgroup, not due
>> to the action of others. And all allocated memory is accounted for.
>>
>> I have to admit I'm pretty clueless about the gpu driver internals and
>> can't really judge how feasible this is. But from a cgroup POV, if you
>> want proper memory isolation between groups, it seems to me that's the
>> direction you'd have to take this in.
> 
> I've been digging into this a bit today, to try and work out what
> various paths forward might look like and run into a few impedance
> mismatches.
> 
> 1. TTM doesn't pool objects, it pools pages. TTM objects are varied in
> size, we don't need to keep any sort of special allocator that we
> would need if we cached sized objects (size buckets etc). list_lru
> doesn't work on pages, if we were pooling the ttm objects I can see
> being able to enable list_lru. But I'm seeing increased complexity for
> no major return, but I might dig a bit more into whether caching
> objects might help.
> 
> 2. list_lru isn't suitable for pages, AFAICS we have to stick the page
> into another object to store it in the list_lru, which would mean we'd
> be allocating yet another wrapper object. Currently TTM uses the page
> LRU pointer to add it to the shrinker_list, which is simple and low
> overhead.
> 
> If we wanted to stick with keeping pages in the pool, I do feel moving
> the pool code closer to the mm core and having some sort of more
> tightly integrated reclaim to avoid the overheads. Now in an ideal
> world we'd get a page flag like PG_uncached, and we can keep an
> uncached inactive list per memcg/node and migrate pages off it, but I
> don't think anyone is willing to give us a page flag for this, so I
> think we do need to find a compromise that isn't ideal but works for
> us now. I've also played a bit with the idea of MEMCG_LOWOVERHEAD
> which adds a shrinker to start of shrinker list instead of end and
> registering TTM pool shrinker as one of those.
> 
> Have I missed anything here that might make this easier?

Just for completeness of the picture, there is also the generic memory pool allocator (see include/linux/genalloc.h) which is an alternative to kmalloc() for uncached memory.

But we never used it because it isn't optimized for pages, but rather objects of certain size (e.g. like kmalloc/vmalloc).

Apart from that the general idea of having cgroup specific pools sounds bad to me. Instead of a technical need for this it only exists because we failed to integrate those device pools into the core memory management.

Regards,
Christian.


> 
> Dave.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-21  2:23       ` Dave Airlie
  2025-05-21  7:50         ` Christian König
@ 2025-05-21 14:43         ` Johannes Weiner
  2025-05-22  7:03           ` Dave Airlie
  1 sibling, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2025-05-21 14:43 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Wed, May 21, 2025 at 12:23:58PM +1000, Dave Airlie wrote:
> >
> > So in the GPU case, you'd charge on allocation, free objects into a
> > cgroup-specific pool, and shrink using a cgroup-specific LRU
> > list. Freed objects can be reused by this cgroup, but nobody else.
> > They're reclaimed through memory pressure inside the cgroup, not due
> > to the action of others. And all allocated memory is accounted for.
> >
> > I have to admit I'm pretty clueless about the gpu driver internals and
> > can't really judge how feasible this is. But from a cgroup POV, if you
> > want proper memory isolation between groups, it seems to me that's the
> > direction you'd have to take this in.
> 
> I've been digging into this a bit today, to try and work out what
> various paths forward might look like and run into a few impedance
> mismatches.
> 
> 1. TTM doesn't pool objects, it pools pages. TTM objects are varied in
> size, we don't need to keep any sort of special allocator that we
> would need if we cached sized objects (size buckets etc). list_lru
> doesn't work on pages, if we were pooling the ttm objects I can see
> being able to enable list_lru. But I'm seeing increased complexity for
> no major return, but I might dig a bit more into whether caching
> objects might help.
> 
> 2. list_lru isn't suitable for pages, AFAICS we have to stick the page
> into another object to store it in the list_lru, which would mean we'd
> be allocating yet another wrapper object. Currently TTM uses the page
> LRU pointer to add it to the shrinker_list, which is simple and low
> overhead.

Why wouldn't you be able to use the page LRU list_head with list_lru?

list_lru_add(&ttm_pool_lru, &page->lru, page_to_nid(page), page_memcg(page));

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-21 14:43         ` Johannes Weiner
@ 2025-05-22  7:03           ` Dave Airlie
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Airlie @ 2025-05-22  7:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dri-devel, tj, christian.koenig, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

On Thu, 22 May 2025 at 00:43, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, May 21, 2025 at 12:23:58PM +1000, Dave Airlie wrote:
> > >
> > > So in the GPU case, you'd charge on allocation, free objects into a
> > > cgroup-specific pool, and shrink using a cgroup-specific LRU
> > > list. Freed objects can be reused by this cgroup, but nobody else.
> > > They're reclaimed through memory pressure inside the cgroup, not due
> > > to the action of others. And all allocated memory is accounted for.
> > >
> > > I have to admit I'm pretty clueless about the gpu driver internals and
> > > can't really judge how feasible this is. But from a cgroup POV, if you
> > > want proper memory isolation between groups, it seems to me that's the
> > > direction you'd have to take this in.
> >
> > I've been digging into this a bit today, to try and work out what
> > various paths forward might look like and run into a few impedance
> > mismatches.
> >
> > 1. TTM doesn't pool objects, it pools pages. TTM objects are varied in
> > size, we don't need to keep any sort of special allocator that we
> > would need if we cached sized objects (size buckets etc). list_lru
> > doesn't work on pages, if we were pooling the ttm objects I can see
> > being able to enable list_lru. But I'm seeing increased complexity for
> > no major return, but I might dig a bit more into whether caching
> > objects might help.
> >
> > 2. list_lru isn't suitable for pages, AFAICS we have to stick the page
> > into another object to store it in the list_lru, which would mean we'd
> > be allocating yet another wrapper object. Currently TTM uses the page
> > LRU pointer to add it to the shrinker_list, which is simple and low
> > overhead.
>
> Why wouldn't you be able to use the page LRU list_head with list_lru?
>
> list_lru_add(&ttm_pool_lru, &page->lru, page_to_nid(page), page_memcg(page));

I for some reason got it into my head that list_lru objects weren't
list_head, not sure why, guess I shall spend next week exploring this
possibility.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-16 20:25                         ` Dave Airlie
  2025-05-18 16:28                           ` Christian König
@ 2025-05-22 19:51                           ` Tejun Heo
  2025-05-23  7:58                             ` Christian König
  1 sibling, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2025-05-22 19:51 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Johannes Weiner, Christian König, dri-devel, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, cgroups, Waiman Long,
	simona

Hello,

On Sat, May 17, 2025 at 06:25:02AM +1000, Dave Airlie wrote:
> I think this is where we have 2 options:
> (a) moving this stuff into core mm and out of shrinker context
> (b) fix our shrinker to be cgroup aware and solve that first.
> 
> The main question I have for Christian, is can you give me a list of
> use cases that this will seriously negatively effect if we proceed
> with (b).

This thread seems to have gone a bit haywire and we may be losing some
context. I'm not sure not doing (b) is an option for acceptable isolation. I
think Johannes already raised the issue but please consider the following
scenario:

- There's a GPU workload which uses a sizable amount of system memory for
  the pool being discussed in this thread. This GPU workload is very
  important, so we want to make sure that other activities in the system
  don't bother it. We give it plenty of isolated CPUs and protect its memory
  with high enough memory.low.

- Because most CPUs are largely idling while GPU is busy, there are plenty
  of CPU cycles which can be used without impacting the GPU workload, so we
  decide to do some data preprocessing which involves scanning large data
  set creating memory usage which is mostly streaming but still has enough
  look backs to promote them in the LRU lists.

IIUC, in the shared pool model, the GPU memory which isn't currently being
used would sit outside the cgroup, and thus outside the protection of
memory.low. Again, IIUC, you want to make this pool priority reclaimed
because reclaiming is nearly free and you don't want to create undue
pressure on other reclaimable resources.

However, what would happen in the above scenario under such implementation
is that the GPU workload would keep losing its memory pool to the background
memory pressure created by the streaming memory usage. It's also easy to
expand on scenarios like this with other GPU workloads with differing
priorities and memory allotments and so on.

There may be some basic misunderstanding here. If a resource is worth
caching, that usually indicates that there's some significant cost
associated with un-caching the resource. It doesn't matter whether that cost
is on the creation or destruction path. Here, the alloc path is expensive
and free path is nearly free. However, this doesn't mean that we can get
free isolation while bunching them together for immediate reclaim as others
would be able to force you into alloc operations that you wouldn't need
otherwise. If someone else can make you pay for something that you otherwise
wouldn't, that resource is not isolated.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-22 19:51                           ` Tejun Heo
@ 2025-05-23  7:58                             ` Christian König
  2025-05-23 17:06                               ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Christian König @ 2025-05-23  7:58 UTC (permalink / raw)
  To: Tejun Heo, Dave Airlie
  Cc: Johannes Weiner, dri-devel, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Waiman Long, simona

Hi Tejun,

first of all thanks to Johannes and you for the input, it took me quite some time to actually get a grip on your concern here.

On 5/22/25 21:51, Tejun Heo wrote:
> Hello,
> 
> On Sat, May 17, 2025 at 06:25:02AM +1000, Dave Airlie wrote:
>> I think this is where we have 2 options:
>> (a) moving this stuff into core mm and out of shrinker context
>> (b) fix our shrinker to be cgroup aware and solve that first.
>>
>> The main question I have for Christian, is can you give me a list of
>> use cases that this will seriously negatively effect if we proceed
>> with (b).
> 
> This thread seems to have gone a bit haywire and we may be losing some
> context. I'm not sure not doing (b) is an option for acceptable isolation. I
> think Johannes already raised the issue but please consider the following
> scenario:
> 
> - There's a GPU workload which uses a sizable amount of system memory for
>   the pool being discussed in this thread. This GPU workload is very
>   important, so we want to make sure that other activities in the system
>   don't bother it. We give it plenty of isolated CPUs and protect its memory
>   with high enough memory.low.

That situation simply doesn't happen. See isolation is *not* a requirement for the pool.

It's basically the opposite, we use the pool to break the isolation between processes. 

> - Because most CPUs are largely idling while GPU is busy, there are plenty
>   of CPU cycles which can be used without impacting the GPU workload, so we
>   decide to do some data preprocessing which involves scanning large data
>   set creating memory usage which is mostly streaming but still has enough
>   look backs to promote them in the LRU lists.
> 
> IIUC, in the shared pool model, the GPU memory which isn't currently being
> used would sit outside the cgroup, and thus outside the protection of
> memory.low. Again, IIUC, you want to make this pool priority reclaimed
> because reclaiming is nearly free and you don't want to create undue
> pressure on other reclaimable resources.
> 
> However, what would happen in the above scenario under such implementation
> is that the GPU workload would keep losing its memory pool to the background
> memory pressure created by the streaming memory usage. It's also easy to
> expand on scenarios like this with other GPU workloads with differing
> priorities and memory allotments and so on.

Yes, and that is absolutely intentional.

That the GPU looses it's memory because of pressure on the CPU side is exactly what should happen.

There is no requirement whatsoever to avoid that.

> There may be some basic misunderstanding here. If a resource is worth
> caching, that usually indicates that there's some significant cost
> associated with un-caching the resource. It doesn't matter whether that cost
> is on the creation or destruction path. Here, the alloc path is expensive
> and free path is nearly free. However, this doesn't mean that we can get
> free isolation while bunching them together for immediate reclaim as others
> would be able to force you into alloc operations that you wouldn't need
> otherwise. If someone else can make you pay for something that you otherwise
> wouldn't, that resource is not isolated.

Correct, but exactly that is intentional.

See the submission model of GPUs is best effort. E.g. you don't guarantee any performance isolation between processes whatsoever. If we would start to do this we would need to start re-designing the HW.

We don't even have standard scheduling functionalities, for example we can't say process X has used N amount of cycles because that isn't even precisely know to the kernel.

When we would make the pool per cgroup we would loose the ability to quickly allocate those temporary throw away buffers which are mainly used for special HW use cases.

The important use cases are:

1. Sharing freed up memory between otherwise unrelated applications.

	Imagine you alt+tab or swipe between applications (which can be in different cgroups), and the old application frees us their write combined buffer and the new application re-allocates the same pages for their write combine buffer.

2. Switching use cases.

	E.g. developer first runs some benchmarks which uses a bunch of write combined memory and then starts gcc in the background (which of course can be in a different cgroup than the desktop) and that needs this additional memory.

That it is in theory possible that in between those milliseconds of the switch some CPU load could start to steal pages from the pool is completely irrelevant. For the second use case it is actually desirable.

If we would make this pool per cgroup we would actually need to add counter measures, e.g. separate cgroup controller which says how much memory each cgroup can hog and how much in total, and especially timeouts after the hogging cgroup would be forced to give it's pool memory back to the core memory management.

But I clearly don't see an use case for that.

Regards,
Christian.

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-23  7:58                             ` Christian König
@ 2025-05-23 17:06                               ` Tejun Heo
  2025-05-26  8:19                                 ` Christian König
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2025-05-23 17:06 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, Johannes Weiner, dri-devel, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, cgroups, Waiman Long,
	simona

Hello, Christian.

On Fri, May 23, 2025 at 09:58:58AM +0200, Christian König wrote:
...
> > - There's a GPU workload which uses a sizable amount of system memory for
> >   the pool being discussed in this thread. This GPU workload is very
> >   important, so we want to make sure that other activities in the system
> >   don't bother it. We give it plenty of isolated CPUs and protect its memory
> >   with high enough memory.low.
> 
> That situation simply doesn't happen. See isolation is *not* a requirement
> for the pool.
...
> See the submission model of GPUs is best effort. E.g. you don't guarantee
> any performance isolation between processes whatsoever. If we would start
> to do this we would need to start re-designing the HW.

This is a radical claim. Let's table the rest of the discussion for now. I
don't know enough to tell whether this claim is true or not, but for this to
be true, the following should be true:

 Whether the GPU memory pool is reclaimed or not doesn't have noticeable
 performance implications on the GPU performance.

Is this true?

As for the scenario that I described above, I didn't just come up with it.
I'm only supporting from system side but that's based on what our ML folks
are doing right now. We have a bunch of lage machines with multiple GPUs
running ML workloads. The workloads can run for a long time spread across
many machines and they synchronize frequently, so any performance drop on
one GPU lowers utiliization on all involved GPUs which can go up to three
digits. For example, any scheduling disturbances on the submitting thread
propagates through the whole cluster and slows down all involved GPUs.

Also, because these machines are large on the CPU and memory sides too and
aren't doing whole lot other than managing the GPUs, people want to put on a
significant amount of CPU work on them which can easily create at least
moderate memory pressure. Is the claim that the combined write memory pool
doesn't have any meaningful impact on the GPU workload performance?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-23 17:06                               ` Tejun Heo
@ 2025-05-26  8:19                                 ` Christian König
  2025-05-26 20:13                                   ` Dave Airlie
  0 siblings, 1 reply; 42+ messages in thread
From: Christian König @ 2025-05-26  8:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Airlie, Johannes Weiner, dri-devel, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, cgroups, Waiman Long,
	simona

Hi Tejun,

On 5/23/25 19:06, Tejun Heo wrote:
> Hello, Christian.
> 
> On Fri, May 23, 2025 at 09:58:58AM +0200, Christian König wrote:
> ...
>>> - There's a GPU workload which uses a sizable amount of system memory for
>>>   the pool being discussed in this thread. This GPU workload is very
>>>   important, so we want to make sure that other activities in the system
>>>   don't bother it. We give it plenty of isolated CPUs and protect its memory
>>>   with high enough memory.low.
>>
>> That situation simply doesn't happen. See isolation is *not* a requirement
>> for the pool.
> ...
>> See the submission model of GPUs is best effort. E.g. you don't guarantee
>> any performance isolation between processes whatsoever. If we would start
>> to do this we would need to start re-designing the HW.
> 
> This is a radical claim. Let's table the rest of the discussion for now. I
> don't know enough to tell whether this claim is true or not, but for this to
> be true, the following should be true:
> 
>  Whether the GPU memory pool is reclaimed or not doesn't have noticeable
>  performance implications on the GPU performance.
> 
> Is this true?

Yes, that is true. Today the GPUs need the memory for correctness, not for performance anymore.

The performance improvements we have seen with this approach 15 or 20 years ago are negligible by todays standards.

It's just that Windows still offers the functionality today and when you bringup hardware on Linux you sometimes run into problems and find that the engineers who designed the hardware/firmware relied on having this. 

> As for the scenario that I described above, I didn't just come up with it.
> I'm only supporting from system side but that's based on what our ML folks
> are doing right now. We have a bunch of lage machines with multiple GPUs
> running ML workloads. The workloads can run for a long time spread across
> many machines and they synchronize frequently, so any performance drop on
> one GPU lowers utiliization on all involved GPUs which can go up to three
> digits. For example, any scheduling disturbances on the submitting thread
> propagates through the whole cluster and slows down all involved GPUs.

For the HPC/ML use case this feature is completely irrelevant. ROCm, Cuda, OpenCL, OpenMP etc... don't even expose something like this in their higher level APIs as far as I know.

Where this here matters is things like scanout on certain laptops, digital rights management in cloud gaming, hacks for getting high end GPUs to work on ARM boards (e.g. rasberry pie etc...).

> Also, because these machines are large on the CPU and memory sides too and
> aren't doing whole lot other than managing the GPUs, people want to put on a
> significant amount of CPU work on them which can easily create at least
> moderate memory pressure. Is the claim that the combined write memory pool
> doesn't have any meaningful impact on the GPU workload performance?

When the memory pool is active on such systems I would strongly advise to question why it is used in the first place.

The main reason why we still need it for business today is cloud gaming. And for this particular use case you absolutely do want to share the pool between cgroups or otherwise the whole use case breaks.

Regards,
Christian.

> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-26  8:19                                 ` Christian König
@ 2025-05-26 20:13                                   ` Dave Airlie
  2025-05-27  8:01                                     ` Christian König
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Airlie @ 2025-05-26 20:13 UTC (permalink / raw)
  To: Christian König
  Cc: Tejun Heo, Johannes Weiner, dri-devel, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, cgroups, Waiman Long,
	simona

On Mon, 26 May 2025 at 18:19, Christian König <christian.koenig@amd.com> wrote:
>
> Hi Tejun,
>
> On 5/23/25 19:06, Tejun Heo wrote:
> > Hello, Christian.
> >
> > On Fri, May 23, 2025 at 09:58:58AM +0200, Christian König wrote:
> > ...
> >>> - There's a GPU workload which uses a sizable amount of system memory for
> >>>   the pool being discussed in this thread. This GPU workload is very
> >>>   important, so we want to make sure that other activities in the system
> >>>   don't bother it. We give it plenty of isolated CPUs and protect its memory
> >>>   with high enough memory.low.
> >>
> >> That situation simply doesn't happen. See isolation is *not* a requirement
> >> for the pool.
> > ...
> >> See the submission model of GPUs is best effort. E.g. you don't guarantee
> >> any performance isolation between processes whatsoever. If we would start
> >> to do this we would need to start re-designing the HW.
> >
> > This is a radical claim. Let's table the rest of the discussion for now. I
> > don't know enough to tell whether this claim is true or not, but for this to
> > be true, the following should be true:
> >
> >  Whether the GPU memory pool is reclaimed or not doesn't have noticeable
> >  performance implications on the GPU performance.
> >
> > Is this true?
>
> Yes, that is true. Today the GPUs need the memory for correctness, not for performance anymore.
>
> The performance improvements we have seen with this approach 15 or 20 years ago are negligible by todays standards.
>
> It's just that Windows still offers the functionality today and when you bringup hardware on Linux you sometimes run into problems and find that the engineers who designed the hardware/firmware relied on having this.
>
> > As for the scenario that I described above, I didn't just come up with it.
> > I'm only supporting from system side but that's based on what our ML folks
> > are doing right now. We have a bunch of lage machines with multiple GPUs
> > running ML workloads. The workloads can run for a long time spread across
> > many machines and they synchronize frequently, so any performance drop on
> > one GPU lowers utiliization on all involved GPUs which can go up to three
> > digits. For example, any scheduling disturbances on the submitting thread
> > propagates through the whole cluster and slows down all involved GPUs.
>
> For the HPC/ML use case this feature is completely irrelevant. ROCm, Cuda, OpenCL, OpenMP etc... don't even expose something like this in their higher level APIs as far as I know.

What do we consider higher level here btw? HIP and CUDA both expose
something like hipHostMallocWriteCombined, there is also
hipHostMallocCoherent which may or may not have an effect.

>
> Where this here matters is things like scanout on certain laptops, digital rights management in cloud gaming, hacks for getting high end GPUs to work on ARM boards (e.g. rasberry pie etc...).
>
> > Also, because these machines are large on the CPU and memory sides too and
> > aren't doing whole lot other than managing the GPUs, people want to put on a
> > significant amount of CPU work on them which can easily create at least
> > moderate memory pressure. Is the claim that the combined write memory pool
> > doesn't have any meaningful impact on the GPU workload performance?
>
> When the memory pool is active on such systems I would strongly advise to question why it is used in the first place.
>
> The main reason why we still need it for business today is cloud gaming. And for this particular use case you absolutely do want to share the pool between cgroups or otherwise the whole use case breaks.

I'm still not convinced on this being totally true, which means either
I'm misunderstanding how cloud gaming works or you are underestimating
how cgroups work,

My model for cloud gaming is, you have some sort of orchestrator
service running that spawns a bunch of games in their own cgroups and
those games would want to operate as independently as possible.

Now if the toplevel cgroup or if none the root cgroup exists and the
game cgroups are all underneath it, then I think this would operate
more optimally for each game, since

a) if a game uses uncached memory continuously it will have it's own
pool of uncached memory that doesn't get used by anyone else, thus
making that game more consistent.
b) if and when the game exits, the pool will be returned to the parent
cgroup to use, this memory should then be reused by other games the
are started subsequently.

The only thing I'm not sure is how the parent pool gets used once it's
built up for new children, need to spend more time reading list_lru
code.

The list_lru change might actually be useful for us without cgroups as
it might be able to hide some of our per-numa stuff.

Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
  2025-05-26 20:13                                   ` Dave Airlie
@ 2025-05-27  8:01                                     ` Christian König
  0 siblings, 0 replies; 42+ messages in thread
From: Christian König @ 2025-05-27  8:01 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Tejun Heo, Johannes Weiner, dri-devel, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, cgroups, Waiman Long,
	simona

On 5/26/25 22:13, Dave Airlie wrote:
> On Mon, 26 May 2025 at 18:19, Christian König <christian.koenig@amd.com> wrote:
>>
>> For the HPC/ML use case this feature is completely irrelevant. ROCm, Cuda, OpenCL, OpenMP etc... don't even expose something like this in their higher level APIs as far as I know.
> 
> What do we consider higher level here btw? HIP and CUDA both expose
> something like hipHostMallocWriteCombined, there is also
> hipHostMallocCoherent which may or may not have an effect.

High level in the sense as things which applications actually use.

And yes hip used to have flags for WC memory, but because of allocating system memory through the KFD node never worked as intended we abandoned that approach long time ago.

What could be is that the userspace components (like hip) now use the render node interface with the AMDGPU_GEM_CREATE_CPU_GTT_USWC flag to allocate WC memory. That is one thing I need to double check.

>> The main reason why we still need it for business today is cloud gaming. And for this particular use case you absolutely do want to share the pool between cgroups or otherwise the whole use case breaks.
> 
> I'm still not convinced on this being totally true, which means either
> I'm misunderstanding how cloud gaming works or you are underestimating
> how cgroups work,
> 
> My model for cloud gaming is, you have some sort of orchestrator
> service running that spawns a bunch of games in their own cgroups and
> those games would want to operate as independently as possible.
> 
> Now if the toplevel cgroup or if none the root cgroup exists and the
> game cgroups are all underneath it, then I think this would operate
> more optimally for each game, since
> 
> a) if a game uses uncached memory continuously it will have it's own
> pool of uncached memory that doesn't get used by anyone else, thus
> making that game more consistent.
> b) if and when the game exits, the pool will be returned to the parent
> cgroup to use, this memory should then be reused by other games the
> are started subsequently.

Mhm, my impression was that cgroups are not hierarchically any more these days. So pages would stay allocated to the cgroup until it is fully destroyed and then given back to the core memory management. But I could certainly be wrong on that.

When an application continuously uses write combined memory it would allocate that only once and then re-use it over and over again. The background is that even with the pool we don't make any allocation performance guarantees, the operation is purely best effort.

So the important advantage of the pool is to share the memory between applications which are otherwise not aware of each other. And that is exactly what you have in a cloud gaming use case.

If we would want to make performance guarantees we would need to approach that from a different side, e.g. have per cgroup config options on the pool size, background workers which refills the pool for each cgroup etc... That is certainly possible, I just don't see the use case for that.

One additional side note: A big original motivation for the pool was to avoid the WBINVD CPU instruction, because that is disruptive not only to the current process but to the system as a whole. This is not an issue any more on modern CPUs because we now have the CFLUSH instruction.

So the motivation of having the pool is really not that high any more.

> The only thing I'm not sure is how the parent pool gets used once it's
> built up for new children, need to spend more time reading list_lru
> code.
> 
> The list_lru change might actually be useful for us without cgroups as
> it might be able to hide some of our per-numa stuff.

Yes, having proper NUMA integration would be a really really big argument in favor of having this.

I also explicitly asked that Waiman Long, but it seems like that cgroups doesn't have the NUMA capabilities we actually need here.  

Regards,
Christian.

> 
> Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-05-27  8:01 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-02  3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
2025-05-02  3:36 ` [PATCH 1/5] memcg: add GPU statistic Dave Airlie
2025-05-02  3:36 ` [PATCH 2/5] memcg: add hooks for gpu memcg charging/uncharging Dave Airlie
2025-05-02  3:36 ` [PATCH 3/5] ttm: add initial memcg integration. (v2) Dave Airlie
2025-05-02 12:01   ` Christian König
2025-05-02 14:24   ` kernel test robot
2025-05-03  2:09   ` kernel test robot
2025-05-02  3:36 ` [PATCH 4/5] amdgpu: add support for memcg integration Dave Airlie
2025-05-02 14:01   ` Waiman Long
2025-05-02  3:36 ` [PATCH 5/5] nouveau: add " Dave Airlie
2025-05-06  0:37 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Shakeel Butt
2025-05-06  0:59   ` Dave Airlie
2025-05-07 17:52 ` Johannes Weiner
2025-05-07 22:03   ` Dave Airlie
2025-05-07 22:11     ` Dave Airlie
2025-05-13  7:54     ` Johannes Weiner
2025-05-15  3:02       ` Dave Airlie
2025-05-15  8:55         ` Christian König
2025-05-15 15:04           ` Waiman Long
2025-05-15 15:16             ` Christian König
2025-05-15 16:08           ` Johannes Weiner
2025-05-16  6:53             ` Christian König
2025-05-16 14:53               ` Johannes Weiner
2025-05-16 15:35                 ` Christian König
2025-05-16 16:41                   ` Johannes Weiner
2025-05-16 17:42                     ` Christian König
2025-05-16 20:04                       ` Johannes Weiner
2025-05-16 20:25                         ` Dave Airlie
2025-05-18 16:28                           ` Christian König
2025-05-19  6:18                             ` Dave Airlie
2025-05-19  8:26                               ` Christian König
2025-05-22 19:51                           ` Tejun Heo
2025-05-23  7:58                             ` Christian König
2025-05-23 17:06                               ` Tejun Heo
2025-05-26  8:19                                 ` Christian König
2025-05-26 20:13                                   ` Dave Airlie
2025-05-27  8:01                                     ` Christian König
2025-05-16 16:12         ` Johannes Weiner
2025-05-21  2:23       ` Dave Airlie
2025-05-21  7:50         ` Christian König
2025-05-21 14:43         ` Johannes Weiner
2025-05-22  7:03           ` Dave Airlie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).