drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver
@ 2025-06-30  4:49 Dave Airlie
  2025-06-30  4:49 ` [PATCH 01/17] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
                   ` (17 more replies)
  0 siblings, 18 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song

Hi all,

tl;dr: start using list_lru/numa/memcg in GPU driver core and amdgpu driver for now.

This is a complete series of patches, some of which have been sent before and reviewed,
but I want to get the complete picture for others, and try to figure out how best to land this.

There are 3 pieces to this:
01->02: add support for global gpu stat counters (previously posted, patch 2 is newer)
03->07: port ttm pools to list_lru for numa awareness
08->14: add memcg stats + gpu apis, then port ttm pools to memcg aware list_lru and shrinker
15->17: enable amdgpu to use new functionality.

The biggest difference in the memcg code from previously is I discovered what
obj cgroups were designed for and I'm reusing the page/objcg intergration that 
already exists, to avoid reinventing that wheel right now.

There are some igt-gpu-tools tests I've written at:
https://gitlab.freedesktop.org/airlied/igt-gpu-tools/-/tree/amdgpu-cgroups?ref_type=heads

One problem is there are a lot of delayed action, that probably means the testing
needs a bit more robustness, but the tests validate all the basic paths.

Regards,
Dave.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 01/17] mm: add gpu active/reclaim per-node stat counters (v2)
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2) Dave Airlie
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie, Matthew Brost,
	Andrew Morton, Zi Yan, Shakeel Butt

From: Dave Airlie <airlied@redhat.com>

While discussing memcg intergration with gpu memory allocations,
it was pointed out that there was no numa/system counters for
GPU memory allocations.

With more integrated memory GPU server systems turning up, and
more requirements for memory tracking it seems we should start
closing the gap.

Add two counters to track GPU per-node system memory allocations.

The first is currently allocated to GPU objects, and the second
is for memory that is stored in GPU page pools that can be reclaimed,
by the shrinker.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---

v2: add more info to the documentation on this memory.

I'd like to get acks to merge this via the drm tree, if possible,

Dave.
---
 Documentation/filesystems/proc.rst | 8 ++++++++
 drivers/base/node.c                | 5 +++++
 fs/proc/meminfo.c                  | 6 ++++++
 include/linux/mmzone.h             | 2 ++
 mm/show_mem.c                      | 9 +++++++--
 mm/vmstat.c                        | 2 ++
 6 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2a17865dfe39..4d9c65b303cc 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1093,6 +1093,8 @@ Example output. You may not have all of these fields.
     CmaFree:               0 kB
     Unaccepted:            0 kB
     Balloon:               0 kB
+    GPUActive:             0 kB
+    GPUReclaim:            0 kB
     HugePages_Total:       0
     HugePages_Free:        0
     HugePages_Rsvd:        0
@@ -1271,6 +1273,12 @@ Unaccepted
               Memory that has not been accepted by the guest
 Balloon
               Memory returned to Host by VM Balloon Drivers
+GPUActive
+              System memory allocated to active GPU objects
+GPUReclaim
+              System memory stored in GPU pools for reuse. This memory is not
+              counted in GPUActive. It is shrinker reclaimable memory kept in a reuse
+              pool because it has non-standard page table attributes, like WC or UC.
 HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
               See Documentation/admin-guide/mm/hugetlbpage.rst.
 DirectMap4k, DirectMap2M, DirectMap1G
diff --git a/drivers/base/node.c b/drivers/base/node.c
index c19094481630..64406862314b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -463,6 +463,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_UNACCEPTED_MEMORY
 			     "Node %d Unaccepted:     %8lu kB\n"
 #endif
+			     "Node %d GPUActive:      %8lu kB\n"
+			     "Node %d GPUReclaim:     %8lu kB\n"
 			     ,
 			     nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
 			     nid, K(node_page_state(pgdat, NR_WRITEBACK)),
@@ -496,6 +498,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     ,
 			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
 #endif
+			     ,
+			     nid, K(node_page_state(pgdat, NR_GPU_ACTIVE)),
+			     nid, K(node_page_state(pgdat, NR_GPU_RECLAIM))
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
 	return len;
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index bc2bc60c36cc..334948744e55 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -164,6 +164,12 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "Balloon:        ",
 		    global_node_page_state(NR_BALLOON_PAGES));
 
+	show_val_kb(m, "GPUActive:      ",
+		    global_node_page_state(NR_GPU_ACTIVE));
+
+	show_val_kb(m, "GPUReclaim:     ",
+		    global_node_page_state(NR_GPU_RECLAIM));
+
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 283913d42d7b..81f3e368fd91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -241,6 +241,8 @@ enum node_stat_item {
 	NR_HUGETLB,
 #endif
 	NR_BALLOON_PAGES,
+	NR_GPU_ACTIVE,          /* Pages assigned to GPU objects */
+	NR_GPU_RECLAIM,         /* Pages in shrinkable GPU pools */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 0cf8bf5d832d..072d33a50148 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -255,7 +255,9 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			" sec_pagetables:%lukB"
 			" all_unreclaimable? %s"
 			" Balloon:%lukB"
-			"\n",
+		        " gpu_active:%lukB"
+		        " gpu_reclaim:%lukB"
+		        "\n",
 			pgdat->node_id,
 			K(node_page_state(pgdat, NR_ACTIVE_ANON)),
 			K(node_page_state(pgdat, NR_INACTIVE_ANON)),
@@ -281,7 +283,10 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			K(node_page_state(pgdat, NR_PAGETABLE)),
 			K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
 			str_yes_no(pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES),
-			K(node_page_state(pgdat, NR_BALLOON_PAGES)));
+		        K(node_page_state(pgdat, NR_BALLOON_PAGES)),
+		        K(node_page_state(pgdat, NR_GPU_ACTIVE)),
+			K(node_page_state(pgdat, NR_GPU_RECLAIM)));
+
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 429ae5339bfe..25a74cf29473 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1281,6 +1281,8 @@ const char * const vmstat_text[] = {
 	"nr_hugetlb",
 #endif
 	"nr_balloon_pages",
+	"nr_gpu_active",
+	"nr_gpu_reclaim",
 	/* system-wide enum vm_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2)
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
  2025-06-30  4:49 ` [PATCH 01/17] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30 10:04   ` Christian König
  2025-07-02 16:08   ` Shakeel Butt
  2025-06-30  4:49 ` [PATCH 03/17] mm/list_lru: export list_lru_add Dave Airlie
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie, Matthew Brost,
	Andrew Morton

From: Dave Airlie <airlied@redhat.com>

This uses the newly introduced per-node gpu tracking stats,
to track GPU memory allocated via TTM and reclaimable memory in
the TTM page pools.

These stats will be useful later for system information and
later when mem cgroups are integrated.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: add reclaim parameters and adjust the right counters.
---
 drivers/gpu/drm/ttm/ttm_pool.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index baf27c70a419..11a5777b4a85 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -131,6 +131,16 @@ static struct list_head shrinker_list;
 static struct shrinker *mm_shrinker;
 static DECLARE_RWSEM(pool_shrink_rwsem);
 
+/* helper to get a current valid node id from a pool */
+static int ttm_pool_nid(struct ttm_pool *pool) {
+	int nid = NUMA_NO_NODE;
+	if (pool)
+		nid = pool->nid;
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+	return nid;
+}
+
 /* Allocate pages of size 1 << order with the given gfp_flags */
 static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 					unsigned int order)
@@ -150,8 +160,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 
 	if (!pool->use_dma_alloc) {
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
-		if (p)
+		if (p) {
 			p->private = order;
+			mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)), NR_GPU_ACTIVE, (1 << order));
+		}
 		return p;
 	}
 
@@ -186,7 +198,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 
 /* Reset the caching and pages of size 1 << order */
 static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
-			       unsigned int order, struct page *p)
+			       unsigned int order, struct page *p, bool reclaim)
 {
 	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
 	struct ttm_pool_dma *dma;
@@ -201,6 +213,9 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
 #endif
 
 	if (!pool || !pool->use_dma_alloc) {
+		mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)),
+				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
+				    -(1 << order));
 		__free_pages(p, order);
 		return;
 	}
@@ -276,6 +291,7 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
 static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 {
 	unsigned int i, num_pages = 1 << pt->order;
+	int nid = ttm_pool_nid(pt->pool);
 
 	for (i = 0; i < num_pages; ++i) {
 		if (PageHighMem(p))
@@ -288,17 +304,23 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 	list_add(&p->lru, &pt->pages);
 	spin_unlock(&pt->lock);
 	atomic_long_add(1 << pt->order, &allocated_pages);
+
+	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
+	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
 }
 
 /* Take pages from a specific pool_type, return NULL when nothing available */
 static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
 {
 	struct page *p;
+	int nid = ttm_pool_nid(pt->pool);
 
 	spin_lock(&pt->lock);
 	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
 	if (p) {
 		atomic_long_sub(1 << pt->order, &allocated_pages);
+		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
+		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
 		list_del(&p->lru);
 	}
 	spin_unlock(&pt->lock);
@@ -331,7 +353,7 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 	spin_unlock(&shrinker_lock);
 
 	while ((p = ttm_pool_type_take(pt)))
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 }
 
 /* Return the pool_type to use for the given caching and order */
@@ -383,7 +405,7 @@ static unsigned int ttm_pool_shrink(void)
 
 	p = ttm_pool_type_take(pt);
 	if (p) {
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 		num_pages = 1 << pt->order;
 	} else {
 		num_pages = 0;
@@ -475,7 +497,7 @@ static pgoff_t ttm_pool_unmap_and_free(struct ttm_pool *pool, struct page *page,
 	if (pt)
 		ttm_pool_type_give(pt, page);
 	else
-		ttm_pool_free_page(pool, caching, order, page);
+		ttm_pool_free_page(pool, caching, order, page, false);
 
 	return nr;
 }
@@ -780,7 +802,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	return 0;
 
 error_free_page:
-	ttm_pool_free_page(pool, page_caching, order, p);
+	ttm_pool_free_page(pool, page_caching, order, p, false);
 
 error_free_all:
 	if (tt->restore)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 03/17] mm/list_lru: export list_lru_add.
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
  2025-06-30  4:49 ` [PATCH 01/17] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
  2025-06-30  4:49 ` [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2) Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 04/17] ttm/pool: port to list_lru. (v2) Dave Airlie
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie, Shakeel Butt

From: Dave Airlie <airlied@redhat.com>

DRM/TTM wants to use this for it's page pool
LRU tracking.

This effective is a revert of
78c0ed09131b772f062b986a2fcca6600daa6285
Author: Kairui Song <kasong@tencent.com>
Date:   Tue Nov 5 01:52:53 2024 +0800

    mm/list_lru: don't export list_lru_add

Cc: Kairui Song <kasong@tencent.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 mm/list_lru.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 490473af3122..315362e3df3d 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -175,6 +175,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 	unlock_list_lru(l, false);
 	return false;
 }
+EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_add_obj(struct list_lru *lru, struct list_head *item)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 04/17] ttm/pool: port to list_lru. (v2)
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (2 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 03/17] mm/list_lru: export list_lru_add Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30 10:37   ` kernel test robot
  2025-06-30  4:49 ` [PATCH 05/17] ttm/pool: drop numa specific pools Dave Airlie
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This is an initial port of the TTM pools for
write combined and uncached pages to use the list_lru.

This makes the pool's more NUMA aware and avoids
needing separate NUMA pools (later commit enables this).

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: drop the pt->lock, lru list has it's own lock which is sufficent.
rearrange list isolates to fix bad locking orders.
---
 drivers/gpu/drm/ttm/tests/ttm_device_test.c |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c   | 32 ++++----
 drivers/gpu/drm/ttm/ttm_pool.c              | 83 +++++++++++++--------
 include/drm/ttm/ttm_pool.h                  |  6 +-
 4 files changed, 72 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/ttm/tests/ttm_device_test.c b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
index 1621903818e5..1f207fd222bc 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_device_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
@@ -183,7 +183,7 @@ static void ttm_device_init_pools(struct kunit *test)
 
 				if (params->use_dma_alloc)
 					KUNIT_ASSERT_FALSE(test,
-							   list_empty(&pt.pages));
+							   !list_lru_count(&pt.pages));
 			}
 		}
 	}
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 8ade53371f72..39234a3e98c4 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -248,7 +248,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	pool = ttm_pool_pre_populated(test, size, caching);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	tt = ttm_tt_kunit_init(test, 0, caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
@@ -256,7 +256,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
@@ -282,8 +282,8 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, tt_caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -291,8 +291,8 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_tt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -316,8 +316,8 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, caching, snd_size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -325,8 +325,8 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_tt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -352,12 +352,12 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -383,12 +383,12 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_TRUE(test, list_is_singular(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_TRUE(test, list_is_singular(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
 
 	ttm_pool_fini(pool);
 }
@@ -404,11 +404,11 @@ static void ttm_pool_fini_basic(struct kunit *test)
 	pool = ttm_pool_pre_populated(test, size, caching);
 	pt = &pool->caching[caching].orders[order];
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_fini(pool);
 
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 }
 
 static struct kunit_case ttm_pool_test_cases[] = {
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 11a5777b4a85..4372f0cc4a57 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -291,7 +291,7 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
 static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 {
 	unsigned int i, num_pages = 1 << pt->order;
-	int nid = ttm_pool_nid(pt->pool);
+	int nid = page_to_nid(p);
 
 	for (i = 0; i < num_pages; ++i) {
 		if (PageHighMem(p))
@@ -300,31 +300,41 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 			clear_page(page_address(p + i));
 	}
 
-	spin_lock(&pt->lock);
-	list_add(&p->lru, &pt->pages);
-	spin_unlock(&pt->lock);
+	INIT_LIST_HEAD(&p->lru);
+	rcu_read_lock();
+	list_lru_add(&pt->pages, &p->lru, nid, NULL);
+	rcu_read_unlock();
 	atomic_long_add(1 << pt->order, &allocated_pages);
 
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
 }
 
+static enum lru_status take_one_from_lru(struct list_head *item,
+					 struct list_lru_one *list,
+					 void *cb_arg)
+{
+	struct page **out_page = cb_arg;
+	struct page *p = container_of(item, struct page, lru);
+	list_lru_isolate(list, item);
+
+	*out_page = p;
+	return LRU_REMOVED;
+}
+
 /* Take pages from a specific pool_type, return NULL when nothing available */
-static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
+static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
 {
-	struct page *p;
-	int nid = ttm_pool_nid(pt->pool);
+	int ret;
+	struct page *p = NULL;
+	unsigned long nr_to_walk = 1;
 
-	spin_lock(&pt->lock);
-	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
-	if (p) {
+	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
+	if (ret == 1 && p) {
 		atomic_long_sub(1 << pt->order, &allocated_pages);
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
-		list_del(&p->lru);
 	}
-	spin_unlock(&pt->lock);
-
 	return p;
 }
 
@@ -335,25 +345,47 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
 	pt->pool = pool;
 	pt->caching = caching;
 	pt->order = order;
-	spin_lock_init(&pt->lock);
-	INIT_LIST_HEAD(&pt->pages);
+	list_lru_init(&pt->pages);
 
 	spin_lock(&shrinker_lock);
 	list_add_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 }
 
+static enum lru_status pool_move_to_dispose_list(struct list_head *item,
+						 struct list_lru_one *list,
+						 void *cb_arg)
+{
+	struct list_head *dispose = cb_arg;
+
+	list_lru_isolate_move(list, item, dispose);
+
+	return LRU_REMOVED;
+}
+
+static void ttm_pool_dispose_list(struct ttm_pool_type *pt,
+				  struct list_head *dispose)
+{
+	while (!list_empty(dispose)) {
+		struct page *p;
+		p = list_first_entry(dispose, struct page, lru);
+		list_del_init(&p->lru);
+		atomic_long_sub(1 << pt->order, &allocated_pages);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
+	}
+}
+
 /* Remove a pool_type from the global shrinker list and free all pages */
 static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 {
-	struct page *p;
+	LIST_HEAD(dispose);
 
 	spin_lock(&shrinker_lock);
 	list_del(&pt->shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	while ((p = ttm_pool_type_take(pt)))
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
+	list_lru_walk(&pt->pages, pool_move_to_dispose_list, &dispose, LONG_MAX);
+	ttm_pool_dispose_list(pt, &dispose);
 }
 
 /* Return the pool_type to use for the given caching and order */
@@ -403,7 +435,7 @@ static unsigned int ttm_pool_shrink(void)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	p = ttm_pool_type_take(pt);
+	p = ttm_pool_type_take(pt, ttm_pool_nid(pt->pool));
 	if (p) {
 		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 		num_pages = 1 << pt->order;
@@ -757,7 +789,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		p = NULL;
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
-			p = ttm_pool_type_take(pt);
+			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
 		/*
 		 * If that fails or previously failed, allocate from system.
 		 * Note that this also disallows additional pool allocations using
@@ -1186,16 +1218,7 @@ static unsigned long ttm_pool_shrinker_count(struct shrinker *shrink,
 /* Count the number of pages available in a pool_type */
 static unsigned int ttm_pool_type_count(struct ttm_pool_type *pt)
 {
-	unsigned int count = 0;
-	struct page *p;
-
-	spin_lock(&pt->lock);
-	/* Only used for debugfs, the overhead doesn't matter */
-	list_for_each_entry(p, &pt->pages, lru)
-		++count;
-	spin_unlock(&pt->lock);
-
-	return count;
+	return list_lru_count(&pt->pages);
 }
 
 /* Print a nice header for the order */
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 54cd34a6e4c0..df56527c4853 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -45,8 +45,7 @@ struct ttm_tt;
  * @order: the allocation order our pages have
  * @caching: the caching type our pages have
  * @shrinker_list: our place on the global shrinker list
- * @lock: protection of the page list
- * @pages: the list of pages in the pool
+ * @pages: the lru_list of pages in the pool
  */
 struct ttm_pool_type {
 	struct ttm_pool *pool;
@@ -55,8 +54,7 @@ struct ttm_pool_type {
 
 	struct list_head shrinker_list;
 
-	spinlock_t lock;
-	struct list_head pages;
+	struct list_lru pages;
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 05/17] ttm/pool: drop numa specific pools
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (3 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 04/17] ttm/pool: port to list_lru. (v2) Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30 10:12   ` Christian König
  2025-06-30  4:49 ` [PATCH 06/17] ttm/pool: make pool shrinker NUMA aware Dave Airlie
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

The list_lru will now handle numa for us, so need to keep
separate pool types for it. Just consoldiate into the global ones.

This adds a debugfs change to avoid dumping non-existant orders due
to this change.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 4372f0cc4a57..95bbbc843b97 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -399,17 +399,11 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 #ifdef CONFIG_X86
 	switch (caching) {
 	case ttm_write_combined:
-		if (pool->nid != NUMA_NO_NODE)
-			return &pool->caching[caching].orders[order];
-
 		if (pool->use_dma32)
 			return &global_dma32_write_combined[order];
 
 		return &global_write_combined[order];
 	case ttm_uncached:
-		if (pool->nid != NUMA_NO_NODE)
-			return &pool->caching[caching].orders[order];
-
 		if (pool->use_dma32)
 			return &global_dma32_uncached[order];
 
@@ -1295,6 +1289,8 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
 	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
 		if (!ttm_pool_select_type(pool, i, 0))
 			continue;
+		if (pool->caching[i].orders[0].pool != pool)
+			continue;
 		if (pool->use_dma_alloc)
 			seq_puts(m, "DMA ");
 		else
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 06/17] ttm/pool: make pool shrinker NUMA aware
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (4 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 05/17] ttm/pool: drop numa specific pools Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30 10:15   ` Christian König
  2025-06-30  4:49 ` [PATCH 07/17] ttm/pool: track allocated_pages per numa node Dave Airlie
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This enable NUMA awareness for the shrinker on the
ttm pools.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 38 +++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 95bbbc843b97..66cd963b24dc 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -416,12 +416,12 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 	return NULL;
 }
 
-/* Free pages using the global shrinker list */
-static unsigned int ttm_pool_shrink(void)
+/* Free pages using the per-node shrinker list */
+static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
 {
+	LIST_HEAD(dispose);
 	struct ttm_pool_type *pt;
 	unsigned int num_pages;
-	struct page *p;
 
 	down_read(&pool_shrink_rwsem);
 	spin_lock(&shrinker_lock);
@@ -429,13 +429,10 @@ static unsigned int ttm_pool_shrink(void)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	p = ttm_pool_type_take(pt, ttm_pool_nid(pt->pool));
-	if (p) {
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
-		num_pages = 1 << pt->order;
-	} else {
-		num_pages = 0;
-	}
+	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	num_pages *= 1 << pt->order;
+
+	ttm_pool_dispose_list(pt, &dispose);
 	up_read(&pool_shrink_rwsem);
 
 	return num_pages;
@@ -784,6 +781,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
 			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
+
 		/*
 		 * If that fails or previously failed, allocate from system.
 		 * Note that this also disallows additional pool allocations using
@@ -932,8 +930,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 {
 	ttm_pool_free_range(pool, tt, tt->caching, 0, tt->num_pages);
 
-	while (atomic_long_read(&allocated_pages) > page_pool_size)
-		ttm_pool_shrink();
+	while (atomic_long_read(&allocated_pages) > page_pool_size) {
+		unsigned long diff = page_pool_size - atomic_long_read(&allocated_pages);
+		ttm_pool_shrink(ttm_pool_nid(pool), diff);
+	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
 
@@ -1190,7 +1190,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 	unsigned long num_freed = 0;
 
 	do
-		num_freed += ttm_pool_shrink();
+		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
 	while (num_freed < sc->nr_to_scan &&
 	       atomic_long_read(&allocated_pages));
 
@@ -1323,11 +1323,15 @@ static int ttm_pool_debugfs_shrink_show(struct seq_file *m, void *data)
 		.nr_to_scan = TTM_SHRINKER_BATCH,
 	};
 	unsigned long count;
+	int nid;
 
 	fs_reclaim_acquire(GFP_KERNEL);
-	count = ttm_pool_shrinker_count(mm_shrinker, &sc);
-	seq_printf(m, "%lu/%lu\n", count,
-		   ttm_pool_shrinker_scan(mm_shrinker, &sc));
+	for_each_node(nid) {
+		sc.nid = nid;
+		count = ttm_pool_shrinker_count(mm_shrinker, &sc);
+		seq_printf(m, "%d: %lu/%lu\n", nid, count,
+			   ttm_pool_shrinker_scan(mm_shrinker, &sc));
+	}
 	fs_reclaim_release(GFP_KERNEL);
 
 	return 0;
@@ -1375,7 +1379,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 #endif
 #endif
 
-	mm_shrinker = shrinker_alloc(0, "drm-ttm_pool");
+	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
 	if (!mm_shrinker)
 		return -ENOMEM;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 07/17] ttm/pool: track allocated_pages per numa node.
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (5 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 06/17] ttm/pool: make pool shrinker NUMA aware Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 08/17] memcg: add support for GPU page counters Dave Airlie
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This gets the memory sizes from the nodes and stores the limit
as 50% of those. I think eventually we should drop the limits
once we have memcg aware shrinking, but this should be more NUMA
friendly, and I think seems like what people would prefer to
happen on NUMA aware systems.

Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 57 +++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 66cd963b24dc..c6192c915f0d 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -115,10 +115,11 @@ struct ttm_pool_tt_restore {
 
 static unsigned long page_pool_size;
 
-MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
+MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool per NUMA node");
 module_param(page_pool_size, ulong, 0644);
 
-static atomic_long_t allocated_pages;
+static unsigned long pool_node_limit[MAX_NUMNODES];
+static atomic_long_t allocated_pages[MAX_NUMNODES];
 
 static struct ttm_pool_type global_write_combined[NR_PAGE_ORDERS];
 static struct ttm_pool_type global_uncached[NR_PAGE_ORDERS];
@@ -304,8 +305,8 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 	rcu_read_lock();
 	list_lru_add(&pt->pages, &p->lru, nid, NULL);
 	rcu_read_unlock();
-	atomic_long_add(1 << pt->order, &allocated_pages);
 
+	atomic_long_add(num_pages, &allocated_pages[nid]);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
 }
@@ -331,7 +332,7 @@ static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
 
 	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
 	if (ret == 1 && p) {
-		atomic_long_sub(1 << pt->order, &allocated_pages);
+		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
 	}
@@ -370,7 +371,7 @@ static void ttm_pool_dispose_list(struct ttm_pool_type *pt,
 		struct page *p;
 		p = list_first_entry(dispose, struct page, lru);
 		list_del_init(&p->lru);
-		atomic_long_sub(1 << pt->order, &allocated_pages);
+		atomic_long_sub(1 << pt->order, &allocated_pages[page_to_nid(p)]);
 		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 	}
 }
@@ -928,11 +929,13 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
  */
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 {
+	int nid = ttm_pool_nid(pool);
+
 	ttm_pool_free_range(pool, tt, tt->caching, 0, tt->num_pages);
 
-	while (atomic_long_read(&allocated_pages) > page_pool_size) {
-		unsigned long diff = page_pool_size - atomic_long_read(&allocated_pages);
-		ttm_pool_shrink(ttm_pool_nid(pool), diff);
+	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
+		unsigned long diff = pool_node_limit[nid] - atomic_long_read(&allocated_pages[nid]);
+		ttm_pool_shrink(nid, diff);
 	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
@@ -1192,7 +1195,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 	do
 		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
 	while (num_freed < sc->nr_to_scan &&
-	       atomic_long_read(&allocated_pages));
+	       atomic_long_read(&allocated_pages[sc->nid]));
 
 	sc->nr_scanned = num_freed;
 
@@ -1203,7 +1206,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 static unsigned long ttm_pool_shrinker_count(struct shrinker *shrink,
 					     struct shrink_control *sc)
 {
-	unsigned long num_pages = atomic_long_read(&allocated_pages);
+	unsigned long num_pages = atomic_long_read(&allocated_pages[sc->nid]);
 
 	return num_pages ? num_pages : SHRINK_EMPTY;
 }
@@ -1240,8 +1243,12 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
 /* Dump the total amount of allocated pages */
 static void ttm_pool_debugfs_footer(struct seq_file *m)
 {
-	seq_printf(m, "\ntotal\t: %8lu of %8lu\n",
-		   atomic_long_read(&allocated_pages), page_pool_size);
+	int nid;
+
+	for_each_node(nid) {
+		seq_printf(m, "\ntotal node%d\t: %8lu of %8lu\n", nid,
+			   atomic_long_read(&allocated_pages[nid]), pool_node_limit[nid]);
+	}
 }
 
 /* Dump the information for the global pools */
@@ -1340,6 +1347,22 @@ DEFINE_SHOW_ATTRIBUTE(ttm_pool_debugfs_shrink);
 
 #endif
 
+static inline uint64_t ttm_get_node_memory_size(int nid)
+{
+	/* This is directly using si_meminfo_node implementation as the
+	 * function is not exported.
+	 */
+	int zone_type;
+	uint64_t managed_pages = 0;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
+		managed_pages +=
+			zone_managed_pages(&pgdat->node_zones[zone_type]);
+	return managed_pages * PAGE_SIZE;
+}
+
 /**
  * ttm_pool_mgr_init - Initialize globals
  *
@@ -1351,8 +1374,14 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 {
 	unsigned int i;
 
-	if (!page_pool_size)
-		page_pool_size = num_pages;
+	int nid;
+	for_each_node(nid) {
+		if (!page_pool_size) {
+			uint64_t node_size = ttm_get_node_memory_size(nid);
+			pool_node_limit[nid] = (node_size >> PAGE_SHIFT) / 2;
+		} else
+			pool_node_limit[nid] = page_pool_size;
+	}
 
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 08/17] memcg: add support for GPU page counters.
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (6 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 07/17] ttm/pool: track allocated_pages per numa node Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-07-02 16:06   ` Shakeel Butt
  2025-06-30  4:49 ` [PATCH 09/17] memcg: export memcg_list_lru_alloc Dave Airlie
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This introduces 2 new statistics and 3 new memcontrol APIs for dealing
with GPU system memory allocations.

The stats corresponds to the same stats in the global vmstat,
for number of active GPU pages, and number of pages in pools that
can be reclaimed.

The first API charges a order of pages to a objcg, and sets
the objcg on the pages like kmem does, and updates the active/reclaim
statistic.

The second API uncharges a page from the obj cgroup it is currently charged
to.

The third API allows moving a page to/from reclaim and between obj cgroups.
When pages are added to the pool lru, this just updates accounting.
When pages are being removed from a pool lru, they can be taken from
the parent objcg so this allows them to be uncharged from there and transferred
to a new child objcg.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |   6 ++
 include/linux/memcontrol.h              |  14 +++
 mm/memcontrol.c                         | 110 ++++++++++++++++++++++++
 3 files changed, 130 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0cc35a14afbe..d6f057c4fe2e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1542,6 +1542,12 @@ The following nested keys are defined.
 	  vmalloc (npn)
 		Amount of memory used for vmap backed memory.
 
+	  gpu (npn)
+		Amount of system memory used for GPU devices.
+
+	  gpu_reclaim (npn)
+		Amount of system memory cached for GPU devices.
+
 	  shmem
 		Amount of cached filesystem data that is swap-backed,
 		such as tmpfs, shm segments, shared anonymous mmap()s
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..ff82d603910d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,6 +36,8 @@ enum memcg_stat_item {
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
 	MEMCG_VMALLOC,
+	MEMCG_GPU,
+	MEMCG_GPU_RECLAIM,
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
@@ -1597,6 +1599,18 @@ struct sock;
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
 			     gfp_t gfp_mask);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+
+bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
+			   unsigned int nr_pages,
+			   gfp_t gfp_mask, bool reclaim);
+void mem_cgroup_uncharge_gpu_page(struct page *page,
+				  unsigned int nr_pages,
+				  bool reclaim);
+bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *objcg,
+				      struct page *page,
+				      unsigned int order,
+				      bool to_reclaim);
+
 #ifdef CONFIG_MEMCG
 extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..87d75963a9ed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -337,6 +337,8 @@ static const unsigned int memcg_stat_items[] = {
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
 	MEMCG_VMALLOC,
+	MEMCG_GPU,
+	MEMCG_GPU_RECLAIM,
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
@@ -1345,6 +1347,8 @@ static const struct memory_stat memory_stats[] = {
 	{ "percpu",			MEMCG_PERCPU_B			},
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			MEMCG_VMALLOC			},
+	{ "gpu",			MEMCG_GPU			},
+	{ "gpu_reclaim",		MEMCG_GPU_RECLAIM		},
 	{ "shmem",			NR_SHMEM			},
 #ifdef CONFIG_ZSWAP
 	{ "zswap",			MEMCG_ZSWAP_B			},
@@ -5132,6 +5136,112 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 	refill_stock(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_charge_gpu_page - charge a page to GPU memory tracking
+ * @objcg: objcg to charge, NULL charges root memcg
+ * @page: page to charge
+ * @order: page allocation order
+ * @gfp_mask: gfp mode
+ * @reclaim: charge the reclaim counter instead of the active one.
+ *
+ * Charge the order sized @page to the objcg. Returns %true if the charge fit within
+ * @objcg's configured limit, %false if it doesn't.
+ */
+bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
+				unsigned int order, gfp_t gfp_mask, bool reclaim)
+{
+	unsigned int nr_pages = 1 << order;
+	struct mem_cgroup *memcg;
+	int ret;
+
+        if (!objcg) {
+		mod_memcg_state(root_mem_cgroup, reclaim ? MEMCG_GPU_RECLAIM : MEMCG_GPU, nr_pages);
+		return true;
+	}
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	ret = try_charge_memcg(memcg, gfp_mask, nr_pages);
+	if (ret) {
+		css_put(&memcg->css);
+		return false;
+	}
+
+	if (page) {
+		obj_cgroup_get(objcg);
+		page_set_objcg(page, objcg);
+	}
+
+	mod_memcg_state(memcg, reclaim ? MEMCG_GPU_RECLAIM : MEMCG_GPU, nr_pages);
+	css_put(&memcg->css);
+	return true;
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_charge_gpu_page);
+
+/**
+ * mem_cgroup_uncharge_gpu_page - uncharge a page from GPU memory tracking
+ * @page: page to uncharge
+ * @order: order of the page allocation
+ * @reclaim: uncharge the reclaim counter instead of the active.
+ */
+void mem_cgroup_uncharge_gpu_page(struct page *page,
+				  unsigned int order, bool reclaim)
+{
+	struct obj_cgroup *objcg = page_objcg(page);
+	struct mem_cgroup *memcg;
+
+	int nr_pages = 1 << order;
+	if (!objcg) {
+		mod_memcg_state(root_mem_cgroup, reclaim ? MEMCG_GPU_RECLAIM : MEMCG_GPU, -nr_pages);
+		return;
+	}
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+
+	mod_memcg_state(memcg, reclaim ? MEMCG_GPU_RECLAIM : MEMCG_GPU, -nr_pages);
+
+	if (page) {
+		page->memcg_data = 0;
+		obj_cgroup_put(objcg);
+	}
+	if (!mem_cgroup_is_root(memcg))
+		refill_stock(memcg, nr_pages);
+	css_put(&memcg->css);
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_uncharge_gpu_page);
+
+/**
+ * mem_cgroup_move_gpu_reclaim - move pages from gpu to gpu reclaim and back
+ * @new_objcg: objcg to move page to, NULL if just stats update.
+ * @nr_pages: number of pages to move
+ * @to_reclaim: true moves pages into reclaim, false moves them back
+ */
+bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *new_objcg,
+				      struct page *page,
+				      unsigned int order,
+				      bool to_reclaim)
+{
+	struct obj_cgroup *objcg = page_objcg(page);
+	struct mem_cgroup *memcg;
+	int nr_pages = 1 << order;
+	bool ret = true;
+
+	if (!objcg)
+		return false;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	if (!new_objcg || objcg == new_objcg) {
+		mod_memcg_state(memcg, to_reclaim ? MEMCG_GPU_RECLAIM : MEMCG_GPU, nr_pages);
+		mod_memcg_state(memcg, to_reclaim ? MEMCG_GPU : MEMCG_GPU_RECLAIM, -nr_pages);
+	} else {
+		mem_cgroup_uncharge_gpu_page(page, order, true);
+
+		ret = mem_cgroup_charge_gpu_page(new_objcg, page, order, 0, false);
+	}
+	css_put(&memcg->css);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_move_gpu_page_reclaim);
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 09/17] memcg: export memcg_list_lru_alloc.
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (7 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 08/17] memcg: add support for GPU page counters Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This is need to use list lru with memcg from a module. drm/ttm
wants to use this interface.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 mm/list_lru.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 315362e3df3d..2892c1d945dd 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -558,6 +558,7 @@ int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 
 	return xas_error(&xas);
 }
+EXPORT_SYMBOL(memcg_list_lru_alloc);
 #else
 static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (8 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 09/17] memcg: export memcg_list_lru_alloc Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  9:56   ` kernel test robot
  2025-06-30 10:20   ` Christian König
  2025-06-30  4:49 ` [PATCH 11/17] ttm/pool: initialise the shrinker earlier Dave Airlie
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This flag does nothing yet, but this just changes the APIs to accept
it in the future across all users.

This flag will eventually be filled out with when to account a tt
populate to a memcg.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  3 ++-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c          |  5 +++--
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c     |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c       |  4 ++--
 drivers/gpu/drm/loongson/lsdc_ttm.c              |  3 ++-
 drivers/gpu/drm/nouveau/nouveau_bo.c             |  6 ++++--
 drivers/gpu/drm/radeon/radeon_ttm.c              |  3 ++-
 drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c        | 16 ++++++++--------
 drivers/gpu/drm/ttm/tests/ttm_tt_test.c          | 12 ++++++------
 drivers/gpu/drm/ttm/ttm_bo.c                     |  5 +++--
 drivers/gpu/drm/ttm/ttm_bo_util.c                |  6 +++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c                  |  4 +++-
 drivers/gpu/drm/ttm/ttm_pool.c                   |  6 ++++--
 drivers/gpu/drm/ttm/ttm_tt.c                     |  8 +++++---
 drivers/gpu/drm/vmwgfx/vmwgfx_blit.c             |  4 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c       |  7 ++++---
 drivers/gpu/drm/xe/xe_bo.c                       |  3 ++-
 include/drm/ttm/ttm_bo.h                         |  1 +
 include/drm/ttm/ttm_device.h                     |  1 +
 include/drm/ttm/ttm_pool.h                       |  1 +
 include/drm/ttm/ttm_tt.h                         |  1 +
 22 files changed, 61 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 9c5df35f05b7..920b412156dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1138,6 +1138,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
  */
 static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
 				  struct ttm_tt *ttm,
+				  bool memcg_account,
 				  struct ttm_operation_ctx *ctx)
 {
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bdev);
@@ -1161,7 +1162,7 @@ static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
 		pool = &adev->mman.ttm_pools[gtt->pool_id];
 	else
 		pool = &adev->mman.bdev.pool;
-	ret = ttm_pool_alloc(pool, ttm, ctx);
+	ret = ttm_pool_alloc(pool, ttm, memcg_account, ctx);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index 1f4814968868..6cdaf3696583 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -314,6 +314,7 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 
 static int i915_ttm_tt_populate(struct ttm_device *bdev,
 				struct ttm_tt *ttm,
+				bool memcg_account,
 				struct ttm_operation_ctx *ctx)
 {
 	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
@@ -321,7 +322,7 @@ static int i915_ttm_tt_populate(struct ttm_device *bdev,
 	if (i915_tt->is_shmem)
 		return i915_ttm_tt_shmem_populate(bdev, ttm, ctx);
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void i915_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
@@ -808,7 +809,7 @@ static int __i915_ttm_get_pages(struct drm_i915_gem_object *obj,
 	}
 
 	if (bo->ttm && !ttm_tt_is_populated(bo->ttm)) {
-		ret = ttm_bo_populate(bo, &ctx);
+		ret = ttm_bo_populate(bo, false, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 2f6b33edb9c9..4ab1eb3e42bc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -624,7 +624,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
 
 	/* Populate ttm with pages if needed. Typically system memory. */
 	if (ttm && (dst_man->use_tt || (ttm->page_flags & TTM_TT_FLAG_SWAPPED))) {
-		ret = ttm_bo_populate(bo, ctx);
+		ret = ttm_bo_populate(bo, false, ctx);
 		if (ret)
 			return ret;
 	}
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
index 61596cecce4d..0b555979d786 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
@@ -90,7 +90,7 @@ static int i915_ttm_backup(struct i915_gem_apply_to_region *apply,
 		goto out_no_lock;
 
 	backup_bo = i915_gem_to_ttm(backup);
-	err = ttm_bo_populate(backup_bo, &ctx);
+	err = ttm_bo_populate(backup_bo, false, &ctx);
 	if (err)
 		goto out_no_populate;
 
@@ -189,7 +189,7 @@ static int i915_ttm_restore(struct i915_gem_apply_to_region *apply,
 	if (!backup_bo->resource)
 		err = ttm_bo_validate(backup_bo, i915_ttm_sys_placement(), &ctx);
 	if (!err)
-		err = ttm_bo_populate(backup_bo, &ctx);
+		err = ttm_bo_populate(backup_bo, false, &ctx);
 	if (!err) {
 		err = i915_gem_obj_copy_ttm(obj, backup, pm_apply->allow_gpu,
 					    false);
diff --git a/drivers/gpu/drm/loongson/lsdc_ttm.c b/drivers/gpu/drm/loongson/lsdc_ttm.c
index 2e42c6970c9f..6d8781506802 100644
--- a/drivers/gpu/drm/loongson/lsdc_ttm.c
+++ b/drivers/gpu/drm/loongson/lsdc_ttm.c
@@ -110,6 +110,7 @@ lsdc_ttm_tt_create(struct ttm_buffer_object *tbo, uint32_t page_flags)
 
 static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
 				struct ttm_tt *ttm,
+				bool memcg_account,
 				struct ttm_operation_ctx *ctx)
 {
 	bool slave = !!(ttm->page_flags & TTM_TT_FLAG_EXTERNAL);
@@ -122,7 +123,7 @@ static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
 		return 0;
 	}
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void lsdc_ttm_tt_unpopulate(struct ttm_device *bdev,
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index b96f0555ca14..1f2b9f5f2bf8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -1417,7 +1417,9 @@ vm_fault_t nouveau_ttm_fault_reserve_notify(struct ttm_buffer_object *bo)
 
 static int
 nouveau_ttm_tt_populate(struct ttm_device *bdev,
-			struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+			struct ttm_tt *ttm,
+			bool memcg_account,
+			struct ttm_operation_ctx *ctx)
 {
 	struct ttm_tt *ttm_dma = (void *)ttm;
 	struct nouveau_drm *drm;
@@ -1434,7 +1436,7 @@ nouveau_ttm_tt_populate(struct ttm_device *bdev,
 
 	drm = nouveau_bdev(bdev);
 
-	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, ctx);
+	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, memcg_account, ctx);
 }
 
 static void
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 616d25c8c2de..8c4273239d16 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -526,6 +526,7 @@ static struct radeon_ttm_tt *radeon_ttm_tt_to_gtt(struct radeon_device *rdev,
 
 static int radeon_ttm_tt_populate(struct ttm_device *bdev,
 				  struct ttm_tt *ttm,
+				  bool memcg_account,
 				  struct ttm_operation_ctx *ctx)
 {
 	struct radeon_device *rdev = radeon_get_rdev(bdev);
@@ -547,7 +548,7 @@ static int radeon_ttm_tt_populate(struct ttm_device *bdev,
 		return 0;
 	}
 
-	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, ctx);
+	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, memcg_account, ctx);
 }
 
 static void radeon_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
index 3148f5d3dbd6..b52e3c1089e6 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
@@ -538,7 +538,7 @@ static void ttm_bo_validate_no_placement_signaled(struct kunit *test)
 
 	if (params->with_ttm) {
 		old_tt = priv->ttm_dev->funcs->ttm_tt_create(bo, 0);
-		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, &ctx);
+		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, false, &ctx);
 		bo->ttm = old_tt;
 	}
 
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 39234a3e98c4..aaf152c2383d 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -88,7 +88,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct kunit *test,
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -157,7 +157,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
 	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
 	KUNIT_ASSERT_EQ(test, pool->use_dma_alloc, params->use_dma_alloc);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
 
@@ -220,7 +220,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct kunit *test)
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
 
@@ -253,7 +253,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
@@ -285,7 +285,7 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -319,7 +319,7 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
-	ttm_pool_alloc(pool, tt, &simple_ctx);
+	ttm_pool_alloc(pool, tt, false, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
@@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
-	ttm_pool_alloc(pool, tt, &simple_ctx);
+	ttm_pool_alloc(pool, tt, false, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
 	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
diff --git a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
index 61ec6f580b62..333c503e218b 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
@@ -262,7 +262,7 @@ static void ttm_tt_populate_null_ttm(struct kunit *test)
 	struct ttm_operation_ctx ctx = { };
 	int err;
 
-	err = ttm_tt_populate(devs->ttm_dev, NULL, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, NULL, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, -EINVAL);
 }
 
@@ -283,11 +283,11 @@ static void ttm_tt_populate_populated_ttm(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	populated_page = *tt->pages;
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_PTR_EQ(test, populated_page, *tt->pages);
 }
 
@@ -307,7 +307,7 @@ static void ttm_tt_unpopulate_basic(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
 
@@ -351,7 +351,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
 
@@ -361,7 +361,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
 	KUNIT_ASSERT_TRUE(test, tt->page_flags & TTM_TT_FLAG_SWAPPED);
 
 	/* Swapout depopulates TT, allocate pages and then swap them in */
-	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, &ctx);
+	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	err = ttm_tt_swapin(tt);
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index f4d9e68b21e7..af04bb8e2c2a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -142,7 +142,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 			goto out_err;
 
 		if (mem->mem_type != TTM_PL_SYSTEM) {
-			ret = ttm_bo_populate(bo, ctx);
+			ret = ttm_bo_populate(bo, false, ctx);
 			if (ret)
 				goto out_err;
 		}
@@ -1256,6 +1256,7 @@ void ttm_bo_tt_destroy(struct ttm_buffer_object *bo)
  * is set to true.
  */
 int ttm_bo_populate(struct ttm_buffer_object *bo,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx)
 {
 	struct ttm_tt *tt = bo->ttm;
@@ -1268,7 +1269,7 @@ int ttm_bo_populate(struct ttm_buffer_object *bo,
 		return 0;
 
 	swapped = ttm_tt_is_swapped(tt);
-	ret = ttm_tt_populate(bo->bdev, tt, ctx);
+	ret = ttm_tt_populate(bo->bdev, tt, memcg_account, ctx);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 250675d56b1c..764d1cf1ecbe 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -167,7 +167,7 @@ int ttm_bo_move_memcpy(struct ttm_buffer_object *bo,
 	src_man = ttm_manager_type(bdev, src_mem->mem_type);
 	if (ttm && ((ttm->page_flags & TTM_TT_FLAG_SWAPPED) ||
 		    dst_man->use_tt)) {
-		ret = ttm_bo_populate(bo, ctx);
+		ret = ttm_bo_populate(bo, false, ctx);
 		if (ret)
 			return ret;
 	}
@@ -354,7 +354,7 @@ static int ttm_bo_kmap_ttm(struct ttm_buffer_object *bo,
 
 	BUG_ON(!ttm);
 
-	ret = ttm_bo_populate(bo, &ctx);
+	ret = ttm_bo_populate(bo, false, &ctx);
 	if (ret)
 		return ret;
 
@@ -511,7 +511,7 @@ int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map)
 		pgprot_t prot;
 		void *vaddr;
 
-		ret = ttm_bo_populate(bo, &ctx);
+		ret = ttm_bo_populate(bo, false, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index b47020fca199..c5ad447debe3 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -225,7 +225,9 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 		};
 
 		ttm = bo->ttm;
-		err = ttm_bo_populate(bo, &ctx);
+		err = ttm_bo_populate(bo,
+				      false,
+				      &ctx);
 		if (err) {
 			if (err == -EINTR || err == -ERESTARTSYS ||
 			    err == -EAGAIN)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index c6192c915f0d..0526900366e5 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -744,6 +744,7 @@ static unsigned int ttm_pool_alloc_find_order(unsigned int highest,
 }
 
 static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+			    bool memcg_account,
 			    const struct ttm_operation_ctx *ctx,
 			    struct ttm_pool_alloc_state *alloc,
 			    struct ttm_pool_tt_restore *restore)
@@ -854,6 +855,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
  * Returns: 0 on successe, negative error code otherwise.
  */
 int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+		   bool memcg_account,
 		   struct ttm_operation_ctx *ctx)
 {
 	struct ttm_pool_alloc_state alloc;
@@ -863,7 +865,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 
 	ttm_pool_alloc_state_init(tt, &alloc);
 
-	return __ttm_pool_alloc(pool, tt, ctx, &alloc, NULL);
+	return __ttm_pool_alloc(pool, tt, memcg_account, ctx, &alloc, NULL);
 }
 EXPORT_SYMBOL(ttm_pool_alloc);
 
@@ -916,7 +918,7 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 			return 0;
 	}
 
-	return __ttm_pool_alloc(pool, tt, ctx, &alloc, tt->restore);
+	return __ttm_pool_alloc(pool, tt, false, ctx, &alloc, tt->restore);
 }
 
 /**
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 506e257dfba8..8f38de3b2f1c 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -366,7 +366,9 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
 EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_tt_swapout);
 
 int ttm_tt_populate(struct ttm_device *bdev,
-		    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+		    struct ttm_tt *ttm,
+		    bool memcg_account,
+		    struct ttm_operation_ctx *ctx)
 {
 	int ret;
 
@@ -395,9 +397,9 @@ int ttm_tt_populate(struct ttm_device *bdev,
 	}
 
 	if (bdev->funcs->ttm_tt_populate)
-		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, ctx);
+		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, memcg_account, ctx);
 	else
-		ret = ttm_pool_alloc(&bdev->pool, ttm, ctx);
+		ret = ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 	if (ret)
 		goto error;
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
index fa5841fda659..a4d4ebf585fe 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
@@ -569,13 +569,13 @@ int vmw_bo_cpu_blit(struct vmw_bo *vmw_dst,
 		dma_resv_assert_held(src->base.resv);
 
 	if (!ttm_tt_is_populated(dst->ttm)) {
-		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, &ctx);
+		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, false, &ctx);
 		if (ret)
 			return ret;
 	}
 
 	if (!ttm_tt_is_populated(src->ttm)) {
-		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, &ctx);
+		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, false, &ctx);
 		if (ret)
 			return ret;
 	}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index 5553892d7c3e..2351dafc1c68 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -360,7 +360,8 @@ static void vmw_ttm_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
 
 
 static int vmw_ttm_populate(struct ttm_device *bdev,
-			    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+			    struct ttm_tt *ttm, bool memcg_account,
+			    struct ttm_operation_ctx *ctx)
 {
 	bool external = (ttm->page_flags & TTM_TT_FLAG_EXTERNAL) != 0;
 
@@ -372,7 +373,7 @@ static int vmw_ttm_populate(struct ttm_device *bdev,
 						       ttm->dma_address,
 						       ttm->num_pages);
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void vmw_ttm_unpopulate(struct ttm_device *bdev,
@@ -580,7 +581,7 @@ int vmw_bo_create_and_populate(struct vmw_private *dev_priv,
 	if (unlikely(ret != 0))
 		return ret;
 
-	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, &ctx);
+	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, false, &ctx);
 	if (likely(ret == 0)) {
 		struct vmw_ttm_tt *vmw_tt =
 			container_of(vbo->tbo.ttm, struct vmw_ttm_tt, dma_ttm);
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 7aa2c17825da..522cbff11563 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -504,6 +504,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 }
 
 static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
+			      bool memcg_account,
 			      struct ttm_operation_ctx *ctx)
 {
 	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
@@ -521,7 +522,7 @@ static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
 		err = ttm_tt_restore(ttm_dev, tt, ctx);
 	} else {
 		ttm_tt_clear_backed_up(tt);
-		err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
+		err = ttm_pool_alloc(&ttm_dev->pool, tt, memcg_account, ctx);
 	}
 	if (err)
 		return err;
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 894ff7ccd68e..099dc2604baa 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -464,6 +464,7 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, struct ttm_resource *res,
 		     pgprot_t tmp);
 void ttm_bo_tt_destroy(struct ttm_buffer_object *bo);
 int ttm_bo_populate(struct ttm_buffer_object *bo,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx);
 
 /* Driver LRU walk helpers initially targeted for shrinking. */
diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
index 39b8636b1845..903ca40ebf92 100644
--- a/include/drm/ttm/ttm_device.h
+++ b/include/drm/ttm/ttm_device.h
@@ -84,6 +84,7 @@ struct ttm_device_funcs {
 	 */
 	int (*ttm_tt_populate)(struct ttm_device *bdev,
 			       struct ttm_tt *ttm,
+			       bool memcg_account,
 			       struct ttm_operation_ctx *ctx);
 
 	/**
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index df56527c4853..da5b94226203 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -79,6 +79,7 @@ struct ttm_pool {
 };
 
 int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+		   bool memcg_account,
 		   struct ttm_operation_ctx *ctx);
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
 
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 406437ad674b..15d4019685f6 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -250,6 +250,7 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
  * Calls the driver method to allocate pages for a ttm
  */
 int ttm_tt_populate(struct ttm_device *bdev, struct ttm_tt *ttm,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx);
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 11/17] ttm/pool: initialise the shrinker earlier
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (9 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 12/17] ttm: add objcg pointer to bo and tt Dave Airlie
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

Later memcg enablement needs the shrinker initialised before the list lru,
Just move it for now.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 0526900366e5..210f4ac4de67 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -1388,6 +1388,17 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
+	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
+	if (!mm_shrinker)
+		return -ENOMEM;
+
+	mm_shrinker->count_objects = ttm_pool_shrinker_count;
+	mm_shrinker->scan_objects = ttm_pool_shrinker_scan;
+	mm_shrinker->batch = TTM_SHRINKER_BATCH;
+	mm_shrinker->seeks = 1;
+
+	shrinker_register(mm_shrinker);
+
 	for (i = 0; i < NR_PAGE_ORDERS; ++i) {
 		ttm_pool_type_init(&global_write_combined[i], NULL,
 				   ttm_write_combined, i);
@@ -1410,17 +1421,6 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 #endif
 #endif
 
-	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
-	if (!mm_shrinker)
-		return -ENOMEM;
-
-	mm_shrinker->count_objects = ttm_pool_shrinker_count;
-	mm_shrinker->scan_objects = ttm_pool_shrinker_scan;
-	mm_shrinker->batch = TTM_SHRINKER_BATCH;
-	mm_shrinker->seeks = 1;
-
-	shrinker_register(mm_shrinker);
-
 	return 0;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (10 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 11/17] ttm/pool: initialise the shrinker earlier Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30 10:24   ` Christian König
  2025-06-30  4:49 ` [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker Dave Airlie
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This just adds the obj cgroup pointer to the bo and tt structs,
and sets it between them.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_tt.c | 1 +
 include/drm/ttm/ttm_bo.h     | 6 ++++++
 include/drm/ttm/ttm_tt.h     | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 8f38de3b2f1c..0c54d5e2bfdd 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -162,6 +162,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
 	ttm->caching = caching;
 	ttm->restore = NULL;
 	ttm->backup = NULL;
+	ttm->objcg = bo->objcg;
 }
 
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 099dc2604baa..f26ec0a0273f 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -135,6 +135,12 @@ struct ttm_buffer_object {
 	 * reservation lock.
 	 */
 	struct sg_table *sg;
+
+	/**
+	 * @objcg: object cgroup to charge this to if it ends up using system memory.
+	 * NULL means don't charge.
+	 */
+	struct obj_cgroup *objcg;
 };
 
 #define TTM_BO_MAP_IOMEM_MASK 0x80
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 15d4019685f6..c13fea4c2915 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -126,6 +126,8 @@ struct ttm_tt {
 	enum ttm_caching caching;
 	/** @restore: Partial restoration from backup state. TTM private */
 	struct ttm_pool_tt_restore *restore;
+	/** @objcg: Object cgroup for this TT allocation */
+	struct obj_cgroup *objcg;
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker.
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (11 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 12/17] ttm: add objcg pointer to bo and tt Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30 10:23   ` Christian König
                     ` (2 more replies)
  2025-06-30  4:49 ` [PATCH 14/17] ttm: hook up memcg placement flags Dave Airlie
                   ` (4 subsequent siblings)
  17 siblings, 3 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This enables all the backend code to use the list lru in memcg mode,
and set the shrinker to be memcg aware.

It adds the loop case for when pooled pages end up being reparented
to a higher memcg group, that newer memcg can search for them there
and take them back.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 123 ++++++++++++++++++++++++++++-----
 1 file changed, 105 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 210f4ac4de67..49e92f40ab23 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -143,7 +143,9 @@ static int ttm_pool_nid(struct ttm_pool *pool) {
 }
 
 /* Allocate pages of size 1 << order with the given gfp_flags */
-static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
+static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
+					struct obj_cgroup *objcg,
+					gfp_t gfp_flags,
 					unsigned int order)
 {
 	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
@@ -163,6 +165,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
 		if (p) {
 			p->private = order;
+			if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
+				__free_pages(p, order);
+				return NULL;
+			}
 			mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)), NR_GPU_ACTIVE, (1 << order));
 		}
 		return p;
@@ -214,6 +220,7 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
 #endif
 
 	if (!pool || !pool->use_dma_alloc) {
+		mem_cgroup_uncharge_gpu_page(p, order, reclaim);
 		mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)),
 				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
 				    -(1 << order));
@@ -303,12 +310,13 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 
 	INIT_LIST_HEAD(&p->lru);
 	rcu_read_lock();
-	list_lru_add(&pt->pages, &p->lru, nid, NULL);
+	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
 	rcu_read_unlock();
 
 	atomic_long_add(num_pages, &allocated_pages[nid]);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
+	mem_cgroup_move_gpu_page_reclaim(NULL, p, pt->order, true);
 }
 
 static enum lru_status take_one_from_lru(struct list_head *item,
@@ -323,20 +331,59 @@ static enum lru_status take_one_from_lru(struct list_head *item,
 	return LRU_REMOVED;
 }
 
-/* Take pages from a specific pool_type, return NULL when nothing available */
-static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
+static int pool_lru_get_page(struct ttm_pool_type *pt, int nid,
+			     struct page **page_out,
+			     struct obj_cgroup *objcg,
+			     struct mem_cgroup *memcg)
 {
 	int ret;
 	struct page *p = NULL;
 	unsigned long nr_to_walk = 1;
+	unsigned int num_pages = 1 << pt->order;
 
-	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
+	ret = list_lru_walk_one(&pt->pages, nid, memcg, take_one_from_lru, (void *)&p, &nr_to_walk);
 	if (ret == 1 && p) {
-		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
-		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
-		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
+		atomic_long_sub(num_pages, &allocated_pages[nid]);
+		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -num_pages);
+
+		if (!mem_cgroup_move_gpu_page_reclaim(objcg, p, pt->order, false)) {
+			__free_pages(p, pt->order);
+			p = NULL;
+		}
+		if (p)
+			mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, num_pages);
 	}
-	return p;
+	*page_out = p;
+	return ret;
+}
+
+/* Take pages from a specific pool_type, return NULL when nothing available */
+static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
+				       struct obj_cgroup *orig_objcg)
+{
+	struct page *page_out = NULL;
+	int ret;
+	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
+	struct mem_cgroup *memcg = orig_memcg;
+
+	/*
+	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,
+	 * go up to the parent and check there. This helps the scenario where multiple apps get
+	 * started into their own cgroup from a common parent and want to reuse the pools.
+	 */
+	while (!page_out) {
+		ret = pool_lru_get_page(pt, nid, &page_out, orig_objcg, memcg);
+		if (ret == 1)
+			break;
+		if (!memcg)
+			break;
+		memcg = parent_mem_cgroup(memcg);
+		if (!memcg)
+			break;
+	}
+
+	mem_cgroup_put(orig_memcg);
+	return page_out;
 }
 
 /* Initialize and add a pool type to the global shrinker list */
@@ -346,7 +393,7 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
 	pt->pool = pool;
 	pt->caching = caching;
 	pt->order = order;
-	list_lru_init(&pt->pages);
+	list_lru_init_memcg(&pt->pages, mm_shrinker);
 
 	spin_lock(&shrinker_lock);
 	list_add_tail(&pt->shrinker_list, &shrinker_list);
@@ -389,6 +436,30 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 	ttm_pool_dispose_list(pt, &dispose);
 }
 
+static int ttm_pool_check_objcg(struct obj_cgroup *objcg)
+{
+#ifdef CONFIG_MEMCG
+	int r = 0;
+	struct mem_cgroup *memcg;
+	if (!objcg)
+		return 0;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	for (unsigned i = 0; i < NR_PAGE_ORDERS; i++) {
+		r = memcg_list_lru_alloc(memcg, &global_write_combined[i].pages, GFP_KERNEL);
+		if (r) {
+			break;
+		}
+		r = memcg_list_lru_alloc(memcg, &global_uncached[i].pages, GFP_KERNEL);
+		if (r) {
+			break;
+		}
+	}
+	css_put(&memcg->css);
+#endif
+	return 0;
+}
+
 /* Return the pool_type to use for the given caching and order */
 static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 						  enum ttm_caching caching,
@@ -418,7 +489,9 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 }
 
 /* Free pages using the per-node shrinker list */
-static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
+static unsigned int ttm_pool_shrink(int nid,
+				    struct mem_cgroup *memcg,
+				    unsigned long num_to_free)
 {
 	LIST_HEAD(dispose);
 	struct ttm_pool_type *pt;
@@ -430,7 +503,11 @@ static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	if (!memcg) {
+		num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	} else {
+		num_pages = list_lru_walk_one(&pt->pages, nid, memcg, pool_move_to_dispose_list, &dispose, &num_to_free);
+	}
 	num_pages *= 1 << pt->order;
 
 	ttm_pool_dispose_list(pt, &dispose);
@@ -595,6 +672,7 @@ static int ttm_pool_restore_commit(struct ttm_pool_tt_restore *restore,
 			 */
 			ttm_pool_split_for_swap(restore->pool, p);
 			copy_highpage(restore->alloced_page + i, p);
+			p->memcg_data = 0;
 			__free_pages(p, 0);
 		}
 
@@ -756,6 +834,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	bool allow_pools;
 	struct page *p;
 	int r;
+	struct obj_cgroup *objcg = memcg_account ? tt->objcg : NULL;
 
 	WARN_ON(!alloc->remaining_pages || ttm_tt_is_populated(tt));
 	WARN_ON(alloc->dma_addr && !pool->dev);
@@ -773,6 +852,9 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 
 	page_caching = tt->caching;
 	allow_pools = true;
+
+	ttm_pool_check_objcg(objcg);
+
 	for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
 	     alloc->remaining_pages;
 	     order = ttm_pool_alloc_find_order(order, alloc)) {
@@ -782,7 +864,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		p = NULL;
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
-			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
+			p = ttm_pool_type_take(pt, ttm_pool_nid(pool), objcg);
 
 		/*
 		 * If that fails or previously failed, allocate from system.
@@ -793,7 +875,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		if (!p) {
 			page_caching = ttm_cached;
 			allow_pools = false;
-			p = ttm_pool_alloc_page(pool, gfp_flags, order);
+			p = ttm_pool_alloc_page(pool, objcg, gfp_flags, order);
 		}
 		/* If that fails, lower the order if possible and retry. */
 		if (!p) {
@@ -937,7 +1019,7 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 
 	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
 		unsigned long diff = pool_node_limit[nid] - atomic_long_read(&allocated_pages[nid]);
-		ttm_pool_shrink(nid, diff);
+		ttm_pool_shrink(nid, NULL, diff);
 	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
@@ -1057,6 +1139,7 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
 			if (flags->purge) {
 				shrunken += num_pages;
 				page->private = 0;
+				page->memcg_data = 0;
 				__free_pages(page, order);
 				memset(tt->pages + i, 0,
 				       num_pages * sizeof(*tt->pages));
@@ -1193,10 +1276,14 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 					    struct shrink_control *sc)
 {
 	unsigned long num_freed = 0;
+	int num_pools;
+	spin_lock(&shrinker_lock);
+	num_pools = list_count_nodes(&shrinker_list);
+	spin_unlock(&shrinker_lock);
 
 	do
-		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
-	while (num_freed < sc->nr_to_scan &&
+		num_freed += ttm_pool_shrink(sc->nid, sc->memcg, sc->nr_to_scan);
+	while (num_pools-- >= 0 && num_freed < sc->nr_to_scan &&
 	       atomic_long_read(&allocated_pages[sc->nid]));
 
 	sc->nr_scanned = num_freed;
@@ -1388,7 +1475,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
-	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
+	mm_shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE | SHRINKER_NUMA_AWARE, "drm-ttm_pool");
 	if (!mm_shrinker)
 		return -ENOMEM;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 14/17] ttm: hook up memcg placement flags.
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (12 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 15/17] memcontrol: allow objcg api when memcg is config off Dave Airlie
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This adds a placement flag that requests that any bo with this
placement flag set gets accounted for memcg if it's a system memory
allocation.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c      | 2 +-
 drivers/gpu/drm/ttm/ttm_bo_util.c | 6 +++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c   | 2 +-
 include/drm/ttm/ttm_placement.h   | 3 +++
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index af04bb8e2c2a..273757974b9f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -142,7 +142,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 			goto out_err;
 
 		if (mem->mem_type != TTM_PL_SYSTEM) {
-			ret = ttm_bo_populate(bo, false, ctx);
+			ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, ctx);
 			if (ret)
 				goto out_err;
 		}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 764d1cf1ecbe..b5521d1bd517 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -167,7 +167,7 @@ int ttm_bo_move_memcpy(struct ttm_buffer_object *bo,
 	src_man = ttm_manager_type(bdev, src_mem->mem_type);
 	if (ttm && ((ttm->page_flags & TTM_TT_FLAG_SWAPPED) ||
 		    dst_man->use_tt)) {
-		ret = ttm_bo_populate(bo, false, ctx);
+		ret = ttm_bo_populate(bo, dst_mem->placement & TTM_PL_FLAG_MEMCG, ctx);
 		if (ret)
 			return ret;
 	}
@@ -354,7 +354,7 @@ static int ttm_bo_kmap_ttm(struct ttm_buffer_object *bo,
 
 	BUG_ON(!ttm);
 
-	ret = ttm_bo_populate(bo, false, &ctx);
+	ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, &ctx);
 	if (ret)
 		return ret;
 
@@ -511,7 +511,7 @@ int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map)
 		pgprot_t prot;
 		void *vaddr;
 
-		ret = ttm_bo_populate(bo, false, &ctx);
+		ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index c5ad447debe3..dddc904f8727 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -226,7 +226,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 
 		ttm = bo->ttm;
 		err = ttm_bo_populate(bo,
-				      false,
+				      bo->resource->placement & TTM_PL_FLAG_MEMCG,
 				      &ctx);
 		if (err) {
 			if (err == -EINTR || err == -ERESTARTSYS ||
diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h
index b510a4812609..4e9f07d70483 100644
--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -70,6 +70,9 @@
 /* Placement is only used during eviction */
 #define TTM_PL_FLAG_FALLBACK	(1 << 4)
 
+/* Placement should account mem cgroup */
+#define TTM_PL_FLAG_MEMCG	(1 << 5)
+
 /**
  * struct ttm_place
  *
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 15/17] memcontrol: allow objcg api when memcg is config off.
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (13 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 14/17] ttm: hook up memcg placement flags Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 16/17] memcontrol: export current_obj_cgroup Dave Airlie
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

amdgpu wants to use the objcg api and not have to enable ifdef
around it, so just add a dummy function for the config off path.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 include/linux/memcontrol.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ff82d603910d..12ba5e00a6d1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1758,6 +1758,11 @@ static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
+static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	return NULL;
+}
+
 static inline struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
 	return NULL;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 16/17] memcontrol: export current_obj_cgroup
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (14 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 15/17] memcontrol: allow objcg api when memcg is config off Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-06-30  4:49 ` [PATCH 17/17] amdgpu: add support for memory cgroups Dave Airlie
  2025-07-01 23:26 ` drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Balbir Singh
  17 siblings, 0 replies; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This is needed to use get_obj_cgroup_from_current from a module.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 87d75963a9ed..1e52e43cc239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2726,6 +2726,7 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 
 	return objcg;
 }
+EXPORT_SYMBOL_GPL(current_obj_cgroup);
 
 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (15 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 16/17] memcontrol: export current_obj_cgroup Dave Airlie
@ 2025-06-30  4:49 ` Dave Airlie
  2025-07-02 16:02   ` Shakeel Butt
  2025-07-01 23:26 ` drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Balbir Singh
  17 siblings, 1 reply; 52+ messages in thread
From: Dave Airlie @ 2025-06-30  4:49 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This adds support for adding a obj cgroup to a buffer object,
and passing in the placement flags to make sure it's accounted
properly.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  2 ++
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index e5e33a68d935..d250183bab03 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -198,6 +198,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
 	struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
 
 	amdgpu_hmm_unregister(aobj);
+	obj_cgroup_put(aobj->tbo.objcg);
 	ttm_bo_put(&aobj->tbo);
 }
 
@@ -225,6 +226,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
 	bp.domain = initial_domain;
 	bp.bo_ptr_size = sizeof(struct amdgpu_bo);
 	bp.xcp_id_plus1 = xcp_id_plus1;
+	bp.objcg = get_obj_cgroup_from_current();
 
 	r = amdgpu_bo_create_user(adev, &bp, &ubo);
 	if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 73403744331a..6d5533703b33 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -158,7 +158,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
 		places[c].mem_type =
 			abo->flags & AMDGPU_GEM_CREATE_PREEMPTIBLE ?
 			AMDGPU_PL_PREEMPT : TTM_PL_TT;
-		places[c].flags = 0;
+		places[c].flags = TTM_PL_FLAG_MEMCG;
 		/*
 		 * When GTT is just an alternative to VRAM make sure that we
 		 * only use it as fallback and still try to fill up VRAM first.
@@ -173,7 +173,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
 		places[c].fpfn = 0;
 		places[c].lpfn = 0;
 		places[c].mem_type = TTM_PL_SYSTEM;
-		places[c].flags = 0;
+		places[c].flags = TTM_PL_FLAG_MEMCG;
 		c++;
 	}
 
@@ -657,16 +657,21 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 		size = ALIGN(size, PAGE_SIZE);
 	}
 
-	if (!amdgpu_bo_validate_size(adev, size, bp->domain))
+	if (!amdgpu_bo_validate_size(adev, size, bp->domain)) {
+		obj_cgroup_put(bp->objcg);
 		return -ENOMEM;
+	}
 
 	BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
 
 	*bo_ptr = NULL;
 	bo = kvzalloc(bp->bo_ptr_size, GFP_KERNEL);
-	if (bo == NULL)
+	if (bo == NULL) {
+		obj_cgroup_put(bp->objcg);
 		return -ENOMEM;
+	}
 	drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
+	bo->tbo.objcg = bp->objcg;
 	bo->tbo.base.funcs = &amdgpu_gem_object_funcs;
 	bo->vm_bo = NULL;
 	bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 375448627f7b..8ebaf1bc202f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -55,6 +55,7 @@ struct amdgpu_bo_param {
 	enum ttm_bo_type		type;
 	bool				no_wait_gpu;
 	struct dma_resv			*resv;
+	struct obj_cgroup               *objcg;
 	void				(*destroy)(struct ttm_buffer_object *bo);
 	/* xcp partition number plus 1, 0 means any partition */
 	int8_t				xcp_id_plus1;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 920b412156dd..a65e23b8c67e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -151,11 +151,13 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,
 			amdgpu_bo_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT |
 							AMDGPU_GEM_DOMAIN_CPU);
 		}
+		abo->placements[0].flags &= ~TTM_PL_FLAG_MEMCG;
 		break;
 	case TTM_PL_TT:
 	case AMDGPU_PL_PREEMPT:
 	default:
 		amdgpu_bo_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_CPU);
+		abo->placements[0].flags &= ~TTM_PL_FLAG_MEMCG;
 		break;
 	}
 	*placement = abo->placement;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs
  2025-06-30  4:49 ` [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
@ 2025-06-30  9:56   ` kernel test robot
  2025-06-30 10:20   ` Christian König
  1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2025-06-30  9:56 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig
  Cc: oe-kbuild-all, Dave Chinner, Kairui Song, Dave Airlie

Hi Dave,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm/drm-next]
[cannot apply to akpm-mm/mm-everything linus/master v6.16-rc4 next-20250630]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Airlie/drm-ttm-use-gpu-mm-stats-to-track-gpu-memory-allocations-v2/20250630-134938
base:   git://anongit.freedesktop.org/drm/drm drm-next
patch link:    https://lore.kernel.org/r/20250630045005.1337339-11-airlied%40gmail.com
patch subject: [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs
config: riscv-randconfig-001-20250630 (https://download.01.org/0day-ci/archive/20250630/202506301731.dRh8GUQZ-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250630/202506301731.dRh8GUQZ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506301731.dRh8GUQZ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: drivers/gpu/drm/ttm/ttm_bo.c:1256 function parameter 'memcg_account' not described in 'ttm_bo_populate'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2)
  2025-06-30  4:49 ` [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2) Dave Airlie
@ 2025-06-30 10:04   ` Christian König
  2025-07-01  1:41     ` David Airlie
  2025-07-02 16:08   ` Shakeel Butt
  1 sibling, 1 reply; 52+ messages in thread
From: Christian König @ 2025-06-30 10:04 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner
  Cc: Dave Chinner, Kairui Song, Dave Airlie, Matthew Brost,
	Andrew Morton

On 30.06.25 06:49, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This uses the newly introduced per-node gpu tracking stats,
> to track GPU memory allocated via TTM and reclaimable memory in
> the TTM page pools.
> 
> These stats will be useful later for system information and
> later when mem cgroups are integrated.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> 
> ---
> v2: add reclaim parameters and adjust the right counters.
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 34 ++++++++++++++++++++++++++++------
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index baf27c70a419..11a5777b4a85 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -131,6 +131,16 @@ static struct list_head shrinker_list;
>  static struct shrinker *mm_shrinker;
>  static DECLARE_RWSEM(pool_shrink_rwsem);
>  
> +/* helper to get a current valid node id from a pool */
> +static int ttm_pool_nid(struct ttm_pool *pool) {
> +	int nid = NUMA_NO_NODE;
> +	if (pool)
> +		nid = pool->nid;
> +	if (nid == NUMA_NO_NODE)
> +		nid = numa_node_id();

That isn't correct.

The NUMA node in the pool is just a hint where to allocate from, but the memory can come from somewhere else as well.

You need to look at the allocated page to figure out to which NUMA node that belongs.

Regards,
Christian.

> +	return nid;
> +}
> +
>  /* Allocate pages of size 1 << order with the given gfp_flags */
>  static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  					unsigned int order)
> @@ -150,8 +160,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  
>  	if (!pool->use_dma_alloc) {
>  		p = alloc_pages_node(pool->nid, gfp_flags, order);
> -		if (p)
> +		if (p) {
>  			p->private = order;
> +			mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)), NR_GPU_ACTIVE, (1 << order));
> +		}
>  		return p;
>  	}
>  
> @@ -186,7 +198,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  
>  /* Reset the caching and pages of size 1 << order */
>  static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
> -			       unsigned int order, struct page *p)
> +			       unsigned int order, struct page *p, bool reclaim)
>  {
>  	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
>  	struct ttm_pool_dma *dma;
> @@ -201,6 +213,9 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
>  #endif
>  
>  	if (!pool || !pool->use_dma_alloc) {
> +		mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)),
> +				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
> +				    -(1 << order));
>  		__free_pages(p, order);
>  		return;
>  	}
> @@ -276,6 +291,7 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
>  static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
>  {
>  	unsigned int i, num_pages = 1 << pt->order;
> +	int nid = ttm_pool_nid(pt->pool);
>  
>  	for (i = 0; i < num_pages; ++i) {
>  		if (PageHighMem(p))
> @@ -288,17 +304,23 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
>  	list_add(&p->lru, &pt->pages);
>  	spin_unlock(&pt->lock);
>  	atomic_long_add(1 << pt->order, &allocated_pages);
> +
> +	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
> +	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
>  }
>  
>  /* Take pages from a specific pool_type, return NULL when nothing available */
>  static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
>  {
>  	struct page *p;
> +	int nid = ttm_pool_nid(pt->pool);
>  
>  	spin_lock(&pt->lock);
>  	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
>  	if (p) {
>  		atomic_long_sub(1 << pt->order, &allocated_pages);
> +		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
> +		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
>  		list_del(&p->lru);
>  	}
>  	spin_unlock(&pt->lock);
> @@ -331,7 +353,7 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
>  	spin_unlock(&shrinker_lock);
>  
>  	while ((p = ttm_pool_type_take(pt)))
> -		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
> +		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
>  }
>  
>  /* Return the pool_type to use for the given caching and order */
> @@ -383,7 +405,7 @@ static unsigned int ttm_pool_shrink(void)
>  
>  	p = ttm_pool_type_take(pt);
>  	if (p) {
> -		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
> +		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
>  		num_pages = 1 << pt->order;
>  	} else {
>  		num_pages = 0;
> @@ -475,7 +497,7 @@ static pgoff_t ttm_pool_unmap_and_free(struct ttm_pool *pool, struct page *page,
>  	if (pt)
>  		ttm_pool_type_give(pt, page);
>  	else
> -		ttm_pool_free_page(pool, caching, order, page);
> +		ttm_pool_free_page(pool, caching, order, page, false);
>  
>  	return nr;
>  }
> @@ -780,7 +802,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  	return 0;
>  
>  error_free_page:
> -	ttm_pool_free_page(pool, page_caching, order, p);
> +	ttm_pool_free_page(pool, page_caching, order, p, false);
>  
>  error_free_all:
>  	if (tt->restore)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/17] ttm/pool: drop numa specific pools
  2025-06-30  4:49 ` [PATCH 05/17] ttm/pool: drop numa specific pools Dave Airlie
@ 2025-06-30 10:12   ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2025-06-30 10:12 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner
  Cc: Dave Chinner, Kairui Song, Dave Airlie



On 30.06.25 06:49, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> The list_lru will now handle numa for us, so need to keep
> separate pool types for it. Just consoldiate into the global ones.
> 
> This adds a debugfs change to avoid dumping non-existant orders due
> to this change.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 4372f0cc4a57..95bbbc843b97 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -399,17 +399,11 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
>  #ifdef CONFIG_X86
>  	switch (caching) {
>  	case ttm_write_combined:
> -		if (pool->nid != NUMA_NO_NODE)
> -			return &pool->caching[caching].orders[order];
> -
>  		if (pool->use_dma32)
>  			return &global_dma32_write_combined[order];
>  
>  		return &global_write_combined[order];
>  	case ttm_uncached:
> -		if (pool->nid != NUMA_NO_NODE)
> -			return &pool->caching[caching].orders[order];
> -
>  		if (pool->use_dma32)
>  			return &global_dma32_uncached[order];
>  
> @@ -1295,6 +1289,8 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
>  	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>  		if (!ttm_pool_select_type(pool, i, 0))
>  			continue;

> +		if (pool->caching[i].orders[0].pool != pool)
> +			continue;

That here makes no sense any more. The NUMA aware allocation goes into the global LRU now, so we can completely drop the debugfs for this.

In other words drop parts of the "if (!pool->use_dma_alloc && pool->nid == NUMA_NO_NODE) {" check above instead.

Regards,
Christian.

>  		if (pool->use_dma_alloc)
>  			seq_puts(m, "DMA ");
>  		else


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/17] ttm/pool: make pool shrinker NUMA aware
  2025-06-30  4:49 ` [PATCH 06/17] ttm/pool: make pool shrinker NUMA aware Dave Airlie
@ 2025-06-30 10:15   ` Christian König
  2025-06-30 21:30     ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-06-30 10:15 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Simona Vetter
  Cc: Dave Chinner, Kairui Song, Dave Airlie

On 30.06.25 06:49, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This enable NUMA awareness for the shrinker on the
> ttm pools.


Looks good from my side, but the last time I suggested this Sima explicitly told me that it isn't very fruitful. Adding her to comment as well.

Christian.

> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 38 +++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 95bbbc843b97..66cd963b24dc 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -416,12 +416,12 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
>  	return NULL;
>  }
>  
> -/* Free pages using the global shrinker list */
> -static unsigned int ttm_pool_shrink(void)
> +/* Free pages using the per-node shrinker list */
> +static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
>  {
> +	LIST_HEAD(dispose);
>  	struct ttm_pool_type *pt;
>  	unsigned int num_pages;
> -	struct page *p;
>  
>  	down_read(&pool_shrink_rwsem);
>  	spin_lock(&shrinker_lock);
> @@ -429,13 +429,10 @@ static unsigned int ttm_pool_shrink(void)
>  	list_move_tail(&pt->shrinker_list, &shrinker_list);
>  	spin_unlock(&shrinker_lock);
>  
> -	p = ttm_pool_type_take(pt, ttm_pool_nid(pt->pool));
> -	if (p) {
> -		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
> -		num_pages = 1 << pt->order;
> -	} else {
> -		num_pages = 0;
> -	}
> +	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
> +	num_pages *= 1 << pt->order;
> +
> +	ttm_pool_dispose_list(pt, &dispose);
>  	up_read(&pool_shrink_rwsem);
>  
>  	return num_pages;
> @@ -784,6 +781,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  		pt = ttm_pool_select_type(pool, page_caching, order);
>  		if (pt && allow_pools)
>  			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
> +
>  		/*
>  		 * If that fails or previously failed, allocate from system.
>  		 * Note that this also disallows additional pool allocations using
> @@ -932,8 +930,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
>  {
>  	ttm_pool_free_range(pool, tt, tt->caching, 0, tt->num_pages);
>  
> -	while (atomic_long_read(&allocated_pages) > page_pool_size)
> -		ttm_pool_shrink();
> +	while (atomic_long_read(&allocated_pages) > page_pool_size) {
> +		unsigned long diff = page_pool_size - atomic_long_read(&allocated_pages);
> +		ttm_pool_shrink(ttm_pool_nid(pool), diff);
> +	}
>  }
>  EXPORT_SYMBOL(ttm_pool_free);
>  
> @@ -1190,7 +1190,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
>  	unsigned long num_freed = 0;
>  
>  	do
> -		num_freed += ttm_pool_shrink();
> +		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
>  	while (num_freed < sc->nr_to_scan &&
>  	       atomic_long_read(&allocated_pages));
>  
> @@ -1323,11 +1323,15 @@ static int ttm_pool_debugfs_shrink_show(struct seq_file *m, void *data)
>  		.nr_to_scan = TTM_SHRINKER_BATCH,
>  	};
>  	unsigned long count;
> +	int nid;
>  
>  	fs_reclaim_acquire(GFP_KERNEL);
> -	count = ttm_pool_shrinker_count(mm_shrinker, &sc);
> -	seq_printf(m, "%lu/%lu\n", count,
> -		   ttm_pool_shrinker_scan(mm_shrinker, &sc));
> +	for_each_node(nid) {
> +		sc.nid = nid;
> +		count = ttm_pool_shrinker_count(mm_shrinker, &sc);
> +		seq_printf(m, "%d: %lu/%lu\n", nid, count,
> +			   ttm_pool_shrinker_scan(mm_shrinker, &sc));
> +	}
>  	fs_reclaim_release(GFP_KERNEL);
>  
>  	return 0;
> @@ -1375,7 +1379,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
>  #endif
>  #endif
>  
> -	mm_shrinker = shrinker_alloc(0, "drm-ttm_pool");
> +	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
>  	if (!mm_shrinker)
>  		return -ENOMEM;
>  


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs
  2025-06-30  4:49 ` [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
  2025-06-30  9:56   ` kernel test robot
@ 2025-06-30 10:20   ` Christian König
  2025-07-01  1:46     ` David Airlie
  1 sibling, 1 reply; 52+ messages in thread
From: Christian König @ 2025-06-30 10:20 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner
  Cc: Dave Chinner, Kairui Song, Dave Airlie

On 30.06.25 06:49, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This flag does nothing yet, but this just changes the APIs to accept
> it in the future across all users.
> 
> This flag will eventually be filled out with when to account a tt
> populate to a memcg.

I would keep the pool completely out of memcg accounting and always account at a higher level.

Accounting that low just gives a hughe surfurce for driver issues.

Regards,
Christian.

> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  3 ++-
>  drivers/gpu/drm/i915/gem/i915_gem_ttm.c          |  5 +++--
>  drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c     |  2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c       |  4 ++--
>  drivers/gpu/drm/loongson/lsdc_ttm.c              |  3 ++-
>  drivers/gpu/drm/nouveau/nouveau_bo.c             |  6 ++++--
>  drivers/gpu/drm/radeon/radeon_ttm.c              |  3 ++-
>  drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c |  2 +-
>  drivers/gpu/drm/ttm/tests/ttm_pool_test.c        | 16 ++++++++--------
>  drivers/gpu/drm/ttm/tests/ttm_tt_test.c          | 12 ++++++------
>  drivers/gpu/drm/ttm/ttm_bo.c                     |  5 +++--
>  drivers/gpu/drm/ttm/ttm_bo_util.c                |  6 +++---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c                  |  4 +++-
>  drivers/gpu/drm/ttm/ttm_pool.c                   |  6 ++++--
>  drivers/gpu/drm/ttm/ttm_tt.c                     |  8 +++++---
>  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c             |  4 ++--
>  drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c       |  7 ++++---
>  drivers/gpu/drm/xe/xe_bo.c                       |  3 ++-
>  include/drm/ttm/ttm_bo.h                         |  1 +
>  include/drm/ttm/ttm_device.h                     |  1 +
>  include/drm/ttm/ttm_pool.h                       |  1 +
>  include/drm/ttm/ttm_tt.h                         |  1 +
>  22 files changed, 61 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 9c5df35f05b7..920b412156dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1138,6 +1138,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
>   */
>  static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
>  				  struct ttm_tt *ttm,
> +				  bool memcg_account,
>  				  struct ttm_operation_ctx *ctx)
>  {
>  	struct amdgpu_device *adev = amdgpu_ttm_adev(bdev);
> @@ -1161,7 +1162,7 @@ static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
>  		pool = &adev->mman.ttm_pools[gtt->pool_id];
>  	else
>  		pool = &adev->mman.bdev.pool;
> -	ret = ttm_pool_alloc(pool, ttm, ctx);
> +	ret = ttm_pool_alloc(pool, ttm, memcg_account, ctx);
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index 1f4814968868..6cdaf3696583 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -314,6 +314,7 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>  
>  static int i915_ttm_tt_populate(struct ttm_device *bdev,
>  				struct ttm_tt *ttm,
> +				bool memcg_account,
>  				struct ttm_operation_ctx *ctx)
>  {
>  	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
> @@ -321,7 +322,7 @@ static int i915_ttm_tt_populate(struct ttm_device *bdev,
>  	if (i915_tt->is_shmem)
>  		return i915_ttm_tt_shmem_populate(bdev, ttm, ctx);
>  
> -	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
> +	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
>  }
>  
>  static void i915_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
> @@ -808,7 +809,7 @@ static int __i915_ttm_get_pages(struct drm_i915_gem_object *obj,
>  	}
>  
>  	if (bo->ttm && !ttm_tt_is_populated(bo->ttm)) {
> -		ret = ttm_bo_populate(bo, &ctx);
> +		ret = ttm_bo_populate(bo, false, &ctx);
>  		if (ret)
>  			return ret;
>  
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index 2f6b33edb9c9..4ab1eb3e42bc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -624,7 +624,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
>  
>  	/* Populate ttm with pages if needed. Typically system memory. */
>  	if (ttm && (dst_man->use_tt || (ttm->page_flags & TTM_TT_FLAG_SWAPPED))) {
> -		ret = ttm_bo_populate(bo, ctx);
> +		ret = ttm_bo_populate(bo, false, ctx);
>  		if (ret)
>  			return ret;
>  	}
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
> index 61596cecce4d..0b555979d786 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
> @@ -90,7 +90,7 @@ static int i915_ttm_backup(struct i915_gem_apply_to_region *apply,
>  		goto out_no_lock;
>  
>  	backup_bo = i915_gem_to_ttm(backup);
> -	err = ttm_bo_populate(backup_bo, &ctx);
> +	err = ttm_bo_populate(backup_bo, false, &ctx);
>  	if (err)
>  		goto out_no_populate;
>  
> @@ -189,7 +189,7 @@ static int i915_ttm_restore(struct i915_gem_apply_to_region *apply,
>  	if (!backup_bo->resource)
>  		err = ttm_bo_validate(backup_bo, i915_ttm_sys_placement(), &ctx);
>  	if (!err)
> -		err = ttm_bo_populate(backup_bo, &ctx);
> +		err = ttm_bo_populate(backup_bo, false, &ctx);
>  	if (!err) {
>  		err = i915_gem_obj_copy_ttm(obj, backup, pm_apply->allow_gpu,
>  					    false);
> diff --git a/drivers/gpu/drm/loongson/lsdc_ttm.c b/drivers/gpu/drm/loongson/lsdc_ttm.c
> index 2e42c6970c9f..6d8781506802 100644
> --- a/drivers/gpu/drm/loongson/lsdc_ttm.c
> +++ b/drivers/gpu/drm/loongson/lsdc_ttm.c
> @@ -110,6 +110,7 @@ lsdc_ttm_tt_create(struct ttm_buffer_object *tbo, uint32_t page_flags)
>  
>  static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
>  				struct ttm_tt *ttm,
> +				bool memcg_account,
>  				struct ttm_operation_ctx *ctx)
>  {
>  	bool slave = !!(ttm->page_flags & TTM_TT_FLAG_EXTERNAL);
> @@ -122,7 +123,7 @@ static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
>  		return 0;
>  	}
>  
> -	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
> +	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
>  }
>  
>  static void lsdc_ttm_tt_unpopulate(struct ttm_device *bdev,
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index b96f0555ca14..1f2b9f5f2bf8 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -1417,7 +1417,9 @@ vm_fault_t nouveau_ttm_fault_reserve_notify(struct ttm_buffer_object *bo)
>  
>  static int
>  nouveau_ttm_tt_populate(struct ttm_device *bdev,
> -			struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
> +			struct ttm_tt *ttm,
> +			bool memcg_account,
> +			struct ttm_operation_ctx *ctx)
>  {
>  	struct ttm_tt *ttm_dma = (void *)ttm;
>  	struct nouveau_drm *drm;
> @@ -1434,7 +1436,7 @@ nouveau_ttm_tt_populate(struct ttm_device *bdev,
>  
>  	drm = nouveau_bdev(bdev);
>  
> -	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, ctx);
> +	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, memcg_account, ctx);
>  }
>  
>  static void
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> index 616d25c8c2de..8c4273239d16 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -526,6 +526,7 @@ static struct radeon_ttm_tt *radeon_ttm_tt_to_gtt(struct radeon_device *rdev,
>  
>  static int radeon_ttm_tt_populate(struct ttm_device *bdev,
>  				  struct ttm_tt *ttm,
> +				  bool memcg_account,
>  				  struct ttm_operation_ctx *ctx)
>  {
>  	struct radeon_device *rdev = radeon_get_rdev(bdev);
> @@ -547,7 +548,7 @@ static int radeon_ttm_tt_populate(struct ttm_device *bdev,
>  		return 0;
>  	}
>  
> -	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, ctx);
> +	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, memcg_account, ctx);
>  }
>  
>  static void radeon_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
> diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
> index 3148f5d3dbd6..b52e3c1089e6 100644
> --- a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
> +++ b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
> @@ -538,7 +538,7 @@ static void ttm_bo_validate_no_placement_signaled(struct kunit *test)
>  
>  	if (params->with_ttm) {
>  		old_tt = priv->ttm_dev->funcs->ttm_tt_create(bo, 0);
> -		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, &ctx);
> +		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, false, &ctx);
>  		bo->ttm = old_tt;
>  	}
>  
> diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
> index 39234a3e98c4..aaf152c2383d 100644
> --- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
> +++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
> @@ -88,7 +88,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct kunit *test,
>  
>  	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>  
> -	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> +	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
>  	ttm_pool_free(pool, tt);
> @@ -157,7 +157,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
>  	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
>  	KUNIT_ASSERT_EQ(test, pool->use_dma_alloc, params->use_dma_alloc);
>  
> -	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> +	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
>  
> @@ -220,7 +220,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct kunit *test)
>  
>  	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>  
> -	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> +	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
>  
> @@ -253,7 +253,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
>  	tt = ttm_tt_kunit_init(test, 0, caching, size);
>  	KUNIT_ASSERT_NOT_NULL(test, tt);
>  
> -	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> +	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
>  	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
> @@ -285,7 +285,7 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
>  	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
>  	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
>  
> -	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> +	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
>  	ttm_pool_free(pool, tt);
> @@ -319,7 +319,7 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
>  	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
>  	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
>  
> -	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> +	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
>  	ttm_pool_free(pool, tt);
> @@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
>  	KUNIT_ASSERT_NOT_NULL(test, pool);
>  
>  	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
> -	ttm_pool_alloc(pool, tt, &simple_ctx);
> +	ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  
>  	pt = &pool->caching[caching].orders[order];
>  	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
> @@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
>  	KUNIT_ASSERT_NOT_NULL(test, pool);
>  
>  	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
> -	ttm_pool_alloc(pool, tt, &simple_ctx);
> +	ttm_pool_alloc(pool, tt, false, &simple_ctx);
>  
>  	pt = &pool->caching[caching].orders[order];
>  	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
> diff --git a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
> index 61ec6f580b62..333c503e218b 100644
> --- a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
> +++ b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
> @@ -262,7 +262,7 @@ static void ttm_tt_populate_null_ttm(struct kunit *test)
>  	struct ttm_operation_ctx ctx = { };
>  	int err;
>  
> -	err = ttm_tt_populate(devs->ttm_dev, NULL, &ctx);
> +	err = ttm_tt_populate(devs->ttm_dev, NULL, false, &ctx);
>  	KUNIT_ASSERT_EQ(test, err, -EINVAL);
>  }
>  
> @@ -283,11 +283,11 @@ static void ttm_tt_populate_populated_ttm(struct kunit *test)
>  	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
> -	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
> +	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  	populated_page = *tt->pages;
>  
> -	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
> +	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
>  	KUNIT_ASSERT_PTR_EQ(test, populated_page, *tt->pages);
>  }
>  
> @@ -307,7 +307,7 @@ static void ttm_tt_unpopulate_basic(struct kunit *test)
>  	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
> -	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
> +	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
>  
> @@ -351,7 +351,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
>  	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
> -	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
> +	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
>  
> @@ -361,7 +361,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
>  	KUNIT_ASSERT_TRUE(test, tt->page_flags & TTM_TT_FLAG_SWAPPED);
>  
>  	/* Swapout depopulates TT, allocate pages and then swap them in */
> -	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, &ctx);
> +	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, false, &ctx);
>  	KUNIT_ASSERT_EQ(test, err, 0);
>  
>  	err = ttm_tt_swapin(tt);
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index f4d9e68b21e7..af04bb8e2c2a 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -142,7 +142,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>  			goto out_err;
>  
>  		if (mem->mem_type != TTM_PL_SYSTEM) {
> -			ret = ttm_bo_populate(bo, ctx);
> +			ret = ttm_bo_populate(bo, false, ctx);
>  			if (ret)
>  				goto out_err;
>  		}
> @@ -1256,6 +1256,7 @@ void ttm_bo_tt_destroy(struct ttm_buffer_object *bo)
>   * is set to true.
>   */
>  int ttm_bo_populate(struct ttm_buffer_object *bo,
> +		    bool memcg_account,
>  		    struct ttm_operation_ctx *ctx)
>  {
>  	struct ttm_tt *tt = bo->ttm;
> @@ -1268,7 +1269,7 @@ int ttm_bo_populate(struct ttm_buffer_object *bo,
>  		return 0;
>  
>  	swapped = ttm_tt_is_swapped(tt);
> -	ret = ttm_tt_populate(bo->bdev, tt, ctx);
> +	ret = ttm_tt_populate(bo->bdev, tt, memcg_account, ctx);
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 250675d56b1c..764d1cf1ecbe 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -167,7 +167,7 @@ int ttm_bo_move_memcpy(struct ttm_buffer_object *bo,
>  	src_man = ttm_manager_type(bdev, src_mem->mem_type);
>  	if (ttm && ((ttm->page_flags & TTM_TT_FLAG_SWAPPED) ||
>  		    dst_man->use_tt)) {
> -		ret = ttm_bo_populate(bo, ctx);
> +		ret = ttm_bo_populate(bo, false, ctx);
>  		if (ret)
>  			return ret;
>  	}
> @@ -354,7 +354,7 @@ static int ttm_bo_kmap_ttm(struct ttm_buffer_object *bo,
>  
>  	BUG_ON(!ttm);
>  
> -	ret = ttm_bo_populate(bo, &ctx);
> +	ret = ttm_bo_populate(bo, false, &ctx);
>  	if (ret)
>  		return ret;
>  
> @@ -511,7 +511,7 @@ int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map)
>  		pgprot_t prot;
>  		void *vaddr;
>  
> -		ret = ttm_bo_populate(bo, &ctx);
> +		ret = ttm_bo_populate(bo, false, &ctx);
>  		if (ret)
>  			return ret;
>  
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index b47020fca199..c5ad447debe3 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -225,7 +225,9 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>  		};
>  
>  		ttm = bo->ttm;
> -		err = ttm_bo_populate(bo, &ctx);
> +		err = ttm_bo_populate(bo,
> +				      false,
> +				      &ctx);
>  		if (err) {
>  			if (err == -EINTR || err == -ERESTARTSYS ||
>  			    err == -EAGAIN)
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index c6192c915f0d..0526900366e5 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -744,6 +744,7 @@ static unsigned int ttm_pool_alloc_find_order(unsigned int highest,
>  }
>  
>  static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> +			    bool memcg_account,
>  			    const struct ttm_operation_ctx *ctx,
>  			    struct ttm_pool_alloc_state *alloc,
>  			    struct ttm_pool_tt_restore *restore)
> @@ -854,6 +855,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>   * Returns: 0 on successe, negative error code otherwise.
>   */
>  int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> +		   bool memcg_account,
>  		   struct ttm_operation_ctx *ctx)
>  {
>  	struct ttm_pool_alloc_state alloc;
> @@ -863,7 +865,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  
>  	ttm_pool_alloc_state_init(tt, &alloc);
>  
> -	return __ttm_pool_alloc(pool, tt, ctx, &alloc, NULL);
> +	return __ttm_pool_alloc(pool, tt, memcg_account, ctx, &alloc, NULL);
>  }
>  EXPORT_SYMBOL(ttm_pool_alloc);
>  
> @@ -916,7 +918,7 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  			return 0;
>  	}
>  
> -	return __ttm_pool_alloc(pool, tt, ctx, &alloc, tt->restore);
> +	return __ttm_pool_alloc(pool, tt, false, ctx, &alloc, tt->restore);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> index 506e257dfba8..8f38de3b2f1c 100644
> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> @@ -366,7 +366,9 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
>  EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_tt_swapout);
>  
>  int ttm_tt_populate(struct ttm_device *bdev,
> -		    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
> +		    struct ttm_tt *ttm,
> +		    bool memcg_account,
> +		    struct ttm_operation_ctx *ctx)
>  {
>  	int ret;
>  
> @@ -395,9 +397,9 @@ int ttm_tt_populate(struct ttm_device *bdev,
>  	}
>  
>  	if (bdev->funcs->ttm_tt_populate)
> -		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, ctx);
> +		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, memcg_account, ctx);
>  	else
> -		ret = ttm_pool_alloc(&bdev->pool, ttm, ctx);
> +		ret = ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
>  	if (ret)
>  		goto error;
>  
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
> index fa5841fda659..a4d4ebf585fe 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
> @@ -569,13 +569,13 @@ int vmw_bo_cpu_blit(struct vmw_bo *vmw_dst,
>  		dma_resv_assert_held(src->base.resv);
>  
>  	if (!ttm_tt_is_populated(dst->ttm)) {
> -		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, &ctx);
> +		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, false, &ctx);
>  		if (ret)
>  			return ret;
>  	}
>  
>  	if (!ttm_tt_is_populated(src->ttm)) {
> -		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, &ctx);
> +		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, false, &ctx);
>  		if (ret)
>  			return ret;
>  	}
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> index 5553892d7c3e..2351dafc1c68 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> @@ -360,7 +360,8 @@ static void vmw_ttm_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
>  
>  
>  static int vmw_ttm_populate(struct ttm_device *bdev,
> -			    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
> +			    struct ttm_tt *ttm, bool memcg_account,
> +			    struct ttm_operation_ctx *ctx)
>  {
>  	bool external = (ttm->page_flags & TTM_TT_FLAG_EXTERNAL) != 0;
>  
> @@ -372,7 +373,7 @@ static int vmw_ttm_populate(struct ttm_device *bdev,
>  						       ttm->dma_address,
>  						       ttm->num_pages);
>  
> -	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
> +	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
>  }
>  
>  static void vmw_ttm_unpopulate(struct ttm_device *bdev,
> @@ -580,7 +581,7 @@ int vmw_bo_create_and_populate(struct vmw_private *dev_priv,
>  	if (unlikely(ret != 0))
>  		return ret;
>  
> -	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, &ctx);
> +	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, false, &ctx);
>  	if (likely(ret == 0)) {
>  		struct vmw_ttm_tt *vmw_tt =
>  			container_of(vbo->tbo.ttm, struct vmw_ttm_tt, dma_ttm);
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 7aa2c17825da..522cbff11563 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -504,6 +504,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>  }
>  
>  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
> +			      bool memcg_account,
>  			      struct ttm_operation_ctx *ctx)
>  {
>  	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
> @@ -521,7 +522,7 @@ static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
>  		err = ttm_tt_restore(ttm_dev, tt, ctx);
>  	} else {
>  		ttm_tt_clear_backed_up(tt);
> -		err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
> +		err = ttm_pool_alloc(&ttm_dev->pool, tt, memcg_account, ctx);
>  	}
>  	if (err)
>  		return err;
> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> index 894ff7ccd68e..099dc2604baa 100644
> --- a/include/drm/ttm/ttm_bo.h
> +++ b/include/drm/ttm/ttm_bo.h
> @@ -464,6 +464,7 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, struct ttm_resource *res,
>  		     pgprot_t tmp);
>  void ttm_bo_tt_destroy(struct ttm_buffer_object *bo);
>  int ttm_bo_populate(struct ttm_buffer_object *bo,
> +		    bool memcg_account,
>  		    struct ttm_operation_ctx *ctx);
>  
>  /* Driver LRU walk helpers initially targeted for shrinking. */
> diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
> index 39b8636b1845..903ca40ebf92 100644
> --- a/include/drm/ttm/ttm_device.h
> +++ b/include/drm/ttm/ttm_device.h
> @@ -84,6 +84,7 @@ struct ttm_device_funcs {
>  	 */
>  	int (*ttm_tt_populate)(struct ttm_device *bdev,
>  			       struct ttm_tt *ttm,
> +			       bool memcg_account,
>  			       struct ttm_operation_ctx *ctx);
>  
>  	/**
> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
> index df56527c4853..da5b94226203 100644
> --- a/include/drm/ttm/ttm_pool.h
> +++ b/include/drm/ttm/ttm_pool.h
> @@ -79,6 +79,7 @@ struct ttm_pool {
>  };
>  
>  int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
> +		   bool memcg_account,
>  		   struct ttm_operation_ctx *ctx);
>  void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>  
> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> index 406437ad674b..15d4019685f6 100644
> --- a/include/drm/ttm/ttm_tt.h
> +++ b/include/drm/ttm/ttm_tt.h
> @@ -250,6 +250,7 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
>   * Calls the driver method to allocate pages for a ttm
>   */
>  int ttm_tt_populate(struct ttm_device *bdev, struct ttm_tt *ttm,
> +		    bool memcg_account,
>  		    struct ttm_operation_ctx *ctx);
>  
>  /**


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker.
  2025-06-30  4:49 ` [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker Dave Airlie
@ 2025-06-30 10:23   ` Christian König
  2025-06-30 21:23     ` David Airlie
  2025-06-30 11:59   ` kernel test robot
  2025-07-02 16:41   ` Shakeel Butt
  2 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-06-30 10:23 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner
  Cc: Dave Chinner, Kairui Song, Dave Airlie

On 30.06.25 06:49, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This enables all the backend code to use the list lru in memcg mode,
> and set the shrinker to be memcg aware.
> 
> It adds the loop case for when pooled pages end up being reparented
> to a higher memcg group, that newer memcg can search for them there
> and take them back.

Again, this makes no sense at all and will break existing use cases.

Regards,
Christian.

> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 123 ++++++++++++++++++++++++++++-----
>  1 file changed, 105 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 210f4ac4de67..49e92f40ab23 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -143,7 +143,9 @@ static int ttm_pool_nid(struct ttm_pool *pool) {
>  }
>  
>  /* Allocate pages of size 1 << order with the given gfp_flags */
> -static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> +static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
> +					struct obj_cgroup *objcg,
> +					gfp_t gfp_flags,
>  					unsigned int order)
>  {
>  	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
> @@ -163,6 +165,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  		p = alloc_pages_node(pool->nid, gfp_flags, order);
>  		if (p) {
>  			p->private = order;
> +			if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
> +				__free_pages(p, order);
> +				return NULL;
> +			}
>  			mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)), NR_GPU_ACTIVE, (1 << order));
>  		}
>  		return p;
> @@ -214,6 +220,7 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
>  #endif
>  
>  	if (!pool || !pool->use_dma_alloc) {
> +		mem_cgroup_uncharge_gpu_page(p, order, reclaim);
>  		mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)),
>  				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
>  				    -(1 << order));
> @@ -303,12 +310,13 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
>  
>  	INIT_LIST_HEAD(&p->lru);
>  	rcu_read_lock();
> -	list_lru_add(&pt->pages, &p->lru, nid, NULL);
> +	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
>  	rcu_read_unlock();
>  
>  	atomic_long_add(num_pages, &allocated_pages[nid]);
>  	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
>  	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
> +	mem_cgroup_move_gpu_page_reclaim(NULL, p, pt->order, true);
>  }
>  
>  static enum lru_status take_one_from_lru(struct list_head *item,
> @@ -323,20 +331,59 @@ static enum lru_status take_one_from_lru(struct list_head *item,
>  	return LRU_REMOVED;
>  }
>  
> -/* Take pages from a specific pool_type, return NULL when nothing available */
> -static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
> +static int pool_lru_get_page(struct ttm_pool_type *pt, int nid,
> +			     struct page **page_out,
> +			     struct obj_cgroup *objcg,
> +			     struct mem_cgroup *memcg)
>  {
>  	int ret;
>  	struct page *p = NULL;
>  	unsigned long nr_to_walk = 1;
> +	unsigned int num_pages = 1 << pt->order;
>  
> -	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
> +	ret = list_lru_walk_one(&pt->pages, nid, memcg, take_one_from_lru, (void *)&p, &nr_to_walk);
>  	if (ret == 1 && p) {
> -		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
> -		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
> -		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
> +		atomic_long_sub(num_pages, &allocated_pages[nid]);
> +		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -num_pages);
> +
> +		if (!mem_cgroup_move_gpu_page_reclaim(objcg, p, pt->order, false)) {
> +			__free_pages(p, pt->order);
> +			p = NULL;
> +		}
> +		if (p)
> +			mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, num_pages);
>  	}
> -	return p;
> +	*page_out = p;
> +	return ret;
> +}
> +
> +/* Take pages from a specific pool_type, return NULL when nothing available */
> +static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
> +				       struct obj_cgroup *orig_objcg)
> +{
> +	struct page *page_out = NULL;
> +	int ret;
> +	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
> +	struct mem_cgroup *memcg = orig_memcg;
> +
> +	/*
> +	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,
> +	 * go up to the parent and check there. This helps the scenario where multiple apps get
> +	 * started into their own cgroup from a common parent and want to reuse the pools.
> +	 */
> +	while (!page_out) {
> +		ret = pool_lru_get_page(pt, nid, &page_out, orig_objcg, memcg);
> +		if (ret == 1)
> +			break;
> +		if (!memcg)
> +			break;
> +		memcg = parent_mem_cgroup(memcg);
> +		if (!memcg)
> +			break;
> +	}
> +
> +	mem_cgroup_put(orig_memcg);
> +	return page_out;
>  }
>  
>  /* Initialize and add a pool type to the global shrinker list */
> @@ -346,7 +393,7 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
>  	pt->pool = pool;
>  	pt->caching = caching;
>  	pt->order = order;
> -	list_lru_init(&pt->pages);
> +	list_lru_init_memcg(&pt->pages, mm_shrinker);
>  
>  	spin_lock(&shrinker_lock);
>  	list_add_tail(&pt->shrinker_list, &shrinker_list);
> @@ -389,6 +436,30 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
>  	ttm_pool_dispose_list(pt, &dispose);
>  }
>  
> +static int ttm_pool_check_objcg(struct obj_cgroup *objcg)
> +{
> +#ifdef CONFIG_MEMCG
> +	int r = 0;
> +	struct mem_cgroup *memcg;
> +	if (!objcg)
> +		return 0;
> +
> +	memcg = get_mem_cgroup_from_objcg(objcg);
> +	for (unsigned i = 0; i < NR_PAGE_ORDERS; i++) {
> +		r = memcg_list_lru_alloc(memcg, &global_write_combined[i].pages, GFP_KERNEL);
> +		if (r) {
> +			break;
> +		}
> +		r = memcg_list_lru_alloc(memcg, &global_uncached[i].pages, GFP_KERNEL);
> +		if (r) {
> +			break;
> +		}
> +	}
> +	css_put(&memcg->css);
> +#endif
> +	return 0;
> +}
> +
>  /* Return the pool_type to use for the given caching and order */
>  static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
>  						  enum ttm_caching caching,
> @@ -418,7 +489,9 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
>  }
>  
>  /* Free pages using the per-node shrinker list */
> -static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
> +static unsigned int ttm_pool_shrink(int nid,
> +				    struct mem_cgroup *memcg,
> +				    unsigned long num_to_free)
>  {
>  	LIST_HEAD(dispose);
>  	struct ttm_pool_type *pt;
> @@ -430,7 +503,11 @@ static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
>  	list_move_tail(&pt->shrinker_list, &shrinker_list);
>  	spin_unlock(&shrinker_lock);
>  
> -	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
> +	if (!memcg) {
> +		num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
> +	} else {
> +		num_pages = list_lru_walk_one(&pt->pages, nid, memcg, pool_move_to_dispose_list, &dispose, &num_to_free);
> +	}
>  	num_pages *= 1 << pt->order;
>  
>  	ttm_pool_dispose_list(pt, &dispose);
> @@ -595,6 +672,7 @@ static int ttm_pool_restore_commit(struct ttm_pool_tt_restore *restore,
>  			 */
>  			ttm_pool_split_for_swap(restore->pool, p);
>  			copy_highpage(restore->alloced_page + i, p);
> +			p->memcg_data = 0;
>  			__free_pages(p, 0);
>  		}
>  
> @@ -756,6 +834,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  	bool allow_pools;
>  	struct page *p;
>  	int r;
> +	struct obj_cgroup *objcg = memcg_account ? tt->objcg : NULL;
>  
>  	WARN_ON(!alloc->remaining_pages || ttm_tt_is_populated(tt));
>  	WARN_ON(alloc->dma_addr && !pool->dev);
> @@ -773,6 +852,9 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  
>  	page_caching = tt->caching;
>  	allow_pools = true;
> +
> +	ttm_pool_check_objcg(objcg);
> +
>  	for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>  	     alloc->remaining_pages;
>  	     order = ttm_pool_alloc_find_order(order, alloc)) {
> @@ -782,7 +864,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  		p = NULL;
>  		pt = ttm_pool_select_type(pool, page_caching, order);
>  		if (pt && allow_pools)
> -			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
> +			p = ttm_pool_type_take(pt, ttm_pool_nid(pool), objcg);
>  
>  		/*
>  		 * If that fails or previously failed, allocate from system.
> @@ -793,7 +875,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  		if (!p) {
>  			page_caching = ttm_cached;
>  			allow_pools = false;
> -			p = ttm_pool_alloc_page(pool, gfp_flags, order);
> +			p = ttm_pool_alloc_page(pool, objcg, gfp_flags, order);
>  		}
>  		/* If that fails, lower the order if possible and retry. */
>  		if (!p) {
> @@ -937,7 +1019,7 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
>  
>  	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
>  		unsigned long diff = pool_node_limit[nid] - atomic_long_read(&allocated_pages[nid]);
> -		ttm_pool_shrink(nid, diff);
> +		ttm_pool_shrink(nid, NULL, diff);
>  	}
>  }
>  EXPORT_SYMBOL(ttm_pool_free);
> @@ -1057,6 +1139,7 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
>  			if (flags->purge) {
>  				shrunken += num_pages;
>  				page->private = 0;
> +				page->memcg_data = 0;
>  				__free_pages(page, order);
>  				memset(tt->pages + i, 0,
>  				       num_pages * sizeof(*tt->pages));
> @@ -1193,10 +1276,14 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
>  					    struct shrink_control *sc)
>  {
>  	unsigned long num_freed = 0;
> +	int num_pools;
> +	spin_lock(&shrinker_lock);
> +	num_pools = list_count_nodes(&shrinker_list);
> +	spin_unlock(&shrinker_lock);
>  
>  	do
> -		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
> -	while (num_freed < sc->nr_to_scan &&
> +		num_freed += ttm_pool_shrink(sc->nid, sc->memcg, sc->nr_to_scan);
> +	while (num_pools-- >= 0 && num_freed < sc->nr_to_scan &&
>  	       atomic_long_read(&allocated_pages[sc->nid]));
>  
>  	sc->nr_scanned = num_freed;
> @@ -1388,7 +1475,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
>  	spin_lock_init(&shrinker_lock);
>  	INIT_LIST_HEAD(&shrinker_list);
>  
> -	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
> +	mm_shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE | SHRINKER_NUMA_AWARE, "drm-ttm_pool");
>  	if (!mm_shrinker)
>  		return -ENOMEM;
>  


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-06-30  4:49 ` [PATCH 12/17] ttm: add objcg pointer to bo and tt Dave Airlie
@ 2025-06-30 10:24   ` Christian König
  2025-06-30 21:33     ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-06-30 10:24 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner
  Cc: Dave Chinner, Kairui Song, Dave Airlie

On 30.06.25 06:49, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This just adds the obj cgroup pointer to the bo and tt structs,
> and sets it between them.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>  include/drm/ttm/ttm_bo.h     | 6 ++++++
>  include/drm/ttm/ttm_tt.h     | 2 ++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> index 8f38de3b2f1c..0c54d5e2bfdd 100644
> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> @@ -162,6 +162,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>  	ttm->caching = caching;
>  	ttm->restore = NULL;
>  	ttm->backup = NULL;
> +	ttm->objcg = bo->objcg;
>  }
>  
>  int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> index 099dc2604baa..f26ec0a0273f 100644
> --- a/include/drm/ttm/ttm_bo.h
> +++ b/include/drm/ttm/ttm_bo.h
> @@ -135,6 +135,12 @@ struct ttm_buffer_object {
>  	 * reservation lock.
>  	 */
>  	struct sg_table *sg;
> +
> +	/**
> +	 * @objcg: object cgroup to charge this to if it ends up using system memory.
> +	 * NULL means don't charge.
> +	 */
> +	struct obj_cgroup *objcg;
>  };
>  
>  #define TTM_BO_MAP_IOMEM_MASK 0x80
> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> index 15d4019685f6..c13fea4c2915 100644
> --- a/include/drm/ttm/ttm_tt.h
> +++ b/include/drm/ttm/ttm_tt.h
> @@ -126,6 +126,8 @@ struct ttm_tt {
>  	enum ttm_caching caching;
>  	/** @restore: Partial restoration from backup state. TTM private */
>  	struct ttm_pool_tt_restore *restore;
> +	/** @objcg: Object cgroup for this TT allocation */
> +	struct obj_cgroup *objcg;
>  };

We should probably keep that out of the pool and account the memory to the BO instead.

Regards,
Christian.

>  
>  /**


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/17] ttm/pool: port to list_lru. (v2)
  2025-06-30  4:49 ` [PATCH 04/17] ttm/pool: port to list_lru. (v2) Dave Airlie
@ 2025-06-30 10:37   ` kernel test robot
  0 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2025-06-30 10:37 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig
  Cc: llvm, oe-kbuild-all, Dave Chinner, Kairui Song, Dave Airlie

Hi Dave,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm/drm-next]
[also build test ERROR on v6.16-rc4]
[cannot apply to akpm-mm/mm-everything linus/master next-20250630]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Airlie/drm-ttm-use-gpu-mm-stats-to-track-gpu-memory-allocations-v2/20250630-134938
base:   git://anongit.freedesktop.org/drm/drm drm-next
patch link:    https://lore.kernel.org/r/20250630045005.1337339-5-airlied%40gmail.com
patch subject: [PATCH 04/17] ttm/pool: port to list_lru. (v2)
config: i386-buildonly-randconfig-003-20250630 (https://download.01.org/0day-ci/archive/20250630/202506301808.C73SwQJr-lkp@intel.com/config)
compiler: clang version 20.1.7 (https://github.com/llvm/llvm-project 6146a88f60492b520a36f8f8f3231e15f3cc6082)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250630/202506301808.C73SwQJr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506301808.C73SwQJr-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:5:
   In file included from include/drm/ttm/ttm_device.h:31:
>> include/drm/ttm/ttm_pool.h:57:18: error: field has incomplete type 'struct list_lru'
      57 |         struct list_lru pages;
         |                         ^
   include/linux/xarray.h:25:8: note: forward declaration of 'struct list_lru'
      25 | struct list_lru;
         |        ^
   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:8:
   In file included from drivers/gpu/drm/i915/i915_drv.h:35:
   In file included from include/linux/pci.h:1656:
   In file included from include/linux/dmapool.h:15:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                         ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:8:
   In file included from drivers/gpu/drm/i915/i915_drv.h:35:
   In file included from include/linux/pci.h:1656:
   In file included from include/linux/dmapool.h:15:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                                       ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:8:
   In file included from drivers/gpu/drm/i915/i915_drv.h:35:
   In file included from include/linux/pci.h:1656:
   In file included from include/linux/dmapool.h:15:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                          ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:8:
   In file included from drivers/gpu/drm/i915/i915_drv.h:35:
   In file included from include/linux/pci.h:1656:
   In file included from include/linux/dmapool.h:15:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                                          ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:8:
   In file included from drivers/gpu/drm/i915/i915_drv.h:35:
   In file included from include/linux/pci.h:1656:
   In file included from include/linux/dmapool.h:15:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                          ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:8:
   In file included from drivers/gpu/drm/i915/i915_drv.h:35:
   In file included from include/linux/pci.h:1656:
   In file included from include/linux/dmapool.h:15:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:21: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                                          ^         ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from drivers/gpu/drm/i915/intel_region_ttm.c:8:
   In file included from drivers/gpu/drm/i915/i915_drv.h:35:
   In file included from include/linux/pci.h:1656:
   In file included from include/linux/dmapool.h:15:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:157:1: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     157 | _SIG_SET_BINOP(sigorsets, _sig_or)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +57 include/drm/ttm/ttm_pool.h

    40	
    41	/**
    42	 * struct ttm_pool_type - Pool for a certain memory type
    43	 *
    44	 * @pool: the pool we belong to, might be NULL for the global ones
    45	 * @order: the allocation order our pages have
    46	 * @caching: the caching type our pages have
    47	 * @shrinker_list: our place on the global shrinker list
    48	 * @pages: the lru_list of pages in the pool
    49	 */
    50	struct ttm_pool_type {
    51		struct ttm_pool *pool;
    52		unsigned int order;
    53		enum ttm_caching caching;
    54	
    55		struct list_head shrinker_list;
    56	
  > 57		struct list_lru pages;
    58	};
    59	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker.
  2025-06-30  4:49 ` [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker Dave Airlie
  2025-06-30 10:23   ` Christian König
@ 2025-06-30 11:59   ` kernel test robot
  2025-07-02 16:41   ` Shakeel Butt
  2 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2025-06-30 11:59 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig
  Cc: oe-kbuild-all, Dave Chinner, Kairui Song, Dave Airlie

Hi Dave,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm/drm-next]
[cannot apply to akpm-mm/mm-everything linus/master v6.16-rc4 next-20250630]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Airlie/drm-ttm-use-gpu-mm-stats-to-track-gpu-memory-allocations-v2/20250630-134938
base:   git://anongit.freedesktop.org/drm/drm drm-next
patch link:    https://lore.kernel.org/r/20250630045005.1337339-14-airlied%40gmail.com
patch subject: [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker.
config: riscv-randconfig-001-20250630 (https://download.01.org/0day-ci/archive/20250630/202506301921.i2QYb2bo-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250630/202506301921.i2QYb2bo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506301921.i2QYb2bo-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/ttm/ttm_pool.c: In function 'ttm_pool_restore_commit':
>> drivers/gpu/drm/ttm/ttm_pool.c:675:5: error: 'struct page' has no member named 'memcg_data'
       p->memcg_data = 0;
        ^~
   drivers/gpu/drm/ttm/ttm_pool.c: In function 'ttm_pool_backup':
   drivers/gpu/drm/ttm/ttm_pool.c:1142:9: error: 'struct page' has no member named 'memcg_data'
        page->memcg_data = 0;
            ^~


vim +675 drivers/gpu/drm/ttm/ttm_pool.c

   627	
   628	/*
   629	 * When restoring, restore backed-up content to the newly allocated page and
   630	 * if successful, populate the page-table and dma-address arrays.
   631	 */
   632	static int ttm_pool_restore_commit(struct ttm_pool_tt_restore *restore,
   633					   struct file *backup,
   634					   const struct ttm_operation_ctx *ctx,
   635					   struct ttm_pool_alloc_state *alloc)
   636	
   637	{
   638		pgoff_t i, nr = 1UL << restore->order;
   639		struct page **first_page = alloc->pages;
   640		struct page *p;
   641		int ret = 0;
   642	
   643		for (i = restore->restored_pages; i < nr; ++i) {
   644			p = first_page[i];
   645			if (ttm_backup_page_ptr_is_handle(p)) {
   646				unsigned long handle = ttm_backup_page_ptr_to_handle(p);
   647	
   648				if (IS_ENABLED(CONFIG_FAULT_INJECTION) && ctx->interruptible &&
   649				    should_fail(&backup_fault_inject, 1)) {
   650					ret = -EINTR;
   651					break;
   652				}
   653	
   654				if (handle == 0) {
   655					restore->restored_pages++;
   656					continue;
   657				}
   658	
   659				ret = ttm_backup_copy_page(backup, restore->alloced_page + i,
   660							   handle, ctx->interruptible);
   661				if (ret)
   662					break;
   663	
   664				ttm_backup_drop(backup, handle);
   665			} else if (p) {
   666				/*
   667				 * We could probably avoid splitting the old page
   668				 * using clever logic, but ATM we don't care, as
   669				 * we prioritize releasing memory ASAP. Note that
   670				 * here, the old retained page is always write-back
   671				 * cached.
   672				 */
   673				ttm_pool_split_for_swap(restore->pool, p);
   674				copy_highpage(restore->alloced_page + i, p);
 > 675				p->memcg_data = 0;
   676				__free_pages(p, 0);
   677			}
   678	
   679			restore->restored_pages++;
   680			first_page[i] = ttm_backup_handle_to_page_ptr(0);
   681		}
   682	
   683		if (ret) {
   684			if (!restore->restored_pages) {
   685				dma_addr_t *dma_addr = alloc->dma_addr ? &restore->first_dma : NULL;
   686	
   687				ttm_pool_unmap_and_free(restore->pool, restore->alloced_page,
   688							dma_addr, restore->page_caching);
   689				restore->restored_pages = nr;
   690			}
   691			return ret;
   692		}
   693	
   694		ttm_pool_allocated_page_commit(restore->alloced_page, restore->first_dma,
   695					       alloc, nr);
   696		if (restore->page_caching == alloc->tt_caching || PageHighMem(restore->alloced_page))
   697			alloc->caching_divide = alloc->pages;
   698		restore->snapshot_alloc = *alloc;
   699		restore->alloced_pages += nr;
   700	
   701		return 0;
   702	}
   703	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker.
  2025-06-30 10:23   ` Christian König
@ 2025-06-30 21:23     ` David Airlie
  0 siblings, 0 replies; 52+ messages in thread
From: David Airlie @ 2025-06-30 21:23 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Mon, Jun 30, 2025 at 8:23 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 30.06.25 06:49, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This enables all the backend code to use the list lru in memcg mode,
> > and set the shrinker to be memcg aware.
> >
> > It adds the loop case for when pooled pages end up being reparented
> > to a higher memcg group, that newer memcg can search for them there
> > and take them back.
>
> Again, this makes no sense at all and will break existing use cases.
>

This is the use case you raised as needing to work before:

Allocate a bunch of uncached pages to a child cgroup, destroying the
cgroup makes those uncached pages belong to the parent cgroup.
Create another child cgroup, and when it allocates uncached pages they
come from the parent cgroup first.

What existing use case will this break?

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/17] ttm/pool: make pool shrinker NUMA aware
  2025-06-30 10:15   ` Christian König
@ 2025-06-30 21:30     ` David Airlie
  0 siblings, 0 replies; 52+ messages in thread
From: David Airlie @ 2025-06-30 21:30 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Simona Vetter,
	Dave Chinner, Kairui Song

On Mon, Jun 30, 2025 at 8:15 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 30.06.25 06:49, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This enable NUMA awareness for the shrinker on the
> > ttm pools.
>
>
> Looks good from my side, but the last time I suggested this Sima explicitly told me that it isn't very fruitful. Adding her to comment as well.

I don't think it going to make anything a lot better, it just makes
adding memcg support easier if you do the lru stuff first in a
separate stage.

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-06-30 10:24   ` Christian König
@ 2025-06-30 21:33     ` David Airlie
  2025-07-01  7:22       ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: David Airlie @ 2025-06-30 21:33 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Mon, Jun 30, 2025 at 8:24 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 30.06.25 06:49, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This just adds the obj cgroup pointer to the bo and tt structs,
> > and sets it between them.
> >
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > ---
> >  drivers/gpu/drm/ttm/ttm_tt.c | 1 +
> >  include/drm/ttm/ttm_bo.h     | 6 ++++++
> >  include/drm/ttm/ttm_tt.h     | 2 ++
> >  3 files changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> > index 8f38de3b2f1c..0c54d5e2bfdd 100644
> > --- a/drivers/gpu/drm/ttm/ttm_tt.c
> > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> > @@ -162,6 +162,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
> >       ttm->caching = caching;
> >       ttm->restore = NULL;
> >       ttm->backup = NULL;
> > +     ttm->objcg = bo->objcg;
> >  }
> >
> >  int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> > diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> > index 099dc2604baa..f26ec0a0273f 100644
> > --- a/include/drm/ttm/ttm_bo.h
> > +++ b/include/drm/ttm/ttm_bo.h
> > @@ -135,6 +135,12 @@ struct ttm_buffer_object {
> >        * reservation lock.
> >        */
> >       struct sg_table *sg;
> > +
> > +     /**
> > +      * @objcg: object cgroup to charge this to if it ends up using system memory.
> > +      * NULL means don't charge.
> > +      */
> > +     struct obj_cgroup *objcg;
> >  };
> >
> >  #define TTM_BO_MAP_IOMEM_MASK 0x80
> > diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> > index 15d4019685f6..c13fea4c2915 100644
> > --- a/include/drm/ttm/ttm_tt.h
> > +++ b/include/drm/ttm/ttm_tt.h
> > @@ -126,6 +126,8 @@ struct ttm_tt {
> >       enum ttm_caching caching;
> >       /** @restore: Partial restoration from backup state. TTM private */
> >       struct ttm_pool_tt_restore *restore;
> > +     /** @objcg: Object cgroup for this TT allocation */
> > +     struct obj_cgroup *objcg;
> >  };
>
> We should probably keep that out of the pool and account the memory to the BO instead.
>

I tried that like 2-3 patch posting iterations ago, you suggested it
then, it didn't work. It has to be done at the pool level, I think it
was due to swap handling.

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2)
  2025-06-30 10:04   ` Christian König
@ 2025-07-01  1:41     ` David Airlie
  0 siblings, 0 replies; 52+ messages in thread
From: David Airlie @ 2025-07-01  1:41 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song, Matthew Brost, Andrew Morton

On Mon, Jun 30, 2025 at 8:04 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 30.06.25 06:49, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This uses the newly introduced per-node gpu tracking stats,
> > to track GPU memory allocated via TTM and reclaimable memory in
> > the TTM page pools.
> >
> > These stats will be useful later for system information and
> > later when mem cgroups are integrated.
> >
> > Cc: Christian Koenig <christian.koenig@amd.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: linux-mm@kvack.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> >
> > ---
> > v2: add reclaim parameters and adjust the right counters.
> > ---
> >  drivers/gpu/drm/ttm/ttm_pool.c | 34 ++++++++++++++++++++++++++++------
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> > index baf27c70a419..11a5777b4a85 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -131,6 +131,16 @@ static struct list_head shrinker_list;
> >  static struct shrinker *mm_shrinker;
> >  static DECLARE_RWSEM(pool_shrink_rwsem);
> >
> > +/* helper to get a current valid node id from a pool */
> > +static int ttm_pool_nid(struct ttm_pool *pool) {
> > +     int nid = NUMA_NO_NODE;
> > +     if (pool)
> > +             nid = pool->nid;
> > +     if (nid == NUMA_NO_NODE)
> > +             nid = numa_node_id();
>
> That isn't correct.
>
> The NUMA node in the pool is just a hint where to allocate from, but the memory can come from somewhere else as well.
>
> You need to look at the allocated page to figure out to which NUMA node that belongs.

Indeed, I've killed this from here, using page_to_nid is correct.

This helper is needed later to figure out what nid to ask the list_lru
for pages from, but I've move it forward into that patch locally.

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs
  2025-06-30 10:20   ` Christian König
@ 2025-07-01  1:46     ` David Airlie
  0 siblings, 0 replies; 52+ messages in thread
From: David Airlie @ 2025-07-01  1:46 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Mon, Jun 30, 2025 at 8:20 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 30.06.25 06:49, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This flag does nothing yet, but this just changes the APIs to accept
> > it in the future across all users.
> >
> > This flag will eventually be filled out with when to account a tt
> > populate to a memcg.
>
> I would keep the pool completely out of memcg accounting and always account at a higher level.
>
> Accounting that low just gives a hughe surfurce for driver issues.

https://lore.kernel.org/dri-devel/20250512061913.3522902-6-airlied@gmail.com/

This is where I already tried it, and had to move it back:

"This version moves back from the resource to the tt layer,
when accounting at the resource layer, if an object is swapped
out there was no way to remove it from the accounting, whereas
the tt layer has more info for this."

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-06-30 21:33     ` David Airlie
@ 2025-07-01  7:22       ` Christian König
  2025-07-01  8:06         ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-07-01  7:22 UTC (permalink / raw)
  To: David Airlie
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On 30.06.25 23:33, David Airlie wrote:
> On Mon, Jun 30, 2025 at 8:24 PM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 30.06.25 06:49, Dave Airlie wrote:
>>> From: Dave Airlie <airlied@redhat.com>
>>>
>>> This just adds the obj cgroup pointer to the bo and tt structs,
>>> and sets it between them.
>>>
>>> Signed-off-by: Dave Airlie <airlied@redhat.com>
>>> ---
>>>  drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>>  include/drm/ttm/ttm_bo.h     | 6 ++++++
>>>  include/drm/ttm/ttm_tt.h     | 2 ++
>>>  3 files changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>>> index 8f38de3b2f1c..0c54d5e2bfdd 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>> @@ -162,6 +162,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>>>       ttm->caching = caching;
>>>       ttm->restore = NULL;
>>>       ttm->backup = NULL;
>>> +     ttm->objcg = bo->objcg;
>>>  }
>>>
>>>  int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
>>> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
>>> index 099dc2604baa..f26ec0a0273f 100644
>>> --- a/include/drm/ttm/ttm_bo.h
>>> +++ b/include/drm/ttm/ttm_bo.h
>>> @@ -135,6 +135,12 @@ struct ttm_buffer_object {
>>>        * reservation lock.
>>>        */
>>>       struct sg_table *sg;
>>> +
>>> +     /**
>>> +      * @objcg: object cgroup to charge this to if it ends up using system memory.
>>> +      * NULL means don't charge.
>>> +      */
>>> +     struct obj_cgroup *objcg;
>>>  };
>>>
>>>  #define TTM_BO_MAP_IOMEM_MASK 0x80
>>> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
>>> index 15d4019685f6..c13fea4c2915 100644
>>> --- a/include/drm/ttm/ttm_tt.h
>>> +++ b/include/drm/ttm/ttm_tt.h
>>> @@ -126,6 +126,8 @@ struct ttm_tt {
>>>       enum ttm_caching caching;
>>>       /** @restore: Partial restoration from backup state. TTM private */
>>>       struct ttm_pool_tt_restore *restore;
>>> +     /** @objcg: Object cgroup for this TT allocation */
>>> +     struct obj_cgroup *objcg;
>>>  };
>>
>> We should probably keep that out of the pool and account the memory to the BO instead.
>>
> 
> I tried that like 2-3 patch posting iterations ago, you suggested it
> then, it didn't work. It has to be done at the pool level, I think it
> was due to swap handling.

When you do it at the pool level the swap/shrink handling is broken as well, just not for amdgpu.

See xe_bo_shrink() and drivers/gpu/drm/xe/xe_shrinker.c on how XE does it.

The problem here is that we don't have a generalized handling for drivers.

So the best we can do is to do it at the resource level because that is common for everybody.

This doesn't takes swapping on amdgpu into account, but that should not be that relevant since we wanted to remove that and switch to the XE approach anyway.

Regards,
Christian.


> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-07-01  7:22       ` Christian König
@ 2025-07-01  8:06         ` David Airlie
  2025-07-01  8:15           ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: David Airlie @ 2025-07-01  8:06 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Tue, Jul 1, 2025 at 5:22 PM Christian König <christian.koenig@amd.com> wrote:
>
> On 30.06.25 23:33, David Airlie wrote:
> > On Mon, Jun 30, 2025 at 8:24 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 30.06.25 06:49, Dave Airlie wrote:
> >>> From: Dave Airlie <airlied@redhat.com>
> >>>
> >>> This just adds the obj cgroup pointer to the bo and tt structs,
> >>> and sets it between them.
> >>>
> >>> Signed-off-by: Dave Airlie <airlied@redhat.com>
> >>> ---
> >>>  drivers/gpu/drm/ttm/ttm_tt.c | 1 +
> >>>  include/drm/ttm/ttm_bo.h     | 6 ++++++
> >>>  include/drm/ttm/ttm_tt.h     | 2 ++
> >>>  3 files changed, 9 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> >>> index 8f38de3b2f1c..0c54d5e2bfdd 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> >>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> >>> @@ -162,6 +162,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
> >>>       ttm->caching = caching;
> >>>       ttm->restore = NULL;
> >>>       ttm->backup = NULL;
> >>> +     ttm->objcg = bo->objcg;
> >>>  }
> >>>
> >>>  int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> >>> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> >>> index 099dc2604baa..f26ec0a0273f 100644
> >>> --- a/include/drm/ttm/ttm_bo.h
> >>> +++ b/include/drm/ttm/ttm_bo.h
> >>> @@ -135,6 +135,12 @@ struct ttm_buffer_object {
> >>>        * reservation lock.
> >>>        */
> >>>       struct sg_table *sg;
> >>> +
> >>> +     /**
> >>> +      * @objcg: object cgroup to charge this to if it ends up using system memory.
> >>> +      * NULL means don't charge.
> >>> +      */
> >>> +     struct obj_cgroup *objcg;
> >>>  };
> >>>
> >>>  #define TTM_BO_MAP_IOMEM_MASK 0x80
> >>> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> >>> index 15d4019685f6..c13fea4c2915 100644
> >>> --- a/include/drm/ttm/ttm_tt.h
> >>> +++ b/include/drm/ttm/ttm_tt.h
> >>> @@ -126,6 +126,8 @@ struct ttm_tt {
> >>>       enum ttm_caching caching;
> >>>       /** @restore: Partial restoration from backup state. TTM private */
> >>>       struct ttm_pool_tt_restore *restore;
> >>> +     /** @objcg: Object cgroup for this TT allocation */
> >>> +     struct obj_cgroup *objcg;
> >>>  };
> >>
> >> We should probably keep that out of the pool and account the memory to the BO instead.
> >>
> >
> > I tried that like 2-3 patch posting iterations ago, you suggested it
> > then, it didn't work. It has to be done at the pool level, I think it
> > was due to swap handling.
>
> When you do it at the pool level the swap/shrink handling is broken as well, just not for amdgpu.
>
> See xe_bo_shrink() and drivers/gpu/drm/xe/xe_shrinker.c on how XE does it.

I've read all of that, but I don't think it needs changing yet, though
I do think I probably need to do a bit more work on the ttm
backup/restore paths to account things, but again we suffer from the
what happens if your cgroup runs out of space on a restore path,
similiar to eviction.

Blocking the problems we can solve now on the problems we've no idea
how to solve means nobody gets experience with solving anything.

> So the best we can do is to do it at the resource level because that is common for everybody.
>
> This doesn't takes swapping on amdgpu into account, but that should not be that relevant since we wanted to remove that and switch to the XE approach anyway.

I don't understand, we cannot do it at the resource level, I sent
patches to try, they don't fundamentally work properly, so it isn't
going to fly. We can solve it at the pool level, so we should, if we
somehow rearchitect things later to solve it at the resource level,
but I feel we'd have to make swap handling operate at the resource
level instead of tt level to have any chance.

Swapping via the backup/restore paths should be accounted properly,
since moving pages out to swap one way cgroups can reduce the memory
usage, if we can't account that swapped pages aren't removed from the
page count, then it isn't going to work properly.

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-07-01  8:06         ` David Airlie
@ 2025-07-01  8:15           ` Christian König
  2025-07-01 22:11             ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-07-01  8:15 UTC (permalink / raw)
  To: David Airlie
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On 01.07.25 10:06, David Airlie wrote:
> On Tue, Jul 1, 2025 at 5:22 PM Christian König <christian.koenig@amd.com> wrote:
>>>>> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
>>>>> index 15d4019685f6..c13fea4c2915 100644
>>>>> --- a/include/drm/ttm/ttm_tt.h
>>>>> +++ b/include/drm/ttm/ttm_tt.h
>>>>> @@ -126,6 +126,8 @@ struct ttm_tt {
>>>>>       enum ttm_caching caching;
>>>>>       /** @restore: Partial restoration from backup state. TTM private */
>>>>>       struct ttm_pool_tt_restore *restore;
>>>>> +     /** @objcg: Object cgroup for this TT allocation */
>>>>> +     struct obj_cgroup *objcg;
>>>>>  };
>>>>
>>>> We should probably keep that out of the pool and account the memory to the BO instead.
>>>>
>>>
>>> I tried that like 2-3 patch posting iterations ago, you suggested it
>>> then, it didn't work. It has to be done at the pool level, I think it
>>> was due to swap handling.
>>
>> When you do it at the pool level the swap/shrink handling is broken as well, just not for amdgpu.
>>
>> See xe_bo_shrink() and drivers/gpu/drm/xe/xe_shrinker.c on how XE does it.
> 
> I've read all of that, but I don't think it needs changing yet, though
> I do think I probably need to do a bit more work on the ttm
> backup/restore paths to account things, but again we suffer from the
> what happens if your cgroup runs out of space on a restore path,
> similiar to eviction.

My thinking was rather that because of this we do it at the resource level and keep memory accounted to whoever allocated it even if it's backed up or swapped out.

> Blocking the problems we can solve now on the problems we've no idea
> how to solve means nobody gets experience with solving anything.

Well that's exactly the reason why I'm suggesting this. Ignoring swapping/backup for now seems to make things much easier.

>> So the best we can do is to do it at the resource level because that is common for everybody.
>>
>> This doesn't takes swapping on amdgpu into account, but that should not be that relevant since we wanted to remove that and switch to the XE approach anyway.
> 
> I don't understand, we cannot do it at the resource level, I sent
> patches to try, they don't fundamentally work properly, so it isn't
> going to fly. We can solve it at the pool level, so we should, if we
> somehow rearchitect things later to solve it at the resource level,
> but I feel we'd have to make swap handling operate at the resource
> level instead of tt level to have any chance.
> 
> Swapping via the backup/restore paths should be accounted properly,
> since moving pages out to swap one way cgroups can reduce the memory
> usage, if we can't account that swapped pages aren't removed from the
> page count, then it isn't going to work properly.

But that is only possible if you generalize the shrinker XE has implemented for at least amdgpu as well (and make it memcg aware).

That's exactly what I would try to avoid because it means tons of additional driver specific work before we can get at least a foundation for memcg awareness.

Alternatively we could implement a shrinker for the existing swapping path, but I tried that before and Sima, Thomas and I agreed that this is not the right thing todo and rather came up with the XE shrinker.

Christian.

> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-07-01  8:15           ` Christian König
@ 2025-07-01 22:11             ` David Airlie
  2025-07-02  7:27               ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: David Airlie @ 2025-07-01 22:11 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Tue, Jul 1, 2025 at 6:16 PM Christian König <christian.koenig@amd.com> wrote:
>
> On 01.07.25 10:06, David Airlie wrote:
> > On Tue, Jul 1, 2025 at 5:22 PM Christian König <christian.koenig@amd.com> wrote:
> >>>>> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> >>>>> index 15d4019685f6..c13fea4c2915 100644
> >>>>> --- a/include/drm/ttm/ttm_tt.h
> >>>>> +++ b/include/drm/ttm/ttm_tt.h
> >>>>> @@ -126,6 +126,8 @@ struct ttm_tt {
> >>>>>       enum ttm_caching caching;
> >>>>>       /** @restore: Partial restoration from backup state. TTM private */
> >>>>>       struct ttm_pool_tt_restore *restore;
> >>>>> +     /** @objcg: Object cgroup for this TT allocation */
> >>>>> +     struct obj_cgroup *objcg;
> >>>>>  };
> >>>>
> >>>> We should probably keep that out of the pool and account the memory to the BO instead.
> >>>>
> >>>
> >>> I tried that like 2-3 patch posting iterations ago, you suggested it
> >>> then, it didn't work. It has to be done at the pool level, I think it
> >>> was due to swap handling.
> >>
> >> When you do it at the pool level the swap/shrink handling is broken as well, just not for amdgpu.
> >>
> >> See xe_bo_shrink() and drivers/gpu/drm/xe/xe_shrinker.c on how XE does it.
> >
> > I've read all of that, but I don't think it needs changing yet, though
> > I do think I probably need to do a bit more work on the ttm
> > backup/restore paths to account things, but again we suffer from the
> > what happens if your cgroup runs out of space on a restore path,
> > similiar to eviction.
>
> My thinking was rather that because of this we do it at the resource level and keep memory accounted to whoever allocated it even if it's backed up or swapped out.
>
> > Blocking the problems we can solve now on the problems we've no idea
> > how to solve means nobody gets experience with solving anything.
>
> Well that's exactly the reason why I'm suggesting this. Ignoring swapping/backup for now seems to make things much easier.

It makes it easier now, but when we have to solve swapping, step one
will be moving all this code around to what I have now, and starting
from there.

This just raises the bar to solving the next problem.

We need to find incremental approaches to getting all the pieces of
the puzzle solved, or else we will still be here in 10 years.

The steps I've formulated (none of them are perfect, but they all seem
better than status quo)

1. add global counters for pages - now we can at least see things in
vmstat and per-node
2. add numa to the pool lru - we can remove our own numa code and
align with core kernel - probably doesn't help anything
3. add memcg awareness to the pool and pool shrinker.
    if you are on a APU with no swap configured - you have a lot better time.
    if you are on a dGPU or APU with swap - you have a moderately
better time, but I can't see you having a worse time.
4. look into tt level swapping and seeing how to integrate that lru
with numa/memcg awareness
    in theory we can do better than allocated_pages tracking, (I'd
like to burn that down, since it seems at odds with memcg)
5. look into xe swapping and see if we can integrate that numa/memcg better.

So the question I really want answered when I'm submitting patches
isn't, what does this not fix or not make better, but what does this
actively make worse than the status quo and is it heading in a
consistent direction to solve the problem.

Accounting at the resource level makes stuff better, but I don't
believe after implementing it that it is consistent with solving the
overall problem.

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver
  2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
                   ` (16 preceding siblings ...)
  2025-06-30  4:49 ` [PATCH 17/17] amdgpu: add support for memory cgroups Dave Airlie
@ 2025-07-01 23:26 ` Balbir Singh
  17 siblings, 0 replies; 52+ messages in thread
From: Balbir Singh @ 2025-07-01 23:26 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig
  Cc: Dave Chinner, Kairui Song

On 6/30/25 14:49, Dave Airlie wrote:
> Hi all,
> 
> tl;dr: start using list_lru/numa/memcg in GPU driver core and amdgpu driver for now.
> 
> This is a complete series of patches, some of which have been sent before and reviewed,
> but I want to get the complete picture for others, and try to figure out how best to land this.
> 
> There are 3 pieces to this:
> 01->02: add support for global gpu stat counters (previously posted, patch 2 is newer)
> 03->07: port ttm pools to list_lru for numa awareness
> 08->14: add memcg stats + gpu apis, then port ttm pools to memcg aware list_lru and shrinker
> 15->17: enable amdgpu to use new functionality.
> 
> The biggest difference in the memcg code from previously is I discovered what
> obj cgroups were designed for and I'm reusing the page/objcg intergration that 
> already exists, to avoid reinventing that wheel right now.
> 
> There are some igt-gpu-tools tests I've written at:
> https://gitlab.freedesktop.org/airlied/igt-gpu-tools/-/tree/amdgpu-cgroups?ref_type=heads
> 
> One problem is there are a lot of delayed action, that probably means the testing
> needs a bit more robustness, but the tests validate all the basic paths.
> 

Hi, Dave

memcg is designed to use memory (rss and page cache) as a single entity in a way that users
don't need to worry about the distinction between memory types and need to think about their
overall memory utilization or discover it with ability to overcommit via swap as needed.

How does dmem fit into the picture? Is the cgroup integration designed to overcommit or limit
dmem/both? Is the programmer expected to know how much dmem the program will need? May be this
was answered, but I missed it.

Balbir Singh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-07-01 22:11             ` David Airlie
@ 2025-07-02  7:27               ` Christian König
  2025-07-02  7:57                 ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-07-02  7:27 UTC (permalink / raw)
  To: David Airlie
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On 02.07.25 00:11, David Airlie wrote:
> On Tue, Jul 1, 2025 at 6:16 PM Christian König <christian.koenig@amd.com> wrote:
>>
>> On 01.07.25 10:06, David Airlie wrote:
>>> On Tue, Jul 1, 2025 at 5:22 PM Christian König <christian.koenig@amd.com> wrote:
>>>>>>> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
>>>>>>> index 15d4019685f6..c13fea4c2915 100644
>>>>>>> --- a/include/drm/ttm/ttm_tt.h
>>>>>>> +++ b/include/drm/ttm/ttm_tt.h
>>>>>>> @@ -126,6 +126,8 @@ struct ttm_tt {
>>>>>>>       enum ttm_caching caching;
>>>>>>>       /** @restore: Partial restoration from backup state. TTM private */
>>>>>>>       struct ttm_pool_tt_restore *restore;
>>>>>>> +     /** @objcg: Object cgroup for this TT allocation */
>>>>>>> +     struct obj_cgroup *objcg;
>>>>>>>  };
>>>>>>
>>>>>> We should probably keep that out of the pool and account the memory to the BO instead.
>>>>>>
>>>>>
>>>>> I tried that like 2-3 patch posting iterations ago, you suggested it
>>>>> then, it didn't work. It has to be done at the pool level, I think it
>>>>> was due to swap handling.
>>>>
>>>> When you do it at the pool level the swap/shrink handling is broken as well, just not for amdgpu.
>>>>
>>>> See xe_bo_shrink() and drivers/gpu/drm/xe/xe_shrinker.c on how XE does it.
>>>
>>> I've read all of that, but I don't think it needs changing yet, though
>>> I do think I probably need to do a bit more work on the ttm
>>> backup/restore paths to account things, but again we suffer from the
>>> what happens if your cgroup runs out of space on a restore path,
>>> similiar to eviction.
>>
>> My thinking was rather that because of this we do it at the resource level and keep memory accounted to whoever allocated it even if it's backed up or swapped out.
>>
>>> Blocking the problems we can solve now on the problems we've no idea
>>> how to solve means nobody gets experience with solving anything.
>>
>> Well that's exactly the reason why I'm suggesting this. Ignoring swapping/backup for now seems to make things much easier.
> 
> It makes it easier now, but when we have to solve swapping, step one
> will be moving all this code around to what I have now, and starting
> from there.
> 
> This just raises the bar to solving the next problem.
> 
> We need to find incremental approaches to getting all the pieces of
> the puzzle solved, or else we will still be here in 10 years.
> 
> The steps I've formulated (none of them are perfect, but they all seem
> better than status quo)
> 
> 1. add global counters for pages - now we can at least see things in
> vmstat and per-node
> 2. add numa to the pool lru - we can remove our own numa code and
> align with core kernel - probably doesn't help anything

So far no objections from my side to that.

> 3. add memcg awareness to the pool and pool shrinker.
>     if you are on a APU with no swap configured - you have a lot better time.
>     if you are on a dGPU or APU with swap - you have a moderately
> better time, but I can't see you having a worse time.

Well that's what I'm strongly disagreeing on.

Adding memcg to the pool has no value at all and complicates things massively when moving forward.

What exactly should be the benefit of that?

> 4. look into tt level swapping and seeing how to integrate that lru
> with numa/memcg awareness
>     in theory we can do better than allocated_pages tracking, (I'd
> like to burn that down, since it seems at odds with memcg)
> 5. look into xe swapping and see if we can integrate that numa/memcg better.
> 
> So the question I really want answered when I'm submitting patches
> isn't, what does this not fix or not make better, but what does this
> actively make worse than the status quo and is it heading in a
> consistent direction to solve the problem.
> 
> Accounting at the resource level makes stuff better, but I don't
> believe after implementing it that it is consistent with solving the
> overall problem.

Exactly that's my point. See accounting is no problem at all, that can be done on any possible level.

What is tricky is shrinking, e.g. either core MM or memcg asking to reduce the usage of memory and moving things into swap.

And that can only be done either on the resource level or the tt object, but not the pool level.

The whole TTM pool is to aid a 28 year old HW design which has no practical relevance on modern systems and we should really not touch that in any way possible.

Christian.

> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-07-02  7:27               ` Christian König
@ 2025-07-02  7:57                 ` David Airlie
  2025-07-02  8:24                   ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: David Airlie @ 2025-07-02  7:57 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

> >
> > It makes it easier now, but when we have to solve swapping, step one
> > will be moving all this code around to what I have now, and starting
> > from there.
> >
> > This just raises the bar to solving the next problem.
> >
> > We need to find incremental approaches to getting all the pieces of
> > the puzzle solved, or else we will still be here in 10 years.
> >
> > The steps I've formulated (none of them are perfect, but they all seem
> > better than status quo)
> >
> > 1. add global counters for pages - now we can at least see things in
> > vmstat and per-node
> > 2. add numa to the pool lru - we can remove our own numa code and
> > align with core kernel - probably doesn't help anything
>
> So far no objections from my side to that.
>
> > 3. add memcg awareness to the pool and pool shrinker.
> >     if you are on a APU with no swap configured - you have a lot better time.
> >     if you are on a dGPU or APU with swap - you have a moderately
> > better time, but I can't see you having a worse time.
>
> Well that's what I'm strongly disagreeing on.
>
> Adding memcg to the pool has no value at all and complicates things massively when moving forward.
>
> What exactly should be the benefit of that?

I'm already showing the benefit of the pool moving to memcg, we've
even talked about it multiple times on the list, it's not a OMG change
the world benefit, but it definitely provides better alignment between
the pool and memcg allocations.

We expose userspace API to allocate write combined memory, we do this
for all currently supported CPU/GPUs. We might think in the future we
don't want to continue to do this, but we do it now. My Fedora 42
desktop uses it, even if you say there is no need.

If I allocate 100% of my memcg budget to WC memory, free it, then
allocate 100% of my budget to non-WC memory, we break container
containment as we can force other cgroups to run out of memory budget
and have to call the global shrinker. With this in place, the
container that allocates the WC memory also pays the price to switch
it back. Again this is just correctness, it's not going to fix any
major workloads, but I also don't think it should cause any
regressions, since it won't be worse than current worst case
expectation for most workloads.

I'm not just adding memcg awareness to the pool though, that is just
completeness, I'm adding memcg awareness to all GPU system memory
allocations, and making sure swapout works (which it does), swapin
probably needs more work.

The future work is integerating ttm swap mechanisms with memcg to get it right.
> >
> > Accounting at the resource level makes stuff better, but I don't
> > believe after implementing it that it is consistent with solving the
> > overall problem.
>
> Exactly that's my point. See accounting is no problem at all, that can be done on any possible level.
>
> What is tricky is shrinking, e.g. either core MM or memcg asking to reduce the usage of memory and moving things into swap.
>
> And that can only be done either on the resource level or the tt object, but not the pool level.

I understand we have to add more code to the tt level and that's fine,
I just don't see why you think starting at the bottom level is wrong?
it clearly has a use, and it's just cleaning up and preparing the
levels, so we can move up and solve the next problem.

> The whole TTM pool is to aid a 28 year old HW design which has no practical relevance on modern systems and we should really not touch that in any way possible.

Modern systems are still using it, I'm still seeing WC allocations,
they still seem to have some cost associated with them on x86-64, they
certainly aren't free. I don't care if they aren't practical, but if
they are a way to route around container containment, they need to be
fixed.

Dave.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-07-02  7:57                 ` David Airlie
@ 2025-07-02  8:24                   ` Christian König
  2025-07-03  5:53                     ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-07-02  8:24 UTC (permalink / raw)
  To: David Airlie
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On 02.07.25 09:57, David Airlie wrote:
>>>
>>> It makes it easier now, but when we have to solve swapping, step one
>>> will be moving all this code around to what I have now, and starting
>>> from there.
>>>
>>> This just raises the bar to solving the next problem.
>>>
>>> We need to find incremental approaches to getting all the pieces of
>>> the puzzle solved, or else we will still be here in 10 years.
>>>
>>> The steps I've formulated (none of them are perfect, but they all seem
>>> better than status quo)
>>>
>>> 1. add global counters for pages - now we can at least see things in
>>> vmstat and per-node
>>> 2. add numa to the pool lru - we can remove our own numa code and
>>> align with core kernel - probably doesn't help anything
>>
>> So far no objections from my side to that.
>>
>>> 3. add memcg awareness to the pool and pool shrinker.
>>>     if you are on a APU with no swap configured - you have a lot better time.
>>>     if you are on a dGPU or APU with swap - you have a moderately
>>> better time, but I can't see you having a worse time.
>>
>> Well that's what I'm strongly disagreeing on.
>>
>> Adding memcg to the pool has no value at all and complicates things massively when moving forward.
>>
>> What exactly should be the benefit of that?
> 
> I'm already showing the benefit of the pool moving to memcg, we've
> even talked about it multiple times on the list, it's not a OMG change
> the world benefit, but it definitely provides better alignment between
> the pool and memcg allocations.
> 
> We expose userspace API to allocate write combined memory, we do this
> for all currently supported CPU/GPUs. We might think in the future we
> don't want to continue to do this, but we do it now. My Fedora 42
> desktop uses it, even if you say there is no need.
> 
> If I allocate 100% of my memcg budget to WC memory, free it, then
> allocate 100% of my budget to non-WC memory, we break container
> containment as we can force other cgroups to run out of memory budget
> and have to call the global shrinker.

Yeah and that is perfectly intentional.

> With this in place, the
> container that allocates the WC memory also pays the price to switch
> it back. Again this is just correctness, it's not going to fix any
> major workloads, but I also don't think it should cause any
> regressions, since it won't be worse than current worst case
> expectation for most workloads.

No, this is not correct behavior any more.

Memory which is used by your cgroup is not used for allocations by another cgroup any more nor given back to the core memory managment for the page pool. E.g. one cgroup can't steal the memory from another cgroup any more.

In other words that is reserving the memory for the cgroup and don't give it back to the global pool as soon as you free it.

That would only be acceptable if we have per cgroup limit on the pool size which is *much* lower than the current global limit we have.

Maybe we could register a memcg aware shrinker, but not make the LRU memcg aware or something like that.

As far as I can see that would give us the benefit of both approaches, the only problem is that we would have to do per cgroup counter tracking on our own.

That's why I asked if we could have TTM pool specific variables in the cgroup.

Another alternative would be to change the LRU so that we track per memcg, but allow stealing of pages between cgroups.

> I'm not just adding memcg awareness to the pool though, that is just
> completeness, I'm adding memcg awareness to all GPU system memory
> allocations, and making sure swapout works (which it does), swapin
> probably needs more work.
> 
> The future work is integerating ttm swap mechanisms with memcg to get it right.
>>>
>>> Accounting at the resource level makes stuff better, but I don't
>>> believe after implementing it that it is consistent with solving the
>>> overall problem.
>>
>> Exactly that's my point. See accounting is no problem at all, that can be done on any possible level.
>>
>> What is tricky is shrinking, e.g. either core MM or memcg asking to reduce the usage of memory and moving things into swap.
>>
>> And that can only be done either on the resource level or the tt object, but not the pool level.
> 
> I understand we have to add more code to the tt level and that's fine,
> I just don't see why you think starting at the bottom level is wrong?
> it clearly has a use, and it's just cleaning up and preparing the
> levels, so we can move up and solve the next problem.

Because we don't have the necessary functionality to implement a memcg aware shrinker which moves BOs into swap there.

>> The whole TTM pool is to aid a 28 year old HW design which has no practical relevance on modern systems and we should really not touch that in any way possible.
> 
> Modern systems are still using it, I'm still seeing WC allocations,
> they still seem to have some cost associated with them on x86-64, they
> certainly aren't free. I don't care if they aren't practical, but if
> they are a way to route around container containment, they need to be
> fixed.

Completely agree to that, but we can't regress anything existing while doing so.

And I'm pretty sure that we do have use cases which rely on that behavior, so those use cases will regress.

Christian.

> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-06-30  4:49 ` [PATCH 17/17] amdgpu: add support for memory cgroups Dave Airlie
@ 2025-07-02 16:02   ` Shakeel Butt
  2025-07-03  2:53     ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-02 16:02 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, linux-mm, Johannes Weiner, Christian Koenig,
	Dave Chinner, Kairui Song, Dave Airlie

On Mon, Jun 30, 2025 at 02:49:36PM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This adds support for adding a obj cgroup to a buffer object,
> and passing in the placement flags to make sure it's accounted
> properly.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  2 ++
>  4 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index e5e33a68d935..d250183bab03 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -198,6 +198,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
>  	struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
>  
>  	amdgpu_hmm_unregister(aobj);
> +	obj_cgroup_put(aobj->tbo.objcg);
>  	ttm_bo_put(&aobj->tbo);
>  }
>  
> @@ -225,6 +226,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
>  	bp.domain = initial_domain;
>  	bp.bo_ptr_size = sizeof(struct amdgpu_bo);
>  	bp.xcp_id_plus1 = xcp_id_plus1;
> +	bp.objcg = get_obj_cgroup_from_current();

In what context this function is called? Is that the same for
ttm_pool_alloc_page()? Is remote charging happening in
ttm_pool_alloc_page()?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/17] memcg: add support for GPU page counters.
  2025-06-30  4:49 ` [PATCH 08/17] memcg: add support for GPU page counters Dave Airlie
@ 2025-07-02 16:06   ` Shakeel Butt
  2025-07-03  5:43     ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-02 16:06 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, linux-mm, Johannes Weiner, Christian Koenig,
	Dave Chinner, Kairui Song, Dave Airlie

On Mon, Jun 30, 2025 at 02:49:27PM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This introduces 2 new statistics and 3 new memcontrol APIs for dealing
> with GPU system memory allocations.
> 
> The stats corresponds to the same stats in the global vmstat,
> for number of active GPU pages, and number of pages in pools that
> can be reclaimed.
> 
> The first API charges a order of pages to a objcg, and sets
> the objcg on the pages like kmem does, and updates the active/reclaim
> statistic.
> 
> The second API uncharges a page from the obj cgroup it is currently charged
> to.
> 
> The third API allows moving a page to/from reclaim and between obj cgroups.
> When pages are added to the pool lru, this just updates accounting.
> When pages are being removed from a pool lru, they can be taken from
> the parent objcg so this allows them to be uncharged from there and transferred
> to a new child objcg.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |   6 ++
>  include/linux/memcontrol.h              |  14 +++
>  mm/memcontrol.c                         | 110 ++++++++++++++++++++++++
>  3 files changed, 130 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 0cc35a14afbe..d6f057c4fe2e 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1542,6 +1542,12 @@ The following nested keys are defined.
>  	  vmalloc (npn)
>  		Amount of memory used for vmap backed memory.
>  
> +	  gpu (npn)
> +		Amount of system memory used for GPU devices.
> +
> +	  gpu_reclaim (npn)
> +		Amount of system memory cached for GPU devices.
> +
>  	  shmem
>  		Amount of cached filesystem data that is swap-backed,
>  		such as tmpfs, shm segments, shared anonymous mmap()s
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 87b6688f124a..ff82d603910d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -36,6 +36,8 @@ enum memcg_stat_item {
>  	MEMCG_SOCK,
>  	MEMCG_PERCPU_B,
>  	MEMCG_VMALLOC,
> +	MEMCG_GPU,
> +	MEMCG_GPU_RECLAIM,

You already added node level counters i.e. GPUActive & GPUReclaim, just
use those instead of these. Add them to memcg_node_stat_items[].


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2)
  2025-06-30  4:49 ` [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2) Dave Airlie
  2025-06-30 10:04   ` Christian König
@ 2025-07-02 16:08   ` Shakeel Butt
  1 sibling, 0 replies; 52+ messages in thread
From: Shakeel Butt @ 2025-07-02 16:08 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, linux-mm, Johannes Weiner, Christian Koenig,
	Dave Chinner, Kairui Song, Dave Airlie, Matthew Brost,
	Andrew Morton

On Mon, Jun 30, 2025 at 02:49:21PM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This uses the newly introduced per-node gpu tracking stats,
> to track GPU memory allocated via TTM and reclaimable memory in
> the TTM page pools.
> 
> These stats will be useful later for system information and
> later when mem cgroups are integrated.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> 
> ---
> v2: add reclaim parameters and adjust the right counters.
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 34 ++++++++++++++++++++++++++++------
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index baf27c70a419..11a5777b4a85 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -131,6 +131,16 @@ static struct list_head shrinker_list;
>  static struct shrinker *mm_shrinker;
>  static DECLARE_RWSEM(pool_shrink_rwsem);
>  
> +/* helper to get a current valid node id from a pool */
> +static int ttm_pool_nid(struct ttm_pool *pool) {
> +	int nid = NUMA_NO_NODE;
> +	if (pool)
> +		nid = pool->nid;
> +	if (nid == NUMA_NO_NODE)
> +		nid = numa_node_id();
> +	return nid;
> +}
> +
>  /* Allocate pages of size 1 << order with the given gfp_flags */
>  static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  					unsigned int order)
> @@ -150,8 +160,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  
>  	if (!pool->use_dma_alloc) {
>  		p = alloc_pages_node(pool->nid, gfp_flags, order);
> -		if (p)
> +		if (p) {
>  			p->private = order;
> +			mod_node_page_state(NODE_DATA(ttm_pool_nid(pool)), NR_GPU_ACTIVE, (1 << order));

Use mod_lruvec_page_state(), it will handle both cases of charged and
uncharged pages.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker.
  2025-06-30  4:49 ` [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker Dave Airlie
  2025-06-30 10:23   ` Christian König
  2025-06-30 11:59   ` kernel test robot
@ 2025-07-02 16:41   ` Shakeel Butt
  2 siblings, 0 replies; 52+ messages in thread
From: Shakeel Butt @ 2025-07-02 16:41 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, linux-mm, Johannes Weiner, Christian Koenig,
	Dave Chinner, Kairui Song, Dave Airlie

On Mon, Jun 30, 2025 at 02:49:32PM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This enables all the backend code to use the list lru in memcg mode,
> and set the shrinker to be memcg aware.
> 
> It adds the loop case for when pooled pages end up being reparented
> to a higher memcg group, that newer memcg can search for them there
> and take them back.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 123 ++++++++++++++++++++++++++++-----
>  1 file changed, 105 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 210f4ac4de67..49e92f40ab23 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -143,7 +143,9 @@ static int ttm_pool_nid(struct ttm_pool *pool) {
>  }
>  
>  /* Allocate pages of size 1 << order with the given gfp_flags */
> -static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> +static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
> +					struct obj_cgroup *objcg,
> +					gfp_t gfp_flags,
>  					unsigned int order)
>  {
>  	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
> @@ -163,6 +165,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  		p = alloc_pages_node(pool->nid, gfp_flags, order);

Here I am wondering if we should introduce something like
__GFP_ACCOUNT_NOKMEM to avoid kmem counters but still charge to the
memcg and along with set_active_memcg(), we can avoid introducing gpu
specific memcg charging interfaces.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-07-02 16:02   ` Shakeel Butt
@ 2025-07-03  2:53     ` David Airlie
  2025-07-03 17:58       ` Shakeel Butt
  0 siblings, 1 reply; 52+ messages in thread
From: David Airlie @ 2025-07-03  2:53 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig, Dave Chinner, Kairui Song

On Thu, Jul 3, 2025 at 2:03 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jun 30, 2025 at 02:49:36PM +1000, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This adds support for adding a obj cgroup to a buffer object,
> > and passing in the placement flags to make sure it's accounted
> > properly.
> >
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  2 ++
> >  4 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > index e5e33a68d935..d250183bab03 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > @@ -198,6 +198,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
> >       struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
> >
> >       amdgpu_hmm_unregister(aobj);
> > +     obj_cgroup_put(aobj->tbo.objcg);
> >       ttm_bo_put(&aobj->tbo);
> >  }
> >
> > @@ -225,6 +226,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
> >       bp.domain = initial_domain;
> >       bp.bo_ptr_size = sizeof(struct amdgpu_bo);
> >       bp.xcp_id_plus1 = xcp_id_plus1;
> > +     bp.objcg = get_obj_cgroup_from_current();
>
> In what context this function is called? Is that the same for
> ttm_pool_alloc_page()? Is remote charging happening in
> ttm_pool_alloc_page()?
>

No, this function is called from userspace ioctl paths that allocate
GPU objects (GEM objects).

The objects are lazily allocated, so this might not trigger any pages
being bound to the object, until it is populated, either via mapping +
page faults or by being used in a GPU command submission, which is
when the ttm_pool_alloc_page happens.

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/17] memcg: add support for GPU page counters.
  2025-07-02 16:06   ` Shakeel Butt
@ 2025-07-03  5:43     ` David Airlie
  0 siblings, 0 replies; 52+ messages in thread
From: David Airlie @ 2025-07-03  5:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig, Dave Chinner, Kairui Song

On Thu, Jul 3, 2025 at 2:06 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jun 30, 2025 at 02:49:27PM +1000, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This introduces 2 new statistics and 3 new memcontrol APIs for dealing
> > with GPU system memory allocations.
> >
> > The stats corresponds to the same stats in the global vmstat,
> > for number of active GPU pages, and number of pages in pools that
> > can be reclaimed.
> >
> > The first API charges a order of pages to a objcg, and sets
> > the objcg on the pages like kmem does, and updates the active/reclaim
> > statistic.
> >
> > The second API uncharges a page from the obj cgroup it is currently charged
> > to.
> >
> > The third API allows moving a page to/from reclaim and between obj cgroups.
> > When pages are added to the pool lru, this just updates accounting.
> > When pages are being removed from a pool lru, they can be taken from
> > the parent objcg so this allows them to be uncharged from there and transferred
> > to a new child objcg.
> >
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |   6 ++
> >  include/linux/memcontrol.h              |  14 +++
> >  mm/memcontrol.c                         | 110 ++++++++++++++++++++++++
> >  3 files changed, 130 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 0cc35a14afbe..d6f057c4fe2e 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1542,6 +1542,12 @@ The following nested keys are defined.
> >         vmalloc (npn)
> >               Amount of memory used for vmap backed memory.
> >
> > +       gpu (npn)
> > +             Amount of system memory used for GPU devices.
> > +
> > +       gpu_reclaim (npn)
> > +             Amount of system memory cached for GPU devices.
> > +
> >         shmem
> >               Amount of cached filesystem data that is swap-backed,
> >               such as tmpfs, shm segments, shared anonymous mmap()s
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 87b6688f124a..ff82d603910d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -36,6 +36,8 @@ enum memcg_stat_item {
> >       MEMCG_SOCK,
> >       MEMCG_PERCPU_B,
> >       MEMCG_VMALLOC,
> > +     MEMCG_GPU,
> > +     MEMCG_GPU_RECLAIM,
>
> You already added node level counters i.e. GPUActive & GPUReclaim, just
> use those instead of these. Add them to memcg_node_stat_items[].
>

Thanks for the pointer, yes I did mess that up, I've rewrote it
locally today and it makes more sense now.

Dave.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt
  2025-07-02  8:24                   ` Christian König
@ 2025-07-03  5:53                     ` David Airlie
  0 siblings, 0 replies; 52+ messages in thread
From: David Airlie @ 2025-07-03  5:53 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Wed, Jul 2, 2025 at 6:24 PM Christian König <christian.koenig@amd.com> wrote:
>
> On 02.07.25 09:57, David Airlie wrote:
> >>>
> >>> It makes it easier now, but when we have to solve swapping, step one
> >>> will be moving all this code around to what I have now, and starting
> >>> from there.
> >>>
> >>> This just raises the bar to solving the next problem.
> >>>
> >>> We need to find incremental approaches to getting all the pieces of
> >>> the puzzle solved, or else we will still be here in 10 years.
> >>>
> >>> The steps I've formulated (none of them are perfect, but they all seem
> >>> better than status quo)
> >>>
> >>> 1. add global counters for pages - now we can at least see things in
> >>> vmstat and per-node
> >>> 2. add numa to the pool lru - we can remove our own numa code and
> >>> align with core kernel - probably doesn't help anything
> >>
> >> So far no objections from my side to that.
> >>
> >>> 3. add memcg awareness to the pool and pool shrinker.
> >>>     if you are on a APU with no swap configured - you have a lot better time.
> >>>     if you are on a dGPU or APU with swap - you have a moderately
> >>> better time, but I can't see you having a worse time.
> >>
> >> Well that's what I'm strongly disagreeing on.
> >>
> >> Adding memcg to the pool has no value at all and complicates things massively when moving forward.
> >>
> >> What exactly should be the benefit of that?
> >
> > I'm already showing the benefit of the pool moving to memcg, we've
> > even talked about it multiple times on the list, it's not a OMG change
> > the world benefit, but it definitely provides better alignment between
> > the pool and memcg allocations.
> >
> > We expose userspace API to allocate write combined memory, we do this
> > for all currently supported CPU/GPUs. We might think in the future we
> > don't want to continue to do this, but we do it now. My Fedora 42
> > desktop uses it, even if you say there is no need.
> >
> > If I allocate 100% of my memcg budget to WC memory, free it, then
> > allocate 100% of my budget to non-WC memory, we break container
> > containment as we can force other cgroups to run out of memory budget
> > and have to call the global shrinker.
>
> Yeah and that is perfectly intentional.

But it's wrong, we've been told by the mm/cgroup people that this
isn't correct behaviour and we should fix it, and in order to move
forward with fixing our other problems, we should start with this one.
We are violating cgroups containment and we should endeavour to stop
doing so.

>
> > With this in place, the
> > container that allocates the WC memory also pays the price to switch
> > it back. Again this is just correctness, it's not going to fix any
> > major workloads, but I also don't think it should cause any
> > regressions, since it won't be worse than current worst case
> > expectation for most workloads.
>
> No, this is not correct behavior any more.
>
> Memory which is used by your cgroup is not used for allocations by another cgroup any more nor given back to the core memory managment for the page pool. E.g. one cgroup can't steal the memory from another cgroup any more.
>
> In other words that is reserving the memory for the cgroup and don't give it back to the global pool as soon as you free it.

But what is the big advantage of giving it back to the global pool
here, I'm pretty neither the worst case or steady state behaviour will
change here, but the ability for one cgroup to help or hinder another
cgroup will be curtailed, which as far as I can see is what the cgroup
behaviour is meant to be. Each piece operates in it's own container,
and can cause minimal disruption either good or bad to other
containers.

> That would only be acceptable if we have per cgroup limit on the pool size which is *much* lower than the current global limit we have.

That is up to whoever configures the cgroup limits, if they say this
process should only have access to 1GB of RAM, then between normal RAM
and uncached/wc RAM they get 1GB, if they need to move RAM between
this via the ttm shrinker then it's all contained in that cgroup. This
isn't taking swapping into account, but currently we don't do that
now.

>
> Maybe we could register a memcg aware shrinker, but not make the LRU memcg aware or something like that.
>
> As far as I can see that would give us the benefit of both approaches, the only problem is that we would have to do per cgroup counter tracking on our own.
>
> That's why I asked if we could have TTM pool specific variables in the cgroup.
>
> Another alternative would be to change the LRU so that we track per memcg, but allow stealing of pages between cgroups.

I just don't get why we'd want to steal pages, just put all the
processes in the same cgroup. If you want to do that, we leave it up
to the cgroup administration to decide what they want to share between
processes. That policy shouldn't be in the driver/ttm layers, it
should be entirely configurable by the admin, and default to
reasonably sane behaviours.

If there is a system out there already using cgroups for containment,
but relying on this cgroup bypass to share uncached/wc pages, then
clearly it's not a great system, and we should figure out how to fix
that. If we need a backwards compat flag to turn this off, then I'm
fine with that, but we've been told by the cgroup folks that it's not
really a correct cgroup usage, and we should discourage it.

> > I understand we have to add more code to the tt level and that's fine,
> > I just don't see why you think starting at the bottom level is wrong?
> > it clearly has a use, and it's just cleaning up and preparing the
> > levels, so we can move up and solve the next problem.
>
> Because we don't have the necessary functionality to implement a memcg aware shrinker which moves BOs into swap there.

We need to have two levels of shrinker here, I'm not disputing the tt
level shinker like xe has doesn't need more work, but right now we
have two shrinkers that aren't aware of numa or memcg, I'd like to
start by reducing that to one for the corner case that nobody really
cares about but would be good to be correct. Then we can work on
swap/shrinker problem, which isn't this shrinker, and if after we do
that we find a single shrinker could be take care of it all, then we
move towards that.

Dave.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-07-03  2:53     ` David Airlie
@ 2025-07-03 17:58       ` Shakeel Butt
  2025-07-03 18:15         ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-03 17:58 UTC (permalink / raw)
  To: David Airlie
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig, Dave Chinner, Kairui Song

On Thu, Jul 03, 2025 at 12:53:44PM +1000, David Airlie wrote:
> On Thu, Jul 3, 2025 at 2:03 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Jun 30, 2025 at 02:49:36PM +1000, Dave Airlie wrote:
> > > From: Dave Airlie <airlied@redhat.com>
> > >
> > > This adds support for adding a obj cgroup to a buffer object,
> > > and passing in the placement flags to make sure it's accounted
> > > properly.
> > >
> > > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  2 ++
> > >  4 files changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > > index e5e33a68d935..d250183bab03 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > > @@ -198,6 +198,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
> > >       struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
> > >
> > >       amdgpu_hmm_unregister(aobj);
> > > +     obj_cgroup_put(aobj->tbo.objcg);
> > >       ttm_bo_put(&aobj->tbo);
> > >  }
> > >
> > > @@ -225,6 +226,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
> > >       bp.domain = initial_domain;
> > >       bp.bo_ptr_size = sizeof(struct amdgpu_bo);
> > >       bp.xcp_id_plus1 = xcp_id_plus1;
> > > +     bp.objcg = get_obj_cgroup_from_current();
> >
> > In what context this function is called? Is that the same for
> > ttm_pool_alloc_page()? Is remote charging happening in
> > ttm_pool_alloc_page()?
> >
> 
> No, this function is called from userspace ioctl paths that allocate
> GPU objects (GEM objects).
> 
> The objects are lazily allocated, so this might not trigger any pages
> being bound to the object, until it is populated, either via mapping +
> page faults or by being used in a GPU command submission, which is
> when the ttm_pool_alloc_page happens.
> 

For the mapping + page fault or GPU command submission, can there be a
case where 'current' is not in the same cgroup as the task which has
called amdgpu_gem_object_create() through ioctl? Can the allocation
happen in kthread or workqueue or irq?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-07-03 17:58       ` Shakeel Butt
@ 2025-07-03 18:15         ` Christian König
  2025-07-03 20:06           ` Shakeel Butt
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2025-07-03 18:15 UTC (permalink / raw)
  To: Shakeel Butt, David Airlie
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On 03.07.25 19:58, Shakeel Butt wrote:
> On Thu, Jul 03, 2025 at 12:53:44PM +1000, David Airlie wrote:
>> On Thu, Jul 3, 2025 at 2:03 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>> On Mon, Jun 30, 2025 at 02:49:36PM +1000, Dave Airlie wrote:
>>>> From: Dave Airlie <airlied@redhat.com>
>>>>
>>>> This adds support for adding a obj cgroup to a buffer object,
>>>> and passing in the placement flags to make sure it's accounted
>>>> properly.
>>>>
>>>> Signed-off-by: Dave Airlie <airlied@redhat.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  2 ++
>>>>  4 files changed, 14 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> index e5e33a68d935..d250183bab03 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> @@ -198,6 +198,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
>>>>       struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
>>>>
>>>>       amdgpu_hmm_unregister(aobj);
>>>> +     obj_cgroup_put(aobj->tbo.objcg);
>>>>       ttm_bo_put(&aobj->tbo);
>>>>  }
>>>>
>>>> @@ -225,6 +226,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
>>>>       bp.domain = initial_domain;
>>>>       bp.bo_ptr_size = sizeof(struct amdgpu_bo);
>>>>       bp.xcp_id_plus1 = xcp_id_plus1;
>>>> +     bp.objcg = get_obj_cgroup_from_current();
>>>
>>> In what context this function is called? Is that the same for
>>> ttm_pool_alloc_page()? Is remote charging happening in
>>> ttm_pool_alloc_page()?
>>>
>>
>> No, this function is called from userspace ioctl paths that allocate
>> GPU objects (GEM objects).
>>
>> The objects are lazily allocated, so this might not trigger any pages
>> being bound to the object, until it is populated, either via mapping +
>> page faults or by being used in a GPU command submission, which is
>> when the ttm_pool_alloc_page happens.
>>
> 
> For the mapping + page fault or GPU command submission, can there be a
> case where 'current' is not in the same cgroup as the task which has
> called amdgpu_gem_object_create() through ioctl? Can the allocation
> happen in kthread or workqueue or irq?

Yes, in some use cases that is actually the most common way of ending up in the memory allocation.

Background is that the first one who touches it actually does the allocation.

BTW: It might be a good idea to not only limit the amount of memory you actually have allocated, but also how much you wanted to allocate.

Regards,
Christian.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-07-03 18:15         ` Christian König
@ 2025-07-03 20:06           ` Shakeel Butt
  2025-07-03 21:22             ` David Airlie
  0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2025-07-03 20:06 UTC (permalink / raw)
  To: Christian König
  Cc: David Airlie, Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Dave Chinner, Kairui Song

On Thu, Jul 03, 2025 at 08:15:09PM +0200, Christian König wrote:
> On 03.07.25 19:58, Shakeel Butt wrote:
> > On Thu, Jul 03, 2025 at 12:53:44PM +1000, David Airlie wrote:
> >> On Thu, Jul 3, 2025 at 2:03 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>>
> >>> On Mon, Jun 30, 2025 at 02:49:36PM +1000, Dave Airlie wrote:
> >>>> From: Dave Airlie <airlied@redhat.com>
> >>>>
> >>>> This adds support for adding a obj cgroup to a buffer object,
> >>>> and passing in the placement flags to make sure it's accounted
> >>>> properly.
> >>>>
> >>>> Signed-off-by: Dave Airlie <airlied@redhat.com>
> >>>> ---
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  2 ++
> >>>>  4 files changed, 14 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>>> index e5e33a68d935..d250183bab03 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>>> @@ -198,6 +198,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
> >>>>       struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
> >>>>
> >>>>       amdgpu_hmm_unregister(aobj);
> >>>> +     obj_cgroup_put(aobj->tbo.objcg);
> >>>>       ttm_bo_put(&aobj->tbo);
> >>>>  }
> >>>>
> >>>> @@ -225,6 +226,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
> >>>>       bp.domain = initial_domain;
> >>>>       bp.bo_ptr_size = sizeof(struct amdgpu_bo);
> >>>>       bp.xcp_id_plus1 = xcp_id_plus1;
> >>>> +     bp.objcg = get_obj_cgroup_from_current();
> >>>
> >>> In what context this function is called? Is that the same for
> >>> ttm_pool_alloc_page()? Is remote charging happening in
> >>> ttm_pool_alloc_page()?
> >>>
> >>
> >> No, this function is called from userspace ioctl paths that allocate
> >> GPU objects (GEM objects).
> >>
> >> The objects are lazily allocated, so this might not trigger any pages
> >> being bound to the object, until it is populated, either via mapping +
> >> page faults or by being used in a GPU command submission, which is
> >> when the ttm_pool_alloc_page happens.
> >>
> > 
> > For the mapping + page fault or GPU command submission, can there be a
> > case where 'current' is not in the same cgroup as the task which has
> > called amdgpu_gem_object_create() through ioctl? Can the allocation
> > happen in kthread or workqueue or irq?
> 
> Yes, in some use cases that is actually the most common way of ending up in the memory allocation.
> 
> Background is that the first one who touches it actually does the allocation.

Do you mean a task in cgroup A does amdgpu_gem_object_create() and then
the actual allocation can happen in the task in cgroup B?

> 
> BTW: It might be a good idea to not only limit the amount of memory you actually have allocated, but also how much you wanted to allocate.

Do you mean accounting and limiting the reservations? Something like
what hugetlb cgroup provides?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-07-03 20:06           ` Shakeel Butt
@ 2025-07-03 21:22             ` David Airlie
  2025-07-04  9:39               ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: David Airlie @ 2025-07-03 21:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Christian König, Dave Airlie, dri-devel, linux-mm,
	Johannes Weiner, Dave Chinner, Kairui Song

>
> Do you mean a task in cgroup A does amdgpu_gem_object_create() and then
> the actual allocation can happen in the task in cgroup B?

On android and in some graphics scenarios, this might happen, not sure
if it does always though. We have scenarios where a display server
allocate a buffer for an application to write into and then it
displays it, the ownership of that buffer can be a big rough.

But in most scenarios I think it'll be the same cgroup, and I think we
generally should account it to the original cgroup unless we
explicitly move the object to another one, (which we haven't got any
code for yet).

> >
> > BTW: It might be a good idea to not only limit the amount of memory you actually have allocated, but also how much you wanted to allocate.
>
> Do you mean accounting and limiting the reservations? Something like
> what hugetlb cgroup provides?
>

I'll let Christian answer that,
Dave.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 17/17] amdgpu: add support for memory cgroups
  2025-07-03 21:22             ` David Airlie
@ 2025-07-04  9:39               ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2025-07-04  9:39 UTC (permalink / raw)
  To: David Airlie, Shakeel Butt
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On 03.07.25 23:22, David Airlie wrote:
>>
>> Do you mean a task in cgroup A does amdgpu_gem_object_create() and then
>> the actual allocation can happen in the task in cgroup B?
> 
> On android and in some graphics scenarios, this might happen, not sure
> if it does always though. We have scenarios where a display server
> allocate a buffer for an application to write into and then it
> displays it, the ownership of that buffer can be a big rough.

Yeah those use cases exists but are rather uncommon.

Also please keep in mind that Android has it's completely own way of memory management. It would be great if we could unify that with cgroups in the future as well, but that is not the task at hand.

> But in most scenarios I think it'll be the same cgroup, and I think we
> generally should account it to the original cgroup unless we
> explicitly move the object to another one, (which we haven't got any
> code for yet).

We just need to make sure that we don't create the possibility for a deny of service.

For example something we clearly can't do is client creates a buffer but doesn't touch it in any way, and then server receives the buffer and is forced into an OOM situation because it is the first one who's touching it.

>>>
>>> BTW: It might be a good idea to not only limit the amount of memory you actually have allocated, but also how much you wanted to allocate.
>>
>> Do you mean accounting and limiting the reservations? Something like
>> what hugetlb cgroup provides?

Yes, exactly that. Basically a limit on how much a process can over-commit those buffers.

Regards,
Christian.

>>
> 
> I'll let Christian answer that,
> Dave.
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2025-07-04  9:39 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-30  4:49 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Dave Airlie
2025-06-30  4:49 ` [PATCH 01/17] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
2025-06-30  4:49 ` [PATCH 02/17] drm/ttm: use gpu mm stats to track gpu memory allocations. (v2) Dave Airlie
2025-06-30 10:04   ` Christian König
2025-07-01  1:41     ` David Airlie
2025-07-02 16:08   ` Shakeel Butt
2025-06-30  4:49 ` [PATCH 03/17] mm/list_lru: export list_lru_add Dave Airlie
2025-06-30  4:49 ` [PATCH 04/17] ttm/pool: port to list_lru. (v2) Dave Airlie
2025-06-30 10:37   ` kernel test robot
2025-06-30  4:49 ` [PATCH 05/17] ttm/pool: drop numa specific pools Dave Airlie
2025-06-30 10:12   ` Christian König
2025-06-30  4:49 ` [PATCH 06/17] ttm/pool: make pool shrinker NUMA aware Dave Airlie
2025-06-30 10:15   ` Christian König
2025-06-30 21:30     ` David Airlie
2025-06-30  4:49 ` [PATCH 07/17] ttm/pool: track allocated_pages per numa node Dave Airlie
2025-06-30  4:49 ` [PATCH 08/17] memcg: add support for GPU page counters Dave Airlie
2025-07-02 16:06   ` Shakeel Butt
2025-07-03  5:43     ` David Airlie
2025-06-30  4:49 ` [PATCH 09/17] memcg: export memcg_list_lru_alloc Dave Airlie
2025-06-30  4:49 ` [PATCH 10/17] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
2025-06-30  9:56   ` kernel test robot
2025-06-30 10:20   ` Christian König
2025-07-01  1:46     ` David Airlie
2025-06-30  4:49 ` [PATCH 11/17] ttm/pool: initialise the shrinker earlier Dave Airlie
2025-06-30  4:49 ` [PATCH 12/17] ttm: add objcg pointer to bo and tt Dave Airlie
2025-06-30 10:24   ` Christian König
2025-06-30 21:33     ` David Airlie
2025-07-01  7:22       ` Christian König
2025-07-01  8:06         ` David Airlie
2025-07-01  8:15           ` Christian König
2025-07-01 22:11             ` David Airlie
2025-07-02  7:27               ` Christian König
2025-07-02  7:57                 ` David Airlie
2025-07-02  8:24                   ` Christian König
2025-07-03  5:53                     ` David Airlie
2025-06-30  4:49 ` [PATCH 13/17] ttm/pool: enable memcg tracking and shrinker Dave Airlie
2025-06-30 10:23   ` Christian König
2025-06-30 21:23     ` David Airlie
2025-06-30 11:59   ` kernel test robot
2025-07-02 16:41   ` Shakeel Butt
2025-06-30  4:49 ` [PATCH 14/17] ttm: hook up memcg placement flags Dave Airlie
2025-06-30  4:49 ` [PATCH 15/17] memcontrol: allow objcg api when memcg is config off Dave Airlie
2025-06-30  4:49 ` [PATCH 16/17] memcontrol: export current_obj_cgroup Dave Airlie
2025-06-30  4:49 ` [PATCH 17/17] amdgpu: add support for memory cgroups Dave Airlie
2025-07-02 16:02   ` Shakeel Butt
2025-07-03  2:53     ` David Airlie
2025-07-03 17:58       ` Shakeel Butt
2025-07-03 18:15         ` Christian König
2025-07-03 20:06           ` Shakeel Butt
2025-07-03 21:22             ` David Airlie
2025-07-04  9:39               ` Christian König
2025-07-01 23:26 ` drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).