drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2)
@ 2025-07-14  5:18 Dave Airlie
  2025-07-14  5:18 ` [PATCH 01/18] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
                   ` (18 more replies)
  0 siblings, 19 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song

Hi all,

This is a repost with some fixes and cleanups.

Differences since last posting:
1. Added patch 18: add a module option to allow pooled pages to not be stored in the lru per-memcg
   (Requested by Christian Konig)
2. Converged the naming and stats between vmstat and memcg (Suggested by Shakeel Butt)
3. Cleaned up the charge/uncharge code and some other bits.

Dave.

Original cover letter:
tl;dr: start using list_lru/numa/memcg in GPU driver core and amdgpu driver for now.

This is a complete series of patches, some of which have been sent before and reviewed,
but I want to get the complete picture for others, and try to figure out how best to land this.

There are 3 pieces to this:
01->02: add support for global gpu stat counters (previously posted, patch 2 is newer)
03->07: port ttm pools to list_lru for numa awareness
08->14: add memcg stats + gpu apis, then port ttm pools to memcg aware list_lru and shrinker
15->17: enable amdgpu to use new functionality.

The biggest difference in the memcg code from previously is I discovered what
obj cgroups were designed for and I'm reusing the page/objcg intergration that 
already exists, to avoid reinventing that wheel right now.

There are some igt-gpu-tools tests I've written at:
https://gitlab.freedesktop.org/airlied/igt-gpu-tools/-/tree/amdgpu-cgroups?ref_type=heads

One problem is there are a lot of delayed action, that probably means the testing
needs a bit more robustness, but the tests validate all the basic paths.

Regards,
Dave.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/18] mm: add gpu active/reclaim per-node stat counters (v2)
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 02/18] drm/ttm: use gpu mm stats to track gpu memory allocations. (v3) Dave Airlie
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie, Matthew Brost,
	Andrew Morton, Zi Yan, Shakeel Butt

From: Dave Airlie <airlied@redhat.com>

While discussing memcg intergration with gpu memory allocations,
it was pointed out that there was no numa/system counters for
GPU memory allocations.

With more integrated memory GPU server systems turning up, and
more requirements for memory tracking it seems we should start
closing the gap.

Add two counters to track GPU per-node system memory allocations.

The first is currently allocated to GPU objects, and the second
is for memory that is stored in GPU page pools that can be reclaimed,
by the shrinker.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---

v2: add more info to the documentation on this memory.

I'd like to get acks to merge this via the drm tree, if possible,

Dave.
---
 Documentation/filesystems/proc.rst | 8 ++++++++
 drivers/base/node.c                | 5 +++++
 fs/proc/meminfo.c                  | 6 ++++++
 include/linux/mmzone.h             | 2 ++
 mm/show_mem.c                      | 9 +++++++--
 mm/vmstat.c                        | 2 ++
 6 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2a17865dfe39..4d9c65b303cc 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1093,6 +1093,8 @@ Example output. You may not have all of these fields.
     CmaFree:               0 kB
     Unaccepted:            0 kB
     Balloon:               0 kB
+    GPUActive:             0 kB
+    GPUReclaim:            0 kB
     HugePages_Total:       0
     HugePages_Free:        0
     HugePages_Rsvd:        0
@@ -1271,6 +1273,12 @@ Unaccepted
               Memory that has not been accepted by the guest
 Balloon
               Memory returned to Host by VM Balloon Drivers
+GPUActive
+              System memory allocated to active GPU objects
+GPUReclaim
+              System memory stored in GPU pools for reuse. This memory is not
+              counted in GPUActive. It is shrinker reclaimable memory kept in a reuse
+              pool because it has non-standard page table attributes, like WC or UC.
 HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
               See Documentation/admin-guide/mm/hugetlbpage.rst.
 DirectMap4k, DirectMap2M, DirectMap1G
diff --git a/drivers/base/node.c b/drivers/base/node.c
index c19094481630..64406862314b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -463,6 +463,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_UNACCEPTED_MEMORY
 			     "Node %d Unaccepted:     %8lu kB\n"
 #endif
+			     "Node %d GPUActive:      %8lu kB\n"
+			     "Node %d GPUReclaim:     %8lu kB\n"
 			     ,
 			     nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
 			     nid, K(node_page_state(pgdat, NR_WRITEBACK)),
@@ -496,6 +498,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     ,
 			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
 #endif
+			     ,
+			     nid, K(node_page_state(pgdat, NR_GPU_ACTIVE)),
+			     nid, K(node_page_state(pgdat, NR_GPU_RECLAIM))
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
 	return len;
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index bc2bc60c36cc..334948744e55 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -164,6 +164,12 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "Balloon:        ",
 		    global_node_page_state(NR_BALLOON_PAGES));
 
+	show_val_kb(m, "GPUActive:      ",
+		    global_node_page_state(NR_GPU_ACTIVE));
+
+	show_val_kb(m, "GPUReclaim:     ",
+		    global_node_page_state(NR_GPU_RECLAIM));
+
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 283913d42d7b..81f3e368fd91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -241,6 +241,8 @@ enum node_stat_item {
 	NR_HUGETLB,
 #endif
 	NR_BALLOON_PAGES,
+	NR_GPU_ACTIVE,          /* Pages assigned to GPU objects */
+	NR_GPU_RECLAIM,         /* Pages in shrinkable GPU pools */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 0cf8bf5d832d..072d33a50148 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -255,7 +255,9 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			" sec_pagetables:%lukB"
 			" all_unreclaimable? %s"
 			" Balloon:%lukB"
-			"\n",
+		        " gpu_active:%lukB"
+		        " gpu_reclaim:%lukB"
+		        "\n",
 			pgdat->node_id,
 			K(node_page_state(pgdat, NR_ACTIVE_ANON)),
 			K(node_page_state(pgdat, NR_INACTIVE_ANON)),
@@ -281,7 +283,10 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			K(node_page_state(pgdat, NR_PAGETABLE)),
 			K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
 			str_yes_no(pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES),
-			K(node_page_state(pgdat, NR_BALLOON_PAGES)));
+		        K(node_page_state(pgdat, NR_BALLOON_PAGES)),
+		        K(node_page_state(pgdat, NR_GPU_ACTIVE)),
+			K(node_page_state(pgdat, NR_GPU_RECLAIM)));
+
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 429ae5339bfe..25a74cf29473 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1281,6 +1281,8 @@ const char * const vmstat_text[] = {
 	"nr_hugetlb",
 #endif
 	"nr_balloon_pages",
+	"nr_gpu_active",
+	"nr_gpu_reclaim",
 	/* system-wide enum vm_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/18] drm/ttm: use gpu mm stats to track gpu memory allocations. (v3)
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
  2025-07-14  5:18 ` [PATCH 01/18] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14 18:33   ` Shakeel Butt
  2025-07-14  5:18 ` [PATCH 03/18] mm/list_lru: export list_lru_add Dave Airlie
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie, Matthew Brost,
	Andrew Morton

From: Dave Airlie <airlied@redhat.com>

This uses the newly introduced per-node gpu tracking stats,
to track GPU memory allocated via TTM and reclaimable memory in
the TTM page pools.

These stats will be useful later for system information and
later when mem cgroups are integrated.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: add reclaim parameters and adjust the right counters.
v3: drop the nid helper and get it from page.
---
 drivers/gpu/drm/ttm/ttm_pool.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index baf27c70a419..ee2344089d47 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -150,8 +150,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 
 	if (!pool->use_dma_alloc) {
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
-		if (p)
+		if (p) {
 			p->private = order;
+			mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
+		}
 		return p;
 	}
 
@@ -186,7 +188,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 
 /* Reset the caching and pages of size 1 << order */
 static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
-			       unsigned int order, struct page *p)
+			       unsigned int order, struct page *p, bool reclaim)
 {
 	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
 	struct ttm_pool_dma *dma;
@@ -201,6 +203,9 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
 #endif
 
 	if (!pool || !pool->use_dma_alloc) {
+		mod_node_page_state(NODE_DATA(page_to_nid(p)),
+				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
+				    -(1 << order));
 		__free_pages(p, order);
 		return;
 	}
@@ -276,6 +281,7 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
 static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 {
 	unsigned int i, num_pages = 1 << pt->order;
+	int nid = page_to_nid(p);
 
 	for (i = 0; i < num_pages; ++i) {
 		if (PageHighMem(p))
@@ -288,17 +294,24 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 	list_add(&p->lru, &pt->pages);
 	spin_unlock(&pt->lock);
 	atomic_long_add(1 << pt->order, &allocated_pages);
+
+	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
+	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
 }
 
 /* Take pages from a specific pool_type, return NULL when nothing available */
 static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
 {
 	struct page *p;
+	int nid;
 
 	spin_lock(&pt->lock);
 	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
 	if (p) {
+		nid = page_to_nid(p);
 		atomic_long_sub(1 << pt->order, &allocated_pages);
+		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
+		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
 		list_del(&p->lru);
 	}
 	spin_unlock(&pt->lock);
@@ -331,7 +344,7 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 	spin_unlock(&shrinker_lock);
 
 	while ((p = ttm_pool_type_take(pt)))
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 }
 
 /* Return the pool_type to use for the given caching and order */
@@ -383,7 +396,7 @@ static unsigned int ttm_pool_shrink(void)
 
 	p = ttm_pool_type_take(pt);
 	if (p) {
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 		num_pages = 1 << pt->order;
 	} else {
 		num_pages = 0;
@@ -475,7 +488,7 @@ static pgoff_t ttm_pool_unmap_and_free(struct ttm_pool *pool, struct page *page,
 	if (pt)
 		ttm_pool_type_give(pt, page);
 	else
-		ttm_pool_free_page(pool, caching, order, page);
+		ttm_pool_free_page(pool, caching, order, page, false);
 
 	return nr;
 }
@@ -780,7 +793,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	return 0;
 
 error_free_page:
-	ttm_pool_free_page(pool, page_caching, order, p);
+	ttm_pool_free_page(pool, page_caching, order, p, false);
 
 error_free_all:
 	if (tt->restore)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/18] mm/list_lru: export list_lru_add.
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
  2025-07-14  5:18 ` [PATCH 01/18] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
  2025-07-14  5:18 ` [PATCH 02/18] drm/ttm: use gpu mm stats to track gpu memory allocations. (v3) Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14 19:01   ` Shakeel Butt
  2025-07-14  5:18 ` [PATCH 04/18] ttm/pool: port to list_lru. (v2) Dave Airlie
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie, Shakeel Butt

From: Dave Airlie <airlied@redhat.com>

DRM/TTM wants to use this for it's page pool
LRU tracking.

This effective is a revert of
78c0ed09131b772f062b986a2fcca6600daa6285
Author: Kairui Song <kasong@tencent.com>
Date:   Tue Nov 5 01:52:53 2024 +0800

    mm/list_lru: don't export list_lru_add

Cc: Kairui Song <kasong@tencent.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 mm/list_lru.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 490473af3122..315362e3df3d 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -175,6 +175,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 	unlock_list_lru(l, false);
 	return false;
 }
+EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_add_obj(struct list_lru *lru, struct list_head *item)
 {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/18] ttm/pool: port to list_lru. (v2)
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (2 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 03/18] mm/list_lru: export list_lru_add Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 05/18] ttm/pool: drop numa specific pools Dave Airlie
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This is an initial port of the TTM pools for
write combined and uncached pages to use the list_lru.

This makes the pool's more NUMA aware and avoids
needing separate NUMA pools (later commit enables this).

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: drop the pt->lock, lru list has it's own lock which is sufficent.
rearrange list isolates to fix bad locking orders.
---
 drivers/gpu/drm/ttm/tests/ttm_device_test.c |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c   | 32 ++++----
 drivers/gpu/drm/ttm/ttm_pool.c              | 91 ++++++++++++++-------
 include/drm/ttm/ttm_pool.h                  |  6 +-
 4 files changed, 80 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/ttm/tests/ttm_device_test.c b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
index 1621903818e5..1f207fd222bc 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_device_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
@@ -183,7 +183,7 @@ static void ttm_device_init_pools(struct kunit *test)
 
 				if (params->use_dma_alloc)
 					KUNIT_ASSERT_FALSE(test,
-							   list_empty(&pt.pages));
+							   !list_lru_count(&pt.pages));
 			}
 		}
 	}
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 8ade53371f72..39234a3e98c4 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -248,7 +248,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	pool = ttm_pool_pre_populated(test, size, caching);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	tt = ttm_tt_kunit_init(test, 0, caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
@@ -256,7 +256,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
@@ -282,8 +282,8 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, tt_caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -291,8 +291,8 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_tt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -316,8 +316,8 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, caching, snd_size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -325,8 +325,8 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_tt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -352,12 +352,12 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -383,12 +383,12 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_TRUE(test, list_is_singular(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_TRUE(test, list_is_singular(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
 
 	ttm_pool_fini(pool);
 }
@@ -404,11 +404,11 @@ static void ttm_pool_fini_basic(struct kunit *test)
 	pool = ttm_pool_pre_populated(test, size, caching);
 	pt = &pool->caching[caching].orders[order];
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_fini(pool);
 
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 }
 
 static struct kunit_case ttm_pool_test_cases[] = {
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index ee2344089d47..df6b81a43893 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -131,6 +131,15 @@ static struct list_head shrinker_list;
 static struct shrinker *mm_shrinker;
 static DECLARE_RWSEM(pool_shrink_rwsem);
 
+static int ttm_pool_nid(struct ttm_pool *pool) {
+	int nid = NUMA_NO_NODE;
+	if (pool)
+		nid = pool->nid;
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+	return nid;
+}
+
 /* Allocate pages of size 1 << order with the given gfp_flags */
 static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 					unsigned int order)
@@ -290,32 +299,41 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 			clear_page(page_address(p + i));
 	}
 
-	spin_lock(&pt->lock);
-	list_add(&p->lru, &pt->pages);
-	spin_unlock(&pt->lock);
+	INIT_LIST_HEAD(&p->lru);
+	rcu_read_lock();
+	list_lru_add(&pt->pages, &p->lru, nid, NULL);
+	rcu_read_unlock();
 	atomic_long_add(1 << pt->order, &allocated_pages);
 
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
 }
 
+static enum lru_status take_one_from_lru(struct list_head *item,
+					 struct list_lru_one *list,
+					 void *cb_arg)
+{
+	struct page **out_page = cb_arg;
+	struct page *p = container_of(item, struct page, lru);
+	list_lru_isolate(list, item);
+
+	*out_page = p;
+	return LRU_REMOVED;
+}
+
 /* Take pages from a specific pool_type, return NULL when nothing available */
-static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
+static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
 {
-	struct page *p;
-	int nid;
+	int ret;
+	struct page *p = NULL;
+	unsigned long nr_to_walk = 1;
 
-	spin_lock(&pt->lock);
-	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
-	if (p) {
-		nid = page_to_nid(p);
+	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
+	if (ret == 1 && p) {
 		atomic_long_sub(1 << pt->order, &allocated_pages);
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
-		list_del(&p->lru);
 	}
-	spin_unlock(&pt->lock);
-
 	return p;
 }
 
@@ -326,25 +344,47 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
 	pt->pool = pool;
 	pt->caching = caching;
 	pt->order = order;
-	spin_lock_init(&pt->lock);
-	INIT_LIST_HEAD(&pt->pages);
+	list_lru_init(&pt->pages);
 
 	spin_lock(&shrinker_lock);
 	list_add_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 }
 
+static enum lru_status pool_move_to_dispose_list(struct list_head *item,
+						 struct list_lru_one *list,
+						 void *cb_arg)
+{
+	struct list_head *dispose = cb_arg;
+
+	list_lru_isolate_move(list, item, dispose);
+
+	return LRU_REMOVED;
+}
+
+static void ttm_pool_dispose_list(struct ttm_pool_type *pt,
+				  struct list_head *dispose)
+{
+	while (!list_empty(dispose)) {
+		struct page *p;
+		p = list_first_entry(dispose, struct page, lru);
+		list_del_init(&p->lru);
+		atomic_long_sub(1 << pt->order, &allocated_pages);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
+	}
+}
+
 /* Remove a pool_type from the global shrinker list and free all pages */
 static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 {
-	struct page *p;
+	LIST_HEAD(dispose);
 
 	spin_lock(&shrinker_lock);
 	list_del(&pt->shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	while ((p = ttm_pool_type_take(pt)))
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
+	list_lru_walk(&pt->pages, pool_move_to_dispose_list, &dispose, LONG_MAX);
+	ttm_pool_dispose_list(pt, &dispose);
 }
 
 /* Return the pool_type to use for the given caching and order */
@@ -394,7 +434,7 @@ static unsigned int ttm_pool_shrink(void)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	p = ttm_pool_type_take(pt);
+	p = ttm_pool_type_take(pt, ttm_pool_nid(pt->pool));
 	if (p) {
 		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 		num_pages = 1 << pt->order;
@@ -748,7 +788,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		p = NULL;
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
-			p = ttm_pool_type_take(pt);
+			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
 		/*
 		 * If that fails or previously failed, allocate from system.
 		 * Note that this also disallows additional pool allocations using
@@ -1177,16 +1217,7 @@ static unsigned long ttm_pool_shrinker_count(struct shrinker *shrink,
 /* Count the number of pages available in a pool_type */
 static unsigned int ttm_pool_type_count(struct ttm_pool_type *pt)
 {
-	unsigned int count = 0;
-	struct page *p;
-
-	spin_lock(&pt->lock);
-	/* Only used for debugfs, the overhead doesn't matter */
-	list_for_each_entry(p, &pt->pages, lru)
-		++count;
-	spin_unlock(&pt->lock);
-
-	return count;
+	return list_lru_count(&pt->pages);
 }
 
 /* Print a nice header for the order */
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 54cd34a6e4c0..df56527c4853 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -45,8 +45,7 @@ struct ttm_tt;
  * @order: the allocation order our pages have
  * @caching: the caching type our pages have
  * @shrinker_list: our place on the global shrinker list
- * @lock: protection of the page list
- * @pages: the list of pages in the pool
+ * @pages: the lru_list of pages in the pool
  */
 struct ttm_pool_type {
 	struct ttm_pool *pool;
@@ -55,8 +54,7 @@ struct ttm_pool_type {
 
 	struct list_head shrinker_list;
 
-	spinlock_t lock;
-	struct list_head pages;
+	struct list_lru pages;
 };
 
 /**
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/18] ttm/pool: drop numa specific pools
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (3 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 04/18] ttm/pool: port to list_lru. (v2) Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 06/18] ttm/pool: make pool shrinker NUMA aware Dave Airlie
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

The list_lru will now handle numa for us, so need to keep
separate pool types for it. Just consoldiate into the global ones.

This adds a debugfs change to avoid dumping non-existant orders due
to this change.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index df6b81a43893..e26d57676fe3 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -398,17 +398,11 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 #ifdef CONFIG_X86
 	switch (caching) {
 	case ttm_write_combined:
-		if (pool->nid != NUMA_NO_NODE)
-			return &pool->caching[caching].orders[order];
-
 		if (pool->use_dma32)
 			return &global_dma32_write_combined[order];
 
 		return &global_write_combined[order];
 	case ttm_uncached:
-		if (pool->nid != NUMA_NO_NODE)
-			return &pool->caching[caching].orders[order];
-
 		if (pool->use_dma32)
 			return &global_dma32_uncached[order];
 
@@ -1283,7 +1277,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
 {
 	unsigned int i;
 
-	if (!pool->use_dma_alloc && pool->nid == NUMA_NO_NODE) {
+	if (!pool->use_dma_alloc) {
 		seq_puts(m, "unused\n");
 		return 0;
 	}
@@ -1294,10 +1288,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
 	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
 		if (!ttm_pool_select_type(pool, i, 0))
 			continue;
-		if (pool->use_dma_alloc)
-			seq_puts(m, "DMA ");
-		else
-			seq_printf(m, "N%d ", pool->nid);
+		seq_puts(m, "DMA ");
 		switch (i) {
 		case ttm_cached:
 			seq_puts(m, "\t:");
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/18] ttm/pool: make pool shrinker NUMA aware
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (4 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 05/18] ttm/pool: drop numa specific pools Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 07/18] ttm/pool: track allocated_pages per numa node Dave Airlie
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This enable NUMA awareness for the shrinker on the
ttm pools.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 38 +++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index e26d57676fe3..c0391ccd179b 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -415,12 +415,12 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 	return NULL;
 }
 
-/* Free pages using the global shrinker list */
-static unsigned int ttm_pool_shrink(void)
+/* Free pages using the per-node shrinker list */
+static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
 {
+	LIST_HEAD(dispose);
 	struct ttm_pool_type *pt;
 	unsigned int num_pages;
-	struct page *p;
 
 	down_read(&pool_shrink_rwsem);
 	spin_lock(&shrinker_lock);
@@ -428,13 +428,10 @@ static unsigned int ttm_pool_shrink(void)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	p = ttm_pool_type_take(pt, ttm_pool_nid(pt->pool));
-	if (p) {
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
-		num_pages = 1 << pt->order;
-	} else {
-		num_pages = 0;
-	}
+	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	num_pages *= 1 << pt->order;
+
+	ttm_pool_dispose_list(pt, &dispose);
 	up_read(&pool_shrink_rwsem);
 
 	return num_pages;
@@ -783,6 +780,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
 			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
+
 		/*
 		 * If that fails or previously failed, allocate from system.
 		 * Note that this also disallows additional pool allocations using
@@ -931,8 +929,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 {
 	ttm_pool_free_range(pool, tt, tt->caching, 0, tt->num_pages);
 
-	while (atomic_long_read(&allocated_pages) > page_pool_size)
-		ttm_pool_shrink();
+	while (atomic_long_read(&allocated_pages) > page_pool_size) {
+		unsigned long diff = page_pool_size - atomic_long_read(&allocated_pages);
+		ttm_pool_shrink(ttm_pool_nid(pool), diff);
+	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
 
@@ -1189,7 +1189,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 	unsigned long num_freed = 0;
 
 	do
-		num_freed += ttm_pool_shrink();
+		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
 	while (num_freed < sc->nr_to_scan &&
 	       atomic_long_read(&allocated_pages));
 
@@ -1317,11 +1317,15 @@ static int ttm_pool_debugfs_shrink_show(struct seq_file *m, void *data)
 		.nr_to_scan = TTM_SHRINKER_BATCH,
 	};
 	unsigned long count;
+	int nid;
 
 	fs_reclaim_acquire(GFP_KERNEL);
-	count = ttm_pool_shrinker_count(mm_shrinker, &sc);
-	seq_printf(m, "%lu/%lu\n", count,
-		   ttm_pool_shrinker_scan(mm_shrinker, &sc));
+	for_each_node(nid) {
+		sc.nid = nid;
+		count = ttm_pool_shrinker_count(mm_shrinker, &sc);
+		seq_printf(m, "%d: %lu/%lu\n", nid, count,
+			   ttm_pool_shrinker_scan(mm_shrinker, &sc));
+	}
 	fs_reclaim_release(GFP_KERNEL);
 
 	return 0;
@@ -1369,7 +1373,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 #endif
 #endif
 
-	mm_shrinker = shrinker_alloc(0, "drm-ttm_pool");
+	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
 	if (!mm_shrinker)
 		return -ENOMEM;
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/18] ttm/pool: track allocated_pages per numa node.
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (5 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 06/18] ttm/pool: make pool shrinker NUMA aware Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 08/18] memcg: add support for GPU page counters. (v2) Dave Airlie
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This gets the memory sizes from the nodes and stores the limit
as 50% of those. I think eventually we should drop the limits
once we have memcg aware shrinking, but this should be more NUMA
friendly, and I think seems like what people would prefer to
happen on NUMA aware systems.

Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 57 +++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index c0391ccd179b..6f44654ea743 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -115,10 +115,11 @@ struct ttm_pool_tt_restore {
 
 static unsigned long page_pool_size;
 
-MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
+MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool per NUMA node");
 module_param(page_pool_size, ulong, 0644);
 
-static atomic_long_t allocated_pages;
+static unsigned long pool_node_limit[MAX_NUMNODES];
+static atomic_long_t allocated_pages[MAX_NUMNODES];
 
 static struct ttm_pool_type global_write_combined[NR_PAGE_ORDERS];
 static struct ttm_pool_type global_uncached[NR_PAGE_ORDERS];
@@ -303,8 +304,8 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 	rcu_read_lock();
 	list_lru_add(&pt->pages, &p->lru, nid, NULL);
 	rcu_read_unlock();
-	atomic_long_add(1 << pt->order, &allocated_pages);
 
+	atomic_long_add(num_pages, &allocated_pages[nid]);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
 	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
 }
@@ -330,7 +331,7 @@ static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
 
 	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
 	if (ret == 1 && p) {
-		atomic_long_sub(1 << pt->order, &allocated_pages);
+		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
 		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
 	}
@@ -369,7 +370,7 @@ static void ttm_pool_dispose_list(struct ttm_pool_type *pt,
 		struct page *p;
 		p = list_first_entry(dispose, struct page, lru);
 		list_del_init(&p->lru);
-		atomic_long_sub(1 << pt->order, &allocated_pages);
+		atomic_long_sub(1 << pt->order, &allocated_pages[page_to_nid(p)]);
 		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 	}
 }
@@ -927,11 +928,13 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
  */
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 {
+	int nid = ttm_pool_nid(pool);
+
 	ttm_pool_free_range(pool, tt, tt->caching, 0, tt->num_pages);
 
-	while (atomic_long_read(&allocated_pages) > page_pool_size) {
-		unsigned long diff = page_pool_size - atomic_long_read(&allocated_pages);
-		ttm_pool_shrink(ttm_pool_nid(pool), diff);
+	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
+		unsigned long diff = pool_node_limit[nid] - atomic_long_read(&allocated_pages[nid]);
+		ttm_pool_shrink(nid, diff);
 	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
@@ -1191,7 +1194,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 	do
 		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
 	while (num_freed < sc->nr_to_scan &&
-	       atomic_long_read(&allocated_pages));
+	       atomic_long_read(&allocated_pages[sc->nid]));
 
 	sc->nr_scanned = num_freed;
 
@@ -1202,7 +1205,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 static unsigned long ttm_pool_shrinker_count(struct shrinker *shrink,
 					     struct shrink_control *sc)
 {
-	unsigned long num_pages = atomic_long_read(&allocated_pages);
+	unsigned long num_pages = atomic_long_read(&allocated_pages[sc->nid]);
 
 	return num_pages ? num_pages : SHRINK_EMPTY;
 }
@@ -1239,8 +1242,12 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
 /* Dump the total amount of allocated pages */
 static void ttm_pool_debugfs_footer(struct seq_file *m)
 {
-	seq_printf(m, "\ntotal\t: %8lu of %8lu\n",
-		   atomic_long_read(&allocated_pages), page_pool_size);
+	int nid;
+
+	for_each_node(nid) {
+		seq_printf(m, "\ntotal node%d\t: %8lu of %8lu\n", nid,
+			   atomic_long_read(&allocated_pages[nid]), pool_node_limit[nid]);
+	}
 }
 
 /* Dump the information for the global pools */
@@ -1334,6 +1341,22 @@ DEFINE_SHOW_ATTRIBUTE(ttm_pool_debugfs_shrink);
 
 #endif
 
+static inline uint64_t ttm_get_node_memory_size(int nid)
+{
+	/* This is directly using si_meminfo_node implementation as the
+	 * function is not exported.
+	 */
+	int zone_type;
+	uint64_t managed_pages = 0;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
+		managed_pages +=
+			zone_managed_pages(&pgdat->node_zones[zone_type]);
+	return managed_pages * PAGE_SIZE;
+}
+
 /**
  * ttm_pool_mgr_init - Initialize globals
  *
@@ -1345,8 +1368,14 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 {
 	unsigned int i;
 
-	if (!page_pool_size)
-		page_pool_size = num_pages;
+	int nid;
+	for_each_node(nid) {
+		if (!page_pool_size) {
+			uint64_t node_size = ttm_get_node_memory_size(nid);
+			pool_node_limit[nid] = (node_size >> PAGE_SHIFT) / 2;
+		} else
+			pool_node_limit[nid] = page_pool_size;
+	}
 
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/18] memcg: add support for GPU page counters. (v2)
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (6 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 07/18] ttm/pool: track allocated_pages per numa node Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 09/18] memcg: export memcg_list_lru_alloc Dave Airlie
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This introduces 2 new statistics and 3 new memcontrol APIs for dealing
with GPU system memory allocations.

The stats corresponds to the same stats in the global vmstat,
for number of active GPU pages, and number of pages in pools that
can be reclaimed.

The first API charges a order of pages to a objcg, and sets
the objcg on the pages like kmem does, and updates the active/reclaim
statistic.

The second API uncharges a page from the obj cgroup it is currently charged
to.

The third API allows moving a page to/from reclaim and between obj cgroups.
When pages are added to the pool lru, this just updates accounting.
When pages are being removed from a pool lru, they can be taken from
the parent objcg so this allows them to be uncharged from there and transferred
to a new child objcg.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
v2: use memcg_node_stat_items
---
 Documentation/admin-guide/cgroup-v2.rst |   6 ++
 include/linux/memcontrol.h              |  12 +++
 mm/memcontrol.c                         | 105 ++++++++++++++++++++++++
 3 files changed, 123 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0cc35a14afbe..28088c4e52d3 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1542,6 +1542,12 @@ The following nested keys are defined.
 	  vmalloc (npn)
 		Amount of memory used for vmap backed memory.
 
+	  gpu_active (npn)
+		Amount of system memory used for GPU devices.
+
+	  gpu_reclaim (npn)
+		Amount of system memory cached for GPU devices.
+
 	  shmem
 		Amount of cached filesystem data that is swap-backed,
 		such as tmpfs, shm segments, shared anonymous mmap()s
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..21328f207d38 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1597,6 +1597,18 @@ struct sock;
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
 			     gfp_t gfp_mask);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+
+bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
+			   unsigned int nr_pages,
+			   gfp_t gfp_mask, bool reclaim);
+void mem_cgroup_uncharge_gpu_page(struct page *page,
+				  unsigned int nr_pages,
+				  bool reclaim);
+bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *objcg,
+				      struct page *page,
+				      unsigned int order,
+				      bool to_reclaim);
+
 #ifdef CONFIG_MEMCG
 extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..4c8ded9501c6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -330,6 +330,8 @@ static const unsigned int memcg_node_stat_items[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
+	NR_GPU_ACTIVE,
+	NR_GPU_RECLAIM,
 };
 
 static const unsigned int memcg_stat_items[] = {
@@ -1345,6 +1347,8 @@ static const struct memory_stat memory_stats[] = {
 	{ "percpu",			MEMCG_PERCPU_B			},
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			MEMCG_VMALLOC			},
+	{ "gpu_active",			NR_GPU_ACTIVE			},
+	{ "gpu_reclaim",		NR_GPU_RECLAIM	                },
 	{ "shmem",			NR_SHMEM			},
 #ifdef CONFIG_ZSWAP
 	{ "zswap",			MEMCG_ZSWAP_B			},
@@ -5132,6 +5136,107 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 	refill_stock(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_charge_gpu_page - charge a page to GPU memory tracking
+ * @objcg: objcg to charge, NULL charges root memcg
+ * @page: page to charge
+ * @order: page allocation order
+ * @gfp_mask: gfp mode
+ * @reclaim: charge the reclaim counter instead of the active one.
+ *
+ * Charge the order sized @page to the objcg. Returns %true if the charge fit within
+ * @objcg's configured limit, %false if it doesn't.
+ */
+bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
+				unsigned int order, gfp_t gfp_mask, bool reclaim)
+{
+	unsigned int nr_pages = 1 << order;
+	struct mem_cgroup *memcg = NULL;
+	struct lruvec *lruvec;
+	int ret;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+
+		ret = try_charge_memcg(memcg, gfp_mask, nr_pages);
+		if (ret) {
+			css_put(&memcg->css);
+			return false;
+		}
+
+		obj_cgroup_get(objcg);
+		page_set_objcg(page, objcg);
+	}
+
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+	mod_lruvec_state(lruvec, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, nr_pages);
+
+	mem_cgroup_put(memcg);
+	return true;
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_charge_gpu_page);
+
+/**
+ * mem_cgroup_uncharge_gpu_page - uncharge a page from GPU memory tracking
+ * @page: page to uncharge
+ * @order: order of the page allocation
+ * @reclaim: uncharge the reclaim counter instead of the active.
+ */
+void mem_cgroup_uncharge_gpu_page(struct page *page,
+				  unsigned int order, bool reclaim)
+{
+	struct obj_cgroup *objcg = page_objcg(page);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+	int nr_pages = 1 << order;
+
+	memcg = objcg ? get_mem_cgroup_from_objcg(objcg) : NULL;
+
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+	mod_lruvec_state(lruvec, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, -nr_pages);
+
+	page->memcg_data = 0;
+	obj_cgroup_put(objcg);
+	mem_cgroup_put(memcg);
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_uncharge_gpu_page);
+
+/**
+ * mem_cgroup_move_gpu_reclaim - move pages from gpu to gpu reclaim and back
+ * @new_objcg: objcg to move page to, NULL if just stats update.
+ * @nr_pages: number of pages to move
+ * @to_reclaim: true moves pages into reclaim, false moves them back
+ */
+bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *new_objcg,
+				      struct page *page,
+				      unsigned int order,
+				      bool to_reclaim)
+{
+	struct obj_cgroup *objcg = page_objcg(page);
+
+	if (!objcg)
+		return false;
+
+	if (!new_objcg || objcg == new_objcg) {
+		struct mem_cgroup *memcg = get_mem_cgroup_from_objcg(objcg);
+		struct lruvec *lruvec;
+		unsigned long flags;
+		int nr_pages = 1 << order;
+
+		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+		local_irq_save(flags);
+		__mod_lruvec_state(lruvec, to_reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, nr_pages);
+		__mod_lruvec_state(lruvec, to_reclaim ? NR_GPU_ACTIVE : NR_GPU_RECLAIM, -nr_pages);
+		local_irq_restore(flags);
+		mem_cgroup_put(memcg);
+		return true;
+	} else {
+		mem_cgroup_uncharge_gpu_page(page, order, true);
+		return mem_cgroup_charge_gpu_page(new_objcg, page, order, 0, false);
+	}
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_move_gpu_page_reclaim);
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/18] memcg: export memcg_list_lru_alloc.
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (7 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 08/18] memcg: add support for GPU page counters. (v2) Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 10/18] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This is need to use list lru with memcg from a module. drm/ttm
wants to use this interface.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 mm/list_lru.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 315362e3df3d..2892c1d945dd 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -558,6 +558,7 @@ int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 
 	return xas_error(&xas);
 }
+EXPORT_SYMBOL(memcg_list_lru_alloc);
 #else
 static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
 {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/18] ttm: add a memcg accounting flag to the alloc/populate APIs
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (8 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 09/18] memcg: export memcg_list_lru_alloc Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 11/18] ttm/pool: initialise the shrinker earlier Dave Airlie
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This flag does nothing yet, but this just changes the APIs to accept
it in the future across all users.

This flag will eventually be filled out with when to account a tt
populate to a memcg.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  3 ++-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c          |  5 +++--
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c     |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c       |  4 ++--
 drivers/gpu/drm/loongson/lsdc_ttm.c              |  3 ++-
 drivers/gpu/drm/nouveau/nouveau_bo.c             |  6 ++++--
 drivers/gpu/drm/radeon/radeon_ttm.c              |  3 ++-
 drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c        | 16 ++++++++--------
 drivers/gpu/drm/ttm/tests/ttm_tt_test.c          | 12 ++++++------
 drivers/gpu/drm/ttm/ttm_bo.c                     |  5 +++--
 drivers/gpu/drm/ttm/ttm_bo_util.c                |  6 +++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c                  |  4 +++-
 drivers/gpu/drm/ttm/ttm_pool.c                   |  6 ++++--
 drivers/gpu/drm/ttm/ttm_tt.c                     |  8 +++++---
 drivers/gpu/drm/vmwgfx/vmwgfx_blit.c             |  4 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c       |  7 ++++---
 drivers/gpu/drm/xe/xe_bo.c                       |  3 ++-
 include/drm/ttm/ttm_bo.h                         |  1 +
 include/drm/ttm/ttm_device.h                     |  1 +
 include/drm/ttm/ttm_pool.h                       |  1 +
 include/drm/ttm/ttm_tt.h                         |  1 +
 22 files changed, 61 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 9c5df35f05b7..920b412156dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1138,6 +1138,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
  */
 static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
 				  struct ttm_tt *ttm,
+				  bool memcg_account,
 				  struct ttm_operation_ctx *ctx)
 {
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bdev);
@@ -1161,7 +1162,7 @@ static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
 		pool = &adev->mman.ttm_pools[gtt->pool_id];
 	else
 		pool = &adev->mman.bdev.pool;
-	ret = ttm_pool_alloc(pool, ttm, ctx);
+	ret = ttm_pool_alloc(pool, ttm, memcg_account, ctx);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index 1f4814968868..6cdaf3696583 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -314,6 +314,7 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 
 static int i915_ttm_tt_populate(struct ttm_device *bdev,
 				struct ttm_tt *ttm,
+				bool memcg_account,
 				struct ttm_operation_ctx *ctx)
 {
 	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
@@ -321,7 +322,7 @@ static int i915_ttm_tt_populate(struct ttm_device *bdev,
 	if (i915_tt->is_shmem)
 		return i915_ttm_tt_shmem_populate(bdev, ttm, ctx);
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void i915_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
@@ -808,7 +809,7 @@ static int __i915_ttm_get_pages(struct drm_i915_gem_object *obj,
 	}
 
 	if (bo->ttm && !ttm_tt_is_populated(bo->ttm)) {
-		ret = ttm_bo_populate(bo, &ctx);
+		ret = ttm_bo_populate(bo, false, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 2f6b33edb9c9..4ab1eb3e42bc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -624,7 +624,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
 
 	/* Populate ttm with pages if needed. Typically system memory. */
 	if (ttm && (dst_man->use_tt || (ttm->page_flags & TTM_TT_FLAG_SWAPPED))) {
-		ret = ttm_bo_populate(bo, ctx);
+		ret = ttm_bo_populate(bo, false, ctx);
 		if (ret)
 			return ret;
 	}
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
index 61596cecce4d..0b555979d786 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
@@ -90,7 +90,7 @@ static int i915_ttm_backup(struct i915_gem_apply_to_region *apply,
 		goto out_no_lock;
 
 	backup_bo = i915_gem_to_ttm(backup);
-	err = ttm_bo_populate(backup_bo, &ctx);
+	err = ttm_bo_populate(backup_bo, false, &ctx);
 	if (err)
 		goto out_no_populate;
 
@@ -189,7 +189,7 @@ static int i915_ttm_restore(struct i915_gem_apply_to_region *apply,
 	if (!backup_bo->resource)
 		err = ttm_bo_validate(backup_bo, i915_ttm_sys_placement(), &ctx);
 	if (!err)
-		err = ttm_bo_populate(backup_bo, &ctx);
+		err = ttm_bo_populate(backup_bo, false, &ctx);
 	if (!err) {
 		err = i915_gem_obj_copy_ttm(obj, backup, pm_apply->allow_gpu,
 					    false);
diff --git a/drivers/gpu/drm/loongson/lsdc_ttm.c b/drivers/gpu/drm/loongson/lsdc_ttm.c
index 2e42c6970c9f..6d8781506802 100644
--- a/drivers/gpu/drm/loongson/lsdc_ttm.c
+++ b/drivers/gpu/drm/loongson/lsdc_ttm.c
@@ -110,6 +110,7 @@ lsdc_ttm_tt_create(struct ttm_buffer_object *tbo, uint32_t page_flags)
 
 static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
 				struct ttm_tt *ttm,
+				bool memcg_account,
 				struct ttm_operation_ctx *ctx)
 {
 	bool slave = !!(ttm->page_flags & TTM_TT_FLAG_EXTERNAL);
@@ -122,7 +123,7 @@ static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
 		return 0;
 	}
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void lsdc_ttm_tt_unpopulate(struct ttm_device *bdev,
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index b96f0555ca14..1f2b9f5f2bf8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -1417,7 +1417,9 @@ vm_fault_t nouveau_ttm_fault_reserve_notify(struct ttm_buffer_object *bo)
 
 static int
 nouveau_ttm_tt_populate(struct ttm_device *bdev,
-			struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+			struct ttm_tt *ttm,
+			bool memcg_account,
+			struct ttm_operation_ctx *ctx)
 {
 	struct ttm_tt *ttm_dma = (void *)ttm;
 	struct nouveau_drm *drm;
@@ -1434,7 +1436,7 @@ nouveau_ttm_tt_populate(struct ttm_device *bdev,
 
 	drm = nouveau_bdev(bdev);
 
-	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, ctx);
+	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, memcg_account, ctx);
 }
 
 static void
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 616d25c8c2de..8c4273239d16 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -526,6 +526,7 @@ static struct radeon_ttm_tt *radeon_ttm_tt_to_gtt(struct radeon_device *rdev,
 
 static int radeon_ttm_tt_populate(struct ttm_device *bdev,
 				  struct ttm_tt *ttm,
+				  bool memcg_account,
 				  struct ttm_operation_ctx *ctx)
 {
 	struct radeon_device *rdev = radeon_get_rdev(bdev);
@@ -547,7 +548,7 @@ static int radeon_ttm_tt_populate(struct ttm_device *bdev,
 		return 0;
 	}
 
-	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, ctx);
+	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, memcg_account, ctx);
 }
 
 static void radeon_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
index 3148f5d3dbd6..b52e3c1089e6 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
@@ -538,7 +538,7 @@ static void ttm_bo_validate_no_placement_signaled(struct kunit *test)
 
 	if (params->with_ttm) {
 		old_tt = priv->ttm_dev->funcs->ttm_tt_create(bo, 0);
-		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, &ctx);
+		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, false, &ctx);
 		bo->ttm = old_tt;
 	}
 
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 39234a3e98c4..aaf152c2383d 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -88,7 +88,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct kunit *test,
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -157,7 +157,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
 	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
 	KUNIT_ASSERT_EQ(test, pool->use_dma_alloc, params->use_dma_alloc);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
 
@@ -220,7 +220,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct kunit *test)
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
 
@@ -253,7 +253,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
@@ -285,7 +285,7 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -319,7 +319,7 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
-	ttm_pool_alloc(pool, tt, &simple_ctx);
+	ttm_pool_alloc(pool, tt, false, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
@@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
-	ttm_pool_alloc(pool, tt, &simple_ctx);
+	ttm_pool_alloc(pool, tt, false, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
 	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
diff --git a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
index 61ec6f580b62..333c503e218b 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
@@ -262,7 +262,7 @@ static void ttm_tt_populate_null_ttm(struct kunit *test)
 	struct ttm_operation_ctx ctx = { };
 	int err;
 
-	err = ttm_tt_populate(devs->ttm_dev, NULL, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, NULL, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, -EINVAL);
 }
 
@@ -283,11 +283,11 @@ static void ttm_tt_populate_populated_ttm(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	populated_page = *tt->pages;
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_PTR_EQ(test, populated_page, *tt->pages);
 }
 
@@ -307,7 +307,7 @@ static void ttm_tt_unpopulate_basic(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
 
@@ -351,7 +351,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
 
@@ -361,7 +361,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
 	KUNIT_ASSERT_TRUE(test, tt->page_flags & TTM_TT_FLAG_SWAPPED);
 
 	/* Swapout depopulates TT, allocate pages and then swap them in */
-	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, &ctx);
+	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	err = ttm_tt_swapin(tt);
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index f4d9e68b21e7..af04bb8e2c2a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -142,7 +142,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 			goto out_err;
 
 		if (mem->mem_type != TTM_PL_SYSTEM) {
-			ret = ttm_bo_populate(bo, ctx);
+			ret = ttm_bo_populate(bo, false, ctx);
 			if (ret)
 				goto out_err;
 		}
@@ -1256,6 +1256,7 @@ void ttm_bo_tt_destroy(struct ttm_buffer_object *bo)
  * is set to true.
  */
 int ttm_bo_populate(struct ttm_buffer_object *bo,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx)
 {
 	struct ttm_tt *tt = bo->ttm;
@@ -1268,7 +1269,7 @@ int ttm_bo_populate(struct ttm_buffer_object *bo,
 		return 0;
 
 	swapped = ttm_tt_is_swapped(tt);
-	ret = ttm_tt_populate(bo->bdev, tt, ctx);
+	ret = ttm_tt_populate(bo->bdev, tt, memcg_account, ctx);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 250675d56b1c..764d1cf1ecbe 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -167,7 +167,7 @@ int ttm_bo_move_memcpy(struct ttm_buffer_object *bo,
 	src_man = ttm_manager_type(bdev, src_mem->mem_type);
 	if (ttm && ((ttm->page_flags & TTM_TT_FLAG_SWAPPED) ||
 		    dst_man->use_tt)) {
-		ret = ttm_bo_populate(bo, ctx);
+		ret = ttm_bo_populate(bo, false, ctx);
 		if (ret)
 			return ret;
 	}
@@ -354,7 +354,7 @@ static int ttm_bo_kmap_ttm(struct ttm_buffer_object *bo,
 
 	BUG_ON(!ttm);
 
-	ret = ttm_bo_populate(bo, &ctx);
+	ret = ttm_bo_populate(bo, false, &ctx);
 	if (ret)
 		return ret;
 
@@ -511,7 +511,7 @@ int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map)
 		pgprot_t prot;
 		void *vaddr;
 
-		ret = ttm_bo_populate(bo, &ctx);
+		ret = ttm_bo_populate(bo, false, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index b47020fca199..c5ad447debe3 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -225,7 +225,9 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 		};
 
 		ttm = bo->ttm;
-		err = ttm_bo_populate(bo, &ctx);
+		err = ttm_bo_populate(bo,
+				      false,
+				      &ctx);
 		if (err) {
 			if (err == -EINTR || err == -ERESTARTSYS ||
 			    err == -EAGAIN)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 6f44654ea743..63eb81b18987 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -743,6 +743,7 @@ static unsigned int ttm_pool_alloc_find_order(unsigned int highest,
 }
 
 static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+			    bool memcg_account,
 			    const struct ttm_operation_ctx *ctx,
 			    struct ttm_pool_alloc_state *alloc,
 			    struct ttm_pool_tt_restore *restore)
@@ -853,6 +854,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
  * Returns: 0 on successe, negative error code otherwise.
  */
 int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+		   bool memcg_account,
 		   struct ttm_operation_ctx *ctx)
 {
 	struct ttm_pool_alloc_state alloc;
@@ -862,7 +864,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 
 	ttm_pool_alloc_state_init(tt, &alloc);
 
-	return __ttm_pool_alloc(pool, tt, ctx, &alloc, NULL);
+	return __ttm_pool_alloc(pool, tt, memcg_account, ctx, &alloc, NULL);
 }
 EXPORT_SYMBOL(ttm_pool_alloc);
 
@@ -915,7 +917,7 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 			return 0;
 	}
 
-	return __ttm_pool_alloc(pool, tt, ctx, &alloc, tt->restore);
+	return __ttm_pool_alloc(pool, tt, false, ctx, &alloc, tt->restore);
 }
 
 /**
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 506e257dfba8..8f38de3b2f1c 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -366,7 +366,9 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
 EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_tt_swapout);
 
 int ttm_tt_populate(struct ttm_device *bdev,
-		    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+		    struct ttm_tt *ttm,
+		    bool memcg_account,
+		    struct ttm_operation_ctx *ctx)
 {
 	int ret;
 
@@ -395,9 +397,9 @@ int ttm_tt_populate(struct ttm_device *bdev,
 	}
 
 	if (bdev->funcs->ttm_tt_populate)
-		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, ctx);
+		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, memcg_account, ctx);
 	else
-		ret = ttm_pool_alloc(&bdev->pool, ttm, ctx);
+		ret = ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 	if (ret)
 		goto error;
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
index fa5841fda659..a4d4ebf585fe 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
@@ -569,13 +569,13 @@ int vmw_bo_cpu_blit(struct vmw_bo *vmw_dst,
 		dma_resv_assert_held(src->base.resv);
 
 	if (!ttm_tt_is_populated(dst->ttm)) {
-		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, &ctx);
+		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, false, &ctx);
 		if (ret)
 			return ret;
 	}
 
 	if (!ttm_tt_is_populated(src->ttm)) {
-		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, &ctx);
+		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, false, &ctx);
 		if (ret)
 			return ret;
 	}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index 5553892d7c3e..2351dafc1c68 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -360,7 +360,8 @@ static void vmw_ttm_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
 
 
 static int vmw_ttm_populate(struct ttm_device *bdev,
-			    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+			    struct ttm_tt *ttm, bool memcg_account,
+			    struct ttm_operation_ctx *ctx)
 {
 	bool external = (ttm->page_flags & TTM_TT_FLAG_EXTERNAL) != 0;
 
@@ -372,7 +373,7 @@ static int vmw_ttm_populate(struct ttm_device *bdev,
 						       ttm->dma_address,
 						       ttm->num_pages);
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void vmw_ttm_unpopulate(struct ttm_device *bdev,
@@ -580,7 +581,7 @@ int vmw_bo_create_and_populate(struct vmw_private *dev_priv,
 	if (unlikely(ret != 0))
 		return ret;
 
-	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, &ctx);
+	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, false, &ctx);
 	if (likely(ret == 0)) {
 		struct vmw_ttm_tt *vmw_tt =
 			container_of(vbo->tbo.ttm, struct vmw_ttm_tt, dma_ttm);
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 7aa2c17825da..522cbff11563 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -504,6 +504,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 }
 
 static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
+			      bool memcg_account,
 			      struct ttm_operation_ctx *ctx)
 {
 	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
@@ -521,7 +522,7 @@ static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
 		err = ttm_tt_restore(ttm_dev, tt, ctx);
 	} else {
 		ttm_tt_clear_backed_up(tt);
-		err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
+		err = ttm_pool_alloc(&ttm_dev->pool, tt, memcg_account, ctx);
 	}
 	if (err)
 		return err;
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 894ff7ccd68e..099dc2604baa 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -464,6 +464,7 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, struct ttm_resource *res,
 		     pgprot_t tmp);
 void ttm_bo_tt_destroy(struct ttm_buffer_object *bo);
 int ttm_bo_populate(struct ttm_buffer_object *bo,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx);
 
 /* Driver LRU walk helpers initially targeted for shrinking. */
diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
index 39b8636b1845..903ca40ebf92 100644
--- a/include/drm/ttm/ttm_device.h
+++ b/include/drm/ttm/ttm_device.h
@@ -84,6 +84,7 @@ struct ttm_device_funcs {
 	 */
 	int (*ttm_tt_populate)(struct ttm_device *bdev,
 			       struct ttm_tt *ttm,
+			       bool memcg_account,
 			       struct ttm_operation_ctx *ctx);
 
 	/**
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index df56527c4853..da5b94226203 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -79,6 +79,7 @@ struct ttm_pool {
 };
 
 int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+		   bool memcg_account,
 		   struct ttm_operation_ctx *ctx);
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
 
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 406437ad674b..15d4019685f6 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -250,6 +250,7 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
  * Calls the driver method to allocate pages for a ttm
  */
 int ttm_tt_populate(struct ttm_device *bdev, struct ttm_tt *ttm,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx);
 
 /**
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/18] ttm/pool: initialise the shrinker earlier
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (9 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 10/18] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 12/18] ttm: add objcg pointer to bo and tt Dave Airlie
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

Later memcg enablement needs the shrinker initialised before the list lru,
Just move it for now.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 63eb81b18987..a4f4e27c6a1f 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -1382,6 +1382,17 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
+	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
+	if (!mm_shrinker)
+		return -ENOMEM;
+
+	mm_shrinker->count_objects = ttm_pool_shrinker_count;
+	mm_shrinker->scan_objects = ttm_pool_shrinker_scan;
+	mm_shrinker->batch = TTM_SHRINKER_BATCH;
+	mm_shrinker->seeks = 1;
+
+	shrinker_register(mm_shrinker);
+
 	for (i = 0; i < NR_PAGE_ORDERS; ++i) {
 		ttm_pool_type_init(&global_write_combined[i], NULL,
 				   ttm_write_combined, i);
@@ -1404,17 +1415,6 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 #endif
 #endif
 
-	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
-	if (!mm_shrinker)
-		return -ENOMEM;
-
-	mm_shrinker->count_objects = ttm_pool_shrinker_count;
-	mm_shrinker->scan_objects = ttm_pool_shrinker_scan;
-	mm_shrinker->batch = TTM_SHRINKER_BATCH;
-	mm_shrinker->seeks = 1;
-
-	shrinker_register(mm_shrinker);
-
 	return 0;
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 12/18] ttm: add objcg pointer to bo and tt
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (10 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 11/18] ttm/pool: initialise the shrinker earlier Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2) Dave Airlie
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This just adds the obj cgroup pointer to the bo and tt structs,
and sets it between them.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_tt.c | 1 +
 include/drm/ttm/ttm_bo.h     | 6 ++++++
 include/drm/ttm/ttm_tt.h     | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 8f38de3b2f1c..0c54d5e2bfdd 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -162,6 +162,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
 	ttm->caching = caching;
 	ttm->restore = NULL;
 	ttm->backup = NULL;
+	ttm->objcg = bo->objcg;
 }
 
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 099dc2604baa..f26ec0a0273f 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -135,6 +135,12 @@ struct ttm_buffer_object {
 	 * reservation lock.
 	 */
 	struct sg_table *sg;
+
+	/**
+	 * @objcg: object cgroup to charge this to if it ends up using system memory.
+	 * NULL means don't charge.
+	 */
+	struct obj_cgroup *objcg;
 };
 
 #define TTM_BO_MAP_IOMEM_MASK 0x80
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 15d4019685f6..c13fea4c2915 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -126,6 +126,8 @@ struct ttm_tt {
 	enum ttm_caching caching;
 	/** @restore: Partial restoration from backup state. TTM private */
 	struct ttm_pool_tt_restore *restore;
+	/** @objcg: Object cgroup for this TT allocation */
+	struct obj_cgroup *objcg;
 };
 
 /**
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (11 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 12/18] ttm: add objcg pointer to bo and tt Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-15  7:34   ` Christian König
  2025-07-14  5:18 ` [PATCH 14/18] ttm: hook up memcg placement flags Dave Airlie
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This enables all the backend code to use the list lru in memcg mode,
and set the shrinker to be memcg aware.

It adds the loop case for when pooled pages end up being reparented
to a higher memcg group, that newer memcg can search for them there
and take them back.

Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: just use the proper stats.
---
 drivers/gpu/drm/ttm/ttm_pool.c | 126 ++++++++++++++++++++++++++-------
 1 file changed, 102 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index a4f4e27c6a1f..1e6da2cc1f06 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -142,7 +142,9 @@ static int ttm_pool_nid(struct ttm_pool *pool) {
 }
 
 /* Allocate pages of size 1 << order with the given gfp_flags */
-static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
+static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
+					struct obj_cgroup *objcg,
+					gfp_t gfp_flags,
 					unsigned int order)
 {
 	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
@@ -162,7 +164,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
 		if (p) {
 			p->private = order;
-			mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
+			if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
+				__free_pages(p, order);
+				return NULL;
+			}
 		}
 		return p;
 	}
@@ -213,9 +218,7 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
 #endif
 
 	if (!pool || !pool->use_dma_alloc) {
-		mod_node_page_state(NODE_DATA(page_to_nid(p)),
-				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
-				    -(1 << order));
+		mem_cgroup_uncharge_gpu_page(p, order, reclaim);
 		__free_pages(p, order);
 		return;
 	}
@@ -302,12 +305,11 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 
 	INIT_LIST_HEAD(&p->lru);
 	rcu_read_lock();
-	list_lru_add(&pt->pages, &p->lru, nid, NULL);
+	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
 	rcu_read_unlock();
 
 	atomic_long_add(num_pages, &allocated_pages[nid]);
-	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
-	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
+	mem_cgroup_move_gpu_page_reclaim(NULL, p, pt->order, true);
 }
 
 static enum lru_status take_one_from_lru(struct list_head *item,
@@ -322,20 +324,56 @@ static enum lru_status take_one_from_lru(struct list_head *item,
 	return LRU_REMOVED;
 }
 
-/* Take pages from a specific pool_type, return NULL when nothing available */
-static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
+static int pool_lru_get_page(struct ttm_pool_type *pt, int nid,
+			     struct page **page_out,
+			     struct obj_cgroup *objcg,
+			     struct mem_cgroup *memcg)
 {
 	int ret;
 	struct page *p = NULL;
 	unsigned long nr_to_walk = 1;
+	unsigned int num_pages = 1 << pt->order;
 
-	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
+	ret = list_lru_walk_one(&pt->pages, nid, memcg, take_one_from_lru, (void *)&p, &nr_to_walk);
 	if (ret == 1 && p) {
-		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
-		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
-		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
+		atomic_long_sub(num_pages, &allocated_pages[nid]);
+
+		if (!mem_cgroup_move_gpu_page_reclaim(objcg, p, pt->order, false)) {
+			__free_pages(p, pt->order);
+			p = NULL;
+		}
 	}
-	return p;
+	*page_out = p;
+	return ret;
+}
+
+/* Take pages from a specific pool_type, return NULL when nothing available */
+static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
+				       struct obj_cgroup *orig_objcg)
+{
+	struct page *page_out = NULL;
+	int ret;
+	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
+	struct mem_cgroup *memcg = orig_memcg;
+
+	/*
+	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,
+	 * go up to the parent and check there. This helps the scenario where multiple apps get
+	 * started into their own cgroup from a common parent and want to reuse the pools.
+	 */
+	while (!page_out) {
+		ret = pool_lru_get_page(pt, nid, &page_out, orig_objcg, memcg);
+		if (ret == 1)
+			break;
+		if (!memcg)
+			break;
+		memcg = parent_mem_cgroup(memcg);
+		if (!memcg)
+			break;
+	}
+
+	mem_cgroup_put(orig_memcg);
+	return page_out;
 }
 
 /* Initialize and add a pool type to the global shrinker list */
@@ -345,7 +383,7 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
 	pt->pool = pool;
 	pt->caching = caching;
 	pt->order = order;
-	list_lru_init(&pt->pages);
+	list_lru_init_memcg(&pt->pages, mm_shrinker);
 
 	spin_lock(&shrinker_lock);
 	list_add_tail(&pt->shrinker_list, &shrinker_list);
@@ -388,6 +426,30 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 	ttm_pool_dispose_list(pt, &dispose);
 }
 
+static int ttm_pool_check_objcg(struct obj_cgroup *objcg)
+{
+#ifdef CONFIG_MEMCG
+	int r = 0;
+	struct mem_cgroup *memcg;
+	if (!objcg)
+		return 0;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	for (unsigned i = 0; i < NR_PAGE_ORDERS; i++) {
+		r = memcg_list_lru_alloc(memcg, &global_write_combined[i].pages, GFP_KERNEL);
+		if (r) {
+			break;
+		}
+		r = memcg_list_lru_alloc(memcg, &global_uncached[i].pages, GFP_KERNEL);
+		if (r) {
+			break;
+		}
+	}
+	mem_cgroup_put(memcg);
+#endif
+	return 0;
+}
+
 /* Return the pool_type to use for the given caching and order */
 static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 						  enum ttm_caching caching,
@@ -417,7 +479,9 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 }
 
 /* Free pages using the per-node shrinker list */
-static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
+static unsigned int ttm_pool_shrink(int nid,
+				    struct mem_cgroup *memcg,
+				    unsigned long num_to_free)
 {
 	LIST_HEAD(dispose);
 	struct ttm_pool_type *pt;
@@ -429,7 +493,11 @@ static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	if (!memcg) {
+		num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	} else {
+		num_pages = list_lru_walk_one(&pt->pages, nid, memcg, pool_move_to_dispose_list, &dispose, &num_to_free);
+	}
 	num_pages *= 1 << pt->order;
 
 	ttm_pool_dispose_list(pt, &dispose);
@@ -594,6 +662,7 @@ static int ttm_pool_restore_commit(struct ttm_pool_tt_restore *restore,
 			 */
 			ttm_pool_split_for_swap(restore->pool, p);
 			copy_highpage(restore->alloced_page + i, p);
+			p->memcg_data = 0;
 			__free_pages(p, 0);
 		}
 
@@ -755,6 +824,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	bool allow_pools;
 	struct page *p;
 	int r;
+	struct obj_cgroup *objcg = memcg_account ? tt->objcg : NULL;
 
 	WARN_ON(!alloc->remaining_pages || ttm_tt_is_populated(tt));
 	WARN_ON(alloc->dma_addr && !pool->dev);
@@ -772,6 +842,9 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 
 	page_caching = tt->caching;
 	allow_pools = true;
+
+	ttm_pool_check_objcg(objcg);
+
 	for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
 	     alloc->remaining_pages;
 	     order = ttm_pool_alloc_find_order(order, alloc)) {
@@ -781,7 +854,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		p = NULL;
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
-			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
+			p = ttm_pool_type_take(pt, ttm_pool_nid(pool), objcg);
 
 		/*
 		 * If that fails or previously failed, allocate from system.
@@ -792,7 +865,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		if (!p) {
 			page_caching = ttm_cached;
 			allow_pools = false;
-			p = ttm_pool_alloc_page(pool, gfp_flags, order);
+			p = ttm_pool_alloc_page(pool, objcg, gfp_flags, order);
 		}
 		/* If that fails, lower the order if possible and retry. */
 		if (!p) {
@@ -936,7 +1009,7 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 
 	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
 		unsigned long diff = pool_node_limit[nid] - atomic_long_read(&allocated_pages[nid]);
-		ttm_pool_shrink(nid, diff);
+		ttm_pool_shrink(nid, NULL, diff);
 	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
@@ -1056,6 +1129,7 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
 			if (flags->purge) {
 				shrunken += num_pages;
 				page->private = 0;
+				page->memcg_data = 0;
 				__free_pages(page, order);
 				memset(tt->pages + i, 0,
 				       num_pages * sizeof(*tt->pages));
@@ -1192,10 +1266,14 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 					    struct shrink_control *sc)
 {
 	unsigned long num_freed = 0;
+	int num_pools;
+	spin_lock(&shrinker_lock);
+	num_pools = list_count_nodes(&shrinker_list);
+	spin_unlock(&shrinker_lock);
 
 	do
-		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
-	while (num_freed < sc->nr_to_scan &&
+		num_freed += ttm_pool_shrink(sc->nid, sc->memcg, sc->nr_to_scan);
+	while (num_pools-- >= 0 && num_freed < sc->nr_to_scan &&
 	       atomic_long_read(&allocated_pages[sc->nid]));
 
 	sc->nr_scanned = num_freed;
@@ -1382,7 +1460,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
-	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
+	mm_shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE | SHRINKER_NUMA_AWARE, "drm-ttm_pool");
 	if (!mm_shrinker)
 		return -ENOMEM;
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 14/18] ttm: hook up memcg placement flags.
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (12 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2) Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 15/18] memcontrol: allow objcg api when memcg is config off Dave Airlie
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This adds a placement flag that requests that any bo with this
placement flag set gets accounted for memcg if it's a system memory
allocation.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c      | 2 +-
 drivers/gpu/drm/ttm/ttm_bo_util.c | 6 +++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c   | 2 +-
 include/drm/ttm/ttm_placement.h   | 3 +++
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index af04bb8e2c2a..273757974b9f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -142,7 +142,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 			goto out_err;
 
 		if (mem->mem_type != TTM_PL_SYSTEM) {
-			ret = ttm_bo_populate(bo, false, ctx);
+			ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, ctx);
 			if (ret)
 				goto out_err;
 		}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 764d1cf1ecbe..b5521d1bd517 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -167,7 +167,7 @@ int ttm_bo_move_memcpy(struct ttm_buffer_object *bo,
 	src_man = ttm_manager_type(bdev, src_mem->mem_type);
 	if (ttm && ((ttm->page_flags & TTM_TT_FLAG_SWAPPED) ||
 		    dst_man->use_tt)) {
-		ret = ttm_bo_populate(bo, false, ctx);
+		ret = ttm_bo_populate(bo, dst_mem->placement & TTM_PL_FLAG_MEMCG, ctx);
 		if (ret)
 			return ret;
 	}
@@ -354,7 +354,7 @@ static int ttm_bo_kmap_ttm(struct ttm_buffer_object *bo,
 
 	BUG_ON(!ttm);
 
-	ret = ttm_bo_populate(bo, false, &ctx);
+	ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, &ctx);
 	if (ret)
 		return ret;
 
@@ -511,7 +511,7 @@ int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map)
 		pgprot_t prot;
 		void *vaddr;
 
-		ret = ttm_bo_populate(bo, false, &ctx);
+		ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index c5ad447debe3..dddc904f8727 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -226,7 +226,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 
 		ttm = bo->ttm;
 		err = ttm_bo_populate(bo,
-				      false,
+				      bo->resource->placement & TTM_PL_FLAG_MEMCG,
 				      &ctx);
 		if (err) {
 			if (err == -EINTR || err == -ERESTARTSYS ||
diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h
index b510a4812609..4e9f07d70483 100644
--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -70,6 +70,9 @@
 /* Placement is only used during eviction */
 #define TTM_PL_FLAG_FALLBACK	(1 << 4)
 
+/* Placement should account mem cgroup */
+#define TTM_PL_FLAG_MEMCG	(1 << 5)
+
 /**
  * struct ttm_place
  *
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 15/18] memcontrol: allow objcg api when memcg is config off.
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (13 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 14/18] ttm: hook up memcg placement flags Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 16/18] memcontrol: export current_obj_cgroup Dave Airlie
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

amdgpu wants to use the objcg api and not have to enable ifdef
around it, so just add a dummy function for the config off path.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 include/linux/memcontrol.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 21328f207d38..55f7ab318eef 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1756,6 +1756,11 @@ static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
+static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	return NULL;
+}
+
 static inline struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
 	return NULL;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 16/18] memcontrol: export current_obj_cgroup
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (14 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 15/18] memcontrol: allow objcg api when memcg is config off Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 17/18] amdgpu: add support for memory cgroups Dave Airlie
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This is needed to use get_obj_cgroup_from_current from a module.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4c8ded9501c6..4c041c5b3a15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2726,6 +2726,7 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 
 	return objcg;
 }
+EXPORT_SYMBOL_GPL(current_obj_cgroup);
 
 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 17/18] amdgpu: add support for memory cgroups
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (15 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 16/18] memcontrol: export current_obj_cgroup Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14  5:18 ` [PATCH 18/18] ttm: add support for a module option to disable memcg pool Dave Airlie
  2025-08-05 10:58 ` drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Maarten Lankhorst
  18 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

This adds support for adding a obj cgroup to a buffer object,
and passing in the placement flags to make sure it's accounted
properly.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  2 ++
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index e5e33a68d935..d250183bab03 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -198,6 +198,7 @@ static void amdgpu_gem_object_free(struct drm_gem_object *gobj)
 	struct amdgpu_bo *aobj = gem_to_amdgpu_bo(gobj);
 
 	amdgpu_hmm_unregister(aobj);
+	obj_cgroup_put(aobj->tbo.objcg);
 	ttm_bo_put(&aobj->tbo);
 }
 
@@ -225,6 +226,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
 	bp.domain = initial_domain;
 	bp.bo_ptr_size = sizeof(struct amdgpu_bo);
 	bp.xcp_id_plus1 = xcp_id_plus1;
+	bp.objcg = get_obj_cgroup_from_current();
 
 	r = amdgpu_bo_create_user(adev, &bp, &ubo);
 	if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 73403744331a..6d5533703b33 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -158,7 +158,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
 		places[c].mem_type =
 			abo->flags & AMDGPU_GEM_CREATE_PREEMPTIBLE ?
 			AMDGPU_PL_PREEMPT : TTM_PL_TT;
-		places[c].flags = 0;
+		places[c].flags = TTM_PL_FLAG_MEMCG;
 		/*
 		 * When GTT is just an alternative to VRAM make sure that we
 		 * only use it as fallback and still try to fill up VRAM first.
@@ -173,7 +173,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
 		places[c].fpfn = 0;
 		places[c].lpfn = 0;
 		places[c].mem_type = TTM_PL_SYSTEM;
-		places[c].flags = 0;
+		places[c].flags = TTM_PL_FLAG_MEMCG;
 		c++;
 	}
 
@@ -657,16 +657,21 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 		size = ALIGN(size, PAGE_SIZE);
 	}
 
-	if (!amdgpu_bo_validate_size(adev, size, bp->domain))
+	if (!amdgpu_bo_validate_size(adev, size, bp->domain)) {
+		obj_cgroup_put(bp->objcg);
 		return -ENOMEM;
+	}
 
 	BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
 
 	*bo_ptr = NULL;
 	bo = kvzalloc(bp->bo_ptr_size, GFP_KERNEL);
-	if (bo == NULL)
+	if (bo == NULL) {
+		obj_cgroup_put(bp->objcg);
 		return -ENOMEM;
+	}
 	drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
+	bo->tbo.objcg = bp->objcg;
 	bo->tbo.base.funcs = &amdgpu_gem_object_funcs;
 	bo->vm_bo = NULL;
 	bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 375448627f7b..8ebaf1bc202f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -55,6 +55,7 @@ struct amdgpu_bo_param {
 	enum ttm_bo_type		type;
 	bool				no_wait_gpu;
 	struct dma_resv			*resv;
+	struct obj_cgroup               *objcg;
 	void				(*destroy)(struct ttm_buffer_object *bo);
 	/* xcp partition number plus 1, 0 means any partition */
 	int8_t				xcp_id_plus1;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 920b412156dd..a65e23b8c67e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -151,11 +151,13 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,
 			amdgpu_bo_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT |
 							AMDGPU_GEM_DOMAIN_CPU);
 		}
+		abo->placements[0].flags &= ~TTM_PL_FLAG_MEMCG;
 		break;
 	case TTM_PL_TT:
 	case AMDGPU_PL_PREEMPT:
 	default:
 		amdgpu_bo_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_CPU);
+		abo->placements[0].flags &= ~TTM_PL_FLAG_MEMCG;
 		break;
 	}
 	*placement = abo->placement;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 18/18] ttm: add support for a module option to disable memcg pool
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (16 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 17/18] amdgpu: add support for memory cgroups Dave Airlie
@ 2025-07-14  5:18 ` Dave Airlie
  2025-07-14 11:49   ` Christian König
  2025-08-05 10:58 ` drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Maarten Lankhorst
  18 siblings, 1 reply; 30+ messages in thread
From: Dave Airlie @ 2025-07-14  5:18 UTC (permalink / raw)
  To: dri-devel, linux-mm, Johannes Weiner, Christian Koenig
  Cc: Dave Chinner, Kairui Song, Dave Airlie

From: Dave Airlie <airlied@redhat.com>

There is an existing workload that cgroup support might regress,
the systems are setup to allocate 1GB of uncached pages at system
startup to prime the pool, then any further users will take them
from the pool. The current cgroup code might handle that, but
it also may regress, so add an option to ttm to avoid using
memcg for the pool pages.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 1e6da2cc1f06..9d84d9991176 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -118,6 +118,21 @@ static unsigned long page_pool_size;
 MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool per NUMA node");
 module_param(page_pool_size, ulong, 0644);
 
+/*
+ * Don't use the memcg aware lru for pooled pages.
+ *
+ * There are use-cases where for example one application in a cgroup will preallocate 1GB
+ * of uncached pages, and immediately release them into the pool, for other consumers
+ * to use. This use-case could be handled with a proper cgroup hierarchy, but to allow
+ * that use case to continue to operate as-is, add a module option.
+ *
+ * This still stores the pages in the list_lru, it just doesn't use the memcg when
+ * adding/removing them.
+ */
+static bool pool_cgroup = true;
+MODULE_PARM_DESC(pool_cgroup, "Manage pooled memory using cgroups (default: true)");
+module_param(pool_cgroup, bool, 0444);
+
 static unsigned long pool_node_limit[MAX_NUMNODES];
 static atomic_long_t allocated_pages[MAX_NUMNODES];
 
@@ -305,7 +320,7 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 
 	INIT_LIST_HEAD(&p->lru);
 	rcu_read_lock();
-	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
+	list_lru_add(&pt->pages, &p->lru, nid, pool_cgroup ? page_memcg_check(p) : NULL);
 	rcu_read_unlock();
 
 	atomic_long_add(num_pages, &allocated_pages[nid]);
@@ -354,7 +369,7 @@ static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
 	struct page *page_out = NULL;
 	int ret;
 	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
-	struct mem_cgroup *memcg = orig_memcg;
+	struct mem_cgroup *memcg = pool_cgroup ? orig_memcg : NULL;
 
 	/*
 	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 18/18] ttm: add support for a module option to disable memcg pool
  2025-07-14  5:18 ` [PATCH 18/18] ttm: add support for a module option to disable memcg pool Dave Airlie
@ 2025-07-14 11:49   ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2025-07-14 11:49 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner
  Cc: Dave Chinner, Kairui Song, Dave Airlie

On 14.07.25 07:18, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> There is an existing workload that cgroup support might regress,
> the systems are setup to allocate 1GB of uncached pages at system
> startup to prime the pool, then any further users will take them
> from the pool. The current cgroup code might handle that, but
> it also may regress, so add an option to ttm to avoid using
> memcg for the pool pages.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 1e6da2cc1f06..9d84d9991176 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -118,6 +118,21 @@ static unsigned long page_pool_size;
>  MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool per NUMA node");
>  module_param(page_pool_size, ulong, 0644);

I think we need that for the whole memcg integration and off by default for now.

Regards,
Christian.

>  
> +/*
> + * Don't use the memcg aware lru for pooled pages.
> + *
> + * There are use-cases where for example one application in a cgroup will preallocate 1GB
> + * of uncached pages, and immediately release them into the pool, for other consumers
> + * to use. This use-case could be handled with a proper cgroup hierarchy, but to allow
> + * that use case to continue to operate as-is, add a module option.
> + *
> + * This still stores the pages in the list_lru, it just doesn't use the memcg when
> + * adding/removing them.
> + */
> +static bool pool_cgroup = true;
> +MODULE_PARM_DESC(pool_cgroup, "Manage pooled memory using cgroups (default: true)");
> +module_param(pool_cgroup, bool, 0444);
> +
>  static unsigned long pool_node_limit[MAX_NUMNODES];
>  static atomic_long_t allocated_pages[MAX_NUMNODES];
>  
> @@ -305,7 +320,7 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
>  
>  	INIT_LIST_HEAD(&p->lru);
>  	rcu_read_lock();
> -	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
> +	list_lru_add(&pt->pages, &p->lru, nid, pool_cgroup ? page_memcg_check(p) : NULL);
>  	rcu_read_unlock();
>  
>  	atomic_long_add(num_pages, &allocated_pages[nid]);
> @@ -354,7 +369,7 @@ static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
>  	struct page *page_out = NULL;
>  	int ret;
>  	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
> -	struct mem_cgroup *memcg = orig_memcg;
> +	struct mem_cgroup *memcg = pool_cgroup ? orig_memcg : NULL;
>  
>  	/*
>  	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/18] drm/ttm: use gpu mm stats to track gpu memory allocations. (v3)
  2025-07-14  5:18 ` [PATCH 02/18] drm/ttm: use gpu mm stats to track gpu memory allocations. (v3) Dave Airlie
@ 2025-07-14 18:33   ` Shakeel Butt
  0 siblings, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2025-07-14 18:33 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, linux-mm, Johannes Weiner, Christian Koenig,
	Dave Chinner, Kairui Song, Dave Airlie, Matthew Brost,
	Andrew Morton

On Mon, Jul 14, 2025 at 03:18:17PM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This uses the newly introduced per-node gpu tracking stats,
> to track GPU memory allocated via TTM and reclaimable memory in
> the TTM page pools.
> 
> These stats will be useful later for system information and
> later when mem cgroups are integrated.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> 
> ---
> v2: add reclaim parameters and adjust the right counters.
> v3: drop the nid helper and get it from page.
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index baf27c70a419..ee2344089d47 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -150,8 +150,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  
>  	if (!pool->use_dma_alloc) {
>  		p = alloc_pages_node(pool->nid, gfp_flags, order);
> -		if (p)
> +		if (p) {
>  			p->private = order;
> +			mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));

Please use mod_lruvec_page_state() here.

> +		}
>  		return p;
>  	}
>  
> @@ -186,7 +188,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  
>  /* Reset the caching and pages of size 1 << order */
>  static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
> -			       unsigned int order, struct page *p)
> +			       unsigned int order, struct page *p, bool reclaim)
>  {
>  	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
>  	struct ttm_pool_dma *dma;
> @@ -201,6 +203,9 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
>  #endif
>  
>  	if (!pool || !pool->use_dma_alloc) {
> +		mod_node_page_state(NODE_DATA(page_to_nid(p)),
> +				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
> +				    -(1 << order));

Same here.

>  		__free_pages(p, order);
>  		return;
>  	}
> @@ -276,6 +281,7 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
>  static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
>  {
>  	unsigned int i, num_pages = 1 << pt->order;
> +	int nid = page_to_nid(p);
>  
>  	for (i = 0; i < num_pages; ++i) {
>  		if (PageHighMem(p))
> @@ -288,17 +294,24 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
>  	list_add(&p->lru, &pt->pages);
>  	spin_unlock(&pt->lock);
>  	atomic_long_add(1 << pt->order, &allocated_pages);
> +
> +	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
> +	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);

Same here.

>  }
>  
>  /* Take pages from a specific pool_type, return NULL when nothing available */
>  static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
>  {
>  	struct page *p;
> +	int nid;
>  
>  	spin_lock(&pt->lock);
>  	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
>  	if (p) {
> +		nid = page_to_nid(p);
>  		atomic_long_sub(1 << pt->order, &allocated_pages);
> +		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
> +		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));

Same here.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/18] mm/list_lru: export list_lru_add.
  2025-07-14  5:18 ` [PATCH 03/18] mm/list_lru: export list_lru_add Dave Airlie
@ 2025-07-14 19:01   ` Shakeel Butt
  0 siblings, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2025-07-14 19:01 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, linux-mm, Johannes Weiner, Christian Koenig,
	Dave Chinner, Kairui Song, Dave Airlie

On Mon, Jul 14, 2025 at 03:18:18PM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> DRM/TTM wants to use this for it's page pool
> LRU tracking.
> 
> This effective is a revert of
> 78c0ed09131b772f062b986a2fcca6600daa6285
> Author: Kairui Song <kasong@tencent.com>
> Date:   Tue Nov 5 01:52:53 2024 +0800
> 
>     mm/list_lru: don't export list_lru_add
> 
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Dave Airlie <airlied@redhat.com>

Instead of a separate patch, please put it in the patch which actually
uses list_lru_add.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)
  2025-07-14  5:18 ` [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2) Dave Airlie
@ 2025-07-15  7:34   ` Christian König
  2025-07-21  5:56     ` David Airlie
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2025-07-15  7:34 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner
  Cc: Dave Chinner, Kairui Song, Dave Airlie



On 14.07.25 07:18, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This enables all the backend code to use the list lru in memcg mode,
> and set the shrinker to be memcg aware.
> 
> It adds the loop case for when pooled pages end up being reparented
> to a higher memcg group, that newer memcg can search for them there
> and take them back.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> 
> ---
> v2: just use the proper stats.
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 126 ++++++++++++++++++++++++++-------
>  1 file changed, 102 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index a4f4e27c6a1f..1e6da2cc1f06 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -142,7 +142,9 @@ static int ttm_pool_nid(struct ttm_pool *pool) {
>  }
>  
>  /* Allocate pages of size 1 << order with the given gfp_flags */
> -static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> +static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
> +					struct obj_cgroup *objcg,
> +					gfp_t gfp_flags,
>  					unsigned int order)
>  {
>  	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
> @@ -162,7 +164,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>  		p = alloc_pages_node(pool->nid, gfp_flags, order);
>  		if (p) {
>  			p->private = order;
> -			mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
> +			if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {

Thinking more about it that is way to late. At this point we can't fail the allocation any more.

Otherwise we either completely break suspend or don't account system allocations to the correctly any more after resume.

What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.

Regards,
Christian.

> +				__free_pages(p, order);
> +				return NULL;
> +			}
>  		}
>  		return p;
>  	}
> @@ -213,9 +218,7 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
>  #endif
>  
>  	if (!pool || !pool->use_dma_alloc) {
> -		mod_node_page_state(NODE_DATA(page_to_nid(p)),
> -				    reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
> -				    -(1 << order));
> +		mem_cgroup_uncharge_gpu_page(p, order, reclaim);
>  		__free_pages(p, order);
>  		return;
>  	}
> @@ -302,12 +305,11 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
>  
>  	INIT_LIST_HEAD(&p->lru);
>  	rcu_read_lock();
> -	list_lru_add(&pt->pages, &p->lru, nid, NULL);
> +	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
>  	rcu_read_unlock();
>  
>  	atomic_long_add(num_pages, &allocated_pages[nid]);
> -	mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, -num_pages);
> -	mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, num_pages);
> +	mem_cgroup_move_gpu_page_reclaim(NULL, p, pt->order, true);
>  }
>  
>  static enum lru_status take_one_from_lru(struct list_head *item,
> @@ -322,20 +324,56 @@ static enum lru_status take_one_from_lru(struct list_head *item,
>  	return LRU_REMOVED;
>  }
>  
> -/* Take pages from a specific pool_type, return NULL when nothing available */
> -static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
> +static int pool_lru_get_page(struct ttm_pool_type *pt, int nid,
> +			     struct page **page_out,
> +			     struct obj_cgroup *objcg,
> +			     struct mem_cgroup *memcg)
>  {
>  	int ret;
>  	struct page *p = NULL;
>  	unsigned long nr_to_walk = 1;
> +	unsigned int num_pages = 1 << pt->order;
>  
> -	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
> +	ret = list_lru_walk_one(&pt->pages, nid, memcg, take_one_from_lru, (void *)&p, &nr_to_walk);
>  	if (ret == 1 && p) {
> -		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
> -		mod_node_page_state(NODE_DATA(nid), NR_GPU_ACTIVE, (1 << pt->order));
> -		mod_node_page_state(NODE_DATA(nid), NR_GPU_RECLAIM, -(1 << pt->order));
> +		atomic_long_sub(num_pages, &allocated_pages[nid]);
> +
> +		if (!mem_cgroup_move_gpu_page_reclaim(objcg, p, pt->order, false)) {
> +			__free_pages(p, pt->order);
> +			p = NULL;
> +		}
>  	}
> -	return p;
> +	*page_out = p;
> +	return ret;
> +}
> +
> +/* Take pages from a specific pool_type, return NULL when nothing available */
> +static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
> +				       struct obj_cgroup *orig_objcg)
> +{
> +	struct page *page_out = NULL;
> +	int ret;
> +	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
> +	struct mem_cgroup *memcg = orig_memcg;
> +
> +	/*
> +	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,
> +	 * go up to the parent and check there. This helps the scenario where multiple apps get
> +	 * started into their own cgroup from a common parent and want to reuse the pools.
> +	 */
> +	while (!page_out) {
> +		ret = pool_lru_get_page(pt, nid, &page_out, orig_objcg, memcg);
> +		if (ret == 1)
> +			break;
> +		if (!memcg)
> +			break;
> +		memcg = parent_mem_cgroup(memcg);
> +		if (!memcg)
> +			break;
> +	}
> +
> +	mem_cgroup_put(orig_memcg);
> +	return page_out;
>  }
>  
>  /* Initialize and add a pool type to the global shrinker list */
> @@ -345,7 +383,7 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
>  	pt->pool = pool;
>  	pt->caching = caching;
>  	pt->order = order;
> -	list_lru_init(&pt->pages);
> +	list_lru_init_memcg(&pt->pages, mm_shrinker);
>  
>  	spin_lock(&shrinker_lock);
>  	list_add_tail(&pt->shrinker_list, &shrinker_list);
> @@ -388,6 +426,30 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
>  	ttm_pool_dispose_list(pt, &dispose);
>  }
>  
> +static int ttm_pool_check_objcg(struct obj_cgroup *objcg)
> +{
> +#ifdef CONFIG_MEMCG
> +	int r = 0;
> +	struct mem_cgroup *memcg;
> +	if (!objcg)
> +		return 0;
> +
> +	memcg = get_mem_cgroup_from_objcg(objcg);
> +	for (unsigned i = 0; i < NR_PAGE_ORDERS; i++) {
> +		r = memcg_list_lru_alloc(memcg, &global_write_combined[i].pages, GFP_KERNEL);
> +		if (r) {
> +			break;
> +		}
> +		r = memcg_list_lru_alloc(memcg, &global_uncached[i].pages, GFP_KERNEL);
> +		if (r) {
> +			break;
> +		}
> +	}
> +	mem_cgroup_put(memcg);
> +#endif
> +	return 0;
> +}
> +
>  /* Return the pool_type to use for the given caching and order */
>  static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
>  						  enum ttm_caching caching,
> @@ -417,7 +479,9 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
>  }
>  
>  /* Free pages using the per-node shrinker list */
> -static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
> +static unsigned int ttm_pool_shrink(int nid,
> +				    struct mem_cgroup *memcg,
> +				    unsigned long num_to_free)
>  {
>  	LIST_HEAD(dispose);
>  	struct ttm_pool_type *pt;
> @@ -429,7 +493,11 @@ static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
>  	list_move_tail(&pt->shrinker_list, &shrinker_list);
>  	spin_unlock(&shrinker_lock);
>  
> -	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
> +	if (!memcg) {
> +		num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
> +	} else {
> +		num_pages = list_lru_walk_one(&pt->pages, nid, memcg, pool_move_to_dispose_list, &dispose, &num_to_free);
> +	}
>  	num_pages *= 1 << pt->order;
>  
>  	ttm_pool_dispose_list(pt, &dispose);
> @@ -594,6 +662,7 @@ static int ttm_pool_restore_commit(struct ttm_pool_tt_restore *restore,
>  			 */
>  			ttm_pool_split_for_swap(restore->pool, p);
>  			copy_highpage(restore->alloced_page + i, p);
> +			p->memcg_data = 0;
>  			__free_pages(p, 0);
>  		}
>  
> @@ -755,6 +824,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  	bool allow_pools;
>  	struct page *p;
>  	int r;
> +	struct obj_cgroup *objcg = memcg_account ? tt->objcg : NULL;
>  
>  	WARN_ON(!alloc->remaining_pages || ttm_tt_is_populated(tt));
>  	WARN_ON(alloc->dma_addr && !pool->dev);
> @@ -772,6 +842,9 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  
>  	page_caching = tt->caching;
>  	allow_pools = true;
> +
> +	ttm_pool_check_objcg(objcg);
> +
>  	for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>  	     alloc->remaining_pages;
>  	     order = ttm_pool_alloc_find_order(order, alloc)) {
> @@ -781,7 +854,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  		p = NULL;
>  		pt = ttm_pool_select_type(pool, page_caching, order);
>  		if (pt && allow_pools)
> -			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
> +			p = ttm_pool_type_take(pt, ttm_pool_nid(pool), objcg);
>  
>  		/*
>  		 * If that fails or previously failed, allocate from system.
> @@ -792,7 +865,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  		if (!p) {
>  			page_caching = ttm_cached;
>  			allow_pools = false;
> -			p = ttm_pool_alloc_page(pool, gfp_flags, order);
> +			p = ttm_pool_alloc_page(pool, objcg, gfp_flags, order);
>  		}
>  		/* If that fails, lower the order if possible and retry. */
>  		if (!p) {
> @@ -936,7 +1009,7 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
>  
>  	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
>  		unsigned long diff = pool_node_limit[nid] - atomic_long_read(&allocated_pages[nid]);
> -		ttm_pool_shrink(nid, diff);
> +		ttm_pool_shrink(nid, NULL, diff);
>  	}
>  }
>  EXPORT_SYMBOL(ttm_pool_free);
> @@ -1056,6 +1129,7 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
>  			if (flags->purge) {
>  				shrunken += num_pages;
>  				page->private = 0;
> +				page->memcg_data = 0;
>  				__free_pages(page, order);
>  				memset(tt->pages + i, 0,
>  				       num_pages * sizeof(*tt->pages));
> @@ -1192,10 +1266,14 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
>  					    struct shrink_control *sc)
>  {
>  	unsigned long num_freed = 0;
> +	int num_pools;
> +	spin_lock(&shrinker_lock);
> +	num_pools = list_count_nodes(&shrinker_list);
> +	spin_unlock(&shrinker_lock);
>  
>  	do
> -		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
> -	while (num_freed < sc->nr_to_scan &&
> +		num_freed += ttm_pool_shrink(sc->nid, sc->memcg, sc->nr_to_scan);
> +	while (num_pools-- >= 0 && num_freed < sc->nr_to_scan &&
>  	       atomic_long_read(&allocated_pages[sc->nid]));
>  
>  	sc->nr_scanned = num_freed;
> @@ -1382,7 +1460,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
>  	spin_lock_init(&shrinker_lock);
>  	INIT_LIST_HEAD(&shrinker_list);
>  
> -	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
> +	mm_shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE | SHRINKER_NUMA_AWARE, "drm-ttm_pool");
>  	if (!mm_shrinker)
>  		return -ENOMEM;
>  



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)
  2025-07-15  7:34   ` Christian König
@ 2025-07-21  5:56     ` David Airlie
  2025-07-21 23:16       ` David Airlie
  0 siblings, 1 reply; 30+ messages in thread
From: David Airlie @ 2025-07-21  5:56 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Tue, Jul 15, 2025 at 5:34 PM Christian König
<christian.koenig@amd.com> wrote:
>
>
>
> On 14.07.25 07:18, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This enables all the backend code to use the list lru in memcg mode,
> > and set the shrinker to be memcg aware.
> >
> > It adds the loop case for when pooled pages end up being reparented
> > to a higher memcg group, that newer memcg can search for them there
> > and take them back.
> >
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> >
> > ---
> > v2: just use the proper stats.
> > ---
> >  drivers/gpu/drm/ttm/ttm_pool.c | 126 ++++++++++++++++++++++++++-------
> >  1 file changed, 102 insertions(+), 24 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> > index a4f4e27c6a1f..1e6da2cc1f06 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -142,7 +142,9 @@ static int ttm_pool_nid(struct ttm_pool *pool) {
> >  }
> >
> >  /* Allocate pages of size 1 << order with the given gfp_flags */
> > -static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> > +static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
> > +                                     struct obj_cgroup *objcg,
> > +                                     gfp_t gfp_flags,
> >                                       unsigned int order)
> >  {
> >       unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
> > @@ -162,7 +164,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> >               p = alloc_pages_node(pool->nid, gfp_flags, order);
> >               if (p) {
> >                       p->private = order;
> > -                     mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
> > +                     if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
>
> Thinking more about it that is way to late. At this point we can't fail the allocation any more.
>

I've tested it at least works, but there is a bit of a problem with
it, because if we fail a 10 order allocation, it tries to fallback
down the order hierarchy, when there is no point since it can't
account the maximum size.

> Otherwise we either completely break suspend or don't account system allocations to the correctly any more after resume.

When you say suspend here, do you mean for VRAM allocations, normal
system RAM allocations which are accounted here shouldn't have any
effect on suspend/resume since they stay where they are. Currently it
also doesn't try account for evictions at all.

>
> What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.

I'm not sure what reserve vs commit is here, mem cgroup is really just
reserve until you can reserve no more, it's just a single
charge/uncharge stage. If we try and charge and we are over the limit,
bad things will happen, either fail allocation or reclaim for the
cgroup.

Regards,
Dave.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)
  2025-07-21  5:56     ` David Airlie
@ 2025-07-21 23:16       ` David Airlie
  2025-08-04  9:22         ` Christian König
  0 siblings, 1 reply; 30+ messages in thread
From: David Airlie @ 2025-07-21 23:16 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

> > On 14.07.25 07:18, Dave Airlie wrote:
> > > From: Dave Airlie <airlied@redhat.com>
> > >
> > > This enables all the backend code to use the list lru in memcg mode,
> > > and set the shrinker to be memcg aware.
> > >
> > > It adds the loop case for when pooled pages end up being reparented
> > > to a higher memcg group, that newer memcg can search for them there
> > > and take them back.
> > >
> > > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > >
> > > ---
> > > v2: just use the proper stats.
> > > ---
> > >  drivers/gpu/drm/ttm/ttm_pool.c | 126 ++++++++++++++++++++++++++-------
> > >  1 file changed, 102 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index a4f4e27c6a1f..1e6da2cc1f06 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -142,7 +142,9 @@ static int ttm_pool_nid(struct ttm_pool *pool) {
> > >  }
> > >
> > >  /* Allocate pages of size 1 << order with the given gfp_flags */
> > > -static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> > > +static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
> > > +                                     struct obj_cgroup *objcg,
> > > +                                     gfp_t gfp_flags,
> > >                                       unsigned int order)
> > >  {
> > >       unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
> > > @@ -162,7 +164,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> > >               p = alloc_pages_node(pool->nid, gfp_flags, order);
> > >               if (p) {
> > >                       p->private = order;
> > > -                     mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
> > > +                     if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
> >
> > Thinking more about it that is way to late. At this point we can't fail the allocation any more.
> >
>
> I've tested it at least works, but there is a bit of a problem with
> it, because if we fail a 10 order allocation, it tries to fallback
> down the order hierarchy, when there is no point since it can't
> account the maximum size.
>
> > Otherwise we either completely break suspend or don't account system allocations to the correctly any more after resume.
>
> When you say suspend here, do you mean for VRAM allocations, normal
> system RAM allocations which are accounted here shouldn't have any
> effect on suspend/resume since they stay where they are. Currently it
> also doesn't try account for evictions at all.

I've just traced the global swapin/out paths as well and those seem
fine for memcg at this point, since they are called only after
populate/unpopulate. Now I haven't addressed the new xe swap paths,
because I don't have a test path, since amdgpu doesn't support those,
I was thinking I'd leave it on the list for when amdgpu goes to that
path, or I can spend some time on xe.

Dave.

> >
> > What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.
>
> I'm not sure what reserve vs commit is here, mem cgroup is really just
> reserve until you can reserve no more, it's just a single
> charge/uncharge stage. If we try and charge and we are over the limit,
> bad things will happen, either fail allocation or reclaim for the
> cgroup.
>
> Regards,
> Dave.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)
  2025-07-21 23:16       ` David Airlie
@ 2025-08-04  9:22         ` Christian König
  2025-08-06  2:43           ` Dave Airlie
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2025-08-04  9:22 UTC (permalink / raw)
  To: David Airlie
  Cc: Dave Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

Sorry for the delayed response, just back from vacation.

On 22.07.25 01:16, David Airlie wrote:
>>>> @@ -162,7 +164,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>>>>               p = alloc_pages_node(pool->nid, gfp_flags, order);
>>>>               if (p) {
>>>>                       p->private = order;
>>>> -                     mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
>>>> +                     if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
>>>
>>> Thinking more about it that is way to late. At this point we can't fail the allocation any more.
>>>
>>
>> I've tested it at least works, but there is a bit of a problem with
>> it, because if we fail a 10 order allocation, it tries to fallback
>> down the order hierarchy, when there is no point since it can't
>> account the maximum size.
>>
>>> Otherwise we either completely break suspend or don't account system allocations to the correctly any more after resume.
>>
>> When you say suspend here, do you mean for VRAM allocations, normal
>> system RAM allocations which are accounted here shouldn't have any
>> effect on suspend/resume since they stay where they are. Currently it
>> also doesn't try account for evictions at all.

Good point, I was not considering moves during suspend as evictions. But from the code flow that should indeed work for now.

What I meant is that after resume BOs are usually not moved back into VRAM immediately. Filling VRAM is rate limited to allow quick response of desktop applications after resume.

So at least temporary we hopelessly overcommit system memory after resume. But that problem potentially goes into the same bucked as general eviction.

> I've just traced the global swapin/out paths as well and those seem
> fine for memcg at this point, since they are called only after
> populate/unpopulate. Now I haven't addressed the new xe swap paths,
> because I don't have a test path, since amdgpu doesn't support those,
> I was thinking I'd leave it on the list for when amdgpu goes to that
> path, or I can spend some time on xe.

I would really prefer that before we commit this that we have patches for both amdgpu and XE which at least demonstrate the functionality.

We are essentially defining uAPI here and when that goes wrong we can't change it any more as soon as people start depending on it.

> 
> Dave.
> 
>>>
>>> What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.
>>
>> I'm not sure what reserve vs commit is here, mem cgroup is really just
>> reserve until you can reserve no more, it's just a single
>> charge/uncharge stage. If we try and charge and we are over the limit,
>> bad things will happen, either fail allocation or reclaim for the
>> cgroup.

Yeah, exactly that is what I think is highly problematic.

When the allocation of a buffer for an application fails in the display server you basically open up the possibility for a deny of service.

E.g. imaging that an application allocates a 4GiB BO while it's cgroup says it can only allocate 2GiB, that will work because the backing store is only allocated delayed. Now send that BO to the display server and the command submission in the display server will fail with an -ENOMEM because we exceed the cgroup of the application.

As far as I can see we also need to limit how much an application can overcommit by creating BOs without backing store.

Alternatively disallow creating BOs without backing store, but that is an uAPI change and will break at least some use cases.

Regards,
Christian.

>>
>> Regards,
>> Dave.
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2)
  2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
                   ` (17 preceding siblings ...)
  2025-07-14  5:18 ` [PATCH 18/18] ttm: add support for a module option to disable memcg pool Dave Airlie
@ 2025-08-05 10:58 ` Maarten Lankhorst
  2025-08-06  2:39   ` Dave Airlie
  18 siblings, 1 reply; 30+ messages in thread
From: Maarten Lankhorst @ 2025-08-05 10:58 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, linux-mm, Johannes Weiner,
	Christian Koenig
  Cc: Dave Chinner, Kairui Song

Hey,

Den 2025-07-14 kl. 07:18, skrev Dave Airlie:
> Hi all,
> 
> This is a repost with some fixes and cleanups.
> 
> Differences since last posting:
> 1. Added patch 18: add a module option to allow pooled pages to not be stored in the lru per-memcg
>    (Requested by Christian Konig)
> 2. Converged the naming and stats between vmstat and memcg (Suggested by Shakeel Butt)
> 3. Cleaned up the charge/uncharge code and some other bits.
> 
> Dave.
> 
> Original cover letter:
> tl;dr: start using list_lru/numa/memcg in GPU driver core and amdgpu driver for now.
> 
> This is a complete series of patches, some of which have been sent before and reviewed,
> but I want to get the complete picture for others, and try to figure out how best to land this.
> 
> There are 3 pieces to this:
> 01->02: add support for global gpu stat counters (previously posted, patch 2 is newer)
> 03->07: port ttm pools to list_lru for numa awareness
> 08->14: add memcg stats + gpu apis, then port ttm pools to memcg aware list_lru and shrinker
> 15->17: enable amdgpu to use new functionality.
> 
> The biggest difference in the memcg code from previously is I discovered what
> obj cgroups were designed for and I'm reusing the page/objcg intergration that 
> already exists, to avoid reinventing that wheel right now.
> 
> There are some igt-gpu-tools tests I've written at:
> https://gitlab.freedesktop.org/airlied/igt-gpu-tools/-/tree/amdgpu-cgroups?ref_type=heads
> 
> One problem is there are a lot of delayed action, that probably means the testing
> needs a bit more robustness, but the tests validate all the basic paths.
> 
> Regards,
> Dave.
> 
Patch below to enable on xe as well, I ran into some issues though when testing.
After shutting down gdm3/sddm, I ran into a null dereference in mem_cgroup_uncharge_gpu_page()
from ttm_pool_free_page(), presumably because of the objects that were created without a
cgroup set. I tried to fix it in mem_cgroup_uncharge_gpu_page() by conditionally calling
refill_stock(), but that ran into an underflow instead.

Anyway, patch for xe below:
----->8-----------
drm/xe: Enable memcg accounting for TT/system

Create a flag to enable memcg accounting for XE as well.

Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 867087c2d1534..fd93374967c9e 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -54,6 +54,7 @@ static const struct ttm_place sys_placement_flags = {
 	.flags = 0,
 };
 
+/* TTM_PL_FLAG_MEMCG is not set, those placements are used for eviction */
 static struct ttm_placement sys_placement = {
 	.num_placement = 1,
 	.placement = &sys_placement_flags,
@@ -188,6 +189,7 @@ static void try_add_system(struct xe_device *xe, struct xe_bo *bo,
 
 		bo->placements[*c] = (struct ttm_place) {
 			.mem_type = XE_PL_TT,
+			.flags = TTM_PL_FLAG_MEMCG,
 		};
 		*c += 1;
 	}
@@ -1696,6 +1698,8 @@ static void xe_ttm_bo_destroy(struct ttm_buffer_object *ttm_bo)
 
 static void xe_gem_object_free(struct drm_gem_object *obj)
 {
+	struct xe_bo *bo = gem_to_xe_bo(obj);
+
 	/* Our BO reference counting scheme works as follows:
 	 *
 	 * The gem object kref is typically used throughout the driver,
@@ -1709,8 +1713,9 @@ static void xe_gem_object_free(struct drm_gem_object *obj)
 	 * driver ttm callbacks is allowed to use the ttm_buffer_object
 	 * refcount directly if needed.
 	 */
-	__xe_bo_vunmap(gem_to_xe_bo(obj));
-	ttm_bo_put(container_of(obj, struct ttm_buffer_object, base));
+	__xe_bo_vunmap(bo);
+	obj_cgroup_put(bo->ttm.objcg);
+	ttm_bo_put(&bo->ttm);
 }
 
 static void xe_gem_object_close(struct drm_gem_object *obj,
@@ -1951,6 +1956,9 @@ struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
 	placement = (type == ttm_bo_type_sg ||
 		     bo->flags & XE_BO_FLAG_DEFER_BACKING) ? &sys_placement :
 		&bo->placement;
+
+	if (bo->flags & XE_BO_FLAG_ACCOUNTED)
+		bo->ttm.objcg = get_obj_cgroup_from_current();
 	err = ttm_bo_init_reserved(&xe->ttm, &bo->ttm, type,
 				   placement, alignment,
 				   &ctx, NULL, resv, xe_ttm_bo_destroy);
@@ -2726,7 +2734,7 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 	if (XE_IOCTL_DBG(xe, args->size & ~PAGE_MASK))
 		return -EINVAL;
 
-	bo_flags = 0;
+	bo_flags = XE_BO_FLAG_ACCOUNTED;
 	if (args->flags & DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING)
 		bo_flags |= XE_BO_FLAG_DEFER_BACKING;
 
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 6134d82e80554..e44fc58d9a00f 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -48,6 +48,7 @@
 #define XE_BO_FLAG_GGTT2		BIT(22)
 #define XE_BO_FLAG_GGTT3		BIT(23)
 #define XE_BO_FLAG_CPU_ADDR_MIRROR	BIT(24)
+#define XE_BO_FLAG_ACCOUNTED		BIT(25)
 
 /* this one is trigger internally only */
 #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index 540f044bf4255..4db3227d65c04 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1266,7 +1266,8 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
 	bo_flags = XE_BO_FLAG_VRAM_IF_DGFX(tile) | XE_BO_FLAG_GGTT |
 		   XE_BO_FLAG_GGTT_INVALIDATE;
 	if (vm && vm->xef) /* userspace */
-		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE;
+		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE |
+			    XE_BO_FLAG_ACCOUNTED;
 
 	lrc->bo = xe_bo_create_pin_map(xe, tile, NULL, bo_size,
 				       ttm_bo_type_kernel,
diff --git a/drivers/gpu/drm/xe/xe_oa.c b/drivers/gpu/drm/xe/xe_oa.c
index 5729e7d3e3356..569035630ffdf 100644
--- a/drivers/gpu/drm/xe/xe_oa.c
+++ b/drivers/gpu/drm/xe/xe_oa.c
@@ -885,7 +885,7 @@ static int xe_oa_alloc_oa_buffer(struct xe_oa_stream *stream, size_t size)
 
 	bo = xe_bo_create_pin_map(stream->oa->xe, stream->gt->tile, NULL,
 				  size, ttm_bo_type_kernel,
-				  XE_BO_FLAG_SYSTEM | XE_BO_FLAG_GGTT);
+				  XE_BO_FLAG_SYSTEM | XE_BO_FLAG_GGTT | XE_BO_FLAG_ACCOUNTED);
 	if (IS_ERR(bo))
 		return PTR_ERR(bo);
 
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 330cc0f54a3f4..efcd54ab75e92 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -120,7 +120,8 @@ struct xe_pt *xe_pt_create(struct xe_vm *vm, struct xe_tile *tile,
 		   XE_BO_FLAG_IGNORE_MIN_PAGE_SIZE |
 		   XE_BO_FLAG_NO_RESV_EVICT | XE_BO_FLAG_PAGETABLE;
 	if (vm->xef) /* userspace */
-		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE;
+		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE |
+			    XE_BO_FLAG_ACCOUNTED;
 
 	pt->level = level;
 	bo = xe_bo_create_pin_map(vm->xe, tile, vm, SZ_4K,
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 10c8a1bcb86e8..fdf845bb717e0 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -700,6 +700,7 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
 	bo = xe_bo_create_locked(vr->xe, NULL, NULL, end - start,
 				 ttm_bo_type_device,
 				 (IS_DGFX(xe) ? XE_BO_FLAG_VRAM(vr) : XE_BO_FLAG_SYSTEM) |
+				 XE_BO_FLAG_ACCOUNTED |
 				 XE_BO_FLAG_CPU_ADDR_MIRROR);
 	if (IS_ERR(bo)) {
 		err = PTR_ERR(bo);



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2)
  2025-08-05 10:58 ` drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Maarten Lankhorst
@ 2025-08-06  2:39   ` Dave Airlie
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Airlie @ 2025-08-06  2:39 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: dri-devel, linux-mm, Johannes Weiner, Christian Koenig,
	Dave Chinner, Kairui Song

On Tue, 5 Aug 2025 at 20:58, Maarten Lankhorst
<maarten.lankhorst@linux.intel.com> wrote:
>
> Hey,
>
> Den 2025-07-14 kl. 07:18, skrev Dave Airlie:
> > Hi all,
> >
> > This is a repost with some fixes and cleanups.
> >
> > Differences since last posting:
> > 1. Added patch 18: add a module option to allow pooled pages to not be stored in the lru per-memcg
> >    (Requested by Christian Konig)
> > 2. Converged the naming and stats between vmstat and memcg (Suggested by Shakeel Butt)
> > 3. Cleaned up the charge/uncharge code and some other bits.
> >
> > Dave.
> >
> > Original cover letter:
> > tl;dr: start using list_lru/numa/memcg in GPU driver core and amdgpu driver for now.
> >
> > This is a complete series of patches, some of which have been sent before and reviewed,
> > but I want to get the complete picture for others, and try to figure out how best to land this.
> >
> > There are 3 pieces to this:
> > 01->02: add support for global gpu stat counters (previously posted, patch 2 is newer)
> > 03->07: port ttm pools to list_lru for numa awareness
> > 08->14: add memcg stats + gpu apis, then port ttm pools to memcg aware list_lru and shrinker
> > 15->17: enable amdgpu to use new functionality.
> >
> > The biggest difference in the memcg code from previously is I discovered what
> > obj cgroups were designed for and I'm reusing the page/objcg intergration that
> > already exists, to avoid reinventing that wheel right now.
> >
> > There are some igt-gpu-tools tests I've written at:
> > https://gitlab.freedesktop.org/airlied/igt-gpu-tools/-/tree/amdgpu-cgroups?ref_type=heads
> >
> > One problem is there are a lot of delayed action, that probably means the testing
> > needs a bit more robustness, but the tests validate all the basic paths.
> >
> > Regards,
> > Dave.
> >
> Patch below to enable on xe as well, I ran into some issues though when testing.
> After shutting down gdm3/sddm, I ran into a null dereference in mem_cgroup_uncharge_gpu_page()
> from ttm_pool_free_page(), presumably because of the objects that were created without a
> cgroup set. I tried to fix it in mem_cgroup_uncharge_gpu_page() by conditionally calling
> refill_stock(), but that ran into an underflow instead.

there should be a check if memcg is not NULL before calling into
refill, where are you seeing the underflow?

Thanks for the patch, I've booted one of my meteorlake systems with
this applied to forcing xe, and it seems to be working in my basic
testing so far.

Dave.

>
> Anyway, patch for xe below:
> ----->8-----------
> drm/xe: Enable memcg accounting for TT/system
>
> Create a flag to enable memcg accounting for XE as well.
>
> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
>
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 867087c2d1534..fd93374967c9e 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -54,6 +54,7 @@ static const struct ttm_place sys_placement_flags = {
>         .flags = 0,
>  };
>
> +/* TTM_PL_FLAG_MEMCG is not set, those placements are used for eviction */
>  static struct ttm_placement sys_placement = {
>         .num_placement = 1,
>         .placement = &sys_placement_flags,
> @@ -188,6 +189,7 @@ static void try_add_system(struct xe_device *xe, struct xe_bo *bo,
>
>                 bo->placements[*c] = (struct ttm_place) {
>                         .mem_type = XE_PL_TT,
> +                       .flags = TTM_PL_FLAG_MEMCG,
>                 };
>                 *c += 1;
>         }
> @@ -1696,6 +1698,8 @@ static void xe_ttm_bo_destroy(struct ttm_buffer_object *ttm_bo)
>
>  static void xe_gem_object_free(struct drm_gem_object *obj)
>  {
> +       struct xe_bo *bo = gem_to_xe_bo(obj);
> +
>         /* Our BO reference counting scheme works as follows:
>          *
>          * The gem object kref is typically used throughout the driver,
> @@ -1709,8 +1713,9 @@ static void xe_gem_object_free(struct drm_gem_object *obj)
>          * driver ttm callbacks is allowed to use the ttm_buffer_object
>          * refcount directly if needed.
>          */
> -       __xe_bo_vunmap(gem_to_xe_bo(obj));
> -       ttm_bo_put(container_of(obj, struct ttm_buffer_object, base));
> +       __xe_bo_vunmap(bo);
> +       obj_cgroup_put(bo->ttm.objcg);
> +       ttm_bo_put(&bo->ttm);
>  }
>
>  static void xe_gem_object_close(struct drm_gem_object *obj,
> @@ -1951,6 +1956,9 @@ struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>         placement = (type == ttm_bo_type_sg ||
>                      bo->flags & XE_BO_FLAG_DEFER_BACKING) ? &sys_placement :
>                 &bo->placement;
> +
> +       if (bo->flags & XE_BO_FLAG_ACCOUNTED)
> +               bo->ttm.objcg = get_obj_cgroup_from_current();
>         err = ttm_bo_init_reserved(&xe->ttm, &bo->ttm, type,
>                                    placement, alignment,
>                                    &ctx, NULL, resv, xe_ttm_bo_destroy);
> @@ -2726,7 +2734,7 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
>         if (XE_IOCTL_DBG(xe, args->size & ~PAGE_MASK))
>                 return -EINVAL;
>
> -       bo_flags = 0;
> +       bo_flags = XE_BO_FLAG_ACCOUNTED;
>         if (args->flags & DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING)
>                 bo_flags |= XE_BO_FLAG_DEFER_BACKING;
>
> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> index 6134d82e80554..e44fc58d9a00f 100644
> --- a/drivers/gpu/drm/xe/xe_bo.h
> +++ b/drivers/gpu/drm/xe/xe_bo.h
> @@ -48,6 +48,7 @@
>  #define XE_BO_FLAG_GGTT2               BIT(22)
>  #define XE_BO_FLAG_GGTT3               BIT(23)
>  #define XE_BO_FLAG_CPU_ADDR_MIRROR     BIT(24)
> +#define XE_BO_FLAG_ACCOUNTED           BIT(25)
>
>  /* this one is trigger internally only */
>  #define XE_BO_FLAG_INTERNAL_TEST       BIT(30)
> diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
> index 540f044bf4255..4db3227d65c04 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.c
> +++ b/drivers/gpu/drm/xe/xe_lrc.c
> @@ -1266,7 +1266,8 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
>         bo_flags = XE_BO_FLAG_VRAM_IF_DGFX(tile) | XE_BO_FLAG_GGTT |
>                    XE_BO_FLAG_GGTT_INVALIDATE;
>         if (vm && vm->xef) /* userspace */
> -               bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE;
> +               bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE |
> +                           XE_BO_FLAG_ACCOUNTED;
>
>         lrc->bo = xe_bo_create_pin_map(xe, tile, NULL, bo_size,
>                                        ttm_bo_type_kernel,
> diff --git a/drivers/gpu/drm/xe/xe_oa.c b/drivers/gpu/drm/xe/xe_oa.c
> index 5729e7d3e3356..569035630ffdf 100644
> --- a/drivers/gpu/drm/xe/xe_oa.c
> +++ b/drivers/gpu/drm/xe/xe_oa.c
> @@ -885,7 +885,7 @@ static int xe_oa_alloc_oa_buffer(struct xe_oa_stream *stream, size_t size)
>
>         bo = xe_bo_create_pin_map(stream->oa->xe, stream->gt->tile, NULL,
>                                   size, ttm_bo_type_kernel,
> -                                 XE_BO_FLAG_SYSTEM | XE_BO_FLAG_GGTT);
> +                                 XE_BO_FLAG_SYSTEM | XE_BO_FLAG_GGTT | XE_BO_FLAG_ACCOUNTED);
>         if (IS_ERR(bo))
>                 return PTR_ERR(bo);
>
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 330cc0f54a3f4..efcd54ab75e92 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -120,7 +120,8 @@ struct xe_pt *xe_pt_create(struct xe_vm *vm, struct xe_tile *tile,
>                    XE_BO_FLAG_IGNORE_MIN_PAGE_SIZE |
>                    XE_BO_FLAG_NO_RESV_EVICT | XE_BO_FLAG_PAGETABLE;
>         if (vm->xef) /* userspace */
> -               bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE;
> +               bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE |
> +                           XE_BO_FLAG_ACCOUNTED;
>
>         pt->level = level;
>         bo = xe_bo_create_pin_map(vm->xe, tile, vm, SZ_4K,
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 10c8a1bcb86e8..fdf845bb717e0 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -700,6 +700,7 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
>         bo = xe_bo_create_locked(vr->xe, NULL, NULL, end - start,
>                                  ttm_bo_type_device,
>                                  (IS_DGFX(xe) ? XE_BO_FLAG_VRAM(vr) : XE_BO_FLAG_SYSTEM) |
> +                                XE_BO_FLAG_ACCOUNTED |
>                                  XE_BO_FLAG_CPU_ADDR_MIRROR);
>         if (IS_ERR(bo)) {
>                 err = PTR_ERR(bo);
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)
  2025-08-04  9:22         ` Christian König
@ 2025-08-06  2:43           ` Dave Airlie
  2025-08-06 13:04             ` Christian König
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Airlie @ 2025-08-06  2:43 UTC (permalink / raw)
  To: Christian König
  Cc: David Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On Mon, 4 Aug 2025 at 19:22, Christian König <christian.koenig@amd.com> wrote:
>
> Sorry for the delayed response, just back from vacation.
>
> On 22.07.25 01:16, David Airlie wrote:
> >>>> @@ -162,7 +164,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
> >>>>               p = alloc_pages_node(pool->nid, gfp_flags, order);
> >>>>               if (p) {
> >>>>                       p->private = order;
> >>>> -                     mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
> >>>> +                     if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
> >>>
> >>> Thinking more about it that is way to late. At this point we can't fail the allocation any more.
> >>>
> >>
> >> I've tested it at least works, but there is a bit of a problem with
> >> it, because if we fail a 10 order allocation, it tries to fallback
> >> down the order hierarchy, when there is no point since it can't
> >> account the maximum size.
> >>
> >>> Otherwise we either completely break suspend or don't account system allocations to the correctly any more after resume.
> >>
> >> When you say suspend here, do you mean for VRAM allocations, normal
> >> system RAM allocations which are accounted here shouldn't have any
> >> effect on suspend/resume since they stay where they are. Currently it
> >> also doesn't try account for evictions at all.
>
> Good point, I was not considering moves during suspend as evictions. But from the code flow that should indeed work for now.
>
> What I meant is that after resume BOs are usually not moved back into VRAM immediately. Filling VRAM is rate limited to allow quick response of desktop applications after resume.
>
> So at least temporary we hopelessly overcommit system memory after resume. But that problem potentially goes into the same bucked as general eviction.
>
> > I've just traced the global swapin/out paths as well and those seem
> > fine for memcg at this point, since they are called only after
> > populate/unpopulate. Now I haven't addressed the new xe swap paths,
> > because I don't have a test path, since amdgpu doesn't support those,
> > I was thinking I'd leave it on the list for when amdgpu goes to that
> > path, or I can spend some time on xe.
>
> I would really prefer that before we commit this that we have patches for both amdgpu and XE which at least demonstrate the functionality.
>
> We are essentially defining uAPI here and when that goes wrong we can't change it any more as soon as people start depending on it.

 Maarten has supplied xe enablement patches, I'll go spend some time
looking into this on there as well.

>
> >
> > Dave.
> >
> >>>
> >>> What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.
> >>
> >> I'm not sure what reserve vs commit is here, mem cgroup is really just
> >> reserve until you can reserve no more, it's just a single
> >> charge/uncharge stage. If we try and charge and we are over the limit,
> >> bad things will happen, either fail allocation or reclaim for the
> >> cgroup.
>
> Yeah, exactly that is what I think is highly problematic.
>
> When the allocation of a buffer for an application fails in the display server you basically open up the possibility for a deny of service.
>
> E.g. imaging that an application allocates a 4GiB BO while it's cgroup says it can only allocate 2GiB, that will work because the backing store is only allocated delayed. Now send that BO to the display server and the command submission in the display server will fail with an -ENOMEM because we exceed the cgroup of the application.
>
> As far as I can see we also need to limit how much an application can overcommit by creating BOs without backing store.
>
> Alternatively disallow creating BOs without backing store, but that is an uAPI change and will break at least some use cases.

This is interesting, because I think the same DOS could exist now if
the system is low on memory, I could allocate a giant unbacked BO and
pass it to the display server now, and when it goes to fill in the
pages it could fail to allocate pages and get ENOMEM?

Should we be considering buffer sharing should cause population?

Dave.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)
  2025-08-06  2:43           ` Dave Airlie
@ 2025-08-06 13:04             ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2025-08-06 13:04 UTC (permalink / raw)
  To: Dave Airlie
  Cc: David Airlie, dri-devel, linux-mm, Johannes Weiner, Dave Chinner,
	Kairui Song

On 06.08.25 04:43, Dave Airlie wrote:
>>>>>
>>>>> What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.
>>>>
>>>> I'm not sure what reserve vs commit is here, mem cgroup is really just
>>>> reserve until you can reserve no more, it's just a single
>>>> charge/uncharge stage. If we try and charge and we are over the limit,
>>>> bad things will happen, either fail allocation or reclaim for the
>>>> cgroup.
>>
>> Yeah, exactly that is what I think is highly problematic.
>>
>> When the allocation of a buffer for an application fails in the display server you basically open up the possibility for a deny of service.
>>
>> E.g. imaging that an application allocates a 4GiB BO while it's cgroup says it can only allocate 2GiB, that will work because the backing store is only allocated delayed. Now send that BO to the display server and the command submission in the display server will fail with an -ENOMEM because we exceed the cgroup of the application.
>>
>> As far as I can see we also need to limit how much an application can overcommit by creating BOs without backing store.
>>
>> Alternatively disallow creating BOs without backing store, but that is an uAPI change and will break at least some use cases.
> 
> This is interesting, because I think the same DOS could exist now if
> the system is low on memory, I could allocate a giant unbacked BO and
> pass it to the display server now, and when it goes to fill in the
> pages it could fail to allocate pages and get ENOMEM?

Yeah that's perfectly possible. IIRC I have already pointed out those problems when I first started to work on radeon years ago.

See the patches for improving the OOM killer I came up ~10 years ago. They don't address exactly that problem, but go into the general direction.

The problem is cgroups makes those issues much worse because you suddenly have not only the general global limit of physically installed memory, but also artificial limits set by the system administrator.

> Should we be considering buffer sharing should cause population?

Good question, haven't though about that approach.

But we don't have a way to figure that out, don't we? Except maybe when exporting the BO as DMA-buf. Mhm, let me take a look at the code.

Christian.

> 
> Dave.



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-08-06 13:04 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14  5:18 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Dave Airlie
2025-07-14  5:18 ` [PATCH 01/18] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
2025-07-14  5:18 ` [PATCH 02/18] drm/ttm: use gpu mm stats to track gpu memory allocations. (v3) Dave Airlie
2025-07-14 18:33   ` Shakeel Butt
2025-07-14  5:18 ` [PATCH 03/18] mm/list_lru: export list_lru_add Dave Airlie
2025-07-14 19:01   ` Shakeel Butt
2025-07-14  5:18 ` [PATCH 04/18] ttm/pool: port to list_lru. (v2) Dave Airlie
2025-07-14  5:18 ` [PATCH 05/18] ttm/pool: drop numa specific pools Dave Airlie
2025-07-14  5:18 ` [PATCH 06/18] ttm/pool: make pool shrinker NUMA aware Dave Airlie
2025-07-14  5:18 ` [PATCH 07/18] ttm/pool: track allocated_pages per numa node Dave Airlie
2025-07-14  5:18 ` [PATCH 08/18] memcg: add support for GPU page counters. (v2) Dave Airlie
2025-07-14  5:18 ` [PATCH 09/18] memcg: export memcg_list_lru_alloc Dave Airlie
2025-07-14  5:18 ` [PATCH 10/18] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
2025-07-14  5:18 ` [PATCH 11/18] ttm/pool: initialise the shrinker earlier Dave Airlie
2025-07-14  5:18 ` [PATCH 12/18] ttm: add objcg pointer to bo and tt Dave Airlie
2025-07-14  5:18 ` [PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2) Dave Airlie
2025-07-15  7:34   ` Christian König
2025-07-21  5:56     ` David Airlie
2025-07-21 23:16       ` David Airlie
2025-08-04  9:22         ` Christian König
2025-08-06  2:43           ` Dave Airlie
2025-08-06 13:04             ` Christian König
2025-07-14  5:18 ` [PATCH 14/18] ttm: hook up memcg placement flags Dave Airlie
2025-07-14  5:18 ` [PATCH 15/18] memcontrol: allow objcg api when memcg is config off Dave Airlie
2025-07-14  5:18 ` [PATCH 16/18] memcontrol: export current_obj_cgroup Dave Airlie
2025-07-14  5:18 ` [PATCH 17/18] amdgpu: add support for memory cgroups Dave Airlie
2025-07-14  5:18 ` [PATCH 18/18] ttm: add support for a module option to disable memcg pool Dave Airlie
2025-07-14 11:49   ` Christian König
2025-08-05 10:58 ` drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v2) Maarten Lankhorst
2025-08-06  2:39   ` Dave Airlie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).