public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 13/16] memcontrol: allow objcg api when memcg is config off.
  2025-10-16  2:31 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v4) Dave Airlie
@ 2025-10-16  2:31 ` Dave Airlie
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2025-10-16  2:31 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

amdgpu wants to use the objcg api and not have to enable ifdef
around it, so just add a dummy function for the config off path.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 include/linux/memcontrol.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 62c46c33f84f..8401b272495e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1828,6 +1828,11 @@ static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
+static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	return NULL;
+}
+
 static inline struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
 	return NULL;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5)
@ 2026-02-24  2:06 Dave Airlie
  2026-02-24  2:06 ` [PATCH 01/16] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
                   ` (15 more replies)
  0 siblings, 16 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

Hi all,

This time I really want to make forward progress on landing this. I'll likely merge the first
half into drm-next soon, but I'd like to get it all landed.

The main changes since v4, is I did an AI review of the patchset and it find a bug 
with the reclaim codepaths when no memcg was around, and a bug in the diff calcs
for accounted pages I introduced.

Christian, I think you said patch 4 got lost last time, hopefully you get it this time.

Patches still needing ack/review:
ttm/pool: drop numa specific pools
ttm/pool: track allocated_pages per numa node.
ttm: add objcg pointer to bo and tt (v2)
ttm/pool: enable memcg tracking and shrinker. (v2)
amdgpu: add support for memory cgroups

Differences since v1 posting:
1. added ttm_bo_set_cgroup wrapper - the cgroup reference is passed to the ttm object.
2. put the cgroup reference in ttm object release
3. rebase onto 6.19-rc1
4. added xe support patch from Maarten.

Differences since v2 posting:
1. Squashed exports into where they are used (Shakeel)
2. Fixed bug in uncharge path memcg
3. Fixed config bug in the module option.

Differences since 1st posting:
1. Added patch 18: add a module option to allow pooled pages to not be stored in the lru per-memcg
   (Requested by Christian Konig)
2. Converged the naming and stats between vmstat and memcg (Suggested by Shakeel Butt)
3. Cleaned up the charge/uncharge code and some other bits.

Dave.

Original cover letter:
tl;dr: start using list_lru/numa/memcg in GPU driver core and amdgpu driver for now.

This is a complete series of patches, some of which have been sent before and reviewed,
but I want to get the complete picture for others, and try to figure out how best to land this.

There are 3 pieces to this:
01->02: add support for global gpu stat counters (previously posted, patch 2 is newer)
03->06: port ttm pools to list_lru for numa awareness
07->13: add memcg stats + gpu apis, then port ttm pools to memcg aware list_lru and shrinker
14: enable amdgpu to use new functionality.
15: add a module option to turn it all off.

The biggest difference in the memcg code from previously is I discovered what
obj cgroups were designed for and I'm reusing the page/objcg intergration that 
already exists, to avoid reinventing that wheel right now.

There are some igt-gpu-tools tests I've written at:
https://gitlab.freedesktop.org/airlied/igt-gpu-tools/-/tree/amdgpu-cgroups?ref_type=heads

One problem is there are a lot of delayed action, that probably means the testing
needs a bit more robustness, but the tests validate all the basic paths.

Regards,
Dave.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/16] mm: add gpu active/reclaim per-node stat counters (v2)
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 02/16] drm/ttm: use gpu mm stats to track gpu memory allocations. (v4) Dave Airlie
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

While discussing memcg intergration with gpu memory allocations,
it was pointed out that there was no numa/system counters for
GPU memory allocations.

With more integrated memory GPU server systems turning up, and
more requirements for memory tracking it seems we should start
closing the gap.

Add two counters to track GPU per-node system memory allocations.

The first is currently allocated to GPU objects, and the second
is for memory that is stored in GPU page pools that can be reclaimed,
by the shrinker.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---

v2: add more info to the documentation on this memory.

I'd like to get acks to merge this via the drm tree, if possible,

Dave.
---
 Documentation/filesystems/proc.rst | 8 ++++++++
 drivers/base/node.c                | 5 +++++
 fs/proc/meminfo.c                  | 6 ++++++
 include/linux/mmzone.h             | 2 ++
 mm/show_mem.c                      | 6 +++++-
 mm/vmstat.c                        | 2 ++
 6 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index b0c0d1b45b99..3acfdf785465 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1089,6 +1089,8 @@ Example output. You may not have all of these fields.
     CmaFree:               0 kB
     Unaccepted:            0 kB
     Balloon:               0 kB
+    GPUActive:             0 kB
+    GPUReclaim:            0 kB
     HugePages_Total:       0
     HugePages_Free:        0
     HugePages_Rsvd:        0
@@ -1269,6 +1271,12 @@ Unaccepted
               Memory that has not been accepted by the guest
 Balloon
               Memory returned to Host by VM Balloon Drivers
+GPUActive
+              System memory allocated to active GPU objects
+GPUReclaim
+              System memory stored in GPU pools for reuse. This memory is not
+              counted in GPUActive. It is shrinker reclaimable memory kept in a reuse
+              pool because it has non-standard page table attributes, like WC or UC.
 HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
               See Documentation/admin-guide/mm/hugetlbpage.rst.
 DirectMap4k, DirectMap2M, DirectMap1G
diff --git a/drivers/base/node.c b/drivers/base/node.c
index d7647d077b66..126f66aa2c3e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -523,6 +523,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_UNACCEPTED_MEMORY
 			     "Node %d Unaccepted:     %8lu kB\n"
 #endif
+			     "Node %d GPUActive:      %8lu kB\n"
+			     "Node %d GPUReclaim:     %8lu kB\n"
 			     ,
 			     nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
 			     nid, K(node_page_state(pgdat, NR_WRITEBACK)),
@@ -556,6 +558,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     ,
 			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
 #endif
+			     ,
+			     nid, K(node_page_state(pgdat, NR_GPU_ACTIVE)),
+			     nid, K(node_page_state(pgdat, NR_GPU_RECLAIM))
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
 	return len;
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index a458f1e112fd..65ba49ec3a63 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -163,6 +163,12 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "Balloon:        ",
 		    global_node_page_state(NR_BALLOON_PAGES));
 
+	show_val_kb(m, "GPUActive:      ",
+		    global_node_page_state(NR_GPU_ACTIVE));
+
+	show_val_kb(m, "GPUReclaim:     ",
+		    global_node_page_state(NR_GPU_RECLAIM));
+
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..841b40031833 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -260,6 +260,8 @@ enum node_stat_item {
 #endif
 	NR_BALLOON_PAGES,
 	NR_KERNEL_FILE_PAGES,
+	NR_GPU_ACTIVE,	/* Pages assigned to GPU objects */
+	NR_GPU_RECLAIM,	/* Pages in shrinkable GPU pools */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 24078ac3e6bc..43aca5a2ac99 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -254,6 +254,8 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			" sec_pagetables:%lukB"
 			" all_unreclaimable? %s"
 			" Balloon:%lukB"
+			" gpu_active:%lukB"
+			" gpu_reclaim:%lukB"
 			"\n",
 			pgdat->node_id,
 			K(node_page_state(pgdat, NR_ACTIVE_ANON)),
@@ -279,7 +281,9 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			K(node_page_state(pgdat, NR_PAGETABLE)),
 			K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
 			str_yes_no(kswapd_test_hopeless(pgdat)),
-			K(node_page_state(pgdat, NR_BALLOON_PAGES)));
+			K(node_page_state(pgdat, NR_BALLOON_PAGES)),
+			K(node_page_state(pgdat, NR_GPU_ACTIVE)),
+			K(node_page_state(pgdat, NR_GPU_RECLAIM)));
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 86b14b0f77b5..ac9affbe48b7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1281,6 +1281,8 @@ const char * const vmstat_text[] = {
 #endif
 	[I(NR_BALLOON_PAGES)]			= "nr_balloon_pages",
 	[I(NR_KERNEL_FILE_PAGES)]		= "nr_kernel_file_pages",
+	[I(NR_GPU_ACTIVE)]			= "nr_gpu_active",
+	[I(NR_GPU_RECLAIM)]			= "nr_gpu_reclaim",
 #undef I
 
 	/* system-wide enum vm_stat_item counters */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/16] drm/ttm: use gpu mm stats to track gpu memory allocations. (v4)
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
  2026-02-24  2:06 ` [PATCH 01/16] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 03/16] ttm/pool: port to list_lru. (v2) Dave Airlie
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This uses the newly introduced per-node gpu tracking stats,
to track GPU memory allocated via TTM and reclaimable memory in
the TTM page pools.

These stats will be useful later for system information and
later when mem cgroups are integrated.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: add reclaim parameters and adjust the right counters.
v3: drop the nid helper and get it from page.
v4: use mod_lruvec_page_state (Shakeel)
---
 drivers/gpu/drm/ttm/ttm_pool.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index c0d95559197c..6891776cad75 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -159,8 +159,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 
 	if (!ttm_pool_uses_dma_alloc(pool)) {
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
-		if (p)
+		if (p) {
 			p->private = order;
+			mod_lruvec_page_state(p, NR_GPU_ACTIVE, 1 << order);
+		}
 		return p;
 	}
 
@@ -195,7 +197,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 
 /* Reset the caching and pages of size 1 << order */
 static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
-			       unsigned int order, struct page *p)
+			       unsigned int order, struct page *p, bool reclaim)
 {
 	unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS;
 	struct ttm_pool_dma *dma;
@@ -210,6 +212,8 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
 #endif
 
 	if (!pool || !ttm_pool_uses_dma_alloc(pool)) {
+		mod_lruvec_page_state(p, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
+				      -(1 << order));
 		__free_pages(p, order);
 		return;
 	}
@@ -297,6 +301,9 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 	list_add(&p->lru, &pt->pages);
 	spin_unlock(&pt->lock);
 	atomic_long_add(1 << pt->order, &allocated_pages);
+
+	mod_lruvec_page_state(p, NR_GPU_ACTIVE, -num_pages);
+	mod_lruvec_page_state(p, NR_GPU_RECLAIM, num_pages);
 }
 
 /* Take pages from a specific pool_type, return NULL when nothing available */
@@ -308,6 +315,8 @@ static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
 	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
 	if (p) {
 		atomic_long_sub(1 << pt->order, &allocated_pages);
+		mod_lruvec_page_state(p, NR_GPU_ACTIVE, (1 << pt->order));
+		mod_lruvec_page_state(p, NR_GPU_RECLAIM, -(1 << pt->order));
 		list_del(&p->lru);
 	}
 	spin_unlock(&pt->lock);
@@ -340,7 +349,7 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 	spin_unlock(&shrinker_lock);
 
 	while ((p = ttm_pool_type_take(pt)))
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 }
 
 /* Return the pool_type to use for the given caching and order */
@@ -392,7 +401,7 @@ static unsigned int ttm_pool_shrink(void)
 
 	p = ttm_pool_type_take(pt);
 	if (p) {
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 		num_pages = 1 << pt->order;
 	} else {
 		num_pages = 0;
@@ -484,7 +493,7 @@ static pgoff_t ttm_pool_unmap_and_free(struct ttm_pool *pool, struct page *page,
 	if (pt)
 		ttm_pool_type_give(pt, page);
 	else
-		ttm_pool_free_page(pool, caching, order, page);
+		ttm_pool_free_page(pool, caching, order, page, false);
 
 	return nr;
 }
@@ -789,7 +798,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	return 0;
 
 error_free_page:
-	ttm_pool_free_page(pool, page_caching, order, p);
+	ttm_pool_free_page(pool, page_caching, order, p, false);
 
 error_free_all:
 	if (tt->restore)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/16] ttm/pool: port to list_lru. (v2)
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
  2026-02-24  2:06 ` [PATCH 01/16] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
  2026-02-24  2:06 ` [PATCH 02/16] drm/ttm: use gpu mm stats to track gpu memory allocations. (v4) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 04/16] ttm/pool: drop numa specific pools Dave Airlie
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This is an initial port of the TTM pools for
write combined and uncached pages to use the list_lru.

This makes the pool's more NUMA aware and avoids
needing separate NUMA pools (later commit enables this).

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: drop the pt->lock, lru list has it's own lock which is sufficent.
rearrange list isolates to fix bad locking orders.
---
 drivers/gpu/drm/ttm/tests/ttm_device_test.c |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c   | 32 ++++----
 drivers/gpu/drm/ttm/ttm_pool.c              | 90 ++++++++++++++-------
 include/drm/ttm/ttm_pool.h                  |  7 +-
 mm/list_lru.c                               |  1 +
 5 files changed, 83 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/ttm/tests/ttm_device_test.c b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
index 2d55ad34fe48..db4b4a09a73f 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_device_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_device_test.c
@@ -176,7 +176,7 @@ static void ttm_device_init_pools(struct kunit *test)
 
 				if (ttm_pool_uses_dma_alloc(pool))
 					KUNIT_ASSERT_FALSE(test,
-							   list_empty(&pt.pages));
+							   !list_lru_count(&pt.pages));
 			}
 		}
 	}
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 11c92bd75779..01197014b83f 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -248,7 +248,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	pool = ttm_pool_pre_populated(test, size, caching);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	tt = ttm_tt_kunit_init(test, 0, caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
@@ -256,7 +256,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
@@ -282,8 +282,8 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, tt_caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -291,8 +291,8 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_tt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -316,8 +316,8 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, caching, snd_size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -325,8 +325,8 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_pool->pages));
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt_tt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_tt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -352,12 +352,12 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_fini(pool);
 }
@@ -383,12 +383,12 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
-	KUNIT_ASSERT_TRUE(test, list_is_singular(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
 
 	ttm_pool_free(pool, tt);
 	ttm_tt_fini(tt);
 
-	KUNIT_ASSERT_TRUE(test, list_is_singular(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
 
 	ttm_pool_fini(pool);
 }
@@ -404,11 +404,11 @@ static void ttm_pool_fini_basic(struct kunit *test)
 	pool = ttm_pool_pre_populated(test, size, caching);
 	pt = &pool->caching[caching].orders[order];
 
-	KUNIT_ASSERT_FALSE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt->pages));
 
 	ttm_pool_fini(pool);
 
-	KUNIT_ASSERT_TRUE(test, list_empty(&pt->pages));
+	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
 }
 
 static struct kunit_case ttm_pool_test_cases[] = {
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 6891776cad75..2947ffc51351 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -132,6 +132,16 @@ static struct list_head shrinker_list;
 static struct shrinker *mm_shrinker;
 static DECLARE_RWSEM(pool_shrink_rwsem);
 
+static int ttm_pool_nid(struct ttm_pool *pool)
+{
+	int nid = NUMA_NO_NODE;
+	if (pool)
+		nid = pool->nid;
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+	return nid;
+}
+
 /* Allocate pages of size 1 << order with the given gfp_flags */
 static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 					unsigned int order)
@@ -297,30 +307,41 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 			clear_page(page_address(p + i));
 	}
 
-	spin_lock(&pt->lock);
-	list_add(&p->lru, &pt->pages);
-	spin_unlock(&pt->lock);
+	INIT_LIST_HEAD(&p->lru);
+	rcu_read_lock();
+	list_lru_add(&pt->pages, &p->lru, page_to_nid(p), NULL);
+	rcu_read_unlock();
 	atomic_long_add(1 << pt->order, &allocated_pages);
 
 	mod_lruvec_page_state(p, NR_GPU_ACTIVE, -num_pages);
 	mod_lruvec_page_state(p, NR_GPU_RECLAIM, num_pages);
 }
 
+static enum lru_status take_one_from_lru(struct list_head *item,
+					 struct list_lru_one *list,
+					 void *cb_arg)
+{
+	struct page **out_page = cb_arg;
+	struct page *p = container_of(item, struct page, lru);
+	list_lru_isolate(list, item);
+
+	*out_page = p;
+	return LRU_REMOVED;
+}
+
 /* Take pages from a specific pool_type, return NULL when nothing available */
-static struct page *ttm_pool_type_take(struct ttm_pool_type *pt)
+static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
 {
-	struct page *p;
+	int ret;
+	struct page *p = NULL;
+	unsigned long nr_to_walk = 1;
 
-	spin_lock(&pt->lock);
-	p = list_first_entry_or_null(&pt->pages, typeof(*p), lru);
-	if (p) {
+	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
+	if (ret == 1 && p) {
 		atomic_long_sub(1 << pt->order, &allocated_pages);
 		mod_lruvec_page_state(p, NR_GPU_ACTIVE, (1 << pt->order));
 		mod_lruvec_page_state(p, NR_GPU_RECLAIM, -(1 << pt->order));
-		list_del(&p->lru);
 	}
-	spin_unlock(&pt->lock);
-
 	return p;
 }
 
@@ -331,25 +352,47 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
 	pt->pool = pool;
 	pt->caching = caching;
 	pt->order = order;
-	spin_lock_init(&pt->lock);
-	INIT_LIST_HEAD(&pt->pages);
+	list_lru_init(&pt->pages);
 
 	spin_lock(&shrinker_lock);
 	list_add_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 }
 
+static enum lru_status pool_move_to_dispose_list(struct list_head *item,
+						 struct list_lru_one *list,
+						 void *cb_arg)
+{
+	struct list_head *dispose = cb_arg;
+
+	list_lru_isolate_move(list, item, dispose);
+
+	return LRU_REMOVED;
+}
+
+static void ttm_pool_dispose_list(struct ttm_pool_type *pt,
+				  struct list_head *dispose)
+{
+	while (!list_empty(dispose)) {
+		struct page *p;
+		p = list_first_entry(dispose, struct page, lru);
+		list_del_init(&p->lru);
+		atomic_long_sub(1 << pt->order, &allocated_pages);
+		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
+	}
+}
+
 /* Remove a pool_type from the global shrinker list and free all pages */
 static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 {
-	struct page *p;
+	LIST_HEAD(dispose);
 
 	spin_lock(&shrinker_lock);
 	list_del(&pt->shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	while ((p = ttm_pool_type_take(pt)))
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
+	list_lru_walk(&pt->pages, pool_move_to_dispose_list, &dispose, LONG_MAX);
+	ttm_pool_dispose_list(pt, &dispose);
 }
 
 /* Return the pool_type to use for the given caching and order */
@@ -399,7 +442,7 @@ static unsigned int ttm_pool_shrink(void)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	p = ttm_pool_type_take(pt);
+	p = ttm_pool_type_take(pt, ttm_pool_nid(pt->pool));
 	if (p) {
 		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 		num_pages = 1 << pt->order;
@@ -753,7 +796,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		p = NULL;
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
-			p = ttm_pool_type_take(pt);
+			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
 		/*
 		 * If that fails or previously failed, allocate from system.
 		 * Note that this also disallows additional pool allocations using
@@ -1182,16 +1225,7 @@ static unsigned long ttm_pool_shrinker_count(struct shrinker *shrink,
 /* Count the number of pages available in a pool_type */
 static unsigned int ttm_pool_type_count(struct ttm_pool_type *pt)
 {
-	unsigned int count = 0;
-	struct page *p;
-
-	spin_lock(&pt->lock);
-	/* Only used for debugfs, the overhead doesn't matter */
-	list_for_each_entry(p, &pt->pages, lru)
-		++count;
-	spin_unlock(&pt->lock);
-
-	return count;
+	return list_lru_count(&pt->pages);
 }
 
 /* Print a nice header for the order */
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 233581670e78..26ee592e1994 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -29,6 +29,7 @@
 #include <linux/mmzone.h>
 #include <linux/llist.h>
 #include <linux/spinlock.h>
+#include <linux/list_lru.h>
 #include <drm/ttm/ttm_caching.h>
 
 struct device;
@@ -45,8 +46,7 @@ struct ttm_tt;
  * @order: the allocation order our pages have
  * @caching: the caching type our pages have
  * @shrinker_list: our place on the global shrinker list
- * @lock: protection of the page list
- * @pages: the list of pages in the pool
+ * @pages: the lru_list of pages in the pool
  */
 struct ttm_pool_type {
 	struct ttm_pool *pool;
@@ -55,8 +55,7 @@ struct ttm_pool_type {
 
 	struct list_head shrinker_list;
 
-	spinlock_t lock;
-	struct list_head pages;
+	struct list_lru pages;
 };
 
 /**
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 26463ae29c64..dd29bcf8eb5f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -179,6 +179,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 	unlock_list_lru(l, false);
 	return false;
 }
+EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_add_obj(struct list_lru *lru, struct list_head *item)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/16] ttm/pool: drop numa specific pools
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (2 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 03/16] ttm/pool: port to list_lru. (v2) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 05/16] ttm/pool: make pool shrinker NUMA aware (v2) Dave Airlie
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

The list_lru will now handle numa for us, so no need to keep
separate pool types for it. Just consolidate into the global ones.

This adds a debugfs change to avoid dumping non-existant orders due
to this change.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 2947ffc51351..3989e15ab5b0 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -406,17 +406,11 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 #ifdef CONFIG_X86
 	switch (caching) {
 	case ttm_write_combined:
-		if (pool->nid != NUMA_NO_NODE)
-			return &pool->caching[caching].orders[order];
-
 		if (ttm_pool_uses_dma32(pool))
 			return &global_dma32_write_combined[order];
 
 		return &global_write_combined[order];
 	case ttm_uncached:
-		if (pool->nid != NUMA_NO_NODE)
-			return &pool->caching[caching].orders[order];
-
 		if (ttm_pool_uses_dma32(pool))
 			return &global_dma32_uncached[order];
 
@@ -1291,7 +1285,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
 {
 	unsigned int i;
 
-	if (!ttm_pool_uses_dma_alloc(pool) && pool->nid == NUMA_NO_NODE) {
+	if (!ttm_pool_uses_dma_alloc(pool)) {
 		seq_puts(m, "unused\n");
 		return 0;
 	}
@@ -1302,10 +1296,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
 	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
 		if (!ttm_pool_select_type(pool, i, 0))
 			continue;
-		if (ttm_pool_uses_dma_alloc(pool))
-			seq_puts(m, "DMA ");
-		else
-			seq_printf(m, "N%d ", pool->nid);
+		seq_puts(m, "DMA ");
 		switch (i) {
 		case ttm_cached:
 			seq_puts(m, "\t:");
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/16] ttm/pool: make pool shrinker NUMA aware (v2)
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (3 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 04/16] ttm/pool: drop numa specific pools Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 06/16] ttm/pool: track allocated_pages per numa node Dave Airlie
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This enable NUMA awareness for the shrinker on the
ttm pools.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: AI review - found reverse diff calculation
---
 drivers/gpu/drm/ttm/ttm_pool.c | 38 +++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 3989e15ab5b0..880228132b91 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -423,12 +423,12 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 	return NULL;
 }
 
-/* Free pages using the global shrinker list */
-static unsigned int ttm_pool_shrink(void)
+/* Free pages using the per-node shrinker list */
+static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
 {
+	LIST_HEAD(dispose);
 	struct ttm_pool_type *pt;
 	unsigned int num_pages;
-	struct page *p;
 
 	down_read(&pool_shrink_rwsem);
 	spin_lock(&shrinker_lock);
@@ -436,13 +436,10 @@ static unsigned int ttm_pool_shrink(void)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	p = ttm_pool_type_take(pt, ttm_pool_nid(pt->pool));
-	if (p) {
-		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
-		num_pages = 1 << pt->order;
-	} else {
-		num_pages = 0;
-	}
+	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	num_pages *= 1 << pt->order;
+
+	ttm_pool_dispose_list(pt, &dispose);
 	up_read(&pool_shrink_rwsem);
 
 	return num_pages;
@@ -791,6 +788,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
 			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
+
 		/*
 		 * If that fails or previously failed, allocate from system.
 		 * Note that this also disallows additional pool allocations using
@@ -941,8 +939,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 {
 	ttm_pool_free_range(pool, tt, tt->caching, 0, tt->num_pages);
 
-	while (atomic_long_read(&allocated_pages) > page_pool_size)
-		ttm_pool_shrink();
+	while (atomic_long_read(&allocated_pages) > page_pool_size) {
+		unsigned long diff = atomic_long_read(&allocated_pages) - page_pool_size;
+		ttm_pool_shrink(ttm_pool_nid(pool), diff);
+	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
 
@@ -1197,7 +1197,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 	unsigned long num_freed = 0;
 
 	do
-		num_freed += ttm_pool_shrink();
+		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
 	while (num_freed < sc->nr_to_scan &&
 	       atomic_long_read(&allocated_pages));
 
@@ -1325,11 +1325,15 @@ static int ttm_pool_debugfs_shrink_show(struct seq_file *m, void *data)
 		.nr_to_scan = TTM_SHRINKER_BATCH,
 	};
 	unsigned long count;
+	int nid;
 
 	fs_reclaim_acquire(GFP_KERNEL);
-	count = ttm_pool_shrinker_count(mm_shrinker, &sc);
-	seq_printf(m, "%lu/%lu\n", count,
-		   ttm_pool_shrinker_scan(mm_shrinker, &sc));
+	for_each_node(nid) {
+		sc.nid = nid;
+		count = ttm_pool_shrinker_count(mm_shrinker, &sc);
+		seq_printf(m, "%d: %lu/%lu\n", nid, count,
+			   ttm_pool_shrinker_scan(mm_shrinker, &sc));
+	}
 	fs_reclaim_release(GFP_KERNEL);
 
 	return 0;
@@ -1377,7 +1381,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 #endif
 #endif
 
-	mm_shrinker = shrinker_alloc(0, "drm-ttm_pool");
+	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
 	if (!mm_shrinker)
 		return -ENOMEM;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/16] ttm/pool: track allocated_pages per numa node.
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (4 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 05/16] ttm/pool: make pool shrinker NUMA aware (v2) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 07/16] memcg: add support for GPU page counters. (v4) Dave Airlie
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This gets the memory sizes from the nodes and stores the limit
as 50% of those. I think eventually we should drop the limits
once we have memcg aware shrinking, but this should be more NUMA
friendly, and I think seems like what people would prefer to
happen on NUMA aware systems.

Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 62 ++++++++++++++++++++++++++--------
 1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 880228132b91..6b288558ac3b 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -116,10 +116,11 @@ struct ttm_pool_tt_restore {
 
 static unsigned long page_pool_size;
 
-MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
+MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool per NUMA node");
 module_param(page_pool_size, ulong, 0644);
 
-static atomic_long_t allocated_pages;
+static unsigned long pool_node_limit[MAX_NUMNODES];
+static atomic_long_t allocated_pages[MAX_NUMNODES];
 
 static struct ttm_pool_type global_write_combined[NR_PAGE_ORDERS];
 static struct ttm_pool_type global_uncached[NR_PAGE_ORDERS];
@@ -299,6 +300,7 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
 static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 {
 	unsigned int i, num_pages = 1 << pt->order;
+	int nid = page_to_nid(p);
 
 	for (i = 0; i < num_pages; ++i) {
 		if (PageHighMem(p))
@@ -309,10 +311,10 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 
 	INIT_LIST_HEAD(&p->lru);
 	rcu_read_lock();
-	list_lru_add(&pt->pages, &p->lru, page_to_nid(p), NULL);
+	list_lru_add(&pt->pages, &p->lru, nid, NULL);
 	rcu_read_unlock();
-	atomic_long_add(1 << pt->order, &allocated_pages);
 
+	atomic_long_add(num_pages, &allocated_pages[nid]);
 	mod_lruvec_page_state(p, NR_GPU_ACTIVE, -num_pages);
 	mod_lruvec_page_state(p, NR_GPU_RECLAIM, num_pages);
 }
@@ -338,7 +340,7 @@ static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
 
 	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
 	if (ret == 1 && p) {
-		atomic_long_sub(1 << pt->order, &allocated_pages);
+		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
 		mod_lruvec_page_state(p, NR_GPU_ACTIVE, (1 << pt->order));
 		mod_lruvec_page_state(p, NR_GPU_RECLAIM, -(1 << pt->order));
 	}
@@ -377,7 +379,7 @@ static void ttm_pool_dispose_list(struct ttm_pool_type *pt,
 		struct page *p;
 		p = list_first_entry(dispose, struct page, lru);
 		list_del_init(&p->lru);
-		atomic_long_sub(1 << pt->order, &allocated_pages);
+		atomic_long_sub(1 << pt->order, &allocated_pages[page_to_nid(p)]);
 		ttm_pool_free_page(pt->pool, pt->caching, pt->order, p, true);
 	}
 }
@@ -937,11 +939,13 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
  */
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 {
+	int nid = ttm_pool_nid(pool);
+
 	ttm_pool_free_range(pool, tt, tt->caching, 0, tt->num_pages);
 
-	while (atomic_long_read(&allocated_pages) > page_pool_size) {
-		unsigned long diff = atomic_long_read(&allocated_pages) - page_pool_size;
-		ttm_pool_shrink(ttm_pool_nid(pool), diff);
+	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
+		unsigned long diff = atomic_long_read(&allocated_pages[nid]) - pool_node_limit[nid];
+		ttm_pool_shrink(nid, diff);
 	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
@@ -1199,7 +1203,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 	do
 		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
 	while (num_freed < sc->nr_to_scan &&
-	       atomic_long_read(&allocated_pages));
+	       atomic_long_read(&allocated_pages[sc->nid]));
 
 	sc->nr_scanned = num_freed;
 
@@ -1210,7 +1214,7 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 static unsigned long ttm_pool_shrinker_count(struct shrinker *shrink,
 					     struct shrink_control *sc)
 {
-	unsigned long num_pages = atomic_long_read(&allocated_pages);
+	unsigned long num_pages = atomic_long_read(&allocated_pages[sc->nid]);
 
 	return num_pages ? num_pages : SHRINK_EMPTY;
 }
@@ -1247,8 +1251,12 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
 /* Dump the total amount of allocated pages */
 static void ttm_pool_debugfs_footer(struct seq_file *m)
 {
-	seq_printf(m, "\ntotal\t: %8lu of %8lu\n",
-		   atomic_long_read(&allocated_pages), page_pool_size);
+	int nid;
+
+	for_each_node(nid) {
+		seq_printf(m, "\ntotal node%d\t: %8lu of %8lu\n", nid,
+			   atomic_long_read(&allocated_pages[nid]), pool_node_limit[nid]);
+	}
 }
 
 /* Dump the information for the global pools */
@@ -1342,6 +1350,23 @@ DEFINE_SHOW_ATTRIBUTE(ttm_pool_debugfs_shrink);
 
 #endif
 
+static inline u64 ttm_get_node_memory_size(int nid)
+{
+	/*
+	 * This is directly using si_meminfo_node implementation as the
+	 * function is not exported.
+	 */
+	int zone_type;
+	u64 managed_pages = 0;
+
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
+		managed_pages +=
+			zone_managed_pages(&pgdat->node_zones[zone_type]);
+	return managed_pages * PAGE_SIZE;
+}
+
 /**
  * ttm_pool_mgr_init - Initialize globals
  *
@@ -1353,8 +1378,15 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 {
 	unsigned int i;
 
-	if (!page_pool_size)
-		page_pool_size = num_pages;
+	int nid;
+	for_each_node(nid) {
+		if (!page_pool_size) {
+			u64 node_size = ttm_get_node_memory_size(nid);
+			pool_node_limit[nid] = (node_size >> PAGE_SHIFT) / 2;
+		} else {
+			pool_node_limit[nid] = page_pool_size;
+		}
+	}
 
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (5 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 06/16] ttm/pool: track allocated_pages per numa node Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  7:20   ` kernel test robot
  2026-02-24  7:50   ` Christian König
  2026-02-24  2:06 ` [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This introduces 2 new statistics and 3 new memcontrol APIs for dealing
with GPU system memory allocations.

The stats corresponds to the same stats in the global vmstat,
for number of active GPU pages, and number of pages in pools that
can be reclaimed.

The first API charges a order of pages to a objcg, and sets
the objcg on the pages like kmem does, and updates the active/reclaim
statistic.

The second API uncharges a page from the obj cgroup it is currently charged
to.

The third API allows moving a page to/from reclaim and between obj cgroups.
When pages are added to the pool lru, this just updates accounting.
When pages are being removed from a pool lru, they can be taken from
the parent objcg so this allows them to be uncharged from there and transferred
to a new child objcg.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
v2: use memcg_node_stat_items
v3: fix null ptr dereference in uncharge
v4: AI review: fix parameter names, fix problem with reclaim moving doing wrong thing
---
 Documentation/admin-guide/cgroup-v2.rst |   6 ++
 include/linux/memcontrol.h              |  11 +++
 mm/memcontrol.c                         | 104 ++++++++++++++++++++++++
 3 files changed, 121 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 91beaa6798ce..3ea7f1a399e8 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1573,6 +1573,12 @@ The following nested keys are defined.
 	  vmalloc (npn)
 		Amount of memory used for vmap backed memory.
 
+	  gpu_active (npn)
+		Amount of system memory used for GPU devices.
+
+	  gpu_reclaim (npn)
+		Amount of system memory cached for GPU devices.
+
 	  shmem
 		Amount of cached filesystem data that is swap-backed,
 		such as tmpfs, shm segments, shared anonymous mmap()s
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70b685a85bf4..4f75d64f5fca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1583,6 +1583,17 @@ static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
+bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
+			   unsigned int order,
+			   gfp_t gfp_mask, bool reclaim);
+void mem_cgroup_uncharge_gpu_page(struct page *page,
+				  unsigned int order,
+				  bool reclaim);
+bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *objcg,
+				      struct page *page,
+				      unsigned int order,
+				      bool to_reclaim);
+
 #ifdef CONFIG_MEMCG
 extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a52da3a5e4fd..90bb3e00c258 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -333,6 +333,8 @@ static const unsigned int memcg_node_stat_items[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
+	NR_GPU_ACTIVE,
+	NR_GPU_RECLAIM,
 };
 
 static const unsigned int memcg_stat_items[] = {
@@ -1360,6 +1362,8 @@ static const struct memory_stat memory_stats[] = {
 	{ "percpu",			MEMCG_PERCPU_B			},
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			MEMCG_VMALLOC			},
+	{ "gpu_active",			NR_GPU_ACTIVE			},
+	{ "gpu_reclaim",		NR_GPU_RECLAIM	                },
 	{ "shmem",			NR_SHMEM			},
 #ifdef CONFIG_ZSWAP
 	{ "zswap",			MEMCG_ZSWAP_B			},
@@ -5133,6 +5137,106 @@ void mem_cgroup_flush_workqueue(void)
 	flush_workqueue(memcg_wq);
 }
 
+/**
+ * mem_cgroup_charge_gpu_page - charge a page to GPU memory tracking
+ * @objcg: objcg to charge, NULL charges root memcg
+ * @page: page to charge
+ * @order: page allocation order
+ * @gfp_mask: gfp mode
+ * @reclaim: charge the reclaim counter instead of the active one.
+ *
+ * Charge the order sized @page to the objcg. Returns %true if the charge fit within
+ * @objcg's configured limit, %false if it doesn't.
+ */
+bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
+				unsigned int order, gfp_t gfp_mask, bool reclaim)
+{
+	unsigned int nr_pages = 1 << order;
+	struct mem_cgroup *memcg = NULL;
+	struct lruvec *lruvec;
+	int ret;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+
+		ret = try_charge_memcg(memcg, gfp_mask, nr_pages);
+		if (ret) {
+			mem_cgroup_put(memcg);
+			return false;
+		}
+
+		obj_cgroup_get(objcg);
+		page_set_objcg(page, objcg);
+	}
+
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+	mod_lruvec_state(lruvec, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, nr_pages);
+
+	mem_cgroup_put(memcg);
+	return true;
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_charge_gpu_page);
+
+/**
+ * mem_cgroup_uncharge_gpu_page - uncharge a page from GPU memory tracking
+ * @page: page to uncharge
+ * @order: order of the page allocation
+ * @reclaim: uncharge the reclaim counter instead of the active.
+ */
+void mem_cgroup_uncharge_gpu_page(struct page *page,
+				  unsigned int order, bool reclaim)
+{
+	struct obj_cgroup *objcg = page_objcg(page);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+	int nr_pages = 1 << order;
+
+	memcg = objcg ? get_mem_cgroup_from_objcg(objcg) : NULL;
+
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+	mod_lruvec_state(lruvec, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, -nr_pages);
+
+	if (memcg && !mem_cgroup_is_root(memcg))
+		refill_stock(memcg, nr_pages);
+	page->memcg_data = 0;
+	obj_cgroup_put(objcg);
+	mem_cgroup_put(memcg);
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_uncharge_gpu_page);
+
+/**
+ * mem_cgroup_move_gpu_reclaim - move pages from gpu to gpu reclaim and back
+ * @new_objcg: objcg to move page to, NULL if just stats update.
+ * @nr_pages: number of pages to move
+ * @to_reclaim: true moves pages into reclaim, false moves them back
+ */
+bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *new_objcg,
+				      struct page *page,
+				      unsigned int order,
+				      bool to_reclaim)
+{
+	struct obj_cgroup *objcg = page_objcg(page);
+
+	if (!objcg || !new_objcg || objcg == new_objcg) {
+		struct mem_cgroup *memcg = objcg ? get_mem_cgroup_from_objcg(objcg) : NULL;
+		struct lruvec *lruvec;
+		unsigned long flags;
+		int nr_pages = 1 << order;
+
+		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+		local_irq_save(flags);
+		mod_lruvec_state(lruvec, to_reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, nr_pages);
+		mod_lruvec_state(lruvec, to_reclaim ? NR_GPU_ACTIVE : NR_GPU_RECLAIM, -nr_pages);
+		local_irq_restore(flags);
+		mem_cgroup_put(memcg);
+		return true;
+	} else {
+		mem_cgroup_uncharge_gpu_page(page, order, true);
+		return mem_cgroup_charge_gpu_page(new_objcg, page, order, 0, false);
+	}
+}
+EXPORT_SYMBOL_GPL(mem_cgroup_move_gpu_page_reclaim);
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (6 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 07/16] memcg: add support for GPU page counters. (v4) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  8:42   ` kernel test robot
  2026-02-24  2:06 ` [PATCH 09/16] ttm/pool: initialise the shrinker earlier Dave Airlie
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This flag does nothing yet, but this just changes the APIs to accept
it in the future across all users.

This flag will eventually be filled out with when to account a tt
populate to a memcg.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  3 ++-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c          |  5 +++--
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c     |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c       |  4 ++--
 drivers/gpu/drm/loongson/lsdc_ttm.c              |  3 ++-
 drivers/gpu/drm/nouveau/nouveau_bo.c             |  6 ++++--
 drivers/gpu/drm/radeon/radeon_ttm.c              |  3 ++-
 drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c |  2 +-
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c        | 16 ++++++++--------
 drivers/gpu/drm/ttm/tests/ttm_tt_test.c          | 12 ++++++------
 drivers/gpu/drm/ttm/ttm_bo.c                     |  7 ++++---
 drivers/gpu/drm/ttm/ttm_bo_util.c                |  6 +++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c                  |  4 +++-
 drivers/gpu/drm/ttm/ttm_pool.c                   |  6 ++++--
 drivers/gpu/drm/ttm/ttm_tt.c                     |  8 +++++---
 drivers/gpu/drm/vmwgfx/vmwgfx_blit.c             |  4 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c       |  7 ++++---
 drivers/gpu/drm/xe/xe_bo.c                       |  5 +++--
 include/drm/ttm/ttm_bo.h                         |  1 +
 include/drm/ttm/ttm_device.h                     |  1 +
 include/drm/ttm/ttm_pool.h                       |  1 +
 include/drm/ttm/ttm_tt.h                         |  1 +
 22 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index eeaa56c8d129..e50c133e87df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1203,6 +1203,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
  */
 static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
 				  struct ttm_tt *ttm,
+				  bool memcg_account,
 				  struct ttm_operation_ctx *ctx)
 {
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bdev);
@@ -1226,7 +1227,7 @@ static int amdgpu_ttm_tt_populate(struct ttm_device *bdev,
 		pool = &adev->mman.ttm_pools[gtt->pool_id];
 	else
 		pool = &adev->mman.bdev.pool;
-	ret = ttm_pool_alloc(pool, ttm, ctx);
+	ret = ttm_pool_alloc(pool, ttm, memcg_account, ctx);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index 033eda38e4b5..da4421cf2623 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -317,6 +317,7 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 
 static int i915_ttm_tt_populate(struct ttm_device *bdev,
 				struct ttm_tt *ttm,
+				bool memcg_account,
 				struct ttm_operation_ctx *ctx)
 {
 	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
@@ -324,7 +325,7 @@ static int i915_ttm_tt_populate(struct ttm_device *bdev,
 	if (i915_tt->is_shmem)
 		return i915_ttm_tt_shmem_populate(bdev, ttm, ctx);
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void i915_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
@@ -811,7 +812,7 @@ static int __i915_ttm_get_pages(struct drm_i915_gem_object *obj,
 	}
 
 	if (bo->ttm && !ttm_tt_is_populated(bo->ttm)) {
-		ret = ttm_bo_populate(bo, &ctx);
+		ret = ttm_bo_populate(bo, false, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 3a7e202ae87d..ec475b311aea 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -624,7 +624,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
 
 	/* Populate ttm with pages if needed. Typically system memory. */
 	if (ttm && (dst_man->use_tt || (ttm->page_flags & TTM_TT_FLAG_SWAPPED))) {
-		ret = ttm_bo_populate(bo, ctx);
+		ret = ttm_bo_populate(bo, false, ctx);
 		if (ret)
 			return ret;
 	}
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
index 4824f948daed..d951e58f78ea 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_pm.c
@@ -91,7 +91,7 @@ static int i915_ttm_backup(struct i915_gem_apply_to_region *apply,
 		goto out_no_lock;
 
 	backup_bo = i915_gem_to_ttm(backup);
-	err = ttm_bo_populate(backup_bo, &ctx);
+	err = ttm_bo_populate(backup_bo, false, &ctx);
 	if (err)
 		goto out_no_populate;
 
@@ -190,7 +190,7 @@ static int i915_ttm_restore(struct i915_gem_apply_to_region *apply,
 	if (!backup_bo->resource)
 		err = ttm_bo_validate(backup_bo, i915_ttm_sys_placement(), &ctx);
 	if (!err)
-		err = ttm_bo_populate(backup_bo, &ctx);
+		err = ttm_bo_populate(backup_bo, false, &ctx);
 	if (!err) {
 		err = i915_gem_obj_copy_ttm(obj, backup, pm_apply->allow_gpu,
 					    false);
diff --git a/drivers/gpu/drm/loongson/lsdc_ttm.c b/drivers/gpu/drm/loongson/lsdc_ttm.c
index d7441d96a0dc..bfb5f1b1ec91 100644
--- a/drivers/gpu/drm/loongson/lsdc_ttm.c
+++ b/drivers/gpu/drm/loongson/lsdc_ttm.c
@@ -111,6 +111,7 @@ lsdc_ttm_tt_create(struct ttm_buffer_object *tbo, uint32_t page_flags)
 
 static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
 				struct ttm_tt *ttm,
+				bool memcg_account,
 				struct ttm_operation_ctx *ctx)
 {
 	bool slave = !!(ttm->page_flags & TTM_TT_FLAG_EXTERNAL);
@@ -123,7 +124,7 @@ static int lsdc_ttm_tt_populate(struct ttm_device *bdev,
 		return 0;
 	}
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void lsdc_ttm_tt_unpopulate(struct ttm_device *bdev,
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 3c7d2e5b3850..e8b0692afd0c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -1417,7 +1417,9 @@ vm_fault_t nouveau_ttm_fault_reserve_notify(struct ttm_buffer_object *bo)
 
 static int
 nouveau_ttm_tt_populate(struct ttm_device *bdev,
-			struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+			struct ttm_tt *ttm,
+			bool memcg_account,
+			struct ttm_operation_ctx *ctx)
 {
 	struct ttm_tt *ttm_dma = (void *)ttm;
 	struct nouveau_drm *drm;
@@ -1434,7 +1436,7 @@ nouveau_ttm_tt_populate(struct ttm_device *bdev,
 
 	drm = nouveau_bdev(bdev);
 
-	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, ctx);
+	return ttm_pool_alloc(&drm->ttm.bdev.pool, ttm, memcg_account, ctx);
 }
 
 static void
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index e7ab8162ac69..98b09463abc2 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -526,6 +526,7 @@ static struct radeon_ttm_tt *radeon_ttm_tt_to_gtt(struct radeon_device *rdev,
 
 static int radeon_ttm_tt_populate(struct ttm_device *bdev,
 				  struct ttm_tt *ttm,
+				  bool memcg_account,
 				  struct ttm_operation_ctx *ctx)
 {
 	struct radeon_device *rdev = radeon_get_rdev(bdev);
@@ -547,7 +548,7 @@ static int radeon_ttm_tt_populate(struct ttm_device *bdev,
 		return 0;
 	}
 
-	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, ctx);
+	return ttm_pool_alloc(&rdev->mman.bdev.pool, ttm, memcg_account, ctx);
 }
 
 static void radeon_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
index 6d95447a989d..cef326ebf25a 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_bo_validate_test.c
@@ -538,7 +538,7 @@ static void ttm_bo_validate_no_placement_signaled(struct kunit *test)
 
 	if (params->with_ttm) {
 		old_tt = priv->ttm_dev->funcs->ttm_tt_create(bo, 0);
-		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, &ctx);
+		ttm_pool_alloc(&priv->ttm_dev->pool, old_tt, false, &ctx);
 		bo->ttm = old_tt;
 	}
 
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 01197014b83f..456241bdaf41 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -89,7 +89,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct kunit *test,
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, TTM_ALLOCATION_POOL_USE_DMA_ALLOC);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -157,7 +157,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
 	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
 	KUNIT_ASSERT_EQ(test, pool->alloc_flags, params->alloc_flags);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
 
@@ -220,7 +220,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct kunit *test)
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, TTM_ALLOCATION_POOL_USE_DMA_ALLOC);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_EQ(test, tt->num_pages, expected_num_pages);
 
@@ -253,7 +253,7 @@ static void ttm_pool_alloc_order_caching_match(struct kunit *test)
 	tt = ttm_tt_kunit_init(test, 0, caching, size);
 	KUNIT_ASSERT_NOT_NULL(test, tt);
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
@@ -285,7 +285,7 @@ static void ttm_pool_alloc_caching_mismatch(struct kunit *test)
 	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -319,7 +319,7 @@ static void ttm_pool_alloc_order_mismatch(struct kunit *test)
 	KUNIT_ASSERT_FALSE(test, !list_lru_count(&pt_pool->pages));
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt_tt->pages));
 
-	err = ttm_pool_alloc(pool, tt, &simple_ctx);
+	err = ttm_pool_alloc(pool, tt, false, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	ttm_pool_free(pool, tt);
@@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, TTM_ALLOCATION_POOL_USE_DMA_ALLOC);
-	ttm_pool_alloc(pool, tt, &simple_ctx);
+	ttm_pool_alloc(pool, tt, false, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
 	KUNIT_ASSERT_TRUE(test, !list_lru_count(&pt->pages));
@@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, 0);
-	ttm_pool_alloc(pool, tt, &simple_ctx);
+	ttm_pool_alloc(pool, tt, false, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
 	KUNIT_ASSERT_TRUE(test, list_lru_count(&pt->pages) == 1);
diff --git a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
index bd5f7d0b9b62..dfa38bbfd829 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_tt_test.c
@@ -262,7 +262,7 @@ static void ttm_tt_populate_null_ttm(struct kunit *test)
 	struct ttm_operation_ctx ctx = { };
 	int err;
 
-	err = ttm_tt_populate(devs->ttm_dev, NULL, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, NULL, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, -EINVAL);
 }
 
@@ -283,11 +283,11 @@ static void ttm_tt_populate_populated_ttm(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	populated_page = *tt->pages;
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_PTR_EQ(test, populated_page, *tt->pages);
 }
 
@@ -307,7 +307,7 @@ static void ttm_tt_unpopulate_basic(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
 
@@ -351,7 +351,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
 	err = ttm_tt_init(tt, bo, 0, ttm_cached, 0);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
-	err = ttm_tt_populate(devs->ttm_dev, tt, &ctx);
+	err = ttm_tt_populate(devs->ttm_dev, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_TRUE(test, ttm_tt_is_populated(tt));
 
@@ -361,7 +361,7 @@ static void ttm_tt_swapin_basic(struct kunit *test)
 	KUNIT_ASSERT_TRUE(test, tt->page_flags & TTM_TT_FLAG_SWAPPED);
 
 	/* Swapout depopulates TT, allocate pages and then swap them in */
-	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, &ctx);
+	err = ttm_pool_alloc(&devs->ttm_dev->pool, tt, false, &ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
 
 	err = ttm_tt_swapin(tt);
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index acb9197db879..704af3cfc47a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -144,7 +144,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 			goto out_err;
 
 		if (mem->mem_type != TTM_PL_SYSTEM) {
-			ret = ttm_bo_populate(bo, ctx);
+			ret = ttm_bo_populate(bo, false, ctx);
 			if (ret)
 				goto out_err;
 		}
@@ -1260,6 +1260,7 @@ void ttm_bo_tt_destroy(struct ttm_buffer_object *bo)
  * is set to true.
  */
 int ttm_bo_populate(struct ttm_buffer_object *bo,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx)
 {
 	struct ttm_device *bdev = bo->bdev;
@@ -1273,7 +1274,7 @@ int ttm_bo_populate(struct ttm_buffer_object *bo,
 		return 0;
 
 	swapped = ttm_tt_is_swapped(tt);
-	ret = ttm_tt_populate(bdev, tt, ctx);
+	ret = ttm_tt_populate(bdev, tt, memcg_account, ctx);
 	if (ret)
 		return ret;
 
@@ -1298,7 +1299,7 @@ int ttm_bo_setup_export(struct ttm_buffer_object *bo,
 	if (ret != 0)
 		return ret;
 
-	ret = ttm_bo_populate(bo, ctx);
+	ret = ttm_bo_populate(bo, false, ctx);
 	ttm_bo_unreserve(bo);
 	return ret;
 }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index f83b7d5ec6c6..1502515043c0 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -167,7 +167,7 @@ int ttm_bo_move_memcpy(struct ttm_buffer_object *bo,
 	src_man = ttm_manager_type(bdev, src_mem->mem_type);
 	if (ttm && ((ttm->page_flags & TTM_TT_FLAG_SWAPPED) ||
 		    dst_man->use_tt)) {
-		ret = ttm_bo_populate(bo, ctx);
+		ret = ttm_bo_populate(bo, false, ctx);
 		if (ret)
 			return ret;
 	}
@@ -352,7 +352,7 @@ static int ttm_bo_kmap_ttm(struct ttm_buffer_object *bo,
 
 	BUG_ON(!ttm);
 
-	ret = ttm_bo_populate(bo, &ctx);
+	ret = ttm_bo_populate(bo, false, &ctx);
 	if (ret)
 		return ret;
 
@@ -533,7 +533,7 @@ int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map)
 		pgprot_t prot;
 		void *vaddr;
 
-		ret = ttm_bo_populate(bo, &ctx);
+		ret = ttm_bo_populate(bo, false, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index a80510489c45..2e59836b6085 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -224,7 +224,9 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 		};
 
 		ttm = bo->ttm;
-		err = ttm_bo_populate(bo, &ctx);
+		err = ttm_bo_populate(bo,
+				      false,
+				      &ctx);
 		if (err) {
 			if (err == -EINTR || err == -ERESTARTSYS ||
 			    err == -EAGAIN)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 6b288558ac3b..f58147a83142 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -752,6 +752,7 @@ static unsigned int ttm_pool_alloc_find_order(unsigned int highest,
 }
 
 static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+			    bool memcg_account,
 			    const struct ttm_operation_ctx *ctx,
 			    struct ttm_pool_alloc_state *alloc,
 			    struct ttm_pool_tt_restore *restore)
@@ -862,6 +863,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
  * Returns: 0 on successe, negative error code otherwise.
  */
 int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+		   bool memcg_account,
 		   struct ttm_operation_ctx *ctx)
 {
 	struct ttm_pool_alloc_state alloc;
@@ -871,7 +873,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 
 	ttm_pool_alloc_state_init(tt, &alloc);
 
-	return __ttm_pool_alloc(pool, tt, ctx, &alloc, NULL);
+	return __ttm_pool_alloc(pool, tt, memcg_account, ctx, &alloc, NULL);
 }
 EXPORT_SYMBOL(ttm_pool_alloc);
 
@@ -926,7 +928,7 @@ int ttm_pool_restore_and_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 			return 0;
 	}
 
-	return __ttm_pool_alloc(pool, tt, ctx, &alloc, restore);
+	return __ttm_pool_alloc(pool, tt, false, ctx, &alloc, restore);
 }
 
 /**
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index b645a1818184..aa0f17fca770 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -368,7 +368,9 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
 EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_tt_swapout);
 
 int ttm_tt_populate(struct ttm_device *bdev,
-		    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+		    struct ttm_tt *ttm,
+		    bool memcg_account,
+		    struct ttm_operation_ctx *ctx)
 {
 	int ret;
 
@@ -397,9 +399,9 @@ int ttm_tt_populate(struct ttm_device *bdev,
 	}
 
 	if (bdev->funcs->ttm_tt_populate)
-		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, ctx);
+		ret = bdev->funcs->ttm_tt_populate(bdev, ttm, memcg_account, ctx);
 	else
-		ret = ttm_pool_alloc(&bdev->pool, ttm, ctx);
+		ret = ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 	if (ret)
 		goto error;
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
index 135b75a3e013..baa1c3fdb12c 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
@@ -569,13 +569,13 @@ int vmw_bo_cpu_blit(struct vmw_bo *vmw_dst,
 		dma_resv_assert_held(src->base.resv);
 
 	if (!ttm_tt_is_populated(dst->ttm)) {
-		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, &ctx);
+		ret = dst->bdev->funcs->ttm_tt_populate(dst->bdev, dst->ttm, false, &ctx);
 		if (ret)
 			return ret;
 	}
 
 	if (!ttm_tt_is_populated(src->ttm)) {
-		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, &ctx);
+		ret = src->bdev->funcs->ttm_tt_populate(src->bdev, src->ttm, false, &ctx);
 		if (ret)
 			return ret;
 	}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index dfd08ee19041..368701756119 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -360,7 +360,8 @@ static void vmw_ttm_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
 
 
 static int vmw_ttm_populate(struct ttm_device *bdev,
-			    struct ttm_tt *ttm, struct ttm_operation_ctx *ctx)
+			    struct ttm_tt *ttm, bool memcg_account,
+			    struct ttm_operation_ctx *ctx)
 {
 	bool external = (ttm->page_flags & TTM_TT_FLAG_EXTERNAL) != 0;
 
@@ -372,7 +373,7 @@ static int vmw_ttm_populate(struct ttm_device *bdev,
 						       ttm->dma_address,
 						       ttm->num_pages);
 
-	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
+	return ttm_pool_alloc(&bdev->pool, ttm, memcg_account, ctx);
 }
 
 static void vmw_ttm_unpopulate(struct ttm_device *bdev,
@@ -580,7 +581,7 @@ int vmw_bo_create_and_populate(struct vmw_private *dev_priv,
 	if (unlikely(ret != 0))
 		return ret;
 
-	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, &ctx);
+	ret = vmw_ttm_populate(vbo->tbo.bdev, vbo->tbo.ttm, false, &ctx);
 	if (likely(ret == 0)) {
 		struct vmw_ttm_tt *vmw_tt =
 			container_of(vbo->tbo.ttm, struct vmw_ttm_tt, dma_ttm);
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 29fffc81f240..9cb4344c93df 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -551,6 +551,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 }
 
 static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
+			      bool memcg_account,
 			      struct ttm_operation_ctx *ctx)
 {
 	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
@@ -568,7 +569,7 @@ static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
 		err = ttm_tt_restore(ttm_dev, tt, ctx);
 	} else {
 		ttm_tt_clear_backed_up(tt);
-		err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
+		err = ttm_pool_alloc(&ttm_dev->pool, tt, memcg_account, ctx);
 	}
 	if (err)
 		return err;
@@ -1809,7 +1810,7 @@ static int xe_bo_fault_migrate(struct xe_bo *bo, struct ttm_operation_ctx *ctx,
 	if (ttm_manager_type(tbo->bdev, tbo->resource->mem_type)->use_tt) {
 		err = xe_bo_wait_usage_kernel(bo, ctx);
 		if (!err)
-			err = ttm_bo_populate(&bo->ttm, ctx);
+			err = ttm_bo_populate(&bo->ttm, false, ctx);
 	} else if (should_migrate_to_smem(bo)) {
 		xe_assert(xe_bo_device(bo), bo->flags & XE_BO_FLAG_SYSTEM);
 		err = xe_bo_migrate(bo, XE_PL_TT, ctx, exec);
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index bca3a8849d47..ea745edc2882 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -465,6 +465,7 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, struct ttm_resource *res,
 		     pgprot_t tmp);
 void ttm_bo_tt_destroy(struct ttm_buffer_object *bo);
 int ttm_bo_populate(struct ttm_buffer_object *bo,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx);
 int ttm_bo_setup_export(struct ttm_buffer_object *bo,
 			struct ttm_operation_ctx *ctx);
diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
index 5618aef462f2..a4bd23988ee0 100644
--- a/include/drm/ttm/ttm_device.h
+++ b/include/drm/ttm/ttm_device.h
@@ -85,6 +85,7 @@ struct ttm_device_funcs {
 	 */
 	int (*ttm_tt_populate)(struct ttm_device *bdev,
 			       struct ttm_tt *ttm,
+			       bool memcg_account,
 			       struct ttm_operation_ctx *ctx);
 
 	/**
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 26ee592e1994..7f3f168c536c 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -78,6 +78,7 @@ struct ttm_pool {
 };
 
 int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
+		   bool memcg_account,
 		   struct ttm_operation_ctx *ctx);
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
 
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 406437ad674b..15d4019685f6 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -250,6 +250,7 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm,
  * Calls the driver method to allocate pages for a ttm
  */
 int ttm_tt_populate(struct ttm_device *bdev, struct ttm_tt *ttm,
+		    bool memcg_account,
 		    struct ttm_operation_ctx *ctx);
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/16] ttm/pool: initialise the shrinker earlier
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (7 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 10/16] ttm: add objcg pointer to bo and tt (v2) Dave Airlie
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

Later memcg enablement needs the shrinker initialised before the list lru,
Just move it for now.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index f58147a83142..401f53244f96 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -1393,6 +1393,17 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
+	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
+	if (!mm_shrinker)
+		return -ENOMEM;
+
+	mm_shrinker->count_objects = ttm_pool_shrinker_count;
+	mm_shrinker->scan_objects = ttm_pool_shrinker_scan;
+	mm_shrinker->batch = TTM_SHRINKER_BATCH;
+	mm_shrinker->seeks = 1;
+
+	shrinker_register(mm_shrinker);
+
 	for (i = 0; i < NR_PAGE_ORDERS; ++i) {
 		ttm_pool_type_init(&global_write_combined[i], NULL,
 				   ttm_write_combined, i);
@@ -1415,17 +1426,6 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 #endif
 #endif
 
-	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
-	if (!mm_shrinker)
-		return -ENOMEM;
-
-	mm_shrinker->count_objects = ttm_pool_shrinker_count;
-	mm_shrinker->scan_objects = ttm_pool_shrinker_scan;
-	mm_shrinker->batch = TTM_SHRINKER_BATCH;
-	mm_shrinker->seeks = 1;
-
-	shrinker_register(mm_shrinker);
-
 	return 0;
 }
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/16] ttm: add objcg pointer to bo and tt (v2)
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (8 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 09/16] ttm/pool: initialise the shrinker earlier Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 11/16] ttm/pool: enable memcg tracking and shrinker. (v3) Dave Airlie
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This just adds the obj cgroup pointer to the bo and tt structs,
and sets it between them.

Signed-off-by: Dave Airlie <airlied@redhat.com>

v2: add the put and a setter helper
---
 drivers/gpu/drm/ttm/ttm_bo.c |  2 ++
 drivers/gpu/drm/ttm/ttm_tt.c |  1 +
 include/drm/ttm/ttm_bo.h     | 20 ++++++++++++++++++++
 include/drm/ttm/ttm_tt.h     |  2 ++
 4 files changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 704af3cfc47a..4234bf7f1110 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -47,6 +47,7 @@
 #include <linux/atomic.h>
 #include <linux/cgroup_dmem.h>
 #include <linux/dma-resv.h>
+#include <linux/memcontrol.h>
 
 #include "ttm_module.h"
 #include "ttm_bo_internal.h"
@@ -316,6 +317,7 @@ static void ttm_bo_release(struct kref *kref)
 		dma_resv_unlock(bo->base.resv);
 	}
 
+	obj_cgroup_put(bo->objcg);
 	atomic_dec(&ttm_glob.bo_count);
 	bo->destroy(bo);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index aa0f17fca770..1f68a7dc41f2 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -164,6 +164,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
 	ttm->caching = caching;
 	ttm->restore = NULL;
 	ttm->backup = NULL;
+	ttm->objcg = bo->objcg;
 }
 
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index ea745edc2882..af9c672486ab 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -135,6 +135,12 @@ struct ttm_buffer_object {
 	 * reservation lock.
 	 */
 	struct sg_table *sg;
+
+	/**
+	 * @objcg: object cgroup to charge this to if it ends up using system memory.
+	 * NULL means don't charge.
+	 */
+	struct obj_cgroup *objcg;
 };
 
 #define TTM_BO_MAP_IOMEM_MASK 0x80
@@ -334,6 +340,20 @@ ttm_bo_move_to_lru_tail_unlocked(struct ttm_buffer_object *bo)
 	spin_unlock(&bo->bdev->lru_lock);
 }
 
+/**
+ * ttm_bo_set_cgroup - assign a cgroup to a buffer object.
+ * @bo: The bo to set the cgroup for
+ * @objcg: the cgroup to set.
+ *
+ * This transfers the cgroup reference to the bo. From this
+ * point on the cgroup reference is owned by the ttm bo.
+ */
+static inline void ttm_bo_set_cgroup(struct ttm_buffer_object *bo,
+				     struct obj_cgroup *objcg)
+{
+	bo->objcg = objcg;
+}
+
 static inline void ttm_bo_assign_mem(struct ttm_buffer_object *bo,
 				     struct ttm_resource *new_mem)
 {
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 15d4019685f6..c13fea4c2915 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -126,6 +126,8 @@ struct ttm_tt {
 	enum ttm_caching caching;
 	/** @restore: Partial restoration from backup state. TTM private */
 	struct ttm_pool_tt_restore *restore;
+	/** @objcg: Object cgroup for this TT allocation */
+	struct obj_cgroup *objcg;
 };
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/16] ttm/pool: enable memcg tracking and shrinker. (v3)
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (9 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 10/16] ttm: add objcg pointer to bo and tt (v2) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 12/16] ttm: hook up memcg placement flags Dave Airlie
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This enables all the backend code to use the list lru in memcg mode,
and set the shrinker to be memcg aware.

It adds the loop case for when pooled pages end up being reparented
to a higher memcg group, that newer memcg can search for them there
and take them back.

Signed-off-by: Dave Airlie <airlied@redhat.com>

---
v2: just use the proper stats.
v3: fix objcg check to return void
---
 drivers/gpu/drm/ttm/ttm_pool.c | 126 +++++++++++++++++++++++++++------
 mm/list_lru.c                  |   1 +
 2 files changed, 104 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 401f53244f96..4b0c7d70b60f 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -144,7 +144,9 @@ static int ttm_pool_nid(struct ttm_pool *pool)
 }
 
 /* Allocate pages of size 1 << order with the given gfp_flags */
-static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
+static struct page *ttm_pool_alloc_page(struct ttm_pool *pool,
+					struct obj_cgroup *objcg,
+					gfp_t gfp_flags,
 					unsigned int order)
 {
 	const unsigned int beneficial_order = ttm_pool_beneficial_order(pool);
@@ -172,7 +174,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
 		p = alloc_pages_node(pool->nid, gfp_flags, order);
 		if (p) {
 			p->private = order;
-			mod_lruvec_page_state(p, NR_GPU_ACTIVE, 1 << order);
+			if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
+				__free_pages(p, order);
+				return NULL;
+			}
 		}
 		return p;
 	}
@@ -223,8 +228,7 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching,
 #endif
 
 	if (!pool || !ttm_pool_uses_dma_alloc(pool)) {
-		mod_lruvec_page_state(p, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE,
-				      -(1 << order));
+		mem_cgroup_uncharge_gpu_page(p, order, reclaim);
 		__free_pages(p, order);
 		return;
 	}
@@ -311,12 +315,11 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 
 	INIT_LIST_HEAD(&p->lru);
 	rcu_read_lock();
-	list_lru_add(&pt->pages, &p->lru, nid, NULL);
+	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
 	rcu_read_unlock();
 
 	atomic_long_add(num_pages, &allocated_pages[nid]);
-	mod_lruvec_page_state(p, NR_GPU_ACTIVE, -num_pages);
-	mod_lruvec_page_state(p, NR_GPU_RECLAIM, num_pages);
+	mem_cgroup_move_gpu_page_reclaim(NULL, p, pt->order, true);
 }
 
 static enum lru_status take_one_from_lru(struct list_head *item,
@@ -331,20 +334,56 @@ static enum lru_status take_one_from_lru(struct list_head *item,
 	return LRU_REMOVED;
 }
 
-/* Take pages from a specific pool_type, return NULL when nothing available */
-static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid)
+static int pool_lru_get_page(struct ttm_pool_type *pt, int nid,
+			     struct page **page_out,
+			     struct obj_cgroup *objcg,
+			     struct mem_cgroup *memcg)
 {
 	int ret;
 	struct page *p = NULL;
 	unsigned long nr_to_walk = 1;
+	unsigned int num_pages = 1 << pt->order;
 
-	ret = list_lru_walk_node(&pt->pages, nid, take_one_from_lru, (void *)&p, &nr_to_walk);
+	ret = list_lru_walk_one(&pt->pages, nid, memcg, take_one_from_lru, (void *)&p, &nr_to_walk);
 	if (ret == 1 && p) {
-		atomic_long_sub(1 << pt->order, &allocated_pages[nid]);
-		mod_lruvec_page_state(p, NR_GPU_ACTIVE, (1 << pt->order));
-		mod_lruvec_page_state(p, NR_GPU_RECLAIM, -(1 << pt->order));
+		atomic_long_sub(num_pages, &allocated_pages[nid]);
+
+		if (!mem_cgroup_move_gpu_page_reclaim(objcg, p, pt->order, false)) {
+			__free_pages(p, pt->order);
+			p = NULL;
+		}
 	}
-	return p;
+	*page_out = p;
+	return ret;
+}
+
+/* Take pages from a specific pool_type, return NULL when nothing available */
+static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
+				       struct obj_cgroup *orig_objcg)
+{
+	struct page *page_out = NULL;
+	int ret;
+	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
+	struct mem_cgroup *memcg = orig_memcg;
+
+	/*
+	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,
+	 * go up to the parent and check there. This helps the scenario where multiple apps get
+	 * started into their own cgroup from a common parent and want to reuse the pools.
+	 */
+	while (!page_out) {
+		ret = pool_lru_get_page(pt, nid, &page_out, orig_objcg, memcg);
+		if (ret == 1)
+			break;
+		if (!memcg)
+			break;
+		memcg = parent_mem_cgroup(memcg);
+		if (!memcg)
+			break;
+	}
+
+	mem_cgroup_put(orig_memcg);
+	return page_out;
 }
 
 /* Initialize and add a pool type to the global shrinker list */
@@ -354,7 +393,7 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool,
 	pt->pool = pool;
 	pt->caching = caching;
 	pt->order = order;
-	list_lru_init(&pt->pages);
+	list_lru_init_memcg(&pt->pages, mm_shrinker);
 
 	spin_lock(&shrinker_lock);
 	list_add_tail(&pt->shrinker_list, &shrinker_list);
@@ -397,6 +436,31 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt)
 	ttm_pool_dispose_list(pt, &dispose);
 }
 
+/*
+ * This function doesn't currently check dma32, because no driver using this
+ * support dma32. This should be added and debugged when that changes.
+ */
+static void ttm_pool_check_objcg(struct obj_cgroup *objcg)
+{
+#ifdef CONFIG_MEMCG
+	int r = 0;
+	struct mem_cgroup *memcg;
+	if (!objcg)
+		return;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	for (unsigned i = 0; i < NR_PAGE_ORDERS; i++) {
+		r = memcg_list_lru_alloc(memcg, &global_write_combined[i].pages, GFP_KERNEL);
+		if (r)
+			break;
+		r = memcg_list_lru_alloc(memcg, &global_uncached[i].pages, GFP_KERNEL);
+		if (r)
+			break;
+	}
+	mem_cgroup_put(memcg);
+#endif
+}
+
 /* Return the pool_type to use for the given caching and order */
 static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 						  enum ttm_caching caching,
@@ -426,7 +490,9 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool,
 }
 
 /* Free pages using the per-node shrinker list */
-static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
+static unsigned int ttm_pool_shrink(int nid,
+				    struct mem_cgroup *memcg,
+				    unsigned long num_to_free)
 {
 	LIST_HEAD(dispose);
 	struct ttm_pool_type *pt;
@@ -438,7 +504,11 @@ static unsigned int ttm_pool_shrink(int nid, unsigned long num_to_free)
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
 	spin_unlock(&shrinker_lock);
 
-	num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	if (!memcg) {
+		num_pages = list_lru_walk_node(&pt->pages, nid, pool_move_to_dispose_list, &dispose, &num_to_free);
+	} else {
+		num_pages = list_lru_walk_one(&pt->pages, nid, memcg, pool_move_to_dispose_list, &dispose, &num_to_free);
+	}
 	num_pages *= 1 << pt->order;
 
 	ttm_pool_dispose_list(pt, &dispose);
@@ -603,6 +673,7 @@ static int ttm_pool_restore_commit(struct ttm_pool_tt_restore *restore,
 			 */
 			ttm_pool_split_for_swap(restore->pool, p);
 			copy_highpage(restore->alloced_page + i, p);
+			p->memcg_data = 0;
 			__free_pages(p, 0);
 		}
 
@@ -764,6 +835,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	bool allow_pools;
 	struct page *p;
 	int r;
+	struct obj_cgroup *objcg = memcg_account ? tt->objcg : NULL;
 
 	WARN_ON(!alloc->remaining_pages || ttm_tt_is_populated(tt));
 	WARN_ON(alloc->dma_addr && !pool->dev);
@@ -781,6 +853,9 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 
 	page_caching = tt->caching;
 	allow_pools = true;
+
+	ttm_pool_check_objcg(objcg);
+
 	for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
 	     alloc->remaining_pages;
 	     order = ttm_pool_alloc_find_order(order, alloc)) {
@@ -790,7 +865,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		p = NULL;
 		pt = ttm_pool_select_type(pool, page_caching, order);
 		if (pt && allow_pools)
-			p = ttm_pool_type_take(pt, ttm_pool_nid(pool));
+			p = ttm_pool_type_take(pt, ttm_pool_nid(pool), objcg);
 
 		/*
 		 * If that fails or previously failed, allocate from system.
@@ -801,7 +876,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		if (!p) {
 			page_caching = ttm_cached;
 			allow_pools = false;
-			p = ttm_pool_alloc_page(pool, gfp_flags, order);
+			p = ttm_pool_alloc_page(pool, objcg, gfp_flags, order);
 		}
 		/* If that fails, lower the order if possible and retry. */
 		if (!p) {
@@ -947,7 +1022,7 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 
 	while (atomic_long_read(&allocated_pages[nid]) > pool_node_limit[nid]) {
 		unsigned long diff = atomic_long_read(&allocated_pages[nid]) - pool_node_limit[nid];
-		ttm_pool_shrink(nid, diff);
+		ttm_pool_shrink(nid, NULL, diff);
 	}
 }
 EXPORT_SYMBOL(ttm_pool_free);
@@ -1067,6 +1142,7 @@ long ttm_pool_backup(struct ttm_pool *pool, struct ttm_tt *tt,
 			if (flags->purge) {
 				shrunken += num_pages;
 				page->private = 0;
+				page->memcg_data = 0;
 				__free_pages(page, order);
 				memset(tt->pages + i, 0,
 				       num_pages * sizeof(*tt->pages));
@@ -1201,10 +1277,14 @@ static unsigned long ttm_pool_shrinker_scan(struct shrinker *shrink,
 					    struct shrink_control *sc)
 {
 	unsigned long num_freed = 0;
+	int num_pools;
+	spin_lock(&shrinker_lock);
+	num_pools = list_count_nodes(&shrinker_list);
+	spin_unlock(&shrinker_lock);
 
 	do
-		num_freed += ttm_pool_shrink(sc->nid, sc->nr_to_scan);
-	while (num_freed < sc->nr_to_scan &&
+		num_freed += ttm_pool_shrink(sc->nid, sc->memcg, sc->nr_to_scan);
+	while (num_pools-- >= 0 && num_freed < sc->nr_to_scan &&
 	       atomic_long_read(&allocated_pages[sc->nid]));
 
 	sc->nr_scanned = num_freed;
@@ -1393,7 +1473,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	spin_lock_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
-	mm_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE, "drm-ttm_pool");
+	mm_shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE | SHRINKER_NUMA_AWARE, "drm-ttm_pool");
 	if (!mm_shrinker)
 		return -ENOMEM;
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index dd29bcf8eb5f..7834549ac6a0 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -562,6 +562,7 @@ int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 
 	return xas_error(&xas);
 }
+EXPORT_SYMBOL_GPL(memcg_list_lru_alloc);
 #else
 static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/16] ttm: hook up memcg placement flags.
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (10 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 11/16] ttm/pool: enable memcg tracking and shrinker. (v3) Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 13/16] memcontrol: allow objcg api when memcg is config off Dave Airlie
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This adds a placement flag that requests that any bo with this
placement flag set gets accounted for memcg if it's a system memory
allocation.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c      | 4 ++--
 drivers/gpu/drm/ttm/ttm_bo_util.c | 6 +++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c   | 2 +-
 drivers/gpu/drm/xe/xe_bo.c        | 2 +-
 include/drm/ttm/ttm_placement.h   | 3 +++
 5 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 4234bf7f1110..52f73a2ea971 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -145,7 +145,7 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 			goto out_err;
 
 		if (mem->mem_type != TTM_PL_SYSTEM) {
-			ret = ttm_bo_populate(bo, false, ctx);
+			ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, ctx);
 			if (ret)
 				goto out_err;
 		}
@@ -1301,7 +1301,7 @@ int ttm_bo_setup_export(struct ttm_buffer_object *bo,
 	if (ret != 0)
 		return ret;
 
-	ret = ttm_bo_populate(bo, false, ctx);
+	ret = ttm_bo_populate(bo, bo->resource->placement & TTM_PL_FLAG_MEMCG, ctx);
 	ttm_bo_unreserve(bo);
 	return ret;
 }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 1502515043c0..a67818c4143d 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -167,7 +167,7 @@ int ttm_bo_move_memcpy(struct ttm_buffer_object *bo,
 	src_man = ttm_manager_type(bdev, src_mem->mem_type);
 	if (ttm && ((ttm->page_flags & TTM_TT_FLAG_SWAPPED) ||
 		    dst_man->use_tt)) {
-		ret = ttm_bo_populate(bo, false, ctx);
+		ret = ttm_bo_populate(bo, dst_mem->placement & TTM_PL_FLAG_MEMCG, ctx);
 		if (ret)
 			return ret;
 	}
@@ -352,7 +352,7 @@ static int ttm_bo_kmap_ttm(struct ttm_buffer_object *bo,
 
 	BUG_ON(!ttm);
 
-	ret = ttm_bo_populate(bo, false, &ctx);
+	ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, &ctx);
 	if (ret)
 		return ret;
 
@@ -533,7 +533,7 @@ int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map)
 		pgprot_t prot;
 		void *vaddr;
 
-		ret = ttm_bo_populate(bo, false, &ctx);
+		ret = ttm_bo_populate(bo, mem->placement & TTM_PL_FLAG_MEMCG, &ctx);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 2e59836b6085..98cf8f83220f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -225,7 +225,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 
 		ttm = bo->ttm;
 		err = ttm_bo_populate(bo,
-				      false,
+				      bo->resource->placement & TTM_PL_FLAG_MEMCG,
 				      &ctx);
 		if (err) {
 			if (err == -EINTR || err == -ERESTARTSYS ||
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 9cb4344c93df..fae2d738ecd2 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -1810,7 +1810,7 @@ static int xe_bo_fault_migrate(struct xe_bo *bo, struct ttm_operation_ctx *ctx,
 	if (ttm_manager_type(tbo->bdev, tbo->resource->mem_type)->use_tt) {
 		err = xe_bo_wait_usage_kernel(bo, ctx);
 		if (!err)
-			err = ttm_bo_populate(&bo->ttm, false, ctx);
+			err = ttm_bo_populate(&bo->ttm, tbo->resource->placement & TTM_PL_FLAG_MEMCG, ctx);
 	} else if (should_migrate_to_smem(bo)) {
 		xe_assert(xe_bo_device(bo), bo->flags & XE_BO_FLAG_SYSTEM);
 		err = xe_bo_migrate(bo, XE_PL_TT, ctx, exec);
diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h
index b510a4812609..4e9f07d70483 100644
--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -70,6 +70,9 @@
 /* Placement is only used during eviction */
 #define TTM_PL_FLAG_FALLBACK	(1 << 4)
 
+/* Placement should account mem cgroup */
+#define TTM_PL_FLAG_MEMCG	(1 << 5)
+
 /**
  * struct ttm_place
  *
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/16] memcontrol: allow objcg api when memcg is config off.
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (11 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 12/16] ttm: hook up memcg placement flags Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 14/16] amdgpu: add support for memory cgroups Dave Airlie
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

amdgpu wants to use the objcg api and not have to enable ifdef
around it, so just add a dummy function for the config off path.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 include/linux/memcontrol.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4f75d64f5fca..fcc6b79d9560 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1804,6 +1804,11 @@ static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
+static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	return NULL;
+}
+
 static inline struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
 	return NULL;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 14/16] amdgpu: add support for memory cgroups
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (12 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 13/16] memcontrol: allow objcg api when memcg is config off Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 15/16] ttm: add support for a module option to disable memcg integration Dave Airlie
  2026-02-24  2:06 ` [PATCH 16/16] xe: create a flag to enable memcg accounting for XE as well Dave Airlie
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This adds support for adding a obj cgroup to a buffer object,
and passing in the placement flags to make sure it's accounted
properly.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 13 +++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  3 +++
 mm/memcontrol.c                            |  1 +
 5 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index a6107109a2b8..d067e8565400 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -184,6 +184,7 @@ int amdgpu_gem_object_create(struct amdgpu_device *adev, unsigned long size,
 	bp.domain = initial_domain;
 	bp.bo_ptr_size = sizeof(struct amdgpu_bo);
 	bp.xcp_id_plus1 = xcp_id_plus1;
+	bp.objcg = get_obj_cgroup_from_current();
 
 	r = amdgpu_bo_create_user(adev, &bp, &ubo);
 	if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 1fb956400696..248e9c5554de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -159,7 +159,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
 		places[c].mem_type =
 			abo->flags & AMDGPU_GEM_CREATE_PREEMPTIBLE ?
 			AMDGPU_PL_PREEMPT : TTM_PL_TT;
-		places[c].flags = 0;
+		places[c].flags = TTM_PL_FLAG_MEMCG;
 		/*
 		 * When GTT is just an alternative to VRAM make sure that we
 		 * only use it as fallback and still try to fill up VRAM first.
@@ -174,7 +174,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain)
 		places[c].fpfn = 0;
 		places[c].lpfn = 0;
 		places[c].mem_type = TTM_PL_SYSTEM;
-		places[c].flags = 0;
+		places[c].flags = TTM_PL_FLAG_MEMCG;
 		c++;
 	}
 
@@ -654,16 +654,21 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 		size = ALIGN(size, PAGE_SIZE);
 	}
 
-	if (!amdgpu_bo_validate_size(adev, size, bp->domain))
+	if (!amdgpu_bo_validate_size(adev, size, bp->domain)) {
+		obj_cgroup_put(bp->objcg);
 		return -ENOMEM;
+	}
 
 	BUG_ON(bp->bo_ptr_size < sizeof(struct amdgpu_bo));
 
 	*bo_ptr = NULL;
 	bo = kvzalloc(bp->bo_ptr_size, GFP_KERNEL);
-	if (bo == NULL)
+	if (bo == NULL) {
+		obj_cgroup_put(bp->objcg);
 		return -ENOMEM;
+	}
 	drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
+	ttm_bo_set_cgroup(&bo->tbo, bp->objcg); /* hand the reference to the ttm bo */
 	bo->tbo.base.funcs = &amdgpu_gem_object_funcs;
 	bo->vm_bo = NULL;
 	bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 912c9afaf9e1..b03c34ce7809 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -55,6 +55,7 @@ struct amdgpu_bo_param {
 	enum ttm_bo_type		type;
 	bool				no_wait_gpu;
 	struct dma_resv			*resv;
+	struct obj_cgroup               *objcg;
 	void				(*destroy)(struct ttm_buffer_object *bo);
 	/* xcp partition number plus 1, 0 means any partition */
 	int8_t				xcp_id_plus1;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index e50c133e87df..5919db47f1a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -152,11 +152,14 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,
 			amdgpu_bo_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT |
 							AMDGPU_GEM_DOMAIN_CPU);
 		}
+		for (int i = 0; i < abo->placement.num_placement; i++)
+			abo->placements[i].flags &= ~TTM_PL_FLAG_MEMCG;
 		break;
 	case TTM_PL_TT:
 	case AMDGPU_PL_PREEMPT:
 	default:
 		amdgpu_bo_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_CPU);
+		abo->placements[0].flags &= ~TTM_PL_FLAG_MEMCG;
 		break;
 	}
 	*placement = abo->placement;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90bb3e00c258..25390e542dc8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2779,6 +2779,7 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 
 	return objcg;
 }
+EXPORT_SYMBOL_GPL(current_obj_cgroup);
 
 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 15/16] ttm: add support for a module option to disable memcg integration
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (13 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 14/16] amdgpu: add support for memory cgroups Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  2026-02-24  2:06 ` [PATCH 16/16] xe: create a flag to enable memcg accounting for XE as well Dave Airlie
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Dave Airlie <airlied@redhat.com>

This adds a kconfig and a module option to turn off ttm memcg
integration completely.

When this is used, no object will ever end up using memcg aware
paths.

There is an existing workload that cgroup support might regress,
the systems are setup to allocate 1GB of uncached pages at system
startup to prime the pool, then any further users will take them
from the pool. The current cgroup code might handle that, but
it also may regress, so add an option to ttm to avoid using
memcg for the pool pages.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/Kconfig        |  7 +++++++
 drivers/gpu/drm/ttm/ttm_pool.c | 24 +++++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index d3d52310c9cc..81227ff10fff 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -239,6 +239,13 @@ config DRM_TTM_HELPER
 	help
 	  Helpers for ttm-based gem objects
 
+config DRM_TTM_MEMCG
+	bool "Enable TTM mem cgroup by default"
+	depends on DRM_TTM
+	depends on MEMCG
+	help
+	  Enable the memcg integration by default
+
 config DRM_GEM_DMA_HELPER
 	tristate
 	depends on DRM
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 4b0c7d70b60f..d23255375219 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -119,6 +119,24 @@ static unsigned long page_pool_size;
 MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool per NUMA node");
 module_param(page_pool_size, ulong, 0644);
 
+/*
+ * Don't use the memcg aware lru for pooled pages.
+ *
+ * There are use-cases where for example one application in a cgroup will preallocate 1GB
+ * of uncached pages, and immediately release them into the pool, for other consumers
+ * to use. This use-case could be handled with a proper cgroup hierarchy, but to allow
+ * that use case to continue to operate as-is, add a module option.
+ *
+ * This still stores the pages in the list_lru, it just doesn't use the memcg when
+ * adding/removing them.
+ */
+#define DEFAULT_TTM_MEMCG IS_ENABLED(CONFIG_DRM_TTM_MEMCG)
+static bool ttm_memcg = DEFAULT_TTM_MEMCG;
+
+MODULE_PARM_DESC(ttm_memcg, "Allow using cgroups with TTM "
+		 "[default=" __stringify(DEFAULT_TTM_MEMCG) "])");
+module_param(ttm_memcg, bool, 0444);
+
 static unsigned long pool_node_limit[MAX_NUMNODES];
 static atomic_long_t allocated_pages[MAX_NUMNODES];
 
@@ -315,7 +333,7 @@ static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
 
 	INIT_LIST_HEAD(&p->lru);
 	rcu_read_lock();
-	list_lru_add(&pt->pages, &p->lru, nid, page_memcg_check(p));
+	list_lru_add(&pt->pages, &p->lru, nid, ttm_memcg ? page_memcg_check(p) : NULL);
 	rcu_read_unlock();
 
 	atomic_long_add(num_pages, &allocated_pages[nid]);
@@ -364,7 +382,7 @@ static struct page *ttm_pool_type_take(struct ttm_pool_type *pt, int nid,
 	struct page *page_out = NULL;
 	int ret;
 	struct mem_cgroup *orig_memcg = orig_objcg ? get_mem_cgroup_from_objcg(orig_objcg) : NULL;
-	struct mem_cgroup *memcg = orig_memcg;
+	struct mem_cgroup *memcg = ttm_memcg ? orig_memcg : NULL;
 
 	/*
 	 * Attempt to get a page from the current memcg, but if it hasn't got any in it's level,
@@ -835,7 +853,7 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	bool allow_pools;
 	struct page *p;
 	int r;
-	struct obj_cgroup *objcg = memcg_account ? tt->objcg : NULL;
+	struct obj_cgroup *objcg = (ttm_memcg && memcg_account) ? tt->objcg : NULL;
 
 	WARN_ON(!alloc->remaining_pages || ttm_tt_is_populated(tt));
 	WARN_ON(alloc->dma_addr && !pool->dev);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 16/16] xe: create a flag to enable memcg accounting for XE as well.
  2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
                   ` (14 preceding siblings ...)
  2026-02-24  2:06 ` [PATCH 15/16] ttm: add support for a module option to disable memcg integration Dave Airlie
@ 2026-02-24  2:06 ` Dave Airlie
  15 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-02-24  2:06 UTC (permalink / raw)
  To: dri-devel, tj, christian.koenig, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

From: Maarten Lankhorst <dev@lankhorst.se>

This adds support for memcg accounting to ttm object used by xe driver.

Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/xe/xe_bo.c  | 10 +++++++---
 drivers/gpu/drm/xe/xe_bo.h  |  1 +
 drivers/gpu/drm/xe/xe_lrc.c |  3 ++-
 drivers/gpu/drm/xe/xe_oa.c  |  3 ++-
 drivers/gpu/drm/xe/xe_pt.c  |  3 ++-
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index fae2d738ecd2..44ad97b7e0c1 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -56,6 +56,7 @@ static const struct ttm_place sys_placement_flags = {
 	.flags = 0,
 };
 
+/* TTM_PL_FLAG_MEMCG is not set, those placements are used for eviction */
 static struct ttm_placement sys_placement = {
 	.num_placement = 1,
 	.placement = &sys_placement_flags,
@@ -194,8 +195,8 @@ static void try_add_system(struct xe_device *xe, struct xe_bo *bo,
 
 		bo->placements[*c] = (struct ttm_place) {
 			.mem_type = XE_PL_TT,
-			.flags = (bo_flags & XE_BO_FLAG_VRAM_MASK) ?
-			TTM_PL_FLAG_FALLBACK : 0,
+			.flags = TTM_PL_FLAG_MEMCG | ((bo_flags & XE_BO_FLAG_VRAM_MASK) ?
+			TTM_PL_FLAG_FALLBACK : 0),
 		};
 		*c += 1;
 	}
@@ -2216,6 +2217,9 @@ struct xe_bo *xe_bo_init_locked(struct xe_device *xe, struct xe_bo *bo,
 	placement = (type == ttm_bo_type_sg ||
 		     bo->flags & XE_BO_FLAG_DEFER_BACKING) ? &sys_placement :
 		&bo->placement;
+
+	if (bo->flags & XE_BO_FLAG_ACCOUNTED)
+		ttm_bo_set_cgroup(&bo->ttm, get_obj_cgroup_from_current());
 	err = ttm_bo_init_reserved(&xe->ttm, &bo->ttm, type,
 				   placement, alignment,
 				   &ctx, NULL, resv, xe_ttm_bo_destroy);
@@ -3193,7 +3197,7 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 	if (XE_IOCTL_DBG(xe, args->size & ~PAGE_MASK))
 		return -EINVAL;
 
-	bo_flags = 0;
+	bo_flags = XE_BO_FLAG_ACCOUNTED;
 	if (args->flags & DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING)
 		bo_flags |= XE_BO_FLAG_DEFER_BACKING;
 
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index c914ab719f20..2c2a93d018fe 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -52,6 +52,7 @@
 #define XE_BO_FLAG_CPU_ADDR_MIRROR	BIT(24)
 #define XE_BO_FLAG_FORCE_USER_VRAM	BIT(25)
 #define XE_BO_FLAG_NO_COMPRESSION	BIT(26)
+#define XE_BO_FLAG_ACCOUNTED		BIT(27)
 
 /* this one is trigger internally only */
 #define XE_BO_FLAG_INTERNAL_TEST	BIT(30)
diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
index b0f037bc227f..e4f2d6db18b3 100644
--- a/drivers/gpu/drm/xe/xe_lrc.c
+++ b/drivers/gpu/drm/xe/xe_lrc.c
@@ -1466,7 +1466,8 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
 		   XE_BO_FLAG_GGTT_INVALIDATE;
 
 	if ((vm && vm->xef) || init_flags & XE_LRC_CREATE_USER_CTX) /* userspace */
-		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE | XE_BO_FLAG_FORCE_USER_VRAM;
+		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE | XE_BO_FLAG_FORCE_USER_VRAM |
+			    XE_BO_FLAG_ACCOUNTED;
 
 	lrc->bo = xe_bo_create_pin_map_novm(xe, tile,
 					    bo_size,
diff --git a/drivers/gpu/drm/xe/xe_oa.c b/drivers/gpu/drm/xe/xe_oa.c
index 4dd3f29933cf..2606395aafdd 100644
--- a/drivers/gpu/drm/xe/xe_oa.c
+++ b/drivers/gpu/drm/xe/xe_oa.c
@@ -887,7 +887,8 @@ static int xe_oa_alloc_oa_buffer(struct xe_oa_stream *stream, size_t size)
 
 	bo = xe_bo_create_pin_map_novm(stream->oa->xe, stream->gt->tile,
 				       size, ttm_bo_type_kernel,
-				       XE_BO_FLAG_SYSTEM | XE_BO_FLAG_GGTT, false);
+				       XE_BO_FLAG_SYSTEM | XE_BO_FLAG_GGTT |
+				       XE_BO_FLAG_ACCOUNTED, false);
 	if (IS_ERR(bo))
 		return PTR_ERR(bo);
 
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 13b355fadd58..c1157dd56923 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -122,7 +122,8 @@ struct xe_pt *xe_pt_create(struct xe_vm *vm, struct xe_tile *tile,
 		   XE_BO_FLAG_IGNORE_MIN_PAGE_SIZE |
 		   XE_BO_FLAG_NO_RESV_EVICT | XE_BO_FLAG_PAGETABLE;
 	if (vm->xef) /* userspace */
-		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE | XE_BO_FLAG_FORCE_USER_VRAM;
+		bo_flags |= XE_BO_FLAG_PINNED_LATE_RESTORE | XE_BO_FLAG_FORCE_USER_VRAM |
+			    XE_BO_FLAG_ACCOUNTED;
 
 	pt->level = level;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-02-24  2:06 ` [PATCH 07/16] memcg: add support for GPU page counters. (v4) Dave Airlie
@ 2026-02-24  7:20   ` kernel test robot
  2026-02-24  7:50   ` Christian König
  1 sibling, 0 replies; 36+ messages in thread
From: kernel test robot @ 2026-02-24  7:20 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, tj, christian.koenig, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: oe-kbuild-all, cgroups, Dave Chinner, Waiman Long, simona

Hi Dave,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on linus/master v7.0-rc1 next-20260223]
[cannot apply to akpm-mm/mm-everything drm-xe/drm-xe-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Airlie/drm-ttm-use-gpu-mm-stats-to-track-gpu-memory-allocations-v4/20260224-112019
base:   https://gitlab.freedesktop.org/drm/misc/kernel.git drm-misc-next
patch link:    https://lore.kernel.org/r/20260224020854.791201-8-airlied%40gmail.com
patch subject: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
config: parisc-allmodconfig (https://download.01.org/0day-ci/archive/20260224/202602241519.EVW5jcey-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260224/202602241519.EVW5jcey-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602241519.EVW5jcey-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: mm/memcontrol.c:5216 function parameter 'page' not described in 'mem_cgroup_move_gpu_page_reclaim'
>> Warning: mm/memcontrol.c:5216 function parameter 'order' not described in 'mem_cgroup_move_gpu_page_reclaim'
>> Warning: mm/memcontrol.c:5216 expecting prototype for mem_cgroup_move_gpu_reclaim(). Prototype was for mem_cgroup_move_gpu_page_reclaim() instead

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-02-24  2:06 ` [PATCH 07/16] memcg: add support for GPU page counters. (v4) Dave Airlie
  2026-02-24  7:20   ` kernel test robot
@ 2026-02-24  7:50   ` Christian König
  2026-02-24 19:28     ` Dave Airlie
  1 sibling, 1 reply; 36+ messages in thread
From: Christian König @ 2026-02-24  7:50 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: cgroups, Dave Chinner, Waiman Long, simona

On 2/24/26 03:06, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This introduces 2 new statistics and 3 new memcontrol APIs for dealing
> with GPU system memory allocations.
> 
> The stats corresponds to the same stats in the global vmstat,
> for number of active GPU pages, and number of pages in pools that
> can be reclaimed.
> 
> The first API charges a order of pages to a objcg, and sets
> the objcg on the pages like kmem does, and updates the active/reclaim
> statistic.
> 
> The second API uncharges a page from the obj cgroup it is currently charged
> to.
> 
> The third API allows moving a page to/from reclaim and between obj cgroups.
> When pages are added to the pool lru, this just updates accounting.
> When pages are being removed from a pool lru, they can be taken from
> the parent objcg so this allows them to be uncharged from there and transferred
> to a new child objcg.
> 
> Acked-by: Christian König <christian.koenig@amd.com>

I have to take that back.

After going over the different use cases I'm now pretty convinced that charging any GPU/TTM allocation to memcg is the wrong approach to the problem.

Instead TTM should have a dmem_cgroup_pool which can limit the amount of system memory each cgroup can use from GTT.

The use case that GTT memory should account to memcg is actually only valid for an extremely small number of HPC customers and for those use cases we have different approaches to solve this issue (udmabuf, system DMA-buf heap, etc...).

What we can do is to say that this dmem_cgroup_pool then also accounts to memcg for selected cgroups. This would not only make it superflous to have different flags in drivers and TTM to turn this feature on/off, but also allow charging VRAM or other local memory to memcg because they use system memory as fallback for device memory.

In other more high level words memcg is actually the swapping space for dmem.

Regards,
Christian.

> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
> v2: use memcg_node_stat_items
> v3: fix null ptr dereference in uncharge
> v4: AI review: fix parameter names, fix problem with reclaim moving doing wrong thing
> ---
>  Documentation/admin-guide/cgroup-v2.rst |   6 ++
>  include/linux/memcontrol.h              |  11 +++
>  mm/memcontrol.c                         | 104 ++++++++++++++++++++++++
>  3 files changed, 121 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 91beaa6798ce..3ea7f1a399e8 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1573,6 +1573,12 @@ The following nested keys are defined.
>  	  vmalloc (npn)
>  		Amount of memory used for vmap backed memory.
>  
> +	  gpu_active (npn)
> +		Amount of system memory used for GPU devices.
> +
> +	  gpu_reclaim (npn)
> +		Amount of system memory cached for GPU devices.
> +
>  	  shmem
>  		Amount of cached filesystem data that is swap-backed,
>  		such as tmpfs, shm segments, shared anonymous mmap()s
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 70b685a85bf4..4f75d64f5fca 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1583,6 +1583,17 @@ static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
>  #endif	/* CONFIG_CGROUP_WRITEBACK */
>  
>  struct sock;
> +bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
> +			   unsigned int order,
> +			   gfp_t gfp_mask, bool reclaim);
> +void mem_cgroup_uncharge_gpu_page(struct page *page,
> +				  unsigned int order,
> +				  bool reclaim);
> +bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *objcg,
> +				      struct page *page,
> +				      unsigned int order,
> +				      bool to_reclaim);
> +
>  #ifdef CONFIG_MEMCG
>  extern struct static_key_false memcg_sockets_enabled_key;
>  #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a52da3a5e4fd..90bb3e00c258 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -333,6 +333,8 @@ static const unsigned int memcg_node_stat_items[] = {
>  #ifdef CONFIG_HUGETLB_PAGE
>  	NR_HUGETLB,
>  #endif
> +	NR_GPU_ACTIVE,
> +	NR_GPU_RECLAIM,
>  };
>  
>  static const unsigned int memcg_stat_items[] = {
> @@ -1360,6 +1362,8 @@ static const struct memory_stat memory_stats[] = {
>  	{ "percpu",			MEMCG_PERCPU_B			},
>  	{ "sock",			MEMCG_SOCK			},
>  	{ "vmalloc",			MEMCG_VMALLOC			},
> +	{ "gpu_active",			NR_GPU_ACTIVE			},
> +	{ "gpu_reclaim",		NR_GPU_RECLAIM	                },
>  	{ "shmem",			NR_SHMEM			},
>  #ifdef CONFIG_ZSWAP
>  	{ "zswap",			MEMCG_ZSWAP_B			},
> @@ -5133,6 +5137,106 @@ void mem_cgroup_flush_workqueue(void)
>  	flush_workqueue(memcg_wq);
>  }
>  
> +/**
> + * mem_cgroup_charge_gpu_page - charge a page to GPU memory tracking
> + * @objcg: objcg to charge, NULL charges root memcg
> + * @page: page to charge
> + * @order: page allocation order
> + * @gfp_mask: gfp mode
> + * @reclaim: charge the reclaim counter instead of the active one.
> + *
> + * Charge the order sized @page to the objcg. Returns %true if the charge fit within
> + * @objcg's configured limit, %false if it doesn't.
> + */
> +bool mem_cgroup_charge_gpu_page(struct obj_cgroup *objcg, struct page *page,
> +				unsigned int order, gfp_t gfp_mask, bool reclaim)
> +{
> +	unsigned int nr_pages = 1 << order;
> +	struct mem_cgroup *memcg = NULL;
> +	struct lruvec *lruvec;
> +	int ret;
> +
> +	if (objcg) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +
> +		ret = try_charge_memcg(memcg, gfp_mask, nr_pages);
> +		if (ret) {
> +			mem_cgroup_put(memcg);
> +			return false;
> +		}
> +
> +		obj_cgroup_get(objcg);
> +		page_set_objcg(page, objcg);
> +	}
> +
> +	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> +	mod_lruvec_state(lruvec, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, nr_pages);
> +
> +	mem_cgroup_put(memcg);
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(mem_cgroup_charge_gpu_page);
> +
> +/**
> + * mem_cgroup_uncharge_gpu_page - uncharge a page from GPU memory tracking
> + * @page: page to uncharge
> + * @order: order of the page allocation
> + * @reclaim: uncharge the reclaim counter instead of the active.
> + */
> +void mem_cgroup_uncharge_gpu_page(struct page *page,
> +				  unsigned int order, bool reclaim)
> +{
> +	struct obj_cgroup *objcg = page_objcg(page);
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec;
> +	int nr_pages = 1 << order;
> +
> +	memcg = objcg ? get_mem_cgroup_from_objcg(objcg) : NULL;
> +
> +	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> +	mod_lruvec_state(lruvec, reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, -nr_pages);
> +
> +	if (memcg && !mem_cgroup_is_root(memcg))
> +		refill_stock(memcg, nr_pages);
> +	page->memcg_data = 0;
> +	obj_cgroup_put(objcg);
> +	mem_cgroup_put(memcg);
> +}
> +EXPORT_SYMBOL_GPL(mem_cgroup_uncharge_gpu_page);
> +
> +/**
> + * mem_cgroup_move_gpu_reclaim - move pages from gpu to gpu reclaim and back
> + * @new_objcg: objcg to move page to, NULL if just stats update.
> + * @nr_pages: number of pages to move
> + * @to_reclaim: true moves pages into reclaim, false moves them back
> + */
> +bool mem_cgroup_move_gpu_page_reclaim(struct obj_cgroup *new_objcg,
> +				      struct page *page,
> +				      unsigned int order,
> +				      bool to_reclaim)
> +{
> +	struct obj_cgroup *objcg = page_objcg(page);
> +
> +	if (!objcg || !new_objcg || objcg == new_objcg) {
> +		struct mem_cgroup *memcg = objcg ? get_mem_cgroup_from_objcg(objcg) : NULL;
> +		struct lruvec *lruvec;
> +		unsigned long flags;
> +		int nr_pages = 1 << order;
> +
> +		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> +		local_irq_save(flags);
> +		mod_lruvec_state(lruvec, to_reclaim ? NR_GPU_RECLAIM : NR_GPU_ACTIVE, nr_pages);
> +		mod_lruvec_state(lruvec, to_reclaim ? NR_GPU_ACTIVE : NR_GPU_RECLAIM, -nr_pages);
> +		local_irq_restore(flags);
> +		mem_cgroup_put(memcg);
> +		return true;
> +	} else {
> +		mem_cgroup_uncharge_gpu_page(page, order, true);
> +		return mem_cgroup_charge_gpu_page(new_objcg, page, order, 0, false);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(mem_cgroup_move_gpu_page_reclaim);
> +
>  static int __init cgroup_memory(char *s)
>  {
>  	char *token;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs
  2026-02-24  2:06 ` [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
@ 2026-02-24  8:42   ` kernel test robot
  0 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2026-02-24  8:42 UTC (permalink / raw)
  To: Dave Airlie, dri-devel, tj, christian.koenig, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song
  Cc: oe-kbuild-all, cgroups, Dave Chinner, Waiman Long, simona

Hi Dave,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on linus/master v7.0-rc1 next-20260223]
[cannot apply to akpm-mm/mm-everything drm-xe/drm-xe-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Airlie/drm-ttm-use-gpu-mm-stats-to-track-gpu-memory-allocations-v4/20260224-112019
base:   https://gitlab.freedesktop.org/drm/misc/kernel.git drm-misc-next
patch link:    https://lore.kernel.org/r/20260224020854.791201-9-airlied%40gmail.com
patch subject: [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs
config: parisc-allmodconfig (https://download.01.org/0day-ci/archive/20260224/202602241625.E8sX98Qb-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260224/202602241625.E8sX98Qb-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602241625.E8sX98Qb-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: drivers/gpu/drm/ttm/ttm_bo.c:1264 function parameter 'memcg_account' not described in 'ttm_bo_populate'
>> Warning: drivers/gpu/drm/ttm/ttm_bo.c:1264 function parameter 'memcg_account' not described in 'ttm_bo_populate'
--
>> Warning: drivers/gpu/drm/ttm/ttm_pool.c:867 function parameter 'memcg_account' not described in 'ttm_pool_alloc'
>> Warning: drivers/gpu/drm/ttm/ttm_pool.c:867 function parameter 'memcg_account' not described in 'ttm_pool_alloc'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-02-24  7:50   ` Christian König
@ 2026-02-24 19:28     ` Dave Airlie
  2026-02-25  9:09       ` Christian König
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Airlie @ 2026-02-24 19:28 UTC (permalink / raw)
  To: Christian König
  Cc: dri-devel, tj, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona

On Tue, 24 Feb 2026 at 17:50, Christian König <christian.koenig@amd.com> wrote:
>
> On 2/24/26 03:06, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This introduces 2 new statistics and 3 new memcontrol APIs for dealing
> > with GPU system memory allocations.
> >
> > The stats corresponds to the same stats in the global vmstat,
> > for number of active GPU pages, and number of pages in pools that
> > can be reclaimed.
> >
> > The first API charges a order of pages to a objcg, and sets
> > the objcg on the pages like kmem does, and updates the active/reclaim
> > statistic.
> >
> > The second API uncharges a page from the obj cgroup it is currently charged
> > to.
> >
> > The third API allows moving a page to/from reclaim and between obj cgroups.
> > When pages are added to the pool lru, this just updates accounting.
> > When pages are being removed from a pool lru, they can be taken from
> > the parent objcg so this allows them to be uncharged from there and transferred
> > to a new child objcg.
> >
> > Acked-by: Christian König <christian.koenig@amd.com>
>
> I have to take that back.
>
> After going over the different use cases I'm now pretty convinced that charging any GPU/TTM allocation to memcg is the wrong approach to the problem.

You'll need to sell me a bit more on this idea, I don't hate it, but
it seems to be honest kinda half baked and smells a bit of reachitect
without form, so please start up you writing skills and give me
something concrete here.

>
> Instead TTM should have a dmem_cgroup_pool which can limit the amount of system memory each cgroup can use from GTT.

This sounds like a static limit though, how would we configure that in
a sane way?
>
> The use case that GTT memory should account to memcg is actually only valid for an extremely small number of HPC customers and for those use cases we have different approaches to solve this issue (udmabuf, system DMA-buf heap, etc...).

Stop, I have a major use case for this that isn't any of those.
Integrated GPUs on Intel and AMD accounting the RAM usage to somewhere
useful, so cgroup mgmt of desktop clients actually work, so when
firefox uses GPU memory it gets accounted to firefox and when the OOM
killer comes along it can choose the correct user.

This has been a pain in the ass for desktop for years, and I'd like to
fix it, the HPC use case if purely a driver for me doing the work.

Can you give a detailed explanation of how your idea will work in an
unconfigured cgroup environment to help this case?

>
> What we can do is to say that this dmem_cgroup_pool then also accounts to memcg for selected cgroups. This would not only make it superflous to have different flags in drivers and TTM to turn this feature on/off, but also allow charging VRAM or other local memory to memcg because they use system memory as fallback for device memory.
>
> In other more high level words memcg is actually the swapping space for dmem.

This is descriptive, but still feels very static, and nothing I've
seen indicated I want this to be a 50% type limit.

Dave.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-02-24 19:28     ` Dave Airlie
@ 2026-02-25  9:09       ` Christian König
  2026-03-02 14:15         ` Shakeel Butt
  0 siblings, 1 reply; 36+ messages in thread
From: Christian König @ 2026-02-25  9:09 UTC (permalink / raw)
  To: Dave Airlie
  Cc: dri-devel, tj, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona

On 2/24/26 20:28, Dave Airlie wrote:
> On Tue, 24 Feb 2026 at 17:50, Christian König <christian.koenig@amd.com> wrote:
>>
>> On 2/24/26 03:06, Dave Airlie wrote:
>>> From: Dave Airlie <airlied@redhat.com>
>>>
>>> This introduces 2 new statistics and 3 new memcontrol APIs for dealing
>>> with GPU system memory allocations.
>>>
>>> The stats corresponds to the same stats in the global vmstat,
>>> for number of active GPU pages, and number of pages in pools that
>>> can be reclaimed.
>>>
>>> The first API charges a order of pages to a objcg, and sets
>>> the objcg on the pages like kmem does, and updates the active/reclaim
>>> statistic.
>>>
>>> The second API uncharges a page from the obj cgroup it is currently charged
>>> to.
>>>
>>> The third API allows moving a page to/from reclaim and between obj cgroups.
>>> When pages are added to the pool lru, this just updates accounting.
>>> When pages are being removed from a pool lru, they can be taken from
>>> the parent objcg so this allows them to be uncharged from there and transferred
>>> to a new child objcg.
>>>
>>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>> I have to take that back.
>>
>> After going over the different use cases I'm now pretty convinced that charging any GPU/TTM allocation to memcg is the wrong approach to the problem.
> 
> You'll need to sell me a bit more on this idea, I don't hate it, but
> it seems to be honest kinda half baked and smells a bit of reachitect
> without form, so please start up you writing skills and give me
> something concrete here.
> 
>>
>> Instead TTM should have a dmem_cgroup_pool which can limit the amount of system memory each cgroup can use from GTT.
> 
> This sounds like a static limit though, how would we configure that in
> a sane way?

See the discussion about dmem controller for CMA with Mathew, T.J., me and a couple of others. It's on dri-devel and I've CCed you on my latest reply.

>>
>> The use case that GTT memory should account to memcg is actually only valid for an extremely small number of HPC customers and for those use cases we have different approaches to solve this issue (udmabuf, system DMA-buf heap, etc...).
> 
> Stop, I have a major use case for this that isn't any of those.
> Integrated GPUs on Intel and AMD accounting the RAM usage to somewhere
> useful, so cgroup mgmt of desktop clients actually work, so when
> firefox uses GPU memory it gets accounted to firefox and when the OOM
> killer comes along it can choose the correct user.

Oh, yes! I have tried multiple times to fix this as well in the last decade or so.

> This has been a pain in the ass for desktop for years, and I'd like to
> fix it, the HPC use case if purely a driver for me doing the work.

Wait a second. How does accounting to cgroups help with that in any way?

The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.

Both me and others have tried that approach multiple times and so far it never worked.

> Can you give a detailed explanation of how your idea will work in an
> unconfigured cgroup environment to help this case?

It wouldn't, but I also don't see how this patch set here would.

The accounting limits the amount of memory you can allocate per process for each cgroup, but it does not affect the OOM killer score in any way.

If we want to fix the OOM killer score we would need to start using the proportional set size in the OOM instead of the resident set size. And that in turn means the changes to the OOM killer and FS layer I already proposed over a decade ago.

Otherwise you can always come up with deny of service attacks against centralized services like X or Wayland.

>>
>> What we can do is to say that this dmem_cgroup_pool then also accounts to memcg for selected cgroups. This would not only make it superflous to have different flags in drivers and TTM to turn this feature on/off, but also allow charging VRAM or other local memory to memcg because they use system memory as fallback for device memory.
>>
>> In other more high level words memcg is actually the swapping space for dmem.
> 
> This is descriptive, but still feels very static, and nothing I've
> seen indicated I want this to be a 50% type limit.

The initial idea was to have more like a 90% limit by default, so that at least enough memory is left to SSH into the box and kill a run away process.

Christian.

> 
> Dave.
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-02-25  9:09       ` Christian König
@ 2026-03-02 14:15         ` Shakeel Butt
  2026-03-02 14:37           ` Christian König
  0 siblings, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2026-03-02 14:15 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona

On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
> On 2/24/26 20:28, Dave Airlie wrote:
[...]
> 
> > This has been a pain in the ass for desktop for years, and I'd like to
> > fix it, the HPC use case if purely a driver for me doing the work.
> 
> Wait a second. How does accounting to cgroups help with that in any way?
> 
> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
> 

It depends on the context of the oom-killer. If the oom-killer is triggered due
to memcg limits then only the processes in the scope of the memcg will be
targetted by the oom-killer. With the specific setting, the oom-killer can kill
all the processes in the target memcg.

However nowadays the userspace oom-killer is preferred over the kernel
oom-killer due to flexibility and configurability. Userspace oom-killers like
systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
environments. Such oom-killers looks at memcg stats and hiding something
something from memcg i.e. not charging to memcg will hide such usage from these
oom-killers.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 14:15         ` Shakeel Butt
@ 2026-03-02 14:37           ` Christian König
  2026-03-02 15:40             ` Shakeel Butt
  0 siblings, 1 reply; 36+ messages in thread
From: Christian König @ 2026-03-02 14:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona

On 3/2/26 15:15, Shakeel Butt wrote:
> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
>> On 2/24/26 20:28, Dave Airlie wrote:
> [...]
>>
>>> This has been a pain in the ass for desktop for years, and I'd like to
>>> fix it, the HPC use case if purely a driver for me doing the work.
>>
>> Wait a second. How does accounting to cgroups help with that in any way?
>>
>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
>>
> 
> It depends on the context of the oom-killer. If the oom-killer is triggered due
> to memcg limits then only the processes in the scope of the memcg will be
> targetted by the oom-killer. With the specific setting, the oom-killer can kill
> all the processes in the target memcg.
> 
> However nowadays the userspace oom-killer is preferred over the kernel
> oom-killer due to flexibility and configurability. Userspace oom-killers like
> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
> environments. Such oom-killers looks at memcg stats and hiding something
> something from memcg i.e. not charging to memcg will hide such usage from these
> oom-killers.

Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.

See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.

The latest incarnation is that BPF is used for this task on Android.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 14:37           ` Christian König
@ 2026-03-02 15:40             ` Shakeel Butt
  2026-03-02 15:51               ` Christian König
  0 siblings, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2026-03-02 15:40 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona, tjmercier

+TJ

On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
> On 3/2/26 15:15, Shakeel Butt wrote:
> > On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
> >> On 2/24/26 20:28, Dave Airlie wrote:
> > [...]
> >>
> >>> This has been a pain in the ass for desktop for years, and I'd like to
> >>> fix it, the HPC use case if purely a driver for me doing the work.
> >>
> >> Wait a second. How does accounting to cgroups help with that in any way?
> >>
> >> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
> >>
> > 
> > It depends on the context of the oom-killer. If the oom-killer is triggered due
> > to memcg limits then only the processes in the scope of the memcg will be
> > targetted by the oom-killer. With the specific setting, the oom-killer can kill
> > all the processes in the target memcg.
> > 
> > However nowadays the userspace oom-killer is preferred over the kernel
> > oom-killer due to flexibility and configurability. Userspace oom-killers like
> > systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
> > environments. Such oom-killers looks at memcg stats and hiding something
> > something from memcg i.e. not charging to memcg will hide such usage from these
> > oom-killers.
> 
> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.

Are you sure Android's oom killer is not using memcg? From what I see in the
documentation [1], it requires memcg.

[1] https://source.android.com/docs/core/perf/lmkd

> 
> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
> 
> The latest incarnation is that BPF is used for this task on Android.
> 
> Regards,
> Christian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 15:40             ` Shakeel Butt
@ 2026-03-02 15:51               ` Christian König
  2026-03-02 17:16                 ` Shakeel Butt
  2026-03-02 19:35                 ` T.J. Mercier
  0 siblings, 2 replies; 36+ messages in thread
From: Christian König @ 2026-03-02 15:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona, tjmercier

On 3/2/26 16:40, Shakeel Butt wrote:
> +TJ
> 
> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
>> On 3/2/26 15:15, Shakeel Butt wrote:
>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
>>>> On 2/24/26 20:28, Dave Airlie wrote:
>>> [...]
>>>>
>>>>> This has been a pain in the ass for desktop for years, and I'd like to
>>>>> fix it, the HPC use case if purely a driver for me doing the work.
>>>>
>>>> Wait a second. How does accounting to cgroups help with that in any way?
>>>>
>>>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
>>>>
>>>
>>> It depends on the context of the oom-killer. If the oom-killer is triggered due
>>> to memcg limits then only the processes in the scope of the memcg will be
>>> targetted by the oom-killer. With the specific setting, the oom-killer can kill
>>> all the processes in the target memcg.
>>>
>>> However nowadays the userspace oom-killer is preferred over the kernel
>>> oom-killer due to flexibility and configurability. Userspace oom-killers like
>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
>>> environments. Such oom-killers looks at memcg stats and hiding something
>>> something from memcg i.e. not charging to memcg will hide such usage from these
>>> oom-killers.
>>
>> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.
> 
> Are you sure Android's oom killer is not using memcg? From what I see in the
> documentation [1], it requires memcg.

My bad, I should have been wording that better.

The Android OOM killer is not using memcg for tracking GPU memory allocations, because memcg doesn't have proper support for tracking shared buffers.

In other words GPU memory allocations are shared by design and it is the norm that the process which is using it is not the process which has allocated it.

What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.

I can give a full list of requirements which would be needed by cgroups to cover all the different use cases, but it basically means tons of extra complexity.

Regards,
Christian.

> 
> [1] https://source.android.com/docs/core/perf/lmkd
> 
>>
>> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
>>
>> The latest incarnation is that BPF is used for this task on Android.
>>
>> Regards,
>> Christian.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 15:51               ` Christian König
@ 2026-03-02 17:16                 ` Shakeel Butt
  2026-03-02 19:36                   ` Christian König
  2026-03-02 19:35                 ` T.J. Mercier
  1 sibling, 1 reply; 36+ messages in thread
From: Shakeel Butt @ 2026-03-02 17:16 UTC (permalink / raw)
  To: Christian König
  Cc: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona, tjmercier

On Mon, Mar 02, 2026 at 04:51:12PM +0100, Christian König wrote:
> On 3/2/26 16:40, Shakeel Butt wrote:
> > +TJ
> > 
> > On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
> >> On 3/2/26 15:15, Shakeel Butt wrote:
> >>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
> >>>> On 2/24/26 20:28, Dave Airlie wrote:
> >>> [...]
> >>>>
> >>>>> This has been a pain in the ass for desktop for years, and I'd like to
> >>>>> fix it, the HPC use case if purely a driver for me doing the work.
> >>>>
> >>>> Wait a second. How does accounting to cgroups help with that in any way?
> >>>>
> >>>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
> >>>>
> >>>
> >>> It depends on the context of the oom-killer. If the oom-killer is triggered due
> >>> to memcg limits then only the processes in the scope of the memcg will be
> >>> targetted by the oom-killer. With the specific setting, the oom-killer can kill
> >>> all the processes in the target memcg.
> >>>
> >>> However nowadays the userspace oom-killer is preferred over the kernel
> >>> oom-killer due to flexibility and configurability. Userspace oom-killers like
> >>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
> >>> environments. Such oom-killers looks at memcg stats and hiding something
> >>> something from memcg i.e. not charging to memcg will hide such usage from these
> >>> oom-killers.
> >>
> >> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.
> > 
> > Are you sure Android's oom killer is not using memcg? From what I see in the
> > documentation [1], it requires memcg.
> 
> My bad, I should have been wording that better.
> 
> The Android OOM killer is not using memcg for tracking GPU memory allocations, because memcg doesn't have proper support for tracking shared buffers.

Yes indeed memcg is bad with buffers shared between memcgs (shmem, shared
filesystems).

> 
> In other words GPU memory allocations are shared by design and it is the norm that the process which is using it is not the process which has allocated it.

Here the GPU memory can be system memory or the actual memory on GPU, right?

I think I discussed with TJ on the possibility of moving the allocations in the
context of process using through custom fault handler in GPU drivers. I don't
remember the conclusion but I am assuming that is not possible.

> 
> What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.

Irrespective of memcg charging decision, one of my request would be to at least
have global counters for the GPU memory which this series is adding. That would
be very similar to NR_KERNEL_FILE_PAGES where we explicit opt-out of memcg
charging but keep the global counter, so the admin can identify the reasons
behind high unaccounted memory on the system.

> 
> I can give a full list of requirements which would be needed by cgroups to cover all the different use cases, but it basically means tons of extra complexity.
> 
> Regards,
> Christian.
> 
> > 
> > [1] https://source.android.com/docs/core/perf/lmkd
> > 
> >>
> >> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
> >>
> >> The latest incarnation is that BPF is used for this task on Android.
> >>
> >> Regards,
> >> Christian.
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 15:51               ` Christian König
  2026-03-02 17:16                 ` Shakeel Butt
@ 2026-03-02 19:35                 ` T.J. Mercier
  2026-03-03  9:29                   ` Christian König
  2026-03-05  3:19                   ` Dave Airlie
  1 sibling, 2 replies; 36+ messages in thread
From: T.J. Mercier @ 2026-03-02 19:35 UTC (permalink / raw)
  To: Christian König
  Cc: Shakeel Butt, Dave Airlie, dri-devel, tj, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Dave Chinner,
	Waiman Long, simona, Suren Baghdasaryan

On Mon, Mar 2, 2026 at 7:51 AM Christian König <christian.koenig@amd.com> wrote:
>
> On 3/2/26 16:40, Shakeel Butt wrote:
> > +TJ
> >
> > On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
> >> On 3/2/26 15:15, Shakeel Butt wrote:
> >>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
> >>>> On 2/24/26 20:28, Dave Airlie wrote:
> >>> [...]
> >>>>
> >>>>> This has been a pain in the ass for desktop for years, and I'd like to
> >>>>> fix it, the HPC use case if purely a driver for me doing the work.
> >>>>
> >>>> Wait a second. How does accounting to cgroups help with that in any way?
> >>>>
> >>>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
> >>>>
> >>>
> >>> It depends on the context of the oom-killer. If the oom-killer is triggered due
> >>> to memcg limits then only the processes in the scope of the memcg will be
> >>> targetted by the oom-killer. With the specific setting, the oom-killer can kill
> >>> all the processes in the target memcg.
> >>>
> >>> However nowadays the userspace oom-killer is preferred over the kernel
> >>> oom-killer due to flexibility and configurability. Userspace oom-killers like
> >>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
> >>> environments. Such oom-killers looks at memcg stats and hiding something
> >>> something from memcg i.e. not charging to memcg will hide such usage from these
> >>> oom-killers.
> >>
> >> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.
> >
> > Are you sure Android's oom killer is not using memcg? From what I see in the
> > documentation [1], it requires memcg.

LMKD used to use memcg v1 for memory.pressure_level, but that has been
replaced by PSI which is now the default configuration. I deprecated
all configurations with memcg v1 dependencies in January. We plan to
remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels
reach EOL.

> My bad, I should have been wording that better.
>
> The Android OOM killer is not using memcg for tracking GPU memory allocations, because memcg doesn't have proper support for tracking shared buffers.
>
> In other words GPU memory allocations are shared by design and it is the norm that the process which is using it is not the process which has allocated it.
>
> What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.
>
> I can give a full list of requirements which would be needed by cgroups to cover all the different use cases, but it basically means tons of extra complexity.

Yeah this is right. We usually prioritize fast kills rather than
picking the biggest offender though. Application state (foreground /
background) is the primary selector, however LMKD does have a mode
(kill_heaviest_task) where it will pick the largest task within a
group of apps sharing the same application state. For this it uses RSS
from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out
of tree interface in procfs for accounting dmabufs used by a process.
It tracks FD references and map references as they come and go, and
only counts any buffer once for a process regardless of the number and
type of references a process has to the same buffer. I dislike it
greatly.

My original intention was to use the dmabuf BPF iterator we added to
scan maps and FDs of a process for dmabufs on demand. Very simple and
pretty fast in BPF. This wouldn't support high watermark tracking, so
I was forced into doing something else for per-process accounting. To
be fair, the HWM tracking has detected a few application bugs where
4GB of system memory was inadvertently consumed by dmabufs.

The BPF iterator is currently used to support accounting of buffers
not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice
improvement for that over the old sysfs interface. I hope to replace
the slow scanning of procfs for dmabufs in libdmabufinfo with BPF
programs that use the dmabuf iterator, but that's not a priority for
this year.

Independent of all of that, memcg doesn't really work well for this
because it's shared memory that can only be attributed to a single
memcg, and the most common allocator (gralloc) is in a separate
process and memcg than the processes using the buffers (camera,
YouTube, etc.). I had a few patches that transferred the ownership of
buffers to a new memcg when they were sent via Binder, but this used
the memcg v1 charge moving functionality which is now gone because it
was so complicated. But that only works if there is one user that
should be charged for the buffer anyway. What if it is shared by
multiple applications and services?

> Regards,
> Christian.
>
> >
> > [1] https://source.android.com/docs/core/perf/lmkd
> >
> >>
> >> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
> >>
> >> The latest incarnation is that BPF is used for this task on Android.
> >>
> >> Regards,
> >> Christian.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 17:16                 ` Shakeel Butt
@ 2026-03-02 19:36                   ` Christian König
  2026-03-05  3:23                     ` Dave Airlie
  0 siblings, 1 reply; 36+ messages in thread
From: Christian König @ 2026-03-02 19:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Airlie, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona, tjmercier

On 3/2/26 18:16, Shakeel Butt wrote:
> On Mon, Mar 02, 2026 at 04:51:12PM +0100, Christian König wrote:
>> On 3/2/26 16:40, Shakeel Butt wrote:
>>> +TJ
>>>
>>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
>>>> On 3/2/26 15:15, Shakeel Butt wrote:
>>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
>>>>>> On 2/24/26 20:28, Dave Airlie wrote:
>>>>> [...]
>>>>>>
>>>>>>> This has been a pain in the ass for desktop for years, and I'd like to
>>>>>>> fix it, the HPC use case if purely a driver for me doing the work.
>>>>>>
>>>>>> Wait a second. How does accounting to cgroups help with that in any way?
>>>>>>
>>>>>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
>>>>>>
>>>>>
>>>>> It depends on the context of the oom-killer. If the oom-killer is triggered due
>>>>> to memcg limits then only the processes in the scope of the memcg will be
>>>>> targetted by the oom-killer. With the specific setting, the oom-killer can kill
>>>>> all the processes in the target memcg.
>>>>>
>>>>> However nowadays the userspace oom-killer is preferred over the kernel
>>>>> oom-killer due to flexibility and configurability. Userspace oom-killers like
>>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
>>>>> environments. Such oom-killers looks at memcg stats and hiding something
>>>>> something from memcg i.e. not charging to memcg will hide such usage from these
>>>>> oom-killers.
>>>>
>>>> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.
>>>
>>> Are you sure Android's oom killer is not using memcg? From what I see in the
>>> documentation [1], it requires memcg.
>>
>> My bad, I should have been wording that better.
>>
>> The Android OOM killer is not using memcg for tracking GPU memory allocations, because memcg doesn't have proper support for tracking shared buffers.
> 
> Yes indeed memcg is bad with buffers shared between memcgs (shmem, shared
> filesystems).

My big concern is that we create uAPI which we then (again) find 6 month later as blocker to further development and have to stick with it.

That has happened before and that we could remove the initial DMA-buf sysfs uAPI (for example) was just because Greg and T.J. agreed that the interface is not something we can carry on into the future.

>>
>> In other words GPU memory allocations are shared by design and it is the norm that the process which is using it is not the process which has allocated it.
> 
> Here the GPU memory can be system memory or the actual memory on GPU, right?

For embedded, mobile GPUs (Android) or APUs (modern laptops) it is system memory.

For dGPUs on desktop environments it is mostly device memory on GPUs, but system memory is used as swap.

In HPC, cloud computing and automotive use cases you have a mixture of both system memory and device memory.

> I think I discussed with TJ on the possibility of moving the allocations in the
> context of process using through custom fault handler in GPU drivers. I don't
> remember the conclusion but I am assuming that is not possible.

Most HW still uses pre-allocated memory and can't do things like fault on demand. That's why we allocate everything at once on creation or at most on first use.

But allocate on first use has the big potential for security problems. E.g. imagine you create a 1GiB buffer and send it to your display server as window content, the display server would be charged because it is the first one touching it but you keep the memory as reference for yourself.

For Android the only way out which has similar functionality as the BPF approach is to change charging when the file descriptors used as reference for the memory is transferred between processes.

For some automotive use cases it is even worse. To fully handle that use case multiple different cgroups in the same process would be needed, e.g. different cgroups for different threads and/or client handles in the same QEMU process.

Long story short: It is a mess.

>>
>> What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.
> 
> Irrespective of memcg charging decision, one of my request would be to at least
> have global counters for the GPU memory which this series is adding. That would
> be very similar to NR_KERNEL_FILE_PAGES where we explicit opt-out of memcg
> charging but keep the global counter, so the admin can identify the reasons
> behind high unaccounted memory on the system.

Sounds reasonable to me. I will try to give this set another review round.

Regards,
Christian.

> 
>>
>> I can give a full list of requirements which would be needed by cgroups to cover all the different use cases, but it basically means tons of extra complexity.
>>
>> Regards,
>> Christian.
>>
>>>
>>> [1] https://source.android.com/docs/core/perf/lmkd
>>>
>>>>
>>>> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
>>>>
>>>> The latest incarnation is that BPF is used for this task on Android.
>>>>
>>>> Regards,
>>>> Christian.
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 19:35                 ` T.J. Mercier
@ 2026-03-03  9:29                   ` Christian König
  2026-03-03 17:25                     ` T.J. Mercier
  2026-03-05  3:19                   ` Dave Airlie
  1 sibling, 1 reply; 36+ messages in thread
From: Christian König @ 2026-03-03  9:29 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Shakeel Butt, Dave Airlie, dri-devel, tj, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Dave Chinner,
	Waiman Long, simona, Suren Baghdasaryan

On 3/2/26 20:35, T.J. Mercier wrote:
> On Mon, Mar 2, 2026 at 7:51 AM Christian König <christian.koenig@amd.com> wrote:
>>
>> On 3/2/26 16:40, Shakeel Butt wrote:
>>> +TJ
>>>
>>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
>>>> On 3/2/26 15:15, Shakeel Butt wrote:
>>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
>>>>>> On 2/24/26 20:28, Dave Airlie wrote:
>>>>> [...]
>>>>>>
>>>>>>> This has been a pain in the ass for desktop for years, and I'd like to
>>>>>>> fix it, the HPC use case if purely a driver for me doing the work.
>>>>>>
>>>>>> Wait a second. How does accounting to cgroups help with that in any way?
>>>>>>
>>>>>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
>>>>>>
>>>>>
>>>>> It depends on the context of the oom-killer. If the oom-killer is triggered due
>>>>> to memcg limits then only the processes in the scope of the memcg will be
>>>>> targetted by the oom-killer. With the specific setting, the oom-killer can kill
>>>>> all the processes in the target memcg.
>>>>>
>>>>> However nowadays the userspace oom-killer is preferred over the kernel
>>>>> oom-killer due to flexibility and configurability. Userspace oom-killers like
>>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
>>>>> environments. Such oom-killers looks at memcg stats and hiding something
>>>>> something from memcg i.e. not charging to memcg will hide such usage from these
>>>>> oom-killers.
>>>>
>>>> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.
>>>
>>> Are you sure Android's oom killer is not using memcg? From what I see in the
>>> documentation [1], it requires memcg.
> 
> LMKD used to use memcg v1 for memory.pressure_level, but that has been
> replaced by PSI which is now the default configuration. I deprecated
> all configurations with memcg v1 dependencies in January. We plan to
> remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels
> reach EOL.
> 
>> My bad, I should have been wording that better.
>>
>> The Android OOM killer is not using memcg for tracking GPU memory allocations, because memcg doesn't have proper support for tracking shared buffers.
>>
>> In other words GPU memory allocations are shared by design and it is the norm that the process which is using it is not the process which has allocated it.
>>
>> What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.
>>
>> I can give a full list of requirements which would be needed by cgroups to cover all the different use cases, but it basically means tons of extra complexity.
> 
> Yeah this is right. We usually prioritize fast kills rather than
> picking the biggest offender though. Application state (foreground /
> background) is the primary selector, however LMKD does have a mode
> (kill_heaviest_task) where it will pick the largest task within a
> group of apps sharing the same application state. For this it uses RSS
> from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out
> of tree interface in procfs for accounting dmabufs used by a process.
> It tracks FD references and map references as they come and go, and
> only counts any buffer once for a process regardless of the number and
> type of references a process has to the same buffer. I dislike it
> greatly.

*sigh* I was really hoping that we would have nailed it with the BPF support for DMA-buf and not rely on out of tree stuff any more.

We should really stop re-inventing the wheel over and over again and fix the shortcomings cgroups has instead and then use that one.

> My original intention was to use the dmabuf BPF iterator we added to
> scan maps and FDs of a process for dmabufs on demand. Very simple and
> pretty fast in BPF. This wouldn't support high watermark tracking, so
> I was forced into doing something else for per-process accounting. To
> be fair, the HWM tracking has detected a few application bugs where
> 4GB of system memory was inadvertently consumed by dmabufs.
> 
> The BPF iterator is currently used to support accounting of buffers
> not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice
> improvement for that over the old sysfs interface. I hope to replace
> the slow scanning of procfs for dmabufs in libdmabufinfo with BPF
> programs that use the dmabuf iterator, but that's not a priority for
> this year.
> 
> Independent of all of that, memcg doesn't really work well for this
> because it's shared memory that can only be attributed to a single
> memcg, and the most common allocator (gralloc) is in a separate
> process and memcg than the processes using the buffers (camera,
> YouTube, etc.). I had a few patches that transferred the ownership of
> buffers to a new memcg when they were sent via Binder, but this used
> the memcg v1 charge moving functionality which is now gone because it
> was so complicated. But that only works if there is one user that
> should be charged for the buffer anyway. What if it is shared by
> multiple applications and services?

Well the "usual" (e.g. what you find in the literature and what other operating systems do) approach is to use a proportional set size instead of the resident set size: https://en.wikipedia.org/wiki/Proportional_set_size

The problem is that a proportional set size is usually harder to come by. So it means additional overhead, more complex interfaces etc...

Regards,
Christian.

> 
>> Regards,
>> Christian.
>>
>>>
>>> [1] https://source.android.com/docs/core/perf/lmkd
>>>
>>>>
>>>> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
>>>>
>>>> The latest incarnation is that BPF is used for this task on Android.
>>>>
>>>> Regards,
>>>> Christian.
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-03  9:29                   ` Christian König
@ 2026-03-03 17:25                     ` T.J. Mercier
  0 siblings, 0 replies; 36+ messages in thread
From: T.J. Mercier @ 2026-03-03 17:25 UTC (permalink / raw)
  To: Christian König
  Cc: Shakeel Butt, Dave Airlie, dri-devel, tj, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Dave Chinner,
	Waiman Long, simona, Suren Baghdasaryan

On Tue, Mar 3, 2026 at 1:29 AM Christian König <christian.koenig@amd.com> wrote:
>
> On 3/2/26 20:35, T.J. Mercier wrote:
> > On Mon, Mar 2, 2026 at 7:51 AM Christian König <christian.koenig@amd.com> wrote:
> >>
> >> On 3/2/26 16:40, Shakeel Butt wrote:
> >>> +TJ
> >>>
> >>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
> >>>> On 3/2/26 15:15, Shakeel Butt wrote:
> >>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
> >>>>>> On 2/24/26 20:28, Dave Airlie wrote:
> >>>>> [...]
> >>>>>>
> >>>>>>> This has been a pain in the ass for desktop for years, and I'd like to
> >>>>>>> fix it, the HPC use case if purely a driver for me doing the work.
> >>>>>>
> >>>>>> Wait a second. How does accounting to cgroups help with that in any way?
> >>>>>>
> >>>>>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
> >>>>>>
> >>>>>
> >>>>> It depends on the context of the oom-killer. If the oom-killer is triggered due
> >>>>> to memcg limits then only the processes in the scope of the memcg will be
> >>>>> targetted by the oom-killer. With the specific setting, the oom-killer can kill
> >>>>> all the processes in the target memcg.
> >>>>>
> >>>>> However nowadays the userspace oom-killer is preferred over the kernel
> >>>>> oom-killer due to flexibility and configurability. Userspace oom-killers like
> >>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
> >>>>> environments. Such oom-killers looks at memcg stats and hiding something
> >>>>> something from memcg i.e. not charging to memcg will hide such usage from these
> >>>>> oom-killers.
> >>>>
> >>>> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.
> >>>
> >>> Are you sure Android's oom killer is not using memcg? From what I see in the
> >>> documentation [1], it requires memcg.
> >
> > LMKD used to use memcg v1 for memory.pressure_level, but that has been
> > replaced by PSI which is now the default configuration. I deprecated
> > all configurations with memcg v1 dependencies in January. We plan to
> > remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels
> > reach EOL.
> >
> >> My bad, I should have been wording that better.
> >>
> >> The Android OOM killer is not using memcg for tracking GPU memory allocations, because memcg doesn't have proper support for tracking shared buffers.
> >>
> >> In other words GPU memory allocations are shared by design and it is the norm that the process which is using it is not the process which has allocated it.
> >>
> >> What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.
> >>
> >> I can give a full list of requirements which would be needed by cgroups to cover all the different use cases, but it basically means tons of extra complexity.
> >
> > Yeah this is right. We usually prioritize fast kills rather than
> > picking the biggest offender though. Application state (foreground /
> > background) is the primary selector, however LMKD does have a mode
> > (kill_heaviest_task) where it will pick the largest task within a
> > group of apps sharing the same application state. For this it uses RSS
> > from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out
> > of tree interface in procfs for accounting dmabufs used by a process.
> > It tracks FD references and map references as they come and go, and
> > only counts any buffer once for a process regardless of the number and
> > type of references a process has to the same buffer. I dislike it
> > greatly.
>
> *sigh* I was really hoping that we would have nailed it with the BPF support for DMA-buf and not rely on out of tree stuff any more.

The BPF support is still a win and I'm very happy to have it, but I
don't think there was ever a route to implementing cgroup limits on
top of it.

> We should really stop re-inventing the wheel over and over again and fix the shortcomings cgroups has instead and then use that one.
>
> > My original intention was to use the dmabuf BPF iterator we added to
> > scan maps and FDs of a process for dmabufs on demand. Very simple and
> > pretty fast in BPF. This wouldn't support high watermark tracking, so
> > I was forced into doing something else for per-process accounting. To
> > be fair, the HWM tracking has detected a few application bugs where
> > 4GB of system memory was inadvertently consumed by dmabufs.
> >
> > The BPF iterator is currently used to support accounting of buffers
> > not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice
> > improvement for that over the old sysfs interface. I hope to replace
> > the slow scanning of procfs for dmabufs in libdmabufinfo with BPF
> > programs that use the dmabuf iterator, but that's not a priority for
> > this year.
> >
> > Independent of all of that, memcg doesn't really work well for this
> > because it's shared memory that can only be attributed to a single
> > memcg, and the most common allocator (gralloc) is in a separate
> > process and memcg than the processes using the buffers (camera,
> > YouTube, etc.). I had a few patches that transferred the ownership of
> > buffers to a new memcg when they were sent via Binder, but this used
> > the memcg v1 charge moving functionality which is now gone because it
> > was so complicated. But that only works if there is one user that
> > should be charged for the buffer anyway. What if it is shared by
> > multiple applications and services?
>
> Well the "usual" (e.g. what you find in the literature and what other operating systems do) approach is to use a proportional set size instead of the resident set size: https://en.wikipedia.org/wiki/Proportional_set_size
>
> The problem is that a proportional set size is usually harder to come by. So it means additional overhead, more complex interfaces etc...

I added /proc/<pid>/dmabuf_pss as well, which actually isn't a
horrible implementation if you consider that the entire buffer is
pinned as long as there is any user.

https://cs.android.com/android/kernel/superproject/+/common-android14-6.1:common/fs/proc/base.c;drc=1b269f8eb12649ec9370f4051ae049e54a31e3fe;l=3393

With page-based memcg accounting it would be much harder and more expensive.

> Regards,
> Christian.
>
> >
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> [1] https://source.android.com/docs/core/perf/lmkd
> >>>
> >>>>
> >>>> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
> >>>>
> >>>> The latest incarnation is that BPF is used for this task on Android.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 19:35                 ` T.J. Mercier
  2026-03-03  9:29                   ` Christian König
@ 2026-03-05  3:19                   ` Dave Airlie
  2026-03-05  9:25                     ` Christian König
  2026-03-10  1:27                     ` T.J. Mercier
  1 sibling, 2 replies; 36+ messages in thread
From: Dave Airlie @ 2026-03-05  3:19 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Christian König, Shakeel Butt, dri-devel, tj,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Dave Chinner, Waiman Long, simona, Suren Baghdasaryan

> Independent of all of that, memcg doesn't really work well for this
> because it's shared memory that can only be attributed to a single
> memcg, and the most common allocator (gralloc) is in a separate
> process and memcg than the processes using the buffers (camera,
> YouTube, etc.). I had a few patches that transferred the ownership of
> buffers to a new memcg when they were sent via Binder, but this used
> the memcg v1 charge moving functionality which is now gone because it
> was so complicated. But that only works if there is one user that
> should be charged for the buffer anyway. What if it is shared by
> multiple applications and services?
>

Usually there is a user intent behind the action that causes the
memory allocation, i.e. user opens camera app, user runs instagram
which opens camera, and uses GPU filters etc.

The charge should be to the application or cgroup that causes the
intent, if multiple applications/services are sharing a buffer, what
is the intent behind how that happens, is there a limit on how to make
more of those shared buffers get allocated, what drives that etc.

If there are truly internal memory allocations like evictions or
underlying shared pages tables then maybe we don't have to account
those to a specific application, but we really want to make sure a
user intentionally cannot cause an application to consume more memory,
so at least for Android I'd try and tie it to user actions and account
to that process.

On desktop Linux, I would say firefox or gtk apps are the intended
users of any compositor allocated buffers (not that we really have
those now I don't think).

if there are caches of allocations that should also be tied into
cgroups, so memory pressure can reclaim them.

Dave.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-02 19:36                   ` Christian König
@ 2026-03-05  3:23                     ` Dave Airlie
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Airlie @ 2026-03-05  3:23 UTC (permalink / raw)
  To: Christian König
  Cc: Shakeel Butt, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona, tjmercier

> >>
> >> What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.
> >
> > Irrespective of memcg charging decision, one of my request would be to at least
> > have global counters for the GPU memory which this series is adding. That would
> > be very similar to NR_KERNEL_FILE_PAGES where we explicit opt-out of memcg
> > charging but keep the global counter, so the admin can identify the reasons
> > behind high unaccounted memory on the system.
>
> Sounds reasonable to me. I will try to give this set another review round.

I think I will try and land the first 6 patches in this series soon,
this adds the basic counters, and ports the pool to use list_lru, so
drops a bunch of TTM specific code in favour of core kernel code,

They also don't create any uapi, it's just two counters to expose in
the VM stats that the GPU is using that memory.

Dave.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-05  3:19                   ` Dave Airlie
@ 2026-03-05  9:25                     ` Christian König
  2026-03-10  1:27                     ` T.J. Mercier
  1 sibling, 0 replies; 36+ messages in thread
From: Christian König @ 2026-03-05  9:25 UTC (permalink / raw)
  To: Dave Airlie, T.J. Mercier
  Cc: Shakeel Butt, dri-devel, tj, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Dave Chinner, Waiman Long,
	simona, Suren Baghdasaryan

On 3/5/26 04:19, Dave Airlie wrote:
>> Independent of all of that, memcg doesn't really work well for this
>> because it's shared memory that can only be attributed to a single
>> memcg, and the most common allocator (gralloc) is in a separate
>> process and memcg than the processes using the buffers (camera,
>> YouTube, etc.). I had a few patches that transferred the ownership of
>> buffers to a new memcg when they were sent via Binder, but this used
>> the memcg v1 charge moving functionality which is now gone because it
>> was so complicated. But that only works if there is one user that
>> should be charged for the buffer anyway. What if it is shared by
>> multiple applications and services?
>>
> 
> Usually there is a user intent behind the action that causes the
> memory allocation, i.e. user opens camera app, user runs instagram
> which opens camera, and uses GPU filters etc.
> 
> The charge should be to the application or cgroup that causes the
> intent, if multiple applications/services are sharing a buffer, what
> is the intent behind how that happens, is there a limit on how to make
> more of those shared buffers get allocated, what drives that etc.

Well the problem is that by charging to the process allocating things the kernel only makes an educated guess what the intent is.

As far as I can see that approach only works for containers but basically not for a huge bunch of other use cases.

Taking the Android use case T.J. describes as example what you would need there is to charge against is the file descriptor and not the application. The application only comes into the picture when you then want to know what is the total of what a specific process uses.

Same thing for native context for automotive, here it is even worse since the role separation is different and you never even ask what the application uses because that is completely uninteresting for that use case. But you still want some kind of limitation for each use case.

> If there are truly internal memory allocations like evictions or
> underlying shared pages tables then maybe we don't have to account
> those to a specific application, but we really want to make sure a
> user intentionally cannot cause an application to consume more memory,
> so at least for Android I'd try and tie it to user actions and account
> to that process.
> 
> On desktop Linux, I would say firefox or gtk apps are the intended
> users of any compositor allocated buffers (not that we really have
> those now I don't think).
> 
> if there are caches of allocations that should also be tied into
> cgroups, so memory pressure can reclaim them.

Not if the cache is global for all cgroups. Tying caches to cgroups only makes sense if that cache is beneficial to that specific cgroup.

In other words you need to distinct between caches (which contain cgroup specific data) and memory pools or caches which doesn't contain any data.

Regards,
Christian.

> 
> Dave.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
  2026-03-05  3:19                   ` Dave Airlie
  2026-03-05  9:25                     ` Christian König
@ 2026-03-10  1:27                     ` T.J. Mercier
  1 sibling, 0 replies; 36+ messages in thread
From: T.J. Mercier @ 2026-03-10  1:27 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Christian König, Shakeel Butt, dri-devel, tj,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	cgroups, Dave Chinner, Waiman Long, simona, Suren Baghdasaryan

On Wed, Mar 4, 2026 at 7:20 PM Dave Airlie <airlied@gmail.com> wrote:
>
> > Independent of all of that, memcg doesn't really work well for this
> > because it's shared memory that can only be attributed to a single
> > memcg, and the most common allocator (gralloc) is in a separate
> > process and memcg than the processes using the buffers (camera,
> > YouTube, etc.). I had a few patches that transferred the ownership of
> > buffers to a new memcg when they were sent via Binder, but this used
> > the memcg v1 charge moving functionality which is now gone because it
> > was so complicated. But that only works if there is one user that
> > should be charged for the buffer anyway. What if it is shared by
> > multiple applications and services?
> >
>
> Usually there is a user intent behind the action that causes the
> memory allocation, i.e. user opens camera app, user runs instagram
> which opens camera, and uses GPU filters etc.
>
> The charge should be to the application or cgroup that causes the
> intent, if multiple applications/services are sharing a buffer, what
> is the intent behind how that happens, is there a limit on how to make
> more of those shared buffers get allocated, what drives that etc.
>
> If there are truly internal memory allocations like evictions or
> underlying shared pages tables then maybe we don't have to account
> those to a specific application, but we really want to make sure a
> user intentionally cannot cause an application to consume more memory,
> so at least for Android I'd try and tie it to user actions and account
> to that process.

Hi Dave,

Yeah this is why I pursued charge transfer. Most of the time I think
we'd be ok just charging the initial allocating application, but a
separate allocator process in a different memcg actually performs the
allocation ioctl, so the allocator process, not the application, gets
charged for it. Without charge transfer functionality in memcg, all
the charge ends up in gralloc's memcg and that doesn't work for
applying limits to applications.

There are some kernel-driven allocations, which I'm only aware of in
drivers. These are by far the minority, so I don't think accounting
them separately or not at all creates a significant gap.

> On desktop Linux, I would say firefox or gtk apps are the intended
> users of any compositor allocated buffers (not that we really have
> those now I don't think).
>
> if there are caches of allocations that should also be tied into
> cgroups, so memory pressure can reclaim them.
>
> Dave.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2026-03-10  1:27 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-24  2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
2026-02-24  2:06 ` [PATCH 01/16] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
2026-02-24  2:06 ` [PATCH 02/16] drm/ttm: use gpu mm stats to track gpu memory allocations. (v4) Dave Airlie
2026-02-24  2:06 ` [PATCH 03/16] ttm/pool: port to list_lru. (v2) Dave Airlie
2026-02-24  2:06 ` [PATCH 04/16] ttm/pool: drop numa specific pools Dave Airlie
2026-02-24  2:06 ` [PATCH 05/16] ttm/pool: make pool shrinker NUMA aware (v2) Dave Airlie
2026-02-24  2:06 ` [PATCH 06/16] ttm/pool: track allocated_pages per numa node Dave Airlie
2026-02-24  2:06 ` [PATCH 07/16] memcg: add support for GPU page counters. (v4) Dave Airlie
2026-02-24  7:20   ` kernel test robot
2026-02-24  7:50   ` Christian König
2026-02-24 19:28     ` Dave Airlie
2026-02-25  9:09       ` Christian König
2026-03-02 14:15         ` Shakeel Butt
2026-03-02 14:37           ` Christian König
2026-03-02 15:40             ` Shakeel Butt
2026-03-02 15:51               ` Christian König
2026-03-02 17:16                 ` Shakeel Butt
2026-03-02 19:36                   ` Christian König
2026-03-05  3:23                     ` Dave Airlie
2026-03-02 19:35                 ` T.J. Mercier
2026-03-03  9:29                   ` Christian König
2026-03-03 17:25                     ` T.J. Mercier
2026-03-05  3:19                   ` Dave Airlie
2026-03-05  9:25                     ` Christian König
2026-03-10  1:27                     ` T.J. Mercier
2026-02-24  2:06 ` [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
2026-02-24  8:42   ` kernel test robot
2026-02-24  2:06 ` [PATCH 09/16] ttm/pool: initialise the shrinker earlier Dave Airlie
2026-02-24  2:06 ` [PATCH 10/16] ttm: add objcg pointer to bo and tt (v2) Dave Airlie
2026-02-24  2:06 ` [PATCH 11/16] ttm/pool: enable memcg tracking and shrinker. (v3) Dave Airlie
2026-02-24  2:06 ` [PATCH 12/16] ttm: hook up memcg placement flags Dave Airlie
2026-02-24  2:06 ` [PATCH 13/16] memcontrol: allow objcg api when memcg is config off Dave Airlie
2026-02-24  2:06 ` [PATCH 14/16] amdgpu: add support for memory cgroups Dave Airlie
2026-02-24  2:06 ` [PATCH 15/16] ttm: add support for a module option to disable memcg integration Dave Airlie
2026-02-24  2:06 ` [PATCH 16/16] xe: create a flag to enable memcg accounting for XE as well Dave Airlie
  -- strict thread matches above, loose matches on Subject: below --
2025-10-16  2:31 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v4) Dave Airlie
2025-10-16  2:31 ` [PATCH 13/16] memcontrol: allow objcg api when memcg is config off Dave Airlie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox