cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] cgroups: Add support for pinned device memory
@ 2025-08-19 11:49 Maarten Lankhorst
  2025-08-19 11:49 ` [RFC 1/3] page_counter: Allow for pinning some amount of memory Maarten Lankhorst
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2025-08-19 11:49 UTC (permalink / raw)
  To: Lucas De Marchi, 'Thomas Hellström', Rodrigo Vivi,
	David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

When exporting dma-bufs to other devices, even when it is allowed to use
move_notify in some drivers, performance will degrade severely when
eviction happens.

A perticular example where this can happen is in a multi-card setup,
where PCI-E peer-to-peer is used to prevent using access to system memory.

If the buffer is evicted to system memory, not only the evicting GPU wher
the buffer resided is affected, but it will also stall the GPU that is
waiting on the buffer.

It also makes sense for long running jobs not to be preempted by having
its buffers evicted, so it will make sense to have the ability to pin
from system memory too.

This is dependant on patches by Dave Airlie, so it's not part of this
series yet. But I'm planning on extending pinning to the memory cgroup
controller in the future to handle this case.

Implementation details:

For each cgroup up until the root cgroup, the 'min' limit is checked
against currently effectively pinned value. If the value will go above
'min', the pinning attempt is rejected.

Pinned memory is handled slightly different and affects calculating
effective min/low values. Pinned memory is subtracted from both,
and needs to be added afterwards when calculating.

This is because increasing the amount of pinned memory, the amount of
free min/low memory decreases for all cgroups that are part of the
hierarchy.

Maarten Lankhorst (3):
  page_counter: Allow for pinning some amount of memory
  cgroup/dmem: Implement pinning device memory
  drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation

 drivers/gpu/drm/xe/xe_bo.c      | 66 +++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_dma_buf.c | 10 +++-
 include/linux/cgroup_dmem.h     |  2 +
 include/linux/page_counter.h    |  8 +++
 include/uapi/drm/xe_drm.h       | 10 +++-
 kernel/cgroup/dmem.c            | 57 ++++++++++++++++++-
 mm/page_counter.c               | 98 ++++++++++++++++++++++++++++++---
 7 files changed, 237 insertions(+), 14 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 1/3] page_counter: Allow for pinning some amount of memory
  2025-08-19 11:49 [RFC 0/3] cgroups: Add support for pinned device memory Maarten Lankhorst
@ 2025-08-19 11:49 ` Maarten Lankhorst
  2025-08-19 11:49 ` [RFC 2/3] cgroup/dmem: Implement pinning device memory Maarten Lankhorst
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2025-08-19 11:49 UTC (permalink / raw)
  To: Lucas De Marchi, 'Thomas Hellström', Rodrigo Vivi,
	David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Add a pinned member, and use it for implementing pinning accounting.
Memory to be pinned has to already be accounted for as normally used,
and only memory up to the 'min' limit is allowed to be pinned.

This limit is chosen because cgroups already guarantees that memory
up to that limit will not evicted.

Pinned memory affects min and low calculations, so alter those slightly.

Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
---
 include/linux/page_counter.h |  8 +++
 mm/page_counter.c            | 98 +++++++++++++++++++++++++++++++++---
 2 files changed, 99 insertions(+), 7 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index d649b6bbbc871..5836c6dfb3c76 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -13,6 +13,7 @@ struct page_counter {
 	 * v2. The memcg->memory.usage is a hot member of struct mem_cgroup.
 	 */
 	atomic_long_t usage;
+	atomic_long_t pinned;
 	unsigned long failcnt; /* v1-only field */
 
 	CACHELINE_PADDING(_pad1_);
@@ -68,11 +69,18 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
 	return atomic_long_read(&counter->usage);
 }
 
+static inline unsigned long page_counter_pinned(struct page_counter *counter)
+{
+	return atomic_long_read(&counter->pinned);
+}
+
 void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
 bool page_counter_try_charge(struct page_counter *counter,
 			     unsigned long nr_pages,
 			     struct page_counter **fail);
+bool page_counter_try_pin(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_unpin(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 661e0f2a5127a..d29d0ed01ec18 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -184,6 +184,82 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 		page_counter_cancel(c, nr_pages);
 }
 
+static void page_counter_unpin_one(struct page_counter *counter, unsigned long nr_pages)
+{
+	long new;
+
+	new = atomic_long_sub_return(nr_pages, &counter->pinned);
+	/* More uncharges than charges? */
+	if (WARN_ONCE(new < 0, "page_counter pinned underflow: %ld nr_pages=%lu\n",
+		      new, nr_pages))
+		atomic_long_set(&counter->pinned, 0);
+}
+
+/**
+ * page_counter_try_pin - try to hierarchically pin pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ *
+ * Returns %true on success, or %false if any counter goes above,
+ * the 'min' limit. Failing cgroup is not returned, as pinned memory
+ * cannot be evicted.
+ */
+bool page_counter_try_pin(struct page_counter *counter,
+			     unsigned long nr_pages)
+{
+	struct page_counter *c, *fail;
+	bool track_failcnt = counter->track_failcnt;
+
+	for (c = counter; c; c = c->parent) {
+		long new;
+		/*
+		 * Pin speculatively to avoid an expensive CAS.  If
+		 * a bigger charge fails, it might falsely lock out a
+		 * racing smaller charge and send it into reclaim
+		 * early, but the error is limited to the difference
+		 * between the two sizes, which is less than 2M/4M in
+		 * case of a THP locking out a regular page charge.
+		 *
+		 * The atomic_long_add_return() implies a full memory
+		 * barrier between incrementing the count and reading
+		 * the limit.  When racing with page_counter_set_max(),
+		 * we either see the new limit or the setter sees the
+		 * counter has changed and retries.
+		 */
+		new = atomic_long_add_return(nr_pages, &c->pinned);
+		if (new > READ_ONCE(c->min)) {
+			atomic_long_sub(nr_pages, &c->pinned);
+			/*
+			 * This is racy, but we can live with some
+			 * inaccuracy in the failcnt which is only used
+			 * to report stats.
+			 */
+			if (track_failcnt)
+				data_race(c->failcnt++);
+			fail = c;
+			goto failed;
+		}
+	}
+	return true;
+
+failed:
+	for (c = counter; c != fail; c = c->parent)
+		page_counter_unpin_one(c, nr_pages);
+
+	return false;
+}
+
+/**
+ * page_counter_unpin - hierarchically unpin pages
+ * @counter: counter
+ * @nr_pages: number of pages to uncharge
+ */
+void page_counter_unpin(struct page_counter *counter, unsigned long nr_pages)
+{
+	for (struct page_counter *c = counter; c; c = c->parent)
+		page_counter_unpin_one(c, nr_pages);
+}
+
 /**
  * page_counter_set_max - set the maximum number of pages allowed
  * @counter: counter
@@ -425,7 +501,7 @@ void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
 				       bool recursive_protection)
 {
-	unsigned long usage, parent_usage;
+	unsigned long usage, parent_usage, pinned, min, low;
 	struct page_counter *parent = counter->parent;
 
 	/*
@@ -442,23 +518,31 @@ void page_counter_calculate_protection(struct page_counter *root,
 	if (!usage)
 		return;
 
+	pinned = page_counter_pinned(counter);
+
+	/* For calculation purposes, pinned memory is subtracted from min/low */
+	min = READ_ONCE(counter->min);
+	if (pinned > min)
+		min = 0;
+	low = READ_ONCE(counter->low);
+	if (pinned > low)
+		low = 0;
+
 	if (parent == root) {
-		counter->emin = READ_ONCE(counter->min);
-		counter->elow = READ_ONCE(counter->low);
+		counter->emin = min;
+		counter->elow = low;
 		return;
 	}
 
 	parent_usage = page_counter_read(parent);
 
 	WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage,
-			READ_ONCE(counter->min),
-			READ_ONCE(parent->emin),
+			min, READ_ONCE(parent->emin),
 			atomic_long_read(&parent->children_min_usage),
 			recursive_protection));
 
 	WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage,
-			READ_ONCE(counter->low),
-			READ_ONCE(parent->elow),
+			low, READ_ONCE(parent->elow),
 			atomic_long_read(&parent->children_low_usage),
 			recursive_protection));
 }
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 2/3] cgroup/dmem: Implement pinning device memory
  2025-08-19 11:49 [RFC 0/3] cgroups: Add support for pinned device memory Maarten Lankhorst
  2025-08-19 11:49 ` [RFC 1/3] page_counter: Allow for pinning some amount of memory Maarten Lankhorst
@ 2025-08-19 11:49 ` Maarten Lankhorst
  2025-08-19 11:49 ` [RFC 3/3] drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation Maarten Lankhorst
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2025-08-19 11:49 UTC (permalink / raw)
  To: Lucas De Marchi, 'Thomas Hellström', Rodrigo Vivi,
	David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Add a function to pin, and to unipn memory and adjust the calculations
in dmem_cgroup_state_evict_valuable().

Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
---
 include/linux/cgroup_dmem.h |  2 ++
 kernel/cgroup/dmem.c        | 57 +++++++++++++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..a981bb692ba22 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -21,6 +21,8 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 			   struct dmem_cgroup_pool_state **ret_pool,
 			   struct dmem_cgroup_pool_state **ret_limit_pool);
 void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
+int dmem_cgroup_try_pin(struct dmem_cgroup_pool_state *pool, u64 size);
+void dmem_cgroup_unpin(struct dmem_cgroup_pool_state *pool, u64 size);
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
 				      bool ignore_low, bool *ret_hit_low);
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 10b63433f0573..ec8b1ffec78de 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -147,6 +147,11 @@ static u64 get_resource_current(struct dmem_cgroup_pool_state *pool)
 	return pool ? page_counter_read(&pool->cnt) : 0;
 }
 
+static u64 get_resource_pinned(struct dmem_cgroup_pool_state *pool)
+{
+	return pool ? page_counter_pinned(&pool->cnt) : 0;
+}
+
 static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
 	set_resource_min(rpool, 0);
@@ -270,7 +275,7 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 {
 	struct dmem_cgroup_pool_state *pool = test_pool;
 	struct page_counter *ctest;
-	u64 used, min, low;
+	u64 used, min, low, pinned;
 
 	/* Can always evict from current pool, despite limits */
 	if (limit_pool == test_pool)
@@ -296,16 +301,18 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 
 	ctest = &test_pool->cnt;
 
+	/* Protection is calculated without pinned memory */
 	dmem_cgroup_calculate_protection(limit_pool, test_pool);
 
 	used = page_counter_read(ctest);
-	min = READ_ONCE(ctest->emin);
+	pinned = page_counter_pinned(ctest);
+	min = READ_ONCE(ctest->emin) + pinned;
 
 	if (used <= min)
 		return false;
 
 	if (!ignore_low) {
-		low = READ_ONCE(ctest->elow);
+		low = READ_ONCE(ctest->elow) + pinned;
 		if (used > low)
 			return true;
 
@@ -641,6 +648,41 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 }
 EXPORT_SYMBOL_GPL(dmem_cgroup_try_charge);
 
+/**
+ * dmem_cgroup_unpin() - Unpin from a pool.
+ * @pool: Pool to unpin.
+ * @size: Size to unpin.
+ *
+ * Undoes the effects of dmem_cgroup_try_pin.
+ * Must be called with the returned pool as argument,
+ * and same @index and @size.
+ */
+void dmem_cgroup_unpin(struct dmem_cgroup_pool_state *pool, u64 size)
+{
+	if (pool)
+		page_counter_unpin(&pool->cnt, size);
+}
+EXPORT_SYMBOL_GPL(dmem_cgroup_unpin);
+
+/**
+ * dmem_cgroup_try_pin() - Try pinning an existing allocation to a region.
+ * @pool: dmem region to pin
+ * @size: Size (in bytes) to pin.
+ *
+ * This function pins in @pool for a size of @size bytes.
+ *
+ * If the function succeeds, the memory is succesfully accounted as being pinned.
+ * The memory may not be uncharged before unpin is called.
+ *
+ * Return: 0 on success, -EAGAIN on hitting a limit, or a negative errno on failure.
+ */
+int dmem_cgroup_try_pin(struct dmem_cgroup_pool_state *pool, u64 size)
+{
+	return page_counter_try_pin(&pool->cnt, size) ? 0 : -EAGAIN;
+
+}
+EXPORT_SYMBOL_GPL(dmem_cgroup_try_pin);
+
 static int dmem_cgroup_region_capacity_show(struct seq_file *sf, void *v)
 {
 	struct dmem_cgroup_region *region;
@@ -756,6 +798,11 @@ static int dmem_cgroup_region_current_show(struct seq_file *sf, void *v)
 	return dmemcg_limit_show(sf, v, get_resource_current);
 }
 
+static int dmem_cgroup_region_pinned_show(struct seq_file *sf, void *v)
+{
+	return dmemcg_limit_show(sf, v, get_resource_pinned);
+}
+
 static int dmem_cgroup_region_min_show(struct seq_file *sf, void *v)
 {
 	return dmemcg_limit_show(sf, v, get_resource_min);
@@ -799,6 +846,10 @@ static struct cftype files[] = {
 		.name = "current",
 		.seq_show = dmem_cgroup_region_current_show,
 	},
+	{
+		.name = "pinned",
+		.seq_show = dmem_cgroup_region_pinned_show,
+	},
 	{
 		.name = "min",
 		.write = dmem_cgroup_region_min_write,
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 3/3] drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation
  2025-08-19 11:49 [RFC 0/3] cgroups: Add support for pinned device memory Maarten Lankhorst
  2025-08-19 11:49 ` [RFC 1/3] page_counter: Allow for pinning some amount of memory Maarten Lankhorst
  2025-08-19 11:49 ` [RFC 2/3] cgroup/dmem: Implement pinning device memory Maarten Lankhorst
@ 2025-08-19 11:49 ` Maarten Lankhorst
  2025-08-19 16:22   ` Thomas Hellström
  2025-08-26 14:20 ` [RFC 0/3] cgroups: Add support for pinned device memory Michal Koutný
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Maarten Lankhorst @ 2025-08-19 11:49 UTC (permalink / raw)
  To: Lucas De Marchi, 'Thomas Hellström', Rodrigo Vivi,
	David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Add an option to pin memory through the science of cgroup accounting.
A bo will be pinned for its entire lifetime, and this allows buffers
that are pinned for dma-buf export without requiring the pinning to be
done at the dma-buf layer for all devices.

For now only implement VRAM pinning. Dave Airlie has a series to implement
memcg accounting for the GPU but that is not ready yet.

Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
---
 drivers/gpu/drm/xe/xe_bo.c      | 66 ++++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_dma_buf.c | 10 ++++-
 include/uapi/drm/xe_drm.h       | 10 ++++-
 3 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 6fea39842e1e6..4095e6bd04ea9 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -5,6 +5,7 @@
 
 #include "xe_bo.h"
 
+#include <linux/cgroup_dmem.h>
 #include <linux/dma-buf.h>
 #include <linux/nospec.h>
 
@@ -208,7 +209,8 @@ static bool force_contiguous(u32 bo_flags)
 	 * must be contiguous, also only contiguous BOs support xe_bo_vmap.
 	 */
 	return bo_flags & XE_BO_FLAG_NEEDS_CPU_ACCESS &&
-	       bo_flags & XE_BO_FLAG_PINNED;
+	       bo_flags & XE_BO_FLAG_PINNED &&
+	       !(bo_flags & XE_BO_FLAG_USER);
 }
 
 static void add_vram(struct xe_device *xe, struct xe_bo *bo,
@@ -1697,6 +1699,16 @@ static void xe_gem_object_free(struct drm_gem_object *obj)
 	ttm_bo_put(container_of(obj, struct ttm_buffer_object, base));
 }
 
+static void xe_bo_unpin_user(struct xe_bo *bo)
+{
+	xe_bo_unpin_external(bo);
+
+	if (bo->flags & XE_BO_FLAG_SYSTEM)
+		WARN_ON(1);
+	else
+		dmem_cgroup_unpin(bo->ttm.resource->css, xe_bo_size(bo));
+}
+
 static void xe_gem_object_close(struct drm_gem_object *obj,
 				struct drm_file *file_priv)
 {
@@ -1708,6 +1720,10 @@ static void xe_gem_object_close(struct drm_gem_object *obj,
 		xe_bo_lock(bo, false);
 		ttm_bo_set_bulk_move(&bo->ttm, NULL);
 		xe_bo_unlock(bo);
+	} else if (bo->flags & XE_BO_FLAG_PINNED) {
+		xe_bo_lock(bo, false);
+		xe_bo_unpin_user(bo);
+		xe_bo_unlock(bo);
 	}
 }
 
@@ -2128,8 +2144,27 @@ struct xe_bo *xe_bo_create_user(struct xe_device *xe, struct xe_tile *tile,
 	struct xe_bo *bo = __xe_bo_create_locked(xe, tile, vm, size, 0, ~0ULL,
 						 cpu_caching, ttm_bo_type_device,
 						 flags | XE_BO_FLAG_USER, 0);
-	if (!IS_ERR(bo))
+	if (!IS_ERR(bo)) {
+		int ret = 0;
+
+		if (bo->flags & XE_BO_FLAG_PINNED) {
+			if (bo->flags & XE_BO_FLAG_SYSTEM) {
+				ret = -ENOSYS; // TODO
+			} else {
+				ret = dmem_cgroup_try_pin(bo->ttm.resource->css, size);
+			}
+			if (!ret)
+				ret = xe_bo_pin_external(bo);
+			else if (ret == -EAGAIN)
+				ret = -ENOSPC;
+		}
+
 		xe_bo_unlock_vm_held(bo);
+		if (ret) {
+			xe_bo_put(bo);
+			return ERR_PTR(ret);
+		}
+	}
 
 	return bo;
 }
@@ -2745,6 +2780,28 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 			 args->cpu_caching == DRM_XE_GEM_CPU_CACHING_WB))
 		return -EINVAL;
 
+	if (XE_IOCTL_DBG(xe, args->flags & DRM_XE_GEM_CREATE_FLAG_PINNED)) {
+		bool pinned_flag = true;
+		/* Only allow a single placement for pinning */
+		if (XE_IOCTL_DBG(xe, pinned_flag && hweight32(args->placement) != 1))
+			return -EINVAL;
+
+		/* Meant for exporting, do not allow a VM-local BO */
+		if (XE_IOCTL_DBG(xe, pinned_flag && args->vm_id))
+			return -EINVAL;
+
+		/* Similarly, force fail at creation time for now. We may relax this requirement later */
+		if (XE_IOCTL_DBG(xe, pinned_flag && args->flags & DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING))
+			return -EINVAL;
+
+		/* Require the appropriate cgroups to be enabled. */
+		if (XE_IOCTL_DBG(xe, pinned_flag && !IS_ENABLED(CONFIG_CGROUP_DMEM) && bo_flags & XE_BO_FLAG_VRAM_MASK) ||
+		    XE_IOCTL_DBG(xe, pinned_flag && !IS_ENABLED(CONFIG_MEMCG) && bo_flags & XE_BO_FLAG_SYSTEM))
+			return -EINVAL;
+
+		bo_flags |= XE_BO_FLAG_PINNED;
+	}
+
 	if (args->vm_id) {
 		vm = xe_vm_lookup(xef, args->vm_id);
 		if (XE_IOCTL_DBG(xe, !vm))
@@ -2790,6 +2847,11 @@ int xe_gem_create_ioctl(struct drm_device *dev, void *data,
 		__xe_bo_unset_bulk_move(bo);
 		xe_vm_unlock(vm);
 	}
+	if (bo->flags & XE_BO_FLAG_PINNED) {
+		xe_bo_lock(bo, false);
+		xe_bo_unpin_user(bo);
+		xe_bo_unlock(bo);
+	}
 out_put:
 	xe_bo_put(bo);
 out_vm:
diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
index 346f857f38374..6719f4552ad37 100644
--- a/drivers/gpu/drm/xe/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/xe_dma_buf.c
@@ -53,6 +53,11 @@ static int xe_dma_buf_pin(struct dma_buf_attachment *attach)
 	struct xe_device *xe = xe_bo_device(bo);
 	int ret;
 
+	if (bo->flags & XE_BO_FLAG_PINNED) {
+		ttm_bo_pin(&bo->ttm);
+		return 0;
+	}
+
 	/*
 	 * For now only support pinning in TT memory, for two reasons:
 	 * 1) Avoid pinning in a placement not accessible to some importers.
@@ -83,7 +88,10 @@ static void xe_dma_buf_unpin(struct dma_buf_attachment *attach)
 	struct drm_gem_object *obj = attach->dmabuf->priv;
 	struct xe_bo *bo = gem_to_xe_bo(obj);
 
-	xe_bo_unpin_external(bo);
+	if (bo->flags & XE_BO_FLAG_PINNED)
+		ttm_bo_unpin(&bo->ttm);
+	else
+		xe_bo_unpin_external(bo);
 }
 
 static struct sg_table *xe_dma_buf_map(struct dma_buf_attachment *attach,
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index c721e130c1d2d..3184fa38ce17e 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -765,12 +765,15 @@ struct drm_xe_device_query {
  *    until the object is either bound to a virtual memory region via
  *    VM_BIND or accessed by the CPU. As a result, no backing memory is
  *    reserved at the time of GEM object creation.
- *  - %DRM_XE_GEM_CREATE_FLAG_SCANOUT
+ *  - %DRM_XE_GEM_CREATE_FLAG_SCANOUT - GEM object will be used
+ *    display framebuffer.
  *  - %DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM - When using VRAM as a
  *    possible placement, ensure that the corresponding VRAM allocation
  *    will always use the CPU accessible part of VRAM. This is important
  *    for small-bar systems (on full-bar systems this gets turned into a
  *    noop).
+ *  - %DRM_XE_GEM_CREATE_FLAG_PINNED - Pin the backing memory permanently
+ *    on allocation, if withing cgroups limits.
  *    Note1: System memory can be used as an extra placement if the kernel
  *    should spill the allocation to system memory, if space can't be made
  *    available in the CPU accessible part of VRAM (giving the same
@@ -781,6 +784,10 @@ struct drm_xe_device_query {
  *    need to use VRAM for display surfaces, therefore the kernel requires
  *    setting this flag for such objects, otherwise an error is thrown on
  *    small-bar systems.
+ *    Note3: %DRM_XE_GEM_CREATE_FLAG_PINNED requires the BO to have only
+ *    a single placement, no vm_id, requires (device) memory cgroups enabled,
+ *    and is incompatible with the %DEFER_BACKING and %NEEDS_VISIBLE_VRAM
+ *    flags.
  *
  * @cpu_caching supports the following values:
  *  - %DRM_XE_GEM_CPU_CACHING_WB - Allocate the pages with write-back
@@ -827,6 +834,7 @@ struct drm_xe_gem_create {
 #define DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING		(1 << 0)
 #define DRM_XE_GEM_CREATE_FLAG_SCANOUT			(1 << 1)
 #define DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM	(1 << 2)
+#define DRM_XE_GEM_CREATE_FLAG_PINNED			(1 << 3)
 	/**
 	 * @flags: Flags, currently a mask of memory instances of where BO can
 	 * be placed
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC 3/3] drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation
  2025-08-19 11:49 ` [RFC 3/3] drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation Maarten Lankhorst
@ 2025-08-19 16:22   ` Thomas Hellström
  2025-08-21 11:41     ` Maarten Lankhorst
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Hellström @ 2025-08-19 16:22 UTC (permalink / raw)
  To: Maarten Lankhorst, Lucas De Marchi, Rodrigo Vivi, David Airlie,
	Simona Vetter, Maxime Ripard, Natalie Vock, Tejun Heo,
	Johannes Weiner, 'Michal Koutný', Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, 'Liam R . Howlett',
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hi, Maarten,

On Tue, 2025-08-19 at 13:49 +0200, Maarten Lankhorst wrote:
> Add an option to pin memory through the science of cgroup accounting.
> A bo will be pinned for its entire lifetime, and this allows buffers
> that are pinned for dma-buf export without requiring the pinning to
> be
> done at the dma-buf layer for all devices.
> 
> For now only implement VRAM pinning. Dave Airlie has a series to
> implement
> memcg accounting for the GPU but that is not ready yet.

Previous discussions around this have favoured a UAPI where we pin a
gpu-vm range, with a pin at mapping time, or dma-buf pin time where
required, this allows for dynamic pinning and unpinning, and would
avoid having separate pinning interfaces for bos and userptr.

In particular if we don't know at bo creation time which buffer objects
will be exported with a method requiring pinning, how would UMD deduce
what buffer objects to pin?

Thanks,
Thomas



> 
> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
> ---
>  drivers/gpu/drm/xe/xe_bo.c      | 66
> ++++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_dma_buf.c | 10 ++++-
>  include/uapi/drm/xe_drm.h       | 10 ++++-
>  3 files changed, 82 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 6fea39842e1e6..4095e6bd04ea9 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -5,6 +5,7 @@
>  
>  #include "xe_bo.h"
>  
> +#include <linux/cgroup_dmem.h>
>  #include <linux/dma-buf.h>
>  #include <linux/nospec.h>
>  
> @@ -208,7 +209,8 @@ static bool force_contiguous(u32 bo_flags)
>  	 * must be contiguous, also only contiguous BOs support
> xe_bo_vmap.
>  	 */
>  	return bo_flags & XE_BO_FLAG_NEEDS_CPU_ACCESS &&
> -	       bo_flags & XE_BO_FLAG_PINNED;
> +	       bo_flags & XE_BO_FLAG_PINNED &&
> +	       !(bo_flags & XE_BO_FLAG_USER);
>  }
>  
>  static void add_vram(struct xe_device *xe, struct xe_bo *bo,
> @@ -1697,6 +1699,16 @@ static void xe_gem_object_free(struct
> drm_gem_object *obj)
>  	ttm_bo_put(container_of(obj, struct ttm_buffer_object,
> base));
>  }
>  
> +static void xe_bo_unpin_user(struct xe_bo *bo)
> +{
> +	xe_bo_unpin_external(bo);
> +
> +	if (bo->flags & XE_BO_FLAG_SYSTEM)
> +		WARN_ON(1);
> +	else
> +		dmem_cgroup_unpin(bo->ttm.resource->css,
> xe_bo_size(bo));
> +}
> +
>  static void xe_gem_object_close(struct drm_gem_object *obj,
>  				struct drm_file *file_priv)
>  {
> @@ -1708,6 +1720,10 @@ static void xe_gem_object_close(struct
> drm_gem_object *obj,
>  		xe_bo_lock(bo, false);
>  		ttm_bo_set_bulk_move(&bo->ttm, NULL);
>  		xe_bo_unlock(bo);
> +	} else if (bo->flags & XE_BO_FLAG_PINNED) {
> +		xe_bo_lock(bo, false);
> +		xe_bo_unpin_user(bo);
> +		xe_bo_unlock(bo);
>  	}
>  }
>  
> @@ -2128,8 +2144,27 @@ struct xe_bo *xe_bo_create_user(struct
> xe_device *xe, struct xe_tile *tile,
>  	struct xe_bo *bo = __xe_bo_create_locked(xe, tile, vm, size,
> 0, ~0ULL,
>  						 cpu_caching,
> ttm_bo_type_device,
>  						 flags |
> XE_BO_FLAG_USER, 0);
> -	if (!IS_ERR(bo))
> +	if (!IS_ERR(bo)) {
> +		int ret = 0;
> +
> +		if (bo->flags & XE_BO_FLAG_PINNED) {
> +			if (bo->flags & XE_BO_FLAG_SYSTEM) {
> +				ret = -ENOSYS; // TODO
> +			} else {
> +				ret = dmem_cgroup_try_pin(bo-
> >ttm.resource->css, size);
> +			}
> +			if (!ret)
> +				ret = xe_bo_pin_external(bo);
> +			else if (ret == -EAGAIN)
> +				ret = -ENOSPC;
> +		}
> +
>  		xe_bo_unlock_vm_held(bo);
> +		if (ret) {
> +			xe_bo_put(bo);
> +			return ERR_PTR(ret);
> +		}
> +	}
>  
>  	return bo;
>  }
> @@ -2745,6 +2780,28 @@ int xe_gem_create_ioctl(struct drm_device
> *dev, void *data,
>  			 args->cpu_caching ==
> DRM_XE_GEM_CPU_CACHING_WB))
>  		return -EINVAL;
>  
> +	if (XE_IOCTL_DBG(xe, args->flags &
> DRM_XE_GEM_CREATE_FLAG_PINNED)) {
> +		bool pinned_flag = true;
> +		/* Only allow a single placement for pinning */
> +		if (XE_IOCTL_DBG(xe, pinned_flag && hweight32(args-
> >placement) != 1))
> +			return -EINVAL;
> +
> +		/* Meant for exporting, do not allow a VM-local BO
> */
> +		if (XE_IOCTL_DBG(xe, pinned_flag && args->vm_id))
> +			return -EINVAL;
> +
> +		/* Similarly, force fail at creation time for now.
> We may relax this requirement later */
> +		if (XE_IOCTL_DBG(xe, pinned_flag && args->flags &
> DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING))
> +			return -EINVAL;
> +
> +		/* Require the appropriate cgroups to be enabled. */
> +		if (XE_IOCTL_DBG(xe, pinned_flag &&
> !IS_ENABLED(CONFIG_CGROUP_DMEM) && bo_flags & XE_BO_FLAG_VRAM_MASK)
> ||
> +		    XE_IOCTL_DBG(xe, pinned_flag &&
> !IS_ENABLED(CONFIG_MEMCG) && bo_flags & XE_BO_FLAG_SYSTEM))
> +			return -EINVAL;
> +
> +		bo_flags |= XE_BO_FLAG_PINNED;
> +	}
> +
>  	if (args->vm_id) {
>  		vm = xe_vm_lookup(xef, args->vm_id);
>  		if (XE_IOCTL_DBG(xe, !vm))
> @@ -2790,6 +2847,11 @@ int xe_gem_create_ioctl(struct drm_device
> *dev, void *data,
>  		__xe_bo_unset_bulk_move(bo);
>  		xe_vm_unlock(vm);
>  	}
> +	if (bo->flags & XE_BO_FLAG_PINNED) {
> +		xe_bo_lock(bo, false);
> +		xe_bo_unpin_user(bo);
> +		xe_bo_unlock(bo);
> +	}
>  out_put:
>  	xe_bo_put(bo);
>  out_vm:
> diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c
> b/drivers/gpu/drm/xe/xe_dma_buf.c
> index 346f857f38374..6719f4552ad37 100644
> --- a/drivers/gpu/drm/xe/xe_dma_buf.c
> +++ b/drivers/gpu/drm/xe/xe_dma_buf.c
> @@ -53,6 +53,11 @@ static int xe_dma_buf_pin(struct
> dma_buf_attachment *attach)
>  	struct xe_device *xe = xe_bo_device(bo);
>  	int ret;
>  
> +	if (bo->flags & XE_BO_FLAG_PINNED) {
> +		ttm_bo_pin(&bo->ttm);
> +		return 0;
> +	}
> +
>  	/*
>  	 * For now only support pinning in TT memory, for two
> reasons:
>  	 * 1) Avoid pinning in a placement not accessible to some
> importers.
> @@ -83,7 +88,10 @@ static void xe_dma_buf_unpin(struct
> dma_buf_attachment *attach)
>  	struct drm_gem_object *obj = attach->dmabuf->priv;
>  	struct xe_bo *bo = gem_to_xe_bo(obj);
>  
> -	xe_bo_unpin_external(bo);
> +	if (bo->flags & XE_BO_FLAG_PINNED)
> +		ttm_bo_unpin(&bo->ttm);
> +	else
> +		xe_bo_unpin_external(bo);
>  }
>  
>  static struct sg_table *xe_dma_buf_map(struct dma_buf_attachment
> *attach,
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index c721e130c1d2d..3184fa38ce17e 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -765,12 +765,15 @@ struct drm_xe_device_query {
>   *    until the object is either bound to a virtual memory region
> via
>   *    VM_BIND or accessed by the CPU. As a result, no backing memory
> is
>   *    reserved at the time of GEM object creation.
> - *  - %DRM_XE_GEM_CREATE_FLAG_SCANOUT
> + *  - %DRM_XE_GEM_CREATE_FLAG_SCANOUT - GEM object will be used
> + *    display framebuffer.
>   *  - %DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM - When using VRAM
> as a
>   *    possible placement, ensure that the corresponding VRAM
> allocation
>   *    will always use the CPU accessible part of VRAM. This is
> important
>   *    for small-bar systems (on full-bar systems this gets turned
> into a
>   *    noop).
> + *  - %DRM_XE_GEM_CREATE_FLAG_PINNED - Pin the backing memory
> permanently
> + *    on allocation, if withing cgroups limits.
>   *    Note1: System memory can be used as an extra placement if the
> kernel
>   *    should spill the allocation to system memory, if space can't
> be made
>   *    available in the CPU accessible part of VRAM (giving the same
> @@ -781,6 +784,10 @@ struct drm_xe_device_query {
>   *    need to use VRAM for display surfaces, therefore the kernel
> requires
>   *    setting this flag for such objects, otherwise an error is
> thrown on
>   *    small-bar systems.
> + *    Note3: %DRM_XE_GEM_CREATE_FLAG_PINNED requires the BO to have
> only
> + *    a single placement, no vm_id, requires (device) memory cgroups
> enabled,
> + *    and is incompatible with the %DEFER_BACKING and
> %NEEDS_VISIBLE_VRAM
> + *    flags.
>   *
>   * @cpu_caching supports the following values:
>   *  - %DRM_XE_GEM_CPU_CACHING_WB - Allocate the pages with write-
> back
> @@ -827,6 +834,7 @@ struct drm_xe_gem_create {
>  #define DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING		(1 << 0)
>  #define DRM_XE_GEM_CREATE_FLAG_SCANOUT			(1 << 1)
>  #define DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM	(1 << 2)
> +#define DRM_XE_GEM_CREATE_FLAG_PINNED			(1 << 3)
>  	/**
>  	 * @flags: Flags, currently a mask of memory instances of
> where BO can
>  	 * be placed


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 3/3] drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation
  2025-08-19 16:22   ` Thomas Hellström
@ 2025-08-21 11:41     ` Maarten Lankhorst
  0 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2025-08-21 11:41 UTC (permalink / raw)
  To: Thomas Hellström, Lucas De Marchi, Rodrigo Vivi,
	David Airlie, Simona Vetter, Maxime Ripard, Natalie Vock,
	Tejun Heo, Johannes Weiner, 'Michal Koutný',
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	'Liam R . Howlett', Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hey,

Den 2025-08-19 kl. 18:22, skrev Thomas Hellström:
> Hi, Maarten,
> 
> On Tue, 2025-08-19 at 13:49 +0200, Maarten Lankhorst wrote:
>> Add an option to pin memory through the science of cgroup accounting.
>> A bo will be pinned for its entire lifetime, and this allows buffers
>> that are pinned for dma-buf export without requiring the pinning to
>> be
>> done at the dma-buf layer for all devices.
>>
>> For now only implement VRAM pinning. Dave Airlie has a series to
>> implement
>> memcg accounting for the GPU but that is not ready yet.
> 
> Previous discussions around this have favoured a UAPI where we pin a
> gpu-vm range, with a pin at mapping time, or dma-buf pin time where
> required, this allows for dynamic pinning and unpinning, and would
> avoid having separate pinning interfaces for bos and userptr.
> 
> In particular if we don't know at bo creation time which buffer objects
> will be exported with a method requiring pinning, how would UMD deduce
> what buffer objects to pin?
> 
> Thanks,
> Thomas
> For discussion purposes, it seems compute preferred pinning at allocation time,
so I wanted to propose that solution since it's easiest to implement.

I will change it for the next version, but I'm very interested in feedback
on the adjustments I made in cgroups. Those are the ones that will have the
most impact, and changes how effective min/low is calculated when memory
pinned is involved.

Kind regards,
~Maarten

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-08-19 11:49 [RFC 0/3] cgroups: Add support for pinned device memory Maarten Lankhorst
                   ` (2 preceding siblings ...)
  2025-08-19 11:49 ` [RFC 3/3] drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation Maarten Lankhorst
@ 2025-08-26 14:20 ` Michal Koutný
  2025-08-28 20:58   ` Maarten Lankhorst
  2025-09-01 12:25 ` David Hildenbrand
  2025-09-01 12:45 ` Natalie Vock
  5 siblings, 1 reply; 16+ messages in thread
From: Michal Koutný @ 2025-08-26 14:20 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: Lucas De Marchi, 'Thomas Hellström', Rodrigo Vivi,
	David Airlie, Simona Vetter, Maxime Ripard, Natalie Vock,
	Tejun Heo, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann,
	Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]

Hello Maarten.

On Tue, Aug 19, 2025 at 01:49:33PM +0200, Maarten Lankhorst <dev@lankhorst.se> wrote:
> Implementation details:
> 
> For each cgroup up until the root cgroup, the 'min' limit is checked
> against currently effectively pinned value. If the value will go above
> 'min', the pinning attempt is rejected.

How is pinning different from setting a 'min' limit (from a user
perspective)?

> 
> Pinned memory is handled slightly different and affects calculating
> effective min/low values. Pinned memory is subtracted from both,
> and needs to be added afterwards when calculating.
> 
> This is because increasing the amount of pinned memory, the amount of
> free min/low memory decreases for all cgroups that are part of the
> hierarchy.

What is supposed to happen with pinned memory after cgroup removal?
I find the page_counter changes little bit complex without understanding
of the difference between min and pinned. Should this be conceptually
similar to memory.stat:unevictable? Or rather mlock(2)? So far neither
of those needed interaction with min/low values (in memcg).

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-08-26 14:20 ` [RFC 0/3] cgroups: Add support for pinned device memory Michal Koutný
@ 2025-08-28 20:58   ` Maarten Lankhorst
  0 siblings, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2025-08-28 20:58 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Lucas De Marchi, 'Thomas Hellström', Rodrigo Vivi,
	David Airlie, Simona Vetter, Maxime Ripard, Natalie Vock,
	Tejun Heo, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann,
	Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hey,

Den 2025-08-26 kl. 16:20, skrev Michal Koutný:
> Hello Maarten.
> 
> On Tue, Aug 19, 2025 at 01:49:33PM +0200, Maarten Lankhorst <dev@lankhorst.se> wrote:
>> Implementation details:
>>
>> For each cgroup up until the root cgroup, the 'min' limit is checked
>> against currently effectively pinned value. If the value will go above
>> 'min', the pinning attempt is rejected.
> 
> How is pinning different from setting a 'min' limit (from a user
> perspective)?
It's related, in fact you have to set the 'min' limit first.

The 'pinned' allows you to pick /which/ memory falls under the 'min' limit.

>>
>> Pinned memory is handled slightly different and affects calculating
>> effective min/low values. Pinned memory is subtracted from both,
>> and needs to be added afterwards when calculating.
>>
>> This is because increasing the amount of pinned memory, the amount of
>> free min/low memory decreases for all cgroups that are part of the
>> hierarchy.
> 
> What is supposed to happen with pinned memory after cgroup removal?
I think for accounting purposes pinned memory stays pinned,
otherwise the idea of pinning is lost. However when you kill all
processes in the cgroup, that should solve itself eventually.

> I find the page_counter changes little bit complex without understanding
> of the difference between min and pinned. Should this be conceptually
> similar to memory.stat:unevictable? Or rather mlock(2)? So far neither
> of those needed interaction with min/low values (in memcg).
You could in theory implement mlockall using the 'min' values too.

The page counter changes implement the following:

Lets say you have this tree with 'min' values.

      / '5' A
X'6' -- '5' B
      \ '5' C

Effective min without pinned pages:
      / '2' A
X'6' -- '2' B
      \ '2' C

Now 'B' pins 3 pages:

Effective min:
         / '1' A
X'3+3p' -- '1' B (1 + 3 pinned pages makes effective min 4)
         \ '1' C

Same for applies to effective 'low' calculations.

Kind regards,
~Maarten


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-08-19 11:49 [RFC 0/3] cgroups: Add support for pinned device memory Maarten Lankhorst
                   ` (3 preceding siblings ...)
  2025-08-26 14:20 ` [RFC 0/3] cgroups: Add support for pinned device memory Michal Koutný
@ 2025-09-01 12:25 ` David Hildenbrand
  2025-09-01 18:16   ` Maarten Lankhorst
  2025-09-01 12:45 ` Natalie Vock
  5 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-09-01 12:25 UTC (permalink / raw)
  To: Maarten Lankhorst, Lucas De Marchi,
	'Thomas Hellström', Rodrigo Vivi, David Airlie,
	Simona Vetter, Maxime Ripard, Natalie Vock, Tejun Heo,
	Johannes Weiner, 'Michal Koutný', Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

On 19.08.25 13:49, Maarten Lankhorst wrote:
> When exporting dma-bufs to other devices, even when it is allowed to use
> move_notify in some drivers, performance will degrade severely when
> eviction happens.
> 
> A perticular example where this can happen is in a multi-card setup,
> where PCI-E peer-to-peer is used to prevent using access to system memory.
> 
> If the buffer is evicted to system memory, not only the evicting GPU wher
> the buffer resided is affected, but it will also stall the GPU that is
> waiting on the buffer.
> 
> It also makes sense for long running jobs not to be preempted by having
> its buffers evicted, so it will make sense to have the ability to pin
> from system memory too.
> 
> This is dependant on patches by Dave Airlie, so it's not part of this
> series yet. But I'm planning on extending pinning to the memory cgroup
> controller in the future to handle this case.
> 
> Implementation details:
> 
> For each cgroup up until the root cgroup, the 'min' limit is checked
> against currently effectively pinned value. If the value will go above
> 'min', the pinning attempt is rejected.
> 
> Pinned memory is handled slightly different and affects calculating
> effective min/low values. Pinned memory is subtracted from both,
> and needs to be added afterwards when calculating.

The term "pinning" is overloaded, and frequently we refer to 
pin_user_pages() and friends.

So I'm wondering if there is an alternative term to describe what you 
want to achieve.

Is it something like "unevictable" ?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-08-19 11:49 [RFC 0/3] cgroups: Add support for pinned device memory Maarten Lankhorst
                   ` (4 preceding siblings ...)
  2025-09-01 12:25 ` David Hildenbrand
@ 2025-09-01 12:45 ` Natalie Vock
  2025-09-01 14:37   ` Thomas Hellström
  2025-09-01 18:22   ` Maarten Lankhorst
  5 siblings, 2 replies; 16+ messages in thread
From: Natalie Vock @ 2025-09-01 12:45 UTC (permalink / raw)
  To: Maarten Lankhorst, Lucas De Marchi,
	'Thomas Hellström', Rodrigo Vivi, David Airlie,
	Simona Vetter, Maxime Ripard, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hi,

On 8/19/25 13:49, Maarten Lankhorst wrote:
> When exporting dma-bufs to other devices, even when it is allowed to use
> move_notify in some drivers, performance will degrade severely when
> eviction happens.
> 
> A perticular example where this can happen is in a multi-card setup,
> where PCI-E peer-to-peer is used to prevent using access to system memory.
> 
> If the buffer is evicted to system memory, not only the evicting GPU wher
> the buffer resided is affected, but it will also stall the GPU that is
> waiting on the buffer.
> 
> It also makes sense for long running jobs not to be preempted by having
> its buffers evicted, so it will make sense to have the ability to pin
> from system memory too.
> 
> This is dependant on patches by Dave Airlie, so it's not part of this
> series yet. But I'm planning on extending pinning to the memory cgroup
> controller in the future to handle this case.
> 
> Implementation details:
> 
> For each cgroup up until the root cgroup, the 'min' limit is checked
> against currently effectively pinned value. If the value will go above
> 'min', the pinning attempt is rejected.

Why do you want to reject pins in this case? What happens in desktop 
usecases (e.g. PRIME buffer sharing)? AFAIU, you kind of need to be able 
to pin buffers and export them to other devices for that whole thing to 
work, right? If the user doesn't explicitly set a min value, wouldn't 
the value being zero mean any pins will be rejected (and thus PRIME 
would break)?

If your objective is to prevent pinned buffers from being evicted, 
perhaps you could instead make TTM try to avoid evicting pinned buffers 
and prefer unpinned buffers as long as there are unpinned buffers to 
evict? As long as the total amount of pinned memory stays below min, no 
pinned buffers should get evicted with that either.

Best,
Natalie

> 
> Pinned memory is handled slightly different and affects calculating
> effective min/low values. Pinned memory is subtracted from both,
> and needs to be added afterwards when calculating.
> 
> This is because increasing the amount of pinned memory, the amount of
> free min/low memory decreases for all cgroups that are part of the
> hierarchy.
> 
> Maarten Lankhorst (3):
>    page_counter: Allow for pinning some amount of memory
>    cgroup/dmem: Implement pinning device memory
>    drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation
> 
>   drivers/gpu/drm/xe/xe_bo.c      | 66 +++++++++++++++++++++-
>   drivers/gpu/drm/xe/xe_dma_buf.c | 10 +++-
>   include/linux/cgroup_dmem.h     |  2 +
>   include/linux/page_counter.h    |  8 +++
>   include/uapi/drm/xe_drm.h       | 10 +++-
>   kernel/cgroup/dmem.c            | 57 ++++++++++++++++++-
>   mm/page_counter.c               | 98 ++++++++++++++++++++++++++++++---
>   7 files changed, 237 insertions(+), 14 deletions(-)
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-09-01 12:45 ` Natalie Vock
@ 2025-09-01 14:37   ` Thomas Hellström
  2025-09-01 18:22   ` Maarten Lankhorst
  1 sibling, 0 replies; 16+ messages in thread
From: Thomas Hellström @ 2025-09-01 14:37 UTC (permalink / raw)
  To: Natalie Vock, Maarten Lankhorst, Lucas De Marchi, Rodrigo Vivi,
	David Airlie, Simona Vetter, Maxime Ripard, Tejun Heo,
	Johannes Weiner, 'Michal Koutný', Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, 'Liam R . Howlett',
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hi,

On Mon, 2025-09-01 at 14:45 +0200, Natalie Vock wrote:
> Hi,
> 
> On 8/19/25 13:49, Maarten Lankhorst wrote:
> > When exporting dma-bufs to other devices, even when it is allowed
> > to use
> > move_notify in some drivers, performance will degrade severely when
> > eviction happens.
> > 
> > A perticular example where this can happen is in a multi-card
> > setup,
> > where PCI-E peer-to-peer is used to prevent using access to system
> > memory.
> > 
> > If the buffer is evicted to system memory, not only the evicting
> > GPU wher
> > the buffer resided is affected, but it will also stall the GPU that
> > is
> > waiting on the buffer.
> > 
> > It also makes sense for long running jobs not to be preempted by
> > having
> > its buffers evicted, so it will make sense to have the ability to
> > pin
> > from system memory too.
> > 
> > This is dependant on patches by Dave Airlie, so it's not part of
> > this
> > series yet. But I'm planning on extending pinning to the memory
> > cgroup
> > controller in the future to handle this case.
> > 
> > Implementation details:
> > 
> > For each cgroup up until the root cgroup, the 'min' limit is
> > checked
> > against currently effectively pinned value. If the value will go
> > above
> > 'min', the pinning attempt is rejected.
> 
> Why do you want to reject pins in this case? What happens in desktop 
> usecases (e.g. PRIME buffer sharing)? AFAIU, you kind of need to be
> able 
> to pin buffers and export them to other devices for that whole thing
> to 
> work, right? If the user doesn't explicitly set a min value, wouldn't
> the value being zero mean any pins will be rejected (and thus PRIME 
> would break)?

That's really the point. If an unprivileged malicious process is
allowed to pin arbitrary amounts of memory, thats a DOS vector.

However drivers that allow unlimited pinning today need to take care
when implementing restrictions to avoid regressions. Like perhaps
adding this behind a config option.

That said, IMO dma-buf clients should implement move_notify() whenever
possible to provide an option to avoid pinning unless necessary.

/Thomas



> 
> If your objective is to prevent pinned buffers from being evicted, 
> perhaps you could instead make TTM try to avoid evicting pinned
> buffers 
> and prefer unpinned buffers as long as there are unpinned buffers to 
> evict? As long as the total amount of pinned memory stays below min,
> no 
> pinned buffers should get evicted with that either.


> 
> Best,
> Natalie
> 
> > 
> > Pinned memory is handled slightly different and affects calculating
> > effective min/low values. Pinned memory is subtracted from both,
> > and needs to be added afterwards when calculating.
> > 
> > This is because increasing the amount of pinned memory, the amount
> > of
> > free min/low memory decreases for all cgroups that are part of the
> > hierarchy.
> > 
> > Maarten Lankhorst (3):
> >    page_counter: Allow for pinning some amount of memory
> >    cgroup/dmem: Implement pinning device memory
> >    drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and
> > implementation
> > 
> >   drivers/gpu/drm/xe/xe_bo.c      | 66 +++++++++++++++++++++-
> >   drivers/gpu/drm/xe/xe_dma_buf.c | 10 +++-
> >   include/linux/cgroup_dmem.h     |  2 +
> >   include/linux/page_counter.h    |  8 +++
> >   include/uapi/drm/xe_drm.h       | 10 +++-
> >   kernel/cgroup/dmem.c            | 57 ++++++++++++++++++-
> >   mm/page_counter.c               | 98
> > ++++++++++++++++++++++++++++++---
> >   7 files changed, 237 insertions(+), 14 deletions(-)
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-09-01 12:25 ` David Hildenbrand
@ 2025-09-01 18:16   ` Maarten Lankhorst
  2025-09-01 18:21     ` Thomas Hellström
  0 siblings, 1 reply; 16+ messages in thread
From: Maarten Lankhorst @ 2025-09-01 18:16 UTC (permalink / raw)
  To: David Hildenbrand, Lucas De Marchi,
	'Thomas Hellström', Rodrigo Vivi, David Airlie,
	Simona Vetter, Maxime Ripard, Natalie Vock, Tejun Heo,
	Johannes Weiner, 'Michal Koutný', Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Lorenzo Stoakes, 'Liam R . Howlett', Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hello David,

Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
> On 19.08.25 13:49, Maarten Lankhorst wrote:
>> When exporting dma-bufs to other devices, even when it is allowed to use
>> move_notify in some drivers, performance will degrade severely when
>> eviction happens.
>>
>> A perticular example where this can happen is in a multi-card setup,
>> where PCI-E peer-to-peer is used to prevent using access to system memory.
>>
>> If the buffer is evicted to system memory, not only the evicting GPU wher
>> the buffer resided is affected, but it will also stall the GPU that is
>> waiting on the buffer.
>>
>> It also makes sense for long running jobs not to be preempted by having
>> its buffers evicted, so it will make sense to have the ability to pin
>> from system memory too.
>>
>> This is dependant on patches by Dave Airlie, so it's not part of this
>> series yet. But I'm planning on extending pinning to the memory cgroup
>> controller in the future to handle this case.
>>
>> Implementation details:
>>
>> For each cgroup up until the root cgroup, the 'min' limit is checked
>> against currently effectively pinned value. If the value will go above
>> 'min', the pinning attempt is rejected.
>>
>> Pinned memory is handled slightly different and affects calculating
>> effective min/low values. Pinned memory is subtracted from both,
>> and needs to be added afterwards when calculating.
> 
> The term "pinning" is overloaded, and frequently we refer to pin_user_pages() and friends.
> 
> So I'm wondering if there is an alternative term to describe what you want to achieve.
> 
> Is it something like "unevictable" ?
It could be required to include a call pin_user_pages(), in case a process wants to pin 
from a user's address space to the gpu.

It's not done yet, but it wouldn't surprise me if we want to include it in the future.
Functionally it's similar to mlock() and related functions.

Perhaps call it mlocked instead?

Kind regards,
~Maarten Lankhorst

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-09-01 18:16   ` Maarten Lankhorst
@ 2025-09-01 18:21     ` Thomas Hellström
  2025-09-01 18:38       ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Hellström @ 2025-09-01 18:21 UTC (permalink / raw)
  To: Maarten Lankhorst, David Hildenbrand, Lucas De Marchi,
	Rodrigo Vivi, David Airlie, Simona Vetter, Maxime Ripard,
	Natalie Vock, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Lorenzo Stoakes,
	'Liam R . Howlett', Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hi,

On Mon, 2025-09-01 at 20:16 +0200, Maarten Lankhorst wrote:
> Hello David,
> 
> Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
> > On 19.08.25 13:49, Maarten Lankhorst wrote:
> > > When exporting dma-bufs to other devices, even when it is allowed
> > > to use
> > > move_notify in some drivers, performance will degrade severely
> > > when
> > > eviction happens.
> > > 
> > > A perticular example where this can happen is in a multi-card
> > > setup,
> > > where PCI-E peer-to-peer is used to prevent using access to
> > > system memory.
> > > 
> > > If the buffer is evicted to system memory, not only the evicting
> > > GPU wher
> > > the buffer resided is affected, but it will also stall the GPU
> > > that is
> > > waiting on the buffer.
> > > 
> > > It also makes sense for long running jobs not to be preempted by
> > > having
> > > its buffers evicted, so it will make sense to have the ability to
> > > pin
> > > from system memory too.
> > > 
> > > This is dependant on patches by Dave Airlie, so it's not part of
> > > this
> > > series yet. But I'm planning on extending pinning to the memory
> > > cgroup
> > > controller in the future to handle this case.
> > > 
> > > Implementation details:
> > > 
> > > For each cgroup up until the root cgroup, the 'min' limit is
> > > checked
> > > against currently effectively pinned value. If the value will go
> > > above
> > > 'min', the pinning attempt is rejected.
> > > 
> > > Pinned memory is handled slightly different and affects
> > > calculating
> > > effective min/low values. Pinned memory is subtracted from both,
> > > and needs to be added afterwards when calculating.
> > 
> > The term "pinning" is overloaded, and frequently we refer to
> > pin_user_pages() and friends.
> > 
> > So I'm wondering if there is an alternative term to describe what
> > you want to achieve.
> > 
> > Is it something like "unevictable" ?
> It could be required to include a call pin_user_pages(), in case a
> process wants to pin 
> from a user's address space to the gpu.
> 
> It's not done yet, but it wouldn't surprise me if we want to include
> it in the future.
> Functionally it's similar to mlock() and related functions.
> 
> Perhaps call it mlocked instead?

I was under the impression that mlocked() memory can be migrated to
other physical memory but not to swap? whereas pinned memory needs to
remain the exact same physical memory.

IMO "pinned" is pretty established within GPU drivers (dma-buf, TTM)
and essentially means the same as "pin" in "pin_user_pages", so
inventing a new name would probably cause even more confusion?

Thanks,
Thomas




> 
> Kind regards,
> ~Maarten Lankhorst


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-09-01 12:45 ` Natalie Vock
  2025-09-01 14:37   ` Thomas Hellström
@ 2025-09-01 18:22   ` Maarten Lankhorst
  1 sibling, 0 replies; 16+ messages in thread
From: Maarten Lankhorst @ 2025-09-01 18:22 UTC (permalink / raw)
  To: Natalie Vock, Lucas De Marchi, 'Thomas Hellström',
	Rodrigo Vivi, David Airlie, Simona Vetter, Maxime Ripard,
	Tejun Heo, Johannes Weiner, 'Michal Koutný',
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	'Liam R . Howlett', Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

Hello Nathalie,

Den 2025-09-01 kl. 14:45, skrev Natalie Vock:
> Hi,
> 
> On 8/19/25 13:49, Maarten Lankhorst wrote:
>> When exporting dma-bufs to other devices, even when it is allowed to use
>> move_notify in some drivers, performance will degrade severely when
>> eviction happens.
>>
>> A perticular example where this can happen is in a multi-card setup,
>> where PCI-E peer-to-peer is used to prevent using access to system memory.
>>
>> If the buffer is evicted to system memory, not only the evicting GPU wher
>> the buffer resided is affected, but it will also stall the GPU that is
>> waiting on the buffer.
>>
>> It also makes sense for long running jobs not to be preempted by having
>> its buffers evicted, so it will make sense to have the ability to pin
>> from system memory too.
>>
>> This is dependant on patches by Dave Airlie, so it's not part of this
>> series yet. But I'm planning on extending pinning to the memory cgroup
>> controller in the future to handle this case.
>>
>> Implementation details:
>>
>> For each cgroup up until the root cgroup, the 'min' limit is checked
>> against currently effectively pinned value. If the value will go above
>> 'min', the pinning attempt is rejected.
> 
> Why do you want to reject pins in this case? What happens in desktop usecases (e.g. PRIME buffer sharing)? AFAIU, you kind of need to be able to pin buffers and export them to other devices for that whole thing to work, right? If the user doesn't explicitly set a min value, wouldn't the value being zero mean any pins will be rejected (and thus PRIME would break)?
> 
> If your objective is to prevent pinned buffers from being evicted, perhaps you could instead make TTM try to avoid evicting pinned buffers and prefer unpinned buffers as long as there are unpinned buffers to evict? As long as the total amount of pinned memory stays below min, no pinned buffers should get evicted with that either.
That would be setting an eviction priority, that can be done but that gives no guarantee memory will not be evicted.

Kind regards,
~Maarten

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-09-01 18:21     ` Thomas Hellström
@ 2025-09-01 18:38       ` David Hildenbrand
  2025-09-02 13:42         ` Thomas Hellström
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-09-01 18:38 UTC (permalink / raw)
  To: Thomas Hellström, Maarten Lankhorst, Lucas De Marchi,
	Rodrigo Vivi, David Airlie, Simona Vetter, Maxime Ripard,
	Natalie Vock, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Lorenzo Stoakes,
	'Liam R . Howlett', Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

On 01.09.25 20:21, Thomas Hellström wrote:
> Hi,
> 
> On Mon, 2025-09-01 at 20:16 +0200, Maarten Lankhorst wrote:
>> Hello David,
>>
>> Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
>>> On 19.08.25 13:49, Maarten Lankhorst wrote:
>>>> When exporting dma-bufs to other devices, even when it is allowed
>>>> to use
>>>> move_notify in some drivers, performance will degrade severely
>>>> when
>>>> eviction happens.
>>>>
>>>> A perticular example where this can happen is in a multi-card
>>>> setup,
>>>> where PCI-E peer-to-peer is used to prevent using access to
>>>> system memory.
>>>>
>>>> If the buffer is evicted to system memory, not only the evicting
>>>> GPU wher
>>>> the buffer resided is affected, but it will also stall the GPU
>>>> that is
>>>> waiting on the buffer.
>>>>
>>>> It also makes sense for long running jobs not to be preempted by
>>>> having
>>>> its buffers evicted, so it will make sense to have the ability to
>>>> pin
>>>> from system memory too.
>>>>
>>>> This is dependant on patches by Dave Airlie, so it's not part of
>>>> this
>>>> series yet. But I'm planning on extending pinning to the memory
>>>> cgroup
>>>> controller in the future to handle this case.
>>>>
>>>> Implementation details:
>>>>
>>>> For each cgroup up until the root cgroup, the 'min' limit is
>>>> checked
>>>> against currently effectively pinned value. If the value will go
>>>> above
>>>> 'min', the pinning attempt is rejected.
>>>>
>>>> Pinned memory is handled slightly different and affects
>>>> calculating
>>>> effective min/low values. Pinned memory is subtracted from both,
>>>> and needs to be added afterwards when calculating.
>>>
>>> The term "pinning" is overloaded, and frequently we refer to
>>> pin_user_pages() and friends.
>>>
>>> So I'm wondering if there is an alternative term to describe what
>>> you want to achieve.
>>>
>>> Is it something like "unevictable" ?
>> It could be required to include a call pin_user_pages(), in case a

We'll only care about long-term pinnings (i.e., FOLL_LONGTERM). Ordinary 
short-term pinning is just fine.

(see how even "pinning" is overloaded? :) )

>> process wants to pin
>> from a user's address space to the gpu.
>>
>> It's not done yet, but it wouldn't surprise me if we want to include
>> it in the future.
>> Functionally it's similar to mlock() and related functions.

Traditionally, vfio, io_uring and rdma do exactly that: they use GUP to 
longterm pin and then account that memory towards RLIMIT_MEMLOCK.

If you grep for "rlimit(RLIMIT_MEMLOCK)", you'll see what I mean.

There are known issues with that: imagine long-term pinning the same 
folio through GUP with 2 interfaces (e.g., vfio, io_uring, rdma), or 
within the same interface.

You'd account the memory multiple times, which is horrible. And so far 
there is no easy way out.

>>
>> Perhaps call it mlocked instead?
> 
> I was under the impression that mlocked() memory can be migrated to
> other physical memory but not to swap? whereas pinned memory needs to
> remain the exact same physical memory.

Yes, exactly.

> 
> IMO "pinned" is pretty established within GPU drivers (dma-buf, TTM)
> and essentially means the same as "pin" in "pin_user_pages", so
> inventing a new name would probably cause even more confusion?

If it's the same thing, absolutely. But Marteen said "It's not done yet, 
but it wouldn't surprise me if we want to include it in the future".

So how is the memory we are talking about in this series "pinned" ?


-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/3] cgroups: Add support for pinned device memory
  2025-09-01 18:38       ` David Hildenbrand
@ 2025-09-02 13:42         ` Thomas Hellström
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Hellström @ 2025-09-02 13:42 UTC (permalink / raw)
  To: David Hildenbrand, Maarten Lankhorst, Lucas De Marchi,
	Rodrigo Vivi, David Airlie, Simona Vetter, Maxime Ripard,
	Natalie Vock, Tejun Heo, Johannes Weiner,
	'Michal Koutný', Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Lorenzo Stoakes,
	'Liam R . Howlett', Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Thomas Zimmermann
  Cc: Michal Hocko, intel-xe, dri-devel, linux-kernel, cgroups,
	linux-mm

On Mon, 2025-09-01 at 20:38 +0200, David Hildenbrand wrote:
> On 01.09.25 20:21, Thomas Hellström wrote:
> > Hi,
> > 
> > On Mon, 2025-09-01 at 20:16 +0200, Maarten Lankhorst wrote:
> > > Hello David,
> > > 
> > > Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
> > > > On 19.08.25 13:49, Maarten Lankhorst wrote:
> > > > > When exporting dma-bufs to other devices, even when it is
> > > > > allowed
> > > > > to use
> > > > > move_notify in some drivers, performance will degrade
> > > > > severely
> > > > > when
> > > > > eviction happens.
> > > > > 
> > > > > A perticular example where this can happen is in a multi-card
> > > > > setup,
> > > > > where PCI-E peer-to-peer is used to prevent using access to
> > > > > system memory.
> > > > > 
> > > > > If the buffer is evicted to system memory, not only the
> > > > > evicting
> > > > > GPU wher
> > > > > the buffer resided is affected, but it will also stall the
> > > > > GPU
> > > > > that is
> > > > > waiting on the buffer.
> > > > > 
> > > > > It also makes sense for long running jobs not to be preempted
> > > > > by
> > > > > having
> > > > > its buffers evicted, so it will make sense to have the
> > > > > ability to
> > > > > pin
> > > > > from system memory too.
> > > > > 
> > > > > This is dependant on patches by Dave Airlie, so it's not part
> > > > > of
> > > > > this
> > > > > series yet. But I'm planning on extending pinning to the
> > > > > memory
> > > > > cgroup
> > > > > controller in the future to handle this case.
> > > > > 
> > > > > Implementation details:
> > > > > 
> > > > > For each cgroup up until the root cgroup, the 'min' limit is
> > > > > checked
> > > > > against currently effectively pinned value. If the value will
> > > > > go
> > > > > above
> > > > > 'min', the pinning attempt is rejected.
> > > > > 
> > > > > Pinned memory is handled slightly different and affects
> > > > > calculating
> > > > > effective min/low values. Pinned memory is subtracted from
> > > > > both,
> > > > > and needs to be added afterwards when calculating.
> > > > 
> > > > The term "pinning" is overloaded, and frequently we refer to
> > > > pin_user_pages() and friends.
> > > > 
> > > > So I'm wondering if there is an alternative term to describe
> > > > what
> > > > you want to achieve.
> > > > 
> > > > Is it something like "unevictable" ?
> > > It could be required to include a call pin_user_pages(), in case
> > > a
> 
> We'll only care about long-term pinnings (i.e., FOLL_LONGTERM).
> Ordinary 
> short-term pinning is just fine.
> 
> (see how even "pinning" is overloaded? :) )
> 
> > > process wants to pin
> > > from a user's address space to the gpu.
> > > 
> > > It's not done yet, but it wouldn't surprise me if we want to
> > > include
> > > it in the future.
> > > Functionally it's similar to mlock() and related functions.
> 
> Traditionally, vfio, io_uring and rdma do exactly that: they use GUP
> to 
> longterm pin and then account that memory towards RLIMIT_MEMLOCK.
> 
> If you grep for "rlimit(RLIMIT_MEMLOCK)", you'll see what I mean.
> 
> There are known issues with that: imagine long-term pinning the same 
> folio through GUP with 2 interfaces (e.g., vfio, io_uring, rdma), or 
> within the same interface.
> 
> You'd account the memory multiple times, which is horrible. And so
> far 
> there is no easy way out.
> 
> > > 
> > > Perhaps call it mlocked instead?
> > 
> > I was under the impression that mlocked() memory can be migrated to
> > other physical memory but not to swap? whereas pinned memory needs
> > to
> > remain the exact same physical memory.
> 
> Yes, exactly.
> 
> > 
> > IMO "pinned" is pretty established within GPU drivers (dma-buf,
> > TTM)
> > and essentially means the same as "pin" in "pin_user_pages", so
> > inventing a new name would probably cause even more confusion?
> 
> If it's the same thing, absolutely. But Marteen said "It's not done
> yet, 
> but it wouldn't surprise me if we want to include it in the future".
> 
> So how is the memory we are talking about in this series "pinned" ?

Reading the cover-letter from Maarten, he only talks about pinning
affecting performance, which would be similar to user-space calling
mlock(), although I doubt that moving content to other physical pages
within the same memory type will be a near-term use-case.

However what's more important are situation where a device (like RDMA)
needs to pin, because it can't handle the case where access is
interrupted and content transferred to another physical location.

Perhaps Maarten could you elaborate whether this series is intended for
both these use-cases?

/Thomas



> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-02 13:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 11:49 [RFC 0/3] cgroups: Add support for pinned device memory Maarten Lankhorst
2025-08-19 11:49 ` [RFC 1/3] page_counter: Allow for pinning some amount of memory Maarten Lankhorst
2025-08-19 11:49 ` [RFC 2/3] cgroup/dmem: Implement pinning device memory Maarten Lankhorst
2025-08-19 11:49 ` [RFC 3/3] drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation Maarten Lankhorst
2025-08-19 16:22   ` Thomas Hellström
2025-08-21 11:41     ` Maarten Lankhorst
2025-08-26 14:20 ` [RFC 0/3] cgroups: Add support for pinned device memory Michal Koutný
2025-08-28 20:58   ` Maarten Lankhorst
2025-09-01 12:25 ` David Hildenbrand
2025-09-01 18:16   ` Maarten Lankhorst
2025-09-01 18:21     ` Thomas Hellström
2025-09-01 18:38       ` David Hildenbrand
2025-09-02 13:42         ` Thomas Hellström
2025-09-01 12:45 ` Natalie Vock
2025-09-01 14:37   ` Thomas Hellström
2025-09-01 18:22   ` Maarten Lankhorst

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).