dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker
@ 2024-07-03 15:38 Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 01/12] drm/ttm: Allow TTM LRU list nodes of different types Thomas Hellström
                   ` (11 more replies)
  0 siblings, 12 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Somalapuram Amaranath,
	Christian König, Matthew Brost, dri-devel

This series implements TTM shrinker / eviction helpers and an xe bo
shrinker. It builds on two previous series, *and obsoletes these*. First

https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg484425.html

Second the previous TTM shrinker series

https://lore.kernel.org/linux-mm/b7491378-defd-4f1c-31e2-29e4c77e2d67@amd.com/T/

Where the comment about layering
https://lore.kernel.org/linux-mm/b7491378-defd-4f1c-31e2-29e4c77e2d67@amd.com/T/#ma918844aa8a6efe8768fdcda0c6590d5c93850c9

now addressed, and this version also implements shmem objects for backup
rather than direct swap-cache insertions, which was used in the previuos
series. It turns out that with per-page backup / shrinking, shmem objects
appears to work just as well as direct swap-cache insertions with the
added benefit that was introduced in the previous TTM shrinker series to
avoid running out of swap entries isn't really needed.

Patch 1-4 implements restartable LRU list iteration.

Patch 5 implements a LRU walker + resv locking helper

Patch 6 moves TTM swapping over to the walker.

Patch 7 moves TTM eviction over to the walker.

Patch 8 could in theory be skipped but introduces a possibility to easily
add or test multiple backup backends, like the direct swap-cache
insertion or even files into fast dedicated nvme storage for for example.

Patch 9 introduces helpers in the ttm_pool code for page-by-page shrinking
and recovery. It avoids having to temporarily allocate a huge amount of
memory to be able to shrink a buffer object. It also introduces the
possibility to immediately write-back pages if needed, since that tends
to be a bit delayed when left to kswapd.

Patch 10 Adds a simple error injection to the above code to help increase
test coverage.

Patch 11 Implements an xe bo shrinker and a common helper in TTM for
shrinking.

Patch 12-21 are really a separate POC series, for introducing drm_exec locking
in TTM. The patch touches both drm_exec and dma-buf and is for now marked as
an RFC:

Patch 12 Increases (removes) the XE_PL_TT watermark.

v2:
- Squash obsolete revision history in the patch commit messages.
- Fix a couple of review comments by Christian
- Don't store the mem_type in the TTM managers but in the
  resource cursor.
- Rename introduced TTM *back_up* function names to *backup*
- Add ttm pool recovery fault injection.
- Shrinker xe kunit test
- Various bugfixes

v3:
- Address some review comments from Matthew Brost and Christian König.
- Use the restartable LRU walk for TTM swapping and eviction.
- Provide a POC drm_exec locking implementation for exhaustive
  eviction. (Christian König).

v4:
- Remove the RFC exhaustive eviction part. While the path to exhaustive
  eviction is pretty clear and demonstrated in v3, there is still some
  drm_exec work that needs to be agreed and implemented.
- Add shrinker power management. On some hw we need to wake when shrinking.
- Fix the lru walker helper for -EALREADY errors.
- Add drm/xe: Increase the XE_PL_TT watermark.

v5:
- Update also TTM kunit tests
- Handle ghost- and zombie objects in the shrinker.
- A couple of compile- and UAF fixes reported by Kernel Build Robot and
  Dan Carpenter.

v6:
- Address review comments from Matthew Brost as detailed in patches
  4/12, 5/12, 6/12, 7/12, 8/12.

Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>

Thomas Hellström (12):
  drm/ttm: Allow TTM LRU list nodes of different types
  drm/ttm: Slightly clean up LRU list iteration
  drm/ttm: Use LRU hitches
  drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist
    moves
  drm/ttm: Provide a generic LRU walker helper
  drm/ttm: Use the LRU walker helper for swapping
  drm/ttm: Use the LRU walker for eviction
  drm/ttm: Add a virtual base class for graphics memory backup
  drm/ttm/pool: Provide a helper to shrink pages
  drm/ttm: Use fault-injection to test error paths
  drm/ttm, drm/xe: Add a shrinker for xe bos
  drm/xe: Increase the XE_PL_TT watermark

 drivers/gpu/drm/Kconfig                       |  10 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   4 +
 drivers/gpu/drm/ttm/Makefile                  |   2 +-
 drivers/gpu/drm/ttm/tests/ttm_bo_test.c       |   6 +-
 drivers/gpu/drm/ttm/tests/ttm_resource_test.c |   2 +-
 drivers/gpu/drm/ttm/ttm_backup_shmem.c        | 139 ++++++
 drivers/gpu/drm/ttm/ttm_bo.c                  | 458 ++++++++----------
 drivers/gpu/drm/ttm/ttm_bo_util.c             | 223 +++++++++
 drivers/gpu/drm/ttm/ttm_device.c              |  29 +-
 drivers/gpu/drm/ttm/ttm_pool.c                | 412 +++++++++++++++-
 drivers/gpu/drm/ttm/ttm_resource.c            | 268 ++++++++--
 drivers/gpu/drm/ttm/ttm_tt.c                  |  37 ++
 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/tests/xe_bo.c              | 118 +++++
 drivers/gpu/drm/xe/tests/xe_bo_test.c         |   1 +
 drivers/gpu/drm/xe/tests/xe_bo_test.h         |   1 +
 drivers/gpu/drm/xe/xe_bo.c                    | 155 +++++-
 drivers/gpu/drm/xe/xe_bo.h                    |  26 +
 drivers/gpu/drm/xe/xe_device.c                |   8 +
 drivers/gpu/drm/xe/xe_device_types.h          |   2 +
 drivers/gpu/drm/xe/xe_shrinker.c              | 287 +++++++++++
 drivers/gpu/drm/xe/xe_shrinker.h              |  18 +
 drivers/gpu/drm/xe/xe_ttm_sys_mgr.c           |   3 +-
 drivers/gpu/drm/xe/xe_vm.c                    |   4 +
 include/drm/ttm/ttm_backup.h                  | 137 ++++++
 include/drm/ttm/ttm_bo.h                      |  51 +-
 include/drm/ttm/ttm_pool.h                    |   5 +
 include/drm/ttm/ttm_resource.h                |  99 +++-
 include/drm/ttm/ttm_tt.h                      |  20 +
 29 files changed, 2149 insertions(+), 377 deletions(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_backup_shmem.c
 create mode 100644 drivers/gpu/drm/xe/xe_shrinker.c
 create mode 100644 drivers/gpu/drm/xe/xe_shrinker.h
 create mode 100644 include/drm/ttm/ttm_backup.h

-- 
2.44.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 01/12] drm/ttm: Allow TTM LRU list nodes of different types
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 02/12] drm/ttm: Slightly clean up LRU list iteration Thomas Hellström
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

To be able to handle list unlocking while traversing the LRU
list, we want the iterators not only to point to the next
position of the list traversal, but to insert themselves as
list nodes at that point to work around the fact that the
next node might otherwise disappear from the list while
the iterator is pointing to it.

These list nodes need to be easily distinguishable from other
list nodes so that others traversing the list can skip
over them.

So declare a struct ttm_lru_item, with a struct list_head member
and a type enum. This will slightly increase the size of a
struct ttm_resource.

Changes in previous series:
- Update enum ttm_lru_item_type documentation.
v3:
- Introduce ttm_lru_first_res_or_null()
  (Christian König, Thomas Hellström)
v5:
- Update also the TTM test code (Xe CI).

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/tests/ttm_bo_test.c       |  6 +-
 drivers/gpu/drm/ttm/tests/ttm_resource_test.c |  2 +-
 drivers/gpu/drm/ttm/ttm_device.c              |  4 +-
 drivers/gpu/drm/ttm/ttm_resource.c            | 89 +++++++++++++++----
 include/drm/ttm/ttm_resource.h                | 54 ++++++++++-
 5 files changed, 129 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/ttm/tests/ttm_bo_test.c b/drivers/gpu/drm/ttm/tests/ttm_bo_test.c
index d1b32303d051..f0a7eb62116c 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_bo_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_bo_test.c
@@ -271,7 +271,7 @@ static void ttm_bo_unreserve_basic(struct kunit *test)
 
 	man = ttm_manager_type(priv->ttm_dev, mem_type);
 	KUNIT_ASSERT_EQ(test,
-			list_is_last(&res1->lru, &man->lru[bo->priority]), 1);
+			list_is_last(&res1->lru.link, &man->lru[bo->priority]), 1);
 
 	ttm_resource_free(bo, &res2);
 	ttm_resource_free(bo, &res1);
@@ -308,11 +308,11 @@ static void ttm_bo_unreserve_pinned(struct kunit *test)
 	err = ttm_resource_alloc(bo, place, &res2);
 	KUNIT_ASSERT_EQ(test, err, 0);
 	KUNIT_ASSERT_EQ(test,
-			list_is_last(&res2->lru, &priv->ttm_dev->pinned), 1);
+			list_is_last(&res2->lru.link, &priv->ttm_dev->pinned), 1);
 
 	ttm_bo_unreserve(bo);
 	KUNIT_ASSERT_EQ(test,
-			list_is_last(&res1->lru, &priv->ttm_dev->pinned), 1);
+			list_is_last(&res1->lru.link, &priv->ttm_dev->pinned), 1);
 
 	ttm_resource_free(bo, &res1);
 	ttm_resource_free(bo, &res2);
diff --git a/drivers/gpu/drm/ttm/tests/ttm_resource_test.c b/drivers/gpu/drm/ttm/tests/ttm_resource_test.c
index 9c2f13e53162..22260e7aea58 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_resource_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_resource_test.c
@@ -198,7 +198,7 @@ static void ttm_resource_fini_basic(struct kunit *test)
 	ttm_resource_init(bo, place, res);
 	ttm_resource_fini(man, res);
 
-	KUNIT_ASSERT_TRUE(test, list_empty(&res->lru));
+	KUNIT_ASSERT_TRUE(test, list_empty(&res->lru.link));
 	KUNIT_ASSERT_EQ(test, man->usage, 0);
 }
 
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 434cf0258000..09411978a13a 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -274,14 +274,14 @@ static void ttm_device_clear_lru_dma_mappings(struct ttm_device *bdev,
 	struct ttm_resource *res;
 
 	spin_lock(&bdev->lru_lock);
-	while ((res = list_first_entry_or_null(list, typeof(*res), lru))) {
+	while ((res = ttm_lru_first_res_or_null(list))) {
 		struct ttm_buffer_object *bo = res->bo;
 
 		/* Take ref against racing releases once lru_lock is unlocked */
 		if (!ttm_bo_get_unless_zero(bo))
 			continue;
 
-		list_del_init(&res->lru);
+		list_del_init(&bo->resource->lru.link);
 		spin_unlock(&bdev->lru_lock);
 
 		if (bo->ttm)
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index 4a66b851b67d..db9a7a3717c4 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -70,8 +70,8 @@ void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk)
 			dma_resv_assert_held(pos->last->bo->base.resv);
 
 			man = ttm_manager_type(pos->first->bo->bdev, i);
-			list_bulk_move_tail(&man->lru[j], &pos->first->lru,
-					    &pos->last->lru);
+			list_bulk_move_tail(&man->lru[j], &pos->first->lru.link,
+					    &pos->last->lru.link);
 		}
 	}
 }
@@ -84,14 +84,38 @@ ttm_lru_bulk_move_pos(struct ttm_lru_bulk_move *bulk, struct ttm_resource *res)
 	return &bulk->pos[res->mem_type][res->bo->priority];
 }
 
+/* Return the previous resource on the list (skip over non-resource list items) */
+static struct ttm_resource *ttm_lru_prev_res(struct ttm_resource *cur)
+{
+	struct ttm_lru_item *lru = &cur->lru;
+
+	do {
+		lru = list_prev_entry(lru, link);
+	} while (!ttm_lru_item_is_res(lru));
+
+	return ttm_lru_item_to_res(lru);
+}
+
+/* Return the next resource on the list (skip over non-resource list items) */
+static struct ttm_resource *ttm_lru_next_res(struct ttm_resource *cur)
+{
+	struct ttm_lru_item *lru = &cur->lru;
+
+	do {
+		lru = list_next_entry(lru, link);
+	} while (!ttm_lru_item_is_res(lru));
+
+	return ttm_lru_item_to_res(lru);
+}
+
 /* Move the resource to the tail of the bulk move range */
 static void ttm_lru_bulk_move_pos_tail(struct ttm_lru_bulk_move_pos *pos,
 				       struct ttm_resource *res)
 {
 	if (pos->last != res) {
 		if (pos->first == res)
-			pos->first = list_next_entry(res, lru);
-		list_move(&res->lru, &pos->last->lru);
+			pos->first = ttm_lru_next_res(res);
+		list_move(&res->lru.link, &pos->last->lru.link);
 		pos->last = res;
 	}
 }
@@ -122,11 +146,11 @@ static void ttm_lru_bulk_move_del(struct ttm_lru_bulk_move *bulk,
 		pos->first = NULL;
 		pos->last = NULL;
 	} else if (pos->first == res) {
-		pos->first = list_next_entry(res, lru);
+		pos->first = ttm_lru_next_res(res);
 	} else if (pos->last == res) {
-		pos->last = list_prev_entry(res, lru);
+		pos->last = ttm_lru_prev_res(res);
 	} else {
-		list_move(&res->lru, &pos->last->lru);
+		list_move(&res->lru.link, &pos->last->lru.link);
 	}
 }
 
@@ -155,7 +179,7 @@ void ttm_resource_move_to_lru_tail(struct ttm_resource *res)
 	lockdep_assert_held(&bo->bdev->lru_lock);
 
 	if (bo->pin_count) {
-		list_move_tail(&res->lru, &bdev->pinned);
+		list_move_tail(&res->lru.link, &bdev->pinned);
 
 	} else	if (bo->bulk_move) {
 		struct ttm_lru_bulk_move_pos *pos =
@@ -166,7 +190,7 @@ void ttm_resource_move_to_lru_tail(struct ttm_resource *res)
 		struct ttm_resource_manager *man;
 
 		man = ttm_manager_type(bdev, res->mem_type);
-		list_move_tail(&res->lru, &man->lru[bo->priority]);
+		list_move_tail(&res->lru.link, &man->lru[bo->priority]);
 	}
 }
 
@@ -197,9 +221,9 @@ void ttm_resource_init(struct ttm_buffer_object *bo,
 	man = ttm_manager_type(bo->bdev, place->mem_type);
 	spin_lock(&bo->bdev->lru_lock);
 	if (bo->pin_count)
-		list_add_tail(&res->lru, &bo->bdev->pinned);
+		list_add_tail(&res->lru.link, &bo->bdev->pinned);
 	else
-		list_add_tail(&res->lru, &man->lru[bo->priority]);
+		list_add_tail(&res->lru.link, &man->lru[bo->priority]);
 	man->usage += res->size;
 	spin_unlock(&bo->bdev->lru_lock);
 }
@@ -221,7 +245,7 @@ void ttm_resource_fini(struct ttm_resource_manager *man,
 	struct ttm_device *bdev = man->bdev;
 
 	spin_lock(&bdev->lru_lock);
-	list_del_init(&res->lru);
+	list_del_init(&res->lru.link);
 	man->usage -= res->size;
 	spin_unlock(&bdev->lru_lock);
 }
@@ -472,14 +496,16 @@ struct ttm_resource *
 ttm_resource_manager_first(struct ttm_resource_manager *man,
 			   struct ttm_resource_cursor *cursor)
 {
-	struct ttm_resource *res;
+	struct ttm_lru_item *lru;
 
 	lockdep_assert_held(&man->bdev->lru_lock);
 
 	for (cursor->priority = 0; cursor->priority < TTM_MAX_BO_PRIORITY;
 	     ++cursor->priority)
-		list_for_each_entry(res, &man->lru[cursor->priority], lru)
-			return res;
+		list_for_each_entry(lru, &man->lru[cursor->priority], link) {
+			if (ttm_lru_item_is_res(lru))
+				return ttm_lru_item_to_res(lru);
+		}
 
 	return NULL;
 }
@@ -498,15 +524,40 @@ ttm_resource_manager_next(struct ttm_resource_manager *man,
 			  struct ttm_resource_cursor *cursor,
 			  struct ttm_resource *res)
 {
+	struct ttm_lru_item *lru = &res->lru;
+
 	lockdep_assert_held(&man->bdev->lru_lock);
 
-	list_for_each_entry_continue(res, &man->lru[cursor->priority], lru)
-		return res;
+	list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
+		if (ttm_lru_item_is_res(lru))
+			return ttm_lru_item_to_res(lru);
+	}
 
 	for (++cursor->priority; cursor->priority < TTM_MAX_BO_PRIORITY;
 	     ++cursor->priority)
-		list_for_each_entry(res, &man->lru[cursor->priority], lru)
-			return res;
+		list_for_each_entry(lru, &man->lru[cursor->priority], link) {
+			if (ttm_lru_item_is_res(lru))
+				ttm_lru_item_to_res(lru);
+		}
+
+	return NULL;
+}
+
+/**
+ * ttm_lru_first_res_or_null() - Return the first resource on an lru list
+ * @head: The list head of the lru list.
+ *
+ * Return: Pointer to the first resource on the lru list or NULL if
+ * there is none.
+ */
+struct ttm_resource *ttm_lru_first_res_or_null(struct list_head *head)
+{
+	struct ttm_lru_item *lru;
+
+	list_for_each_entry(lru, head, link) {
+		if (ttm_lru_item_is_res(lru))
+			return ttm_lru_item_to_res(lru);
+	}
 
 	return NULL;
 }
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index 69769355139f..1511d91e290d 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -49,6 +49,43 @@ struct io_mapping;
 struct sg_table;
 struct scatterlist;
 
+/**
+ * enum ttm_lru_item_type - enumerate ttm_lru_item subclasses
+ */
+enum ttm_lru_item_type {
+	/** @TTM_LRU_RESOURCE: The resource subclass */
+	TTM_LRU_RESOURCE,
+	/** @TTM_LRU_HITCH: The iterator hitch subclass */
+	TTM_LRU_HITCH
+};
+
+/**
+ * struct ttm_lru_item - The TTM lru list node base class
+ * @link: The list link
+ * @type: The subclass type
+ */
+struct ttm_lru_item {
+	struct list_head link;
+	enum ttm_lru_item_type type;
+};
+
+/**
+ * ttm_lru_item_init() - initialize a struct ttm_lru_item
+ * @item: The item to initialize
+ * @type: The subclass type
+ */
+static inline void ttm_lru_item_init(struct ttm_lru_item *item,
+				     enum ttm_lru_item_type type)
+{
+	item->type = type;
+	INIT_LIST_HEAD(&item->link);
+}
+
+static inline bool ttm_lru_item_is_res(const struct ttm_lru_item *item)
+{
+	return item->type == TTM_LRU_RESOURCE;
+}
+
 struct ttm_resource_manager_func {
 	/**
 	 * struct ttm_resource_manager_func member alloc
@@ -217,9 +254,21 @@ struct ttm_resource {
 	/**
 	 * @lru: Least recently used list, see &ttm_resource_manager.lru
 	 */
-	struct list_head lru;
+	struct ttm_lru_item lru;
 };
 
+/**
+ * ttm_lru_item_to_res() - Downcast a struct ttm_lru_item to a struct ttm_resource
+ * @item: The struct ttm_lru_item to downcast
+ *
+ * Return: Pointer to the embedding struct ttm_resource
+ */
+static inline struct ttm_resource *
+ttm_lru_item_to_res(struct ttm_lru_item *item)
+{
+	return container_of(item, struct ttm_resource, lru);
+}
+
 /**
  * struct ttm_resource_cursor
  *
@@ -393,6 +442,9 @@ ttm_resource_manager_next(struct ttm_resource_manager *man,
 			  struct ttm_resource_cursor *cursor,
 			  struct ttm_resource *res);
 
+struct ttm_resource *
+ttm_lru_first_res_or_null(struct list_head *head);
+
 /**
  * ttm_resource_manager_for_each_res - iterate over all resources
  * @man: the resource manager
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 02/12] drm/ttm: Slightly clean up LRU list iteration
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 01/12] drm/ttm: Allow TTM LRU list nodes of different types Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 03/12] drm/ttm: Use LRU hitches Thomas Hellström
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

To make the transition to using lru hitches easier,
simplify the ttm_resource_manager_next() interface to only take
the cursor and reuse ttm_resource_manager_next() functionality
from ttm_resource_manager_first().

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/ttm_resource.c | 48 +++++++++++++-----------------
 include/drm/ttm/ttm_resource.h     | 10 ++++---
 2 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index db9a7a3717c4..8bfbddddc0e8 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -496,50 +496,44 @@ struct ttm_resource *
 ttm_resource_manager_first(struct ttm_resource_manager *man,
 			   struct ttm_resource_cursor *cursor)
 {
-	struct ttm_lru_item *lru;
-
 	lockdep_assert_held(&man->bdev->lru_lock);
 
-	for (cursor->priority = 0; cursor->priority < TTM_MAX_BO_PRIORITY;
-	     ++cursor->priority)
-		list_for_each_entry(lru, &man->lru[cursor->priority], link) {
-			if (ttm_lru_item_is_res(lru))
-				return ttm_lru_item_to_res(lru);
-		}
-
-	return NULL;
+	cursor->priority = 0;
+	cursor->man = man;
+	cursor->cur = &man->lru[cursor->priority];
+	return ttm_resource_manager_next(cursor);
 }
 
 /**
  * ttm_resource_manager_next
  *
- * @man: resource manager to iterate over
  * @cursor: cursor to record the position
- * @res: the current resource pointer
  *
- * Returns the next resource from the resource manager.
+ * Return: the next resource from the resource manager.
  */
 struct ttm_resource *
-ttm_resource_manager_next(struct ttm_resource_manager *man,
-			  struct ttm_resource_cursor *cursor,
-			  struct ttm_resource *res)
+ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
 {
-	struct ttm_lru_item *lru = &res->lru;
+	struct ttm_resource_manager *man = cursor->man;
+	struct ttm_lru_item *lru;
 
 	lockdep_assert_held(&man->bdev->lru_lock);
 
-	list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
-		if (ttm_lru_item_is_res(lru))
-			return ttm_lru_item_to_res(lru);
-	}
-
-	for (++cursor->priority; cursor->priority < TTM_MAX_BO_PRIORITY;
-	     ++cursor->priority)
-		list_for_each_entry(lru, &man->lru[cursor->priority], link) {
-			if (ttm_lru_item_is_res(lru))
-				ttm_lru_item_to_res(lru);
+	for (;;) {
+		lru = list_entry(cursor->cur, typeof(*lru), link);
+		list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
+			if (ttm_lru_item_is_res(lru)) {
+				cursor->cur = &lru->link;
+				return ttm_lru_item_to_res(lru);
+			}
 		}
 
+		if (++cursor->priority >= TTM_MAX_BO_PRIORITY)
+			break;
+
+		cursor->cur = &man->lru[cursor->priority];
+	}
+
 	return NULL;
 }
 
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index 1511d91e290d..7d81fd5b5b83 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -272,11 +272,15 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
 /**
  * struct ttm_resource_cursor
  *
+ * @man: The resource manager currently being iterated over.
+ * @cur: The list head the cursor currently points to.
  * @priority: the current priority
  *
  * Cursor to iterate over the resources in a manager.
  */
 struct ttm_resource_cursor {
+	struct ttm_resource_manager *man;
+	struct list_head *cur;
 	unsigned int priority;
 };
 
@@ -438,9 +442,7 @@ struct ttm_resource *
 ttm_resource_manager_first(struct ttm_resource_manager *man,
 			   struct ttm_resource_cursor *cursor);
 struct ttm_resource *
-ttm_resource_manager_next(struct ttm_resource_manager *man,
-			  struct ttm_resource_cursor *cursor,
-			  struct ttm_resource *res);
+ttm_resource_manager_next(struct ttm_resource_cursor *cursor);
 
 struct ttm_resource *
 ttm_lru_first_res_or_null(struct list_head *head);
@@ -455,7 +457,7 @@ ttm_lru_first_res_or_null(struct list_head *head);
  */
 #define ttm_resource_manager_for_each_res(man, cursor, res)		\
 	for (res = ttm_resource_manager_first(man, cursor); res;	\
-	     res = ttm_resource_manager_next(man, cursor, res))
+	     res = ttm_resource_manager_next(cursor))
 
 struct ttm_kmap_iter *
 ttm_kmap_iter_iomap_init(struct ttm_kmap_iter_iomap *iter_io,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 03/12] drm/ttm: Use LRU hitches
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 01/12] drm/ttm: Allow TTM LRU list nodes of different types Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 02/12] drm/ttm: Slightly clean up LRU list iteration Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-04  9:05   ` Christian König
  2024-07-03 15:38 ` [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves Thomas Hellström
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Have iterators insert themselves into the list they are iterating
over using hitch list nodes. Since only the iterator owner
can remove these list nodes from the list, it's safe to unlock
the list and when continuing, use them as a starting point. Due to
the way LRU bumping works in TTM, newly added items will not be
missed, and bumped items will be iterated over a second time before
reaching the end of the list.

The exception is list with bulk move sublists. When bumping a
sublist, a hitch that is part of that sublist will also be moved
and we might miss items if restarting from it. This will be
addressed in a later patch.

Changes in previous series:
- Updated ttm_resource_cursor_fini() documentation.
v2:
- Don't reorder ttm_resource_manager_first() and _next().
  (Christian König).
- Use list_add instead of list_move
  (Christian König)
v3:
- Split into two patches, one cleanup, one new functionality
  (Christian König)
- use ttm_resource_cursor_fini_locked() instead of open-coding
  (Matthew Brost)

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c       |  1 +
 drivers/gpu/drm/ttm/ttm_device.c   |  9 +++--
 drivers/gpu/drm/ttm/ttm_resource.c | 56 +++++++++++++++++++++++++-----
 include/drm/ttm/ttm_resource.h     |  9 +++--
 4 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 6396dece0db1..43eda720657f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -621,6 +621,7 @@ int ttm_mem_evict_first(struct ttm_device *bdev,
 		if (locked)
 			dma_resv_unlock(res->bo->base.resv);
 	}
+	ttm_resource_cursor_fini_locked(&cursor);
 
 	if (!bo) {
 		if (busy_bo && !ttm_bo_get_unless_zero(busy_bo))
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 09411978a13a..f9e9b1ec8c8a 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -170,12 +170,17 @@ int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
 			num_pages = PFN_UP(bo->base.size);
 			ret = ttm_bo_swapout(bo, ctx, gfp_flags);
 			/* ttm_bo_swapout has dropped the lru_lock */
-			if (!ret)
+			if (!ret) {
+				ttm_resource_cursor_fini(&cursor);
 				return num_pages;
-			if (ret != -EBUSY)
+			}
+			if (ret != -EBUSY) {
+				ttm_resource_cursor_fini(&cursor);
 				return ret;
+			}
 		}
 	}
+	ttm_resource_cursor_fini_locked(&cursor);
 	spin_unlock(&bdev->lru_lock);
 	return 0;
 }
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index 8bfbddddc0e8..9c8b6499edfb 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -33,6 +33,37 @@
 
 #include <drm/drm_util.h>
 
+/**
+ * ttm_resource_cursor_fini_locked() - Finalize the LRU list cursor usage
+ * @cursor: The struct ttm_resource_cursor to finalize.
+ *
+ * The function pulls the LRU list cursor off any lists it was previusly
+ * attached to. Needs to be called with the LRU lock held. The function
+ * can be called multiple times after eachother.
+ */
+void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor)
+{
+	lockdep_assert_held(&cursor->man->bdev->lru_lock);
+	list_del_init(&cursor->hitch.link);
+}
+
+/**
+ * ttm_resource_cursor_fini() - Finalize the LRU list cursor usage
+ * @cursor: The struct ttm_resource_cursor to finalize.
+ *
+ * The function pulls the LRU list cursor off any lists it was previusly
+ * attached to. Needs to be called without the LRU list lock held. The
+ * function can be called multiple times after eachother.
+ */
+void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor)
+{
+	spinlock_t *lru_lock = &cursor->man->bdev->lru_lock;
+
+	spin_lock(lru_lock);
+	ttm_resource_cursor_fini_locked(cursor);
+	spin_unlock(lru_lock);
+}
+
 /**
  * ttm_lru_bulk_move_init - initialize a bulk move structure
  * @bulk: the structure to init
@@ -485,12 +516,15 @@ void ttm_resource_manager_debug(struct ttm_resource_manager *man,
 EXPORT_SYMBOL(ttm_resource_manager_debug);
 
 /**
- * ttm_resource_manager_first
- *
+ * ttm_resource_manager_first() - Start iterating over the resources
+ * of a resource manager
  * @man: resource manager to iterate over
  * @cursor: cursor to record the position
  *
- * Returns the first resource from the resource manager.
+ * Initializes the cursor and starts iterating. When done iterating,
+ * the caller must explicitly call ttm_resource_cursor_fini().
+ *
+ * Return: The first resource from the resource manager.
  */
 struct ttm_resource *
 ttm_resource_manager_first(struct ttm_resource_manager *man,
@@ -500,13 +534,15 @@ ttm_resource_manager_first(struct ttm_resource_manager *man,
 
 	cursor->priority = 0;
 	cursor->man = man;
-	cursor->cur = &man->lru[cursor->priority];
+	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
+	list_add(&cursor->hitch.link, &man->lru[cursor->priority]);
+
 	return ttm_resource_manager_next(cursor);
 }
 
 /**
- * ttm_resource_manager_next
- *
+ * ttm_resource_manager_next() - Continue iterating over the resource manager
+ * resources
  * @cursor: cursor to record the position
  *
  * Return: the next resource from the resource manager.
@@ -520,10 +556,10 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
 	lockdep_assert_held(&man->bdev->lru_lock);
 
 	for (;;) {
-		lru = list_entry(cursor->cur, typeof(*lru), link);
+		lru = &cursor->hitch;
 		list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
 			if (ttm_lru_item_is_res(lru)) {
-				cursor->cur = &lru->link;
+				list_move(&cursor->hitch.link, &lru->link);
 				return ttm_lru_item_to_res(lru);
 			}
 		}
@@ -531,9 +567,11 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
 		if (++cursor->priority >= TTM_MAX_BO_PRIORITY)
 			break;
 
-		cursor->cur = &man->lru[cursor->priority];
+		list_move(&cursor->hitch.link, &man->lru[cursor->priority]);
 	}
 
+	ttm_resource_cursor_fini_locked(cursor);
+
 	return NULL;
 }
 
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index 7d81fd5b5b83..8fac781f641e 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -273,17 +273,22 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
  * struct ttm_resource_cursor
  *
  * @man: The resource manager currently being iterated over.
- * @cur: The list head the cursor currently points to.
+ * @hitch: A hitch list node inserted before the next resource
+ * to iterate over.
  * @priority: the current priority
  *
  * Cursor to iterate over the resources in a manager.
  */
 struct ttm_resource_cursor {
 	struct ttm_resource_manager *man;
-	struct list_head *cur;
+	struct ttm_lru_item hitch;
 	unsigned int priority;
 };
 
+void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
+
+void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
+
 /**
  * struct ttm_lru_bulk_move_pos
  *
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (2 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 03/12] drm/ttm: Use LRU hitches Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-03 17:53   ` Matthew Brost
  2024-07-04  9:21   ` Christian König
  2024-07-03 15:38 ` [PATCH v6 05/12] drm/ttm: Provide a generic LRU walker helper Thomas Hellström
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

To address the problem with hitches moving when bulk move
sublists are lru-bumped, register the list cursors with the
ttm_lru_bulk_move structure when traversing its list, and
when lru-bumping the list, move the cursor hitch to the tail.
This also means it's mandatory for drivers to call
ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when
initializing and finalizing the bulk move structure, so add
those calls to the amdgpu- and xe driver.

Compared to v1 this is slightly more code but less fragile
and hopefully easier to understand.

Changes in previous series:
- Completely rework the functionality
- Avoid a NULL pointer dereference assigning manager->mem_type
- Remove some leftover code causing build problems
v2:
- For hitch bulk tail moves, store the mem_type in the cursor
  instead of with the manager.
v3:
- Remove leftover mem_type member from change in v2.
v6:
- Add some lockdep asserts (Matthew Brost)
- Avoid NULL pointer dereference (Matthew Brost)
- No need to check bo->resource before dereferencing
  bo->bulk_move (Matthew Brost)

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  4 ++
 drivers/gpu/drm/ttm/ttm_resource.c     | 92 ++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.c             |  4 ++
 include/drm/ttm/ttm_resource.h         | 56 ++++++++++------
 4 files changed, 135 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3abfa66d72a2..97743993d711 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2420,6 +2420,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	if (r)
 		return r;
 
+	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
+
 	vm->is_compute_context = false;
 
 	vm->use_cpu_for_update = !!(adev->vm_manager.vm_update_mode &
@@ -2484,6 +2486,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 error_free_delayed:
 	dma_fence_put(vm->last_tlb_flush);
 	dma_fence_put(vm->last_unlocked);
+	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm->lru_bulk_move);
 	amdgpu_vm_fini_entities(vm);
 
 	return r;
@@ -2640,6 +2643,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)
 		}
 	}
 
+	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm->lru_bulk_move);
 }
 
 /**
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index 9c8b6499edfb..b6a2daac5518 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -33,6 +33,53 @@
 
 #include <drm/drm_util.h>
 
+/* Detach the cursor from the bulk move list*/
+static void
+ttm_resource_cursor_clear_bulk(struct ttm_resource_cursor *cursor)
+{
+	lockdep_assert_held(&cursor->man->bdev->lru_lock);
+
+	cursor->bulk = NULL;
+	list_del_init(&cursor->bulk_link);
+}
+
+/* Move the cursor to the end of the bulk move list it's in */
+static void ttm_resource_cursor_move_bulk_tail(struct ttm_lru_bulk_move *bulk,
+					       struct ttm_resource_cursor *cursor)
+{
+	struct ttm_lru_bulk_move_pos *pos;
+
+	lockdep_assert_held(&cursor->man->bdev->lru_lock);
+
+	if (WARN_ON_ONCE(bulk != cursor->bulk)) {
+		list_del_init(&cursor->bulk_link);
+		return;
+	}
+
+	pos = &bulk->pos[cursor->mem_type][cursor->priority];
+	if (pos->last)
+		list_move(&cursor->hitch.link, &pos->last->lru.link);
+	ttm_resource_cursor_clear_bulk(cursor);
+}
+
+/* Move all cursors attached to a bulk move to its end */
+static void ttm_bulk_move_adjust_cursors(struct ttm_lru_bulk_move *bulk)
+{
+	struct ttm_resource_cursor *cursor, *next;
+
+	list_for_each_entry_safe(cursor, next, &bulk->cursor_list, bulk_link)
+		ttm_resource_cursor_move_bulk_tail(bulk, cursor);
+}
+
+/* Remove a cursor from an empty bulk move list */
+static void ttm_bulk_move_drop_cursors(struct ttm_lru_bulk_move *bulk)
+{
+	struct ttm_resource_cursor *cursor, *next;
+
+	list_for_each_entry_safe(cursor, next, &bulk->cursor_list, bulk_link)
+		ttm_resource_cursor_clear_bulk(cursor);
+}
+
 /**
  * ttm_resource_cursor_fini_locked() - Finalize the LRU list cursor usage
  * @cursor: The struct ttm_resource_cursor to finalize.
@@ -45,6 +92,7 @@ void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor)
 {
 	lockdep_assert_held(&cursor->man->bdev->lru_lock);
 	list_del_init(&cursor->hitch.link);
+	ttm_resource_cursor_clear_bulk(cursor);
 }
 
 /**
@@ -73,9 +121,27 @@ void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor)
 void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk)
 {
 	memset(bulk, 0, sizeof(*bulk));
+	INIT_LIST_HEAD(&bulk->cursor_list);
 }
 EXPORT_SYMBOL(ttm_lru_bulk_move_init);
 
+/**
+ * ttm_lru_bulk_move_fini - finalize a bulk move structure
+ * @bdev: The struct ttm_device
+ * @bulk: the structure to finalize
+ *
+ * Sanity checks that bulk moves don't have any
+ * resources left and hence no cursors attached.
+ */
+void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
+			    struct ttm_lru_bulk_move *bulk)
+{
+	spin_lock(&bdev->lru_lock);
+	ttm_bulk_move_drop_cursors(bulk);
+	spin_unlock(&bdev->lru_lock);
+}
+EXPORT_SYMBOL(ttm_lru_bulk_move_fini);
+
 /**
  * ttm_lru_bulk_move_tail - bulk move range of resources to the LRU tail.
  *
@@ -88,6 +154,7 @@ void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk)
 {
 	unsigned i, j;
 
+	ttm_bulk_move_adjust_cursors(bulk);
 	for (i = 0; i < TTM_NUM_MEM_TYPES; ++i) {
 		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
 			struct ttm_lru_bulk_move_pos *pos = &bulk->pos[i][j];
@@ -515,6 +582,28 @@ void ttm_resource_manager_debug(struct ttm_resource_manager *man,
 }
 EXPORT_SYMBOL(ttm_resource_manager_debug);
 
+static void
+ttm_resource_cursor_check_bulk(struct ttm_resource_cursor *cursor,
+			       struct ttm_lru_item *next_lru)
+{
+	struct ttm_resource *next = ttm_lru_item_to_res(next_lru);
+	struct ttm_lru_bulk_move *bulk = NULL;
+	struct ttm_buffer_object *bo = next->bo;
+
+	lockdep_assert_held(&cursor->man->bdev->lru_lock);
+	bulk = bo->bulk_move;
+
+	if (cursor->bulk != bulk) {
+		if (bulk) {
+			list_move_tail(&cursor->bulk_link, &bulk->cursor_list);
+			cursor->mem_type = next->mem_type;
+		} else {
+			list_del_init(&cursor->bulk_link);
+		}
+		cursor->bulk = bulk;
+	}
+}
+
 /**
  * ttm_resource_manager_first() - Start iterating over the resources
  * of a resource manager
@@ -535,6 +624,7 @@ ttm_resource_manager_first(struct ttm_resource_manager *man,
 	cursor->priority = 0;
 	cursor->man = man;
 	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
+	INIT_LIST_HEAD(&cursor->bulk_link);
 	list_add(&cursor->hitch.link, &man->lru[cursor->priority]);
 
 	return ttm_resource_manager_next(cursor);
@@ -559,6 +649,7 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
 		lru = &cursor->hitch;
 		list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
 			if (ttm_lru_item_is_res(lru)) {
+				ttm_resource_cursor_check_bulk(cursor, lru);
 				list_move(&cursor->hitch.link, &lru->link);
 				return ttm_lru_item_to_res(lru);
 			}
@@ -568,6 +659,7 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
 			break;
 
 		list_move(&cursor->hitch.link, &man->lru[cursor->priority]);
+		ttm_resource_cursor_clear_bulk(cursor);
 	}
 
 	ttm_resource_cursor_fini_locked(cursor);
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 5b166fa03684..0c7e327bc9a2 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1335,6 +1335,8 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 
 	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
 
+	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
+
 	INIT_LIST_HEAD(&vm->preempt.exec_queues);
 	vm->preempt.min_run_period_ms = 10;	/* FIXME: Wire up to uAPI */
 
@@ -1458,6 +1460,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	mutex_destroy(&vm->snap_mutex);
 	for_each_tile(tile, xe, id)
 		xe_range_fence_tree_fini(&vm->rftree[id]);
+	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
 	kfree(vm);
 	if (flags & XE_VM_FLAG_LR_MODE)
 		xe_pm_runtime_put(xe);
@@ -1601,6 +1604,7 @@ static void vm_destroy_work_func(struct work_struct *w)
 		XE_WARN_ON(vm->pt_root[id]);
 
 	trace_xe_vm_free(vm);
+	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
 	kfree(vm);
 }
 
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index 8fac781f641e..571abb4861a6 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -269,26 +269,6 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
 	return container_of(item, struct ttm_resource, lru);
 }
 
-/**
- * struct ttm_resource_cursor
- *
- * @man: The resource manager currently being iterated over.
- * @hitch: A hitch list node inserted before the next resource
- * to iterate over.
- * @priority: the current priority
- *
- * Cursor to iterate over the resources in a manager.
- */
-struct ttm_resource_cursor {
-	struct ttm_resource_manager *man;
-	struct ttm_lru_item hitch;
-	unsigned int priority;
-};
-
-void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
-
-void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
-
 /**
  * struct ttm_lru_bulk_move_pos
  *
@@ -304,8 +284,9 @@ struct ttm_lru_bulk_move_pos {
 
 /**
  * struct ttm_lru_bulk_move
- *
  * @pos: first/last lru entry for resources in the each domain/priority
+ * @cursor_list: The list of cursors currently traversing any of
+ * the sublists of @pos. Protected by the ttm device's lru_lock.
  *
  * Container for the current bulk move state. Should be used with
  * ttm_lru_bulk_move_init() and ttm_bo_set_bulk_move().
@@ -315,8 +296,39 @@ struct ttm_lru_bulk_move_pos {
  */
 struct ttm_lru_bulk_move {
 	struct ttm_lru_bulk_move_pos pos[TTM_NUM_MEM_TYPES][TTM_MAX_BO_PRIORITY];
+	struct list_head cursor_list;
 };
 
+/**
+ * struct ttm_resource_cursor
+ * @man: The resource manager currently being iterated over
+ * @hitch: A hitch list node inserted before the next resource
+ * to iterate over.
+ * @bulk_link: A list link for the list of cursors traversing the
+ * bulk sublist of @bulk. Protected by the ttm device's lru_lock.
+ * @bulk: Pointer to struct ttm_lru_bulk_move whose subrange @hitch is
+ * inserted to. NULL if none. Never dereference this pointer since
+ * the struct ttm_lru_bulk_move object pointed to might have been
+ * freed. The pointer is only for comparison.
+ * @mem_type: The memory type of the LRU list being traversed.
+ * This field is valid iff @bulk != NULL.
+ * @priority: the current priority
+ *
+ * Cursor to iterate over the resources in a manager.
+ */
+struct ttm_resource_cursor {
+	struct ttm_resource_manager *man;
+	struct ttm_lru_item hitch;
+	struct list_head bulk_link;
+	struct ttm_lru_bulk_move *bulk;
+	unsigned int mem_type;
+	unsigned int priority;
+};
+
+void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
+
+void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
+
 /**
  * struct ttm_kmap_iter_iomap - Specialization for a struct io_mapping +
  * struct sg_table backed struct ttm_resource.
@@ -405,6 +417,8 @@ ttm_resource_manager_cleanup(struct ttm_resource_manager *man)
 
 void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);
 void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk);
+void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
+			    struct ttm_lru_bulk_move *bulk);
 
 void ttm_resource_add_bulk_move(struct ttm_resource *res,
 				struct ttm_buffer_object *bo);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 05/12] drm/ttm: Provide a generic LRU walker helper
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (3 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 06/12] drm/ttm: Use the LRU walker helper for swapping Thomas Hellström
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Provide a generic LRU walker in TTM, in the spirit of drm_gem_lru_scan()
but building on the restartable TTM LRU functionality.

The LRU walker optionally supports locking objects as part of
a ww mutex locking transaction, to mimic to some extent the
current functionality in ttm. However any -EDEADLK return
is converted to -ENOSPC and then to -ENOMEM before reaching
the driver, so that the driver will need to backoff and possibly retry
without being able to keep the ticket.

v3:
- Move the helper to core ttm.
- Remove the drm_exec usage from it for now, it will be
  reintroduced later in the series.
v4:
- Handle the -EALREADY case if ticketlocking.
v6:
- Some cleanup and added code comments (Matthew Brost)
- Clarified the ticketlock in the commit message (Matthew Brost)

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 156 ++++++++++++++++++++++++++++++
 include/drm/ttm/ttm_bo.h          |  35 +++++++
 2 files changed, 191 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 0b3f4267130c..c4f678f30fc2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -768,3 +768,159 @@ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 	ttm_tt_destroy(bo->bdev, ttm);
 	return ret;
 }
+
+static bool ttm_lru_walk_trylock(struct ttm_lru_walk *walk,
+				 struct ttm_buffer_object *bo,
+				 bool *needs_unlock)
+{
+	struct ttm_operation_ctx *ctx = walk->ctx;
+
+	*needs_unlock = false;
+
+	if (dma_resv_trylock(bo->base.resv)) {
+		*needs_unlock = true;
+		return true;
+	}
+
+	if (bo->base.resv == ctx->resv && ctx->allow_res_evict) {
+		dma_resv_assert_held(bo->base.resv);
+		return true;
+	}
+
+	return false;
+}
+
+static int ttm_lru_walk_ticketlock(struct ttm_lru_walk *walk,
+				   struct ttm_buffer_object *bo,
+				   bool *needs_unlock)
+{
+	struct dma_resv *resv = bo->base.resv;
+	int ret;
+
+	if (walk->ctx->interruptible)
+		ret = dma_resv_lock_interruptible(resv, walk->ticket);
+	else
+		ret = dma_resv_lock(resv, walk->ticket);
+
+	if (!ret) {
+		*needs_unlock = true;
+		/*
+		 * Only a single ticketlock per loop. Ticketlocks are prone
+		 * to return -EDEADLK causing the eviction to fail, so
+		 * after waiting for the ticketlock, revert back to
+		 * trylocking for this walk.
+		 */
+		walk->ticket = NULL;
+	} else if (ret == -EDEADLK) {
+		/* Caller needs to exit the ww transaction. */
+		ret = -ENOSPC;
+	}
+
+	return ret;
+}
+
+static void ttm_lru_walk_unlock(struct ttm_buffer_object *bo, bool locked)
+{
+	if (locked)
+		dma_resv_unlock(bo->base.resv);
+}
+
+/**
+ * ttm_lru_walk_for_evict() - Perform a LRU list walk, with actions taken on
+ * valid items.
+ * @walk: describe the walks and actions taken
+ * @bdev: The TTM device.
+ * @man: The struct ttm_resource manager whose LRU lists we're walking.
+ * @target: The end condition for the walk.
+ *
+ * The LRU lists of @man are walk, and for each struct ttm_resource encountered,
+ * the corresponding ttm_buffer_object is locked and taken a reference on, and
+ * the LRU lock is dropped. the LRU lock may be dropped before locking and, in
+ * that case, it's verified that the item actually remains on the LRU list after
+ * the lock, and that the buffer object didn't switch resource in between.
+ *
+ * With a locked object, the actions indicated by @walk->process_bo are
+ * performed, and after that, the bo is unlocked, the refcount dropped and the
+ * next struct ttm_resource is processed. Here, the walker relies on
+ * TTM's restartable LRU list implementation.
+ *
+ * Typically @walk->process_bo() would return the number of pages evicted,
+ * swapped or shrunken, so that when the total exceeds @target, or when the
+ * LRU list has been walked in full, iteration is terminated. It's also terminated
+ * on error. Note that the definition of @target is done by the caller, it
+ * could have a different meaning than the number of pages.
+ *
+ * Note that the way dma_resv individualization is done, locking needs to be done
+ * either with the LRU lock held (trylocking only) or with a reference on the
+ * object.
+ *
+ * Return: The progress made towards target or negative error code on error.
+ */
+long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct ttm_device *bdev,
+			    struct ttm_resource_manager *man, long target)
+{
+	struct ttm_resource_cursor cursor;
+	struct ttm_resource *res;
+	long progress = 0;
+	long lret;
+
+	spin_lock(&bdev->lru_lock);
+	ttm_resource_manager_for_each_res(man, &cursor, res) {
+		struct ttm_buffer_object *bo = res->bo;
+		bool bo_needs_unlock = false;
+		bool bo_locked = false;
+		int mem_type;
+
+		if (!bo || bo->resource != res)
+			continue;
+
+		/*
+		 * Attempt a trylock before taking a reference on the bo,
+		 * since if we do it the other way around, and the trylock fails,
+		 * we need to drop the lru lock to put the bo.
+		 */
+
+		if (ttm_lru_walk_trylock(walk, bo, &bo_needs_unlock))
+			bo_locked = true;
+		else if (!walk->ticket || walk->ctx->no_wait_gpu ||
+			 walk->trylock_only)
+			continue;
+
+		if (!ttm_bo_get_unless_zero(bo)) {
+			ttm_lru_walk_unlock(bo, bo_needs_unlock);
+			continue;
+		}
+
+		mem_type = res->mem_type;
+		spin_unlock(&bdev->lru_lock);
+
+		lret = 0;
+		if (!bo_locked)
+			lret = ttm_lru_walk_ticketlock(walk, bo, &bo_needs_unlock);
+
+		/*
+		 * Note that in between the release of the lru lock and the
+		 * ticketlock, the bo may have switched resource,
+		 * and also memory type, since the resource may have been
+		 * freed and allocated again with a different memory type.
+		 * In that case, just skip it.
+		 */
+		if (!lret && bo->resource == res && res->mem_type == mem_type)
+			lret = walk->ops->process_bo(walk, bo);
+
+		ttm_lru_walk_unlock(bo, bo_needs_unlock);
+		ttm_bo_put(bo);
+		if (lret == -EBUSY || lret == -EALREADY)
+			lret = 0;
+		progress = (lret < 0) ? lret : progress + lret;
+
+		cond_resched();
+		spin_lock(&bdev->lru_lock);
+		if (progress < 0 || progress >= target)
+			break;
+	}
+	ttm_resource_cursor_fini_locked(&cursor);
+	spin_unlock(&bdev->lru_lock);
+
+	return progress;
+}
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index ef0f52f56ebc..10bff3aecd5c 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -194,6 +194,41 @@ struct ttm_operation_ctx {
 	uint64_t bytes_moved;
 };
 
+struct ttm_lru_walk;
+
+/** struct ttm_lru_walk_ops - Operations for a LRU walk. */
+struct ttm_lru_walk_ops {
+	/**
+	 * process_bo - Process this bo.
+	 * @walk: struct ttm_lru_walk describing the walk.
+	 * @bo: A locked and referenced buffer object.
+	 *
+	 * Return: Negative error code on error, User-defined positive value
+	 * (typically, but not always number of processed pages) on success.
+	 * On success, the returned values are summed by the walk and the
+	 * walk exits when its garget is met.
+	 * 0 also indicates success, -EBUSY means this bo was skipped.
+	 */
+	long (*process_bo)(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo);
+};
+
+/**
+ * struct ttm_lru_walk - Structure describing a LRU walk.
+ */
+struct ttm_lru_walk {
+	/** @ops: Pointer to the ops structure. */
+	const struct ttm_lru_walk_ops *ops;
+	/** @ctx: Pointer to the struct ttm_operation_ctx. */
+	struct ttm_operation_ctx *ctx;
+	/** @ticket: The struct ww_acquire_ctx if any. */
+	struct ww_acquire_ctx *ticket;
+	/** @tryock_only: Only use trylock for locking. */
+	bool trylock_only;
+};
+
+long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct ttm_device *bdev,
+			    struct ttm_resource_manager *man, long target);
+
 /**
  * ttm_bo_get - reference a struct ttm_buffer_object
  *
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 06/12] drm/ttm: Use the LRU walker helper for swapping
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (4 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 05/12] drm/ttm: Provide a generic LRU walker helper Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-03 18:24   ` Matthew Brost
  2024-07-03 15:38 ` [PATCH v6 07/12] drm/ttm: Use the LRU walker for eviction Thomas Hellström
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Rework the TTM swapping to use the LRU walker helper.
This helps fixing up the ttm_bo_swapout() interface
to be consistent about not requiring any locking.

For now mimic the current behaviour of using trylock
only. We could be using ticket-locks here but defer
that until it's deemed necessary. The TTM swapout
functionality is a bit weird anyway since it
alternates between memory types without exhausting
TTM_PL_SYSTEM first.

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>

v6:
- Improve on error code translation in the swapout callback
  (Matthew Brost).

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c     | 111 ++++++++++++++++++++-----------
 drivers/gpu/drm/ttm/ttm_device.c |  30 ++-------
 include/drm/ttm/ttm_bo.h         |   5 +-
 3 files changed, 82 insertions(+), 64 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 43eda720657f..1053cdca131e 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1118,11 +1118,23 @@ int ttm_bo_wait_ctx(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx)
 }
 EXPORT_SYMBOL(ttm_bo_wait_ctx);
 
-int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
-		   gfp_t gfp_flags)
+/**
+ * struct ttm_bo_swapout_walk - Parameters for the swapout walk
+ */
+struct ttm_bo_swapout_walk {
+	/** @walk: The walk base parameters. */
+	struct ttm_lru_walk walk;
+	/** @gfp_flags: The gfp flags to use for ttm_tt_swapout() */
+	gfp_t gfp_flags;
+};
+
+static long
+ttm_bo_swapout_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
 {
-	struct ttm_place place;
-	bool locked;
+	struct ttm_place place = {.mem_type = bo->resource->mem_type};
+	struct ttm_bo_swapout_walk *swapout_walk =
+		container_of(walk, typeof(*swapout_walk), walk);
+	struct ttm_operation_ctx *ctx = walk->ctx;
 	long ret;
 
 	/*
@@ -1131,28 +1143,29 @@ int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
 	 * The driver may use the fact that we're moving from SYSTEM
 	 * as an indication that we're about to swap out.
 	 */
-	memset(&place, 0, sizeof(place));
-	place.mem_type = bo->resource->mem_type;
-	if (!ttm_bo_evict_swapout_allowable(bo, ctx, &place, &locked, NULL))
-		return -EBUSY;
+	if (!bo->bdev->funcs->eviction_valuable(bo, &place)) {
+		ret = -EBUSY;
+		goto out;
+	}
 
 	if (!bo->ttm || !ttm_tt_is_populated(bo->ttm) ||
 	    bo->ttm->page_flags & TTM_TT_FLAG_EXTERNAL ||
-	    bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED ||
-	    !ttm_bo_get_unless_zero(bo)) {
-		if (locked)
-			dma_resv_unlock(bo->base.resv);
-		return -EBUSY;
+	    bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED) {
+		ret = -EBUSY;
+		goto out;
 	}
 
 	if (bo->deleted) {
-		ret = ttm_bo_cleanup_refs(bo, false, false, locked);
-		ttm_bo_put(bo);
-		return ret == -EBUSY ? -ENOSPC : ret;
-	}
+		pgoff_t num_pages = bo->ttm->num_pages;
 
-	/* TODO: Cleanup the locking */
-	spin_unlock(&bo->bdev->lru_lock);
+		ret = ttm_bo_wait_ctx(bo, ctx);
+		if (ret)
+			goto out;
+
+		ttm_bo_cleanup_memtype_use(bo);
+		ret = num_pages;
+		goto out;
+	}
 
 	/*
 	 * Move to system cached
@@ -1164,12 +1177,13 @@ int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
 		memset(&hop, 0, sizeof(hop));
 		place.mem_type = TTM_PL_SYSTEM;
 		ret = ttm_resource_alloc(bo, &place, &evict_mem);
-		if (unlikely(ret))
+		if (ret)
 			goto out;
 
 		ret = ttm_bo_handle_move_mem(bo, evict_mem, true, ctx, &hop);
-		if (unlikely(ret != 0)) {
-			WARN(ret == -EMULTIHOP, "Unexpected multihop in swaput - likely driver bug.\n");
+		if (ret) {
+			WARN(ret == -EMULTIHOP,
+			     "Unexpected multihop in swapout - likely driver bug.\n");
 			ttm_resource_free(bo, &evict_mem);
 			goto out;
 		}
@@ -1179,30 +1193,53 @@ int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
 	 * Make sure BO is idle.
 	 */
 	ret = ttm_bo_wait_ctx(bo, ctx);
-	if (unlikely(ret != 0))
+	if (ret)
 		goto out;
 
 	ttm_bo_unmap_virtual(bo);
-
-	/*
-	 * Swap out. Buffer will be swapped in again as soon as
-	 * anyone tries to access a ttm page.
-	 */
 	if (bo->bdev->funcs->swap_notify)
 		bo->bdev->funcs->swap_notify(bo);
 
 	if (ttm_tt_is_populated(bo->ttm))
-		ret = ttm_tt_swapout(bo->bdev, bo->ttm, gfp_flags);
+		ret = ttm_tt_swapout(bo->bdev, bo->ttm, swapout_walk->gfp_flags);
 out:
+	/* Consider -ENOMEM and -ENOSPC non-fatal. */
+	if (ret == -ENOMEM || ret == -ENOSPC)
+		ret = -EBUSY;
 
-	/*
-	 * Unreserve without putting on LRU to avoid swapping out an
-	 * already swapped buffer.
-	 */
-	if (locked)
-		dma_resv_unlock(bo->base.resv);
-	ttm_bo_put(bo);
-	return ret == -EBUSY ? -ENOSPC : ret;
+	return ret;
+}
+
+const struct ttm_lru_walk_ops ttm_swap_ops = {
+	.process_bo = ttm_bo_swapout_cb,
+};
+
+/**
+ * ttm_bo_swapout() - Swap out buffer objects on the LRU list to shmem.
+ * @bdev: The ttm device.
+ * @ctx: The ttm_operation_ctx governing the swapout operation.
+ * @man: The resource manager whose resources / buffer objects are
+ * goint to be swapped out.
+ * @gfp_flags: The gfp flags used for shmem page allocations.
+ * @target: The desired number of pages to swap out.
+ *
+ * Return: The number of pages actually swapped out, or negative error code
+ * on error.
+ */
+long ttm_bo_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
+		    struct ttm_resource_manager *man, gfp_t gfp_flags,
+		    pgoff_t target)
+{
+	struct ttm_bo_swapout_walk swapout_walk = {
+		.walk = {
+			.ops = &ttm_swap_ops,
+			.ctx = ctx,
+			.trylock_only = true,
+		},
+		.gfp_flags = gfp_flags,
+	};
+
+	return ttm_lru_walk_for_evict(&swapout_walk.walk, bdev, man, target);
 }
 
 void ttm_bo_tt_destroy(struct ttm_buffer_object *bo)
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index f9e9b1ec8c8a..ee575d8a54c0 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -148,40 +148,20 @@ int ttm_global_swapout(struct ttm_operation_ctx *ctx, gfp_t gfp_flags)
 int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
 		       gfp_t gfp_flags)
 {
-	struct ttm_resource_cursor cursor;
 	struct ttm_resource_manager *man;
-	struct ttm_resource *res;
 	unsigned i;
-	int ret;
+	long lret;
 
-	spin_lock(&bdev->lru_lock);
 	for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
 		man = ttm_manager_type(bdev, i);
 		if (!man || !man->use_tt)
 			continue;
 
-		ttm_resource_manager_for_each_res(man, &cursor, res) {
-			struct ttm_buffer_object *bo = res->bo;
-			uint32_t num_pages;
-
-			if (!bo || bo->resource != res)
-				continue;
-
-			num_pages = PFN_UP(bo->base.size);
-			ret = ttm_bo_swapout(bo, ctx, gfp_flags);
-			/* ttm_bo_swapout has dropped the lru_lock */
-			if (!ret) {
-				ttm_resource_cursor_fini(&cursor);
-				return num_pages;
-			}
-			if (ret != -EBUSY) {
-				ttm_resource_cursor_fini(&cursor);
-				return ret;
-			}
-		}
+		lret = ttm_bo_swapout(bdev, ctx, man, gfp_flags, 1);
+		/* Can be both positive (num_pages) and negative (error) */
+		if (lret)
+			return lret;
 	}
-	ttm_resource_cursor_fini_locked(&cursor);
-	spin_unlock(&bdev->lru_lock);
 	return 0;
 }
 EXPORT_SYMBOL(ttm_device_swapout);
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index 10bff3aecd5c..de97ea9fa75f 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -417,8 +417,9 @@ void ttm_bo_kunmap(struct ttm_bo_kmap_obj *map);
 int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map);
 void ttm_bo_vunmap(struct ttm_buffer_object *bo, struct iosys_map *map);
 int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct ttm_buffer_object *bo);
-int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
-		   gfp_t gfp_flags);
+long ttm_bo_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
+		    struct ttm_resource_manager *man, gfp_t gfp_flags,
+		    pgoff_t target);
 void ttm_bo_pin(struct ttm_buffer_object *bo);
 void ttm_bo_unpin(struct ttm_buffer_object *bo);
 int ttm_mem_evict_first(struct ttm_device *bdev,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 07/12] drm/ttm: Use the LRU walker for eviction
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (5 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 06/12] drm/ttm: Use the LRU walker helper for swapping Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-03 19:20   ` Matthew Brost
  2024-07-03 15:38 ` [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup Thomas Hellström
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Use the LRU walker for eviction. This helps
removing a lot of code with weird locking
semantics.

The functionality is slightly changed so that
when trylocked buffer objects are exhausted, we
continue to interleave walks with ticket-locks while
there is still progress made. The list walks are
not restarted in-between evictions.

Also provide a separate ttm_bo_evict_first()
function for its single user. The context of that
user allows sleeping dma_resv locks.

v6:
- Various cleanups suggested by Matthew Brost.
- Fix error return code of ttm_bo_evict_first(). (Matthew Brost)
- Fix an error check that was inverted. (Matthew Brost)

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c       | 346 ++++++++++++-----------------
 drivers/gpu/drm/ttm/ttm_resource.c |  21 +-
 include/drm/ttm/ttm_bo.h           |   8 +-
 3 files changed, 144 insertions(+), 231 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 1053cdca131e..603b9353f436 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -224,80 +224,6 @@ static void ttm_bo_flush_all_fences(struct ttm_buffer_object *bo)
 	dma_resv_iter_end(&cursor);
 }
 
-/**
- * ttm_bo_cleanup_refs
- * If bo idle, remove from lru lists, and unref.
- * If not idle, block if possible.
- *
- * Must be called with lru_lock and reservation held, this function
- * will drop the lru lock and optionally the reservation lock before returning.
- *
- * @bo:                    The buffer object to clean-up
- * @interruptible:         Any sleeps should occur interruptibly.
- * @no_wait_gpu:           Never wait for gpu. Return -EBUSY instead.
- * @unlock_resv:           Unlock the reservation lock as well.
- */
-
-static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo,
-			       bool interruptible, bool no_wait_gpu,
-			       bool unlock_resv)
-{
-	struct dma_resv *resv = &bo->base._resv;
-	int ret;
-
-	if (dma_resv_test_signaled(resv, DMA_RESV_USAGE_BOOKKEEP))
-		ret = 0;
-	else
-		ret = -EBUSY;
-
-	if (ret && !no_wait_gpu) {
-		long lret;
-
-		if (unlock_resv)
-			dma_resv_unlock(bo->base.resv);
-		spin_unlock(&bo->bdev->lru_lock);
-
-		lret = dma_resv_wait_timeout(resv, DMA_RESV_USAGE_BOOKKEEP,
-					     interruptible,
-					     30 * HZ);
-
-		if (lret < 0)
-			return lret;
-		else if (lret == 0)
-			return -EBUSY;
-
-		spin_lock(&bo->bdev->lru_lock);
-		if (unlock_resv && !dma_resv_trylock(bo->base.resv)) {
-			/*
-			 * We raced, and lost, someone else holds the reservation now,
-			 * and is probably busy in ttm_bo_cleanup_memtype_use.
-			 *
-			 * Even if it's not the case, because we finished waiting any
-			 * delayed destruction would succeed, so just return success
-			 * here.
-			 */
-			spin_unlock(&bo->bdev->lru_lock);
-			return 0;
-		}
-		ret = 0;
-	}
-
-	if (ret) {
-		if (unlock_resv)
-			dma_resv_unlock(bo->base.resv);
-		spin_unlock(&bo->bdev->lru_lock);
-		return ret;
-	}
-
-	spin_unlock(&bo->bdev->lru_lock);
-	ttm_bo_cleanup_memtype_use(bo);
-
-	if (unlock_resv)
-		dma_resv_unlock(bo->base.resv);
-
-	return 0;
-}
-
 /*
  * Block for the dma_resv object to become idle, lock the buffer and clean up
  * the resource and tt object.
@@ -505,151 +431,153 @@ bool ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
 }
 EXPORT_SYMBOL(ttm_bo_eviction_valuable);
 
-/*
- * Check the target bo is allowable to be evicted or swapout, including cases:
- *
- * a. if share same reservation object with ctx->resv, have assumption
- * reservation objects should already be locked, so not lock again and
- * return true directly when either the opreation allow_reserved_eviction
- * or the target bo already is in delayed free list;
+/**
+ * ttm_bo_evict_first() - Evict the first bo on the manager's LRU list.
+ * @bdev: The ttm device.
+ * @man: The manager whose bo to evict.
+ * @ctx: The TTM operation ctx governing the eviction.
  *
- * b. Otherwise, trylock it.
+ * Return: 0 if successful or the resource disappeared. Negative error code on error.
  */
-static bool ttm_bo_evict_swapout_allowable(struct ttm_buffer_object *bo,
-					   struct ttm_operation_ctx *ctx,
-					   const struct ttm_place *place,
-					   bool *locked, bool *busy)
+int ttm_bo_evict_first(struct ttm_device *bdev, struct ttm_resource_manager *man,
+		       struct ttm_operation_ctx *ctx)
 {
-	bool ret = false;
+	struct ttm_resource_cursor cursor;
+	struct ttm_buffer_object *bo;
+	struct ttm_resource *res;
+	unsigned int mem_type;
+	int ret = 0;
 
-	if (bo->pin_count) {
-		*locked = false;
-		if (busy)
-			*busy = false;
-		return false;
+	spin_lock(&bdev->lru_lock);
+	res = ttm_resource_manager_first(man, &cursor);
+	if (!res) {
+		ret = -ENOENT;
+		goto out_no_ref;
 	}
+	bo = res->bo;
+	if (!ttm_bo_get_unless_zero(bo))
+		goto out_no_ref;
+	mem_type = res->mem_type;
+	spin_unlock(&bdev->lru_lock);
+	ret = ttm_bo_reserve(bo, ctx->interruptible, ctx->no_wait_gpu, NULL);
+	if (ret)
+		goto out_no_lock;
+	if (bo->resource != res || res->mem_type != mem_type)
+		goto out_bo_moved;
 
-	if (bo->base.resv == ctx->resv) {
-		dma_resv_assert_held(bo->base.resv);
-		if (ctx->allow_res_evict)
-			ret = true;
-		*locked = false;
-		if (busy)
-			*busy = false;
+	if (bo->deleted) {
+		ret = ttm_bo_wait_ctx(bo, ctx);
+		if (!ret)
+			ttm_bo_cleanup_memtype_use(bo);
 	} else {
-		ret = dma_resv_trylock(bo->base.resv);
-		*locked = ret;
-		if (busy)
-			*busy = !ret;
-	}
-
-	if (ret && place && (bo->resource->mem_type != place->mem_type ||
-		!bo->bdev->funcs->eviction_valuable(bo, place))) {
-		ret = false;
-		if (*locked) {
-			dma_resv_unlock(bo->base.resv);
-			*locked = false;
-		}
+		ret = ttm_bo_evict(bo, ctx);
 	}
+out_bo_moved:
+	dma_resv_unlock(bo->base.resv);
+out_no_lock:
+	ttm_bo_put(bo);
+	ttm_resource_cursor_fini(&cursor);
+	return ret;
 
+out_no_ref:
+	ttm_resource_cursor_fini_locked(&cursor);
+	spin_unlock(&bdev->lru_lock);
 	return ret;
 }
 
 /**
- * ttm_mem_evict_wait_busy - wait for a busy BO to become available
- *
- * @busy_bo: BO which couldn't be locked with trylock
- * @ctx: operation context
- * @ticket: acquire ticket
- *
- * Try to lock a busy buffer object to avoid failing eviction.
+ * struct ttm_bo_evict_walk - Parameters for the evict walk.
  */
-static int ttm_mem_evict_wait_busy(struct ttm_buffer_object *busy_bo,
-				   struct ttm_operation_ctx *ctx,
-				   struct ww_acquire_ctx *ticket)
-{
-	int r;
-
-	if (!busy_bo || !ticket)
-		return -EBUSY;
-
-	if (ctx->interruptible)
-		r = dma_resv_lock_interruptible(busy_bo->base.resv,
-							  ticket);
-	else
-		r = dma_resv_lock(busy_bo->base.resv, ticket);
-
-	/*
-	 * TODO: It would be better to keep the BO locked until allocation is at
-	 * least tried one more time, but that would mean a much larger rework
-	 * of TTM.
-	 */
-	if (!r)
-		dma_resv_unlock(busy_bo->base.resv);
-
-	return r == -EDEADLK ? -EBUSY : r;
-}
+struct ttm_bo_evict_walk {
+	/** @walk: The walk base parameters. */
+	struct ttm_lru_walk walk;
+	/** @place: The place passed to the resource allocation. */
+	const struct ttm_place *place;
+	/** @evictor: The buffer object we're trying to make room for. */
+	struct ttm_buffer_object *evictor;
+	/** @res: The allocated resource if any. */
+	struct ttm_resource **res;
+	/** @evicted: Number of successful evictions. */
+	unsigned long evicted;
+};
 
-int ttm_mem_evict_first(struct ttm_device *bdev,
-			struct ttm_resource_manager *man,
-			const struct ttm_place *place,
-			struct ttm_operation_ctx *ctx,
-			struct ww_acquire_ctx *ticket)
+static long ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
 {
-	struct ttm_buffer_object *bo = NULL, *busy_bo = NULL;
-	struct ttm_resource_cursor cursor;
-	struct ttm_resource *res;
-	bool locked = false;
-	int ret;
+	struct ttm_bo_evict_walk *evict_walk =
+		container_of(walk, typeof(*evict_walk), walk);
+	long lret;
 
-	spin_lock(&bdev->lru_lock);
-	ttm_resource_manager_for_each_res(man, &cursor, res) {
-		bool busy;
-
-		if (!ttm_bo_evict_swapout_allowable(res->bo, ctx, place,
-						    &locked, &busy)) {
-			if (busy && !busy_bo && ticket !=
-			    dma_resv_locking_ctx(res->bo->base.resv))
-				busy_bo = res->bo;
-			continue;
-		}
+	if (!bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
+		return 0;
 
-		if (ttm_bo_get_unless_zero(res->bo)) {
-			bo = res->bo;
-			break;
-		}
-		if (locked)
-			dma_resv_unlock(res->bo->base.resv);
+	if (bo->deleted) {
+		lret = ttm_bo_wait_ctx(bo, walk->ctx);
+		if (!lret)
+			ttm_bo_cleanup_memtype_use(bo);
+	} else {
+		lret = ttm_bo_evict(bo, walk->ctx);
 	}
-	ttm_resource_cursor_fini_locked(&cursor);
 
-	if (!bo) {
-		if (busy_bo && !ttm_bo_get_unless_zero(busy_bo))
-			busy_bo = NULL;
-		spin_unlock(&bdev->lru_lock);
-		ret = ttm_mem_evict_wait_busy(busy_bo, ctx, ticket);
-		if (busy_bo)
-			ttm_bo_put(busy_bo);
-		return ret;
-	}
+	if (lret)
+		goto out;
 
-	if (bo->deleted) {
-		ret = ttm_bo_cleanup_refs(bo, ctx->interruptible,
-					  ctx->no_wait_gpu, locked);
-		ttm_bo_put(bo);
-		return ret;
-	}
+	evict_walk->evicted++;
+	if (evict_walk->res)
+		lret = ttm_resource_alloc(evict_walk->evictor, evict_walk->place,
+					  evict_walk->res);
+	if (lret == 0)
+		return 1;
+out:
+	/* Errors that should terminate the walk. */
+	if (lret == -ENOSPC)
+		return -EBUSY;
 
-	spin_unlock(&bdev->lru_lock);
+	return lret;
+}
 
-	ret = ttm_bo_evict(bo, ctx);
-	if (locked)
-		ttm_bo_unreserve(bo);
-	else
-		ttm_bo_move_to_lru_tail_unlocked(bo);
+static const struct ttm_lru_walk_ops ttm_evict_walk_ops = {
+	.process_bo = ttm_bo_evict_cb,
+};
 
-	ttm_bo_put(bo);
-	return ret;
+static int ttm_bo_evict_alloc(struct ttm_device *bdev,
+			      struct ttm_resource_manager *man,
+			      const struct ttm_place *place,
+			      struct ttm_buffer_object *evictor,
+			      struct ttm_operation_ctx *ctx,
+			      struct ww_acquire_ctx *ticket,
+			      struct ttm_resource **res)
+{
+	struct ttm_bo_evict_walk evict_walk = {
+		.walk = {
+			.ops = &ttm_evict_walk_ops,
+			.ctx = ctx,
+			.ticket = ticket,
+		},
+		.place = place,
+		.evictor = evictor,
+		.res = res,
+	};
+	long lret;
+
+	evict_walk.walk.trylock_only = true;
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+	if (lret || !ticket)
+		goto out;
+
+	/* If ticket-locking, repeat while making progress. */
+	evict_walk.walk.trylock_only = false;
+	do {
+		/* The walk may clear the evict_walk.walk.ticket field */
+		evict_walk.walk.ticket = ticket;
+		evict_walk.evicted = 0;
+		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+	} while (!lret && evict_walk.evicted);
+out:
+	if (lret < 0)
+		return lret;
+	if (lret == 0)
+		return -EBUSY;
+	return 0;
 }
 
 /**
@@ -760,6 +688,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
 	for (i = 0; i < placement->num_placement; ++i) {
 		const struct ttm_place *place = &placement->placement[i];
 		struct ttm_resource_manager *man;
+		bool may_evict;
 
 		man = ttm_manager_type(bdev, place->mem_type);
 		if (!man || !ttm_resource_manager_used(man))
@@ -769,22 +698,21 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
 				    TTM_PL_FLAG_FALLBACK))
 			continue;
 
-		do {
-			ret = ttm_resource_alloc(bo, place, res);
-			if (unlikely(ret && ret != -ENOSPC))
+		may_evict = (force_space && place->mem_type != TTM_PL_SYSTEM);
+		ret = ttm_resource_alloc(bo, place, res);
+		if (ret) {
+			if (ret != -ENOSPC)
 				return ret;
-			if (likely(!ret) || !force_space)
-				break;
-
-			ret = ttm_mem_evict_first(bdev, man, place, ctx,
-						  ticket);
-			if (unlikely(ret == -EBUSY))
-				break;
-			if (unlikely(ret))
+			if (!may_evict)
+				continue;
+
+			ret = ttm_bo_evict_alloc(bdev, man, place, bo, ctx,
+						 ticket, res);
+			if (ret == -EBUSY)
+				continue;
+			if (ret)
 				return ret;
-		} while (1);
-		if (ret)
-			continue;
+		}
 
 		ret = ttm_bo_add_move_fence(bo, man, ctx->no_wait_gpu);
 		if (unlikely(ret)) {
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index b6a2daac5518..9dc727d416cc 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -512,24 +512,11 @@ int ttm_resource_manager_evict_all(struct ttm_device *bdev,
 	};
 	struct dma_fence *fence;
 	int ret;
-	unsigned i;
-
-	/*
-	 * Can't use standard list traversal since we're unlocking.
-	 */
 
-	spin_lock(&bdev->lru_lock);
-	for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
-		while (!list_empty(&man->lru[i])) {
-			spin_unlock(&bdev->lru_lock);
-			ret = ttm_mem_evict_first(bdev, man, NULL, &ctx,
-						  NULL);
-			if (ret)
-				return ret;
-			spin_lock(&bdev->lru_lock);
-		}
-	}
-	spin_unlock(&bdev->lru_lock);
+	do {
+		ret = ttm_bo_evict_first(bdev, man, &ctx);
+		cond_resched();
+	} while (!ret);
 
 	spin_lock(&man->move_lock);
 	fence = dma_fence_get(man->move);
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index de97ea9fa75f..e577528f5dfc 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -422,11 +422,9 @@ long ttm_bo_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
 		    pgoff_t target);
 void ttm_bo_pin(struct ttm_buffer_object *bo);
 void ttm_bo_unpin(struct ttm_buffer_object *bo);
-int ttm_mem_evict_first(struct ttm_device *bdev,
-			struct ttm_resource_manager *man,
-			const struct ttm_place *place,
-			struct ttm_operation_ctx *ctx,
-			struct ww_acquire_ctx *ticket);
+int ttm_bo_evict_first(struct ttm_device *bdev,
+		       struct ttm_resource_manager *man,
+		       struct ttm_operation_ctx *ctx);
 vm_fault_t ttm_bo_vm_reserve(struct ttm_buffer_object *bo,
 			     struct vm_fault *vmf);
 vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (6 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 07/12] drm/ttm: Use the LRU walker for eviction Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-07-03 19:47   ` Matthew Brost
  2024-07-04 11:57   ` Christian König
  2024-07-03 15:38 ` [PATCH v6 09/12] drm/ttm/pool: Provide a helper to shrink pages Thomas Hellström
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Initially intended for experimenting with different backup
solutions (shmem vs direct swap cache insertion), abstract
the backup destination using a virtual base class.

Also provide a sample implementation for shmem.

While when settling on a preferred backup solution, one could
perhaps skip the abstraction, this functionality may actually
come in handy for configurable dedicated graphics memory
backup to fast nvme files or similar, whithout affecting
swap-space. Could indeed be useful for VRAM backup on S4 and
other cases.

v5:
- Fix a UAF. (kernel test robot, Dan Carptenter)
v6:
- Rename ttm_backup_shmem_copy_page() function argument
  (Matthew Brost)
- Add some missing documentation

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/Makefile           |   2 +-
 drivers/gpu/drm/ttm/ttm_backup_shmem.c | 139 +++++++++++++++++++++++++
 include/drm/ttm/ttm_backup.h           | 137 ++++++++++++++++++++++++
 3 files changed, 277 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_backup_shmem.c
 create mode 100644 include/drm/ttm/ttm_backup.h

diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
index dad298127226..5e980dd90e41 100644
--- a/drivers/gpu/drm/ttm/Makefile
+++ b/drivers/gpu/drm/ttm/Makefile
@@ -4,7 +4,7 @@
 
 ttm-y := ttm_tt.o ttm_bo.o ttm_bo_util.o ttm_bo_vm.o ttm_module.o \
 	ttm_execbuf_util.o ttm_range_manager.o ttm_resource.o ttm_pool.o \
-	ttm_device.o ttm_sys_manager.o
+	ttm_device.o ttm_sys_manager.o ttm_backup_shmem.o
 ttm-$(CONFIG_AGP) += ttm_agp_backend.o
 
 obj-$(CONFIG_DRM_TTM) += ttm.o
diff --git a/drivers/gpu/drm/ttm/ttm_backup_shmem.c b/drivers/gpu/drm/ttm/ttm_backup_shmem.c
new file mode 100644
index 000000000000..3d23a34d9f34
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_backup_shmem.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <drm/ttm/ttm_backup.h>
+#include <linux/page-flags.h>
+
+/**
+ * struct ttm_backup_shmem - A shmem based ttm_backup subclass.
+ * @backup: The base struct ttm_backup
+ * @filp: The associated shmem object
+ */
+struct ttm_backup_shmem {
+	struct ttm_backup backup;
+	struct file *filp;
+};
+
+static struct ttm_backup_shmem *to_backup_shmem(struct ttm_backup *backup)
+{
+	return container_of(backup, struct ttm_backup_shmem, backup);
+}
+
+static void ttm_backup_shmem_drop(struct ttm_backup *backup, unsigned long handle)
+{
+	handle -= 1;
+	shmem_truncate_range(file_inode(to_backup_shmem(backup)->filp), handle,
+			     handle + 1);
+}
+
+static int ttm_backup_shmem_copy_page(struct ttm_backup *backup, struct page *dst,
+				      unsigned long handle, bool intr)
+{
+	struct file *filp = to_backup_shmem(backup)->filp;
+	struct address_space *mapping = filp->f_mapping;
+	struct folio *from_folio;
+
+	handle -= 1;
+	from_folio = shmem_read_folio(mapping, handle);
+	if (IS_ERR(from_folio))
+		return PTR_ERR(from_folio);
+
+	/* Note: Use drm_memcpy_from_wc? */
+	copy_highpage(dst, folio_file_page(from_folio, handle));
+	folio_put(from_folio);
+
+	return 0;
+}
+
+static unsigned long
+ttm_backup_shmem_backup_page(struct ttm_backup *backup, struct page *page,
+			     bool writeback, pgoff_t i, gfp_t page_gfp,
+			     gfp_t alloc_gfp)
+{
+	struct file *filp = to_backup_shmem(backup)->filp;
+	struct address_space *mapping = filp->f_mapping;
+	unsigned long handle = 0;
+	struct folio *to_folio;
+	int ret;
+
+	to_folio = shmem_read_folio_gfp(mapping, i, alloc_gfp);
+	if (IS_ERR(to_folio))
+		return handle;
+
+	folio_mark_accessed(to_folio);
+	folio_lock(to_folio);
+	folio_mark_dirty(to_folio);
+	copy_highpage(folio_file_page(to_folio, i), page);
+	handle = i + 1;
+
+	if (writeback && !folio_mapped(to_folio) && folio_clear_dirty_for_io(to_folio)) {
+		struct writeback_control wbc = {
+			.sync_mode = WB_SYNC_NONE,
+			.nr_to_write = SWAP_CLUSTER_MAX,
+			.range_start = 0,
+			.range_end = LLONG_MAX,
+			.for_reclaim = 1,
+		};
+		folio_set_reclaim(to_folio);
+		ret = mapping->a_ops->writepage(folio_page(to_folio, 0), &wbc);
+		if (!folio_test_writeback(to_folio))
+			folio_clear_reclaim(to_folio);
+		/* If writepage succeeds, it unlocks the folio */
+		if (ret)
+			folio_unlock(to_folio);
+	} else {
+		folio_unlock(to_folio);
+	}
+
+	folio_put(to_folio);
+
+	return handle;
+}
+
+static void ttm_backup_shmem_fini(struct ttm_backup *backup)
+{
+	struct ttm_backup_shmem *sbackup = to_backup_shmem(backup);
+
+	fput(sbackup->filp);
+	kfree(sbackup);
+}
+
+static const struct ttm_backup_ops ttm_backup_shmem_ops = {
+	.drop = ttm_backup_shmem_drop,
+	.copy_backed_up_page = ttm_backup_shmem_copy_page,
+	.backup_page = ttm_backup_shmem_backup_page,
+	.fini = ttm_backup_shmem_fini,
+};
+
+/**
+ * ttm_backup_shmem_create() - Create a shmem-based struct backup.
+ * @size: The maximum size (in bytes) to back up.
+ *
+ * Create a backup utilizing shmem objects.
+ *
+ * Return: A pointer to a struct ttm_backup on success,
+ * an error pointer on error.
+ */
+struct ttm_backup *ttm_backup_shmem_create(loff_t size)
+{
+	struct ttm_backup_shmem *sbackup =
+		kzalloc(sizeof(*sbackup), GFP_KERNEL | __GFP_ACCOUNT);
+	struct file *filp;
+
+	if (!sbackup)
+		return ERR_PTR(-ENOMEM);
+
+	filp = shmem_file_setup("ttm shmem backup", size, 0);
+	if (IS_ERR(filp)) {
+		kfree(sbackup);
+		return ERR_CAST(filp);
+	}
+
+	sbackup->filp = filp;
+	sbackup->backup.ops = &ttm_backup_shmem_ops;
+
+	return &sbackup->backup;
+}
+EXPORT_SYMBOL_GPL(ttm_backup_shmem_create);
diff --git a/include/drm/ttm/ttm_backup.h b/include/drm/ttm/ttm_backup.h
new file mode 100644
index 000000000000..5f8c7d3069ef
--- /dev/null
+++ b/include/drm/ttm/ttm_backup.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _TTM_BACKUP_H_
+#define _TTM_BACKUP_H_
+
+#include <linux/mm_types.h>
+#include <linux/shmem_fs.h>
+
+struct ttm_backup;
+
+/**
+ * ttm_backup_handle_to_page_ptr() - Convert handle to struct page pointer
+ * @handle: The handle to convert.
+ *
+ * Converts an opaque handle received from the
+ * struct ttm_backoup_ops::backup_page() function to an (invalid)
+ * struct page pointer suitable for a struct page array.
+ *
+ * Return: An (invalid) struct page pointer.
+ */
+static inline struct page *
+ttm_backup_handle_to_page_ptr(unsigned long handle)
+{
+	return (struct page *)(handle << 1 | 1);
+}
+
+/**
+ * ttm_backup_page_ptr_is_handle() - Whether a struct page pointer is a handle
+ * @page: The struct page pointer to check.
+ *
+ * Return: true if the struct page pointer is a handld returned from
+ * ttm_backup_handle_to_page_ptr(). False otherwise.
+ */
+static inline bool ttm_backup_page_ptr_is_handle(const struct page *page)
+{
+	return (unsigned long)page & 1;
+}
+
+/**
+ * ttm_backup_page_ptr_to_handle() - Convert a struct page pointer to a handle
+ * @page: The struct page pointer to convert
+ *
+ * Return: The handle that was previously used in
+ * ttm_backup_handle_to_page_ptr() to obtain a struct page pointer, suitable
+ * for use as argument in the struct ttm_backup_ops drop() or
+ * copy_backed_up_page() functions.
+ */
+static inline unsigned long
+ttm_backup_page_ptr_to_handle(const struct page *page)
+{
+	WARN_ON(!ttm_backup_page_ptr_is_handle(page));
+	return (unsigned long)page >> 1;
+}
+
+/** struct ttm_backup_ops - A struct ttm_backup backend operations */
+struct ttm_backup_ops {
+	/**
+	 * drop - release memory associated with a handle
+	 * @backup: The struct backup pointer used to obtain the handle
+	 * @handle: The handle obtained from the @backup_page function.
+	 */
+	void (*drop)(struct ttm_backup *backup, unsigned long handle);
+
+	/**
+	 * copy_backed_up_page - Copy the contents of a previously backed
+	 * up page
+	 * @backup: The struct backup pointer used to back up the page.
+	 * @dst: The struct page to copy into.
+	 * @handle: The handle returned when the page was backed up.
+	 * @intr: Try to perform waits interruptable or at least killable.
+	 *
+	 * Return: 0 on success, Negative error code on failure, notably
+	 * -EINTR if @intr was set to true and a signal is pending.
+	 */
+	int (*copy_backed_up_page)(struct ttm_backup *backup, struct page *dst,
+				   unsigned long handle, bool intr);
+
+	/**
+	 * backup_page - Backup a page
+	 * @backup: The struct backup pointer to use.
+	 * @page: The page to back up.
+	 * @writeback: Whether to perform immediate writeback of the page.
+	 * This may have performance implications.
+	 * @i: A unique integer for each page and each struct backup.
+	 * This is a hint allowing the backup backend to avoid managing
+	 * its address space separately.
+	 * @page_gfp: The gfp value used when the page was allocated.
+	 * This is used for accounting purposes.
+	 * @alloc_gfp: The gpf to be used when the backend needs to allocaete
+	 * memory.
+	 *
+	 * Return: A handle on success. 0 on failure.
+	 * (This is following the swp_entry_t convention).
+	 *
+	 * Note: This function could be extended to back up a folio and
+	 * backends would then split the folio internally if needed.
+	 * Drawback is that the caller would then have to keep track of
+	 * the folio size- and usage.
+	 */
+	unsigned long (*backup_page)(struct ttm_backup *backup, struct page *page,
+				     bool writeback, pgoff_t i, gfp_t page_gfp,
+				     gfp_t alloc_gfp);
+	/**
+	 * fini - Free the struct backup resources after last use.
+	 * @backup: Pointer to the struct backup whose resources to free.
+	 *
+	 * After a call to @fini, it's illegal to use the @backup pointer.
+	 */
+	void (*fini)(struct ttm_backup *backup);
+};
+
+/**
+ * struct ttm_backup - Abstract a backup backend.
+ * @ops: The operations as described above.
+ *
+ * The struct ttm_backup is intended to be subclassed by the
+ * backend implementation.
+ */
+struct ttm_backup {
+	const struct ttm_backup_ops *ops;
+};
+
+/**
+ * ttm_backup_shmem_create() - Create a shmem-based struct backup.
+ * @size: The maximum size (in bytes) to back up.
+ *
+ * Create a backup utilizing shmem objects.
+ *
+ * Return: A pointer to a struct ttm_backup on success,
+ * an error pointer on error.
+ */
+struct ttm_backup *ttm_backup_shmem_create(loff_t size);
+
+#endif
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 09/12] drm/ttm/pool: Provide a helper to shrink pages
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (7 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-08-07 23:38   ` Matthew Brost
  2024-07-03 15:38 ` [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths Thomas Hellström
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Provide a helper to shrink ttm_tt page-vectors on a per-page
basis. A ttm_backup backend could then in theory get away with
allocating a single temporary page for each struct ttm_tt.

This is accomplished by splitting larger pages before trying to
back them up.

In the future we could allow ttm_backup to handle backing up
large pages as well, but currently there's no benefit in
doing that, since the shmem backup backend would have to
split those anyway to avoid allocating too much temporary
memory, and if the backend instead inserts pages into the
swap-cache, those are split on reclaim by the core.

Due to potential backup- and recover errors, allow partially swapped
out struct ttm_tt's, although mark them as swapped out stopping them
from being swapped out a second time. More details in the ttm_pool.c
DOC section.

v2:
- A couple of cleanups and error fixes in ttm_pool_back_up_tt.
- s/back_up/backup/
- Add a writeback parameter to the exported interface.

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 397 +++++++++++++++++++++++++++++++--
 drivers/gpu/drm/ttm/ttm_tt.c   |  37 +++
 include/drm/ttm/ttm_pool.h     |   5 +
 include/drm/ttm/ttm_tt.h       |  20 ++
 4 files changed, 446 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 6e1fd6985ffc..38e50cf81b0a 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -41,6 +41,7 @@
 #include <asm/set_memory.h>
 #endif
 
+#include <drm/ttm/ttm_backup.h>
 #include <drm/ttm/ttm_pool.h>
 #include <drm/ttm/ttm_tt.h>
 #include <drm/ttm/ttm_bo.h>
@@ -58,6 +59,32 @@ struct ttm_pool_dma {
 	unsigned long vaddr;
 };
 
+/**
+ * struct ttm_pool_tt_restore - State representing restore from backup
+ * @alloced_pages: Total number of already allocated pages for the ttm_tt.
+ * @restored_pages: Number of (sub) pages restored from swap for this
+ *		     chunk of 1 << @order pages.
+ * @first_page: The ttm page ptr representing for @old_pages[0].
+ * @caching_divide: Page pointer where subsequent pages are cached.
+ * @old_pages: Backup copy of page pointers that were replaced by the new
+ *	       page allocation.
+ * @pool: The pool used for page allocation while restoring.
+ * @order: The order of the last page allocated while restoring.
+ *
+ * Recovery from backup might fail when we've recovered less than the
+ * full ttm_tt. In order not to loose any data (yet), keep information
+ * around that allows us to restart a failed ttm backup recovery.
+ */
+struct ttm_pool_tt_restore {
+	pgoff_t alloced_pages;
+	pgoff_t restored_pages;
+	struct page **first_page;
+	struct page **caching_divide;
+	struct ttm_pool *pool;
+	unsigned int order;
+	struct page *old_pages[];
+};
+
 static unsigned long page_pool_size;
 
 MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
@@ -354,11 +381,102 @@ static unsigned int ttm_pool_page_order(struct ttm_pool *pool, struct page *p)
 	return p->private;
 }
 
+/*
+ * To be able to insert single pages into backup directly,
+ * we need to split multi-order page allocations and make them look
+ * like single-page allocations.
+ */
+static void ttm_pool_split_for_swap(struct ttm_pool *pool, struct page *p)
+{
+	unsigned int order = ttm_pool_page_order(pool, p);
+	pgoff_t nr;
+
+	if (!order)
+		return;
+
+	split_page(p, order);
+	nr = 1UL << order;
+	while (nr--)
+		(p++)->private = 0;
+}
+
+/**
+ * DOC: Partial backup and restoration of a struct ttm_tt.
+ *
+ * Swapout using ttm_backup::ops::backup_page() and swapin using
+ * ttm_backup::ops::copy_backed_up_page() may fail.
+ * The former most likely due to lack of swap-space or memory, the latter due
+ * to lack of memory or because of signal interruption during waits.
+ *
+ * Backupfailure is easily handled by using a ttm_tt pages vector that holds
+ * both swap entries and page pointers. This has to be taken into account when
+ * restoring such a ttm_tt from backup, and when freeing it while backed up.
+ * When restoring, for simplicity, new pages are actually allocated from the
+ * pool and the contents of any old pages are copied in and then the old pages
+ * are released.
+ *
+ * For restoration failures, the struct ttm_pool_tt_restore holds sufficient state
+ * to be able to resume an interrupted restore, and that structure is freed once
+ * the restoration is complete. If the struct ttm_tt is destroyed while there
+ * is a valid struct ttm_pool_tt_restore attached, that is also properly taken
+ * care of.
+ */
+
+static bool ttm_pool_restore_valid(const struct ttm_pool_tt_restore *restore)
+{
+	return restore && restore->restored_pages < (1 << restore->order);
+}
+
+static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
+			       struct ttm_backup *backup,
+			       struct ttm_operation_ctx *ctx)
+{
+	unsigned int i, nr = 1 << restore->order;
+	int ret = 0;
+
+	if (!ttm_pool_restore_valid(restore))
+		return 0;
+
+	for (i = restore->restored_pages; i < nr; ++i) {
+		struct page *p = restore->old_pages[i];
+
+		if (ttm_backup_page_ptr_is_handle(p)) {
+			unsigned long handle = ttm_backup_page_ptr_to_handle(p);
+
+			if (handle == 0)
+				continue;
+
+			ret = backup->ops->copy_backed_up_page
+				(backup, restore->first_page[i],
+				 handle, ctx->interruptible);
+			if (ret)
+				break;
+
+			backup->ops->drop(backup, handle);
+		} else if (p) {
+			/*
+			 * We could probably avoid splitting the old page
+			 * using clever logic, but ATM we don't care.
+			 */
+			ttm_pool_split_for_swap(restore->pool, p);
+			copy_highpage(restore->first_page[i], p);
+			__free_pages(p, 0);
+		}
+
+		restore->restored_pages++;
+		restore->old_pages[i] = NULL;
+		cond_resched();
+	}
+
+	return ret;
+}
+
 /* Called when we got a page, either from a pool or newly allocated */
 static int ttm_pool_page_allocated(struct ttm_pool *pool, unsigned int order,
 				   struct page *p, dma_addr_t **dma_addr,
 				   unsigned long *num_pages,
-				   struct page ***pages)
+				   struct page ***pages,
+				   struct ttm_pool_tt_restore *restore)
 {
 	unsigned int i;
 	int r;
@@ -369,6 +487,16 @@ static int ttm_pool_page_allocated(struct ttm_pool *pool, unsigned int order,
 			return r;
 	}
 
+	if (restore) {
+		memcpy(restore->old_pages, *pages,
+		       (1 << order) * sizeof(*restore->old_pages));
+		memset(*pages, 0, (1 << order) * sizeof(**pages));
+		restore->order = order;
+		restore->restored_pages = 0;
+		restore->first_page = *pages;
+		restore->alloced_pages += 1UL << order;
+	}
+
 	*num_pages -= 1 << order;
 	for (i = 1 << order; i; --i, ++(*pages), ++p)
 		**pages = p;
@@ -394,22 +522,39 @@ static void ttm_pool_free_range(struct ttm_pool *pool, struct ttm_tt *tt,
 				pgoff_t start_page, pgoff_t end_page)
 {
 	struct page **pages = &tt->pages[start_page];
+	struct ttm_backup *backup = tt->backup;
 	unsigned int order;
 	pgoff_t i, nr;
 
 	for (i = start_page; i < end_page; i += nr, pages += nr) {
 		struct ttm_pool_type *pt = NULL;
+		struct page *p = *pages;
+
+		if (ttm_backup_page_ptr_is_handle(p)) {
+			unsigned long handle = ttm_backup_page_ptr_to_handle(p);
+
+			nr = 1;
+			if (handle != 0)
+				backup->ops->drop(backup, handle);
+			continue;
+		}
+
+		if (pool) {
+			order = ttm_pool_page_order(pool, p);
+			nr = (1UL << order);
+			if (tt->dma_address)
+				ttm_pool_unmap(pool, tt->dma_address[i], nr);
 
-		order = ttm_pool_page_order(pool, *pages);
-		nr = (1UL << order);
-		if (tt->dma_address)
-			ttm_pool_unmap(pool, tt->dma_address[i], nr);
+			pt = ttm_pool_select_type(pool, caching, order);
+		} else {
+			order = p->private;
+			nr = (1UL << order);
+		}
 
-		pt = ttm_pool_select_type(pool, caching, order);
 		if (pt)
-			ttm_pool_type_give(pt, *pages);
+			ttm_pool_type_give(pt, p);
 		else
-			ttm_pool_free_page(pool, caching, order, *pages);
+			ttm_pool_free_page(pool, caching, order, p);
 	}
 }
 
@@ -453,9 +598,37 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	else
 		gfp_flags |= GFP_HIGHUSER;
 
-	for (order = min_t(unsigned int, MAX_PAGE_ORDER, __fls(num_pages));
-	     num_pages;
-	     order = min_t(unsigned int, order, __fls(num_pages))) {
+	order = min_t(unsigned int, MAX_PAGE_ORDER, __fls(num_pages));
+
+	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP) {
+		if (!tt->restore) {
+			gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
+
+			if (ctx->gfp_retry_mayfail)
+				gfp |= __GFP_RETRY_MAYFAIL;
+
+			tt->restore =
+				kvzalloc(struct_size(tt->restore, old_pages,
+						     (size_t)1 << order), gfp);
+			/* RFC: Possibly loop on -ENOMEM and reduce order. */
+			if (!tt->restore)
+				return -ENOMEM;
+		} else if (ttm_pool_restore_valid(tt->restore)) {
+			struct ttm_pool_tt_restore *restore = tt->restore;
+
+			num_pages -= restore->alloced_pages;
+			order = min_t(unsigned int, order, __fls(num_pages));
+			pages += restore->alloced_pages;
+			r = ttm_pool_restore_tt(restore, tt->backup, ctx);
+			if (r)
+				return r;
+			caching = restore->caching_divide;
+		}
+
+		tt->restore->pool = pool;
+	}
+
+	for (; num_pages; order = min_t(unsigned int, order, __fls(num_pages))) {
 		struct ttm_pool_type *pt;
 
 		page_caching = tt->caching;
@@ -472,11 +645,19 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 				r = ttm_pool_page_allocated(pool, order, p,
 							    &dma_addr,
 							    &num_pages,
-							    &pages);
+							    &pages,
+							    tt->restore);
 				if (r)
 					goto error_free_page;
 
 				caching = pages;
+				if (ttm_pool_restore_valid(tt->restore)) {
+					r = ttm_pool_restore_tt(tt->restore, tt->backup,
+								ctx);
+					if (r)
+						goto error_free_all;
+				}
+
 				if (num_pages < (1 << order))
 					break;
 
@@ -496,9 +677,17 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 				caching = pages;
 			}
 			r = ttm_pool_page_allocated(pool, order, p, &dma_addr,
-						    &num_pages, &pages);
+						    &num_pages, &pages,
+						    tt->restore);
 			if (r)
 				goto error_free_page;
+
+			if (ttm_pool_restore_valid(tt->restore)) {
+				r = ttm_pool_restore_tt(tt->restore, tt->backup, ctx);
+				if (r)
+					goto error_free_all;
+			}
+
 			if (PageHighMem(p))
 				caching = pages;
 		}
@@ -517,12 +706,26 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	if (r)
 		goto error_free_all;
 
+	if (tt->restore) {
+		kvfree(tt->restore);
+		tt->restore = NULL;
+	}
+
+	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP)
+		tt->page_flags &= ~(TTM_TT_FLAG_PRIV_BACKED_UP |
+				    TTM_TT_FLAG_SWAPPED);
+
 	return 0;
 
 error_free_page:
 	ttm_pool_free_page(pool, page_caching, order, p);
 
 error_free_all:
+	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP) {
+		tt->restore->caching_divide = caching;
+		return r;
+	}
+
 	num_pages = tt->num_pages - num_pages;
 	caching_divide = caching - tt->pages;
 	ttm_pool_free_range(pool, tt, tt->caching, 0, caching_divide);
@@ -549,6 +752,174 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 }
 EXPORT_SYMBOL(ttm_pool_free);
 
+/**
+ * ttm_pool_release_backed_up() - Release content of a swapped-out struct ttm_tt
+ * @tt: The struct ttm_tt.
+ *
+ * Release handles with associated content or any remaining pages of
+ * a backed-up struct ttm_tt.
+ */
+void ttm_pool_release_backed_up(struct ttm_tt *tt)
+{
+	struct ttm_backup *backup = tt->backup;
+	struct ttm_pool_tt_restore *restore;
+	pgoff_t i, start_page = 0;
+	unsigned long handle;
+
+	if (!(tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP))
+		return;
+
+	restore = tt->restore;
+
+	if (ttm_pool_restore_valid(restore)) {
+		pgoff_t nr = 1UL << restore->order;
+
+		for (i = restore->restored_pages; i < nr; ++i) {
+			struct page *p = restore->old_pages[i];
+
+			if (ttm_backup_page_ptr_is_handle(p)) {
+				handle = ttm_backup_page_ptr_to_handle(p);
+				if (handle == 0)
+					continue;
+
+				backup->ops->drop(backup, handle);
+			} else if (p) {
+				ttm_pool_split_for_swap(restore->pool, p);
+				__free_pages(p, 0);
+			}
+		}
+	}
+
+	if (restore) {
+		pgoff_t mid = restore->caching_divide - tt->pages;
+
+		start_page = restore->alloced_pages;
+		/* Pages that might be dma-mapped and non-cached */
+		ttm_pool_free_range(restore->pool, tt, tt->caching,
+				    0, mid);
+		/* Pages that might be dma-mapped but cached */
+		ttm_pool_free_range(restore->pool, tt, ttm_cached,
+				    mid, restore->alloced_pages);
+	}
+
+	/* Shrunken pages. Cached and not dma-mapped. */
+	ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt->num_pages);
+
+	if (restore) {
+		kvfree(restore);
+		tt->restore = NULL;
+	}
+
+	tt->page_flags &= ~(TTM_TT_FLAG_PRIV_BACKED_UP | TTM_TT_FLAG_SWAPPED);
+}
+
+/**
+ * ttm_pool_backup_tt() - Back up or purge a struct ttm_tt
+ * @pool: The pool used when allocating the struct ttm_tt.
+ * @ttm: The struct ttm_tt.
+ * @purge: Don't back up but release pages directly to system.
+ * @writeback: If !@purge, Try to write out directly to the
+ * underlying persistent media.
+ *
+ * Back up or purge a struct ttm_tt. If @purge is true, then
+ * all pages will be freed directly to the system rather than to the pool
+ * they were allocated from, making the function behave similarly to
+ * ttm_pool_free(). If @purge is false the pages will be backed up instead,
+ * exchanged for handles.
+ * A subsequent call to ttm_pool_alloc() will then read back the content and
+ * a subsequent call to ttm_pool_release_shrunken() will drop it.
+ * If backup of a page fails for whatever reason, @ttm will still be
+ * partially backed up, retaining those pages for which backup fails.
+ *
+ * Return: Number of pages actually backed up or freed, or negative
+ * error code on error.
+ */
+long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge,
+			bool writeback)
+{
+	struct ttm_backup *backup = ttm->backup;
+	struct page *page;
+	unsigned long handle;
+	gfp_t alloc_gfp;
+	gfp_t gfp;
+	int ret = 0;
+	pgoff_t shrunken = 0;
+	pgoff_t i, num_pages;
+
+	if ((!get_nr_swap_pages() && !purge) ||
+	    pool->use_dma_alloc ||
+	    (ttm->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP))
+		return -EBUSY;
+
+#ifdef CONFIG_X86
+	/* Anything returned to the system needs to be cached. */
+	if (ttm->caching != ttm_cached)
+		set_pages_array_wb(ttm->pages, ttm->num_pages);
+#endif
+
+	if (ttm->dma_address || purge) {
+		for (i = 0; i < ttm->num_pages; i += num_pages) {
+			unsigned int order;
+
+			page = ttm->pages[i];
+			if (unlikely(!page)) {
+				num_pages = 1;
+				continue;
+			}
+
+			order = ttm_pool_page_order(pool, page);
+			num_pages = 1UL << order;
+			if (ttm->dma_address)
+				ttm_pool_unmap(pool, ttm->dma_address[i],
+					       num_pages);
+			if (purge) {
+				shrunken += num_pages;
+				page->private = 0;
+				__free_pages(page, order);
+				memset(ttm->pages + i, 0,
+				       num_pages * sizeof(*ttm->pages));
+			}
+		}
+	}
+
+	if (purge)
+		return shrunken;
+
+	if (pool->use_dma32)
+		gfp = GFP_DMA32;
+	else
+		gfp = GFP_HIGHUSER;
+
+	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
+
+	for (i = 0; i < ttm->num_pages; ++i) {
+		page = ttm->pages[i];
+		if (unlikely(!page))
+			continue;
+
+		ttm_pool_split_for_swap(pool, page);
+
+		handle = backup->ops->backup_page(backup, page, writeback, i,
+						  gfp, alloc_gfp);
+		if (handle) {
+			ttm->pages[i] = ttm_backup_handle_to_page_ptr(handle);
+			put_page(page);
+			shrunken++;
+		} else {
+			/* We allow partially shrunken tts */
+			ret = -ENOMEM;
+			break;
+		}
+		cond_resched();
+	}
+
+	if (shrunken)
+		ttm->page_flags |= (TTM_TT_FLAG_PRIV_BACKED_UP |
+				    TTM_TT_FLAG_SWAPPED);
+
+	return shrunken ? shrunken : ret;
+}
+
 /**
  * ttm_pool_init - Initialize a pool
  *
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 4b51b9023126..98ce25197b38 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -40,6 +40,7 @@
 #include <drm/drm_cache.h>
 #include <drm/drm_device.h>
 #include <drm/drm_util.h>
+#include <drm/ttm/ttm_backup.h>
 #include <drm/ttm/ttm_bo.h>
 #include <drm/ttm/ttm_tt.h>
 
@@ -158,6 +159,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
 	ttm->swap_storage = NULL;
 	ttm->sg = bo->sg;
 	ttm->caching = caching;
+	ttm->restore = NULL;
 }
 
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
@@ -182,6 +184,12 @@ void ttm_tt_fini(struct ttm_tt *ttm)
 		fput(ttm->swap_storage);
 	ttm->swap_storage = NULL;
 
+	ttm_pool_release_backed_up(ttm);
+	if (ttm->backup) {
+		ttm->backup->ops->fini(ttm->backup);
+		ttm->backup = NULL;
+	}
+
 	if (ttm->pages)
 		kvfree(ttm->pages);
 	else
@@ -253,6 +261,35 @@ int ttm_tt_swapin(struct ttm_tt *ttm)
 }
 EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_tt_swapin);
 
+/**
+ * ttm_tt_backup() - Helper to back up a struct ttm_tt.
+ * @bdev: The TTM device.
+ * @tt: The struct ttm_tt.
+ * @purge: Don't back up but release pages directly to system,
+ * bypassing any pooling.
+ * @writeback: If !@purge, try to write out directly to the
+ * underlying persistent media.
+ *
+ * Helper for a TTM driver to use from the bo_shrink() method to shrink
+ * a struct ttm_tt, after it has done the necessary unbinding. This function
+ * will update the page accounting and call ttm_pool_shrink_tt to free pages
+ * or move them to the swap cache.
+ *
+ * Return: Number of pages freed or swapped out, or negative error code on
+ * error.
+ */
+long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt, bool purge,
+		   bool writeback)
+{
+	long ret = ttm_pool_backup_tt(&bdev->pool, tt, purge, writeback);
+
+	if (ret > 0)
+		tt->page_flags &= ~TTM_TT_FLAG_PRIV_POPULATED;
+
+	return ret;
+}
+EXPORT_SYMBOL(ttm_tt_backup);
+
 /**
  * ttm_tt_swapout - swap out tt object
  *
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 160d954a261e..4e4db369952b 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -89,6 +89,11 @@ void ttm_pool_fini(struct ttm_pool *pool);
 
 int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
 
+void ttm_pool_release_backed_up(struct ttm_tt *tt);
+
+long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm,
+			bool purge, bool writeback);
+
 int ttm_pool_mgr_init(unsigned long num_pages);
 void ttm_pool_mgr_fini(void);
 
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index 2b9d856ff388..6b990f1e7dd0 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -32,11 +32,13 @@
 #include <drm/ttm/ttm_caching.h>
 #include <drm/ttm/ttm_kmap_iter.h>
 
+struct ttm_backup;
 struct ttm_device;
 struct ttm_tt;
 struct ttm_resource;
 struct ttm_buffer_object;
 struct ttm_operation_ctx;
+struct ttm_pool_tt_restore;
 
 /**
  * struct ttm_tt - This is a structure holding the pages, caching- and aperture
@@ -85,6 +87,9 @@ struct ttm_tt {
 	 * fault handling abuses the DMA api a bit and dma_map_attrs can't be
 	 * used to assure pgprot always matches.
 	 *
+	 * TTM_TT_FLAG_PRIV_BACKED_UP: TTM internal only. This is set if the
+	 * struct ttm_tt has been (possibly partially) backed up.
+	 *
 	 * TTM_TT_FLAG_PRIV_POPULATED: TTM internal only. DO NOT USE. This is
 	 * set by TTM after ttm_tt_populate() has successfully returned, and is
 	 * then unset when TTM calls ttm_tt_unpopulate().
@@ -96,6 +101,7 @@ struct ttm_tt {
 #define TTM_TT_FLAG_DECRYPTED		BIT(4)
 
 #define TTM_TT_FLAG_PRIV_POPULATED	BIT(5)
+#define TTM_TT_FLAG_PRIV_BACKED_UP	BIT(6)
 	uint32_t page_flags;
 	/** @num_pages: Number of pages in the page array. */
 	uint32_t num_pages;
@@ -105,11 +111,21 @@ struct ttm_tt {
 	dma_addr_t *dma_address;
 	/** @swap_storage: Pointer to shmem struct file for swap storage. */
 	struct file *swap_storage;
+	/**
+	 * @backup: Pointer to backup struct for backed up tts.
+	 * RFC: Could possibly be unified with @swap_storage.
+	 */
+	struct ttm_backup *backup;
 	/**
 	 * @caching: The current caching state of the pages, see enum
 	 * ttm_caching.
 	 */
 	enum ttm_caching caching;
+	/**
+	 * @restore: Partial restoration from backup state.
+	 * RFC: Incorporate in struct ttm_backup?
+	 */
+	struct ttm_pool_tt_restore *restore;
 };
 
 /**
@@ -230,6 +246,10 @@ void ttm_tt_mgr_init(unsigned long num_pages, unsigned long num_dma32_pages);
 struct ttm_kmap_iter *ttm_kmap_iter_tt_init(struct ttm_kmap_iter_tt *iter_tt,
 					    struct ttm_tt *tt);
 unsigned long ttm_tt_pages_limit(void);
+
+long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt, bool purge,
+		   bool writeback);
+
 #if IS_ENABLED(CONFIG_AGP)
 #include <linux/agp_backend.h>
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (8 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 09/12] drm/ttm/pool: Provide a helper to shrink pages Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-08-07 23:43   ` Matthew Brost
  2024-07-03 15:38 ` [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos Thomas Hellström
  2024-07-03 15:38 ` [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark Thomas Hellström
  11 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Use fault-injection to test partial TTM swapout and interrupted swapin.
Return -EINTR for swapin to test the callers ability to handle and
restart the swapin, and on swapout perform a partial swapout to test that
the swapin and release_shrunken functionality.

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Kconfig        | 10 ++++++++++
 drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index fd0749c0c630..9f27271bfab8 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -272,6 +272,16 @@ config DRM_GPUVM
 	  GPU-VM representation providing helpers to manage a GPUs virtual
 	  address space
 
+config DRM_TTM_BACKUP_FAULT_INJECT
+	bool "Enable fault injection during TTM backup"
+	depends on DRM_TTM
+	default n
+	help
+	  Inject recoverable failures during TTM backup and recovery of
+	  backed-up objects. For DRM driver developers only.
+
+	  If in doubt, choose N.
+
 config DRM_BUDDY
 	tristate
 	depends on DRM
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 38e50cf81b0a..d32a1f2e5e50 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
 			       struct ttm_backup *backup,
 			       struct ttm_operation_ctx *ctx)
 {
+	static unsigned long __maybe_unused swappedin;
 	unsigned int i, nr = 1 << restore->order;
 	int ret = 0;
 
@@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
 			if (handle == 0)
 				continue;
 
+			if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
+			    ctx->interruptible &&
+			    ++swappedin % 100 == 0) {
+				ret = -EINTR;
+				break;
+			}
+
 			ret = backup->ops->copy_backed_up_page
 				(backup, restore->first_page[i],
 				 handle, ctx->interruptible);
@@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge,
 
 	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
 
-	for (i = 0; i < ttm->num_pages; ++i) {
+	num_pages = ttm->num_pages;
+
+	/* Pretend doing fault injection by shrinking only half of the pages. */
+
+	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
+		num_pages = DIV_ROUND_UP(num_pages, 2);
+
+	for (i = 0; i < num_pages; ++i) {
 		page = ttm->pages[i];
 		if (unlikely(!page))
 			continue;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (9 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-08-08  1:37   ` Matthew Brost
  2024-08-09 16:05   ` Matthew Auld
  2024-07-03 15:38 ` [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark Thomas Hellström
  11 siblings, 2 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Christian König,
	Somalapuram Amaranath, Matthew Brost, dri-devel

Rather than relying on the TTM watermark accounting add a shrinker
for xe_bos in TT or system memory.

Leverage the newly added TTM per-page shrinking and shmem backup
support.

Although xe doesn't fully support WONTNEED (purgeable) bos yet,
introduce and add shrinker support for purgeable ttm_tts.

v2:
- Cleanups bugfixes and a KUNIT shrinker test.
- Add writeback support, and activate if kswapd.
v3:
- Move the try_shrink() helper to core TTM.
- Minor cleanups.
v4:
- Add runtime pm for the shrinker. Shrinking may require an active
  device for CCS metadata copying.
v5:
- Separately purge ghost- and zombie objects in the shrinker.
- Fix a format specifier - type inconsistency. (Kernel test robot).

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c     |  67 ++++++
 drivers/gpu/drm/xe/Makefile           |   1 +
 drivers/gpu/drm/xe/tests/xe_bo.c      | 118 +++++++++++
 drivers/gpu/drm/xe/tests/xe_bo_test.c |   1 +
 drivers/gpu/drm/xe/tests/xe_bo_test.h |   1 +
 drivers/gpu/drm/xe/xe_bo.c            | 155 ++++++++++++--
 drivers/gpu/drm/xe/xe_bo.h            |  26 +++
 drivers/gpu/drm/xe/xe_device.c        |   8 +
 drivers/gpu/drm/xe/xe_device_types.h  |   2 +
 drivers/gpu/drm/xe/xe_shrinker.c      | 287 ++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_shrinker.h      |  18 ++
 include/drm/ttm/ttm_bo.h              |   3 +
 12 files changed, 671 insertions(+), 16 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_shrinker.c
 create mode 100644 drivers/gpu/drm/xe/xe_shrinker.h

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index c4f678f30fc2..563e96a4cf06 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -924,3 +924,70 @@ long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct ttm_device *bdev,
 
 	return progress;
 }
+EXPORT_SYMBOL(ttm_lru_walk_for_evict);
+
+/**
+ * ttm_bo_try_shrink - LRU walk helper to shrink a ttm buffer object.
+ * @walk: The struct xe_ttm_lru_walk that describes the walk.
+ * @bo: The buffer object.
+ * @purge: Whether to attempt to purge the bo content since it's no
+ * longer needed.
+ * @writeback: If !@purge, attempt to write out to persistent storage.
+ *
+ * The function uses the ttm_tt_back_up functionality to back up or
+ * purge a struct ttm_tt. If the bo is not in system, it's first
+ * moved there.
+ *
+ * Return: The number of pages shrunken or purged, or
+ * negative error code on failure.
+ */
+long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
+		       bool purge, bool writeback)
+{
+	static const struct ttm_place sys_placement_flags = {
+		.fpfn = 0,
+		.lpfn = 0,
+		.mem_type = TTM_PL_SYSTEM,
+		.flags = 0,
+	};
+	static struct ttm_placement sys_placement = {
+		.num_placement = 1,
+		.placement = &sys_placement_flags,
+	};
+	struct ttm_operation_ctx *ctx = walk->ctx;
+	struct ttm_tt *tt = bo->ttm;
+	long lret;
+
+	dma_resv_assert_held(bo->base.resv);
+
+	if (!tt || !ttm_tt_is_populated(tt))
+		return 0;
+
+	if (bo->resource->mem_type != TTM_PL_SYSTEM) {
+		int ret = ttm_bo_validate(bo, &sys_placement, ctx);
+
+		if (ret) {
+			if (ret == -EINTR || ret == -EDEADLK ||
+			    ret == -ERESTARTSYS)
+				return ret;
+			return 0;
+		}
+	}
+
+	lret = ttm_bo_wait_ctx(bo, ctx);
+	if (lret < 0) {
+		if (lret == -ERESTARTSYS)
+			return lret;
+		return 0;
+	}
+
+	if (bo->deleted)
+		lret = ttm_tt_backup(bo->bdev, tt, true, writeback);
+	else
+		lret = ttm_tt_backup(bo->bdev, tt, purge, writeback);
+	if (lret < 0 && lret != -EINTR)
+		return 0;
+
+	return lret;
+}
+EXPORT_SYMBOL(ttm_bo_try_shrink);
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index b1e03bfe4a68..1eba51bdd172 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -112,6 +112,7 @@ xe-y += xe_bb.o \
 	xe_ring_ops.o \
 	xe_sa.o \
 	xe_sched_job.o \
+	xe_shrinker.o \
 	xe_step.o \
 	xe_sync.o \
 	xe_tile.o \
diff --git a/drivers/gpu/drm/xe/tests/xe_bo.c b/drivers/gpu/drm/xe/tests/xe_bo.c
index 9f3c02826464..49617f16dc76 100644
--- a/drivers/gpu/drm/xe/tests/xe_bo.c
+++ b/drivers/gpu/drm/xe/tests/xe_bo.c
@@ -6,6 +6,8 @@
 #include <kunit/test.h>
 #include <kunit/visibility.h>
 
+#include <uapi/linux/sysinfo.h>
+
 #include "tests/xe_bo_test.h"
 #include "tests/xe_pci_test.h"
 #include "tests/xe_test.h"
@@ -350,3 +352,119 @@ void xe_bo_evict_kunit(struct kunit *test)
 	xe_call_for_each_device(evict_test_run_device);
 }
 EXPORT_SYMBOL_IF_KUNIT(xe_bo_evict_kunit);
+
+struct xe_bo_link {
+	struct list_head link;
+	struct xe_bo *bo;
+};
+
+#define XE_BO_SHRINK_SIZE ((unsigned long)SZ_64M)
+
+/*
+ * Try to create system bos corresponding to twice the amount
+ * of available system memory to test shrinker functionality.
+ * If no swap space is available to accommodate the
+ * memory overcommit, mark bos purgeable.
+ */
+static int shrink_test_run_device(struct xe_device *xe)
+{
+	struct kunit *test = xe_cur_kunit();
+	LIST_HEAD(bos);
+	struct xe_bo_link *link, *next;
+	struct sysinfo si;
+	size_t total, alloced;
+	unsigned int interrupted = 0, successful = 0;
+
+	si_meminfo(&si);
+	total = si.freeram * si.mem_unit;
+
+	kunit_info(test, "Free ram is %lu bytes. Will allocate twice of that.\n",
+		   (unsigned long) total);
+
+	total <<= 1;
+	for (alloced = 0; alloced < total ; alloced += XE_BO_SHRINK_SIZE) {
+		struct xe_bo *bo;
+		unsigned int mem_type;
+
+		link = kzalloc(sizeof(*link), GFP_KERNEL);
+		if (!link) {
+			KUNIT_FAIL(test, "Unexpeced link allocation failure\n");
+			break;
+		}
+
+		INIT_LIST_HEAD(&link->link);
+
+		/* We can create bos using WC caching here. But it is slower. */
+		bo = xe_bo_create_user(xe, NULL, NULL, XE_BO_SHRINK_SIZE,
+				       DRM_XE_GEM_CPU_CACHING_WB,
+				       ttm_bo_type_device,
+				       XE_BO_FLAG_SYSTEM);
+		if (IS_ERR(bo)) {
+			if (bo != ERR_PTR(-ENOMEM) && bo != ERR_PTR(-ENOSPC) &&
+			    bo != ERR_PTR(-EINTR) && bo != ERR_PTR(-ERESTARTSYS))
+				KUNIT_FAIL(test, "Error creating bo: %pe\n", bo);
+			kfree(link);
+			break;
+		}
+		link->bo = bo;
+		list_add_tail(&link->link, &bos);
+		xe_bo_lock(bo, false);
+
+		/*
+		 * If we're low on swap entries, we can't shrink unless the bo
+		 * is marked purgeable.
+		 */
+		if (get_nr_swap_pages() < (XE_BO_SHRINK_SIZE >> PAGE_SHIFT) * 128) {
+			struct xe_ttm_tt *xe_tt =
+				container_of(bo->ttm.ttm, typeof(*xe_tt), ttm);
+			long num_pages = xe_tt->ttm.num_pages;
+
+			xe_tt->purgeable = true;
+			xe_shrinker_mod_pages(xe->mem.shrinker, -num_pages,
+					      num_pages);
+		}
+
+		mem_type = bo->ttm.resource->mem_type;
+		xe_bo_unlock(bo);
+		if (mem_type != XE_PL_TT)
+			KUNIT_FAIL(test, "Bo in incorrect memory type: %u\n",
+				   bo->ttm.resource->mem_type);
+		cond_resched();
+		if (signal_pending(current))
+			break;
+	}
+
+	/* Read back and destroy bos */
+	list_for_each_entry_safe_reverse(link, next, &bos, link) {
+		static struct ttm_operation_ctx ctx = {.interruptible = true};
+		struct xe_bo *bo = link->bo;
+		int ret;
+
+		if (!signal_pending(current)) {
+			xe_bo_lock(bo, NULL);
+			ret = ttm_bo_validate(&bo->ttm, &tt_placement, &ctx);
+			xe_bo_unlock(bo);
+			if (ret && ret != -EINTR)
+				KUNIT_FAIL(test, "Validation failed: %pe\n",
+					   ERR_PTR(ret));
+			else if (ret)
+				interrupted++;
+			else
+				successful++;
+		}
+		xe_bo_put(link->bo);
+		list_del(&link->link);
+		kfree(link);
+		cond_resched();
+	}
+	kunit_info(test, "Readbacks interrupted: %u successful: %u\n",
+		   interrupted, successful);
+
+	return 0;
+}
+
+void xe_bo_shrink_kunit(struct kunit *test)
+{
+	xe_call_for_each_device(shrink_test_run_device);
+}
+EXPORT_SYMBOL_IF_KUNIT(xe_bo_shrink_kunit);
diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.c b/drivers/gpu/drm/xe/tests/xe_bo_test.c
index a324cde77db8..317fa923e287 100644
--- a/drivers/gpu/drm/xe/tests/xe_bo_test.c
+++ b/drivers/gpu/drm/xe/tests/xe_bo_test.c
@@ -10,6 +10,7 @@
 static struct kunit_case xe_bo_tests[] = {
 	KUNIT_CASE(xe_ccs_migrate_kunit),
 	KUNIT_CASE(xe_bo_evict_kunit),
+	KUNIT_CASE_SLOW(xe_bo_shrink_kunit),
 	{}
 };
 
diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.h b/drivers/gpu/drm/xe/tests/xe_bo_test.h
index 0113ab45066a..7f44d14a45c5 100644
--- a/drivers/gpu/drm/xe/tests/xe_bo_test.h
+++ b/drivers/gpu/drm/xe/tests/xe_bo_test.h
@@ -10,5 +10,6 @@ struct kunit;
 
 void xe_ccs_migrate_kunit(struct kunit *test);
 void xe_bo_evict_kunit(struct kunit *test);
+void xe_bo_shrink_kunit(struct kunit *test);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 65c696966e96..6ab63d1642ae 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -10,6 +10,7 @@
 #include <drm/drm_drv.h>
 #include <drm/drm_gem_ttm_helper.h>
 #include <drm/drm_managed.h>
+#include <drm/ttm/ttm_backup.h>
 #include <drm/ttm/ttm_device.h>
 #include <drm/ttm/ttm_placement.h>
 #include <drm/ttm/ttm_tt.h>
@@ -25,6 +26,7 @@
 #include "xe_pm.h"
 #include "xe_preempt_fence.h"
 #include "xe_res_cursor.h"
+#include "xe_shrinker.h"
 #include "xe_trace_bo.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -278,11 +280,15 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
 	}
 }
 
+/* struct xe_ttm_tt - Subclassed ttm_tt for xe */
 struct xe_ttm_tt {
 	struct ttm_tt ttm;
-	struct device *dev;
+	/** @xe - The xe device */
+	struct xe_device *xe;
 	struct sg_table sgt;
 	struct sg_table *sg;
+	/** @purgeable - Whether the bo is purgeable (WONTNEED) */
+	bool purgeable;
 };
 
 static int xe_tt_map_sg(struct ttm_tt *tt)
@@ -291,7 +297,8 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
 	unsigned long num_pages = tt->num_pages;
 	int ret;
 
-	XE_WARN_ON(tt->page_flags & TTM_TT_FLAG_EXTERNAL);
+	XE_WARN_ON((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
+		   !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE));
 
 	if (xe_tt->sg)
 		return 0;
@@ -299,13 +306,13 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
 	ret = sg_alloc_table_from_pages_segment(&xe_tt->sgt, tt->pages,
 						num_pages, 0,
 						(u64)num_pages << PAGE_SHIFT,
-						xe_sg_segment_size(xe_tt->dev),
+						xe_sg_segment_size(xe_tt->xe->drm.dev),
 						GFP_KERNEL);
 	if (ret)
 		return ret;
 
 	xe_tt->sg = &xe_tt->sgt;
-	ret = dma_map_sgtable(xe_tt->dev, xe_tt->sg, DMA_BIDIRECTIONAL,
+	ret = dma_map_sgtable(xe_tt->xe->drm.dev, xe_tt->sg, DMA_BIDIRECTIONAL,
 			      DMA_ATTR_SKIP_CPU_SYNC);
 	if (ret) {
 		sg_free_table(xe_tt->sg);
@@ -321,7 +328,7 @@ static void xe_tt_unmap_sg(struct ttm_tt *tt)
 	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
 
 	if (xe_tt->sg) {
-		dma_unmap_sgtable(xe_tt->dev, xe_tt->sg,
+		dma_unmap_sgtable(xe_tt->xe->drm.dev, xe_tt->sg,
 				  DMA_BIDIRECTIONAL, 0);
 		sg_free_table(xe_tt->sg);
 		xe_tt->sg = NULL;
@@ -336,21 +343,41 @@ struct sg_table *xe_bo_sg(struct xe_bo *bo)
 	return xe_tt->sg;
 }
 
+/*
+ * Account ttm pages against the device shrinker's shrinkable and
+ * purgeable counts.
+ */
+static void xe_ttm_tt_account(struct ttm_tt *tt, bool add)
+{
+	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
+	long num_pages = tt->num_pages;
+
+	if (!add)
+		num_pages = -num_pages;
+
+	if (xe_tt->purgeable)
+		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, 0, num_pages);
+	else
+		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, num_pages, 0);
+}
+
 static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 				       u32 page_flags)
 {
 	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
 	struct xe_device *xe = xe_bo_device(bo);
-	struct xe_ttm_tt *tt;
+	struct xe_ttm_tt *xe_tt;
+	struct ttm_tt *tt;
 	unsigned long extra_pages;
 	enum ttm_caching caching;
 	int err;
 
-	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
-	if (!tt)
+	xe_tt = kzalloc(sizeof(*xe_tt), GFP_KERNEL);
+	if (!xe_tt)
 		return NULL;
 
-	tt->dev = xe->drm.dev;
+	tt = &xe_tt->ttm;
+	xe_tt->xe = xe;
 
 	extra_pages = 0;
 	if (xe_bo_needs_ccs_pages(bo))
@@ -387,42 +414,128 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 		caching = ttm_uncached;
 	}
 
-	err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages);
+	if (ttm_bo->type != ttm_bo_type_sg)
+		page_flags |= TTM_TT_FLAG_EXTERNAL | TTM_TT_FLAG_EXTERNAL_MAPPABLE;
+
+	err = ttm_tt_init(tt, &bo->ttm, page_flags, caching, extra_pages);
 	if (err) {
-		kfree(tt);
+		kfree(xe_tt);
 		return NULL;
 	}
 
-	return &tt->ttm;
+	tt->backup = ttm_backup_shmem_create(tt->num_pages << PAGE_SHIFT);
+	if (IS_ERR(tt->backup)) {
+		ttm_tt_fini(tt);
+		kfree(xe_tt);
+		return NULL;
+	}
+
+	return tt;
 }
 
 static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
 			      struct ttm_operation_ctx *ctx)
 {
+	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
 	int err;
 
 	/*
 	 * dma-bufs are not populated with pages, and the dma-
 	 * addresses are set up when moved to XE_PL_TT.
 	 */
-	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
+	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
+	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
 		return 0;
 
 	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
 	if (err)
 		return err;
 
-	return err;
+	xe_tt->purgeable = false;
+	xe_ttm_tt_account(tt, true);
+
+	return 0;
 }
 
 static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev, struct ttm_tt *tt)
 {
-	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
+	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
+	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
 		return;
 
 	xe_tt_unmap_sg(tt);
 
-	return ttm_pool_free(&ttm_dev->pool, tt);
+	ttm_pool_free(&ttm_dev->pool, tt);
+	xe_ttm_tt_account(tt, false);
+}
+
+/**
+ * xe_bo_shrink() - Try to shrink an xe bo.
+ * @walk:  - The walk parameters
+ * @bo: The TTM buffer object
+ * @purge: Only consider purgeable bos.
+ * @writeback: Try to write back to persistent storage.
+ *
+ * Try to shrink- or purge a bo, and if it succeeds, unmap dma.
+ * Note that we need to be able to handle also non xe bos
+ * (ghost bos), but only if the struct ttm_tt is embedded in
+ * a struct xe_ttm_tt.
+ *
+ * Return: The number of pages shrunken or purged, or negative error
+ * code on failure.
+ */
+long xe_bo_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
+		  bool purge, bool writeback)
+{
+	struct ttm_tt *tt = bo->ttm;
+	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
+	struct ttm_place place = {.mem_type = bo->resource->mem_type};
+	struct xe_bo *xe_bo = ttm_to_xe_bo(bo);
+	struct xe_device *xe = xe_tt->xe;
+	bool needs_rpm;
+	long lret = 0L;
+
+	if (!tt || !ttm_tt_is_populated(tt) ||
+	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE) ||
+	    (purge && !xe_tt->purgeable))
+		return 0L;
+
+	if (!ttm_bo_eviction_valuable(bo, &place))
+		return 0L;
+
+	/* Beware of zombies (GEM object refcount == 0) and ghosts. */
+	if (!xe_bo_is_xe_bo(bo) || !xe_bo_get_unless_zero(xe_bo)) {
+		struct ttm_placement null_placement = { .num_placement = 0 };
+
+		lret = ttm_bo_wait_ctx(bo, walk->ctx);
+		if (lret)
+			return lret;
+
+		/* Purge the bo content! */
+		ttm_bo_validate(bo, &null_placement, walk->ctx);
+		return tt->num_pages;
+	}
+
+	/* System CCS needs gpu copy when moving PL_TT -> PL_SYSTEM */
+	needs_rpm = (!IS_DGFX(xe) && bo->resource->mem_type != XE_PL_SYSTEM &&
+		     xe_bo && xe_bo_needs_ccs_pages(xe_bo) && !xe_tt->purgeable);
+	if (needs_rpm && !xe_pm_runtime_get_if_active(xe))
+		goto out_unref;
+
+	lret = ttm_bo_try_shrink(walk, bo, xe_tt->purgeable, writeback);
+	if (needs_rpm)
+		xe_pm_runtime_put(xe);
+
+	if (lret > 0) {
+		xe_assert(xe, !ttm_tt_is_populated(tt));
+
+		xe_ttm_tt_account(tt, false);
+	}
+
+out_unref:
+	xe_bo_put(xe_bo);
+
+	return lret;
 }
 
 static void xe_ttm_tt_destroy(struct ttm_device *ttm_dev, struct ttm_tt *tt)
@@ -1238,6 +1351,7 @@ struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
 	struct ttm_operation_ctx ctx = {
 		.interruptible = true,
 		.no_wait_gpu = false,
+		.gfp_retry_mayfail = true,
 	};
 	struct ttm_placement *placement;
 	uint32_t alignment;
@@ -1681,6 +1795,8 @@ int xe_bo_pin_external(struct xe_bo *bo)
 	}
 
 	ttm_bo_pin(&bo->ttm);
+	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
+		xe_ttm_tt_account(bo->ttm.ttm, false);
 
 	/*
 	 * FIXME: If we always use the reserve / unreserve functions for locking
@@ -1739,6 +1855,8 @@ int xe_bo_pin(struct xe_bo *bo)
 	}
 
 	ttm_bo_pin(&bo->ttm);
+	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
+		xe_ttm_tt_account(bo->ttm.ttm, false);
 
 	/*
 	 * FIXME: If we always use the reserve / unreserve functions for locking
@@ -1773,6 +1891,9 @@ void xe_bo_unpin_external(struct xe_bo *bo)
 	spin_unlock(&xe->pinned.lock);
 
 	ttm_bo_unpin(&bo->ttm);
+	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
+		xe_ttm_tt_account(bo->ttm.ttm, true);
+
 
 	/*
 	 * FIXME: If we always use the reserve / unreserve functions for locking
@@ -1801,6 +1922,8 @@ void xe_bo_unpin(struct xe_bo *bo)
 	}
 
 	ttm_bo_unpin(&bo->ttm);
+	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
+		xe_ttm_tt_account(bo->ttm.ttm, true);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 6de894c728f5..8463e3f3f6f1 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -63,6 +63,7 @@
 #define XE_BO_PROPS_INVALID	(-1)
 
 struct sg_table;
+struct xe_ttm_lru_walk;
 
 struct xe_bo *xe_bo_alloc(void);
 void xe_bo_free(struct xe_bo *bo);
@@ -126,6 +127,28 @@ static inline struct xe_bo *xe_bo_get(struct xe_bo *bo)
 	return bo;
 }
 
+/*
+ * xe_bo_get_unless_zero() - Conditionally obtain a GEM object refcount on an
+ * xe bo
+ * @bo: The bo for which we want to obtain a refcount.
+ *
+ * There is a short window between where the bo's GEM object refcount reaches
+ * zero and where we put the final ttm_bo reference. Code in the eviction- and
+ * shrinking path should therefore attempt to grab a gem object reference before
+ * trying to use members outside of the base class ttm object. This function is
+ * intended for that purpose. On successful return, this function must be paired
+ * with an xe_bo_put().
+ *
+ * Return: @bo on success, NULL on failure.
+ */
+static inline __must_check struct xe_bo *xe_bo_get_unless_zero(struct xe_bo *bo)
+{
+	if (!bo || !kref_get_unless_zero(&bo->ttm.base.refcount))
+		return NULL;
+
+	return bo;
+}
+
 static inline void xe_bo_put(struct xe_bo *bo)
 {
 	if (bo)
@@ -315,6 +338,9 @@ static inline unsigned int xe_sg_segment_size(struct device *dev)
 
 #define i915_gem_object_flush_if_display(obj)		((void)(obj))
 
+long xe_bo_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
+		  bool purge, bool writeback);
+
 #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
 /**
  * xe_bo_is_mem_type - Whether the bo currently resides in the given
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index cfda7cb5df2c..58fecc4b0a18 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -47,6 +47,7 @@
 #include "xe_perf.h"
 #include "xe_pm.h"
 #include "xe_query.h"
+#include "xe_shrinker.h"
 #include "xe_sriov.h"
 #include "xe_tile.h"
 #include "xe_ttm_stolen_mgr.h"
@@ -241,6 +242,9 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
 	if (xe->unordered_wq)
 		destroy_workqueue(xe->unordered_wq);
 
+	if (!IS_ERR_OR_NULL(xe->mem.shrinker))
+		xe_shrinker_destroy(xe->mem.shrinker);
+
 	ttm_device_fini(&xe->ttm);
 }
 
@@ -270,6 +274,10 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 	if (err)
 		goto err;
 
+	xe->mem.shrinker = xe_shrinker_create(xe);
+	if (IS_ERR(xe->mem.shrinker))
+		return ERR_CAST(xe->mem.shrinker);
+
 	xe->info.devid = pdev->device;
 	xe->info.revid = pdev->revision;
 	xe->info.force_execlist = xe_modparam.force_execlist;
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index c37be471d11c..3d5440aba52e 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -325,6 +325,8 @@ struct xe_device {
 		struct xe_mem_region vram;
 		/** @mem.sys_mgr: system TTM manager */
 		struct ttm_resource_manager sys_mgr;
+		/** @mem.sys_mgr: system memory shrinker. */
+		struct xe_shrinker *shrinker;
 	} mem;
 
 	/** @sriov: device level virtualization data */
diff --git a/drivers/gpu/drm/xe/xe_shrinker.c b/drivers/gpu/drm/xe/xe_shrinker.c
new file mode 100644
index 000000000000..3f9554bdc06b
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_shrinker.c
@@ -0,0 +1,287 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <linux/shrinker.h>
+#include <linux/swap.h>
+
+#include <drm/ttm/ttm_bo.h>
+#include <drm/ttm/ttm_tt.h>
+
+#include "xe_bo.h"
+#include "xe_pm.h"
+#include "xe_shrinker.h"
+
+/**
+ * struct xe_shrinker - per-device shrinker
+ * @xe: Back pointer to the device.
+ * @lock: Lock protecting accounting.
+ * @shrinkable_pages: Number of pages that are currently shrinkable.
+ * @purgeable_pages: Number of pages that are currently purgeable.
+ * @shrink: Pointer to the mm shrinker.
+ * @pm_worker: Worker to wake up the device if required.
+ */
+struct xe_shrinker {
+	struct xe_device *xe;
+	rwlock_t lock;
+	long shrinkable_pages;
+	long purgeable_pages;
+	struct shrinker *shrink;
+	struct work_struct pm_worker;
+};
+
+/**
+ * struct xe_shrink_lru_walk - lru_walk subclass for shrinker
+ * @walk: The embedded base class.
+ * @xe: Pointer to the xe device.
+ * @purge: Purgeable only request from the srinker.
+ * @writeback: Try to write back to persistent storage.
+ */
+struct xe_shrink_lru_walk {
+	struct ttm_lru_walk walk;
+	struct xe_device *xe;
+	bool purge;
+	bool writeback;
+};
+
+static struct xe_shrinker *to_xe_shrinker(struct shrinker *shrink)
+{
+	return shrink->private_data;
+}
+
+static struct xe_shrink_lru_walk *
+to_xe_shrink_lru_walk(struct ttm_lru_walk *walk)
+{
+	return container_of(walk, struct xe_shrink_lru_walk, walk);
+}
+
+/**
+ * xe_shrinker_mod_pages() - Modify shrinker page accounting
+ * @shrinker: Pointer to the struct xe_shrinker.
+ * @shrinkable: Shrinkable pages delta. May be negative.
+ * @purgeable: Purgeable page delta. May be negative.
+ *
+ * Modifies the shrinkable and purgeable pages accounting.
+ */
+void
+xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long shrinkable, long purgeable)
+{
+	write_lock(&shrinker->lock);
+	shrinker->shrinkable_pages += shrinkable;
+	shrinker->purgeable_pages += purgeable;
+	write_unlock(&shrinker->lock);
+}
+
+static long xe_shrinker_process_bo(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
+{
+	struct xe_shrink_lru_walk *shrink_walk = to_xe_shrink_lru_walk(walk);
+
+	return xe_bo_shrink(walk, bo, shrink_walk->purge, shrink_walk->writeback);
+}
+
+static long xe_shrinker_walk(struct xe_shrink_lru_walk *shrink_walk, long target)
+{
+	struct xe_device *xe = shrink_walk->xe;
+	struct ttm_resource_manager *man;
+	unsigned int mem_type;
+	long sofar = 0;
+	long lret;
+
+	for (mem_type = XE_PL_SYSTEM; mem_type <= XE_PL_TT; ++mem_type) {
+		man = ttm_manager_type(&xe->ttm, mem_type);
+		if (!man || !man->use_tt)
+			continue;
+
+		lret = ttm_lru_walk_for_evict(&shrink_walk->walk, &xe->ttm, man, target);
+		if (lret < 0)
+			return lret;
+
+		sofar += lret;
+		if (sofar >= target)
+			break;
+	}
+
+	return sofar;
+}
+
+static unsigned long
+xe_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
+	unsigned long num_pages;
+
+	num_pages = get_nr_swap_pages();
+	read_lock(&shrinker->lock);
+	num_pages = min_t(unsigned long, num_pages, shrinker->shrinkable_pages);
+	num_pages += shrinker->purgeable_pages;
+	read_unlock(&shrinker->lock);
+
+	return num_pages ? num_pages : SHRINK_EMPTY;
+}
+
+static const struct ttm_lru_walk_ops xe_shrink_ops = {
+	.process_bo = xe_shrinker_process_bo,
+};
+
+/*
+ * Check if we need runtime pm, and if so try to grab a reference if
+ * already active. If grabbing a reference fails, queue a worker that
+ * does it for us outside of reclaim, but don't wait for it to complete.
+ * If bo shrinking needs an rpm reference and we don't have it (yet),
+ * that bo will be skipped anyway.
+ */
+static bool xe_shrinker_runtime_pm_get(struct xe_shrinker *shrinker, bool force,
+				       unsigned long nr_to_scan)
+{
+	struct xe_device *xe = shrinker->xe;
+
+	if (IS_DGFX(xe) || !xe_device_has_flat_ccs(xe) ||
+	    !get_nr_swap_pages())
+		return false;
+
+	if (!force) {
+		read_lock(&shrinker->lock);
+		force = (nr_to_scan > shrinker->purgeable_pages);
+		read_unlock(&shrinker->lock);
+		if (!force)
+			return false;
+	}
+
+	if (!xe_pm_runtime_get_if_active(xe)) {
+		queue_work(xe->unordered_wq, &shrinker->pm_worker);
+		return false;
+	}
+
+	return true;
+}
+
+static void xe_shrinker_runtime_pm_put(struct xe_shrinker *shrinker, bool runtime_pm)
+{
+	if (runtime_pm)
+		xe_pm_runtime_put(shrinker->xe);
+}
+
+static unsigned long xe_shrinker_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
+	bool is_kswapd = current_is_kswapd();
+	struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.no_wait_gpu = !is_kswapd,
+	};
+	unsigned long nr_to_scan, freed = 0;
+	struct xe_shrink_lru_walk shrink_walk = {
+		.walk = {
+			.ops = &xe_shrink_ops,
+			.ctx = &ctx,
+			.trylock_only = true,
+		},
+		.xe = shrinker->xe,
+		.purge = true,
+		.writeback = is_kswapd,
+	};
+	bool runtime_pm;
+	bool purgeable;
+	long ret;
+
+	sc->nr_scanned = 0;
+	nr_to_scan = sc->nr_to_scan;
+
+	read_lock(&shrinker->lock);
+	purgeable = !!shrinker->purgeable_pages;
+	read_unlock(&shrinker->lock);
+
+	/* Might need runtime PM. Try to wake early if it looks like it. */
+	runtime_pm = xe_shrinker_runtime_pm_get(shrinker, false, nr_to_scan);
+
+	while (purgeable && freed < nr_to_scan) {
+		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
+		if (ret <= 0)
+			break;
+
+		freed += ret;
+	}
+
+	sc->nr_scanned = freed;
+	if (freed < nr_to_scan)
+		nr_to_scan -= freed;
+	else
+		nr_to_scan = 0;
+	if (!nr_to_scan)
+		goto out;
+
+	/* If we didn't wake before, try to do it now if needed. */
+	if (!runtime_pm)
+		runtime_pm = xe_shrinker_runtime_pm_get(shrinker, true, 0);
+
+	shrink_walk.purge = false;
+	nr_to_scan = sc->nr_to_scan;
+	while (freed < nr_to_scan) {
+		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
+		if (ret <= 0)
+			break;
+
+		freed += ret;
+	}
+
+	sc->nr_scanned = freed;
+
+out:
+	xe_shrinker_runtime_pm_put(shrinker, runtime_pm);
+	return freed ? freed : SHRINK_STOP;
+}
+
+/* Wake up the device for shrinking. */
+static void xe_shrinker_pm(struct work_struct *work)
+{
+	struct xe_shrinker *shrinker =
+		container_of(work, typeof(*shrinker), pm_worker);
+
+	xe_pm_runtime_get(shrinker->xe);
+	xe_pm_runtime_put(shrinker->xe);
+}
+
+/**
+ * xe_shrinker_create() - Create an xe per-device shrinker
+ * @xe: Pointer to the xe device.
+ *
+ * Returns: A pointer to the created shrinker on success,
+ * Negative error code on failure.
+ */
+struct xe_shrinker *xe_shrinker_create(struct xe_device *xe)
+{
+	struct xe_shrinker *shrinker = kzalloc(sizeof(*shrinker), GFP_KERNEL);
+
+	if (!shrinker)
+		return ERR_PTR(-ENOMEM);
+
+	shrinker->shrink = shrinker_alloc(0, "xe system shrinker");
+	if (!shrinker->shrink) {
+		kfree(shrinker);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	INIT_WORK(&shrinker->pm_worker, xe_shrinker_pm);
+	shrinker->xe = xe;
+	rwlock_init(&shrinker->lock);
+	shrinker->shrink->count_objects = xe_shrinker_count;
+	shrinker->shrink->scan_objects = xe_shrinker_scan;
+	shrinker->shrink->private_data = shrinker;
+	shrinker_register(shrinker->shrink);
+
+	return shrinker;
+}
+
+/**
+ * xe_shrinker_destroy() - Destroy an xe per-device shrinker
+ * @shrinker: Pointer to the shrinker to destroy.
+ */
+void xe_shrinker_destroy(struct xe_shrinker *shrinker)
+{
+	xe_assert(shrinker->xe, !shrinker->shrinkable_pages);
+	xe_assert(shrinker->xe, !shrinker->purgeable_pages);
+	shrinker_free(shrinker->shrink);
+	flush_work(&shrinker->pm_worker);
+	kfree(shrinker);
+}
diff --git a/drivers/gpu/drm/xe/xe_shrinker.h b/drivers/gpu/drm/xe/xe_shrinker.h
new file mode 100644
index 000000000000..28a038f4fcbf
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_shrinker.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _XE_SHRINKER_H_
+#define _XE_SHRINKER_H_
+
+struct xe_shrinker;
+struct xe_device;
+
+void xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long shrinkable, long purgeable);
+
+struct xe_shrinker *xe_shrinker_create(struct xe_device *xe);
+
+void xe_shrinker_destroy(struct xe_shrinker *shrinker);
+
+#endif
diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
index e577528f5dfc..c7e81ae025d9 100644
--- a/include/drm/ttm/ttm_bo.h
+++ b/include/drm/ttm/ttm_bo.h
@@ -229,6 +229,9 @@ struct ttm_lru_walk {
 long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct ttm_device *bdev,
 			    struct ttm_resource_manager *man, long target);
 
+long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
+		       bool purge, bool writeback);
+
 /**
  * ttm_bo_get - reference a struct ttm_buffer_object
  *
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark
  2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
                   ` (10 preceding siblings ...)
  2024-07-03 15:38 ` [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos Thomas Hellström
@ 2024-07-03 15:38 ` Thomas Hellström
  2024-08-05 18:35   ` Souza, Jose
  2024-08-07 23:44   ` Matthew Brost
  11 siblings, 2 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-07-03 15:38 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Matthew Brost, Somalapuram Amaranath,
	Christian König, dri-devel

The XE_PL_TT watermark was set to 50% of system memory.
The idea behind that was unclear since the net effect is that
TT memory will be evicted to TTM_PL_SYSTEM memory if that
watermark is exceeded, requiring PPGTT rebinds and dma
remapping. But there is no similar watermark for TTM_PL_SYSTEM
memory.

The TTM functionality that tries to swap out system memory to
shmem objects if a 50% limit of total system memory is reached
is orthogonal to this, and with the shrinker added, it's no
longer in effect.

Replace the 50% TTM_PL_TT limit with a 100% limit, in effect
allowing all graphics memory to be bound to the device unless it
has been swapped out by the shrinker.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_sys_mgr.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
index 9844a8edbfe1..d38b91872da3 100644
--- a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
@@ -108,9 +108,8 @@ int xe_ttm_sys_mgr_init(struct xe_device *xe)
 	u64 gtt_size;
 
 	si_meminfo(&si);
+	/* Potentially restrict amount of TT memory here. */
 	gtt_size = (u64)si.totalram * si.mem_unit;
-	/* TTM limits allocation of all TTM devices by 50% of system memory */
-	gtt_size /= 2;
 
 	man->use_tt = true;
 	man->func = &xe_ttm_sys_mgr_func;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves
  2024-07-03 15:38 ` [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves Thomas Hellström
@ 2024-07-03 17:53   ` Matthew Brost
  2024-07-04  9:21   ` Christian König
  1 sibling, 0 replies; 38+ messages in thread
From: Matthew Brost @ 2024-07-03 17:53 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, Jul 03, 2024 at 05:38:05PM +0200, Thomas Hellström wrote:
> To address the problem with hitches moving when bulk move
> sublists are lru-bumped, register the list cursors with the
> ttm_lru_bulk_move structure when traversing its list, and
> when lru-bumping the list, move the cursor hitch to the tail.
> This also means it's mandatory for drivers to call
> ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when
> initializing and finalizing the bulk move structure, so add
> those calls to the amdgpu- and xe driver.
> 
> Compared to v1 this is slightly more code but less fragile
> and hopefully easier to understand.
> 
> Changes in previous series:
> - Completely rework the functionality
> - Avoid a NULL pointer dereference assigning manager->mem_type
> - Remove some leftover code causing build problems
> v2:
> - For hitch bulk tail moves, store the mem_type in the cursor
>   instead of with the manager.
> v3:
> - Remove leftover mem_type member from change in v2.
> v6:
> - Add some lockdep asserts (Matthew Brost)
> - Avoid NULL pointer dereference (Matthew Brost)
> - No need to check bo->resource before dereferencing
>   bo->bulk_move (Matthew Brost)
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  4 ++
>  drivers/gpu/drm/ttm/ttm_resource.c     | 92 ++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_vm.c             |  4 ++
>  include/drm/ttm/ttm_resource.h         | 56 ++++++++++------
>  4 files changed, 135 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..97743993d711 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2420,6 +2420,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>  	if (r)
>  		return r;
>  
> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> +
>  	vm->is_compute_context = false;
>  
>  	vm->use_cpu_for_update = !!(adev->vm_manager.vm_update_mode &
> @@ -2484,6 +2486,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>  error_free_delayed:
>  	dma_fence_put(vm->last_tlb_flush);
>  	dma_fence_put(vm->last_unlocked);
> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm->lru_bulk_move);
>  	amdgpu_vm_fini_entities(vm);
>  
>  	return r;
> @@ -2640,6 +2643,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>  		}
>  	}
>  
> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm->lru_bulk_move);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
> index 9c8b6499edfb..b6a2daac5518 100644
> --- a/drivers/gpu/drm/ttm/ttm_resource.c
> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> @@ -33,6 +33,53 @@
>  
>  #include <drm/drm_util.h>
>  
> +/* Detach the cursor from the bulk move list*/
> +static void
> +ttm_resource_cursor_clear_bulk(struct ttm_resource_cursor *cursor)
> +{
> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> +
> +	cursor->bulk = NULL;
> +	list_del_init(&cursor->bulk_link);
> +}
> +
> +/* Move the cursor to the end of the bulk move list it's in */
> +static void ttm_resource_cursor_move_bulk_tail(struct ttm_lru_bulk_move *bulk,
> +					       struct ttm_resource_cursor *cursor)
> +{
> +	struct ttm_lru_bulk_move_pos *pos;
> +
> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> +
> +	if (WARN_ON_ONCE(bulk != cursor->bulk)) {
> +		list_del_init(&cursor->bulk_link);
> +		return;
> +	}
> +
> +	pos = &bulk->pos[cursor->mem_type][cursor->priority];
> +	if (pos->last)
> +		list_move(&cursor->hitch.link, &pos->last->lru.link);
> +	ttm_resource_cursor_clear_bulk(cursor);
> +}
> +
> +/* Move all cursors attached to a bulk move to its end */
> +static void ttm_bulk_move_adjust_cursors(struct ttm_lru_bulk_move *bulk)
> +{
> +	struct ttm_resource_cursor *cursor, *next;
> +
> +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list, bulk_link)
> +		ttm_resource_cursor_move_bulk_tail(bulk, cursor);
> +}
> +
> +/* Remove a cursor from an empty bulk move list */
> +static void ttm_bulk_move_drop_cursors(struct ttm_lru_bulk_move *bulk)
> +{
> +	struct ttm_resource_cursor *cursor, *next;
> +
> +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list, bulk_link)
> +		ttm_resource_cursor_clear_bulk(cursor);
> +}
> +
>  /**
>   * ttm_resource_cursor_fini_locked() - Finalize the LRU list cursor usage
>   * @cursor: The struct ttm_resource_cursor to finalize.
> @@ -45,6 +92,7 @@ void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor)
>  {
>  	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>  	list_del_init(&cursor->hitch.link);
> +	ttm_resource_cursor_clear_bulk(cursor);
>  }
>  
>  /**
> @@ -73,9 +121,27 @@ void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor)
>  void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk)
>  {
>  	memset(bulk, 0, sizeof(*bulk));
> +	INIT_LIST_HEAD(&bulk->cursor_list);
>  }
>  EXPORT_SYMBOL(ttm_lru_bulk_move_init);
>  
> +/**
> + * ttm_lru_bulk_move_fini - finalize a bulk move structure
> + * @bdev: The struct ttm_device
> + * @bulk: the structure to finalize
> + *
> + * Sanity checks that bulk moves don't have any
> + * resources left and hence no cursors attached.
> + */
> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> +			    struct ttm_lru_bulk_move *bulk)
> +{
> +	spin_lock(&bdev->lru_lock);
> +	ttm_bulk_move_drop_cursors(bulk);
> +	spin_unlock(&bdev->lru_lock);
> +}
> +EXPORT_SYMBOL(ttm_lru_bulk_move_fini);
> +
>  /**
>   * ttm_lru_bulk_move_tail - bulk move range of resources to the LRU tail.
>   *
> @@ -88,6 +154,7 @@ void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk)
>  {
>  	unsigned i, j;
>  
> +	ttm_bulk_move_adjust_cursors(bulk);
>  	for (i = 0; i < TTM_NUM_MEM_TYPES; ++i) {
>  		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>  			struct ttm_lru_bulk_move_pos *pos = &bulk->pos[i][j];
> @@ -515,6 +582,28 @@ void ttm_resource_manager_debug(struct ttm_resource_manager *man,
>  }
>  EXPORT_SYMBOL(ttm_resource_manager_debug);
>  
> +static void
> +ttm_resource_cursor_check_bulk(struct ttm_resource_cursor *cursor,
> +			       struct ttm_lru_item *next_lru)
> +{
> +	struct ttm_resource *next = ttm_lru_item_to_res(next_lru);
> +	struct ttm_lru_bulk_move *bulk = NULL;
> +	struct ttm_buffer_object *bo = next->bo;
> +
> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> +	bulk = bo->bulk_move;
> +
> +	if (cursor->bulk != bulk) {
> +		if (bulk) {
> +			list_move_tail(&cursor->bulk_link, &bulk->cursor_list);
> +			cursor->mem_type = next->mem_type;
> +		} else {
> +			list_del_init(&cursor->bulk_link);
> +		}
> +		cursor->bulk = bulk;
> +	}
> +}
> +
>  /**
>   * ttm_resource_manager_first() - Start iterating over the resources
>   * of a resource manager
> @@ -535,6 +624,7 @@ ttm_resource_manager_first(struct ttm_resource_manager *man,
>  	cursor->priority = 0;
>  	cursor->man = man;
>  	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
> +	INIT_LIST_HEAD(&cursor->bulk_link);
>  	list_add(&cursor->hitch.link, &man->lru[cursor->priority]);
>  
>  	return ttm_resource_manager_next(cursor);
> @@ -559,6 +649,7 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
>  		lru = &cursor->hitch;
>  		list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
>  			if (ttm_lru_item_is_res(lru)) {
> +				ttm_resource_cursor_check_bulk(cursor, lru);
>  				list_move(&cursor->hitch.link, &lru->link);
>  				return ttm_lru_item_to_res(lru);
>  			}
> @@ -568,6 +659,7 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
>  			break;
>  
>  		list_move(&cursor->hitch.link, &man->lru[cursor->priority]);
> +		ttm_resource_cursor_clear_bulk(cursor);
>  	}
>  
>  	ttm_resource_cursor_fini_locked(cursor);
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 5b166fa03684..0c7e327bc9a2 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -1335,6 +1335,8 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
>  
>  	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
>  
> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> +
>  	INIT_LIST_HEAD(&vm->preempt.exec_queues);
>  	vm->preempt.min_run_period_ms = 10;	/* FIXME: Wire up to uAPI */
>  
> @@ -1458,6 +1460,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
>  	mutex_destroy(&vm->snap_mutex);
>  	for_each_tile(tile, xe, id)
>  		xe_range_fence_tree_fini(&vm->rftree[id]);
> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>  	kfree(vm);
>  	if (flags & XE_VM_FLAG_LR_MODE)
>  		xe_pm_runtime_put(xe);
> @@ -1601,6 +1604,7 @@ static void vm_destroy_work_func(struct work_struct *w)
>  		XE_WARN_ON(vm->pt_root[id]);
>  
>  	trace_xe_vm_free(vm);
> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>  	kfree(vm);
>  }
>  
> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
> index 8fac781f641e..571abb4861a6 100644
> --- a/include/drm/ttm/ttm_resource.h
> +++ b/include/drm/ttm/ttm_resource.h
> @@ -269,26 +269,6 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
>  	return container_of(item, struct ttm_resource, lru);
>  }
>  
> -/**
> - * struct ttm_resource_cursor
> - *
> - * @man: The resource manager currently being iterated over.
> - * @hitch: A hitch list node inserted before the next resource
> - * to iterate over.
> - * @priority: the current priority
> - *
> - * Cursor to iterate over the resources in a manager.
> - */
> -struct ttm_resource_cursor {
> -	struct ttm_resource_manager *man;
> -	struct ttm_lru_item hitch;
> -	unsigned int priority;
> -};
> -
> -void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
> -
> -void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
> -
>  /**
>   * struct ttm_lru_bulk_move_pos
>   *
> @@ -304,8 +284,9 @@ struct ttm_lru_bulk_move_pos {
>  
>  /**
>   * struct ttm_lru_bulk_move
> - *
>   * @pos: first/last lru entry for resources in the each domain/priority
> + * @cursor_list: The list of cursors currently traversing any of
> + * the sublists of @pos. Protected by the ttm device's lru_lock.
>   *
>   * Container for the current bulk move state. Should be used with
>   * ttm_lru_bulk_move_init() and ttm_bo_set_bulk_move().
> @@ -315,8 +296,39 @@ struct ttm_lru_bulk_move_pos {
>   */
>  struct ttm_lru_bulk_move {
>  	struct ttm_lru_bulk_move_pos pos[TTM_NUM_MEM_TYPES][TTM_MAX_BO_PRIORITY];
> +	struct list_head cursor_list;
>  };
>  
> +/**
> + * struct ttm_resource_cursor
> + * @man: The resource manager currently being iterated over
> + * @hitch: A hitch list node inserted before the next resource
> + * to iterate over.
> + * @bulk_link: A list link for the list of cursors traversing the
> + * bulk sublist of @bulk. Protected by the ttm device's lru_lock.
> + * @bulk: Pointer to struct ttm_lru_bulk_move whose subrange @hitch is
> + * inserted to. NULL if none. Never dereference this pointer since
> + * the struct ttm_lru_bulk_move object pointed to might have been
> + * freed. The pointer is only for comparison.
> + * @mem_type: The memory type of the LRU list being traversed.
> + * This field is valid iff @bulk != NULL.
> + * @priority: the current priority
> + *
> + * Cursor to iterate over the resources in a manager.
> + */
> +struct ttm_resource_cursor {
> +	struct ttm_resource_manager *man;
> +	struct ttm_lru_item hitch;
> +	struct list_head bulk_link;
> +	struct ttm_lru_bulk_move *bulk;
> +	unsigned int mem_type;
> +	unsigned int priority;
> +};
> +
> +void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
> +
> +void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
> +
>  /**
>   * struct ttm_kmap_iter_iomap - Specialization for a struct io_mapping +
>   * struct sg_table backed struct ttm_resource.
> @@ -405,6 +417,8 @@ ttm_resource_manager_cleanup(struct ttm_resource_manager *man)
>  
>  void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);
>  void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk);
> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> +			    struct ttm_lru_bulk_move *bulk);
>  
>  void ttm_resource_add_bulk_move(struct ttm_resource *res,
>  				struct ttm_buffer_object *bo);
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 06/12] drm/ttm: Use the LRU walker helper for swapping
  2024-07-03 15:38 ` [PATCH v6 06/12] drm/ttm: Use the LRU walker helper for swapping Thomas Hellström
@ 2024-07-03 18:24   ` Matthew Brost
  0 siblings, 0 replies; 38+ messages in thread
From: Matthew Brost @ 2024-07-03 18:24 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, Jul 03, 2024 at 05:38:07PM +0200, Thomas Hellström wrote:
> Rework the TTM swapping to use the LRU walker helper.
> This helps fixing up the ttm_bo_swapout() interface
> to be consistent about not requiring any locking.
> 
> For now mimic the current behaviour of using trylock
> only. We could be using ticket-locks here but defer
> that until it's deemed necessary. The TTM swapout
> functionality is a bit weird anyway since it
> alternates between memory types without exhausting
> TTM_PL_SYSTEM first.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> Cc: <dri-devel@lists.freedesktop.org>
> 
> v6:
> - Improve on error code translation in the swapout callback
>   (Matthew Brost).
> 
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/ttm/ttm_bo.c     | 111 ++++++++++++++++++++-----------
>  drivers/gpu/drm/ttm/ttm_device.c |  30 ++-------
>  include/drm/ttm/ttm_bo.h         |   5 +-
>  3 files changed, 82 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 43eda720657f..1053cdca131e 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -1118,11 +1118,23 @@ int ttm_bo_wait_ctx(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx)
>  }
>  EXPORT_SYMBOL(ttm_bo_wait_ctx);
>  
> -int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
> -		   gfp_t gfp_flags)
> +/**
> + * struct ttm_bo_swapout_walk - Parameters for the swapout walk
> + */
> +struct ttm_bo_swapout_walk {
> +	/** @walk: The walk base parameters. */
> +	struct ttm_lru_walk walk;
> +	/** @gfp_flags: The gfp flags to use for ttm_tt_swapout() */
> +	gfp_t gfp_flags;
> +};
> +
> +static long
> +ttm_bo_swapout_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
>  {
> -	struct ttm_place place;
> -	bool locked;
> +	struct ttm_place place = {.mem_type = bo->resource->mem_type};
> +	struct ttm_bo_swapout_walk *swapout_walk =
> +		container_of(walk, typeof(*swapout_walk), walk);
> +	struct ttm_operation_ctx *ctx = walk->ctx;
>  	long ret;
>  
>  	/*
> @@ -1131,28 +1143,29 @@ int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
>  	 * The driver may use the fact that we're moving from SYSTEM
>  	 * as an indication that we're about to swap out.
>  	 */
> -	memset(&place, 0, sizeof(place));
> -	place.mem_type = bo->resource->mem_type;
> -	if (!ttm_bo_evict_swapout_allowable(bo, ctx, &place, &locked, NULL))
> -		return -EBUSY;
> +	if (!bo->bdev->funcs->eviction_valuable(bo, &place)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
>  
>  	if (!bo->ttm || !ttm_tt_is_populated(bo->ttm) ||
>  	    bo->ttm->page_flags & TTM_TT_FLAG_EXTERNAL ||
> -	    bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED ||
> -	    !ttm_bo_get_unless_zero(bo)) {
> -		if (locked)
> -			dma_resv_unlock(bo->base.resv);
> -		return -EBUSY;
> +	    bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED) {
> +		ret = -EBUSY;
> +		goto out;
>  	}
>  
>  	if (bo->deleted) {
> -		ret = ttm_bo_cleanup_refs(bo, false, false, locked);
> -		ttm_bo_put(bo);
> -		return ret == -EBUSY ? -ENOSPC : ret;
> -	}
> +		pgoff_t num_pages = bo->ttm->num_pages;
>  
> -	/* TODO: Cleanup the locking */
> -	spin_unlock(&bo->bdev->lru_lock);
> +		ret = ttm_bo_wait_ctx(bo, ctx);
> +		if (ret)
> +			goto out;
> +
> +		ttm_bo_cleanup_memtype_use(bo);
> +		ret = num_pages;
> +		goto out;
> +	}
>  
>  	/*
>  	 * Move to system cached
> @@ -1164,12 +1177,13 @@ int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
>  		memset(&hop, 0, sizeof(hop));
>  		place.mem_type = TTM_PL_SYSTEM;
>  		ret = ttm_resource_alloc(bo, &place, &evict_mem);
> -		if (unlikely(ret))
> +		if (ret)
>  			goto out;
>  
>  		ret = ttm_bo_handle_move_mem(bo, evict_mem, true, ctx, &hop);
> -		if (unlikely(ret != 0)) {
> -			WARN(ret == -EMULTIHOP, "Unexpected multihop in swaput - likely driver bug.\n");
> +		if (ret) {
> +			WARN(ret == -EMULTIHOP,
> +			     "Unexpected multihop in swapout - likely driver bug.\n");
>  			ttm_resource_free(bo, &evict_mem);
>  			goto out;
>  		}
> @@ -1179,30 +1193,53 @@ int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
>  	 * Make sure BO is idle.
>  	 */
>  	ret = ttm_bo_wait_ctx(bo, ctx);
> -	if (unlikely(ret != 0))
> +	if (ret)
>  		goto out;
>  
>  	ttm_bo_unmap_virtual(bo);
> -
> -	/*
> -	 * Swap out. Buffer will be swapped in again as soon as
> -	 * anyone tries to access a ttm page.
> -	 */
>  	if (bo->bdev->funcs->swap_notify)
>  		bo->bdev->funcs->swap_notify(bo);
>  
>  	if (ttm_tt_is_populated(bo->ttm))
> -		ret = ttm_tt_swapout(bo->bdev, bo->ttm, gfp_flags);
> +		ret = ttm_tt_swapout(bo->bdev, bo->ttm, swapout_walk->gfp_flags);
>  out:
> +	/* Consider -ENOMEM and -ENOSPC non-fatal. */
> +	if (ret == -ENOMEM || ret == -ENOSPC)
> +		ret = -EBUSY;
>  
> -	/*
> -	 * Unreserve without putting on LRU to avoid swapping out an
> -	 * already swapped buffer.
> -	 */
> -	if (locked)
> -		dma_resv_unlock(bo->base.resv);
> -	ttm_bo_put(bo);
> -	return ret == -EBUSY ? -ENOSPC : ret;
> +	return ret;
> +}
> +
> +const struct ttm_lru_walk_ops ttm_swap_ops = {
> +	.process_bo = ttm_bo_swapout_cb,
> +};
> +
> +/**
> + * ttm_bo_swapout() - Swap out buffer objects on the LRU list to shmem.
> + * @bdev: The ttm device.
> + * @ctx: The ttm_operation_ctx governing the swapout operation.
> + * @man: The resource manager whose resources / buffer objects are
> + * goint to be swapped out.
> + * @gfp_flags: The gfp flags used for shmem page allocations.
> + * @target: The desired number of pages to swap out.
> + *
> + * Return: The number of pages actually swapped out, or negative error code
> + * on error.
> + */
> +long ttm_bo_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
> +		    struct ttm_resource_manager *man, gfp_t gfp_flags,
> +		    pgoff_t target)
> +{
> +	struct ttm_bo_swapout_walk swapout_walk = {
> +		.walk = {
> +			.ops = &ttm_swap_ops,
> +			.ctx = ctx,
> +			.trylock_only = true,
> +		},
> +		.gfp_flags = gfp_flags,
> +	};
> +
> +	return ttm_lru_walk_for_evict(&swapout_walk.walk, bdev, man, target);
>  }
>  
>  void ttm_bo_tt_destroy(struct ttm_buffer_object *bo)
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index f9e9b1ec8c8a..ee575d8a54c0 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -148,40 +148,20 @@ int ttm_global_swapout(struct ttm_operation_ctx *ctx, gfp_t gfp_flags)
>  int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
>  		       gfp_t gfp_flags)
>  {
> -	struct ttm_resource_cursor cursor;
>  	struct ttm_resource_manager *man;
> -	struct ttm_resource *res;
>  	unsigned i;
> -	int ret;
> +	long lret;
>  
> -	spin_lock(&bdev->lru_lock);
>  	for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
>  		man = ttm_manager_type(bdev, i);
>  		if (!man || !man->use_tt)
>  			continue;
>  
> -		ttm_resource_manager_for_each_res(man, &cursor, res) {
> -			struct ttm_buffer_object *bo = res->bo;
> -			uint32_t num_pages;
> -
> -			if (!bo || bo->resource != res)
> -				continue;
> -
> -			num_pages = PFN_UP(bo->base.size);
> -			ret = ttm_bo_swapout(bo, ctx, gfp_flags);
> -			/* ttm_bo_swapout has dropped the lru_lock */
> -			if (!ret) {
> -				ttm_resource_cursor_fini(&cursor);
> -				return num_pages;
> -			}
> -			if (ret != -EBUSY) {
> -				ttm_resource_cursor_fini(&cursor);
> -				return ret;
> -			}
> -		}
> +		lret = ttm_bo_swapout(bdev, ctx, man, gfp_flags, 1);
> +		/* Can be both positive (num_pages) and negative (error) */
> +		if (lret)
> +			return lret;
>  	}
> -	ttm_resource_cursor_fini_locked(&cursor);
> -	spin_unlock(&bdev->lru_lock);
>  	return 0;
>  }
>  EXPORT_SYMBOL(ttm_device_swapout);
> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> index 10bff3aecd5c..de97ea9fa75f 100644
> --- a/include/drm/ttm/ttm_bo.h
> +++ b/include/drm/ttm/ttm_bo.h
> @@ -417,8 +417,9 @@ void ttm_bo_kunmap(struct ttm_bo_kmap_obj *map);
>  int ttm_bo_vmap(struct ttm_buffer_object *bo, struct iosys_map *map);
>  void ttm_bo_vunmap(struct ttm_buffer_object *bo, struct iosys_map *map);
>  int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct ttm_buffer_object *bo);
> -int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx,
> -		   gfp_t gfp_flags);
> +long ttm_bo_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
> +		    struct ttm_resource_manager *man, gfp_t gfp_flags,
> +		    pgoff_t target);
>  void ttm_bo_pin(struct ttm_buffer_object *bo);
>  void ttm_bo_unpin(struct ttm_buffer_object *bo);
>  int ttm_mem_evict_first(struct ttm_device *bdev,
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 07/12] drm/ttm: Use the LRU walker for eviction
  2024-07-03 15:38 ` [PATCH v6 07/12] drm/ttm: Use the LRU walker for eviction Thomas Hellström
@ 2024-07-03 19:20   ` Matthew Brost
  0 siblings, 0 replies; 38+ messages in thread
From: Matthew Brost @ 2024-07-03 19:20 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, Jul 03, 2024 at 05:38:08PM +0200, Thomas Hellström wrote:
> Use the LRU walker for eviction. This helps
> removing a lot of code with weird locking
> semantics.
> 
> The functionality is slightly changed so that
> when trylocked buffer objects are exhausted, we
> continue to interleave walks with ticket-locks while
> there is still progress made. The list walks are
> not restarted in-between evictions.
> 
> Also provide a separate ttm_bo_evict_first()
> function for its single user. The context of that
> user allows sleeping dma_resv locks.
> 
> v6:
> - Various cleanups suggested by Matthew Brost.
> - Fix error return code of ttm_bo_evict_first(). (Matthew Brost)
> - Fix an error check that was inverted. (Matthew Brost)
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/ttm/ttm_bo.c       | 346 ++++++++++++-----------------
>  drivers/gpu/drm/ttm/ttm_resource.c |  21 +-
>  include/drm/ttm/ttm_bo.h           |   8 +-
>  3 files changed, 144 insertions(+), 231 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 1053cdca131e..603b9353f436 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -224,80 +224,6 @@ static void ttm_bo_flush_all_fences(struct ttm_buffer_object *bo)
>  	dma_resv_iter_end(&cursor);
>  }
>  
> -/**
> - * ttm_bo_cleanup_refs
> - * If bo idle, remove from lru lists, and unref.
> - * If not idle, block if possible.
> - *
> - * Must be called with lru_lock and reservation held, this function
> - * will drop the lru lock and optionally the reservation lock before returning.
> - *
> - * @bo:                    The buffer object to clean-up
> - * @interruptible:         Any sleeps should occur interruptibly.
> - * @no_wait_gpu:           Never wait for gpu. Return -EBUSY instead.
> - * @unlock_resv:           Unlock the reservation lock as well.
> - */
> -
> -static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo,
> -			       bool interruptible, bool no_wait_gpu,
> -			       bool unlock_resv)
> -{
> -	struct dma_resv *resv = &bo->base._resv;
> -	int ret;
> -
> -	if (dma_resv_test_signaled(resv, DMA_RESV_USAGE_BOOKKEEP))
> -		ret = 0;
> -	else
> -		ret = -EBUSY;
> -
> -	if (ret && !no_wait_gpu) {
> -		long lret;
> -
> -		if (unlock_resv)
> -			dma_resv_unlock(bo->base.resv);
> -		spin_unlock(&bo->bdev->lru_lock);
> -
> -		lret = dma_resv_wait_timeout(resv, DMA_RESV_USAGE_BOOKKEEP,
> -					     interruptible,
> -					     30 * HZ);
> -
> -		if (lret < 0)
> -			return lret;
> -		else if (lret == 0)
> -			return -EBUSY;
> -
> -		spin_lock(&bo->bdev->lru_lock);
> -		if (unlock_resv && !dma_resv_trylock(bo->base.resv)) {
> -			/*
> -			 * We raced, and lost, someone else holds the reservation now,
> -			 * and is probably busy in ttm_bo_cleanup_memtype_use.
> -			 *
> -			 * Even if it's not the case, because we finished waiting any
> -			 * delayed destruction would succeed, so just return success
> -			 * here.
> -			 */
> -			spin_unlock(&bo->bdev->lru_lock);
> -			return 0;
> -		}
> -		ret = 0;
> -	}
> -
> -	if (ret) {
> -		if (unlock_resv)
> -			dma_resv_unlock(bo->base.resv);
> -		spin_unlock(&bo->bdev->lru_lock);
> -		return ret;
> -	}
> -
> -	spin_unlock(&bo->bdev->lru_lock);
> -	ttm_bo_cleanup_memtype_use(bo);
> -
> -	if (unlock_resv)
> -		dma_resv_unlock(bo->base.resv);
> -
> -	return 0;
> -}
> -
>  /*
>   * Block for the dma_resv object to become idle, lock the buffer and clean up
>   * the resource and tt object.
> @@ -505,151 +431,153 @@ bool ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
>  }
>  EXPORT_SYMBOL(ttm_bo_eviction_valuable);
>  
> -/*
> - * Check the target bo is allowable to be evicted or swapout, including cases:
> - *
> - * a. if share same reservation object with ctx->resv, have assumption
> - * reservation objects should already be locked, so not lock again and
> - * return true directly when either the opreation allow_reserved_eviction
> - * or the target bo already is in delayed free list;
> +/**
> + * ttm_bo_evict_first() - Evict the first bo on the manager's LRU list.
> + * @bdev: The ttm device.
> + * @man: The manager whose bo to evict.
> + * @ctx: The TTM operation ctx governing the eviction.
>   *
> - * b. Otherwise, trylock it.
> + * Return: 0 if successful or the resource disappeared. Negative error code on error.
>   */
> -static bool ttm_bo_evict_swapout_allowable(struct ttm_buffer_object *bo,
> -					   struct ttm_operation_ctx *ctx,
> -					   const struct ttm_place *place,
> -					   bool *locked, bool *busy)
> +int ttm_bo_evict_first(struct ttm_device *bdev, struct ttm_resource_manager *man,
> +		       struct ttm_operation_ctx *ctx)
>  {
> -	bool ret = false;
> +	struct ttm_resource_cursor cursor;
> +	struct ttm_buffer_object *bo;
> +	struct ttm_resource *res;
> +	unsigned int mem_type;
> +	int ret = 0;
>  
> -	if (bo->pin_count) {
> -		*locked = false;
> -		if (busy)
> -			*busy = false;
> -		return false;
> +	spin_lock(&bdev->lru_lock);
> +	res = ttm_resource_manager_first(man, &cursor);
> +	if (!res) {
> +		ret = -ENOENT;
> +		goto out_no_ref;
>  	}
> +	bo = res->bo;
> +	if (!ttm_bo_get_unless_zero(bo))
> +		goto out_no_ref;
> +	mem_type = res->mem_type;
> +	spin_unlock(&bdev->lru_lock);
> +	ret = ttm_bo_reserve(bo, ctx->interruptible, ctx->no_wait_gpu, NULL);
> +	if (ret)
> +		goto out_no_lock;
> +	if (bo->resource != res || res->mem_type != mem_type)
> +		goto out_bo_moved;
>  
> -	if (bo->base.resv == ctx->resv) {
> -		dma_resv_assert_held(bo->base.resv);
> -		if (ctx->allow_res_evict)
> -			ret = true;
> -		*locked = false;
> -		if (busy)
> -			*busy = false;
> +	if (bo->deleted) {
> +		ret = ttm_bo_wait_ctx(bo, ctx);
> +		if (!ret)
> +			ttm_bo_cleanup_memtype_use(bo);
>  	} else {
> -		ret = dma_resv_trylock(bo->base.resv);
> -		*locked = ret;
> -		if (busy)
> -			*busy = !ret;
> -	}
> -
> -	if (ret && place && (bo->resource->mem_type != place->mem_type ||
> -		!bo->bdev->funcs->eviction_valuable(bo, place))) {
> -		ret = false;
> -		if (*locked) {
> -			dma_resv_unlock(bo->base.resv);
> -			*locked = false;
> -		}
> +		ret = ttm_bo_evict(bo, ctx);
>  	}
> +out_bo_moved:
> +	dma_resv_unlock(bo->base.resv);
> +out_no_lock:
> +	ttm_bo_put(bo);
> +	ttm_resource_cursor_fini(&cursor);
> +	return ret;
>  
> +out_no_ref:
> +	ttm_resource_cursor_fini_locked(&cursor);
> +	spin_unlock(&bdev->lru_lock);
>  	return ret;
>  }
>  
>  /**
> - * ttm_mem_evict_wait_busy - wait for a busy BO to become available
> - *
> - * @busy_bo: BO which couldn't be locked with trylock
> - * @ctx: operation context
> - * @ticket: acquire ticket
> - *
> - * Try to lock a busy buffer object to avoid failing eviction.
> + * struct ttm_bo_evict_walk - Parameters for the evict walk.
>   */
> -static int ttm_mem_evict_wait_busy(struct ttm_buffer_object *busy_bo,
> -				   struct ttm_operation_ctx *ctx,
> -				   struct ww_acquire_ctx *ticket)
> -{
> -	int r;
> -
> -	if (!busy_bo || !ticket)
> -		return -EBUSY;
> -
> -	if (ctx->interruptible)
> -		r = dma_resv_lock_interruptible(busy_bo->base.resv,
> -							  ticket);
> -	else
> -		r = dma_resv_lock(busy_bo->base.resv, ticket);
> -
> -	/*
> -	 * TODO: It would be better to keep the BO locked until allocation is at
> -	 * least tried one more time, but that would mean a much larger rework
> -	 * of TTM.
> -	 */
> -	if (!r)
> -		dma_resv_unlock(busy_bo->base.resv);
> -
> -	return r == -EDEADLK ? -EBUSY : r;
> -}
> +struct ttm_bo_evict_walk {
> +	/** @walk: The walk base parameters. */
> +	struct ttm_lru_walk walk;
> +	/** @place: The place passed to the resource allocation. */
> +	const struct ttm_place *place;
> +	/** @evictor: The buffer object we're trying to make room for. */
> +	struct ttm_buffer_object *evictor;
> +	/** @res: The allocated resource if any. */
> +	struct ttm_resource **res;
> +	/** @evicted: Number of successful evictions. */
> +	unsigned long evicted;
> +};
>  
> -int ttm_mem_evict_first(struct ttm_device *bdev,
> -			struct ttm_resource_manager *man,
> -			const struct ttm_place *place,
> -			struct ttm_operation_ctx *ctx,
> -			struct ww_acquire_ctx *ticket)
> +static long ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
>  {
> -	struct ttm_buffer_object *bo = NULL, *busy_bo = NULL;
> -	struct ttm_resource_cursor cursor;
> -	struct ttm_resource *res;
> -	bool locked = false;
> -	int ret;
> +	struct ttm_bo_evict_walk *evict_walk =
> +		container_of(walk, typeof(*evict_walk), walk);
> +	long lret;
>  
> -	spin_lock(&bdev->lru_lock);
> -	ttm_resource_manager_for_each_res(man, &cursor, res) {
> -		bool busy;
> -
> -		if (!ttm_bo_evict_swapout_allowable(res->bo, ctx, place,
> -						    &locked, &busy)) {
> -			if (busy && !busy_bo && ticket !=
> -			    dma_resv_locking_ctx(res->bo->base.resv))
> -				busy_bo = res->bo;
> -			continue;
> -		}
> +	if (!bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
> +		return 0;
>  
> -		if (ttm_bo_get_unless_zero(res->bo)) {
> -			bo = res->bo;
> -			break;
> -		}
> -		if (locked)
> -			dma_resv_unlock(res->bo->base.resv);
> +	if (bo->deleted) {
> +		lret = ttm_bo_wait_ctx(bo, walk->ctx);
> +		if (!lret)
> +			ttm_bo_cleanup_memtype_use(bo);
> +	} else {
> +		lret = ttm_bo_evict(bo, walk->ctx);
>  	}
> -	ttm_resource_cursor_fini_locked(&cursor);
>  
> -	if (!bo) {
> -		if (busy_bo && !ttm_bo_get_unless_zero(busy_bo))
> -			busy_bo = NULL;
> -		spin_unlock(&bdev->lru_lock);
> -		ret = ttm_mem_evict_wait_busy(busy_bo, ctx, ticket);
> -		if (busy_bo)
> -			ttm_bo_put(busy_bo);
> -		return ret;
> -	}
> +	if (lret)
> +		goto out;
>  
> -	if (bo->deleted) {
> -		ret = ttm_bo_cleanup_refs(bo, ctx->interruptible,
> -					  ctx->no_wait_gpu, locked);
> -		ttm_bo_put(bo);
> -		return ret;
> -	}
> +	evict_walk->evicted++;
> +	if (evict_walk->res)
> +		lret = ttm_resource_alloc(evict_walk->evictor, evict_walk->place,
> +					  evict_walk->res);
> +	if (lret == 0)
> +		return 1;
> +out:
> +	/* Errors that should terminate the walk. */
> +	if (lret == -ENOSPC)
> +		return -EBUSY;
>  
> -	spin_unlock(&bdev->lru_lock);
> +	return lret;
> +}
>  
> -	ret = ttm_bo_evict(bo, ctx);
> -	if (locked)
> -		ttm_bo_unreserve(bo);
> -	else
> -		ttm_bo_move_to_lru_tail_unlocked(bo);
> +static const struct ttm_lru_walk_ops ttm_evict_walk_ops = {
> +	.process_bo = ttm_bo_evict_cb,
> +};
>  
> -	ttm_bo_put(bo);
> -	return ret;
> +static int ttm_bo_evict_alloc(struct ttm_device *bdev,
> +			      struct ttm_resource_manager *man,
> +			      const struct ttm_place *place,
> +			      struct ttm_buffer_object *evictor,
> +			      struct ttm_operation_ctx *ctx,
> +			      struct ww_acquire_ctx *ticket,
> +			      struct ttm_resource **res)
> +{
> +	struct ttm_bo_evict_walk evict_walk = {
> +		.walk = {
> +			.ops = &ttm_evict_walk_ops,
> +			.ctx = ctx,
> +			.ticket = ticket,
> +		},
> +		.place = place,
> +		.evictor = evictor,
> +		.res = res,
> +	};
> +	long lret;
> +
> +	evict_walk.walk.trylock_only = true;
> +	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
> +	if (lret || !ticket)
> +		goto out;
> +
> +	/* If ticket-locking, repeat while making progress. */
> +	evict_walk.walk.trylock_only = false;
> +	do {
> +		/* The walk may clear the evict_walk.walk.ticket field */
> +		evict_walk.walk.ticket = ticket;
> +		evict_walk.evicted = 0;
> +		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
> +	} while (!lret && evict_walk.evicted);
> +out:
> +	if (lret < 0)
> +		return lret;
> +	if (lret == 0)
> +		return -EBUSY;
> +	return 0;
>  }
>  
>  /**
> @@ -760,6 +688,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
>  	for (i = 0; i < placement->num_placement; ++i) {
>  		const struct ttm_place *place = &placement->placement[i];
>  		struct ttm_resource_manager *man;
> +		bool may_evict;
>  
>  		man = ttm_manager_type(bdev, place->mem_type);
>  		if (!man || !ttm_resource_manager_used(man))
> @@ -769,22 +698,21 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
>  				    TTM_PL_FLAG_FALLBACK))
>  			continue;
>  
> -		do {
> -			ret = ttm_resource_alloc(bo, place, res);
> -			if (unlikely(ret && ret != -ENOSPC))
> +		may_evict = (force_space && place->mem_type != TTM_PL_SYSTEM);
> +		ret = ttm_resource_alloc(bo, place, res);
> +		if (ret) {
> +			if (ret != -ENOSPC)
>  				return ret;
> -			if (likely(!ret) || !force_space)
> -				break;
> -
> -			ret = ttm_mem_evict_first(bdev, man, place, ctx,
> -						  ticket);
> -			if (unlikely(ret == -EBUSY))
> -				break;
> -			if (unlikely(ret))
> +			if (!may_evict)
> +				continue;
> +
> +			ret = ttm_bo_evict_alloc(bdev, man, place, bo, ctx,
> +						 ticket, res);
> +			if (ret == -EBUSY)
> +				continue;
> +			if (ret)
>  				return ret;
> -		} while (1);
> -		if (ret)
> -			continue;
> +		}
>  
>  		ret = ttm_bo_add_move_fence(bo, man, ctx->no_wait_gpu);
>  		if (unlikely(ret)) {
> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
> index b6a2daac5518..9dc727d416cc 100644
> --- a/drivers/gpu/drm/ttm/ttm_resource.c
> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> @@ -512,24 +512,11 @@ int ttm_resource_manager_evict_all(struct ttm_device *bdev,
>  	};
>  	struct dma_fence *fence;
>  	int ret;
> -	unsigned i;
> -
> -	/*
> -	 * Can't use standard list traversal since we're unlocking.
> -	 */
>  
> -	spin_lock(&bdev->lru_lock);
> -	for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
> -		while (!list_empty(&man->lru[i])) {
> -			spin_unlock(&bdev->lru_lock);
> -			ret = ttm_mem_evict_first(bdev, man, NULL, &ctx,
> -						  NULL);
> -			if (ret)
> -				return ret;
> -			spin_lock(&bdev->lru_lock);
> -		}
> -	}
> -	spin_unlock(&bdev->lru_lock);
> +	do {
> +		ret = ttm_bo_evict_first(bdev, man, &ctx);
> +		cond_resched();
> +	} while (!ret);
>  
>  	spin_lock(&man->move_lock);
>  	fence = dma_fence_get(man->move);
> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> index de97ea9fa75f..e577528f5dfc 100644
> --- a/include/drm/ttm/ttm_bo.h
> +++ b/include/drm/ttm/ttm_bo.h
> @@ -422,11 +422,9 @@ long ttm_bo_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
>  		    pgoff_t target);
>  void ttm_bo_pin(struct ttm_buffer_object *bo);
>  void ttm_bo_unpin(struct ttm_buffer_object *bo);
> -int ttm_mem_evict_first(struct ttm_device *bdev,
> -			struct ttm_resource_manager *man,
> -			const struct ttm_place *place,
> -			struct ttm_operation_ctx *ctx,
> -			struct ww_acquire_ctx *ticket);
> +int ttm_bo_evict_first(struct ttm_device *bdev,
> +		       struct ttm_resource_manager *man,
> +		       struct ttm_operation_ctx *ctx);
>  vm_fault_t ttm_bo_vm_reserve(struct ttm_buffer_object *bo,
>  			     struct vm_fault *vmf);
>  vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup
  2024-07-03 15:38 ` [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup Thomas Hellström
@ 2024-07-03 19:47   ` Matthew Brost
  2024-07-04 11:57   ` Christian König
  1 sibling, 0 replies; 38+ messages in thread
From: Matthew Brost @ 2024-07-03 19:47 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, Jul 03, 2024 at 05:38:09PM +0200, Thomas Hellström wrote:
> Initially intended for experimenting with different backup
> solutions (shmem vs direct swap cache insertion), abstract
> the backup destination using a virtual base class.
> 
> Also provide a sample implementation for shmem.
> 
> While when settling on a preferred backup solution, one could
> perhaps skip the abstraction, this functionality may actually
> come in handy for configurable dedicated graphics memory
> backup to fast nvme files or similar, whithout affecting
> swap-space. Could indeed be useful for VRAM backup on S4 and
> other cases.
> 
> v5:
> - Fix a UAF. (kernel test robot, Dan Carptenter)
> v6:
> - Rename ttm_backup_shmem_copy_page() function argument
>   (Matthew Brost)
> - Add some missing documentation
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/ttm/Makefile           |   2 +-
>  drivers/gpu/drm/ttm/ttm_backup_shmem.c | 139 +++++++++++++++++++++++++
>  include/drm/ttm/ttm_backup.h           | 137 ++++++++++++++++++++++++
>  3 files changed, 277 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/ttm/ttm_backup_shmem.c
>  create mode 100644 include/drm/ttm/ttm_backup.h
> 
> diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
> index dad298127226..5e980dd90e41 100644
> --- a/drivers/gpu/drm/ttm/Makefile
> +++ b/drivers/gpu/drm/ttm/Makefile
> @@ -4,7 +4,7 @@
>  
>  ttm-y := ttm_tt.o ttm_bo.o ttm_bo_util.o ttm_bo_vm.o ttm_module.o \
>  	ttm_execbuf_util.o ttm_range_manager.o ttm_resource.o ttm_pool.o \
> -	ttm_device.o ttm_sys_manager.o
> +	ttm_device.o ttm_sys_manager.o ttm_backup_shmem.o
>  ttm-$(CONFIG_AGP) += ttm_agp_backend.o
>  
>  obj-$(CONFIG_DRM_TTM) += ttm.o
> diff --git a/drivers/gpu/drm/ttm/ttm_backup_shmem.c b/drivers/gpu/drm/ttm/ttm_backup_shmem.c
> new file mode 100644
> index 000000000000..3d23a34d9f34
> --- /dev/null
> +++ b/drivers/gpu/drm/ttm/ttm_backup_shmem.c
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <drm/ttm/ttm_backup.h>
> +#include <linux/page-flags.h>
> +
> +/**
> + * struct ttm_backup_shmem - A shmem based ttm_backup subclass.
> + * @backup: The base struct ttm_backup
> + * @filp: The associated shmem object
> + */
> +struct ttm_backup_shmem {
> +	struct ttm_backup backup;
> +	struct file *filp;
> +};
> +
> +static struct ttm_backup_shmem *to_backup_shmem(struct ttm_backup *backup)
> +{
> +	return container_of(backup, struct ttm_backup_shmem, backup);
> +}
> +
> +static void ttm_backup_shmem_drop(struct ttm_backup *backup, unsigned long handle)
> +{
> +	handle -= 1;
> +	shmem_truncate_range(file_inode(to_backup_shmem(backup)->filp), handle,
> +			     handle + 1);
> +}
> +
> +static int ttm_backup_shmem_copy_page(struct ttm_backup *backup, struct page *dst,
> +				      unsigned long handle, bool intr)
> +{
> +	struct file *filp = to_backup_shmem(backup)->filp;
> +	struct address_space *mapping = filp->f_mapping;
> +	struct folio *from_folio;
> +
> +	handle -= 1;
> +	from_folio = shmem_read_folio(mapping, handle);
> +	if (IS_ERR(from_folio))
> +		return PTR_ERR(from_folio);
> +
> +	/* Note: Use drm_memcpy_from_wc? */
> +	copy_highpage(dst, folio_file_page(from_folio, handle));
> +	folio_put(from_folio);
> +
> +	return 0;
> +}
> +
> +static unsigned long
> +ttm_backup_shmem_backup_page(struct ttm_backup *backup, struct page *page,
> +			     bool writeback, pgoff_t i, gfp_t page_gfp,
> +			     gfp_t alloc_gfp)
> +{
> +	struct file *filp = to_backup_shmem(backup)->filp;
> +	struct address_space *mapping = filp->f_mapping;
> +	unsigned long handle = 0;
> +	struct folio *to_folio;
> +	int ret;
> +
> +	to_folio = shmem_read_folio_gfp(mapping, i, alloc_gfp);
> +	if (IS_ERR(to_folio))
> +		return handle;
> +
> +	folio_mark_accessed(to_folio);
> +	folio_lock(to_folio);
> +	folio_mark_dirty(to_folio);
> +	copy_highpage(folio_file_page(to_folio, i), page);
> +	handle = i + 1;
> +
> +	if (writeback && !folio_mapped(to_folio) && folio_clear_dirty_for_io(to_folio)) {
> +		struct writeback_control wbc = {
> +			.sync_mode = WB_SYNC_NONE,
> +			.nr_to_write = SWAP_CLUSTER_MAX,
> +			.range_start = 0,
> +			.range_end = LLONG_MAX,
> +			.for_reclaim = 1,
> +		};
> +		folio_set_reclaim(to_folio);
> +		ret = mapping->a_ops->writepage(folio_page(to_folio, 0), &wbc);
> +		if (!folio_test_writeback(to_folio))
> +			folio_clear_reclaim(to_folio);
> +		/* If writepage succeeds, it unlocks the folio */
> +		if (ret)
> +			folio_unlock(to_folio);
> +	} else {
> +		folio_unlock(to_folio);
> +	}
> +
> +	folio_put(to_folio);
> +
> +	return handle;
> +}
> +
> +static void ttm_backup_shmem_fini(struct ttm_backup *backup)
> +{
> +	struct ttm_backup_shmem *sbackup = to_backup_shmem(backup);
> +
> +	fput(sbackup->filp);
> +	kfree(sbackup);
> +}
> +
> +static const struct ttm_backup_ops ttm_backup_shmem_ops = {
> +	.drop = ttm_backup_shmem_drop,
> +	.copy_backed_up_page = ttm_backup_shmem_copy_page,
> +	.backup_page = ttm_backup_shmem_backup_page,
> +	.fini = ttm_backup_shmem_fini,
> +};
> +
> +/**
> + * ttm_backup_shmem_create() - Create a shmem-based struct backup.
> + * @size: The maximum size (in bytes) to back up.
> + *
> + * Create a backup utilizing shmem objects.
> + *
> + * Return: A pointer to a struct ttm_backup on success,
> + * an error pointer on error.
> + */
> +struct ttm_backup *ttm_backup_shmem_create(loff_t size)
> +{
> +	struct ttm_backup_shmem *sbackup =
> +		kzalloc(sizeof(*sbackup), GFP_KERNEL | __GFP_ACCOUNT);
> +	struct file *filp;
> +
> +	if (!sbackup)
> +		return ERR_PTR(-ENOMEM);
> +
> +	filp = shmem_file_setup("ttm shmem backup", size, 0);
> +	if (IS_ERR(filp)) {
> +		kfree(sbackup);
> +		return ERR_CAST(filp);
> +	}
> +
> +	sbackup->filp = filp;
> +	sbackup->backup.ops = &ttm_backup_shmem_ops;
> +
> +	return &sbackup->backup;
> +}
> +EXPORT_SYMBOL_GPL(ttm_backup_shmem_create);
> diff --git a/include/drm/ttm/ttm_backup.h b/include/drm/ttm/ttm_backup.h
> new file mode 100644
> index 000000000000..5f8c7d3069ef
> --- /dev/null
> +++ b/include/drm/ttm/ttm_backup.h
> @@ -0,0 +1,137 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef _TTM_BACKUP_H_
> +#define _TTM_BACKUP_H_
> +
> +#include <linux/mm_types.h>
> +#include <linux/shmem_fs.h>
> +
> +struct ttm_backup;
> +
> +/**
> + * ttm_backup_handle_to_page_ptr() - Convert handle to struct page pointer
> + * @handle: The handle to convert.
> + *
> + * Converts an opaque handle received from the
> + * struct ttm_backoup_ops::backup_page() function to an (invalid)
> + * struct page pointer suitable for a struct page array.
> + *
> + * Return: An (invalid) struct page pointer.
> + */
> +static inline struct page *
> +ttm_backup_handle_to_page_ptr(unsigned long handle)
> +{
> +	return (struct page *)(handle << 1 | 1);
> +}
> +
> +/**
> + * ttm_backup_page_ptr_is_handle() - Whether a struct page pointer is a handle
> + * @page: The struct page pointer to check.
> + *
> + * Return: true if the struct page pointer is a handld returned from
> + * ttm_backup_handle_to_page_ptr(). False otherwise.
> + */
> +static inline bool ttm_backup_page_ptr_is_handle(const struct page *page)
> +{
> +	return (unsigned long)page & 1;
> +}
> +
> +/**
> + * ttm_backup_page_ptr_to_handle() - Convert a struct page pointer to a handle
> + * @page: The struct page pointer to convert
> + *
> + * Return: The handle that was previously used in
> + * ttm_backup_handle_to_page_ptr() to obtain a struct page pointer, suitable
> + * for use as argument in the struct ttm_backup_ops drop() or
> + * copy_backed_up_page() functions.
> + */
> +static inline unsigned long
> +ttm_backup_page_ptr_to_handle(const struct page *page)
> +{
> +	WARN_ON(!ttm_backup_page_ptr_is_handle(page));
> +	return (unsigned long)page >> 1;
> +}
> +
> +/** struct ttm_backup_ops - A struct ttm_backup backend operations */
> +struct ttm_backup_ops {
> +	/**
> +	 * drop - release memory associated with a handle
> +	 * @backup: The struct backup pointer used to obtain the handle
> +	 * @handle: The handle obtained from the @backup_page function.
> +	 */
> +	void (*drop)(struct ttm_backup *backup, unsigned long handle);
> +
> +	/**
> +	 * copy_backed_up_page - Copy the contents of a previously backed
> +	 * up page
> +	 * @backup: The struct backup pointer used to back up the page.
> +	 * @dst: The struct page to copy into.
> +	 * @handle: The handle returned when the page was backed up.
> +	 * @intr: Try to perform waits interruptable or at least killable.
> +	 *
> +	 * Return: 0 on success, Negative error code on failure, notably
> +	 * -EINTR if @intr was set to true and a signal is pending.
> +	 */
> +	int (*copy_backed_up_page)(struct ttm_backup *backup, struct page *dst,
> +				   unsigned long handle, bool intr);
> +
> +	/**
> +	 * backup_page - Backup a page
> +	 * @backup: The struct backup pointer to use.
> +	 * @page: The page to back up.
> +	 * @writeback: Whether to perform immediate writeback of the page.
> +	 * This may have performance implications.
> +	 * @i: A unique integer for each page and each struct backup.
> +	 * This is a hint allowing the backup backend to avoid managing
> +	 * its address space separately.
> +	 * @page_gfp: The gfp value used when the page was allocated.
> +	 * This is used for accounting purposes.
> +	 * @alloc_gfp: The gpf to be used when the backend needs to allocaete
> +	 * memory.
> +	 *
> +	 * Return: A handle on success. 0 on failure.
> +	 * (This is following the swp_entry_t convention).
> +	 *
> +	 * Note: This function could be extended to back up a folio and
> +	 * backends would then split the folio internally if needed.
> +	 * Drawback is that the caller would then have to keep track of
> +	 * the folio size- and usage.
> +	 */
> +	unsigned long (*backup_page)(struct ttm_backup *backup, struct page *page,
> +				     bool writeback, pgoff_t i, gfp_t page_gfp,
> +				     gfp_t alloc_gfp);
> +	/**
> +	 * fini - Free the struct backup resources after last use.
> +	 * @backup: Pointer to the struct backup whose resources to free.
> +	 *
> +	 * After a call to @fini, it's illegal to use the @backup pointer.
> +	 */
> +	void (*fini)(struct ttm_backup *backup);
> +};
> +
> +/**
> + * struct ttm_backup - Abstract a backup backend.
> + * @ops: The operations as described above.
> + *
> + * The struct ttm_backup is intended to be subclassed by the
> + * backend implementation.
> + */
> +struct ttm_backup {
> +	const struct ttm_backup_ops *ops;
> +};
> +
> +/**
> + * ttm_backup_shmem_create() - Create a shmem-based struct backup.
> + * @size: The maximum size (in bytes) to back up.
> + *
> + * Create a backup utilizing shmem objects.
> + *
> + * Return: A pointer to a struct ttm_backup on success,
> + * an error pointer on error.
> + */
> +struct ttm_backup *ttm_backup_shmem_create(loff_t size);
> +
> +#endif
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 03/12] drm/ttm: Use LRU hitches
  2024-07-03 15:38 ` [PATCH v6 03/12] drm/ttm: Use LRU hitches Thomas Hellström
@ 2024-07-04  9:05   ` Christian König
  0 siblings, 0 replies; 38+ messages in thread
From: Christian König @ 2024-07-04  9:05 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Somalapuram Amaranath, Matthew Brost, dri-devel

Am 03.07.24 um 17:38 schrieb Thomas Hellström:
> Have iterators insert themselves into the list they are iterating
> over using hitch list nodes. Since only the iterator owner
> can remove these list nodes from the list, it's safe to unlock
> the list and when continuing, use them as a starting point. Due to
> the way LRU bumping works in TTM, newly added items will not be
> missed, and bumped items will be iterated over a second time before
> reaching the end of the list.
>
> The exception is list with bulk move sublists. When bumping a
> sublist, a hitch that is part of that sublist will also be moved
> and we might miss items if restarting from it. This will be
> addressed in a later patch.
>
> Changes in previous series:
> - Updated ttm_resource_cursor_fini() documentation.
> v2:
> - Don't reorder ttm_resource_manager_first() and _next().
>    (Christian König).
> - Use list_add instead of list_move
>    (Christian König)
> v3:
> - Split into two patches, one cleanup, one new functionality
>    (Christian König)
> - use ttm_resource_cursor_fini_locked() instead of open-coding
>    (Matthew Brost)
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/ttm/ttm_bo.c       |  1 +
>   drivers/gpu/drm/ttm/ttm_device.c   |  9 +++--
>   drivers/gpu/drm/ttm/ttm_resource.c | 56 +++++++++++++++++++++++++-----
>   include/drm/ttm/ttm_resource.h     |  9 +++--
>   4 files changed, 62 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 6396dece0db1..43eda720657f 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -621,6 +621,7 @@ int ttm_mem_evict_first(struct ttm_device *bdev,
>   		if (locked)
>   			dma_resv_unlock(res->bo->base.resv);
>   	}
> +	ttm_resource_cursor_fini_locked(&cursor);
>   
>   	if (!bo) {
>   		if (busy_bo && !ttm_bo_get_unless_zero(busy_bo))
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 09411978a13a..f9e9b1ec8c8a 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -170,12 +170,17 @@ int ttm_device_swapout(struct ttm_device *bdev, struct ttm_operation_ctx *ctx,
>   			num_pages = PFN_UP(bo->base.size);
>   			ret = ttm_bo_swapout(bo, ctx, gfp_flags);
>   			/* ttm_bo_swapout has dropped the lru_lock */
> -			if (!ret)
> +			if (!ret) {
> +				ttm_resource_cursor_fini(&cursor);
>   				return num_pages;
> -			if (ret != -EBUSY)
> +			}
> +			if (ret != -EBUSY) {
> +				ttm_resource_cursor_fini(&cursor);
>   				return ret;
> +			}
>   		}
>   	}
> +	ttm_resource_cursor_fini_locked(&cursor);
>   	spin_unlock(&bdev->lru_lock);
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
> index 8bfbddddc0e8..9c8b6499edfb 100644
> --- a/drivers/gpu/drm/ttm/ttm_resource.c
> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> @@ -33,6 +33,37 @@
>   
>   #include <drm/drm_util.h>
>   
> +/**
> + * ttm_resource_cursor_fini_locked() - Finalize the LRU list cursor usage
> + * @cursor: The struct ttm_resource_cursor to finalize.
> + *
> + * The function pulls the LRU list cursor off any lists it was previusly
> + * attached to. Needs to be called with the LRU lock held. The function
> + * can be called multiple times after eachother.
> + */
> +void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor)
> +{
> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> +	list_del_init(&cursor->hitch.link);
> +}
> +
> +/**
> + * ttm_resource_cursor_fini() - Finalize the LRU list cursor usage
> + * @cursor: The struct ttm_resource_cursor to finalize.
> + *
> + * The function pulls the LRU list cursor off any lists it was previusly
> + * attached to. Needs to be called without the LRU list lock held. The
> + * function can be called multiple times after eachother.
> + */
> +void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor)
> +{
> +	spinlock_t *lru_lock = &cursor->man->bdev->lru_lock;
> +
> +	spin_lock(lru_lock);
> +	ttm_resource_cursor_fini_locked(cursor);
> +	spin_unlock(lru_lock);
> +}
> +
>   /**
>    * ttm_lru_bulk_move_init - initialize a bulk move structure
>    * @bulk: the structure to init
> @@ -485,12 +516,15 @@ void ttm_resource_manager_debug(struct ttm_resource_manager *man,
>   EXPORT_SYMBOL(ttm_resource_manager_debug);
>   
>   /**
> - * ttm_resource_manager_first
> - *
> + * ttm_resource_manager_first() - Start iterating over the resources
> + * of a resource manager
>    * @man: resource manager to iterate over
>    * @cursor: cursor to record the position
>    *
> - * Returns the first resource from the resource manager.
> + * Initializes the cursor and starts iterating. When done iterating,
> + * the caller must explicitly call ttm_resource_cursor_fini().
> + *
> + * Return: The first resource from the resource manager.
>    */
>   struct ttm_resource *
>   ttm_resource_manager_first(struct ttm_resource_manager *man,
> @@ -500,13 +534,15 @@ ttm_resource_manager_first(struct ttm_resource_manager *man,
>   
>   	cursor->priority = 0;
>   	cursor->man = man;
> -	cursor->cur = &man->lru[cursor->priority];
> +	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
> +	list_add(&cursor->hitch.link, &man->lru[cursor->priority]);
> +
>   	return ttm_resource_manager_next(cursor);
>   }
>   
>   /**
> - * ttm_resource_manager_next
> - *
> + * ttm_resource_manager_next() - Continue iterating over the resource manager
> + * resources
>    * @cursor: cursor to record the position
>    *
>    * Return: the next resource from the resource manager.
> @@ -520,10 +556,10 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
>   	lockdep_assert_held(&man->bdev->lru_lock);
>   
>   	for (;;) {
> -		lru = list_entry(cursor->cur, typeof(*lru), link);
> +		lru = &cursor->hitch;
>   		list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
>   			if (ttm_lru_item_is_res(lru)) {
> -				cursor->cur = &lru->link;
> +				list_move(&cursor->hitch.link, &lru->link);
>   				return ttm_lru_item_to_res(lru);
>   			}
>   		}
> @@ -531,9 +567,11 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
>   		if (++cursor->priority >= TTM_MAX_BO_PRIORITY)
>   			break;
>   
> -		cursor->cur = &man->lru[cursor->priority];
> +		list_move(&cursor->hitch.link, &man->lru[cursor->priority]);
>   	}
>   
> +	ttm_resource_cursor_fini_locked(cursor);
> +
>   	return NULL;
>   }
>   
> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
> index 7d81fd5b5b83..8fac781f641e 100644
> --- a/include/drm/ttm/ttm_resource.h
> +++ b/include/drm/ttm/ttm_resource.h
> @@ -273,17 +273,22 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
>    * struct ttm_resource_cursor
>    *
>    * @man: The resource manager currently being iterated over.
> - * @cur: The list head the cursor currently points to.
> + * @hitch: A hitch list node inserted before the next resource
> + * to iterate over.
>    * @priority: the current priority
>    *
>    * Cursor to iterate over the resources in a manager.
>    */
>   struct ttm_resource_cursor {
>   	struct ttm_resource_manager *man;
> -	struct list_head *cur;
> +	struct ttm_lru_item hitch;
>   	unsigned int priority;
>   };
>   
> +void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
> +
> +void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
> +
>   /**
>    * struct ttm_lru_bulk_move_pos
>    *


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves
  2024-07-03 15:38 ` [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves Thomas Hellström
  2024-07-03 17:53   ` Matthew Brost
@ 2024-07-04  9:21   ` Christian König
  2024-07-04 12:41     ` Thomas Hellström
  1 sibling, 1 reply; 38+ messages in thread
From: Christian König @ 2024-07-04  9:21 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Somalapuram Amaranath, Matthew Brost, dri-devel

Am 03.07.24 um 17:38 schrieb Thomas Hellström:
> To address the problem with hitches moving when bulk move
> sublists are lru-bumped, register the list cursors with the
> ttm_lru_bulk_move structure when traversing its list, and
> when lru-bumping the list, move the cursor hitch to the tail.
> This also means it's mandatory for drivers to call
> ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when
> initializing and finalizing the bulk move structure, so add
> those calls to the amdgpu- and xe driver.
>
> Compared to v1 this is slightly more code but less fragile
> and hopefully easier to understand.

This is the only patch in the series which I see critical.

I think the final goal when using drm_exec in TTMs eviction path is to 
keep all evicted (or evicting) BOs locked until we have enough space.

This means that for bulk move sections on the LRU we would lock the 
first BO and would only drop that lock again if we have gone over the 
full bulk move section and know that all BOs are not valuable for eviction.

Because of this the issue of having to consider hitches move with a bulk 
move section on the LRU doesn't even occur because for that a concurrent 
process would need to grab the common lock of the BOs in the bulk move 
section.

Regards,
Christian.


>
> Changes in previous series:
> - Completely rework the functionality
> - Avoid a NULL pointer dereference assigning manager->mem_type
> - Remove some leftover code causing build problems
> v2:
> - For hitch bulk tail moves, store the mem_type in the cursor
>    instead of with the manager.
> v3:
> - Remove leftover mem_type member from change in v2.
> v6:
> - Add some lockdep asserts (Matthew Brost)
> - Avoid NULL pointer dereference (Matthew Brost)
> - No need to check bo->resource before dereferencing
>    bo->bulk_move (Matthew Brost)
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  4 ++
>   drivers/gpu/drm/ttm/ttm_resource.c     | 92 ++++++++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_vm.c             |  4 ++
>   include/drm/ttm/ttm_resource.h         | 56 ++++++++++------
>   4 files changed, 135 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..97743993d711 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2420,6 +2420,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	if (r)
>   		return r;
>   
> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> +
>   	vm->is_compute_context = false;
>   
>   	vm->use_cpu_for_update = !!(adev->vm_manager.vm_update_mode &
> @@ -2484,6 +2486,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   error_free_delayed:
>   	dma_fence_put(vm->last_tlb_flush);
>   	dma_fence_put(vm->last_unlocked);
> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm->lru_bulk_move);
>   	amdgpu_vm_fini_entities(vm);
>   
>   	return r;
> @@ -2640,6 +2643,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   		}
>   	}
>   
> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm->lru_bulk_move);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
> index 9c8b6499edfb..b6a2daac5518 100644
> --- a/drivers/gpu/drm/ttm/ttm_resource.c
> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> @@ -33,6 +33,53 @@
>   
>   #include <drm/drm_util.h>
>   
> +/* Detach the cursor from the bulk move list*/
> +static void
> +ttm_resource_cursor_clear_bulk(struct ttm_resource_cursor *cursor)
> +{
> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> +
> +	cursor->bulk = NULL;
> +	list_del_init(&cursor->bulk_link);
> +}
> +
> +/* Move the cursor to the end of the bulk move list it's in */
> +static void ttm_resource_cursor_move_bulk_tail(struct ttm_lru_bulk_move *bulk,
> +					       struct ttm_resource_cursor *cursor)
> +{
> +	struct ttm_lru_bulk_move_pos *pos;
> +
> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> +
> +	if (WARN_ON_ONCE(bulk != cursor->bulk)) {
> +		list_del_init(&cursor->bulk_link);
> +		return;
> +	}
> +
> +	pos = &bulk->pos[cursor->mem_type][cursor->priority];
> +	if (pos->last)
> +		list_move(&cursor->hitch.link, &pos->last->lru.link);
> +	ttm_resource_cursor_clear_bulk(cursor);
> +}
> +
> +/* Move all cursors attached to a bulk move to its end */
> +static void ttm_bulk_move_adjust_cursors(struct ttm_lru_bulk_move *bulk)
> +{
> +	struct ttm_resource_cursor *cursor, *next;
> +
> +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list, bulk_link)
> +		ttm_resource_cursor_move_bulk_tail(bulk, cursor);
> +}
> +
> +/* Remove a cursor from an empty bulk move list */
> +static void ttm_bulk_move_drop_cursors(struct ttm_lru_bulk_move *bulk)
> +{
> +	struct ttm_resource_cursor *cursor, *next;
> +
> +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list, bulk_link)
> +		ttm_resource_cursor_clear_bulk(cursor);
> +}
> +
>   /**
>    * ttm_resource_cursor_fini_locked() - Finalize the LRU list cursor usage
>    * @cursor: The struct ttm_resource_cursor to finalize.
> @@ -45,6 +92,7 @@ void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor)
>   {
>   	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>   	list_del_init(&cursor->hitch.link);
> +	ttm_resource_cursor_clear_bulk(cursor);
>   }
>   
>   /**
> @@ -73,9 +121,27 @@ void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor)
>   void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk)
>   {
>   	memset(bulk, 0, sizeof(*bulk));
> +	INIT_LIST_HEAD(&bulk->cursor_list);
>   }
>   EXPORT_SYMBOL(ttm_lru_bulk_move_init);
>   
> +/**
> + * ttm_lru_bulk_move_fini - finalize a bulk move structure
> + * @bdev: The struct ttm_device
> + * @bulk: the structure to finalize
> + *
> + * Sanity checks that bulk moves don't have any
> + * resources left and hence no cursors attached.
> + */
> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> +			    struct ttm_lru_bulk_move *bulk)
> +{
> +	spin_lock(&bdev->lru_lock);
> +	ttm_bulk_move_drop_cursors(bulk);
> +	spin_unlock(&bdev->lru_lock);
> +}
> +EXPORT_SYMBOL(ttm_lru_bulk_move_fini);
> +
>   /**
>    * ttm_lru_bulk_move_tail - bulk move range of resources to the LRU tail.
>    *
> @@ -88,6 +154,7 @@ void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk)
>   {
>   	unsigned i, j;
>   
> +	ttm_bulk_move_adjust_cursors(bulk);
>   	for (i = 0; i < TTM_NUM_MEM_TYPES; ++i) {
>   		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>   			struct ttm_lru_bulk_move_pos *pos = &bulk->pos[i][j];
> @@ -515,6 +582,28 @@ void ttm_resource_manager_debug(struct ttm_resource_manager *man,
>   }
>   EXPORT_SYMBOL(ttm_resource_manager_debug);
>   
> +static void
> +ttm_resource_cursor_check_bulk(struct ttm_resource_cursor *cursor,
> +			       struct ttm_lru_item *next_lru)
> +{
> +	struct ttm_resource *next = ttm_lru_item_to_res(next_lru);
> +	struct ttm_lru_bulk_move *bulk = NULL;
> +	struct ttm_buffer_object *bo = next->bo;
> +
> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> +	bulk = bo->bulk_move;
> +
> +	if (cursor->bulk != bulk) {
> +		if (bulk) {
> +			list_move_tail(&cursor->bulk_link, &bulk->cursor_list);
> +			cursor->mem_type = next->mem_type;
> +		} else {
> +			list_del_init(&cursor->bulk_link);
> +		}
> +		cursor->bulk = bulk;
> +	}
> +}
> +
>   /**
>    * ttm_resource_manager_first() - Start iterating over the resources
>    * of a resource manager
> @@ -535,6 +624,7 @@ ttm_resource_manager_first(struct ttm_resource_manager *man,
>   	cursor->priority = 0;
>   	cursor->man = man;
>   	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
> +	INIT_LIST_HEAD(&cursor->bulk_link);
>   	list_add(&cursor->hitch.link, &man->lru[cursor->priority]);
>   
>   	return ttm_resource_manager_next(cursor);
> @@ -559,6 +649,7 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
>   		lru = &cursor->hitch;
>   		list_for_each_entry_continue(lru, &man->lru[cursor->priority], link) {
>   			if (ttm_lru_item_is_res(lru)) {
> +				ttm_resource_cursor_check_bulk(cursor, lru);
>   				list_move(&cursor->hitch.link, &lru->link);
>   				return ttm_lru_item_to_res(lru);
>   			}
> @@ -568,6 +659,7 @@ ttm_resource_manager_next(struct ttm_resource_cursor *cursor)
>   			break;
>   
>   		list_move(&cursor->hitch.link, &man->lru[cursor->priority]);
> +		ttm_resource_cursor_clear_bulk(cursor);
>   	}
>   
>   	ttm_resource_cursor_fini_locked(cursor);
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 5b166fa03684..0c7e327bc9a2 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -1335,6 +1335,8 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
>   
>   	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
>   
> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> +
>   	INIT_LIST_HEAD(&vm->preempt.exec_queues);
>   	vm->preempt.min_run_period_ms = 10;	/* FIXME: Wire up to uAPI */
>   
> @@ -1458,6 +1460,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
>   	mutex_destroy(&vm->snap_mutex);
>   	for_each_tile(tile, xe, id)
>   		xe_range_fence_tree_fini(&vm->rftree[id]);
> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>   	kfree(vm);
>   	if (flags & XE_VM_FLAG_LR_MODE)
>   		xe_pm_runtime_put(xe);
> @@ -1601,6 +1604,7 @@ static void vm_destroy_work_func(struct work_struct *w)
>   		XE_WARN_ON(vm->pt_root[id]);
>   
>   	trace_xe_vm_free(vm);
> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>   	kfree(vm);
>   }
>   
> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
> index 8fac781f641e..571abb4861a6 100644
> --- a/include/drm/ttm/ttm_resource.h
> +++ b/include/drm/ttm/ttm_resource.h
> @@ -269,26 +269,6 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
>   	return container_of(item, struct ttm_resource, lru);
>   }
>   
> -/**
> - * struct ttm_resource_cursor
> - *
> - * @man: The resource manager currently being iterated over.
> - * @hitch: A hitch list node inserted before the next resource
> - * to iterate over.
> - * @priority: the current priority
> - *
> - * Cursor to iterate over the resources in a manager.
> - */
> -struct ttm_resource_cursor {
> -	struct ttm_resource_manager *man;
> -	struct ttm_lru_item hitch;
> -	unsigned int priority;
> -};
> -
> -void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
> -
> -void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
> -
>   /**
>    * struct ttm_lru_bulk_move_pos
>    *
> @@ -304,8 +284,9 @@ struct ttm_lru_bulk_move_pos {
>   
>   /**
>    * struct ttm_lru_bulk_move
> - *
>    * @pos: first/last lru entry for resources in the each domain/priority
> + * @cursor_list: The list of cursors currently traversing any of
> + * the sublists of @pos. Protected by the ttm device's lru_lock.
>    *
>    * Container for the current bulk move state. Should be used with
>    * ttm_lru_bulk_move_init() and ttm_bo_set_bulk_move().
> @@ -315,8 +296,39 @@ struct ttm_lru_bulk_move_pos {
>    */
>   struct ttm_lru_bulk_move {
>   	struct ttm_lru_bulk_move_pos pos[TTM_NUM_MEM_TYPES][TTM_MAX_BO_PRIORITY];
> +	struct list_head cursor_list;
>   };
>   
> +/**
> + * struct ttm_resource_cursor
> + * @man: The resource manager currently being iterated over
> + * @hitch: A hitch list node inserted before the next resource
> + * to iterate over.
> + * @bulk_link: A list link for the list of cursors traversing the
> + * bulk sublist of @bulk. Protected by the ttm device's lru_lock.
> + * @bulk: Pointer to struct ttm_lru_bulk_move whose subrange @hitch is
> + * inserted to. NULL if none. Never dereference this pointer since
> + * the struct ttm_lru_bulk_move object pointed to might have been
> + * freed. The pointer is only for comparison.
> + * @mem_type: The memory type of the LRU list being traversed.
> + * This field is valid iff @bulk != NULL.
> + * @priority: the current priority
> + *
> + * Cursor to iterate over the resources in a manager.
> + */
> +struct ttm_resource_cursor {
> +	struct ttm_resource_manager *man;
> +	struct ttm_lru_item hitch;
> +	struct list_head bulk_link;
> +	struct ttm_lru_bulk_move *bulk;
> +	unsigned int mem_type;
> +	unsigned int priority;
> +};
> +
> +void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor *cursor);
> +
> +void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
> +
>   /**
>    * struct ttm_kmap_iter_iomap - Specialization for a struct io_mapping +
>    * struct sg_table backed struct ttm_resource.
> @@ -405,6 +417,8 @@ ttm_resource_manager_cleanup(struct ttm_resource_manager *man)
>   
>   void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);
>   void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk);
> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> +			    struct ttm_lru_bulk_move *bulk);
>   
>   void ttm_resource_add_bulk_move(struct ttm_resource *res,
>   				struct ttm_buffer_object *bo);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup
  2024-07-03 15:38 ` [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup Thomas Hellström
  2024-07-03 19:47   ` Matthew Brost
@ 2024-07-04 11:57   ` Christian König
  1 sibling, 0 replies; 38+ messages in thread
From: Christian König @ 2024-07-04 11:57 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Somalapuram Amaranath, Matthew Brost, dri-devel

Am 03.07.24 um 17:38 schrieb Thomas Hellström:
> Initially intended for experimenting with different backup
> solutions (shmem vs direct swap cache insertion), abstract
> the backup destination using a virtual base class.
>
> Also provide a sample implementation for shmem.

Let's postpone this and all following patches and merge the LRU changes 
first.

Christian.

>
> While when settling on a preferred backup solution, one could
> perhaps skip the abstraction, this functionality may actually
> come in handy for configurable dedicated graphics memory
> backup to fast nvme files or similar, whithout affecting
> swap-space. Could indeed be useful for VRAM backup on S4 and
> other cases.
>
> v5:
> - Fix a UAF. (kernel test robot, Dan Carptenter)
> v6:
> - Rename ttm_backup_shmem_copy_page() function argument
>    (Matthew Brost)
> - Add some missing documentation
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>   drivers/gpu/drm/ttm/Makefile           |   2 +-
>   drivers/gpu/drm/ttm/ttm_backup_shmem.c | 139 +++++++++++++++++++++++++
>   include/drm/ttm/ttm_backup.h           | 137 ++++++++++++++++++++++++
>   3 files changed, 277 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/gpu/drm/ttm/ttm_backup_shmem.c
>   create mode 100644 include/drm/ttm/ttm_backup.h
>
> diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
> index dad298127226..5e980dd90e41 100644
> --- a/drivers/gpu/drm/ttm/Makefile
> +++ b/drivers/gpu/drm/ttm/Makefile
> @@ -4,7 +4,7 @@
>   
>   ttm-y := ttm_tt.o ttm_bo.o ttm_bo_util.o ttm_bo_vm.o ttm_module.o \
>   	ttm_execbuf_util.o ttm_range_manager.o ttm_resource.o ttm_pool.o \
> -	ttm_device.o ttm_sys_manager.o
> +	ttm_device.o ttm_sys_manager.o ttm_backup_shmem.o
>   ttm-$(CONFIG_AGP) += ttm_agp_backend.o
>   
>   obj-$(CONFIG_DRM_TTM) += ttm.o
> diff --git a/drivers/gpu/drm/ttm/ttm_backup_shmem.c b/drivers/gpu/drm/ttm/ttm_backup_shmem.c
> new file mode 100644
> index 000000000000..3d23a34d9f34
> --- /dev/null
> +++ b/drivers/gpu/drm/ttm/ttm_backup_shmem.c
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <drm/ttm/ttm_backup.h>
> +#include <linux/page-flags.h>
> +
> +/**
> + * struct ttm_backup_shmem - A shmem based ttm_backup subclass.
> + * @backup: The base struct ttm_backup
> + * @filp: The associated shmem object
> + */
> +struct ttm_backup_shmem {
> +	struct ttm_backup backup;
> +	struct file *filp;
> +};
> +
> +static struct ttm_backup_shmem *to_backup_shmem(struct ttm_backup *backup)
> +{
> +	return container_of(backup, struct ttm_backup_shmem, backup);
> +}
> +
> +static void ttm_backup_shmem_drop(struct ttm_backup *backup, unsigned long handle)
> +{
> +	handle -= 1;
> +	shmem_truncate_range(file_inode(to_backup_shmem(backup)->filp), handle,
> +			     handle + 1);
> +}
> +
> +static int ttm_backup_shmem_copy_page(struct ttm_backup *backup, struct page *dst,
> +				      unsigned long handle, bool intr)
> +{
> +	struct file *filp = to_backup_shmem(backup)->filp;
> +	struct address_space *mapping = filp->f_mapping;
> +	struct folio *from_folio;
> +
> +	handle -= 1;
> +	from_folio = shmem_read_folio(mapping, handle);
> +	if (IS_ERR(from_folio))
> +		return PTR_ERR(from_folio);
> +
> +	/* Note: Use drm_memcpy_from_wc? */
> +	copy_highpage(dst, folio_file_page(from_folio, handle));
> +	folio_put(from_folio);
> +
> +	return 0;
> +}
> +
> +static unsigned long
> +ttm_backup_shmem_backup_page(struct ttm_backup *backup, struct page *page,
> +			     bool writeback, pgoff_t i, gfp_t page_gfp,
> +			     gfp_t alloc_gfp)
> +{
> +	struct file *filp = to_backup_shmem(backup)->filp;
> +	struct address_space *mapping = filp->f_mapping;
> +	unsigned long handle = 0;
> +	struct folio *to_folio;
> +	int ret;
> +
> +	to_folio = shmem_read_folio_gfp(mapping, i, alloc_gfp);
> +	if (IS_ERR(to_folio))
> +		return handle;
> +
> +	folio_mark_accessed(to_folio);
> +	folio_lock(to_folio);
> +	folio_mark_dirty(to_folio);
> +	copy_highpage(folio_file_page(to_folio, i), page);
> +	handle = i + 1;
> +
> +	if (writeback && !folio_mapped(to_folio) && folio_clear_dirty_for_io(to_folio)) {
> +		struct writeback_control wbc = {
> +			.sync_mode = WB_SYNC_NONE,
> +			.nr_to_write = SWAP_CLUSTER_MAX,
> +			.range_start = 0,
> +			.range_end = LLONG_MAX,
> +			.for_reclaim = 1,
> +		};
> +		folio_set_reclaim(to_folio);
> +		ret = mapping->a_ops->writepage(folio_page(to_folio, 0), &wbc);
> +		if (!folio_test_writeback(to_folio))
> +			folio_clear_reclaim(to_folio);
> +		/* If writepage succeeds, it unlocks the folio */
> +		if (ret)
> +			folio_unlock(to_folio);
> +	} else {
> +		folio_unlock(to_folio);
> +	}
> +
> +	folio_put(to_folio);
> +
> +	return handle;
> +}
> +
> +static void ttm_backup_shmem_fini(struct ttm_backup *backup)
> +{
> +	struct ttm_backup_shmem *sbackup = to_backup_shmem(backup);
> +
> +	fput(sbackup->filp);
> +	kfree(sbackup);
> +}
> +
> +static const struct ttm_backup_ops ttm_backup_shmem_ops = {
> +	.drop = ttm_backup_shmem_drop,
> +	.copy_backed_up_page = ttm_backup_shmem_copy_page,
> +	.backup_page = ttm_backup_shmem_backup_page,
> +	.fini = ttm_backup_shmem_fini,
> +};
> +
> +/**
> + * ttm_backup_shmem_create() - Create a shmem-based struct backup.
> + * @size: The maximum size (in bytes) to back up.
> + *
> + * Create a backup utilizing shmem objects.
> + *
> + * Return: A pointer to a struct ttm_backup on success,
> + * an error pointer on error.
> + */
> +struct ttm_backup *ttm_backup_shmem_create(loff_t size)
> +{
> +	struct ttm_backup_shmem *sbackup =
> +		kzalloc(sizeof(*sbackup), GFP_KERNEL | __GFP_ACCOUNT);
> +	struct file *filp;
> +
> +	if (!sbackup)
> +		return ERR_PTR(-ENOMEM);
> +
> +	filp = shmem_file_setup("ttm shmem backup", size, 0);
> +	if (IS_ERR(filp)) {
> +		kfree(sbackup);
> +		return ERR_CAST(filp);
> +	}
> +
> +	sbackup->filp = filp;
> +	sbackup->backup.ops = &ttm_backup_shmem_ops;
> +
> +	return &sbackup->backup;
> +}
> +EXPORT_SYMBOL_GPL(ttm_backup_shmem_create);
> diff --git a/include/drm/ttm/ttm_backup.h b/include/drm/ttm/ttm_backup.h
> new file mode 100644
> index 000000000000..5f8c7d3069ef
> --- /dev/null
> +++ b/include/drm/ttm/ttm_backup.h
> @@ -0,0 +1,137 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef _TTM_BACKUP_H_
> +#define _TTM_BACKUP_H_
> +
> +#include <linux/mm_types.h>
> +#include <linux/shmem_fs.h>
> +
> +struct ttm_backup;
> +
> +/**
> + * ttm_backup_handle_to_page_ptr() - Convert handle to struct page pointer
> + * @handle: The handle to convert.
> + *
> + * Converts an opaque handle received from the
> + * struct ttm_backoup_ops::backup_page() function to an (invalid)
> + * struct page pointer suitable for a struct page array.
> + *
> + * Return: An (invalid) struct page pointer.
> + */
> +static inline struct page *
> +ttm_backup_handle_to_page_ptr(unsigned long handle)
> +{
> +	return (struct page *)(handle << 1 | 1);
> +}
> +
> +/**
> + * ttm_backup_page_ptr_is_handle() - Whether a struct page pointer is a handle
> + * @page: The struct page pointer to check.
> + *
> + * Return: true if the struct page pointer is a handld returned from
> + * ttm_backup_handle_to_page_ptr(). False otherwise.
> + */
> +static inline bool ttm_backup_page_ptr_is_handle(const struct page *page)
> +{
> +	return (unsigned long)page & 1;
> +}
> +
> +/**
> + * ttm_backup_page_ptr_to_handle() - Convert a struct page pointer to a handle
> + * @page: The struct page pointer to convert
> + *
> + * Return: The handle that was previously used in
> + * ttm_backup_handle_to_page_ptr() to obtain a struct page pointer, suitable
> + * for use as argument in the struct ttm_backup_ops drop() or
> + * copy_backed_up_page() functions.
> + */
> +static inline unsigned long
> +ttm_backup_page_ptr_to_handle(const struct page *page)
> +{
> +	WARN_ON(!ttm_backup_page_ptr_is_handle(page));
> +	return (unsigned long)page >> 1;
> +}
> +
> +/** struct ttm_backup_ops - A struct ttm_backup backend operations */
> +struct ttm_backup_ops {
> +	/**
> +	 * drop - release memory associated with a handle
> +	 * @backup: The struct backup pointer used to obtain the handle
> +	 * @handle: The handle obtained from the @backup_page function.
> +	 */
> +	void (*drop)(struct ttm_backup *backup, unsigned long handle);
> +
> +	/**
> +	 * copy_backed_up_page - Copy the contents of a previously backed
> +	 * up page
> +	 * @backup: The struct backup pointer used to back up the page.
> +	 * @dst: The struct page to copy into.
> +	 * @handle: The handle returned when the page was backed up.
> +	 * @intr: Try to perform waits interruptable or at least killable.
> +	 *
> +	 * Return: 0 on success, Negative error code on failure, notably
> +	 * -EINTR if @intr was set to true and a signal is pending.
> +	 */
> +	int (*copy_backed_up_page)(struct ttm_backup *backup, struct page *dst,
> +				   unsigned long handle, bool intr);
> +
> +	/**
> +	 * backup_page - Backup a page
> +	 * @backup: The struct backup pointer to use.
> +	 * @page: The page to back up.
> +	 * @writeback: Whether to perform immediate writeback of the page.
> +	 * This may have performance implications.
> +	 * @i: A unique integer for each page and each struct backup.
> +	 * This is a hint allowing the backup backend to avoid managing
> +	 * its address space separately.
> +	 * @page_gfp: The gfp value used when the page was allocated.
> +	 * This is used for accounting purposes.
> +	 * @alloc_gfp: The gpf to be used when the backend needs to allocaete
> +	 * memory.
> +	 *
> +	 * Return: A handle on success. 0 on failure.
> +	 * (This is following the swp_entry_t convention).
> +	 *
> +	 * Note: This function could be extended to back up a folio and
> +	 * backends would then split the folio internally if needed.
> +	 * Drawback is that the caller would then have to keep track of
> +	 * the folio size- and usage.
> +	 */
> +	unsigned long (*backup_page)(struct ttm_backup *backup, struct page *page,
> +				     bool writeback, pgoff_t i, gfp_t page_gfp,
> +				     gfp_t alloc_gfp);
> +	/**
> +	 * fini - Free the struct backup resources after last use.
> +	 * @backup: Pointer to the struct backup whose resources to free.
> +	 *
> +	 * After a call to @fini, it's illegal to use the @backup pointer.
> +	 */
> +	void (*fini)(struct ttm_backup *backup);
> +};
> +
> +/**
> + * struct ttm_backup - Abstract a backup backend.
> + * @ops: The operations as described above.
> + *
> + * The struct ttm_backup is intended to be subclassed by the
> + * backend implementation.
> + */
> +struct ttm_backup {
> +	const struct ttm_backup_ops *ops;
> +};
> +
> +/**
> + * ttm_backup_shmem_create() - Create a shmem-based struct backup.
> + * @size: The maximum size (in bytes) to back up.
> + *
> + * Create a backup utilizing shmem objects.
> + *
> + * Return: A pointer to a struct ttm_backup on success,
> + * an error pointer on error.
> + */
> +struct ttm_backup *ttm_backup_shmem_create(loff_t size);
> +
> +#endif


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves
  2024-07-04  9:21   ` Christian König
@ 2024-07-04 12:41     ` Thomas Hellström
  2024-07-04 13:13       ` Christian König
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-07-04 12:41 UTC (permalink / raw)
  To: Christian König, intel-xe
  Cc: Somalapuram Amaranath, Matthew Brost, dri-devel

Hi, Christian,

On Thu, 2024-07-04 at 11:21 +0200, Christian König wrote:
> Am 03.07.24 um 17:38 schrieb Thomas Hellström:
> > To address the problem with hitches moving when bulk move
> > sublists are lru-bumped, register the list cursors with the
> > ttm_lru_bulk_move structure when traversing its list, and
> > when lru-bumping the list, move the cursor hitch to the tail.
> > This also means it's mandatory for drivers to call
> > ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when
> > initializing and finalizing the bulk move structure, so add
> > those calls to the amdgpu- and xe driver.
> > 
> > Compared to v1 this is slightly more code but less fragile
> > and hopefully easier to understand.
> 
> This is the only patch in the series which I see critical.
> 
> I think the final goal when using drm_exec in TTMs eviction path is
> to 
> keep all evicted (or evicting) BOs locked until we have enough space.
> 
> This means that for bulk move sections on the LRU we would lock the 
> first BO and would only drop that lock again if we have gone over the
> full bulk move section and know that all BOs are not valuable for
> eviction.
> 
> Because of this the issue of having to consider hitches move with a
> bulk 
> move section on the LRU doesn't even occur because for that a
> concurrent 
> process would need to grab the common lock of the BOs in the bulk
> move 
> section.

While I agree that this is something we should strive towards,
following the previous discussion I already reworked this patch
completely to remove the dual hitches and make it less fragile. 
After that you mentioned you were ok with the high level approach for
these first four patches here:

https://lists.freedesktop.org/archives/dri-devel/2024-April/450288.html

So is that not any longer the case?

To recap, the concerns I'm seeing with the "kept common lock" approach
are

a) Since when we release the LRU lock and the common bulk bo lock is
not yet locked, a LRU bump may happen and the hitch will go with it. So
to avoid that we need to place the hitch *before* then considered
resource in the LRU list rather than *after*. Now on the next iteration
we need to come up with some way to choose what's really the next
resource? If the next resource pointer is the same we already
considered, should we assume it might have been freed and re-alloced
with the same virtual address? 

b) It will be up to the user of the lru traversal to actually guarantee
that locks are held across a bulk part, to make the resource traversal
reasonably self-contained. In this case the LRU walker, because there's
where the bo locking happens. 
This means that any other code that aims to walk the LRUs for various
reasons, and doesn't provide any held lock guarantees, may be subject
to unexpected results if someone bumped the LRU.
So we would basically tailor the resource iteration here for a single
use-case and not make it robust for various use-cases.

So my suggestion is we keep this until we've come up with a bullet-
proof way to sort out a) and b) above and then we can rip it out.

/Thomas












> 
> Regards,
> Christian.
> 
> 
> > 
> > Changes in previous series:
> > - Completely rework the functionality
> > - Avoid a NULL pointer dereference assigning manager->mem_type
> > - Remove some leftover code causing build problems
> > v2:
> > - For hitch bulk tail moves, store the mem_type in the cursor
> >    instead of with the manager.
> > v3:
> > - Remove leftover mem_type member from change in v2.
> > v6:
> > - Add some lockdep asserts (Matthew Brost)
> > - Avoid NULL pointer dereference (Matthew Brost)
> > - No need to check bo->resource before dereferencing
> >    bo->bulk_move (Matthew Brost)
> > 
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  4 ++
> >   drivers/gpu/drm/ttm/ttm_resource.c     | 92
> > ++++++++++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_vm.c             |  4 ++
> >   include/drm/ttm/ttm_resource.h         | 56 ++++++++++------
> >   4 files changed, 135 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index 3abfa66d72a2..97743993d711 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -2420,6 +2420,8 @@ int amdgpu_vm_init(struct amdgpu_device
> > *adev, struct amdgpu_vm *vm,
> >   	if (r)
> >   		return r;
> >   
> > +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> > +
> >   	vm->is_compute_context = false;
> >   
> >   	vm->use_cpu_for_update = !!(adev-
> > >vm_manager.vm_update_mode &
> > @@ -2484,6 +2486,7 @@ int amdgpu_vm_init(struct amdgpu_device
> > *adev, struct amdgpu_vm *vm,
> >   error_free_delayed:
> >   	dma_fence_put(vm->last_tlb_flush);
> >   	dma_fence_put(vm->last_unlocked);
> > +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
> > >lru_bulk_move);
> >   	amdgpu_vm_fini_entities(vm);
> >   
> >   	return r;
> > @@ -2640,6 +2643,7 @@ void amdgpu_vm_fini(struct amdgpu_device
> > *adev, struct amdgpu_vm *vm)
> >   		}
> >   	}
> >   
> > +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
> > >lru_bulk_move);
> >   }
> >   
> >   /**
> > diff --git a/drivers/gpu/drm/ttm/ttm_resource.c
> > b/drivers/gpu/drm/ttm/ttm_resource.c
> > index 9c8b6499edfb..b6a2daac5518 100644
> > --- a/drivers/gpu/drm/ttm/ttm_resource.c
> > +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> > @@ -33,6 +33,53 @@
> >   
> >   #include <drm/drm_util.h>
> >   
> > +/* Detach the cursor from the bulk move list*/
> > +static void
> > +ttm_resource_cursor_clear_bulk(struct ttm_resource_cursor *cursor)
> > +{
> > +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> > +
> > +	cursor->bulk = NULL;
> > +	list_del_init(&cursor->bulk_link);
> > +}
> > +
> > +/* Move the cursor to the end of the bulk move list it's in */
> > +static void ttm_resource_cursor_move_bulk_tail(struct
> > ttm_lru_bulk_move *bulk,
> > +					       struct
> > ttm_resource_cursor *cursor)
> > +{
> > +	struct ttm_lru_bulk_move_pos *pos;
> > +
> > +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> > +
> > +	if (WARN_ON_ONCE(bulk != cursor->bulk)) {
> > +		list_del_init(&cursor->bulk_link);
> > +		return;
> > +	}
> > +
> > +	pos = &bulk->pos[cursor->mem_type][cursor->priority];
> > +	if (pos->last)
> > +		list_move(&cursor->hitch.link, &pos->last-
> > >lru.link);
> > +	ttm_resource_cursor_clear_bulk(cursor);
> > +}
> > +
> > +/* Move all cursors attached to a bulk move to its end */
> > +static void ttm_bulk_move_adjust_cursors(struct ttm_lru_bulk_move
> > *bulk)
> > +{
> > +	struct ttm_resource_cursor *cursor, *next;
> > +
> > +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list,
> > bulk_link)
> > +		ttm_resource_cursor_move_bulk_tail(bulk, cursor);
> > +}
> > +
> > +/* Remove a cursor from an empty bulk move list */
> > +static void ttm_bulk_move_drop_cursors(struct ttm_lru_bulk_move
> > *bulk)
> > +{
> > +	struct ttm_resource_cursor *cursor, *next;
> > +
> > +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list,
> > bulk_link)
> > +		ttm_resource_cursor_clear_bulk(cursor);
> > +}
> > +
> >   /**
> >    * ttm_resource_cursor_fini_locked() - Finalize the LRU list
> > cursor usage
> >    * @cursor: The struct ttm_resource_cursor to finalize.
> > @@ -45,6 +92,7 @@ void ttm_resource_cursor_fini_locked(struct
> > ttm_resource_cursor *cursor)
> >   {
> >   	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> >   	list_del_init(&cursor->hitch.link);
> > +	ttm_resource_cursor_clear_bulk(cursor);
> >   }
> >   
> >   /**
> > @@ -73,9 +121,27 @@ void ttm_resource_cursor_fini(struct
> > ttm_resource_cursor *cursor)
> >   void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk)
> >   {
> >   	memset(bulk, 0, sizeof(*bulk));
> > +	INIT_LIST_HEAD(&bulk->cursor_list);
> >   }
> >   EXPORT_SYMBOL(ttm_lru_bulk_move_init);
> >   
> > +/**
> > + * ttm_lru_bulk_move_fini - finalize a bulk move structure
> > + * @bdev: The struct ttm_device
> > + * @bulk: the structure to finalize
> > + *
> > + * Sanity checks that bulk moves don't have any
> > + * resources left and hence no cursors attached.
> > + */
> > +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> > +			    struct ttm_lru_bulk_move *bulk)
> > +{
> > +	spin_lock(&bdev->lru_lock);
> > +	ttm_bulk_move_drop_cursors(bulk);
> > +	spin_unlock(&bdev->lru_lock);
> > +}
> > +EXPORT_SYMBOL(ttm_lru_bulk_move_fini);
> > +
> >   /**
> >    * ttm_lru_bulk_move_tail - bulk move range of resources to the
> > LRU tail.
> >    *
> > @@ -88,6 +154,7 @@ void ttm_lru_bulk_move_tail(struct
> > ttm_lru_bulk_move *bulk)
> >   {
> >   	unsigned i, j;
> >   
> > +	ttm_bulk_move_adjust_cursors(bulk);
> >   	for (i = 0; i < TTM_NUM_MEM_TYPES; ++i) {
> >   		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
> >   			struct ttm_lru_bulk_move_pos *pos = &bulk-
> > >pos[i][j];
> > @@ -515,6 +582,28 @@ void ttm_resource_manager_debug(struct
> > ttm_resource_manager *man,
> >   }
> >   EXPORT_SYMBOL(ttm_resource_manager_debug);
> >   
> > +static void
> > +ttm_resource_cursor_check_bulk(struct ttm_resource_cursor *cursor,
> > +			       struct ttm_lru_item *next_lru)
> > +{
> > +	struct ttm_resource *next = ttm_lru_item_to_res(next_lru);
> > +	struct ttm_lru_bulk_move *bulk = NULL;
> > +	struct ttm_buffer_object *bo = next->bo;
> > +
> > +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> > +	bulk = bo->bulk_move;
> > +
> > +	if (cursor->bulk != bulk) {
> > +		if (bulk) {
> > +			list_move_tail(&cursor->bulk_link, &bulk-
> > >cursor_list);
> > +			cursor->mem_type = next->mem_type;
> > +		} else {
> > +			list_del_init(&cursor->bulk_link);
> > +		}
> > +		cursor->bulk = bulk;
> > +	}
> > +}
> > +
> >   /**
> >    * ttm_resource_manager_first() - Start iterating over the
> > resources
> >    * of a resource manager
> > @@ -535,6 +624,7 @@ ttm_resource_manager_first(struct
> > ttm_resource_manager *man,
> >   	cursor->priority = 0;
> >   	cursor->man = man;
> >   	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
> > +	INIT_LIST_HEAD(&cursor->bulk_link);
> >   	list_add(&cursor->hitch.link, &man->lru[cursor-
> > >priority]);
> >   
> >   	return ttm_resource_manager_next(cursor);
> > @@ -559,6 +649,7 @@ ttm_resource_manager_next(struct
> > ttm_resource_cursor *cursor)
> >   		lru = &cursor->hitch;
> >   		list_for_each_entry_continue(lru, &man-
> > >lru[cursor->priority], link) {
> >   			if (ttm_lru_item_is_res(lru)) {
> > +				ttm_resource_cursor_check_bulk(cur
> > sor, lru);
> >   				list_move(&cursor->hitch.link,
> > &lru->link);
> >   				return ttm_lru_item_to_res(lru);
> >   			}
> > @@ -568,6 +659,7 @@ ttm_resource_manager_next(struct
> > ttm_resource_cursor *cursor)
> >   			break;
> >   
> >   		list_move(&cursor->hitch.link, &man->lru[cursor-
> > >priority]);
> > +		ttm_resource_cursor_clear_bulk(cursor);
> >   	}
> >   
> >   	ttm_resource_cursor_fini_locked(cursor);
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > index 5b166fa03684..0c7e327bc9a2 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1335,6 +1335,8 @@ struct xe_vm *xe_vm_create(struct xe_device
> > *xe, u32 flags)
> >   
> >   	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
> >   
> > +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> > +
> >   	INIT_LIST_HEAD(&vm->preempt.exec_queues);
> >   	vm->preempt.min_run_period_ms = 10;	/* FIXME: Wire up
> > to uAPI */
> >   
> > @@ -1458,6 +1460,7 @@ struct xe_vm *xe_vm_create(struct xe_device
> > *xe, u32 flags)
> >   	mutex_destroy(&vm->snap_mutex);
> >   	for_each_tile(tile, xe, id)
> >   		xe_range_fence_tree_fini(&vm->rftree[id]);
> > +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
> >   	kfree(vm);
> >   	if (flags & XE_VM_FLAG_LR_MODE)
> >   		xe_pm_runtime_put(xe);
> > @@ -1601,6 +1604,7 @@ static void vm_destroy_work_func(struct
> > work_struct *w)
> >   		XE_WARN_ON(vm->pt_root[id]);
> >   
> >   	trace_xe_vm_free(vm);
> > +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
> >   	kfree(vm);
> >   }
> >   
> > diff --git a/include/drm/ttm/ttm_resource.h
> > b/include/drm/ttm/ttm_resource.h
> > index 8fac781f641e..571abb4861a6 100644
> > --- a/include/drm/ttm/ttm_resource.h
> > +++ b/include/drm/ttm/ttm_resource.h
> > @@ -269,26 +269,6 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
> >   	return container_of(item, struct ttm_resource, lru);
> >   }
> >   
> > -/**
> > - * struct ttm_resource_cursor
> > - *
> > - * @man: The resource manager currently being iterated over.
> > - * @hitch: A hitch list node inserted before the next resource
> > - * to iterate over.
> > - * @priority: the current priority
> > - *
> > - * Cursor to iterate over the resources in a manager.
> > - */
> > -struct ttm_resource_cursor {
> > -	struct ttm_resource_manager *man;
> > -	struct ttm_lru_item hitch;
> > -	unsigned int priority;
> > -};
> > -
> > -void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor
> > *cursor);
> > -
> > -void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
> > -
> >   /**
> >    * struct ttm_lru_bulk_move_pos
> >    *
> > @@ -304,8 +284,9 @@ struct ttm_lru_bulk_move_pos {
> >   
> >   /**
> >    * struct ttm_lru_bulk_move
> > - *
> >    * @pos: first/last lru entry for resources in the each
> > domain/priority
> > + * @cursor_list: The list of cursors currently traversing any of
> > + * the sublists of @pos. Protected by the ttm device's lru_lock.
> >    *
> >    * Container for the current bulk move state. Should be used with
> >    * ttm_lru_bulk_move_init() and ttm_bo_set_bulk_move().
> > @@ -315,8 +296,39 @@ struct ttm_lru_bulk_move_pos {
> >    */
> >   struct ttm_lru_bulk_move {
> >   	struct ttm_lru_bulk_move_pos
> > pos[TTM_NUM_MEM_TYPES][TTM_MAX_BO_PRIORITY];
> > +	struct list_head cursor_list;
> >   };
> >   
> > +/**
> > + * struct ttm_resource_cursor
> > + * @man: The resource manager currently being iterated over
> > + * @hitch: A hitch list node inserted before the next resource
> > + * to iterate over.
> > + * @bulk_link: A list link for the list of cursors traversing the
> > + * bulk sublist of @bulk. Protected by the ttm device's lru_lock.
> > + * @bulk: Pointer to struct ttm_lru_bulk_move whose subrange
> > @hitch is
> > + * inserted to. NULL if none. Never dereference this pointer since
> > + * the struct ttm_lru_bulk_move object pointed to might have been
> > + * freed. The pointer is only for comparison.
> > + * @mem_type: The memory type of the LRU list being traversed.
> > + * This field is valid iff @bulk != NULL.
> > + * @priority: the current priority
> > + *
> > + * Cursor to iterate over the resources in a manager.
> > + */
> > +struct ttm_resource_cursor {
> > +	struct ttm_resource_manager *man;
> > +	struct ttm_lru_item hitch;
> > +	struct list_head bulk_link;
> > +	struct ttm_lru_bulk_move *bulk;
> > +	unsigned int mem_type;
> > +	unsigned int priority;
> > +};
> > +
> > +void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor
> > *cursor);
> > +
> > +void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
> > +
> >   /**
> >    * struct ttm_kmap_iter_iomap - Specialization for a struct
> > io_mapping +
> >    * struct sg_table backed struct ttm_resource.
> > @@ -405,6 +417,8 @@ ttm_resource_manager_cleanup(struct
> > ttm_resource_manager *man)
> >   
> >   void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);
> >   void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk);
> > +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> > +			    struct ttm_lru_bulk_move *bulk);
> >   
> >   void ttm_resource_add_bulk_move(struct ttm_resource *res,
> >   				struct ttm_buffer_object *bo);
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves
  2024-07-04 12:41     ` Thomas Hellström
@ 2024-07-04 13:13       ` Christian König
  2024-07-04 13:53         ` Thomas Hellström
  0 siblings, 1 reply; 38+ messages in thread
From: Christian König @ 2024-07-04 13:13 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Somalapuram Amaranath, Matthew Brost, dri-devel

Hey Thomas,

Am 04.07.24 um 14:41 schrieb Thomas Hellström:
> Hi, Christian,
>
> On Thu, 2024-07-04 at 11:21 +0200, Christian König wrote:
>> Am 03.07.24 um 17:38 schrieb Thomas Hellström:
>>> To address the problem with hitches moving when bulk move
>>> sublists are lru-bumped, register the list cursors with the
>>> ttm_lru_bulk_move structure when traversing its list, and
>>> when lru-bumping the list, move the cursor hitch to the tail.
>>> This also means it's mandatory for drivers to call
>>> ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when
>>> initializing and finalizing the bulk move structure, so add
>>> those calls to the amdgpu- and xe driver.
>>>
>>> Compared to v1 this is slightly more code but less fragile
>>> and hopefully easier to understand.
>> This is the only patch in the series which I see critical.
>>
>> I think the final goal when using drm_exec in TTMs eviction path is
>> to
>> keep all evicted (or evicting) BOs locked until we have enough space.
>>
>> This means that for bulk move sections on the LRU we would lock the
>> first BO and would only drop that lock again if we have gone over the
>> full bulk move section and know that all BOs are not valuable for
>> eviction.
>>
>> Because of this the issue of having to consider hitches move with a
>> bulk
>> move section on the LRU doesn't even occur because for that a
>> concurrent
>> process would need to grab the common lock of the BOs in the bulk
>> move
>> section.
> While I agree that this is something we should strive towards,
> following the previous discussion I already reworked this patch
> completely to remove the dual hitches and make it less fragile.

Yeah seen that and it indeed makes it much easier to understand what's 
going on.

> After that you mentioned you were ok with the high level approach for
> these first four patches here:
>
> https://lists.freedesktop.org/archives/dri-devel/2024-April/450288.html
>
> So is that not any longer the case?

I'm ok with having it as intermediate step, but for that it's a bit much 
of an hammer.

On the other hand having clean ttm_lru_bulk_move_init() and 
ttm_lru_bulk_move_fini() calls is probably something we should keep 
around anyway.

> To recap, the concerns I'm seeing with the "kept common lock" approach
> are
>
> a) Since when we release the LRU lock and the common bulk bo lock is
> not yet locked, a LRU bump may happen and the hitch will go with it. So
> to avoid that we need to place the hitch *before* then considered
> resource in the LRU list rather than *after*. Now on the next iteration
> we need to come up with some way to choose what's really the next
> resource? If the next resource pointer is the same we already
> considered, should we assume it might have been freed and re-alloced
> with the same virtual address?

My idea is for the general flow is this:

1. Grab the lru lock
2. Grab a reference to the BO after the hitch, eventually trylock the BO 
or just continue with a prelocked one
3. If locking wasn't successfully
     4. Drop the lru lock
     5. Block on the BO lock
     6. Check that this resource/BO is still the one the cursor points 
to, if not drop the lock and restart from #1
     7. Grab the lru lock
8. Advance the cursor.
9. Drop the lru lock.
10. Try to evict or swap the BO
11. Repeat if still not able to allocate memory.

The BO could be prelocked if it's part of the currently processed bulk 
or previously contended and prelocked by drm_exec.

And instead of checking if the resource is in the right domain we check 
if the resource/BO is still the one where the cursor points to.

This way we don't care if the resource was reallocated and by coincident 
ended up right after the cursor hitch again. As long as we still point 
to the BO we just locked everything is fine.

> b) It will be up to the user of the lru traversal to actually guarantee
> that locks are held across a bulk part, to make the resource traversal
> reasonably self-contained. In this case the LRU walker, because there's
> where the bo locking happens.
> This means that any other code that aims to walk the LRUs for various
> reasons, and doesn't provide any held lock guarantees, may be subject
> to unexpected results if someone bumped the LRU.
> So we would basically tailor the resource iteration here for a single
> use-case and not make it robust for various use-cases.

Yeah, that's also going into a direction I was questioning. Do we have 
use cases for the resource iterator were we don't lock the BO?

If not why don't we integrate all this into the first_resource() and 
next_resource() functions instead? Obviously with some helpers in the BO 
code.

> So my suggestion is we keep this until we've come up with a bullet-
> proof way to sort out a) and b) above and then we can rip it out.

Yeah if we can't make progress otherwise that works for me as well.

Regards,
Christian.

>
> /Thomas
>
>
>
>
>
>
>
>
>
>
>
>
>> Regards,
>> Christian.
>>
>>
>>> Changes in previous series:
>>> - Completely rework the functionality
>>> - Avoid a NULL pointer dereference assigning manager->mem_type
>>> - Remove some leftover code causing build problems
>>> v2:
>>> - For hitch bulk tail moves, store the mem_type in the cursor
>>>     instead of with the manager.
>>> v3:
>>> - Remove leftover mem_type member from change in v2.
>>> v6:
>>> - Add some lockdep asserts (Matthew Brost)
>>> - Avoid NULL pointer dereference (Matthew Brost)
>>> - No need to check bo->resource before dereferencing
>>>     bo->bulk_move (Matthew Brost)
>>>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: <dri-devel@lists.freedesktop.org>
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  4 ++
>>>    drivers/gpu/drm/ttm/ttm_resource.c     | 92
>>> ++++++++++++++++++++++++++
>>>    drivers/gpu/drm/xe/xe_vm.c             |  4 ++
>>>    include/drm/ttm/ttm_resource.h         | 56 ++++++++++------
>>>    4 files changed, 135 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 3abfa66d72a2..97743993d711 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -2420,6 +2420,8 @@ int amdgpu_vm_init(struct amdgpu_device
>>> *adev, struct amdgpu_vm *vm,
>>>    	if (r)
>>>    		return r;
>>>    
>>> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
>>> +
>>>    	vm->is_compute_context = false;
>>>    
>>>    	vm->use_cpu_for_update = !!(adev-
>>>> vm_manager.vm_update_mode &
>>> @@ -2484,6 +2486,7 @@ int amdgpu_vm_init(struct amdgpu_device
>>> *adev, struct amdgpu_vm *vm,
>>>    error_free_delayed:
>>>    	dma_fence_put(vm->last_tlb_flush);
>>>    	dma_fence_put(vm->last_unlocked);
>>> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
>>>> lru_bulk_move);
>>>    	amdgpu_vm_fini_entities(vm);
>>>    
>>>    	return r;
>>> @@ -2640,6 +2643,7 @@ void amdgpu_vm_fini(struct amdgpu_device
>>> *adev, struct amdgpu_vm *vm)
>>>    		}
>>>    	}
>>>    
>>> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
>>>> lru_bulk_move);
>>>    }
>>>    
>>>    /**
>>> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c
>>> b/drivers/gpu/drm/ttm/ttm_resource.c
>>> index 9c8b6499edfb..b6a2daac5518 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_resource.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
>>> @@ -33,6 +33,53 @@
>>>    
>>>    #include <drm/drm_util.h>
>>>    
>>> +/* Detach the cursor from the bulk move list*/
>>> +static void
>>> +ttm_resource_cursor_clear_bulk(struct ttm_resource_cursor *cursor)
>>> +{
>>> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>> +
>>> +	cursor->bulk = NULL;
>>> +	list_del_init(&cursor->bulk_link);
>>> +}
>>> +
>>> +/* Move the cursor to the end of the bulk move list it's in */
>>> +static void ttm_resource_cursor_move_bulk_tail(struct
>>> ttm_lru_bulk_move *bulk,
>>> +					       struct
>>> ttm_resource_cursor *cursor)
>>> +{
>>> +	struct ttm_lru_bulk_move_pos *pos;
>>> +
>>> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>> +
>>> +	if (WARN_ON_ONCE(bulk != cursor->bulk)) {
>>> +		list_del_init(&cursor->bulk_link);
>>> +		return;
>>> +	}
>>> +
>>> +	pos = &bulk->pos[cursor->mem_type][cursor->priority];
>>> +	if (pos->last)
>>> +		list_move(&cursor->hitch.link, &pos->last-
>>>> lru.link);
>>> +	ttm_resource_cursor_clear_bulk(cursor);
>>> +}
>>> +
>>> +/* Move all cursors attached to a bulk move to its end */
>>> +static void ttm_bulk_move_adjust_cursors(struct ttm_lru_bulk_move
>>> *bulk)
>>> +{
>>> +	struct ttm_resource_cursor *cursor, *next;
>>> +
>>> +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list,
>>> bulk_link)
>>> +		ttm_resource_cursor_move_bulk_tail(bulk, cursor);
>>> +}
>>> +
>>> +/* Remove a cursor from an empty bulk move list */
>>> +static void ttm_bulk_move_drop_cursors(struct ttm_lru_bulk_move
>>> *bulk)
>>> +{
>>> +	struct ttm_resource_cursor *cursor, *next;
>>> +
>>> +	list_for_each_entry_safe(cursor, next, &bulk->cursor_list,
>>> bulk_link)
>>> +		ttm_resource_cursor_clear_bulk(cursor);
>>> +}
>>> +
>>>    /**
>>>     * ttm_resource_cursor_fini_locked() - Finalize the LRU list
>>> cursor usage
>>>     * @cursor: The struct ttm_resource_cursor to finalize.
>>> @@ -45,6 +92,7 @@ void ttm_resource_cursor_fini_locked(struct
>>> ttm_resource_cursor *cursor)
>>>    {
>>>    	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>>    	list_del_init(&cursor->hitch.link);
>>> +	ttm_resource_cursor_clear_bulk(cursor);
>>>    }
>>>    
>>>    /**
>>> @@ -73,9 +121,27 @@ void ttm_resource_cursor_fini(struct
>>> ttm_resource_cursor *cursor)
>>>    void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk)
>>>    {
>>>    	memset(bulk, 0, sizeof(*bulk));
>>> +	INIT_LIST_HEAD(&bulk->cursor_list);
>>>    }
>>>    EXPORT_SYMBOL(ttm_lru_bulk_move_init);
>>>    
>>> +/**
>>> + * ttm_lru_bulk_move_fini - finalize a bulk move structure
>>> + * @bdev: The struct ttm_device
>>> + * @bulk: the structure to finalize
>>> + *
>>> + * Sanity checks that bulk moves don't have any
>>> + * resources left and hence no cursors attached.
>>> + */
>>> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
>>> +			    struct ttm_lru_bulk_move *bulk)
>>> +{
>>> +	spin_lock(&bdev->lru_lock);
>>> +	ttm_bulk_move_drop_cursors(bulk);
>>> +	spin_unlock(&bdev->lru_lock);
>>> +}
>>> +EXPORT_SYMBOL(ttm_lru_bulk_move_fini);
>>> +
>>>    /**
>>>     * ttm_lru_bulk_move_tail - bulk move range of resources to the
>>> LRU tail.
>>>     *
>>> @@ -88,6 +154,7 @@ void ttm_lru_bulk_move_tail(struct
>>> ttm_lru_bulk_move *bulk)
>>>    {
>>>    	unsigned i, j;
>>>    
>>> +	ttm_bulk_move_adjust_cursors(bulk);
>>>    	for (i = 0; i < TTM_NUM_MEM_TYPES; ++i) {
>>>    		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>>>    			struct ttm_lru_bulk_move_pos *pos = &bulk-
>>>> pos[i][j];
>>> @@ -515,6 +582,28 @@ void ttm_resource_manager_debug(struct
>>> ttm_resource_manager *man,
>>>    }
>>>    EXPORT_SYMBOL(ttm_resource_manager_debug);
>>>    
>>> +static void
>>> +ttm_resource_cursor_check_bulk(struct ttm_resource_cursor *cursor,
>>> +			       struct ttm_lru_item *next_lru)
>>> +{
>>> +	struct ttm_resource *next = ttm_lru_item_to_res(next_lru);
>>> +	struct ttm_lru_bulk_move *bulk = NULL;
>>> +	struct ttm_buffer_object *bo = next->bo;
>>> +
>>> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>> +	bulk = bo->bulk_move;
>>> +
>>> +	if (cursor->bulk != bulk) {
>>> +		if (bulk) {
>>> +			list_move_tail(&cursor->bulk_link, &bulk-
>>>> cursor_list);
>>> +			cursor->mem_type = next->mem_type;
>>> +		} else {
>>> +			list_del_init(&cursor->bulk_link);
>>> +		}
>>> +		cursor->bulk = bulk;
>>> +	}
>>> +}
>>> +
>>>    /**
>>>     * ttm_resource_manager_first() - Start iterating over the
>>> resources
>>>     * of a resource manager
>>> @@ -535,6 +624,7 @@ ttm_resource_manager_first(struct
>>> ttm_resource_manager *man,
>>>    	cursor->priority = 0;
>>>    	cursor->man = man;
>>>    	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
>>> +	INIT_LIST_HEAD(&cursor->bulk_link);
>>>    	list_add(&cursor->hitch.link, &man->lru[cursor-
>>>> priority]);
>>>    
>>>    	return ttm_resource_manager_next(cursor);
>>> @@ -559,6 +649,7 @@ ttm_resource_manager_next(struct
>>> ttm_resource_cursor *cursor)
>>>    		lru = &cursor->hitch;
>>>    		list_for_each_entry_continue(lru, &man-
>>>> lru[cursor->priority], link) {
>>>    			if (ttm_lru_item_is_res(lru)) {
>>> +				ttm_resource_cursor_check_bulk(cur
>>> sor, lru);
>>>    				list_move(&cursor->hitch.link,
>>> &lru->link);
>>>    				return ttm_lru_item_to_res(lru);
>>>    			}
>>> @@ -568,6 +659,7 @@ ttm_resource_manager_next(struct
>>> ttm_resource_cursor *cursor)
>>>    			break;
>>>    
>>>    		list_move(&cursor->hitch.link, &man->lru[cursor-
>>>> priority]);
>>> +		ttm_resource_cursor_clear_bulk(cursor);
>>>    	}
>>>    
>>>    	ttm_resource_cursor_fini_locked(cursor);
>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>> b/drivers/gpu/drm/xe/xe_vm.c
>>> index 5b166fa03684..0c7e327bc9a2 100644
>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>> @@ -1335,6 +1335,8 @@ struct xe_vm *xe_vm_create(struct xe_device
>>> *xe, u32 flags)
>>>    
>>>    	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
>>>    
>>> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
>>> +
>>>    	INIT_LIST_HEAD(&vm->preempt.exec_queues);
>>>    	vm->preempt.min_run_period_ms = 10;	/* FIXME: Wire up
>>> to uAPI */
>>>    
>>> @@ -1458,6 +1460,7 @@ struct xe_vm *xe_vm_create(struct xe_device
>>> *xe, u32 flags)
>>>    	mutex_destroy(&vm->snap_mutex);
>>>    	for_each_tile(tile, xe, id)
>>>    		xe_range_fence_tree_fini(&vm->rftree[id]);
>>> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>>>    	kfree(vm);
>>>    	if (flags & XE_VM_FLAG_LR_MODE)
>>>    		xe_pm_runtime_put(xe);
>>> @@ -1601,6 +1604,7 @@ static void vm_destroy_work_func(struct
>>> work_struct *w)
>>>    		XE_WARN_ON(vm->pt_root[id]);
>>>    
>>>    	trace_xe_vm_free(vm);
>>> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>>>    	kfree(vm);
>>>    }
>>>    
>>> diff --git a/include/drm/ttm/ttm_resource.h
>>> b/include/drm/ttm/ttm_resource.h
>>> index 8fac781f641e..571abb4861a6 100644
>>> --- a/include/drm/ttm/ttm_resource.h
>>> +++ b/include/drm/ttm/ttm_resource.h
>>> @@ -269,26 +269,6 @@ ttm_lru_item_to_res(struct ttm_lru_item *item)
>>>    	return container_of(item, struct ttm_resource, lru);
>>>    }
>>>    
>>> -/**
>>> - * struct ttm_resource_cursor
>>> - *
>>> - * @man: The resource manager currently being iterated over.
>>> - * @hitch: A hitch list node inserted before the next resource
>>> - * to iterate over.
>>> - * @priority: the current priority
>>> - *
>>> - * Cursor to iterate over the resources in a manager.
>>> - */
>>> -struct ttm_resource_cursor {
>>> -	struct ttm_resource_manager *man;
>>> -	struct ttm_lru_item hitch;
>>> -	unsigned int priority;
>>> -};
>>> -
>>> -void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor
>>> *cursor);
>>> -
>>> -void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
>>> -
>>>    /**
>>>     * struct ttm_lru_bulk_move_pos
>>>     *
>>> @@ -304,8 +284,9 @@ struct ttm_lru_bulk_move_pos {
>>>    
>>>    /**
>>>     * struct ttm_lru_bulk_move
>>> - *
>>>     * @pos: first/last lru entry for resources in the each
>>> domain/priority
>>> + * @cursor_list: The list of cursors currently traversing any of
>>> + * the sublists of @pos. Protected by the ttm device's lru_lock.
>>>     *
>>>     * Container for the current bulk move state. Should be used with
>>>     * ttm_lru_bulk_move_init() and ttm_bo_set_bulk_move().
>>> @@ -315,8 +296,39 @@ struct ttm_lru_bulk_move_pos {
>>>     */
>>>    struct ttm_lru_bulk_move {
>>>    	struct ttm_lru_bulk_move_pos
>>> pos[TTM_NUM_MEM_TYPES][TTM_MAX_BO_PRIORITY];
>>> +	struct list_head cursor_list;
>>>    };
>>>    
>>> +/**
>>> + * struct ttm_resource_cursor
>>> + * @man: The resource manager currently being iterated over
>>> + * @hitch: A hitch list node inserted before the next resource
>>> + * to iterate over.
>>> + * @bulk_link: A list link for the list of cursors traversing the
>>> + * bulk sublist of @bulk. Protected by the ttm device's lru_lock.
>>> + * @bulk: Pointer to struct ttm_lru_bulk_move whose subrange
>>> @hitch is
>>> + * inserted to. NULL if none. Never dereference this pointer since
>>> + * the struct ttm_lru_bulk_move object pointed to might have been
>>> + * freed. The pointer is only for comparison.
>>> + * @mem_type: The memory type of the LRU list being traversed.
>>> + * This field is valid iff @bulk != NULL.
>>> + * @priority: the current priority
>>> + *
>>> + * Cursor to iterate over the resources in a manager.
>>> + */
>>> +struct ttm_resource_cursor {
>>> +	struct ttm_resource_manager *man;
>>> +	struct ttm_lru_item hitch;
>>> +	struct list_head bulk_link;
>>> +	struct ttm_lru_bulk_move *bulk;
>>> +	unsigned int mem_type;
>>> +	unsigned int priority;
>>> +};
>>> +
>>> +void ttm_resource_cursor_fini_locked(struct ttm_resource_cursor
>>> *cursor);
>>> +
>>> +void ttm_resource_cursor_fini(struct ttm_resource_cursor *cursor);
>>> +
>>>    /**
>>>     * struct ttm_kmap_iter_iomap - Specialization for a struct
>>> io_mapping +
>>>     * struct sg_table backed struct ttm_resource.
>>> @@ -405,6 +417,8 @@ ttm_resource_manager_cleanup(struct
>>> ttm_resource_manager *man)
>>>    
>>>    void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);
>>>    void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk);
>>> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
>>> +			    struct ttm_lru_bulk_move *bulk);
>>>    
>>>    void ttm_resource_add_bulk_move(struct ttm_resource *res,
>>>    				struct ttm_buffer_object *bo);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves
  2024-07-04 13:13       ` Christian König
@ 2024-07-04 13:53         ` Thomas Hellström
  2024-07-04 14:32           ` Christian König
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-07-04 13:53 UTC (permalink / raw)
  To: Christian König, intel-xe
  Cc: Somalapuram Amaranath, Matthew Brost, dri-devel

On Thu, 2024-07-04 at 15:13 +0200, Christian König wrote:
> Hey Thomas,
> 
> Am 04.07.24 um 14:41 schrieb Thomas Hellström:
> > Hi, Christian,
> > 
> > On Thu, 2024-07-04 at 11:21 +0200, Christian König wrote:
> > > Am 03.07.24 um 17:38 schrieb Thomas Hellström:
> > > > To address the problem with hitches moving when bulk move
> > > > sublists are lru-bumped, register the list cursors with the
> > > > ttm_lru_bulk_move structure when traversing its list, and
> > > > when lru-bumping the list, move the cursor hitch to the tail.
> > > > This also means it's mandatory for drivers to call
> > > > ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when
> > > > initializing and finalizing the bulk move structure, so add
> > > > those calls to the amdgpu- and xe driver.
> > > > 
> > > > Compared to v1 this is slightly more code but less fragile
> > > > and hopefully easier to understand.
> > > This is the only patch in the series which I see critical.
> > > 
> > > I think the final goal when using drm_exec in TTMs eviction path
> > > is
> > > to
> > > keep all evicted (or evicting) BOs locked until we have enough
> > > space.
> > > 
> > > This means that for bulk move sections on the LRU we would lock
> > > the
> > > first BO and would only drop that lock again if we have gone over
> > > the
> > > full bulk move section and know that all BOs are not valuable for
> > > eviction.
> > > 
> > > Because of this the issue of having to consider hitches move with
> > > a
> > > bulk
> > > move section on the LRU doesn't even occur because for that a
> > > concurrent
> > > process would need to grab the common lock of the BOs in the bulk
> > > move
> > > section.
> > While I agree that this is something we should strive towards,
> > following the previous discussion I already reworked this patch
> > completely to remove the dual hitches and make it less fragile.
> 
> Yeah seen that and it indeed makes it much easier to understand
> what's 
> going on.
> 
> > After that you mentioned you were ok with the high level approach
> > for
> > these first four patches here:
> > 
> > https://lists.freedesktop.org/archives/dri-devel/2024-April/450288.html
> > 
> > So is that not any longer the case?
> 
> I'm ok with having it as intermediate step, but for that it's a bit
> much 
> of an hammer.
> 
> On the other hand having clean ttm_lru_bulk_move_init() and 
> ttm_lru_bulk_move_fini() calls is probably something we should keep 
> around anyway.
> 
> > To recap, the concerns I'm seeing with the "kept common lock"
> > approach
> > are
> > 
> > a) Since when we release the LRU lock and the common bulk bo lock
> > is
> > not yet locked, a LRU bump may happen and the hitch will go with
> > it. So
> > to avoid that we need to place the hitch *before* then considered
> > resource in the LRU list rather than *after*. Now on the next
> > iteration
> > we need to come up with some way to choose what's really the next
> > resource? If the next resource pointer is the same we already
> > considered, should we assume it might have been freed and re-
> > alloced
> > with the same virtual address?
> 
> My idea is for the general flow is this:
> 
> 1. Grab the lru lock
> 2. Grab a reference to the BO after the hitch, eventually trylock the
> BO 
> or just continue with a prelocked one
> 3. If locking wasn't successfully
>      4. Drop the lru lock
>      5. Block on the BO lock
>      6. Check that this resource/BO is still the one the cursor
> points 
> to, if not drop the lock and restart from #1
>      7. Grab the lru lock
> 8. Advance the cursor.
> 9. Drop the lru lock.
> 10. Try to evict or swap the BO
> 11. Repeat if still not able to allocate memory.
> 
> The BO could be prelocked if it's part of the currently processed
> bulk 
> or previously contended and prelocked by drm_exec.
> 
> And instead of checking if the resource is in the right domain we
> check 
> if the resource/BO is still the one where the cursor points to.
> 
> This way we don't care if the resource was reallocated and by
> coincident 
> ended up right after the cursor hitch again. As long as we still
> point 
> to the BO we just locked everything is fine.
> 
> > b) It will be up to the user of the lru traversal to actually
> > guarantee
> > that locks are held across a bulk part, to make the resource
> > traversal
> > reasonably self-contained. In this case the LRU walker, because
> > there's
> > where the bo locking happens.
> > This means that any other code that aims to walk the LRUs for
> > various
> > reasons, and doesn't provide any held lock guarantees, may be
> > subject
> > to unexpected results if someone bumped the LRU.
> > So we would basically tailor the resource iteration here for a
> > single
> > use-case and not make it robust for various use-cases.
> 
> Yeah, that's also going into a direction I was questioning. Do we
> have 
> use cases for the resource iterator were we don't lock the BO?

> 
> If not why don't we integrate all this into the first_resource() and 
> next_resource() functions instead? Obviously with some helpers in the
> BO 
> code.

That'd be if we moved this out to a drm-level layer like the work Oak
started for cross-component eviction targeting SVM.

I guess it's also my desire for keeping components separated as much as
possible, but I'm aware others may feel differently about that.

> 
> > So my suggestion is we keep this until we've come up with a bullet-
> > proof way to sort out a) and b) above and then we can rip it out.
> 
> Yeah if we can't make progress otherwise that works for me as well.

Then I'd say let's go for this and revisit. 

So what are the ARs here?
Making sure we have a clean init and fini is something I've thought of
as well.

Related to that, what's your opinion on using DEFINE_CLASS() and
scoped_guard() in TTM for automatic cleanup of the iterator when
leaving the loop scope?

https://elixir.bootlin.com/linux/v6.10-rc6/source/include/linux/cleanup.h#L168

Thanks,
Thomas

> 
> Regards,
> Christian.
> 
> > 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > 
> > > > Changes in previous series:
> > > > - Completely rework the functionality
> > > > - Avoid a NULL pointer dereference assigning manager->mem_type
> > > > - Remove some leftover code causing build problems
> > > > v2:
> > > > - For hitch bulk tail moves, store the mem_type in the cursor
> > > >     instead of with the manager.
> > > > v3:
> > > > - Remove leftover mem_type member from change in v2.
> > > > v6:
> > > > - Add some lockdep asserts (Matthew Brost)
> > > > - Avoid NULL pointer dereference (Matthew Brost)
> > > > - No need to check bo->resource before dereferencing
> > > >     bo->bulk_move (Matthew Brost)
> > > > 
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > Signed-off-by: Thomas Hellström
> > > > <thomas.hellstrom@linux.intel.com>
> > > > ---
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  4 ++
> > > >    drivers/gpu/drm/ttm/ttm_resource.c     | 92
> > > > ++++++++++++++++++++++++++
> > > >    drivers/gpu/drm/xe/xe_vm.c             |  4 ++
> > > >    include/drm/ttm/ttm_resource.h         | 56 ++++++++++------
> > > >    4 files changed, 135 insertions(+), 21 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > index 3abfa66d72a2..97743993d711 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > @@ -2420,6 +2420,8 @@ int amdgpu_vm_init(struct amdgpu_device
> > > > *adev, struct amdgpu_vm *vm,
> > > >    	if (r)
> > > >    		return r;
> > > >    
> > > > +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> > > > +
> > > >    	vm->is_compute_context = false;
> > > >    
> > > >    	vm->use_cpu_for_update = !!(adev-
> > > > > vm_manager.vm_update_mode &
> > > > @@ -2484,6 +2486,7 @@ int amdgpu_vm_init(struct amdgpu_device
> > > > *adev, struct amdgpu_vm *vm,
> > > >    error_free_delayed:
> > > >    	dma_fence_put(vm->last_tlb_flush);
> > > >    	dma_fence_put(vm->last_unlocked);
> > > > +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
> > > > > lru_bulk_move);
> > > >    	amdgpu_vm_fini_entities(vm);
> > > >    
> > > >    	return r;
> > > > @@ -2640,6 +2643,7 @@ void amdgpu_vm_fini(struct amdgpu_device
> > > > *adev, struct amdgpu_vm *vm)
> > > >    		}
> > > >    	}
> > > >    
> > > > +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
> > > > > lru_bulk_move);
> > > >    }
> > > >    
> > > >    /**
> > > > diff --git a/drivers/gpu/drm/ttm/ttm_resource.c
> > > > b/drivers/gpu/drm/ttm/ttm_resource.c
> > > > index 9c8b6499edfb..b6a2daac5518 100644
> > > > --- a/drivers/gpu/drm/ttm/ttm_resource.c
> > > > +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> > > > @@ -33,6 +33,53 @@
> > > >    
> > > >    #include <drm/drm_util.h>
> > > >    
> > > > +/* Detach the cursor from the bulk move list*/
> > > > +static void
> > > > +ttm_resource_cursor_clear_bulk(struct ttm_resource_cursor
> > > > *cursor)
> > > > +{
> > > > +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> > > > +
> > > > +	cursor->bulk = NULL;
> > > > +	list_del_init(&cursor->bulk_link);
> > > > +}
> > > > +
> > > > +/* Move the cursor to the end of the bulk move list it's in */
> > > > +static void ttm_resource_cursor_move_bulk_tail(struct
> > > > ttm_lru_bulk_move *bulk,
> > > > +					       struct
> > > > ttm_resource_cursor *cursor)
> > > > +{
> > > > +	struct ttm_lru_bulk_move_pos *pos;
> > > > +
> > > > +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> > > > +
> > > > +	if (WARN_ON_ONCE(bulk != cursor->bulk)) {
> > > > +		list_del_init(&cursor->bulk_link);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	pos = &bulk->pos[cursor->mem_type][cursor->priority];
> > > > +	if (pos->last)
> > > > +		list_move(&cursor->hitch.link, &pos->last-
> > > > > lru.link);
> > > > +	ttm_resource_cursor_clear_bulk(cursor);
> > > > +}
> > > > +
> > > > +/* Move all cursors attached to a bulk move to its end */
> > > > +static void ttm_bulk_move_adjust_cursors(struct
> > > > ttm_lru_bulk_move
> > > > *bulk)
> > > > +{
> > > > +	struct ttm_resource_cursor *cursor, *next;
> > > > +
> > > > +	list_for_each_entry_safe(cursor, next, &bulk-
> > > > >cursor_list,
> > > > bulk_link)
> > > > +		ttm_resource_cursor_move_bulk_tail(bulk,
> > > > cursor);
> > > > +}
> > > > +
> > > > +/* Remove a cursor from an empty bulk move list */
> > > > +static void ttm_bulk_move_drop_cursors(struct
> > > > ttm_lru_bulk_move
> > > > *bulk)
> > > > +{
> > > > +	struct ttm_resource_cursor *cursor, *next;
> > > > +
> > > > +	list_for_each_entry_safe(cursor, next, &bulk-
> > > > >cursor_list,
> > > > bulk_link)
> > > > +		ttm_resource_cursor_clear_bulk(cursor);
> > > > +}
> > > > +
> > > >    /**
> > > >     * ttm_resource_cursor_fini_locked() - Finalize the LRU list
> > > > cursor usage
> > > >     * @cursor: The struct ttm_resource_cursor to finalize.
> > > > @@ -45,6 +92,7 @@ void ttm_resource_cursor_fini_locked(struct
> > > > ttm_resource_cursor *cursor)
> > > >    {
> > > >    	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> > > >    	list_del_init(&cursor->hitch.link);
> > > > +	ttm_resource_cursor_clear_bulk(cursor);
> > > >    }
> > > >    
> > > >    /**
> > > > @@ -73,9 +121,27 @@ void ttm_resource_cursor_fini(struct
> > > > ttm_resource_cursor *cursor)
> > > >    void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk)
> > > >    {
> > > >    	memset(bulk, 0, sizeof(*bulk));
> > > > +	INIT_LIST_HEAD(&bulk->cursor_list);
> > > >    }
> > > >    EXPORT_SYMBOL(ttm_lru_bulk_move_init);
> > > >    
> > > > +/**
> > > > + * ttm_lru_bulk_move_fini - finalize a bulk move structure
> > > > + * @bdev: The struct ttm_device
> > > > + * @bulk: the structure to finalize
> > > > + *
> > > > + * Sanity checks that bulk moves don't have any
> > > > + * resources left and hence no cursors attached.
> > > > + */
> > > > +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> > > > +			    struct ttm_lru_bulk_move *bulk)
> > > > +{
> > > > +	spin_lock(&bdev->lru_lock);
> > > > +	ttm_bulk_move_drop_cursors(bulk);
> > > > +	spin_unlock(&bdev->lru_lock);
> > > > +}
> > > > +EXPORT_SYMBOL(ttm_lru_bulk_move_fini);
> > > > +
> > > >    /**
> > > >     * ttm_lru_bulk_move_tail - bulk move range of resources to
> > > > the
> > > > LRU tail.
> > > >     *
> > > > @@ -88,6 +154,7 @@ void ttm_lru_bulk_move_tail(struct
> > > > ttm_lru_bulk_move *bulk)
> > > >    {
> > > >    	unsigned i, j;
> > > >    
> > > > +	ttm_bulk_move_adjust_cursors(bulk);
> > > >    	for (i = 0; i < TTM_NUM_MEM_TYPES; ++i) {
> > > >    		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
> > > >    			struct ttm_lru_bulk_move_pos *pos =
> > > > &bulk-
> > > > > pos[i][j];
> > > > @@ -515,6 +582,28 @@ void ttm_resource_manager_debug(struct
> > > > ttm_resource_manager *man,
> > > >    }
> > > >    EXPORT_SYMBOL(ttm_resource_manager_debug);
> > > >    
> > > > +static void
> > > > +ttm_resource_cursor_check_bulk(struct ttm_resource_cursor
> > > > *cursor,
> > > > +			       struct ttm_lru_item *next_lru)
> > > > +{
> > > > +	struct ttm_resource *next =
> > > > ttm_lru_item_to_res(next_lru);
> > > > +	struct ttm_lru_bulk_move *bulk = NULL;
> > > > +	struct ttm_buffer_object *bo = next->bo;
> > > > +
> > > > +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
> > > > +	bulk = bo->bulk_move;
> > > > +
> > > > +	if (cursor->bulk != bulk) {
> > > > +		if (bulk) {
> > > > +			list_move_tail(&cursor->bulk_link,
> > > > &bulk-
> > > > > cursor_list);
> > > > +			cursor->mem_type = next->mem_type;
> > > > +		} else {
> > > > +			list_del_init(&cursor->bulk_link);
> > > > +		}
> > > > +		cursor->bulk = bulk;
> > > > +	}
> > > > +}
> > > > +
> > > >    /**
> > > >     * ttm_resource_manager_first() - Start iterating over the
> > > > resources
> > > >     * of a resource manager
> > > > @@ -535,6 +624,7 @@ ttm_resource_manager_first(struct
> > > > ttm_resource_manager *man,
> > > >    	cursor->priority = 0;
> > > >    	cursor->man = man;
> > > >    	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
> > > > +	INIT_LIST_HEAD(&cursor->bulk_link);
> > > >    	list_add(&cursor->hitch.link, &man->lru[cursor-
> > > > > priority]);
> > > >    
> > > >    	return ttm_resource_manager_next(cursor);
> > > > @@ -559,6 +649,7 @@ ttm_resource_manager_next(struct
> > > > ttm_resource_cursor *cursor)
> > > >    		lru = &cursor->hitch;
> > > >    		list_for_each_entry_continue(lru, &man-
> > > > > lru[cursor->priority], link) {
> > > >    			if (ttm_lru_item_is_res(lru)) {
> > > > +				ttm_resource_cursor_check_bulk
> > > > (cur
> > > > sor, lru);
> > > >    				list_move(&cursor->hitch.link,
> > > > &lru->link);
> > > >    				return
> > > > ttm_lru_item_to_res(lru);
> > > >    			}
> > > > @@ -568,6 +659,7 @@ ttm_resource_manager_next(struct
> > > > ttm_resource_cursor *cursor)
> > > >    			break;
> > > >    
> > > >    		list_move(&cursor->hitch.link, &man-
> > > > >lru[cursor-
> > > > > priority]);
> > > > +		ttm_resource_cursor_clear_bulk(cursor);
> > > >    	}
> > > >    
> > > >    	ttm_resource_cursor_fini_locked(cursor);
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > index 5b166fa03684..0c7e327bc9a2 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -1335,6 +1335,8 @@ struct xe_vm *xe_vm_create(struct
> > > > xe_device
> > > > *xe, u32 flags)
> > > >    
> > > >    	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
> > > >    
> > > > +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
> > > > +
> > > >    	INIT_LIST_HEAD(&vm->preempt.exec_queues);
> > > >    	vm->preempt.min_run_period_ms = 10;	/* FIXME: Wire
> > > > up
> > > > to uAPI */
> > > >    
> > > > @@ -1458,6 +1460,7 @@ struct xe_vm *xe_vm_create(struct
> > > > xe_device
> > > > *xe, u32 flags)
> > > >    	mutex_destroy(&vm->snap_mutex);
> > > >    	for_each_tile(tile, xe, id)
> > > >    		xe_range_fence_tree_fini(&vm->rftree[id]);
> > > > +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
> > > >    	kfree(vm);
> > > >    	if (flags & XE_VM_FLAG_LR_MODE)
> > > >    		xe_pm_runtime_put(xe);
> > > > @@ -1601,6 +1604,7 @@ static void vm_destroy_work_func(struct
> > > > work_struct *w)
> > > >    		XE_WARN_ON(vm->pt_root[id]);
> > > >    
> > > >    	trace_xe_vm_free(vm);
> > > > +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
> > > >    	kfree(vm);
> > > >    }
> > > >    
> > > > diff --git a/include/drm/ttm/ttm_resource.h
> > > > b/include/drm/ttm/ttm_resource.h
> > > > index 8fac781f641e..571abb4861a6 100644
> > > > --- a/include/drm/ttm/ttm_resource.h
> > > > +++ b/include/drm/ttm/ttm_resource.h
> > > > @@ -269,26 +269,6 @@ ttm_lru_item_to_res(struct ttm_lru_item
> > > > *item)
> > > >    	return container_of(item, struct ttm_resource, lru);
> > > >    }
> > > >    
> > > > -/**
> > > > - * struct ttm_resource_cursor
> > > > - *
> > > > - * @man: The resource manager currently being iterated over.
> > > > - * @hitch: A hitch list node inserted before the next resource
> > > > - * to iterate over.
> > > > - * @priority: the current priority
> > > > - *
> > > > - * Cursor to iterate over the resources in a manager.
> > > > - */
> > > > -struct ttm_resource_cursor {
> > > > -	struct ttm_resource_manager *man;
> > > > -	struct ttm_lru_item hitch;
> > > > -	unsigned int priority;
> > > > -};
> > > > -
> > > > -void ttm_resource_cursor_fini_locked(struct
> > > > ttm_resource_cursor
> > > > *cursor);
> > > > -
> > > > -void ttm_resource_cursor_fini(struct ttm_resource_cursor
> > > > *cursor);
> > > > -
> > > >    /**
> > > >     * struct ttm_lru_bulk_move_pos
> > > >     *
> > > > @@ -304,8 +284,9 @@ struct ttm_lru_bulk_move_pos {
> > > >    
> > > >    /**
> > > >     * struct ttm_lru_bulk_move
> > > > - *
> > > >     * @pos: first/last lru entry for resources in the each
> > > > domain/priority
> > > > + * @cursor_list: The list of cursors currently traversing any
> > > > of
> > > > + * the sublists of @pos. Protected by the ttm device's
> > > > lru_lock.
> > > >     *
> > > >     * Container for the current bulk move state. Should be used
> > > > with
> > > >     * ttm_lru_bulk_move_init() and ttm_bo_set_bulk_move().
> > > > @@ -315,8 +296,39 @@ struct ttm_lru_bulk_move_pos {
> > > >     */
> > > >    struct ttm_lru_bulk_move {
> > > >    	struct ttm_lru_bulk_move_pos
> > > > pos[TTM_NUM_MEM_TYPES][TTM_MAX_BO_PRIORITY];
> > > > +	struct list_head cursor_list;
> > > >    };
> > > >    
> > > > +/**
> > > > + * struct ttm_resource_cursor
> > > > + * @man: The resource manager currently being iterated over
> > > > + * @hitch: A hitch list node inserted before the next resource
> > > > + * to iterate over.
> > > > + * @bulk_link: A list link for the list of cursors traversing
> > > > the
> > > > + * bulk sublist of @bulk. Protected by the ttm device's
> > > > lru_lock.
> > > > + * @bulk: Pointer to struct ttm_lru_bulk_move whose subrange
> > > > @hitch is
> > > > + * inserted to. NULL if none. Never dereference this pointer
> > > > since
> > > > + * the struct ttm_lru_bulk_move object pointed to might have
> > > > been
> > > > + * freed. The pointer is only for comparison.
> > > > + * @mem_type: The memory type of the LRU list being traversed.
> > > > + * This field is valid iff @bulk != NULL.
> > > > + * @priority: the current priority
> > > > + *
> > > > + * Cursor to iterate over the resources in a manager.
> > > > + */
> > > > +struct ttm_resource_cursor {
> > > > +	struct ttm_resource_manager *man;
> > > > +	struct ttm_lru_item hitch;
> > > > +	struct list_head bulk_link;
> > > > +	struct ttm_lru_bulk_move *bulk;
> > > > +	unsigned int mem_type;
> > > > +	unsigned int priority;
> > > > +};
> > > > +
> > > > +void ttm_resource_cursor_fini_locked(struct
> > > > ttm_resource_cursor
> > > > *cursor);
> > > > +
> > > > +void ttm_resource_cursor_fini(struct ttm_resource_cursor
> > > > *cursor);
> > > > +
> > > >    /**
> > > >     * struct ttm_kmap_iter_iomap - Specialization for a struct
> > > > io_mapping +
> > > >     * struct sg_table backed struct ttm_resource.
> > > > @@ -405,6 +417,8 @@ ttm_resource_manager_cleanup(struct
> > > > ttm_resource_manager *man)
> > > >    
> > > >    void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);
> > > >    void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk);
> > > > +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
> > > > +			    struct ttm_lru_bulk_move *bulk);
> > > >    
> > > >    void ttm_resource_add_bulk_move(struct ttm_resource *res,
> > > >    				struct ttm_buffer_object *bo);
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves
  2024-07-04 13:53         ` Thomas Hellström
@ 2024-07-04 14:32           ` Christian König
  0 siblings, 0 replies; 38+ messages in thread
From: Christian König @ 2024-07-04 14:32 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Somalapuram Amaranath, Matthew Brost, dri-devel



Am 04.07.24 um 15:53 schrieb Thomas Hellström:
> On Thu, 2024-07-04 at 15:13 +0200, Christian König wrote:
>> Hey Thomas,
>>
>> Am 04.07.24 um 14:41 schrieb Thomas Hellström:
>>> Hi, Christian,
>>>
>>> On Thu, 2024-07-04 at 11:21 +0200, Christian König wrote:
>>>> Am 03.07.24 um 17:38 schrieb Thomas Hellström:
>>>>> To address the problem with hitches moving when bulk move
>>>>> sublists are lru-bumped, register the list cursors with the
>>>>> ttm_lru_bulk_move structure when traversing its list, and
>>>>> when lru-bumping the list, move the cursor hitch to the tail.
>>>>> This also means it's mandatory for drivers to call
>>>>> ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when
>>>>> initializing and finalizing the bulk move structure, so add
>>>>> those calls to the amdgpu- and xe driver.
>>>>>
>>>>> Compared to v1 this is slightly more code but less fragile
>>>>> and hopefully easier to understand.
>>>> This is the only patch in the series which I see critical.
>>>>
>>>> I think the final goal when using drm_exec in TTMs eviction path
>>>> is
>>>> to
>>>> keep all evicted (or evicting) BOs locked until we have enough
>>>> space.
>>>>
>>>> This means that for bulk move sections on the LRU we would lock
>>>> the
>>>> first BO and would only drop that lock again if we have gone over
>>>> the
>>>> full bulk move section and know that all BOs are not valuable for
>>>> eviction.
>>>>
>>>> Because of this the issue of having to consider hitches move with
>>>> a
>>>> bulk
>>>> move section on the LRU doesn't even occur because for that a
>>>> concurrent
>>>> process would need to grab the common lock of the BOs in the bulk
>>>> move
>>>> section.
>>> While I agree that this is something we should strive towards,
>>> following the previous discussion I already reworked this patch
>>> completely to remove the dual hitches and make it less fragile.
>> Yeah seen that and it indeed makes it much easier to understand
>> what's
>> going on.
>>
>>> After that you mentioned you were ok with the high level approach
>>> for
>>> these first four patches here:
>>>
>>> https://lists.freedesktop.org/archives/dri-devel/2024-April/450288.html
>>>
>>> So is that not any longer the case?
>> I'm ok with having it as intermediate step, but for that it's a bit
>> much
>> of an hammer.
>>
>> On the other hand having clean ttm_lru_bulk_move_init() and
>> ttm_lru_bulk_move_fini() calls is probably something we should keep
>> around anyway.
>>
>>> To recap, the concerns I'm seeing with the "kept common lock"
>>> approach
>>> are
>>>
>>> a) Since when we release the LRU lock and the common bulk bo lock
>>> is
>>> not yet locked, a LRU bump may happen and the hitch will go with
>>> it. So
>>> to avoid that we need to place the hitch *before* then considered
>>> resource in the LRU list rather than *after*. Now on the next
>>> iteration
>>> we need to come up with some way to choose what's really the next
>>> resource? If the next resource pointer is the same we already
>>> considered, should we assume it might have been freed and re-
>>> alloced
>>> with the same virtual address?
>> My idea is for the general flow is this:
>>
>> 1. Grab the lru lock
>> 2. Grab a reference to the BO after the hitch, eventually trylock the
>> BO
>> or just continue with a prelocked one
>> 3. If locking wasn't successfully
>>       4. Drop the lru lock
>>       5. Block on the BO lock
>>       6. Check that this resource/BO is still the one the cursor
>> points
>> to, if not drop the lock and restart from #1
>>       7. Grab the lru lock
>> 8. Advance the cursor.
>> 9. Drop the lru lock.
>> 10. Try to evict or swap the BO
>> 11. Repeat if still not able to allocate memory.
>>
>> The BO could be prelocked if it's part of the currently processed
>> bulk
>> or previously contended and prelocked by drm_exec.
>>
>> And instead of checking if the resource is in the right domain we
>> check
>> if the resource/BO is still the one where the cursor points to.
>>
>> This way we don't care if the resource was reallocated and by
>> coincident
>> ended up right after the cursor hitch again. As long as we still
>> point
>> to the BO we just locked everything is fine.
>>
>>> b) It will be up to the user of the lru traversal to actually
>>> guarantee
>>> that locks are held across a bulk part, to make the resource
>>> traversal
>>> reasonably self-contained. In this case the LRU walker, because
>>> there's
>>> where the bo locking happens.
>>> This means that any other code that aims to walk the LRUs for
>>> various
>>> reasons, and doesn't provide any held lock guarantees, may be
>>> subject
>>> to unexpected results if someone bumped the LRU.
>>> So we would basically tailor the resource iteration here for a
>>> single
>>> use-case and not make it robust for various use-cases.
>> Yeah, that's also going into a direction I was questioning. Do we
>> have
>> use cases for the resource iterator were we don't lock the BO?
>> If not why don't we integrate all this into the first_resource() and
>> next_resource() functions instead? Obviously with some helpers in the
>> BO
>> code.
> That'd be if we moved this out to a drm-level layer like the work Oak
> started for cross-component eviction targeting SVM.
>
> I guess it's also my desire for keeping components separated as much as
> possible, but I'm aware others may feel differently about that.
>
>>> So my suggestion is we keep this until we've come up with a bullet-
>>> proof way to sort out a) and b) above and then we can rip it out.
>> Yeah if we can't make progress otherwise that works for me as well.
> Then I'd say let's go for this and revisit.

Ok in that case feel free to add my Acked-by to this patch.

>
> So what are the ARs here?
> Making sure we have a clean init and fini is something I've thought of
> as well.
>
> Related to that, what's your opinion on using DEFINE_CLASS() and
> scoped_guard() in TTM for automatic cleanup of the iterator when
> leaving the loop scope?

Good question, I was wondering about that as well.

On the one hand it makes error handling easier on the other have I 
worked with scope based error handling in both C and C++ before and 
there are always down sides as well.

Not sure if that's really a good idea or not.

Regards,
Christian.


>
> https://elixir.bootlin.com/linux/v6.10-rc6/source/include/linux/cleanup.h#L168
>
> Thanks,
> Thomas
>
>> Regards,
>> Christian.
>>
>>> /Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>> Changes in previous series:
>>>>> - Completely rework the functionality
>>>>> - Avoid a NULL pointer dereference assigning manager->mem_type
>>>>> - Remove some leftover code causing build problems
>>>>> v2:
>>>>> - For hitch bulk tail moves, store the mem_type in the cursor
>>>>>      instead of with the manager.
>>>>> v3:
>>>>> - Remove leftover mem_type member from change in v2.
>>>>> v6:
>>>>> - Add some lockdep asserts (Matthew Brost)
>>>>> - Avoid NULL pointer dereference (Matthew Brost)
>>>>> - No need to check bo->resource before dereferencing
>>>>>      bo->bulk_move (Matthew Brost)
>>>>>
>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: <dri-devel@lists.freedesktop.org>
>>>>> Signed-off-by: Thomas Hellström
>>>>> <thomas.hellstrom@linux.intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  4 ++
>>>>>     drivers/gpu/drm/ttm/ttm_resource.c     | 92
>>>>> ++++++++++++++++++++++++++
>>>>>     drivers/gpu/drm/xe/xe_vm.c             |  4 ++
>>>>>     include/drm/ttm/ttm_resource.h         | 56 ++++++++++------
>>>>>     4 files changed, 135 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index 3abfa66d72a2..97743993d711 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -2420,6 +2420,8 @@ int amdgpu_vm_init(struct amdgpu_device
>>>>> *adev, struct amdgpu_vm *vm,
>>>>>     	if (r)
>>>>>     		return r;
>>>>>     
>>>>> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
>>>>> +
>>>>>     	vm->is_compute_context = false;
>>>>>     
>>>>>     	vm->use_cpu_for_update = !!(adev-
>>>>>> vm_manager.vm_update_mode &
>>>>> @@ -2484,6 +2486,7 @@ int amdgpu_vm_init(struct amdgpu_device
>>>>> *adev, struct amdgpu_vm *vm,
>>>>>     error_free_delayed:
>>>>>     	dma_fence_put(vm->last_tlb_flush);
>>>>>     	dma_fence_put(vm->last_unlocked);
>>>>> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
>>>>>> lru_bulk_move);
>>>>>     	amdgpu_vm_fini_entities(vm);
>>>>>     
>>>>>     	return r;
>>>>> @@ -2640,6 +2643,7 @@ void amdgpu_vm_fini(struct amdgpu_device
>>>>> *adev, struct amdgpu_vm *vm)
>>>>>     		}
>>>>>     	}
>>>>>     
>>>>> +	ttm_lru_bulk_move_fini(&adev->mman.bdev, &vm-
>>>>>> lru_bulk_move);
>>>>>     }
>>>>>     
>>>>>     /**
>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c
>>>>> b/drivers/gpu/drm/ttm/ttm_resource.c
>>>>> index 9c8b6499edfb..b6a2daac5518 100644
>>>>> --- a/drivers/gpu/drm/ttm/ttm_resource.c
>>>>> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
>>>>> @@ -33,6 +33,53 @@
>>>>>     
>>>>>     #include <drm/drm_util.h>
>>>>>     
>>>>> +/* Detach the cursor from the bulk move list*/
>>>>> +static void
>>>>> +ttm_resource_cursor_clear_bulk(struct ttm_resource_cursor
>>>>> *cursor)
>>>>> +{
>>>>> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>>>> +
>>>>> +	cursor->bulk = NULL;
>>>>> +	list_del_init(&cursor->bulk_link);
>>>>> +}
>>>>> +
>>>>> +/* Move the cursor to the end of the bulk move list it's in */
>>>>> +static void ttm_resource_cursor_move_bulk_tail(struct
>>>>> ttm_lru_bulk_move *bulk,
>>>>> +					       struct
>>>>> ttm_resource_cursor *cursor)
>>>>> +{
>>>>> +	struct ttm_lru_bulk_move_pos *pos;
>>>>> +
>>>>> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>>>> +
>>>>> +	if (WARN_ON_ONCE(bulk != cursor->bulk)) {
>>>>> +		list_del_init(&cursor->bulk_link);
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	pos = &bulk->pos[cursor->mem_type][cursor->priority];
>>>>> +	if (pos->last)
>>>>> +		list_move(&cursor->hitch.link, &pos->last-
>>>>>> lru.link);
>>>>> +	ttm_resource_cursor_clear_bulk(cursor);
>>>>> +}
>>>>> +
>>>>> +/* Move all cursors attached to a bulk move to its end */
>>>>> +static void ttm_bulk_move_adjust_cursors(struct
>>>>> ttm_lru_bulk_move
>>>>> *bulk)
>>>>> +{
>>>>> +	struct ttm_resource_cursor *cursor, *next;
>>>>> +
>>>>> +	list_for_each_entry_safe(cursor, next, &bulk-
>>>>>> cursor_list,
>>>>> bulk_link)
>>>>> +		ttm_resource_cursor_move_bulk_tail(bulk,
>>>>> cursor);
>>>>> +}
>>>>> +
>>>>> +/* Remove a cursor from an empty bulk move list */
>>>>> +static void ttm_bulk_move_drop_cursors(struct
>>>>> ttm_lru_bulk_move
>>>>> *bulk)
>>>>> +{
>>>>> +	struct ttm_resource_cursor *cursor, *next;
>>>>> +
>>>>> +	list_for_each_entry_safe(cursor, next, &bulk-
>>>>>> cursor_list,
>>>>> bulk_link)
>>>>> +		ttm_resource_cursor_clear_bulk(cursor);
>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * ttm_resource_cursor_fini_locked() - Finalize the LRU list
>>>>> cursor usage
>>>>>      * @cursor: The struct ttm_resource_cursor to finalize.
>>>>> @@ -45,6 +92,7 @@ void ttm_resource_cursor_fini_locked(struct
>>>>> ttm_resource_cursor *cursor)
>>>>>     {
>>>>>     	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>>>>     	list_del_init(&cursor->hitch.link);
>>>>> +	ttm_resource_cursor_clear_bulk(cursor);
>>>>>     }
>>>>>     
>>>>>     /**
>>>>> @@ -73,9 +121,27 @@ void ttm_resource_cursor_fini(struct
>>>>> ttm_resource_cursor *cursor)
>>>>>     void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk)
>>>>>     {
>>>>>     	memset(bulk, 0, sizeof(*bulk));
>>>>> +	INIT_LIST_HEAD(&bulk->cursor_list);
>>>>>     }
>>>>>     EXPORT_SYMBOL(ttm_lru_bulk_move_init);
>>>>>     
>>>>> +/**
>>>>> + * ttm_lru_bulk_move_fini - finalize a bulk move structure
>>>>> + * @bdev: The struct ttm_device
>>>>> + * @bulk: the structure to finalize
>>>>> + *
>>>>> + * Sanity checks that bulk moves don't have any
>>>>> + * resources left and hence no cursors attached.
>>>>> + */
>>>>> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
>>>>> +			    struct ttm_lru_bulk_move *bulk)
>>>>> +{
>>>>> +	spin_lock(&bdev->lru_lock);
>>>>> +	ttm_bulk_move_drop_cursors(bulk);
>>>>> +	spin_unlock(&bdev->lru_lock);
>>>>> +}
>>>>> +EXPORT_SYMBOL(ttm_lru_bulk_move_fini);
>>>>> +
>>>>>     /**
>>>>>      * ttm_lru_bulk_move_tail - bulk move range of resources to
>>>>> the
>>>>> LRU tail.
>>>>>      *
>>>>> @@ -88,6 +154,7 @@ void ttm_lru_bulk_move_tail(struct
>>>>> ttm_lru_bulk_move *bulk)
>>>>>     {
>>>>>     	unsigned i, j;
>>>>>     
>>>>> +	ttm_bulk_move_adjust_cursors(bulk);
>>>>>     	for (i = 0; i < TTM_NUM_MEM_TYPES; ++i) {
>>>>>     		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>>>>>     			struct ttm_lru_bulk_move_pos *pos =
>>>>> &bulk-
>>>>>> pos[i][j];
>>>>> @@ -515,6 +582,28 @@ void ttm_resource_manager_debug(struct
>>>>> ttm_resource_manager *man,
>>>>>     }
>>>>>     EXPORT_SYMBOL(ttm_resource_manager_debug);
>>>>>     
>>>>> +static void
>>>>> +ttm_resource_cursor_check_bulk(struct ttm_resource_cursor
>>>>> *cursor,
>>>>> +			       struct ttm_lru_item *next_lru)
>>>>> +{
>>>>> +	struct ttm_resource *next =
>>>>> ttm_lru_item_to_res(next_lru);
>>>>> +	struct ttm_lru_bulk_move *bulk = NULL;
>>>>> +	struct ttm_buffer_object *bo = next->bo;
>>>>> +
>>>>> +	lockdep_assert_held(&cursor->man->bdev->lru_lock);
>>>>> +	bulk = bo->bulk_move;
>>>>> +
>>>>> +	if (cursor->bulk != bulk) {
>>>>> +		if (bulk) {
>>>>> +			list_move_tail(&cursor->bulk_link,
>>>>> &bulk-
>>>>>> cursor_list);
>>>>> +			cursor->mem_type = next->mem_type;
>>>>> +		} else {
>>>>> +			list_del_init(&cursor->bulk_link);
>>>>> +		}
>>>>> +		cursor->bulk = bulk;
>>>>> +	}
>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * ttm_resource_manager_first() - Start iterating over the
>>>>> resources
>>>>>      * of a resource manager
>>>>> @@ -535,6 +624,7 @@ ttm_resource_manager_first(struct
>>>>> ttm_resource_manager *man,
>>>>>     	cursor->priority = 0;
>>>>>     	cursor->man = man;
>>>>>     	ttm_lru_item_init(&cursor->hitch, TTM_LRU_HITCH);
>>>>> +	INIT_LIST_HEAD(&cursor->bulk_link);
>>>>>     	list_add(&cursor->hitch.link, &man->lru[cursor-
>>>>>> priority]);
>>>>>     
>>>>>     	return ttm_resource_manager_next(cursor);
>>>>> @@ -559,6 +649,7 @@ ttm_resource_manager_next(struct
>>>>> ttm_resource_cursor *cursor)
>>>>>     		lru = &cursor->hitch;
>>>>>     		list_for_each_entry_continue(lru, &man-
>>>>>> lru[cursor->priority], link) {
>>>>>     			if (ttm_lru_item_is_res(lru)) {
>>>>> +				ttm_resource_cursor_check_bulk
>>>>> (cur
>>>>> sor, lru);
>>>>>     				list_move(&cursor->hitch.link,
>>>>> &lru->link);
>>>>>     				return
>>>>> ttm_lru_item_to_res(lru);
>>>>>     			}
>>>>> @@ -568,6 +659,7 @@ ttm_resource_manager_next(struct
>>>>> ttm_resource_cursor *cursor)
>>>>>     			break;
>>>>>     
>>>>>     		list_move(&cursor->hitch.link, &man-
>>>>>> lru[cursor-
>>>>>> priority]);
>>>>> +		ttm_resource_cursor_clear_bulk(cursor);
>>>>>     	}
>>>>>     
>>>>>     	ttm_resource_cursor_fini_locked(cursor);
>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>> b/drivers/gpu/drm/xe/xe_vm.c
>>>>> index 5b166fa03684..0c7e327bc9a2 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>> @@ -1335,6 +1335,8 @@ struct xe_vm *xe_vm_create(struct
>>>>> xe_device
>>>>> *xe, u32 flags)
>>>>>     
>>>>>     	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
>>>>>     
>>>>> +	ttm_lru_bulk_move_init(&vm->lru_bulk_move);
>>>>> +
>>>>>     	INIT_LIST_HEAD(&vm->preempt.exec_queues);
>>>>>     	vm->preempt.min_run_period_ms = 10;	/* FIXME: Wire
>>>>> up
>>>>> to uAPI */
>>>>>     
>>>>> @@ -1458,6 +1460,7 @@ struct xe_vm *xe_vm_create(struct
>>>>> xe_device
>>>>> *xe, u32 flags)
>>>>>     	mutex_destroy(&vm->snap_mutex);
>>>>>     	for_each_tile(tile, xe, id)
>>>>>     		xe_range_fence_tree_fini(&vm->rftree[id]);
>>>>> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>>>>>     	kfree(vm);
>>>>>     	if (flags & XE_VM_FLAG_LR_MODE)
>>>>>     		xe_pm_runtime_put(xe);
>>>>> @@ -1601,6 +1604,7 @@ static void vm_destroy_work_func(struct
>>>>> work_struct *w)
>>>>>     		XE_WARN_ON(vm->pt_root[id]);
>>>>>     
>>>>>     	trace_xe_vm_free(vm);
>>>>> +	ttm_lru_bulk_move_fini(&xe->ttm, &vm->lru_bulk_move);
>>>>>     	kfree(vm);
>>>>>     }
>>>>>     
>>>>> diff --git a/include/drm/ttm/ttm_resource.h
>>>>> b/include/drm/ttm/ttm_resource.h
>>>>> index 8fac781f641e..571abb4861a6 100644
>>>>> --- a/include/drm/ttm/ttm_resource.h
>>>>> +++ b/include/drm/ttm/ttm_resource.h
>>>>> @@ -269,26 +269,6 @@ ttm_lru_item_to_res(struct ttm_lru_item
>>>>> *item)
>>>>>     	return container_of(item, struct ttm_resource, lru);
>>>>>     }
>>>>>     
>>>>> -/**
>>>>> - * struct ttm_resource_cursor
>>>>> - *
>>>>> - * @man: The resource manager currently being iterated over.
>>>>> - * @hitch: A hitch list node inserted before the next resource
>>>>> - * to iterate over.
>>>>> - * @priority: the current priority
>>>>> - *
>>>>> - * Cursor to iterate over the resources in a manager.
>>>>> - */
>>>>> -struct ttm_resource_cursor {
>>>>> -	struct ttm_resource_manager *man;
>>>>> -	struct ttm_lru_item hitch;
>>>>> -	unsigned int priority;
>>>>> -};
>>>>> -
>>>>> -void ttm_resource_cursor_fini_locked(struct
>>>>> ttm_resource_cursor
>>>>> *cursor);
>>>>> -
>>>>> -void ttm_resource_cursor_fini(struct ttm_resource_cursor
>>>>> *cursor);
>>>>> -
>>>>>     /**
>>>>>      * struct ttm_lru_bulk_move_pos
>>>>>      *
>>>>> @@ -304,8 +284,9 @@ struct ttm_lru_bulk_move_pos {
>>>>>     
>>>>>     /**
>>>>>      * struct ttm_lru_bulk_move
>>>>> - *
>>>>>      * @pos: first/last lru entry for resources in the each
>>>>> domain/priority
>>>>> + * @cursor_list: The list of cursors currently traversing any
>>>>> of
>>>>> + * the sublists of @pos. Protected by the ttm device's
>>>>> lru_lock.
>>>>>      *
>>>>>      * Container for the current bulk move state. Should be used
>>>>> with
>>>>>      * ttm_lru_bulk_move_init() and ttm_bo_set_bulk_move().
>>>>> @@ -315,8 +296,39 @@ struct ttm_lru_bulk_move_pos {
>>>>>      */
>>>>>     struct ttm_lru_bulk_move {
>>>>>     	struct ttm_lru_bulk_move_pos
>>>>> pos[TTM_NUM_MEM_TYPES][TTM_MAX_BO_PRIORITY];
>>>>> +	struct list_head cursor_list;
>>>>>     };
>>>>>     
>>>>> +/**
>>>>> + * struct ttm_resource_cursor
>>>>> + * @man: The resource manager currently being iterated over
>>>>> + * @hitch: A hitch list node inserted before the next resource
>>>>> + * to iterate over.
>>>>> + * @bulk_link: A list link for the list of cursors traversing
>>>>> the
>>>>> + * bulk sublist of @bulk. Protected by the ttm device's
>>>>> lru_lock.
>>>>> + * @bulk: Pointer to struct ttm_lru_bulk_move whose subrange
>>>>> @hitch is
>>>>> + * inserted to. NULL if none. Never dereference this pointer
>>>>> since
>>>>> + * the struct ttm_lru_bulk_move object pointed to might have
>>>>> been
>>>>> + * freed. The pointer is only for comparison.
>>>>> + * @mem_type: The memory type of the LRU list being traversed.
>>>>> + * This field is valid iff @bulk != NULL.
>>>>> + * @priority: the current priority
>>>>> + *
>>>>> + * Cursor to iterate over the resources in a manager.
>>>>> + */
>>>>> +struct ttm_resource_cursor {
>>>>> +	struct ttm_resource_manager *man;
>>>>> +	struct ttm_lru_item hitch;
>>>>> +	struct list_head bulk_link;
>>>>> +	struct ttm_lru_bulk_move *bulk;
>>>>> +	unsigned int mem_type;
>>>>> +	unsigned int priority;
>>>>> +};
>>>>> +
>>>>> +void ttm_resource_cursor_fini_locked(struct
>>>>> ttm_resource_cursor
>>>>> *cursor);
>>>>> +
>>>>> +void ttm_resource_cursor_fini(struct ttm_resource_cursor
>>>>> *cursor);
>>>>> +
>>>>>     /**
>>>>>      * struct ttm_kmap_iter_iomap - Specialization for a struct
>>>>> io_mapping +
>>>>>      * struct sg_table backed struct ttm_resource.
>>>>> @@ -405,6 +417,8 @@ ttm_resource_manager_cleanup(struct
>>>>> ttm_resource_manager *man)
>>>>>     
>>>>>     void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);
>>>>>     void ttm_lru_bulk_move_tail(struct ttm_lru_bulk_move *bulk);
>>>>> +void ttm_lru_bulk_move_fini(struct ttm_device *bdev,
>>>>> +			    struct ttm_lru_bulk_move *bulk);
>>>>>     
>>>>>     void ttm_resource_add_bulk_move(struct ttm_resource *res,
>>>>>     				struct ttm_buffer_object *bo);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark
  2024-07-03 15:38 ` [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark Thomas Hellström
@ 2024-08-05 18:35   ` Souza, Jose
  2024-08-07 23:13     ` Matthew Brost
  2024-08-07 23:44   ` Matthew Brost
  1 sibling, 1 reply; 38+ messages in thread
From: Souza, Jose @ 2024-08-05 18:35 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, thomas.hellstrom@linux.intel.com
  Cc: dri-devel@lists.freedesktop.org, Brost, Matthew,
	christian.koenig@amd.com, Amaranath.Somalapuram@amd.com

On Wed, 2024-07-03 at 17:38 +0200, Thomas Hellström wrote:
> The XE_PL_TT watermark was set to 50% of system memory.
> The idea behind that was unclear since the net effect is that
> TT memory will be evicted to TTM_PL_SYSTEM memory if that
> watermark is exceeded, requiring PPGTT rebinds and dma
> remapping. But there is no similar watermark for TTM_PL_SYSTEM
> memory.
> 
> The TTM functionality that tries to swap out system memory to
> shmem objects if a 50% limit of total system memory is reached
> is orthogonal to this, and with the shrinker added, it's no
> longer in effect.
> 
> Replace the 50% TTM_PL_TT limit with a 100% limit, in effect
> allowing all graphics memory to be bound to the device unless it
> has been swapped out by the shrinker.

Sorry if I missed some patch changing it but I did not found in this series anything changing the 50% limit in ttm_global_init().
When I debugged some Vulkan tests allocate a lot of memory, the reason that KMD was not allocating memory wash this ttm_global limit that is shared
with all devices using TTM.

> 
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/xe/xe_ttm_sys_mgr.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> index 9844a8edbfe1..d38b91872da3 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> +++ b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> @@ -108,9 +108,8 @@ int xe_ttm_sys_mgr_init(struct xe_device *xe)
>  	u64 gtt_size;
>  
>  	si_meminfo(&si);
> +	/* Potentially restrict amount of TT memory here. */
>  	gtt_size = (u64)si.totalram * si.mem_unit;
> -	/* TTM limits allocation of all TTM devices by 50% of system memory */
> -	gtt_size /= 2;
>  
>  	man->use_tt = true;
>  	man->func = &xe_ttm_sys_mgr_func;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark
  2024-08-05 18:35   ` Souza, Jose
@ 2024-08-07 23:13     ` Matthew Brost
  2024-08-09 12:22       ` Thomas Hellström
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Brost @ 2024-08-07 23:13 UTC (permalink / raw)
  To: Souza, Jose
  Cc: intel-xe@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	dri-devel@lists.freedesktop.org, christian.koenig@amd.com,
	Amaranath.Somalapuram@amd.com

On Mon, Aug 05, 2024 at 12:35:34PM -0600, Souza, Jose wrote:
> On Wed, 2024-07-03 at 17:38 +0200, Thomas Hellström wrote:
> > The XE_PL_TT watermark was set to 50% of system memory.
> > The idea behind that was unclear since the net effect is that
> > TT memory will be evicted to TTM_PL_SYSTEM memory if that
> > watermark is exceeded, requiring PPGTT rebinds and dma
> > remapping. But there is no similar watermark for TTM_PL_SYSTEM
> > memory.
> > 
> > The TTM functionality that tries to swap out system memory to
> > shmem objects if a 50% limit of total system memory is reached
> > is orthogonal to this, and with the shrinker added, it's no
> > longer in effect.
> > 
> > Replace the 50% TTM_PL_TT limit with a 100% limit, in effect
> > allowing all graphics memory to be bound to the device unless it
> > has been swapped out by the shrinker.
> 
> Sorry if I missed some patch changing it but I did not found in this series anything changing the 50% limit in ttm_global_init().
> When I debugged some Vulkan tests allocate a lot of memory, the reason that KMD was not allocating memory wash this ttm_global limit that is shared
> with all devices using TTM.
> 

I'm reviewing this series and starting make sense of all this.

Thomas please correct me if I'm wrong here...

The limit set in ttm_global_init is the watermark for the TTM pool where
if exceeded upon freeing a BO's pages the pages are actually freed
rather than just returning to the TTM pool cache. The global watermark
is reason why in issue #2438 it observed a bunch of memory is still
consumed when nothing is running or any BOs exist - pages are being
cached in the TTM pool. The global watermark doesn't actually limit the
amount system memory TTM can allocate. A shrinker also exists which can
free cached pages in the TTM pool if memory pressure exists or 'echo 3 >
/proc/sys/vm/drop_caches' is done.

The watermark changed in this patch, is the actual limit for the number
of pages we can allocate for BOs. With a shrinker hooked into BOs, we
now can freely allocate all of the system pages for BOs and if memory
pressure exists idle BOs pages are swapped to shmem via the shrinker and
restored upon next GPU use.

Matt

> > 
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_ttm_sys_mgr.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > index 9844a8edbfe1..d38b91872da3 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > +++ b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > @@ -108,9 +108,8 @@ int xe_ttm_sys_mgr_init(struct xe_device *xe)
> >  	u64 gtt_size;
> >  
> >  	si_meminfo(&si);
> > +	/* Potentially restrict amount of TT memory here. */
> >  	gtt_size = (u64)si.totalram * si.mem_unit;
> > -	/* TTM limits allocation of all TTM devices by 50% of system memory */
> > -	gtt_size /= 2;
> >  
> >  	man->use_tt = true;
> >  	man->func = &xe_ttm_sys_mgr_func;
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 09/12] drm/ttm/pool: Provide a helper to shrink pages
  2024-07-03 15:38 ` [PATCH v6 09/12] drm/ttm/pool: Provide a helper to shrink pages Thomas Hellström
@ 2024-08-07 23:38   ` Matthew Brost
  2024-08-16  9:47     ` Thomas Hellström
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Brost @ 2024-08-07 23:38 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, Jul 03, 2024 at 05:38:10PM +0200, Thomas Hellström wrote:
> Provide a helper to shrink ttm_tt page-vectors on a per-page
> basis. A ttm_backup backend could then in theory get away with
> allocating a single temporary page for each struct ttm_tt.
> 
> This is accomplished by splitting larger pages before trying to
> back them up.
> 
> In the future we could allow ttm_backup to handle backing up
> large pages as well, but currently there's no benefit in
> doing that, since the shmem backup backend would have to
> split those anyway to avoid allocating too much temporary
> memory, and if the backend instead inserts pages into the
> swap-cache, those are split on reclaim by the core.
> 
> Due to potential backup- and recover errors, allow partially swapped
> out struct ttm_tt's, although mark them as swapped out stopping them
> from being swapped out a second time. More details in the ttm_pool.c
> DOC section.
> 
> v2:
> - A couple of cleanups and error fixes in ttm_pool_back_up_tt.
> - s/back_up/backup/
> - Add a writeback parameter to the exported interface.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 397 +++++++++++++++++++++++++++++++--
>  drivers/gpu/drm/ttm/ttm_tt.c   |  37 +++
>  include/drm/ttm/ttm_pool.h     |   5 +
>  include/drm/ttm/ttm_tt.h       |  20 ++
>  4 files changed, 446 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 6e1fd6985ffc..38e50cf81b0a 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -41,6 +41,7 @@
>  #include <asm/set_memory.h>
>  #endif
>  
> +#include <drm/ttm/ttm_backup.h>
>  #include <drm/ttm/ttm_pool.h>
>  #include <drm/ttm/ttm_tt.h>
>  #include <drm/ttm/ttm_bo.h>
> @@ -58,6 +59,32 @@ struct ttm_pool_dma {
>  	unsigned long vaddr;
>  };
>  
> +/**
> + * struct ttm_pool_tt_restore - State representing restore from backup
> + * @alloced_pages: Total number of already allocated pages for the ttm_tt.
> + * @restored_pages: Number of (sub) pages restored from swap for this
> + *		     chunk of 1 << @order pages.
> + * @first_page: The ttm page ptr representing for @old_pages[0].
> + * @caching_divide: Page pointer where subsequent pages are cached.
> + * @old_pages: Backup copy of page pointers that were replaced by the new
> + *	       page allocation.
> + * @pool: The pool used for page allocation while restoring.
> + * @order: The order of the last page allocated while restoring.
> + *
> + * Recovery from backup might fail when we've recovered less than the
> + * full ttm_tt. In order not to loose any data (yet), keep information
> + * around that allows us to restart a failed ttm backup recovery.
> + */
> +struct ttm_pool_tt_restore {
> +	pgoff_t alloced_pages;
> +	pgoff_t restored_pages;
> +	struct page **first_page;
> +	struct page **caching_divide;
> +	struct ttm_pool *pool;
> +	unsigned int order;
> +	struct page *old_pages[];
> +};
> +
>  static unsigned long page_pool_size;
>  
>  MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA pool");
> @@ -354,11 +381,102 @@ static unsigned int ttm_pool_page_order(struct ttm_pool *pool, struct page *p)
>  	return p->private;
>  }
>  
> +/*
> + * To be able to insert single pages into backup directly,
> + * we need to split multi-order page allocations and make them look
> + * like single-page allocations.
> + */
> +static void ttm_pool_split_for_swap(struct ttm_pool *pool, struct page *p)
> +{
> +	unsigned int order = ttm_pool_page_order(pool, p);
> +	pgoff_t nr;
> +
> +	if (!order)
> +		return;
> +
> +	split_page(p, order);
> +	nr = 1UL << order;
> +	while (nr--)
> +		(p++)->private = 0;
> +}
> +
> +/**
> + * DOC: Partial backup and restoration of a struct ttm_tt.
> + *
> + * Swapout using ttm_backup::ops::backup_page() and swapin using
> + * ttm_backup::ops::copy_backed_up_page() may fail.
> + * The former most likely due to lack of swap-space or memory, the latter due
> + * to lack of memory or because of signal interruption during waits.
> + *
> + * Backupfailure is easily handled by using a ttm_tt pages vector that holds
> + * both swap entries and page pointers. This has to be taken into account when
> + * restoring such a ttm_tt from backup, and when freeing it while backed up.
> + * When restoring, for simplicity, new pages are actually allocated from the
> + * pool and the contents of any old pages are copied in and then the old pages
> + * are released.
> + *
> + * For restoration failures, the struct ttm_pool_tt_restore holds sufficient state
> + * to be able to resume an interrupted restore, and that structure is freed once
> + * the restoration is complete. If the struct ttm_tt is destroyed while there
> + * is a valid struct ttm_pool_tt_restore attached, that is also properly taken
> + * care of.
> + */
> +
> +static bool ttm_pool_restore_valid(const struct ttm_pool_tt_restore *restore)
> +{
> +	return restore && restore->restored_pages < (1 << restore->order);
> +}
> +
> +static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
> +			       struct ttm_backup *backup,
> +			       struct ttm_operation_ctx *ctx)
> +{
> +	unsigned int i, nr = 1 << restore->order;
> +	int ret = 0;
> +
> +	if (!ttm_pool_restore_valid(restore))
> +		return 0;
> +
> +	for (i = restore->restored_pages; i < nr; ++i) {
> +		struct page *p = restore->old_pages[i];
> +
> +		if (ttm_backup_page_ptr_is_handle(p)) {
> +			unsigned long handle = ttm_backup_page_ptr_to_handle(p);
> +
> +			if (handle == 0)
> +				continue;
> +
> +			ret = backup->ops->copy_backed_up_page
> +				(backup, restore->first_page[i],
> +				 handle, ctx->interruptible);
> +			if (ret)
> +				break;
> +
> +			backup->ops->drop(backup, handle);
> +		} else if (p) {
> +			/*
> +			 * We could probably avoid splitting the old page
> +			 * using clever logic, but ATM we don't care.
> +			 */
> +			ttm_pool_split_for_swap(restore->pool, p);
> +			copy_highpage(restore->first_page[i], p);
> +			__free_pages(p, 0);
> +		}
> +
> +		restore->restored_pages++;
> +		restore->old_pages[i] = NULL;
> +		cond_resched();
> +	}
> +
> +	return ret;
> +}
> +
>  /* Called when we got a page, either from a pool or newly allocated */
>  static int ttm_pool_page_allocated(struct ttm_pool *pool, unsigned int order,
>  				   struct page *p, dma_addr_t **dma_addr,
>  				   unsigned long *num_pages,
> -				   struct page ***pages)
> +				   struct page ***pages,
> +				   struct ttm_pool_tt_restore *restore)
>  {
>  	unsigned int i;
>  	int r;
> @@ -369,6 +487,16 @@ static int ttm_pool_page_allocated(struct ttm_pool *pool, unsigned int order,
>  			return r;
>  	}
>  
> +	if (restore) {
> +		memcpy(restore->old_pages, *pages,
> +		       (1 << order) * sizeof(*restore->old_pages));
> +		memset(*pages, 0, (1 << order) * sizeof(**pages));
> +		restore->order = order;
> +		restore->restored_pages = 0;
> +		restore->first_page = *pages;
> +		restore->alloced_pages += 1UL << order;
> +	}
> +
>  	*num_pages -= 1 << order;
>  	for (i = 1 << order; i; --i, ++(*pages), ++p)
>  		**pages = p;
> @@ -394,22 +522,39 @@ static void ttm_pool_free_range(struct ttm_pool *pool, struct ttm_tt *tt,
>  				pgoff_t start_page, pgoff_t end_page)
>  {
>  	struct page **pages = &tt->pages[start_page];
> +	struct ttm_backup *backup = tt->backup;
>  	unsigned int order;
>  	pgoff_t i, nr;
>  
>  	for (i = start_page; i < end_page; i += nr, pages += nr) {
>  		struct ttm_pool_type *pt = NULL;
> +		struct page *p = *pages;
> +
> +		if (ttm_backup_page_ptr_is_handle(p)) {
> +			unsigned long handle = ttm_backup_page_ptr_to_handle(p);
> +
> +			nr = 1;
> +			if (handle != 0)
> +				backup->ops->drop(backup, handle);
> +			continue;
> +		}
> +
> +		if (pool) {
> +			order = ttm_pool_page_order(pool, p);
> +			nr = (1UL << order);
> +			if (tt->dma_address)
> +				ttm_pool_unmap(pool, tt->dma_address[i], nr);
>  
> -		order = ttm_pool_page_order(pool, *pages);
> -		nr = (1UL << order);
> -		if (tt->dma_address)
> -			ttm_pool_unmap(pool, tt->dma_address[i], nr);
> +			pt = ttm_pool_select_type(pool, caching, order);
> +		} else {
> +			order = p->private;
> +			nr = (1UL << order);
> +		}
>  
> -		pt = ttm_pool_select_type(pool, caching, order);
>  		if (pt)
> -			ttm_pool_type_give(pt, *pages);
> +			ttm_pool_type_give(pt, p);
>  		else
> -			ttm_pool_free_page(pool, caching, order, *pages);
> +			ttm_pool_free_page(pool, caching, order, p);
>  	}
>  }
>  
> @@ -453,9 +598,37 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  	else
>  		gfp_flags |= GFP_HIGHUSER;
>  
> -	for (order = min_t(unsigned int, MAX_PAGE_ORDER, __fls(num_pages));
> -	     num_pages;
> -	     order = min_t(unsigned int, order, __fls(num_pages))) {
> +	order = min_t(unsigned int, MAX_PAGE_ORDER, __fls(num_pages));
> +
> +	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP) {
> +		if (!tt->restore) {
> +			gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
> +
> +			if (ctx->gfp_retry_mayfail)
> +				gfp |= __GFP_RETRY_MAYFAIL;
> +
> +			tt->restore =
> +				kvzalloc(struct_size(tt->restore, old_pages,
> +						     (size_t)1 << order), gfp);
> +			/* RFC: Possibly loop on -ENOMEM and reduce order. */

I'd say this is fine as is. If we can't allocate memory from an array of
pages here we likely pretty much screwed, right? e.g. We likely don't
have a chance of actually allocating new pages for the backing store
anyways. Also wouldn't the restart be broken if we can't fully track the
state of the restore?

> +			if (!tt->restore)
> +				return -ENOMEM;
> +		} else if (ttm_pool_restore_valid(tt->restore)) {
> +			struct ttm_pool_tt_restore *restore = tt->restore;
> +
> +			num_pages -= restore->alloced_pages;
> +			order = min_t(unsigned int, order, __fls(num_pages));
> +			pages += restore->alloced_pages;
> +			r = ttm_pool_restore_tt(restore, tt->backup, ctx);
> +			if (r)
> +				return r;
> +			caching = restore->caching_divide;
> +		}
> +
> +		tt->restore->pool = pool;
> +	}
> +
> +	for (; num_pages; order = min_t(unsigned int, order, __fls(num_pages))) {
>  		struct ttm_pool_type *pt;
>  
>  		page_caching = tt->caching;
> @@ -472,11 +645,19 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  				r = ttm_pool_page_allocated(pool, order, p,
>  							    &dma_addr,
>  							    &num_pages,
> -							    &pages);
> +							    &pages,
> +							    tt->restore);
>  				if (r)
>  					goto error_free_page;
>  
>  				caching = pages;
> +				if (ttm_pool_restore_valid(tt->restore)) {
> +					r = ttm_pool_restore_tt(tt->restore, tt->backup,
> +								ctx);
> +					if (r)
> +						goto error_free_all;
> +				}
> +
>  				if (num_pages < (1 << order))
>  					break;
>  
> @@ -496,9 +677,17 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  				caching = pages;
>  			}
>  			r = ttm_pool_page_allocated(pool, order, p, &dma_addr,
> -						    &num_pages, &pages);
> +						    &num_pages, &pages,
> +						    tt->restore);
>  			if (r)
>  				goto error_free_page;
> +
> +			if (ttm_pool_restore_valid(tt->restore)) {
> +				r = ttm_pool_restore_tt(tt->restore, tt->backup, ctx);
> +				if (r)
> +					goto error_free_all;
> +			}
> +
>  			if (PageHighMem(p))
>  				caching = pages;
>  		}
> @@ -517,12 +706,26 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>  	if (r)
>  		goto error_free_all;
>  
> +	if (tt->restore) {
> +		kvfree(tt->restore);
> +		tt->restore = NULL;
> +	}
> +
> +	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP)
> +		tt->page_flags &= ~(TTM_TT_FLAG_PRIV_BACKED_UP |
> +				    TTM_TT_FLAG_SWAPPED);
> +
>  	return 0;
>  
>  error_free_page:
>  	ttm_pool_free_page(pool, page_caching, order, p);
>  
>  error_free_all:
> +	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP) {
> +		tt->restore->caching_divide = caching;
> +		return r;
> +	}
> +
>  	num_pages = tt->num_pages - num_pages;
>  	caching_divide = caching - tt->pages;
>  	ttm_pool_free_range(pool, tt, tt->caching, 0, caching_divide);
> @@ -549,6 +752,174 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
>  }
>  EXPORT_SYMBOL(ttm_pool_free);
>  
> +/**
> + * ttm_pool_release_backed_up() - Release content of a swapped-out struct ttm_tt
> + * @tt: The struct ttm_tt.
> + *
> + * Release handles with associated content or any remaining pages of
> + * a backed-up struct ttm_tt.
> + */
> +void ttm_pool_release_backed_up(struct ttm_tt *tt)
> +{
> +	struct ttm_backup *backup = tt->backup;
> +	struct ttm_pool_tt_restore *restore;
> +	pgoff_t i, start_page = 0;
> +	unsigned long handle;
> +
> +	if (!(tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP))
> +		return;
> +
> +	restore = tt->restore;
> +
> +	if (ttm_pool_restore_valid(restore)) {
> +		pgoff_t nr = 1UL << restore->order;
> +
> +		for (i = restore->restored_pages; i < nr; ++i) {
> +			struct page *p = restore->old_pages[i];
> +
> +			if (ttm_backup_page_ptr_is_handle(p)) {
> +				handle = ttm_backup_page_ptr_to_handle(p);
> +				if (handle == 0)
> +					continue;
> +
> +				backup->ops->drop(backup, handle);
> +			} else if (p) {
> +				ttm_pool_split_for_swap(restore->pool, p);
> +				__free_pages(p, 0);
> +			}
> +		}
> +	}
> +
> +	if (restore) {
> +		pgoff_t mid = restore->caching_divide - tt->pages;
> +
> +		start_page = restore->alloced_pages;
> +		/* Pages that might be dma-mapped and non-cached */
> +		ttm_pool_free_range(restore->pool, tt, tt->caching,
> +				    0, mid);
> +		/* Pages that might be dma-mapped but cached */
> +		ttm_pool_free_range(restore->pool, tt, ttm_cached,
> +				    mid, restore->alloced_pages);
> +	}
> +
> +	/* Shrunken pages. Cached and not dma-mapped. */
> +	ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt->num_pages);
> +
> +	if (restore) {
> +		kvfree(restore);
> +		tt->restore = NULL;
> +	}
> +
> +	tt->page_flags &= ~(TTM_TT_FLAG_PRIV_BACKED_UP | TTM_TT_FLAG_SWAPPED);
> +}
> +
> +/**
> + * ttm_pool_backup_tt() - Back up or purge a struct ttm_tt
> + * @pool: The pool used when allocating the struct ttm_tt.
> + * @ttm: The struct ttm_tt.
> + * @purge: Don't back up but release pages directly to system.
> + * @writeback: If !@purge, Try to write out directly to the
> + * underlying persistent media.
> + *
> + * Back up or purge a struct ttm_tt. If @purge is true, then
> + * all pages will be freed directly to the system rather than to the pool
> + * they were allocated from, making the function behave similarly to
> + * ttm_pool_free(). If @purge is false the pages will be backed up instead,
> + * exchanged for handles.
> + * A subsequent call to ttm_pool_alloc() will then read back the content and
> + * a subsequent call to ttm_pool_release_shrunken() will drop it.
> + * If backup of a page fails for whatever reason, @ttm will still be
> + * partially backed up, retaining those pages for which backup fails.
> + *
> + * Return: Number of pages actually backed up or freed, or negative
> + * error code on error.
> + */
> +long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge,
> +			bool writeback)
> +{
> +	struct ttm_backup *backup = ttm->backup;
> +	struct page *page;
> +	unsigned long handle;
> +	gfp_t alloc_gfp;
> +	gfp_t gfp;
> +	int ret = 0;
> +	pgoff_t shrunken = 0;
> +	pgoff_t i, num_pages;
> +
> +	if ((!get_nr_swap_pages() && !purge) ||
> +	    pool->use_dma_alloc ||
> +	    (ttm->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP))
> +		return -EBUSY;
> +
> +#ifdef CONFIG_X86
> +	/* Anything returned to the system needs to be cached. */
> +	if (ttm->caching != ttm_cached)
> +		set_pages_array_wb(ttm->pages, ttm->num_pages);
> +#endif
> +
> +	if (ttm->dma_address || purge) {
> +		for (i = 0; i < ttm->num_pages; i += num_pages) {
> +			unsigned int order;
> +
> +			page = ttm->pages[i];
> +			if (unlikely(!page)) {
> +				num_pages = 1;
> +				continue;
> +			}
> +
> +			order = ttm_pool_page_order(pool, page);
> +			num_pages = 1UL << order;
> +			if (ttm->dma_address)
> +				ttm_pool_unmap(pool, ttm->dma_address[i],
> +					       num_pages);
> +			if (purge) {
> +				shrunken += num_pages;
> +				page->private = 0;
> +				__free_pages(page, order);
> +				memset(ttm->pages + i, 0,
> +				       num_pages * sizeof(*ttm->pages));
> +			}
> +		}
> +	}
> +
> +	if (purge)

if (purge || !backup)?

> +		return shrunken;
> +
> +	if (pool->use_dma32)
> +		gfp = GFP_DMA32;
> +	else
> +		gfp = GFP_HIGHUSER;
> +
> +	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
> +
> +	for (i = 0; i < ttm->num_pages; ++i) {
> +		page = ttm->pages[i];
> +		if (unlikely(!page))
> +			continue;
> +
> +		ttm_pool_split_for_swap(pool, page);
> +
> +		handle = backup->ops->backup_page(backup, page, writeback, i,
> +						  gfp, alloc_gfp);
> +		if (handle) {
> +			ttm->pages[i] = ttm_backup_handle_to_page_ptr(handle);
> +			put_page(page);
> +			shrunken++;
> +		} else {
> +			/* We allow partially shrunken tts */
> +			ret = -ENOMEM;
> +			break;
> +		}
> +		cond_resched();
> +	}
> +
> +	if (shrunken)
> +		ttm->page_flags |= (TTM_TT_FLAG_PRIV_BACKED_UP |
> +				    TTM_TT_FLAG_SWAPPED);
> +
> +	return shrunken ? shrunken : ret;
> +}
> +
>  /**
>   * ttm_pool_init - Initialize a pool
>   *
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> index 4b51b9023126..98ce25197b38 100644
> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> @@ -40,6 +40,7 @@
>  #include <drm/drm_cache.h>
>  #include <drm/drm_device.h>
>  #include <drm/drm_util.h>
> +#include <drm/ttm/ttm_backup.h>
>  #include <drm/ttm/ttm_bo.h>
>  #include <drm/ttm/ttm_tt.h>
>  
> @@ -158,6 +159,7 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>  	ttm->swap_storage = NULL;
>  	ttm->sg = bo->sg;
>  	ttm->caching = caching;
> +	ttm->restore = NULL;

Set backup to NULL? Seems problematic if not set to NULL and driver
doesn't choose to set the backup.

>  }
>  
>  int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> @@ -182,6 +184,12 @@ void ttm_tt_fini(struct ttm_tt *ttm)
>  		fput(ttm->swap_storage);
>  	ttm->swap_storage = NULL;
>  
> +	ttm_pool_release_backed_up(ttm);
> +	if (ttm->backup) {

In patch 12 you don't set this to NULL on error. You will have to set it
to NULL their or change this too:

if (ttm->backup && !IS_ERR(ttm->backup))

> +		ttm->backup->ops->fini(ttm->backup);
> +		ttm->backup = NULL;
> +	}
> +
>  	if (ttm->pages)
>  		kvfree(ttm->pages);
>  	else
> @@ -253,6 +261,35 @@ int ttm_tt_swapin(struct ttm_tt *ttm)
>  }
>  EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_tt_swapin);
>  
> +/**
> + * ttm_tt_backup() - Helper to back up a struct ttm_tt.
> + * @bdev: The TTM device.
> + * @tt: The struct ttm_tt.
> + * @purge: Don't back up but release pages directly to system,
> + * bypassing any pooling.
> + * @writeback: If !@purge, try to write out directly to the
> + * underlying persistent media.
> + *
> + * Helper for a TTM driver to use from the bo_shrink() method to shrink
> + * a struct ttm_tt, after it has done the necessary unbinding. This function
> + * will update the page accounting and call ttm_pool_shrink_tt to free pages
> + * or move them to the swap cache.
> + *
> + * Return: Number of pages freed or swapped out, or negative error code on
> + * error.
> + */
> +long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt, bool purge,
> +		   bool writeback)
> +{
> +	long ret = ttm_pool_backup_tt(&bdev->pool, tt, purge, writeback);
> +
> +	if (ret > 0)
> +		tt->page_flags &= ~TTM_TT_FLAG_PRIV_POPULATED;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(ttm_tt_backup);
> +
>  /**
>   * ttm_tt_swapout - swap out tt object
>   *
> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
> index 160d954a261e..4e4db369952b 100644
> --- a/include/drm/ttm/ttm_pool.h
> +++ b/include/drm/ttm/ttm_pool.h
> @@ -89,6 +89,11 @@ void ttm_pool_fini(struct ttm_pool *pool);
>  
>  int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>  
> +void ttm_pool_release_backed_up(struct ttm_tt *tt);
> +
> +long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm,
> +			bool purge, bool writeback);
> +
>  int ttm_pool_mgr_init(unsigned long num_pages);
>  void ttm_pool_mgr_fini(void);
>  
> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> index 2b9d856ff388..6b990f1e7dd0 100644
> --- a/include/drm/ttm/ttm_tt.h
> +++ b/include/drm/ttm/ttm_tt.h
> @@ -32,11 +32,13 @@
>  #include <drm/ttm/ttm_caching.h>
>  #include <drm/ttm/ttm_kmap_iter.h>
>  
> +struct ttm_backup;
>  struct ttm_device;
>  struct ttm_tt;
>  struct ttm_resource;
>  struct ttm_buffer_object;
>  struct ttm_operation_ctx;
> +struct ttm_pool_tt_restore;
>  
>  /**
>   * struct ttm_tt - This is a structure holding the pages, caching- and aperture
> @@ -85,6 +87,9 @@ struct ttm_tt {
>  	 * fault handling abuses the DMA api a bit and dma_map_attrs can't be
>  	 * used to assure pgprot always matches.
>  	 *
> +	 * TTM_TT_FLAG_PRIV_BACKED_UP: TTM internal only. This is set if the
> +	 * struct ttm_tt has been (possibly partially) backed up.
> +	 *
>  	 * TTM_TT_FLAG_PRIV_POPULATED: TTM internal only. DO NOT USE. This is
>  	 * set by TTM after ttm_tt_populate() has successfully returned, and is
>  	 * then unset when TTM calls ttm_tt_unpopulate().
> @@ -96,6 +101,7 @@ struct ttm_tt {
>  #define TTM_TT_FLAG_DECRYPTED		BIT(4)
>  
>  #define TTM_TT_FLAG_PRIV_POPULATED	BIT(5)
> +#define TTM_TT_FLAG_PRIV_BACKED_UP	BIT(6)
>  	uint32_t page_flags;
>  	/** @num_pages: Number of pages in the page array. */
>  	uint32_t num_pages;
> @@ -105,11 +111,21 @@ struct ttm_tt {
>  	dma_addr_t *dma_address;
>  	/** @swap_storage: Pointer to shmem struct file for swap storage. */
>  	struct file *swap_storage;
> +	/**
> +	 * @backup: Pointer to backup struct for backed up tts.
> +	 * RFC: Could possibly be unified with @swap_storage.

I think longterm unifying with swap_storage is problably a good idea.
Kinda goofy having two backup mechanisms.

In the meantime, can you add a comment that is this is a driver owned
field? This confused me until I looked at the last patch in this series
where this field was being setup.

> +	 */
> +	struct ttm_backup *backup;
>  	/**
>  	 * @caching: The current caching state of the pages, see enum
>  	 * ttm_caching.
>  	 */
>  	enum ttm_caching caching;
> +	/**
> +	 * @restore: Partial restoration from backup state.
> +	 * RFC: Incorporate in struct ttm_backup?

I think having a standalone restore field makes sense. Also probably
mention is this a TTM private field and drivers shouldn't touch this.

Matt

> +	 */
> +	struct ttm_pool_tt_restore *restore;
>  };
>  
>  /**
> @@ -230,6 +246,10 @@ void ttm_tt_mgr_init(unsigned long num_pages, unsigned long num_dma32_pages);
>  struct ttm_kmap_iter *ttm_kmap_iter_tt_init(struct ttm_kmap_iter_tt *iter_tt,
>  					    struct ttm_tt *tt);
>  unsigned long ttm_tt_pages_limit(void);
> +
> +long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt, bool purge,
> +		   bool writeback);
> +
>  #if IS_ENABLED(CONFIG_AGP)
>  #include <linux/agp_backend.h>
>  
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths
  2024-07-03 15:38 ` [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths Thomas Hellström
@ 2024-08-07 23:43   ` Matthew Brost
  2024-08-09 13:53     ` Thomas Hellström
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Brost @ 2024-08-07 23:43 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote:
> Use fault-injection to test partial TTM swapout and interrupted swapin.
> Return -EINTR for swapin to test the callers ability to handle and
> restart the swapin, and on swapout perform a partial swapout to test that
> the swapin and release_shrunken functionality.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/Kconfig        | 10 ++++++++++
>  drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index fd0749c0c630..9f27271bfab8 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -272,6 +272,16 @@ config DRM_GPUVM
>  	  GPU-VM representation providing helpers to manage a GPUs virtual
>  	  address space
>  
> +config DRM_TTM_BACKUP_FAULT_INJECT
> +	bool "Enable fault injection during TTM backup"
> +	depends on DRM_TTM
> +	default n
> +	help
> +	  Inject recoverable failures during TTM backup and recovery of
> +	  backed-up objects. For DRM driver developers only.
> +
> +	  If in doubt, choose N.
> +
>  config DRM_BUDDY
>  	tristate
>  	depends on DRM
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 38e50cf81b0a..d32a1f2e5e50 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
>  			       struct ttm_backup *backup,
>  			       struct ttm_operation_ctx *ctx)
>  {
> +	static unsigned long __maybe_unused swappedin;
>  	unsigned int i, nr = 1 << restore->order;
>  	int ret = 0;
>  
> @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
>  			if (handle == 0)
>  				continue;
>  
> +			if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
> +			    ctx->interruptible &&
> +			    ++swappedin % 100 == 0) {
> +				ret = -EINTR;
> +				break;
> +			}

So here this -EINTR would be kicked to the user IOCTL which triggered
the BO validate and retry? The restore then should be able to
successfully pick up where it left off?

> +
>  			ret = backup->ops->copy_backed_up_page
>  				(backup, restore->first_page[i],
>  				 handle, ctx->interruptible);
> @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge,
>  
>  	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
>  
> -	for (i = 0; i < ttm->num_pages; ++i) {
> +	num_pages = ttm->num_pages;
> +
> +	/* Pretend doing fault injection by shrinking only half of the pages. */
> +
> +	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
> +		num_pages = DIV_ROUND_UP(num_pages, 2);

So what happens here? Half the pages swapped out, then upon restore half
swapped back in? The shrinker continues to walk until enough pages
swapped out?

Matt

> +
> +	for (i = 0; i < num_pages; ++i) {
>  		page = ttm->pages[i];
>  		if (unlikely(!page))
>  			continue;
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark
  2024-07-03 15:38 ` [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark Thomas Hellström
  2024-08-05 18:35   ` Souza, Jose
@ 2024-08-07 23:44   ` Matthew Brost
  2024-08-09 13:53     ` Thomas Hellström
  1 sibling, 1 reply; 38+ messages in thread
From: Matthew Brost @ 2024-08-07 23:44 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Somalapuram Amaranath, Christian König, dri-devel

On Wed, Jul 03, 2024 at 05:38:13PM +0200, Thomas Hellström wrote:
> The XE_PL_TT watermark was set to 50% of system memory.
> The idea behind that was unclear since the net effect is that
> TT memory will be evicted to TTM_PL_SYSTEM memory if that
> watermark is exceeded, requiring PPGTT rebinds and dma
> remapping. But there is no similar watermark for TTM_PL_SYSTEM
> memory.
> 
> The TTM functionality that tries to swap out system memory to
> shmem objects if a 50% limit of total system memory is reached
> is orthogonal to this, and with the shrinker added, it's no
> longer in effect.
> 
> Replace the 50% TTM_PL_TT limit with a 100% limit, in effect
> allowing all graphics memory to be bound to the device unless it
> has been swapped out by the shrinker.
> 
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_ttm_sys_mgr.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> index 9844a8edbfe1..d38b91872da3 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> +++ b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> @@ -108,9 +108,8 @@ int xe_ttm_sys_mgr_init(struct xe_device *xe)
>  	u64 gtt_size;
>  
>  	si_meminfo(&si);
> +	/* Potentially restrict amount of TT memory here. */
>  	gtt_size = (u64)si.totalram * si.mem_unit;
> -	/* TTM limits allocation of all TTM devices by 50% of system memory */
> -	gtt_size /= 2;
>  
>  	man->use_tt = true;
>  	man->func = &xe_ttm_sys_mgr_func;
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos
  2024-07-03 15:38 ` [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos Thomas Hellström
@ 2024-08-08  1:37   ` Matthew Brost
  2024-08-09 14:31     ` Thomas Hellström
  2024-08-09 16:05   ` Matthew Auld
  1 sibling, 1 reply; 38+ messages in thread
From: Matthew Brost @ 2024-08-08  1:37 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, Jul 03, 2024 at 05:38:12PM +0200, Thomas Hellström wrote:
> Rather than relying on the TTM watermark accounting add a shrinker
> for xe_bos in TT or system memory.
> 
> Leverage the newly added TTM per-page shrinking and shmem backup
> support.
> 
> Although xe doesn't fully support WONTNEED (purgeable) bos yet,
> introduce and add shrinker support for purgeable ttm_tts.
> 
> v2:
> - Cleanups bugfixes and a KUNIT shrinker test.
> - Add writeback support, and activate if kswapd.
> v3:
> - Move the try_shrink() helper to core TTM.
> - Minor cleanups.
> v4:
> - Add runtime pm for the shrinker. Shrinking may require an active
>   device for CCS metadata copying.
> v5:
> - Separately purge ghost- and zombie objects in the shrinker.
> - Fix a format specifier - type inconsistency. (Kernel test robot).
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/ttm/ttm_bo_util.c     |  67 ++++++
>  drivers/gpu/drm/xe/Makefile           |   1 +
>  drivers/gpu/drm/xe/tests/xe_bo.c      | 118 +++++++++++
>  drivers/gpu/drm/xe/tests/xe_bo_test.c |   1 +
>  drivers/gpu/drm/xe/tests/xe_bo_test.h |   1 +
>  drivers/gpu/drm/xe/xe_bo.c            | 155 ++++++++++++--
>  drivers/gpu/drm/xe/xe_bo.h            |  26 +++
>  drivers/gpu/drm/xe/xe_device.c        |   8 +
>  drivers/gpu/drm/xe/xe_device_types.h  |   2 +
>  drivers/gpu/drm/xe/xe_shrinker.c      | 287 ++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_shrinker.h      |  18 ++
>  include/drm/ttm/ttm_bo.h              |   3 +
>  12 files changed, 671 insertions(+), 16 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_shrinker.c
>  create mode 100644 drivers/gpu/drm/xe/xe_shrinker.h
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index c4f678f30fc2..563e96a4cf06 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -924,3 +924,70 @@ long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct ttm_device *bdev,
>  
>  	return progress;
>  }
> +EXPORT_SYMBOL(ttm_lru_walk_for_evict);
> +
> +/**
> + * ttm_bo_try_shrink - LRU walk helper to shrink a ttm buffer object.
> + * @walk: The struct xe_ttm_lru_walk that describes the walk.
> + * @bo: The buffer object.
> + * @purge: Whether to attempt to purge the bo content since it's no
> + * longer needed.
> + * @writeback: If !@purge, attempt to write out to persistent storage.
> + *
> + * The function uses the ttm_tt_back_up functionality to back up or
> + * purge a struct ttm_tt. If the bo is not in system, it's first
> + * moved there.
> + *
> + * Return: The number of pages shrunken or purged, or
> + * negative error code on failure.
> + */
> +long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
> +		       bool purge, bool writeback)
> +{
> +	static const struct ttm_place sys_placement_flags = {
> +		.fpfn = 0,
> +		.lpfn = 0,
> +		.mem_type = TTM_PL_SYSTEM,
> +		.flags = 0,
> +	};
> +	static struct ttm_placement sys_placement = {
> +		.num_placement = 1,
> +		.placement = &sys_placement_flags,
> +	};
> +	struct ttm_operation_ctx *ctx = walk->ctx;
> +	struct ttm_tt *tt = bo->ttm;
> +	long lret;
> +
> +	dma_resv_assert_held(bo->base.resv);
> +
> +	if (!tt || !ttm_tt_is_populated(tt))
> +		return 0;
> +
> +	if (bo->resource->mem_type != TTM_PL_SYSTEM) {
> +		int ret = ttm_bo_validate(bo, &sys_placement, ctx);
> +
> +		if (ret) {
> +			if (ret == -EINTR || ret == -EDEADLK ||
> +			    ret == -ERESTARTSYS)
> +				return ret;

Can you explain the various error code returns / supression in this
function?

> +			return 0;
> +		}
> +	}
> +
> +	lret = ttm_bo_wait_ctx(bo, ctx);
> +	if (lret < 0) {
> +		if (lret == -ERESTARTSYS)
> +			return lret;
> +		return 0;
> +	}
> +
> +	if (bo->deleted)
> +		lret = ttm_tt_backup(bo->bdev, tt, true, writeback);
> +	else
> +		lret = ttm_tt_backup(bo->bdev, tt, purge, writeback);

Hmm, missed this in my previous review. It is frowned upon having
multiple bools as arguments. Could this be reworked with flags? Same
goes for all functions in the series with multiple bool arguments.

> +	if (lret < 0 && lret != -EINTR)
> +		return 0;
> +
> +	return lret;
> +}
> +EXPORT_SYMBOL(ttm_bo_try_shrink);
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index b1e03bfe4a68..1eba51bdd172 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -112,6 +112,7 @@ xe-y += xe_bb.o \
>  	xe_ring_ops.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
> +	xe_shrinker.o \
>  	xe_step.o \
>  	xe_sync.o \
>  	xe_tile.o \
> diff --git a/drivers/gpu/drm/xe/tests/xe_bo.c b/drivers/gpu/drm/xe/tests/xe_bo.c
> index 9f3c02826464..49617f16dc76 100644
> --- a/drivers/gpu/drm/xe/tests/xe_bo.c
> +++ b/drivers/gpu/drm/xe/tests/xe_bo.c
> @@ -6,6 +6,8 @@
>  #include <kunit/test.h>
>  #include <kunit/visibility.h>
>  
> +#include <uapi/linux/sysinfo.h>
> +
>  #include "tests/xe_bo_test.h"
>  #include "tests/xe_pci_test.h"
>  #include "tests/xe_test.h"
> @@ -350,3 +352,119 @@ void xe_bo_evict_kunit(struct kunit *test)
>  	xe_call_for_each_device(evict_test_run_device);
>  }
>  EXPORT_SYMBOL_IF_KUNIT(xe_bo_evict_kunit);
> +
> +struct xe_bo_link {
> +	struct list_head link;
> +	struct xe_bo *bo;
> +};
> +
> +#define XE_BO_SHRINK_SIZE ((unsigned long)SZ_64M)
> +
> +/*
> + * Try to create system bos corresponding to twice the amount
> + * of available system memory to test shrinker functionality.
> + * If no swap space is available to accommodate the
> + * memory overcommit, mark bos purgeable.
> + */
> +static int shrink_test_run_device(struct xe_device *xe)
> +{
> +	struct kunit *test = xe_cur_kunit();
> +	LIST_HEAD(bos);
> +	struct xe_bo_link *link, *next;
> +	struct sysinfo si;
> +	size_t total, alloced;
> +	unsigned int interrupted = 0, successful = 0;
> +
> +	si_meminfo(&si);
> +	total = si.freeram * si.mem_unit;
> +
> +	kunit_info(test, "Free ram is %lu bytes. Will allocate twice of that.\n",
> +		   (unsigned long) total);
> +
> +	total <<= 1;
> +	for (alloced = 0; alloced < total ; alloced += XE_BO_SHRINK_SIZE) {
> +		struct xe_bo *bo;
> +		unsigned int mem_type;
> +
> +		link = kzalloc(sizeof(*link), GFP_KERNEL);
> +		if (!link) {
> +			KUNIT_FAIL(test, "Unexpeced link allocation failure\n");
> +			break;
> +		}
> +
> +		INIT_LIST_HEAD(&link->link);
> +
> +		/* We can create bos using WC caching here. But it is slower. */
> +		bo = xe_bo_create_user(xe, NULL, NULL, XE_BO_SHRINK_SIZE,
> +				       DRM_XE_GEM_CPU_CACHING_WB,
> +				       ttm_bo_type_device,
> +				       XE_BO_FLAG_SYSTEM);
> +		if (IS_ERR(bo)) {
> +			if (bo != ERR_PTR(-ENOMEM) && bo != ERR_PTR(-ENOSPC) &&
> +			    bo != ERR_PTR(-EINTR) && bo != ERR_PTR(-ERESTARTSYS))
> +				KUNIT_FAIL(test, "Error creating bo: %pe\n", bo);
> +			kfree(link);
> +			break;
> +		}
> +		link->bo = bo;
> +		list_add_tail(&link->link, &bos);
> +		xe_bo_lock(bo, false);
> +
> +		/*
> +		 * If we're low on swap entries, we can't shrink unless the bo
> +		 * is marked purgeable.
> +		 */
> +		if (get_nr_swap_pages() < (XE_BO_SHRINK_SIZE >> PAGE_SHIFT) * 128) {
> +			struct xe_ttm_tt *xe_tt =
> +				container_of(bo->ttm.ttm, typeof(*xe_tt), ttm);
> +			long num_pages = xe_tt->ttm.num_pages;
> +
> +			xe_tt->purgeable = true;
> +			xe_shrinker_mod_pages(xe->mem.shrinker, -num_pages,
> +					      num_pages);
> +		}
> +
> +		mem_type = bo->ttm.resource->mem_type;
> +		xe_bo_unlock(bo);
> +		if (mem_type != XE_PL_TT)
> +			KUNIT_FAIL(test, "Bo in incorrect memory type: %u\n",
> +				   bo->ttm.resource->mem_type);
> +		cond_resched();
> +		if (signal_pending(current))
> +			break;
> +	}
> +
> +	/* Read back and destroy bos */
> +	list_for_each_entry_safe_reverse(link, next, &bos, link) {
> +		static struct ttm_operation_ctx ctx = {.interruptible = true};
> +		struct xe_bo *bo = link->bo;
> +		int ret;
> +
> +		if (!signal_pending(current)) {
> +			xe_bo_lock(bo, NULL);
> +			ret = ttm_bo_validate(&bo->ttm, &tt_placement, &ctx);
> +			xe_bo_unlock(bo);
> +			if (ret && ret != -EINTR)
> +				KUNIT_FAIL(test, "Validation failed: %pe\n",
> +					   ERR_PTR(ret));
> +			else if (ret)
> +				interrupted++;
> +			else
> +				successful++;
> +		}
> +		xe_bo_put(link->bo);
> +		list_del(&link->link);
> +		kfree(link);
> +		cond_resched();
> +	}
> +	kunit_info(test, "Readbacks interrupted: %u successful: %u\n",
> +		   interrupted, successful);
> +
> +	return 0;
> +}
> +
> +void xe_bo_shrink_kunit(struct kunit *test)
> +{
> +	xe_call_for_each_device(shrink_test_run_device);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(xe_bo_shrink_kunit);
> diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.c b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> index a324cde77db8..317fa923e287 100644
> --- a/drivers/gpu/drm/xe/tests/xe_bo_test.c
> +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> @@ -10,6 +10,7 @@
>  static struct kunit_case xe_bo_tests[] = {
>  	KUNIT_CASE(xe_ccs_migrate_kunit),
>  	KUNIT_CASE(xe_bo_evict_kunit),
> +	KUNIT_CASE_SLOW(xe_bo_shrink_kunit),
>  	{}
>  };
>  
> diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.h b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> index 0113ab45066a..7f44d14a45c5 100644
> --- a/drivers/gpu/drm/xe/tests/xe_bo_test.h
> +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> @@ -10,5 +10,6 @@ struct kunit;
>  
>  void xe_ccs_migrate_kunit(struct kunit *test);
>  void xe_bo_evict_kunit(struct kunit *test);
> +void xe_bo_shrink_kunit(struct kunit *test);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 65c696966e96..6ab63d1642ae 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -10,6 +10,7 @@
>  #include <drm/drm_drv.h>
>  #include <drm/drm_gem_ttm_helper.h>
>  #include <drm/drm_managed.h>
> +#include <drm/ttm/ttm_backup.h>
>  #include <drm/ttm/ttm_device.h>
>  #include <drm/ttm/ttm_placement.h>
>  #include <drm/ttm/ttm_tt.h>
> @@ -25,6 +26,7 @@
>  #include "xe_pm.h"
>  #include "xe_preempt_fence.h"
>  #include "xe_res_cursor.h"
> +#include "xe_shrinker.h"
>  #include "xe_trace_bo.h"
>  #include "xe_ttm_stolen_mgr.h"
>  #include "xe_vm.h"
> @@ -278,11 +280,15 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
>  	}
>  }
>  
> +/* struct xe_ttm_tt - Subclassed ttm_tt for xe */
>  struct xe_ttm_tt {
>  	struct ttm_tt ttm;
> -	struct device *dev;
> +	/** @xe - The xe device */
> +	struct xe_device *xe;
>  	struct sg_table sgt;
>  	struct sg_table *sg;
> +	/** @purgeable - Whether the bo is purgeable (WONTNEED) */

So we need to add WONTNEED to our uAPI, right?

> +	bool purgeable;
>  };
>  
>  static int xe_tt_map_sg(struct ttm_tt *tt)
> @@ -291,7 +297,8 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
>  	unsigned long num_pages = tt->num_pages;
>  	int ret;
>  
> -	XE_WARN_ON(tt->page_flags & TTM_TT_FLAG_EXTERNAL);
> +	XE_WARN_ON((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> +		   !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE));
>  
>  	if (xe_tt->sg)
>  		return 0;
> @@ -299,13 +306,13 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
>  	ret = sg_alloc_table_from_pages_segment(&xe_tt->sgt, tt->pages,
>  						num_pages, 0,
>  						(u64)num_pages << PAGE_SHIFT,
> -						xe_sg_segment_size(xe_tt->dev),
> +						xe_sg_segment_size(xe_tt->xe->drm.dev),
>  						GFP_KERNEL);
>  	if (ret)
>  		return ret;
>  
>  	xe_tt->sg = &xe_tt->sgt;
> -	ret = dma_map_sgtable(xe_tt->dev, xe_tt->sg, DMA_BIDIRECTIONAL,
> +	ret = dma_map_sgtable(xe_tt->xe->drm.dev, xe_tt->sg, DMA_BIDIRECTIONAL,
>  			      DMA_ATTR_SKIP_CPU_SYNC);
>  	if (ret) {
>  		sg_free_table(xe_tt->sg);
> @@ -321,7 +328,7 @@ static void xe_tt_unmap_sg(struct ttm_tt *tt)
>  	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
>  
>  	if (xe_tt->sg) {
> -		dma_unmap_sgtable(xe_tt->dev, xe_tt->sg,
> +		dma_unmap_sgtable(xe_tt->xe->drm.dev, xe_tt->sg,
>  				  DMA_BIDIRECTIONAL, 0);
>  		sg_free_table(xe_tt->sg);
>  		xe_tt->sg = NULL;
> @@ -336,21 +343,41 @@ struct sg_table *xe_bo_sg(struct xe_bo *bo)
>  	return xe_tt->sg;
>  }
>  
> +/*
> + * Account ttm pages against the device shrinker's shrinkable and
> + * purgeable counts.
> + */
> +static void xe_ttm_tt_account(struct ttm_tt *tt, bool add)
> +{

Again I think bools are frowned upon as arguments. Maybe just two
functions - add / sub?

> +	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
> +	long num_pages = tt->num_pages;
> +
> +	if (!add)
> +		num_pages = -num_pages;
> +
> +	if (xe_tt->purgeable)
> +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, 0, num_pages);
> +	else
> +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, num_pages, 0);
> +}
> +
>  static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>  				       u32 page_flags)
>  {
>  	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
>  	struct xe_device *xe = xe_bo_device(bo);
> -	struct xe_ttm_tt *tt;
> +	struct xe_ttm_tt *xe_tt;
> +	struct ttm_tt *tt;
>  	unsigned long extra_pages;
>  	enum ttm_caching caching;
>  	int err;
>  
> -	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> -	if (!tt)
> +	xe_tt = kzalloc(sizeof(*xe_tt), GFP_KERNEL);
> +	if (!xe_tt)
>  		return NULL;
>  
> -	tt->dev = xe->drm.dev;
> +	tt = &xe_tt->ttm;
> +	xe_tt->xe = xe;
>  
>  	extra_pages = 0;
>  	if (xe_bo_needs_ccs_pages(bo))
> @@ -387,42 +414,128 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>  		caching = ttm_uncached;
>  	}
>  
> -	err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages);
> +	if (ttm_bo->type != ttm_bo_type_sg)
> +		page_flags |= TTM_TT_FLAG_EXTERNAL | TTM_TT_FLAG_EXTERNAL_MAPPABLE;
> +
> +	err = ttm_tt_init(tt, &bo->ttm, page_flags, caching, extra_pages);
>  	if (err) {
> -		kfree(tt);
> +		kfree(xe_tt);
>  		return NULL;
>  	}
>  
> -	return &tt->ttm;
> +	tt->backup = ttm_backup_shmem_create(tt->num_pages << PAGE_SHIFT);
> +	if (IS_ERR(tt->backup)) {
> +		ttm_tt_fini(tt);

Mentioned this the previous review I think you need set tt->backup to
NULL here or update ttm_tt_fini to understand IS_ERR(tt->backup).

Also maybe dump question but could we just have a global backup for all
BOs? Would that be better than each BO creating its own backup?

> +		kfree(xe_tt);
> +		return NULL;
> +	}
> +
> +	return tt;
>  }
>  
>  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
>  			      struct ttm_operation_ctx *ctx)
>  {
> +	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
>  	int err;
>  
>  	/*
>  	 * dma-bufs are not populated with pages, and the dma-
>  	 * addresses are set up when moved to XE_PL_TT.
>  	 */
> -	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
> +	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
>  		return 0;
>  
>  	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
>  	if (err)
>  		return err;
>  
> -	return err;
> +	xe_tt->purgeable = false;
> +	xe_ttm_tt_account(tt, true);
> +
> +	return 0;
>  }
>  
>  static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev, struct ttm_tt *tt)
>  {
> -	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
> +	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
>  		return;
>  
>  	xe_tt_unmap_sg(tt);
>  
> -	return ttm_pool_free(&ttm_dev->pool, tt);
> +	ttm_pool_free(&ttm_dev->pool, tt);
> +	xe_ttm_tt_account(tt, false);
> +}
> +
> +/**
> + * xe_bo_shrink() - Try to shrink an xe bo.
> + * @walk:  - The walk parameters
> + * @bo: The TTM buffer object
> + * @purge: Only consider purgeable bos.
> + * @writeback: Try to write back to persistent storage.
> + *
> + * Try to shrink- or purge a bo, and if it succeeds, unmap dma.
> + * Note that we need to be able to handle also non xe bos
> + * (ghost bos), but only if the struct ttm_tt is embedded in
> + * a struct xe_ttm_tt.
> + *
> + * Return: The number of pages shrunken or purged, or negative error
> + * code on failure.
> + */
> +long xe_bo_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
> +		  bool purge, bool writeback)
> +{
> +	struct ttm_tt *tt = bo->ttm;
> +	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
> +	struct ttm_place place = {.mem_type = bo->resource->mem_type};
> +	struct xe_bo *xe_bo = ttm_to_xe_bo(bo);
> +	struct xe_device *xe = xe_tt->xe;
> +	bool needs_rpm;
> +	long lret = 0L;
> +
> +	if (!tt || !ttm_tt_is_populated(tt) ||
> +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE) ||
> +	    (purge && !xe_tt->purgeable))
> +		return 0L;
> +
> +	if (!ttm_bo_eviction_valuable(bo, &place))
> +		return 0L;
> +
> +	/* Beware of zombies (GEM object refcount == 0) and ghosts. */
> +	if (!xe_bo_is_xe_bo(bo) || !xe_bo_get_unless_zero(xe_bo)) {
> +		struct ttm_placement null_placement = { .num_placement = 0 };
> +
> +		lret = ttm_bo_wait_ctx(bo, walk->ctx);
> +		if (lret)
> +			return lret;
> +
> +		/* Purge the bo content! */
> +		ttm_bo_validate(bo, &null_placement, walk->ctx);
> +		return tt->num_pages;
> +	}
> +
> +	/* System CCS needs gpu copy when moving PL_TT -> PL_SYSTEM */
> +	needs_rpm = (!IS_DGFX(xe) && bo->resource->mem_type != XE_PL_SYSTEM &&
> +		     xe_bo && xe_bo_needs_ccs_pages(xe_bo) && !xe_tt->purgeable);

Is xe_bo check really needed here?

> +	if (needs_rpm && !xe_pm_runtime_get_if_active(xe))
> +		goto out_unref;
> +
> +	lret = ttm_bo_try_shrink(walk, bo, xe_tt->purgeable, writeback);
> +	if (needs_rpm)
> +		xe_pm_runtime_put(xe);
> +
> +	if (lret > 0) {
> +		xe_assert(xe, !ttm_tt_is_populated(tt));
> +
> +		xe_ttm_tt_account(tt, false);
> +	}
> +
> +out_unref:
> +	xe_bo_put(xe_bo);
> +
> +	return lret;
>  }
>  
>  static void xe_ttm_tt_destroy(struct ttm_device *ttm_dev, struct ttm_tt *tt)
> @@ -1238,6 +1351,7 @@ struct xe_bo *___xe_bo_create_locked(struct xe_device *xe, struct xe_bo *bo,
>  	struct ttm_operation_ctx ctx = {
>  		.interruptible = true,
>  		.no_wait_gpu = false,
> +		.gfp_retry_mayfail = true,

Can you explain why you are setting this?

>  	};
>  	struct ttm_placement *placement;
>  	uint32_t alignment;
> @@ -1681,6 +1795,8 @@ int xe_bo_pin_external(struct xe_bo *bo)
>  	}
>  
>  	ttm_bo_pin(&bo->ttm);
> +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> +		xe_ttm_tt_account(bo->ttm.ttm, false);
>  
>  	/*
>  	 * FIXME: If we always use the reserve / unreserve functions for locking
> @@ -1739,6 +1855,8 @@ int xe_bo_pin(struct xe_bo *bo)
>  	}
>  
>  	ttm_bo_pin(&bo->ttm);
> +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> +		xe_ttm_tt_account(bo->ttm.ttm, false);
>  
>  	/*
>  	 * FIXME: If we always use the reserve / unreserve functions for locking
> @@ -1773,6 +1891,9 @@ void xe_bo_unpin_external(struct xe_bo *bo)
>  	spin_unlock(&xe->pinned.lock);
>  
>  	ttm_bo_unpin(&bo->ttm);
> +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> +		xe_ttm_tt_account(bo->ttm.ttm, true);
> +

Nit: Extra newline.

>  
>  	/*
>  	 * FIXME: If we always use the reserve / unreserve functions for locking
> @@ -1801,6 +1922,8 @@ void xe_bo_unpin(struct xe_bo *bo)
>  	}
>  
>  	ttm_bo_unpin(&bo->ttm);
> +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> +		xe_ttm_tt_account(bo->ttm.ttm, true);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> index 6de894c728f5..8463e3f3f6f1 100644
> --- a/drivers/gpu/drm/xe/xe_bo.h
> +++ b/drivers/gpu/drm/xe/xe_bo.h
> @@ -63,6 +63,7 @@
>  #define XE_BO_PROPS_INVALID	(-1)
>  
>  struct sg_table;
> +struct xe_ttm_lru_walk;
>  
>  struct xe_bo *xe_bo_alloc(void);
>  void xe_bo_free(struct xe_bo *bo);
> @@ -126,6 +127,28 @@ static inline struct xe_bo *xe_bo_get(struct xe_bo *bo)
>  	return bo;
>  }
>  
> +/*
> + * xe_bo_get_unless_zero() - Conditionally obtain a GEM object refcount on an
> + * xe bo
> + * @bo: The bo for which we want to obtain a refcount.
> + *
> + * There is a short window between where the bo's GEM object refcount reaches
> + * zero and where we put the final ttm_bo reference. Code in the eviction- and
> + * shrinking path should therefore attempt to grab a gem object reference before
> + * trying to use members outside of the base class ttm object. This function is
> + * intended for that purpose. On successful return, this function must be paired
> + * with an xe_bo_put().
> + *
> + * Return: @bo on success, NULL on failure.
> + */
> +static inline __must_check struct xe_bo *xe_bo_get_unless_zero(struct xe_bo *bo)
> +{
> +	if (!bo || !kref_get_unless_zero(&bo->ttm.base.refcount))
> +		return NULL;
> +
> +	return bo;
> +}
> +
>  static inline void xe_bo_put(struct xe_bo *bo)
>  {
>  	if (bo)
> @@ -315,6 +338,9 @@ static inline unsigned int xe_sg_segment_size(struct device *dev)
>  
>  #define i915_gem_object_flush_if_display(obj)		((void)(obj))
>  
> +long xe_bo_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
> +		  bool purge, bool writeback);
> +
>  #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
>  /**
>   * xe_bo_is_mem_type - Whether the bo currently resides in the given
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index cfda7cb5df2c..58fecc4b0a18 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -47,6 +47,7 @@
>  #include "xe_perf.h"
>  #include "xe_pm.h"
>  #include "xe_query.h"
> +#include "xe_shrinker.h"
>  #include "xe_sriov.h"
>  #include "xe_tile.h"
>  #include "xe_ttm_stolen_mgr.h"
> @@ -241,6 +242,9 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
>  	if (xe->unordered_wq)
>  		destroy_workqueue(xe->unordered_wq);
>  
> +	if (!IS_ERR_OR_NULL(xe->mem.shrinker))
> +		xe_shrinker_destroy(xe->mem.shrinker);
> +
>  	ttm_device_fini(&xe->ttm);
>  }
>  
> @@ -270,6 +274,10 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
>  	if (err)
>  		goto err;
>  
> +	xe->mem.shrinker = xe_shrinker_create(xe);
> +	if (IS_ERR(xe->mem.shrinker))
> +		return ERR_CAST(xe->mem.shrinker);
> +
>  	xe->info.devid = pdev->device;
>  	xe->info.revid = pdev->revision;
>  	xe->info.force_execlist = xe_modparam.force_execlist;
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index c37be471d11c..3d5440aba52e 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -325,6 +325,8 @@ struct xe_device {
>  		struct xe_mem_region vram;
>  		/** @mem.sys_mgr: system TTM manager */
>  		struct ttm_resource_manager sys_mgr;
> +		/** @mem.sys_mgr: system memory shrinker. */
> +		struct xe_shrinker *shrinker;
>  	} mem;
>  
>  	/** @sriov: device level virtualization data */
> diff --git a/drivers/gpu/drm/xe/xe_shrinker.c b/drivers/gpu/drm/xe/xe_shrinker.c
> new file mode 100644
> index 000000000000..3f9554bdc06b
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_shrinker.c
> @@ -0,0 +1,287 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/shrinker.h>
> +#include <linux/swap.h>
> +
> +#include <drm/ttm/ttm_bo.h>
> +#include <drm/ttm/ttm_tt.h>
> +
> +#include "xe_bo.h"
> +#include "xe_pm.h"
> +#include "xe_shrinker.h"
> +
> +/**
> + * struct xe_shrinker - per-device shrinker
> + * @xe: Back pointer to the device.
> + * @lock: Lock protecting accounting.
> + * @shrinkable_pages: Number of pages that are currently shrinkable.
> + * @purgeable_pages: Number of pages that are currently purgeable.
> + * @shrink: Pointer to the mm shrinker.
> + * @pm_worker: Worker to wake up the device if required.
> + */
> +struct xe_shrinker {
> +	struct xe_device *xe;
> +	rwlock_t lock;
> +	long shrinkable_pages;
> +	long purgeable_pages;
> +	struct shrinker *shrink;
> +	struct work_struct pm_worker;
> +};
> +
> +/**
> + * struct xe_shrink_lru_walk - lru_walk subclass for shrinker
> + * @walk: The embedded base class.
> + * @xe: Pointer to the xe device.
> + * @purge: Purgeable only request from the srinker.
> + * @writeback: Try to write back to persistent storage.
> + */
> +struct xe_shrink_lru_walk {
> +	struct ttm_lru_walk walk;
> +	struct xe_device *xe;
> +	bool purge;
> +	bool writeback;
> +};
> +
> +static struct xe_shrinker *to_xe_shrinker(struct shrinker *shrink)
> +{
> +	return shrink->private_data;
> +}
> +
> +static struct xe_shrink_lru_walk *
> +to_xe_shrink_lru_walk(struct ttm_lru_walk *walk)
> +{
> +	return container_of(walk, struct xe_shrink_lru_walk, walk);
> +}
> +
> +/**
> + * xe_shrinker_mod_pages() - Modify shrinker page accounting
> + * @shrinker: Pointer to the struct xe_shrinker.
> + * @shrinkable: Shrinkable pages delta. May be negative.
> + * @purgeable: Purgeable page delta. May be negative.
> + *
> + * Modifies the shrinkable and purgeable pages accounting.
> + */
> +void
> +xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long shrinkable, long purgeable)
> +{
> +	write_lock(&shrinker->lock);
> +	shrinker->shrinkable_pages += shrinkable;
> +	shrinker->purgeable_pages += purgeable;
> +	write_unlock(&shrinker->lock);
> +}
> +
> +static long xe_shrinker_process_bo(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo)
> +{
> +	struct xe_shrink_lru_walk *shrink_walk = to_xe_shrink_lru_walk(walk);
> +
> +	return xe_bo_shrink(walk, bo, shrink_walk->purge, shrink_walk->writeback);
> +}
> +
> +static long xe_shrinker_walk(struct xe_shrink_lru_walk *shrink_walk, long target)
> +{
> +	struct xe_device *xe = shrink_walk->xe;
> +	struct ttm_resource_manager *man;
> +	unsigned int mem_type;
> +	long sofar = 0;
> +	long lret;
> +
> +	for (mem_type = XE_PL_SYSTEM; mem_type <= XE_PL_TT; ++mem_type) {
> +		man = ttm_manager_type(&xe->ttm, mem_type);
> +		if (!man || !man->use_tt)
> +			continue;
> +
> +		lret = ttm_lru_walk_for_evict(&shrink_walk->walk, &xe->ttm, man, target);
> +		if (lret < 0)
> +			return lret;
> +
> +		sofar += lret;
> +		if (sofar >= target)
> +			break;
> +	}
> +
> +	return sofar;
> +}
> +
> +static unsigned long
> +xe_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
> +	unsigned long num_pages;
> +
> +	num_pages = get_nr_swap_pages();
> +	read_lock(&shrinker->lock);
> +	num_pages = min_t(unsigned long, num_pages, shrinker->shrinkable_pages);
> +	num_pages += shrinker->purgeable_pages;
> +	read_unlock(&shrinker->lock);
> +
> +	return num_pages ? num_pages : SHRINK_EMPTY;
> +}
> +
> +static const struct ttm_lru_walk_ops xe_shrink_ops = {
> +	.process_bo = xe_shrinker_process_bo,
> +};
> +
> +/*
> + * Check if we need runtime pm, and if so try to grab a reference if
> + * already active. If grabbing a reference fails, queue a worker that
> + * does it for us outside of reclaim, but don't wait for it to complete.
> + * If bo shrinking needs an rpm reference and we don't have it (yet),
> + * that bo will be skipped anyway.
> + */
> +static bool xe_shrinker_runtime_pm_get(struct xe_shrinker *shrinker, bool force,
> +				       unsigned long nr_to_scan)
> +{
> +	struct xe_device *xe = shrinker->xe;
> +
> +	if (IS_DGFX(xe) || !xe_device_has_flat_ccs(xe) ||
> +	    !get_nr_swap_pages())
> +		return false;
> +
> +	if (!force) {
> +		read_lock(&shrinker->lock);
> +		force = (nr_to_scan > shrinker->purgeable_pages);
> +		read_unlock(&shrinker->lock);
> +		if (!force)
> +			return false;
> +	}
> +
> +	if (!xe_pm_runtime_get_if_active(xe)) {
> +		queue_work(xe->unordered_wq, &shrinker->pm_worker);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static void xe_shrinker_runtime_pm_put(struct xe_shrinker *shrinker, bool runtime_pm)
> +{
> +	if (runtime_pm)
> +		xe_pm_runtime_put(shrinker->xe);
> +}
> +
> +static unsigned long xe_shrinker_scan(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
> +	bool is_kswapd = current_is_kswapd();
> +	struct ttm_operation_ctx ctx = {
> +		.interruptible = false,
> +		.no_wait_gpu = !is_kswapd,
> +	};
> +	unsigned long nr_to_scan, freed = 0;
> +	struct xe_shrink_lru_walk shrink_walk = {
> +		.walk = {
> +			.ops = &xe_shrink_ops,
> +			.ctx = &ctx,
> +			.trylock_only = true,
> +		},
> +		.xe = shrinker->xe,
> +		.purge = true,
> +		.writeback = is_kswapd,
> +	};
> +	bool runtime_pm;
> +	bool purgeable;
> +	long ret;
> +
> +	sc->nr_scanned = 0;
> +	nr_to_scan = sc->nr_to_scan;
> +
> +	read_lock(&shrinker->lock);
> +	purgeable = !!shrinker->purgeable_pages;
> +	read_unlock(&shrinker->lock);
> +
> +	/* Might need runtime PM. Try to wake early if it looks like it. */
> +	runtime_pm = xe_shrinker_runtime_pm_get(shrinker, false, nr_to_scan);
> +
> +	while (purgeable && freed < nr_to_scan) {
> +		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
> +		if (ret <= 0)
> +			break;
> +
> +		freed += ret;
> +	}
> +
> +	sc->nr_scanned = freed;
> +	if (freed < nr_to_scan)
> +		nr_to_scan -= freed;
> +	else
> +		nr_to_scan = 0;
> +	if (!nr_to_scan)
> +		goto out;
> +
> +	/* If we didn't wake before, try to do it now if needed. */
> +	if (!runtime_pm)
> +		runtime_pm = xe_shrinker_runtime_pm_get(shrinker, true, 0);
> +
> +	shrink_walk.purge = false;
> +	nr_to_scan = sc->nr_to_scan;
> +	while (freed < nr_to_scan) {
> +		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
> +		if (ret <= 0)
> +			break;
> +
> +		freed += ret;
> +	}
> +
> +	sc->nr_scanned = freed;
> +
> +out:
> +	xe_shrinker_runtime_pm_put(shrinker, runtime_pm);
> +	return freed ? freed : SHRINK_STOP;
> +}
> +
> +/* Wake up the device for shrinking. */
> +static void xe_shrinker_pm(struct work_struct *work)
> +{
> +	struct xe_shrinker *shrinker =
> +		container_of(work, typeof(*shrinker), pm_worker);
> +
> +	xe_pm_runtime_get(shrinker->xe);
> +	xe_pm_runtime_put(shrinker->xe);

So I don't really get this. How does this help the shrinker get a PM
ref? The small window between xe_pm_runtime_get / put the shrinker grabs
one via xe_pm_runtime_get_if_active? Seems fragile.

Matt

> +}
> +
> +/**
> + * xe_shrinker_create() - Create an xe per-device shrinker
> + * @xe: Pointer to the xe device.
> + *
> + * Returns: A pointer to the created shrinker on success,
> + * Negative error code on failure.
> + */
> +struct xe_shrinker *xe_shrinker_create(struct xe_device *xe)
> +{
> +	struct xe_shrinker *shrinker = kzalloc(sizeof(*shrinker), GFP_KERNEL);
> +
> +	if (!shrinker)
> +		return ERR_PTR(-ENOMEM);
> +
> +	shrinker->shrink = shrinker_alloc(0, "xe system shrinker");
> +	if (!shrinker->shrink) {
> +		kfree(shrinker);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	INIT_WORK(&shrinker->pm_worker, xe_shrinker_pm);
> +	shrinker->xe = xe;
> +	rwlock_init(&shrinker->lock);
> +	shrinker->shrink->count_objects = xe_shrinker_count;
> +	shrinker->shrink->scan_objects = xe_shrinker_scan;
> +	shrinker->shrink->private_data = shrinker;
> +	shrinker_register(shrinker->shrink);
> +
> +	return shrinker;
> +}
> +
> +/**
> + * xe_shrinker_destroy() - Destroy an xe per-device shrinker
> + * @shrinker: Pointer to the shrinker to destroy.
> + */
> +void xe_shrinker_destroy(struct xe_shrinker *shrinker)
> +{
> +	xe_assert(shrinker->xe, !shrinker->shrinkable_pages);
> +	xe_assert(shrinker->xe, !shrinker->purgeable_pages);
> +	shrinker_free(shrinker->shrink);
> +	flush_work(&shrinker->pm_worker);
> +	kfree(shrinker);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_shrinker.h b/drivers/gpu/drm/xe/xe_shrinker.h
> new file mode 100644
> index 000000000000..28a038f4fcbf
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_shrinker.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef _XE_SHRINKER_H_
> +#define _XE_SHRINKER_H_
> +
> +struct xe_shrinker;
> +struct xe_device;
> +
> +void xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long shrinkable, long purgeable);
> +
> +struct xe_shrinker *xe_shrinker_create(struct xe_device *xe);
> +
> +void xe_shrinker_destroy(struct xe_shrinker *shrinker);
> +
> +#endif
> diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> index e577528f5dfc..c7e81ae025d9 100644
> --- a/include/drm/ttm/ttm_bo.h
> +++ b/include/drm/ttm/ttm_bo.h
> @@ -229,6 +229,9 @@ struct ttm_lru_walk {
>  long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct ttm_device *bdev,
>  			    struct ttm_resource_manager *man, long target);
>  
> +long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
> +		       bool purge, bool writeback);
> +
>  /**
>   * ttm_bo_get - reference a struct ttm_buffer_object
>   *
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark
  2024-08-07 23:13     ` Matthew Brost
@ 2024-08-09 12:22       ` Thomas Hellström
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-08-09 12:22 UTC (permalink / raw)
  To: Matthew Brost, Souza, Jose
  Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	christian.koenig@amd.com, Amaranath.Somalapuram@amd.com

On Wed, 2024-08-07 at 23:13 +0000, Matthew Brost wrote:
> On Mon, Aug 05, 2024 at 12:35:34PM -0600, Souza, Jose wrote:
> > On Wed, 2024-07-03 at 17:38 +0200, Thomas Hellström wrote:
> > > The XE_PL_TT watermark was set to 50% of system memory.
> > > The idea behind that was unclear since the net effect is that
> > > TT memory will be evicted to TTM_PL_SYSTEM memory if that
> > > watermark is exceeded, requiring PPGTT rebinds and dma
> > > remapping. But there is no similar watermark for TTM_PL_SYSTEM
> > > memory.
> > > 
> > > The TTM functionality that tries to swap out system memory to
> > > shmem objects if a 50% limit of total system memory is reached
> > > is orthogonal to this, and with the shrinker added, it's no
> > > longer in effect.
> > > 
> > > Replace the 50% TTM_PL_TT limit with a 100% limit, in effect
> > > allowing all graphics memory to be bound to the device unless it
> > > has been swapped out by the shrinker.
> > 
> > Sorry if I missed some patch changing it but I did not found in
> > this series anything changing the 50% limit in ttm_global_init().
> > When I debugged some Vulkan tests allocate a lot of memory, the
> > reason that KMD was not allocating memory wash this ttm_global
> > limit that is shared
> > with all devices using TTM.
> > 
> 
> I'm reviewing this series and starting make sense of all this.
> 
> Thomas please correct me if I'm wrong here...
> 
> The limit set in ttm_global_init is the watermark for the TTM pool
> where
> if exceeded upon freeing a BO's pages the pages are actually freed
> rather than just returning to the TTM pool cache. The global
> watermark
> is reason why in issue #2438 it observed a bunch of memory is still
> consumed when nothing is running or any BOs exist - pages are being
> cached in the TTM pool. 

This is correct.



> The global watermark doesn't actually limit the
> amount system memory TTM can allocate. A shrinker also exists which
> can
> free cached pages in the TTM pool if memory pressure exists or 'echo
> 3 >
> /proc/sys/vm/drop_caches' is done.

Yes, this is also true except the global watermark should be called the
pool watermark.

There is another global watermark that, if the amount of pages used for
graphics (PL_SYSTEM or PL_TT) exceeds 50% of system, bo swapping
starts. That means bos are pulled of the various LRU lists in a device
round-robin fashion and moved to shmem objects, in the anticipation
that theses shmem objects can then be paged out to disc by the core.
However, this is done even if there is no disc swap-space attached.
Also if this shmem swapping fails, nothing happens but the allocation
is free to proceed anyway.

This is what I typically refer to as the global watermark. It used to
be implemented by an opportunistic swapper process (similar to kswapd)
and a direct swapper (similar to direct reclaim) and the 50% limit was
configurable, but much of that functionality was ripped out. I didn't
follow the discussion preceding that change, though.


> 
> The watermark changed in this patch, is the actual limit for the
> number
> of pages we can allocate for BOs.

What is changed in this patch is actually the amount to memory we can
have in the TT placement and therefore also bound to the device. If
this limit is exceeded, eviction from TT to SYSTEM will take place in
addition to the above global swapping. This limit is per device. So now
we can in theory have all of system bound to a device.

>  With a shrinker hooked into BOs, we
> now can freely allocate all of the system pages for BOs and if memory
> pressure exists idle BOs pages are swapped to shmem via the shrinker
> and
> restored upon next GPU use.

Exactly.

> 
> Matt

/Thomas

> 
> > > 
> > > Signed-off-by: Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_ttm_sys_mgr.c | 3 +--
> > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > > b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > > index 9844a8edbfe1..d38b91872da3 100644
> > > --- a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > > +++ b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > > @@ -108,9 +108,8 @@ int xe_ttm_sys_mgr_init(struct xe_device *xe)
> > >  	u64 gtt_size;
> > >  
> > >  	si_meminfo(&si);
> > > +	/* Potentially restrict amount of TT memory here. */
> > >  	gtt_size = (u64)si.totalram * si.mem_unit;
> > > -	/* TTM limits allocation of all TTM devices by 50% of
> > > system memory */
> > > -	gtt_size /= 2;
> > >  
> > >  	man->use_tt = true;
> > >  	man->func = &xe_ttm_sys_mgr_func;
> > 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths
  2024-08-07 23:43   ` Matthew Brost
@ 2024-08-09 13:53     ` Thomas Hellström
  2024-08-09 16:40       ` Matthew Brost
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-08-09 13:53 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote:
> On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote:
> > Use fault-injection to test partial TTM swapout and interrupted
> > swapin.
> > Return -EINTR for swapin to test the callers ability to handle and
> > restart the swapin, and on swapout perform a partial swapout to
> > test that
> > the swapin and release_shrunken functionality.
> > 
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/Kconfig        | 10 ++++++++++
> >  drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
> >  2 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > index fd0749c0c630..9f27271bfab8 100644
> > --- a/drivers/gpu/drm/Kconfig
> > +++ b/drivers/gpu/drm/Kconfig
> > @@ -272,6 +272,16 @@ config DRM_GPUVM
> >  	  GPU-VM representation providing helpers to manage a GPUs
> > virtual
> >  	  address space
> >  
> > +config DRM_TTM_BACKUP_FAULT_INJECT
> > +	bool "Enable fault injection during TTM backup"
> > +	depends on DRM_TTM
> > +	default n
> > +	help
> > +	  Inject recoverable failures during TTM backup and
> > recovery of
> > +	  backed-up objects. For DRM driver developers only.
> > +
> > +	  If in doubt, choose N.
> > +
> >  config DRM_BUDDY
> >  	tristate
> >  	depends on DRM
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > b/drivers/gpu/drm/ttm/ttm_pool.c
> > index 38e50cf81b0a..d32a1f2e5e50 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct
> > ttm_pool_tt_restore *restore,
> >  			       struct ttm_backup *backup,
> >  			       struct ttm_operation_ctx *ctx)
> >  {
> > +	static unsigned long __maybe_unused swappedin;
> >  	unsigned int i, nr = 1 << restore->order;
> >  	int ret = 0;
> >  
> > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct
> > ttm_pool_tt_restore *restore,
> >  			if (handle == 0)
> >  				continue;
> >  
> > +			if
> > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
> > +			    ctx->interruptible &&
> > +			    ++swappedin % 100 == 0) {
> > +				ret = -EINTR;
> > +				break;
> > +			}
> 
> So here this -EINTR would be kicked to the user IOCTL which triggered
> the BO validate and retry? The restore then should be able to
> successfully pick up where it left off?

Yes, that's the point. For the direct swap-cache backend I initially
used (before concluding that the shmem one actually seemed to work
fine), we had an interruptible wait here.

Supporting interrupts is generally a good thing but for the pool code,
this makes the code considerably more complicated. However, this is a
good way to ensure drivers actually support -EINTR for the call chain.
If not, adding interrupt capability "later" will most likely be a PITA.

> 
> > +
> >  			ret = backup->ops->copy_backed_up_page
> >  				(backup, restore->first_page[i],
> >  				 handle, ctx->interruptible);
> > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool,
> > struct ttm_tt *ttm, bool purge,
> >  
> >  	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN |
> > __GFP_RETRY_MAYFAIL;
> >  
> > -	for (i = 0; i < ttm->num_pages; ++i) {
> > +	num_pages = ttm->num_pages;
> > +
> > +	/* Pretend doing fault injection by shrinking only half of
> > the pages. */
> > +
> > +	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
> > +		num_pages = DIV_ROUND_UP(num_pages, 2);
> 
> So what happens here? Half the pages swapped out, then upon restore
> half
> swapped back in? The shrinker continues to walk until enough pages
> swapped out?

Yes, exactly. Ideally we'd want some intermediate state here so that a
partially swapped out bo is still eligible for further shrinking.

/Thomas


> 
> Matt
> 
> > +
> > +	for (i = 0; i < num_pages; ++i) {
> >  		page = ttm->pages[i];
> >  		if (unlikely(!page))
> >  			continue;
> > -- 
> > 2.44.0
> > 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark
  2024-08-07 23:44   ` Matthew Brost
@ 2024-08-09 13:53     ` Thomas Hellström
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-08-09 13:53 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, Somalapuram Amaranath, Christian König, dri-devel

On Wed, 2024-08-07 at 23:44 +0000, Matthew Brost wrote:
> On Wed, Jul 03, 2024 at 05:38:13PM +0200, Thomas Hellström wrote:
> > The XE_PL_TT watermark was set to 50% of system memory.
> > The idea behind that was unclear since the net effect is that
> > TT memory will be evicted to TTM_PL_SYSTEM memory if that
> > watermark is exceeded, requiring PPGTT rebinds and dma
> > remapping. But there is no similar watermark for TTM_PL_SYSTEM
> > memory.
> > 
> > The TTM functionality that tries to swap out system memory to
> > shmem objects if a 50% limit of total system memory is reached
> > is orthogonal to this, and with the shrinker added, it's no
> > longer in effect.
> > 
> > Replace the 50% TTM_PL_TT limit with a 100% limit, in effect
> > allowing all graphics memory to be bound to the device unless it
> > has been swapped out by the shrinker.
> > 
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> 
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>

Thanks for reviewing!
/Thomas


> 
> > ---
> >  drivers/gpu/drm/xe/xe_ttm_sys_mgr.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > index 9844a8edbfe1..d38b91872da3 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > +++ b/drivers/gpu/drm/xe/xe_ttm_sys_mgr.c
> > @@ -108,9 +108,8 @@ int xe_ttm_sys_mgr_init(struct xe_device *xe)
> >  	u64 gtt_size;
> >  
> >  	si_meminfo(&si);
> > +	/* Potentially restrict amount of TT memory here. */
> >  	gtt_size = (u64)si.totalram * si.mem_unit;
> > -	/* TTM limits allocation of all TTM devices by 50% of
> > system memory */
> > -	gtt_size /= 2;
> >  
> >  	man->use_tt = true;
> >  	man->func = &xe_ttm_sys_mgr_func;
> > -- 
> > 2.44.0
> > 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos
  2024-08-08  1:37   ` Matthew Brost
@ 2024-08-09 14:31     ` Thomas Hellström
  2024-08-09 17:22       ` Matthew Brost
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Hellström @ 2024-08-09 14:31 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Thu, 2024-08-08 at 01:37 +0000, Matthew Brost wrote:
> On Wed, Jul 03, 2024 at 05:38:12PM +0200, Thomas Hellström wrote:
> > Rather than relying on the TTM watermark accounting add a shrinker
> > for xe_bos in TT or system memory.
> > 
> > Leverage the newly added TTM per-page shrinking and shmem backup
> > support.
> > 
> > Although xe doesn't fully support WONTNEED (purgeable) bos yet,
> > introduce and add shrinker support for purgeable ttm_tts.
> > 
> > v2:
> > - Cleanups bugfixes and a KUNIT shrinker test.
> > - Add writeback support, and activate if kswapd.
> > v3:
> > - Move the try_shrink() helper to core TTM.
> > - Minor cleanups.
> > v4:
> > - Add runtime pm for the shrinker. Shrinking may require an active
> >   device for CCS metadata copying.
> > v5:
> > - Separately purge ghost- and zombie objects in the shrinker.
> > - Fix a format specifier - type inconsistency. (Kernel test robot).
> > 
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/ttm/ttm_bo_util.c     |  67 ++++++
> >  drivers/gpu/drm/xe/Makefile           |   1 +
> >  drivers/gpu/drm/xe/tests/xe_bo.c      | 118 +++++++++++
> >  drivers/gpu/drm/xe/tests/xe_bo_test.c |   1 +
> >  drivers/gpu/drm/xe/tests/xe_bo_test.h |   1 +
> >  drivers/gpu/drm/xe/xe_bo.c            | 155 ++++++++++++--
> >  drivers/gpu/drm/xe/xe_bo.h            |  26 +++
> >  drivers/gpu/drm/xe/xe_device.c        |   8 +
> >  drivers/gpu/drm/xe/xe_device_types.h  |   2 +
> >  drivers/gpu/drm/xe/xe_shrinker.c      | 287
> > ++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_shrinker.h      |  18 ++
> >  include/drm/ttm/ttm_bo.h              |   3 +
> >  12 files changed, 671 insertions(+), 16 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_shrinker.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_shrinker.h
> > 
> > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c
> > b/drivers/gpu/drm/ttm/ttm_bo_util.c
> > index c4f678f30fc2..563e96a4cf06 100644
> > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> > @@ -924,3 +924,70 @@ long ttm_lru_walk_for_evict(struct
> > ttm_lru_walk *walk, struct ttm_device *bdev,
> >  
> >  	return progress;
> >  }
> > +EXPORT_SYMBOL(ttm_lru_walk_for_evict);
> > +
> > +/**
> > + * ttm_bo_try_shrink - LRU walk helper to shrink a ttm buffer
> > object.
> > + * @walk: The struct xe_ttm_lru_walk that describes the walk.
> > + * @bo: The buffer object.
> > + * @purge: Whether to attempt to purge the bo content since it's
> > no
> > + * longer needed.
> > + * @writeback: If !@purge, attempt to write out to persistent
> > storage.
> > + *
> > + * The function uses the ttm_tt_back_up functionality to back up
> > or
> > + * purge a struct ttm_tt. If the bo is not in system, it's first
> > + * moved there.
> > + *
> > + * Return: The number of pages shrunken or purged, or
> > + * negative error code on failure.
> > + */
> > +long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct
> > ttm_buffer_object *bo,
> > +		       bool purge, bool writeback)
> > +{
> > +	static const struct ttm_place sys_placement_flags = {
> > +		.fpfn = 0,
> > +		.lpfn = 0,
> > +		.mem_type = TTM_PL_SYSTEM,
> > +		.flags = 0,
> > +	};
> > +	static struct ttm_placement sys_placement = {
> > +		.num_placement = 1,
> > +		.placement = &sys_placement_flags,
> > +	};
> > +	struct ttm_operation_ctx *ctx = walk->ctx;
> > +	struct ttm_tt *tt = bo->ttm;
> > +	long lret;
> > +
> > +	dma_resv_assert_held(bo->base.resv);
> > +
> > +	if (!tt || !ttm_tt_is_populated(tt))
> > +		return 0;
> > +
> > +	if (bo->resource->mem_type != TTM_PL_SYSTEM) {
> > +		int ret = ttm_bo_validate(bo, &sys_placement,
> > ctx);
> > +
> > +		if (ret) {
> > +			if (ret == -EINTR || ret == -EDEADLK ||
> > +			    ret == -ERESTARTSYS)
> > +				return ret;
> 
> Can you explain the various error code returns / supression in this
> function?

Want me to add a comment in the code or inline here? Anyway, the error
codes are codes for which the caller wants to restart. (Signal delivery
or deadlock). For other errors just move on to the next bo on the LRU
list.

> 
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	lret = ttm_bo_wait_ctx(bo, ctx);
> > +	if (lret < 0) {
> > +		if (lret == -ERESTARTSYS)
> > +			return lret;
> > +		return 0;
> > +	}
> > +
> > +	if (bo->deleted)
> > +		lret = ttm_tt_backup(bo->bdev, tt, true,
> > writeback);
> > +	else
> > +		lret = ttm_tt_backup(bo->bdev, tt, purge,
> > writeback);
> 
> Hmm, missed this in my previous review. It is frowned upon having
> multiple bools as arguments. Could this be reworked with flags? Same
> goes for all functions in the series with multiple bool arguments.

I agree. Ill see if I can make this look betteer.

> 
> > +	if (lret < 0 && lret != -EINTR)
> > +		return 0;
> > +
> > +	return lret;
> > +}
> > +EXPORT_SYMBOL(ttm_bo_try_shrink);
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index b1e03bfe4a68..1eba51bdd172 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -112,6 +112,7 @@ xe-y += xe_bb.o \
> >  	xe_ring_ops.o \
> >  	xe_sa.o \
> >  	xe_sched_job.o \
> > +	xe_shrinker.o \
> >  	xe_step.o \
> >  	xe_sync.o \
> >  	xe_tile.o \
> > diff --git a/drivers/gpu/drm/xe/tests/xe_bo.c
> > b/drivers/gpu/drm/xe/tests/xe_bo.c
> > index 9f3c02826464..49617f16dc76 100644
> > --- a/drivers/gpu/drm/xe/tests/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/tests/xe_bo.c
> > @@ -6,6 +6,8 @@
> >  #include <kunit/test.h>
> >  #include <kunit/visibility.h>
> >  
> > +#include <uapi/linux/sysinfo.h>
> > +
> >  #include "tests/xe_bo_test.h"
> >  #include "tests/xe_pci_test.h"
> >  #include "tests/xe_test.h"
> > @@ -350,3 +352,119 @@ void xe_bo_evict_kunit(struct kunit *test)
> >  	xe_call_for_each_device(evict_test_run_device);
> >  }
> >  EXPORT_SYMBOL_IF_KUNIT(xe_bo_evict_kunit);
> > +
> > +struct xe_bo_link {
> > +	struct list_head link;
> > +	struct xe_bo *bo;
> > +};
> > +
> > +#define XE_BO_SHRINK_SIZE ((unsigned long)SZ_64M)
> > +
> > +/*
> > + * Try to create system bos corresponding to twice the amount
> > + * of available system memory to test shrinker functionality.
> > + * If no swap space is available to accommodate the
> > + * memory overcommit, mark bos purgeable.
> > + */
> > +static int shrink_test_run_device(struct xe_device *xe)
> > +{
> > +	struct kunit *test = xe_cur_kunit();
> > +	LIST_HEAD(bos);
> > +	struct xe_bo_link *link, *next;
> > +	struct sysinfo si;
> > +	size_t total, alloced;
> > +	unsigned int interrupted = 0, successful = 0;
> > +
> > +	si_meminfo(&si);
> > +	total = si.freeram * si.mem_unit;
> > +
> > +	kunit_info(test, "Free ram is %lu bytes. Will allocate
> > twice of that.\n",
> > +		   (unsigned long) total);
> > +
> > +	total <<= 1;
> > +	for (alloced = 0; alloced < total ; alloced +=
> > XE_BO_SHRINK_SIZE) {
> > +		struct xe_bo *bo;
> > +		unsigned int mem_type;
> > +
> > +		link = kzalloc(sizeof(*link), GFP_KERNEL);
> > +		if (!link) {
> > +			KUNIT_FAIL(test, "Unexpeced link
> > allocation failure\n");
> > +			break;
> > +		}
> > +
> > +		INIT_LIST_HEAD(&link->link);
> > +
> > +		/* We can create bos using WC caching here. But it
> > is slower. */
> > +		bo = xe_bo_create_user(xe, NULL, NULL,
> > XE_BO_SHRINK_SIZE,
> > +				       DRM_XE_GEM_CPU_CACHING_WB,
> > +				       ttm_bo_type_device,
> > +				       XE_BO_FLAG_SYSTEM);
> > +		if (IS_ERR(bo)) {
> > +			if (bo != ERR_PTR(-ENOMEM) && bo !=
> > ERR_PTR(-ENOSPC) &&
> > +			    bo != ERR_PTR(-EINTR) && bo !=
> > ERR_PTR(-ERESTARTSYS))
> > +				KUNIT_FAIL(test, "Error creating
> > bo: %pe\n", bo);
> > +			kfree(link);
> > +			break;
> > +		}
> > +		link->bo = bo;
> > +		list_add_tail(&link->link, &bos);
> > +		xe_bo_lock(bo, false);
> > +
> > +		/*
> > +		 * If we're low on swap entries, we can't shrink
> > unless the bo
> > +		 * is marked purgeable.
> > +		 */
> > +		if (get_nr_swap_pages() < (XE_BO_SHRINK_SIZE >>
> > PAGE_SHIFT) * 128) {
> > +			struct xe_ttm_tt *xe_tt =
> > +				container_of(bo->ttm.ttm,
> > typeof(*xe_tt), ttm);
> > +			long num_pages = xe_tt->ttm.num_pages;
> > +
> > +			xe_tt->purgeable = true;
> > +			xe_shrinker_mod_pages(xe->mem.shrinker, -
> > num_pages,
> > +					      num_pages);
> > +		}
> > +
> > +		mem_type = bo->ttm.resource->mem_type;
> > +		xe_bo_unlock(bo);
> > +		if (mem_type != XE_PL_TT)
> > +			KUNIT_FAIL(test, "Bo in incorrect memory
> > type: %u\n",
> > +				   bo->ttm.resource->mem_type);
> > +		cond_resched();
> > +		if (signal_pending(current))
> > +			break;
> > +	}
> > +
> > +	/* Read back and destroy bos */
> > +	list_for_each_entry_safe_reverse(link, next, &bos, link) {
> > +		static struct ttm_operation_ctx ctx =
> > {.interruptible = true};
> > +		struct xe_bo *bo = link->bo;
> > +		int ret;
> > +
> > +		if (!signal_pending(current)) {
> > +			xe_bo_lock(bo, NULL);
> > +			ret = ttm_bo_validate(&bo->ttm,
> > &tt_placement, &ctx);
> > +			xe_bo_unlock(bo);
> > +			if (ret && ret != -EINTR)
> > +				KUNIT_FAIL(test, "Validation
> > failed: %pe\n",
> > +					   ERR_PTR(ret));
> > +			else if (ret)
> > +				interrupted++;
> > +			else
> > +				successful++;
> > +		}
> > +		xe_bo_put(link->bo);
> > +		list_del(&link->link);
> > +		kfree(link);
> > +		cond_resched();
> > +	}
> > +	kunit_info(test, "Readbacks interrupted: %u successful:
> > %u\n",
> > +		   interrupted, successful);
> > +
> > +	return 0;
> > +}
> > +
> > +void xe_bo_shrink_kunit(struct kunit *test)
> > +{
> > +	xe_call_for_each_device(shrink_test_run_device);
> > +}
> > +EXPORT_SYMBOL_IF_KUNIT(xe_bo_shrink_kunit);
> > diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > index a324cde77db8..317fa923e287 100644
> > --- a/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > @@ -10,6 +10,7 @@
> >  static struct kunit_case xe_bo_tests[] = {
> >  	KUNIT_CASE(xe_ccs_migrate_kunit),
> >  	KUNIT_CASE(xe_bo_evict_kunit),
> > +	KUNIT_CASE_SLOW(xe_bo_shrink_kunit),
> >  	{}
> >  };
> >  
> > diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > index 0113ab45066a..7f44d14a45c5 100644
> > --- a/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > @@ -10,5 +10,6 @@ struct kunit;
> >  
> >  void xe_ccs_migrate_kunit(struct kunit *test);
> >  void xe_bo_evict_kunit(struct kunit *test);
> > +void xe_bo_shrink_kunit(struct kunit *test);
> >  
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > b/drivers/gpu/drm/xe/xe_bo.c
> > index 65c696966e96..6ab63d1642ae 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.c
> > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > @@ -10,6 +10,7 @@
> >  #include <drm/drm_drv.h>
> >  #include <drm/drm_gem_ttm_helper.h>
> >  #include <drm/drm_managed.h>
> > +#include <drm/ttm/ttm_backup.h>
> >  #include <drm/ttm/ttm_device.h>
> >  #include <drm/ttm/ttm_placement.h>
> >  #include <drm/ttm/ttm_tt.h>
> > @@ -25,6 +26,7 @@
> >  #include "xe_pm.h"
> >  #include "xe_preempt_fence.h"
> >  #include "xe_res_cursor.h"
> > +#include "xe_shrinker.h"
> >  #include "xe_trace_bo.h"
> >  #include "xe_ttm_stolen_mgr.h"
> >  #include "xe_vm.h"
> > @@ -278,11 +280,15 @@ static void xe_evict_flags(struct
> > ttm_buffer_object *tbo,
> >  	}
> >  }
> >  
> > +/* struct xe_ttm_tt - Subclassed ttm_tt for xe */
> >  struct xe_ttm_tt {
> >  	struct ttm_tt ttm;
> > -	struct device *dev;
> > +	/** @xe - The xe device */
> > +	struct xe_device *xe;
> >  	struct sg_table sgt;
> >  	struct sg_table *sg;
> > +	/** @purgeable - Whether the bo is purgeable (WONTNEED) */
> 
> So we need to add WONTNEED to our uAPI, right?

Not strictly, but we have work ongoing to implement this. Bos that UMD
pool are typically WONTNEED and unless we mark them as such, shrinking
of these will not work unless we have swap-space available, and then
only with a completely unnecessary copy...

> 
> > +	bool purgeable;
> >  };
> >  
> >  static int xe_tt_map_sg(struct ttm_tt *tt)
> > @@ -291,7 +297,8 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
> >  	unsigned long num_pages = tt->num_pages;
> >  	int ret;
> >  
> > -	XE_WARN_ON(tt->page_flags & TTM_TT_FLAG_EXTERNAL);
> > +	XE_WARN_ON((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> > +		   !(tt->page_flags &
> > TTM_TT_FLAG_EXTERNAL_MAPPABLE));
> >  
> >  	if (xe_tt->sg)
> >  		return 0;
> > @@ -299,13 +306,13 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
> >  	ret = sg_alloc_table_from_pages_segment(&xe_tt->sgt, tt-
> > >pages,
> >  						num_pages, 0,
> >  						(u64)num_pages <<
> > PAGE_SHIFT,
> > -
> > 						xe_sg_segment_size(xe_tt->dev),
> > +						xe_sg_segment_size
> > (xe_tt->xe->drm.dev),
> >  						GFP_KERNEL);
> >  	if (ret)
> >  		return ret;
> >  
> >  	xe_tt->sg = &xe_tt->sgt;
> > -	ret = dma_map_sgtable(xe_tt->dev, xe_tt->sg,
> > DMA_BIDIRECTIONAL,
> > +	ret = dma_map_sgtable(xe_tt->xe->drm.dev, xe_tt->sg,
> > DMA_BIDIRECTIONAL,
> >  			      DMA_ATTR_SKIP_CPU_SYNC);
> >  	if (ret) {
> >  		sg_free_table(xe_tt->sg);
> > @@ -321,7 +328,7 @@ static void xe_tt_unmap_sg(struct ttm_tt *tt)
> >  	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > xe_ttm_tt, ttm);
> >  
> >  	if (xe_tt->sg) {
> > -		dma_unmap_sgtable(xe_tt->dev, xe_tt->sg,
> > +		dma_unmap_sgtable(xe_tt->xe->drm.dev, xe_tt->sg,
> >  				  DMA_BIDIRECTIONAL, 0);
> >  		sg_free_table(xe_tt->sg);
> >  		xe_tt->sg = NULL;
> > @@ -336,21 +343,41 @@ struct sg_table *xe_bo_sg(struct xe_bo *bo)
> >  	return xe_tt->sg;
> >  }
> >  
> > +/*
> > + * Account ttm pages against the device shrinker's shrinkable and
> > + * purgeable counts.
> > + */
> > +static void xe_ttm_tt_account(struct ttm_tt *tt, bool add)
> > +{
> 
> Again I think bools are frowned upon as arguments. Maybe just two
> functions - add / sub?

OK,

> 
> > +	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > xe_ttm_tt, ttm);
> > +	long num_pages = tt->num_pages;
> > +
> > +	if (!add)
> > +		num_pages = -num_pages;
> > +
> > +	if (xe_tt->purgeable)
> > +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, 0,
> > num_pages);
> > +	else
> > +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker,
> > num_pages, 0);
> > +}
> > +
> >  static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object
> > *ttm_bo,
> >  				       u32 page_flags)
> >  {
> >  	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
> >  	struct xe_device *xe = xe_bo_device(bo);
> > -	struct xe_ttm_tt *tt;
> > +	struct xe_ttm_tt *xe_tt;
> > +	struct ttm_tt *tt;
> >  	unsigned long extra_pages;
> >  	enum ttm_caching caching;
> >  	int err;
> >  
> > -	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> > -	if (!tt)
> > +	xe_tt = kzalloc(sizeof(*xe_tt), GFP_KERNEL);
> > +	if (!xe_tt)
> >  		return NULL;
> >  
> > -	tt->dev = xe->drm.dev;
> > +	tt = &xe_tt->ttm;
> > +	xe_tt->xe = xe;
> >  
> >  	extra_pages = 0;
> >  	if (xe_bo_needs_ccs_pages(bo))
> > @@ -387,42 +414,128 @@ static struct ttm_tt
> > *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
> >  		caching = ttm_uncached;
> >  	}
> >  
> > -	err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching,
> > extra_pages);
> > +	if (ttm_bo->type != ttm_bo_type_sg)
> > +		page_flags |= TTM_TT_FLAG_EXTERNAL |
> > TTM_TT_FLAG_EXTERNAL_MAPPABLE;
> > +
> > +	err = ttm_tt_init(tt, &bo->ttm, page_flags, caching,
> > extra_pages);
> >  	if (err) {
> > -		kfree(tt);
> > +		kfree(xe_tt);
> >  		return NULL;
> >  	}
> >  
> > -	return &tt->ttm;
> > +	tt->backup = ttm_backup_shmem_create(tt->num_pages <<
> > PAGE_SHIFT);
> > +	if (IS_ERR(tt->backup)) {
> > +		ttm_tt_fini(tt);
> 
> Mentioned this the previous review I think you need set tt->backup to
> NULL here or update ttm_tt_fini to understand IS_ERR(tt->backup).
> 
> Also maybe dump question but could we just have a global backup for
> all
> BOs? Would that be better than each BO creating its own backup?

I initially made the code like that, when we had a global backend
directly into the swap cache. But I figure shmem wants one file per bo,
otherwise we'd need one gigantic shmem object... Not sure if that
works...

> 
> > +		kfree(xe_tt);
> > +		return NULL;
> > +	}
> > +
> > +	return tt;
> >  }
> >  
> >  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct
> > ttm_tt *tt,
> >  			      struct ttm_operation_ctx *ctx)
> >  {
> > +	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > xe_ttm_tt, ttm);
> >  	int err;
> >  
> >  	/*
> >  	 * dma-bufs are not populated with pages, and the dma-
> >  	 * addresses are set up when moved to XE_PL_TT.
> >  	 */
> > -	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
> > +	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> > +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
> >  		return 0;
> >  
> >  	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
> >  	if (err)
> >  		return err;
> >  
> > -	return err;
> > +	xe_tt->purgeable = false;
> > +	xe_ttm_tt_account(tt, true);
> > +
> > +	return 0;
> >  }
> >  
> >  static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev,
> > struct ttm_tt *tt)
> >  {
> > -	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
> > +	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> > +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
> >  		return;
> >  
> >  	xe_tt_unmap_sg(tt);
> >  
> > -	return ttm_pool_free(&ttm_dev->pool, tt);
> > +	ttm_pool_free(&ttm_dev->pool, tt);
> > +	xe_ttm_tt_account(tt, false);
> > +}
> > +
> > +/**
> > + * xe_bo_shrink() - Try to shrink an xe bo.
> > + * @walk:  - The walk parameters
> > + * @bo: The TTM buffer object
> > + * @purge: Only consider purgeable bos.
> > + * @writeback: Try to write back to persistent storage.
> > + *
> > + * Try to shrink- or purge a bo, and if it succeeds, unmap dma.
> > + * Note that we need to be able to handle also non xe bos
> > + * (ghost bos), but only if the struct ttm_tt is embedded in
> > + * a struct xe_ttm_tt.
> > + *
> > + * Return: The number of pages shrunken or purged, or negative
> > error
> > + * code on failure.
> > + */
> > +long xe_bo_shrink(struct ttm_lru_walk *walk, struct
> > ttm_buffer_object *bo,
> > +		  bool purge, bool writeback)
> > +{
> > +	struct ttm_tt *tt = bo->ttm;
> > +	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > xe_ttm_tt, ttm);
> > +	struct ttm_place place = {.mem_type = bo->resource-
> > >mem_type};
> > +	struct xe_bo *xe_bo = ttm_to_xe_bo(bo);
> > +	struct xe_device *xe = xe_tt->xe;
> > +	bool needs_rpm;
> > +	long lret = 0L;
> > +
> > +	if (!tt || !ttm_tt_is_populated(tt) ||
> > +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE) ||
> > +	    (purge && !xe_tt->purgeable))
> > +		return 0L;
> > +
> > +	if (!ttm_bo_eviction_valuable(bo, &place))
> > +		return 0L;
> > +
> > +	/* Beware of zombies (GEM object refcount == 0) and
> > ghosts. */
> > +	if (!xe_bo_is_xe_bo(bo) || !xe_bo_get_unless_zero(xe_bo))
> > {
> > +		struct ttm_placement null_placement = {
> > .num_placement = 0 };
> > +
> > +		lret = ttm_bo_wait_ctx(bo, walk->ctx);
> > +		if (lret)
> > +			return lret;
> > +
> > +		/* Purge the bo content! */
> > +		ttm_bo_validate(bo, &null_placement, walk->ctx);
> > +		return tt->num_pages;
> > +	}
> > +
> > +	/* System CCS needs gpu copy when moving PL_TT ->
> > PL_SYSTEM */
> > +	needs_rpm = (!IS_DGFX(xe) && bo->resource->mem_type !=
> > XE_PL_SYSTEM &&
> > +		     xe_bo && xe_bo_needs_ccs_pages(xe_bo) &&
> > !xe_tt->purgeable);
> 
> Is xe_bo check really needed here?

Yes, I think otherwise xe_bo_needs_ccs_pages will be called with NULL
for ghost objects that aren't xe_bo.

> 
> > +	if (needs_rpm && !xe_pm_runtime_get_if_active(xe))
> > +		goto out_unref;
> > +
> > +	lret = ttm_bo_try_shrink(walk, bo, xe_tt->purgeable,
> > writeback);
> > +	if (needs_rpm)
> > +		xe_pm_runtime_put(xe);
> > +
> > +	if (lret > 0) {
> > +		xe_assert(xe, !ttm_tt_is_populated(tt));
> > +
> > +		xe_ttm_tt_account(tt, false);
> > +	}
> > +
> > +out_unref:
> > +	xe_bo_put(xe_bo);
> > +
> > +	return lret;
> >  }
> >  
> >  static void xe_ttm_tt_destroy(struct ttm_device *ttm_dev, struct
> > ttm_tt *tt)
> > @@ -1238,6 +1351,7 @@ struct xe_bo *___xe_bo_create_locked(struct
> > xe_device *xe, struct xe_bo *bo,
> >  	struct ttm_operation_ctx ctx = {
> >  		.interruptible = true,
> >  		.no_wait_gpu = false,
> > +		.gfp_retry_mayfail = true,
> 
> Can you explain why you are setting this?

Hm. We might want this in a separate patch. Without this, bo memory
allocation will typically never fail, but instead start the OOM killer.
I don't think that's the behaviour we want, but yeah deserves its own
patch and review.

> 
> >  	};
> >  	struct ttm_placement *placement;
> >  	uint32_t alignment;
> > @@ -1681,6 +1795,8 @@ int xe_bo_pin_external(struct xe_bo *bo)
> >  	}
> >  
> >  	ttm_bo_pin(&bo->ttm);
> > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > +		xe_ttm_tt_account(bo->ttm.ttm, false);
> >  
> >  	/*
> >  	 * FIXME: If we always use the reserve / unreserve
> > functions for locking
> > @@ -1739,6 +1855,8 @@ int xe_bo_pin(struct xe_bo *bo)
> >  	}
> >  
> >  	ttm_bo_pin(&bo->ttm);
> > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > +		xe_ttm_tt_account(bo->ttm.ttm, false);
> >  
> >  	/*
> >  	 * FIXME: If we always use the reserve / unreserve
> > functions for locking
> > @@ -1773,6 +1891,9 @@ void xe_bo_unpin_external(struct xe_bo *bo)
> >  	spin_unlock(&xe->pinned.lock);
> >  
> >  	ttm_bo_unpin(&bo->ttm);
> > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > +		xe_ttm_tt_account(bo->ttm.ttm, true);
> > +
> 
> Nit: Extra newline.
Will fix.
> 
> >  
> >  	/*
> >  	 * FIXME: If we always use the reserve / unreserve
> > functions for locking
> > @@ -1801,6 +1922,8 @@ void xe_bo_unpin(struct xe_bo *bo)
> >  	}
> >  
> >  	ttm_bo_unpin(&bo->ttm);
> > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > +		xe_ttm_tt_account(bo->ttm.ttm, true);
> >  }
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_bo.h
> > b/drivers/gpu/drm/xe/xe_bo.h
> > index 6de894c728f5..8463e3f3f6f1 100644
> > --- a/drivers/gpu/drm/xe/xe_bo.h
> > +++ b/drivers/gpu/drm/xe/xe_bo.h
> > @@ -63,6 +63,7 @@
> >  #define XE_BO_PROPS_INVALID	(-1)
> >  
> >  struct sg_table;
> > +struct xe_ttm_lru_walk;
> >  
> >  struct xe_bo *xe_bo_alloc(void);
> >  void xe_bo_free(struct xe_bo *bo);
> > @@ -126,6 +127,28 @@ static inline struct xe_bo *xe_bo_get(struct
> > xe_bo *bo)
> >  	return bo;
> >  }
> >  
> > +/*
> > + * xe_bo_get_unless_zero() - Conditionally obtain a GEM object
> > refcount on an
> > + * xe bo
> > + * @bo: The bo for which we want to obtain a refcount.
> > + *
> > + * There is a short window between where the bo's GEM object
> > refcount reaches
> > + * zero and where we put the final ttm_bo reference. Code in the
> > eviction- and
> > + * shrinking path should therefore attempt to grab a gem object
> > reference before
> > + * trying to use members outside of the base class ttm object.
> > This function is
> > + * intended for that purpose. On successful return, this function
> > must be paired
> > + * with an xe_bo_put().
> > + *
> > + * Return: @bo on success, NULL on failure.
> > + */
> > +static inline __must_check struct xe_bo
> > *xe_bo_get_unless_zero(struct xe_bo *bo)
> > +{
> > +	if (!bo || !kref_get_unless_zero(&bo->ttm.base.refcount))
> > +		return NULL;
> > +
> > +	return bo;
> > +}
> > +
> >  static inline void xe_bo_put(struct xe_bo *bo)
> >  {
> >  	if (bo)
> > @@ -315,6 +338,9 @@ static inline unsigned int
> > xe_sg_segment_size(struct device *dev)
> >  
> >  #define
> > i915_gem_object_flush_if_display(obj)		((void)(obj))
> >  
> > +long xe_bo_shrink(struct ttm_lru_walk *walk, struct
> > ttm_buffer_object *bo,
> > +		  bool purge, bool writeback);
> > +
> >  #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
> >  /**
> >   * xe_bo_is_mem_type - Whether the bo currently resides in the
> > given
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c
> > index cfda7cb5df2c..58fecc4b0a18 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -47,6 +47,7 @@
> >  #include "xe_perf.h"
> >  #include "xe_pm.h"
> >  #include "xe_query.h"
> > +#include "xe_shrinker.h"
> >  #include "xe_sriov.h"
> >  #include "xe_tile.h"
> >  #include "xe_ttm_stolen_mgr.h"
> > @@ -241,6 +242,9 @@ static void xe_device_destroy(struct drm_device
> > *dev, void *dummy)
> >  	if (xe->unordered_wq)
> >  		destroy_workqueue(xe->unordered_wq);
> >  
> > +	if (!IS_ERR_OR_NULL(xe->mem.shrinker))
> > +		xe_shrinker_destroy(xe->mem.shrinker);
> > +
> >  	ttm_device_fini(&xe->ttm);
> >  }
> >  
> > @@ -270,6 +274,10 @@ struct xe_device *xe_device_create(struct
> > pci_dev *pdev,
> >  	if (err)
> >  		goto err;
> >  
> > +	xe->mem.shrinker = xe_shrinker_create(xe);
> > +	if (IS_ERR(xe->mem.shrinker))
> > +		return ERR_CAST(xe->mem.shrinker);
> > +
> >  	xe->info.devid = pdev->device;
> >  	xe->info.revid = pdev->revision;
> >  	xe->info.force_execlist = xe_modparam.force_execlist;
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > index c37be471d11c..3d5440aba52e 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -325,6 +325,8 @@ struct xe_device {
> >  		struct xe_mem_region vram;
> >  		/** @mem.sys_mgr: system TTM manager */
> >  		struct ttm_resource_manager sys_mgr;
> > +		/** @mem.sys_mgr: system memory shrinker. */
> > +		struct xe_shrinker *shrinker;
> >  	} mem;
> >  
> >  	/** @sriov: device level virtualization data */
> > diff --git a/drivers/gpu/drm/xe/xe_shrinker.c
> > b/drivers/gpu/drm/xe/xe_shrinker.c
> > new file mode 100644
> > index 000000000000..3f9554bdc06b
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_shrinker.c
> > @@ -0,0 +1,287 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/shrinker.h>
> > +#include <linux/swap.h>
> > +
> > +#include <drm/ttm/ttm_bo.h>
> > +#include <drm/ttm/ttm_tt.h>
> > +
> > +#include "xe_bo.h"
> > +#include "xe_pm.h"
> > +#include "xe_shrinker.h"
> > +
> > +/**
> > + * struct xe_shrinker - per-device shrinker
> > + * @xe: Back pointer to the device.
> > + * @lock: Lock protecting accounting.
> > + * @shrinkable_pages: Number of pages that are currently
> > shrinkable.
> > + * @purgeable_pages: Number of pages that are currently purgeable.
> > + * @shrink: Pointer to the mm shrinker.
> > + * @pm_worker: Worker to wake up the device if required.
> > + */
> > +struct xe_shrinker {
> > +	struct xe_device *xe;
> > +	rwlock_t lock;
> > +	long shrinkable_pages;
> > +	long purgeable_pages;
> > +	struct shrinker *shrink;
> > +	struct work_struct pm_worker;
> > +};
> > +
> > +/**
> > + * struct xe_shrink_lru_walk - lru_walk subclass for shrinker
> > + * @walk: The embedded base class.
> > + * @xe: Pointer to the xe device.
> > + * @purge: Purgeable only request from the srinker.
> > + * @writeback: Try to write back to persistent storage.
> > + */
> > +struct xe_shrink_lru_walk {
> > +	struct ttm_lru_walk walk;
> > +	struct xe_device *xe;
> > +	bool purge;
> > +	bool writeback;
> > +};
> > +
> > +static struct xe_shrinker *to_xe_shrinker(struct shrinker *shrink)
> > +{
> > +	return shrink->private_data;
> > +}
> > +
> > +static struct xe_shrink_lru_walk *
> > +to_xe_shrink_lru_walk(struct ttm_lru_walk *walk)
> > +{
> > +	return container_of(walk, struct xe_shrink_lru_walk,
> > walk);
> > +}
> > +
> > +/**
> > + * xe_shrinker_mod_pages() - Modify shrinker page accounting
> > + * @shrinker: Pointer to the struct xe_shrinker.
> > + * @shrinkable: Shrinkable pages delta. May be negative.
> > + * @purgeable: Purgeable page delta. May be negative.
> > + *
> > + * Modifies the shrinkable and purgeable pages accounting.
> > + */
> > +void
> > +xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long
> > shrinkable, long purgeable)
> > +{
> > +	write_lock(&shrinker->lock);
> > +	shrinker->shrinkable_pages += shrinkable;
> > +	shrinker->purgeable_pages += purgeable;
> > +	write_unlock(&shrinker->lock);
> > +}
> > +
> > +static long xe_shrinker_process_bo(struct ttm_lru_walk *walk,
> > struct ttm_buffer_object *bo)
> > +{
> > +	struct xe_shrink_lru_walk *shrink_walk =
> > to_xe_shrink_lru_walk(walk);
> > +
> > +	return xe_bo_shrink(walk, bo, shrink_walk->purge,
> > shrink_walk->writeback);
> > +}
> > +
> > +static long xe_shrinker_walk(struct xe_shrink_lru_walk
> > *shrink_walk, long target)
> > +{
> > +	struct xe_device *xe = shrink_walk->xe;
> > +	struct ttm_resource_manager *man;
> > +	unsigned int mem_type;
> > +	long sofar = 0;
> > +	long lret;
> > +
> > +	for (mem_type = XE_PL_SYSTEM; mem_type <= XE_PL_TT;
> > ++mem_type) {
> > +		man = ttm_manager_type(&xe->ttm, mem_type);
> > +		if (!man || !man->use_tt)
> > +			continue;
> > +
> > +		lret = ttm_lru_walk_for_evict(&shrink_walk->walk,
> > &xe->ttm, man, target);
> > +		if (lret < 0)
> > +			return lret;
> > +
> > +		sofar += lret;
> > +		if (sofar >= target)
> > +			break;
> > +	}
> > +
> > +	return sofar;
> > +}
> > +
> > +static unsigned long
> > +xe_shrinker_count(struct shrinker *shrink, struct shrink_control
> > *sc)
> > +{
> > +	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
> > +	unsigned long num_pages;
> > +
> > +	num_pages = get_nr_swap_pages();
> > +	read_lock(&shrinker->lock);
> > +	num_pages = min_t(unsigned long, num_pages, shrinker-
> > >shrinkable_pages);
> > +	num_pages += shrinker->purgeable_pages;
> > +	read_unlock(&shrinker->lock);
> > +
> > +	return num_pages ? num_pages : SHRINK_EMPTY;
> > +}
> > +
> > +static const struct ttm_lru_walk_ops xe_shrink_ops = {
> > +	.process_bo = xe_shrinker_process_bo,
> > +};
> > +
> > +/*
> > + * Check if we need runtime pm, and if so try to grab a reference
> > if
> > + * already active. If grabbing a reference fails, queue a worker
> > that
> > + * does it for us outside of reclaim, but don't wait for it to
> > complete.
> > + * If bo shrinking needs an rpm reference and we don't have it
> > (yet),
> > + * that bo will be skipped anyway.
> > + */
> > +static bool xe_shrinker_runtime_pm_get(struct xe_shrinker
> > *shrinker, bool force,
> > +				       unsigned long nr_to_scan)
> > +{
> > +	struct xe_device *xe = shrinker->xe;
> > +
> > +	if (IS_DGFX(xe) || !xe_device_has_flat_ccs(xe) ||
> > +	    !get_nr_swap_pages())
> > +		return false;
> > +
> > +	if (!force) {
> > +		read_lock(&shrinker->lock);
> > +		force = (nr_to_scan > shrinker->purgeable_pages);
> > +		read_unlock(&shrinker->lock);
> > +		if (!force)
> > +			return false;
> > +	}
> > +
> > +	if (!xe_pm_runtime_get_if_active(xe)) {
> > +		queue_work(xe->unordered_wq, &shrinker-
> > >pm_worker);
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +static void xe_shrinker_runtime_pm_put(struct xe_shrinker
> > *shrinker, bool runtime_pm)
> > +{
> > +	if (runtime_pm)
> > +		xe_pm_runtime_put(shrinker->xe);
> > +}
> > +
> > +static unsigned long xe_shrinker_scan(struct shrinker *shrink,
> > struct shrink_control *sc)
> > +{
> > +	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
> > +	bool is_kswapd = current_is_kswapd();
> > +	struct ttm_operation_ctx ctx = {
> > +		.interruptible = false,
> > +		.no_wait_gpu = !is_kswapd,
> > +	};
> > +	unsigned long nr_to_scan, freed = 0;
> > +	struct xe_shrink_lru_walk shrink_walk = {
> > +		.walk = {
> > +			.ops = &xe_shrink_ops,
> > +			.ctx = &ctx,
> > +			.trylock_only = true,
> > +		},
> > +		.xe = shrinker->xe,
> > +		.purge = true,
> > +		.writeback = is_kswapd,
> > +	};
> > +	bool runtime_pm;
> > +	bool purgeable;
> > +	long ret;
> > +
> > +	sc->nr_scanned = 0;
> > +	nr_to_scan = sc->nr_to_scan;
> > +
> > +	read_lock(&shrinker->lock);
> > +	purgeable = !!shrinker->purgeable_pages;
> > +	read_unlock(&shrinker->lock);
> > +
> > +	/* Might need runtime PM. Try to wake early if it looks
> > like it. */
> > +	runtime_pm = xe_shrinker_runtime_pm_get(shrinker, false,
> > nr_to_scan);
> > +
> > +	while (purgeable && freed < nr_to_scan) {
> > +		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
> > +		if (ret <= 0)
> > +			break;
> > +
> > +		freed += ret;
> > +	}
> > +
> > +	sc->nr_scanned = freed;
> > +	if (freed < nr_to_scan)
> > +		nr_to_scan -= freed;
> > +	else
> > +		nr_to_scan = 0;
> > +	if (!nr_to_scan)
> > +		goto out;
> > +
> > +	/* If we didn't wake before, try to do it now if needed.
> > */
> > +	if (!runtime_pm)
> > +		runtime_pm = xe_shrinker_runtime_pm_get(shrinker,
> > true, 0);
> > +
> > +	shrink_walk.purge = false;
> > +	nr_to_scan = sc->nr_to_scan;
> > +	while (freed < nr_to_scan) {
> > +		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
> > +		if (ret <= 0)
> > +			break;
> > +
> > +		freed += ret;
> > +	}
> > +
> > +	sc->nr_scanned = freed;
> > +
> > +out:
> > +	xe_shrinker_runtime_pm_put(shrinker, runtime_pm);
> > +	return freed ? freed : SHRINK_STOP;
> > +}
> > +
> > +/* Wake up the device for shrinking. */
> > +static void xe_shrinker_pm(struct work_struct *work)
> > +{
> > +	struct xe_shrinker *shrinker =
> > +		container_of(work, typeof(*shrinker), pm_worker);
> > +
> > +	xe_pm_runtime_get(shrinker->xe);
> > +	xe_pm_runtime_put(shrinker->xe);
> 
> So I don't really get this. How does this help the shrinker get a PM
> ref? The small window between xe_pm_runtime_get / put the shrinker
> grabs
> one via xe_pm_runtime_get_if_active? Seems fragile.

The pm has a delay after the last put (Typically 1s?), and if the
shrinker fails obtaining a reference it will be retried. Typically it
will then succeed and when it does start a new delay.

This doesn't strictly guarantee shrinking progress, but there's no way
we can synchronously delay shrinking until we have the pm reference.

What we could perhaps do is to report back that we actually made
progress but postpone the actual shrinking to the worker with writeback
set to on; hoping that the shrinking core doesn't give up before it
sees the released memory...

But I'd prefer to leave that out until we see that that's actually a
real problem.

Best would be if we could have access the LNL CCS memory using the
CPU...

/Thomas


> 
> Matt
> 
> > +}
> > +
> > +/**
> > + * xe_shrinker_create() - Create an xe per-device shrinker
> > + * @xe: Pointer to the xe device.
> > + *
> > + * Returns: A pointer to the created shrinker on success,
> > + * Negative error code on failure.
> > + */
> > +struct xe_shrinker *xe_shrinker_create(struct xe_device *xe)
> > +{
> > +	struct xe_shrinker *shrinker = kzalloc(sizeof(*shrinker),
> > GFP_KERNEL);
> > +
> > +	if (!shrinker)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	shrinker->shrink = shrinker_alloc(0, "xe system
> > shrinker");
> > +	if (!shrinker->shrink) {
> > +		kfree(shrinker);
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	INIT_WORK(&shrinker->pm_worker, xe_shrinker_pm);
> > +	shrinker->xe = xe;
> > +	rwlock_init(&shrinker->lock);
> > +	shrinker->shrink->count_objects = xe_shrinker_count;
> > +	shrinker->shrink->scan_objects = xe_shrinker_scan;
> > +	shrinker->shrink->private_data = shrinker;
> > +	shrinker_register(shrinker->shrink);
> > +
> > +	return shrinker;
> > +}
> > +
> > +/**
> > + * xe_shrinker_destroy() - Destroy an xe per-device shrinker
> > + * @shrinker: Pointer to the shrinker to destroy.
> > + */
> > +void xe_shrinker_destroy(struct xe_shrinker *shrinker)
> > +{
> > +	xe_assert(shrinker->xe, !shrinker->shrinkable_pages);
> > +	xe_assert(shrinker->xe, !shrinker->purgeable_pages);
> > +	shrinker_free(shrinker->shrink);
> > +	flush_work(&shrinker->pm_worker);
> > +	kfree(shrinker);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_shrinker.h
> > b/drivers/gpu/drm/xe/xe_shrinker.h
> > new file mode 100644
> > index 000000000000..28a038f4fcbf
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_shrinker.h
> > @@ -0,0 +1,18 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_SHRINKER_H_
> > +#define _XE_SHRINKER_H_
> > +
> > +struct xe_shrinker;
> > +struct xe_device;
> > +
> > +void xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long
> > shrinkable, long purgeable);
> > +
> > +struct xe_shrinker *xe_shrinker_create(struct xe_device *xe);
> > +
> > +void xe_shrinker_destroy(struct xe_shrinker *shrinker);
> > +
> > +#endif
> > diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> > index e577528f5dfc..c7e81ae025d9 100644
> > --- a/include/drm/ttm/ttm_bo.h
> > +++ b/include/drm/ttm/ttm_bo.h
> > @@ -229,6 +229,9 @@ struct ttm_lru_walk {
> >  long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct
> > ttm_device *bdev,
> >  			    struct ttm_resource_manager *man, long
> > target);
> >  
> > +long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct
> > ttm_buffer_object *bo,
> > +		       bool purge, bool writeback);
> > +
> >  /**
> >   * ttm_bo_get - reference a struct ttm_buffer_object
> >   *
> > -- 
> > 2.44.0
> > 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos
  2024-07-03 15:38 ` [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos Thomas Hellström
  2024-08-08  1:37   ` Matthew Brost
@ 2024-08-09 16:05   ` Matthew Auld
  1 sibling, 0 replies; 38+ messages in thread
From: Matthew Auld @ 2024-08-09 16:05 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe
  Cc: Christian König, Somalapuram Amaranath, Matthew Brost,
	dri-devel

Hi,

On 03/07/2024 16:38, Thomas Hellström wrote:
> Rather than relying on the TTM watermark accounting add a shrinker
> for xe_bos in TT or system memory.
> 
> Leverage the newly added TTM per-page shrinking and shmem backup
> support.
> 
> Although xe doesn't fully support WONTNEED (purgeable) bos yet,
> introduce and add shrinker support for purgeable ttm_tts.
> 
> v2:
> - Cleanups bugfixes and a KUNIT shrinker test.
> - Add writeback support, and activate if kswapd.
> v3:
> - Move the try_shrink() helper to core TTM.
> - Minor cleanups.
> v4:
> - Add runtime pm for the shrinker. Shrinking may require an active
>    device for CCS metadata copying.
> v5:
> - Separately purge ghost- and zombie objects in the shrinker.
> - Fix a format specifier - type inconsistency. (Kernel test robot).
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>   drivers/gpu/drm/ttm/ttm_bo_util.c     |  67 ++++++
>   drivers/gpu/drm/xe/Makefile           |   1 +
>   drivers/gpu/drm/xe/tests/xe_bo.c      | 118 +++++++++++
>   drivers/gpu/drm/xe/tests/xe_bo_test.c |   1 +
>   drivers/gpu/drm/xe/tests/xe_bo_test.h |   1 +
>   drivers/gpu/drm/xe/xe_bo.c            | 155 ++++++++++++--
>   drivers/gpu/drm/xe/xe_bo.h            |  26 +++
>   drivers/gpu/drm/xe/xe_device.c        |   8 +
>   drivers/gpu/drm/xe/xe_device_types.h  |   2 +
>   drivers/gpu/drm/xe/xe_shrinker.c      | 287 ++++++++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_shrinker.h      |  18 ++
>   include/drm/ttm/ttm_bo.h              |   3 +
>   12 files changed, 671 insertions(+), 16 deletions(-)
>   create mode 100644 drivers/gpu/drm/xe/xe_shrinker.c
>   create mode 100644 drivers/gpu/drm/xe/xe_shrinker.h
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index c4f678f30fc2..563e96a4cf06 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -924,3 +924,70 @@ long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct ttm_device *bdev,
>   
>   	return progress;
>   }
> +EXPORT_SYMBOL(ttm_lru_walk_for_evict);
> +
> +/**
> + * ttm_bo_try_shrink - LRU walk helper to shrink a ttm buffer object.
> + * @walk: The struct xe_ttm_lru_walk that describes the walk.
> + * @bo: The buffer object.
> + * @purge: Whether to attempt to purge the bo content since it's no
> + * longer needed.
> + * @writeback: If !@purge, attempt to write out to persistent storage.
> + *
> + * The function uses the ttm_tt_back_up functionality to back up or
> + * purge a struct ttm_tt. If the bo is not in system, it's first
> + * moved there.
> + *
> + * Return: The number of pages shrunken or purged, or
> + * negative error code on failure.
> + */
> +long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct ttm_buffer_object *bo,
> +		       bool purge, bool writeback)
> +{
> +	static const struct ttm_place sys_placement_flags = {
> +		.fpfn = 0,
> +		.lpfn = 0,
> +		.mem_type = TTM_PL_SYSTEM,
> +		.flags = 0,
> +	};
> +	static struct ttm_placement sys_placement = {
> +		.num_placement = 1,
> +		.placement = &sys_placement_flags,
> +	};
> +	struct ttm_operation_ctx *ctx = walk->ctx;
> +	struct ttm_tt *tt = bo->ttm;
> +	long lret;
> +
> +	dma_resv_assert_held(bo->base.resv);
> +
> +	if (!tt || !ttm_tt_is_populated(tt))
> +		return 0;
> +
> +	if (bo->resource->mem_type != TTM_PL_SYSTEM) {
> +		int ret = ttm_bo_validate(bo, &sys_placement, ctx);
> +
> +		if (ret) {
> +			if (ret == -EINTR || ret == -EDEADLK ||
> +			    ret == -ERESTARTSYS)
> +				return ret;
> +			return 0;
> +		}
> +	}
> +
> +	lret = ttm_bo_wait_ctx(bo, ctx);
> +	if (lret < 0) {
> +		if (lret == -ERESTARTSYS)
> +			return lret;
> +		return 0;
> +	}
> +
> +	if (bo->deleted)
> +		lret = ttm_tt_backup(bo->bdev, tt, true, writeback);
> +	else
> +		lret = ttm_tt_backup(bo->bdev, tt, purge, writeback);
> +	if (lret < 0 && lret != -EINTR)
> +		return 0;
> +
> +	return lret;
> +}
> +EXPORT_SYMBOL(ttm_bo_try_shrink);
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index b1e03bfe4a68..1eba51bdd172 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -112,6 +112,7 @@ xe-y += xe_bb.o \
>   	xe_ring_ops.o \
>   	xe_sa.o \
>   	xe_sched_job.o \
> +	xe_shrinker.o \
>   	xe_step.o \
>   	xe_sync.o \
>   	xe_tile.o \
> diff --git a/drivers/gpu/drm/xe/tests/xe_bo.c b/drivers/gpu/drm/xe/tests/xe_bo.c
> index 9f3c02826464..49617f16dc76 100644
> --- a/drivers/gpu/drm/xe/tests/xe_bo.c
> +++ b/drivers/gpu/drm/xe/tests/xe_bo.c
> @@ -6,6 +6,8 @@
>   #include <kunit/test.h>
>   #include <kunit/visibility.h>
>   
> +#include <uapi/linux/sysinfo.h>
> +
>   #include "tests/xe_bo_test.h"
>   #include "tests/xe_pci_test.h"
>   #include "tests/xe_test.h"
> @@ -350,3 +352,119 @@ void xe_bo_evict_kunit(struct kunit *test)
>   	xe_call_for_each_device(evict_test_run_device);
>   }
>   EXPORT_SYMBOL_IF_KUNIT(xe_bo_evict_kunit);
> +
> +struct xe_bo_link {
> +	struct list_head link;
> +	struct xe_bo *bo;
> +};
> +
> +#define XE_BO_SHRINK_SIZE ((unsigned long)SZ_64M)
> +
> +/*
> + * Try to create system bos corresponding to twice the amount
> + * of available system memory to test shrinker functionality.
> + * If no swap space is available to accommodate the
> + * memory overcommit, mark bos purgeable.
> + */
> +static int shrink_test_run_device(struct xe_device *xe)
> +{
> +	struct kunit *test = xe_cur_kunit();
> +	LIST_HEAD(bos);
> +	struct xe_bo_link *link, *next;
> +	struct sysinfo si;
> +	size_t total, alloced;
> +	unsigned int interrupted = 0, successful = 0;
> +
> +	si_meminfo(&si);
> +	total = si.freeram * si.mem_unit;
> +
> +	kunit_info(test, "Free ram is %lu bytes. Will allocate twice of that.\n",
> +		   (unsigned long) total);
> +
> +	total <<= 1;
> +	for (alloced = 0; alloced < total ; alloced += XE_BO_SHRINK_SIZE) {
> +		struct xe_bo *bo;
> +		unsigned int mem_type;
> +
> +		link = kzalloc(sizeof(*link), GFP_KERNEL);
> +		if (!link) {
> +			KUNIT_FAIL(test, "Unexpeced link allocation failure\n");
> +			break;
> +		}
> +
> +		INIT_LIST_HEAD(&link->link);
> +
> +		/* We can create bos using WC caching here. But it is slower. */
> +		bo = xe_bo_create_user(xe, NULL, NULL, XE_BO_SHRINK_SIZE,
> +				       DRM_XE_GEM_CPU_CACHING_WB,
> +				       ttm_bo_type_device,
> +				       XE_BO_FLAG_SYSTEM);
> +		if (IS_ERR(bo)) {
> +			if (bo != ERR_PTR(-ENOMEM) && bo != ERR_PTR(-ENOSPC) &&
> +			    bo != ERR_PTR(-EINTR) && bo != ERR_PTR(-ERESTARTSYS))
> +				KUNIT_FAIL(test, "Error creating bo: %pe\n", bo);
> +			kfree(link);
> +			break;
> +		}
> +		link->bo = bo;
> +		list_add_tail(&link->link, &bos);
> +		xe_bo_lock(bo, false);
> +
> +		/*
> +		 * If we're low on swap entries, we can't shrink unless the bo
> +		 * is marked purgeable.
> +		 */
> +		if (get_nr_swap_pages() < (XE_BO_SHRINK_SIZE >> PAGE_SHIFT) * 128) {
> +			struct xe_ttm_tt *xe_tt =
> +				container_of(bo->ttm.ttm, typeof(*xe_tt), ttm);
> +			long num_pages = xe_tt->ttm.num_pages;
> +
> +			xe_tt->purgeable = true;
> +			xe_shrinker_mod_pages(xe->mem.shrinker, -num_pages,
> +					      num_pages);
> +		}
> +
> +		mem_type = bo->ttm.resource->mem_type;
> +		xe_bo_unlock(bo);
> +		if (mem_type != XE_PL_TT)
> +			KUNIT_FAIL(test, "Bo in incorrect memory type: %u\n",
> +				   bo->ttm.resource->mem_type);
> +		cond_resched();
> +		if (signal_pending(current))
> +			break;
> +	}
> +
> +	/* Read back and destroy bos */
> +	list_for_each_entry_safe_reverse(link, next, &bos, link) {
> +		static struct ttm_operation_ctx ctx = {.interruptible = true};
> +		struct xe_bo *bo = link->bo;
> +		int ret;
> +
> +		if (!signal_pending(current)) {
> +			xe_bo_lock(bo, NULL);
> +			ret = ttm_bo_validate(&bo->ttm, &tt_placement, &ctx);
> +			xe_bo_unlock(bo);
> +			if (ret && ret != -EINTR)
> +				KUNIT_FAIL(test, "Validation failed: %pe\n",
> +					   ERR_PTR(ret));
> +			else if (ret)
> +				interrupted++;
> +			else
> +				successful++;
> +		}
> +		xe_bo_put(link->bo);
> +		list_del(&link->link);
> +		kfree(link);
> +		cond_resched();
> +	}
> +	kunit_info(test, "Readbacks interrupted: %u successful: %u\n",
> +		   interrupted, successful);
> +
> +	return 0;
> +}
> +
> +void xe_bo_shrink_kunit(struct kunit *test)
> +{
> +	xe_call_for_each_device(shrink_test_run_device);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(xe_bo_shrink_kunit);
> diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.c b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> index a324cde77db8..317fa923e287 100644
> --- a/drivers/gpu/drm/xe/tests/xe_bo_test.c
> +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> @@ -10,6 +10,7 @@
>   static struct kunit_case xe_bo_tests[] = {
>   	KUNIT_CASE(xe_ccs_migrate_kunit),
>   	KUNIT_CASE(xe_bo_evict_kunit),
> +	KUNIT_CASE_SLOW(xe_bo_shrink_kunit),
>   	{}
>   };
>   
> diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.h b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> index 0113ab45066a..7f44d14a45c5 100644
> --- a/drivers/gpu/drm/xe/tests/xe_bo_test.h
> +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> @@ -10,5 +10,6 @@ struct kunit;
>   
>   void xe_ccs_migrate_kunit(struct kunit *test);
>   void xe_bo_evict_kunit(struct kunit *test);
> +void xe_bo_shrink_kunit(struct kunit *test);
>   
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 65c696966e96..6ab63d1642ae 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -10,6 +10,7 @@
>   #include <drm/drm_drv.h>
>   #include <drm/drm_gem_ttm_helper.h>
>   #include <drm/drm_managed.h>
> +#include <drm/ttm/ttm_backup.h>
>   #include <drm/ttm/ttm_device.h>
>   #include <drm/ttm/ttm_placement.h>
>   #include <drm/ttm/ttm_tt.h>
> @@ -25,6 +26,7 @@
>   #include "xe_pm.h"
>   #include "xe_preempt_fence.h"
>   #include "xe_res_cursor.h"
> +#include "xe_shrinker.h"
>   #include "xe_trace_bo.h"
>   #include "xe_ttm_stolen_mgr.h"
>   #include "xe_vm.h"
> @@ -278,11 +280,15 @@ static void xe_evict_flags(struct ttm_buffer_object *tbo,
>   	}
>   }
>   
> +/* struct xe_ttm_tt - Subclassed ttm_tt for xe */
>   struct xe_ttm_tt {
>   	struct ttm_tt ttm;
> -	struct device *dev;
> +	/** @xe - The xe device */
> +	struct xe_device *xe;
>   	struct sg_table sgt;
>   	struct sg_table *sg;
> +	/** @purgeable - Whether the bo is purgeable (WONTNEED) */
> +	bool purgeable;
>   };
>   
>   static int xe_tt_map_sg(struct ttm_tt *tt)
> @@ -291,7 +297,8 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
>   	unsigned long num_pages = tt->num_pages;
>   	int ret;
>   
> -	XE_WARN_ON(tt->page_flags & TTM_TT_FLAG_EXTERNAL);
> +	XE_WARN_ON((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> +		   !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE));
>   
>   	if (xe_tt->sg)
>   		return 0;
> @@ -299,13 +306,13 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
>   	ret = sg_alloc_table_from_pages_segment(&xe_tt->sgt, tt->pages,
>   						num_pages, 0,
>   						(u64)num_pages << PAGE_SHIFT,
> -						xe_sg_segment_size(xe_tt->dev),
> +						xe_sg_segment_size(xe_tt->xe->drm.dev),
>   						GFP_KERNEL);
>   	if (ret)
>   		return ret;
>   
>   	xe_tt->sg = &xe_tt->sgt;
> -	ret = dma_map_sgtable(xe_tt->dev, xe_tt->sg, DMA_BIDIRECTIONAL,
> +	ret = dma_map_sgtable(xe_tt->xe->drm.dev, xe_tt->sg, DMA_BIDIRECTIONAL,
>   			      DMA_ATTR_SKIP_CPU_SYNC);
>   	if (ret) {
>   		sg_free_table(xe_tt->sg);
> @@ -321,7 +328,7 @@ static void xe_tt_unmap_sg(struct ttm_tt *tt)
>   	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
>   
>   	if (xe_tt->sg) {
> -		dma_unmap_sgtable(xe_tt->dev, xe_tt->sg,
> +		dma_unmap_sgtable(xe_tt->xe->drm.dev, xe_tt->sg,
>   				  DMA_BIDIRECTIONAL, 0);
>   		sg_free_table(xe_tt->sg);
>   		xe_tt->sg = NULL;
> @@ -336,21 +343,41 @@ struct sg_table *xe_bo_sg(struct xe_bo *bo)
>   	return xe_tt->sg;
>   }
>   
> +/*
> + * Account ttm pages against the device shrinker's shrinkable and
> + * purgeable counts.
> + */
> +static void xe_ttm_tt_account(struct ttm_tt *tt, bool add)
> +{
> +	struct xe_ttm_tt *xe_tt = container_of(tt, struct xe_ttm_tt, ttm);
> +	long num_pages = tt->num_pages;
> +
> +	if (!add)
> +		num_pages = -num_pages;
> +
> +	if (xe_tt->purgeable)
> +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, 0, num_pages);
> +	else
> +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, num_pages, 0);
> +}
> +
>   static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>   				       u32 page_flags)
>   {
>   	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
>   	struct xe_device *xe = xe_bo_device(bo);
> -	struct xe_ttm_tt *tt;
> +	struct xe_ttm_tt *xe_tt;
> +	struct ttm_tt *tt;
>   	unsigned long extra_pages;
>   	enum ttm_caching caching;
>   	int err;
>   
> -	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> -	if (!tt)
> +	xe_tt = kzalloc(sizeof(*xe_tt), GFP_KERNEL);
> +	if (!xe_tt)
>   		return NULL;
>   
> -	tt->dev = xe->drm.dev;
> +	tt = &xe_tt->ttm;
> +	xe_tt->xe = xe;
>   
>   	extra_pages = 0;
>   	if (xe_bo_needs_ccs_pages(bo))
> @@ -387,42 +414,128 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>   		caching = ttm_uncached;
>   	}
>   
> -	err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages);
> +	if (ttm_bo->type != ttm_bo_type_sg)
> +		page_flags |= TTM_TT_FLAG_EXTERNAL | TTM_TT_FLAG_EXTERNAL_MAPPABLE;
> +
> +	err = ttm_tt_init(tt, &bo->ttm, page_flags, caching, extra_pages);
>   	if (err) {
> -		kfree(tt);
> +		kfree(xe_tt);
>   		return NULL;
>   	}
>   
> -	return &tt->ttm;
> +	tt->backup = ttm_backup_shmem_create(tt->num_pages << PAGE_SHIFT);

I guess we should make this (loff_t)tt->num_pages << PAGE_SHIFT or similar?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths
  2024-08-09 13:53     ` Thomas Hellström
@ 2024-08-09 16:40       ` Matthew Brost
  0 siblings, 0 replies; 38+ messages in thread
From: Matthew Brost @ 2024-08-09 16:40 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Fri, Aug 09, 2024 at 03:53:20PM +0200, Thomas Hellström wrote:
> On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote:
> > On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote:
> > > Use fault-injection to test partial TTM swapout and interrupted
> > > swapin.
> > > Return -EINTR for swapin to test the callers ability to handle and
> > > restart the swapin, and on swapout perform a partial swapout to
> > > test that
> > > the swapin and release_shrunken functionality.
> > > 
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/Kconfig        | 10 ++++++++++
> > >  drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
> > >  2 files changed, 26 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > > index fd0749c0c630..9f27271bfab8 100644
> > > --- a/drivers/gpu/drm/Kconfig
> > > +++ b/drivers/gpu/drm/Kconfig
> > > @@ -272,6 +272,16 @@ config DRM_GPUVM
> > >  	  GPU-VM representation providing helpers to manage a GPUs
> > > virtual
> > >  	  address space
> > >  
> > > +config DRM_TTM_BACKUP_FAULT_INJECT
> > > +	bool "Enable fault injection during TTM backup"
> > > +	depends on DRM_TTM
> > > +	default n
> > > +	help
> > > +	  Inject recoverable failures during TTM backup and
> > > recovery of
> > > +	  backed-up objects. For DRM driver developers only.
> > > +
> > > +	  If in doubt, choose N.
> > > +
> > >  config DRM_BUDDY
> > >  	tristate
> > >  	depends on DRM
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index 38e50cf81b0a..d32a1f2e5e50 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct
> > > ttm_pool_tt_restore *restore,
> > >  			       struct ttm_backup *backup,
> > >  			       struct ttm_operation_ctx *ctx)
> > >  {
> > > +	static unsigned long __maybe_unused swappedin;
> > >  	unsigned int i, nr = 1 << restore->order;
> > >  	int ret = 0;
> > >  
> > > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct
> > > ttm_pool_tt_restore *restore,
> > >  			if (handle == 0)
> > >  				continue;
> > >  
> > > +			if
> > > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
> > > +			    ctx->interruptible &&
> > > +			    ++swappedin % 100 == 0) {
> > > +				ret = -EINTR;
> > > +				break;
> > > +			}
> > 
> > So here this -EINTR would be kicked to the user IOCTL which triggered
> > the BO validate and retry? The restore then should be able to
> > successfully pick up where it left off?
> 
> Yes, that's the point. For the direct swap-cache backend I initially
> used (before concluding that the shmem one actually seemed to work
> fine), we had an interruptible wait here.
> 
> Supporting interrupts is generally a good thing but for the pool code,
> this makes the code considerably more complicated. However, this is a
> good way to ensure drivers actually support -EINTR for the call chain.
> If not, adding interrupt capability "later" will most likely be a PITA.
> 
> > 
> > > +
> > >  			ret = backup->ops->copy_backed_up_page
> > >  				(backup, restore->first_page[i],
> > >  				 handle, ctx->interruptible);
> > > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool,
> > > struct ttm_tt *ttm, bool purge,
> > >  
> > >  	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN |
> > > __GFP_RETRY_MAYFAIL;
> > >  
> > > -	for (i = 0; i < ttm->num_pages; ++i) {
> > > +	num_pages = ttm->num_pages;
> > > +
> > > +	/* Pretend doing fault injection by shrinking only half of
> > > the pages. */
> > > +
> > > +	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
> > > +		num_pages = DIV_ROUND_UP(num_pages, 2);
> > 
> > So what happens here? Half the pages swapped out, then upon restore
> > half
> > swapped back in? The shrinker continues to walk until enough pages
> > swapped out?
> 
> Yes, exactly. Ideally we'd want some intermediate state here so that a
> partially swapped out bo is still eligible for further shrinking.
> 

Cool, glad my understanding is correct. Having error injection is always
a good idea to ensure error paths / corner cases work rather than
finding they blow up much later when these cases somehow get triggered.

With that:
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> /Thomas
> 
> 
> > 
> > Matt
> > 
> > > +
> > > +	for (i = 0; i < num_pages; ++i) {
> > >  		page = ttm->pages[i];
> > >  		if (unlikely(!page))
> > >  			continue;
> > > -- 
> > > 2.44.0
> > > 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos
  2024-08-09 14:31     ` Thomas Hellström
@ 2024-08-09 17:22       ` Matthew Brost
  0 siblings, 0 replies; 38+ messages in thread
From: Matthew Brost @ 2024-08-09 17:22 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Fri, Aug 09, 2024 at 04:31:27PM +0200, Thomas Hellström wrote:
> On Thu, 2024-08-08 at 01:37 +0000, Matthew Brost wrote:
> > On Wed, Jul 03, 2024 at 05:38:12PM +0200, Thomas Hellström wrote:
> > > Rather than relying on the TTM watermark accounting add a shrinker
> > > for xe_bos in TT or system memory.
> > > 
> > > Leverage the newly added TTM per-page shrinking and shmem backup
> > > support.
> > > 
> > > Although xe doesn't fully support WONTNEED (purgeable) bos yet,
> > > introduce and add shrinker support for purgeable ttm_tts.
> > > 
> > > v2:
> > > - Cleanups bugfixes and a KUNIT shrinker test.
> > > - Add writeback support, and activate if kswapd.
> > > v3:
> > > - Move the try_shrink() helper to core TTM.
> > > - Minor cleanups.
> > > v4:
> > > - Add runtime pm for the shrinker. Shrinking may require an active
> > >   device for CCS metadata copying.
> > > v5:
> > > - Separately purge ghost- and zombie objects in the shrinker.
> > > - Fix a format specifier - type inconsistency. (Kernel test robot).
> > > 
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/ttm/ttm_bo_util.c     |  67 ++++++
> > >  drivers/gpu/drm/xe/Makefile           |   1 +
> > >  drivers/gpu/drm/xe/tests/xe_bo.c      | 118 +++++++++++
> > >  drivers/gpu/drm/xe/tests/xe_bo_test.c |   1 +
> > >  drivers/gpu/drm/xe/tests/xe_bo_test.h |   1 +
> > >  drivers/gpu/drm/xe/xe_bo.c            | 155 ++++++++++++--
> > >  drivers/gpu/drm/xe/xe_bo.h            |  26 +++
> > >  drivers/gpu/drm/xe/xe_device.c        |   8 +
> > >  drivers/gpu/drm/xe/xe_device_types.h  |   2 +
> > >  drivers/gpu/drm/xe/xe_shrinker.c      | 287
> > > ++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_shrinker.h      |  18 ++
> > >  include/drm/ttm/ttm_bo.h              |   3 +
> > >  12 files changed, 671 insertions(+), 16 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_shrinker.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_shrinker.h
> > > 
> > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c
> > > b/drivers/gpu/drm/ttm/ttm_bo_util.c
> > > index c4f678f30fc2..563e96a4cf06 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> > > @@ -924,3 +924,70 @@ long ttm_lru_walk_for_evict(struct
> > > ttm_lru_walk *walk, struct ttm_device *bdev,
> > >  
> > >  	return progress;
> > >  }
> > > +EXPORT_SYMBOL(ttm_lru_walk_for_evict);
> > > +
> > > +/**
> > > + * ttm_bo_try_shrink - LRU walk helper to shrink a ttm buffer
> > > object.
> > > + * @walk: The struct xe_ttm_lru_walk that describes the walk.
> > > + * @bo: The buffer object.
> > > + * @purge: Whether to attempt to purge the bo content since it's
> > > no
> > > + * longer needed.
> > > + * @writeback: If !@purge, attempt to write out to persistent
> > > storage.
> > > + *
> > > + * The function uses the ttm_tt_back_up functionality to back up
> > > or
> > > + * purge a struct ttm_tt. If the bo is not in system, it's first
> > > + * moved there.
> > > + *
> > > + * Return: The number of pages shrunken or purged, or
> > > + * negative error code on failure.
> > > + */
> > > +long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct
> > > ttm_buffer_object *bo,
> > > +		       bool purge, bool writeback)
> > > +{
> > > +	static const struct ttm_place sys_placement_flags = {
> > > +		.fpfn = 0,
> > > +		.lpfn = 0,
> > > +		.mem_type = TTM_PL_SYSTEM,
> > > +		.flags = 0,
> > > +	};
> > > +	static struct ttm_placement sys_placement = {
> > > +		.num_placement = 1,
> > > +		.placement = &sys_placement_flags,
> > > +	};
> > > +	struct ttm_operation_ctx *ctx = walk->ctx;
> > > +	struct ttm_tt *tt = bo->ttm;
> > > +	long lret;
> > > +
> > > +	dma_resv_assert_held(bo->base.resv);
> > > +
> > > +	if (!tt || !ttm_tt_is_populated(tt))
> > > +		return 0;
> > > +
> > > +	if (bo->resource->mem_type != TTM_PL_SYSTEM) {
> > > +		int ret = ttm_bo_validate(bo, &sys_placement,
> > > ctx);
> > > +
> > > +		if (ret) {
> > > +			if (ret == -EINTR || ret == -EDEADLK ||
> > > +			    ret == -ERESTARTSYS)
> > > +				return ret;
> > 
> > Can you explain the various error code returns / supression in this
> > function?
> 
> Want me to add a comment in the code or inline here? Anyway, the error
> codes are codes for which the caller wants to restart. (Signal delivery
> or deadlock). For other errors just move on to the next bo on the LRU
> list.
> 

Yes, that would help make this clear and help ensure correctness.

> > 
> > > +			return 0;
> > > +		}
> > > +	}
> > > +
> > > +	lret = ttm_bo_wait_ctx(bo, ctx);
> > > +	if (lret < 0) {
> > > +		if (lret == -ERESTARTSYS)
> > > +			return lret;
> > > +		return 0;
> > > +	}
> > > +
> > > +	if (bo->deleted)
> > > +		lret = ttm_tt_backup(bo->bdev, tt, true,
> > > writeback);
> > > +	else
> > > +		lret = ttm_tt_backup(bo->bdev, tt, purge,
> > > writeback);
> > 
> > Hmm, missed this in my previous review. It is frowned upon having
> > multiple bools as arguments. Could this be reworked with flags? Same
> > goes for all functions in the series with multiple bool arguments.
> 
> I agree. Ill see if I can make this look betteer.
>

+1
 
> > 
> > > +	if (lret < 0 && lret != -EINTR)
> > > +		return 0;
> > > +
> > > +	return lret;
> > > +}
> > > +EXPORT_SYMBOL(ttm_bo_try_shrink);
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index b1e03bfe4a68..1eba51bdd172 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -112,6 +112,7 @@ xe-y += xe_bb.o \
> > >  	xe_ring_ops.o \
> > >  	xe_sa.o \
> > >  	xe_sched_job.o \
> > > +	xe_shrinker.o \
> > >  	xe_step.o \
> > >  	xe_sync.o \
> > >  	xe_tile.o \
> > > diff --git a/drivers/gpu/drm/xe/tests/xe_bo.c
> > > b/drivers/gpu/drm/xe/tests/xe_bo.c
> > > index 9f3c02826464..49617f16dc76 100644
> > > --- a/drivers/gpu/drm/xe/tests/xe_bo.c
> > > +++ b/drivers/gpu/drm/xe/tests/xe_bo.c
> > > @@ -6,6 +6,8 @@
> > >  #include <kunit/test.h>
> > >  #include <kunit/visibility.h>
> > >  
> > > +#include <uapi/linux/sysinfo.h>
> > > +
> > >  #include "tests/xe_bo_test.h"
> > >  #include "tests/xe_pci_test.h"
> > >  #include "tests/xe_test.h"
> > > @@ -350,3 +352,119 @@ void xe_bo_evict_kunit(struct kunit *test)
> > >  	xe_call_for_each_device(evict_test_run_device);
> > >  }
> > >  EXPORT_SYMBOL_IF_KUNIT(xe_bo_evict_kunit);
> > > +
> > > +struct xe_bo_link {
> > > +	struct list_head link;
> > > +	struct xe_bo *bo;
> > > +};
> > > +
> > > +#define XE_BO_SHRINK_SIZE ((unsigned long)SZ_64M)
> > > +
> > > +/*
> > > + * Try to create system bos corresponding to twice the amount
> > > + * of available system memory to test shrinker functionality.
> > > + * If no swap space is available to accommodate the
> > > + * memory overcommit, mark bos purgeable.
> > > + */
> > > +static int shrink_test_run_device(struct xe_device *xe)
> > > +{
> > > +	struct kunit *test = xe_cur_kunit();
> > > +	LIST_HEAD(bos);
> > > +	struct xe_bo_link *link, *next;
> > > +	struct sysinfo si;
> > > +	size_t total, alloced;
> > > +	unsigned int interrupted = 0, successful = 0;
> > > +
> > > +	si_meminfo(&si);
> > > +	total = si.freeram * si.mem_unit;
> > > +
> > > +	kunit_info(test, "Free ram is %lu bytes. Will allocate
> > > twice of that.\n",
> > > +		   (unsigned long) total);
> > > +
> > > +	total <<= 1;
> > > +	for (alloced = 0; alloced < total ; alloced +=
> > > XE_BO_SHRINK_SIZE) {
> > > +		struct xe_bo *bo;
> > > +		unsigned int mem_type;
> > > +
> > > +		link = kzalloc(sizeof(*link), GFP_KERNEL);
> > > +		if (!link) {
> > > +			KUNIT_FAIL(test, "Unexpeced link
> > > allocation failure\n");
> > > +			break;
> > > +		}
> > > +
> > > +		INIT_LIST_HEAD(&link->link);
> > > +
> > > +		/* We can create bos using WC caching here. But it
> > > is slower. */
> > > +		bo = xe_bo_create_user(xe, NULL, NULL,
> > > XE_BO_SHRINK_SIZE,
> > > +				       DRM_XE_GEM_CPU_CACHING_WB,
> > > +				       ttm_bo_type_device,
> > > +				       XE_BO_FLAG_SYSTEM);
> > > +		if (IS_ERR(bo)) {
> > > +			if (bo != ERR_PTR(-ENOMEM) && bo !=
> > > ERR_PTR(-ENOSPC) &&
> > > +			    bo != ERR_PTR(-EINTR) && bo !=
> > > ERR_PTR(-ERESTARTSYS))
> > > +				KUNIT_FAIL(test, "Error creating
> > > bo: %pe\n", bo);
> > > +			kfree(link);
> > > +			break;
> > > +		}
> > > +		link->bo = bo;
> > > +		list_add_tail(&link->link, &bos);
> > > +		xe_bo_lock(bo, false);
> > > +
> > > +		/*
> > > +		 * If we're low on swap entries, we can't shrink
> > > unless the bo
> > > +		 * is marked purgeable.
> > > +		 */
> > > +		if (get_nr_swap_pages() < (XE_BO_SHRINK_SIZE >>
> > > PAGE_SHIFT) * 128) {
> > > +			struct xe_ttm_tt *xe_tt =
> > > +				container_of(bo->ttm.ttm,
> > > typeof(*xe_tt), ttm);
> > > +			long num_pages = xe_tt->ttm.num_pages;
> > > +
> > > +			xe_tt->purgeable = true;
> > > +			xe_shrinker_mod_pages(xe->mem.shrinker, -
> > > num_pages,
> > > +					      num_pages);
> > > +		}
> > > +
> > > +		mem_type = bo->ttm.resource->mem_type;
> > > +		xe_bo_unlock(bo);
> > > +		if (mem_type != XE_PL_TT)
> > > +			KUNIT_FAIL(test, "Bo in incorrect memory
> > > type: %u\n",
> > > +				   bo->ttm.resource->mem_type);
> > > +		cond_resched();
> > > +		if (signal_pending(current))
> > > +			break;
> > > +	}
> > > +
> > > +	/* Read back and destroy bos */
> > > +	list_for_each_entry_safe_reverse(link, next, &bos, link) {
> > > +		static struct ttm_operation_ctx ctx =
> > > {.interruptible = true};
> > > +		struct xe_bo *bo = link->bo;
> > > +		int ret;
> > > +
> > > +		if (!signal_pending(current)) {
> > > +			xe_bo_lock(bo, NULL);
> > > +			ret = ttm_bo_validate(&bo->ttm,
> > > &tt_placement, &ctx);
> > > +			xe_bo_unlock(bo);
> > > +			if (ret && ret != -EINTR)
> > > +				KUNIT_FAIL(test, "Validation
> > > failed: %pe\n",
> > > +					   ERR_PTR(ret));
> > > +			else if (ret)
> > > +				interrupted++;
> > > +			else
> > > +				successful++;
> > > +		}
> > > +		xe_bo_put(link->bo);
> > > +		list_del(&link->link);
> > > +		kfree(link);
> > > +		cond_resched();
> > > +	}
> > > +	kunit_info(test, "Readbacks interrupted: %u successful:
> > > %u\n",
> > > +		   interrupted, successful);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +void xe_bo_shrink_kunit(struct kunit *test)
> > > +{
> > > +	xe_call_for_each_device(shrink_test_run_device);
> > > +}
> > > +EXPORT_SYMBOL_IF_KUNIT(xe_bo_shrink_kunit);
> > > diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > > b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > > index a324cde77db8..317fa923e287 100644
> > > --- a/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > > +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.c
> > > @@ -10,6 +10,7 @@
> > >  static struct kunit_case xe_bo_tests[] = {
> > >  	KUNIT_CASE(xe_ccs_migrate_kunit),
> > >  	KUNIT_CASE(xe_bo_evict_kunit),
> > > +	KUNIT_CASE_SLOW(xe_bo_shrink_kunit),
> > >  	{}
> > >  };
> > >  
> > > diff --git a/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > > b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > > index 0113ab45066a..7f44d14a45c5 100644
> > > --- a/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > > +++ b/drivers/gpu/drm/xe/tests/xe_bo_test.h
> > > @@ -10,5 +10,6 @@ struct kunit;
> > >  
> > >  void xe_ccs_migrate_kunit(struct kunit *test);
> > >  void xe_bo_evict_kunit(struct kunit *test);
> > > +void xe_bo_shrink_kunit(struct kunit *test);
> > >  
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > b/drivers/gpu/drm/xe/xe_bo.c
> > > index 65c696966e96..6ab63d1642ae 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > @@ -10,6 +10,7 @@
> > >  #include <drm/drm_drv.h>
> > >  #include <drm/drm_gem_ttm_helper.h>
> > >  #include <drm/drm_managed.h>
> > > +#include <drm/ttm/ttm_backup.h>
> > >  #include <drm/ttm/ttm_device.h>
> > >  #include <drm/ttm/ttm_placement.h>
> > >  #include <drm/ttm/ttm_tt.h>
> > > @@ -25,6 +26,7 @@
> > >  #include "xe_pm.h"
> > >  #include "xe_preempt_fence.h"
> > >  #include "xe_res_cursor.h"
> > > +#include "xe_shrinker.h"
> > >  #include "xe_trace_bo.h"
> > >  #include "xe_ttm_stolen_mgr.h"
> > >  #include "xe_vm.h"
> > > @@ -278,11 +280,15 @@ static void xe_evict_flags(struct
> > > ttm_buffer_object *tbo,
> > >  	}
> > >  }
> > >  
> > > +/* struct xe_ttm_tt - Subclassed ttm_tt for xe */
> > >  struct xe_ttm_tt {
> > >  	struct ttm_tt ttm;
> > > -	struct device *dev;
> > > +	/** @xe - The xe device */
> > > +	struct xe_device *xe;
> > >  	struct sg_table sgt;
> > >  	struct sg_table *sg;
> > > +	/** @purgeable - Whether the bo is purgeable (WONTNEED) */
> > 
> > So we need to add WONTNEED to our uAPI, right?
> 
> Not strictly, but we have work ongoing to implement this. Bos that UMD
> pool are typically WONTNEED and unless we mark them as such, shrinking
> of these will not work unless we have swap-space available, and then
> only with a completely unnecessary copy...
> 

Makes sense.

> > 
> > > +	bool purgeable;
> > >  };
> > >  
> > >  static int xe_tt_map_sg(struct ttm_tt *tt)
> > > @@ -291,7 +297,8 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
> > >  	unsigned long num_pages = tt->num_pages;
> > >  	int ret;
> > >  
> > > -	XE_WARN_ON(tt->page_flags & TTM_TT_FLAG_EXTERNAL);
> > > +	XE_WARN_ON((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> > > +		   !(tt->page_flags &
> > > TTM_TT_FLAG_EXTERNAL_MAPPABLE));
> > >  
> > >  	if (xe_tt->sg)
> > >  		return 0;
> > > @@ -299,13 +306,13 @@ static int xe_tt_map_sg(struct ttm_tt *tt)
> > >  	ret = sg_alloc_table_from_pages_segment(&xe_tt->sgt, tt-
> > > >pages,
> > >  						num_pages, 0,
> > >  						(u64)num_pages <<
> > > PAGE_SHIFT,
> > > -
> > > 						xe_sg_segment_size(xe_tt->dev),
> > > +						xe_sg_segment_size
> > > (xe_tt->xe->drm.dev),
> > >  						GFP_KERNEL);
> > >  	if (ret)
> > >  		return ret;
> > >  
> > >  	xe_tt->sg = &xe_tt->sgt;
> > > -	ret = dma_map_sgtable(xe_tt->dev, xe_tt->sg,
> > > DMA_BIDIRECTIONAL,
> > > +	ret = dma_map_sgtable(xe_tt->xe->drm.dev, xe_tt->sg,
> > > DMA_BIDIRECTIONAL,
> > >  			      DMA_ATTR_SKIP_CPU_SYNC);
> > >  	if (ret) {
> > >  		sg_free_table(xe_tt->sg);
> > > @@ -321,7 +328,7 @@ static void xe_tt_unmap_sg(struct ttm_tt *tt)
> > >  	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > > xe_ttm_tt, ttm);
> > >  
> > >  	if (xe_tt->sg) {
> > > -		dma_unmap_sgtable(xe_tt->dev, xe_tt->sg,
> > > +		dma_unmap_sgtable(xe_tt->xe->drm.dev, xe_tt->sg,
> > >  				  DMA_BIDIRECTIONAL, 0);
> > >  		sg_free_table(xe_tt->sg);
> > >  		xe_tt->sg = NULL;
> > > @@ -336,21 +343,41 @@ struct sg_table *xe_bo_sg(struct xe_bo *bo)
> > >  	return xe_tt->sg;
> > >  }
> > >  
> > > +/*
> > > + * Account ttm pages against the device shrinker's shrinkable and
> > > + * purgeable counts.
> > > + */
> > > +static void xe_ttm_tt_account(struct ttm_tt *tt, bool add)
> > > +{
> > 
> > Again I think bools are frowned upon as arguments. Maybe just two
> > functions - add / sub?
> 
> OK,
> 
> > 
> > > +	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > > xe_ttm_tt, ttm);
> > > +	long num_pages = tt->num_pages;
> > > +
> > > +	if (!add)
> > > +		num_pages = -num_pages;
> > > +
> > > +	if (xe_tt->purgeable)
> > > +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker, 0,
> > > num_pages);
> > > +	else
> > > +		xe_shrinker_mod_pages(xe_tt->xe->mem.shrinker,
> > > num_pages, 0);
> > > +}
> > > +
> > >  static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object
> > > *ttm_bo,
> > >  				       u32 page_flags)
> > >  {
> > >  	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
> > >  	struct xe_device *xe = xe_bo_device(bo);
> > > -	struct xe_ttm_tt *tt;
> > > +	struct xe_ttm_tt *xe_tt;
> > > +	struct ttm_tt *tt;
> > >  	unsigned long extra_pages;
> > >  	enum ttm_caching caching;
> > >  	int err;
> > >  
> > > -	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> > > -	if (!tt)
> > > +	xe_tt = kzalloc(sizeof(*xe_tt), GFP_KERNEL);
> > > +	if (!xe_tt)
> > >  		return NULL;
> > >  
> > > -	tt->dev = xe->drm.dev;
> > > +	tt = &xe_tt->ttm;
> > > +	xe_tt->xe = xe;
> > >  
> > >  	extra_pages = 0;
> > >  	if (xe_bo_needs_ccs_pages(bo))
> > > @@ -387,42 +414,128 @@ static struct ttm_tt
> > > *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
> > >  		caching = ttm_uncached;
> > >  	}
> > >  
> > > -	err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching,
> > > extra_pages);
> > > +	if (ttm_bo->type != ttm_bo_type_sg)
> > > +		page_flags |= TTM_TT_FLAG_EXTERNAL |
> > > TTM_TT_FLAG_EXTERNAL_MAPPABLE;
> > > +
> > > +	err = ttm_tt_init(tt, &bo->ttm, page_flags, caching,
> > > extra_pages);
> > >  	if (err) {
> > > -		kfree(tt);
> > > +		kfree(xe_tt);
> > >  		return NULL;
> > >  	}
> > >  
> > > -	return &tt->ttm;
> > > +	tt->backup = ttm_backup_shmem_create(tt->num_pages <<
> > > PAGE_SHIFT);
> > > +	if (IS_ERR(tt->backup)) {
> > > +		ttm_tt_fini(tt);
> > 
> > Mentioned this the previous review I think you need set tt->backup to
> > NULL here or update ttm_tt_fini to understand IS_ERR(tt->backup).
> > 
> > Also maybe dump question but could we just have a global backup for
> > all
> > BOs? Would that be better than each BO creating its own backup?
> 
> I initially made the code like that, when we had a global backend
> directly into the swap cache. But I figure shmem wants one file per bo,
> otherwise we'd need one gigantic shmem object... Not sure if that
> works...
>

If one backup per shmem sense, then let's do that.

I do think we should remove the restriction of only supporting 1 backup
per BO so if a driver wants to use a global backup it can.

e.g. Don't fini / free the backup in ttm_tt_fini rather have the driver
completely own the backup field. Kinda goofy for the driver to setup the
field but TTM to fini / free it.

> > 
> > > +		kfree(xe_tt);
> > > +		return NULL;
> > > +	}
> > > +
> > > +	return tt;
> > >  }
> > >  
> > >  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct
> > > ttm_tt *tt,
> > >  			      struct ttm_operation_ctx *ctx)
> > >  {
> > > +	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > > xe_ttm_tt, ttm);
> > >  	int err;
> > >  
> > >  	/*
> > >  	 * dma-bufs are not populated with pages, and the dma-
> > >  	 * addresses are set up when moved to XE_PL_TT.
> > >  	 */
> > > -	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
> > > +	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> > > +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
> > >  		return 0;
> > >  
> > >  	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
> > >  	if (err)
> > >  		return err;
> > >  
> > > -	return err;
> > > +	xe_tt->purgeable = false;
> > > +	xe_ttm_tt_account(tt, true);
> > > +
> > > +	return 0;
> > >  }
> > >  
> > >  static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev,
> > > struct ttm_tt *tt)
> > >  {
> > > -	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
> > > +	if ((tt->page_flags & TTM_TT_FLAG_EXTERNAL) &&
> > > +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE))
> > >  		return;
> > >  
> > >  	xe_tt_unmap_sg(tt);
> > >  
> > > -	return ttm_pool_free(&ttm_dev->pool, tt);
> > > +	ttm_pool_free(&ttm_dev->pool, tt);
> > > +	xe_ttm_tt_account(tt, false);
> > > +}
> > > +
> > > +/**
> > > + * xe_bo_shrink() - Try to shrink an xe bo.
> > > + * @walk:  - The walk parameters
> > > + * @bo: The TTM buffer object
> > > + * @purge: Only consider purgeable bos.
> > > + * @writeback: Try to write back to persistent storage.
> > > + *
> > > + * Try to shrink- or purge a bo, and if it succeeds, unmap dma.
> > > + * Note that we need to be able to handle also non xe bos
> > > + * (ghost bos), but only if the struct ttm_tt is embedded in
> > > + * a struct xe_ttm_tt.
> > > + *
> > > + * Return: The number of pages shrunken or purged, or negative
> > > error
> > > + * code on failure.
> > > + */
> > > +long xe_bo_shrink(struct ttm_lru_walk *walk, struct
> > > ttm_buffer_object *bo,
> > > +		  bool purge, bool writeback)
> > > +{
> > > +	struct ttm_tt *tt = bo->ttm;
> > > +	struct xe_ttm_tt *xe_tt = container_of(tt, struct
> > > xe_ttm_tt, ttm);
> > > +	struct ttm_place place = {.mem_type = bo->resource-
> > > >mem_type};
> > > +	struct xe_bo *xe_bo = ttm_to_xe_bo(bo);
> > > +	struct xe_device *xe = xe_tt->xe;
> > > +	bool needs_rpm;
> > > +	long lret = 0L;
> > > +
> > > +	if (!tt || !ttm_tt_is_populated(tt) ||
> > > +	    !(tt->page_flags & TTM_TT_FLAG_EXTERNAL_MAPPABLE) ||
> > > +	    (purge && !xe_tt->purgeable))
> > > +		return 0L;
> > > +
> > > +	if (!ttm_bo_eviction_valuable(bo, &place))
> > > +		return 0L;
> > > +
> > > +	/* Beware of zombies (GEM object refcount == 0) and
> > > ghosts. */
> > > +	if (!xe_bo_is_xe_bo(bo) || !xe_bo_get_unless_zero(xe_bo))
> > > {
> > > +		struct ttm_placement null_placement = {
> > > .num_placement = 0 };
> > > +
> > > +		lret = ttm_bo_wait_ctx(bo, walk->ctx);
> > > +		if (lret)
> > > +			return lret;
> > > +
> > > +		/* Purge the bo content! */
> > > +		ttm_bo_validate(bo, &null_placement, walk->ctx);
> > > +		return tt->num_pages;
> > > +	}
> > > +
> > > +	/* System CCS needs gpu copy when moving PL_TT ->
> > > PL_SYSTEM */
> > > +	needs_rpm = (!IS_DGFX(xe) && bo->resource->mem_type !=
> > > XE_PL_SYSTEM &&
> > > +		     xe_bo && xe_bo_needs_ccs_pages(xe_bo) &&
> > > !xe_tt->purgeable);
> > 
> > Is xe_bo check really needed here?
> 
> Yes, I think otherwise xe_bo_needs_ccs_pages will be called with NULL
> for ghost objects that aren't xe_bo.
> 

Got it.

> > 
> > > +	if (needs_rpm && !xe_pm_runtime_get_if_active(xe))
> > > +		goto out_unref;
> > > +
> > > +	lret = ttm_bo_try_shrink(walk, bo, xe_tt->purgeable,
> > > writeback);
> > > +	if (needs_rpm)
> > > +		xe_pm_runtime_put(xe);
> > > +
> > > +	if (lret > 0) {
> > > +		xe_assert(xe, !ttm_tt_is_populated(tt));
> > > +
> > > +		xe_ttm_tt_account(tt, false);
> > > +	}
> > > +
> > > +out_unref:
> > > +	xe_bo_put(xe_bo);
> > > +
> > > +	return lret;
> > >  }
> > >  
> > >  static void xe_ttm_tt_destroy(struct ttm_device *ttm_dev, struct
> > > ttm_tt *tt)
> > > @@ -1238,6 +1351,7 @@ struct xe_bo *___xe_bo_create_locked(struct
> > > xe_device *xe, struct xe_bo *bo,
> > >  	struct ttm_operation_ctx ctx = {
> > >  		.interruptible = true,
> > >  		.no_wait_gpu = false,
> > > +		.gfp_retry_mayfail = true,
> > 
> > Can you explain why you are setting this?
> 
> Hm. We might want this in a separate patch. Without this, bo memory
> allocation will typically never fail, but instead start the OOM killer.
> I don't think that's the behaviour we want, but yeah deserves its own
> patch and review.
> 

Sounds good.

> > 
> > >  	};
> > >  	struct ttm_placement *placement;
> > >  	uint32_t alignment;
> > > @@ -1681,6 +1795,8 @@ int xe_bo_pin_external(struct xe_bo *bo)
> > >  	}
> > >  
> > >  	ttm_bo_pin(&bo->ttm);
> > > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > > +		xe_ttm_tt_account(bo->ttm.ttm, false);
> > >  
> > >  	/*
> > >  	 * FIXME: If we always use the reserve / unreserve
> > > functions for locking
> > > @@ -1739,6 +1855,8 @@ int xe_bo_pin(struct xe_bo *bo)
> > >  	}
> > >  
> > >  	ttm_bo_pin(&bo->ttm);
> > > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > > +		xe_ttm_tt_account(bo->ttm.ttm, false);
> > >  
> > >  	/*
> > >  	 * FIXME: If we always use the reserve / unreserve
> > > functions for locking
> > > @@ -1773,6 +1891,9 @@ void xe_bo_unpin_external(struct xe_bo *bo)
> > >  	spin_unlock(&xe->pinned.lock);
> > >  
> > >  	ttm_bo_unpin(&bo->ttm);
> > > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > > +		xe_ttm_tt_account(bo->ttm.ttm, true);
> > > +
> > 
> > Nit: Extra newline.
> Will fix.
> > 
> > >  
> > >  	/*
> > >  	 * FIXME: If we always use the reserve / unreserve
> > > functions for locking
> > > @@ -1801,6 +1922,8 @@ void xe_bo_unpin(struct xe_bo *bo)
> > >  	}
> > >  
> > >  	ttm_bo_unpin(&bo->ttm);
> > > +	if (bo->ttm.ttm && ttm_tt_is_populated(bo->ttm.ttm))
> > > +		xe_ttm_tt_account(bo->ttm.ttm, true);
> > >  }
> > >  
> > >  /**
> > > diff --git a/drivers/gpu/drm/xe/xe_bo.h
> > > b/drivers/gpu/drm/xe/xe_bo.h
> > > index 6de894c728f5..8463e3f3f6f1 100644
> > > --- a/drivers/gpu/drm/xe/xe_bo.h
> > > +++ b/drivers/gpu/drm/xe/xe_bo.h
> > > @@ -63,6 +63,7 @@
> > >  #define XE_BO_PROPS_INVALID	(-1)
> > >  
> > >  struct sg_table;
> > > +struct xe_ttm_lru_walk;
> > >  
> > >  struct xe_bo *xe_bo_alloc(void);
> > >  void xe_bo_free(struct xe_bo *bo);
> > > @@ -126,6 +127,28 @@ static inline struct xe_bo *xe_bo_get(struct
> > > xe_bo *bo)
> > >  	return bo;
> > >  }
> > >  
> > > +/*
> > > + * xe_bo_get_unless_zero() - Conditionally obtain a GEM object
> > > refcount on an
> > > + * xe bo
> > > + * @bo: The bo for which we want to obtain a refcount.
> > > + *
> > > + * There is a short window between where the bo's GEM object
> > > refcount reaches
> > > + * zero and where we put the final ttm_bo reference. Code in the
> > > eviction- and
> > > + * shrinking path should therefore attempt to grab a gem object
> > > reference before
> > > + * trying to use members outside of the base class ttm object.
> > > This function is
> > > + * intended for that purpose. On successful return, this function
> > > must be paired
> > > + * with an xe_bo_put().
> > > + *
> > > + * Return: @bo on success, NULL on failure.
> > > + */
> > > +static inline __must_check struct xe_bo
> > > *xe_bo_get_unless_zero(struct xe_bo *bo)
> > > +{
> > > +	if (!bo || !kref_get_unless_zero(&bo->ttm.base.refcount))
> > > +		return NULL;
> > > +
> > > +	return bo;
> > > +}
> > > +
> > >  static inline void xe_bo_put(struct xe_bo *bo)
> > >  {
> > >  	if (bo)
> > > @@ -315,6 +338,9 @@ static inline unsigned int
> > > xe_sg_segment_size(struct device *dev)
> > >  
> > >  #define
> > > i915_gem_object_flush_if_display(obj)		((void)(obj))
> > >  
> > > +long xe_bo_shrink(struct ttm_lru_walk *walk, struct
> > > ttm_buffer_object *bo,
> > > +		  bool purge, bool writeback);
> > > +
> > >  #if IS_ENABLED(CONFIG_DRM_XE_KUNIT_TEST)
> > >  /**
> > >   * xe_bo_is_mem_type - Whether the bo currently resides in the
> > > given
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > b/drivers/gpu/drm/xe/xe_device.c
> > > index cfda7cb5df2c..58fecc4b0a18 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -47,6 +47,7 @@
> > >  #include "xe_perf.h"
> > >  #include "xe_pm.h"
> > >  #include "xe_query.h"
> > > +#include "xe_shrinker.h"
> > >  #include "xe_sriov.h"
> > >  #include "xe_tile.h"
> > >  #include "xe_ttm_stolen_mgr.h"
> > > @@ -241,6 +242,9 @@ static void xe_device_destroy(struct drm_device
> > > *dev, void *dummy)
> > >  	if (xe->unordered_wq)
> > >  		destroy_workqueue(xe->unordered_wq);
> > >  
> > > +	if (!IS_ERR_OR_NULL(xe->mem.shrinker))
> > > +		xe_shrinker_destroy(xe->mem.shrinker);
> > > +
> > >  	ttm_device_fini(&xe->ttm);
> > >  }
> > >  
> > > @@ -270,6 +274,10 @@ struct xe_device *xe_device_create(struct
> > > pci_dev *pdev,
> > >  	if (err)
> > >  		goto err;
> > >  
> > > +	xe->mem.shrinker = xe_shrinker_create(xe);
> > > +	if (IS_ERR(xe->mem.shrinker))
> > > +		return ERR_CAST(xe->mem.shrinker);
> > > +
> > >  	xe->info.devid = pdev->device;
> > >  	xe->info.revid = pdev->revision;
> > >  	xe->info.force_execlist = xe_modparam.force_execlist;
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index c37be471d11c..3d5440aba52e 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -325,6 +325,8 @@ struct xe_device {
> > >  		struct xe_mem_region vram;
> > >  		/** @mem.sys_mgr: system TTM manager */
> > >  		struct ttm_resource_manager sys_mgr;
> > > +		/** @mem.sys_mgr: system memory shrinker. */
> > > +		struct xe_shrinker *shrinker;
> > >  	} mem;
> > >  
> > >  	/** @sriov: device level virtualization data */
> > > diff --git a/drivers/gpu/drm/xe/xe_shrinker.c
> > > b/drivers/gpu/drm/xe/xe_shrinker.c
> > > new file mode 100644
> > > index 000000000000..3f9554bdc06b
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_shrinker.c
> > > @@ -0,0 +1,287 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/shrinker.h>
> > > +#include <linux/swap.h>
> > > +
> > > +#include <drm/ttm/ttm_bo.h>
> > > +#include <drm/ttm/ttm_tt.h>
> > > +
> > > +#include "xe_bo.h"
> > > +#include "xe_pm.h"
> > > +#include "xe_shrinker.h"
> > > +
> > > +/**
> > > + * struct xe_shrinker - per-device shrinker
> > > + * @xe: Back pointer to the device.
> > > + * @lock: Lock protecting accounting.
> > > + * @shrinkable_pages: Number of pages that are currently
> > > shrinkable.
> > > + * @purgeable_pages: Number of pages that are currently purgeable.
> > > + * @shrink: Pointer to the mm shrinker.
> > > + * @pm_worker: Worker to wake up the device if required.
> > > + */
> > > +struct xe_shrinker {
> > > +	struct xe_device *xe;
> > > +	rwlock_t lock;
> > > +	long shrinkable_pages;
> > > +	long purgeable_pages;
> > > +	struct shrinker *shrink;
> > > +	struct work_struct pm_worker;
> > > +};
> > > +
> > > +/**
> > > + * struct xe_shrink_lru_walk - lru_walk subclass for shrinker
> > > + * @walk: The embedded base class.
> > > + * @xe: Pointer to the xe device.
> > > + * @purge: Purgeable only request from the srinker.
> > > + * @writeback: Try to write back to persistent storage.
> > > + */
> > > +struct xe_shrink_lru_walk {
> > > +	struct ttm_lru_walk walk;
> > > +	struct xe_device *xe;
> > > +	bool purge;
> > > +	bool writeback;
> > > +};
> > > +
> > > +static struct xe_shrinker *to_xe_shrinker(struct shrinker *shrink)
> > > +{
> > > +	return shrink->private_data;
> > > +}
> > > +
> > > +static struct xe_shrink_lru_walk *
> > > +to_xe_shrink_lru_walk(struct ttm_lru_walk *walk)
> > > +{
> > > +	return container_of(walk, struct xe_shrink_lru_walk,
> > > walk);
> > > +}
> > > +
> > > +/**
> > > + * xe_shrinker_mod_pages() - Modify shrinker page accounting
> > > + * @shrinker: Pointer to the struct xe_shrinker.
> > > + * @shrinkable: Shrinkable pages delta. May be negative.
> > > + * @purgeable: Purgeable page delta. May be negative.
> > > + *
> > > + * Modifies the shrinkable and purgeable pages accounting.
> > > + */
> > > +void
> > > +xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long
> > > shrinkable, long purgeable)
> > > +{
> > > +	write_lock(&shrinker->lock);
> > > +	shrinker->shrinkable_pages += shrinkable;
> > > +	shrinker->purgeable_pages += purgeable;
> > > +	write_unlock(&shrinker->lock);
> > > +}
> > > +
> > > +static long xe_shrinker_process_bo(struct ttm_lru_walk *walk,
> > > struct ttm_buffer_object *bo)
> > > +{
> > > +	struct xe_shrink_lru_walk *shrink_walk =
> > > to_xe_shrink_lru_walk(walk);
> > > +
> > > +	return xe_bo_shrink(walk, bo, shrink_walk->purge,
> > > shrink_walk->writeback);
> > > +}
> > > +
> > > +static long xe_shrinker_walk(struct xe_shrink_lru_walk
> > > *shrink_walk, long target)
> > > +{
> > > +	struct xe_device *xe = shrink_walk->xe;
> > > +	struct ttm_resource_manager *man;
> > > +	unsigned int mem_type;
> > > +	long sofar = 0;
> > > +	long lret;
> > > +
> > > +	for (mem_type = XE_PL_SYSTEM; mem_type <= XE_PL_TT;
> > > ++mem_type) {
> > > +		man = ttm_manager_type(&xe->ttm, mem_type);
> > > +		if (!man || !man->use_tt)
> > > +			continue;
> > > +
> > > +		lret = ttm_lru_walk_for_evict(&shrink_walk->walk,
> > > &xe->ttm, man, target);
> > > +		if (lret < 0)
> > > +			return lret;
> > > +
> > > +		sofar += lret;
> > > +		if (sofar >= target)
> > > +			break;
> > > +	}
> > > +
> > > +	return sofar;
> > > +}
> > > +
> > > +static unsigned long
> > > +xe_shrinker_count(struct shrinker *shrink, struct shrink_control
> > > *sc)
> > > +{
> > > +	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
> > > +	unsigned long num_pages;
> > > +
> > > +	num_pages = get_nr_swap_pages();
> > > +	read_lock(&shrinker->lock);
> > > +	num_pages = min_t(unsigned long, num_pages, shrinker-
> > > >shrinkable_pages);
> > > +	num_pages += shrinker->purgeable_pages;
> > > +	read_unlock(&shrinker->lock);
> > > +
> > > +	return num_pages ? num_pages : SHRINK_EMPTY;
> > > +}
> > > +
> > > +static const struct ttm_lru_walk_ops xe_shrink_ops = {
> > > +	.process_bo = xe_shrinker_process_bo,
> > > +};
> > > +
> > > +/*
> > > + * Check if we need runtime pm, and if so try to grab a reference
> > > if
> > > + * already active. If grabbing a reference fails, queue a worker
> > > that
> > > + * does it for us outside of reclaim, but don't wait for it to
> > > complete.
> > > + * If bo shrinking needs an rpm reference and we don't have it
> > > (yet),
> > > + * that bo will be skipped anyway.
> > > + */
> > > +static bool xe_shrinker_runtime_pm_get(struct xe_shrinker
> > > *shrinker, bool force,
> > > +				       unsigned long nr_to_scan)
> > > +{
> > > +	struct xe_device *xe = shrinker->xe;
> > > +
> > > +	if (IS_DGFX(xe) || !xe_device_has_flat_ccs(xe) ||
> > > +	    !get_nr_swap_pages())
> > > +		return false;
> > > +
> > > +	if (!force) {
> > > +		read_lock(&shrinker->lock);
> > > +		force = (nr_to_scan > shrinker->purgeable_pages);
> > > +		read_unlock(&shrinker->lock);
> > > +		if (!force)
> > > +			return false;
> > > +	}
> > > +
> > > +	if (!xe_pm_runtime_get_if_active(xe)) {
> > > +		queue_work(xe->unordered_wq, &shrinker-
> > > >pm_worker);
> > > +		return false;
> > > +	}
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +static void xe_shrinker_runtime_pm_put(struct xe_shrinker
> > > *shrinker, bool runtime_pm)
> > > +{
> > > +	if (runtime_pm)
> > > +		xe_pm_runtime_put(shrinker->xe);
> > > +}
> > > +
> > > +static unsigned long xe_shrinker_scan(struct shrinker *shrink,
> > > struct shrink_control *sc)
> > > +{
> > > +	struct xe_shrinker *shrinker = to_xe_shrinker(shrink);
> > > +	bool is_kswapd = current_is_kswapd();
> > > +	struct ttm_operation_ctx ctx = {
> > > +		.interruptible = false,
> > > +		.no_wait_gpu = !is_kswapd,
> > > +	};
> > > +	unsigned long nr_to_scan, freed = 0;
> > > +	struct xe_shrink_lru_walk shrink_walk = {
> > > +		.walk = {
> > > +			.ops = &xe_shrink_ops,
> > > +			.ctx = &ctx,
> > > +			.trylock_only = true,
> > > +		},
> > > +		.xe = shrinker->xe,
> > > +		.purge = true,
> > > +		.writeback = is_kswapd,
> > > +	};
> > > +	bool runtime_pm;
> > > +	bool purgeable;
> > > +	long ret;
> > > +
> > > +	sc->nr_scanned = 0;
> > > +	nr_to_scan = sc->nr_to_scan;
> > > +
> > > +	read_lock(&shrinker->lock);
> > > +	purgeable = !!shrinker->purgeable_pages;
> > > +	read_unlock(&shrinker->lock);
> > > +
> > > +	/* Might need runtime PM. Try to wake early if it looks
> > > like it. */
> > > +	runtime_pm = xe_shrinker_runtime_pm_get(shrinker, false,
> > > nr_to_scan);
> > > +
> > > +	while (purgeable && freed < nr_to_scan) {
> > > +		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
> > > +		if (ret <= 0)
> > > +			break;
> > > +
> > > +		freed += ret;
> > > +	}
> > > +
> > > +	sc->nr_scanned = freed;
> > > +	if (freed < nr_to_scan)
> > > +		nr_to_scan -= freed;
> > > +	else
> > > +		nr_to_scan = 0;
> > > +	if (!nr_to_scan)
> > > +		goto out;
> > > +
> > > +	/* If we didn't wake before, try to do it now if needed.
> > > */
> > > +	if (!runtime_pm)
> > > +		runtime_pm = xe_shrinker_runtime_pm_get(shrinker,
> > > true, 0);
> > > +
> > > +	shrink_walk.purge = false;
> > > +	nr_to_scan = sc->nr_to_scan;
> > > +	while (freed < nr_to_scan) {
> > > +		ret = xe_shrinker_walk(&shrink_walk, nr_to_scan);
> > > +		if (ret <= 0)
> > > +			break;
> > > +
> > > +		freed += ret;
> > > +	}
> > > +
> > > +	sc->nr_scanned = freed;
> > > +
> > > +out:
> > > +	xe_shrinker_runtime_pm_put(shrinker, runtime_pm);
> > > +	return freed ? freed : SHRINK_STOP;
> > > +}
> > > +
> > > +/* Wake up the device for shrinking. */
> > > +static void xe_shrinker_pm(struct work_struct *work)
> > > +{
> > > +	struct xe_shrinker *shrinker =
> > > +		container_of(work, typeof(*shrinker), pm_worker);
> > > +
> > > +	xe_pm_runtime_get(shrinker->xe);
> > > +	xe_pm_runtime_put(shrinker->xe);
> > 
> > So I don't really get this. How does this help the shrinker get a PM
> > ref? The small window between xe_pm_runtime_get / put the shrinker
> > grabs
> > one via xe_pm_runtime_get_if_active? Seems fragile.
> 
> The pm has a delay after the last put (Typically 1s?), and if the
> shrinker fails obtaining a reference it will be retried. Typically it
> will then succeed and when it does start a new delay.
> 

Ah, forgot that their is a delay on the final put. This make sense now.
Can't really think of a better solution that this at the moment. Maybe
add a comment indicating this?

Matt

> This doesn't strictly guarantee shrinking progress, but there's no way
> we can synchronously delay shrinking until we have the pm reference.
> 
> What we could perhaps do is to report back that we actually made
> progress but postpone the actual shrinking to the worker with writeback
> set to on; hoping that the shrinking core doesn't give up before it
> sees the released memory...
> 
> But I'd prefer to leave that out until we see that that's actually a
> real problem.
> 
> Best would be if we could have access the LNL CCS memory using the
> CPU...
> 
> /Thomas
> 
> 
> > 
> > Matt
> > 
> > > +}
> > > +
> > > +/**
> > > + * xe_shrinker_create() - Create an xe per-device shrinker
> > > + * @xe: Pointer to the xe device.
> > > + *
> > > + * Returns: A pointer to the created shrinker on success,
> > > + * Negative error code on failure.
> > > + */
> > > +struct xe_shrinker *xe_shrinker_create(struct xe_device *xe)
> > > +{
> > > +	struct xe_shrinker *shrinker = kzalloc(sizeof(*shrinker),
> > > GFP_KERNEL);
> > > +
> > > +	if (!shrinker)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	shrinker->shrink = shrinker_alloc(0, "xe system
> > > shrinker");
> > > +	if (!shrinker->shrink) {
> > > +		kfree(shrinker);
> > > +		return ERR_PTR(-ENOMEM);
> > > +	}
> > > +
> > > +	INIT_WORK(&shrinker->pm_worker, xe_shrinker_pm);
> > > +	shrinker->xe = xe;
> > > +	rwlock_init(&shrinker->lock);
> > > +	shrinker->shrink->count_objects = xe_shrinker_count;
> > > +	shrinker->shrink->scan_objects = xe_shrinker_scan;
> > > +	shrinker->shrink->private_data = shrinker;
> > > +	shrinker_register(shrinker->shrink);
> > > +
> > > +	return shrinker;
> > > +}
> > > +
> > > +/**
> > > + * xe_shrinker_destroy() - Destroy an xe per-device shrinker
> > > + * @shrinker: Pointer to the shrinker to destroy.
> > > + */
> > > +void xe_shrinker_destroy(struct xe_shrinker *shrinker)
> > > +{
> > > +	xe_assert(shrinker->xe, !shrinker->shrinkable_pages);
> > > +	xe_assert(shrinker->xe, !shrinker->purgeable_pages);
> > > +	shrinker_free(shrinker->shrink);
> > > +	flush_work(&shrinker->pm_worker);
> > > +	kfree(shrinker);
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_shrinker.h
> > > b/drivers/gpu/drm/xe/xe_shrinker.h
> > > new file mode 100644
> > > index 000000000000..28a038f4fcbf
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_shrinker.h
> > > @@ -0,0 +1,18 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _XE_SHRINKER_H_
> > > +#define _XE_SHRINKER_H_
> > > +
> > > +struct xe_shrinker;
> > > +struct xe_device;
> > > +
> > > +void xe_shrinker_mod_pages(struct xe_shrinker *shrinker, long
> > > shrinkable, long purgeable);
> > > +
> > > +struct xe_shrinker *xe_shrinker_create(struct xe_device *xe);
> > > +
> > > +void xe_shrinker_destroy(struct xe_shrinker *shrinker);
> > > +
> > > +#endif
> > > diff --git a/include/drm/ttm/ttm_bo.h b/include/drm/ttm/ttm_bo.h
> > > index e577528f5dfc..c7e81ae025d9 100644
> > > --- a/include/drm/ttm/ttm_bo.h
> > > +++ b/include/drm/ttm/ttm_bo.h
> > > @@ -229,6 +229,9 @@ struct ttm_lru_walk {
> > >  long ttm_lru_walk_for_evict(struct ttm_lru_walk *walk, struct
> > > ttm_device *bdev,
> > >  			    struct ttm_resource_manager *man, long
> > > target);
> > >  
> > > +long ttm_bo_try_shrink(struct ttm_lru_walk *walk, struct
> > > ttm_buffer_object *bo,
> > > +		       bool purge, bool writeback);
> > > +
> > >  /**
> > >   * ttm_bo_get - reference a struct ttm_buffer_object
> > >   *
> > > -- 
> > > 2.44.0
> > > 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 09/12] drm/ttm/pool: Provide a helper to shrink pages
  2024-08-07 23:38   ` Matthew Brost
@ 2024-08-16  9:47     ` Thomas Hellström
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Hellström @ 2024-08-16  9:47 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, Christian König, Somalapuram Amaranath, dri-devel

On Wed, 2024-08-07 at 23:38 +0000, Matthew Brost wrote:
> On Wed, Jul 03, 2024 at 05:38:10PM +0200, Thomas Hellström wrote:
> > Provide a helper to shrink ttm_tt page-vectors on a per-page
> > basis. A ttm_backup backend could then in theory get away with
> > allocating a single temporary page for each struct ttm_tt.
> > 
> > This is accomplished by splitting larger pages before trying to
> > back them up.
> > 
> > In the future we could allow ttm_backup to handle backing up
> > large pages as well, but currently there's no benefit in
> > doing that, since the shmem backup backend would have to
> > split those anyway to avoid allocating too much temporary
> > memory, and if the backend instead inserts pages into the
> > swap-cache, those are split on reclaim by the core.
> > 
> > Due to potential backup- and recover errors, allow partially
> > swapped
> > out struct ttm_tt's, although mark them as swapped out stopping
> > them
> > from being swapped out a second time. More details in the
> > ttm_pool.c
> > DOC section.
> > 
> > v2:
> > - A couple of cleanups and error fixes in ttm_pool_back_up_tt.
> > - s/back_up/backup/
> > - Add a writeback parameter to the exported interface.
> > 
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/ttm/ttm_pool.c | 397
> > +++++++++++++++++++++++++++++++--
> >  drivers/gpu/drm/ttm/ttm_tt.c   |  37 +++
> >  include/drm/ttm/ttm_pool.h     |   5 +
> >  include/drm/ttm/ttm_tt.h       |  20 ++
> >  4 files changed, 446 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > b/drivers/gpu/drm/ttm/ttm_pool.c
> > index 6e1fd6985ffc..38e50cf81b0a 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -41,6 +41,7 @@
> >  #include <asm/set_memory.h>
> >  #endif
> >  
> > +#include <drm/ttm/ttm_backup.h>
> >  #include <drm/ttm/ttm_pool.h>
> >  #include <drm/ttm/ttm_tt.h>
> >  #include <drm/ttm/ttm_bo.h>
> > @@ -58,6 +59,32 @@ struct ttm_pool_dma {
> >  	unsigned long vaddr;
> >  };
> >  
> > +/**
> > + * struct ttm_pool_tt_restore - State representing restore from
> > backup
> > + * @alloced_pages: Total number of already allocated pages for the
> > ttm_tt.
> > + * @restored_pages: Number of (sub) pages restored from swap for
> > this
> > + *		     chunk of 1 << @order pages.
> > + * @first_page: The ttm page ptr representing for @old_pages[0].
> > + * @caching_divide: Page pointer where subsequent pages are
> > cached.
> > + * @old_pages: Backup copy of page pointers that were replaced by
> > the new
> > + *	       page allocation.
> > + * @pool: The pool used for page allocation while restoring.
> > + * @order: The order of the last page allocated while restoring.
> > + *
> > + * Recovery from backup might fail when we've recovered less than
> > the
> > + * full ttm_tt. In order not to loose any data (yet), keep
> > information
> > + * around that allows us to restart a failed ttm backup recovery.
> > + */
> > +struct ttm_pool_tt_restore {
> > +	pgoff_t alloced_pages;
> > +	pgoff_t restored_pages;
> > +	struct page **first_page;
> > +	struct page **caching_divide;
> > +	struct ttm_pool *pool;
> > +	unsigned int order;
> > +	struct page *old_pages[];
> > +};
> > +
> >  static unsigned long page_pool_size;
> >  
> >  MODULE_PARM_DESC(page_pool_size, "Number of pages in the WC/UC/DMA
> > pool");
> > @@ -354,11 +381,102 @@ static unsigned int
> > ttm_pool_page_order(struct ttm_pool *pool, struct page *p)
> >  	return p->private;
> >  }
> >  
> > +/*
> > + * To be able to insert single pages into backup directly,
> > + * we need to split multi-order page allocations and make them
> > look
> > + * like single-page allocations.
> > + */
> > +static void ttm_pool_split_for_swap(struct ttm_pool *pool, struct
> > page *p)
> > +{
> > +	unsigned int order = ttm_pool_page_order(pool, p);
> > +	pgoff_t nr;
> > +
> > +	if (!order)
> > +		return;
> > +
> > +	split_page(p, order);
> > +	nr = 1UL << order;
> > +	while (nr--)
> > +		(p++)->private = 0;
> > +}
> > +
> > +/**
> > + * DOC: Partial backup and restoration of a struct ttm_tt.
> > + *
> > + * Swapout using ttm_backup::ops::backup_page() and swapin using
> > + * ttm_backup::ops::copy_backed_up_page() may fail.
> > + * The former most likely due to lack of swap-space or memory, the
> > latter due
> > + * to lack of memory or because of signal interruption during
> > waits.
> > + *
> > + * Backupfailure is easily handled by using a ttm_tt pages vector
> > that holds
> > + * both swap entries and page pointers. This has to be taken into
> > account when
> > + * restoring such a ttm_tt from backup, and when freeing it while
> > backed up.
> > + * When restoring, for simplicity, new pages are actually
> > allocated from the
> > + * pool and the contents of any old pages are copied in and then
> > the old pages
> > + * are released.
> > + *
> > + * For restoration failures, the struct ttm_pool_tt_restore holds
> > sufficient state
> > + * to be able to resume an interrupted restore, and that structure
> > is freed once
> > + * the restoration is complete. If the struct ttm_tt is destroyed
> > while there
> > + * is a valid struct ttm_pool_tt_restore attached, that is also
> > properly taken
> > + * care of.
> > + */
> > +
> > +static bool ttm_pool_restore_valid(const struct
> > ttm_pool_tt_restore *restore)
> > +{
> > +	return restore && restore->restored_pages < (1 << restore-
> > >order);
> > +}
> > +
> > +static int ttm_pool_restore_tt(struct ttm_pool_tt_restore
> > *restore,
> > +			       struct ttm_backup *backup,
> > +			       struct ttm_operation_ctx *ctx)
> > +{
> > +	unsigned int i, nr = 1 << restore->order;
> > +	int ret = 0;
> > +
> > +	if (!ttm_pool_restore_valid(restore))
> > +		return 0;
> > +
> > +	for (i = restore->restored_pages; i < nr; ++i) {
> > +		struct page *p = restore->old_pages[i];
> > +
> > +		if (ttm_backup_page_ptr_is_handle(p)) {
> > +			unsigned long handle =
> > ttm_backup_page_ptr_to_handle(p);
> > +
> > +			if (handle == 0)
> > +				continue;
> > +
> > +			ret = backup->ops->copy_backed_up_page
> > +				(backup, restore->first_page[i],
> > +				 handle, ctx->interruptible);
> > +			if (ret)
> > +				break;
> > +
> > +			backup->ops->drop(backup, handle);
> > +		} else if (p) {
> > +			/*
> > +			 * We could probably avoid splitting the
> > old page
> > +			 * using clever logic, but ATM we don't
> > care.
> > +			 */
> > +			ttm_pool_split_for_swap(restore->pool, p);
> > +			copy_highpage(restore->first_page[i], p);
> > +			__free_pages(p, 0);
> > +		}
> > +
> > +		restore->restored_pages++;
> > +		restore->old_pages[i] = NULL;
> > +		cond_resched();
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  /* Called when we got a page, either from a pool or newly
> > allocated */
> >  static int ttm_pool_page_allocated(struct ttm_pool *pool, unsigned
> > int order,
> >  				   struct page *p, dma_addr_t
> > **dma_addr,
> >  				   unsigned long *num_pages,
> > -				   struct page ***pages)
> > +				   struct page ***pages,
> > +				   struct ttm_pool_tt_restore
> > *restore)
> >  {
> >  	unsigned int i;
> >  	int r;
> > @@ -369,6 +487,16 @@ static int ttm_pool_page_allocated(struct
> > ttm_pool *pool, unsigned int order,
> >  			return r;
> >  	}
> >  
> > +	if (restore) {
> > +		memcpy(restore->old_pages, *pages,
> > +		       (1 << order) * sizeof(*restore-
> > >old_pages));
> > +		memset(*pages, 0, (1 << order) * sizeof(**pages));
> > +		restore->order = order;
> > +		restore->restored_pages = 0;
> > +		restore->first_page = *pages;
> > +		restore->alloced_pages += 1UL << order;
> > +	}
> > +
> >  	*num_pages -= 1 << order;
> >  	for (i = 1 << order; i; --i, ++(*pages), ++p)
> >  		**pages = p;
> > @@ -394,22 +522,39 @@ static void ttm_pool_free_range(struct
> > ttm_pool *pool, struct ttm_tt *tt,
> >  				pgoff_t start_page, pgoff_t
> > end_page)
> >  {
> >  	struct page **pages = &tt->pages[start_page];
> > +	struct ttm_backup *backup = tt->backup;
> >  	unsigned int order;
> >  	pgoff_t i, nr;
> >  
> >  	for (i = start_page; i < end_page; i += nr, pages += nr) {
> >  		struct ttm_pool_type *pt = NULL;
> > +		struct page *p = *pages;
> > +
> > +		if (ttm_backup_page_ptr_is_handle(p)) {
> > +			unsigned long handle =
> > ttm_backup_page_ptr_to_handle(p);
> > +
> > +			nr = 1;
> > +			if (handle != 0)
> > +				backup->ops->drop(backup, handle);
> > +			continue;
> > +		}
> > +
> > +		if (pool) {
> > +			order = ttm_pool_page_order(pool, p);
> > +			nr = (1UL << order);
> > +			if (tt->dma_address)
> > +				ttm_pool_unmap(pool, tt-
> > >dma_address[i], nr);
> >  
> > -		order = ttm_pool_page_order(pool, *pages);
> > -		nr = (1UL << order);
> > -		if (tt->dma_address)
> > -			ttm_pool_unmap(pool, tt->dma_address[i],
> > nr);
> > +			pt = ttm_pool_select_type(pool, caching,
> > order);
> > +		} else {
> > +			order = p->private;
> > +			nr = (1UL << order);
> > +		}
> >  
> > -		pt = ttm_pool_select_type(pool, caching, order);
> >  		if (pt)
> > -			ttm_pool_type_give(pt, *pages);
> > +			ttm_pool_type_give(pt, p);
> >  		else
> > -			ttm_pool_free_page(pool, caching, order,
> > *pages);
> > +			ttm_pool_free_page(pool, caching, order,
> > p);
> >  	}
> >  }
> >  
> > @@ -453,9 +598,37 @@ int ttm_pool_alloc(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> >  	else
> >  		gfp_flags |= GFP_HIGHUSER;
> >  
> > -	for (order = min_t(unsigned int, MAX_PAGE_ORDER,
> > __fls(num_pages));
> > -	     num_pages;
> > -	     order = min_t(unsigned int, order, __fls(num_pages)))
> > {
> > +	order = min_t(unsigned int, MAX_PAGE_ORDER,
> > __fls(num_pages));
> > +
> > +	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP) {
> > +		if (!tt->restore) {
> > +			gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
> > +
> > +			if (ctx->gfp_retry_mayfail)
> > +				gfp |= __GFP_RETRY_MAYFAIL;
> > +
> > +			tt->restore =
> > +				kvzalloc(struct_size(tt->restore,
> > old_pages,
> > +						     (size_t)1 <<
> > order), gfp);
> > +			/* RFC: Possibly loop on -ENOMEM and
> > reduce order. */
> 
> I'd say this is fine as is. If we can't allocate memory from an array
> of
> pages here we likely pretty much screwed, right? e.g. We likely don't
> have a chance of actually allocating new pages for the backing store
> anyways. Also wouldn't the restart be broken if we can't fully track
> the
> state of the restore?
> 
> > +			if (!tt->restore)
> > +				return -ENOMEM;
> > +		} else if (ttm_pool_restore_valid(tt->restore)) {
> > +			struct ttm_pool_tt_restore *restore = tt-
> > >restore;
> > +
> > +			num_pages -= restore->alloced_pages;
> > +			order = min_t(unsigned int, order,
> > __fls(num_pages));
> > +			pages += restore->alloced_pages;
> > +			r = ttm_pool_restore_tt(restore, tt-
> > >backup, ctx);
> > +			if (r)
> > +				return r;
> > +			caching = restore->caching_divide;
> > +		}
> > +
> > +		tt->restore->pool = pool;
> > +	}
> > +
> > +	for (; num_pages; order = min_t(unsigned int, order,
> > __fls(num_pages))) {
> >  		struct ttm_pool_type *pt;
> >  
> >  		page_caching = tt->caching;
> > @@ -472,11 +645,19 @@ int ttm_pool_alloc(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> >  				r = ttm_pool_page_allocated(pool,
> > order, p,
> >  							   
> > &dma_addr,
> >  							   
> > &num_pages,
> > -							   
> > &pages);
> > +							   
> > &pages,
> > +							    tt-
> > >restore);
> >  				if (r)
> >  					goto error_free_page;
> >  
> >  				caching = pages;
> > +				if (ttm_pool_restore_valid(tt-
> > >restore)) {
> > +					r =
> > ttm_pool_restore_tt(tt->restore, tt->backup,
> > +								ct
> > x);
> > +					if (r)
> > +						goto
> > error_free_all;
> > +				}
> > +
> >  				if (num_pages < (1 << order))
> >  					break;
> >  
> > @@ -496,9 +677,17 @@ int ttm_pool_alloc(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> >  				caching = pages;
> >  			}
> >  			r = ttm_pool_page_allocated(pool, order,
> > p, &dma_addr,
> > -						    &num_pages,
> > &pages);
> > +						    &num_pages,
> > &pages,
> > +						    tt->restore);
> >  			if (r)
> >  				goto error_free_page;
> > +
> > +			if (ttm_pool_restore_valid(tt->restore)) {
> > +				r = ttm_pool_restore_tt(tt-
> > >restore, tt->backup, ctx);
> > +				if (r)
> > +					goto error_free_all;
> > +			}
> > +
> >  			if (PageHighMem(p))
> >  				caching = pages;
> >  		}
> > @@ -517,12 +706,26 @@ int ttm_pool_alloc(struct ttm_pool *pool,
> > struct ttm_tt *tt,
> >  	if (r)
> >  		goto error_free_all;
> >  
> > +	if (tt->restore) {
> > +		kvfree(tt->restore);
> > +		tt->restore = NULL;
> > +	}
> > +
> > +	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP)
> > +		tt->page_flags &= ~(TTM_TT_FLAG_PRIV_BACKED_UP |
> > +				    TTM_TT_FLAG_SWAPPED);
> > +
> >  	return 0;
> >  
> >  error_free_page:
> >  	ttm_pool_free_page(pool, page_caching, order, p);
> >  
> >  error_free_all:
> > +	if (tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP) {
> > +		tt->restore->caching_divide = caching;
> > +		return r;
> > +	}
> > +
> >  	num_pages = tt->num_pages - num_pages;
> >  	caching_divide = caching - tt->pages;
> >  	ttm_pool_free_range(pool, tt, tt->caching, 0,
> > caching_divide);
> > @@ -549,6 +752,174 @@ void ttm_pool_free(struct ttm_pool *pool,
> > struct ttm_tt *tt)
> >  }
> >  EXPORT_SYMBOL(ttm_pool_free);
> >  
> > +/**
> > + * ttm_pool_release_backed_up() - Release content of a swapped-out
> > struct ttm_tt
> > + * @tt: The struct ttm_tt.
> > + *
> > + * Release handles with associated content or any remaining pages
> > of
> > + * a backed-up struct ttm_tt.
> > + */
> > +void ttm_pool_release_backed_up(struct ttm_tt *tt)
> > +{
> > +	struct ttm_backup *backup = tt->backup;
> > +	struct ttm_pool_tt_restore *restore;
> > +	pgoff_t i, start_page = 0;
> > +	unsigned long handle;
> > +
> > +	if (!(tt->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP))
> > +		return;
> > +
> > +	restore = tt->restore;
> > +
> > +	if (ttm_pool_restore_valid(restore)) {
> > +		pgoff_t nr = 1UL << restore->order;
> > +
> > +		for (i = restore->restored_pages; i < nr; ++i) {
> > +			struct page *p = restore->old_pages[i];
> > +
> > +			if (ttm_backup_page_ptr_is_handle(p)) {
> > +				handle =
> > ttm_backup_page_ptr_to_handle(p);
> > +				if (handle == 0)
> > +					continue;
> > +
> > +				backup->ops->drop(backup, handle);
> > +			} else if (p) {
> > +				ttm_pool_split_for_swap(restore-
> > >pool, p);
> > +				__free_pages(p, 0);
> > +			}
> > +		}
> > +	}
> > +
> > +	if (restore) {
> > +		pgoff_t mid = restore->caching_divide - tt->pages;
> > +
> > +		start_page = restore->alloced_pages;
> > +		/* Pages that might be dma-mapped and non-cached
> > */
> > +		ttm_pool_free_range(restore->pool, tt, tt-
> > >caching,
> > +				    0, mid);
> > +		/* Pages that might be dma-mapped but cached */
> > +		ttm_pool_free_range(restore->pool, tt, ttm_cached,
> > +				    mid, restore->alloced_pages);
> > +	}
> > +
> > +	/* Shrunken pages. Cached and not dma-mapped. */
> > +	ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt-
> > >num_pages);
> > +
> > +	if (restore) {
> > +		kvfree(restore);
> > +		tt->restore = NULL;
> > +	}
> > +
> > +	tt->page_flags &= ~(TTM_TT_FLAG_PRIV_BACKED_UP |
> > TTM_TT_FLAG_SWAPPED);
> > +}
> > +
> > +/**
> > + * ttm_pool_backup_tt() - Back up or purge a struct ttm_tt
> > + * @pool: The pool used when allocating the struct ttm_tt.
> > + * @ttm: The struct ttm_tt.
> > + * @purge: Don't back up but release pages directly to system.
> > + * @writeback: If !@purge, Try to write out directly to the
> > + * underlying persistent media.
> > + *
> > + * Back up or purge a struct ttm_tt. If @purge is true, then
> > + * all pages will be freed directly to the system rather than to
> > the pool
> > + * they were allocated from, making the function behave similarly
> > to
> > + * ttm_pool_free(). If @purge is false the pages will be backed up
> > instead,
> > + * exchanged for handles.
> > + * A subsequent call to ttm_pool_alloc() will then read back the
> > content and
> > + * a subsequent call to ttm_pool_release_shrunken() will drop it.
> > + * If backup of a page fails for whatever reason, @ttm will still
> > be
> > + * partially backed up, retaining those pages for which backup
> > fails.
> > + *
> > + * Return: Number of pages actually backed up or freed, or
> > negative
> > + * error code on error.
> > + */
> > +long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm,
> > bool purge,
> > +			bool writeback)
> > +{
> > +	struct ttm_backup *backup = ttm->backup;
> > +	struct page *page;
> > +	unsigned long handle;
> > +	gfp_t alloc_gfp;
> > +	gfp_t gfp;
> > +	int ret = 0;
> > +	pgoff_t shrunken = 0;
> > +	pgoff_t i, num_pages;
> > +
> > +	if ((!get_nr_swap_pages() && !purge) ||
> > +	    pool->use_dma_alloc ||
> > +	    (ttm->page_flags & TTM_TT_FLAG_PRIV_BACKED_UP))
> > +		return -EBUSY;
> > +
> > +#ifdef CONFIG_X86
> > +	/* Anything returned to the system needs to be cached. */
> > +	if (ttm->caching != ttm_cached)
> > +		set_pages_array_wb(ttm->pages, ttm->num_pages);
> > +#endif
> > +
> > +	if (ttm->dma_address || purge) {
> > +		for (i = 0; i < ttm->num_pages; i += num_pages) {
> > +			unsigned int order;
> > +
> > +			page = ttm->pages[i];
> > +			if (unlikely(!page)) {
> > +				num_pages = 1;
> > +				continue;
> > +			}
> > +
> > +			order = ttm_pool_page_order(pool, page);
> > +			num_pages = 1UL << order;
> > +			if (ttm->dma_address)
> > +				ttm_pool_unmap(pool, ttm-
> > >dma_address[i],
> > +					       num_pages);
> > +			if (purge) {
> > +				shrunken += num_pages;
> > +				page->private = 0;
> > +				__free_pages(page, order);
> > +				memset(ttm->pages + i, 0,
> > +				       num_pages * sizeof(*ttm-
> > >pages));
> > +			}
> > +		}
> > +	}
> > +
> > +	if (purge)
> 
> if (purge || !backup)?
> 
> > +		return shrunken;
> > +
> > +	if (pool->use_dma32)
> > +		gfp = GFP_DMA32;
> > +	else
> > +		gfp = GFP_HIGHUSER;
> > +
> > +	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN |
> > __GFP_RETRY_MAYFAIL;
> > +
> > +	for (i = 0; i < ttm->num_pages; ++i) {
> > +		page = ttm->pages[i];
> > +		if (unlikely(!page))
> > +			continue;
> > +
> > +		ttm_pool_split_for_swap(pool, page);
> > +
> > +		handle = backup->ops->backup_page(backup, page,
> > writeback, i,
> > +						  gfp, alloc_gfp);
> > +		if (handle) {
> > +			ttm->pages[i] =
> > ttm_backup_handle_to_page_ptr(handle);
> > +			put_page(page);
> > +			shrunken++;
> > +		} else {
> > +			/* We allow partially shrunken tts */
> > +			ret = -ENOMEM;
> > +			break;
> > +		}
> > +		cond_resched();
> > +	}
> > +
> > +	if (shrunken)
> > +		ttm->page_flags |= (TTM_TT_FLAG_PRIV_BACKED_UP |
> > +				    TTM_TT_FLAG_SWAPPED);
> > +
> > +	return shrunken ? shrunken : ret;
> > +}
> > +
> >  /**
> >   * ttm_pool_init - Initialize a pool
> >   *
> > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
> > b/drivers/gpu/drm/ttm/ttm_tt.c
> > index 4b51b9023126..98ce25197b38 100644
> > --- a/drivers/gpu/drm/ttm/ttm_tt.c
> > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> > @@ -40,6 +40,7 @@
> >  #include <drm/drm_cache.h>
> >  #include <drm/drm_device.h>
> >  #include <drm/drm_util.h>
> > +#include <drm/ttm/ttm_backup.h>
> >  #include <drm/ttm/ttm_bo.h>
> >  #include <drm/ttm/ttm_tt.h>
> >  
> > @@ -158,6 +159,7 @@ static void ttm_tt_init_fields(struct ttm_tt
> > *ttm,
> >  	ttm->swap_storage = NULL;
> >  	ttm->sg = bo->sg;
> >  	ttm->caching = caching;
> > +	ttm->restore = NULL;
> 
> Set backup to NULL? Seems problematic if not set to NULL and driver
> doesn't choose to set the backup.
> 
> >  }
> >  
> >  int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> > @@ -182,6 +184,12 @@ void ttm_tt_fini(struct ttm_tt *ttm)
> >  		fput(ttm->swap_storage);
> >  	ttm->swap_storage = NULL;
> >  
> > +	ttm_pool_release_backed_up(ttm);
> > +	if (ttm->backup) {
> 
> In patch 12 you don't set this to NULL on error. You will have to set
> it
> to NULL their or change this too:
> 
> if (ttm->backup && !IS_ERR(ttm->backup))
> 
> > +		ttm->backup->ops->fini(ttm->backup);
> > +		ttm->backup = NULL;
> > +	}
> > +
> >  	if (ttm->pages)
> >  		kvfree(ttm->pages);
> >  	else
> > @@ -253,6 +261,35 @@ int ttm_tt_swapin(struct ttm_tt *ttm)
> >  }
> >  EXPORT_SYMBOL_FOR_TESTS_ONLY(ttm_tt_swapin);
> >  
> > +/**
> > + * ttm_tt_backup() - Helper to back up a struct ttm_tt.
> > + * @bdev: The TTM device.
> > + * @tt: The struct ttm_tt.
> > + * @purge: Don't back up but release pages directly to system,
> > + * bypassing any pooling.
> > + * @writeback: If !@purge, try to write out directly to the
> > + * underlying persistent media.
> > + *
> > + * Helper for a TTM driver to use from the bo_shrink() method to
> > shrink
> > + * a struct ttm_tt, after it has done the necessary unbinding.
> > This function
> > + * will update the page accounting and call ttm_pool_shrink_tt to
> > free pages
> > + * or move them to the swap cache.
> > + *
> > + * Return: Number of pages freed or swapped out, or negative error
> > code on
> > + * error.
> > + */
> > +long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt,
> > bool purge,
> > +		   bool writeback)
> > +{
> > +	long ret = ttm_pool_backup_tt(&bdev->pool, tt, purge,
> > writeback);
> > +
> > +	if (ret > 0)
> > +		tt->page_flags &= ~TTM_TT_FLAG_PRIV_POPULATED;
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(ttm_tt_backup);
> > +
> >  /**
> >   * ttm_tt_swapout - swap out tt object
> >   *
> > diff --git a/include/drm/ttm/ttm_pool.h
> > b/include/drm/ttm/ttm_pool.h
> > index 160d954a261e..4e4db369952b 100644
> > --- a/include/drm/ttm/ttm_pool.h
> > +++ b/include/drm/ttm/ttm_pool.h
> > @@ -89,6 +89,11 @@ void ttm_pool_fini(struct ttm_pool *pool);
> >  
> >  int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
> >  
> > +void ttm_pool_release_backed_up(struct ttm_tt *tt);
> > +
> > +long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm,
> > +			bool purge, bool writeback);
> > +
> >  int ttm_pool_mgr_init(unsigned long num_pages);
> >  void ttm_pool_mgr_fini(void);
> >  
> > diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> > index 2b9d856ff388..6b990f1e7dd0 100644
> > --- a/include/drm/ttm/ttm_tt.h
> > +++ b/include/drm/ttm/ttm_tt.h
> > @@ -32,11 +32,13 @@
> >  #include <drm/ttm/ttm_caching.h>
> >  #include <drm/ttm/ttm_kmap_iter.h>
> >  
> > +struct ttm_backup;
> >  struct ttm_device;
> >  struct ttm_tt;
> >  struct ttm_resource;
> >  struct ttm_buffer_object;
> >  struct ttm_operation_ctx;
> > +struct ttm_pool_tt_restore;
> >  
> >  /**
> >   * struct ttm_tt - This is a structure holding the pages, caching-
> > and aperture
> > @@ -85,6 +87,9 @@ struct ttm_tt {
> >  	 * fault handling abuses the DMA api a bit and
> > dma_map_attrs can't be
> >  	 * used to assure pgprot always matches.
> >  	 *
> > +	 * TTM_TT_FLAG_PRIV_BACKED_UP: TTM internal only. This is
> > set if the
> > +	 * struct ttm_tt has been (possibly partially) backed up.
> > +	 *
> >  	 * TTM_TT_FLAG_PRIV_POPULATED: TTM internal only. DO NOT
> > USE. This is
> >  	 * set by TTM after ttm_tt_populate() has successfully
> > returned, and is
> >  	 * then unset when TTM calls ttm_tt_unpopulate().
> > @@ -96,6 +101,7 @@ struct ttm_tt {
> >  #define TTM_TT_FLAG_DECRYPTED		BIT(4)
> >  
> >  #define TTM_TT_FLAG_PRIV_POPULATED	BIT(5)
> > +#define TTM_TT_FLAG_PRIV_BACKED_UP	BIT(6)
> >  	uint32_t page_flags;
> >  	/** @num_pages: Number of pages in the page array. */
> >  	uint32_t num_pages;
> > @@ -105,11 +111,21 @@ struct ttm_tt {
> >  	dma_addr_t *dma_address;
> >  	/** @swap_storage: Pointer to shmem struct file for swap
> > storage. */
> >  	struct file *swap_storage;
> > +	/**
> > +	 * @backup: Pointer to backup struct for backed up tts.
> > +	 * RFC: Could possibly be unified with @swap_storage.
> 
> I think longterm unifying with swap_storage is problably a good idea.
> Kinda goofy having two backup mechanisms.
> 
> In the meantime, can you add a comment that is this is a driver owned
> field? This confused me until I looked at the last patch in this
> series
> where this field was being setup.
> 
> > +	 */
> > +	struct ttm_backup *backup;
> >  	/**
> >  	 * @caching: The current caching state of the pages, see
> > enum
> >  	 * ttm_caching.
> >  	 */
> >  	enum ttm_caching caching;
> > +	/**
> > +	 * @restore: Partial restoration from backup state.
> > +	 * RFC: Incorporate in struct ttm_backup?
> 
> I think having a standalone restore field makes sense. Also probably
> mention is this a TTM private field and drivers shouldn't touch this.

Thanks for reviewing, Matt. I'll incorporate all your suggestions, but
assert that ttm->backup is valid in ttm_tt.c

Thanks,
Thomas


> 
> Matt
> 
> > +	 */
> > +	struct ttm_pool_tt_restore *restore;
> >  };
> >  
> >  /**
> > @@ -230,6 +246,10 @@ void ttm_tt_mgr_init(unsigned long num_pages,
> > unsigned long num_dma32_pages);
> >  struct ttm_kmap_iter *ttm_kmap_iter_tt_init(struct
> > ttm_kmap_iter_tt *iter_tt,
> >  					    struct ttm_tt *tt);
> >  unsigned long ttm_tt_pages_limit(void);
> > +
> > +long ttm_tt_backup(struct ttm_device *bdev, struct ttm_tt *tt,
> > bool purge,
> > +		   bool writeback);
> > +
> >  #if IS_ENABLED(CONFIG_AGP)
> >  #include <linux/agp_backend.h>
> >  
> > -- 
> > 2.44.0
> > 


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-08-16  9:47 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-03 15:38 [PATCH v6 00/12] TTM shrinker helpers and xe buffer object shrinker Thomas Hellström
2024-07-03 15:38 ` [PATCH v6 01/12] drm/ttm: Allow TTM LRU list nodes of different types Thomas Hellström
2024-07-03 15:38 ` [PATCH v6 02/12] drm/ttm: Slightly clean up LRU list iteration Thomas Hellström
2024-07-03 15:38 ` [PATCH v6 03/12] drm/ttm: Use LRU hitches Thomas Hellström
2024-07-04  9:05   ` Christian König
2024-07-03 15:38 ` [PATCH v6 04/12] drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist moves Thomas Hellström
2024-07-03 17:53   ` Matthew Brost
2024-07-04  9:21   ` Christian König
2024-07-04 12:41     ` Thomas Hellström
2024-07-04 13:13       ` Christian König
2024-07-04 13:53         ` Thomas Hellström
2024-07-04 14:32           ` Christian König
2024-07-03 15:38 ` [PATCH v6 05/12] drm/ttm: Provide a generic LRU walker helper Thomas Hellström
2024-07-03 15:38 ` [PATCH v6 06/12] drm/ttm: Use the LRU walker helper for swapping Thomas Hellström
2024-07-03 18:24   ` Matthew Brost
2024-07-03 15:38 ` [PATCH v6 07/12] drm/ttm: Use the LRU walker for eviction Thomas Hellström
2024-07-03 19:20   ` Matthew Brost
2024-07-03 15:38 ` [PATCH v6 08/12] drm/ttm: Add a virtual base class for graphics memory backup Thomas Hellström
2024-07-03 19:47   ` Matthew Brost
2024-07-04 11:57   ` Christian König
2024-07-03 15:38 ` [PATCH v6 09/12] drm/ttm/pool: Provide a helper to shrink pages Thomas Hellström
2024-08-07 23:38   ` Matthew Brost
2024-08-16  9:47     ` Thomas Hellström
2024-07-03 15:38 ` [PATCH v6 10/12] drm/ttm: Use fault-injection to test error paths Thomas Hellström
2024-08-07 23:43   ` Matthew Brost
2024-08-09 13:53     ` Thomas Hellström
2024-08-09 16:40       ` Matthew Brost
2024-07-03 15:38 ` [PATCH v6 11/12] drm/ttm, drm/xe: Add a shrinker for xe bos Thomas Hellström
2024-08-08  1:37   ` Matthew Brost
2024-08-09 14:31     ` Thomas Hellström
2024-08-09 17:22       ` Matthew Brost
2024-08-09 16:05   ` Matthew Auld
2024-07-03 15:38 ` [PATCH v6 12/12] drm/xe: Increase the XE_PL_TT watermark Thomas Hellström
2024-08-05 18:35   ` Souza, Jose
2024-08-07 23:13     ` Matthew Brost
2024-08-09 12:22       ` Thomas Hellström
2024-08-07 23:44   ` Matthew Brost
2024-08-09 13:53     ` Thomas Hellström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).