[RFC PATCH 0/2] Clear system pages with GPU

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Clear system pages with GPU
@ 2024-06-20 13:46 Nirmoy Das
  2024-06-20 13:46 ` [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free Nirmoy Das
  2024-06-20 13:46 ` [PATCH] drm/xe/lnl: Offload system clear page activity to GPU Nirmoy Das
  0 siblings, 2 replies; 9+ messages in thread
From: Nirmoy Das @ 2024-06-20 13:46 UTC (permalink / raw)
  To: dri-devel
  Cc: intel-xe, Nirmoy Das, Christian Koenig, Thomas Hellström,
	Matthew Auld

TTM does clear on free for pooled pages and clear on alloc for
non pooled pages using CPU this can have large latency for large
buffer objects.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       GPU can clear pages much faster but mostly for larger buffers as gpu
clearing requires a gpu job submission which can make latency worse.                                                                                                                                                                                                                               XE driver on device with flat CCS clears CCS meta data with a clear                                                                                                                                                                                                                                job submission for all buffers. This series extend that clear job to
also clear system pages using GPU to improve job submission latency.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  To test the series I created a small test that tries to submit a job                                                                                                                                                                                                                               after binding various sizes of buffer.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Without the series:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)
Using IGT_SRANDOM=1718889799 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 4882 us
Time taken for size SZ_2M: 3679 us
Time taken for size SZ_64M: 13367 us
Time taken for size SZ_128M: 21034 us
Time taken for size SZ_256M: 32940 us
Time taken for size SZ_1G: 116261 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 5417 us
Time taken for size SZ_2M: 5711 us
Time taken for size SZ_64M: 15718 us
Time taken for size SZ_128M: 26170 us
Time taken for size SZ_256M: 50529 us
Time taken for size SZ_1G: 177933 us
Subtest basic-store-benchmark: SUCCESS (0.504s)

With the series:                                                                                                                                                                                                                                                                                   ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark                                                                                                                                                                                                                              IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)                                                                                                                                                                                                                                    Using IGT_SRANDOM=1718889593 for randomisation                                                                                                                                                                                                                                                     Opened device: /dev/dri/card0                                                                                                                                                                                                                                                                      Starting subtest: basic-store-benchmark                                                                                                                                                                                                                                                            Starting dynamic subtest: WC                                                                                                                                                                                                                                                                       Dynamic subtest WC: SUCCESS (0.000s)                                                                                                                                                                                                                                                               Time taken for size SZ_4K: 4479 us                                                                                                                                                                                                                                                                 Time taken for size SZ_2M: 3291 us                                                                                                                                                                                                                                                                 Time taken for size SZ_64M: 6595 us                                                                                                                                                                                                                                                                Time taken for size SZ_128M: 9069 us                                                                                                                                                                                                                                                               Time taken for size SZ_256M: 12681 us                                                                                                                                                                                                                                                              Time taken for size SZ_1G: 41806 us                                                                                                                                                                                                                                                                Starting dynamic subtest: WB                                                                                                                                                                                                                                                                       Dynamic subtest WB: SUCCESS (0.000s)                                                                                                                                                                                                                                                               Time taken for size SZ_4K: 3317 us                                                                                                                                                                                                                                                                 Time taken for size SZ_2M: 6458 us                                                                                                                                                                                                                                                                 Time taken for size SZ_64M: 12802 us                                                                                                                                                                                                                                                               Time taken for size SZ_128M: 19579 us                                                                                                                                                                                                                                                              Time taken for size SZ_256M: 38768 us                                                                                                                                                                                                                                                              Time taken for size SZ_1G: 143250 us                                                                                                                                                                                                                                                               Subtest basic-store-benchmark: SUCCESS (0.328s)

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>

Nirmoy Das (2):
  drm/ttm/pool: Introduce a way to skip clear on free
  drm/xe/lnl: Offload system clear page activity to GPU

 drivers/gpu/drm/ttm/ttm_device.c     | 42 +++++++++++++++++++++---
 drivers/gpu/drm/ttm/ttm_pool.c       | 49 +++++++++++++++++++++-------
 drivers/gpu/drm/xe/xe_bo.c           |  4 +++
 drivers/gpu/drm/xe/xe_device.c       | 38 ++++++++++++++++-----
 drivers/gpu/drm/xe/xe_device_types.h |  2 ++
 drivers/gpu/drm/xe/xe_migrate.c      |  6 ++--
 include/drm/ttm/ttm_device.h         |  8 +++++
 include/drm/ttm/ttm_pool.h           | 11 +++++++
 8 files changed, 133 insertions(+), 27 deletions(-)

-- 
2.42.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free
  2024-06-20 13:46 [RFC PATCH 0/2] Clear system pages with GPU Nirmoy Das
@ 2024-06-20 13:46 ` Nirmoy Das
  2024-06-20 14:08   ` Christian König
  2024-06-20 13:46 ` [PATCH] drm/xe/lnl: Offload system clear page activity to GPU Nirmoy Das
  1 sibling, 1 reply; 9+ messages in thread
From: Nirmoy Das @ 2024-06-20 13:46 UTC (permalink / raw)
  To: dri-devel
  Cc: intel-xe, Nirmoy Das, Christian Koenig, Thomas Hellström,
	Matthew Auld

Clearing pages can be very slow when using CPU but GPUs can perform this
task much faster. With this new pool API driver can decide if it wants to
clear pages using GPU. This provides the flexibility to choose the desired
clear policy, either during allocation or upon freeing, as per the driver's
preference.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/ttm/ttm_device.c | 42 +++++++++++++++++++++++----
 drivers/gpu/drm/ttm/ttm_pool.c   | 49 +++++++++++++++++++++++++-------
 include/drm/ttm/ttm_device.h     |  8 ++++++
 include/drm/ttm/ttm_pool.h       | 11 +++++++
 4 files changed, 94 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 434cf0258000..54a3ea825c2e 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -191,15 +191,19 @@ EXPORT_SYMBOL(ttm_device_swapout);
  * @vma_manager: A pointer to a vma manager.
  * @use_dma_alloc: If coherent DMA allocation API should be used.
  * @use_dma32: If we should use GFP_DMA32 for device memory allocations.
+ * @pool_flags: Desired pool flags
  *
  * Initializes a struct ttm_device:
  * Returns:
  * !0: Failure.
  */
-int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *funcs,
-		    struct device *dev, struct address_space *mapping,
-		    struct drm_vma_offset_manager *vma_manager,
-		    bool use_dma_alloc, bool use_dma32)
+int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
+				    const struct ttm_device_funcs *funcs,
+				    struct device *dev,
+				    struct address_space *mapping,
+				    struct drm_vma_offset_manager *vma_manager,
+				    bool use_dma_alloc, bool use_dma32,
+				    unsigned int pool_flags)
 {
 	struct ttm_global *glob = &ttm_glob;
 	int ret, nid;
@@ -227,7 +231,8 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
 	else
 		nid = NUMA_NO_NODE;
 
-	ttm_pool_init(&bdev->pool, dev, nid, use_dma_alloc, use_dma32);
+	ttm_pool_init_with_flags(&bdev->pool, dev, nid, use_dma_alloc,
+				 use_dma32, pool_flags);
 
 	bdev->vma_manager = vma_manager;
 	spin_lock_init(&bdev->lru_lock);
@@ -239,6 +244,33 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
 
 	return 0;
 }
+EXPORT_SYMBOL(ttm_device_init_with_pool_flags);
+
+
+/**
+ * ttm_device_init
+ *
+ * @bdev: A pointer to a struct ttm_device to initialize.
+ * @funcs: Function table for the device.
+ * @dev: The core kernel device pointer for DMA mappings and allocations.
+ * @mapping: The address space to use for this bo.
+ * @vma_manager: A pointer to a vma manager.
+ * @use_dma_alloc: If coherent DMA allocation API should be used.
+ * @use_dma32: If we should use GFP_DMA32 for device memory allocations.
+ *
+ * Initializes a struct ttm_device:
+ * Returns:
+ * !0: Failure.
+ */
+int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *funcs,
+		    struct device *dev, struct address_space *mapping,
+		    struct drm_vma_offset_manager *vma_manager,
+		    bool use_dma_alloc, bool use_dma32)
+{
+	return ttm_device_init_with_pool_flags(bdev, funcs, dev, mapping,
+					       vma_manager, use_dma_alloc,
+					       use_dma32, 0);
+}
 EXPORT_SYMBOL(ttm_device_init);
 
 void ttm_device_fini(struct ttm_device *bdev)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 6e1fd6985ffc..6f33c3e7cdf2 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -222,15 +222,17 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
 }
 
 /* Give pages into a specific pool_type */
-static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
+static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p, bool skip_clear)
 {
 	unsigned int i, num_pages = 1 << pt->order;
 
-	for (i = 0; i < num_pages; ++i) {
-		if (PageHighMem(p))
-			clear_highpage(p + i);
-		else
-			clear_page(page_address(p + i));
+	if (!skip_clear) {
+		for (i = 0; i < num_pages; ++i) {
+			if (PageHighMem(p))
+				clear_highpage(p + i);
+			else
+				clear_page(page_address(p + i));
+		}
 	}
 
 	spin_lock(&pt->lock);
@@ -396,7 +398,10 @@ static void ttm_pool_free_range(struct ttm_pool *pool, struct ttm_tt *tt,
 	struct page **pages = &tt->pages[start_page];
 	unsigned int order;
 	pgoff_t i, nr;
+	bool skip_clear = false;
 
+	if (pool->flags & TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE)
+		skip_clear = true;
 	for (i = start_page; i < end_page; i += nr, pages += nr) {
 		struct ttm_pool_type *pt = NULL;
 
@@ -407,7 +412,7 @@ static void ttm_pool_free_range(struct ttm_pool *pool, struct ttm_tt *tt,
 
 		pt = ttm_pool_select_type(pool, caching, order);
 		if (pt)
-			ttm_pool_type_give(pt, *pages);
+			ttm_pool_type_give(pt, *pages, skip_clear);
 		else
 			ttm_pool_free_page(pool, caching, order, *pages);
 	}
@@ -550,18 +555,21 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
 EXPORT_SYMBOL(ttm_pool_free);
 
 /**
- * ttm_pool_init - Initialize a pool
+ * ttm_pool_init_with_flags - Initialize a pool with flags
  *
  * @pool: the pool to initialize
  * @dev: device for DMA allocations and mappings
  * @nid: NUMA node to use for allocations
  * @use_dma_alloc: true if coherent DMA alloc should be used
  * @use_dma32: true if GFP_DMA32 should be used
+ * @flags: control flags for the pool
+ *
+ * Initialize the pool and its pool types with flags to modify defaults
  *
- * Initialize the pool and its pool types.
  */
-void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
-		   int nid, bool use_dma_alloc, bool use_dma32)
+void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device *dev,
+		   int nid, bool use_dma_alloc, bool use_dma32,
+		   unsigned int flags)
 {
 	unsigned int i, j;
 
@@ -571,6 +579,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 	pool->nid = nid;
 	pool->use_dma_alloc = use_dma_alloc;
 	pool->use_dma32 = use_dma32;
+	pool->flags = flags;
 
 	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
 		for (j = 0; j < NR_PAGE_ORDERS; ++j) {
@@ -585,6 +594,24 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 		}
 	}
 }
+EXPORT_SYMBOL(ttm_pool_init_with_flags);
+
+/**
+ * ttm_pool_init - Initialize a pool
+ *
+ * @pool: the pool to initialize
+ * @dev: device for DMA allocations and mappings
+ * @nid: NUMA node to use for allocations
+ * @use_dma_alloc: true if coherent DMA alloc should be used
+ * @use_dma32: true if GFP_DMA32 should be used
+ *
+ * Initialize the pool and its pool types.
+ */
+void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
+		   int nid, bool use_dma_alloc, bool use_dma32)
+{
+	ttm_pool_init_with_flags(pool, dev, nid, use_dma_alloc, use_dma32, 0);
+}
 EXPORT_SYMBOL(ttm_pool_init);
 
 /**
diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
index c22f30535c84..1b20c5798e97 100644
--- a/include/drm/ttm/ttm_device.h
+++ b/include/drm/ttm/ttm_device.h
@@ -291,6 +291,14 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
 		    struct device *dev, struct address_space *mapping,
 		    struct drm_vma_offset_manager *vma_manager,
 		    bool use_dma_alloc, bool use_dma32);
+int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
+				    const struct ttm_device_funcs *funcs,
+				    struct device *dev,
+				    struct address_space *mapping,
+				    struct drm_vma_offset_manager *vma_manager,
+				    bool use_dma_alloc, bool use_dma32,
+				    unsigned int pool_flags);
+
 void ttm_device_fini(struct ttm_device *bdev);
 void ttm_device_clear_dma_mappings(struct ttm_device *bdev);
 
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 160d954a261e..9822996309e5 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -66,10 +66,17 @@ struct ttm_pool_type {
  * @use_dma_alloc: if coherent DMA allocations should be used
  * @use_dma32: if GFP_DMA32 should be used
  * @caching: pools for each caching/order
+ * @flags: flags to control certain pool behaviour
+ *
+ * The @flags can be:
+ *  - %TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE - This flag can be used to
+ *    skip clear on free when driver decides to do that on it's own.
  */
 struct ttm_pool {
 	struct device *dev;
 	int nid;
+#define TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE	1 << 0
+	unsigned int flags;
 
 	bool use_dma_alloc;
 	bool use_dma32;
@@ -85,6 +92,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
 
 void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 		   int nid, bool use_dma_alloc, bool use_dma32);
+void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device *dev,
+			      int nid, bool use_dma_alloc, bool use_dma32,
+			      unsigned int flags);
+
 void ttm_pool_fini(struct ttm_pool *pool);
 
 int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] drm/xe/lnl: Offload system clear page activity to GPU
  2024-06-20 13:46 [RFC PATCH 0/2] Clear system pages with GPU Nirmoy Das
  2024-06-20 13:46 ` [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free Nirmoy Das
@ 2024-06-20 13:46 ` Nirmoy Das
  2024-06-21 12:08   ` Ghimiray, Himal Prasad
  1 sibling, 1 reply; 9+ messages in thread
From: Nirmoy Das @ 2024-06-20 13:46 UTC (permalink / raw)
  To: dri-devel
  Cc: intel-xe, Nirmoy Das, Christian Koenig, Thomas Hellström,
	Matthew Auld

On LNL because flat CCS, driver will create a migrate job to clear
CCS meta data. Extend that to also clear pages using GPU with new
ttm pool flag which allows offloading page clear activity to GPU.

This gives very nice improvement for large buffer:
Without the patch:
 ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)
Using IGT_SRANDOM=1718889799 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 4882 us
Time taken for size SZ_2M: 3679 us
Time taken for size SZ_64M: 13367 us
Time taken for size SZ_128M: 21034 us
Time taken for size SZ_256M: 32940 us
Time taken for size SZ_1G: 116261 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 5417 us
Time taken for size SZ_2M: 5711 us
Time taken for size SZ_64M: 15718 us
Time taken for size SZ_128M: 26170 us
Time taken for size SZ_256M: 50529 us
Time taken for size SZ_1G: 177933 us
Subtest basic-store-benchmark: SUCCESS (0.504s)

With the patch:
sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)
Using IGT_SRANDOM=1718889593 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 4479 us
Time taken for size SZ_2M: 3291 us
Time taken for size SZ_64M: 6595 us
Time taken for size SZ_128M: 9069 us
Time taken for size SZ_256M: 12681 us
Time taken for size SZ_1G: 41806 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 3317 us
Time taken for size SZ_2M: 6458 us
Time taken for size SZ_64M: 12802 us
Time taken for size SZ_128M: 19579 us
Time taken for size SZ_256M: 38768 us
Time taken for size SZ_1G: 143250 us
Subtest basic-store-benchmark: SUCCESS (0.328s)

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c           |  4 ++++
 drivers/gpu/drm/xe/xe_device.c       | 36 +++++++++++++++++++++-------
 drivers/gpu/drm/xe/xe_device_types.h |  2 ++
 drivers/gpu/drm/xe/xe_migrate.c      |  6 ++---
 4 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 65c696966e96..10ec02412dc4 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -387,6 +387,10 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 		caching = ttm_uncached;
 	}
 
+	/* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */
+	if (xe->mem.gpu_page_clear)
+		page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC;
+
 	err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages);
 	if (err) {
 		kfree(tt);
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 75d4c8ae9234..8e8d54c59aae 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -240,8 +240,6 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
 
 	if (xe->unordered_wq)
 		destroy_workqueue(xe->unordered_wq);
-
-	ttm_device_fini(&xe->ttm);
 }
 
 struct xe_device *xe_device_create(struct pci_dev *pdev,
@@ -260,12 +258,6 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 	if (IS_ERR(xe))
 		return xe;
 
-	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
-			      xe->drm.anon_inode->i_mapping,
-			      xe->drm.vma_offset_manager, false, false);
-	if (WARN_ON(err))
-		goto err;
-
 	err = drmm_add_action_or_reset(&xe->drm, xe_device_destroy, NULL);
 	if (err)
 		goto err;
@@ -543,6 +535,13 @@ static int xe_device_set_has_flat_ccs(struct  xe_device *xe)
 	return xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
 }
 
+static void xe_device_destroy_ttm_device(struct drm_device *dev, void *dummy)
+{
+	struct xe_device *xe = to_xe_device(dev);
+
+	ttm_device_fini(&xe->ttm);
+}
+
 int xe_device_probe(struct xe_device *xe)
 {
 	struct xe_tile *tile;
@@ -550,6 +549,7 @@ int xe_device_probe(struct xe_device *xe)
 	int err;
 	u8 last_gt;
 	u8 id;
+	unsigned int ttm_pool_flags = 0;
 
 	xe_pat_init_early(xe);
 
@@ -572,6 +572,26 @@ int xe_device_probe(struct xe_device *xe)
 
 	xe_ttm_sys_mgr_init(xe);
 
+	/* On iGFX device with flat CCS we clear CCS metadata, let's extend that
+	 * and use GPU to clear  pages as well.
+	 */
+	if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe)) {
+		ttm_pool_flags = TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE;
+		xe->mem.gpu_page_clear = true;
+	}
+
+	err = ttm_device_init_with_pool_flags(&xe->ttm, &xe_ttm_funcs,
+					      xe->drm.dev,
+					      xe->drm.anon_inode->i_mapping,
+					      xe->drm.vma_offset_manager,
+					      false, false, ttm_pool_flags);
+	if (WARN_ON(err))
+		return err;
+
+	err = drmm_add_action_or_reset(&xe->drm, xe_device_destroy_ttm_device, NULL);
+	if (err)
+		return err;
+
 	for_each_gt(gt, xe, id) {
 		err = xe_gt_init_early(gt);
 		if (err)
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index c37be471d11c..ece68c6f3668 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -325,6 +325,8 @@ struct xe_device {
 		struct xe_mem_region vram;
 		/** @mem.sys_mgr: system TTM manager */
 		struct ttm_resource_manager sys_mgr;
+		/** @gpu_page_clear: clear pages offloaded to GPU */
+		bool gpu_page_clear;
 	} mem;
 
 	/** @sriov: device level virtualization data */
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 05f933787860..0023f32d147d 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1003,6 +1003,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 	struct xe_gt *gt = m->tile->primary_gt;
 	struct xe_device *xe = gt_to_xe(gt);
 	bool clear_system_ccs = (xe_bo_needs_ccs_pages(bo) && !IS_DGFX(xe)) ? true : false;
+	bool clear_on_create = xe->mem.gpu_page_clear;
 	struct dma_fence *fence = NULL;
 	u64 size = bo->size;
 	struct xe_res_cursor src_it;
@@ -1022,7 +1023,6 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 		struct xe_sched_job *job;
 		struct xe_bb *bb;
 		u32 batch_size, update_idx;
-
 		bool usm = xe->info.has_usm;
 		u32 avail_pts = max_mem_transfer_per_pass(xe) / LEVEL0_PAGE_TABLE_ENCODE_SIZE;
 
@@ -1032,7 +1032,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 		batch_size = 2 +
 			pte_update_size(m, clear_vram, src, &src_it,
 					&clear_L0, &clear_L0_ofs, &clear_L0_pt,
-					clear_system_ccs ? 0 : emit_clear_cmd_len(gt), 0,
+					!clear_on_create ? 0 : emit_clear_cmd_len(gt), 0,
 					avail_pts);
 
 		if (xe_device_has_flat_ccs(xe))
@@ -1060,7 +1060,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
 		update_idx = bb->len;
 
-		if (!clear_system_ccs)
+		if (clear_on_create)
 			emit_clear(gt, bb, clear_L0_ofs, clear_L0, XE_PAGE_SIZE, clear_vram);
 
 		if (xe_device_has_flat_ccs(xe)) {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free
  2024-06-20 13:46 ` [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free Nirmoy Das
@ 2024-06-20 14:08   ` Christian König
  2024-06-20 14:37     ` Nirmoy Das
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2024-06-20 14:08 UTC (permalink / raw)
  To: Nirmoy Das, dri-devel; +Cc: intel-xe, Thomas Hellström, Matthew Auld

Am 20.06.24 um 15:46 schrieb Nirmoy Das:
> Clearing pages can be very slow when using CPU but GPUs can perform this
> task much faster. With this new pool API driver can decide if it wants to
> clear pages using GPU. This provides the flexibility to choose the desired
> clear policy, either during allocation or upon freeing, as per the driver's
> preference.

We already have the TTM_TT_FLAG_ZERO_ALLOC to indicate if pages needs to 
be cleared on alloc from the OS.

I'm not sure if we really need the option to not clear them in the pool 
as well, but if we really need this I suggest to switch from clear on 
free to clear on alloc again and just honor the flag.

Alternatively you could also split the pools into cleared and not 
cleared pages as well.

Regards,
Christian.

>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/ttm/ttm_device.c | 42 +++++++++++++++++++++++----
>   drivers/gpu/drm/ttm/ttm_pool.c   | 49 +++++++++++++++++++++++++-------
>   include/drm/ttm/ttm_device.h     |  8 ++++++
>   include/drm/ttm/ttm_pool.h       | 11 +++++++
>   4 files changed, 94 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 434cf0258000..54a3ea825c2e 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -191,15 +191,19 @@ EXPORT_SYMBOL(ttm_device_swapout);
>    * @vma_manager: A pointer to a vma manager.
>    * @use_dma_alloc: If coherent DMA allocation API should be used.
>    * @use_dma32: If we should use GFP_DMA32 for device memory allocations.
> + * @pool_flags: Desired pool flags
>    *
>    * Initializes a struct ttm_device:
>    * Returns:
>    * !0: Failure.
>    */
> -int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *funcs,
> -		    struct device *dev, struct address_space *mapping,
> -		    struct drm_vma_offset_manager *vma_manager,
> -		    bool use_dma_alloc, bool use_dma32)
> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
> +				    const struct ttm_device_funcs *funcs,
> +				    struct device *dev,
> +				    struct address_space *mapping,
> +				    struct drm_vma_offset_manager *vma_manager,
> +				    bool use_dma_alloc, bool use_dma32,
> +				    unsigned int pool_flags)
>   {
>   	struct ttm_global *glob = &ttm_glob;
>   	int ret, nid;
> @@ -227,7 +231,8 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
>   	else
>   		nid = NUMA_NO_NODE;
>   
> -	ttm_pool_init(&bdev->pool, dev, nid, use_dma_alloc, use_dma32);
> +	ttm_pool_init_with_flags(&bdev->pool, dev, nid, use_dma_alloc,
> +				 use_dma32, pool_flags);
>   
>   	bdev->vma_manager = vma_manager;
>   	spin_lock_init(&bdev->lru_lock);
> @@ -239,6 +244,33 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
>   
>   	return 0;
>   }
> +EXPORT_SYMBOL(ttm_device_init_with_pool_flags);
> +
> +
> +/**
> + * ttm_device_init
> + *
> + * @bdev: A pointer to a struct ttm_device to initialize.
> + * @funcs: Function table for the device.
> + * @dev: The core kernel device pointer for DMA mappings and allocations.
> + * @mapping: The address space to use for this bo.
> + * @vma_manager: A pointer to a vma manager.
> + * @use_dma_alloc: If coherent DMA allocation API should be used.
> + * @use_dma32: If we should use GFP_DMA32 for device memory allocations.
> + *
> + * Initializes a struct ttm_device:
> + * Returns:
> + * !0: Failure.
> + */
> +int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *funcs,
> +		    struct device *dev, struct address_space *mapping,
> +		    struct drm_vma_offset_manager *vma_manager,
> +		    bool use_dma_alloc, bool use_dma32)
> +{
> +	return ttm_device_init_with_pool_flags(bdev, funcs, dev, mapping,
> +					       vma_manager, use_dma_alloc,
> +					       use_dma32, 0);
> +}
>   EXPORT_SYMBOL(ttm_device_init);
>   
>   void ttm_device_fini(struct ttm_device *bdev)
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 6e1fd6985ffc..6f33c3e7cdf2 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -222,15 +222,17 @@ static void ttm_pool_unmap(struct ttm_pool *pool, dma_addr_t dma_addr,
>   }
>   
>   /* Give pages into a specific pool_type */
> -static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p)
> +static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page *p, bool skip_clear)
>   {
>   	unsigned int i, num_pages = 1 << pt->order;
>   
> -	for (i = 0; i < num_pages; ++i) {
> -		if (PageHighMem(p))
> -			clear_highpage(p + i);
> -		else
> -			clear_page(page_address(p + i));
> +	if (!skip_clear) {
> +		for (i = 0; i < num_pages; ++i) {
> +			if (PageHighMem(p))
> +				clear_highpage(p + i);
> +			else
> +				clear_page(page_address(p + i));
> +		}
>   	}
>   
>   	spin_lock(&pt->lock);
> @@ -396,7 +398,10 @@ static void ttm_pool_free_range(struct ttm_pool *pool, struct ttm_tt *tt,
>   	struct page **pages = &tt->pages[start_page];
>   	unsigned int order;
>   	pgoff_t i, nr;
> +	bool skip_clear = false;
>   
> +	if (pool->flags & TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE)
> +		skip_clear = true;
>   	for (i = start_page; i < end_page; i += nr, pages += nr) {
>   		struct ttm_pool_type *pt = NULL;
>   
> @@ -407,7 +412,7 @@ static void ttm_pool_free_range(struct ttm_pool *pool, struct ttm_tt *tt,
>   
>   		pt = ttm_pool_select_type(pool, caching, order);
>   		if (pt)
> -			ttm_pool_type_give(pt, *pages);
> +			ttm_pool_type_give(pt, *pages, skip_clear);
>   		else
>   			ttm_pool_free_page(pool, caching, order, *pages);
>   	}
> @@ -550,18 +555,21 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt)
>   EXPORT_SYMBOL(ttm_pool_free);
>   
>   /**
> - * ttm_pool_init - Initialize a pool
> + * ttm_pool_init_with_flags - Initialize a pool with flags
>    *
>    * @pool: the pool to initialize
>    * @dev: device for DMA allocations and mappings
>    * @nid: NUMA node to use for allocations
>    * @use_dma_alloc: true if coherent DMA alloc should be used
>    * @use_dma32: true if GFP_DMA32 should be used
> + * @flags: control flags for the pool
> + *
> + * Initialize the pool and its pool types with flags to modify defaults
>    *
> - * Initialize the pool and its pool types.
>    */
> -void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> -		   int nid, bool use_dma_alloc, bool use_dma32)
> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device *dev,
> +		   int nid, bool use_dma_alloc, bool use_dma32,
> +		   unsigned int flags)
>   {
>   	unsigned int i, j;
>   
> @@ -571,6 +579,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>   	pool->nid = nid;
>   	pool->use_dma_alloc = use_dma_alloc;
>   	pool->use_dma32 = use_dma32;
> +	pool->flags = flags;
>   
>   	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>   		for (j = 0; j < NR_PAGE_ORDERS; ++j) {
> @@ -585,6 +594,24 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>   		}
>   	}
>   }
> +EXPORT_SYMBOL(ttm_pool_init_with_flags);
> +
> +/**
> + * ttm_pool_init - Initialize a pool
> + *
> + * @pool: the pool to initialize
> + * @dev: device for DMA allocations and mappings
> + * @nid: NUMA node to use for allocations
> + * @use_dma_alloc: true if coherent DMA alloc should be used
> + * @use_dma32: true if GFP_DMA32 should be used
> + *
> + * Initialize the pool and its pool types.
> + */
> +void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> +		   int nid, bool use_dma_alloc, bool use_dma32)
> +{
> +	ttm_pool_init_with_flags(pool, dev, nid, use_dma_alloc, use_dma32, 0);
> +}
>   EXPORT_SYMBOL(ttm_pool_init);
>   
>   /**
> diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
> index c22f30535c84..1b20c5798e97 100644
> --- a/include/drm/ttm/ttm_device.h
> +++ b/include/drm/ttm/ttm_device.h
> @@ -291,6 +291,14 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
>   		    struct device *dev, struct address_space *mapping,
>   		    struct drm_vma_offset_manager *vma_manager,
>   		    bool use_dma_alloc, bool use_dma32);
> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
> +				    const struct ttm_device_funcs *funcs,
> +				    struct device *dev,
> +				    struct address_space *mapping,
> +				    struct drm_vma_offset_manager *vma_manager,
> +				    bool use_dma_alloc, bool use_dma32,
> +				    unsigned int pool_flags);
> +
>   void ttm_device_fini(struct ttm_device *bdev);
>   void ttm_device_clear_dma_mappings(struct ttm_device *bdev);
>   
> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
> index 160d954a261e..9822996309e5 100644
> --- a/include/drm/ttm/ttm_pool.h
> +++ b/include/drm/ttm/ttm_pool.h
> @@ -66,10 +66,17 @@ struct ttm_pool_type {
>    * @use_dma_alloc: if coherent DMA allocations should be used
>    * @use_dma32: if GFP_DMA32 should be used
>    * @caching: pools for each caching/order
> + * @flags: flags to control certain pool behaviour
> + *
> + * The @flags can be:
> + *  - %TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE - This flag can be used to
> + *    skip clear on free when driver decides to do that on it's own.
>    */
>   struct ttm_pool {
>   	struct device *dev;
>   	int nid;
> +#define TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE	1 << 0
> +	unsigned int flags;
>   
>   	bool use_dma_alloc;
>   	bool use_dma32;
> @@ -85,6 +92,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>   
>   void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>   		   int nid, bool use_dma_alloc, bool use_dma32);
> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device *dev,
> +			      int nid, bool use_dma_alloc, bool use_dma32,
> +			      unsigned int flags);
> +
>   void ttm_pool_fini(struct ttm_pool *pool);
>   
>   int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free
  2024-06-20 14:08   ` Christian König
@ 2024-06-20 14:37     ` Nirmoy Das
  2024-06-20 14:45       ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Nirmoy Das @ 2024-06-20 14:37 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: intel-xe, Thomas Hellström, Matthew Auld

Hi Christian,

On 6/20/2024 4:08 PM, Christian König wrote:
> Am 20.06.24 um 15:46 schrieb Nirmoy Das:
>> Clearing pages can be very slow when using CPU but GPUs can perform this
>> task much faster. With this new pool API driver can decide if it 
>> wants to
>> clear pages using GPU. This provides the flexibility to choose the 
>> desired
>> clear policy, either during allocation or upon freeing, as per the 
>> driver's
>> preference.
>
> We already have the TTM_TT_FLAG_ZERO_ALLOC to indicate if pages needs 
> to be cleared on alloc from the OS.
>
> I'm not sure if we really need the option to not clear them in the 
> pool as well, but if we really need this I suggest to switch from 
> clear on free to clear on alloc again and just honor the flag.


Perf reported higher latency because of clearing pages before giving 
back to the pool. I think it would be nice if drm driver could avoid it.

I  can modify this to move clearing page to ttm_pool_type_take() to 
honor TTM_TT_FLAG_ZERO_ALLOC flags.


>
> Alternatively you could also split the pools into cleared and not 
> cleared pages as well.


Could you expand this please ?


I have another question. Our userspace team have  found that there is 
higher latency for ttm_cached type buffer as well and using gpu clear 
doesn't help much

because kernel will clear pages anyways if 
alloc_on_init/CONFIG_INIT_ON_ALLOC_DEFAULT_ON is active. I see that only 
way to mitigate this is to use a pool for

ttm_cached buffers.  I was thinking of using a pool flag to also allow 
drm driver to create a pool for ttm_cached. I wonder what do you think 
about it and if

there is any other better solution.


Thanks,

Nirmoy

>
> Regards,
> Christian.
>
>>
>> Cc: Christian Koenig <christian.koenig@amd.com>
>> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/ttm/ttm_device.c | 42 +++++++++++++++++++++++----
>>   drivers/gpu/drm/ttm/ttm_pool.c   | 49 +++++++++++++++++++++++++-------
>>   include/drm/ttm/ttm_device.h     |  8 ++++++
>>   include/drm/ttm/ttm_pool.h       | 11 +++++++
>>   4 files changed, 94 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>> b/drivers/gpu/drm/ttm/ttm_device.c
>> index 434cf0258000..54a3ea825c2e 100644
>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>> @@ -191,15 +191,19 @@ EXPORT_SYMBOL(ttm_device_swapout);
>>    * @vma_manager: A pointer to a vma manager.
>>    * @use_dma_alloc: If coherent DMA allocation API should be used.
>>    * @use_dma32: If we should use GFP_DMA32 for device memory 
>> allocations.
>> + * @pool_flags: Desired pool flags
>>    *
>>    * Initializes a struct ttm_device:
>>    * Returns:
>>    * !0: Failure.
>>    */
>> -int ttm_device_init(struct ttm_device *bdev, const struct 
>> ttm_device_funcs *funcs,
>> -            struct device *dev, struct address_space *mapping,
>> -            struct drm_vma_offset_manager *vma_manager,
>> -            bool use_dma_alloc, bool use_dma32)
>> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
>> +                    const struct ttm_device_funcs *funcs,
>> +                    struct device *dev,
>> +                    struct address_space *mapping,
>> +                    struct drm_vma_offset_manager *vma_manager,
>> +                    bool use_dma_alloc, bool use_dma32,
>> +                    unsigned int pool_flags)
>>   {
>>       struct ttm_global *glob = &ttm_glob;
>>       int ret, nid;
>> @@ -227,7 +231,8 @@ int ttm_device_init(struct ttm_device *bdev, 
>> const struct ttm_device_funcs *func
>>       else
>>           nid = NUMA_NO_NODE;
>>   -    ttm_pool_init(&bdev->pool, dev, nid, use_dma_alloc, use_dma32);
>> +    ttm_pool_init_with_flags(&bdev->pool, dev, nid, use_dma_alloc,
>> +                 use_dma32, pool_flags);
>>         bdev->vma_manager = vma_manager;
>>       spin_lock_init(&bdev->lru_lock);
>> @@ -239,6 +244,33 @@ int ttm_device_init(struct ttm_device *bdev, 
>> const struct ttm_device_funcs *func
>>         return 0;
>>   }
>> +EXPORT_SYMBOL(ttm_device_init_with_pool_flags);
>> +
>> +
>> +/**
>> + * ttm_device_init
>> + *
>> + * @bdev: A pointer to a struct ttm_device to initialize.
>> + * @funcs: Function table for the device.
>> + * @dev: The core kernel device pointer for DMA mappings and 
>> allocations.
>> + * @mapping: The address space to use for this bo.
>> + * @vma_manager: A pointer to a vma manager.
>> + * @use_dma_alloc: If coherent DMA allocation API should be used.
>> + * @use_dma32: If we should use GFP_DMA32 for device memory 
>> allocations.
>> + *
>> + * Initializes a struct ttm_device:
>> + * Returns:
>> + * !0: Failure.
>> + */
>> +int ttm_device_init(struct ttm_device *bdev, const struct 
>> ttm_device_funcs *funcs,
>> +            struct device *dev, struct address_space *mapping,
>> +            struct drm_vma_offset_manager *vma_manager,
>> +            bool use_dma_alloc, bool use_dma32)
>> +{
>> +    return ttm_device_init_with_pool_flags(bdev, funcs, dev, mapping,
>> +                           vma_manager, use_dma_alloc,
>> +                           use_dma32, 0);
>> +}
>>   EXPORT_SYMBOL(ttm_device_init);
>>     void ttm_device_fini(struct ttm_device *bdev)
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c 
>> b/drivers/gpu/drm/ttm/ttm_pool.c
>> index 6e1fd6985ffc..6f33c3e7cdf2 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -222,15 +222,17 @@ static void ttm_pool_unmap(struct ttm_pool 
>> *pool, dma_addr_t dma_addr,
>>   }
>>     /* Give pages into a specific pool_type */
>> -static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page 
>> *p)
>> +static void ttm_pool_type_give(struct ttm_pool_type *pt, struct page 
>> *p, bool skip_clear)
>>   {
>>       unsigned int i, num_pages = 1 << pt->order;
>>   -    for (i = 0; i < num_pages; ++i) {
>> -        if (PageHighMem(p))
>> -            clear_highpage(p + i);
>> -        else
>> -            clear_page(page_address(p + i));
>> +    if (!skip_clear) {
>> +        for (i = 0; i < num_pages; ++i) {
>> +            if (PageHighMem(p))
>> +                clear_highpage(p + i);
>> +            else
>> +                clear_page(page_address(p + i));
>> +        }
>>       }
>>         spin_lock(&pt->lock);
>> @@ -396,7 +398,10 @@ static void ttm_pool_free_range(struct ttm_pool 
>> *pool, struct ttm_tt *tt,
>>       struct page **pages = &tt->pages[start_page];
>>       unsigned int order;
>>       pgoff_t i, nr;
>> +    bool skip_clear = false;
>>   +    if (pool->flags & TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE)
>> +        skip_clear = true;
>>       for (i = start_page; i < end_page; i += nr, pages += nr) {
>>           struct ttm_pool_type *pt = NULL;
>>   @@ -407,7 +412,7 @@ static void ttm_pool_free_range(struct ttm_pool 
>> *pool, struct ttm_tt *tt,
>>             pt = ttm_pool_select_type(pool, caching, order);
>>           if (pt)
>> -            ttm_pool_type_give(pt, *pages);
>> +            ttm_pool_type_give(pt, *pages, skip_clear);
>>           else
>>               ttm_pool_free_page(pool, caching, order, *pages);
>>       }
>> @@ -550,18 +555,21 @@ void ttm_pool_free(struct ttm_pool *pool, 
>> struct ttm_tt *tt)
>>   EXPORT_SYMBOL(ttm_pool_free);
>>     /**
>> - * ttm_pool_init - Initialize a pool
>> + * ttm_pool_init_with_flags - Initialize a pool with flags
>>    *
>>    * @pool: the pool to initialize
>>    * @dev: device for DMA allocations and mappings
>>    * @nid: NUMA node to use for allocations
>>    * @use_dma_alloc: true if coherent DMA alloc should be used
>>    * @use_dma32: true if GFP_DMA32 should be used
>> + * @flags: control flags for the pool
>> + *
>> + * Initialize the pool and its pool types with flags to modify defaults
>>    *
>> - * Initialize the pool and its pool types.
>>    */
>> -void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> -           int nid, bool use_dma_alloc, bool use_dma32)
>> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device 
>> *dev,
>> +           int nid, bool use_dma_alloc, bool use_dma32,
>> +           unsigned int flags)
>>   {
>>       unsigned int i, j;
>>   @@ -571,6 +579,7 @@ void ttm_pool_init(struct ttm_pool *pool, 
>> struct device *dev,
>>       pool->nid = nid;
>>       pool->use_dma_alloc = use_dma_alloc;
>>       pool->use_dma32 = use_dma32;
>> +    pool->flags = flags;
>>         for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>>           for (j = 0; j < NR_PAGE_ORDERS; ++j) {
>> @@ -585,6 +594,24 @@ void ttm_pool_init(struct ttm_pool *pool, struct 
>> device *dev,
>>           }
>>       }
>>   }
>> +EXPORT_SYMBOL(ttm_pool_init_with_flags);
>> +
>> +/**
>> + * ttm_pool_init - Initialize a pool
>> + *
>> + * @pool: the pool to initialize
>> + * @dev: device for DMA allocations and mappings
>> + * @nid: NUMA node to use for allocations
>> + * @use_dma_alloc: true if coherent DMA alloc should be used
>> + * @use_dma32: true if GFP_DMA32 should be used
>> + *
>> + * Initialize the pool and its pool types.
>> + */
>> +void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> +           int nid, bool use_dma_alloc, bool use_dma32)
>> +{
>> +    ttm_pool_init_with_flags(pool, dev, nid, use_dma_alloc, 
>> use_dma32, 0);
>> +}
>>   EXPORT_SYMBOL(ttm_pool_init);
>>     /**
>> diff --git a/include/drm/ttm/ttm_device.h b/include/drm/ttm/ttm_device.h
>> index c22f30535c84..1b20c5798e97 100644
>> --- a/include/drm/ttm/ttm_device.h
>> +++ b/include/drm/ttm/ttm_device.h
>> @@ -291,6 +291,14 @@ int ttm_device_init(struct ttm_device *bdev, 
>> const struct ttm_device_funcs *func
>>               struct device *dev, struct address_space *mapping,
>>               struct drm_vma_offset_manager *vma_manager,
>>               bool use_dma_alloc, bool use_dma32);
>> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
>> +                    const struct ttm_device_funcs *funcs,
>> +                    struct device *dev,
>> +                    struct address_space *mapping,
>> +                    struct drm_vma_offset_manager *vma_manager,
>> +                    bool use_dma_alloc, bool use_dma32,
>> +                    unsigned int pool_flags);
>> +
>>   void ttm_device_fini(struct ttm_device *bdev);
>>   void ttm_device_clear_dma_mappings(struct ttm_device *bdev);
>>   diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
>> index 160d954a261e..9822996309e5 100644
>> --- a/include/drm/ttm/ttm_pool.h
>> +++ b/include/drm/ttm/ttm_pool.h
>> @@ -66,10 +66,17 @@ struct ttm_pool_type {
>>    * @use_dma_alloc: if coherent DMA allocations should be used
>>    * @use_dma32: if GFP_DMA32 should be used
>>    * @caching: pools for each caching/order
>> + * @flags: flags to control certain pool behaviour
>> + *
>> + * The @flags can be:
>> + *  - %TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE - This flag can be used to
>> + *    skip clear on free when driver decides to do that on it's own.
>>    */
>>   struct ttm_pool {
>>       struct device *dev;
>>       int nid;
>> +#define TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE    1 << 0
>> +    unsigned int flags;
>>         bool use_dma_alloc;
>>       bool use_dma32;
>> @@ -85,6 +92,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct 
>> ttm_tt *tt);
>>     void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>              int nid, bool use_dma_alloc, bool use_dma32);
>> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device 
>> *dev,
>> +                  int nid, bool use_dma_alloc, bool use_dma32,
>> +                  unsigned int flags);
>> +
>>   void ttm_pool_fini(struct ttm_pool *pool);
>>     int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free
  2024-06-20 14:37     ` Nirmoy Das
@ 2024-06-20 14:45       ` Christian König
  2024-06-20 15:11         ` Nirmoy Das
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2024-06-20 14:45 UTC (permalink / raw)
  To: Nirmoy Das, dri-devel; +Cc: intel-xe, Thomas Hellström, Matthew Auld

Hi Nirmoy,

Am 20.06.24 um 16:37 schrieb Nirmoy Das:
> Hi Christian,
>
> On 6/20/2024 4:08 PM, Christian König wrote:
>> Am 20.06.24 um 15:46 schrieb Nirmoy Das:
>>> Clearing pages can be very slow when using CPU but GPUs can perform 
>>> this
>>> task much faster. With this new pool API driver can decide if it 
>>> wants to
>>> clear pages using GPU. This provides the flexibility to choose the 
>>> desired
>>> clear policy, either during allocation or upon freeing, as per the 
>>> driver's
>>> preference.
>>
>> We already have the TTM_TT_FLAG_ZERO_ALLOC to indicate if pages needs 
>> to be cleared on alloc from the OS.
>>
>> I'm not sure if we really need the option to not clear them in the 
>> pool as well, but if we really need this I suggest to switch from 
>> clear on free to clear on alloc again and just honor the flag.
>
>
> Perf reported higher latency because of clearing pages before giving 
> back to the pool. I think it would be nice if drm driver could avoid it.
>
> I  can modify this to move clearing page to ttm_pool_type_take() to 
> honor TTM_TT_FLAG_ZERO_ALLOC flags.

Both approaches have some pro and cons. IIRC we intentionally moved the 
clearing to the free function to avoid latency on allocation.

>>
>> Alternatively you could also split the pools into cleared and not 
>> cleared pages as well.
>
>
> Could you expand this please ?

Just create separate pools for cleared and uncleared pages (or separate 
lists inside the pools).

Then when you see the TTM_TT_FLAG_ZERO_ALLOC flag you try to grab things 
from the uncleared pool and if you don't see it try to grab things from 
the cleared pool.

Same for release of pages, just the other way around.

>
> I have another question. Our userspace team have  found that there is 
> higher latency for ttm_cached type buffer as well and using gpu clear 
> doesn't help much
>
> because kernel will clear pages anyways if 
> alloc_on_init/CONFIG_INIT_ON_ALLOC_DEFAULT_ON is active. I see that 
> only way to mitigate this is to use a pool for
>
> ttm_cached buffers.  I was thinking of using a pool flag to also allow 
> drm driver to create a pool for ttm_cached. I wonder what do you think 
> about it and if
>
> there is any other better solution.

Well I would clearly have to NAK a hack like this.

We only have a pool for uncached and WC pages because of lack of support 
for that in the general memory and DMA management.

The TTM_TT_FLAG_ZERO_ALLOC should control if GFP_ZERO is set or not. If 
the core MM decides to ignore that and clear pages anyway then you need 
to talk to the core MM people if you want to avoid that.

Regards,
Christian.

>
>
> Thanks,
>
> Nirmoy
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Cc: Christian Koenig <christian.koenig@amd.com>
>>> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>>> ---
>>>   drivers/gpu/drm/ttm/ttm_device.c | 42 +++++++++++++++++++++++----
>>>   drivers/gpu/drm/ttm/ttm_pool.c   | 49 
>>> +++++++++++++++++++++++++-------
>>>   include/drm/ttm/ttm_device.h     |  8 ++++++
>>>   include/drm/ttm/ttm_pool.h       | 11 +++++++
>>>   4 files changed, 94 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>>> b/drivers/gpu/drm/ttm/ttm_device.c
>>> index 434cf0258000..54a3ea825c2e 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>>> @@ -191,15 +191,19 @@ EXPORT_SYMBOL(ttm_device_swapout);
>>>    * @vma_manager: A pointer to a vma manager.
>>>    * @use_dma_alloc: If coherent DMA allocation API should be used.
>>>    * @use_dma32: If we should use GFP_DMA32 for device memory 
>>> allocations.
>>> + * @pool_flags: Desired pool flags
>>>    *
>>>    * Initializes a struct ttm_device:
>>>    * Returns:
>>>    * !0: Failure.
>>>    */
>>> -int ttm_device_init(struct ttm_device *bdev, const struct 
>>> ttm_device_funcs *funcs,
>>> -            struct device *dev, struct address_space *mapping,
>>> -            struct drm_vma_offset_manager *vma_manager,
>>> -            bool use_dma_alloc, bool use_dma32)
>>> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
>>> +                    const struct ttm_device_funcs *funcs,
>>> +                    struct device *dev,
>>> +                    struct address_space *mapping,
>>> +                    struct drm_vma_offset_manager *vma_manager,
>>> +                    bool use_dma_alloc, bool use_dma32,
>>> +                    unsigned int pool_flags)
>>>   {
>>>       struct ttm_global *glob = &ttm_glob;
>>>       int ret, nid;
>>> @@ -227,7 +231,8 @@ int ttm_device_init(struct ttm_device *bdev, 
>>> const struct ttm_device_funcs *func
>>>       else
>>>           nid = NUMA_NO_NODE;
>>>   -    ttm_pool_init(&bdev->pool, dev, nid, use_dma_alloc, use_dma32);
>>> +    ttm_pool_init_with_flags(&bdev->pool, dev, nid, use_dma_alloc,
>>> +                 use_dma32, pool_flags);
>>>         bdev->vma_manager = vma_manager;
>>>       spin_lock_init(&bdev->lru_lock);
>>> @@ -239,6 +244,33 @@ int ttm_device_init(struct ttm_device *bdev, 
>>> const struct ttm_device_funcs *func
>>>         return 0;
>>>   }
>>> +EXPORT_SYMBOL(ttm_device_init_with_pool_flags);
>>> +
>>> +
>>> +/**
>>> + * ttm_device_init
>>> + *
>>> + * @bdev: A pointer to a struct ttm_device to initialize.
>>> + * @funcs: Function table for the device.
>>> + * @dev: The core kernel device pointer for DMA mappings and 
>>> allocations.
>>> + * @mapping: The address space to use for this bo.
>>> + * @vma_manager: A pointer to a vma manager.
>>> + * @use_dma_alloc: If coherent DMA allocation API should be used.
>>> + * @use_dma32: If we should use GFP_DMA32 for device memory 
>>> allocations.
>>> + *
>>> + * Initializes a struct ttm_device:
>>> + * Returns:
>>> + * !0: Failure.
>>> + */
>>> +int ttm_device_init(struct ttm_device *bdev, const struct 
>>> ttm_device_funcs *funcs,
>>> +            struct device *dev, struct address_space *mapping,
>>> +            struct drm_vma_offset_manager *vma_manager,
>>> +            bool use_dma_alloc, bool use_dma32)
>>> +{
>>> +    return ttm_device_init_with_pool_flags(bdev, funcs, dev, mapping,
>>> +                           vma_manager, use_dma_alloc,
>>> +                           use_dma32, 0);
>>> +}
>>>   EXPORT_SYMBOL(ttm_device_init);
>>>     void ttm_device_fini(struct ttm_device *bdev)
>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c 
>>> b/drivers/gpu/drm/ttm/ttm_pool.c
>>> index 6e1fd6985ffc..6f33c3e7cdf2 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>> @@ -222,15 +222,17 @@ static void ttm_pool_unmap(struct ttm_pool 
>>> *pool, dma_addr_t dma_addr,
>>>   }
>>>     /* Give pages into a specific pool_type */
>>> -static void ttm_pool_type_give(struct ttm_pool_type *pt, struct 
>>> page *p)
>>> +static void ttm_pool_type_give(struct ttm_pool_type *pt, struct 
>>> page *p, bool skip_clear)
>>>   {
>>>       unsigned int i, num_pages = 1 << pt->order;
>>>   -    for (i = 0; i < num_pages; ++i) {
>>> -        if (PageHighMem(p))
>>> -            clear_highpage(p + i);
>>> -        else
>>> -            clear_page(page_address(p + i));
>>> +    if (!skip_clear) {
>>> +        for (i = 0; i < num_pages; ++i) {
>>> +            if (PageHighMem(p))
>>> +                clear_highpage(p + i);
>>> +            else
>>> +                clear_page(page_address(p + i));
>>> +        }
>>>       }
>>>         spin_lock(&pt->lock);
>>> @@ -396,7 +398,10 @@ static void ttm_pool_free_range(struct ttm_pool 
>>> *pool, struct ttm_tt *tt,
>>>       struct page **pages = &tt->pages[start_page];
>>>       unsigned int order;
>>>       pgoff_t i, nr;
>>> +    bool skip_clear = false;
>>>   +    if (pool->flags & TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE)
>>> +        skip_clear = true;
>>>       for (i = start_page; i < end_page; i += nr, pages += nr) {
>>>           struct ttm_pool_type *pt = NULL;
>>>   @@ -407,7 +412,7 @@ static void ttm_pool_free_range(struct 
>>> ttm_pool *pool, struct ttm_tt *tt,
>>>             pt = ttm_pool_select_type(pool, caching, order);
>>>           if (pt)
>>> -            ttm_pool_type_give(pt, *pages);
>>> +            ttm_pool_type_give(pt, *pages, skip_clear);
>>>           else
>>>               ttm_pool_free_page(pool, caching, order, *pages);
>>>       }
>>> @@ -550,18 +555,21 @@ void ttm_pool_free(struct ttm_pool *pool, 
>>> struct ttm_tt *tt)
>>>   EXPORT_SYMBOL(ttm_pool_free);
>>>     /**
>>> - * ttm_pool_init - Initialize a pool
>>> + * ttm_pool_init_with_flags - Initialize a pool with flags
>>>    *
>>>    * @pool: the pool to initialize
>>>    * @dev: device for DMA allocations and mappings
>>>    * @nid: NUMA node to use for allocations
>>>    * @use_dma_alloc: true if coherent DMA alloc should be used
>>>    * @use_dma32: true if GFP_DMA32 should be used
>>> + * @flags: control flags for the pool
>>> + *
>>> + * Initialize the pool and its pool types with flags to modify 
>>> defaults
>>>    *
>>> - * Initialize the pool and its pool types.
>>>    */
>>> -void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>> -           int nid, bool use_dma_alloc, bool use_dma32)
>>> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device 
>>> *dev,
>>> +           int nid, bool use_dma_alloc, bool use_dma32,
>>> +           unsigned int flags)
>>>   {
>>>       unsigned int i, j;
>>>   @@ -571,6 +579,7 @@ void ttm_pool_init(struct ttm_pool *pool, 
>>> struct device *dev,
>>>       pool->nid = nid;
>>>       pool->use_dma_alloc = use_dma_alloc;
>>>       pool->use_dma32 = use_dma32;
>>> +    pool->flags = flags;
>>>         for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>>>           for (j = 0; j < NR_PAGE_ORDERS; ++j) {
>>> @@ -585,6 +594,24 @@ void ttm_pool_init(struct ttm_pool *pool, 
>>> struct device *dev,
>>>           }
>>>       }
>>>   }
>>> +EXPORT_SYMBOL(ttm_pool_init_with_flags);
>>> +
>>> +/**
>>> + * ttm_pool_init - Initialize a pool
>>> + *
>>> + * @pool: the pool to initialize
>>> + * @dev: device for DMA allocations and mappings
>>> + * @nid: NUMA node to use for allocations
>>> + * @use_dma_alloc: true if coherent DMA alloc should be used
>>> + * @use_dma32: true if GFP_DMA32 should be used
>>> + *
>>> + * Initialize the pool and its pool types.
>>> + */
>>> +void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>> +           int nid, bool use_dma_alloc, bool use_dma32)
>>> +{
>>> +    ttm_pool_init_with_flags(pool, dev, nid, use_dma_alloc, 
>>> use_dma32, 0);
>>> +}
>>>   EXPORT_SYMBOL(ttm_pool_init);
>>>     /**
>>> diff --git a/include/drm/ttm/ttm_device.h 
>>> b/include/drm/ttm/ttm_device.h
>>> index c22f30535c84..1b20c5798e97 100644
>>> --- a/include/drm/ttm/ttm_device.h
>>> +++ b/include/drm/ttm/ttm_device.h
>>> @@ -291,6 +291,14 @@ int ttm_device_init(struct ttm_device *bdev, 
>>> const struct ttm_device_funcs *func
>>>               struct device *dev, struct address_space *mapping,
>>>               struct drm_vma_offset_manager *vma_manager,
>>>               bool use_dma_alloc, bool use_dma32);
>>> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
>>> +                    const struct ttm_device_funcs *funcs,
>>> +                    struct device *dev,
>>> +                    struct address_space *mapping,
>>> +                    struct drm_vma_offset_manager *vma_manager,
>>> +                    bool use_dma_alloc, bool use_dma32,
>>> +                    unsigned int pool_flags);
>>> +
>>>   void ttm_device_fini(struct ttm_device *bdev);
>>>   void ttm_device_clear_dma_mappings(struct ttm_device *bdev);
>>>   diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
>>> index 160d954a261e..9822996309e5 100644
>>> --- a/include/drm/ttm/ttm_pool.h
>>> +++ b/include/drm/ttm/ttm_pool.h
>>> @@ -66,10 +66,17 @@ struct ttm_pool_type {
>>>    * @use_dma_alloc: if coherent DMA allocations should be used
>>>    * @use_dma32: if GFP_DMA32 should be used
>>>    * @caching: pools for each caching/order
>>> + * @flags: flags to control certain pool behaviour
>>> + *
>>> + * The @flags can be:
>>> + *  - %TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE - This flag can be used to
>>> + *    skip clear on free when driver decides to do that on it's own.
>>>    */
>>>   struct ttm_pool {
>>>       struct device *dev;
>>>       int nid;
>>> +#define TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE    1 << 0
>>> +    unsigned int flags;
>>>         bool use_dma_alloc;
>>>       bool use_dma32;
>>> @@ -85,6 +92,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct 
>>> ttm_tt *tt);
>>>     void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>>              int nid, bool use_dma_alloc, bool use_dma32);
>>> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device 
>>> *dev,
>>> +                  int nid, bool use_dma_alloc, bool use_dma32,
>>> +                  unsigned int flags);
>>> +
>>>   void ttm_pool_fini(struct ttm_pool *pool);
>>>     int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free
  2024-06-20 14:45       ` Christian König
@ 2024-06-20 15:11         ` Nirmoy Das
  0 siblings, 0 replies; 9+ messages in thread
From: Nirmoy Das @ 2024-06-20 15:11 UTC (permalink / raw)
  To: Christian König, Nirmoy Das, dri-devel
  Cc: intel-xe, Thomas Hellström, Matthew Auld

Hi Christian,

On 6/20/2024 4:45 PM, Christian König wrote:
> Hi Nirmoy,
>
> Am 20.06.24 um 16:37 schrieb Nirmoy Das:
>> Hi Christian,
>>
>> On 6/20/2024 4:08 PM, Christian König wrote:
>>> Am 20.06.24 um 15:46 schrieb Nirmoy Das:
>>>> Clearing pages can be very slow when using CPU but GPUs can perform 
>>>> this
>>>> task much faster. With this new pool API driver can decide if it 
>>>> wants to
>>>> clear pages using GPU. This provides the flexibility to choose the 
>>>> desired
>>>> clear policy, either during allocation or upon freeing, as per the 
>>>> driver's
>>>> preference.
>>>
>>> We already have the TTM_TT_FLAG_ZERO_ALLOC to indicate if pages 
>>> needs to be cleared on alloc from the OS.
>>>
>>> I'm not sure if we really need the option to not clear them in the 
>>> pool as well, but if we really need this I suggest to switch from 
>>> clear on free to clear on alloc again and just honor the flag.
>>
>>
>> Perf reported higher latency because of clearing pages before giving 
>> back to the pool. I think it would be nice if drm driver could avoid it.
>>
>> I  can modify this to move clearing page to ttm_pool_type_take() to 
>> honor TTM_TT_FLAG_ZERO_ALLOC flags.
>
> Both approaches have some pro and cons. IIRC we intentionally moved 
> the clearing to the free function to avoid latency on allocation.


Ah I see. I will send a patch to honor TTM_TT_FLAG_ZERO_ALLOC flag and 
if we need clear-on-free then we can add another flag for that.

>
>>>
>>> Alternatively you could also split the pools into cleared and not 
>>> cleared pages as well.
>>
>>
>> Could you expand this please ?
>
> Just create separate pools for cleared and uncleared pages (or 
> separate lists inside the pools).
>
> Then when you see the TTM_TT_FLAG_ZERO_ALLOC flag you try to grab 
> things from the uncleared pool and if you don't see it try to grab 
> things from the cleared pool.
>
> Same for release of pages, just the other way around.


I get it now. My main goal is to avoid cpu clear so this will work well 
too with the above change.

>
>>
>> I have another question. Our userspace team have  found that there is 
>> higher latency for ttm_cached type buffer as well and using gpu clear 
>> doesn't help much
>>
>> because kernel will clear pages anyways if 
>> alloc_on_init/CONFIG_INIT_ON_ALLOC_DEFAULT_ON is active. I see that 
>> only way to mitigate this is to use a pool for
>>
>> ttm_cached buffers.  I was thinking of using a pool flag to also 
>> allow drm driver to create a pool for ttm_cached. I wonder what do 
>> you think about it and if
>>
>> there is any other better solution.
>
> Well I would clearly have to NAK a hack like this.

I thought so :)


>
> We only have a pool for uncached and WC pages because of lack of 
> support for that in the general memory and DMA management.
>
> The TTM_TT_FLAG_ZERO_ALLOC should control if GFP_ZERO is set or not. 
> If the core MM decides to ignore that and clear pages anyway then you 
> need to talk to the core MM people if you want to avoid that.


Now I know why we don't have a pool for WB. Also it seems the pools are 
x86 exclusive.  Just found about that there were some proposal in core 
MM[1] to avoid alloc_on_init but haven't seen any

follow up with a quick search.

[1] 
https://patchwork.kernel.org/project/linux-mm/patch/20230831105252.1385911-1-zhaoyang.huang@unisoc.com/#25494667


Thanks a lot,

Nirmoy

>
> Regards,
> Christian.
>
>>
>>
>> Thanks,
>>
>> Nirmoy
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Cc: Christian Koenig <christian.koenig@amd.com>
>>>> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
>>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/ttm/ttm_device.c | 42 +++++++++++++++++++++++----
>>>>   drivers/gpu/drm/ttm/ttm_pool.c   | 49 
>>>> +++++++++++++++++++++++++-------
>>>>   include/drm/ttm/ttm_device.h     |  8 ++++++
>>>>   include/drm/ttm/ttm_pool.h       | 11 +++++++
>>>>   4 files changed, 94 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>>>> b/drivers/gpu/drm/ttm/ttm_device.c
>>>> index 434cf0258000..54a3ea825c2e 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>>>> @@ -191,15 +191,19 @@ EXPORT_SYMBOL(ttm_device_swapout);
>>>>    * @vma_manager: A pointer to a vma manager.
>>>>    * @use_dma_alloc: If coherent DMA allocation API should be used.
>>>>    * @use_dma32: If we should use GFP_DMA32 for device memory 
>>>> allocations.
>>>> + * @pool_flags: Desired pool flags
>>>>    *
>>>>    * Initializes a struct ttm_device:
>>>>    * Returns:
>>>>    * !0: Failure.
>>>>    */
>>>> -int ttm_device_init(struct ttm_device *bdev, const struct 
>>>> ttm_device_funcs *funcs,
>>>> -            struct device *dev, struct address_space *mapping,
>>>> -            struct drm_vma_offset_manager *vma_manager,
>>>> -            bool use_dma_alloc, bool use_dma32)
>>>> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
>>>> +                    const struct ttm_device_funcs *funcs,
>>>> +                    struct device *dev,
>>>> +                    struct address_space *mapping,
>>>> +                    struct drm_vma_offset_manager *vma_manager,
>>>> +                    bool use_dma_alloc, bool use_dma32,
>>>> +                    unsigned int pool_flags)
>>>>   {
>>>>       struct ttm_global *glob = &ttm_glob;
>>>>       int ret, nid;
>>>> @@ -227,7 +231,8 @@ int ttm_device_init(struct ttm_device *bdev, 
>>>> const struct ttm_device_funcs *func
>>>>       else
>>>>           nid = NUMA_NO_NODE;
>>>>   -    ttm_pool_init(&bdev->pool, dev, nid, use_dma_alloc, use_dma32);
>>>> +    ttm_pool_init_with_flags(&bdev->pool, dev, nid, use_dma_alloc,
>>>> +                 use_dma32, pool_flags);
>>>>         bdev->vma_manager = vma_manager;
>>>>       spin_lock_init(&bdev->lru_lock);
>>>> @@ -239,6 +244,33 @@ int ttm_device_init(struct ttm_device *bdev, 
>>>> const struct ttm_device_funcs *func
>>>>         return 0;
>>>>   }
>>>> +EXPORT_SYMBOL(ttm_device_init_with_pool_flags);
>>>> +
>>>> +
>>>> +/**
>>>> + * ttm_device_init
>>>> + *
>>>> + * @bdev: A pointer to a struct ttm_device to initialize.
>>>> + * @funcs: Function table for the device.
>>>> + * @dev: The core kernel device pointer for DMA mappings and 
>>>> allocations.
>>>> + * @mapping: The address space to use for this bo.
>>>> + * @vma_manager: A pointer to a vma manager.
>>>> + * @use_dma_alloc: If coherent DMA allocation API should be used.
>>>> + * @use_dma32: If we should use GFP_DMA32 for device memory 
>>>> allocations.
>>>> + *
>>>> + * Initializes a struct ttm_device:
>>>> + * Returns:
>>>> + * !0: Failure.
>>>> + */
>>>> +int ttm_device_init(struct ttm_device *bdev, const struct 
>>>> ttm_device_funcs *funcs,
>>>> +            struct device *dev, struct address_space *mapping,
>>>> +            struct drm_vma_offset_manager *vma_manager,
>>>> +            bool use_dma_alloc, bool use_dma32)
>>>> +{
>>>> +    return ttm_device_init_with_pool_flags(bdev, funcs, dev, mapping,
>>>> +                           vma_manager, use_dma_alloc,
>>>> +                           use_dma32, 0);
>>>> +}
>>>>   EXPORT_SYMBOL(ttm_device_init);
>>>>     void ttm_device_fini(struct ttm_device *bdev)
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c 
>>>> b/drivers/gpu/drm/ttm/ttm_pool.c
>>>> index 6e1fd6985ffc..6f33c3e7cdf2 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>>> @@ -222,15 +222,17 @@ static void ttm_pool_unmap(struct ttm_pool 
>>>> *pool, dma_addr_t dma_addr,
>>>>   }
>>>>     /* Give pages into a specific pool_type */
>>>> -static void ttm_pool_type_give(struct ttm_pool_type *pt, struct 
>>>> page *p)
>>>> +static void ttm_pool_type_give(struct ttm_pool_type *pt, struct 
>>>> page *p, bool skip_clear)
>>>>   {
>>>>       unsigned int i, num_pages = 1 << pt->order;
>>>>   -    for (i = 0; i < num_pages; ++i) {
>>>> -        if (PageHighMem(p))
>>>> -            clear_highpage(p + i);
>>>> -        else
>>>> -            clear_page(page_address(p + i));
>>>> +    if (!skip_clear) {
>>>> +        for (i = 0; i < num_pages; ++i) {
>>>> +            if (PageHighMem(p))
>>>> +                clear_highpage(p + i);
>>>> +            else
>>>> +                clear_page(page_address(p + i));
>>>> +        }
>>>>       }
>>>>         spin_lock(&pt->lock);
>>>> @@ -396,7 +398,10 @@ static void ttm_pool_free_range(struct 
>>>> ttm_pool *pool, struct ttm_tt *tt,
>>>>       struct page **pages = &tt->pages[start_page];
>>>>       unsigned int order;
>>>>       pgoff_t i, nr;
>>>> +    bool skip_clear = false;
>>>>   +    if (pool->flags & TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE)
>>>> +        skip_clear = true;
>>>>       for (i = start_page; i < end_page; i += nr, pages += nr) {
>>>>           struct ttm_pool_type *pt = NULL;
>>>>   @@ -407,7 +412,7 @@ static void ttm_pool_free_range(struct 
>>>> ttm_pool *pool, struct ttm_tt *tt,
>>>>             pt = ttm_pool_select_type(pool, caching, order);
>>>>           if (pt)
>>>> -            ttm_pool_type_give(pt, *pages);
>>>> +            ttm_pool_type_give(pt, *pages, skip_clear);
>>>>           else
>>>>               ttm_pool_free_page(pool, caching, order, *pages);
>>>>       }
>>>> @@ -550,18 +555,21 @@ void ttm_pool_free(struct ttm_pool *pool, 
>>>> struct ttm_tt *tt)
>>>>   EXPORT_SYMBOL(ttm_pool_free);
>>>>     /**
>>>> - * ttm_pool_init - Initialize a pool
>>>> + * ttm_pool_init_with_flags - Initialize a pool with flags
>>>>    *
>>>>    * @pool: the pool to initialize
>>>>    * @dev: device for DMA allocations and mappings
>>>>    * @nid: NUMA node to use for allocations
>>>>    * @use_dma_alloc: true if coherent DMA alloc should be used
>>>>    * @use_dma32: true if GFP_DMA32 should be used
>>>> + * @flags: control flags for the pool
>>>> + *
>>>> + * Initialize the pool and its pool types with flags to modify 
>>>> defaults
>>>>    *
>>>> - * Initialize the pool and its pool types.
>>>>    */
>>>> -void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>>> -           int nid, bool use_dma_alloc, bool use_dma32)
>>>> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device 
>>>> *dev,
>>>> +           int nid, bool use_dma_alloc, bool use_dma32,
>>>> +           unsigned int flags)
>>>>   {
>>>>       unsigned int i, j;
>>>>   @@ -571,6 +579,7 @@ void ttm_pool_init(struct ttm_pool *pool, 
>>>> struct device *dev,
>>>>       pool->nid = nid;
>>>>       pool->use_dma_alloc = use_dma_alloc;
>>>>       pool->use_dma32 = use_dma32;
>>>> +    pool->flags = flags;
>>>>         for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>>>>           for (j = 0; j < NR_PAGE_ORDERS; ++j) {
>>>> @@ -585,6 +594,24 @@ void ttm_pool_init(struct ttm_pool *pool, 
>>>> struct device *dev,
>>>>           }
>>>>       }
>>>>   }
>>>> +EXPORT_SYMBOL(ttm_pool_init_with_flags);
>>>> +
>>>> +/**
>>>> + * ttm_pool_init - Initialize a pool
>>>> + *
>>>> + * @pool: the pool to initialize
>>>> + * @dev: device for DMA allocations and mappings
>>>> + * @nid: NUMA node to use for allocations
>>>> + * @use_dma_alloc: true if coherent DMA alloc should be used
>>>> + * @use_dma32: true if GFP_DMA32 should be used
>>>> + *
>>>> + * Initialize the pool and its pool types.
>>>> + */
>>>> +void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>>> +           int nid, bool use_dma_alloc, bool use_dma32)
>>>> +{
>>>> +    ttm_pool_init_with_flags(pool, dev, nid, use_dma_alloc, 
>>>> use_dma32, 0);
>>>> +}
>>>>   EXPORT_SYMBOL(ttm_pool_init);
>>>>     /**
>>>> diff --git a/include/drm/ttm/ttm_device.h 
>>>> b/include/drm/ttm/ttm_device.h
>>>> index c22f30535c84..1b20c5798e97 100644
>>>> --- a/include/drm/ttm/ttm_device.h
>>>> +++ b/include/drm/ttm/ttm_device.h
>>>> @@ -291,6 +291,14 @@ int ttm_device_init(struct ttm_device *bdev, 
>>>> const struct ttm_device_funcs *func
>>>>               struct device *dev, struct address_space *mapping,
>>>>               struct drm_vma_offset_manager *vma_manager,
>>>>               bool use_dma_alloc, bool use_dma32);
>>>> +int ttm_device_init_with_pool_flags(struct ttm_device *bdev,
>>>> +                    const struct ttm_device_funcs *funcs,
>>>> +                    struct device *dev,
>>>> +                    struct address_space *mapping,
>>>> +                    struct drm_vma_offset_manager *vma_manager,
>>>> +                    bool use_dma_alloc, bool use_dma32,
>>>> +                    unsigned int pool_flags);
>>>> +
>>>>   void ttm_device_fini(struct ttm_device *bdev);
>>>>   void ttm_device_clear_dma_mappings(struct ttm_device *bdev);
>>>>   diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
>>>> index 160d954a261e..9822996309e5 100644
>>>> --- a/include/drm/ttm/ttm_pool.h
>>>> +++ b/include/drm/ttm/ttm_pool.h
>>>> @@ -66,10 +66,17 @@ struct ttm_pool_type {
>>>>    * @use_dma_alloc: if coherent DMA allocations should be used
>>>>    * @use_dma32: if GFP_DMA32 should be used
>>>>    * @caching: pools for each caching/order
>>>> + * @flags: flags to control certain pool behaviour
>>>> + *
>>>> + * The @flags can be:
>>>> + *  - %TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE - This flag can be used to
>>>> + *    skip clear on free when driver decides to do that on it's own.
>>>>    */
>>>>   struct ttm_pool {
>>>>       struct device *dev;
>>>>       int nid;
>>>> +#define TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE    1 << 0
>>>> +    unsigned int flags;
>>>>         bool use_dma_alloc;
>>>>       bool use_dma32;
>>>> @@ -85,6 +92,10 @@ void ttm_pool_free(struct ttm_pool *pool, struct 
>>>> ttm_tt *tt);
>>>>     void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>>>              int nid, bool use_dma_alloc, bool use_dma32);
>>>> +void ttm_pool_init_with_flags(struct ttm_pool *pool, struct device 
>>>> *dev,
>>>> +                  int nid, bool use_dma_alloc, bool use_dma32,
>>>> +                  unsigned int flags);
>>>> +
>>>>   void ttm_pool_fini(struct ttm_pool *pool);
>>>>     int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/xe/lnl: Offload system clear page activity to GPU
  2024-06-20 13:46 ` [PATCH] drm/xe/lnl: Offload system clear page activity to GPU Nirmoy Das
@ 2024-06-21 12:08   ` Ghimiray, Himal Prasad
  2024-06-21 12:12     ` Nirmoy Das
  0 siblings, 1 reply; 9+ messages in thread
From: Ghimiray, Himal Prasad @ 2024-06-21 12:08 UTC (permalink / raw)
  To: Nirmoy Das, dri-devel
  Cc: intel-xe, Christian Koenig, Thomas Hellström, Matthew Auld



On 20-06-2024 19:16, Nirmoy Das wrote:
> On LNL because flat CCS, driver will create a migrate job to clear
> CCS meta data. Extend that to also clear pages using GPU with new
> ttm pool flag which allows offloading page clear activity to GP
> 
> This gives very nice improvement for large buffer:
> Without the patch:
>   ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)
> Using IGT_SRANDOM=1718889799 for randomisation
> Opened device: /dev/dri/card0
> Starting subtest: basic-store-benchmark
> Starting dynamic subtest: WC
> Dynamic subtest WC: SUCCESS (0.000s)
> Time taken for size SZ_4K: 4882 us
> Time taken for size SZ_2M: 3679 us
> Time taken for size SZ_64M: 13367 us
> Time taken for size SZ_128M: 21034 us
> Time taken for size SZ_256M: 32940 us
> Time taken for size SZ_1G: 116261 us
> Starting dynamic subtest: WB
> Dynamic subtest WB: SUCCESS (0.000s)
> Time taken for size SZ_4K: 5417 us
> Time taken for size SZ_2M: 5711 us
> Time taken for size SZ_64M: 15718 us
> Time taken for size SZ_128M: 26170 us
> Time taken for size SZ_256M: 50529 us
> Time taken for size SZ_1G: 177933 us
> Subtest basic-store-benchmark: SUCCESS (0.504s)
> 
> With the patch:
> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)
> Using IGT_SRANDOM=1718889593 for randomisation
> Opened device: /dev/dri/card0
> Starting subtest: basic-store-benchmark
> Starting dynamic subtest: WC
> Dynamic subtest WC: SUCCESS (0.000s)
> Time taken for size SZ_4K: 4479 us
> Time taken for size SZ_2M: 3291 us
> Time taken for size SZ_64M: 6595 us
> Time taken for size SZ_128M: 9069 us
> Time taken for size SZ_256M: 12681 us
> Time taken for size SZ_1G: 41806 us
> Starting dynamic subtest: WB
> Dynamic subtest WB: SUCCESS (0.000s)
> Time taken for size SZ_4K: 3317 us
> Time taken for size SZ_2M: 6458 us
> Time taken for size SZ_64M: 12802 us
> Time taken for size SZ_128M: 19579 us
> Time taken for size SZ_256M: 38768 us
> Time taken for size SZ_1G: 143250 us
> Subtest basic-store-benchmark: SUCCESS (0.328s)
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_bo.c           |  4 ++++
>   drivers/gpu/drm/xe/xe_device.c       | 36 +++++++++++++++++++++-------
>   drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>   drivers/gpu/drm/xe/xe_migrate.c      |  6 ++---
>   4 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 65c696966e96..10ec02412dc4 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -387,6 +387,10 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
>   		caching = ttm_uncached;
>   	}
>   
> +	/* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */
> +	if (xe->mem.gpu_page_clear)
> +		page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC;
> +
>   	err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages);
>   	if (err) {
>   		kfree(tt);
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 75d4c8ae9234..8e8d54c59aae 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -240,8 +240,6 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
>   
>   	if (xe->unordered_wq)
>   		destroy_workqueue(xe->unordered_wq);
> -
> -	ttm_device_fini(&xe->ttm);
>   }
>   
>   struct xe_device *xe_device_create(struct pci_dev *pdev,
> @@ -260,12 +258,6 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
>   	if (IS_ERR(xe))
>   		return xe;
>   
> -	err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
> -			      xe->drm.anon_inode->i_mapping,
> -			      xe->drm.vma_offset_manager, false, false);
> -	if (WARN_ON(err))
> -		goto err;
> -
>   	err = drmm_add_action_or_reset(&xe->drm, xe_device_destroy, NULL);
>   	if (err)
>   		goto err;
> @@ -543,6 +535,13 @@ static int xe_device_set_has_flat_ccs(struct  xe_device *xe)
>   	return xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>   }
>   
> +static void xe_device_destroy_ttm_device(struct drm_device *dev, void *dummy)
> +{
> +	struct xe_device *xe = to_xe_device(dev);
> +
> +	ttm_device_fini(&xe->ttm);
> +}
> +
>   int xe_device_probe(struct xe_device *xe)
>   {
>   	struct xe_tile *tile;
> @@ -550,6 +549,7 @@ int xe_device_probe(struct xe_device *xe)
>   	int err;
>   	u8 last_gt;
>   	u8 id;
> +	unsigned int ttm_pool_flags = 0;
>   
>   	xe_pat_init_early(xe);
>   
> @@ -572,6 +572,26 @@ int xe_device_probe(struct xe_device *xe)
>   
>   	xe_ttm_sys_mgr_init(xe);
>   
> +	/* On iGFX device with flat CCS we clear CCS metadata, let's extend that
> +	 * and use GPU to clear  pages as well.
> +	 */
> +	if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe)) {
> +		ttm_pool_flags = TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE;
> +		xe->mem.gpu_page_clear = true;
> +	}
> +
> +	err = ttm_device_init_with_pool_flags(&xe->ttm, &xe_ttm_funcs,
> +					      xe->drm.dev,
> +					      xe->drm.anon_inode->i_mapping,
> +					      xe->drm.vma_offset_manager,
> +					      false, false, ttm_pool_flags);
> +	if (WARN_ON(err))
> +		return err;
> +
> +	err = drmm_add_action_or_reset(&xe->drm, xe_device_destroy_ttm_device, NULL);
> +	if (err)
> +		return err;
> +
>   	for_each_gt(gt, xe, id) {
>   		err = xe_gt_init_early(gt);
>   		if (err)
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index c37be471d11c..ece68c6f3668 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -325,6 +325,8 @@ struct xe_device {
>   		struct xe_mem_region vram;
>   		/** @mem.sys_mgr: system TTM manager */
>   		struct ttm_resource_manager sys_mgr;
> +		/** @gpu_page_clear: clear pages offloaded to GPU */
> +		bool gpu_page_clear;
>   	} mem;
>   
>   	/** @sriov: device level virtualization data */
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index 05f933787860..0023f32d147d 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -1003,6 +1003,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
>   	struct xe_gt *gt = m->tile->primary_gt;
>   	struct xe_device *xe = gt_to_xe(gt);
>   	bool clear_system_ccs = (xe_bo_needs_ccs_pages(bo) && !IS_DGFX(xe)) ? true : false;
> +	bool clear_on_create = xe->mem.gpu_page_clear;
>   	struct dma_fence *fence = NULL;
>   	u64 size = bo->size;
>   	struct xe_res_cursor src_it;
> @@ -1022,7 +1023,6 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
>   		struct xe_sched_job *job;
>   		struct xe_bb *bb;
>   		u32 batch_size, update_idx;
> -
>   		bool usm = xe->info.has_usm;
>   		u32 avail_pts = max_mem_transfer_per_pass(xe) / LEVEL0_PAGE_TABLE_ENCODE_SIZE;
>   
> @@ -1032,7 +1032,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
>   		batch_size = 2 +
>   			pte_update_size(m, clear_vram, src, &src_it,
>   					&clear_L0, &clear_L0_ofs, &clear_L0_pt,
> -					clear_system_ccs ? 0 : emit_clear_cmd_len(gt), 0,
> +					!clear_on_create ? 0 : emit_clear_cmd_len(gt), 0,
>   					avail_pts);
>   
>   		if (xe_device_has_flat_ccs(xe))
> @@ -1060,7 +1060,7 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
>   		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
>   		update_idx = bb->len;
>   
> -		if (!clear_system_ccs)
> +		if (clear_on_create)

will break on dgfx

>   			emit_clear(gt, bb, clear_L0_ofs, clear_L0, XE_PAGE_SIZE, clear_vram);
>   
>   		if (xe_device_has_flat_ccs(xe)) {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/xe/lnl: Offload system clear page activity to GPU
  2024-06-21 12:08   ` Ghimiray, Himal Prasad
@ 2024-06-21 12:12     ` Nirmoy Das
  0 siblings, 0 replies; 9+ messages in thread
From: Nirmoy Das @ 2024-06-21 12:12 UTC (permalink / raw)
  To: Ghimiray, Himal Prasad, dri-devel
  Cc: intel-xe, Christian Koenig, Thomas Hellström, Matthew Auld


On 6/21/2024 2:08 PM, Ghimiray, Himal Prasad wrote:
>
>
> On 20-06-2024 19:16, Nirmoy Das wrote:
>> On LNL because flat CCS, driver will create a migrate job to clear
>> CCS meta data. Extend that to also clear pages using GPU with new
>> ttm pool flag which allows offloading page clear activity to GP
>>
>> This gives very nice improvement for large buffer:
>> Without the patch:
>>   ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
>> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)
>> Using IGT_SRANDOM=1718889799 for randomisation
>> Opened device: /dev/dri/card0
>> Starting subtest: basic-store-benchmark
>> Starting dynamic subtest: WC
>> Dynamic subtest WC: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 4882 us
>> Time taken for size SZ_2M: 3679 us
>> Time taken for size SZ_64M: 13367 us
>> Time taken for size SZ_128M: 21034 us
>> Time taken for size SZ_256M: 32940 us
>> Time taken for size SZ_1G: 116261 us
>> Starting dynamic subtest: WB
>> Dynamic subtest WB: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 5417 us
>> Time taken for size SZ_2M: 5711 us
>> Time taken for size SZ_64M: 15718 us
>> Time taken for size SZ_128M: 26170 us
>> Time taken for size SZ_256M: 50529 us
>> Time taken for size SZ_1G: 177933 us
>> Subtest basic-store-benchmark: SUCCESS (0.504s)
>>
>> With the patch:
>> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run 
>> basic-store-benchmark
>> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.9.0-xe+ x86_64)
>> Using IGT_SRANDOM=1718889593 for randomisation
>> Opened device: /dev/dri/card0
>> Starting subtest: basic-store-benchmark
>> Starting dynamic subtest: WC
>> Dynamic subtest WC: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 4479 us
>> Time taken for size SZ_2M: 3291 us
>> Time taken for size SZ_64M: 6595 us
>> Time taken for size SZ_128M: 9069 us
>> Time taken for size SZ_256M: 12681 us
>> Time taken for size SZ_1G: 41806 us
>> Starting dynamic subtest: WB
>> Dynamic subtest WB: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 3317 us
>> Time taken for size SZ_2M: 6458 us
>> Time taken for size SZ_64M: 12802 us
>> Time taken for size SZ_128M: 19579 us
>> Time taken for size SZ_256M: 38768 us
>> Time taken for size SZ_1G: 143250 us
>> Subtest basic-store-benchmark: SUCCESS (0.328s)
>>
>> Cc: Christian Koenig <christian.koenig@amd.com>
>> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_bo.c           |  4 ++++
>>   drivers/gpu/drm/xe/xe_device.c       | 36 +++++++++++++++++++++-------
>>   drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>>   drivers/gpu/drm/xe/xe_migrate.c      |  6 ++---
>>   4 files changed, 37 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>> index 65c696966e96..10ec02412dc4 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.c
>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>> @@ -387,6 +387,10 @@ static struct ttm_tt *xe_ttm_tt_create(struct 
>> ttm_buffer_object *ttm_bo,
>>           caching = ttm_uncached;
>>       }
>>   +    /* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */
>> +    if (xe->mem.gpu_page_clear)
>> +        page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC;
>> +
>>       err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, 
>> extra_pages);
>>       if (err) {
>>           kfree(tt);
>> diff --git a/drivers/gpu/drm/xe/xe_device.c 
>> b/drivers/gpu/drm/xe/xe_device.c
>> index 75d4c8ae9234..8e8d54c59aae 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -240,8 +240,6 @@ static void xe_device_destroy(struct drm_device 
>> *dev, void *dummy)
>>         if (xe->unordered_wq)
>>           destroy_workqueue(xe->unordered_wq);
>> -
>> -    ttm_device_fini(&xe->ttm);
>>   }
>>     struct xe_device *xe_device_create(struct pci_dev *pdev,
>> @@ -260,12 +258,6 @@ struct xe_device *xe_device_create(struct 
>> pci_dev *pdev,
>>       if (IS_ERR(xe))
>>           return xe;
>>   -    err = ttm_device_init(&xe->ttm, &xe_ttm_funcs, xe->drm.dev,
>> -                  xe->drm.anon_inode->i_mapping,
>> -                  xe->drm.vma_offset_manager, false, false);
>> -    if (WARN_ON(err))
>> -        goto err;
>> -
>>       err = drmm_add_action_or_reset(&xe->drm, xe_device_destroy, NULL);
>>       if (err)
>>           goto err;
>> @@ -543,6 +535,13 @@ static int xe_device_set_has_flat_ccs(struct  
>> xe_device *xe)
>>       return xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>   }
>>   +static void xe_device_destroy_ttm_device(struct drm_device *dev, 
>> void *dummy)
>> +{
>> +    struct xe_device *xe = to_xe_device(dev);
>> +
>> +    ttm_device_fini(&xe->ttm);
>> +}
>> +
>>   int xe_device_probe(struct xe_device *xe)
>>   {
>>       struct xe_tile *tile;
>> @@ -550,6 +549,7 @@ int xe_device_probe(struct xe_device *xe)
>>       int err;
>>       u8 last_gt;
>>       u8 id;
>> +    unsigned int ttm_pool_flags = 0;
>>         xe_pat_init_early(xe);
>>   @@ -572,6 +572,26 @@ int xe_device_probe(struct xe_device *xe)
>>         xe_ttm_sys_mgr_init(xe);
>>   +    /* On iGFX device with flat CCS we clear CCS metadata, let's 
>> extend that
>> +     * and use GPU to clear  pages as well.
>> +     */
>> +    if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe)) {
>> +        ttm_pool_flags = TTM_POOL_FLAG_SKIP_CLEAR_ON_FREE;
>> +        xe->mem.gpu_page_clear = true;
>> +    }
>> +
>> +    err = ttm_device_init_with_pool_flags(&xe->ttm, &xe_ttm_funcs,
>> +                          xe->drm.dev,
>> +                          xe->drm.anon_inode->i_mapping,
>> +                          xe->drm.vma_offset_manager,
>> +                          false, false, ttm_pool_flags);
>> +    if (WARN_ON(err))
>> +        return err;
>> +
>> +    err = drmm_add_action_or_reset(&xe->drm, 
>> xe_device_destroy_ttm_device, NULL);
>> +    if (err)
>> +        return err;
>> +
>>       for_each_gt(gt, xe, id) {
>>           err = xe_gt_init_early(gt);
>>           if (err)
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
>> b/drivers/gpu/drm/xe/xe_device_types.h
>> index c37be471d11c..ece68c6f3668 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -325,6 +325,8 @@ struct xe_device {
>>           struct xe_mem_region vram;
>>           /** @mem.sys_mgr: system TTM manager */
>>           struct ttm_resource_manager sys_mgr;
>> +        /** @gpu_page_clear: clear pages offloaded to GPU */
>> +        bool gpu_page_clear;
>>       } mem;
>>         /** @sriov: device level virtualization data */
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c 
>> b/drivers/gpu/drm/xe/xe_migrate.c
>> index 05f933787860..0023f32d147d 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -1003,6 +1003,7 @@ struct dma_fence *xe_migrate_clear(struct 
>> xe_migrate *m,
>>       struct xe_gt *gt = m->tile->primary_gt;
>>       struct xe_device *xe = gt_to_xe(gt);
>>       bool clear_system_ccs = (xe_bo_needs_ccs_pages(bo) && 
>> !IS_DGFX(xe)) ? true : false;
>> +    bool clear_on_create = xe->mem.gpu_page_clear;
>>       struct dma_fence *fence = NULL;
>>       u64 size = bo->size;
>>       struct xe_res_cursor src_it;
>> @@ -1022,7 +1023,6 @@ struct dma_fence *xe_migrate_clear(struct 
>> xe_migrate *m,
>>           struct xe_sched_job *job;
>>           struct xe_bb *bb;
>>           u32 batch_size, update_idx;
>> -
>>           bool usm = xe->info.has_usm;
>>           u32 avail_pts = max_mem_transfer_per_pass(xe) / 
>> LEVEL0_PAGE_TABLE_ENCODE_SIZE;
>>   @@ -1032,7 +1032,7 @@ struct dma_fence *xe_migrate_clear(struct 
>> xe_migrate *m,
>>           batch_size = 2 +
>>               pte_update_size(m, clear_vram, src, &src_it,
>>                       &clear_L0, &clear_L0_ofs, &clear_L0_pt,
>> -                    clear_system_ccs ? 0 : emit_clear_cmd_len(gt), 0,
>> +                    !clear_on_create ? 0 : emit_clear_cmd_len(gt), 0,
>>                       avail_pts);
>>             if (xe_device_has_flat_ccs(xe))
>> @@ -1060,7 +1060,7 @@ struct dma_fence *xe_migrate_clear(struct 
>> xe_migrate *m,
>>           bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
>>           update_idx = bb->len;
>>   -        if (!clear_system_ccs)
>> +        if (clear_on_create)
>
> will break on dgfx

True, I had a different version but I didn't update this afterwards.

Let me resend.


Regards,

Nirmoy

>
>>               emit_clear(gt, bb, clear_L0_ofs, clear_L0, 
>> XE_PAGE_SIZE, clear_vram);
>>             if (xe_device_has_flat_ccs(xe)) {

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-06-21 12:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-20 13:46 [RFC PATCH 0/2] Clear system pages with GPU Nirmoy Das
2024-06-20 13:46 ` [RFC PATCH 1/2] drm/ttm/pool: Introduce a way to skip clear on free Nirmoy Das
2024-06-20 14:08   ` Christian König
2024-06-20 14:37     ` Nirmoy Das
2024-06-20 14:45       ` Christian König
2024-06-20 15:11         ` Nirmoy Das
2024-06-20 13:46 ` [PATCH] drm/xe/lnl: Offload system clear page activity to GPU Nirmoy Das
2024-06-21 12:08   ` Ghimiray, Himal Prasad
2024-06-21 12:12     ` Nirmoy Das

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox