igt-dev.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs
@ 2025-11-13 16:32 nishit.sharma
  2025-11-13 16:33 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:32 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

Tis patch series adds comprehensive SVM multi-GPU IGT test coverage for
madvise and prefetch functionality.

ver2:
- Test name changed in commits
- In patchwork v1 patch is missing due to last patch was not sent

ver3:
- In patch-7 tags were added and it was not sent on patchwork

ver4:
- In patch file was added which is not available in source which caused
  CI build failure.

ver5:
- Added subtest function wrappers
- Subtests executing for all GPUs enumerated

ver7:
- Optimized function calling which are frequenctly called
- Incorporated review comments (Thomas Hellstrom)


Nishit Sharma (10):
  lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync
    helpers
  tests/intel/xe_exec_system_allocator: Add parameter in madvise call
  tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access
    test
  tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations
  tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test
  tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU performance test
  tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU fault handling test
  tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU simultaneous access
    test
  tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU conflicting madvise
    test
  tests/intel/xe_multi-gpusvm.c: Add SVM multi-GPU migration test

 include/drm-uapi/xe_drm.h              |    4 +-
 lib/xe/xe_ioctl.c                      |   53 +-
 lib/xe/xe_ioctl.h                      |   11 +-
 tests/intel/xe_exec_system_allocator.c |    8 +-
 tests/intel/xe_multi_gpusvm.c          | 1441 ++++++++++++++++++++++++
 tests/meson.build                      |    1 +
 6 files changed, 1504 insertions(+), 14 deletions(-)
 create mode 100644 tests/intel/xe_multi_gpusvm.c

-- 
2.48.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 12:34   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 02/10] tests/intel/xe_exec_system_allocator: Add parameter in madvise call nishit.sharma
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
support per-instance memory advice operations.Implement xe_vm_bind_lr_sync
and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using user
fences.
These changes improve memory advice and binding operations for multi-GPU
and multi-instance scenarios in IGT tests.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 include/drm-uapi/xe_drm.h |  4 +--
 lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++----
 lib/xe/xe_ioctl.h         | 11 +++++---
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
index 89ab54935..3472efa58 100644
--- a/include/drm-uapi/xe_drm.h
+++ b/include/drm-uapi/xe_drm.h
@@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
 			/** @preferred_mem_loc.migration_policy: Page migration policy */
 			__u16 migration_policy;
 
-			/** @preferred_mem_loc.pad : MBZ */
-			__u16 pad;
+			/** @preferred_mem_loc.region_instance: Region instance */
+			__u16 region_instance;
 
 			/** @preferred_mem_loc.reserved : Reserved */
 			__u64 reserved;
diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
index 39c4667a1..06ce8a339 100644
--- a/lib/xe/xe_ioctl.c
+++ b/lib/xe/xe_ioctl.c
@@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 }
 
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		    uint16_t instance)
 {
 	struct drm_xe_madvise madvise = {
 		.type = type,
@@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
 		madvise.preferred_mem_loc.devmem_fd = op_val;
 		madvise.preferred_mem_loc.migration_policy = policy;
+		madvise.preferred_mem_loc.region_instance = instance;
 		igt_debug("madvise.preferred_mem_loc.devmem_fd = %d\n",
 			  madvise.preferred_mem_loc.devmem_fd);
 		break;
@@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
  * @type: type of attribute
  * @op_val: fd/atomic value/pat index, depending upon type of operation
  * @policy: Page migration policy
+ * @instance: vram instance
  *
  * Function initializes different members of struct drm_xe_madvise and calls
  * MADVISE IOCTL .
  *
- * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
+ * Returns error number in failure and 0 if pass.
  */
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
+		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		   uint16_t instance)
 {
-	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy), 0);
+	return __xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy, instance);
+}
+
+#define        BIND_SYNC_VAL   0x686868
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t offset,
+			uint64_t addr, uint64_t size, uint32_t flags)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync, 1, flags);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	/* Only free if the wait succeeds */
+	free((void *)sync_addr);
+}
+
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	*sync_addr = 0;
+	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	free((void *)sync_addr);
 }
diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
index ae8a23a54..1ae38029d 100644
--- a/lib/xe/xe_ioctl.h
+++ b/lib/xe/xe_ioctl.h
@@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 		       uint32_t exec_queue, int64_t timeout);
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		    uint32_t type, uint32_t op_val, uint16_t policy);
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		   uint32_t type, uint32_t op_val, uint16_t policy);
+		    uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
+		  uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
 int xe_vm_number_vmas_in_range(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr);
 int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr,
 		    struct drm_xe_mem_range_attr *mem_attr);
 struct drm_xe_mem_range_attr
 *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t start,
 				    uint64_t range, uint32_t *num_ranges);
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
+			uint64_t offset, uint64_t addr,
+			uint64_t size, uint32_t flags);
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size);
 #endif /* XE_IOCTL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 02/10] tests/intel/xe_exec_system_allocator: Add parameter in madvise call
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
  2025-11-13 16:33 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 12:38   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test nishit.sharma
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

Parameter instance added in xe_vm_madvise() call. This parameter
addition cause compilation issue in system_allocator test. As a fix
0 as instance parameter passed in xe_vm_madvise() calls.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 tests/intel/xe_exec_system_allocator.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tests/intel/xe_exec_system_allocator.c b/tests/intel/xe_exec_system_allocator.c
index b88967e58..1e7175061 100644
--- a/tests/intel/xe_exec_system_allocator.c
+++ b/tests/intel/xe_exec_system_allocator.c
@@ -1164,7 +1164,7 @@ madvise_swizzle_op_exec(int fd, uint32_t vm, struct test_exec_data *data,
 	xe_vm_madvise(fd, vm, to_user_pointer(data), bo_size, 0,
 		      DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
 		      preferred_loc,
-		      0);
+		      0, 0);
 }
 
 static void
@@ -1172,7 +1172,7 @@ xe_vm_madvixe_pat_attr(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 		       int pat_index)
 {
 	xe_vm_madvise(fd, vm, addr, range, 0,
-		      DRM_XE_MEM_RANGE_ATTR_PAT, pat_index, 0);
+		      DRM_XE_MEM_RANGE_ATTR_PAT, pat_index, 0, 0);
 }
 
 static void
@@ -1181,7 +1181,7 @@ xe_vm_madvise_atomic_attr(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 {
 	xe_vm_madvise(fd, vm, addr, range, 0,
 		      DRM_XE_MEM_RANGE_ATTR_ATOMIC,
-		      mem_attr, 0);
+		      mem_attr, 0, 0);
 }
 
 static void
@@ -1190,7 +1190,7 @@ xe_vm_madvise_migrate_pages(int fd, uint32_t vm, uint64_t addr, uint64_t range)
 	xe_vm_madvise(fd, vm, addr, range, 0,
 		      DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
 		      DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM,
-		      DRM_XE_MIGRATE_ALL_PAGES);
+		      DRM_XE_MIGRATE_ALL_PAGES, 0);
 }
 
 static void
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
  2025-11-13 16:33 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
  2025-11-13 16:33 ` [PATCH i-g-t v7 02/10] tests/intel/xe_exec_system_allocator: Add parameter in madvise call nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 13:00   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations nishit.sharma
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

This test allocates a buffer in SVM, writes data to it from src GPU , and reads/verifies
the data from dst GPU. Optionally, the CPU also reads or modifies the buffer and both
GPUs verify the results, ensuring correct cross-GPU and CPU memory access in a
multi-GPU environment.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 tests/intel/xe_multi_gpusvm.c | 373 ++++++++++++++++++++++++++++++++++
 tests/meson.build             |   1 +
 2 files changed, 374 insertions(+)
 create mode 100644 tests/intel/xe_multi_gpusvm.c

diff --git a/tests/intel/xe_multi_gpusvm.c b/tests/intel/xe_multi_gpusvm.c
new file mode 100644
index 000000000..6614ea3d1
--- /dev/null
+++ b/tests/intel/xe_multi_gpusvm.c
@@ -0,0 +1,373 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <unistd.h>
+
+#include "drmtest.h"
+#include "igt.h"
+#include "igt_multigpu.h"
+
+#include "intel_blt.h"
+#include "intel_mocs.h"
+#include "intel_reg.h"
+
+#include "xe/xe_ioctl.h"
+#include "xe/xe_query.h"
+#include "xe/xe_util.h"
+
+/**
+ * TEST: Basic multi-gpu SVM testing
+ * Category: SVM
+ * Mega feature: Compute
+ * Sub-category: Compute tests
+ * Functionality: SVM p2p access, madvise and prefetch.
+ * Test category: functionality test
+ *
+ * SUBTEST: cross-gpu-mem-access
+ * Description:
+ *      This test creates two malloced regions, places the destination
+ *      region both remotely and locally and copies to it. Reads back to
+ *      system memory and checks the result.
+ *
+ */
+
+#define MAX_XE_REGIONS	8
+#define MAX_XE_GPUS 8
+#define NUM_LOOPS 1
+#define BATCH_SIZE(_fd) ALIGN(SZ_8K, xe_get_default_alignment(_fd))
+#define BIND_SYNC_VAL 0x686868
+#define EXEC_SYNC_VAL 0x676767
+#define COPY_SIZE SZ_64M
+
+struct xe_svm_gpu_info {
+	bool supports_faults;
+	int vram_regions[MAX_XE_REGIONS];
+	unsigned int num_regions;
+	unsigned int va_bits;
+	int fd;
+};
+
+struct multigpu_ops_args {
+	bool prefetch_req;
+	bool op_mod;
+};
+
+typedef void (*gpu_pair_fn) (
+		struct xe_svm_gpu_info *src,
+		struct xe_svm_gpu_info *dst,
+		struct drm_xe_engine_class_instance *eci,
+		void *extra_args
+);
+
+static void for_each_gpu_pair(int num_gpus,
+			      struct xe_svm_gpu_info *gpus,
+			      struct drm_xe_engine_class_instance *eci,
+			      gpu_pair_fn fn,
+			      void *extra_args);
+
+static void gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
+				   struct xe_svm_gpu_info *dst,
+				   struct drm_xe_engine_class_instance *eci,
+				   void *extra_args);
+
+static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
+
+static void
+create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct drm_xe_engine_class_instance *eci,
+		    uint32_t *vm, uint32_t *exec_queue)
+{
+	*vm = xe_vm_create(gpu->fd,
+			   DRM_XE_VM_CREATE_FLAG_LR_MODE | DRM_XE_VM_CREATE_FLAG_FAULT_MODE, 0);
+	*exec_queue = xe_exec_queue_create(gpu->fd, *vm, eci, 0);
+	xe_vm_bind_lr_sync(gpu->fd, *vm, 0, 0, 0, 1ull << gpu->va_bits,
+			   DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR);
+}
+
+static void
+setup_sync(struct drm_xe_sync *sync, volatile uint64_t **sync_addr, uint64_t timeline_value)
+{
+	*sync_addr = malloc(sizeof(**sync_addr));
+	igt_assert(*sync_addr);
+	sync->flags = DRM_XE_SYNC_FLAG_SIGNAL;
+	sync->type = DRM_XE_SYNC_TYPE_USER_FENCE;
+	sync->addr = to_user_pointer((uint64_t *)*sync_addr);
+	sync->timeline_value = timeline_value;
+	**sync_addr = 0;
+}
+
+static void
+cleanup_vm_and_queue(struct xe_svm_gpu_info *gpu, uint32_t vm, uint32_t exec_queue)
+{
+	xe_vm_unbind_lr_sync(gpu->fd, vm, 0, 0, 1ull << gpu->va_bits);
+	xe_exec_queue_destroy(gpu->fd, exec_queue);
+	xe_vm_destroy(gpu->fd, vm);
+}
+
+static void xe_multigpu_madvise(int src_fd, uint32_t vm, uint64_t addr, uint64_t size,
+				uint64_t ext, uint32_t type, int dst_fd, uint16_t policy,
+				uint16_t instance, uint32_t exec_queue, int local_fd,
+				uint16_t local_vram)
+{
+	int ret;
+
+#define SYSTEM_MEMORY	0
+	if (src_fd != dst_fd) {
+		ret = xe_vm_madvise(src_fd, vm, addr, size, ext, type, dst_fd, policy, instance);
+		if (ret == -ENOLINK) {
+			igt_info("No fast interconnect between GPU0 and GPU1, falling back to local VRAM\n");
+			ret = xe_vm_madvise(src_fd, vm, addr, size, ext, type, local_fd,
+					    policy, local_vram);
+			if (ret) {
+				igt_info("Local VRAM madvise failed, falling back to system memory\n");
+				ret = xe_vm_madvise(src_fd, vm, addr, size, ext, type,
+						    SYSTEM_MEMORY, policy, SYSTEM_MEMORY);
+				igt_assert_eq(ret, 0);
+			}
+		} else {
+			igt_assert_eq(ret, 0);
+		}
+	} else {
+		ret = xe_vm_madvise(src_fd, vm, addr, size, ext, type, dst_fd, policy, instance);
+		igt_assert_eq(ret, 0);
+
+	}
+
+}
+
+static void xe_multigpu_prefetch(int src_fd, uint32_t vm, uint64_t addr, uint64_t size,
+				 struct drm_xe_sync *sync, volatile uint64_t *sync_addr,
+				 uint32_t exec_queue, bool prefetch_req)
+{
+	if (prefetch_req) {
+		xe_vm_prefetch_async(src_fd, vm, 0, 0, addr, size, sync, 1,
+				     DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
+		if (*sync_addr != sync->timeline_value)
+			xe_wait_ufence(src_fd, (uint64_t *)sync_addr, sync->timeline_value,
+				       exec_queue, NSEC_PER_SEC * 10);
+	}
+	free((void *)sync_addr);
+}
+
+static void for_each_gpu_pair(int num_gpus, struct xe_svm_gpu_info *gpus,
+			      struct drm_xe_engine_class_instance *eci,
+			      gpu_pair_fn fn, void *extra_args)
+{
+	for (int src = 0; src < num_gpus; src++) {
+		if(!gpus[src].supports_faults)
+			continue;
+
+		for (int dst = 0; dst < num_gpus; dst++) {
+			if (src == dst)
+				continue;
+			fn(&gpus[src], &gpus[dst], eci, extra_args);
+		}
+	}
+}
+
+static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
+		       uint64_t dst_addr, uint64_t copy_size,
+		       uint32_t *bo, uint64_t *addr)
+{
+	uint32_t width = copy_size / 256;
+	uint32_t height = 1;
+	uint32_t batch_bo_size = BATCH_SIZE(fd);
+	uint32_t batch_bo;
+	uint64_t batch_addr;
+	void *batch;
+	uint32_t *cmd;
+	uint32_t mocs_index = intel_get_uc_mocs_index(fd);
+	int i = 0;
+
+	batch_bo = xe_bo_create(fd, vm, batch_bo_size, vram_if_possible(fd, 0), 0);
+	batch = xe_bo_map(fd, batch_bo, batch_bo_size);
+	cmd = (uint32_t *) batch;
+	cmd[i++] = MEM_COPY_CMD | (1 << 19);
+	cmd[i++] = width - 1;
+	cmd[i++] = height - 1;
+	cmd[i++] = width - 1;
+	cmd[i++] = width - 1;
+	cmd[i++] = src_addr & ((1UL << 32) - 1);
+	cmd[i++] = src_addr >> 32;
+	cmd[i++] = dst_addr & ((1UL << 32) - 1);
+	cmd[i++] = dst_addr >> 32;
+	cmd[i++] = mocs_index << XE2_MEM_COPY_MOCS_SHIFT | mocs_index;
+	cmd[i++] = MI_BATCH_BUFFER_END;
+	cmd[i++] = MI_BATCH_BUFFER_END;
+
+	batch_addr = to_user_pointer(batch);
+	/* Punch a gap in the SVM map where we map the batch_bo */
+	xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr, batch_bo_size, 0);
+	*bo = batch_bo;
+	*addr = batch_addr;
+}
+
+static void batch_fini(int fd, uint32_t vm, uint32_t bo, uint64_t addr)
+{
+        /* Unmap the batch bo by re-instating the SVM binding. */
+        xe_vm_bind_lr_sync(fd, vm, 0, 0, addr, BATCH_SIZE(fd),
+                           DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR);
+        gem_close(fd, bo);
+}
+
+
+static void open_pagemaps(int fd, struct xe_svm_gpu_info *info)
+{
+	unsigned int count = 0;
+	uint64_t regions = all_memory_regions(fd);
+	uint32_t region;
+
+	xe_for_each_mem_region(fd, regions, region) {
+		if (XE_IS_VRAM_MEMORY_REGION(fd, region)) {
+			struct drm_xe_mem_region *mem_region =
+				xe_mem_region(fd, 1ull << (region - 1));
+			igt_assert(count < MAX_XE_REGIONS);
+			info->vram_regions[count++] = mem_region->instance;
+		}
+	}
+
+	info->num_regions = count;
+}
+
+static int get_device_info(struct xe_svm_gpu_info gpus[], int num_gpus)
+{
+	int cnt;
+	int xe;
+	int i;
+
+	for (i = 0, cnt = 0 && i < 128; cnt < num_gpus; i++) {
+		xe = __drm_open_driver_another(i, DRIVER_XE);
+		if (xe < 0)
+			break;
+
+		gpus[cnt].fd = xe;
+		cnt++;
+	}
+
+	return cnt;
+}
+
+static void
+copy_src_dst(struct xe_svm_gpu_info *gpu0,
+	     struct xe_svm_gpu_info *gpu1,
+	     struct drm_xe_engine_class_instance *eci,
+	     bool prefetch_req)
+{
+	uint32_t vm[1];
+	uint32_t exec_queue[2];
+	uint32_t batch_bo;
+	void *copy_src, *copy_dst;
+	uint64_t batch_addr;
+	struct drm_xe_sync sync = {};
+	volatile uint64_t *sync_addr;
+	int local_fd = gpu0->fd;
+	uint16_t local_vram = gpu0->vram_regions[0];
+
+	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
+
+	/* Allocate source and destination buffers */
+	copy_src = aligned_alloc(xe_get_default_alignment(gpu0->fd), SZ_64M);
+	igt_assert(copy_src);
+	copy_dst = aligned_alloc(xe_get_default_alignment(gpu1->fd), SZ_64M);
+	igt_assert(copy_dst);
+
+	/*
+	 * Initialize, map and bind the batch bo. Note that Xe doesn't seem to enjoy
+	 * batch buffer memory accessed over PCIe p2p.
+	 */
+	batch_init(gpu0->fd, vm[0], to_user_pointer(copy_src), to_user_pointer(copy_dst),
+		   COPY_SIZE, &batch_bo, &batch_addr);
+
+	/* Fill the source with a pattern, clear the destination. */
+	memset(copy_src, 0x67, COPY_SIZE);
+	memset(copy_dst, 0x0, COPY_SIZE);
+
+	xe_multigpu_madvise(gpu0->fd, vm[0], to_user_pointer(copy_dst), COPY_SIZE,
+			     0, DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			     gpu1->fd, 0, gpu1->vram_regions[0], exec_queue[0],
+			     local_fd, local_vram);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu0->fd, vm[0], to_user_pointer(copy_dst), COPY_SIZE, &sync,
+			     sync_addr, exec_queue[0], prefetch_req);
+
+	sync_addr = (void *)((char *)batch_addr + SZ_4K);
+	sync.addr = to_user_pointer((uint64_t *)sync_addr);
+	sync.timeline_value = EXEC_SYNC_VAL;
+	*sync_addr = 0;
+
+	/* Execute a GPU copy. */
+	xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
+	if (*sync_addr != EXEC_SYNC_VAL)
+		xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
+			       NSEC_PER_SEC * 10);
+
+	igt_assert(memcmp(copy_src, copy_dst, COPY_SIZE) == 0);
+
+	free(copy_dst);
+	free(copy_src);
+	munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
+	batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
+	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
+}
+
+static void
+gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
+		       struct xe_svm_gpu_info *dst,
+		       struct drm_xe_engine_class_instance *eci,
+		       void *extra_args)
+{
+	struct multigpu_ops_args *args = (struct multigpu_ops_args *)extra_args;
+	igt_assert(src);
+	igt_assert(dst);
+
+	copy_src_dst(src, dst, eci, args->prefetch_req);
+}
+
+igt_main
+{
+	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
+	struct xe_device *xe;
+	int gpu, gpu_cnt;
+
+	struct drm_xe_engine_class_instance eci = {
+                .engine_class = DRM_XE_ENGINE_CLASS_COPY,
+        };
+
+	igt_fixture {
+		gpu_cnt = get_device_info(gpus, ARRAY_SIZE(gpus));
+		igt_skip_on(gpu_cnt < 2);
+
+		for (gpu = 0; gpu < gpu_cnt; ++gpu) {
+			igt_assert(gpu < MAX_XE_GPUS);
+
+			open_pagemaps(gpus[gpu].fd, &gpus[gpu]);
+			/* NOTE! inverted return value. */
+			gpus[gpu].supports_faults = !xe_supports_faults(gpus[gpu].fd);
+			fprintf(stderr, "GPU %u has %u VRAM regions%s, and %s SVM VMs.\n",
+				gpu, gpus[gpu].num_regions,
+				gpus[gpu].num_regions != 1 ? "s" : "",
+				gpus[gpu].supports_faults ? "supports" : "doesn't support");
+
+			xe = xe_device_get(gpus[gpu].fd);
+			gpus[gpu].va_bits = xe->va_bits;
+		}
+	}
+
+	igt_describe("gpu-gpu write-read");
+	igt_subtest("cross-gpu-mem-access") {
+		struct multigpu_ops_args op_args;
+		op_args.prefetch_req = 1;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_mem_access_wrapper, &op_args);
+		op_args.prefetch_req = 0;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_mem_access_wrapper, &op_args);
+	}
+
+	igt_fixture {
+		int cnt;
+
+		for (cnt = 0; cnt < gpu_cnt; cnt++)
+			drm_close_driver(gpus[cnt].fd);
+	}
+}
diff --git a/tests/meson.build b/tests/meson.build
index 9736f2338..1209f84a4 100644
--- a/tests/meson.build
+++ b/tests/meson.build
@@ -313,6 +313,7 @@ intel_xe_progs = [
 	'xe_media_fill',
 	'xe_mmap',
 	'xe_module_load',
+        'xe_multi_gpusvm',
 	'xe_noexec_ping_pong',
 	'xe_oa',
 	'xe_pat',
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
                   ` (2 preceding siblings ...)
  2025-11-13 16:33 ` [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 13:10   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test nishit.sharma
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

This test performs atomic increment operation on a shared SVM buffer
from both GPUs and the CPU in a multi-GPU environment. It uses madvise
and prefetch to control buffer placement and verifies correctness and
ordering of atomic updates across agents.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 tests/intel/xe_multi_gpusvm.c | 157 +++++++++++++++++++++++++++++++++-
 1 file changed, 156 insertions(+), 1 deletion(-)

diff --git a/tests/intel/xe_multi_gpusvm.c b/tests/intel/xe_multi_gpusvm.c
index 6614ea3d1..54e036724 100644
--- a/tests/intel/xe_multi_gpusvm.c
+++ b/tests/intel/xe_multi_gpusvm.c
@@ -31,6 +31,11 @@
  *      region both remotely and locally and copies to it. Reads back to
  *      system memory and checks the result.
  *
+ * SUBTEST: atomic-inc-gpu-op
+ * Description:
+ * 	This test does atomic operation in multi-gpu by executing atomic
+ *	operation on GPU1 and then atomic operation on GPU2 using same
+ *	adress
  */
 
 #define MAX_XE_REGIONS	8
@@ -40,6 +45,7 @@
 #define BIND_SYNC_VAL 0x686868
 #define EXEC_SYNC_VAL 0x676767
 #define COPY_SIZE SZ_64M
+#define	ATOMIC_OP_VAL	56
 
 struct xe_svm_gpu_info {
 	bool supports_faults;
@@ -49,6 +55,16 @@ struct xe_svm_gpu_info {
 	int fd;
 };
 
+struct test_exec_data {
+	uint32_t batch[32];
+	uint64_t pad;
+	uint64_t vm_sync;
+	uint64_t exec_sync;
+	uint32_t data;
+	uint32_t expected_data;
+	uint64_t batch_addr;
+};
+
 struct multigpu_ops_args {
 	bool prefetch_req;
 	bool op_mod;
@@ -72,7 +88,10 @@ static void gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
 				   struct drm_xe_engine_class_instance *eci,
 				   void *extra_args);
 
-static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
+static void gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
+				   struct xe_svm_gpu_info *dst,
+				   struct drm_xe_engine_class_instance *eci,
+				   void *extra_args);
 
 static void
 create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct drm_xe_engine_class_instance *eci,
@@ -166,6 +185,35 @@ static void for_each_gpu_pair(int num_gpus, struct xe_svm_gpu_info *gpus,
 	}
 }
 
+static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
+
+static void
+atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
+		  uint32_t *bo, uint64_t *addr)
+{
+	uint32_t batch_bo_size = BATCH_SIZE(fd);
+	uint32_t batch_bo;
+	uint64_t batch_addr;
+	void *batch;
+	uint32_t *cmd;
+	int i = 0;
+
+	batch_bo = xe_bo_create(fd, vm, batch_bo_size, vram_if_possible(fd, 0), 0);
+	batch = xe_bo_map(fd, batch_bo, batch_bo_size);
+	cmd = (uint32_t *)batch;
+
+	cmd[i++] = MI_ATOMIC | MI_ATOMIC_INC;
+	cmd[i++] = src_addr;
+	cmd[i++] = src_addr >> 32;
+	cmd[i++] = MI_BATCH_BUFFER_END;
+
+	batch_addr = to_user_pointer(batch);
+	/* Punch a gap in the SVM map where we map the batch_bo */
+	xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr, batch_bo_size, 0);
+	*bo = batch_bo;
+	*addr = batch_addr;
+}
+
 static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
 		       uint64_t dst_addr, uint64_t copy_size,
 		       uint32_t *bo, uint64_t *addr)
@@ -325,6 +373,105 @@ gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
 	copy_src_dst(src, dst, eci, args->prefetch_req);
 }
 
+static void
+atomic_inc_op(struct xe_svm_gpu_info *gpu0,
+	      struct xe_svm_gpu_info *gpu1,
+	      struct drm_xe_engine_class_instance *eci,
+	      bool prefetch_req)
+{
+	uint64_t addr;
+	uint32_t vm[2];
+	uint32_t exec_queue[2];
+	uint32_t batch_bo;
+	struct test_exec_data *data;
+	uint64_t batch_addr;
+	struct drm_xe_sync sync = {};
+	volatile uint64_t *sync_addr;
+	volatile uint32_t *shared_val;
+
+	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
+	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
+
+	data = aligned_alloc(SZ_2M, SZ_4K);
+	igt_assert(data);
+	data[0].vm_sync = 0;
+	addr = to_user_pointer(data);
+
+	shared_val = (volatile uint32_t *)addr;
+	*shared_val = ATOMIC_OP_VAL - 1;
+
+	atomic_batch_init(gpu0->fd, vm[0], addr, &batch_bo, &batch_addr);
+
+	/* Place destination in an optionally remote location to test */
+	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu0->fd, 0, gpu0->vram_regions[0], exec_queue[0],
+			    0, 0);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
+			     sync_addr, exec_queue[0], prefetch_req);
+
+	sync_addr = (void *)((char *)batch_addr + SZ_4K);
+	sync.addr = to_user_pointer((uint64_t *)sync_addr);
+	sync.timeline_value = EXEC_SYNC_VAL;
+	*sync_addr = 0;
+
+	/* Executing ATOMIC_INC on GPU0. */
+	xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
+	if (*sync_addr != EXEC_SYNC_VAL)
+		xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
+			       NSEC_PER_SEC * 10);
+
+	igt_assert_eq(*shared_val, ATOMIC_OP_VAL);
+
+	atomic_batch_init(gpu1->fd, vm[1], addr, &batch_bo, &batch_addr);
+
+	/* Place destination in an optionally remote location to test */
+	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu1->fd, 0, gpu1->vram_regions[0], exec_queue[0],
+			    0, 0);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
+			     sync_addr, exec_queue[1], prefetch_req);
+
+	sync_addr = (void *)((char *)batch_addr + SZ_4K);
+	sync.addr = to_user_pointer((uint64_t *)sync_addr);
+	sync.timeline_value = EXEC_SYNC_VAL;
+	*sync_addr = 0;
+
+	/* Execute ATOMIC_INC on GPU1 */
+	xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);
+	if (*sync_addr != EXEC_SYNC_VAL)
+		xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[1],
+			       NSEC_PER_SEC * 10);
+
+	igt_assert_eq(*shared_val, ATOMIC_OP_VAL + 1);
+
+	munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
+	batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
+	batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
+	free(data);
+
+	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
+	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
+}
+
+static void
+gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
+		       struct xe_svm_gpu_info *dst,
+		       struct drm_xe_engine_class_instance *eci,
+		       void *extra_args)
+{
+	struct multigpu_ops_args *args = (struct multigpu_ops_args *)extra_args;
+	igt_assert(src);
+	igt_assert(dst);
+
+	atomic_inc_op(src, dst, eci, args->prefetch_req);
+}
+
 igt_main
 {
 	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
@@ -364,6 +511,14 @@ igt_main
 		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_mem_access_wrapper, &op_args);
 	}
 
+	igt_subtest("atomic-inc-gpu-op") {
+		struct multigpu_ops_args atomic_args;
+		atomic_args.prefetch_req = 1;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_atomic_inc_wrapper, &atomic_args);
+		atomic_args.prefetch_req = 0;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_atomic_inc_wrapper, &atomic_args);
+	}
+
 	igt_fixture {
 		int cnt;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
                   ` (3 preceding siblings ...)
  2025-11-13 16:33 ` [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 14:02   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 06/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU performance test nishit.sharma
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

This test verifies memory coherency in a multi-GPU environment using SVM.
GPU 1 writes to a shared buffer, GPU 2 reads and checks for correct data
without explicit synchronization, and the test is repeated with CPU and
both GPUs to ensure consistent memory visibility across agents.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 tests/intel/xe_multi_gpusvm.c | 203 +++++++++++++++++++++++++++++++++-
 1 file changed, 201 insertions(+), 2 deletions(-)

diff --git a/tests/intel/xe_multi_gpusvm.c b/tests/intel/xe_multi_gpusvm.c
index 54e036724..6792ef72c 100644
--- a/tests/intel/xe_multi_gpusvm.c
+++ b/tests/intel/xe_multi_gpusvm.c
@@ -34,8 +34,13 @@
  * SUBTEST: atomic-inc-gpu-op
  * Description:
  * 	This test does atomic operation in multi-gpu by executing atomic
- *	operation on GPU1 and then atomic operation on GPU2 using same
- *	adress
+ * 	operation on GPU1 and then atomic operation on GPU2 using same
+ * 	adress
+ *
+ * SUBTEST: coherency-multi-gpu
+ * Description:
+ * 	This test checks coherency in multi-gpu by writing from GPU0
+ * 	reading from GPU1 and verify and repeating with CPU and both GPUs
  */
 
 #define MAX_XE_REGIONS	8
@@ -93,6 +98,11 @@ static void gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
 				   struct drm_xe_engine_class_instance *eci,
 				   void *extra_args);
 
+static void gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
+				      struct xe_svm_gpu_info *dst,
+				      struct drm_xe_engine_class_instance *eci,
+				      void *extra_args);
+
 static void
 create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct drm_xe_engine_class_instance *eci,
 		    uint32_t *vm, uint32_t *exec_queue)
@@ -214,6 +224,35 @@ atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
 	*addr = batch_addr;
 }
 
+static void
+store_dword_batch_init(int fd, uint32_t vm, uint64_t src_addr,
+                       uint32_t *bo, uint64_t *addr, int value)
+{
+        uint32_t batch_bo_size = BATCH_SIZE(fd);
+        uint32_t batch_bo;
+        uint64_t batch_addr;
+        void *batch;
+        uint32_t *cmd;
+        int i = 0;
+
+        batch_bo = xe_bo_create(fd, vm, batch_bo_size, vram_if_possible(fd, 0), 0);
+        batch = xe_bo_map(fd, batch_bo, batch_bo_size);
+        cmd = (uint32_t *) batch;
+
+        cmd[i++] = MI_STORE_DWORD_IMM_GEN4;
+        cmd[i++] = src_addr;
+        cmd[i++] = src_addr >> 32;
+        cmd[i++] = value;
+        cmd[i++] = MI_BATCH_BUFFER_END;
+
+        batch_addr = to_user_pointer(batch);
+
+        /* Punch a gap in the SVM map where we map the batch_bo */
+        xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr, batch_bo_size, 0);
+        *bo = batch_bo;
+        *addr = batch_addr;
+}
+
 static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
 		       uint64_t dst_addr, uint64_t copy_size,
 		       uint32_t *bo, uint64_t *addr)
@@ -373,6 +412,143 @@ gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
 	copy_src_dst(src, dst, eci, args->prefetch_req);
 }
 
+static void
+coherency_test_multigpu(struct xe_svm_gpu_info *gpu0,
+			struct xe_svm_gpu_info *gpu1,
+			struct drm_xe_engine_class_instance *eci,
+			bool coh_fail_set,
+			bool prefetch_req)
+{
+        uint64_t addr;
+        uint32_t vm[2];
+        uint32_t exec_queue[2];
+        uint32_t batch_bo, batch1_bo[2];
+        uint64_t batch_addr, batch1_addr[2];
+        struct drm_xe_sync sync = {};
+        volatile uint64_t *sync_addr;
+        int value = 60;
+	uint64_t *data1;
+	void *copy_dst;
+
+	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
+	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
+
+        data1 = aligned_alloc(SZ_2M, SZ_4K);
+	igt_assert(data1);
+	addr = to_user_pointer(data1);
+
+	copy_dst = aligned_alloc(SZ_2M, SZ_4K);
+	igt_assert(copy_dst);
+
+        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo, &batch_addr, value);
+
+        /* Place destination in GPU0 local memory location to test */
+	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu0->fd, 0, gpu0->vram_regions[0], exec_queue[0],
+			    0, 0);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
+			     sync_addr, exec_queue[0], prefetch_req);
+
+        sync_addr = (void *)((char *)batch_addr + SZ_4K);
+        sync.addr = to_user_pointer((uint64_t *)sync_addr);
+        sync.timeline_value = EXEC_SYNC_VAL;
+        *sync_addr = 0;
+
+        /* Execute STORE command on GPU0 */
+        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
+        if (*sync_addr != EXEC_SYNC_VAL)
+                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
+			       NSEC_PER_SEC * 10);
+
+        igt_assert_eq(*(uint64_t *)addr, value);
+
+	/* Creating batch for GPU1 using addr as Src which have value from GPU0 */
+	batch_init(gpu1->fd, vm[1], addr, to_user_pointer(copy_dst),
+		   SZ_4K, &batch_bo, &batch_addr);
+
+        /* Place destination in GPU1 local memory location to test */
+	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu1->fd, 0, gpu1->vram_regions[0], exec_queue[1],
+			    0, 0);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
+			     sync_addr, exec_queue[1], prefetch_req);
+
+        sync_addr = (void *)((char *)batch_addr + SZ_4K);
+        sync.addr = to_user_pointer((uint64_t *)sync_addr);
+        sync.timeline_value = EXEC_SYNC_VAL;
+        *sync_addr = 0;
+
+        /* Execute COPY command on GPU1 */
+        xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);
+        if (*sync_addr != EXEC_SYNC_VAL)
+                xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[1],
+			       NSEC_PER_SEC * 10);
+
+        igt_assert_eq(*(uint64_t *)copy_dst, value);
+
+        /* CPU writes 10, memset set bytes no integer hence memset fills 4 bytes with 0x0A */
+        memset((void *)(uintptr_t)addr, 10, sizeof(int));
+        igt_assert_eq(*(uint64_t *)addr, 0x0A0A0A0A);
+
+	if (coh_fail_set) {
+		igt_info("coherency fail impl\n");
+
+		/* Coherency fail scenario */
+		store_dword_batch_init(gpu0->fd, vm[0], addr, &batch1_bo[0], &batch1_addr[0], value + 10);
+		store_dword_batch_init(gpu1->fd, vm[1], addr, &batch1_bo[1], &batch1_addr[1], value + 20);
+
+		sync_addr = (void *)((char *)batch1_addr[0] + SZ_4K);
+		sync.addr = to_user_pointer((uint64_t *)sync_addr);
+		sync.timeline_value = EXEC_SYNC_VAL;
+		*sync_addr = 0;
+
+		/* Execute STORE command on GPU1 */
+		xe_exec_sync(gpu0->fd, exec_queue[0], batch1_addr[0], &sync, 1);
+		if (*sync_addr != EXEC_SYNC_VAL)
+			xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
+				       NSEC_PER_SEC * 10);
+
+		sync_addr = (void *)((char *)batch1_addr[1] + SZ_4K);
+		sync.addr = to_user_pointer((uint64_t *)sync_addr);
+		sync.timeline_value = EXEC_SYNC_VAL;
+		*sync_addr = 0;
+
+		/* Execute STORE command on GPU2 */
+		xe_exec_sync(gpu1->fd, exec_queue[1], batch1_addr[1], &sync, 1);
+		if (*sync_addr != EXEC_SYNC_VAL)
+			xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[1],
+				       NSEC_PER_SEC * 10);
+
+		igt_warn_on_f(*(uint64_t *)addr != (value + 10),
+			      "GPU2(dst_gpu] has overwritten value at addr\n");
+
+		munmap((void *)batch1_addr[0], BATCH_SIZE(gpu0->fd));
+		munmap((void *)batch1_addr[1], BATCH_SIZE(gpu1->fd));
+
+		batch_fini(gpu0->fd, vm[0], batch1_bo[0], batch1_addr[0]);
+		batch_fini(gpu1->fd, vm[1], batch1_bo[1], batch1_addr[1]);
+	}
+
+        /* CPU writes 11, memset set bytes no integer hence memset fills 4 bytes with 0x0B */
+        memset((void *)(uintptr_t)addr, 11, sizeof(int));
+        igt_assert_eq(*(uint64_t *)addr, 0x0B0B0B0B);
+
+        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
+        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
+        batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
+        free(data1);
+	free(copy_dst);
+
+	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
+	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
+}
+
 static void
 atomic_inc_op(struct xe_svm_gpu_info *gpu0,
 	      struct xe_svm_gpu_info *gpu1,
@@ -472,6 +648,19 @@ gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
 	atomic_inc_op(src, dst, eci, args->prefetch_req);
 }
 
+static void
+gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
+			  struct xe_svm_gpu_info *dst,
+			  struct drm_xe_engine_class_instance *eci,
+			  void *extra_args)
+{
+	struct multigpu_ops_args *args = (struct multigpu_ops_args *)extra_args;
+	igt_assert(src);
+	igt_assert(dst);
+
+	coherency_test_multigpu(src, dst, eci, args->op_mod, args->prefetch_req);
+}
+
 igt_main
 {
 	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
@@ -519,6 +708,16 @@ igt_main
 		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_atomic_inc_wrapper, &atomic_args);
 	}
 
+	igt_subtest("coherency-multi-gpu") {
+		struct multigpu_ops_args coh_args;
+		coh_args.prefetch_req = 1;
+		coh_args.op_mod = 0;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_coherecy_test_wrapper, &coh_args);
+		coh_args.prefetch_req = 0;
+		coh_args.op_mod = 1;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_coherecy_test_wrapper, &coh_args);
+	}
+
 	igt_fixture {
 		int cnt;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 06/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU performance test
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
                   ` (4 preceding siblings ...)
  2025-11-13 16:33 ` [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 14:39   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 07/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU fault handling test nishit.sharma
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

This test measures latency and bandwidth for buffer access from each GPU
and the CPU in a multi-GPU SVM environment. It compares performance for
local versus remote access using madvise and prefetch to control buffer
placement

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 tests/intel/xe_multi_gpusvm.c | 181 ++++++++++++++++++++++++++++++++++
 1 file changed, 181 insertions(+)

diff --git a/tests/intel/xe_multi_gpusvm.c b/tests/intel/xe_multi_gpusvm.c
index 6792ef72c..2c8e62e34 100644
--- a/tests/intel/xe_multi_gpusvm.c
+++ b/tests/intel/xe_multi_gpusvm.c
@@ -13,6 +13,8 @@
 #include "intel_mocs.h"
 #include "intel_reg.h"
 
+#include "time.h"
+
 #include "xe/xe_ioctl.h"
 #include "xe/xe_query.h"
 #include "xe/xe_util.h"
@@ -41,6 +43,11 @@
  * Description:
  * 	This test checks coherency in multi-gpu by writing from GPU0
  * 	reading from GPU1 and verify and repeating with CPU and both GPUs
+ *
+ * SUBTEST: latency-multi-gpu
+ * Description:
+ * 	This test measures and compares latency and bandwidth for buffer access
+ * 	from CPU, local GPU, and remote GPU
  */
 
 #define MAX_XE_REGIONS	8
@@ -103,6 +110,11 @@ static void gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
 				      struct drm_xe_engine_class_instance *eci,
 				      void *extra_args);
 
+static void gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
+				     struct xe_svm_gpu_info *dst,
+				     struct drm_xe_engine_class_instance *eci,
+				     void *extra_args);
+
 static void
 create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct drm_xe_engine_class_instance *eci,
 		    uint32_t *vm, uint32_t *exec_queue)
@@ -197,6 +209,11 @@ static void for_each_gpu_pair(int num_gpus, struct xe_svm_gpu_info *gpus,
 
 static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
 
+static double time_diff(struct timespec *start, struct timespec *end)
+{
+    return (end->tv_sec - start->tv_sec) + (end->tv_nsec - start->tv_nsec) / 1e9;
+}
+
 static void
 atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
 		  uint32_t *bo, uint64_t *addr)
@@ -549,6 +566,147 @@ coherency_test_multigpu(struct xe_svm_gpu_info *gpu0,
 	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
 }
 
+static void
+latency_test_multigpu(struct xe_svm_gpu_info *gpu0,
+		      struct xe_svm_gpu_info *gpu1,
+		      struct drm_xe_engine_class_instance *eci,
+		      bool remote_copy,
+		      bool prefetch_req)
+{
+        uint64_t addr;
+        uint32_t vm[2];
+        uint32_t exec_queue[2];
+        uint32_t batch_bo;
+        uint8_t *copy_dst;
+        uint64_t batch_addr;
+        struct drm_xe_sync sync = {};
+        volatile uint64_t *sync_addr;
+        int value = 60;
+        int shared_val[4];
+        struct test_exec_data *data;
+	struct timespec t_start, t_end;
+	double cpu_latency, gpu1_latency, gpu2_latency;
+	double cpu_bw, gpu1_bw, gpu2_bw;
+
+	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
+	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
+
+        data = aligned_alloc(SZ_2M, SZ_4K);
+        igt_assert(data);
+        data[0].vm_sync = 0;
+        addr = to_user_pointer(data);
+
+        copy_dst = aligned_alloc(SZ_2M, SZ_4K);
+        igt_assert(copy_dst);
+
+        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo, &batch_addr, value);
+
+	/* Measure GPU0 access latency/bandwidth */
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+        /* GPU0(src_gpu) access */
+	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu0->fd, 0, gpu0->vram_regions[0], exec_queue[0],
+			    0, 0);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
+			     sync_addr, exec_queue[0], prefetch_req);
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+	gpu1_latency = time_diff(&t_start, &t_end);
+	gpu1_bw = COPY_SIZE / gpu1_latency / (1024 * 1024); // MB/s
+
+        sync_addr = (void *)((char *)batch_addr + SZ_4K);
+        sync.addr = to_user_pointer((uint64_t *)sync_addr);
+        sync.timeline_value = EXEC_SYNC_VAL;
+        *sync_addr = 0;
+
+        /* Execute STORE command on GPU0 */
+        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
+        if (*sync_addr != EXEC_SYNC_VAL)
+                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
+			       NSEC_PER_SEC * 10);
+
+	memcpy(shared_val, (void *)addr, 4);
+	igt_assert_eq(shared_val[0], value);
+
+        /* CPU writes 10, memset set bytes no integer hence memset fills 4 bytes with 0x0A */
+        memset((void *)(uintptr_t)addr, 10, sizeof(int));
+        memcpy(shared_val, (void *)(uintptr_t)addr, sizeof(shared_val));
+        igt_assert_eq(shared_val[0], 0x0A0A0A0A);
+
+	*(uint64_t *)addr = 50;
+
+	if(remote_copy) {
+		igt_info("creating batch for COPY_CMD on GPU1\n");
+		batch_init(gpu1->fd, vm[1], addr, to_user_pointer(copy_dst),
+			   SZ_4K, &batch_bo, &batch_addr);
+	} else {
+		igt_info("creating batch for STORE_CMD on GPU1\n");
+		store_dword_batch_init(gpu1->fd, vm[1], addr, &batch_bo, &batch_addr, value + 10);
+	}
+
+	/* Measure GPU1 access latency/bandwidth */
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+        /* GPU1(dst_gpu) access */
+	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu1->fd, 0, gpu1->vram_regions[0], exec_queue[1],
+			    0, 0);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
+			     sync_addr, exec_queue[1], prefetch_req);
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+	gpu2_latency = time_diff(&t_start, &t_end);
+	gpu2_bw = COPY_SIZE / gpu2_latency / (1024 * 1024); // MB/s
+
+        sync_addr = (void *)((char *)batch_addr + SZ_4K);
+        sync.addr = to_user_pointer((uint64_t *)sync_addr);
+        sync.timeline_value = EXEC_SYNC_VAL;
+        *sync_addr = 0;
+
+        /* Execute COPY/STORE command on GPU1 */
+        xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);
+        if (*sync_addr != EXEC_SYNC_VAL)
+                xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[1],
+			       NSEC_PER_SEC * 10);
+
+	if (!remote_copy)
+		igt_assert_eq(*(uint64_t *)addr, value + 10);
+	else
+		igt_assert_eq(*(uint64_t *)copy_dst, 50);
+
+        /* CPU writes 11, memset set bytes no integer hence memset fills 4 bytes with 0x0B */
+	/* Measure CPU access latency/bandwidth */
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+        memset((void *)(uintptr_t)addr, 11, sizeof(int));
+        memcpy(shared_val, (void *)(uintptr_t)addr, sizeof(shared_val));
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+	cpu_latency = time_diff(&t_start, &t_end);
+	cpu_bw = COPY_SIZE / cpu_latency / (1024 * 1024); // MB/s
+
+        igt_assert_eq(shared_val[0], 0x0B0B0B0B);
+
+	/* Print results */
+	igt_info("CPU: Latency %.6f s, Bandwidth %.2f MB/s\n", cpu_latency, cpu_bw);
+	igt_info("GPU: Latency %.6f s, Bandwidth %.2f MB/s\n", gpu1_latency, gpu1_bw);
+	igt_info("GPU: Latency %.6f s, Bandwidth %.2f MB/s\n", gpu2_latency, gpu2_bw);
+
+        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
+        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
+        batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
+        free(data);
+        free(copy_dst);
+
+	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
+	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
+}
+
 static void
 atomic_inc_op(struct xe_svm_gpu_info *gpu0,
 	      struct xe_svm_gpu_info *gpu1,
@@ -661,6 +819,19 @@ gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
 	coherency_test_multigpu(src, dst, eci, args->op_mod, args->prefetch_req);
 }
 
+static void
+gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
+			 struct xe_svm_gpu_info *dst,
+			 struct drm_xe_engine_class_instance *eci,
+			 void *extra_args)
+{
+	struct multigpu_ops_args *args = (struct multigpu_ops_args *)extra_args;
+	igt_assert(src);
+	igt_assert(dst);
+
+	latency_test_multigpu(src, dst, eci, args->op_mod, args->prefetch_req);
+}
+
 igt_main
 {
 	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
@@ -718,6 +889,16 @@ igt_main
 		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_coherecy_test_wrapper, &coh_args);
 	}
 
+	igt_subtest("latency-multi-gpu") {
+		struct multigpu_ops_args latency_args;
+		latency_args.prefetch_req = 1;
+		latency_args.op_mod = 1;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_latency_test_wrapper, &latency_args);
+		latency_args.prefetch_req = 0;
+		latency_args.op_mod = 0;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_latency_test_wrapper, &latency_args);
+	}
+
 	igt_fixture {
 		int cnt;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 07/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU fault handling test
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
                   ` (5 preceding siblings ...)
  2025-11-13 16:33 ` [PATCH i-g-t v7 06/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU performance test nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 14:48   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 08/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU simultaneous access test nishit.sharma
  2025-11-13 16:33 ` [PATCH i-g-t v7 09/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU conflicting madvise test nishit.sharma
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

This test intentionally triggers page faults by accessing regions without
prefetch for both GPUs in a multi-GPU environment.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 tests/intel/xe_multi_gpusvm.c | 102 ++++++++++++++++++++++++++++++++++
 1 file changed, 102 insertions(+)

diff --git a/tests/intel/xe_multi_gpusvm.c b/tests/intel/xe_multi_gpusvm.c
index 2c8e62e34..6feb543ae 100644
--- a/tests/intel/xe_multi_gpusvm.c
+++ b/tests/intel/xe_multi_gpusvm.c
@@ -15,6 +15,7 @@
 
 #include "time.h"
 
+#include "xe/xe_gt.h"
 #include "xe/xe_ioctl.h"
 #include "xe/xe_query.h"
 #include "xe/xe_util.h"
@@ -48,6 +49,11 @@
  * Description:
  * 	This test measures and compares latency and bandwidth for buffer access
  * 	from CPU, local GPU, and remote GPU
+ *
+ * SUBTEST: pagefault-multi-gpu
+ * Description:
+ * 	This test intentionally triggers page faults by accessing unmapped SVM
+ * 	regions from both GPUs
  */
 
 #define MAX_XE_REGIONS	8
@@ -115,6 +121,11 @@ static void gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
 				     struct drm_xe_engine_class_instance *eci,
 				     void *extra_args);
 
+static void gpu_fault_test_wrapper(struct xe_svm_gpu_info *src,
+				   struct xe_svm_gpu_info *dst,
+				   struct drm_xe_engine_class_instance *eci,
+				   void *extra_args);
+
 static void
 create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct drm_xe_engine_class_instance *eci,
 		    uint32_t *vm, uint32_t *exec_queue)
@@ -707,6 +718,76 @@ latency_test_multigpu(struct xe_svm_gpu_info *gpu0,
 	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
 }
 
+static void
+pagefault_test_multigpu(struct xe_svm_gpu_info *gpu0,
+			struct xe_svm_gpu_info *gpu1,
+			struct drm_xe_engine_class_instance *eci,
+			bool prefetch_req)
+{
+        uint64_t addr;
+        uint32_t vm[2];
+        uint32_t exec_queue[2];
+        uint32_t batch_bo;
+        uint64_t batch_addr;
+        struct drm_xe_sync sync = {};
+        volatile uint64_t *sync_addr;
+        int value = 60, pf_count_1, pf_count_2;
+        void *data;
+	const char *pf_count_stat = "svm_pagefault_count";
+
+	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
+	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
+
+        data = aligned_alloc(SZ_2M, SZ_4K);
+        igt_assert(data);
+        addr = to_user_pointer(data);
+
+	pf_count_1 = xe_gt_stats_get_count(gpu0->fd, eci->gt_id, pf_count_stat);
+
+	/* checking pagefault count on GPU */
+        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo, &batch_addr, value);
+
+	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu0->fd, 0, gpu0->vram_regions[0], exec_queue[0],
+			    0, 0);
+
+	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
+	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
+			     sync_addr, exec_queue[0], prefetch_req);
+
+        sync_addr = (void *)((char *)batch_addr + SZ_4K);
+        sync.addr = to_user_pointer((uint64_t *)sync_addr);
+        sync.timeline_value = EXEC_SYNC_VAL;
+        *sync_addr = 0;
+
+	/* Execute STORE command on GPU */
+        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
+        if (*sync_addr != EXEC_SYNC_VAL)
+                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
+			       NSEC_PER_SEC * 10);
+
+	pf_count_2 = xe_gt_stats_get_count(gpu0->fd, eci->gt_id, pf_count_stat);
+
+	if (pf_count_2 != pf_count_1) {
+		igt_warn("GPU pf: pf_count_2(%d) != pf_count_1(%d) prefetch_req :%d\n",
+			 pf_count_2, pf_count_1, prefetch_req);
+	}
+
+        igt_assert_eq(*(uint64_t *)addr, value);
+
+        /* CPU writes 11, memset set bytes no integer hence memset fills 4 bytes with 0x0B */
+        memset((void *)(uintptr_t)addr, 11, sizeof(int));
+        igt_assert_eq(*(uint64_t *)addr, 0x0B0B0B0B);
+
+        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
+        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
+        free(data);
+
+	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
+	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
+}
+
 static void
 atomic_inc_op(struct xe_svm_gpu_info *gpu0,
 	      struct xe_svm_gpu_info *gpu1,
@@ -832,6 +913,19 @@ gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
 	latency_test_multigpu(src, dst, eci, args->op_mod, args->prefetch_req);
 }
 
+static void
+gpu_fault_test_wrapper(struct xe_svm_gpu_info *src,
+		       struct xe_svm_gpu_info *dst,
+		       struct drm_xe_engine_class_instance *eci,
+		       void *extra_args)
+{
+	struct multigpu_ops_args *args = (struct multigpu_ops_args *)extra_args;
+	igt_assert(src);
+	igt_assert(dst);
+
+        pagefault_test_multigpu(src, dst, eci, args->prefetch_req);
+}
+
 igt_main
 {
 	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
@@ -899,6 +993,14 @@ igt_main
 		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_latency_test_wrapper, &latency_args);
 	}
 
+	igt_subtest("pagefault-multi-gpu") {
+		struct multigpu_ops_args fault_args;
+		fault_args.prefetch_req = 1;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_fault_test_wrapper, &fault_args);
+		fault_args.prefetch_req = 0;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_fault_test_wrapper, &fault_args);
+	}
+
 	igt_fixture {
 		int cnt;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 08/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU simultaneous access test
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
                   ` (6 preceding siblings ...)
  2025-11-13 16:33 ` [PATCH i-g-t v7 07/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU fault handling test nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 14:57   ` Hellstrom, Thomas
  2025-11-13 16:33 ` [PATCH i-g-t v7 09/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU conflicting madvise test nishit.sharma
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

This test launches compute or copy workloads on both GPUs that access the same
SVM buffer, using synchronization primitives (fences/semaphores) to coordinate
access. It verifies data integrity and checks for the absence of race conditions
in a multi-GPU SVM environment.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 tests/intel/xe_multi_gpusvm.c | 133 ++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/tests/intel/xe_multi_gpusvm.c b/tests/intel/xe_multi_gpusvm.c
index 6feb543ae..dc2a8f9c8 100644
--- a/tests/intel/xe_multi_gpusvm.c
+++ b/tests/intel/xe_multi_gpusvm.c
@@ -54,6 +54,11 @@
  * Description:
  * 	This test intentionally triggers page faults by accessing unmapped SVM
  * 	regions from both GPUs
+ *
+ * SUBTEST: concurrent-access-multi-gpu
+ * Description:
+ * 	This tests aunches simultaneous workloads on both GPUs accessing the
+ * 	same SVM buffer synchronizes with fences, and verifies data integrity
  */
 
 #define MAX_XE_REGIONS	8
@@ -126,6 +131,11 @@ static void gpu_fault_test_wrapper(struct xe_svm_gpu_info *src,
 				   struct drm_xe_engine_class_instance *eci,
 				   void *extra_args);
 
+static void gpu_simult_test_wrapper(struct xe_svm_gpu_info *src,
+				    struct xe_svm_gpu_info *dst,
+				    struct drm_xe_engine_class_instance *eci,
+				    void *extra_args);
+
 static void
 create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct drm_xe_engine_class_instance *eci,
 		    uint32_t *vm, uint32_t *exec_queue)
@@ -900,6 +910,108 @@ gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
 	coherency_test_multigpu(src, dst, eci, args->op_mod, args->prefetch_req);
 }
 
+static void
+multigpu_access_test(struct xe_svm_gpu_info *gpu0,
+		     struct xe_svm_gpu_info *gpu1,
+		     struct drm_xe_engine_class_instance *eci,
+		     bool no_prefetch)
+{
+	uint64_t addr;
+	uint32_t vm[2];
+	uint32_t exec_queue[2];
+	uint32_t batch_bo[2];
+	struct test_exec_data *data;
+	uint64_t batch_addr[2];
+	struct drm_xe_sync sync[2] = {};
+	volatile uint64_t *sync_addr[2];
+	volatile uint32_t *shared_val;
+
+	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
+	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
+
+	data = aligned_alloc(SZ_2M, SZ_4K);
+	igt_assert(data);
+	data[0].vm_sync = 0;
+	addr = to_user_pointer(data);
+
+	shared_val = (volatile uint32_t *)addr;
+	*shared_val = ATOMIC_OP_VAL - 1;
+
+	atomic_batch_init(gpu0->fd, vm[0], addr, &batch_bo[0], &batch_addr[0]);
+	*shared_val = ATOMIC_OP_VAL - 2;
+	atomic_batch_init(gpu1->fd, vm[1], addr, &batch_bo[1], &batch_addr[1]);
+
+	/* Place destination in an optionally remote location to test */
+	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu0->fd, 0, gpu0->vram_regions[0], exec_queue[0],
+			    0, 0);
+	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
+			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu1->fd, 0, gpu1->vram_regions[0], exec_queue[1],
+			    0, 0);
+
+	setup_sync(&sync[0], &sync_addr[0], BIND_SYNC_VAL);
+	setup_sync(&sync[1], &sync_addr[1], BIND_SYNC_VAL);
+
+	/* For simultaneous access need to call xe_wait_ufence for both gpus after prefetch */
+	if(!no_prefetch) {
+		xe_vm_prefetch_async(gpu0->fd, vm[0], 0, 0, addr,
+				     SZ_4K, &sync[0], 1,
+				     DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
+
+		xe_vm_prefetch_async(gpu1->fd, vm[1], 0, 0, addr,
+				     SZ_4K, &sync[1], 1,
+				     DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
+
+		if (*sync_addr[0] != BIND_SYNC_VAL)
+			xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr[0], BIND_SYNC_VAL, exec_queue[0],
+				       NSEC_PER_SEC * 10);
+		free((void *)sync_addr[0]);
+		if (*sync_addr[1] != BIND_SYNC_VAL)
+			xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr[1], BIND_SYNC_VAL, exec_queue[1],
+				       NSEC_PER_SEC * 10);
+		free((void *)sync_addr[1]);
+	}
+
+	if (no_prefetch) {
+		free((void *)sync_addr[0]);
+		free((void *)sync_addr[1]);
+	}
+
+	for (int i = 0; i < 100; i++) {
+		sync_addr[0] = (void *)((char *)batch_addr[0] + SZ_4K);
+		sync[0].addr = to_user_pointer((uint64_t *)sync_addr[0]);
+		sync[0].timeline_value = EXEC_SYNC_VAL;
+
+		sync_addr[1] = (void *)((char *)batch_addr[1] + SZ_4K);
+		sync[1].addr = to_user_pointer((uint64_t *)sync_addr[1]);
+		sync[1].timeline_value = EXEC_SYNC_VAL;
+		*sync_addr[0] = 0;
+		*sync_addr[1] = 0;
+
+		xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr[0], &sync[0], 1);
+		if (*sync_addr[0] != EXEC_SYNC_VAL)
+			xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr[0], EXEC_SYNC_VAL, exec_queue[0],
+				       NSEC_PER_SEC * 10);
+		xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr[1], &sync[1], 1);
+		if (*sync_addr[1] != EXEC_SYNC_VAL)
+			xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr[1], EXEC_SYNC_VAL, exec_queue[1],
+				       NSEC_PER_SEC * 10);
+	}
+
+	igt_assert_eq(*(uint64_t *)addr, 254);
+
+	munmap((void *)batch_addr[0], BATCH_SIZE(gpu0->fd));
+	munmap((void *)batch_addr[1], BATCH_SIZE(gpu0->fd));
+	batch_fini(gpu0->fd, vm[0], batch_bo[0], batch_addr[0]);
+	batch_fini(gpu1->fd, vm[1], batch_bo[1], batch_addr[1]);
+	free(data);
+
+	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
+	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
+}
+
 static void
 gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
 			 struct xe_svm_gpu_info *dst,
@@ -926,6 +1038,19 @@ gpu_fault_test_wrapper(struct xe_svm_gpu_info *src,
         pagefault_test_multigpu(src, dst, eci, args->prefetch_req);
 }
 
+static void
+gpu_simult_test_wrapper(struct xe_svm_gpu_info *src,
+			struct xe_svm_gpu_info *dst,
+			struct drm_xe_engine_class_instance *eci,
+			void *extra_args)
+{
+	struct multigpu_ops_args *args = (struct multigpu_ops_args *)extra_args;
+	igt_assert(src);
+	igt_assert(dst);
+
+	multigpu_access_test(src, dst, eci, args->prefetch_req);
+}
+
 igt_main
 {
 	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
@@ -1001,6 +1126,14 @@ igt_main
 		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_fault_test_wrapper, &fault_args);
 	}
 
+	igt_subtest("concurrent-access-multi-gpu") {
+		struct multigpu_ops_args simul_args;
+		simul_args.prefetch_req = 1;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_simult_test_wrapper, &simul_args);
+		simul_args.prefetch_req = 0;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_simult_test_wrapper, &simul_args);
+	}
+
 	igt_fixture {
 		int cnt;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 09/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU conflicting madvise test
  2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
                   ` (7 preceding siblings ...)
  2025-11-13 16:33 ` [PATCH i-g-t v7 08/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU simultaneous access test nishit.sharma
@ 2025-11-13 16:33 ` nishit.sharma
  2025-11-17 15:11   ` Hellstrom, Thomas
  8 siblings, 1 reply; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:33 UTC (permalink / raw)
  To: igt-dev, thomas.hellstrom, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

This test calls madvise operations on GPU0 with the preferred location set
to GPU1 and vice versa. It reports conflicts when conflicting memory advice
is given for shared SVM buffers in a multi-GPU environment.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 tests/intel/xe_multi_gpusvm.c | 143 ++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

diff --git a/tests/intel/xe_multi_gpusvm.c b/tests/intel/xe_multi_gpusvm.c
index dc2a8f9c8..afbf010e6 100644
--- a/tests/intel/xe_multi_gpusvm.c
+++ b/tests/intel/xe_multi_gpusvm.c
@@ -59,6 +59,11 @@
  * Description:
  * 	This tests aunches simultaneous workloads on both GPUs accessing the
  * 	same SVM buffer synchronizes with fences, and verifies data integrity
+ *
+ * SUBTEST: conflicting-madvise-gpu
+ * Description:
+ * 	This test checks conflicting madvise by allocating shared buffer
+ * 	prefetches from both and checks for migration conflicts
  */
 
 #define MAX_XE_REGIONS	8
@@ -69,6 +74,8 @@
 #define EXEC_SYNC_VAL 0x676767
 #define COPY_SIZE SZ_64M
 #define	ATOMIC_OP_VAL	56
+#define USER_FENCE_VALUE        0xdeadbeefdeadbeefull
+#define FIVE_SEC                (5LL * NSEC_PER_SEC)
 
 struct xe_svm_gpu_info {
 	bool supports_faults;
@@ -136,6 +143,11 @@ static void gpu_simult_test_wrapper(struct xe_svm_gpu_info *src,
 				    struct drm_xe_engine_class_instance *eci,
 				    void *extra_args);
 
+static void gpu_conflict_test_wrapper(struct xe_svm_gpu_info *src,
+				      struct xe_svm_gpu_info *dst,
+				      struct drm_xe_engine_class_instance *eci,
+				      void *extra_args);
+
 static void
 create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct drm_xe_engine_class_instance *eci,
 		    uint32_t *vm, uint32_t *exec_queue)
@@ -798,6 +810,116 @@ pagefault_test_multigpu(struct xe_svm_gpu_info *gpu0,
 	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
 }
 
+#define	XE_BO_FLAG_SYSTEM	BIT(1)
+#define XE_BO_FLAG_CPU_ADDR_MIRROR      BIT(24)
+
+static void
+conflicting_madvise(struct xe_svm_gpu_info *gpu0,
+		    struct xe_svm_gpu_info *gpu1,
+		    struct drm_xe_engine_class_instance *eci,
+		    bool no_prefetch)
+{
+	uint64_t addr;
+	uint32_t vm[2];
+	uint32_t exec_queue[2];
+	uint32_t batch_bo[2];
+	void *data;
+	uint64_t batch_addr[2];
+	struct drm_xe_sync sync[2] = {};
+	volatile uint64_t *sync_addr[2];
+	int local_fd;
+	uint16_t local_vram;
+
+	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
+	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
+
+	data = aligned_alloc(SZ_2M, SZ_4K);
+	igt_assert(data);
+	addr = to_user_pointer(data);
+
+	xe_vm_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
+		      DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+		      DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM, 0, 0);
+
+	store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo[0], &batch_addr[0], 10);
+	store_dword_batch_init(gpu1->fd, vm[0], addr, &batch_bo[1], &batch_addr[1], 20);
+
+	/* Place destination in an optionally remote location to test */
+	local_fd = gpu0->fd;
+	local_vram = gpu0->vram_regions[0];
+	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K,
+			    0, DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu1->fd, 0, gpu1->vram_regions[0], exec_queue[0],
+			    local_fd, local_vram);
+
+	local_fd = gpu1->fd;
+	local_vram = gpu1->vram_regions[0];
+	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K,
+			    0, DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
+			    gpu0->fd, 0, gpu0->vram_regions[0], exec_queue[0],
+			    local_fd, local_vram);
+
+	setup_sync(&sync[0], &sync_addr[0], BIND_SYNC_VAL);
+	setup_sync(&sync[1], &sync_addr[1], BIND_SYNC_VAL);
+
+	/* For simultaneous access need to call xe_wait_ufence for both gpus after prefetch */
+	if(!no_prefetch) {
+		xe_vm_prefetch_async(gpu0->fd, vm[0], 0, 0, addr,
+				     SZ_4K, &sync[0], 1,
+				     DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
+
+		xe_vm_prefetch_async(gpu1->fd, vm[1], 0, 0, addr,
+				     SZ_4K, &sync[1], 1,
+				     DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
+
+		if (*sync_addr[0] != BIND_SYNC_VAL)
+			xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr[0], BIND_SYNC_VAL, exec_queue[0],
+				       NSEC_PER_SEC * 10);
+		free((void *)sync_addr[0]);
+		if (*sync_addr[1] != BIND_SYNC_VAL)
+			xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr[1], BIND_SYNC_VAL, exec_queue[1],
+				       NSEC_PER_SEC * 10);
+		free((void *)sync_addr[1]);
+	}
+
+	if (no_prefetch) {
+		free((void *)sync_addr[0]);
+		free((void *)sync_addr[1]);
+	}
+
+	for (int i = 0; i < 1; i++) {
+		sync_addr[0] = (void *)((char *)batch_addr[0] + SZ_4K);
+		sync[0].addr = to_user_pointer((uint64_t *)sync_addr[0]);
+		sync[0].timeline_value = EXEC_SYNC_VAL;
+
+		sync_addr[1] = (void *)((char *)batch_addr[1] + SZ_4K);
+		sync[1].addr = to_user_pointer((uint64_t *)sync_addr[1]);
+		sync[1].timeline_value = EXEC_SYNC_VAL;
+		*sync_addr[0] = 0;
+		*sync_addr[1] = 0;
+
+		xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr[0], &sync[0], 1);
+		if (*sync_addr[0] != EXEC_SYNC_VAL)
+			xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr[0], EXEC_SYNC_VAL, exec_queue[0],
+				       NSEC_PER_SEC * 10);
+		xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr[1], &sync[1], 1);
+		if (*sync_addr[1] != EXEC_SYNC_VAL)
+			xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr[1], EXEC_SYNC_VAL, exec_queue[1],
+				       NSEC_PER_SEC * 10);
+	}
+
+	igt_assert_eq(*(uint64_t *)addr, 20);
+
+	munmap((void *)batch_addr[0], BATCH_SIZE(gpu0->fd));
+	munmap((void *)batch_addr[1], BATCH_SIZE(gpu0->fd));
+	batch_fini(gpu0->fd, vm[0], batch_bo[0], batch_addr[0]);
+	batch_fini(gpu1->fd, vm[1], batch_bo[1], batch_addr[1]);
+	free(data);
+
+	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
+	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
+}
+
 static void
 atomic_inc_op(struct xe_svm_gpu_info *gpu0,
 	      struct xe_svm_gpu_info *gpu1,
@@ -1012,6 +1134,19 @@ multigpu_access_test(struct xe_svm_gpu_info *gpu0,
 	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
 }
 
+static void
+gpu_conflict_test_wrapper(struct xe_svm_gpu_info *src,
+			  struct xe_svm_gpu_info *dst,
+			  struct drm_xe_engine_class_instance *eci,
+			  void *extra_args)
+{
+	struct multigpu_ops_args *args = (struct multigpu_ops_args *)extra_args;
+	igt_assert(src);
+	igt_assert(dst);
+
+        conflicting_madvise(src, dst, eci, args->prefetch_req);
+}
+
 static void
 gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
 			 struct xe_svm_gpu_info *dst,
@@ -1108,6 +1243,14 @@ igt_main
 		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_coherecy_test_wrapper, &coh_args);
 	}
 
+	igt_subtest("conflicting-madvise-gpu") {
+		struct multigpu_ops_args conflict_args;
+		conflict_args.prefetch_req = 1;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_conflict_test_wrapper, &conflict_args);
+		conflict_args.prefetch_req = 0;
+		for_each_gpu_pair(gpu_cnt, gpus, &eci, gpu_conflict_test_wrapper, &conflict_args);
+	}
+
 	igt_subtest("latency-multi-gpu") {
 		struct multigpu_ops_args latency_args;
 		latency_args.prefetch_req = 1;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 16:49 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
@ 2025-11-13 16:49 ` nishit.sharma
  0 siblings, 0 replies; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 16:49 UTC (permalink / raw)
  To: igt-dev, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
support per-instance memory advice operations.Implement xe_vm_bind_lr_sync
and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using user
fences.
These changes improve memory advice and binding operations for multi-GPU
and multi-instance scenarios in IGT tests.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 include/drm-uapi/xe_drm.h |  4 +--
 lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++----
 lib/xe/xe_ioctl.h         | 11 +++++---
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
index 89ab54935..3472efa58 100644
--- a/include/drm-uapi/xe_drm.h
+++ b/include/drm-uapi/xe_drm.h
@@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
 			/** @preferred_mem_loc.migration_policy: Page migration policy */
 			__u16 migration_policy;
 
-			/** @preferred_mem_loc.pad : MBZ */
-			__u16 pad;
+			/** @preferred_mem_loc.region_instance: Region instance */
+			__u16 region_instance;
 
 			/** @preferred_mem_loc.reserved : Reserved */
 			__u64 reserved;
diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
index 39c4667a1..06ce8a339 100644
--- a/lib/xe/xe_ioctl.c
+++ b/lib/xe/xe_ioctl.c
@@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 }
 
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		    uint16_t instance)
 {
 	struct drm_xe_madvise madvise = {
 		.type = type,
@@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
 		madvise.preferred_mem_loc.devmem_fd = op_val;
 		madvise.preferred_mem_loc.migration_policy = policy;
+		madvise.preferred_mem_loc.region_instance = instance;
 		igt_debug("madvise.preferred_mem_loc.devmem_fd = %d\n",
 			  madvise.preferred_mem_loc.devmem_fd);
 		break;
@@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
  * @type: type of attribute
  * @op_val: fd/atomic value/pat index, depending upon type of operation
  * @policy: Page migration policy
+ * @instance: vram instance
  *
  * Function initializes different members of struct drm_xe_madvise and calls
  * MADVISE IOCTL .
  *
- * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
+ * Returns error number in failure and 0 if pass.
  */
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
+		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		   uint16_t instance)
 {
-	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy), 0);
+	return __xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy, instance);
+}
+
+#define        BIND_SYNC_VAL   0x686868
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t offset,
+			uint64_t addr, uint64_t size, uint32_t flags)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync, 1, flags);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	/* Only free if the wait succeeds */
+	free((void *)sync_addr);
+}
+
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	*sync_addr = 0;
+	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	free((void *)sync_addr);
 }
diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
index ae8a23a54..1ae38029d 100644
--- a/lib/xe/xe_ioctl.h
+++ b/lib/xe/xe_ioctl.h
@@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 		       uint32_t exec_queue, int64_t timeout);
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		    uint32_t type, uint32_t op_val, uint16_t policy);
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		   uint32_t type, uint32_t op_val, uint16_t policy);
+		    uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
+		  uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
 int xe_vm_number_vmas_in_range(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr);
 int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr,
 		    struct drm_xe_mem_range_attr *mem_attr);
 struct drm_xe_mem_range_attr
 *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t start,
 				    uint64_t range, uint32_t *num_ranges);
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
+			uint64_t offset, uint64_t addr,
+			uint64_t size, uint32_t flags);
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size);
 #endif /* XE_IOCTL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 17:00 ` [PATCH i-g-t v7 00/10] " Nishit Sharma
@ 2025-11-13 17:00   ` Nishit Sharma
  0 siblings, 0 replies; 34+ messages in thread
From: Nishit Sharma @ 2025-11-13 17:00 UTC (permalink / raw)
  To: igt-dev

Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
support per-instance memory advice operations.Implement xe_vm_bind_lr_sync
and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using user
fences.
These changes improve memory advice and binding operations for multi-GPU
and multi-instance scenarios in IGT tests.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 include/drm-uapi/xe_drm.h |  4 +--
 lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++----
 lib/xe/xe_ioctl.h         | 11 +++++---
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
index 89ab54935..3472efa58 100644
--- a/include/drm-uapi/xe_drm.h
+++ b/include/drm-uapi/xe_drm.h
@@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
 			/** @preferred_mem_loc.migration_policy: Page migration policy */
 			__u16 migration_policy;
 
-			/** @preferred_mem_loc.pad : MBZ */
-			__u16 pad;
+			/** @preferred_mem_loc.region_instance: Region instance */
+			__u16 region_instance;
 
 			/** @preferred_mem_loc.reserved : Reserved */
 			__u64 reserved;
diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
index 39c4667a1..06ce8a339 100644
--- a/lib/xe/xe_ioctl.c
+++ b/lib/xe/xe_ioctl.c
@@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 }
 
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		    uint16_t instance)
 {
 	struct drm_xe_madvise madvise = {
 		.type = type,
@@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
 		madvise.preferred_mem_loc.devmem_fd = op_val;
 		madvise.preferred_mem_loc.migration_policy = policy;
+		madvise.preferred_mem_loc.region_instance = instance;
 		igt_debug("madvise.preferred_mem_loc.devmem_fd = %d\n",
 			  madvise.preferred_mem_loc.devmem_fd);
 		break;
@@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
  * @type: type of attribute
  * @op_val: fd/atomic value/pat index, depending upon type of operation
  * @policy: Page migration policy
+ * @instance: vram instance
  *
  * Function initializes different members of struct drm_xe_madvise and calls
  * MADVISE IOCTL .
  *
- * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
+ * Returns error number in failure and 0 if pass.
  */
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
+		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		   uint16_t instance)
 {
-	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy), 0);
+	return __xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy, instance);
+}
+
+#define        BIND_SYNC_VAL   0x686868
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t offset,
+			uint64_t addr, uint64_t size, uint32_t flags)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync, 1, flags);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	/* Only free if the wait succeeds */
+	free((void *)sync_addr);
+}
+
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	*sync_addr = 0;
+	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	free((void *)sync_addr);
 }
diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
index ae8a23a54..1ae38029d 100644
--- a/lib/xe/xe_ioctl.h
+++ b/lib/xe/xe_ioctl.h
@@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 		       uint32_t exec_queue, int64_t timeout);
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		    uint32_t type, uint32_t op_val, uint16_t policy);
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		   uint32_t type, uint32_t op_val, uint16_t policy);
+		    uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
+		  uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
 int xe_vm_number_vmas_in_range(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr);
 int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr,
 		    struct drm_xe_mem_range_attr *mem_attr);
 struct drm_xe_mem_range_attr
 *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t start,
 				    uint64_t range, uint32_t *num_ranges);
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
+			uint64_t offset, uint64_t addr,
+			uint64_t size, uint32_t flags);
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size);
 #endif /* XE_IOCTL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 17:04 [PATCH i-g-t v7 00/10] Add SVM madvise feature for multi-GPU configurations nishit.sharma
@ 2025-11-13 17:04 ` nishit.sharma
  0 siblings, 0 replies; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 17:04 UTC (permalink / raw)
  To: igt-dev, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
support per-instance memory advice operations.Implement xe_vm_bind_lr_sync
and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using user
fences.
These changes improve memory advice and binding operations for multi-GPU
and multi-instance scenarios in IGT tests.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 include/drm-uapi/xe_drm.h |  4 +--
 lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++----
 lib/xe/xe_ioctl.h         | 11 +++++---
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
index 89ab54935..3472efa58 100644
--- a/include/drm-uapi/xe_drm.h
+++ b/include/drm-uapi/xe_drm.h
@@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
 			/** @preferred_mem_loc.migration_policy: Page migration policy */
 			__u16 migration_policy;
 
-			/** @preferred_mem_loc.pad : MBZ */
-			__u16 pad;
+			/** @preferred_mem_loc.region_instance: Region instance */
+			__u16 region_instance;
 
 			/** @preferred_mem_loc.reserved : Reserved */
 			__u64 reserved;
diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
index 39c4667a1..06ce8a339 100644
--- a/lib/xe/xe_ioctl.c
+++ b/lib/xe/xe_ioctl.c
@@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 }
 
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		    uint16_t instance)
 {
 	struct drm_xe_madvise madvise = {
 		.type = type,
@@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
 		madvise.preferred_mem_loc.devmem_fd = op_val;
 		madvise.preferred_mem_loc.migration_policy = policy;
+		madvise.preferred_mem_loc.region_instance = instance;
 		igt_debug("madvise.preferred_mem_loc.devmem_fd = %d\n",
 			  madvise.preferred_mem_loc.devmem_fd);
 		break;
@@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
  * @type: type of attribute
  * @op_val: fd/atomic value/pat index, depending upon type of operation
  * @policy: Page migration policy
+ * @instance: vram instance
  *
  * Function initializes different members of struct drm_xe_madvise and calls
  * MADVISE IOCTL .
  *
- * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
+ * Returns error number in failure and 0 if pass.
  */
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
+		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		   uint16_t instance)
 {
-	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy), 0);
+	return __xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy, instance);
+}
+
+#define        BIND_SYNC_VAL   0x686868
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t offset,
+			uint64_t addr, uint64_t size, uint32_t flags)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync, 1, flags);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	/* Only free if the wait succeeds */
+	free((void *)sync_addr);
+}
+
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	*sync_addr = 0;
+	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	free((void *)sync_addr);
 }
diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
index ae8a23a54..1ae38029d 100644
--- a/lib/xe/xe_ioctl.h
+++ b/lib/xe/xe_ioctl.h
@@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 		       uint32_t exec_queue, int64_t timeout);
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		    uint32_t type, uint32_t op_val, uint16_t policy);
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		   uint32_t type, uint32_t op_val, uint16_t policy);
+		    uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
+		  uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
 int xe_vm_number_vmas_in_range(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr);
 int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr,
 		    struct drm_xe_mem_range_attr *mem_attr);
 struct drm_xe_mem_range_attr
 *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t start,
 				    uint64_t range, uint32_t *num_ranges);
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
+			uint64_t offset, uint64_t addr,
+			uint64_t size, uint32_t flags);
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size);
 #endif /* XE_IOCTL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 17:09 [PATCH i-g-t v7 00/10] SVM madvise feature in multi-GPU config nishit.sharma
@ 2025-11-13 17:09 ` nishit.sharma
  0 siblings, 0 replies; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 17:09 UTC (permalink / raw)
  To: igt-dev, nishit.sharma

From: Nishit Sharma <nishit.sharma@intel.com>

Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
support per-instance memory advice operations.Implement xe_vm_bind_lr_sync
and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using user
fences.
These changes improve memory advice and binding operations for multi-GPU
and multi-instance scenarios in IGT tests.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 include/drm-uapi/xe_drm.h |  4 +--
 lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++----
 lib/xe/xe_ioctl.h         | 11 +++++---
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
index 89ab54935..3472efa58 100644
--- a/include/drm-uapi/xe_drm.h
+++ b/include/drm-uapi/xe_drm.h
@@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
 			/** @preferred_mem_loc.migration_policy: Page migration policy */
 			__u16 migration_policy;
 
-			/** @preferred_mem_loc.pad : MBZ */
-			__u16 pad;
+			/** @preferred_mem_loc.region_instance: Region instance */
+			__u16 region_instance;
 
 			/** @preferred_mem_loc.reserved : Reserved */
 			__u64 reserved;
diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
index 39c4667a1..06ce8a339 100644
--- a/lib/xe/xe_ioctl.c
+++ b/lib/xe/xe_ioctl.c
@@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 }
 
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		    uint16_t instance)
 {
 	struct drm_xe_madvise madvise = {
 		.type = type,
@@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
 		madvise.preferred_mem_loc.devmem_fd = op_val;
 		madvise.preferred_mem_loc.migration_policy = policy;
+		madvise.preferred_mem_loc.region_instance = instance;
 		igt_debug("madvise.preferred_mem_loc.devmem_fd = %d\n",
 			  madvise.preferred_mem_loc.devmem_fd);
 		break;
@@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
  * @type: type of attribute
  * @op_val: fd/atomic value/pat index, depending upon type of operation
  * @policy: Page migration policy
+ * @instance: vram instance
  *
  * Function initializes different members of struct drm_xe_madvise and calls
  * MADVISE IOCTL .
  *
- * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
+ * Returns error number in failure and 0 if pass.
  */
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
+		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		   uint16_t instance)
 {
-	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy), 0);
+	return __xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy, instance);
+}
+
+#define        BIND_SYNC_VAL   0x686868
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t offset,
+			uint64_t addr, uint64_t size, uint32_t flags)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync, 1, flags);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	/* Only free if the wait succeeds */
+	free((void *)sync_addr);
+}
+
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	*sync_addr = 0;
+	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	free((void *)sync_addr);
 }
diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
index ae8a23a54..1ae38029d 100644
--- a/lib/xe/xe_ioctl.h
+++ b/lib/xe/xe_ioctl.h
@@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 		       uint32_t exec_queue, int64_t timeout);
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		    uint32_t type, uint32_t op_val, uint16_t policy);
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		   uint32_t type, uint32_t op_val, uint16_t policy);
+		    uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
+		  uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
 int xe_vm_number_vmas_in_range(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr);
 int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr,
 		    struct drm_xe_mem_range_attr *mem_attr);
 struct drm_xe_mem_range_attr
 *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t start,
 				    uint64_t range, uint32_t *num_ranges);
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
+			uint64_t offset, uint64_t addr,
+			uint64_t size, uint32_t flags);
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size);
 #endif /* XE_IOCTL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 17:15 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
@ 2025-11-13 17:15 ` nishit.sharma
  0 siblings, 0 replies; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 17:15 UTC (permalink / raw)
  To: igt-dev

From: Nishit Sharma <nishit.sharma@intel.com>

Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
support per-instance memory advice operations.Implement xe_vm_bind_lr_sync
and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using user
fences.
These changes improve memory advice and binding operations for multi-GPU
and multi-instance scenarios in IGT tests.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 include/drm-uapi/xe_drm.h |  4 +--
 lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++----
 lib/xe/xe_ioctl.h         | 11 +++++---
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
index 89ab54935..3472efa58 100644
--- a/include/drm-uapi/xe_drm.h
+++ b/include/drm-uapi/xe_drm.h
@@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
 			/** @preferred_mem_loc.migration_policy: Page migration policy */
 			__u16 migration_policy;
 
-			/** @preferred_mem_loc.pad : MBZ */
-			__u16 pad;
+			/** @preferred_mem_loc.region_instance: Region instance */
+			__u16 region_instance;
 
 			/** @preferred_mem_loc.reserved : Reserved */
 			__u64 reserved;
diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
index 39c4667a1..06ce8a339 100644
--- a/lib/xe/xe_ioctl.c
+++ b/lib/xe/xe_ioctl.c
@@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 }
 
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		    uint16_t instance)
 {
 	struct drm_xe_madvise madvise = {
 		.type = type,
@@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
 		madvise.preferred_mem_loc.devmem_fd = op_val;
 		madvise.preferred_mem_loc.migration_policy = policy;
+		madvise.preferred_mem_loc.region_instance = instance;
 		igt_debug("madvise.preferred_mem_loc.devmem_fd = %d\n",
 			  madvise.preferred_mem_loc.devmem_fd);
 		break;
@@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
  * @type: type of attribute
  * @op_val: fd/atomic value/pat index, depending upon type of operation
  * @policy: Page migration policy
+ * @instance: vram instance
  *
  * Function initializes different members of struct drm_xe_madvise and calls
  * MADVISE IOCTL .
  *
- * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
+ * Returns error number in failure and 0 if pass.
  */
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
+		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		   uint16_t instance)
 {
-	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy), 0);
+	return __xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy, instance);
+}
+
+#define        BIND_SYNC_VAL   0x686868
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t offset,
+			uint64_t addr, uint64_t size, uint32_t flags)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync, 1, flags);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	/* Only free if the wait succeeds */
+	free((void *)sync_addr);
+}
+
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	*sync_addr = 0;
+	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	free((void *)sync_addr);
 }
diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
index ae8a23a54..1ae38029d 100644
--- a/lib/xe/xe_ioctl.h
+++ b/lib/xe/xe_ioctl.h
@@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 		       uint32_t exec_queue, int64_t timeout);
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		    uint32_t type, uint32_t op_val, uint16_t policy);
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		   uint32_t type, uint32_t op_val, uint16_t policy);
+		    uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
+		  uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
 int xe_vm_number_vmas_in_range(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr);
 int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr,
 		    struct drm_xe_mem_range_attr *mem_attr);
 struct drm_xe_mem_range_attr
 *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t start,
 				    uint64_t range, uint32_t *num_ranges);
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
+			uint64_t offset, uint64_t addr,
+			uint64_t size, uint32_t flags);
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size);
 #endif /* XE_IOCTL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 17:16 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
@ 2025-11-13 17:16 ` nishit.sharma
  0 siblings, 0 replies; 34+ messages in thread
From: nishit.sharma @ 2025-11-13 17:16 UTC (permalink / raw)
  To: igt-dev

From: Nishit Sharma <nishit.sharma@intel.com>

Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
support per-instance memory advice operations.Implement xe_vm_bind_lr_sync
and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using user
fences.
These changes improve memory advice and binding operations for multi-GPU
and multi-instance scenarios in IGT tests.

Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
---
 include/drm-uapi/xe_drm.h |  4 +--
 lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++----
 lib/xe/xe_ioctl.h         | 11 +++++---
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
index 89ab54935..3472efa58 100644
--- a/include/drm-uapi/xe_drm.h
+++ b/include/drm-uapi/xe_drm.h
@@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
 			/** @preferred_mem_loc.migration_policy: Page migration policy */
 			__u16 migration_policy;
 
-			/** @preferred_mem_loc.pad : MBZ */
-			__u16 pad;
+			/** @preferred_mem_loc.region_instance: Region instance */
+			__u16 region_instance;
 
 			/** @preferred_mem_loc.reserved : Reserved */
 			__u64 reserved;
diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
index 39c4667a1..06ce8a339 100644
--- a/lib/xe/xe_ioctl.c
+++ b/lib/xe/xe_ioctl.c
@@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 }
 
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+		    uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		    uint16_t instance)
 {
 	struct drm_xe_madvise madvise = {
 		.type = type,
@@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
 	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
 		madvise.preferred_mem_loc.devmem_fd = op_val;
 		madvise.preferred_mem_loc.migration_policy = policy;
+		madvise.preferred_mem_loc.region_instance = instance;
 		igt_debug("madvise.preferred_mem_loc.devmem_fd = %d\n",
 			  madvise.preferred_mem_loc.devmem_fd);
 		break;
@@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
  * @type: type of attribute
  * @op_val: fd/atomic value/pat index, depending upon type of operation
  * @policy: Page migration policy
+ * @instance: vram instance
  *
  * Function initializes different members of struct drm_xe_madvise and calls
  * MADVISE IOCTL .
  *
- * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
+ * Returns error number in failure and 0 if pass.
  */
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
-		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy)
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range,
+		   uint64_t ext, uint32_t type, uint32_t op_val, uint16_t policy,
+		   uint16_t instance)
 {
-	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy), 0);
+	return __xe_vm_madvise(fd, vm, addr, range, ext, type, op_val, policy, instance);
+}
+
+#define        BIND_SYNC_VAL   0x686868
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t offset,
+			uint64_t addr, uint64_t size, uint32_t flags)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync, 1, flags);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	/* Only free if the wait succeeds */
+	free((void *)sync_addr);
+}
+
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size)
+{
+	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
+	struct drm_xe_sync sync = {
+		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
+		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
+		.addr = to_user_pointer((uint64_t *)sync_addr),
+		.timeline_value = BIND_SYNC_VAL,
+	};
+
+	igt_assert(!!sync_addr);
+	*sync_addr = 0;
+	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
+	if (*sync_addr != BIND_SYNC_VAL)
+		xe_wait_ufence(fd, (uint64_t *)sync_addr, BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
+	free((void *)sync_addr);
 }
diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
index ae8a23a54..1ae38029d 100644
--- a/lib/xe/xe_ioctl.h
+++ b/lib/xe/xe_ioctl.h
@@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
 		       uint32_t exec_queue, int64_t timeout);
 int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		    uint32_t type, uint32_t op_val, uint16_t policy);
-void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
-		   uint32_t type, uint32_t op_val, uint16_t policy);
+		    uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
+int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t range, uint64_t ext,
+		  uint32_t type, uint32_t op_val, uint16_t policy, uint16_t instance);
 int xe_vm_number_vmas_in_range(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr);
 int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr *vmas_attr,
 		    struct drm_xe_mem_range_attr *mem_attr);
 struct drm_xe_mem_range_attr
 *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t start,
 				    uint64_t range, uint32_t *num_ranges);
+void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
+			uint64_t offset, uint64_t addr,
+			uint64_t size, uint32_t flags);
+void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
+			  uint64_t addr, uint64_t size);
 #endif /* XE_IOCTL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-13 16:33 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
@ 2025-11-17 12:34   ` Hellstrom, Thomas
  2025-11-17 15:43     ` Sharma, Nishit
  0 siblings, 1 reply; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 12:34 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
> support per-instance memory advice operations.Implement
> xe_vm_bind_lr_sync
> and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using
> user
> fences.
> These changes improve memory advice and binding operations for multi-
> GPU
> and multi-instance scenarios in IGT tests.

s memory advice/memory advise/ ?

Also the lr_sync part is unrelated and should be split out to a
separate patch.

Thanks,
Thomas



> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> ---
>  include/drm-uapi/xe_drm.h |  4 +--
>  lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++--
> --
>  lib/xe/xe_ioctl.h         | 11 +++++---
>  3 files changed, 58 insertions(+), 10 deletions(-)
> 
> diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
> index 89ab54935..3472efa58 100644
> --- a/include/drm-uapi/xe_drm.h
> +++ b/include/drm-uapi/xe_drm.h
> @@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
>  			/** @preferred_mem_loc.migration_policy:
> Page migration policy */
>  			__u16 migration_policy;
>  
> -			/** @preferred_mem_loc.pad : MBZ */
> -			__u16 pad;
> +			/** @preferred_mem_loc.region_instance:
> Region instance */
> +			__u16 region_instance;
>  
>  			/** @preferred_mem_loc.reserved : Reserved
> */
>  			__u64 reserved;
> diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
> index 39c4667a1..06ce8a339 100644
> --- a/lib/xe/xe_ioctl.c
> +++ b/lib/xe/xe_ioctl.c
> @@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr,
> uint64_t value,
>  }
>  
>  int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> range,
> -		    uint64_t ext, uint32_t type, uint32_t op_val,
> uint16_t policy)
> +		    uint64_t ext, uint32_t type, uint32_t op_val,
> uint16_t policy,
> +		    uint16_t instance)
>  {
>  	struct drm_xe_madvise madvise = {
>  		.type = type,
> @@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t
> addr, uint64_t range,
>  	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
>  		madvise.preferred_mem_loc.devmem_fd = op_val;
>  		madvise.preferred_mem_loc.migration_policy = policy;
> +		madvise.preferred_mem_loc.region_instance =
> instance;
>  		igt_debug("madvise.preferred_mem_loc.devmem_fd =
> %d\n",
>  			  madvise.preferred_mem_loc.devmem_fd);
>  		break;
> @@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm,
> uint64_t addr, uint64_t range,
>   * @type: type of attribute
>   * @op_val: fd/atomic value/pat index, depending upon type of
> operation
>   * @policy: Page migration policy
> + * @instance: vram instance
>   *
>   * Function initializes different members of struct drm_xe_madvise
> and calls
>   * MADVISE IOCTL .
>   *
> - * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
> + * Returns error number in failure and 0 if pass.
>   */
> -void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> range,
> -		   uint64_t ext, uint32_t type, uint32_t op_val,
> uint16_t policy)
> +int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> range,
> +		   uint64_t ext, uint32_t type, uint32_t op_val,
> uint16_t policy,
> +		   uint16_t instance)
>  {
> -	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext,
> type, op_val, policy), 0);
> +	return __xe_vm_madvise(fd, vm, addr, range, ext, type,
> op_val, policy, instance);
> +}
> +
> +#define        BIND_SYNC_VAL   0x686868
> +void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t
> offset,
> +			uint64_t addr, uint64_t size, uint32_t
> flags)
> +{
> +	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
> +	struct drm_xe_sync sync = {
> +		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
> +		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
> +		.addr = to_user_pointer((uint64_t *)sync_addr),
> +		.timeline_value = BIND_SYNC_VAL,
> +	};
> +
> +	igt_assert(!!sync_addr);
> +	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync,
> 1, flags);
> +	if (*sync_addr != BIND_SYNC_VAL)
> +		xe_wait_ufence(fd, (uint64_t *)sync_addr,
> BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
> +	/* Only free if the wait succeeds */
> +	free((void *)sync_addr);
> +}
> +
> +void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
> +			  uint64_t addr, uint64_t size)
> +{
> +	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
> +	struct drm_xe_sync sync = {
> +		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
> +		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
> +		.addr = to_user_pointer((uint64_t *)sync_addr),
> +		.timeline_value = BIND_SYNC_VAL,
> +	};
> +
> +	igt_assert(!!sync_addr);
> +	*sync_addr = 0;
> +	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
> +	if (*sync_addr != BIND_SYNC_VAL)
> +		xe_wait_ufence(fd, (uint64_t *)sync_addr,
> BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
> +	free((void *)sync_addr);
>  }
> diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
> index ae8a23a54..1ae38029d 100644
> --- a/lib/xe/xe_ioctl.h
> +++ b/lib/xe/xe_ioctl.h
> @@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr,
> uint64_t value,
>  int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
>  		       uint32_t exec_queue, int64_t timeout);
>  int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> range, uint64_t ext,
> -		    uint32_t type, uint32_t op_val, uint16_t
> policy);
> -void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> range, uint64_t ext,
> -		   uint32_t type, uint32_t op_val, uint16_t policy);
> +		    uint32_t type, uint32_t op_val, uint16_t policy,
> uint16_t instance);
> +int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> range, uint64_t ext,
> +		  uint32_t type, uint32_t op_val, uint16_t policy,
> uint16_t instance);
>  int xe_vm_number_vmas_in_range(int fd, struct
> drm_xe_vm_query_mem_range_attr *vmas_attr);
>  int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr
> *vmas_attr,
>  		    struct drm_xe_mem_range_attr *mem_attr);
>  struct drm_xe_mem_range_attr
>  *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t
> start,
>  				    uint64_t range, uint32_t
> *num_ranges);
> +void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
> +			uint64_t offset, uint64_t addr,
> +			uint64_t size, uint32_t flags);
> +void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
> +			  uint64_t addr, uint64_t size);
>  #endif /* XE_IOCTL_H */


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 02/10] tests/intel/xe_exec_system_allocator: Add parameter in madvise call
  2025-11-13 16:33 ` [PATCH i-g-t v7 02/10] tests/intel/xe_exec_system_allocator: Add parameter in madvise call nishit.sharma
@ 2025-11-17 12:38   ` Hellstrom, Thomas
  0 siblings, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 12:38 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> Parameter instance added in xe_vm_madvise() call. This parameter
> addition cause compilation issue in system_allocator test. As a fix
> 0 as instance parameter passed in xe_vm_madvise() calls.
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>

Shouldn't this be merged with previous patch, with the goal that each
commit should leave the tree in a state that compiles.

/Thomas



> ---
>  tests/intel/xe_exec_system_allocator.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/tests/intel/xe_exec_system_allocator.c
> b/tests/intel/xe_exec_system_allocator.c
> index b88967e58..1e7175061 100644
> --- a/tests/intel/xe_exec_system_allocator.c
> +++ b/tests/intel/xe_exec_system_allocator.c
> @@ -1164,7 +1164,7 @@ madvise_swizzle_op_exec(int fd, uint32_t vm,
> struct test_exec_data *data,
>  	xe_vm_madvise(fd, vm, to_user_pointer(data), bo_size, 0,
>  		      DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
>  		      preferred_loc,
> -		      0);
> +		      0, 0);
>  }
>  
>  static void
> @@ -1172,7 +1172,7 @@ xe_vm_madvixe_pat_attr(int fd, uint32_t vm,
> uint64_t addr, uint64_t range,
>  		       int pat_index)
>  {
>  	xe_vm_madvise(fd, vm, addr, range, 0,
> -		      DRM_XE_MEM_RANGE_ATTR_PAT, pat_index, 0);
> +		      DRM_XE_MEM_RANGE_ATTR_PAT, pat_index, 0, 0);
>  }
>  
>  static void
> @@ -1181,7 +1181,7 @@ xe_vm_madvise_atomic_attr(int fd, uint32_t vm,
> uint64_t addr, uint64_t range,
>  {
>  	xe_vm_madvise(fd, vm, addr, range, 0,
>  		      DRM_XE_MEM_RANGE_ATTR_ATOMIC,
> -		      mem_attr, 0);
> +		      mem_attr, 0, 0);
>  }
>  
>  static void
> @@ -1190,7 +1190,7 @@ xe_vm_madvise_migrate_pages(int fd, uint32_t
> vm, uint64_t addr, uint64_t range)
>  	xe_vm_madvise(fd, vm, addr, range, 0,
>  		      DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
>  		      DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM,
> -		      DRM_XE_MIGRATE_ALL_PAGES);
> +		      DRM_XE_MIGRATE_ALL_PAGES, 0);
>  }
>  
>  static void


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test
  2025-11-13 16:33 ` [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test nishit.sharma
@ 2025-11-17 13:00   ` Hellstrom, Thomas
  2025-11-17 15:49     ` Sharma, Nishit
  0 siblings, 1 reply; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 13:00 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> This test allocates a buffer in SVM, writes data to it from src GPU ,
> and reads/verifies
> the data from dst GPU. Optionally, the CPU also reads or modifies the
> buffer and both
> GPUs verify the results, ensuring correct cross-GPU and CPU memory
> access in a
> multi-GPU environment.
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  tests/intel/xe_multi_gpusvm.c | 373
> ++++++++++++++++++++++++++++++++++
>  tests/meson.build             |   1 +
>  2 files changed, 374 insertions(+)
>  create mode 100644 tests/intel/xe_multi_gpusvm.c
> 
> diff --git a/tests/intel/xe_multi_gpusvm.c
> b/tests/intel/xe_multi_gpusvm.c
> new file mode 100644
> index 000000000..6614ea3d1
> --- /dev/null
> +++ b/tests/intel/xe_multi_gpusvm.c
> @@ -0,0 +1,373 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +
> +#include "drmtest.h"
> +#include "igt.h"
> +#include "igt_multigpu.h"
> +
> +#include "intel_blt.h"
> +#include "intel_mocs.h"
> +#include "intel_reg.h"
> +
> +#include "xe/xe_ioctl.h"
> +#include "xe/xe_query.h"
> +#include "xe/xe_util.h"
> +
> +/**
> + * TEST: Basic multi-gpu SVM testing
> + * Category: SVM
> + * Mega feature: Compute
> + * Sub-category: Compute tests
> + * Functionality: SVM p2p access, madvise and prefetch.
> + * Test category: functionality test
> + *
> + * SUBTEST: cross-gpu-mem-access
> + * Description:
> + *      This test creates two malloced regions, places the
> destination
> + *      region both remotely and locally and copies to it. Reads
> back to
> + *      system memory and checks the result.
> + *
> + */
> +
> +#define MAX_XE_REGIONS	8
> +#define MAX_XE_GPUS 8
> +#define NUM_LOOPS 1
> +#define BATCH_SIZE(_fd) ALIGN(SZ_8K, xe_get_default_alignment(_fd))
> +#define BIND_SYNC_VAL 0x686868
> +#define EXEC_SYNC_VAL 0x676767
> +#define COPY_SIZE SZ_64M
> +
> +struct xe_svm_gpu_info {
> +	bool supports_faults;
> +	int vram_regions[MAX_XE_REGIONS];
> +	unsigned int num_regions;
> +	unsigned int va_bits;
> +	int fd;
> +};
> +
> +struct multigpu_ops_args {
> +	bool prefetch_req;
> +	bool op_mod;
> +};
> +
> +typedef void (*gpu_pair_fn) (
> +		struct xe_svm_gpu_info *src,
> +		struct xe_svm_gpu_info *dst,
> +		struct drm_xe_engine_class_instance *eci,
> +		void *extra_args
> +);
> +
> +static void for_each_gpu_pair(int num_gpus,
> +			      struct xe_svm_gpu_info *gpus,
> +			      struct drm_xe_engine_class_instance
> *eci,
> +			      gpu_pair_fn fn,
> +			      void *extra_args);
> +
> +static void gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
> +				   struct xe_svm_gpu_info *dst,
> +				   struct
> drm_xe_engine_class_instance *eci,
> +				   void *extra_args);
> +
> +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
> +
> +static void
> +create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> drm_xe_engine_class_instance *eci,
> +		    uint32_t *vm, uint32_t *exec_queue)
> +{
> +	*vm = xe_vm_create(gpu->fd,
> +			   DRM_XE_VM_CREATE_FLAG_LR_MODE |
> DRM_XE_VM_CREATE_FLAG_FAULT_MODE, 0);
> +	*exec_queue = xe_exec_queue_create(gpu->fd, *vm, eci, 0);
> +	xe_vm_bind_lr_sync(gpu->fd, *vm, 0, 0, 0, 1ull << gpu-
> >va_bits,
> +			   DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR);
> +}
> +
> +static void
> +setup_sync(struct drm_xe_sync *sync, volatile uint64_t **sync_addr,
> uint64_t timeline_value)
> +{
> +	*sync_addr = malloc(sizeof(**sync_addr));
> +	igt_assert(*sync_addr);
> +	sync->flags = DRM_XE_SYNC_FLAG_SIGNAL;
> +	sync->type = DRM_XE_SYNC_TYPE_USER_FENCE;
> +	sync->addr = to_user_pointer((uint64_t *)*sync_addr);
> +	sync->timeline_value = timeline_value;
> +	**sync_addr = 0;
> +}
> +
> +static void
> +cleanup_vm_and_queue(struct xe_svm_gpu_info *gpu, uint32_t vm,
> uint32_t exec_queue)
> +{
> +	xe_vm_unbind_lr_sync(gpu->fd, vm, 0, 0, 1ull << gpu-
> >va_bits);
> +	xe_exec_queue_destroy(gpu->fd, exec_queue);
> +	xe_vm_destroy(gpu->fd, vm);
> +}
> +
> +static void xe_multigpu_madvise(int src_fd, uint32_t vm, uint64_t
> addr, uint64_t size,
> +				uint64_t ext, uint32_t type, int
> dst_fd, uint16_t policy,
> +				uint16_t instance, uint32_t
> exec_queue, int local_fd,
> +				uint16_t local_vram)
> +{
> +	int ret;
> +
> +#define SYSTEM_MEMORY	0

Please use DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM. 
A new define isn't necessary and it's also incorrect.

> +	if (src_fd != dst_fd) {
> +		ret = xe_vm_madvise(src_fd, vm, addr, size, ext,
> type, dst_fd, policy, instance);
> +		if (ret == -ENOLINK) {
> +			igt_info("No fast interconnect between GPU0
> and GPU1, falling back to local VRAM\n");
> +			ret = xe_vm_madvise(src_fd, vm, addr, size,
> ext, type, local_fd,
> +					    policy, local_vram);

Please use DRM_XE_PREFERRED_LOC_DEFAULT_DEVICE


> +			if (ret) {
> +				igt_info("Local VRAM madvise failed,
> falling back to system memory\n");
> +				ret = xe_vm_madvise(src_fd, vm,
> addr, size, ext, type,
> +						    SYSTEM_MEMORY,
> policy, SYSTEM_MEMORY);

> 
> +				igt_assert_eq(ret, 0);
> +			}
> +		} else {
> +			igt_assert_eq(ret, 0);
> +		}
> +	} else {
> +		ret = xe_vm_madvise(src_fd, vm, addr, size, ext,
> type, dst_fd, policy, instance);
> +		igt_assert_eq(ret, 0);
> +
> +	}
> +
> +}
> +
> +static void xe_multigpu_prefetch(int src_fd, uint32_t vm, uint64_t
> addr, uint64_t size,
> +				 struct drm_xe_sync *sync, volatile
> uint64_t *sync_addr,
> +				 uint32_t exec_queue, bool
> prefetch_req)
> +{
> +	if (prefetch_req) {
> +		xe_vm_prefetch_async(src_fd, vm, 0, 0, addr, size,
> sync, 1,
> +				    
> DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
> +		if (*sync_addr != sync->timeline_value)
> +			xe_wait_ufence(src_fd, (uint64_t
> *)sync_addr, sync->timeline_value,
> +				       exec_queue, NSEC_PER_SEC *
> 10);
> +	}
> +	free((void *)sync_addr);
> +}
> +
> +static void for_each_gpu_pair(int num_gpus, struct xe_svm_gpu_info
> *gpus,
> +			      struct drm_xe_engine_class_instance
> *eci,
> +			      gpu_pair_fn fn, void *extra_args)
> +{
> +	for (int src = 0; src < num_gpus; src++) {
> +		if(!gpus[src].supports_faults)
> +			continue;
> +
> +		for (int dst = 0; dst < num_gpus; dst++) {
> +			if (src == dst)
> +				continue;
> +			fn(&gpus[src], &gpus[dst], eci, extra_args);
> +		}
> +	}
> +}
> +
> +static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
> +		       uint64_t dst_addr, uint64_t copy_size,
> +		       uint32_t *bo, uint64_t *addr)
> +{
> +	uint32_t width = copy_size / 256;
> +	uint32_t height = 1;
> +	uint32_t batch_bo_size = BATCH_SIZE(fd);
> +	uint32_t batch_bo;
> +	uint64_t batch_addr;
> +	void *batch;
> +	uint32_t *cmd;
> +	uint32_t mocs_index = intel_get_uc_mocs_index(fd);
> +	int i = 0;
> +
> +	batch_bo = xe_bo_create(fd, vm, batch_bo_size,
> vram_if_possible(fd, 0), 0);
> +	batch = xe_bo_map(fd, batch_bo, batch_bo_size);
> +	cmd = (uint32_t *) batch;
> +	cmd[i++] = MEM_COPY_CMD | (1 << 19);
> +	cmd[i++] = width - 1;
> +	cmd[i++] = height - 1;
> +	cmd[i++] = width - 1;
> +	cmd[i++] = width - 1;
> +	cmd[i++] = src_addr & ((1UL << 32) - 1);
> +	cmd[i++] = src_addr >> 32;
> +	cmd[i++] = dst_addr & ((1UL << 32) - 1);
> +	cmd[i++] = dst_addr >> 32;
> +	cmd[i++] = mocs_index << XE2_MEM_COPY_MOCS_SHIFT |
> mocs_index;
> +	cmd[i++] = MI_BATCH_BUFFER_END;
> +	cmd[i++] = MI_BATCH_BUFFER_END;
> +
> +	batch_addr = to_user_pointer(batch);
> +	/* Punch a gap in the SVM map where we map the batch_bo */
> +	xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
> batch_bo_size, 0);
> +	*bo = batch_bo;
> +	*addr = batch_addr;
> +}
> +
> +static void batch_fini(int fd, uint32_t vm, uint32_t bo, uint64_t
> addr)
> +{
> +        /* Unmap the batch bo by re-instating the SVM binding. */
> +        xe_vm_bind_lr_sync(fd, vm, 0, 0, addr, BATCH_SIZE(fd),
> +                           DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR);
> +        gem_close(fd, bo);
> +}
> +
> +
> +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info)
> +{
> +	unsigned int count = 0;
> +	uint64_t regions = all_memory_regions(fd);
> +	uint32_t region;
> +
> +	xe_for_each_mem_region(fd, regions, region) {
> +		if (XE_IS_VRAM_MEMORY_REGION(fd, region)) {
> +			struct drm_xe_mem_region *mem_region =
> +				xe_mem_region(fd, 1ull << (region -
> 1));
> +			igt_assert(count < MAX_XE_REGIONS);
> +			info->vram_regions[count++] = mem_region-
> >instance;
> +		}
> +	}
> +
> +	info->num_regions = count;
> +}
> +
> +static int get_device_info(struct xe_svm_gpu_info gpus[], int
> num_gpus)
> +{
> +	int cnt;
> +	int xe;
> +	int i;
> +
> +	for (i = 0, cnt = 0 && i < 128; cnt < num_gpus; i++) {
> +		xe = __drm_open_driver_another(i, DRIVER_XE);
> +		if (xe < 0)
> +			break;
> +
> +		gpus[cnt].fd = xe;
> +		cnt++;
> +	}
> +
> +	return cnt;
> +}
> +
> +static void
> +copy_src_dst(struct xe_svm_gpu_info *gpu0,
> +	     struct xe_svm_gpu_info *gpu1,
> +	     struct drm_xe_engine_class_instance *eci,
> +	     bool prefetch_req)
> +{
> +	uint32_t vm[1];
> +	uint32_t exec_queue[2];
> +	uint32_t batch_bo;
> +	void *copy_src, *copy_dst;
> +	uint64_t batch_addr;
> +	struct drm_xe_sync sync = {};
> +	volatile uint64_t *sync_addr;
> +	int local_fd = gpu0->fd;
> +	uint16_t local_vram = gpu0->vram_regions[0];
> +
> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> +
> +	/* Allocate source and destination buffers */
> +	copy_src = aligned_alloc(xe_get_default_alignment(gpu0->fd),
> SZ_64M);
> +	igt_assert(copy_src);
> +	copy_dst = aligned_alloc(xe_get_default_alignment(gpu1->fd),
> SZ_64M);
> +	igt_assert(copy_dst);
> +
> +	/*
> +	 * Initialize, map and bind the batch bo. Note that Xe
> doesn't seem to enjoy
> +	 * batch buffer memory accessed over PCIe p2p.
> +	 */
> +	batch_init(gpu0->fd, vm[0], to_user_pointer(copy_src),
> to_user_pointer(copy_dst),
> +		   COPY_SIZE, &batch_bo, &batch_addr);
> +
> +	/* Fill the source with a pattern, clear the destination. */
> +	memset(copy_src, 0x67, COPY_SIZE);
> +	memset(copy_dst, 0x0, COPY_SIZE);
> +
> +	xe_multigpu_madvise(gpu0->fd, vm[0],
> to_user_pointer(copy_dst), COPY_SIZE,
> +			     0, DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			     gpu1->fd, 0, gpu1->vram_regions[0],
> exec_queue[0],
> +			     local_fd, local_vram);
> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu0->fd, vm[0],
> to_user_pointer(copy_dst), COPY_SIZE, &sync,
> +			     sync_addr, exec_queue[0],
> prefetch_req);
> +
> +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +	sync.timeline_value = EXEC_SYNC_VAL;
> +	*sync_addr = 0;
> +
> +	/* Execute a GPU copy. */
> +	xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
> +	if (*sync_addr != EXEC_SYNC_VAL)
> +		xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[0],
> +			       NSEC_PER_SEC * 10);
> +
> +	igt_assert(memcmp(copy_src, copy_dst, COPY_SIZE) == 0);
> +
> +	free(copy_dst);
> +	free(copy_src);
> +	munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
> +	batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> +}
> +
> +static void
> +gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
> +		       struct xe_svm_gpu_info *dst,
> +		       struct drm_xe_engine_class_instance *eci,
> +		       void *extra_args)
> +{
> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> *)extra_args;
> +	igt_assert(src);
> +	igt_assert(dst);
> +
> +	copy_src_dst(src, dst, eci, args->prefetch_req);
> +}
> +
> +igt_main
> +{
> +	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
> +	struct xe_device *xe;
> +	int gpu, gpu_cnt;
> +
> +	struct drm_xe_engine_class_instance eci = {
> +                .engine_class = DRM_XE_ENGINE_CLASS_COPY,
> +        };
> +
> +	igt_fixture {
> +		gpu_cnt = get_device_info(gpus, ARRAY_SIZE(gpus));
> +		igt_skip_on(gpu_cnt < 2);
> +
> +		for (gpu = 0; gpu < gpu_cnt; ++gpu) {
> +			igt_assert(gpu < MAX_XE_GPUS);
> +
> +			open_pagemaps(gpus[gpu].fd, &gpus[gpu]);
> +			/* NOTE! inverted return value. */
> +			gpus[gpu].supports_faults =
> !xe_supports_faults(gpus[gpu].fd);
> +			fprintf(stderr, "GPU %u has %u VRAM
> regions%s, and %s SVM VMs.\n",
> +				gpu, gpus[gpu].num_regions,
> +				gpus[gpu].num_regions != 1 ? "s" :
> "",
> +				gpus[gpu].supports_faults ?
> "supports" : "doesn't support");
> +
> +			xe = xe_device_get(gpus[gpu].fd);
> +			gpus[gpu].va_bits = xe->va_bits;
> +		}
> +	}
> +
> +	igt_describe("gpu-gpu write-read");
> +	igt_subtest("cross-gpu-mem-access") {
> +		struct multigpu_ops_args op_args;
> +		op_args.prefetch_req = 1;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_mem_access_wrapper, &op_args);
> +		op_args.prefetch_req = 0;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_mem_access_wrapper, &op_args);

Wouldn't a separate test make sense here, like many other tests defines
a base test with variants that are indicated in an unsigned long flags?

So we have cross-gpu-mem-access-%s where %s can take "basic" and
"prefetch"? 


> +	}
> +
> +	igt_fixture {
> +		int cnt;
> +
> +		for (cnt = 0; cnt < gpu_cnt; cnt++)
> +			drm_close_driver(gpus[cnt].fd);
> +	}
> +}
> diff --git a/tests/meson.build b/tests/meson.build
> index 9736f2338..1209f84a4 100644
> --- a/tests/meson.build
> +++ b/tests/meson.build
> @@ -313,6 +313,7 @@ intel_xe_progs = [
>  	'xe_media_fill',
>  	'xe_mmap',
>  	'xe_module_load',
> +        'xe_multi_gpusvm',
>  	'xe_noexec_ping_pong',
>  	'xe_oa',
>  	'xe_pat',


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations
  2025-11-13 16:33 ` [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations nishit.sharma
@ 2025-11-17 13:10   ` Hellstrom, Thomas
  2025-11-17 15:50     ` Sharma, Nishit
  0 siblings, 1 reply; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 13:10 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> This test performs atomic increment operation on a shared SVM buffer
> from both GPUs and the CPU in a multi-GPU environment. It uses
> madvise
> and prefetch to control buffer placement and verifies correctness and
> ordering of atomic updates across agents.
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> ---
>  tests/intel/xe_multi_gpusvm.c | 157
> +++++++++++++++++++++++++++++++++-
>  1 file changed, 156 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/intel/xe_multi_gpusvm.c
> b/tests/intel/xe_multi_gpusvm.c
> index 6614ea3d1..54e036724 100644
> --- a/tests/intel/xe_multi_gpusvm.c
> +++ b/tests/intel/xe_multi_gpusvm.c
> @@ -31,6 +31,11 @@
>   *      region both remotely and locally and copies to it. Reads
> back to
>   *      system memory and checks the result.
>   *
> + * SUBTEST: atomic-inc-gpu-op
> + * Description:
> + * 	This test does atomic operation in multi-gpu by executing
> atomic
> + *	operation on GPU1 and then atomic operation on GPU2 using
> same
> + *	adress
>   */
>  
>  #define MAX_XE_REGIONS	8
> @@ -40,6 +45,7 @@
>  #define BIND_SYNC_VAL 0x686868
>  #define EXEC_SYNC_VAL 0x676767
>  #define COPY_SIZE SZ_64M
> +#define	ATOMIC_OP_VAL	56
>  
>  struct xe_svm_gpu_info {
>  	bool supports_faults;
> @@ -49,6 +55,16 @@ struct xe_svm_gpu_info {
>  	int fd;
>  };
>  
> +struct test_exec_data {
> +	uint32_t batch[32];
> +	uint64_t pad;
> +	uint64_t vm_sync;
> +	uint64_t exec_sync;
> +	uint32_t data;
> +	uint32_t expected_data;
> +	uint64_t batch_addr;
> +};
> +
>  struct multigpu_ops_args {
>  	bool prefetch_req;
>  	bool op_mod;
> @@ -72,7 +88,10 @@ static void gpu_mem_access_wrapper(struct
> xe_svm_gpu_info *src,
>  				   struct
> drm_xe_engine_class_instance *eci,
>  				   void *extra_args);
>  
> -static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
> +static void gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
> +				   struct xe_svm_gpu_info *dst,
> +				   struct
> drm_xe_engine_class_instance *eci,
> +				   void *extra_args);
>  
>  static void
>  create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> drm_xe_engine_class_instance *eci,
> @@ -166,6 +185,35 @@ static void for_each_gpu_pair(int num_gpus,
> struct xe_svm_gpu_info *gpus,
>  	}
>  }
>  
> +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
> +
> +static void
> +atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
> +		  uint32_t *bo, uint64_t *addr)
> +{
> +	uint32_t batch_bo_size = BATCH_SIZE(fd);
> +	uint32_t batch_bo;
> +	uint64_t batch_addr;
> +	void *batch;
> +	uint32_t *cmd;
> +	int i = 0;
> +
> +	batch_bo = xe_bo_create(fd, vm, batch_bo_size,
> vram_if_possible(fd, 0), 0);
> +	batch = xe_bo_map(fd, batch_bo, batch_bo_size);
> +	cmd = (uint32_t *)batch;
> +
> +	cmd[i++] = MI_ATOMIC | MI_ATOMIC_INC;
> +	cmd[i++] = src_addr;
> +	cmd[i++] = src_addr >> 32;
> +	cmd[i++] = MI_BATCH_BUFFER_END;
> +
> +	batch_addr = to_user_pointer(batch);
> +	/* Punch a gap in the SVM map where we map the batch_bo */
> +	xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
> batch_bo_size, 0);
> +	*bo = batch_bo;
> +	*addr = batch_addr;
> +}
> +
>  static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
>  		       uint64_t dst_addr, uint64_t copy_size,
>  		       uint32_t *bo, uint64_t *addr)
> @@ -325,6 +373,105 @@ gpu_mem_access_wrapper(struct xe_svm_gpu_info
> *src,
>  	copy_src_dst(src, dst, eci, args->prefetch_req);
>  }
>  
> +static void
> +atomic_inc_op(struct xe_svm_gpu_info *gpu0,
> +	      struct xe_svm_gpu_info *gpu1,
> +	      struct drm_xe_engine_class_instance *eci,
> +	      bool prefetch_req)
> +{
> +	uint64_t addr;
> +	uint32_t vm[2];
> +	uint32_t exec_queue[2];
> +	uint32_t batch_bo;
> +	struct test_exec_data *data;
> +	uint64_t batch_addr;
> +	struct drm_xe_sync sync = {};
> +	volatile uint64_t *sync_addr;
> +	volatile uint32_t *shared_val;
> +
> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> +
> +	data = aligned_alloc(SZ_2M, SZ_4K);
> +	igt_assert(data);
> +	data[0].vm_sync = 0;
> +	addr = to_user_pointer(data);
> +
> +	shared_val = (volatile uint32_t *)addr;
> +	*shared_val = ATOMIC_OP_VAL - 1;
> +
> +	atomic_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
> &batch_addr);
> +
> +	/* Place destination in an optionally remote location to
> test */
> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu0->fd, 0, gpu0->vram_regions[0],
> exec_queue[0],
> +			    0, 0);
> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
> +			     sync_addr, exec_queue[0],
> prefetch_req);
> +
> +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +	sync.timeline_value = EXEC_SYNC_VAL;
> +	*sync_addr = 0;
> +
> +	/* Executing ATOMIC_INC on GPU0. */
> +	xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
> +	if (*sync_addr != EXEC_SYNC_VAL)
> +		xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[0],
> +			       NSEC_PER_SEC * 10);
> +
> +	igt_assert_eq(*shared_val, ATOMIC_OP_VAL);
> +
> +	atomic_batch_init(gpu1->fd, vm[1], addr, &batch_bo,
> &batch_addr);
> +
> +	/* Place destination in an optionally remote location to
> test */

We're actually never using a remote location here? It's always advised
to local.

> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu1->fd, 0, gpu1->vram_regions[0],
> exec_queue[0],
> +			    0, 0);



> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
> +			     sync_addr, exec_queue[1],
> prefetch_req);
> +
> +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +	sync.timeline_value = EXEC_SYNC_VAL;
> +	*sync_addr = 0;
> +
> +	/* Execute ATOMIC_INC on GPU1 */
> +	xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);

If gpu1 here doesn't support faults, we shouldn't execute this.


> +	if (*sync_addr != EXEC_SYNC_VAL)
> +		xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[1],
> +			       NSEC_PER_SEC * 10);
> +
> +	igt_assert_eq(*shared_val, ATOMIC_OP_VAL + 1);
> +
> +	munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
> +	batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
> +	batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
> +	free(data);
> +
> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
> +}
> +
> +static void
> +gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
> +		       struct xe_svm_gpu_info *dst,
> +		       struct drm_xe_engine_class_instance *eci,
> +		       void *extra_args)
> +{
> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> *)extra_args;
> +	igt_assert(src);
> +	igt_assert(dst);
> +
> +	atomic_inc_op(src, dst, eci, args->prefetch_req);
> +}
> +
>  igt_main
>  {
>  	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
> @@ -364,6 +511,14 @@ igt_main
>  		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_mem_access_wrapper, &op_args);
>  	}
>  
> +	igt_subtest("atomic-inc-gpu-op") {
> +		struct multigpu_ops_args atomic_args;
> +		atomic_args.prefetch_req = 1;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_atomic_inc_wrapper, &atomic_args);
> +		atomic_args.prefetch_req = 0;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_atomic_inc_wrapper, &atomic_args);

Same comment here as for the first test.

/Thomas



> +	}
> +
>  	igt_fixture {
>  		int cnt;
>  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test
  2025-11-13 16:33 ` [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test nishit.sharma
@ 2025-11-17 14:02   ` Hellstrom, Thomas
  2025-11-17 16:18     ` Sharma, Nishit
  0 siblings, 1 reply; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 14:02 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> This test verifies memory coherency in a multi-GPU environment using
> SVM.
> GPU 1 writes to a shared buffer, GPU 2 reads and checks for correct
> data
> without explicit synchronization, and the test is repeated with CPU
> and
> both GPUs to ensure consistent memory visibility across agents.
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> ---
>  tests/intel/xe_multi_gpusvm.c | 203
> +++++++++++++++++++++++++++++++++-
>  1 file changed, 201 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/intel/xe_multi_gpusvm.c
> b/tests/intel/xe_multi_gpusvm.c
> index 54e036724..6792ef72c 100644
> --- a/tests/intel/xe_multi_gpusvm.c
> +++ b/tests/intel/xe_multi_gpusvm.c
> @@ -34,8 +34,13 @@
>   * SUBTEST: atomic-inc-gpu-op
>   * Description:
>   * 	This test does atomic operation in multi-gpu by executing
> atomic
> - *	operation on GPU1 and then atomic operation on GPU2 using
> same
> - *	adress
> + * 	operation on GPU1 and then atomic operation on GPU2 using
> same
> + * 	adress
> + *
> + * SUBTEST: coherency-multi-gpu
> + * Description:
> + * 	This test checks coherency in multi-gpu by writing from GPU0
> + * 	reading from GPU1 and verify and repeating with CPU and both
> GPUs
>   */
>  
>  #define MAX_XE_REGIONS	8
> @@ -93,6 +98,11 @@ static void gpu_atomic_inc_wrapper(struct
> xe_svm_gpu_info *src,
>  				   struct
> drm_xe_engine_class_instance *eci,
>  				   void *extra_args);
>  
> +static void gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
> +				      struct xe_svm_gpu_info *dst,
> +				      struct
> drm_xe_engine_class_instance *eci,
> +				      void *extra_args);
> +
>  static void
>  create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> drm_xe_engine_class_instance *eci,
>  		    uint32_t *vm, uint32_t *exec_queue)
> @@ -214,6 +224,35 @@ atomic_batch_init(int fd, uint32_t vm, uint64_t
> src_addr,
>  	*addr = batch_addr;
>  }
>  
> +static void
> +store_dword_batch_init(int fd, uint32_t vm, uint64_t src_addr,
> +                       uint32_t *bo, uint64_t *addr, int value)
> +{
> +        uint32_t batch_bo_size = BATCH_SIZE(fd);
> +        uint32_t batch_bo;
> +        uint64_t batch_addr;
> +        void *batch;
> +        uint32_t *cmd;
> +        int i = 0;
> +
> +        batch_bo = xe_bo_create(fd, vm, batch_bo_size,
> vram_if_possible(fd, 0), 0);
> +        batch = xe_bo_map(fd, batch_bo, batch_bo_size);
> +        cmd = (uint32_t *) batch;
> +
> +        cmd[i++] = MI_STORE_DWORD_IMM_GEN4;
> +        cmd[i++] = src_addr;
> +        cmd[i++] = src_addr >> 32;
> +        cmd[i++] = value;
> +        cmd[i++] = MI_BATCH_BUFFER_END;
> +
> +        batch_addr = to_user_pointer(batch);
> +
> +        /* Punch a gap in the SVM map where we map the batch_bo */
> +        xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
> batch_bo_size, 0);
> +        *bo = batch_bo;
> +        *addr = batch_addr;
> +}
> +
>  static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
>  		       uint64_t dst_addr, uint64_t copy_size,
>  		       uint32_t *bo, uint64_t *addr)
> @@ -373,6 +412,143 @@ gpu_mem_access_wrapper(struct xe_svm_gpu_info
> *src,
>  	copy_src_dst(src, dst, eci, args->prefetch_req);
>  }
>  
> +static void
> +coherency_test_multigpu(struct xe_svm_gpu_info *gpu0,
> +			struct xe_svm_gpu_info *gpu1,
> +			struct drm_xe_engine_class_instance *eci,
> +			bool coh_fail_set,
> +			bool prefetch_req)
> +{
> +        uint64_t addr;
> +        uint32_t vm[2];
> +        uint32_t exec_queue[2];
> +        uint32_t batch_bo, batch1_bo[2];
> +        uint64_t batch_addr, batch1_addr[2];
> +        struct drm_xe_sync sync = {};
> +        volatile uint64_t *sync_addr;
> +        int value = 60;
> +	uint64_t *data1;
> +	void *copy_dst;
> +
> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> +
> +        data1 = aligned_alloc(SZ_2M, SZ_4K);
> +	igt_assert(data1);
> +	addr = to_user_pointer(data1);
> +
> +	copy_dst = aligned_alloc(SZ_2M, SZ_4K);
> +	igt_assert(copy_dst);
> +
> +        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
> &batch_addr, value);
> +
> +        /* Place destination in GPU0 local memory location to test
> */

Indentation looks odd throughout this function. Is there a
formatting / style checker that has been run on these patches?

> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu0->fd, 0, gpu0->vram_regions[0],
> exec_queue[0],
> +			    0, 0);
> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
> +			     sync_addr, exec_queue[0],
> prefetch_req);
> +
> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +        sync.timeline_value = EXEC_SYNC_VAL;
> +        *sync_addr = 0;

> +
> +        /* Execute STORE command on GPU0 */
> +        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
> +        if (*sync_addr != EXEC_SYNC_VAL)
> +                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[0],
> +			       NSEC_PER_SEC * 10);
> +
> +        igt_assert_eq(*(uint64_t *)addr, value);

This assert will cause a CPU read which migrates the data to system, so
perhaps not ideal if we want to test coherency across gpus?


> +
> +	/* Creating batch for GPU1 using addr as Src which have
> value from GPU0 */
> +	batch_init(gpu1->fd, vm[1], addr, to_user_pointer(copy_dst),
> +		   SZ_4K, &batch_bo, &batch_addr);
> +
> +        /* Place destination in GPU1 local memory location to test
> */
> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu1->fd, 0, gpu1->vram_regions[0],
> exec_queue[1],
> +			    0, 0);
> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
> +			     sync_addr, exec_queue[1],
> prefetch_req);
> +
> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +        sync.timeline_value = EXEC_SYNC_VAL;
> +        *sync_addr = 0;
> +
> +        /* Execute COPY command on GPU1 */
> +        xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);
> +        if (*sync_addr != EXEC_SYNC_VAL)
> +                xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[1],
> +			       NSEC_PER_SEC * 10);
> +
> +        igt_assert_eq(*(uint64_t *)copy_dst, value);
> +
> +        /* CPU writes 10, memset set bytes no integer hence memset
> fills 4 bytes with 0x0A */
> +        memset((void *)(uintptr_t)addr, 10, sizeof(int));
> +        igt_assert_eq(*(uint64_t *)addr, 0x0A0A0A0A);
> +
> +	if (coh_fail_set) {
> +		igt_info("coherency fail impl\n");
> +
> +		/* Coherency fail scenario */
> +		store_dword_batch_init(gpu0->fd, vm[0], addr,
> &batch1_bo[0], &batch1_addr[0], value + 10);
> +		store_dword_batch_init(gpu1->fd, vm[1], addr,
> &batch1_bo[1], &batch1_addr[1], value + 20);
> +
> +		sync_addr = (void *)((char *)batch1_addr[0] +
> SZ_4K);
> +		sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +		sync.timeline_value = EXEC_SYNC_VAL;
> +		*sync_addr = 0;
> +
> +		/* Execute STORE command on GPU1 */
> +		xe_exec_sync(gpu0->fd, exec_queue[0],
> batch1_addr[0], &sync, 1);
> +		if (*sync_addr != EXEC_SYNC_VAL)
> +			xe_wait_ufence(gpu0->fd, (uint64_t
> *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
> +				       NSEC_PER_SEC * 10);
> +
> +		sync_addr = (void *)((char *)batch1_addr[1] +
> SZ_4K);
> +		sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +		sync.timeline_value = EXEC_SYNC_VAL;
> +		*sync_addr = 0;
> +
> +		/* Execute STORE command on GPU2 */
> +		xe_exec_sync(gpu1->fd, exec_queue[1],
> batch1_addr[1], &sync, 1);
> +		if (*sync_addr != EXEC_SYNC_VAL)
> +			xe_wait_ufence(gpu1->fd, (uint64_t
> *)sync_addr, EXEC_SYNC_VAL, exec_queue[1],
> +				       NSEC_PER_SEC * 10);
> +
> +		igt_warn_on_f(*(uint64_t *)addr != (value + 10),
> +			      "GPU2(dst_gpu] has overwritten value
> at addr\n");
Parenthesis mismatch.

BTW, isn't gpu2 supposed to overwrite the value here? Perhaps I'm
missing something?

Also regarding the previous comment WRT using the naming gpu0 and gpu1
vs gpu1 and gpu2? Shouldn't we try to be consistent here to avoid
confusion?

/Thomas



> +
> +		munmap((void *)batch1_addr[0], BATCH_SIZE(gpu0-
> >fd));
> +		munmap((void *)batch1_addr[1], BATCH_SIZE(gpu1-
> >fd));
> +
> +		batch_fini(gpu0->fd, vm[0], batch1_bo[0],
> batch1_addr[0]);
> +		batch_fini(gpu1->fd, vm[1], batch1_bo[1],
> batch1_addr[1]);
> +	}
> +
> +        /* CPU writes 11, memset set bytes no integer hence memset
> fills 4 bytes with 0x0B */
> +        memset((void *)(uintptr_t)addr, 11, sizeof(int));
> +        igt_assert_eq(*(uint64_t *)addr, 0x0B0B0B0B);
> +
> +        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
> +        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
> +        batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
> +        free(data1);
> +	free(copy_dst);
> +
> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
> +}
> +
>  static void
>  atomic_inc_op(struct xe_svm_gpu_info *gpu0,
>  	      struct xe_svm_gpu_info *gpu1,
> @@ -472,6 +648,19 @@ gpu_atomic_inc_wrapper(struct xe_svm_gpu_info
> *src,
>  	atomic_inc_op(src, dst, eci, args->prefetch_req);
>  }
>  
> +static void
> +gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
> +			  struct xe_svm_gpu_info *dst,
> +			  struct drm_xe_engine_class_instance *eci,
> +			  void *extra_args)
> +{
> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> *)extra_args;
> +	igt_assert(src);
> +	igt_assert(dst);
> +
> +	coherency_test_multigpu(src, dst, eci, args->op_mod, args-
> >prefetch_req);
> +}
> +
>  igt_main
>  {
>  	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
> @@ -519,6 +708,16 @@ igt_main
>  		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_atomic_inc_wrapper, &atomic_args);
>  	}
>  
> +	igt_subtest("coherency-multi-gpu") {
> +		struct multigpu_ops_args coh_args;
> +		coh_args.prefetch_req = 1;
> +		coh_args.op_mod = 0;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_coherecy_test_wrapper, &coh_args);
> +		coh_args.prefetch_req = 0;
> +		coh_args.op_mod = 1;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_coherecy_test_wrapper, &coh_args);
> +	}
> +
>  	igt_fixture {
>  		int cnt;
>  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 06/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU performance test
  2025-11-13 16:33 ` [PATCH i-g-t v7 06/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU performance test nishit.sharma
@ 2025-11-17 14:39   ` Hellstrom, Thomas
  0 siblings, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 14:39 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> This test measures latency and bandwidth for buffer access from each
> GPU
> and the CPU in a multi-GPU SVM environment. It compares performance
> for
> local versus remote access using madvise and prefetch to control
> buffer
> placement
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> ---
>  tests/intel/xe_multi_gpusvm.c | 181
> ++++++++++++++++++++++++++++++++++
>  1 file changed, 181 insertions(+)
> 
> diff --git a/tests/intel/xe_multi_gpusvm.c
> b/tests/intel/xe_multi_gpusvm.c
> index 6792ef72c..2c8e62e34 100644
> --- a/tests/intel/xe_multi_gpusvm.c
> +++ b/tests/intel/xe_multi_gpusvm.c
> @@ -13,6 +13,8 @@
>  #include "intel_mocs.h"
>  #include "intel_reg.h"
>  
> +#include "time.h"
> +
>  #include "xe/xe_ioctl.h"
>  #include "xe/xe_query.h"
>  #include "xe/xe_util.h"
> @@ -41,6 +43,11 @@
>   * Description:
>   * 	This test checks coherency in multi-gpu by writing from GPU0
>   * 	reading from GPU1 and verify and repeating with CPU and both
> GPUs
> + *
> + * SUBTEST: latency-multi-gpu
> + * Description:
> + * 	This test measures and compares latency and bandwidth for
> buffer access
> + * 	from CPU, local GPU, and remote GPU
>   */
>  
>  #define MAX_XE_REGIONS	8
> @@ -103,6 +110,11 @@ static void gpu_coherecy_test_wrapper(struct
> xe_svm_gpu_info *src,
>  				      struct
> drm_xe_engine_class_instance *eci,
>  				      void *extra_args);
>  
> +static void gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
> +				     struct xe_svm_gpu_info *dst,
> +				     struct
> drm_xe_engine_class_instance *eci,
> +				     void *extra_args);
> +
>  static void
>  create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> drm_xe_engine_class_instance *eci,
>  		    uint32_t *vm, uint32_t *exec_queue)
> @@ -197,6 +209,11 @@ static void for_each_gpu_pair(int num_gpus,
> struct xe_svm_gpu_info *gpus,
>  
>  static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
>  
> +static double time_diff(struct timespec *start, struct timespec
> *end)
> +{
> +    return (end->tv_sec - start->tv_sec) + (end->tv_nsec - start-
> >tv_nsec) / 1e9;
> +}
> +
>  static void
>  atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
>  		  uint32_t *bo, uint64_t *addr)
> @@ -549,6 +566,147 @@ coherency_test_multigpu(struct xe_svm_gpu_info
> *gpu0,
>  	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
>  }
>  
> +static void
> +latency_test_multigpu(struct xe_svm_gpu_info *gpu0,
> +		      struct xe_svm_gpu_info *gpu1,
> +		      struct drm_xe_engine_class_instance *eci,
> +		      bool remote_copy,
> +		      bool prefetch_req)
> +{
> +        uint64_t addr;
> +        uint32_t vm[2];
> +        uint32_t exec_queue[2];
> +        uint32_t batch_bo;
> +        uint8_t *copy_dst;
> +        uint64_t batch_addr;
> +        struct drm_xe_sync sync = {};
> +        volatile uint64_t *sync_addr;
> +        int value = 60;
> +        int shared_val[4];
> +        struct test_exec_data *data;
> +	struct timespec t_start, t_end;
> +	double cpu_latency, gpu1_latency, gpu2_latency;
> +	double cpu_bw, gpu1_bw, gpu2_bw;

Also here, indentation looks inconsistent throughout the function.

> +
> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> +
> +        data = aligned_alloc(SZ_2M, SZ_4K);
> +        igt_assert(data);
> +        data[0].vm_sync = 0;
> +        addr = to_user_pointer(data);
> +
> +        copy_dst = aligned_alloc(SZ_2M, SZ_4K);
> +        igt_assert(copy_dst);
> +
> +        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
> &batch_addr, value);
> +
> +	/* Measure GPU0 access latency/bandwidth */
> +	clock_gettime(CLOCK_MONOTONIC, &t_start);
> +
> +        /* GPU0(src_gpu) access */
> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu0->fd, 0, gpu0->vram_regions[0],
> exec_queue[0],
> +			    0, 0);
> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
> +			     sync_addr, exec_queue[0],
> prefetch_req);
> +
> +	clock_gettime(CLOCK_MONOTONIC, &t_end);

So here we are measuring the madvise and prefetch (if any) latency
only. Is that intentional?

> +	gpu1_latency = time_diff(&t_start, &t_end);
> +	gpu1_bw = COPY_SIZE / gpu1_latency / (1024 * 1024); // MB/s

Aren't we doing a single dword store? Why are you calculating bandwidth
based on COPY_SIZE?

> +
> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +        sync.timeline_value = EXEC_SYNC_VAL;
> +        *sync_addr = 0;
> +
> +        /* Execute STORE command on GPU0 */
> +        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
> +        if (*sync_addr != EXEC_SYNC_VAL)
> +                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[0],
> +			       NSEC_PER_SEC * 10);
> +

Why isn't the GPU execution included in the timing?


> +	memcpy(shared_val, (void *)addr, 4);

I think I asked about this a couple of times before, but why are you
doing a memcpy to "shared_val" here?


> +	igt_assert_eq(shared_val[0], value);
> +
> +        /* CPU writes 10, memset set bytes no integer hence memset
> fills 4 bytes with 0x0A */
> +        memset((void *)(uintptr_t)addr, 10, sizeof(int));
> +        memcpy(shared_val, (void *)(uintptr_t)addr,
> sizeof(shared_val));
> +        igt_assert_eq(shared_val[0], 0x0A0A0A0A);


And here? Also why this exercise of setting add to 0x0a0a0a0a? isn't
this a performance / latency test.

> +
> +	*(uint64_t *)addr = 50;
> +
> +	if(remote_copy) {
> +		igt_info("creating batch for COPY_CMD on GPU1\n");
> +		batch_init(gpu1->fd, vm[1], addr,
> to_user_pointer(copy_dst),
> +			   SZ_4K, &batch_bo, &batch_addr);
> +	} else {
> +		igt_info("creating batch for STORE_CMD on GPU1\n");
> +		store_dword_batch_init(gpu1->fd, vm[1], addr,
> &batch_bo, &batch_addr, value + 10);
> +	}
> +
> +	/* Measure GPU1 access latency/bandwidth */
> +	clock_gettime(CLOCK_MONOTONIC, &t_start);

Here we also include the madvise- and prefetch timing. Is that
intentional.

> +
> +        /* GPU1(dst_gpu) access */
> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu1->fd, 0, gpu1->vram_regions[0],
> exec_queue[1],
> +			    0, 0);

Setting madvise to gpu1 will migrate the data to gpu1 at prefetch or
fault time? Is that really what's intended?

> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
> +			     sync_addr, exec_queue[1],
> prefetch_req);
> +
> +	clock_gettime(CLOCK_MONOTONIC, &t_end);
> +	gpu2_latency = time_diff(&t_start, &t_end);
> +	gpu2_bw = COPY_SIZE / gpu2_latency / (1024 * 1024); // MB/s

Same comment as previosly. Shouldn't we be timing GPU execution. Also
COPY_SIZE seems wrong

> +
> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +        sync.timeline_value = EXEC_SYNC_VAL;
> +        *sync_addr = 0;
> +
> +        /* Execute COPY/STORE command on GPU1 */
> +        xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);

Shouldn't this be included in the timing?

> +        if (*sync_addr != EXEC_SYNC_VAL)
> +                xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[1],
> +			       NSEC_PER_SEC * 10);
> +
> +	if (!remote_copy)
> +		igt_assert_eq(*(uint64_t *)addr, value + 10);
> +	else
> +		igt_assert_eq(*(uint64_t *)copy_dst, 50);
> +
> +        /* CPU writes 11, memset set bytes no integer hence memset
> fills 4 bytes with 0x0B */
> +	/* Measure CPU access latency/bandwidth */
> +	clock_gettime(CLOCK_MONOTONIC, &t_start);
> +        memset((void *)(uintptr_t)addr, 11, sizeof(int));
> +        memcpy(shared_val, (void *)(uintptr_t)addr,
> sizeof(shared_val));
> +	clock_gettime(CLOCK_MONOTONIC, &t_end);

I don't think it's meaningful to measure a single dword CPU write and
read in this way. In theory, the CPU accesses may actually never reach
physical memory. Suggest skipping the CPU timing.

> +	cpu_latency = time_diff(&t_start, &t_end);
> +	cpu_bw = COPY_SIZE / cpu_latency / (1024 * 1024); // MB/s
> +
> +        igt_assert_eq(shared_val[0], 0x0B0B0B0B);
> +
> +	/* Print results */
> +	igt_info("CPU: Latency %.6f s, Bandwidth %.2f MB/s\n",
> cpu_latency, cpu_bw);
> +	igt_info("GPU: Latency %.6f s, Bandwidth %.2f MB/s\n",
> gpu1_latency, gpu1_bw);
> +	igt_info("GPU: Latency %.6f s, Bandwidth %.2f MB/s\n",
> gpu2_latency, gpu2_bw);
> +
> +        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
> +        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
> +        batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
> +        free(data);
> +        free(copy_dst);
> +
> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
> +}
> +
>  static void
>  atomic_inc_op(struct xe_svm_gpu_info *gpu0,
>  	      struct xe_svm_gpu_info *gpu1,
> @@ -661,6 +819,19 @@ gpu_coherecy_test_wrapper(struct xe_svm_gpu_info
> *src,
>  	coherency_test_multigpu(src, dst, eci, args->op_mod, args-
> >prefetch_req);
>  }
>  
> +static void
> +gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
> +			 struct xe_svm_gpu_info *dst,
> +			 struct drm_xe_engine_class_instance *eci,
> +			 void *extra_args)
> +{
> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> *)extra_args;
> +	igt_assert(src);
> +	igt_assert(dst);
> +
> +	latency_test_multigpu(src, dst, eci, args->op_mod, args-
> >prefetch_req);
> +}
> +
>  igt_main
>  {
>  	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
> @@ -718,6 +889,16 @@ igt_main
>  		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_coherecy_test_wrapper, &coh_args);
>  	}
>  
> +	igt_subtest("latency-multi-gpu") {
> +		struct multigpu_ops_args latency_args;
> +		latency_args.prefetch_req = 1;
> +		latency_args.op_mod = 1;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_latency_test_wrapper, &latency_args);
> +		latency_args.prefetch_req = 0;
> +		latency_args.op_mod = 0;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_latency_test_wrapper, &latency_args);
> +	}

Suggest using a flag variable similar to other xe igts, as discussed
previously.

/Thomas


> +
>  	igt_fixture {
>  		int cnt;
>  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 07/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU fault handling test
  2025-11-13 16:33 ` [PATCH i-g-t v7 07/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU fault handling test nishit.sharma
@ 2025-11-17 14:48   ` Hellstrom, Thomas
  0 siblings, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 14:48 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> This test intentionally triggers page faults by accessing regions
> without
> prefetch for both GPUs in a multi-GPU environment.
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> ---
>  tests/intel/xe_multi_gpusvm.c | 102
> ++++++++++++++++++++++++++++++++++
>  1 file changed, 102 insertions(+)
> 
> diff --git a/tests/intel/xe_multi_gpusvm.c
> b/tests/intel/xe_multi_gpusvm.c
> index 2c8e62e34..6feb543ae 100644
> --- a/tests/intel/xe_multi_gpusvm.c
> +++ b/tests/intel/xe_multi_gpusvm.c
> @@ -15,6 +15,7 @@
>  
>  #include "time.h"
>  
> +#include "xe/xe_gt.h"
>  #include "xe/xe_ioctl.h"
>  #include "xe/xe_query.h"
>  #include "xe/xe_util.h"
> @@ -48,6 +49,11 @@
>   * Description:
>   * 	This test measures and compares latency and bandwidth for
> buffer access
>   * 	from CPU, local GPU, and remote GPU
> + *
> + * SUBTEST: pagefault-multi-gpu
> + * Description:
> + * 	This test intentionally triggers page faults by accessing
> unmapped SVM
> + * 	regions from both GPUs
>   */
>  
>  #define MAX_XE_REGIONS	8
> @@ -115,6 +121,11 @@ static void gpu_latency_test_wrapper(struct
> xe_svm_gpu_info *src,
>  				     struct
> drm_xe_engine_class_instance *eci,
>  				     void *extra_args);
>  
> +static void gpu_fault_test_wrapper(struct xe_svm_gpu_info *src,
> +				   struct xe_svm_gpu_info *dst,
> +				   struct
> drm_xe_engine_class_instance *eci,
> +				   void *extra_args);
> +
>  static void
>  create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> drm_xe_engine_class_instance *eci,
>  		    uint32_t *vm, uint32_t *exec_queue)
> @@ -707,6 +718,76 @@ latency_test_multigpu(struct xe_svm_gpu_info
> *gpu0,
>  	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
>  }
>  
> +static void
> +pagefault_test_multigpu(struct xe_svm_gpu_info *gpu0,
> +			struct xe_svm_gpu_info *gpu1,
> +			struct drm_xe_engine_class_instance *eci,
> +			bool prefetch_req)
> +{
> +        uint64_t addr;
> +        uint32_t vm[2];
> +        uint32_t exec_queue[2];
> +        uint32_t batch_bo;
> +        uint64_t batch_addr;
> +        struct drm_xe_sync sync = {};
> +        volatile uint64_t *sync_addr;
> +        int value = 60, pf_count_1, pf_count_2;
> +        void *data;
> +	const char *pf_count_stat = "svm_pagefault_count";
> +
> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> +
> +        data = aligned_alloc(SZ_2M, SZ_4K);
> +        igt_assert(data);
> +        addr = to_user_pointer(data);
> +
> +	pf_count_1 = xe_gt_stats_get_count(gpu0->fd, eci->gt_id,
> pf_count_stat);
> +
> +	/* checking pagefault count on GPU */
> +        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
> &batch_addr, value);
> +
> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu0->fd, 0, gpu0->vram_regions[0],
> exec_queue[0],
> +			    0, 0);
> +
> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
> +			     sync_addr, exec_queue[0],
> prefetch_req);
> +
> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
> +        sync.timeline_value = EXEC_SYNC_VAL;
> +        *sync_addr = 0;

Indentation appears odd across the function.

> +
> +	/* Execute STORE command on GPU */
> +        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
> +        if (*sync_addr != EXEC_SYNC_VAL)
> +                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
> EXEC_SYNC_VAL, exec_queue[0],
> +			       NSEC_PER_SEC * 10);
> +
> +	pf_count_2 = xe_gt_stats_get_count(gpu0->fd, eci->gt_id,
> pf_count_stat);
> +
> +	if (pf_count_2 != pf_count_1) {
> +		igt_warn("GPU pf: pf_count_2(%d) != pf_count_1(%d)
> prefetch_req :%d\n",
> +			 pf_count_2, pf_count_1, prefetch_req);
> +	}
> +
> +        igt_assert_eq(*(uint64_t *)addr, value);
> +
> +        /* CPU writes 11, memset set bytes no integer hence memset
> fills 4 bytes with 0x0B */
> +        memset((void *)(uintptr_t)addr, 11, sizeof(int));
> +        igt_assert_eq(*(uint64_t *)addr, 0x0B0B0B0B);

Still not clear over what the above is trying to accomplish?
Isn't this a pagefault counting test?

> +
> +        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
> +        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
> +        free(data);

It looks like we're only testing on a single GPU here. Perhaps it would
make sense to execute also on gpu1 (perhaps without prefetch) and
verify that pagefaults happen on gpu1?

Thanks,
Thomas


> +
> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
> +}
> +
>  static void
>  atomic_inc_op(struct xe_svm_gpu_info *gpu0,
>  	      struct xe_svm_gpu_info *gpu1,
> @@ -832,6 +913,19 @@ gpu_latency_test_wrapper(struct xe_svm_gpu_info
> *src,
>  	latency_test_multigpu(src, dst, eci, args->op_mod, args-
> >prefetch_req);
>  }
>  
> +static void
> +gpu_fault_test_wrapper(struct xe_svm_gpu_info *src,
> +		       struct xe_svm_gpu_info *dst,
> +		       struct drm_xe_engine_class_instance *eci,
> +		       void *extra_args)
> +{
> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> *)extra_args;
> +	igt_assert(src);
> +	igt_assert(dst);
> +
> +        pagefault_test_multigpu(src, dst, eci, args->prefetch_req);
> +}
> +
>  igt_main
>  {
>  	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
> @@ -899,6 +993,14 @@ igt_main
>  		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_latency_test_wrapper, &latency_args);
>  	}
>  
> +	igt_subtest("pagefault-multi-gpu") {
> +		struct multigpu_ops_args fault_args;
> +		fault_args.prefetch_req = 1;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_fault_test_wrapper, &fault_args);
> +		fault_args.prefetch_req = 0;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_fault_test_wrapper, &fault_args);
> +	}
> +
>  	igt_fixture {
>  		int cnt;
>  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 08/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU simultaneous access test
  2025-11-13 16:33 ` [PATCH i-g-t v7 08/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU simultaneous access test nishit.sharma
@ 2025-11-17 14:57   ` Hellstrom, Thomas
  0 siblings, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 14:57 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> This test launches compute or copy workloads on both GPUs that access
> the same
> SVM buffer, using synchronization primitives (fences/semaphores) to
> coordinate
> access. It verifies data integrity and checks for the absence of race
> conditions
> in a multi-GPU SVM environment.
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> ---
>  tests/intel/xe_multi_gpusvm.c | 133
> ++++++++++++++++++++++++++++++++++
>  1 file changed, 133 insertions(+)
> 
> diff --git a/tests/intel/xe_multi_gpusvm.c
> b/tests/intel/xe_multi_gpusvm.c
> index 6feb543ae..dc2a8f9c8 100644
> --- a/tests/intel/xe_multi_gpusvm.c
> +++ b/tests/intel/xe_multi_gpusvm.c
> @@ -54,6 +54,11 @@
>   * Description:
>   * 	This test intentionally triggers page faults by accessing
> unmapped SVM
>   * 	regions from both GPUs
> + *
> + * SUBTEST: concurrent-access-multi-gpu
> + * Description:
> + * 	This tests aunches simultaneous workloads on both GPUs
> accessing the
> + * 	same SVM buffer synchronizes with fences, and verifies data
> integrity
>   */
>  
>  #define MAX_XE_REGIONS	8
> @@ -126,6 +131,11 @@ static void gpu_fault_test_wrapper(struct
> xe_svm_gpu_info *src,
>  				   struct
> drm_xe_engine_class_instance *eci,
>  				   void *extra_args);
>  
> +static void gpu_simult_test_wrapper(struct xe_svm_gpu_info *src,
> +				    struct xe_svm_gpu_info *dst,
> +				    struct
> drm_xe_engine_class_instance *eci,
> +				    void *extra_args);
> +
>  static void
>  create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> drm_xe_engine_class_instance *eci,
>  		    uint32_t *vm, uint32_t *exec_queue)
> @@ -900,6 +910,108 @@ gpu_coherecy_test_wrapper(struct
> xe_svm_gpu_info *src,
>  	coherency_test_multigpu(src, dst, eci, args->op_mod, args-
> >prefetch_req);
>  }
>  
> +static void
> +multigpu_access_test(struct xe_svm_gpu_info *gpu0,
> +		     struct xe_svm_gpu_info *gpu1,
> +		     struct drm_xe_engine_class_instance *eci,
> +		     bool no_prefetch)
> +{
> +	uint64_t addr;
> +	uint32_t vm[2];
> +	uint32_t exec_queue[2];
> +	uint32_t batch_bo[2];
> +	struct test_exec_data *data;
> +	uint64_t batch_addr[2];
> +	struct drm_xe_sync sync[2] = {};
> +	volatile uint64_t *sync_addr[2];
> +	volatile uint32_t *shared_val;
> +
> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> +
> +	data = aligned_alloc(SZ_2M, SZ_4K);
> +	igt_assert(data);
> +	data[0].vm_sync = 0;
> +	addr = to_user_pointer(data);
> +
> +	shared_val = (volatile uint32_t *)addr;
> +	*shared_val = ATOMIC_OP_VAL - 1;
> +
> +	atomic_batch_init(gpu0->fd, vm[0], addr, &batch_bo[0],
> &batch_addr[0]);
> +	*shared_val = ATOMIC_OP_VAL - 2;
> +	atomic_batch_init(gpu1->fd, vm[1], addr, &batch_bo[1],
> &batch_addr[1]);
> +
> +	/* Place destination in an optionally remote location to
> test */
> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu0->fd, 0, gpu0->vram_regions[0],
> exec_queue[0],
> +			    0, 0);
> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu1->fd, 0, gpu1->vram_regions[0],
> exec_queue[1],
> +			    0, 0);
> +
> +	setup_sync(&sync[0], &sync_addr[0], BIND_SYNC_VAL);
> +	setup_sync(&sync[1], &sync_addr[1], BIND_SYNC_VAL);
> +
> +	/* For simultaneous access need to call xe_wait_ufence for
> both gpus after prefetch */
> +	if(!no_prefetch) {

Here we have double negation. Perhaps invert the meaning of the
variable and call it do_prefetch.


> +		xe_vm_prefetch_async(gpu0->fd, vm[0], 0, 0, addr,
> +				     SZ_4K, &sync[0], 1,
> +				    
> DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
> +
> +		xe_vm_prefetch_async(gpu1->fd, vm[1], 0, 0, addr,
> +				     SZ_4K, &sync[1], 1,
> +				    
> DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
> +
> +		if (*sync_addr[0] != BIND_SYNC_VAL)
> +			xe_wait_ufence(gpu0->fd, (uint64_t
> *)sync_addr[0], BIND_SYNC_VAL, exec_queue[0],
> +				       NSEC_PER_SEC * 10);
> +		free((void *)sync_addr[0]);
> +		if (*sync_addr[1] != BIND_SYNC_VAL)
> +			xe_wait_ufence(gpu1->fd, (uint64_t
> *)sync_addr[1], BIND_SYNC_VAL, exec_queue[1],
> +				       NSEC_PER_SEC * 10);
> +		free((void *)sync_addr[1]);
> +	}
> +
> +	if (no_prefetch) {
> +		free((void *)sync_addr[0]);
> +		free((void *)sync_addr[1]);
> +	}
> +
> +	for (int i = 0; i < 100; i++) {
> +		sync_addr[0] = (void *)((char *)batch_addr[0] +
> SZ_4K);
> +		sync[0].addr = to_user_pointer((uint64_t
> *)sync_addr[0]);
> +		sync[0].timeline_value = EXEC_SYNC_VAL;
> +
> +		sync_addr[1] = (void *)((char *)batch_addr[1] +
> SZ_4K);
> +		sync[1].addr = to_user_pointer((uint64_t
> *)sync_addr[1]);
> +		sync[1].timeline_value = EXEC_SYNC_VAL;
> +		*sync_addr[0] = 0;
> +		*sync_addr[1] = 0;
> +
> +		xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr[0],
> &sync[0], 1);
> +		if (*sync_addr[0] != EXEC_SYNC_VAL)
> +			xe_wait_ufence(gpu0->fd, (uint64_t
> *)sync_addr[0], EXEC_SYNC_VAL, exec_queue[0],
> +				       NSEC_PER_SEC * 10);
> +		xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr[1],
> &sync[1], 1);
> +		if (*sync_addr[1] != EXEC_SYNC_VAL)
> +			xe_wait_ufence(gpu1->fd, (uint64_t
> *)sync_addr[1], EXEC_SYNC_VAL, exec_queue[1],
> +				       NSEC_PER_SEC * 10);

Here you are synchronizing after each batch execution, so nothing
really runs in parallel. I suggest only synchronizing on the last
iteration, and don't use any sync objects on the previous iterations.

Thanks,
Thomas



> +	}
> +
> +	igt_assert_eq(*(uint64_t *)addr, 254);
> +
> +	munmap((void *)batch_addr[0], BATCH_SIZE(gpu0->fd));
> +	munmap((void *)batch_addr[1], BATCH_SIZE(gpu0->fd));
> +	batch_fini(gpu0->fd, vm[0], batch_bo[0], batch_addr[0]);
> +	batch_fini(gpu1->fd, vm[1], batch_bo[1], batch_addr[1]);
> +	free(data);
> +
> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
> +}
> +
>  static void
>  gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
>  			 struct xe_svm_gpu_info *dst,
> @@ -926,6 +1038,19 @@ gpu_fault_test_wrapper(struct xe_svm_gpu_info
> *src,
>          pagefault_test_multigpu(src, dst, eci, args->prefetch_req);
>  }
>  
> +static void
> +gpu_simult_test_wrapper(struct xe_svm_gpu_info *src,
> +			struct xe_svm_gpu_info *dst,
> +			struct drm_xe_engine_class_instance *eci,
> +			void *extra_args)
> +{
> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> *)extra_args;
> +	igt_assert(src);
> +	igt_assert(dst);
> +
> +	multigpu_access_test(src, dst, eci, args->prefetch_req);
> +}
> +
>  igt_main
>  {
>  	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
> @@ -1001,6 +1126,14 @@ igt_main
>  		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_fault_test_wrapper, &fault_args);
>  	}
>  
> +	igt_subtest("concurrent-access-multi-gpu") {
> +		struct multigpu_ops_args simul_args;
> +		simul_args.prefetch_req = 1;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_simult_test_wrapper, &simul_args);
> +		simul_args.prefetch_req = 0;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_simult_test_wrapper, &simul_args);
> +	}
> +
>  	igt_fixture {
>  		int cnt;
>  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 09/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU conflicting madvise test
  2025-11-13 16:33 ` [PATCH i-g-t v7 09/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU conflicting madvise test nishit.sharma
@ 2025-11-17 15:11   ` Hellstrom, Thomas
  0 siblings, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 15:11 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> From: Nishit Sharma <nishit.sharma@intel.com>
> 
> This test calls madvise operations on GPU0 with the preferred
> location set
> to GPU1 and vice versa. It reports conflicts when conflicting memory
> advice
> is given for shared SVM buffers in a multi-GPU environment.
> 
> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> ---
>  tests/intel/xe_multi_gpusvm.c | 143
> ++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> diff --git a/tests/intel/xe_multi_gpusvm.c
> b/tests/intel/xe_multi_gpusvm.c
> index dc2a8f9c8..afbf010e6 100644
> --- a/tests/intel/xe_multi_gpusvm.c
> +++ b/tests/intel/xe_multi_gpusvm.c
> @@ -59,6 +59,11 @@
>   * Description:
>   * 	This tests aunches simultaneous workloads on both GPUs
> accessing the
>   * 	same SVM buffer synchronizes with fences, and verifies data
> integrity
> + *
> + * SUBTEST: conflicting-madvise-gpu
> + * Description:
> + * 	This test checks conflicting madvise by allocating shared
> buffer
> + * 	prefetches from both and checks for migration conflicts
>   */
>  
>  #define MAX_XE_REGIONS	8
> @@ -69,6 +74,8 @@
>  #define EXEC_SYNC_VAL 0x676767
>  #define COPY_SIZE SZ_64M
>  #define	ATOMIC_OP_VAL	56
> +#define USER_FENCE_VALUE        0xdeadbeefdeadbeefull
> +#define FIVE_SEC                (5LL * NSEC_PER_SEC)
>  
>  struct xe_svm_gpu_info {
>  	bool supports_faults;
> @@ -136,6 +143,11 @@ static void gpu_simult_test_wrapper(struct
> xe_svm_gpu_info *src,
>  				    struct
> drm_xe_engine_class_instance *eci,
>  				    void *extra_args);
>  
> +static void gpu_conflict_test_wrapper(struct xe_svm_gpu_info *src,
> +				      struct xe_svm_gpu_info *dst,
> +				      struct
> drm_xe_engine_class_instance *eci,
> +				      void *extra_args);
> +
>  static void
>  create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> drm_xe_engine_class_instance *eci,
>  		    uint32_t *vm, uint32_t *exec_queue)
> @@ -798,6 +810,116 @@ pagefault_test_multigpu(struct xe_svm_gpu_info
> *gpu0,
>  	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
>  }
>  
> +#define	XE_BO_FLAG_SYSTEM	BIT(1)
> +#define XE_BO_FLAG_CPU_ADDR_MIRROR      BIT(24)
> +
> +static void
> +conflicting_madvise(struct xe_svm_gpu_info *gpu0,
> +		    struct xe_svm_gpu_info *gpu1,
> +		    struct drm_xe_engine_class_instance *eci,
> +		    bool no_prefetch)
> +{
> +	uint64_t addr;
> +	uint32_t vm[2];
> +	uint32_t exec_queue[2];
> +	uint32_t batch_bo[2];
> +	void *data;
> +	uint64_t batch_addr[2];
> +	struct drm_xe_sync sync[2] = {};
> +	volatile uint64_t *sync_addr[2];
> +	int local_fd;
> +	uint16_t local_vram;
> +
> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> +
> +	data = aligned_alloc(SZ_2M, SZ_4K);
> +	igt_assert(data);
> +	addr = to_user_pointer(data);
> +
> +	xe_vm_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> +		      DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +		      DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM, 0, 0);
> +
> +	store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo[0],
> &batch_addr[0], 10);
> +	store_dword_batch_init(gpu1->fd, vm[0], addr, &batch_bo[1],
> &batch_addr[1], 20);
> +
> +	/* Place destination in an optionally remote location to
> test */
> +	local_fd = gpu0->fd;
> +	local_vram = gpu0->vram_regions[0];
> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K,
> +			    0, DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu1->fd, 0, gpu1->vram_regions[0],
> exec_queue[0],
> +			    local_fd, local_vram);
> +
> +	local_fd = gpu1->fd;
> +	local_vram = gpu1->vram_regions[0];
> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K,
> +			    0, DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> +			    gpu0->fd, 0, gpu0->vram_regions[0],
> exec_queue[0],
> +			    local_fd, local_vram);
> +
> +	setup_sync(&sync[0], &sync_addr[0], BIND_SYNC_VAL);
> +	setup_sync(&sync[1], &sync_addr[1], BIND_SYNC_VAL);
> +
> +	/* For simultaneous access need to call xe_wait_ufence for
> both gpus after prefetch */
> +	if(!no_prefetch) {

Again, suggest flag parameter and separate tests.

> +		xe_vm_prefetch_async(gpu0->fd, vm[0], 0, 0, addr,
> +				     SZ_4K, &sync[0], 1,
> +				    
> DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
> +
> +		xe_vm_prefetch_async(gpu1->fd, vm[1], 0, 0, addr,
> +				     SZ_4K, &sync[1], 1,
> +				    
> DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
> +
> +		if (*sync_addr[0] != BIND_SYNC_VAL)
> +			xe_wait_ufence(gpu0->fd, (uint64_t
> *)sync_addr[0], BIND_SYNC_VAL, exec_queue[0],
> +				       NSEC_PER_SEC * 10);
> +		free((void *)sync_addr[0]);
> +		if (*sync_addr[1] != BIND_SYNC_VAL)
> +			xe_wait_ufence(gpu1->fd, (uint64_t
> *)sync_addr[1], BIND_SYNC_VAL, exec_queue[1],
> +				       NSEC_PER_SEC * 10);
> +		free((void *)sync_addr[1]);
> +	}
> +
> +	if (no_prefetch) {
> +		free((void *)sync_addr[0]);
> +		free((void *)sync_addr[1]);
> +	}
> +
> +	for (int i = 0; i < 1; i++) {

This loop only runs once with i == 0 ?

Thanks,
Thomas




> +		sync_addr[0] = (void *)((char *)batch_addr[0] +
> SZ_4K);
> +		sync[0].addr = to_user_pointer((uint64_t
> *)sync_addr[0]);
> +		sync[0].timeline_value = EXEC_SYNC_VAL;
> +
> +		sync_addr[1] = (void *)((char *)batch_addr[1] +
> SZ_4K);
> +		sync[1].addr = to_user_pointer((uint64_t
> *)sync_addr[1]);
> +		sync[1].timeline_value = EXEC_SYNC_VAL;
> +		*sync_addr[0] = 0;
> +		*sync_addr[1] = 0;
> +
> +		xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr[0],
> &sync[0], 1);
> +		if (*sync_addr[0] != EXEC_SYNC_VAL)
> +			xe_wait_ufence(gpu0->fd, (uint64_t
> *)sync_addr[0], EXEC_SYNC_VAL, exec_queue[0],
> +				       NSEC_PER_SEC * 10);
> +		xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr[1],
> &sync[1], 1);
> +		if (*sync_addr[1] != EXEC_SYNC_VAL)
> +			xe_wait_ufence(gpu1->fd, (uint64_t
> *)sync_addr[1], EXEC_SYNC_VAL, exec_queue[1],
> +				       NSEC_PER_SEC * 10);
> +	}
> +
> +	igt_assert_eq(*(uint64_t *)addr, 20);
> +
> +	munmap((void *)batch_addr[0], BATCH_SIZE(gpu0->fd));
> +	munmap((void *)batch_addr[1], BATCH_SIZE(gpu0->fd));
> +	batch_fini(gpu0->fd, vm[0], batch_bo[0], batch_addr[0]);
> +	batch_fini(gpu1->fd, vm[1], batch_bo[1], batch_addr[1]);
> +	free(data);
> +
> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
> +}
> +
>  static void
>  atomic_inc_op(struct xe_svm_gpu_info *gpu0,
>  	      struct xe_svm_gpu_info *gpu1,
> @@ -1012,6 +1134,19 @@ multigpu_access_test(struct xe_svm_gpu_info
> *gpu0,
>  	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
>  }
>  
> +static void
> +gpu_conflict_test_wrapper(struct xe_svm_gpu_info *src,
> +			  struct xe_svm_gpu_info *dst,
> +			  struct drm_xe_engine_class_instance *eci,
> +			  void *extra_args)
> +{
> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> *)extra_args;
> +	igt_assert(src);
> +	igt_assert(dst);
> +
> +        conflicting_madvise(src, dst, eci, args->prefetch_req);
> +}
> +
>  static void
>  gpu_latency_test_wrapper(struct xe_svm_gpu_info *src,
>  			 struct xe_svm_gpu_info *dst,
> @@ -1108,6 +1243,14 @@ igt_main
>  		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_coherecy_test_wrapper, &coh_args);
>  	}
>  
> +	igt_subtest("conflicting-madvise-gpu") {
> +		struct multigpu_ops_args conflict_args;
> +		conflict_args.prefetch_req = 1;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_conflict_test_wrapper, &conflict_args);
> +		conflict_args.prefetch_req = 0;
> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_conflict_test_wrapper, &conflict_args);
> +	}
> +
>  	igt_subtest("latency-multi-gpu") {
>  		struct multigpu_ops_args latency_args;
>  		latency_args.prefetch_req = 1;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-17 12:34   ` Hellstrom, Thomas
@ 2025-11-17 15:43     ` Sharma, Nishit
  2025-11-18  9:23       ` Hellstrom, Thomas
  0 siblings, 1 reply; 34+ messages in thread
From: Sharma, Nishit @ 2025-11-17 15:43 UTC (permalink / raw)
  To: Hellstrom, Thomas, igt-dev@lists.freedesktop.org


On 11/17/2025 6:04 PM, Hellstrom, Thomas wrote:
> On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
>> From: Nishit Sharma <nishit.sharma@intel.com>
>>
>> Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise to
>> support per-instance memory advice operations.Implement
>> xe_vm_bind_lr_sync
>> and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind using
>> user
>> fences.
>> These changes improve memory advice and binding operations for multi-
>> GPU
>> and multi-instance scenarios in IGT tests.
> s memory advice/memory advise/ ?
Git it. Will edit the description.
>
> Also the lr_sync part is unrelated and should be split out to a
> separate patch.

Sure will create separate patch for lr_sync part. Also the 
xe_exec_system_allocator() changes available in Patch [2/10] should be 
merged along madvise changes within

Patch [1/10]?


>
> Thanks,
> Thomas
>
>
>
>> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
>> ---
>>   include/drm-uapi/xe_drm.h |  4 +--
>>   lib/xe/xe_ioctl.c         | 53 +++++++++++++++++++++++++++++++++++--
>> --
>>   lib/xe/xe_ioctl.h         | 11 +++++---
>>   3 files changed, 58 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/drm-uapi/xe_drm.h b/include/drm-uapi/xe_drm.h
>> index 89ab54935..3472efa58 100644
>> --- a/include/drm-uapi/xe_drm.h
>> +++ b/include/drm-uapi/xe_drm.h
>> @@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
>>   			/** @preferred_mem_loc.migration_policy:
>> Page migration policy */
>>   			__u16 migration_policy;
>>   
>> -			/** @preferred_mem_loc.pad : MBZ */
>> -			__u16 pad;
>> +			/** @preferred_mem_loc.region_instance:
>> Region instance */
>> +			__u16 region_instance;
>>   
>>   			/** @preferred_mem_loc.reserved : Reserved
>> */
>>   			__u64 reserved;
>> diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
>> index 39c4667a1..06ce8a339 100644
>> --- a/lib/xe/xe_ioctl.c
>> +++ b/lib/xe/xe_ioctl.c
>> @@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t *addr,
>> uint64_t value,
>>   }
>>   
>>   int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
>> range,
>> -		    uint64_t ext, uint32_t type, uint32_t op_val,
>> uint16_t policy)
>> +		    uint64_t ext, uint32_t type, uint32_t op_val,
>> uint16_t policy,
>> +		    uint16_t instance)
>>   {
>>   	struct drm_xe_madvise madvise = {
>>   		.type = type,
>> @@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm, uint64_t
>> addr, uint64_t range,
>>   	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
>>   		madvise.preferred_mem_loc.devmem_fd = op_val;
>>   		madvise.preferred_mem_loc.migration_policy = policy;
>> +		madvise.preferred_mem_loc.region_instance =
>> instance;
>>   		igt_debug("madvise.preferred_mem_loc.devmem_fd =
>> %d\n",
>>   			  madvise.preferred_mem_loc.devmem_fd);
>>   		break;
>> @@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm,
>> uint64_t addr, uint64_t range,
>>    * @type: type of attribute
>>    * @op_val: fd/atomic value/pat index, depending upon type of
>> operation
>>    * @policy: Page migration policy
>> + * @instance: vram instance
>>    *
>>    * Function initializes different members of struct drm_xe_madvise
>> and calls
>>    * MADVISE IOCTL .
>>    *
>> - * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
>> + * Returns error number in failure and 0 if pass.
>>    */
>> -void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
>> range,
>> -		   uint64_t ext, uint32_t type, uint32_t op_val,
>> uint16_t policy)
>> +int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
>> range,
>> +		   uint64_t ext, uint32_t type, uint32_t op_val,
>> uint16_t policy,
>> +		   uint16_t instance)
>>   {
>> -	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext,
>> type, op_val, policy), 0);
>> +	return __xe_vm_madvise(fd, vm, addr, range, ext, type,
>> op_val, policy, instance);
>> +}
>> +
>> +#define        BIND_SYNC_VAL   0x686868
>> +void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo, uint64_t
>> offset,
>> +			uint64_t addr, uint64_t size, uint32_t
>> flags)
>> +{
>> +	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
>> +	struct drm_xe_sync sync = {
>> +		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
>> +		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
>> +		.addr = to_user_pointer((uint64_t *)sync_addr),
>> +		.timeline_value = BIND_SYNC_VAL,
>> +	};
>> +
>> +	igt_assert(!!sync_addr);
>> +	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size, &sync,
>> 1, flags);
>> +	if (*sync_addr != BIND_SYNC_VAL)
>> +		xe_wait_ufence(fd, (uint64_t *)sync_addr,
>> BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
>> +	/* Only free if the wait succeeds */
>> +	free((void *)sync_addr);
>> +}
>> +
>> +void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
>> +			  uint64_t addr, uint64_t size)
>> +{
>> +	volatile uint64_t *sync_addr = malloc(sizeof(*sync_addr));
>> +	struct drm_xe_sync sync = {
>> +		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
>> +		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
>> +		.addr = to_user_pointer((uint64_t *)sync_addr),
>> +		.timeline_value = BIND_SYNC_VAL,
>> +	};
>> +
>> +	igt_assert(!!sync_addr);
>> +	*sync_addr = 0;
>> +	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
>> +	if (*sync_addr != BIND_SYNC_VAL)
>> +		xe_wait_ufence(fd, (uint64_t *)sync_addr,
>> BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
>> +	free((void *)sync_addr);
>>   }
>> diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
>> index ae8a23a54..1ae38029d 100644
>> --- a/lib/xe/xe_ioctl.h
>> +++ b/lib/xe/xe_ioctl.h
>> @@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t *addr,
>> uint64_t value,
>>   int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
>>   		       uint32_t exec_queue, int64_t timeout);
>>   int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
>> range, uint64_t ext,
>> -		    uint32_t type, uint32_t op_val, uint16_t
>> policy);
>> -void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
>> range, uint64_t ext,
>> -		   uint32_t type, uint32_t op_val, uint16_t policy);
>> +		    uint32_t type, uint32_t op_val, uint16_t policy,
>> uint16_t instance);
>> +int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
>> range, uint64_t ext,
>> +		  uint32_t type, uint32_t op_val, uint16_t policy,
>> uint16_t instance);
>>   int xe_vm_number_vmas_in_range(int fd, struct
>> drm_xe_vm_query_mem_range_attr *vmas_attr);
>>   int xe_vm_vma_attrs(int fd, struct drm_xe_vm_query_mem_range_attr
>> *vmas_attr,
>>   		    struct drm_xe_mem_range_attr *mem_attr);
>>   struct drm_xe_mem_range_attr
>>   *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm, uint64_t
>> start,
>>   				    uint64_t range, uint32_t
>> *num_ranges);
>> +void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
>> +			uint64_t offset, uint64_t addr,
>> +			uint64_t size, uint32_t flags);
>> +void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
>> +			  uint64_t addr, uint64_t size);
>>   #endif /* XE_IOCTL_H */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test
  2025-11-17 13:00   ` Hellstrom, Thomas
@ 2025-11-17 15:49     ` Sharma, Nishit
  2025-11-17 20:40       ` Hellstrom, Thomas
  2025-11-18  9:24       ` Hellstrom, Thomas
  0 siblings, 2 replies; 34+ messages in thread
From: Sharma, Nishit @ 2025-11-17 15:49 UTC (permalink / raw)
  To: Hellstrom, Thomas, igt-dev@lists.freedesktop.org


On 11/17/2025 6:30 PM, Hellstrom, Thomas wrote:
> On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
>> From: Nishit Sharma <nishit.sharma@intel.com>
>>
>> This test allocates a buffer in SVM, writes data to it from src GPU ,
>> and reads/verifies
>> the data from dst GPU. Optionally, the CPU also reads or modifies the
>> buffer and both
>> GPUs verify the results, ensuring correct cross-GPU and CPU memory
>> access in a
>> multi-GPU environment.
>>
>> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
>> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> ---
>>   tests/intel/xe_multi_gpusvm.c | 373
>> ++++++++++++++++++++++++++++++++++
>>   tests/meson.build             |   1 +
>>   2 files changed, 374 insertions(+)
>>   create mode 100644 tests/intel/xe_multi_gpusvm.c
>>
>> diff --git a/tests/intel/xe_multi_gpusvm.c
>> b/tests/intel/xe_multi_gpusvm.c
>> new file mode 100644
>> index 000000000..6614ea3d1
>> --- /dev/null
>> +++ b/tests/intel/xe_multi_gpusvm.c
>> @@ -0,0 +1,373 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#include <unistd.h>
>> +
>> +#include "drmtest.h"
>> +#include "igt.h"
>> +#include "igt_multigpu.h"
>> +
>> +#include "intel_blt.h"
>> +#include "intel_mocs.h"
>> +#include "intel_reg.h"
>> +
>> +#include "xe/xe_ioctl.h"
>> +#include "xe/xe_query.h"
>> +#include "xe/xe_util.h"
>> +
>> +/**
>> + * TEST: Basic multi-gpu SVM testing
>> + * Category: SVM
>> + * Mega feature: Compute
>> + * Sub-category: Compute tests
>> + * Functionality: SVM p2p access, madvise and prefetch.
>> + * Test category: functionality test
>> + *
>> + * SUBTEST: cross-gpu-mem-access
>> + * Description:
>> + *      This test creates two malloced regions, places the
>> destination
>> + *      region both remotely and locally and copies to it. Reads
>> back to
>> + *      system memory and checks the result.
>> + *
>> + */
>> +
>> +#define MAX_XE_REGIONS	8
>> +#define MAX_XE_GPUS 8
>> +#define NUM_LOOPS 1
>> +#define BATCH_SIZE(_fd) ALIGN(SZ_8K, xe_get_default_alignment(_fd))
>> +#define BIND_SYNC_VAL 0x686868
>> +#define EXEC_SYNC_VAL 0x676767
>> +#define COPY_SIZE SZ_64M
>> +
>> +struct xe_svm_gpu_info {
>> +	bool supports_faults;
>> +	int vram_regions[MAX_XE_REGIONS];
>> +	unsigned int num_regions;
>> +	unsigned int va_bits;
>> +	int fd;
>> +};
>> +
>> +struct multigpu_ops_args {
>> +	bool prefetch_req;
>> +	bool op_mod;
>> +};
>> +
>> +typedef void (*gpu_pair_fn) (
>> +		struct xe_svm_gpu_info *src,
>> +		struct xe_svm_gpu_info *dst,
>> +		struct drm_xe_engine_class_instance *eci,
>> +		void *extra_args
>> +);
>> +
>> +static void for_each_gpu_pair(int num_gpus,
>> +			      struct xe_svm_gpu_info *gpus,
>> +			      struct drm_xe_engine_class_instance
>> *eci,
>> +			      gpu_pair_fn fn,
>> +			      void *extra_args);
>> +
>> +static void gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
>> +				   struct xe_svm_gpu_info *dst,
>> +				   struct
>> drm_xe_engine_class_instance *eci,
>> +				   void *extra_args);
>> +
>> +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
>> +
>> +static void
>> +create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
>> drm_xe_engine_class_instance *eci,
>> +		    uint32_t *vm, uint32_t *exec_queue)
>> +{
>> +	*vm = xe_vm_create(gpu->fd,
>> +			   DRM_XE_VM_CREATE_FLAG_LR_MODE |
>> DRM_XE_VM_CREATE_FLAG_FAULT_MODE, 0);
>> +	*exec_queue = xe_exec_queue_create(gpu->fd, *vm, eci, 0);
>> +	xe_vm_bind_lr_sync(gpu->fd, *vm, 0, 0, 0, 1ull << gpu-
>>> va_bits,
>> +			   DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR);
>> +}
>> +
>> +static void
>> +setup_sync(struct drm_xe_sync *sync, volatile uint64_t **sync_addr,
>> uint64_t timeline_value)
>> +{
>> +	*sync_addr = malloc(sizeof(**sync_addr));
>> +	igt_assert(*sync_addr);
>> +	sync->flags = DRM_XE_SYNC_FLAG_SIGNAL;
>> +	sync->type = DRM_XE_SYNC_TYPE_USER_FENCE;
>> +	sync->addr = to_user_pointer((uint64_t *)*sync_addr);
>> +	sync->timeline_value = timeline_value;
>> +	**sync_addr = 0;
>> +}
>> +
>> +static void
>> +cleanup_vm_and_queue(struct xe_svm_gpu_info *gpu, uint32_t vm,
>> uint32_t exec_queue)
>> +{
>> +	xe_vm_unbind_lr_sync(gpu->fd, vm, 0, 0, 1ull << gpu-
>>> va_bits);
>> +	xe_exec_queue_destroy(gpu->fd, exec_queue);
>> +	xe_vm_destroy(gpu->fd, vm);
>> +}
>> +
>> +static void xe_multigpu_madvise(int src_fd, uint32_t vm, uint64_t
>> addr, uint64_t size,
>> +				uint64_t ext, uint32_t type, int
>> dst_fd, uint16_t policy,
>> +				uint16_t instance, uint32_t
>> exec_queue, int local_fd,
>> +				uint16_t local_vram)
>> +{
>> +	int ret;
>> +
>> +#define SYSTEM_MEMORY	0
> Please use DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM.
> A new define isn't necessary and it's also incorrect.
Sure, will use that at required places.
>
>> +	if (src_fd != dst_fd) {
>> +		ret = xe_vm_madvise(src_fd, vm, addr, size, ext,
>> type, dst_fd, policy, instance);
>> +		if (ret == -ENOLINK) {
>> +			igt_info("No fast interconnect between GPU0
>> and GPU1, falling back to local VRAM\n");
>> +			ret = xe_vm_madvise(src_fd, vm, addr, size,
>> ext, type, local_fd,
>> +					    policy, local_vram);
> Please use DRM_XE_PREFERRED_LOC_DEFAULT_DEVICE
>
>
>> +			if (ret) {
>> +				igt_info("Local VRAM madvise failed,
>> falling back to system memory\n");
>> +				ret = xe_vm_madvise(src_fd, vm,
>> addr, size, ext, type,
>> +						    SYSTEM_MEMORY,
>> policy, SYSTEM_MEMORY);
>> +				igt_assert_eq(ret, 0);
>> +			}
>> +		} else {
>> +			igt_assert_eq(ret, 0);
>> +		}
>> +	} else {
>> +		ret = xe_vm_madvise(src_fd, vm, addr, size, ext,
>> type, dst_fd, policy, instance);
>> +		igt_assert_eq(ret, 0);
>> +
>> +	}
>> +
>> +}
>> +
>> +static void xe_multigpu_prefetch(int src_fd, uint32_t vm, uint64_t
>> addr, uint64_t size,
>> +				 struct drm_xe_sync *sync, volatile
>> uint64_t *sync_addr,
>> +				 uint32_t exec_queue, bool
>> prefetch_req)
>> +{
>> +	if (prefetch_req) {
>> +		xe_vm_prefetch_async(src_fd, vm, 0, 0, addr, size,
>> sync, 1,
>> +				
>> DRM_XE_CONSULT_MEM_ADVISE_PREF_LOC);
>> +		if (*sync_addr != sync->timeline_value)
>> +			xe_wait_ufence(src_fd, (uint64_t
>> *)sync_addr, sync->timeline_value,
>> +				       exec_queue, NSEC_PER_SEC *
>> 10);
>> +	}
>> +	free((void *)sync_addr);
>> +}
>> +
>> +static void for_each_gpu_pair(int num_gpus, struct xe_svm_gpu_info
>> *gpus,
>> +			      struct drm_xe_engine_class_instance
>> *eci,
>> +			      gpu_pair_fn fn, void *extra_args)
>> +{
>> +	for (int src = 0; src < num_gpus; src++) {
>> +		if(!gpus[src].supports_faults)
>> +			continue;
>> +
>> +		for (int dst = 0; dst < num_gpus; dst++) {
>> +			if (src == dst)
>> +				continue;
>> +			fn(&gpus[src], &gpus[dst], eci, extra_args);
>> +		}
>> +	}
>> +}
>> +
>> +static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
>> +		       uint64_t dst_addr, uint64_t copy_size,
>> +		       uint32_t *bo, uint64_t *addr)
>> +{
>> +	uint32_t width = copy_size / 256;
>> +	uint32_t height = 1;
>> +	uint32_t batch_bo_size = BATCH_SIZE(fd);
>> +	uint32_t batch_bo;
>> +	uint64_t batch_addr;
>> +	void *batch;
>> +	uint32_t *cmd;
>> +	uint32_t mocs_index = intel_get_uc_mocs_index(fd);
>> +	int i = 0;
>> +
>> +	batch_bo = xe_bo_create(fd, vm, batch_bo_size,
>> vram_if_possible(fd, 0), 0);
>> +	batch = xe_bo_map(fd, batch_bo, batch_bo_size);
>> +	cmd = (uint32_t *) batch;
>> +	cmd[i++] = MEM_COPY_CMD | (1 << 19);
>> +	cmd[i++] = width - 1;
>> +	cmd[i++] = height - 1;
>> +	cmd[i++] = width - 1;
>> +	cmd[i++] = width - 1;
>> +	cmd[i++] = src_addr & ((1UL << 32) - 1);
>> +	cmd[i++] = src_addr >> 32;
>> +	cmd[i++] = dst_addr & ((1UL << 32) - 1);
>> +	cmd[i++] = dst_addr >> 32;
>> +	cmd[i++] = mocs_index << XE2_MEM_COPY_MOCS_SHIFT |
>> mocs_index;
>> +	cmd[i++] = MI_BATCH_BUFFER_END;
>> +	cmd[i++] = MI_BATCH_BUFFER_END;
>> +
>> +	batch_addr = to_user_pointer(batch);
>> +	/* Punch a gap in the SVM map where we map the batch_bo */
>> +	xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
>> batch_bo_size, 0);
>> +	*bo = batch_bo;
>> +	*addr = batch_addr;
>> +}
>> +
>> +static void batch_fini(int fd, uint32_t vm, uint32_t bo, uint64_t
>> addr)
>> +{
>> +        /* Unmap the batch bo by re-instating the SVM binding. */
>> +        xe_vm_bind_lr_sync(fd, vm, 0, 0, addr, BATCH_SIZE(fd),
>> +                           DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR);
>> +        gem_close(fd, bo);
>> +}
>> +
>> +
>> +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info)
>> +{
>> +	unsigned int count = 0;
>> +	uint64_t regions = all_memory_regions(fd);
>> +	uint32_t region;
>> +
>> +	xe_for_each_mem_region(fd, regions, region) {
>> +		if (XE_IS_VRAM_MEMORY_REGION(fd, region)) {
>> +			struct drm_xe_mem_region *mem_region =
>> +				xe_mem_region(fd, 1ull << (region -
>> 1));
>> +			igt_assert(count < MAX_XE_REGIONS);
>> +			info->vram_regions[count++] = mem_region-
>>> instance;
>> +		}
>> +	}
>> +
>> +	info->num_regions = count;
>> +}
>> +
>> +static int get_device_info(struct xe_svm_gpu_info gpus[], int
>> num_gpus)
>> +{
>> +	int cnt;
>> +	int xe;
>> +	int i;
>> +
>> +	for (i = 0, cnt = 0 && i < 128; cnt < num_gpus; i++) {
>> +		xe = __drm_open_driver_another(i, DRIVER_XE);
>> +		if (xe < 0)
>> +			break;
>> +
>> +		gpus[cnt].fd = xe;
>> +		cnt++;
>> +	}
>> +
>> +	return cnt;
>> +}
>> +
>> +static void
>> +copy_src_dst(struct xe_svm_gpu_info *gpu0,
>> +	     struct xe_svm_gpu_info *gpu1,
>> +	     struct drm_xe_engine_class_instance *eci,
>> +	     bool prefetch_req)
>> +{
>> +	uint32_t vm[1];
>> +	uint32_t exec_queue[2];
>> +	uint32_t batch_bo;
>> +	void *copy_src, *copy_dst;
>> +	uint64_t batch_addr;
>> +	struct drm_xe_sync sync = {};
>> +	volatile uint64_t *sync_addr;
>> +	int local_fd = gpu0->fd;
>> +	uint16_t local_vram = gpu0->vram_regions[0];
>> +
>> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
>> +
>> +	/* Allocate source and destination buffers */
>> +	copy_src = aligned_alloc(xe_get_default_alignment(gpu0->fd),
>> SZ_64M);
>> +	igt_assert(copy_src);
>> +	copy_dst = aligned_alloc(xe_get_default_alignment(gpu1->fd),
>> SZ_64M);
>> +	igt_assert(copy_dst);
>> +
>> +	/*
>> +	 * Initialize, map and bind the batch bo. Note that Xe
>> doesn't seem to enjoy
>> +	 * batch buffer memory accessed over PCIe p2p.
>> +	 */
>> +	batch_init(gpu0->fd, vm[0], to_user_pointer(copy_src),
>> to_user_pointer(copy_dst),
>> +		   COPY_SIZE, &batch_bo, &batch_addr);
>> +
>> +	/* Fill the source with a pattern, clear the destination. */
>> +	memset(copy_src, 0x67, COPY_SIZE);
>> +	memset(copy_dst, 0x0, COPY_SIZE);
>> +
>> +	xe_multigpu_madvise(gpu0->fd, vm[0],
>> to_user_pointer(copy_dst), COPY_SIZE,
>> +			     0, DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
>> +			     gpu1->fd, 0, gpu1->vram_regions[0],
>> exec_queue[0],
>> +			     local_fd, local_vram);
>> +
>> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
>> +	xe_multigpu_prefetch(gpu0->fd, vm[0],
>> to_user_pointer(copy_dst), COPY_SIZE, &sync,
>> +			     sync_addr, exec_queue[0],
>> prefetch_req);
>> +
>> +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
>> +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
>> +	sync.timeline_value = EXEC_SYNC_VAL;
>> +	*sync_addr = 0;
>> +
>> +	/* Execute a GPU copy. */
>> +	xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
>> +	if (*sync_addr != EXEC_SYNC_VAL)
>> +		xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
>> EXEC_SYNC_VAL, exec_queue[0],
>> +			       NSEC_PER_SEC * 10);
>> +
>> +	igt_assert(memcmp(copy_src, copy_dst, COPY_SIZE) == 0);
>> +
>> +	free(copy_dst);
>> +	free(copy_src);
>> +	munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
>> +	batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
>> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
>> +}
>> +
>> +static void
>> +gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
>> +		       struct xe_svm_gpu_info *dst,
>> +		       struct drm_xe_engine_class_instance *eci,
>> +		       void *extra_args)
>> +{
>> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
>> *)extra_args;
>> +	igt_assert(src);
>> +	igt_assert(dst);
>> +
>> +	copy_src_dst(src, dst, eci, args->prefetch_req);
>> +}
>> +
>> +igt_main
>> +{
>> +	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
>> +	struct xe_device *xe;
>> +	int gpu, gpu_cnt;
>> +
>> +	struct drm_xe_engine_class_instance eci = {
>> +                .engine_class = DRM_XE_ENGINE_CLASS_COPY,
>> +        };
>> +
>> +	igt_fixture {
>> +		gpu_cnt = get_device_info(gpus, ARRAY_SIZE(gpus));
>> +		igt_skip_on(gpu_cnt < 2);
>> +
>> +		for (gpu = 0; gpu < gpu_cnt; ++gpu) {
>> +			igt_assert(gpu < MAX_XE_GPUS);
>> +
>> +			open_pagemaps(gpus[gpu].fd, &gpus[gpu]);
>> +			/* NOTE! inverted return value. */
>> +			gpus[gpu].supports_faults =
>> !xe_supports_faults(gpus[gpu].fd);
>> +			fprintf(stderr, "GPU %u has %u VRAM
>> regions%s, and %s SVM VMs.\n",
>> +				gpu, gpus[gpu].num_regions,
>> +				gpus[gpu].num_regions != 1 ? "s" :
>> "",
>> +				gpus[gpu].supports_faults ?
>> "supports" : "doesn't support");
>> +
>> +			xe = xe_device_get(gpus[gpu].fd);
>> +			gpus[gpu].va_bits = xe->va_bits;
>> +		}
>> +	}
>> +
>> +	igt_describe("gpu-gpu write-read");
>> +	igt_subtest("cross-gpu-mem-access") {
>> +		struct multigpu_ops_args op_args;
>> +		op_args.prefetch_req = 1;
>> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_mem_access_wrapper, &op_args);
>> +		op_args.prefetch_req = 0;
>> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_mem_access_wrapper, &op_args);
> Wouldn't a separate test make sense here, like many other tests defines
> a base test with variants that are indicated in an unsigned long flags?
>
> So we have cross-gpu-mem-access-%s where %s can take "basic" and
> "prefetch"?
Those already defined tests are part of struct section {}. Let me work 
on this and update.
>
>
>> +	}
>> +
>> +	igt_fixture {
>> +		int cnt;
>> +
>> +		for (cnt = 0; cnt < gpu_cnt; cnt++)
>> +			drm_close_driver(gpus[cnt].fd);
>> +	}
>> +}
>> diff --git a/tests/meson.build b/tests/meson.build
>> index 9736f2338..1209f84a4 100644
>> --- a/tests/meson.build
>> +++ b/tests/meson.build
>> @@ -313,6 +313,7 @@ intel_xe_progs = [
>>   	'xe_media_fill',
>>   	'xe_mmap',
>>   	'xe_module_load',
>> +        'xe_multi_gpusvm',
>>   	'xe_noexec_ping_pong',
>>   	'xe_oa',
>>   	'xe_pat',

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations
  2025-11-17 13:10   ` Hellstrom, Thomas
@ 2025-11-17 15:50     ` Sharma, Nishit
  2025-11-18  9:26       ` Hellstrom, Thomas
  0 siblings, 1 reply; 34+ messages in thread
From: Sharma, Nishit @ 2025-11-17 15:50 UTC (permalink / raw)
  To: Hellstrom, Thomas, igt-dev@lists.freedesktop.org


On 11/17/2025 6:40 PM, Hellstrom, Thomas wrote:
> On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
>> From: Nishit Sharma <nishit.sharma@intel.com>
>>
>> This test performs atomic increment operation on a shared SVM buffer
>> from both GPUs and the CPU in a multi-GPU environment. It uses
>> madvise
>> and prefetch to control buffer placement and verifies correctness and
>> ordering of atomic updates across agents.
>>
>> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
>> ---
>>   tests/intel/xe_multi_gpusvm.c | 157
>> +++++++++++++++++++++++++++++++++-
>>   1 file changed, 156 insertions(+), 1 deletion(-)
>>
>> diff --git a/tests/intel/xe_multi_gpusvm.c
>> b/tests/intel/xe_multi_gpusvm.c
>> index 6614ea3d1..54e036724 100644
>> --- a/tests/intel/xe_multi_gpusvm.c
>> +++ b/tests/intel/xe_multi_gpusvm.c
>> @@ -31,6 +31,11 @@
>>    *      region both remotely and locally and copies to it. Reads
>> back to
>>    *      system memory and checks the result.
>>    *
>> + * SUBTEST: atomic-inc-gpu-op
>> + * Description:
>> + * 	This test does atomic operation in multi-gpu by executing
>> atomic
>> + *	operation on GPU1 and then atomic operation on GPU2 using
>> same
>> + *	adress
>>    */
>>   
>>   #define MAX_XE_REGIONS	8
>> @@ -40,6 +45,7 @@
>>   #define BIND_SYNC_VAL 0x686868
>>   #define EXEC_SYNC_VAL 0x676767
>>   #define COPY_SIZE SZ_64M
>> +#define	ATOMIC_OP_VAL	56
>>   
>>   struct xe_svm_gpu_info {
>>   	bool supports_faults;
>> @@ -49,6 +55,16 @@ struct xe_svm_gpu_info {
>>   	int fd;
>>   };
>>   
>> +struct test_exec_data {
>> +	uint32_t batch[32];
>> +	uint64_t pad;
>> +	uint64_t vm_sync;
>> +	uint64_t exec_sync;
>> +	uint32_t data;
>> +	uint32_t expected_data;
>> +	uint64_t batch_addr;
>> +};
>> +
>>   struct multigpu_ops_args {
>>   	bool prefetch_req;
>>   	bool op_mod;
>> @@ -72,7 +88,10 @@ static void gpu_mem_access_wrapper(struct
>> xe_svm_gpu_info *src,
>>   				   struct
>> drm_xe_engine_class_instance *eci,
>>   				   void *extra_args);
>>   
>> -static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
>> +static void gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
>> +				   struct xe_svm_gpu_info *dst,
>> +				   struct
>> drm_xe_engine_class_instance *eci,
>> +				   void *extra_args);
>>   
>>   static void
>>   create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
>> drm_xe_engine_class_instance *eci,
>> @@ -166,6 +185,35 @@ static void for_each_gpu_pair(int num_gpus,
>> struct xe_svm_gpu_info *gpus,
>>   	}
>>   }
>>   
>> +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
>> +
>> +static void
>> +atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
>> +		  uint32_t *bo, uint64_t *addr)
>> +{
>> +	uint32_t batch_bo_size = BATCH_SIZE(fd);
>> +	uint32_t batch_bo;
>> +	uint64_t batch_addr;
>> +	void *batch;
>> +	uint32_t *cmd;
>> +	int i = 0;
>> +
>> +	batch_bo = xe_bo_create(fd, vm, batch_bo_size,
>> vram_if_possible(fd, 0), 0);
>> +	batch = xe_bo_map(fd, batch_bo, batch_bo_size);
>> +	cmd = (uint32_t *)batch;
>> +
>> +	cmd[i++] = MI_ATOMIC | MI_ATOMIC_INC;
>> +	cmd[i++] = src_addr;
>> +	cmd[i++] = src_addr >> 32;
>> +	cmd[i++] = MI_BATCH_BUFFER_END;
>> +
>> +	batch_addr = to_user_pointer(batch);
>> +	/* Punch a gap in the SVM map where we map the batch_bo */
>> +	xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
>> batch_bo_size, 0);
>> +	*bo = batch_bo;
>> +	*addr = batch_addr;
>> +}
>> +
>>   static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
>>   		       uint64_t dst_addr, uint64_t copy_size,
>>   		       uint32_t *bo, uint64_t *addr)
>> @@ -325,6 +373,105 @@ gpu_mem_access_wrapper(struct xe_svm_gpu_info
>> *src,
>>   	copy_src_dst(src, dst, eci, args->prefetch_req);
>>   }
>>   
>> +static void
>> +atomic_inc_op(struct xe_svm_gpu_info *gpu0,
>> +	      struct xe_svm_gpu_info *gpu1,
>> +	      struct drm_xe_engine_class_instance *eci,
>> +	      bool prefetch_req)
>> +{
>> +	uint64_t addr;
>> +	uint32_t vm[2];
>> +	uint32_t exec_queue[2];
>> +	uint32_t batch_bo;
>> +	struct test_exec_data *data;
>> +	uint64_t batch_addr;
>> +	struct drm_xe_sync sync = {};
>> +	volatile uint64_t *sync_addr;
>> +	volatile uint32_t *shared_val;
>> +
>> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
>> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
>> +
>> +	data = aligned_alloc(SZ_2M, SZ_4K);
>> +	igt_assert(data);
>> +	data[0].vm_sync = 0;
>> +	addr = to_user_pointer(data);
>> +
>> +	shared_val = (volatile uint32_t *)addr;
>> +	*shared_val = ATOMIC_OP_VAL - 1;
>> +
>> +	atomic_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
>> &batch_addr);
>> +
>> +	/* Place destination in an optionally remote location to
>> test */
>> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
>> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
>> +			    gpu0->fd, 0, gpu0->vram_regions[0],
>> exec_queue[0],
>> +			    0, 0);
>> +
>> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
>> +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
>> +			     sync_addr, exec_queue[0],
>> prefetch_req);
>> +
>> +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
>> +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
>> +	sync.timeline_value = EXEC_SYNC_VAL;
>> +	*sync_addr = 0;
>> +
>> +	/* Executing ATOMIC_INC on GPU0. */
>> +	xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
>> +	if (*sync_addr != EXEC_SYNC_VAL)
>> +		xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
>> EXEC_SYNC_VAL, exec_queue[0],
>> +			       NSEC_PER_SEC * 10);
>> +
>> +	igt_assert_eq(*shared_val, ATOMIC_OP_VAL);
>> +
>> +	atomic_batch_init(gpu1->fd, vm[1], addr, &batch_bo,
>> &batch_addr);
>> +
>> +	/* Place destination in an optionally remote location to
>> test */
> We're actually never using a remote location here? It's always advised
> to local.
will edit the explanation.
>
>> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
>> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
>> +			    gpu1->fd, 0, gpu1->vram_regions[0],
>> exec_queue[0],
>> +			    0, 0);
>
>
>> +
>> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
>> +	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
>> +			     sync_addr, exec_queue[1],
>> prefetch_req);
>> +
>> +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
>> +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
>> +	sync.timeline_value = EXEC_SYNC_VAL;
>> +	*sync_addr = 0;
>> +
>> +	/* Execute ATOMIC_INC on GPU1 */
>> +	xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);
> If gpu1 here doesn't support faults, we shouldn't execute this.
So this condition is applicable for all tests. if fault not supported 
xe_exec_sync(gpxx->fd,.....) shouldn't be called?
>
>
>> +	if (*sync_addr != EXEC_SYNC_VAL)
>> +		xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr,
>> EXEC_SYNC_VAL, exec_queue[1],
>> +			       NSEC_PER_SEC * 10);
>> +
>> +	igt_assert_eq(*shared_val, ATOMIC_OP_VAL + 1);
>> +
>> +	munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
>> +	batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
>> +	batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
>> +	free(data);
>> +
>> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
>> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
>> +}
>> +
>> +static void
>> +gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
>> +		       struct xe_svm_gpu_info *dst,
>> +		       struct drm_xe_engine_class_instance *eci,
>> +		       void *extra_args)
>> +{
>> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
>> *)extra_args;
>> +	igt_assert(src);
>> +	igt_assert(dst);
>> +
>> +	atomic_inc_op(src, dst, eci, args->prefetch_req);
>> +}
>> +
>>   igt_main
>>   {
>>   	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
>> @@ -364,6 +511,14 @@ igt_main
>>   		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_mem_access_wrapper, &op_args);
>>   	}
>>   
>> +	igt_subtest("atomic-inc-gpu-op") {
>> +		struct multigpu_ops_args atomic_args;
>> +		atomic_args.prefetch_req = 1;
>> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_atomic_inc_wrapper, &atomic_args);
>> +		atomic_args.prefetch_req = 0;
>> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_atomic_inc_wrapper, &atomic_args);
> Same comment here as for the first test.
>
> /Thomas
>
>
>
>> +	}
>> +
>>   	igt_fixture {
>>   		int cnt;
>>   

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test
  2025-11-17 14:02   ` Hellstrom, Thomas
@ 2025-11-17 16:18     ` Sharma, Nishit
  2025-11-27  7:36       ` Gurram, Pravalika
  0 siblings, 1 reply; 34+ messages in thread
From: Sharma, Nishit @ 2025-11-17 16:18 UTC (permalink / raw)
  To: Hellstrom, Thomas, igt-dev@lists.freedesktop.org


On 11/17/2025 7:32 PM, Hellstrom, Thomas wrote:
> On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
>> From: Nishit Sharma <nishit.sharma@intel.com>
>>
>> This test verifies memory coherency in a multi-GPU environment using
>> SVM.
>> GPU 1 writes to a shared buffer, GPU 2 reads and checks for correct
>> data
>> without explicit synchronization, and the test is repeated with CPU
>> and
>> both GPUs to ensure consistent memory visibility across agents.
>>
>> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
>> ---
>>   tests/intel/xe_multi_gpusvm.c | 203
>> +++++++++++++++++++++++++++++++++-
>>   1 file changed, 201 insertions(+), 2 deletions(-)
>>
>> diff --git a/tests/intel/xe_multi_gpusvm.c
>> b/tests/intel/xe_multi_gpusvm.c
>> index 54e036724..6792ef72c 100644
>> --- a/tests/intel/xe_multi_gpusvm.c
>> +++ b/tests/intel/xe_multi_gpusvm.c
>> @@ -34,8 +34,13 @@
>>    * SUBTEST: atomic-inc-gpu-op
>>    * Description:
>>    * 	This test does atomic operation in multi-gpu by executing
>> atomic
>> - *	operation on GPU1 and then atomic operation on GPU2 using
>> same
>> - *	adress
>> + * 	operation on GPU1 and then atomic operation on GPU2 using
>> same
>> + * 	adress
>> + *
>> + * SUBTEST: coherency-multi-gpu
>> + * Description:
>> + * 	This test checks coherency in multi-gpu by writing from GPU0
>> + * 	reading from GPU1 and verify and repeating with CPU and both
>> GPUs
>>    */
>>   
>>   #define MAX_XE_REGIONS	8
>> @@ -93,6 +98,11 @@ static void gpu_atomic_inc_wrapper(struct
>> xe_svm_gpu_info *src,
>>   				   struct
>> drm_xe_engine_class_instance *eci,
>>   				   void *extra_args);
>>   
>> +static void gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
>> +				      struct xe_svm_gpu_info *dst,
>> +				      struct
>> drm_xe_engine_class_instance *eci,
>> +				      void *extra_args);
>> +
>>   static void
>>   create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
>> drm_xe_engine_class_instance *eci,
>>   		    uint32_t *vm, uint32_t *exec_queue)
>> @@ -214,6 +224,35 @@ atomic_batch_init(int fd, uint32_t vm, uint64_t
>> src_addr,
>>   	*addr = batch_addr;
>>   }
>>   
>> +static void
>> +store_dword_batch_init(int fd, uint32_t vm, uint64_t src_addr,
>> +                       uint32_t *bo, uint64_t *addr, int value)
>> +{
>> +        uint32_t batch_bo_size = BATCH_SIZE(fd);
>> +        uint32_t batch_bo;
>> +        uint64_t batch_addr;
>> +        void *batch;
>> +        uint32_t *cmd;
>> +        int i = 0;
>> +
>> +        batch_bo = xe_bo_create(fd, vm, batch_bo_size,
>> vram_if_possible(fd, 0), 0);
>> +        batch = xe_bo_map(fd, batch_bo, batch_bo_size);
>> +        cmd = (uint32_t *) batch;
>> +
>> +        cmd[i++] = MI_STORE_DWORD_IMM_GEN4;
>> +        cmd[i++] = src_addr;
>> +        cmd[i++] = src_addr >> 32;
>> +        cmd[i++] = value;
>> +        cmd[i++] = MI_BATCH_BUFFER_END;
>> +
>> +        batch_addr = to_user_pointer(batch);
>> +
>> +        /* Punch a gap in the SVM map where we map the batch_bo */
>> +        xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
>> batch_bo_size, 0);
>> +        *bo = batch_bo;
>> +        *addr = batch_addr;
>> +}
>> +
>>   static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
>>   		       uint64_t dst_addr, uint64_t copy_size,
>>   		       uint32_t *bo, uint64_t *addr)
>> @@ -373,6 +412,143 @@ gpu_mem_access_wrapper(struct xe_svm_gpu_info
>> *src,
>>   	copy_src_dst(src, dst, eci, args->prefetch_req);
>>   }
>>   
>> +static void
>> +coherency_test_multigpu(struct xe_svm_gpu_info *gpu0,
>> +			struct xe_svm_gpu_info *gpu1,
>> +			struct drm_xe_engine_class_instance *eci,
>> +			bool coh_fail_set,
>> +			bool prefetch_req)
>> +{
>> +        uint64_t addr;
>> +        uint32_t vm[2];
>> +        uint32_t exec_queue[2];
>> +        uint32_t batch_bo, batch1_bo[2];
>> +        uint64_t batch_addr, batch1_addr[2];
>> +        struct drm_xe_sync sync = {};
>> +        volatile uint64_t *sync_addr;
>> +        int value = 60;
>> +	uint64_t *data1;
>> +	void *copy_dst;
>> +
>> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
>> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
>> +
>> +        data1 = aligned_alloc(SZ_2M, SZ_4K);
>> +	igt_assert(data1);
>> +	addr = to_user_pointer(data1);
>> +
>> +	copy_dst = aligned_alloc(SZ_2M, SZ_4K);
>> +	igt_assert(copy_dst);
>> +
>> +        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
>> &batch_addr, value);
>> +
>> +        /* Place destination in GPU0 local memory location to test
>> */
> Indentation looks odd throughout this function. Is there a
> formatting / style checker that has been run on these patches?
>
>> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
>> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
>> +			    gpu0->fd, 0, gpu0->vram_regions[0],
>> exec_queue[0],
>> +			    0, 0);
>> +
>> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
>> +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
>> +			     sync_addr, exec_queue[0],
>> prefetch_req);
>> +
>> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
>> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
>> +        sync.timeline_value = EXEC_SYNC_VAL;
>> +        *sync_addr = 0;
>> +
>> +        /* Execute STORE command on GPU0 */
>> +        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
>> +        if (*sync_addr != EXEC_SYNC_VAL)
>> +                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
>> EXEC_SYNC_VAL, exec_queue[0],
>> +			       NSEC_PER_SEC * 10);
>> +
>> +        igt_assert_eq(*(uint64_t *)addr, value);
> This assert will cause a CPU read which migrates the data to system, so
> perhaps not ideal if we want to test coherency across gpus?
>
>
>> +
>> +	/* Creating batch for GPU1 using addr as Src which have
>> value from GPU0 */
>> +	batch_init(gpu1->fd, vm[1], addr, to_user_pointer(copy_dst),
>> +		   SZ_4K, &batch_bo, &batch_addr);
>> +
>> +        /* Place destination in GPU1 local memory location to test
>> */
>> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
>> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
>> +			    gpu1->fd, 0, gpu1->vram_regions[0],
>> exec_queue[1],
>> +			    0, 0);
>> +
>> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
>> +	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
>> +			     sync_addr, exec_queue[1],
>> prefetch_req);
>> +
>> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
>> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
>> +        sync.timeline_value = EXEC_SYNC_VAL;
>> +        *sync_addr = 0;
>> +
>> +        /* Execute COPY command on GPU1 */
>> +        xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);
>> +        if (*sync_addr != EXEC_SYNC_VAL)
>> +                xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr,
>> EXEC_SYNC_VAL, exec_queue[1],
>> +			       NSEC_PER_SEC * 10);
>> +
>> +        igt_assert_eq(*(uint64_t *)copy_dst, value);
>> +
>> +        /* CPU writes 10, memset set bytes no integer hence memset
>> fills 4 bytes with 0x0A */
>> +        memset((void *)(uintptr_t)addr, 10, sizeof(int));
>> +        igt_assert_eq(*(uint64_t *)addr, 0x0A0A0A0A);
>> +
>> +	if (coh_fail_set) {
>> +		igt_info("coherency fail impl\n");
>> +
>> +		/* Coherency fail scenario */
>> +		store_dword_batch_init(gpu0->fd, vm[0], addr,
>> &batch1_bo[0], &batch1_addr[0], value + 10);
>> +		store_dword_batch_init(gpu1->fd, vm[1], addr,
>> &batch1_bo[1], &batch1_addr[1], value + 20);
>> +
>> +		sync_addr = (void *)((char *)batch1_addr[0] +
>> SZ_4K);
>> +		sync.addr = to_user_pointer((uint64_t *)sync_addr);
>> +		sync.timeline_value = EXEC_SYNC_VAL;
>> +		*sync_addr = 0;
>> +
>> +		/* Execute STORE command on GPU1 */
>> +		xe_exec_sync(gpu0->fd, exec_queue[0],
>> batch1_addr[0], &sync, 1);
>> +		if (*sync_addr != EXEC_SYNC_VAL)
>> +			xe_wait_ufence(gpu0->fd, (uint64_t
>> *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
>> +				       NSEC_PER_SEC * 10);
>> +
>> +		sync_addr = (void *)((char *)batch1_addr[1] +
>> SZ_4K);
>> +		sync.addr = to_user_pointer((uint64_t *)sync_addr);
>> +		sync.timeline_value = EXEC_SYNC_VAL;
>> +		*sync_addr = 0;
>> +
>> +		/* Execute STORE command on GPU2 */
>> +		xe_exec_sync(gpu1->fd, exec_queue[1],
>> batch1_addr[1], &sync, 1);
>> +		if (*sync_addr != EXEC_SYNC_VAL)
>> +			xe_wait_ufence(gpu1->fd, (uint64_t
>> *)sync_addr, EXEC_SYNC_VAL, exec_queue[1],
>> +				       NSEC_PER_SEC * 10);
>> +
>> +		igt_warn_on_f(*(uint64_t *)addr != (value + 10),
>> +			      "GPU2(dst_gpu] has overwritten value
>> at addr\n");
> Parenthesis mismatch.
>
> BTW, isn't gpu2 supposed to overwrite the value here? Perhaps I'm
> missing something?
Will change this logic
>
> Also regarding the previous comment WRT using the naming gpu0 and gpu1
> vs gpu1 and gpu2? Shouldn't we try to be consistent here to avoid
> confusion?
>
> /Thomas
>
>
>
>> +
>> +		munmap((void *)batch1_addr[0], BATCH_SIZE(gpu0-
>>> fd));
>> +		munmap((void *)batch1_addr[1], BATCH_SIZE(gpu1-
>>> fd));
>> +
>> +		batch_fini(gpu0->fd, vm[0], batch1_bo[0],
>> batch1_addr[0]);
>> +		batch_fini(gpu1->fd, vm[1], batch1_bo[1],
>> batch1_addr[1]);
>> +	}
>> +
>> +        /* CPU writes 11, memset set bytes no integer hence memset
>> fills 4 bytes with 0x0B */
>> +        memset((void *)(uintptr_t)addr, 11, sizeof(int));
>> +        igt_assert_eq(*(uint64_t *)addr, 0x0B0B0B0B);
>> +
>> +        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
>> +        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
>> +        batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
>> +        free(data1);
>> +	free(copy_dst);
>> +
>> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
>> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
>> +}
>> +
>>   static void
>>   atomic_inc_op(struct xe_svm_gpu_info *gpu0,
>>   	      struct xe_svm_gpu_info *gpu1,
>> @@ -472,6 +648,19 @@ gpu_atomic_inc_wrapper(struct xe_svm_gpu_info
>> *src,
>>   	atomic_inc_op(src, dst, eci, args->prefetch_req);
>>   }
>>   
>> +static void
>> +gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
>> +			  struct xe_svm_gpu_info *dst,
>> +			  struct drm_xe_engine_class_instance *eci,
>> +			  void *extra_args)
>> +{
>> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
>> *)extra_args;
>> +	igt_assert(src);
>> +	igt_assert(dst);
>> +
>> +	coherency_test_multigpu(src, dst, eci, args->op_mod, args-
>>> prefetch_req);
>> +}
>> +
>>   igt_main
>>   {
>>   	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
>> @@ -519,6 +708,16 @@ igt_main
>>   		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_atomic_inc_wrapper, &atomic_args);
>>   	}
>>   
>> +	igt_subtest("coherency-multi-gpu") {
>> +		struct multigpu_ops_args coh_args;
>> +		coh_args.prefetch_req = 1;
>> +		coh_args.op_mod = 0;
>> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_coherecy_test_wrapper, &coh_args);
>> +		coh_args.prefetch_req = 0;
>> +		coh_args.op_mod = 1;
>> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
>> gpu_coherecy_test_wrapper, &coh_args);
>> +	}
>> +
>>   	igt_fixture {
>>   		int cnt;
>>   

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test
  2025-11-17 15:49     ` Sharma, Nishit
@ 2025-11-17 20:40       ` Hellstrom, Thomas
  2025-11-18  9:24       ` Hellstrom, Thomas
  1 sibling, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-17 20:40 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Mon, 2025-11-17 at 21:19 +0530, Sharma, Nishit wrote:
> 
> On 11/17/2025 6:30 PM, Hellstrom, Thomas wrote:
> > On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> > > From: Nishit Sharma <nishit.sharma@intel.com>
> > > 
> > > This test allocates a buffer in SVM, writes data to it from src
> > > GPU ,
> > > and reads/verifies
> > > the data from dst GPU. Optionally, the CPU also reads or modifies
> > > the
> > > buffer and both
> > > GPUs verify the results, ensuring correct cross-GPU and CPU
> > > memory
> > > access in a
> > > multi-GPU environment.
> > > 
> > > Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> > > Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > >   tests/intel/xe_multi_gpusvm.c | 373
> > > ++++++++++++++++++++++++++++++++++
> > >   tests/meson.build             |   1 +
> > >   2 files changed, 374 insertions(+)
> > >   create mode 100644 tests/intel/xe_multi_gpusvm.c
> > > 
> > > diff --git a/tests/intel/xe_multi_gpusvm.c
> > > b/tests/intel/xe_multi_gpusvm.c
> > > new file mode 100644
> > > index 000000000..6614ea3d1
> > > --- /dev/null
> > > +++ b/tests/intel/xe_multi_gpusvm.c
> > > @@ -0,0 +1,373 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > > + */
> > > +
> > > +#include <unistd.h>
> > > +
> > > +#include "drmtest.h"
> > > +#include "igt.h"
> > > +#include "igt_multigpu.h"
> > > +
> > > +#include "intel_blt.h"
> > > +#include "intel_mocs.h"
> > > +#include "intel_reg.h"
> > > +
> > > +#include "xe/xe_ioctl.h"
> > > +#include "xe/xe_query.h"
> > > +#include "xe/xe_util.h"
> > > +
> > > +/**
> > > + * TEST: Basic multi-gpu SVM testing
> > > + * Category: SVM
> > > + * Mega feature: Compute
> > > + * Sub-category: Compute tests
> > > + * Functionality: SVM p2p access, madvise and prefetch.
> > > + * Test category: functionality test
> > > + *
> > > + * SUBTEST: cross-gpu-mem-access
> > > + * Description:
> > > + *      This test creates two malloced regions, places the
> > > destination
> > > + *      region both remotely and locally and copies to it. Reads
> > > back to
> > > + *      system memory and checks the result.
> > > + *
> > > + */
> > > +
> > > +#define MAX_XE_REGIONS	8
> > > +#define MAX_XE_GPUS 8
> > > +#define NUM_LOOPS 1
> > > +#define BATCH_SIZE(_fd) ALIGN(SZ_8K,
> > > xe_get_default_alignment(_fd))
> > > +#define BIND_SYNC_VAL 0x686868
> > > +#define EXEC_SYNC_VAL 0x676767
> > > +#define COPY_SIZE SZ_64M
> > > +
> > > +struct xe_svm_gpu_info {
> > > +	bool supports_faults;
> > > +	int vram_regions[MAX_XE_REGIONS];
> > > +	unsigned int num_regions;
> > > +	unsigned int va_bits;
> > > +	int fd;
> > > +};
> > > +
> > > +struct multigpu_ops_args {
> > > +	bool prefetch_req;
> > > +	bool op_mod;
> > > +};
> > > +
> > > +typedef void (*gpu_pair_fn) (
> > > +		struct xe_svm_gpu_info *src,
> > > +		struct xe_svm_gpu_info *dst,
> > > +		struct drm_xe_engine_class_instance *eci,
> > > +		void *extra_args
> > > +);
> > > +
> > > +static void for_each_gpu_pair(int num_gpus,
> > > +			      struct xe_svm_gpu_info *gpus,
> > > +			      struct
> > > drm_xe_engine_class_instance
> > > *eci,
> > > +			      gpu_pair_fn fn,
> > > +			      void *extra_args);
> > > +
> > > +static void gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
> > > +				   struct xe_svm_gpu_info *dst,
> > > +				   struct
> > > drm_xe_engine_class_instance *eci,
> > > +				   void *extra_args);
> > > +
> > > +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
> > > +
> > > +static void
> > > +create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> > > drm_xe_engine_class_instance *eci,
> > > +		    uint32_t *vm, uint32_t *exec_queue)
> > > +{
> > > +	*vm = xe_vm_create(gpu->fd,
> > > +			   DRM_XE_VM_CREATE_FLAG_LR_MODE |
> > > DRM_XE_VM_CREATE_FLAG_FAULT_MODE, 0);
> > > +	*exec_queue = xe_exec_queue_create(gpu->fd, *vm, eci,
> > > 0);
> > > +	xe_vm_bind_lr_sync(gpu->fd, *vm, 0, 0, 0, 1ull << gpu-
> > > > va_bits,
> > > +			   DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR);
> > > +}
> > > +
> > > +static void
> > > +setup_sync(struct drm_xe_sync *sync, volatile uint64_t
> > > **sync_addr,
> > > uint64_t timeline_value)
> > > +{
> > > +	*sync_addr = malloc(sizeof(**sync_addr));
> > > +	igt_assert(*sync_addr);
> > > +	sync->flags = DRM_XE_SYNC_FLAG_SIGNAL;
> > > +	sync->type = DRM_XE_SYNC_TYPE_USER_FENCE;
> > > +	sync->addr = to_user_pointer((uint64_t *)*sync_addr);
> > > +	sync->timeline_value = timeline_value;
> > > +	**sync_addr = 0;
> > > +}
> > > +
> > > +static void
> > > +cleanup_vm_and_queue(struct xe_svm_gpu_info *gpu, uint32_t vm,
> > > uint32_t exec_queue)
> > > +{
> > > +	xe_vm_unbind_lr_sync(gpu->fd, vm, 0, 0, 1ull << gpu-
> > > > va_bits);
> > > +	xe_exec_queue_destroy(gpu->fd, exec_queue);
> > > +	xe_vm_destroy(gpu->fd, vm);
> > > +}
> > > +
> > > +static void xe_multigpu_madvise(int src_fd, uint32_t vm,
> > > uint64_t
> > > addr, uint64_t size,
> > > +				uint64_t ext, uint32_t type, int
> > > dst_fd, uint16_t policy,
> > > +				uint16_t instance, uint32_t
> > > exec_queue, int local_fd,
> > > +				uint16_t local_vram)
> > > +{
> > > +	int ret;
> > > +
> > > +#define SYSTEM_MEMORY	0
> > Please use DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM.
> > A new define isn't necessary and it's also incorrect.
> Sure, will use that at required places.
> > 
> > > +	if (src_fd != dst_fd) {
> > > +		ret = xe_vm_madvise(src_fd, vm, addr, size, ext,
> > > type, dst_fd, policy, instance);
> > > +		if (ret == -ENOLINK) {
> > > +			igt_info("No fast interconnect between
> > > GPU0
> > > and GPU1, falling back to local VRAM\n");
> > > +			ret = xe_vm_madvise(src_fd, vm, addr,
> > > size,
> > > ext, type, local_fd,
> > > +					    policy, local_vram);
> > Please use DRM_XE_PREFERRED_LOC_DEFAULT_DEVICE

Note that this also means you can skip the last two parameters to the
function AFAICT.

/Thomas


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers
  2025-11-17 15:43     ` Sharma, Nishit
@ 2025-11-18  9:23       ` Hellstrom, Thomas
  0 siblings, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-18  9:23 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Mon, 2025-11-17 at 21:13 +0530, Sharma, Nishit wrote:
> 
> On 11/17/2025 6:04 PM, Hellstrom, Thomas wrote:
> > On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> > > From: Nishit Sharma <nishit.sharma@intel.com>
> > > 
> > > Add an 'instance' parameter to xe_vm_madvise and __xe_vm_madvise
> > > to
> > > support per-instance memory advice operations.Implement
> > > xe_vm_bind_lr_sync
> > > and xe_vm_unbind_lr_sync helpers for synchronous VM bind/unbind
> > > using
> > > user
> > > fences.
> > > These changes improve memory advice and binding operations for
> > > multi-
> > > GPU
> > > and multi-instance scenarios in IGT tests.
> > s memory advice/memory advise/ ?
> Git it. Will edit the description.
> > 
> > Also the lr_sync part is unrelated and should be split out to a
> > separate patch.
> 
> Sure will create separate patch for lr_sync part. Also the 
> xe_exec_system_allocator() changes available in Patch [2/10] should
> be 
> merged along madvise changes within
> 
> Patch [1/10]?

Yes, IIRC I noted that in that review.

We need to ensure that the code compiles after each patch.
/Thomas



> 
> 
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > 
> > > Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> > > ---
> > >   include/drm-uapi/xe_drm.h |  4 +--
> > >   lib/xe/xe_ioctl.c         | 53
> > > +++++++++++++++++++++++++++++++++++--
> > > --
> > >   lib/xe/xe_ioctl.h         | 11 +++++---
> > >   3 files changed, 58 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/include/drm-uapi/xe_drm.h b/include/drm-
> > > uapi/xe_drm.h
> > > index 89ab54935..3472efa58 100644
> > > --- a/include/drm-uapi/xe_drm.h
> > > +++ b/include/drm-uapi/xe_drm.h
> > > @@ -2060,8 +2060,8 @@ struct drm_xe_madvise {
> > >   			/** @preferred_mem_loc.migration_policy:
> > > Page migration policy */
> > >   			__u16 migration_policy;
> > >   
> > > -			/** @preferred_mem_loc.pad : MBZ */
> > > -			__u16 pad;
> > > +			/** @preferred_mem_loc.region_instance:
> > > Region instance */
> > > +			__u16 region_instance;
> > >   
> > >   			/** @preferred_mem_loc.reserved :
> > > Reserved
> > > */
> > >   			__u64 reserved;
> > > diff --git a/lib/xe/xe_ioctl.c b/lib/xe/xe_ioctl.c
> > > index 39c4667a1..06ce8a339 100644
> > > --- a/lib/xe/xe_ioctl.c
> > > +++ b/lib/xe/xe_ioctl.c
> > > @@ -687,7 +687,8 @@ int64_t xe_wait_ufence(int fd, uint64_t
> > > *addr,
> > > uint64_t value,
> > >   }
> > >   
> > >   int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr,
> > > uint64_t
> > > range,
> > > -		    uint64_t ext, uint32_t type, uint32_t
> > > op_val,
> > > uint16_t policy)
> > > +		    uint64_t ext, uint32_t type, uint32_t
> > > op_val,
> > > uint16_t policy,
> > > +		    uint16_t instance)
> > >   {
> > >   	struct drm_xe_madvise madvise = {
> > >   		.type = type,
> > > @@ -704,6 +705,7 @@ int __xe_vm_madvise(int fd, uint32_t vm,
> > > uint64_t
> > > addr, uint64_t range,
> > >   	case DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC:
> > >   		madvise.preferred_mem_loc.devmem_fd = op_val;
> > >   		madvise.preferred_mem_loc.migration_policy =
> > > policy;
> > > +		madvise.preferred_mem_loc.region_instance =
> > > instance;
> > >   		igt_debug("madvise.preferred_mem_loc.devmem_fd =
> > > %d\n",
> > >   			  madvise.preferred_mem_loc.devmem_fd);
> > >   		break;
> > > @@ -731,14 +733,55 @@ int __xe_vm_madvise(int fd, uint32_t vm,
> > > uint64_t addr, uint64_t range,
> > >    * @type: type of attribute
> > >    * @op_val: fd/atomic value/pat index, depending upon type of
> > > operation
> > >    * @policy: Page migration policy
> > > + * @instance: vram instance
> > >    *
> > >    * Function initializes different members of struct
> > > drm_xe_madvise
> > > and calls
> > >    * MADVISE IOCTL .
> > >    *
> > > - * Asserts in case of error returned by DRM_IOCTL_XE_MADVISE.
> > > + * Returns error number in failure and 0 if pass.
> > >    */
> > > -void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> > > range,
> > > -		   uint64_t ext, uint32_t type, uint32_t op_val,
> > > uint16_t policy)
> > > +int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> > > range,
> > > +		   uint64_t ext, uint32_t type, uint32_t op_val,
> > > uint16_t policy,
> > > +		   uint16_t instance)
> > >   {
> > > -	igt_assert_eq(__xe_vm_madvise(fd, vm, addr, range, ext,
> > > type, op_val, policy), 0);
> > > +	return __xe_vm_madvise(fd, vm, addr, range, ext, type,
> > > op_val, policy, instance);
> > > +}
> > > +
> > > +#define        BIND_SYNC_VAL   0x686868
> > > +void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
> > > uint64_t
> > > offset,
> > > +			uint64_t addr, uint64_t size, uint32_t
> > > flags)
> > > +{
> > > +	volatile uint64_t *sync_addr =
> > > malloc(sizeof(*sync_addr));
> > > +	struct drm_xe_sync sync = {
> > > +		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
> > > +		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
> > > +		.addr = to_user_pointer((uint64_t *)sync_addr),
> > > +		.timeline_value = BIND_SYNC_VAL,
> > > +	};
> > > +
> > > +	igt_assert(!!sync_addr);
> > > +	xe_vm_bind_async_flags(fd, vm, 0, bo, 0, addr, size,
> > > &sync,
> > > 1, flags);
> > > +	if (*sync_addr != BIND_SYNC_VAL)
> > > +		xe_wait_ufence(fd, (uint64_t *)sync_addr,
> > > BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
> > > +	/* Only free if the wait succeeds */
> > > +	free((void *)sync_addr);
> > > +}
> > > +
> > > +void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
> > > +			  uint64_t addr, uint64_t size)
> > > +{
> > > +	volatile uint64_t *sync_addr =
> > > malloc(sizeof(*sync_addr));
> > > +	struct drm_xe_sync sync = {
> > > +		.flags = DRM_XE_SYNC_FLAG_SIGNAL,
> > > +		.type = DRM_XE_SYNC_TYPE_USER_FENCE,
> > > +		.addr = to_user_pointer((uint64_t *)sync_addr),
> > > +		.timeline_value = BIND_SYNC_VAL,
> > > +	};
> > > +
> > > +	igt_assert(!!sync_addr);
> > > +	*sync_addr = 0;
> > > +	xe_vm_unbind_async(fd, vm, 0, 0, addr, size, &sync, 1);
> > > +	if (*sync_addr != BIND_SYNC_VAL)
> > > +		xe_wait_ufence(fd, (uint64_t *)sync_addr,
> > > BIND_SYNC_VAL, 0, NSEC_PER_SEC * 10);
> > > +	free((void *)sync_addr);
> > >   }
> > > diff --git a/lib/xe/xe_ioctl.h b/lib/xe/xe_ioctl.h
> > > index ae8a23a54..1ae38029d 100644
> > > --- a/lib/xe/xe_ioctl.h
> > > +++ b/lib/xe/xe_ioctl.h
> > > @@ -100,13 +100,18 @@ int __xe_wait_ufence(int fd, uint64_t
> > > *addr,
> > > uint64_t value,
> > >   int64_t xe_wait_ufence(int fd, uint64_t *addr, uint64_t value,
> > >   		       uint32_t exec_queue, int64_t timeout);
> > >   int __xe_vm_madvise(int fd, uint32_t vm, uint64_t addr,
> > > uint64_t
> > > range, uint64_t ext,
> > > -		    uint32_t type, uint32_t op_val, uint16_t
> > > policy);
> > > -void xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> > > range, uint64_t ext,
> > > -		   uint32_t type, uint32_t op_val, uint16_t
> > > policy);
> > > +		    uint32_t type, uint32_t op_val, uint16_t
> > > policy,
> > > uint16_t instance);
> > > +int xe_vm_madvise(int fd, uint32_t vm, uint64_t addr, uint64_t
> > > range, uint64_t ext,
> > > +		  uint32_t type, uint32_t op_val, uint16_t
> > > policy,
> > > uint16_t instance);
> > >   int xe_vm_number_vmas_in_range(int fd, struct
> > > drm_xe_vm_query_mem_range_attr *vmas_attr);
> > >   int xe_vm_vma_attrs(int fd, struct
> > > drm_xe_vm_query_mem_range_attr
> > > *vmas_attr,
> > >   		    struct drm_xe_mem_range_attr *mem_attr);
> > >   struct drm_xe_mem_range_attr
> > >   *xe_vm_get_mem_attr_values_in_range(int fd, uint32_t vm,
> > > uint64_t
> > > start,
> > >   				    uint64_t range, uint32_t
> > > *num_ranges);
> > > +void xe_vm_bind_lr_sync(int fd, uint32_t vm, uint32_t bo,
> > > +			uint64_t offset, uint64_t addr,
> > > +			uint64_t size, uint32_t flags);
> > > +void xe_vm_unbind_lr_sync(int fd, uint32_t vm, uint64_t offset,
> > > +			  uint64_t addr, uint64_t size);
> > >   #endif /* XE_IOCTL_H */


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test
  2025-11-17 15:49     ` Sharma, Nishit
  2025-11-17 20:40       ` Hellstrom, Thomas
@ 2025-11-18  9:24       ` Hellstrom, Thomas
  1 sibling, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-18  9:24 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Mon, 2025-11-17 at 21:19 +0530, Sharma, Nishit wrote:
> > 
> > So we have cross-gpu-mem-access-%s where %s can take "basic" and
> > "prefetch"?
> Those already defined tests are part of struct section {}. Let me
> work 
> on this and update.

So if you end up with base-test with variants, then perhaps we could
add the conflicting madvises as variants as well, increasing the
coverage?

Thanks,
Thomas


> > 
> > 
> > > +	}
> > > +
> > > +	igt_fixture {
> > > +		int cnt;
> > > +
> > > +		for (cnt = 0; cnt < gpu_cnt; cnt++)
> > > +			drm_close_driver(gpus[cnt].fd);
> > > +	}
> > > +}
> > > diff --git a/tests/meson.build b/tests/meson.build
> > > index 9736f2338..1209f84a4 100644
> > > --- a/tests/meson.build
> > > +++ b/tests/meson.build
> > > @@ -313,6 +313,7 @@ intel_xe_progs = [
> > >   	'xe_media_fill',
> > >   	'xe_mmap',
> > >   	'xe_module_load',
> > > +        'xe_multi_gpusvm',
> > >   	'xe_noexec_ping_pong',
> > >   	'xe_oa',
> > >   	'xe_pat',


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations
  2025-11-17 15:50     ` Sharma, Nishit
@ 2025-11-18  9:26       ` Hellstrom, Thomas
  0 siblings, 0 replies; 34+ messages in thread
From: Hellstrom, Thomas @ 2025-11-18  9:26 UTC (permalink / raw)
  To: igt-dev@lists.freedesktop.org, Sharma,  Nishit

On Mon, 2025-11-17 at 21:20 +0530, Sharma, Nishit wrote:
> 
> On 11/17/2025 6:40 PM, Hellstrom, Thomas wrote:
> > On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> > > From: Nishit Sharma <nishit.sharma@intel.com>
> > > 
> > > This test performs atomic increment operation on a shared SVM
> > > buffer
> > > from both GPUs and the CPU in a multi-GPU environment. It uses
> > > madvise
> > > and prefetch to control buffer placement and verifies correctness
> > > and
> > > ordering of atomic updates across agents.
> > > 
> > > Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> > > ---
> > >   tests/intel/xe_multi_gpusvm.c | 157
> > > +++++++++++++++++++++++++++++++++-
> > >   1 file changed, 156 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/tests/intel/xe_multi_gpusvm.c
> > > b/tests/intel/xe_multi_gpusvm.c
> > > index 6614ea3d1..54e036724 100644
> > > --- a/tests/intel/xe_multi_gpusvm.c
> > > +++ b/tests/intel/xe_multi_gpusvm.c
> > > @@ -31,6 +31,11 @@
> > >    *      region both remotely and locally and copies to it.
> > > Reads
> > > back to
> > >    *      system memory and checks the result.
> > >    *
> > > + * SUBTEST: atomic-inc-gpu-op
> > > + * Description:
> > > + * 	This test does atomic operation in multi-gpu by
> > > executing
> > > atomic
> > > + *	operation on GPU1 and then atomic operation on GPU2
> > > using
> > > same
> > > + *	adress
> > >    */
> > >   
> > >   #define MAX_XE_REGIONS	8
> > > @@ -40,6 +45,7 @@
> > >   #define BIND_SYNC_VAL 0x686868
> > >   #define EXEC_SYNC_VAL 0x676767
> > >   #define COPY_SIZE SZ_64M
> > > +#define	ATOMIC_OP_VAL	56
> > >   
> > >   struct xe_svm_gpu_info {
> > >   	bool supports_faults;
> > > @@ -49,6 +55,16 @@ struct xe_svm_gpu_info {
> > >   	int fd;
> > >   };
> > >   
> > > +struct test_exec_data {
> > > +	uint32_t batch[32];
> > > +	uint64_t pad;
> > > +	uint64_t vm_sync;
> > > +	uint64_t exec_sync;
> > > +	uint32_t data;
> > > +	uint32_t expected_data;
> > > +	uint64_t batch_addr;
> > > +};
> > > +
> > >   struct multigpu_ops_args {
> > >   	bool prefetch_req;
> > >   	bool op_mod;
> > > @@ -72,7 +88,10 @@ static void gpu_mem_access_wrapper(struct
> > > xe_svm_gpu_info *src,
> > >   				   struct
> > > drm_xe_engine_class_instance *eci,
> > >   				   void *extra_args);
> > >   
> > > -static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
> > > +static void gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
> > > +				   struct xe_svm_gpu_info *dst,
> > > +				   struct
> > > drm_xe_engine_class_instance *eci,
> > > +				   void *extra_args);
> > >   
> > >   static void
> > >   create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> > > drm_xe_engine_class_instance *eci,
> > > @@ -166,6 +185,35 @@ static void for_each_gpu_pair(int num_gpus,
> > > struct xe_svm_gpu_info *gpus,
> > >   	}
> > >   }
> > >   
> > > +static void open_pagemaps(int fd, struct xe_svm_gpu_info *info);
> > > +
> > > +static void
> > > +atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
> > > +		  uint32_t *bo, uint64_t *addr)
> > > +{
> > > +	uint32_t batch_bo_size = BATCH_SIZE(fd);
> > > +	uint32_t batch_bo;
> > > +	uint64_t batch_addr;
> > > +	void *batch;
> > > +	uint32_t *cmd;
> > > +	int i = 0;
> > > +
> > > +	batch_bo = xe_bo_create(fd, vm, batch_bo_size,
> > > vram_if_possible(fd, 0), 0);
> > > +	batch = xe_bo_map(fd, batch_bo, batch_bo_size);
> > > +	cmd = (uint32_t *)batch;
> > > +
> > > +	cmd[i++] = MI_ATOMIC | MI_ATOMIC_INC;
> > > +	cmd[i++] = src_addr;
> > > +	cmd[i++] = src_addr >> 32;
> > > +	cmd[i++] = MI_BATCH_BUFFER_END;
> > > +
> > > +	batch_addr = to_user_pointer(batch);
> > > +	/* Punch a gap in the SVM map where we map the batch_bo
> > > */
> > > +	xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
> > > batch_bo_size, 0);
> > > +	*bo = batch_bo;
> > > +	*addr = batch_addr;
> > > +}
> > > +
> > >   static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
> > >   		       uint64_t dst_addr, uint64_t copy_size,
> > >   		       uint32_t *bo, uint64_t *addr)
> > > @@ -325,6 +373,105 @@ gpu_mem_access_wrapper(struct
> > > xe_svm_gpu_info
> > > *src,
> > >   	copy_src_dst(src, dst, eci, args->prefetch_req);
> > >   }
> > >   
> > > +static void
> > > +atomic_inc_op(struct xe_svm_gpu_info *gpu0,
> > > +	      struct xe_svm_gpu_info *gpu1,
> > > +	      struct drm_xe_engine_class_instance *eci,
> > > +	      bool prefetch_req)
> > > +{
> > > +	uint64_t addr;
> > > +	uint32_t vm[2];
> > > +	uint32_t exec_queue[2];
> > > +	uint32_t batch_bo;
> > > +	struct test_exec_data *data;
> > > +	uint64_t batch_addr;
> > > +	struct drm_xe_sync sync = {};
> > > +	volatile uint64_t *sync_addr;
> > > +	volatile uint32_t *shared_val;
> > > +
> > > +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> > > +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> > > +
> > > +	data = aligned_alloc(SZ_2M, SZ_4K);
> > > +	igt_assert(data);
> > > +	data[0].vm_sync = 0;
> > > +	addr = to_user_pointer(data);
> > > +
> > > +	shared_val = (volatile uint32_t *)addr;
> > > +	*shared_val = ATOMIC_OP_VAL - 1;
> > > +
> > > +	atomic_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
> > > &batch_addr);
> > > +
> > > +	/* Place destination in an optionally remote location to
> > > test */
> > > +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> > > +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> > > +			    gpu0->fd, 0, gpu0->vram_regions[0],
> > > exec_queue[0],
> > > +			    0, 0);
> > > +
> > > +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> > > +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K,
> > > &sync,
> > > +			     sync_addr, exec_queue[0],
> > > prefetch_req);
> > > +
> > > +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
> > > +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
> > > +	sync.timeline_value = EXEC_SYNC_VAL;
> > > +	*sync_addr = 0;
> > > +
> > > +	/* Executing ATOMIC_INC on GPU0. */
> > > +	xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync,
> > > 1);
> > > +	if (*sync_addr != EXEC_SYNC_VAL)
> > > +		xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
> > > EXEC_SYNC_VAL, exec_queue[0],
> > > +			       NSEC_PER_SEC * 10);
> > > +
> > > +	igt_assert_eq(*shared_val, ATOMIC_OP_VAL);
> > > +
> > > +	atomic_batch_init(gpu1->fd, vm[1], addr, &batch_bo,
> > > &batch_addr);
> > > +
> > > +	/* Place destination in an optionally remote location to
> > > test */
> > We're actually never using a remote location here? It's always
> > advised
> > to local.
> will edit the explanation.

Great. I think this goes for most tests, actually so please review this
comment for all tests.

Thanks,
Thomas


> > 
> > > +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
> > > +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> > > +			    gpu1->fd, 0, gpu1->vram_regions[0],
> > > exec_queue[0],
> > > +			    0, 0);
> > 
> > 
> > > +
> > > +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> > > +	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K,
> > > &sync,
> > > +			     sync_addr, exec_queue[1],
> > > prefetch_req);
> > > +
> > > +	sync_addr = (void *)((char *)batch_addr + SZ_4K);
> > > +	sync.addr = to_user_pointer((uint64_t *)sync_addr);
> > > +	sync.timeline_value = EXEC_SYNC_VAL;
> > > +	*sync_addr = 0;
> > > +
> > > +	/* Execute ATOMIC_INC on GPU1 */
> > > +	xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync,
> > > 1);
> > If gpu1 here doesn't support faults, we shouldn't execute this.
> So this condition is applicable for all tests. if fault not supported
> xe_exec_sync(gpxx->fd,.....) shouldn't be called?
> > 
> > 
> > > +	if (*sync_addr != EXEC_SYNC_VAL)
> > > +		xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr,
> > > EXEC_SYNC_VAL, exec_queue[1],
> > > +			       NSEC_PER_SEC * 10);
> > > +
> > > +	igt_assert_eq(*shared_val, ATOMIC_OP_VAL + 1);
> > > +
> > > +	munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
> > > +	batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
> > > +	batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
> > > +	free(data);
> > > +
> > > +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> > > +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]);
> > > +}
> > > +
> > > +static void
> > > +gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
> > > +		       struct xe_svm_gpu_info *dst,
> > > +		       struct drm_xe_engine_class_instance *eci,
> > > +		       void *extra_args)
> > > +{
> > > +	struct multigpu_ops_args *args = (struct
> > > multigpu_ops_args
> > > *)extra_args;
> > > +	igt_assert(src);
> > > +	igt_assert(dst);
> > > +
> > > +	atomic_inc_op(src, dst, eci, args->prefetch_req);
> > > +}
> > > +
> > >   igt_main
> > >   {
> > >   	struct xe_svm_gpu_info gpus[MAX_XE_GPUS];
> > > @@ -364,6 +511,14 @@ igt_main
> > >   		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> > > gpu_mem_access_wrapper, &op_args);
> > >   	}
> > >   
> > > +	igt_subtest("atomic-inc-gpu-op") {
> > > +		struct multigpu_ops_args atomic_args;
> > > +		atomic_args.prefetch_req = 1;
> > > +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> > > gpu_atomic_inc_wrapper, &atomic_args);
> > > +		atomic_args.prefetch_req = 0;
> > > +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> > > gpu_atomic_inc_wrapper, &atomic_args);
> > Same comment here as for the first test.
> > 
> > /Thomas
> > 
> > 
> > 
> > > +	}
> > > +
> > >   	igt_fixture {
> > >   		int cnt;
> > >   


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test
  2025-11-17 16:18     ` Sharma, Nishit
@ 2025-11-27  7:36       ` Gurram, Pravalika
  0 siblings, 0 replies; 34+ messages in thread
From: Gurram, Pravalika @ 2025-11-27  7:36 UTC (permalink / raw)
  To: Sharma, Nishit, Hellstrom, Thomas, igt-dev@lists.freedesktop.org



> -----Original Message-----
> From: igt-dev <igt-dev-bounces@lists.freedesktop.org> On Behalf Of
> Sharma, Nishit
> Sent: Monday, November 17, 2025 9:48 PM
> To: Hellstrom, Thomas <thomas.hellstrom@intel.com>; igt-
> dev@lists.freedesktop.org
> Subject: Re: [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM
> multi-GPU coherency test
> 
> 
> On 11/17/2025 7:32 PM, Hellstrom, Thomas wrote:
> > On Thu, 2025-11-13 at 16:33 +0000, nishit.sharma@intel.com wrote:
> >> From: Nishit Sharma <nishit.sharma@intel.com>
> >>
> >> This test verifies memory coherency in a multi-GPU environment using
> >> SVM.
> >> GPU 1 writes to a shared buffer, GPU 2 reads and checks for correct
> >> data without explicit synchronization, and the test is repeated with
> >> CPU and both GPUs to ensure consistent memory visibility across
> >> agents.
> >>
> >> Signed-off-by: Nishit Sharma <nishit.sharma@intel.com>
> >> ---
> >>   tests/intel/xe_multi_gpusvm.c | 203
> >> +++++++++++++++++++++++++++++++++-
> >>   1 file changed, 201 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tests/intel/xe_multi_gpusvm.c
> >> b/tests/intel/xe_multi_gpusvm.c index 54e036724..6792ef72c 100644
> >> --- a/tests/intel/xe_multi_gpusvm.c
> >> +++ b/tests/intel/xe_multi_gpusvm.c
> >> @@ -34,8 +34,13 @@
> >>    * SUBTEST: atomic-inc-gpu-op
> >>    * Description:
> >>    * 	This test does atomic operation in multi-gpu by executing
> >> atomic
> >> - *	operation on GPU1 and then atomic operation on GPU2 using
> >> same
> >> - *	adress
> >> + * 	operation on GPU1 and then atomic operation on GPU2 using
> >> same
> >> + * 	adress
> >> + *
> >> + * SUBTEST: coherency-multi-gpu
> >> + * Description:
> >> + * 	This test checks coherency in multi-gpu by writing from GPU0
> >> + * 	reading from GPU1 and verify and repeating with CPU and both
> >> GPUs
> >>    */
> >>
> >>   #define MAX_XE_REGIONS	8
> >> @@ -93,6 +98,11 @@ static void gpu_atomic_inc_wrapper(struct
> >> xe_svm_gpu_info *src,
> >>   				   struct
> >> drm_xe_engine_class_instance *eci,
> >>   				   void *extra_args);
> >>
> >> +static void gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
> >> +				      struct xe_svm_gpu_info *dst,
> >> +				      struct
> >> drm_xe_engine_class_instance *eci,
> >> +				      void *extra_args);
> >> +
> >>   static void
> >>   create_vm_and_queue(struct xe_svm_gpu_info *gpu, struct
> >> drm_xe_engine_class_instance *eci,
> >>   		    uint32_t *vm, uint32_t *exec_queue) @@ -214,6 +224,35
> @@
> >> atomic_batch_init(int fd, uint32_t vm, uint64_t src_addr,
> >>   	*addr = batch_addr;
> >>   }
> >>
> >> +static void
> >> +store_dword_batch_init(int fd, uint32_t vm, uint64_t src_addr,
> >> +                       uint32_t *bo, uint64_t *addr, int value) {
> >> +        uint32_t batch_bo_size = BATCH_SIZE(fd);
> >> +        uint32_t batch_bo;
> >> +        uint64_t batch_addr;
> >> +        void *batch;
> >> +        uint32_t *cmd;
> >> +        int i = 0;
> >> +
> >> +        batch_bo = xe_bo_create(fd, vm, batch_bo_size,
> >> vram_if_possible(fd, 0), 0);
> >> +        batch = xe_bo_map(fd, batch_bo, batch_bo_size);
> >> +        cmd = (uint32_t *) batch;
> >> +
> >> +        cmd[i++] = MI_STORE_DWORD_IMM_GEN4;
> >> +        cmd[i++] = src_addr;
> >> +        cmd[i++] = src_addr >> 32;
> >> +        cmd[i++] = value;
> >> +        cmd[i++] = MI_BATCH_BUFFER_END;
> >> +
> >> +        batch_addr = to_user_pointer(batch);
> >> +
> >> +        /* Punch a gap in the SVM map where we map the batch_bo */
> >> +        xe_vm_bind_lr_sync(fd, vm, batch_bo, 0, batch_addr,
> >> batch_bo_size, 0);
> >> +        *bo = batch_bo;
> >> +        *addr = batch_addr;
> >> +}
> >> +
> >>   static void batch_init(int fd, uint32_t vm, uint64_t src_addr,
> >>   		       uint64_t dst_addr, uint64_t copy_size,
> >>   		       uint32_t *bo, uint64_t *addr) @@ -373,6 +412,143 @@
> >> gpu_mem_access_wrapper(struct xe_svm_gpu_info *src,
> >>   	copy_src_dst(src, dst, eci, args->prefetch_req);
> >>   }
> >>
> >> +static void
> >> +coherency_test_multigpu(struct xe_svm_gpu_info *gpu0,
> >> +			struct xe_svm_gpu_info *gpu1,
> >> +			struct drm_xe_engine_class_instance *eci,
> >> +			bool coh_fail_set,
> >> +			bool prefetch_req)
> >> +{
> >> +        uint64_t addr;
> >> +        uint32_t vm[2];
> >> +        uint32_t exec_queue[2];
> >> +        uint32_t batch_bo, batch1_bo[2];
> >> +        uint64_t batch_addr, batch1_addr[2];
> >> +        struct drm_xe_sync sync = {};
> >> +        volatile uint64_t *sync_addr;
> >> +        int value = 60;
> >> +	uint64_t *data1;
> >> +	void *copy_dst;
> >> +
> >> +	create_vm_and_queue(gpu0, eci, &vm[0], &exec_queue[0]);
> >> +	create_vm_and_queue(gpu1, eci, &vm[1], &exec_queue[1]);
> >> +

Instead of just 2 GPUs racing, test N GPUs writing simultaneously:
// Setup: Create VMs and queues for ALL GPUs
    for (int i = 0; i < gpu_count; i++) {
        create_vm_and_queue(&gpus[i], eci, &vm[i], &exec_queue[i]);
    }

> >> +        data1 = aligned_alloc(SZ_2M, SZ_4K);
> >> +	igt_assert(data1);
> >> +	addr = to_user_pointer(data1);
> >> +
> >> +	copy_dst = aligned_alloc(SZ_2M, SZ_4K);
> >> +	igt_assert(copy_dst);
> >> +
> >> +        store_dword_batch_init(gpu0->fd, vm[0], addr, &batch_bo,
> >> &batch_addr, value);
> >> +
// Each GPU writes its own unique value
    for (int i = 0; i < gpu_count; i++) {
        int unique_value = 100 + (i * 10);  // GPU0=100, GPU1=110, GPU2=120, etc.
        store_dword_batch_init(gpus[i].fd, vm[i], addr,
                              &batch_bo[i], &batch_addr[i], unique_value);
        setup_sync(&sync[i], &sync_addr[i], EXEC_SYNC_VAL);
    }
> >> +        /* Place destination in GPU0 local memory location to test
> >> */
> > Indentation looks odd throughout this function. Is there a formatting
> > / style checker that has been run on these patches?
> >
> >> +	xe_multigpu_madvise(gpu0->fd, vm[0], addr, SZ_4K, 0,
> >> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> >> +			    gpu0->fd, 0, gpu0->vram_regions[0],
> >> exec_queue[0],
> >> +			    0, 0);
> >> +
> >> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> >> +	xe_multigpu_prefetch(gpu0->fd, vm[0], addr, SZ_4K, &sync,
> >> +			     sync_addr, exec_queue[0],
> >> prefetch_req);
> >> +
> >> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
> >> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
> >> +        sync.timeline_value = EXEC_SYNC_VAL;
> >> +        *sync_addr = 0;
> >> +
> >> +        /* Execute STORE command on GPU0 */
> >> +        xe_exec_sync(gpu0->fd, exec_queue[0], batch_addr, &sync, 1);
    // LAUNCH ALL GPUS SIMULTANEOUSLY
    for (int i = 0; i < gpu_count; i++) {
        xe_exec_sync(gpus[i].fd, exec_queue[i], batch_addr[i], &sync[i], 1);
    }
> >> +        if (*sync_addr != EXEC_SYNC_VAL)
> >> +                xe_wait_ufence(gpu0->fd, (uint64_t *)sync_addr,
> >> EXEC_SYNC_VAL, exec_queue[0],
> >> +			       NSEC_PER_SEC * 10);
> >> +
    // Wait for all to complete
    for (int i = 0; i < gpu_count; i++) {
        xe_wait_ufence(gpus[i].fd, sync_addr[i], EXEC_SYNC_VAL,
                      exec_queue[i], NSEC_PER_SEC * 10);
    }	
> >> +        igt_assert_eq(*(uint64_t *)addr, value);
> > This assert will cause a CPU read which migrates the data to system,
> > so perhaps not ideal if we want to test coherency across gpus?
> >
> >
> >> +
> >> +	/* Creating batch for GPU1 using addr as Src which have
> >> value from GPU0 */
> >> +	batch_init(gpu1->fd, vm[1], addr, to_user_pointer(copy_dst),
> >> +		   SZ_4K, &batch_bo, &batch_addr);
> >> +
> >> +        /* Place destination in GPU1 local memory location to test
> >> */
> >> +	xe_multigpu_madvise(gpu1->fd, vm[1], addr, SZ_4K, 0,
> >> +			    DRM_XE_MEM_RANGE_ATTR_PREFERRED_LOC,
> >> +			    gpu1->fd, 0, gpu1->vram_regions[0],
> >> exec_queue[1],
> >> +			    0, 0);
> >> +
> >> +	setup_sync(&sync, &sync_addr, BIND_SYNC_VAL);
> >> +	xe_multigpu_prefetch(gpu1->fd, vm[1], addr, SZ_4K, &sync,
> >> +			     sync_addr, exec_queue[1],
> >> prefetch_req);
> >> +
> >> +        sync_addr = (void *)((char *)batch_addr + SZ_4K);
> >> +        sync.addr = to_user_pointer((uint64_t *)sync_addr);
> >> +        sync.timeline_value = EXEC_SYNC_VAL;
> >> +        *sync_addr = 0;
> >> +
> >> +        /* Execute COPY command on GPU1 */
> >> +        xe_exec_sync(gpu1->fd, exec_queue[1], batch_addr, &sync, 1);
> >> +        if (*sync_addr != EXEC_SYNC_VAL)
> >> +                xe_wait_ufence(gpu1->fd, (uint64_t *)sync_addr,
> >> EXEC_SYNC_VAL, exec_queue[1],
> >> +			       NSEC_PER_SEC * 10);
> >> +
> >> +        igt_assert_eq(*(uint64_t *)copy_dst, value);
> >> +
> >> +        /* CPU writes 10, memset set bytes no integer hence memset
> >> fills 4 bytes with 0x0A */
> >> +        memset((void *)(uintptr_t)addr, 10, sizeof(int));
> >> +        igt_assert_eq(*(uint64_t *)addr, 0x0A0A0A0A);
> >> +
> >> +	if (coh_fail_set) {
> >> +		igt_info("coherency fail impl\n");
> >> +
> >> +		/* Coherency fail scenario */
> >> +		store_dword_batch_init(gpu0->fd, vm[0], addr,
> >> &batch1_bo[0], &batch1_addr[0], value + 10);
> >> +		store_dword_batch_init(gpu1->fd, vm[1], addr,
> >> &batch1_bo[1], &batch1_addr[1], value + 20);
> >> +
> >> +		sync_addr = (void *)((char *)batch1_addr[0] +
> >> SZ_4K);
> >> +		sync.addr = to_user_pointer((uint64_t *)sync_addr);
> >> +		sync.timeline_value = EXEC_SYNC_VAL;
> >> +		*sync_addr = 0;
> >> +
> >> +		/* Execute STORE command on GPU1 */
> >> +		xe_exec_sync(gpu0->fd, exec_queue[0],
> >> batch1_addr[0], &sync, 1);
> >> +		if (*sync_addr != EXEC_SYNC_VAL)
> >> +			xe_wait_ufence(gpu0->fd, (uint64_t
> >> *)sync_addr, EXEC_SYNC_VAL, exec_queue[0],
> >> +				       NSEC_PER_SEC * 10);
> >> +
> >> +		sync_addr = (void *)((char *)batch1_addr[1] +
> >> SZ_4K);
> >> +		sync.addr = to_user_pointer((uint64_t *)sync_addr);
> >> +		sync.timeline_value = EXEC_SYNC_VAL;
> >> +		*sync_addr = 0;
> >> +
> >> +		/* Execute STORE command on GPU2 */
> >> +		xe_exec_sync(gpu1->fd, exec_queue[1],
> >> batch1_addr[1], &sync, 1);
> >> +		if (*sync_addr != EXEC_SYNC_VAL)
> >> +			xe_wait_ufence(gpu1->fd, (uint64_t
> >> *)sync_addr, EXEC_SYNC_VAL, exec_queue[1],
> >> +				       NSEC_PER_SEC * 10);
> >> +
> >> +		igt_warn_on_f(*(uint64_t *)addr != (value + 10),
> >> +			      "GPU2(dst_gpu] has overwritten value
> >> at addr\n");
> > Parenthesis mismatch.
> >
> > BTW, isn't gpu2 supposed to overwrite the value here? Perhaps I'm
> > missing something?
> Will change this logic
> >
> > Also regarding the previous comment WRT using the naming gpu0 and gpu1
> > vs gpu1 and gpu2? Shouldn't we try to be consistent here to avoid
> > confusion?
> >
> > /Thomas
> >
> >
> >
> >> +
> >> +		munmap((void *)batch1_addr[0], BATCH_SIZE(gpu0-
> >>> fd));
> >> +		munmap((void *)batch1_addr[1], BATCH_SIZE(gpu1-
> >>> fd));
> >> +
> >> +		batch_fini(gpu0->fd, vm[0], batch1_bo[0],
> >> batch1_addr[0]);
> >> +		batch_fini(gpu1->fd, vm[1], batch1_bo[1],
> >> batch1_addr[1]);
> >> +	}
> >> +
> >> +        /* CPU writes 11, memset set bytes no integer hence memset
> >> fills 4 bytes with 0x0B */
> >> +        memset((void *)(uintptr_t)addr, 11, sizeof(int));
> >> +        igt_assert_eq(*(uint64_t *)addr, 0x0B0B0B0B);
> >> +
> >> +        munmap((void *)batch_addr, BATCH_SIZE(gpu0->fd));
> >> +        batch_fini(gpu0->fd, vm[0], batch_bo, batch_addr);
> >> +        batch_fini(gpu1->fd, vm[1], batch_bo, batch_addr);
> >> +        free(data1);
> >> +	free(copy_dst);
> >> +
> >> +	cleanup_vm_and_queue(gpu0, vm[0], exec_queue[0]);
> >> +	cleanup_vm_and_queue(gpu1, vm[1], exec_queue[1]); }
> >> +
> >>   static void
> >>   atomic_inc_op(struct xe_svm_gpu_info *gpu0,
> >>   	      struct xe_svm_gpu_info *gpu1, @@ -472,6 +648,19 @@
> >> gpu_atomic_inc_wrapper(struct xe_svm_gpu_info *src,
> >>   	atomic_inc_op(src, dst, eci, args->prefetch_req);
> >>   }
> >>
> >> +static void
> >> +gpu_coherecy_test_wrapper(struct xe_svm_gpu_info *src,
> >> +			  struct xe_svm_gpu_info *dst,
> >> +			  struct drm_xe_engine_class_instance *eci,
> >> +			  void *extra_args)
> >> +{
> >> +	struct multigpu_ops_args *args = (struct multigpu_ops_args
> >> *)extra_args;
> >> +	igt_assert(src);
> >> +	igt_assert(dst);
> >> +
> >> +	coherency_test_multigpu(src, dst, eci, args->op_mod, args-
> >>> prefetch_req);
> >> +}
> >> +
> >>   igt_main
> >>   {
> >>   	struct xe_svm_gpu_info gpus[MAX_XE_GPUS]; @@ -519,6 +708,16
> @@
> >> igt_main
> >>   		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> gpu_atomic_inc_wrapper,
> >> &atomic_args);
> >>   	}
> >>
> >> +	igt_subtest("coherency-multi-gpu") {
> >> +		struct multigpu_ops_args coh_args;
> >> +		coh_args.prefetch_req = 1;
> >> +		coh_args.op_mod = 0;
> >> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> >> gpu_coherecy_test_wrapper, &coh_args);
> >> +		coh_args.prefetch_req = 0;
> >> +		coh_args.op_mod = 1;
> >> +		for_each_gpu_pair(gpu_cnt, gpus, &eci,
> >> gpu_coherecy_test_wrapper, &coh_args);
> >> +	}
> >> +
> >>   	igt_fixture {
> >>   		int cnt;
> >>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-11-27  7:37 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-13 16:32 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
2025-11-13 16:33 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
2025-11-17 12:34   ` Hellstrom, Thomas
2025-11-17 15:43     ` Sharma, Nishit
2025-11-18  9:23       ` Hellstrom, Thomas
2025-11-13 16:33 ` [PATCH i-g-t v7 02/10] tests/intel/xe_exec_system_allocator: Add parameter in madvise call nishit.sharma
2025-11-17 12:38   ` Hellstrom, Thomas
2025-11-13 16:33 ` [PATCH i-g-t v7 03/10] tests/intel/xe_multi_gpusvm: Add SVM multi-GPU cross-GPU memory access test nishit.sharma
2025-11-17 13:00   ` Hellstrom, Thomas
2025-11-17 15:49     ` Sharma, Nishit
2025-11-17 20:40       ` Hellstrom, Thomas
2025-11-18  9:24       ` Hellstrom, Thomas
2025-11-13 16:33 ` [PATCH i-g-t v7 04/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU atomic operations nishit.sharma
2025-11-17 13:10   ` Hellstrom, Thomas
2025-11-17 15:50     ` Sharma, Nishit
2025-11-18  9:26       ` Hellstrom, Thomas
2025-11-13 16:33 ` [PATCH i-g-t v7 05/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU coherency test nishit.sharma
2025-11-17 14:02   ` Hellstrom, Thomas
2025-11-17 16:18     ` Sharma, Nishit
2025-11-27  7:36       ` Gurram, Pravalika
2025-11-13 16:33 ` [PATCH i-g-t v7 06/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU performance test nishit.sharma
2025-11-17 14:39   ` Hellstrom, Thomas
2025-11-13 16:33 ` [PATCH i-g-t v7 07/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU fault handling test nishit.sharma
2025-11-17 14:48   ` Hellstrom, Thomas
2025-11-13 16:33 ` [PATCH i-g-t v7 08/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU simultaneous access test nishit.sharma
2025-11-17 14:57   ` Hellstrom, Thomas
2025-11-13 16:33 ` [PATCH i-g-t v7 09/10] tests/intel/xe_multi_gpusvm.c: Add SVM multi-GPU conflicting madvise test nishit.sharma
2025-11-17 15:11   ` Hellstrom, Thomas
  -- strict thread matches above, loose matches on Subject: below --
2025-11-13 17:16 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
2025-11-13 17:16 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
2025-11-13 17:15 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
2025-11-13 17:15 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
2025-11-13 17:09 [PATCH i-g-t v7 00/10] SVM madvise feature in multi-GPU config nishit.sharma
2025-11-13 17:09 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
2025-11-13 17:04 [PATCH i-g-t v7 00/10] Add SVM madvise feature for multi-GPU configurations nishit.sharma
2025-11-13 17:04 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
2025-11-13 16:49 [PATCH i-g-t v7 00/10] Madvise feature in SVM for Multi-GPU configs nishit.sharma
2025-11-13 16:49 ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers nishit.sharma
2025-11-04 15:31 [PATCH i-g-t v2 0/7] Madvise feature in SVM for Multi-GPU configs nishit.sharma
2025-11-13 17:00 ` [PATCH i-g-t v7 00/10] " Nishit Sharma
2025-11-13 17:00   ` [PATCH i-g-t v7 01/10] lib/xe: Add instance parameter to xe_vm_madvise and introduce lr_sync helpers Nishit Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).