* [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header Honglei Huang
` (11 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Add amdgpu drm SVM API definitions built on the
DRM GPUSVM framework.
This includes:
- DRM_AMDGPU_GEM_SVM ioctl
- AMDGPU_SVM_FLAG_* flags
- AMDGPU_SVM_OP_SET_ATTR / AMDGPU_SVM_OP_GET_ATTR operations
- AMDGPU_SVM_ATTR_* attribute types
- AMDGPU_SVM_LOCATION_SYSMEM / AMDGPU_SVM_LOCATION_UNDEFINED
- struct drm_amdgpu_svm_attribute and struct drm_amdgpu_gem_svm
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
include/uapi/drm/amdgpu_drm.h | 39 +++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index 406a42be4..bed71ed9b 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -58,6 +58,7 @@ extern "C" {
#define DRM_AMDGPU_USERQ_SIGNAL 0x17
#define DRM_AMDGPU_USERQ_WAIT 0x18
#define DRM_AMDGPU_GEM_LIST_HANDLES 0x19
+#define DRM_AMDGPU_GEM_SVM 0x1a
#define DRM_IOCTL_AMDGPU_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_CREATE, union drm_amdgpu_gem_create)
#define DRM_IOCTL_AMDGPU_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_MMAP, union drm_amdgpu_gem_mmap)
@@ -79,6 +80,7 @@ extern "C" {
#define DRM_IOCTL_AMDGPU_USERQ_SIGNAL DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_USERQ_SIGNAL, struct drm_amdgpu_userq_signal)
#define DRM_IOCTL_AMDGPU_USERQ_WAIT DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_USERQ_WAIT, struct drm_amdgpu_userq_wait)
#define DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_LIST_HANDLES, struct drm_amdgpu_gem_list_handles)
+#define DRM_IOCTL_AMDGPU_GEM_SVM DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_SVM, struct drm_amdgpu_gem_svm)
/**
* DOC: memory domains
@@ -1665,6 +1667,43 @@ struct drm_color_ctm_3x4 {
__u64 matrix[12];
};
+#define AMDGPU_SVM_FLAG_HOST_ACCESS 0x00000001
+#define AMDGPU_SVM_FLAG_COHERENT 0x00000002
+#define AMDGPU_SVM_FLAG_HIVE_LOCAL 0x00000004
+#define AMDGPU_SVM_FLAG_GPU_RO 0x00000008
+#define AMDGPU_SVM_FLAG_GPU_EXEC 0x00000010
+#define AMDGPU_SVM_FLAG_GPU_READ_MOSTLY 0x00000020
+#define AMDGPU_SVM_FLAG_GPU_ALWAYS_MAPPED 0x00000040
+#define AMDGPU_SVM_FLAG_EXT_COHERENT 0x00000080
+
+#define AMDGPU_SVM_OP_SET_ATTR 0
+#define AMDGPU_SVM_OP_GET_ATTR 1
+
+#define AMDGPU_SVM_ATTR_PREFERRED_LOC 0
+#define AMDGPU_SVM_ATTR_PREFETCH_LOC 1
+#define AMDGPU_SVM_ATTR_ACCESS 2
+#define AMDGPU_SVM_ATTR_ACCESS_IN_PLACE 3
+#define AMDGPU_SVM_ATTR_NO_ACCESS 4
+#define AMDGPU_SVM_ATTR_SET_FLAGS 5
+#define AMDGPU_SVM_ATTR_CLR_FLAGS 6
+#define AMDGPU_SVM_ATTR_GRANULARITY 7
+
+#define AMDGPU_SVM_LOCATION_SYSMEM 0
+#define AMDGPU_SVM_LOCATION_UNDEFINED 0xffffffff
+
+struct drm_amdgpu_svm_attribute {
+ __u32 type;
+ __u32 value;
+};
+
+struct drm_amdgpu_gem_svm {
+ __u64 start_addr;
+ __u64 size;
+ __u32 operation;
+ __u32 nattr;
+ __u64 attrs_ptr;
+};
+
#if defined(__cplusplus)
}
#endif
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures Honglei Huang
` (10 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
This includes:
- struct amdgpu_svm contains drm_gpusvm, refcount,
attr_tree, workqueues, locks, atomics, and per-mode callbacks
- Helper macros and functions
- Function declarations with CONFIG_DRM_AMDGPU_SVM guards and inline
stubs
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++++++++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
2 files changed, 151 insertions(+)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
new file mode 100644
index 000000000..a1bfe8b47
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright 2026 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __AMDGPU_SVM_H__
+#define __AMDGPU_SVM_H__
+
+#include <drm/amdgpu_drm.h>
+#include <drm/drm_gpusvm.h>
+#include <linux/atomic.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/printk.h>
+#include <linux/rwsem.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <linux/workqueue.h>
+
+struct amdgpu_device;
+struct amdgpu_vm;
+struct amdgpu_svm_attr_tree;
+struct drm_device;
+struct drm_file;
+
+#define AMDGPU_SVM_TRACE(fmt, ...) \
+ pr_debug("%s: " fmt, __func__, ##__VA_ARGS__)
+
+#define AMDGPU_SVM_KMEM_CACHE_CREATE(name, type) \
+ kmem_cache_create((name), sizeof(type), 0, 0, NULL)
+
+#define AMDGPU_SVM_KMEM_CACHE_DESTROY(cache) \
+ do { \
+ if ((cache) != NULL) { \
+ kmem_cache_destroy((cache)); \
+ (cache) = NULL; \
+ } \
+ } while (0)
+
+struct amdgpu_svm {
+ struct drm_gpusvm gpusvm;
+ struct kref refcount;
+ struct amdgpu_device *adev;
+ struct amdgpu_vm *vm;
+ struct amdgpu_svm_attr_tree *attr_tree;
+ struct workqueue_struct *gc_wq;
+ struct workqueue_struct *restore_wq;
+ struct rw_semaphore svm_lock;
+ spinlock_t gc_lock;
+ struct list_head gc_list;
+ struct work_struct gc_work;
+ struct list_head restore_work_list;
+ struct delayed_work restore_work;
+ atomic_t kfd_queues_quiesced;
+ atomic_t evicted_ranges;
+ atomic_t exiting;
+ u8 default_granularity;
+ bool xnack_enabled;
+ void (*begin_restore)(struct amdgpu_svm *svm);
+ void (*end_restore)(struct amdgpu_svm *svm);
+ void (*flush_tlb)(struct amdgpu_svm *svm);
+};
+
+static inline struct amdgpu_svm *to_amdgpu_svm(struct drm_gpusvm *gpusvm)
+{
+ return container_of(gpusvm, struct amdgpu_svm, gpusvm);
+}
+
+#if IS_ENABLED(CONFIG_DRM_AMDGPU_SVM)
+int amdgpu_svm_cache_init(void);
+void amdgpu_svm_cache_fini(void);
+
+int amdgpu_svm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm);
+void amdgpu_svm_close(struct amdgpu_vm *vm);
+void amdgpu_svm_fini(struct amdgpu_vm *vm);
+
+int amdgpu_svm_handle_fault(struct amdgpu_device *adev, uint32_t pasid,
+ uint64_t fault_addr, bool write_fault);
+bool amdgpu_svm_is_enabled(struct amdgpu_vm *vm);
+
+int amdgpu_gem_svm_ioctl(struct drm_device *dev, void *data,
+ struct drm_file *filp);
+#else
+static inline int amdgpu_svm_init(struct amdgpu_device *adev,
+ struct amdgpu_vm *vm)
+{
+ return 0;
+}
+
+static inline int amdgpu_svm_cache_init(void)
+{
+ return 0;
+}
+
+static inline void amdgpu_svm_cache_fini(void)
+{
+}
+
+static inline void amdgpu_svm_close(struct amdgpu_vm *vm)
+{
+}
+
+static inline void amdgpu_svm_fini(struct amdgpu_vm *vm)
+{
+}
+
+static inline int amdgpu_svm_handle_fault(struct amdgpu_device *adev,
+ uint32_t pasid,
+ uint64_t fault_addr,
+ bool write_fault)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline bool amdgpu_svm_is_enabled(struct amdgpu_vm *vm)
+{
+ return false;
+}
+
+static inline int amdgpu_gem_svm_ioctl(struct drm_device *dev, void *data,
+ struct drm_file *filp)
+{
+ return -EOPNOTSUPP;
+}
+#endif /* CONFIG_DRM_AMDGPU_SVM */
+
+#endif /* __AMDGPU_SVM_H__ */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index cf0ec94e8..7a5aeefdf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -43,6 +43,7 @@ struct amdgpu_bo_va;
struct amdgpu_job;
struct amdgpu_bo_list_entry;
struct amdgpu_bo_vm;
+struct amdgpu_svm;
/*
* GPUVM handling
@@ -445,6 +446,9 @@ struct amdgpu_vm {
/* cached fault info */
struct amdgpu_vm_fault_info fault_info;
+
+ /* SVM experimental implementation */
+ struct amdgpu_svm *svm;
};
struct amdgpu_vm_manager {
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations Honglei Huang
` (9 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Add the SVM attribute subsystem header defining:
- enum amdgpu_svm_attr_access
- flag masks for change
- struct amdgpu_svm_attrs spereate with drm svm range
- struct amdgpu_svm_attr_range: interval-tree node
- struct amdgpu_svm_attr_tree
- enum amdgpu_svm_attr_change_trigger
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 +++++++++++++++++++
1 file changed, 110 insertions(+)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
new file mode 100644
index 000000000..d49f6bb72
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright 2026 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __AMDGPU_SVM_ATTR_H__
+#define __AMDGPU_SVM_ATTR_H__
+
+#include <drm/amdgpu_drm.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/rbtree.h>
+#include <linux/types.h>
+
+
+/* one fd one svm one GPU so no bit map
+ * only three status for this pattren.
+ */
+enum amdgpu_svm_attr_access {
+ AMDGPU_SVM_ACCESS_NONE = 0,
+ AMDGPU_SVM_ACCESS_ENABLE = 1,
+ AMDGPU_SVM_ACCESS_IN_PLACE = 2,
+};
+
+#define AMDGPU_SVM_PTE_FLAG_MASK \
+ (AMDGPU_SVM_FLAG_COHERENT | AMDGPU_SVM_FLAG_EXT_COHERENT | \
+ AMDGPU_SVM_FLAG_GPU_RO | AMDGPU_SVM_FLAG_GPU_EXEC)
+
+#define AMDGPU_SVM_MAPPING_FLAG_MASK \
+ (AMDGPU_SVM_FLAG_HOST_ACCESS | AMDGPU_SVM_FLAG_HIVE_LOCAL | \
+ AMDGPU_SVM_FLAG_GPU_READ_MOSTLY | AMDGPU_SVM_FLAG_GPU_ALWAYS_MAPPED)
+
+struct amdgpu_svm_attrs {
+ /* keep preferred_loc to adapt to kfd API */
+ int32_t preferred_loc;
+ int32_t prefetch_loc;
+ uint32_t flags;
+ uint32_t granularity;
+ enum amdgpu_svm_attr_access access;
+};
+
+struct amdgpu_svm_attr_range {
+ struct interval_tree_node it_node;
+ struct list_head list;
+ struct amdgpu_svm_attrs attrs;
+};
+
+struct amdgpu_svm;
+
+struct amdgpu_svm_attr_tree {
+ struct mutex lock;
+ struct rb_root_cached tree;
+ struct list_head range_list;
+ struct amdgpu_svm *svm;
+};
+
+enum amdgpu_svm_attr_change_trigger {
+ AMDGPU_SVM_ATTR_TRIGGER_ACCESS_CHANGE = (1U << 0),
+ AMDGPU_SVM_ATTR_TRIGGER_PTE_FLAG_CHANGE = (1U << 1),
+ AMDGPU_SVM_ATTR_TRIGGER_MAPPING_FLAG_CHANGE = (1U << 2),
+ AMDGPU_SVM_ATTR_TRIGGER_LOCATION_CHANGE = (1U << 3),
+ AMDGPU_SVM_ATTR_TRIGGER_GRANULARITY_CHANGE = (1U << 4),
+ AMDGPU_SVM_ATTR_TRIGGER_ATTR_ONLY = (1U << 5), /* no changes */
+};
+
+struct amdgpu_svm_attr_tree *
+amdgpu_svm_attr_tree_create(struct amdgpu_svm *svm);
+void amdgpu_svm_attr_tree_destroy(struct amdgpu_svm_attr_tree *attr_tree);
+int amdgpu_svm_attr_cache_init(void);
+void amdgpu_svm_attr_cache_fini(void);
+void amdgpu_svm_attr_lookup_page_locked(struct amdgpu_svm_attr_tree *attr_tree,
+ unsigned long page,
+ struct amdgpu_svm_attrs *attrs,
+ unsigned long *seg_last);
+
+int amdgpu_svm_attr_set(struct amdgpu_svm_attr_tree *attr_tree,
+ uint64_t start,
+ uint64_t size,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs);
+int amdgpu_svm_attr_get(struct amdgpu_svm_attr_tree *attr_tree,
+ uint64_t start,
+ uint64_t size,
+ uint32_t nattr,
+ struct drm_amdgpu_svm_attribute *attrs);
+int amdgpu_svm_attr_clear_pages(struct amdgpu_svm_attr_tree *attr_tree,
+ unsigned long start_page,
+ unsigned long last_page);
+
+#endif /* __AMDGPU_SVM_ATTR_H__ */
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (2 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set Honglei Huang
` (8 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Implement the attribyte tree operations.
- Attribute tree operations
- amdgpu_svm_attr_tree_create/destroy for lifecycle management
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 346 +++++++++++++++++++
1 file changed, 346 insertions(+)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
new file mode 100644
index 000000000..137dfcb58
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
@@ -0,0 +1,346 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright 2026 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu_svm.h"
+#include "amdgpu_svm_attr.h"
+#include "amdgpu_svm_range.h"
+
+#include <linux/errno.h>
+#include <linux/gfp.h>
+#include <linux/lockdep.h>
+#include <linux/minmax.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+static struct kmem_cache *amdgpu_svm_attr_range_cache;
+
+struct attr_get_ctx {
+ int32_t preferred_loc;
+ int32_t prefetch_loc;
+ enum amdgpu_svm_attr_access access;
+ uint32_t granularity;
+ uint32_t flags_and;
+ uint32_t flags_or;
+ bool has_range;
+};
+
+int amdgpu_svm_attr_cache_init(void)
+{
+ amdgpu_svm_attr_range_cache = AMDGPU_SVM_KMEM_CACHE_CREATE(
+ "amdgpu_svm_attr_range_cache", struct amdgpu_svm_attr_range);
+ if (!amdgpu_svm_attr_range_cache)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void amdgpu_svm_attr_cache_fini(void)
+{
+ AMDGPU_SVM_KMEM_CACHE_DESTROY(amdgpu_svm_attr_range_cache);
+}
+
+static void attr_set_interval(struct amdgpu_svm_attr_range *range,
+ unsigned long start_page,
+ unsigned long last_page)
+{
+ range->it_node.start = start_page;
+ range->it_node.last = last_page;
+}
+
+static unsigned long attr_start_page(const struct amdgpu_svm_attr_range *range)
+{
+ return range->it_node.start;
+}
+
+static unsigned long attr_last_page(const struct amdgpu_svm_attr_range *range)
+{
+ return range->it_node.last;
+}
+
+static void attr_set_default(struct amdgpu_svm *svm,
+ struct amdgpu_svm_attrs *attrs)
+{
+ attrs->preferred_loc = AMDGPU_SVM_LOCATION_UNDEFINED;
+ attrs->prefetch_loc = AMDGPU_SVM_LOCATION_UNDEFINED;
+ attrs->granularity = svm->default_granularity;
+ attrs->flags = AMDGPU_SVM_FLAG_HOST_ACCESS | AMDGPU_SVM_FLAG_COHERENT;
+ attrs->access = svm->xnack_enabled ?
+ AMDGPU_SVM_ACCESS_ENABLE : AMDGPU_SVM_ACCESS_NONE;
+}
+
+void amdgpu_svm_attr_lookup_page_locked(struct amdgpu_svm_attr_tree *attr_tree,
+ unsigned long page,
+ struct amdgpu_svm_attrs *attrs,
+ unsigned long *range_last)
+{
+ struct interval_tree_node *node;
+ struct amdgpu_svm_attr_range *range;
+
+ node = interval_tree_iter_first(&attr_tree->tree, page, page);
+ if (node) {
+ range = container_of(node, struct amdgpu_svm_attr_range, it_node);
+ *attrs = range->attrs;
+ *range_last = range->it_node.last;
+ return;
+ }
+
+ attr_set_default(attr_tree->svm, attrs);
+ *range_last = ULONG_MAX;
+
+ if (page == ULONG_MAX)
+ return;
+
+ node = interval_tree_iter_first(&attr_tree->tree, page + 1, ULONG_MAX);
+ if (!node)
+ return;
+
+ range = container_of(node, struct amdgpu_svm_attr_range, it_node);
+ if (range->it_node.start > page)
+ *range_last = range->it_node.start - 1;
+}
+
+static bool amdgpu_svm_attr_equal(const struct amdgpu_svm_attrs *a,
+ const struct amdgpu_svm_attrs *b)
+{
+ return a->flags == b->flags &&
+ a->preferred_loc == b->preferred_loc &&
+ a->prefetch_loc == b->prefetch_loc &&
+ a->granularity == b->granularity &&
+ a->access == b->access;
+}
+
+static struct amdgpu_svm_attr_range *
+attr_alloc_range(unsigned long start,
+ unsigned long last,
+ const struct amdgpu_svm_attrs *attrs)
+{
+ struct amdgpu_svm_attr_range *range;
+
+ range = kmem_cache_zalloc(amdgpu_svm_attr_range_cache, GFP_KERNEL);
+ if (!range)
+ return NULL;
+
+ INIT_LIST_HEAD(&range->list);
+ attr_set_interval(range, start, last);
+ range->attrs = *attrs;
+ return range;
+}
+
+static void attr_insert_range_locked(struct amdgpu_svm_attr_tree *attr_tree,
+ struct amdgpu_svm_attr_range *range)
+{
+ struct interval_tree_node *node;
+ struct amdgpu_svm_attr_range *next;
+
+ lockdep_assert_held(&attr_tree->lock);
+
+ node = interval_tree_iter_first(&attr_tree->tree, attr_start_page(range),
+ ULONG_MAX);
+ if (node) {
+ next = container_of(node, struct amdgpu_svm_attr_range, it_node);
+ list_add_tail(&range->list, &next->list);
+ } else {
+ list_add_tail(&range->list, &attr_tree->range_list);
+ }
+
+ interval_tree_insert(&range->it_node, &attr_tree->tree);
+}
+
+static void attr_remove_range_locked(struct amdgpu_svm_attr_tree *attr_tree,
+ struct amdgpu_svm_attr_range *range,
+ bool free_range)
+{
+ lockdep_assert_held(&attr_tree->lock);
+
+ interval_tree_remove(&range->it_node, &attr_tree->tree);
+ list_del_init(&range->list);
+ if (free_range)
+ kmem_cache_free(amdgpu_svm_attr_range_cache, range);
+}
+
+struct amdgpu_svm_attr_tree *
+amdgpu_svm_attr_tree_create(struct amdgpu_svm *svm)
+{
+ struct amdgpu_svm_attr_tree *attr_tree;
+
+ attr_tree = kzalloc(sizeof(*attr_tree), GFP_KERNEL);
+ if (!attr_tree)
+ return NULL;
+
+ mutex_init(&attr_tree->lock);
+ attr_tree->tree = RB_ROOT_CACHED;
+ INIT_LIST_HEAD(&attr_tree->range_list);
+ attr_tree->svm = svm;
+ return attr_tree;
+}
+
+void amdgpu_svm_attr_tree_destroy(struct amdgpu_svm_attr_tree *attr_tree)
+{
+ struct amdgpu_svm_attr_range *range, *tmp;
+
+ if (!attr_tree)
+ return;
+
+ mutex_lock(&attr_tree->lock);
+ list_for_each_entry_safe(range, tmp, &attr_tree->range_list, list) {
+ interval_tree_remove(&range->it_node, &attr_tree->tree);
+ list_del_init(&range->list);
+ kmem_cache_free(amdgpu_svm_attr_range_cache, range);
+ }
+ mutex_unlock(&attr_tree->lock);
+
+ mutex_destroy(&attr_tree->lock);
+ kfree(attr_tree);
+}
+
+static void attr_get_ctx_add(struct attr_get_ctx *ctx,
+ const struct amdgpu_svm_attrs *attrs)
+{
+ if (!ctx->has_range) {
+ ctx->preferred_loc = attrs->preferred_loc;
+ ctx->prefetch_loc = attrs->prefetch_loc;
+ ctx->granularity = attrs->granularity;
+ ctx->access = attrs->access;
+ ctx->flags_and = attrs->flags;
+ ctx->flags_or = attrs->flags;
+ ctx->has_range = true;
+ return;
+ }
+
+ if (ctx->preferred_loc != attrs->preferred_loc)
+ ctx->preferred_loc = AMDGPU_SVM_LOCATION_UNDEFINED;
+ if (ctx->prefetch_loc != attrs->prefetch_loc)
+ ctx->prefetch_loc = AMDGPU_SVM_LOCATION_UNDEFINED;
+ if (attrs->granularity < ctx->granularity)
+ ctx->granularity = attrs->granularity;
+ if (ctx->access != attrs->access)
+ ctx->access = AMDGPU_SVM_ACCESS_NONE;
+ ctx->flags_and &= attrs->flags;
+ ctx->flags_or |= attrs->flags;
+}
+
+static int attr_get_ctx_to_result(const struct attr_get_ctx *ctx,
+ uint32_t nattr,
+ struct drm_amdgpu_svm_attribute *attrs)
+{
+ uint32_t i;
+
+ for (i = 0; i < nattr; i++) {
+ switch (attrs[i].type) {
+ case AMDGPU_SVM_ATTR_PREFERRED_LOC:
+ attrs[i].value = ctx->preferred_loc;
+ break;
+ case AMDGPU_SVM_ATTR_PREFETCH_LOC:
+ attrs[i].value = ctx->prefetch_loc;
+ break;
+ case AMDGPU_SVM_ATTR_ACCESS:
+ if (ctx->access == AMDGPU_SVM_ACCESS_ENABLE)
+ attrs[i].type = AMDGPU_SVM_ATTR_ACCESS;
+ else if (ctx->access == AMDGPU_SVM_ACCESS_IN_PLACE)
+ attrs[i].type = AMDGPU_SVM_ATTR_ACCESS_IN_PLACE;
+ else
+ attrs[i].type = AMDGPU_SVM_ATTR_NO_ACCESS;
+ break;
+ case AMDGPU_SVM_ATTR_SET_FLAGS:
+ attrs[i].value = ctx->flags_and;
+ break;
+ case AMDGPU_SVM_ATTR_CLR_FLAGS:
+ attrs[i].value = ~ctx->flags_or;
+ break;
+ case AMDGPU_SVM_ATTR_GRANULARITY:
+ attrs[i].value = ctx->granularity;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+int amdgpu_svm_attr_get(struct amdgpu_svm_attr_tree *attr_tree,
+ uint64_t start, uint64_t size,
+ uint32_t nattr,
+ struct drm_amdgpu_svm_attribute *attrs)
+{
+ struct amdgpu_svm_attrs default_attrs;
+ struct attr_get_ctx ctx = { 0 };
+ struct interval_tree_node *node;
+ unsigned long start_page, last_page, cursor;
+ int r;
+
+ start_page = start >> PAGE_SHIFT;
+ last_page = (start + size - 1) >> PAGE_SHIFT;
+
+ mutex_lock(&attr_tree->lock);
+ attr_set_default(attr_tree->svm, &default_attrs);
+ node = interval_tree_iter_first(&attr_tree->tree, start_page, last_page);
+
+ cursor = start_page;
+ while (cursor <= last_page) {
+ const struct amdgpu_svm_attrs *range_attrs;
+ unsigned long range_last = last_page;
+ struct amdgpu_svm_attr_range *range = NULL;
+ unsigned long next;
+
+ if (node) {
+ range = container_of(node, struct amdgpu_svm_attr_range,
+ it_node);
+
+ if (attr_last_page(range) < cursor) {
+ node = interval_tree_iter_next(node, start_page,
+ last_page);
+ continue;
+ }
+
+ if (attr_start_page(range) <= cursor) {
+ range_last = min(last_page, attr_last_page(range));
+ node = interval_tree_iter_next(node, start_page,
+ last_page);
+ } else {
+ range_last = min(last_page,
+ attr_start_page(range) - 1);
+ range = NULL;
+ }
+ }
+
+ range_attrs = range ? &range->attrs : &default_attrs;
+ attr_get_ctx_add(&ctx, range_attrs);
+
+ if (range_last == ULONG_MAX)
+ break;
+
+ next = range_last + 1;
+ if (next <= cursor)
+ break;
+ cursor = next;
+ }
+
+ if (!ctx.has_range)
+ attr_get_ctx_add(&ctx, &default_attrs);
+
+ r = attr_get_ctx_to_result(&ctx, nattr, attrs);
+ mutex_unlock(&attr_tree->lock);
+ return r;
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (3 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures Honglei Huang
` (7 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Implement the attribute set path
- Attribute application: apply UAPI attributes to internal attrs
- Attribute tree set split remove.
- amdgpu_svm_attr_set with retry on -EAGAIN
- amdgpu_svm_attr_clear_pages: remove attribute ranges for unmapped
operations.
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 548 +++++++++++++++++++
1 file changed, 548 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
index 137dfcb58..cd972026f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
@@ -33,8 +33,23 @@
#include <linux/mm.h>
#include <linux/slab.h>
+#define AMDGPU_SVM_VALID_FLAG_MASK \
+ (AMDGPU_SVM_FLAG_HOST_ACCESS | AMDGPU_SVM_FLAG_COHERENT | \
+ AMDGPU_SVM_FLAG_HIVE_LOCAL | AMDGPU_SVM_FLAG_GPU_RO | \
+ AMDGPU_SVM_FLAG_GPU_EXEC | AMDGPU_SVM_FLAG_GPU_READ_MOSTLY | \
+ AMDGPU_SVM_FLAG_GPU_ALWAYS_MAPPED | AMDGPU_SVM_FLAG_EXT_COHERENT)
+
+
static struct kmem_cache *amdgpu_svm_attr_range_cache;
+struct attr_set_ctx {
+ unsigned long start;
+ unsigned long last;
+ uint32_t trigger;
+ struct amdgpu_svm_attrs prev_attrs;
+ struct amdgpu_svm_attrs new_attrs;
+};
+
struct attr_get_ctx {
int32_t preferred_loc;
int32_t prefetch_loc;
@@ -130,6 +145,48 @@ static bool amdgpu_svm_attr_equal(const struct amdgpu_svm_attrs *a,
a->access == b->access;
}
+static uint32_t
+attr_change_ctx_trigger(const struct amdgpu_svm_attrs *prev_attrs,
+ const struct amdgpu_svm_attrs *new_attrs)
+{
+ uint32_t trigger = 0;
+ uint32_t changed_flags = prev_attrs->flags ^ new_attrs->flags;
+
+ if (prev_attrs->access != new_attrs->access)
+ trigger |= AMDGPU_SVM_ATTR_TRIGGER_ACCESS_CHANGE;
+
+ if (changed_flags & AMDGPU_SVM_PTE_FLAG_MASK)
+ trigger |= AMDGPU_SVM_ATTR_TRIGGER_PTE_FLAG_CHANGE;
+ if (changed_flags & AMDGPU_SVM_MAPPING_FLAG_MASK)
+ trigger |= AMDGPU_SVM_ATTR_TRIGGER_MAPPING_FLAG_CHANGE;
+ if (prev_attrs->preferred_loc != new_attrs->preferred_loc ||
+ prev_attrs->prefetch_loc != new_attrs->prefetch_loc)
+ trigger |= AMDGPU_SVM_ATTR_TRIGGER_LOCATION_CHANGE;
+ if (prev_attrs->granularity != new_attrs->granularity)
+ trigger |= AMDGPU_SVM_ATTR_TRIGGER_GRANULARITY_CHANGE;
+
+ if (!trigger)
+ trigger = AMDGPU_SVM_ATTR_TRIGGER_ATTR_ONLY;
+
+ return trigger;
+}
+
+static bool attr_has_access(uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs)
+{
+ uint32_t i;
+
+ for (i = 0; i < nattr; i++) {
+ switch (attrs[i].type) {
+ case AMDGPU_SVM_ATTR_ACCESS:
+ case AMDGPU_SVM_ATTR_ACCESS_IN_PLACE:
+ return true;
+ }
+ }
+
+ return false;
+}
+
static struct amdgpu_svm_attr_range *
attr_alloc_range(unsigned long start,
unsigned long last,
@@ -179,6 +236,388 @@ static void attr_remove_range_locked(struct amdgpu_svm_attr_tree *attr_tree,
kmem_cache_free(amdgpu_svm_attr_range_cache, range);
}
+static void amdgpu_svm_attr_change_ctx_set(
+ struct attr_set_ctx *change,
+ unsigned long start,
+ unsigned long last,
+ uint32_t trigger,
+ const struct amdgpu_svm_attrs *prev_attrs,
+ const struct amdgpu_svm_attrs *new_attrs)
+{
+ change->start = start;
+ change->last = last;
+ change->trigger = trigger;
+ change->prev_attrs = *prev_attrs;
+ change->new_attrs = *new_attrs;
+}
+
+static int amdgpu_svm_attr_apply_change(
+ struct amdgpu_svm *svm,
+ const struct attr_set_ctx *change)
+{
+ int ret;
+
+ lockdep_assert_held_write(&svm->svm_lock);
+
+ if (!change->trigger ||
+ change->trigger == AMDGPU_SVM_ATTR_TRIGGER_ATTR_ONLY)
+ return 0;
+
+ ret = amdgpu_svm_range_apply_attr_change(svm, change->start, change->last,
+ change->trigger, &change->prev_attrs,
+ &change->new_attrs);
+ if (ret)
+ AMDGPU_SVM_TRACE("mapping apply failed ret=%d [0x%lx-0x%lx]-0x%lx trigger=0x%x\n",
+ ret, change->start, change->last,
+ change->last - change->start + 1,
+ change->trigger);
+
+ return ret;
+}
+
+static inline int attr_check_preferred_loc(uint32_t value)
+{
+ /* casue one svm one gpu so value > 0 then means prefered loc is this GPU */
+ if (value == AMDGPU_SVM_LOCATION_SYSMEM || value == AMDGPU_SVM_LOCATION_UNDEFINED)
+ return 0;
+
+ return 0;
+}
+
+static inline int attr_check_prefetch_loc(uint32_t value)
+{
+ /* casue one svm one gpu so value > 0 then means prefetch loc is this GPU
+ * keep prefetch loc to adapt to KFD API
+ */
+ if (value == AMDGPU_SVM_LOCATION_SYSMEM)
+ return 0;
+
+ if (value == AMDGPU_SVM_LOCATION_UNDEFINED)
+ return -EINVAL;
+
+ return 0;
+}
+
+static inline int attr_check_access(uint32_t value)
+{
+ if (!value || value == AMDGPU_SVM_LOCATION_UNDEFINED)
+ return -EINVAL;
+
+ return 0;
+}
+
+static inline int attr_check_flags(uint32_t value)
+{
+ if (value & ~AMDGPU_SVM_VALID_FLAG_MASK)
+ return -EINVAL;
+
+ return 0;
+}
+
+static inline int attr_check_granularity(uint32_t value)
+{
+ return 0;
+}
+
+static int
+amdgpu_svm_attr_validate_range_vma(struct amdgpu_svm_attr_tree *attr_tree,
+ unsigned long start_page,
+ unsigned long last_page)
+{
+ const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
+ struct mm_struct *mm;
+ unsigned long start, end;
+ int ret = 0;
+
+ if (start_page > last_page)
+ return -EINVAL;
+
+ if (last_page == ULONG_MAX)
+ return -EINVAL;
+
+ start = start_page << PAGE_SHIFT;
+ end = (last_page + 1) << PAGE_SHIFT;
+ mm = attr_tree->svm->gpusvm.mm;
+ if (!mm)
+ return -EFAULT;
+
+ mmap_read_lock(mm);
+ while (start < end) {
+ struct vm_area_struct *vma = vma_lookup(mm, start);
+
+ if (!vma || (vma->vm_flags & device_vma)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ start = min(end, vma->vm_end);
+ }
+ mmap_read_unlock(mm);
+
+ return ret;
+}
+
+static int amdgpu_svm_attr_set_validate(const struct drm_amdgpu_svm_attribute *attr)
+{
+ switch (attr->type) {
+ case AMDGPU_SVM_ATTR_PREFERRED_LOC:
+ return attr_check_preferred_loc(attr->value);
+ case AMDGPU_SVM_ATTR_PREFETCH_LOC:
+ return attr_check_prefetch_loc(attr->value);
+ case AMDGPU_SVM_ATTR_ACCESS:
+ case AMDGPU_SVM_ATTR_ACCESS_IN_PLACE:
+ case AMDGPU_SVM_ATTR_NO_ACCESS:
+ return attr_check_access(attr->value);
+ case AMDGPU_SVM_ATTR_SET_FLAGS:
+ case AMDGPU_SVM_ATTR_CLR_FLAGS:
+ return attr_check_flags(attr->value);
+ case AMDGPU_SVM_ATTR_GRANULARITY:
+ return attr_check_granularity(attr->value);
+ default:
+ return -EINVAL;
+ }
+}
+
+static void amdgpu_svm_attr_apply(struct amdgpu_svm_attrs *attrs,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *pattrs)
+{
+ const struct drm_amdgpu_svm_attribute *attr;
+
+ for (attr = pattrs; nattr--; attr++) {
+ switch (attr->type) {
+ case AMDGPU_SVM_ATTR_PREFERRED_LOC:
+ attrs->preferred_loc = (int32_t)attr->value;
+ break;
+ case AMDGPU_SVM_ATTR_PREFETCH_LOC:
+ attrs->prefetch_loc = (int32_t)attr->value;
+ break;
+ case AMDGPU_SVM_ATTR_ACCESS:
+ attrs->access = AMDGPU_SVM_ACCESS_ENABLE;
+ break;
+ case AMDGPU_SVM_ATTR_ACCESS_IN_PLACE:
+ attrs->access = AMDGPU_SVM_ACCESS_IN_PLACE;
+ break;
+ case AMDGPU_SVM_ATTR_NO_ACCESS:
+ attrs->access = AMDGPU_SVM_ACCESS_NONE;
+ break;
+ case AMDGPU_SVM_ATTR_SET_FLAGS:
+ attrs->flags |= attr->value;
+ break;
+ case AMDGPU_SVM_ATTR_CLR_FLAGS:
+ attrs->flags &= ~attr->value;
+ break;
+ case AMDGPU_SVM_ATTR_GRANULARITY:
+ attrs->granularity = min_t(uint32_t, attr->value, 0x3f);
+ break;
+ default:
+ break;
+ }
+ }
+}
+
+static bool attr_same_attrs(const struct amdgpu_svm_attr_range *range,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs)
+{
+ struct amdgpu_svm_attrs target;
+
+ target = range->attrs;
+ amdgpu_svm_attr_apply(&target, nattr, attrs);
+ return amdgpu_svm_attr_equal(&range->attrs, &target);
+}
+
+static int
+amdgpu_svm_attr_set_hole(struct amdgpu_svm_attr_tree *attr_tree,
+ const struct amdgpu_svm_attrs *default_attrs,
+ unsigned long start, unsigned long last,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs,
+ struct attr_set_ctx *change)
+{
+ struct amdgpu_svm_attrs new_attrs;
+ struct amdgpu_svm_attr_range *range;
+ uint32_t trigger;
+
+ lockdep_assert_held(&attr_tree->lock);
+
+ if (start > last)
+ return 0;
+
+ /* no action if default attr */
+ new_attrs = *default_attrs;
+ amdgpu_svm_attr_apply(&new_attrs, nattr, attrs);
+ if (amdgpu_svm_attr_equal(default_attrs, &new_attrs))
+ return 0;
+
+ range = attr_alloc_range(start, last, &new_attrs);
+ if (!range)
+ return -ENOMEM;
+
+ attr_insert_range_locked(attr_tree, range);
+
+ trigger = attr_change_ctx_trigger(default_attrs, &new_attrs);
+ amdgpu_svm_attr_change_ctx_set(change, start, last, trigger,
+ default_attrs, &new_attrs);
+ return 0;
+}
+
+static int
+amdgpu_svm_attr_set_existing(struct amdgpu_svm_attr_tree *attr_tree,
+ struct amdgpu_svm_attr_range *range,
+ unsigned long start, unsigned long last,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs,
+ struct attr_set_ctx *change)
+{
+ unsigned long range_start = attr_start_page(range);
+ unsigned long range_last = attr_last_page(range);
+ struct amdgpu_svm_attr_range *left = NULL;
+ struct amdgpu_svm_attr_range *right = NULL;
+ struct amdgpu_svm_attrs old_attrs;
+ struct amdgpu_svm_attrs new_attrs;
+ uint32_t trigger;
+ bool force_trigger;
+
+ lockdep_assert_held(&attr_tree->lock);
+
+ old_attrs = range->attrs;
+
+ /* The attr layer doesn't store the gpu mapped state, and for align with KFD,
+ * need force trigger range layer to check if gpu mapped.
+ */
+ force_trigger = !attr_tree->svm->xnack_enabled && attr_has_access(nattr, attrs);
+
+ if (attr_same_attrs(range, nattr, attrs)) {
+ if (!force_trigger)
+ return 0;
+
+ amdgpu_svm_attr_change_ctx_set(change, start, last,
+ AMDGPU_SVM_ATTR_TRIGGER_ACCESS_CHANGE,
+ &old_attrs, &old_attrs);
+ return 0;
+ }
+
+ new_attrs = old_attrs;
+ amdgpu_svm_attr_apply(&new_attrs, nattr, attrs);
+ trigger = attr_change_ctx_trigger(&old_attrs, &new_attrs);
+
+ /* only need to update attr */
+ if (start == range_start && last == range_last) {
+ range->attrs = new_attrs;
+ amdgpu_svm_attr_change_ctx_set(change, start, last,
+ trigger, &old_attrs, &new_attrs);
+ return 0;
+ }
+
+ /* split head */
+ if (start > range_start) {
+ left = attr_alloc_range(range_start, start - 1, &old_attrs);
+ if (!left)
+ return -ENOMEM;
+ }
+
+ /* split tail */
+ if (last < range_last) {
+ right = attr_alloc_range(last + 1, range_last, &old_attrs);
+ if (!right) {
+ if (left)
+ kmem_cache_free(amdgpu_svm_attr_range_cache, left);
+ return -ENOMEM;
+ }
+ }
+
+ attr_remove_range_locked(attr_tree, range, false);
+ if (left)
+ attr_insert_range_locked(attr_tree, left);
+ attr_set_interval(range, start, last);
+ range->attrs = new_attrs;
+ attr_insert_range_locked(attr_tree, range);
+ if (right)
+ attr_insert_range_locked(attr_tree, right);
+
+ amdgpu_svm_attr_change_ctx_set(change, start, last, trigger,
+ &old_attrs, &new_attrs);
+ return 0;
+}
+
+static int
+amdgpu_svm_attr_set_range(struct amdgpu_svm_attr_tree *attr_tree,
+ const struct amdgpu_svm_attrs *default_attrs,
+ unsigned long start, unsigned long last,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs)
+{
+ struct amdgpu_svm *svm = attr_tree->svm;
+ unsigned long cursor = start;
+ bool need_retry = false;
+
+ while (cursor <= last) {
+ struct interval_tree_node *node;
+ unsigned long seg_last;
+ struct attr_set_ctx change = { 0 };
+ int ret;
+
+ mutex_lock(&attr_tree->lock);
+ node = interval_tree_iter_first(&attr_tree->tree, cursor, cursor);
+ if (node) {
+ struct amdgpu_svm_attr_range *range;
+
+ range = container_of(node, struct amdgpu_svm_attr_range, it_node);
+ seg_last = min(last, attr_last_page(range));
+ ret = amdgpu_svm_attr_set_existing(attr_tree, range,
+ cursor, seg_last,
+ nattr, attrs, &change);
+ } else {
+ struct interval_tree_node *next;
+
+ seg_last = last;
+ if (cursor != ULONG_MAX) {
+ next = interval_tree_iter_first(&attr_tree->tree,
+ cursor + 1,
+ ULONG_MAX);
+ if (next) {
+ struct amdgpu_svm_attr_range *next_range;
+
+ next_range = container_of(next,
+ struct amdgpu_svm_attr_range,
+ it_node);
+ seg_last = min(last,
+ attr_start_page(next_range) - 1);
+ }
+ }
+ ret = amdgpu_svm_attr_set_hole(attr_tree,
+ default_attrs,
+ cursor, seg_last,
+ nattr, attrs,
+ &change);
+ }
+ mutex_unlock(&attr_tree->lock);
+
+ if (ret)
+ return ret;
+
+ down_write(&svm->svm_lock);
+ ret = amdgpu_svm_attr_apply_change(svm, &change);
+ up_write(&svm->svm_lock);
+
+ if (ret == -EAGAIN) {
+ need_retry = true;
+ ret = 0;
+ }
+
+ if (ret)
+ return ret;
+
+ if (seg_last == ULONG_MAX || seg_last == last)
+ break;
+
+ cursor = seg_last + 1;
+ }
+
+ return need_retry ? -EAGAIN : 0;
+}
+
struct amdgpu_svm_attr_tree *
amdgpu_svm_attr_tree_create(struct amdgpu_svm *svm)
{
@@ -214,6 +653,115 @@ void amdgpu_svm_attr_tree_destroy(struct amdgpu_svm_attr_tree *attr_tree)
kfree(attr_tree);
}
+int amdgpu_svm_attr_set(struct amdgpu_svm_attr_tree *attr_tree,
+ uint64_t start,
+ uint64_t size,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs)
+{
+ struct amdgpu_svm *svm = attr_tree->svm;
+ struct amdgpu_svm_attrs default_attrs;
+ unsigned long start_page, last_page;
+ uint32_t i;
+ int r;
+
+ start_page = start >> PAGE_SHIFT;
+ last_page = (start + size - 1) >> PAGE_SHIFT;
+
+ for (i = 0; i < nattr; i++) {
+ AMDGPU_SVM_TRACE("set attr type %u value 0x%08x for page range [%lx, %lx] xnack:%d",
+ attrs[i].type, attrs[i].value, start_page, last_page, svm->xnack_enabled ? 1 : 0);
+ r = amdgpu_svm_attr_set_validate(&attrs[i]);
+ if (r) {
+ AMDGPU_SVM_TRACE("invalid attribute %u value 0x%08x", attrs[i].type, attrs[i].value);
+ return r;
+ }
+ }
+
+ r = amdgpu_svm_attr_validate_range_vma(attr_tree, start_page, last_page);
+ if (r)
+ return r;
+
+ attr_set_default(attr_tree->svm, &default_attrs);
+
+ /*
+ * POC/WA:
+ * can not acquire the mmap lock because of drm gpu svm frame work design (drm_gpusvm_range_find_or_insert)
+ * the hmm operations and GPU mapping possiable to fail so add retry mechanism
+ *
+ * TODO: add mmap locked flag in drm_gpusvm_ctx to acquire mmap lock in entire ioctl period
+ */
+retry:
+ r = amdgpu_svm_attr_set_range(attr_tree, &default_attrs,
+ start_page, last_page,
+ nattr, attrs);
+ if (r == -EAGAIN) {
+ AMDGPU_SVM_TRACE("attr_set retry [0x%lx-0x%lx]\n",
+ start_page, last_page);
+ amdgpu_svm_range_flush(svm);
+ cond_resched();
+ goto retry;
+ }
+
+ return r;
+}
+
+int amdgpu_svm_attr_clear_pages(struct amdgpu_svm_attr_tree *attr_tree,
+ unsigned long start_page,
+ unsigned long last_page)
+{
+ struct interval_tree_node *node;
+ int r = 0;
+
+ if (start_page > last_page)
+ return -EINVAL;
+
+ mutex_lock(&attr_tree->lock);
+
+ node = interval_tree_iter_first(&attr_tree->tree, start_page, last_page);
+ while (node) {
+ struct interval_tree_node *next;
+ struct amdgpu_svm_attr_range *range;
+ unsigned long range_start;
+ unsigned long range_last;
+
+ range = container_of(node, struct amdgpu_svm_attr_range, it_node);
+ next = interval_tree_iter_next(node, start_page, last_page);
+ range_start = attr_start_page(range);
+ range_last = attr_last_page(range);
+
+ if (range_start < start_page && range_last > last_page) {
+ struct amdgpu_svm_attr_range *tail;
+
+ tail = attr_alloc_range(last_page + 1, range_last, &range->attrs);
+ if (!tail) {
+ r = -ENOMEM;
+ break;
+ }
+
+ attr_remove_range_locked(attr_tree, range, false);
+ attr_set_interval(range, range_start, start_page - 1);
+ attr_insert_range_locked(attr_tree, range);
+ attr_insert_range_locked(attr_tree, tail);
+ } else if (range_start < start_page) {
+ attr_remove_range_locked(attr_tree, range, false);
+ attr_set_interval(range, range_start, start_page - 1);
+ attr_insert_range_locked(attr_tree, range);
+ } else if (range_last > last_page) {
+ attr_remove_range_locked(attr_tree, range, false);
+ attr_set_interval(range, last_page + 1, range_last);
+ attr_insert_range_locked(attr_tree, range);
+ } else {
+ attr_remove_range_locked(attr_tree, range, true);
+ }
+
+ node = next;
+ }
+
+ mutex_unlock(&attr_tree->lock);
+ return r;
+}
+
static void attr_get_ctx_add(struct attr_get_ctx *ctx,
const struct amdgpu_svm_attrs *attrs)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (4 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping Honglei Huang
` (6 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Add the SVM range header:
- struct amdgpu_svm_range: extends drm_gpusvm_range with mapping state
- helper functions
- Function declarations for range work init/fini, flush, sync,
mapping, attribute change application, invalidation, and queue
stop/restore.
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 +++++++++++++++++++
1 file changed, 76 insertions(+)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
new file mode 100644
index 000000000..18bf3dad1
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright 2026 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __AMDGPU_SVM_RANGE_H__
+#define __AMDGPU_SVM_RANGE_H__
+
+#include <drm/drm_gpusvm.h>
+
+#include <linux/list.h>
+#include <linux/types.h>
+
+struct amdgpu_svm;
+struct amdgpu_svm_attrs;
+struct drm_gpusvm_notifier;
+struct drm_gpusvm_range;
+struct mmu_notifier_range;
+
+struct amdgpu_svm_range {
+ struct drm_gpusvm_range base;
+ struct list_head gc_node;
+ bool gpu_mapped;
+ bool gc_queued;
+ bool restore_queued;
+ bool in_queue;
+ u8 pending_ops;
+ unsigned long pending_start;
+ unsigned long pending_last;
+ uint64_t pte_flags;
+ uint32_t attr_flags;
+};
+
+static inline struct amdgpu_svm_range *
+to_amdgpu_svm_range(struct drm_gpusvm_range *range)
+{
+ return container_of(range, struct amdgpu_svm_range, base);
+}
+
+int amdgpu_svm_range_work_init(struct amdgpu_svm *svm);
+void amdgpu_svm_range_work_fini(struct amdgpu_svm *svm);
+void amdgpu_svm_range_flush(struct amdgpu_svm *svm);
+void amdgpu_svm_range_sync_work(struct amdgpu_svm *svm);
+int amdgpu_svm_range_map_attr_ranges(struct amdgpu_svm *svm,
+ unsigned long start_page,
+ unsigned long last_page);
+int amdgpu_svm_range_apply_attr_change(
+ struct amdgpu_svm *svm, unsigned long start, unsigned long last,
+ uint32_t trigger, const struct amdgpu_svm_attrs *prev_attrs,
+ const struct amdgpu_svm_attrs *new_attrs);
+void amdgpu_svm_range_invalidate(struct amdgpu_svm *svm,
+ struct drm_gpusvm_notifier *notifier,
+ const struct mmu_notifier_range *mmu_range);
+void amdgpu_svm_range_restore_begin_compute(struct amdgpu_svm *svm);
+void amdgpu_svm_range_restore_end_compute(struct amdgpu_svm *svm);
+
+#endif /* __AMDGPU_SVM_RANGE_H__ */
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (5 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation Honglei Huang
` (5 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Implement the GPU page table mapping core for SVM ranges:
- PTE flag computation per GC IP version (9.4.x, 11.x, 12.x) with
coherency mode selection (UC/NC/CC/RW) based on SVM flags
- GPU PTE update helpers using amdgpu_vm_update_range with DMA
address coalescing across contiguous pagemap entries
- Range mapping loop: find_or_insert via drm_gpusvm, get_pages,
validate under notifier lock, update GPU PTEs, flush TLB
- Attribute-aware mapping: walk the attr tree to map only accessible
ranges with correct PTE flags
- Attribute change handler: detect trigger types and remap intervals
when PTE flags, mapping flags, or access state changes
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 539 ++++++++++++++++++
1 file changed, 539 insertions(+)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
new file mode 100644
index 000000000..b3bd4e2e6
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
@@ -0,0 +1,539 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright 2026 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu_svm.h"
+#include "amdgpu_svm_attr.h"
+#include "amdgpu_svm_range.h"
+#include "amdgpu.h"
+#include "amdgpu_amdkfd.h"
+#include "amdgpu_vm.h"
+
+#include <drm/drm_exec.h>
+#include <drm/drm_pagemap.h>
+
+#include <linux/mmu_notifier.h>
+#include <uapi/linux/kfd_ioctl.h>
+
+enum amdgpu_svm_range_queue_op {
+ AMDGPU_SVM_RANGE_OP_RESTORE = 0,
+ AMDGPU_SVM_RANGE_OP_UNMAP = 1,
+};
+
+enum amdgpu_svm_range_pending_op {
+ AMDGPU_SVM_RANGE_PENDING_OP_NONE = 0,
+ AMDGPU_SVM_RANGE_PENDING_OP_UNMAP = BIT(0),
+ AMDGPU_SVM_RANGE_PENDING_OP_RESTORE = BIT(1),
+};
+
+#define UNMAP_WORK(ops) ((ops) & AMDGPU_SVM_RANGE_PENDING_OP_UNMAP)
+
+#define RESTORE_WORK(ops) ((ops) & AMDGPU_SVM_RANGE_PENDING_OP_RESTORE)
+
+#define NEED_REBUILD(svm) (!(svm)->xnack_enabled)
+
+enum amdgpu_svm_range_notifier_op {
+ AMDGPU_SVM_RANGE_NOTIFIER_CLEAR_PTE = BIT(0),
+ AMDGPU_SVM_RANGE_NOTIFIER_QUEUE_INTERVAL = BIT(1),
+};
+
+struct range_pending_op_ctx {
+ struct amdgpu_svm_range *range;
+ unsigned long start;
+ unsigned long last;
+ uint8_t pending_ops;
+};
+
+#define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
+#define AMDGPU_SVM_RANGE_WQ_NAME "amdgpu_svm_range"
+#define AMDGPU_SVM_RESTORE_WQ_NAME "amdgpu_svm_restore"
+
+static void
+amdgpu_svm_range_enqueue(struct amdgpu_svm *svm,
+ struct amdgpu_svm_range *range,
+ unsigned long start,
+ unsigned long last,
+ enum amdgpu_svm_range_queue_op op);
+
+static inline bool
+range_has_access(enum amdgpu_svm_attr_access access)
+{
+ return access == AMDGPU_SVM_ACCESS_ENABLE ||
+ access == AMDGPU_SVM_ACCESS_IN_PLACE;
+}
+
+static void
+range_invalidate_gpu_mapping(struct drm_gpusvm_range *range)
+{
+ WRITE_ONCE(to_amdgpu_svm_range(range)->gpu_mapped, false);
+}
+
+static bool
+range_attr_match(struct drm_gpusvm_range *range,
+ const struct amdgpu_svm_attrs *attrs,
+ uint64_t pte_flags)
+{
+ struct amdgpu_svm_range *r = to_amdgpu_svm_range(range);
+
+ if (!READ_ONCE(r->gpu_mapped))
+ return false;
+
+ return READ_ONCE(r->pte_flags) == pte_flags &&
+ READ_ONCE(r->attr_flags) == attrs->flags;
+}
+
+static bool
+range_pages_valid(struct amdgpu_svm *svm,
+ struct drm_gpusvm_range *range)
+{
+ lockdep_assert_held(&svm->gpusvm.notifier_lock);
+
+ if (range->pages.flags.unmapped || range->pages.flags.partial_unmap)
+ return false;
+
+ return drm_gpusvm_range_pages_valid(&svm->gpusvm, range);
+}
+
+static uint64_t
+amdgpu_svm_range_attr_pte_flags(struct amdgpu_svm *svm,
+ const struct amdgpu_svm_attrs *attrs)
+{
+ /* WA/POC: a simple pte flags func */
+ uint32_t gc_ip_version = amdgpu_ip_version(svm->adev, GC_HWIP, 0);
+ uint32_t flags = attrs->flags;
+ uint32_t mapping_flags = 0;
+ uint64_t pte_flags;
+ bool coherent = flags & (AMDGPU_SVM_FLAG_COHERENT |
+ AMDGPU_SVM_FLAG_EXT_COHERENT);
+ bool ext_coherent = flags & AMDGPU_SVM_FLAG_EXT_COHERENT;
+ bool snoop = true;
+ unsigned int mtype_local;
+
+ switch (gc_ip_version) {
+ case IP_VERSION(9, 4, 1):
+ case IP_VERSION(9, 4, 2):
+ mapping_flags |= coherent ?
+ AMDGPU_VM_MTYPE_UC : AMDGPU_VM_MTYPE_NC;
+ break;
+ case IP_VERSION(9, 4, 3):
+ case IP_VERSION(9, 4, 4):
+ case IP_VERSION(9, 5, 0):
+ if (ext_coherent)
+ mtype_local = AMDGPU_VM_MTYPE_CC;
+ else
+ mtype_local = amdgpu_mtype_local == 1 ? AMDGPU_VM_MTYPE_NC :
+ amdgpu_mtype_local == 2 ? AMDGPU_VM_MTYPE_CC :
+ AMDGPU_VM_MTYPE_RW;
+ if (svm->adev->flags & AMD_IS_APU) {
+ if (num_possible_nodes() <= 1)
+ mapping_flags |= mtype_local;
+ else
+ mapping_flags |= ext_coherent ?
+ AMDGPU_VM_MTYPE_UC : AMDGPU_VM_MTYPE_NC;
+ } else {
+ if (gc_ip_version < IP_VERSION(9, 5, 0) || ext_coherent)
+ mapping_flags |= AMDGPU_VM_MTYPE_UC;
+ else
+ mapping_flags |= AMDGPU_VM_MTYPE_NC;
+ }
+ break;
+ case IP_VERSION(11, 0, 0):
+ case IP_VERSION(11, 0, 1):
+ case IP_VERSION(11, 0, 2):
+ case IP_VERSION(11, 0, 3):
+ case IP_VERSION(11, 0, 4):
+ case IP_VERSION(11, 5, 0):
+ case IP_VERSION(11, 5, 1):
+ case IP_VERSION(11, 5, 2):
+ case IP_VERSION(11, 5, 3):
+ mapping_flags |= coherent ?
+ AMDGPU_VM_MTYPE_UC : AMDGPU_VM_MTYPE_NC;
+ break;
+ case IP_VERSION(12, 0, 0):
+ case IP_VERSION(12, 0, 1):
+ mapping_flags |= AMDGPU_VM_MTYPE_NC;
+ break;
+ default:
+ mapping_flags |= coherent ?
+ AMDGPU_VM_MTYPE_UC : AMDGPU_VM_MTYPE_NC;
+ break;
+ }
+
+ if (flags & AMDGPU_SVM_FLAG_GPU_EXEC)
+ mapping_flags |= AMDGPU_VM_PAGE_EXECUTABLE;
+
+ pte_flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SYSTEM;
+ pte_flags |= snoop ? AMDGPU_PTE_SNOOPED : 0;
+ if (gc_ip_version >= IP_VERSION(12, 0, 0))
+ pte_flags |= AMDGPU_PTE_IS_PTE;
+
+ amdgpu_gmc_get_vm_pte(svm->adev, svm->vm, NULL, mapping_flags, &pte_flags);
+ pte_flags |= AMDGPU_PTE_READABLE;
+ if (!(flags & AMDGPU_SVM_FLAG_GPU_RO))
+ pte_flags |= AMDGPU_PTE_WRITEABLE;
+
+ return pte_flags;
+}
+
+static int amdgpu_svm_range_lock_vm_pd(struct amdgpu_svm *svm, struct drm_exec *exec)
+{
+ int ret;
+
+ drm_exec_init(exec, DRM_EXEC_IGNORE_DUPLICATES, 0);
+ drm_exec_until_all_locked(exec) {
+ ret = amdgpu_vm_lock_pd(svm->vm, exec, 1);
+ drm_exec_retry_on_contention(exec);
+ if (ret) {
+ drm_exec_fini(exec);
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static int
+amdgpu_svm_range_update_gpu(struct amdgpu_svm *svm, unsigned long start_page,
+ unsigned long last_page, uint64_t pte_flags,
+ dma_addr_t *pages_addr, bool flush_tlb,
+ bool update_pdes, bool wait_fence)
+{
+ struct drm_exec exec;
+ struct dma_fence *fence = NULL;
+ int ret;
+
+ ret = amdgpu_svm_range_lock_vm_pd(svm, &exec);
+ if (ret)
+ return ret;
+
+ ret = amdgpu_vm_update_range(svm->adev, svm->vm, false, false,
+ flush_tlb, true,
+ NULL, start_page, last_page, pte_flags, 0, 0,
+ NULL, pages_addr, wait_fence ? &fence : NULL);
+ if (!ret && wait_fence && fence) {
+ ret = dma_fence_wait(fence, false);
+ if (ret < 0)
+ AMDGPU_SVM_TRACE("wait unmap fence failed: ret=%d [0x%lx-0x%lx]-0x%lx\n",
+ ret, start_page, last_page,
+ last_page - start_page + 1);
+ }
+ if (!ret && update_pdes)
+ ret = amdgpu_vm_update_pdes(svm->adev, svm->vm, false);
+
+ dma_fence_put(fence);
+ drm_exec_fini(&exec);
+ return ret;
+}
+
+static int
+amdgpu_svm_range_update_gpu_range(struct amdgpu_svm *svm,
+ struct drm_gpusvm_range *range,
+ uint64_t pte_flags,
+ bool flush_tlb,
+ bool wait_fence,
+ struct dma_fence **fence)
+{
+ lockdep_assert_held(&svm->gpusvm.notifier_lock);
+
+ const unsigned long range_start_page = drm_gpusvm_range_start(range) >> PAGE_SHIFT;
+ const unsigned long range_end_page = drm_gpusvm_range_end(range) >> PAGE_SHIFT;
+ const unsigned long npages = range_end_page - range_start_page;
+ unsigned long mapped_pages = 0;
+ unsigned long dma_idx = 0;
+ int ret;
+
+ if (!range->pages.dma_addr || !npages)
+ return -EINVAL;
+
+ while (mapped_pages < npages) {
+ const struct drm_pagemap_addr *entry = &range->pages.dma_addr[dma_idx++];
+ unsigned long seg_pages = min_t(unsigned long, 1UL << entry->order,
+ npages - mapped_pages);
+ dma_addr_t seg_addr = entry->addr;
+ unsigned long start_page, last_page;
+ bool is_last_seg;
+
+ if (entry->proto != DRM_INTERCONNECT_SYSTEM)
+ return -EOPNOTSUPP;
+
+ while (mapped_pages + seg_pages < npages) {
+ const struct drm_pagemap_addr *next = &range->pages.dma_addr[dma_idx];
+ unsigned long next_pages = min_t(unsigned long,
+ 1UL << next->order,
+ npages - (mapped_pages + seg_pages));
+
+ if (next->proto != entry->proto ||
+ next->addr != seg_addr + ((dma_addr_t)seg_pages << PAGE_SHIFT))
+ break;
+
+ seg_pages += next_pages;
+ dma_idx++;
+ }
+
+ start_page = range_start_page + mapped_pages;
+ last_page = start_page + seg_pages - 1;
+ is_last_seg = mapped_pages + seg_pages == npages;
+
+ ret = amdgpu_vm_update_range(svm->adev, svm->vm, false, false,
+ flush_tlb && is_last_seg, true, NULL,
+ start_page, last_page, pte_flags,
+ 0, seg_addr, NULL, NULL,
+ wait_fence && is_last_seg ? fence : NULL);
+ if (ret)
+ return ret;
+
+ mapped_pages += seg_pages;
+ }
+
+ return 0;
+}
+
+static int
+amdgpu_svm_range_map(struct amdgpu_svm *svm,
+ unsigned long start,
+ unsigned long end,
+ const struct amdgpu_svm_attrs *attrs,
+ const struct drm_gpusvm_ctx *gpusvm_ctx,
+ uint64_t pte_flags)
+{
+ unsigned long addr = start;
+ int ret;
+
+ while (addr < end) {
+ struct drm_exec exec;
+ struct drm_gpusvm_ctx map_ctx;
+ struct drm_gpusvm_range *range;
+ struct dma_fence *fence = NULL;
+ unsigned long vma_start;
+ unsigned long next_addr;
+ uint64_t range_pte_flags;
+ unsigned int flags;
+ bool skip_map;
+
+ vma_start = drm_gpusvm_find_vma_start(&svm->gpusvm, addr, end);
+ if (vma_start > addr)
+ return -EFAULT;
+
+ map_ctx = *gpusvm_ctx;
+retry:
+ range = drm_gpusvm_range_find_or_insert(&svm->gpusvm, addr,
+ vma_start, end,
+ &map_ctx);
+ if (IS_ERR(range)) {
+ ret = PTR_ERR(range);
+ /*
+ * drm gpu svm deny RO when VMA is writeable
+ * but some UMD test does not set RO in readonly MM VMA
+ * so set read only when ret == -EPERM and retry
+ */
+ if (ret == -EPERM && !map_ctx.read_only) {
+ map_ctx.read_only = true;
+ goto retry;
+ }
+ return ret;
+ }
+
+ next_addr = drm_gpusvm_range_end(range);
+ if (next_addr <= addr)
+ return -EINVAL;
+
+ range_pte_flags = map_ctx.read_only ?
+ (pte_flags & ~AMDGPU_PTE_WRITEABLE) : pte_flags;
+
+ skip_map = range_attr_match(range, attrs, range_pte_flags);
+
+ AMDGPU_SVM_TRACE("range_map: [0x%lx-0x%lx] skip=%d pte=0x%llx\n",
+ addr, next_addr, skip_map ? 1 : 0, range_pte_flags);
+
+ if (!skip_map) {
+ ret = drm_gpusvm_range_get_pages(&svm->gpusvm, range, &map_ctx);
+ if (ret)
+ return ret;
+ }
+
+ ret = amdgpu_svm_range_lock_vm_pd(svm, &exec);
+ if (ret)
+ return ret;
+
+ flags = memalloc_noreclaim_save();
+ drm_gpusvm_notifier_lock(&svm->gpusvm);
+ if (skip_map) {
+ /* slow path must validate under notifier lock */
+ if (!range_attr_match(range, attrs, range_pte_flags) ||
+ !range_pages_valid(svm, range)) {
+ range_invalidate_gpu_mapping(range);
+ ret = -EAGAIN;
+ } else {
+ ret = 0;
+ }
+ } else if (!range_pages_valid(svm, range)) {
+ /* not protected by mmap lock, maybe changed by mmu notifier */
+ ret = -EAGAIN;
+ } else {
+ ret = amdgpu_svm_range_update_gpu_range(svm, range,
+ range_pte_flags,
+ true, true, &fence);
+ }
+ drm_gpusvm_notifier_unlock(&svm->gpusvm);
+ memalloc_noreclaim_restore(flags);
+
+ if (!ret && fence)
+ dma_fence_wait(fence, false);
+
+ dma_fence_put(fence);
+
+ if (!ret)
+ ret = amdgpu_vm_update_pdes(svm->adev, svm->vm, false);
+ if (!ret) {
+ svm->flush_tlb(svm);
+ WRITE_ONCE(to_amdgpu_svm_range(range)->pte_flags, range_pte_flags);
+ WRITE_ONCE(to_amdgpu_svm_range(range)->attr_flags, attrs->flags);
+ WRITE_ONCE(to_amdgpu_svm_range(range)->gpu_mapped, true);
+ }
+ drm_exec_fini(&exec);
+
+ if (ret)
+ return ret;
+
+ addr = next_addr;
+ }
+
+ return 0;
+}
+
+static int
+amdgpu_svm_range_map_interval(struct amdgpu_svm *svm, unsigned long start_page,
+ unsigned long last_page,
+ const struct amdgpu_svm_attrs *attrs)
+{
+ struct drm_gpusvm_ctx gpusvm_ctx = {
+ .read_only = !!(attrs->flags & AMDGPU_SVM_FLAG_GPU_RO),
+ };
+ unsigned long start = start_page << PAGE_SHIFT;
+ unsigned long end = (last_page + 1) << PAGE_SHIFT;
+ uint64_t pte_flags;
+ int ret;
+
+ pte_flags = amdgpu_svm_range_attr_pte_flags(svm, attrs);
+
+ ret = amdgpu_svm_range_map(svm, start, end, attrs, &gpusvm_ctx,
+ pte_flags);
+ if (ret)
+ AMDGPU_SVM_TRACE("map_interval failed: ret=%d [0x%lx-0x%lx)-0x%lx\n",
+ ret, start, end, end - start);
+
+ return ret;
+}
+
+int
+amdgpu_svm_range_map_attr_ranges(struct amdgpu_svm *svm,
+ unsigned long start_page,
+ unsigned long last_page)
+{
+ lockdep_assert_held_write(&svm->svm_lock);
+
+ struct amdgpu_svm_attr_tree *attr_tree = svm->attr_tree;
+ unsigned long cursor = start_page;
+
+ while (cursor <= last_page) {
+ struct amdgpu_svm_attrs attrs;
+ unsigned long seg_last;
+ unsigned long next;
+ int ret;
+
+ mutex_lock(&attr_tree->lock);
+ amdgpu_svm_attr_lookup_page_locked(attr_tree, cursor, &attrs,
+ &seg_last);
+ mutex_unlock(&attr_tree->lock);
+
+ seg_last = min(seg_last, last_page);
+ if (range_has_access(attrs.access)) {
+ /* map may fail here cause no vma or access deny */
+ ret = amdgpu_svm_range_map_interval(svm, cursor, seg_last,
+ &attrs);
+ if (ret)
+ return ret;
+ }
+
+ if (seg_last == ULONG_MAX || seg_last == last_page)
+ break;
+
+ next = seg_last + 1;
+ if (next <= cursor)
+ break;
+ cursor = next;
+ }
+
+ return 0;
+}
+
+int amdgpu_svm_range_apply_attr_change(struct amdgpu_svm *svm,
+ unsigned long start,
+ unsigned long last,
+ uint32_t trigger,
+ const struct amdgpu_svm_attrs *prev_attrs,
+ const struct amdgpu_svm_attrs *new_attrs)
+{
+ lockdep_assert_held_write(&svm->svm_lock);
+
+ bool old_access, new_access;
+ bool update_mapping = false;
+
+ old_access = range_has_access(prev_attrs->access);
+ new_access = range_has_access(new_attrs->access);
+
+ AMDGPU_SVM_TRACE("attr change trigger=0x%x old_access=%d new_access=%d [0x%lx-0x%lx]-0x%lx, xnack=%d\n",
+ trigger, old_access, new_access, start, last, last - start + 1,
+ svm->xnack_enabled ? 1 : 0);
+
+ if (trigger & AMDGPU_SVM_ATTR_TRIGGER_ACCESS_CHANGE) {
+ if (!new_access && old_access) {
+ /*
+ * Do nothing align with kfd svm
+ * TODO: unmap ranges from GPU that lost access
+ */
+ AMDGPU_SVM_TRACE("skip unmap ioctl operation [0x%lx-0x%lx]-0x%lx\n",
+ start, last, last - start + 1);
+ } else if (new_access) {
+ if (NEED_REBUILD(svm) ||
+ (new_attrs->flags & AMDGPU_SVM_FLAG_GPU_ALWAYS_MAPPED))
+ update_mapping = true;
+ }
+ }
+
+ if ((trigger & (AMDGPU_SVM_ATTR_TRIGGER_PTE_FLAG_CHANGE |
+ AMDGPU_SVM_ATTR_TRIGGER_MAPPING_FLAG_CHANGE)) &&
+ new_access)
+ update_mapping = true;
+
+ if (trigger & AMDGPU_SVM_ATTR_TRIGGER_LOCATION_CHANGE) {
+ /* TODO: add migration */
+ }
+
+ if (!update_mapping)
+ return 0;
+
+ AMDGPU_SVM_TRACE("mapping update: remap interval [0x%lx-0x%lx]-0x%lx\n",
+ start, last, last - start + 1);
+ return amdgpu_svm_range_map_interval(svm, start, last, new_attrs);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (6 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers Honglei Huang
` (4 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Implement MMU notifier handling and range lifecycle management:
- GPU unmap in notifier context: synchronous PTE clear with
memalloc_noreclaim protection and fence wait
- Range removal: unmap pages via drm_gpusvm, invalidate GPU mapping,
remove from gpusvm
- Overlap removal: iterate notifiers and ranges in an interval,
remove all overlapping ranges, track rebuild bounds
- Rebuild: remove overlapping ranges then remap via attr tree or
clear GPU PTEs with TLB flush
- Notifier range processing: walk ranges in a notifier for an MMU
event, clear PTEs and/or queue work depending on event type
- MMU invalidation dispatcher: classify events (unmap vs other),
determine operation (clear PTE, queue interval), trigger restore
for non-xnack mode
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 253 ++++++++++++++++++
1 file changed, 253 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
index b3bd4e2e6..eba0a52be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
@@ -114,6 +114,57 @@ range_pages_valid(struct amdgpu_svm *svm,
return drm_gpusvm_range_pages_valid(&svm->gpusvm, range);
}
+static int
+amdgpu_svm_range_gpu_unmap_in_notifier(struct amdgpu_svm *svm,
+ struct drm_gpusvm_range *range,
+ const struct mmu_notifier_range *mmu_range)
+{
+ struct dma_fence *fence = NULL;
+ unsigned long start = max(drm_gpusvm_range_start(range), mmu_range->start);
+ unsigned long end = min(drm_gpusvm_range_end(range), mmu_range->end);
+ unsigned int flags;
+ int ret;
+
+ if (end <= start)
+ return 0;
+
+ start >>= PAGE_SHIFT;
+ end = (end - 1) >> PAGE_SHIFT;
+
+ flags = memalloc_noreclaim_save();
+ ret = amdgpu_vm_update_range(svm->adev, svm->vm, false, true, true, false,
+ NULL, start, end, 0, 0, 0, NULL,
+ NULL, &fence);
+ memalloc_noreclaim_restore(flags);
+
+ if (!ret && fence) {
+ ret = dma_fence_wait(fence, false);
+ if (ret < 0)
+ AMDGPU_SVM_TRACE("notifier unmap fence wait failed: ret=%d [0x%lx-0x%lx]-0x%lx\n",
+ ret, start, end,
+ end - start + 1);
+ }
+
+ dma_fence_put(fence);
+ return ret;
+}
+
+static bool
+has_always_mapped_range(
+ struct drm_gpusvm_notifier *notifier,
+ const struct mmu_notifier_range *mmu_range)
+{
+ struct drm_gpusvm_range *range = NULL;
+
+ drm_gpusvm_for_each_range(range, notifier, mmu_range->start, mmu_range->end) {
+ if (READ_ONCE(to_amdgpu_svm_range(range)->attr_flags) &
+ AMDGPU_SVM_FLAG_GPU_ALWAYS_MAPPED)
+ return true;
+ }
+
+ return false;
+}
+
static uint64_t
amdgpu_svm_range_attr_pte_flags(struct amdgpu_svm *svm,
const struct amdgpu_svm_attrs *attrs)
@@ -487,6 +538,163 @@ amdgpu_svm_range_map_attr_ranges(struct amdgpu_svm *svm,
return 0;
}
+static void amdgpu_svm_range_remove(struct amdgpu_svm *svm,
+ struct drm_gpusvm_range *range,
+ struct drm_gpusvm_ctx *ctx)
+{
+ lockdep_assert_held_write(&svm->svm_lock);
+
+ if (!range->pages.flags.unmapped && !range->pages.flags.partial_unmap)
+ drm_gpusvm_range_unmap_pages(&svm->gpusvm, range, ctx);
+
+ range_invalidate_gpu_mapping(range);
+ drm_gpusvm_range_remove(&svm->gpusvm, range);
+}
+
+static bool
+amdgpu_svm_range_remove_overlaps(struct amdgpu_svm *svm, unsigned long start_page,
+ unsigned long last_page,
+ unsigned long *rebuild_start,
+ unsigned long *rebuild_last)
+{
+ lockdep_assert_held_write(&svm->svm_lock);
+
+ struct drm_gpusvm_ctx ctx = {
+ .in_notifier = false,
+ };
+ unsigned long start = start_page << PAGE_SHIFT;
+ unsigned long end = (last_page + 1) << PAGE_SHIFT;
+ struct drm_gpusvm_notifier *notifier, *next_notifier;
+ bool removed = false;
+
+ if (rebuild_start && rebuild_last) {
+ *rebuild_start = ULONG_MAX;
+ *rebuild_last = 0;
+ }
+
+ /* remove overlap ranges, need to remove entire range */
+ drm_gpusvm_for_each_notifier_safe(notifier, next_notifier, &svm->gpusvm,
+ start, end) {
+ struct drm_gpusvm_range *range, *next_range;
+
+ drm_gpusvm_for_each_range_safe(range, next_range, notifier, start,
+ end) {
+ unsigned long rs = drm_gpusvm_range_start(range) >> PAGE_SHIFT;
+ unsigned long rl = (drm_gpusvm_range_end(range) >> PAGE_SHIFT) - 1;
+
+ removed = true;
+ /* record rebuild start end, first range start and last range end */
+ if (rebuild_start && rebuild_last) {
+ *rebuild_start = min(*rebuild_start, rs);
+ *rebuild_last = max(*rebuild_last, rl);
+ }
+ amdgpu_svm_range_remove(svm, range, &ctx);
+ }
+ }
+
+ return removed;
+}
+
+static int amdgpu_svm_range_rebuild_locked(struct amdgpu_svm *svm,
+ unsigned long start_page,
+ unsigned long last_page,
+ bool rebuild)
+{
+ unsigned long rebuild_start = start_page;
+ unsigned long rebuild_last = last_page;
+ bool removed;
+ int ret;
+
+ lockdep_assert_held_write(&svm->svm_lock);
+
+ AMDGPU_SVM_TRACE("remove and rebuild: [0x%lx-0x%lx] rebuild=%d\n",
+ start_page, last_page, rebuild ? 1 : 0);
+
+ removed = amdgpu_svm_range_remove_overlaps(svm, start_page, last_page,
+ &rebuild_start,
+ &rebuild_last);
+ if (!removed)
+ return 0;
+
+ /* scan rebuild start end to build the extra removed ranges */
+ if (rebuild)
+ return amdgpu_svm_range_map_attr_ranges(svm, rebuild_start,
+ rebuild_last);
+
+ ret = amdgpu_svm_range_update_gpu(svm, rebuild_start, rebuild_last,
+ 0, NULL, true, true, true);
+ if (!ret)
+ svm->flush_tlb(svm);
+
+ return ret;
+}
+
+static void
+amdgpu_svm_range_process_notifier_ranges(struct amdgpu_svm *svm,
+ struct drm_gpusvm_notifier *notifier,
+ const struct mmu_notifier_range *mmu_range,
+ uint32_t notifier_op,
+ enum amdgpu_svm_range_queue_op queue_op)
+{
+ struct drm_gpusvm_ctx ctx = {
+ .in_notifier = true,
+ };
+ struct drm_gpusvm_range *range = NULL;
+ bool queue_ranges = notifier_op & AMDGPU_SVM_RANGE_NOTIFIER_QUEUE_INTERVAL;
+ bool clear_pte = notifier_op & AMDGPU_SVM_RANGE_NOTIFIER_CLEAR_PTE;
+ bool is_unmap = mmu_range->event == MMU_NOTIFY_UNMAP;
+ bool has_range = false;
+
+ lockdep_assert_held(&svm->gpusvm.notifier_lock);
+
+ drm_gpusvm_for_each_range(range, notifier, mmu_range->start, mmu_range->end) {
+ has_range = true;
+ if (clear_pte) {
+ amdgpu_svm_range_gpu_unmap_in_notifier(svm, range,
+ mmu_range);
+ range_invalidate_gpu_mapping(range);
+ }
+
+ drm_gpusvm_range_unmap_pages(&svm->gpusvm, range, &ctx);
+ if (is_unmap)
+ drm_gpusvm_range_set_unmapped(range, mmu_range);
+
+ if (queue_ranges) {
+ unsigned long start = max(drm_gpusvm_range_start(range),
+ mmu_range->start) >> PAGE_SHIFT;
+ unsigned long last = (min(drm_gpusvm_range_end(range),
+ mmu_range->end) - 1) >> PAGE_SHIFT;
+
+ amdgpu_svm_range_enqueue(svm, to_amdgpu_svm_range(range),
+ start, last, queue_op);
+ }
+ }
+
+ if (has_range && clear_pte)
+ svm->flush_tlb(svm);
+}
+
+static bool
+amdgpu_svm_range_interval_has_range(struct amdgpu_svm *svm,
+ unsigned long start_page,
+ unsigned long last_page)
+{
+ lockdep_assert_held(&svm->svm_lock);
+
+ unsigned long start = start_page << PAGE_SHIFT;
+ unsigned long end = (last_page + 1) << PAGE_SHIFT;
+ struct drm_gpusvm_notifier *notifier;
+
+ drm_gpusvm_for_each_notifier(notifier, &svm->gpusvm, start, end) {
+ struct drm_gpusvm_range *range = NULL;
+
+ drm_gpusvm_for_each_range(range, notifier, start, end)
+ return true;
+ }
+
+ return false;
+}
+
int amdgpu_svm_range_apply_attr_change(struct amdgpu_svm *svm,
unsigned long start,
unsigned long last,
@@ -537,3 +745,48 @@ int amdgpu_svm_range_apply_attr_change(struct amdgpu_svm *svm,
start, last, last - start + 1);
return amdgpu_svm_range_map_interval(svm, start, last, new_attrs);
}
+
+static void amdgpu_svm_range_begin_restore(struct amdgpu_svm *svm)
+{
+ if (atomic_inc_return(&svm->evicted_ranges) != 1)
+ return;
+
+ svm->begin_restore(svm);
+}
+
+void amdgpu_svm_range_invalidate(struct amdgpu_svm *svm,
+ struct drm_gpusvm_notifier *notifier,
+ const struct mmu_notifier_range *mmu_range)
+{
+ bool is_unmap = mmu_range->event == MMU_NOTIFY_UNMAP;
+ uint32_t op;
+ enum amdgpu_svm_range_queue_op queue_op;
+
+ if (mmu_range->event == MMU_NOTIFY_RELEASE)
+ return;
+ if (atomic_read(&svm->exiting))
+ return;
+
+ if (!drm_gpusvm_range_find(notifier, mmu_range->start,
+ mmu_range->end))
+ return;
+
+ if (is_unmap) {
+ op = AMDGPU_SVM_RANGE_NOTIFIER_CLEAR_PTE |
+ AMDGPU_SVM_RANGE_NOTIFIER_QUEUE_INTERVAL;
+ queue_op = AMDGPU_SVM_RANGE_OP_UNMAP;
+ if (NEED_REBUILD(svm))
+ amdgpu_svm_range_begin_restore(svm);
+ } else if (NEED_REBUILD(svm) ||
+ has_always_mapped_range(notifier, mmu_range)) {
+ op = AMDGPU_SVM_RANGE_NOTIFIER_QUEUE_INTERVAL;
+ queue_op = AMDGPU_SVM_RANGE_OP_RESTORE;
+ amdgpu_svm_range_begin_restore(svm);
+ } else {
+ op = AMDGPU_SVM_RANGE_NOTIFIER_CLEAR_PTE;
+ queue_op = AMDGPU_SVM_RANGE_OP_RESTORE;
+ }
+
+ amdgpu_svm_range_process_notifier_ranges(svm, notifier, mmu_range,
+ op, queue_op);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (7 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini Honglei Huang
` (3 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
- KFD queue quiesce/resume: reuse kfd api
- GC worker: processes unmap events by clearing attributes and
rebuilding GPU mappings, queue into restore queue if rebuild failed.
- Restore worker: restore evicted ranges via attr tree lookup
- Flush/sync helpers for orderly shutdown
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 404 ++++++++++++++++++
1 file changed, 404 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
index eba0a52be..472a641fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
@@ -114,6 +114,7 @@ range_pages_valid(struct amdgpu_svm *svm,
return drm_gpusvm_range_pages_valid(&svm->gpusvm, range);
}
+
static int
amdgpu_svm_range_gpu_unmap_in_notifier(struct amdgpu_svm *svm,
struct drm_gpusvm_range *range,
@@ -246,6 +247,59 @@ amdgpu_svm_range_attr_pte_flags(struct amdgpu_svm *svm,
return pte_flags;
}
+ /*
+ * POC/WA: reuse kfd apis for queue quiesce/resume
+ * But kfd apis are for process level, not for GPU VM level
+ * need consider potential issues
+ */
+void amdgpu_svm_range_restore_begin_compute(struct amdgpu_svm *svm)
+{
+ int ret;
+
+ if (!svm->gpusvm.mm)
+ return;
+
+ if (atomic_cmpxchg(&svm->kfd_queues_quiesced, 0, 1) != 0)
+ return;
+
+ ret = kgd2kfd_quiesce_mm(svm->gpusvm.mm, KFD_QUEUE_EVICTION_TRIGGER_SVM);
+ if (ret == -ESRCH) {
+ AMDGPU_SVM_TRACE("kfd quiesce skipped no KFD process\n");
+ atomic_set(&svm->kfd_queues_quiesced, 0);
+ return;
+ }
+
+ if (ret) {
+ AMDGPU_SVM_TRACE("kfd quiesce failed ret=%d\n", ret);
+ atomic_set(&svm->kfd_queues_quiesced, 0);
+ return;
+ }
+
+ AMDGPU_SVM_TRACE("kfd quiesce ret=%d\n", ret);
+}
+
+void amdgpu_svm_range_restore_end_compute(struct amdgpu_svm *svm)
+{
+ int ret;
+
+ if (atomic_cmpxchg(&svm->kfd_queues_quiesced, 1, 0) != 1)
+ return;
+
+ if (!svm->gpusvm.mm)
+ return;
+
+ ret = kgd2kfd_resume_mm(svm->gpusvm.mm);
+ if (ret == -ESRCH) {
+ AMDGPU_SVM_TRACE("kfd resume skipped no KFD process\n");
+ return;
+ }
+
+ if (ret)
+ AMDGPU_SVM_TRACE("kfd resume failed ret=%d\n", ret);
+ else
+ AMDGPU_SVM_TRACE("kfd resume ret=%d\n", ret);
+}
+
static int amdgpu_svm_range_lock_vm_pd(struct amdgpu_svm *svm, struct drm_exec *exec)
{
int ret;
@@ -746,6 +800,169 @@ int amdgpu_svm_range_apply_attr_change(struct amdgpu_svm *svm,
return amdgpu_svm_range_map_interval(svm, start, last, new_attrs);
}
+static bool
+range_dequeue_locked(struct amdgpu_svm *svm,
+ struct list_head *work_list,
+ bool restore_queue,
+ struct range_pending_op_ctx *op_ctx)
+{
+ struct amdgpu_svm_range *range;
+
+ lockdep_assert_held(&svm->gc_lock);
+
+ range = list_first_entry_or_null(work_list, struct amdgpu_svm_range,
+ gc_node);
+ if (!range)
+ return false;
+
+ list_del_init(&range->gc_node);
+ if (restore_queue)
+ range->restore_queued = false;
+ else
+ range->gc_queued = false;
+
+ op_ctx->range = range;
+ op_ctx->start = range->pending_start;
+ op_ctx->last = range->pending_last;
+ op_ctx->pending_ops = range->pending_ops;
+
+ range->pending_start = ULONG_MAX;
+ range->pending_last = 0;
+ range->pending_ops = AMDGPU_SVM_RANGE_PENDING_OP_NONE;
+
+ return true;
+}
+
+static void
+range_requeue_restore_locked(struct amdgpu_svm *svm,
+ struct amdgpu_svm_range *range,
+ unsigned long start,
+ unsigned long last)
+{
+ lockdep_assert_held(&svm->gc_lock);
+
+ range->pending_start = min(range->pending_start, start);
+ range->pending_last = max(range->pending_last, last);
+ range->pending_ops |= AMDGPU_SVM_RANGE_PENDING_OP_RESTORE;
+
+ if (!range->gc_queued && !range->restore_queued) {
+ list_add_tail(&range->gc_node, &svm->restore_work_list);
+ range->restore_queued = true;
+ }
+}
+
+static bool
+range_try_dequeue(struct amdgpu_svm_range *range)
+{
+ if (!range->in_queue)
+ return false;
+
+ if (range->gc_queued || range->restore_queued ||
+ range->pending_start <= range->pending_last ||
+ range->pending_ops != AMDGPU_SVM_RANGE_PENDING_OP_NONE)
+ return false;
+
+ range->in_queue = false;
+ return true;
+}
+
+static void
+range_put_if_dequeued(struct amdgpu_svm *svm,
+ struct amdgpu_svm_range *range)
+{
+ bool dequeue;
+
+ spin_lock(&svm->gc_lock);
+ dequeue = range_try_dequeue(range);
+ spin_unlock(&svm->gc_lock);
+
+ if (dequeue)
+ drm_gpusvm_range_put(&range->base);
+}
+
+static void
+amdgpu_svm_range_enqueue(struct amdgpu_svm *svm,
+ struct amdgpu_svm_range *range,
+ unsigned long start,
+ unsigned long last,
+ enum amdgpu_svm_range_queue_op op)
+{
+ bool queue_gc_work = false;
+ bool queue_restore_work = false;
+
+ if (atomic_read(&svm->exiting))
+ return;
+
+ spin_lock(&svm->gc_lock);
+ if (!range->in_queue) {
+ drm_gpusvm_range_get(&range->base);
+ range->in_queue = true;
+ }
+
+ range->pending_start = min(range->pending_start, start);
+ range->pending_last = max(range->pending_last, last);
+
+ switch (op) {
+ case AMDGPU_SVM_RANGE_OP_UNMAP:
+ range->pending_ops |= AMDGPU_SVM_RANGE_PENDING_OP_UNMAP;
+ if (NEED_REBUILD(svm))
+ range->pending_ops |= AMDGPU_SVM_RANGE_PENDING_OP_RESTORE;
+ break;
+ case AMDGPU_SVM_RANGE_OP_RESTORE:
+ range->pending_ops |= AMDGPU_SVM_RANGE_PENDING_OP_RESTORE;
+ break;
+ }
+
+ if (UNMAP_WORK(range->pending_ops)) {
+ if (range->restore_queued) {
+ list_move_tail(&range->gc_node, &svm->gc_list);
+ range->restore_queued = false;
+ range->gc_queued = true;
+ } else if (!range->gc_queued) {
+ list_add_tail(&range->gc_node, &svm->gc_list);
+ range->gc_queued = true;
+ }
+ queue_gc_work = true;
+ } else if (RESTORE_WORK(range->pending_ops)) {
+ if (!range->gc_queued && !range->restore_queued) {
+ list_add_tail(&range->gc_node, &svm->restore_work_list);
+ range->restore_queued = true;
+ }
+ queue_restore_work = true;
+ }
+
+ spin_unlock(&svm->gc_lock);
+
+ if (queue_gc_work)
+ queue_work(svm->gc_wq, &svm->gc_work);
+ if (queue_restore_work)
+ queue_delayed_work(svm->restore_wq, &svm->restore_work,
+ msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
+}
+
+static int
+amdgpu_svm_range_process_unmap_interval(struct amdgpu_svm *svm,
+ unsigned long start, unsigned long last,
+ bool rebuild)
+{
+ int ret = 0;
+
+ down_write(&svm->svm_lock);
+ /* clean attrs */
+ amdgpu_svm_attr_clear_pages(svm->attr_tree, start, last);
+
+ /* rebuild if needed */
+ if (amdgpu_svm_range_interval_has_range(svm, start, last))
+ ret = amdgpu_svm_range_rebuild_locked(svm, start, last, rebuild);
+
+ up_write(&svm->svm_lock);
+
+ AMDGPU_SVM_TRACE("work=UNMAP ret=%d start=0x%lx last=0x%lx rebuild=%d\n",
+ ret, start, last, rebuild ? 1 : 0);
+
+ return ret;
+}
+
static void amdgpu_svm_range_begin_restore(struct amdgpu_svm *svm)
{
if (atomic_inc_return(&svm->evicted_ranges) != 1)
@@ -754,6 +971,121 @@ static void amdgpu_svm_range_begin_restore(struct amdgpu_svm *svm)
svm->begin_restore(svm);
}
+static void amdgpu_svm_range_restore_worker(struct work_struct *w)
+{
+ struct delayed_work *dwork = to_delayed_work(w);
+ struct amdgpu_svm *svm = container_of(dwork, struct amdgpu_svm, restore_work);
+ unsigned long resched_delay =
+ max_t(unsigned long, 1,
+ msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
+ struct range_pending_op_ctx op_ctx;
+ int evicted_record;
+ bool need_resched = false;
+ bool has_pending;
+ int ret;
+
+ if (atomic_read(&svm->exiting))
+ return;
+
+ evicted_record = atomic_read(&svm->evicted_ranges);
+ if (!evicted_record)
+ return;
+
+ if (!svm->gpusvm.mm) {
+ atomic_set(&svm->evicted_ranges, 0);
+ svm->end_restore(svm);
+ return;
+ }
+
+ spin_lock(&svm->gc_lock);
+ while (range_dequeue_locked(svm, &svm->restore_work_list,
+ true, &op_ctx)) {
+ spin_unlock(&svm->gc_lock);
+
+ down_write(&svm->svm_lock);
+ ret = amdgpu_svm_range_map_attr_ranges(svm, op_ctx.start,
+ op_ctx.last);
+ up_write(&svm->svm_lock);
+
+ if (ret) {
+ AMDGPU_SVM_TRACE("restore work retry ret=%d start=0x%lx last=0x%lx ret=%d\n",
+ ret, op_ctx.start, op_ctx.last, ret);
+ spin_lock(&svm->gc_lock);
+ range_requeue_restore_locked(svm, op_ctx.range,
+ op_ctx.start, op_ctx.last);
+ spin_unlock(&svm->gc_lock);
+ need_resched = true;
+ }
+
+ range_put_if_dequeued(svm, op_ctx.range);
+ spin_lock(&svm->gc_lock);
+ }
+ spin_unlock(&svm->gc_lock);
+
+ spin_lock(&svm->gc_lock);
+ has_pending = !list_empty(&svm->restore_work_list) ||
+ !list_empty(&svm->gc_list);
+ spin_unlock(&svm->gc_lock);
+
+ if (!need_resched && !has_pending) {
+
+ drm_gpusvm_notifier_lock(&svm->gpusvm);
+ spin_lock(&svm->gc_lock);
+
+ has_pending = !list_empty(&svm->restore_work_list) || !list_empty(&svm->gc_list);
+
+ spin_unlock(&svm->gc_lock);
+
+ if (!has_pending &&
+ atomic_cmpxchg(&svm->evicted_ranges, evicted_record, 0) == evicted_record) {
+
+ drm_gpusvm_notifier_unlock(&svm->gpusvm);
+ svm->end_restore(svm);
+ return;
+
+ }
+ drm_gpusvm_notifier_unlock(&svm->gpusvm);
+ }
+
+ queue_delayed_work(svm->restore_wq, &svm->restore_work, resched_delay);
+}
+
+static void amdgpu_svm_range_gc_worker(struct work_struct *w)
+{
+ struct amdgpu_svm *svm = container_of(w, struct amdgpu_svm, gc_work);
+ struct range_pending_op_ctx op_ctx;
+
+ spin_lock(&svm->gc_lock);
+ while (range_dequeue_locked(svm, &svm->gc_list,
+ false, &op_ctx)) {
+ int ret = 0;
+
+ spin_unlock(&svm->gc_lock);
+
+ if (UNMAP_WORK(op_ctx.pending_ops))
+ ret = amdgpu_svm_range_process_unmap_interval(svm,
+ op_ctx.start, op_ctx.last,
+ NEED_REBUILD(svm));
+
+ if (RESTORE_WORK(op_ctx.pending_ops)) {
+ /* queue into restore wq, if rebuild failed */
+ if (NEED_REBUILD(svm) && !ret)
+ queue_delayed_work(svm->restore_wq,
+ &svm->restore_work,
+ msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
+ else
+ amdgpu_svm_range_enqueue(svm, op_ctx.range,
+ op_ctx.start,
+ op_ctx.last,
+ AMDGPU_SVM_RANGE_OP_RESTORE);
+ }
+
+ range_put_if_dequeued(svm, op_ctx.range);
+ spin_lock(&svm->gc_lock);
+ }
+ spin_unlock(&svm->gc_lock);
+}
+
void amdgpu_svm_range_invalidate(struct amdgpu_svm *svm,
struct drm_gpusvm_notifier *notifier,
const struct mmu_notifier_range *mmu_range)
@@ -790,3 +1122,75 @@ void amdgpu_svm_range_invalidate(struct amdgpu_svm *svm,
amdgpu_svm_range_process_notifier_ranges(svm, notifier, mmu_range,
op, queue_op);
}
+
+int amdgpu_svm_range_work_init(struct amdgpu_svm *svm)
+{
+ svm->gc_wq = alloc_workqueue(AMDGPU_SVM_RANGE_WQ_NAME,
+ WQ_UNBOUND | WQ_HIGHPRI | WQ_MEM_RECLAIM, 0);
+ if (!svm->gc_wq)
+ return -ENOMEM;
+
+ svm->restore_wq = alloc_ordered_workqueue(AMDGPU_SVM_RESTORE_WQ_NAME,
+ WQ_HIGHPRI | WQ_MEM_RECLAIM);
+ if (!svm->restore_wq) {
+ destroy_workqueue(svm->gc_wq);
+ svm->gc_wq = NULL;
+ return -ENOMEM;
+ }
+
+ init_rwsem(&svm->svm_lock);
+ spin_lock_init(&svm->gc_lock);
+ INIT_LIST_HEAD(&svm->gc_list);
+ INIT_LIST_HEAD(&svm->restore_work_list);
+ INIT_WORK(&svm->gc_work, amdgpu_svm_range_gc_worker);
+ INIT_DELAYED_WORK(&svm->restore_work, amdgpu_svm_range_restore_worker);
+
+ return 0;
+}
+
+void amdgpu_svm_range_flush(struct amdgpu_svm *svm)
+{
+ flush_work(&svm->gc_work);
+ flush_delayed_work(&svm->restore_work);
+ flush_work(&svm->gc_work);
+}
+
+void amdgpu_svm_range_sync_work(struct amdgpu_svm *svm)
+{
+ amdgpu_svm_range_flush(svm);
+ flush_workqueue(svm->gc_wq);
+ flush_workqueue(svm->restore_wq);
+}
+
+static void
+amdgpu_svm_range_clean_queue(struct amdgpu_svm *svm,
+ struct list_head *work_list,
+ bool restore_queue)
+{
+ struct range_pending_op_ctx op_ctx;
+
+ spin_lock(&svm->gc_lock);
+ while (range_dequeue_locked(svm, work_list,
+ restore_queue, &op_ctx)) {
+ spin_unlock(&svm->gc_lock);
+ range_put_if_dequeued(svm, op_ctx.range);
+ spin_lock(&svm->gc_lock);
+ }
+ spin_unlock(&svm->gc_lock);
+}
+
+void amdgpu_svm_range_work_fini(struct amdgpu_svm *svm)
+{
+ cancel_delayed_work_sync(&svm->restore_work);
+ flush_work(&svm->gc_work);
+ amdgpu_svm_range_clean_queue(svm, &svm->gc_list, false);
+ amdgpu_svm_range_clean_queue(svm, &svm->restore_work_list, true);
+ atomic_set(&svm->evicted_ranges, 0);
+ if (atomic_read(&svm->kfd_queues_quiesced))
+ svm->end_restore(svm);
+
+ destroy_workqueue(svm->restore_wq);
+ svm->restore_wq = NULL;
+ destroy_workqueue(svm->gc_wq);
+ svm->gc_wq = NULL;
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (8 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler Honglei Huang
` (2 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
- kmem_cache management for amdgpu_svm_range
- Reference counting: kref-based release for async safety
- XNACK helper.
- TLB flush helper for compute mode
- amdgpu_svm_init_with_ops: allocate SVM context, initialize
attr tree, work queues, and drm_gpusvm with configurable
chunk sizes and notifier size
- amdgpu_svm_init/close/fini: public lifecycle API
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 270 ++++++++++++++++++++++++
1 file changed, 270 insertions(+)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
new file mode 100644
index 000000000..aa40e1126
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
@@ -0,0 +1,270 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright 2026 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include <linux/sched/mm.h>
+#include <linux/uaccess.h>
+#include <linux/xarray.h>
+
+#include <drm/drm_file.h>
+
+#include "amdgpu.h"
+#include "amdgpu_svm.h"
+#include "amdgpu_svm_attr.h"
+#include "amdgpu_svm_range.h"
+#include "amdgpu_vm.h"
+
+#if IS_ENABLED(CONFIG_DRM_AMDGPU_SVM)
+
+#define AMDGPU_SVM_MAX_ATTRS 64
+#define AMDGPU_SVM_DEFAULT_SVM_NOTIFIER_SIZE 512
+
+static const unsigned long amdgpu_svm_chunk_sizes[] = {
+ SZ_2M,
+ SZ_64K,
+ SZ_4K,
+};
+
+static struct kmem_cache *amdgpu_svm_range_cache;
+
+static void amdgpu_svm_invalidate(struct drm_gpusvm *gpusvm,
+ struct drm_gpusvm_notifier *notifier,
+ const struct mmu_notifier_range *mmu_range)
+{
+ amdgpu_svm_range_invalidate(to_amdgpu_svm(gpusvm), notifier, mmu_range);
+}
+
+static struct drm_gpusvm_range *amdgpu_svm_range_alloc(struct drm_gpusvm *gpusvm)
+{
+ struct amdgpu_svm_range *range;
+
+ range = kmem_cache_zalloc(amdgpu_svm_range_cache, GFP_KERNEL);
+ if (!range)
+ return NULL;
+
+ INIT_LIST_HEAD(&range->gc_node);
+ range->pending_start = ULONG_MAX;
+ return &range->base;
+}
+
+static void amdgpu_svm_range_free(struct drm_gpusvm_range *range)
+{
+ kmem_cache_free(amdgpu_svm_range_cache, to_amdgpu_svm_range(range));
+}
+
+static const struct drm_gpusvm_ops amdgpu_gpusvm_ops = {
+ .range_alloc = amdgpu_svm_range_alloc,
+ .range_free = amdgpu_svm_range_free,
+ .invalidate = amdgpu_svm_invalidate,
+};
+
+static void amdgpu_svm_release(struct kref *ref)
+{
+ kfree(container_of(ref, struct amdgpu_svm, refcount));
+}
+
+static void amdgpu_svm_put(struct amdgpu_svm *svm)
+{
+ if (svm)
+ kref_put(&svm->refcount, amdgpu_svm_release);
+}
+
+int amdgpu_svm_cache_init(void)
+{
+ int ret = 0;
+
+ if (amdgpu_svm_range_cache)
+ return 0;
+
+ amdgpu_svm_range_cache = AMDGPU_SVM_KMEM_CACHE_CREATE("amdgpu_svm_range_cache",
+ struct amdgpu_svm_range);
+ if (!amdgpu_svm_range_cache)
+ return -ENOMEM;
+
+ ret = amdgpu_svm_attr_cache_init();
+ if (ret)
+ goto free_out;
+
+ return 0;
+free_out:
+ amdgpu_svm_attr_cache_fini();
+ AMDGPU_SVM_KMEM_CACHE_DESTROY(amdgpu_svm_range_cache);
+ return ret;
+}
+
+void amdgpu_svm_cache_fini(void)
+{
+ if (!amdgpu_svm_range_cache)
+ return;
+
+ amdgpu_svm_attr_cache_fini();
+ AMDGPU_SVM_KMEM_CACHE_DESTROY(amdgpu_svm_range_cache);
+}
+
+static bool amdgpu_svm_default_xnack_enabled(struct amdgpu_device *adev)
+{
+ uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
+
+ if (gc_ver < IP_VERSION(9, 0, 1))
+ return false;
+ if (!amdgpu_sriov_xnack_support(adev))
+ return false;
+
+ switch (gc_ver) {
+ case IP_VERSION(9, 4, 2):
+ case IP_VERSION(9, 4, 3):
+ case IP_VERSION(9, 4, 4):
+ case IP_VERSION(9, 5, 0):
+ return true;
+ default:
+ break;
+ }
+ if (gc_ver >= IP_VERSION(10, 1, 1))
+ return false;
+ return !adev->gmc.noretry;
+}
+
+static void amdgpu_svm_flush_tlb_compute(struct amdgpu_svm *svm)
+{
+ amdgpu_vm_flush_compute_tlb(svm->adev, svm->vm, TLB_FLUSH_HEAVYWEIGHT,
+ svm->adev->gfx.xcc_mask);
+}
+
+static int amdgpu_svm_init_with_ops(struct amdgpu_device *adev,
+ struct amdgpu_vm *vm,
+ void (*begin_restore)(struct amdgpu_svm *),
+ void (*end_restore)(struct amdgpu_svm *),
+ void (*flush_tlb)(struct amdgpu_svm *))
+{
+ struct amdgpu_svm *svm;
+ int ret;
+
+ if (vm->svm)
+ return 0;
+
+ ret = amdgpu_svm_cache_init();
+ if (ret)
+ return ret;
+
+ svm = kzalloc(sizeof(*svm), GFP_KERNEL);
+ if (!svm)
+ return -ENOMEM;
+
+ kref_init(&svm->refcount);
+ svm->adev = adev;
+ svm->vm = vm;
+
+ svm->default_granularity = min_t(u8, amdgpu_svm_default_granularity, 0x3f);
+ svm->xnack_enabled = amdgpu_svm_default_xnack_enabled(adev);
+ svm->xnack_enabled = false; // WA/POC: force to disable xnack
+ svm->begin_restore = begin_restore;
+ svm->end_restore = end_restore;
+ svm->flush_tlb = flush_tlb;
+ atomic_set(&svm->kfd_queues_quiesced, 0);
+ atomic_set(&svm->evicted_ranges, 0);
+ atomic_set(&svm->exiting, 0);
+
+ ret = amdgpu_svm_range_work_init(svm);
+ if (ret)
+ goto err_free;
+
+ svm->attr_tree = amdgpu_svm_attr_tree_create(svm);
+ if (!svm->attr_tree) {
+ ret = -ENOMEM;
+ goto err_range_work_fini;
+ }
+
+ ret = drm_gpusvm_init(&svm->gpusvm, "AMDGPU SVM",
+ adev_to_drm(adev), current->mm, 0,
+ adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT,
+ AMDGPU_SVM_DEFAULT_SVM_NOTIFIER_SIZE * SZ_1M,
+ &amdgpu_gpusvm_ops,
+ amdgpu_svm_chunk_sizes,
+ ARRAY_SIZE(amdgpu_svm_chunk_sizes));
+
+ if (ret)
+ goto err_attr_tree_destroy;
+
+ drm_gpusvm_driver_set_lock(&svm->gpusvm, &svm->svm_lock);
+ vm->svm = svm;
+ return 0;
+
+err_attr_tree_destroy:
+ amdgpu_svm_attr_tree_destroy(svm->attr_tree);
+err_range_work_fini:
+ amdgpu_svm_range_work_fini(svm);
+err_free:
+ kfree(svm);
+ return ret;
+}
+
+static int amdgpu_svm_init_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm)
+{
+ return amdgpu_svm_init_with_ops(adev, vm,
+ amdgpu_svm_range_restore_begin_compute,
+ amdgpu_svm_range_restore_end_compute,
+ amdgpu_svm_flush_tlb_compute);
+}
+
+int amdgpu_svm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
+{
+ /* graphics svm init maybe different */
+
+ return amdgpu_svm_init_compute(adev, vm);
+}
+
+void amdgpu_svm_close(struct amdgpu_vm *vm)
+{
+ if (!vm->svm)
+ return;
+
+ if (atomic_xchg(&vm->svm->exiting, 1))
+ return;
+
+ amdgpu_svm_range_sync_work(vm->svm);
+}
+
+void amdgpu_svm_fini(struct amdgpu_vm *vm)
+{
+ struct amdgpu_svm *svm = vm->svm;
+
+ if (!svm)
+ return;
+
+ amdgpu_svm_close(vm);
+ down_write(&svm->svm_lock);
+ drm_gpusvm_fini(&svm->gpusvm);
+ up_write(&svm->svm_lock);
+
+ amdgpu_svm_range_work_fini(svm);
+ amdgpu_svm_attr_tree_destroy(svm->attr_tree);
+ vm->svm = NULL;
+ amdgpu_svm_put(svm);
+}
+
+bool amdgpu_svm_is_enabled(struct amdgpu_vm *vm)
+{
+ return vm->svm != NULL;
+}
+
+#endif /* CONFIG_DRM_AMDGPU_SVM */
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (9 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system " Honglei Huang
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Add the userspace and fault entry points for the SVM
- amdgpu_svm_lookup_by_pasid: look up SVM context from PASID via
vm_manager.pasids xarray with kref protection for async safety
- amdgpu_gem_svm_ioctl: ioctl handler that copies attributes from
userspace, validates page alignment and range, dispatches to
set_attr or get_attr, and copies results back for GET operations
- amdgpu_svm_handle_fault: GPU page fault handler that looks up
SVM by PASID, checks xnack and exiting state, then maps the
faulting page range via the attribute tree under svm_lock
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 160 ++++++++++++++++++++++++
1 file changed, 160 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
index aa40e1126..57103a140 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
@@ -89,6 +89,24 @@ static void amdgpu_svm_put(struct amdgpu_svm *svm)
kref_put(&svm->refcount, amdgpu_svm_release);
}
+static struct amdgpu_svm *
+amdgpu_svm_lookup_by_pasid(struct amdgpu_device *adev, uint32_t pasid)
+{
+ struct amdgpu_svm *svm = NULL;
+ struct amdgpu_vm *vm;
+ unsigned long irqflags;
+
+ xa_lock_irqsave(&adev->vm_manager.pasids, irqflags);
+ vm = xa_load(&adev->vm_manager.pasids, pasid);
+ if (vm && vm->svm) {
+ svm = vm->svm;
+ kref_get(&svm->refcount);
+ }
+ xa_unlock_irqrestore(&adev->vm_manager.pasids, irqflags);
+
+ return svm;
+}
+
int amdgpu_svm_cache_init(void)
{
int ret = 0;
@@ -121,6 +139,33 @@ void amdgpu_svm_cache_fini(void)
AMDGPU_SVM_KMEM_CACHE_DESTROY(amdgpu_svm_range_cache);
}
+static int amdgpu_svm_set_attr(struct amdgpu_vm *vm,
+ uint64_t start,
+ uint64_t size,
+ uint32_t nattr,
+ const struct drm_amdgpu_svm_attribute *attrs)
+{
+ struct amdgpu_svm *svm = vm->svm;
+
+ /* cause drm_gpusvm_range_find_or_insert acquire the mmap_read lock
+ * can not acquire the mmap lock in the entire time in ioctl
+ * just flush the work to reduce the probability of failure
+ */
+ amdgpu_svm_range_sync_work(svm);
+
+ return amdgpu_svm_attr_set(svm->attr_tree, start, size, nattr,
+ attrs);
+}
+
+static int amdgpu_svm_get_attr(struct amdgpu_vm *vm,
+ uint64_t start,
+ uint64_t size,
+ uint32_t nattr,
+ struct drm_amdgpu_svm_attribute *attrs)
+{
+ return amdgpu_svm_attr_get(vm->svm->attr_tree, start, size, nattr, attrs);
+}
+
static bool amdgpu_svm_default_xnack_enabled(struct amdgpu_device *adev)
{
uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
@@ -262,9 +307,124 @@ void amdgpu_svm_fini(struct amdgpu_vm *vm)
amdgpu_svm_put(svm);
}
+int amdgpu_svm_handle_fault(struct amdgpu_device *adev, uint32_t pasid,
+ uint64_t fault_addr, bool write_fault)
+{
+ struct amdgpu_svm *svm;
+ unsigned long fault_page;
+ int ret;
+
+ AMDGPU_SVM_TRACE("handle_fault enter: pasid=%u addr=0x%llx write=%d\n",
+ pasid, fault_addr, write_fault ? 1 : 0);
+
+ svm = amdgpu_svm_lookup_by_pasid(adev, pasid);
+ if (!svm) {
+ AMDGPU_SVM_TRACE("handle_fault: pasid %u lookup failed\n", pasid);
+ return -EOPNOTSUPP;
+ }
+
+ AMDGPU_SVM_TRACE("handle_fault: pasid %u svm=%p exiting=%d xnack=%d\n",
+ pasid, svm, atomic_read(&svm->exiting),
+ svm->xnack_enabled ? 1 : 0);
+
+ if (atomic_read(&svm->exiting)) {
+ ret = -EAGAIN;
+ goto out;
+ }
+
+ if (!svm->xnack_enabled) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+
+ fault_page = fault_addr >> PAGE_SHIFT;
+ AMDGPU_SVM_TRACE("handle_fault: map_attr page=0x%lx\n", fault_page);
+
+ down_write(&svm->svm_lock);
+ ret = amdgpu_svm_range_map_attr_ranges(svm, fault_page, fault_page);
+ up_write(&svm->svm_lock);
+
+ if (ret)
+ AMDGPU_SVM_TRACE("fault map failed: ret=%d addr=0x%llx write=%d\n",
+ ret, fault_addr, write_fault ? 1 : 0);
+ else
+ AMDGPU_SVM_TRACE("fault map success: addr=0x%llx write=%d\n",
+ fault_addr, write_fault ? 1 : 0);
+
+out:
+ AMDGPU_SVM_TRACE("handle_fault exit: pasid=%u addr=0x%llx ret=%d\n",
+ pasid, fault_addr, ret);
+ amdgpu_svm_put(svm);
+ return ret;
+}
+
bool amdgpu_svm_is_enabled(struct amdgpu_vm *vm)
{
return vm->svm != NULL;
}
+static int amdgpu_svm_copy_attrs(const struct drm_amdgpu_gem_svm *args,
+ struct drm_amdgpu_svm_attribute **attrs,
+ size_t *size)
+{
+ if (!args->nattr || args->nattr > AMDGPU_SVM_MAX_ATTRS)
+ return -EINVAL;
+ if (!args->attrs_ptr)
+ return -EINVAL;
+
+ *size = args->nattr * sizeof(**attrs);
+ *attrs = memdup_user(u64_to_user_ptr(args->attrs_ptr), *size);
+
+ return PTR_ERR_OR_ZERO(*attrs);
+}
+
+int amdgpu_gem_svm_ioctl(struct drm_device *dev, void *data,
+ struct drm_file *filp)
+{
+ struct amdgpu_fpriv *fpriv = filp->driver_priv;
+ struct drm_amdgpu_gem_svm *args = data;
+ struct drm_amdgpu_svm_attribute *attrs = NULL;
+ struct amdgpu_vm *vm;
+ size_t attrs_size = 0;
+ int ret = 0;
+
+ AMDGPU_SVM_TRACE("ioctl op=%u va:[0x%llx-0x%llx)-0x%llx nattr=%u\n",
+ args->operation, args->start_addr, args->start_addr + args->size,
+ args->size, args->nattr);
+
+ vm = &fpriv->vm;
+ if (!amdgpu_svm_is_enabled(vm))
+ return -EOPNOTSUPP;
+
+ if ((args->start_addr & ~PAGE_MASK) || (args->size & ~PAGE_MASK))
+ return -EINVAL;
+
+ if (!args->start_addr || !args->size)
+ return -EINVAL;
+
+ ret = amdgpu_svm_copy_attrs(args, &attrs, &attrs_size);
+ if (ret)
+ return ret;
+
+ switch (args->operation) {
+ case AMDGPU_SVM_OP_SET_ATTR:
+ ret = amdgpu_svm_set_attr(vm, args->start_addr, args->size,
+ args->nattr, attrs);
+ break;
+ case AMDGPU_SVM_OP_GET_ATTR:
+ ret = amdgpu_svm_get_attr(vm, args->start_addr, args->size,
+ args->nattr, attrs);
+ if (!ret && copy_to_user(u64_to_user_ptr(args->attrs_ptr),
+ attrs, attrs_size))
+ ret = -EFAULT;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ kvfree(attrs);
+ return ret;
+}
+
#endif /* CONFIG_DRM_AMDGPU_SVM */
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system and fault handler
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (10 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler Honglei Huang
@ 2026-03-17 11:29 ` Honglei Huang
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
12 siblings, 0 replies; 36+ messages in thread
From: Honglei Huang @ 2026-03-17 11:29 UTC (permalink / raw)
To: Alexander.Deucher, Felix.Kuehling, Christian.Koenig, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen
Cc: amd-gfx, dri-devel, honghuan
From: Honglei Huang <honghuan@amd.com>
Enable SVM compilation and integrate it into the VM subsystem:
Kconfig:
- Add CONFIG_DRM_AMDGPU_SVM option (depends on DRM_AMDGPU and
DEVICE_PRIVATE, selects DRM_GPUSVM, HMM_MIRROR, MMU_NOTIFIER)
Makefile:
- Build amdgpu_svm.o, amdgpu_svm_attr.o, amdgpu_svm_range.o when
CONFIG_DRM_AMDGPU_SVM is enabled
amdgpu_drv.c:
- Register DRM_IOCTL_AMDGPU_GEM_SVM in the ioctl table
amdgpu_vm.c:
- Initialize vm->svm = NULL in amdgpu_vm_init
- Call amdgpu_svm_init in amdgpu_vm_make_compute for compute VMs
- Call amdgpu_svm_close + amdgpu_svm_fini in amdgpu_vm_fini
- Integrate SVM fault handling in amdgpu_vm_handle_fault
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +++++++
drivers/gpu/drm/amd/amdgpu/Makefile | 13 ++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 ++++++++++++++++++++++---
4 files changed, 62 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
index 1acfed2f9..22f679b85 100644
--- a/drivers/gpu/drm/amd/amdgpu/Kconfig
+++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
@@ -74,6 +74,17 @@ config DRM_AMDGPU_USERPTR
This option selects CONFIG_HMM and CONFIG_HMM_MIRROR if it
isn't already selected to enabled full userptr support.
+config DRM_AMDGPU_SVM
+ bool "Enable AMDGPU SVM support (experimental)"
+ depends on DRM_AMDGPU
+ depends on DEVICE_PRIVATE
+ select DRM_GPUSVM
+ select HMM_MIRROR
+ select MMU_NOTIFIER
+ default y
+ help
+ Experimental SVM support based on DRM GPUSVM.
+
config DRM_AMD_ISP
bool "Enable AMD Image Signal Processor IP support"
depends on DRM_AMDGPU && ACPI
diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile
index 64e7acff8..6507d9a39 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -43,6 +43,10 @@ ccflags-y := -I$(FULL_AMD_PATH)/include/asic_reg \
subdir-ccflags-y += -Wno-override-init
subdir-ccflags-$(CONFIG_DRM_AMDGPU_WERROR) += -Werror
+ifneq ($(wildcard $(objtree)/drivers/gpu/drm/Module.symvers),)
+KBUILD_EXTRA_SYMBOLS += $(objtree)/drivers/gpu/drm/Module.symvers
+endif
+
amdgpu-y := amdgpu_drv.o
# add KMS driver
@@ -303,6 +307,15 @@ amdgpu-$(CONFIG_VGA_SWITCHEROO) += amdgpu_atpx_handler.o
amdgpu-$(CONFIG_ACPI) += amdgpu_acpi.o
amdgpu-$(CONFIG_HMM_MIRROR) += amdgpu_hmm.o
+# svm support
+amdgpu-$(CONFIG_DRM_AMDGPU_SVM) += amdgpu_svm.o amdgpu_svm_attr.o \
+ amdgpu_svm_range.o
+
+.PHONY: clean-svm
+clean-svm:
+ rm -f $(obj)/amdgpu_svm.o $(obj)/amdgpu_svm_attr.o $(obj)/amdgpu_svm_range.o \
+ $(obj)/.amdgpu_svm.o.cmd $(obj)/.amdgpu_svm_attr.o.cmd $(obj)/.amdgpu_svm_range.o.cmd
+
include $(FULL_AMD_PATH)/pm/Makefile
amdgpu-y += $(AMD_POWERPLAY_FILES)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 7333e1929..12b587f9c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -50,6 +50,7 @@
#include "amdgpu_ras.h"
#include "amdgpu_reset.h"
#include "amdgpu_sched.h"
+#include "amdgpu_svm.h"
#include "amdgpu_xgmi.h"
#include "amdgpu_userq.h"
#include "amdgpu_userq_fence.h"
@@ -3068,6 +3069,7 @@ const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
DRM_IOCTL_DEF_DRV(AMDGPU_USERQ_SIGNAL, amdgpu_userq_signal_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_USERQ_WAIT, amdgpu_userq_wait_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_GEM_LIST_HANDLES, amdgpu_gem_list_handles_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
+ DRM_IOCTL_DEF_DRV(AMDGPU_GEM_SVM, amdgpu_gem_svm_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
};
static const struct drm_driver amdgpu_kms_driver = {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 676e24fb8..f64392117 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -43,6 +43,7 @@
#include "amdgpu_xgmi.h"
#include "amdgpu_dma_buf.h"
#include "amdgpu_res_cursor.h"
+#include "amdgpu_svm.h"
#include "kfd_svm.h"
/**
@@ -2564,6 +2565,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
int r, i;
vm->va = RB_ROOT_CACHED;
+ vm->svm = NULL;
for (i = 0; i < AMDGPU_MAX_VMHUBS; i++)
vm->reserved_vmid[i] = NULL;
INIT_LIST_HEAD(&vm->evicted);
@@ -2722,6 +2724,10 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm)
vm->last_update = dma_fence_get_stub();
vm->is_compute_context = true;
+ r = amdgpu_svm_init(adev, vm);
+ if (r)
+ goto unreserve_bo;
+
unreserve_bo:
amdgpu_bo_unreserve(vm->root.bo);
return r;
@@ -2754,6 +2760,9 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)
unsigned long flags;
int i;
+ amdgpu_svm_close(vm);
+ amdgpu_svm_fini(vm);
+
amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);
root = amdgpu_bo_ref(vm->root.bo);
@@ -2939,8 +2948,10 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
bool write_fault)
{
bool is_compute_context = false;
+ bool has_svm = false;
struct amdgpu_bo *root;
unsigned long irqflags;
+ uint64_t fault_addr = addr;
uint64_t value, flags;
struct amdgpu_vm *vm;
int r;
@@ -2950,6 +2961,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
if (vm) {
root = amdgpu_bo_ref(vm->root.bo);
is_compute_context = vm->is_compute_context;
+ has_svm = !!vm->svm;
} else {
root = NULL;
}
@@ -2960,12 +2972,30 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
addr /= AMDGPU_GPU_PAGE_SIZE;
- if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
- node_id, addr, ts, write_fault)) {
- amdgpu_bo_unref(&root);
- return true;
+ pr_debug("vm_handle_fault: pasid=%u addr=0x%llx compute=%d has_svm=%d write=%d\n",
+ pasid, fault_addr, is_compute_context, has_svm, write_fault);
+
+ if (is_compute_context && has_svm) {
+ r = amdgpu_svm_handle_fault(adev, pasid, fault_addr, write_fault);
+ pr_debug("vm_handle_fault: svm_handle_fault returned %d\n", r);
+ if (!r) {
+ amdgpu_bo_unref(&root);
+ return true;
+ }
}
+ if (is_compute_context && !has_svm) {
+ r = svm_range_restore_pages(adev, pasid, vmid,
+ node_id, addr, ts, write_fault);
+ pr_debug("vm_handle_fault: kfd svm_range_restore_pages returned %d\n", r);
+ if (!r) {
+ amdgpu_bo_unref(&root);
+ return true;
+ }
+ }
+
+ pr_debug("vm_handle_fault: SVM paths exhausted, falling through to NORETRY path\n");
+
r = amdgpu_bo_reserve(root, true);
if (r)
goto error_unref;
@@ -3020,6 +3050,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
error_unref:
amdgpu_bo_unref(&root);
+ pr_debug("vm_handle_fault: returning false (unhandled) pasid=%u addr=0x%llx\n",
+ pasid, fault_addr);
return false;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
` (11 preceding siblings ...)
2026-03-17 11:29 ` [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system " Honglei Huang
@ 2026-03-17 11:48 ` Christian König
2026-03-18 8:59 ` Honglei Huang
12 siblings, 1 reply; 36+ messages in thread
From: Christian König @ 2026-03-17 11:48 UTC (permalink / raw)
To: Honglei Huang, Alexander.Deucher, Felix.Kuehling, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen, Matthew Brost, Thomas Hellström,
Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
Cc: amd-gfx, dri-devel, honghuan
Adding a few XE and drm_gpuvm people on TO.
On 3/17/26 12:29, Honglei Huang wrote:
> From: Honglei Huang <honghuan@amd.com>
>
> This is a POC/draft patch series of SVM feature in amdgpu based on the
> drm_gpusvm framework. The primary purpose of this RFC is to validate
> the framework's applicability, identify implementation challenges,
> and start discussion on framework evolution. This is not a production
> ready submission.
>
> This patch series implements basic SVM support with the following features:
>
> 1. attributes sepatarated from physical page management:
>
> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
> tree that stores SVM attributes. Managed through the SET_ATTR,
> and mmu notifier callback.
>
> - Physical page layer (drm_gpusvm ranges): managed by the
> drm_gpusvm framework, representing actual HMM backed DMA
> mappings and GPU page table entries.
>
> This separation is necessary:
> - The framework does not support range splitting, so a partial
> munmap destroys the entire overlapping range, including the
> still valid parts. If attributes were stored inside drm_gpusvm
> ranges, they would be lost on unmapping.
> The separate attr tree preserves userspace set attributes
> across range operations.
Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
>
> - drm_gpusvm range boundaries are determined by fault address
> and pre setted chunk size, not by userspace attribute boundaries.
> Ranges may be rechunked on memory changes. Embedding
> attributes in framework ranges would scatter attr state
> across many small ranges and require complex reassemble
> logic when operate attrbute.
Yeah, that makes a lot of sense.
>
> 2) System memory mapping via drm_gpusvm
>
> The core mapping path uses drm_gpusvm_range_find_or_insert() to
> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
> and DMA mapping, then updates GPU page tables via
> amdgpu_vm_update_range().
>
> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>
> On XNACK off hardware the GPU cannot recover from page faults,
> so mappings must be established through ioctl. When
> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> walks the attr tree and maps all accessible intervals
> to the GPU by amdgpu_svm_range_map_attr_ranges().
>
> 4) Invalidation, GC worker, and restore worker
>
> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
> three cases based on event type and hardware mode:
> - unmap event: clear GPU PTEs in the notifier context,
> unmap DMA pages, mark ranges as unmapped, flush TLB,
> and enqueue to the GC worker. On XNACK off, also
> quiesce KFD queues and schedule rebuild of the
> still valid portions that were destroyed together with
> the unmapped subregion.
>
> - evict on XNACK off:
> quiesce KFD queues first, then unmap DMA pages and
> enqueue to the restore worker.
Is that done through the DMA fence or by talking directly to the MES/HWS?
Thanks,
Christian.
>
> - evict on XNACK on:
> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
> not schedule any worker. The GPU will fault on next
> access and the fault handler establishes the mapping.
>
> Not supported feature:
> - XNACK on GPU page fault mode
> - migration and prefetch feature
> - Multi GPU support
>
> XNACK on enablement is ongoing.The GPUs that support XNACK on
> are currently only accessible to us via remote lab machines, which slows
> down progress.
>
> Patch overview:
>
> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> SET_ATTR/GET_ATTR operations, attribute types, and related
> structs in amdgpu_drm.h.
>
> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
> refcount, attr_tree, workqueues, locks, and
> callbacks (begin/end_restore, flush_tlb).
>
> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
> (interval tree node), attr_tree, access enum, flag masks,
> and change trigger enum.
>
> 04/12 Attribute tree operations: interval tree lookup, insert,
> remove, and tree create/destroy lifecycle.
>
> 05/12 Attribute set: validate UAPI attributes, apply to internal
> attrs, handle hole/existing range with head/tail splitting,
> compute change triggers, and -EAGAIN retry loop.
> Implements attr_clear_pages for unmap cleanup and attr_get.
>
> 06/12 Range data structures: amdgpu_svm_range extending
> drm_gpusvm_range with gpu_mapped state, pending ops,
> pte_flags cache, and GC/restore queue linkage.
>
> 07/12 PTE flags and GPU mapping: simple gpu pte function,
> GPU page table update with DMA address, range mapping loop:
> find_or_insert -> get_pages -> validate -> update PTE,
> and attribute change driven mapping function.
>
> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
> notifier context, range removal and overlap cleanup,
> rebuild after destroy logic, and MMU event dispatcher
>
> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> worker for unmap processing and rebuild, ordered restore
> worker for mapping evicted ranges, and flush/sync
> helpers.
>
> 10/12 Initialization and fini: kmem_cache for range/attr,
> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> flush helper, and amdgpu_svm init/close/fini lifecycle.
>
> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
> protection, amdgpu_gem_svm_ioctl dispatcher, and
> amdgpu_svm_handle_fault for GPU page fault recovery.
>
> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
> Makefile rules, ioctl table registration, and amdgpu_vm
> hooks (init in make_compute, close/fini, fault dispatch).
>
> Test result:
> on gfx1100(W7900) and gfx943(MI300x)
> kfd test: 95%+ passed, same failed cases with offical relase
> rocr test: all passed
> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>
> During implementation we identified several challenges / design questions:
>
> 1. No range splitting on partial unmap
>
> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
> Partial munmap needs to destroy the entire range including the valid interval.
> GPU fault driven hardware can handle this design by extra gpu fault handle,
> but AMDGPU needs to support XNACK off hardware, this design requires driver
> rebuild the valid part in the removed entire range. Whichs bring a very heavy
> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
> this restore work even heavier than kfd_svm. In previous driver work queue
> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
> which brings about more complex logic, heavier worker queue workload, and
> synchronization issues.
>
> 2. Fault driven vs ioctl driven mapping
>
> drm_gpusvm is designed around GPU page fault handlers. The primary entry
> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
> GPU cannot fault at all
>
> The ioctl path cannot hold mmap_read_lock across the entire operation
> because drm_gpusvm_range_find_or_insert() acquires/releases it
> internally. This creates race windows with MMU notifiers / workers.
>
> 3. Multi GPU support
>
> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> each GPU gets an independent instance with its own range tree, MMU
> notifiers, notifier_lock, and DMA mappings.
>
> This may brings huge overhead:
> - N x MMU notifier registrations for the same address range
> - N x hmm_range_fault() calls for the same page (KFD: 1x)
> - N x DMA mapping memory
> - N x invalidation + restore worker scheduling per CPU unmap event
> - N x GPU page table flush / TLB invalidation
> - Increased mmap_lock hold time, N callbacks serialize under it
>
> compatibility issues:
> - Quiesce/resume scope mismatch: to integrate with KFD compute
> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
> which have process level semantics. Under the per GPU
> drm_gpusvm model, maybe there are some issues on sync. To properly
> integrate with KFD under the per SVM model, a compatibility or
> new per VM level queue control APIs maybe need to introduced.
>
> Migration challenges:
>
> - No global migration decision logic: each per GPU SVM
> instance maintains its own attribute tree independently. This
> allows conflicting settings (e.g., GPU0's SVM sets
> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
> for the same address range) with no detection or resolution.
> A global attribute coordinator or a shared manager is needed to
> provide a unified global view for migration decisions
>
> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
> notifier callbacks in ALL N-1 other drm_gpusvm instances,
> causing N-1 unnecessary restore workers to be scheduled. And
> creates races between the initiating migration and the other
> instance's restore attempts.
>
> - No cross instance migration serialization: each per GPU
> drm_gpusvm instance has independent locking, so two GPUs'
> "decide -> migrate -> remap" sequences can interleave. While
> the kernel page lock prevents truly simultaneous migration of
> the same physical page, the losing side's retry (evict from
> other GPU's VRAM -> migrate back) triggers broadcast notifier
> invalidations and restore workers, compounding the ping pong
> problem above.
>
> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
> it only selects system memory pages for migration.
>
> - CPU fault reverse migration race: CPU page fault triggers
> migrate_to_ram while GPU instances are concurrently operating.
> Per GPU notifier_lock does not protect cross GPU operations.
>
> We believe a strong, well designed solution at the framework level is
> needed to properly address these problems, and we look forward to
> discussion and suggestions.
>
> Honglei Huang (12):
> drm/amdgpu: add SVM UAPI definitions
> drm/amdgpu: add SVM data structures and header
> drm/amdgpu: add SVM attribute data structures
> drm/amdgpu: implement SVM attribute tree operations
> drm/amdgpu: implement SVM attribute set
> drm/amdgpu: add SVM range data structures
> drm/amdgpu: implement SVM range PTE flags and GPU mapping
> drm/amdgpu: implement SVM range notifier and invalidation
> drm/amdgpu: implement SVM range workers
> drm/amdgpu: implement SVM core initialization and fini
> drm/amdgpu: implement SVM ioctl and fault handler
> drm/amdgpu: wire up SVM build system and fault handler
>
> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> include/uapi/drm/amdgpu_drm.h | 39 +
> 12 files changed, 2958 insertions(+), 4 deletions(-)
> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>
>
> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
@ 2026-03-18 8:59 ` Honglei Huang
2026-03-19 5:08 ` Matthew Brost
0 siblings, 1 reply; 36+ messages in thread
From: Honglei Huang @ 2026-03-18 8:59 UTC (permalink / raw)
To: Christian König
Cc: amd-gfx, dri-devel, Alexander.Deucher, Felix.Kuehling,
Honglei Huang, Oak.Zeng, Jenny-Jing.Liu, Philip.Yang,
Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Matthew Brost, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On 3/17/26 19:48, Christian König wrote:
> Adding a few XE and drm_gpuvm people on TO.
>
> On 3/17/26 12:29, Honglei Huang wrote:
>> From: Honglei Huang <honghuan@amd.com>
>>
>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>> the framework's applicability, identify implementation challenges,
>> and start discussion on framework evolution. This is not a production
>> ready submission.
>>
>> This patch series implements basic SVM support with the following features:
>>
>> 1. attributes sepatarated from physical page management:
>>
>> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>> tree that stores SVM attributes. Managed through the SET_ATTR,
>> and mmu notifier callback.
>>
>> - Physical page layer (drm_gpusvm ranges): managed by the
>> drm_gpusvm framework, representing actual HMM backed DMA
>> mappings and GPU page table entries.
>>
>> This separation is necessary:
>> - The framework does not support range splitting, so a partial
>> munmap destroys the entire overlapping range, including the
>> still valid parts. If attributes were stored inside drm_gpusvm
>> ranges, they would be lost on unmapping.
>> The separate attr tree preserves userspace set attributes
>> across range operations.
>
> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
It is about partial unmap, some subregion in drm_gpusvm_range is still
valid but some other subregion is invalid, but under drm_gpusvm, need to
destroy the entire range.
e.g.:
[---------------unmap region in mmu notifier-----------------]
[0x1000 ------------ 0x9000]
[ valid ][ invalid ]
see deatil in drm_gpusvm.c:110 line
section:Partial Unmapping of Ranges
>
>>
>> - drm_gpusvm range boundaries are determined by fault address
>> and pre setted chunk size, not by userspace attribute boundaries.
>> Ranges may be rechunked on memory changes. Embedding
>> attributes in framework ranges would scatter attr state
>> across many small ranges and require complex reassemble
>> logic when operate attrbute.
>
> Yeah, that makes a lot of sense.
>
>>
>> 2) System memory mapping via drm_gpusvm
>>
>> The core mapping path uses drm_gpusvm_range_find_or_insert() to
>> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>> and DMA mapping, then updates GPU page tables via
>> amdgpu_vm_update_range().
>>
>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>
>> On XNACK off hardware the GPU cannot recover from page faults,
>> so mappings must be established through ioctl. When
>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>> walks the attr tree and maps all accessible intervals
>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>>
>> 4) Invalidation, GC worker, and restore worker
>>
>> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>> three cases based on event type and hardware mode:
>> - unmap event: clear GPU PTEs in the notifier context,
>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>> and enqueue to the GC worker. On XNACK off, also
>> quiesce KFD queues and schedule rebuild of the
>> still valid portions that were destroyed together with
>> the unmapped subregion.
>>
>> - evict on XNACK off:
>> quiesce KFD queues first, then unmap DMA pages and
>> enqueue to the restore worker.
>
> Is that done through the DMA fence or by talking directly to the MES/HWS?
Currently KFD queues quiesce/resume API are reused, lookig forward to a
better solution.
Regards,
Honglei
>
> Thanks,
> Christian.
>
>>
>> - evict on XNACK on:
>> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>> not schedule any worker. The GPU will fault on next
>> access and the fault handler establishes the mapping.
>>
>> Not supported feature:
>> - XNACK on GPU page fault mode
>> - migration and prefetch feature
>> - Multi GPU support
>>
>> XNACK on enablement is ongoing.The GPUs that support XNACK on
>> are currently only accessible to us via remote lab machines, which slows
>> down progress.
>>
>> Patch overview:
>>
>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>> SET_ATTR/GET_ATTR operations, attribute types, and related
>> structs in amdgpu_drm.h.
>>
>> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>> refcount, attr_tree, workqueues, locks, and
>> callbacks (begin/end_restore, flush_tlb).
>>
>> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>> (interval tree node), attr_tree, access enum, flag masks,
>> and change trigger enum.
>>
>> 04/12 Attribute tree operations: interval tree lookup, insert,
>> remove, and tree create/destroy lifecycle.
>>
>> 05/12 Attribute set: validate UAPI attributes, apply to internal
>> attrs, handle hole/existing range with head/tail splitting,
>> compute change triggers, and -EAGAIN retry loop.
>> Implements attr_clear_pages for unmap cleanup and attr_get.
>>
>> 06/12 Range data structures: amdgpu_svm_range extending
>> drm_gpusvm_range with gpu_mapped state, pending ops,
>> pte_flags cache, and GC/restore queue linkage.
>>
>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>> GPU page table update with DMA address, range mapping loop:
>> find_or_insert -> get_pages -> validate -> update PTE,
>> and attribute change driven mapping function.
>>
>> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
>> notifier context, range removal and overlap cleanup,
>> rebuild after destroy logic, and MMU event dispatcher
>>
>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>> worker for unmap processing and rebuild, ordered restore
>> worker for mapping evicted ranges, and flush/sync
>> helpers.
>>
>> 10/12 Initialization and fini: kmem_cache for range/attr,
>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>> flush helper, and amdgpu_svm init/close/fini lifecycle.
>>
>> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>
>> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>> Makefile rules, ioctl table registration, and amdgpu_vm
>> hooks (init in make_compute, close/fini, fault dispatch).
>>
>> Test result:
>> on gfx1100(W7900) and gfx943(MI300x)
>> kfd test: 95%+ passed, same failed cases with offical relase
>> rocr test: all passed
>> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>
>> During implementation we identified several challenges / design questions:
>>
>> 1. No range splitting on partial unmap
>>
>> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>> Partial munmap needs to destroy the entire range including the valid interval.
>> GPU fault driven hardware can handle this design by extra gpu fault handle,
>> but AMDGPU needs to support XNACK off hardware, this design requires driver
>> rebuild the valid part in the removed entire range. Whichs bring a very heavy
>> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>> this restore work even heavier than kfd_svm. In previous driver work queue
>> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>> which brings about more complex logic, heavier worker queue workload, and
>> synchronization issues.
>>
>> 2. Fault driven vs ioctl driven mapping
>>
>> drm_gpusvm is designed around GPU page fault handlers. The primary entry
>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>> GPU cannot fault at all
>>
>> The ioctl path cannot hold mmap_read_lock across the entire operation
>> because drm_gpusvm_range_find_or_insert() acquires/releases it
>> internally. This creates race windows with MMU notifiers / workers.
>>
>> 3. Multi GPU support
>>
>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>> each GPU gets an independent instance with its own range tree, MMU
>> notifiers, notifier_lock, and DMA mappings.
>>
>> This may brings huge overhead:
>> - N x MMU notifier registrations for the same address range
>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>> - N x DMA mapping memory
>> - N x invalidation + restore worker scheduling per CPU unmap event
>> - N x GPU page table flush / TLB invalidation
>> - Increased mmap_lock hold time, N callbacks serialize under it
>>
>> compatibility issues:
>> - Quiesce/resume scope mismatch: to integrate with KFD compute
>> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>> which have process level semantics. Under the per GPU
>> drm_gpusvm model, maybe there are some issues on sync. To properly
>> integrate with KFD under the per SVM model, a compatibility or
>> new per VM level queue control APIs maybe need to introduced.
>>
>> Migration challenges:
>>
>> - No global migration decision logic: each per GPU SVM
>> instance maintains its own attribute tree independently. This
>> allows conflicting settings (e.g., GPU0's SVM sets
>> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>> for the same address range) with no detection or resolution.
>> A global attribute coordinator or a shared manager is needed to
>> provide a unified global view for migration decisions
>>
>> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>> causing N-1 unnecessary restore workers to be scheduled. And
>> creates races between the initiating migration and the other
>> instance's restore attempts.
>>
>> - No cross instance migration serialization: each per GPU
>> drm_gpusvm instance has independent locking, so two GPUs'
>> "decide -> migrate -> remap" sequences can interleave. While
>> the kernel page lock prevents truly simultaneous migration of
>> the same physical page, the losing side's retry (evict from
>> other GPU's VRAM -> migrate back) triggers broadcast notifier
>> invalidations and restore workers, compounding the ping pong
>> problem above.
>>
>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>> it only selects system memory pages for migration.
>>
>> - CPU fault reverse migration race: CPU page fault triggers
>> migrate_to_ram while GPU instances are concurrently operating.
>> Per GPU notifier_lock does not protect cross GPU operations.
>>
>> We believe a strong, well designed solution at the framework level is
>> needed to properly address these problems, and we look forward to
>> discussion and suggestions.
>>
>> Honglei Huang (12):
>> drm/amdgpu: add SVM UAPI definitions
>> drm/amdgpu: add SVM data structures and header
>> drm/amdgpu: add SVM attribute data structures
>> drm/amdgpu: implement SVM attribute tree operations
>> drm/amdgpu: implement SVM attribute set
>> drm/amdgpu: add SVM range data structures
>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>> drm/amdgpu: implement SVM range notifier and invalidation
>> drm/amdgpu: implement SVM range workers
>> drm/amdgpu: implement SVM core initialization and fini
>> drm/amdgpu: implement SVM ioctl and fault handler
>> drm/amdgpu: wire up SVM build system and fault handler
>>
>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>> include/uapi/drm/amdgpu_drm.h | 39 +
>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>
>>
>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-18 8:59 ` Honglei Huang
@ 2026-03-19 5:08 ` Matthew Brost
2026-03-19 14:17 ` Honglei Huang
0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2026-03-19 5:08 UTC (permalink / raw)
To: Honglei Huang
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>
Disclaimer I haven't look at any code in this series yet.
>
> On 3/17/26 19:48, Christian König wrote:
> > Adding a few XE and drm_gpuvm people on TO.
> >
> > On 3/17/26 12:29, Honglei Huang wrote:
> > > From: Honglei Huang <honghuan@amd.com>
> > >
> > > This is a POC/draft patch series of SVM feature in amdgpu based on the
> > > drm_gpusvm framework. The primary purpose of this RFC is to validate
> > > the framework's applicability, identify implementation challenges,
> > > and start discussion on framework evolution. This is not a production
+1. Open to any ideas. Given this was designed originally for Xe we very
well could have missed other drivers requirements.
> > > ready submission.
> > >
> > > This patch series implements basic SVM support with the following features:
> > >
> > > 1. attributes sepatarated from physical page management:
> > >
> > > - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
> > > tree that stores SVM attributes. Managed through the SET_ATTR,
> > > and mmu notifier callback.
Can you explain the mmu notifier callback interaction here? See below in
Xe the attribute tree is existing VMA tree (gpuvm).
> > >
> > > - Physical page layer (drm_gpusvm ranges): managed by the
> > > drm_gpusvm framework, representing actual HMM backed DMA
> > > mappings and GPU page table entries.
> > >
> > > This separation is necessary:
> > > - The framework does not support range splitting, so a partial
> > > munmap destroys the entire overlapping range, including the
> > > still valid parts. If attributes were stored inside drm_gpusvm
> > > ranges, they would be lost on unmapping.
> > > The separate attr tree preserves userspace set attributes
> > > across range operations.
Yes, in Xe the divide is at the VMA level (set by user space) via VM
bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
madvise IOCTLs which reflect user space attributes on current SVM
mappings or future ones.
The SVM range tree reflects mappings that have been faulted into the
device and contain pages. This is an intentional choice.
> >
> > Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
Yes, this was an intentional design choice to not support partial unmap,
and instead rely on the driver to recreate a new range.
The reasoning is:
- In practice, this should be rare for well-behaved applications.
- With THP / large device pages, if a sub-range is unmapped, the entire
GPU mapping is invalidated anyway due to the page size change. As a
result, the cost of creating a new range is minimal, since the device
will likely fault again on the remaining pages.
So there is no need to over-engineer the common code.
FWIW, to even test partial unmaps in Xe, I had to do things I doubt
anyone would ever do:
ptr = mmap(SZ_2M);
/* fault in memory to the device */
munmap(ptr, SZ_1M);
/* touch memory again on the device */
>
>
> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
> but some other subregion is invalid, but under drm_gpusvm, need to destroy
> the entire range.
>
> e.g.:
>
> [---------------unmap region in mmu notifier-----------------]
> [0x1000 ------------ 0x9000]
> [ valid ][ invalid ]
>
> see deatil in drm_gpusvm.c:110 line
> section:Partial Unmapping of Ranges
>
>
> >
> > >
> > > - drm_gpusvm range boundaries are determined by fault address
> > > and pre setted chunk size, not by userspace attribute boundaries.
> > > Ranges may be rechunked on memory changes. Embedding
> > > attributes in framework ranges would scatter attr state
> > > across many small ranges and require complex reassemble
> > > logic when operate attrbute.
> >
> > Yeah, that makes a lot of sense.
> >
> > >
> > > 2) System memory mapping via drm_gpusvm
> > >
> > > The core mapping path uses drm_gpusvm_range_find_or_insert() to
> > > create ranges, drm_gpusvm_range_get_pages() for HMM page fault
> > > and DMA mapping, then updates GPU page tables via
> > > amdgpu_vm_update_range().
> > >
> > > 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > >
> > > On XNACK off hardware the GPU cannot recover from page faults,
> > > so mappings must be established through ioctl. When
> > > userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > > walks the attr tree and maps all accessible intervals
> > > to the GPU by amdgpu_svm_range_map_attr_ranges().
Can you expand on XNACK off / GPU no faults? Is this to the share GPU
between 3D (dma-fences) and faulting clients? We have something similar
in Xe, but it isn't an explicit IOCTL rather we switch between on demand
as 3D client submits and then resumes page faults when all dma-fences
have signaled.
I see below you mention page tables are modified during quiesce KFD
queues? I'm not sure that is required - you just need to guarnette
faulting clients won't trigger page faults when dma-fence is in flight.
Maybe give me an explaination of exactly what the requirement from AMD
are here so I have better picture.
> > >
> > > 4) Invalidation, GC worker, and restore worker
> > >
> > > MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
> > > three cases based on event type and hardware mode:
> > > - unmap event: clear GPU PTEs in the notifier context,
> > > unmap DMA pages, mark ranges as unmapped, flush TLB,
> > > and enqueue to the GC worker. On XNACK off, also
> > > quiesce KFD queues and schedule rebuild of the
> > > still valid portions that were destroyed together with
> > > the unmapped subregion.
> > >
> > > - evict on XNACK off:
> > > quiesce KFD queues first, then unmap DMA pages and
> > > enqueue to the restore worker.
> >
> > Is that done through the DMA fence or by talking directly to the MES/HWS?
>
> Currently KFD queues quiesce/resume API are reused, lookig forward to a
> better solution.
>
+1
> Regards,
> Honglei
>
> >
> > Thanks,
> > Christian.
> >
> > >
> > > - evict on XNACK on:
> > > clear GPU PTEs, unmap DMA pages, and flush TLB, but do
> > > not schedule any worker. The GPU will fault on next
> > > access and the fault handler establishes the mapping.
> > >
> > > Not supported feature:
> > > - XNACK on GPU page fault mode
> > > - migration and prefetch feature
> > > - Multi GPU support
> > >
> > > XNACK on enablement is ongoing.The GPUs that support XNACK on
> > > are currently only accessible to us via remote lab machines, which slows
> > > down progress.
> > >
> > > Patch overview:
> > >
> > > 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > > SET_ATTR/GET_ATTR operations, attribute types, and related
> > > structs in amdgpu_drm.h.
> > >
> > > 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
> > > refcount, attr_tree, workqueues, locks, and
> > > callbacks (begin/end_restore, flush_tlb).
> > >
> > > 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
> > > (interval tree node), attr_tree, access enum, flag masks,
> > > and change trigger enum.
> > >
> > > 04/12 Attribute tree operations: interval tree lookup, insert,
> > > remove, and tree create/destroy lifecycle.
> > >
> > > 05/12 Attribute set: validate UAPI attributes, apply to internal
> > > attrs, handle hole/existing range with head/tail splitting,
> > > compute change triggers, and -EAGAIN retry loop.
> > > Implements attr_clear_pages for unmap cleanup and attr_get.
> > >
> > > 06/12 Range data structures: amdgpu_svm_range extending
> > > drm_gpusvm_range with gpu_mapped state, pending ops,
> > > pte_flags cache, and GC/restore queue linkage.
> > >
> > > 07/12 PTE flags and GPU mapping: simple gpu pte function,
> > > GPU page table update with DMA address, range mapping loop:
> > > find_or_insert -> get_pages -> validate -> update PTE,
> > > and attribute change driven mapping function.
> > >
> > > 08/12 Notifier and invalidation: synchronous GPU PTE clear in
> > > notifier context, range removal and overlap cleanup,
> > > rebuild after destroy logic, and MMU event dispatcher
> > >
> > > 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > > worker for unmap processing and rebuild, ordered restore
> > > worker for mapping evicted ranges, and flush/sync
> > > helpers.
> > >
> > > 10/12 Initialization and fini: kmem_cache for range/attr,
> > > drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > > flush helper, and amdgpu_svm init/close/fini lifecycle.
> > >
> > > 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
> > > protection, amdgpu_gem_svm_ioctl dispatcher, and
> > > amdgpu_svm_handle_fault for GPU page fault recovery.
> > >
> > > 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
> > > Makefile rules, ioctl table registration, and amdgpu_vm
> > > hooks (init in make_compute, close/fini, fault dispatch).
> > >
> > > Test result:
> > > on gfx1100(W7900) and gfx943(MI300x)
> > > kfd test: 95%+ passed, same failed cases with offical relase
> > > rocr test: all passed
> > > hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
> > >
> > > During implementation we identified several challenges / design questions:
> > >
> > > 1. No range splitting on partial unmap
> > >
> > > drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
> > > Partial munmap needs to destroy the entire range including the valid interval.
> > > GPU fault driven hardware can handle this design by extra gpu fault handle,
> > > but AMDGPU needs to support XNACK off hardware, this design requires driver
> > > rebuild the valid part in the removed entire range. Whichs bring a very heavy
> > > restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
> > > this restore work even heavier than kfd_svm. In previous driver work queue
> > > only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
> > > which brings about more complex logic, heavier worker queue workload, and
> > > synchronization issues.
Is this common in the workload you are running? I'm also wondering if
your restore logic / KFDs design is contributing to this actally the
problem.
> > >
> > > 2. Fault driven vs ioctl driven mapping
> > >
> > > drm_gpusvm is designed around GPU page fault handlers. The primary entry
> > > point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > > AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
> > > GPU cannot fault at all
I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
issues these so the device does not fault (e.g., prefetch creates a set
of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
specific VM bind operations.
> > >
> > > The ioctl path cannot hold mmap_read_lock across the entire operation
> > > because drm_gpusvm_range_find_or_insert() acquires/releases it
> > > internally. This creates race windows with MMU notifiers / workers.
This is a very intentional choice in the locking design: mmap_read_lock
is held only in very specific parts of GPU SVM, and the driver should
never need to take this lock.
Yes, notifiers can race, which is why the GPU fault handler and prefetch
handler are structured as retry loops when a notifier race is detected.
In practice, with well-behaved applications, these races should be
rare—but they do occur, and the driver must handle them.
__xe_svm_handle_pagefault implements the page fault retry loop. VM bind
prefetch has similar logic, although it is more spread out given that it
is part of a deeper software pipeline.
FWIW, holding locks to avoid races was rejected by Sima because we
reasoned it is essentially impossible to guarantee the absence of races
by holding a lock. CPU page fault handlers are also effectively just
large retry loops.
So this is one point I believe you will need to fixup driver side.
> > >
> > > 3. Multi GPU support
> > >
> > > drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> > > each GPU gets an independent instance with its own range tree, MMU
> > > notifiers, notifier_lock, and DMA mappings.
> > >
This is a part I am absolutely open to fixing. Right now, each
drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
decoupling a GPU SVM instance from a single device, allowing each
drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
device).
This would give drivers the flexibility to use one GPU SVM instance per
VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
MM.
> > > This may brings huge overhead:
> > > - N x MMU notifier registrations for the same address range
The notifier overhead is a real concern. We recently introduced two-pass
notifiers [1] to speed up multi-device notifiers. At least in Xe, the
TLB invalidations—which are the truly expensive part—can be pipelined
using the two=pass approach. Currently, [1] only implements two-pass
notifiers for userptr, but Xe’s GPU SVM will be updated to use them
shortly.
[1] https://patchwork.freedesktop.org/series/153280/
> > > - N x hmm_range_fault() calls for the same page (KFD: 1x)
hmm_range_fault is extremely fast compared to the actual migration.
Running hmm_range_fault on a 2MB region using 4KB pages takes less
than 1µs. With THP or large device pages [2] (merged last week), it’s
around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
[2] https://patchwork.freedesktop.org/series/163141/
> > > - N x DMA mapping memory
You will always have N x DMA mapping memory if the pages are in system
memory as the dma-mapping API is per device.
> > > - N x invalidation + restore worker scheduling per CPU unmap event
> > > - N x GPU page table flush / TLB invalidation
I agree you do not want serialize GPU page table flush / TLB
invalidations. Hence two-pass notifiers [1].
> > > - Increased mmap_lock hold time, N callbacks serialize under it
> > >
> > > compatibility issues:
> > > - Quiesce/resume scope mismatch: to integrate with KFD compute
> > > queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
> > > which have process level semantics. Under the per GPU
> > > drm_gpusvm model, maybe there are some issues on sync. To properly
> > > integrate with KFD under the per SVM model, a compatibility or
> > > new per VM level queue control APIs maybe need to introduced.
> > >
I thought the idea to get rid of KFD and move over to AMDGPU? I thought
Christian mentioned this to me at XDC.
> > > Migration challenges:
> > >
> > > - No global migration decision logic: each per GPU SVM
> > > instance maintains its own attribute tree independently. This
> > > allows conflicting settings (e.g., GPU0's SVM sets
> > > PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
> > > for the same address range) with no detection or resolution.
> > > A global attribute coordinator or a shared manager is needed to
> > > provide a unified global view for migration decisions
Yes, this is hole in the Xe API too. We have told UMDs if they setup
individual VMs with conflict attributes for a single CPU address space
the behavior is undefined. Our UMD implement madvise is basically loop
over al GPU VMs setting the same attributes.
> > >
> > > - migrate_vma_setup broadcast: one GPU's migration triggers MMU
> > > notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > > causing N-1 unnecessary restore workers to be scheduled. And
My feeling is that you shouldn’t reschedule restore workers unless you
actually have to invalidate page tables (i.e., you have a local SVM
range within the notifier). So the first migration to an untouched
region may trigger notifiers, but they won’t do anything because you
don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
region won’t trigger a notifier unless the memory is moved again.
> > > creates races between the initiating migration and the other
> > > instance's restore attempts.
Yes, if multiple devices try to migrate the same CPU pages at the same
time, that will race. That’s why in Xe we have a module-level
driver_migrate_lock. The first migration runs in read mode; if it
detects a race and aborts, it then takes driver_migrate_lock in write
mode so it becomes the only device allowed to move memory / CPU pages.
See xe_svm_alloc_vram() for how this is used.
I’m not sure this approach will work for you, but I just wanted to point
out that we identified this as a potential issue.
> > >
> > > - No cross instance migration serialization: each per GPU
> > > drm_gpusvm instance has independent locking, so two GPUs'
> > > "decide -> migrate -> remap" sequences can interleave. While
> > > the kernel page lock prevents truly simultaneous migration of
> > > the same physical page, the losing side's retry (evict from
> > > other GPU's VRAM -> migrate back) triggers broadcast notifier
> > > invalidations and restore workers, compounding the ping pong
> > > problem above.
> > >
See the driver_migrate_lock above.
> > > - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > > hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
> > > it only selects system memory pages for migration.
> > >
I think this is fixed? We did find some core MM bugs that blocked VRAM
to VRAM but those have been worked out.
The code I'm looking at:
517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
518 struct mm_struct *mm,
519 unsigned long start, unsigned long end,
520 const struct drm_pagemap_migrate_details *mdetails)
521 {
522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
524 struct dev_pagemap *pagemap = dpagemap->pagemap;
525 struct migrate_vma migrate = {
526 .start = start,
527 .end = end,
528 .pgmap_owner = pagemap->owner,
529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
531 };
> > > - CPU fault reverse migration race: CPU page fault triggers
> > > migrate_to_ram while GPU instances are concurrently operating.
> > > Per GPU notifier_lock does not protect cross GPU operations.
No, again retry loop as discussed above.
> > >
> > > We believe a strong, well designed solution at the framework level is
> > > needed to properly address these problems, and we look forward to
> > > discussion and suggestions.
Let's work together to figure out what is missing here.
Matt
> > >
> > > Honglei Huang (12):
> > > drm/amdgpu: add SVM UAPI definitions
> > > drm/amdgpu: add SVM data structures and header
> > > drm/amdgpu: add SVM attribute data structures
> > > drm/amdgpu: implement SVM attribute tree operations
> > > drm/amdgpu: implement SVM attribute set
> > > drm/amdgpu: add SVM range data structures
> > > drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > > drm/amdgpu: implement SVM range notifier and invalidation
> > > drm/amdgpu: implement SVM range workers
> > > drm/amdgpu: implement SVM core initialization and fini
> > > drm/amdgpu: implement SVM ioctl and fault handler
> > > drm/amdgpu: wire up SVM build system and fault handler
> > >
> > > drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> > > drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> > > include/uapi/drm/amdgpu_drm.h | 39 +
> > > 12 files changed, 2958 insertions(+), 4 deletions(-)
> > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
> > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
> > >
> > >
> > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> >
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-19 5:08 ` Matthew Brost
@ 2026-03-19 14:17 ` Honglei Huang
2026-03-23 6:31 ` Matthew Brost
0 siblings, 1 reply; 36+ messages in thread
From: Honglei Huang @ 2026-03-19 14:17 UTC (permalink / raw)
To: Matthew Brost
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On 3/19/26 13:08, Matthew Brost wrote:
> On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>>
>
> Disclaimer I haven't look at any code in this series yet.
>
>>
>> On 3/17/26 19:48, Christian König wrote:
>>> Adding a few XE and drm_gpuvm people on TO.
>>>
>>> On 3/17/26 12:29, Honglei Huang wrote:
>>>> From: Honglei Huang <honghuan@amd.com>
>>>>
>>>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>>>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>>>> the framework's applicability, identify implementation challenges,
>>>> and start discussion on framework evolution. This is not a production
>
> +1. Open to any ideas. Given this was designed originally for Xe we very
> well could have missed other drivers requirements.
Hi Matt,
Thank you for the openness. And thank you so much for the incredibly
detailed and patient response. I really appreciate you taking the time
to walk through each point.
Actually I am still a learner when it comes to the drm_gpusvm framework
and GPU SVM design in general. Some of my descriptions below may not be
entirely accurate. But I really want to bring drm_gpusvm into amdgpu and
make it work well.
>
>>>> ready submission.
>>>>
>>>> This patch series implements basic SVM support with the following features:
>>>>
>>>> 1. attributes sepatarated from physical page management:
>>>>
>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>>>> tree that stores SVM attributes. Managed through the SET_ATTR,
>>>> and mmu notifier callback.
>
> Can you explain the mmu notifier callback interaction here? See below in
> Xe the attribute tree is existing VMA tree (gpuvm).
>
Let me try to explain, apologies if the description is not fully
precise.
In current implementation, the MMU notifier callback interacts with the
attr tree only in the munmap path remove the corresponding attribute
entries from the attr tree so that stale attributes do not persist for
freed address space.
>>>>
>>>> - Physical page layer (drm_gpusvm ranges): managed by the
>>>> drm_gpusvm framework, representing actual HMM backed DMA
>>>> mappings and GPU page table entries.
>>>>
>>>> This separation is necessary:
>>>> - The framework does not support range splitting, so a partial
>>>> munmap destroys the entire overlapping range, including the
>>>> still valid parts. If attributes were stored inside drm_gpusvm
>>>> ranges, they would be lost on unmapping.
>>>> The separate attr tree preserves userspace set attributes
>>>> across range operations.
>
> Yes, in Xe the divide is at the VMA level (set by user space) via VM
> bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
> madvise IOCTLs which reflect user space attributes on current SVM
> mappings or future ones.
>
> The SVM range tree reflects mappings that have been faulted into the
> device and contain pages. This is an intentional choice.
That makes a lot of sense. Thank you for clarifying the design intent. I
think the current adopt the same principle: the drm_gpusvm range tree
only reflect actual faulted in mappings.
>
>>>
>>> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
>
>
> Yes, this was an intentional design choice to not support partial unmap,
> and instead rely on the driver to recreate a new range.
>
> The reasoning is:
>
> - In practice, this should be rare for well-behaved applications.
>
> - With THP / large device pages, if a sub-range is unmapped, the entire
> GPU mapping is invalidated anyway due to the page size change. As a
> result, the cost of creating a new range is minimal, since the device
> will likely fault again on the remaining pages.
>
> So there is no need to over-engineer the common code.
>
> FWIW, to even test partial unmaps in Xe, I had to do things I doubt
> anyone would ever do:
>
> ptr = mmap(SZ_2M);
> /* fault in memory to the device */
> munmap(ptr, SZ_1M);
> /* touch memory again on the device */
>
Thank you for this explanation and the concrete example. After further
discussion internally with Christian, we are now aligned with same
position partial unmap. Will remove rebuild on partial unmap logic in
the next version and handle it as only partially backed range.
>>
>>
>> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
>> but some other subregion is invalid, but under drm_gpusvm, need to destroy
>> the entire range.
>>
>> e.g.:
>>
>> [---------------unmap region in mmu notifier-----------------]
>> [0x1000 ------------ 0x9000]
>> [ valid ][ invalid ]
>>
>> see deatil in drm_gpusvm.c:110 line
>> section:Partial Unmapping of Ranges
>>
>>
>>>
>>>>
>>>> - drm_gpusvm range boundaries are determined by fault address
>>>> and pre setted chunk size, not by userspace attribute boundaries.
>>>> Ranges may be rechunked on memory changes. Embedding
>>>> attributes in framework ranges would scatter attr state
>>>> across many small ranges and require complex reassemble
>>>> logic when operate attrbute.
>>>
>>> Yeah, that makes a lot of sense.
>>>
>>>>
>>>> 2) System memory mapping via drm_gpusvm
>>>>
>>>> The core mapping path uses drm_gpusvm_range_find_or_insert() to
>>>> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>>>> and DMA mapping, then updates GPU page tables via
>>>> amdgpu_vm_update_range().
>>>>
>>>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>>>
>>>> On XNACK off hardware the GPU cannot recover from page faults,
>>>> so mappings must be established through ioctl. When
>>>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>>>> walks the attr tree and maps all accessible intervals
>>>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>
> Can you expand on XNACK off / GPU no faults? Is this to the share GPU
> between 3D (dma-fences) and faulting clients? We have something similar
> in Xe, but it isn't an explicit IOCTL rather we switch between on demand
> as 3D client submits and then resumes page faults when all dma-fences
> have signaled.
>
> I see below you mention page tables are modified during quiesce KFD
> queues? I'm not sure that is required - you just need to guarnette
> faulting clients won't trigger page faults when dma-fence is in flight.
>
> Maybe give me an explaination of exactly what the requirement from AMD
> are here so I have better picture.
Thank you for the patience, let me try to explain our situation, though
I may not get every detail right.
XNACK off means hardware that does not have GPU page fault capability
(or turned off)
So for these GPUs, ALL page table entries must be fully populated before
the GPU can access the memory. This is why we need the ioctl driven
mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
walk the attribute tree and eagerly map all accessible ranges into the
GPU page tables. This is functionally similar to what you describe as
prefetch IOCTLs / VM bind in Xe.
Regarding queue quiesce during page table modification: on XNACK off
hardware, because the GPU cannot fault, we must ensure the GPU is
completely stopped before modifying any PTE it might be accessing.
Otherwise the GPU could access a partially updated page table and hang.
The quiesce/resume is the mechanism to guarantee this.
I hope that helps clarify the picture.
>
>>>>
>>>> 4) Invalidation, GC worker, and restore worker
>>>>
>>>> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>>>> three cases based on event type and hardware mode:
>>>> - unmap event: clear GPU PTEs in the notifier context,
>>>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>>>> and enqueue to the GC worker. On XNACK off, also
>>>> quiesce KFD queues and schedule rebuild of the
>>>> still valid portions that were destroyed together with
>>>> the unmapped subregion.
>>>>
>>>> - evict on XNACK off:
>>>> quiesce KFD queues first, then unmap DMA pages and
>>>> enqueue to the restore worker.
>>>
>>> Is that done through the DMA fence or by talking directly to the MES/HWS?
>>
>> Currently KFD queues quiesce/resume API are reused, lookig forward to a
>> better solution.
>>
>
> +1
>
>> Regards,
>> Honglei
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>> - evict on XNACK on:
>>>> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>>>> not schedule any worker. The GPU will fault on next
>>>> access and the fault handler establishes the mapping.
>>>>
>>>> Not supported feature:
>>>> - XNACK on GPU page fault mode
>>>> - migration and prefetch feature
>>>> - Multi GPU support
>>>>
>>>> XNACK on enablement is ongoing.The GPUs that support XNACK on
>>>> are currently only accessible to us via remote lab machines, which slows
>>>> down progress.
>>>>
>>>> Patch overview:
>>>>
>>>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>>>> SET_ATTR/GET_ATTR operations, attribute types, and related
>>>> structs in amdgpu_drm.h.
>>>>
>>>> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>>>> refcount, attr_tree, workqueues, locks, and
>>>> callbacks (begin/end_restore, flush_tlb).
>>>>
>>>> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>>>> (interval tree node), attr_tree, access enum, flag masks,
>>>> and change trigger enum.
>>>>
>>>> 04/12 Attribute tree operations: interval tree lookup, insert,
>>>> remove, and tree create/destroy lifecycle.
>>>>
>>>> 05/12 Attribute set: validate UAPI attributes, apply to internal
>>>> attrs, handle hole/existing range with head/tail splitting,
>>>> compute change triggers, and -EAGAIN retry loop.
>>>> Implements attr_clear_pages for unmap cleanup and attr_get.
>>>>
>>>> 06/12 Range data structures: amdgpu_svm_range extending
>>>> drm_gpusvm_range with gpu_mapped state, pending ops,
>>>> pte_flags cache, and GC/restore queue linkage.
>>>>
>>>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>>>> GPU page table update with DMA address, range mapping loop:
>>>> find_or_insert -> get_pages -> validate -> update PTE,
>>>> and attribute change driven mapping function.
>>>>
>>>> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
>>>> notifier context, range removal and overlap cleanup,
>>>> rebuild after destroy logic, and MMU event dispatcher
>>>>
>>>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>>>> worker for unmap processing and rebuild, ordered restore
>>>> worker for mapping evicted ranges, and flush/sync
>>>> helpers.
>>>>
>>>> 10/12 Initialization and fini: kmem_cache for range/attr,
>>>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>>>> flush helper, and amdgpu_svm init/close/fini lifecycle.
>>>>
>>>> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>>>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>>>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>>>
>>>> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>>>> Makefile rules, ioctl table registration, and amdgpu_vm
>>>> hooks (init in make_compute, close/fini, fault dispatch).
>>>>
>>>> Test result:
>>>> on gfx1100(W7900) and gfx943(MI300x)
>>>> kfd test: 95%+ passed, same failed cases with offical relase
>>>> rocr test: all passed
>>>> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>>>
>>>> During implementation we identified several challenges / design questions:
>>>>
>>>> 1. No range splitting on partial unmap
>>>>
>>>> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>>>> Partial munmap needs to destroy the entire range including the valid interval.
>>>> GPU fault driven hardware can handle this design by extra gpu fault handle,
>>>> but AMDGPU needs to support XNACK off hardware, this design requires driver
>>>> rebuild the valid part in the removed entire range. Whichs bring a very heavy
>>>> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>>>> this restore work even heavier than kfd_svm. In previous driver work queue
>>>> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>>>> which brings about more complex logic, heavier worker queue workload, and
>>>> synchronization issues.
>
> Is this common in the workload you are running? I'm also wondering if
> your restore logic / KFDs design is contributing to this actally the
> problem.
>
Honestly, you raise a fair point.
We will redesign the logic about the partial munap, which should
eliminate most of this complexity.
>>>>
>>>> 2. Fault driven vs ioctl driven mapping
>>>>
>>>> drm_gpusvm is designed around GPU page fault handlers. The primary entry
>>>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>>>> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>>>> GPU cannot fault at all
>
> I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
> issues these so the device does not fault (e.g., prefetch creates a set
> of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
> specific VM bind operations.
>
That is a very helpful way to think about it. Yes, our ioctl driven
mapping(xnack off) is essentially equivalent to a prefetch operation. We
are trying to improve it.
>>>>
>>>> The ioctl path cannot hold mmap_read_lock across the entire operation
>>>> because drm_gpusvm_range_find_or_insert() acquires/releases it
>>>> internally. This creates race windows with MMU notifiers / workers.
>
> This is a very intentional choice in the locking design: mmap_read_lock
> is held only in very specific parts of GPU SVM, and the driver should
> never need to take this lock.
>
> Yes, notifiers can race, which is why the GPU fault handler and prefetch
> handler are structured as retry loops when a notifier race is detected.
> In practice, with well-behaved applications, these races should be
> rare—but they do occur, and the driver must handle them.
>
> __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
> prefetch has similar logic, although it is more spread out given that it
> is part of a deeper software pipeline.
>
> FWIW, holding locks to avoid races was rejected by Sima because we
> reasoned it is essentially impossible to guarantee the absence of races
> by holding a lock. CPU page fault handlers are also effectively just
> large retry loops.
>
> So this is one point I believe you will need to fixup driver side.
>
Understood. Thank you for the detailed explanation and for pointing to
__xe_svm_handle_pagefault as a reference. We will restructure both our
fault handler and ioctl path to a betterretry loop pattern with sequence
number race detection.
>>>>
>>>> 3. Multi GPU support
>>>>
>>>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>>>> each GPU gets an independent instance with its own range tree, MMU
>>>> notifiers, notifier_lock, and DMA mappings.
>>>>
>
> This is a part I am absolutely open to fixing. Right now, each
> drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
> decoupling a GPU SVM instance from a single device, allowing each
> drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
> device).
>
> This would give drivers the flexibility to use one GPU SVM instance per
> VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
> MM.
>
That would be wonderful! Looking forward to your patch very much!
>>>> This may brings huge overhead:
>>>> - N x MMU notifier registrations for the same address range
>
> The notifier overhead is a real concern. We recently introduced two-pass
> notifiers [1] to speed up multi-device notifiers. At least in Xe, the
> TLB invalidations—which are the truly expensive part—can be pipelined
> using the two=pass approach. Currently, [1] only implements two-pass
> notifiers for userptr, but Xe’s GPU SVM will be updated to use them
> shortly.
>
> [1] https://patchwork.freedesktop.org/series/153280/
>
Thank you for the pointer to two-pass notifiers. Will study this
series.
>>>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>
> hmm_range_fault is extremely fast compared to the actual migration.
> Running hmm_range_fault on a 2MB region using 4KB pages takes less
> than 1µs. With THP or large device pages [2] (merged last week), it’s
> around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
>
> [2] https://patchwork.freedesktop.org/series/163141/
>
That is very helpful data. Perhaps worry too much.
>>>> - N x DMA mapping memory
>
> You will always have N x DMA mapping memory if the pages are in system
> memory as the dma-mapping API is per device.
Totally agreed.
>
>>>> - N x invalidation + restore worker scheduling per CPU unmap event
>>>> - N x GPU page table flush / TLB invalidation
>
> I agree you do not want serialize GPU page table flush / TLB
> invalidations. Hence two-pass notifiers [1].
Yes, will learn it.
>
>>>> - Increased mmap_lock hold time, N callbacks serialize under it
>>>>
>>>> compatibility issues:
>>>> - Quiesce/resume scope mismatch: to integrate with KFD compute
>>>> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>>>> which have process level semantics. Under the per GPU
>>>> drm_gpusvm model, maybe there are some issues on sync. To properly
>>>> integrate with KFD under the per SVM model, a compatibility or
>>>> new per VM level queue control APIs maybe need to introduced.
>>>>
>
> I thought the idea to get rid of KFD and move over to AMDGPU? I thought
> Christian mentioned this to me at XDC.
>
>>>> Migration challenges:
>>>>
>>>> - No global migration decision logic: each per GPU SVM
>>>> instance maintains its own attribute tree independently. This
>>>> allows conflicting settings (e.g., GPU0's SVM sets
>>>> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>>>> for the same address range) with no detection or resolution.
>>>> A global attribute coordinator or a shared manager is needed to
>>>> provide a unified global view for migration decisions
>
> Yes, this is hole in the Xe API too. We have told UMDs if they setup
> individual VMs with conflict attributes for a single CPU address space
> the behavior is undefined. Our UMD implement madvise is basically loop
> over al GPU VMs setting the same attributes.
Will follow the same approach for now, the UMD is responsible for
setting consistent attributes across GPU VMs.
>
>>>>
>>>> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>>>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>>>> causing N-1 unnecessary restore workers to be scheduled. And
>
> My feeling is that you shouldn’t reschedule restore workers unless you
> actually have to invalidate page tables (i.e., you have a local SVM
> range within the notifier). So the first migration to an untouched
> region may trigger notifiers, but they won’t do anything because you
> don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
> region won’t trigger a notifier unless the memory is moved again.
>
That is a very good point. We should check whether we actually have
valid SVM ranges before scheduling restore workers. If there is nothing
to invalidate, the notifier callback should be a no-op. We will review
our notifier callback logic to ensure we are not doing unnecessary work
here. Thank you for pointing this out.
>>>> creates races between the initiating migration and the other
>>>> instance's restore attempts.
>
> Yes, if multiple devices try to migrate the same CPU pages at the same
> time, that will race. That’s why in Xe we have a module-level
> driver_migrate_lock. The first migration runs in read mode; if it
> detects a race and aborts, it then takes driver_migrate_lock in write
> mode so it becomes the only device allowed to move memory / CPU pages.
> See xe_svm_alloc_vram() for how this is used.
>
> I’m not sure this approach will work for you, but I just wanted to point
> out that we identified this as a potential issue.
>
Thank you for sharing the driver_migrate_lock approach and pointing to
xe_svm_alloc_vram(). Will explore whether a similar lock pattern can
work for our case.
>>>>
>>>> - No cross instance migration serialization: each per GPU
>>>> drm_gpusvm instance has independent locking, so two GPUs'
>>>> "decide -> migrate -> remap" sequences can interleave. While
>>>> the kernel page lock prevents truly simultaneous migration of
>>>> the same physical page, the losing side's retry (evict from
>>>> other GPU's VRAM -> migrate back) triggers broadcast notifier
>>>> invalidations and restore workers, compounding the ping pong
>>>> problem above.
>>>>
>
> See the driver_migrate_lock above.
Acknowledged, thank you.
>
>>>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>>>> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>>>> it only selects system memory pages for migration.
>>>>
>
> I think this is fixed? We did find some core MM bugs that blocked VRAM
> to VRAM but those have been worked out.
>
> The code I'm looking at:
>
> 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
> 518 struct mm_struct *mm,
> 519 unsigned long start, unsigned long end,
> 520 const struct drm_pagemap_migrate_details *mdetails)
> 521 {
> 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
> 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
> 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
> 525 struct migrate_vma migrate = {
> 526 .start = start,
> 527 .end = end,
> 528 .pgmap_owner = pagemap->owner,
> 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
> 531 };
>
Thank you for checking! I am using v6.18 for this POC, missed the fix,
will rebase to the latest.
>>>> - CPU fault reverse migration race: CPU page fault triggers
>>>> migrate_to_ram while GPU instances are concurrently operating.
>>>> Per GPU notifier_lock does not protect cross GPU operations.
>
> No, again retry loop as discussed above.
Understood.
>
>>>>
>>>> We believe a strong, well designed solution at the framework level is
>>>> needed to properly address these problems, and we look forward to
>>>> discussion and suggestions.
>
> Let's work together to figure out what is missing here.
Thank you so much, Matt. Your feedback has been incredibly valuable and
has given us a much clearer picture of the framework's design.
Ireally appreciate the effort you put into building drm_gpusvm as a
shared framework. Will incorporate your suggestions into our next
revision and look forward to continuing the collaboration.
Regards,
Honglei
>
> Matt
>
>>>>
>>>> Honglei Huang (12):
>>>> drm/amdgpu: add SVM UAPI definitions
>>>> drm/amdgpu: add SVM data structures and header
>>>> drm/amdgpu: add SVM attribute data structures
>>>> drm/amdgpu: implement SVM attribute tree operations
>>>> drm/amdgpu: implement SVM attribute set
>>>> drm/amdgpu: add SVM range data structures
>>>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>>>> drm/amdgpu: implement SVM range notifier and invalidation
>>>> drm/amdgpu: implement SVM range workers
>>>> drm/amdgpu: implement SVM core initialization and fini
>>>> drm/amdgpu: implement SVM ioctl and fault handler
>>>> drm/amdgpu: wire up SVM build system and fault handler
>>>>
>>>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>>>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>>>> include/uapi/drm/amdgpu_drm.h | 39 +
>>>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>>>
>>>>
>>>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>>>
>>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-19 14:17 ` Honglei Huang
@ 2026-03-23 6:31 ` Matthew Brost
2026-03-24 7:24 ` Honglei Huang
2026-04-23 6:09 ` Huang, Honglei1
0 siblings, 2 replies; 36+ messages in thread
From: Matthew Brost @ 2026-03-23 6:31 UTC (permalink / raw)
To: Honglei Huang
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
>
>
> On 3/19/26 13:08, Matthew Brost wrote:
> > On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
> > >
> >
> > Disclaimer I haven't look at any code in this series yet.
> >
> > >
> > > On 3/17/26 19:48, Christian König wrote:
> > > > Adding a few XE and drm_gpuvm people on TO.
> > > >
> > > > On 3/17/26 12:29, Honglei Huang wrote:
> > > > > From: Honglei Huang <honghuan@amd.com>
> > > > >
> > > > > This is a POC/draft patch series of SVM feature in amdgpu based on the
> > > > > drm_gpusvm framework. The primary purpose of this RFC is to validate
> > > > > the framework's applicability, identify implementation challenges,
> > > > > and start discussion on framework evolution. This is not a production
> >
> > +1. Open to any ideas. Given this was designed originally for Xe we very
> > well could have missed other drivers requirements.
> Hi Matt,
>
> Thank you for the openness. And thank you so much for the incredibly
> detailed and patient response. I really appreciate you taking the time to
> walk through each point.
>
I'm here to help.
> Actually I am still a learner when it comes to the drm_gpusvm framework and
> GPU SVM design in general. Some of my descriptions below may not be entirely
> accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
> well.
I appreciate another driver jumping in and using this framework—it
becomes easier to validate as more users adopt it.
>
> >
> > > > > ready submission.
> > > > >
> > > > > This patch series implements basic SVM support with the following features:
> > > > >
> > > > > 1. attributes sepatarated from physical page management:
> > > > >
> > > > > - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
> > > > > tree that stores SVM attributes. Managed through the SET_ATTR,
> > > > > and mmu notifier callback.
> >
> > Can you explain the mmu notifier callback interaction here? See below in
> > Xe the attribute tree is existing VMA tree (gpuvm).
> >
>
> Let me try to explain, apologies if the description is not fully
> precise.
>
> In current implementation, the MMU notifier callback interacts with the attr
> tree only in the munmap path remove the corresponding attribute
> entries from the attr tree so that stale attributes do not persist for
> freed address space.
>
Ah, yes. We reset our attributes upon munmap too. We actually don't this
100% correct quite either and series in flight to fix [1].
[1] https://patchwork.freedesktop.org/series/161815/
> > > > >
> > > > > - Physical page layer (drm_gpusvm ranges): managed by the
> > > > > drm_gpusvm framework, representing actual HMM backed DMA
> > > > > mappings and GPU page table entries.
> > > > >
> > > > > This separation is necessary:
> > > > > - The framework does not support range splitting, so a partial
> > > > > munmap destroys the entire overlapping range, including the
> > > > > still valid parts. If attributes were stored inside drm_gpusvm
> > > > > ranges, they would be lost on unmapping.
> > > > > The separate attr tree preserves userspace set attributes
> > > > > across range operations.
> >
> > Yes, in Xe the divide is at the VMA level (set by user space) via VM
> > bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
> > madvise IOCTLs which reflect user space attributes on current SVM
> > mappings or future ones.
> >
> > The SVM range tree reflects mappings that have been faulted into the
> > device and contain pages. This is an intentional choice.
>
> That makes a lot of sense. Thank you for clarifying the design intent. I
> think the current adopt the same principle: the drm_gpusvm range tree only
> reflect actual faulted in mappings.
>
> >
> > > >
> > > > Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
> >
> >
> > Yes, this was an intentional design choice to not support partial unmap,
> > and instead rely on the driver to recreate a new range.
> >
> > The reasoning is:
> >
> > - In practice, this should be rare for well-behaved applications.
> >
> > - With THP / large device pages, if a sub-range is unmapped, the entire
> > GPU mapping is invalidated anyway due to the page size change. As a
> > result, the cost of creating a new range is minimal, since the device
> > will likely fault again on the remaining pages.
> >
> > So there is no need to over-engineer the common code.
> >
> > FWIW, to even test partial unmaps in Xe, I had to do things I doubt
> > anyone would ever do:
> >
> > ptr = mmap(SZ_2M);
> > /* fault in memory to the device */
> > munmap(ptr, SZ_1M);
> > /* touch memory again on the device */
> >
>
> Thank you for this explanation and the concrete example. After further
> discussion internally with Christian, we are now aligned with same position
> partial unmap. Will remove rebuild on partial unmap logic in the next
> version and handle it as only partially backed range.
>
> > >
> > >
> > > It is about partial unmap, some subregion in drm_gpusvm_range is still valid
> > > but some other subregion is invalid, but under drm_gpusvm, need to destroy
> > > the entire range.
> > >
> > > e.g.:
> > >
> > > [---------------unmap region in mmu notifier-----------------]
> > > [0x1000 ------------ 0x9000]
> > > [ valid ][ invalid ]
> > >
> > > see deatil in drm_gpusvm.c:110 line
> > > section:Partial Unmapping of Ranges
> > >
> > >
> > > >
> > > > >
> > > > > - drm_gpusvm range boundaries are determined by fault address
> > > > > and pre setted chunk size, not by userspace attribute boundaries.
> > > > > Ranges may be rechunked on memory changes. Embedding
> > > > > attributes in framework ranges would scatter attr state
> > > > > across many small ranges and require complex reassemble
> > > > > logic when operate attrbute.
> > > >
> > > > Yeah, that makes a lot of sense.
> > > >
> > > > >
> > > > > 2) System memory mapping via drm_gpusvm
> > > > >
> > > > > The core mapping path uses drm_gpusvm_range_find_or_insert() to
> > > > > create ranges, drm_gpusvm_range_get_pages() for HMM page fault
> > > > > and DMA mapping, then updates GPU page tables via
> > > > > amdgpu_vm_update_range().
> > > > >
> > > > > 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > > > >
> > > > > On XNACK off hardware the GPU cannot recover from page faults,
> > > > > so mappings must be established through ioctl. When
> > > > > userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > > > > walks the attr tree and maps all accessible intervals
> > > > > to the GPU by amdgpu_svm_range_map_attr_ranges().
> >
> > Can you expand on XNACK off / GPU no faults? Is this to the share GPU
> > between 3D (dma-fences) and faulting clients? We have something similar
> > in Xe, but it isn't an explicit IOCTL rather we switch between on demand
> > as 3D client submits and then resumes page faults when all dma-fences
> > have signaled.
> >
> > I see below you mention page tables are modified during quiesce KFD
> > queues? I'm not sure that is required - you just need to guarnette
> > faulting clients won't trigger page faults when dma-fence is in flight.
> >
> > Maybe give me an explaination of exactly what the requirement from AMD
> > are here so I have better picture.
>
> Thank you for the patience, let me try to explain our situation, though
> I may not get every detail right.
>
> XNACK off means hardware that does not have GPU page fault capability (or
> turned off)
>
> So for these GPUs, ALL page table entries must be fully populated before
> the GPU can access the memory. This is why we need the ioctl driven
> mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
> walk the attribute tree and eagerly map all accessible ranges into the
> GPU page tables. This is functionally similar to what you describe as
> prefetch IOCTLs / VM bind in Xe.
>
> Regarding queue quiesce during page table modification: on XNACK off
> hardware, because the GPU cannot fault, we must ensure the GPU is
> completely stopped before modifying any PTE it might be accessing.
> Otherwise the GPU could access a partially updated page table and hang.
> The quiesce/resume is the mechanism to guarantee this.
>
> I hope that helps clarify the picture.
>
This clarifies a lot. This is what we’d call in Xe “preemption fence”
mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
resume. We don’t actually support SVM in this case; instead, we use
“userptr binds,” which are built on gpusvm for page collection. However,
we don’t support migrating memory to the device—though we could.
I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
this case, don’t maintain a range tree, as those—as you suggest—are more
of an on-demand fault driver concern. Instead, just embed 'struct
drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
We could extend this to support migrating 'userptr', but we just haven’t
done that yet—this may be what you want to do in “XNACK off..
[2] https://patchwork.freedesktop.org/series/146553/
>
> >
> > > > >
> > > > > 4) Invalidation, GC worker, and restore worker
> > > > >
> > > > > MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
> > > > > three cases based on event type and hardware mode:
> > > > > - unmap event: clear GPU PTEs in the notifier context,
> > > > > unmap DMA pages, mark ranges as unmapped, flush TLB,
> > > > > and enqueue to the GC worker. On XNACK off, also
> > > > > quiesce KFD queues and schedule rebuild of the
> > > > > still valid portions that were destroyed together with
> > > > > the unmapped subregion.
> > > > >
> > > > > - evict on XNACK off:
> > > > > quiesce KFD queues first, then unmap DMA pages and
> > > > > enqueue to the restore worker.
> > > >
> > > > Is that done through the DMA fence or by talking directly to the MES/HWS?
> > >
> > > Currently KFD queues quiesce/resume API are reused, lookig forward to a
> > > better solution.
> > >
> >
> > +1
> >
> > > Regards,
> > > Honglei
> > >
> > > >
> > > > Thanks,
> > > > Christian.
> > > >
> > > > >
> > > > > - evict on XNACK on:
> > > > > clear GPU PTEs, unmap DMA pages, and flush TLB, but do
> > > > > not schedule any worker. The GPU will fault on next
> > > > > access and the fault handler establishes the mapping.
> > > > >
> > > > > Not supported feature:
> > > > > - XNACK on GPU page fault mode
> > > > > - migration and prefetch feature
> > > > > - Multi GPU support
> > > > >
> > > > > XNACK on enablement is ongoing.The GPUs that support XNACK on
> > > > > are currently only accessible to us via remote lab machines, which slows
> > > > > down progress.
> > > > >
> > > > > Patch overview:
> > > > >
> > > > > 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > > > > SET_ATTR/GET_ATTR operations, attribute types, and related
> > > > > structs in amdgpu_drm.h.
> > > > >
> > > > > 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
> > > > > refcount, attr_tree, workqueues, locks, and
> > > > > callbacks (begin/end_restore, flush_tlb).
> > > > >
> > > > > 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
> > > > > (interval tree node), attr_tree, access enum, flag masks,
> > > > > and change trigger enum.
> > > > >
> > > > > 04/12 Attribute tree operations: interval tree lookup, insert,
> > > > > remove, and tree create/destroy lifecycle.
> > > > >
> > > > > 05/12 Attribute set: validate UAPI attributes, apply to internal
> > > > > attrs, handle hole/existing range with head/tail splitting,
> > > > > compute change triggers, and -EAGAIN retry loop.
> > > > > Implements attr_clear_pages for unmap cleanup and attr_get.
> > > > >
> > > > > 06/12 Range data structures: amdgpu_svm_range extending
> > > > > drm_gpusvm_range with gpu_mapped state, pending ops,
> > > > > pte_flags cache, and GC/restore queue linkage.
> > > > >
> > > > > 07/12 PTE flags and GPU mapping: simple gpu pte function,
> > > > > GPU page table update with DMA address, range mapping loop:
> > > > > find_or_insert -> get_pages -> validate -> update PTE,
> > > > > and attribute change driven mapping function.
> > > > >
> > > > > 08/12 Notifier and invalidation: synchronous GPU PTE clear in
> > > > > notifier context, range removal and overlap cleanup,
> > > > > rebuild after destroy logic, and MMU event dispatcher
> > > > >
> > > > > 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > > > > worker for unmap processing and rebuild, ordered restore
> > > > > worker for mapping evicted ranges, and flush/sync
> > > > > helpers.
> > > > >
> > > > > 10/12 Initialization and fini: kmem_cache for range/attr,
> > > > > drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > > > > flush helper, and amdgpu_svm init/close/fini lifecycle.
> > > > >
> > > > > 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
> > > > > protection, amdgpu_gem_svm_ioctl dispatcher, and
> > > > > amdgpu_svm_handle_fault for GPU page fault recovery.
> > > > >
> > > > > 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
> > > > > Makefile rules, ioctl table registration, and amdgpu_vm
> > > > > hooks (init in make_compute, close/fini, fault dispatch).
> > > > >
> > > > > Test result:
> > > > > on gfx1100(W7900) and gfx943(MI300x)
> > > > > kfd test: 95%+ passed, same failed cases with offical relase
> > > > > rocr test: all passed
> > > > > hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
> > > > >
> > > > > During implementation we identified several challenges / design questions:
> > > > >
> > > > > 1. No range splitting on partial unmap
> > > > >
> > > > > drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
> > > > > Partial munmap needs to destroy the entire range including the valid interval.
> > > > > GPU fault driven hardware can handle this design by extra gpu fault handle,
> > > > > but AMDGPU needs to support XNACK off hardware, this design requires driver
> > > > > rebuild the valid part in the removed entire range. Whichs bring a very heavy
> > > > > restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
> > > > > this restore work even heavier than kfd_svm. In previous driver work queue
> > > > > only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
> > > > > which brings about more complex logic, heavier worker queue workload, and
> > > > > synchronization issues.
> >
> > Is this common in the workload you are running? I'm also wondering if
> > your restore logic / KFDs design is contributing to this actally the
> > problem.
> >
>
> Honestly, you raise a fair point.
>
> We will redesign the logic about the partial munap, which should eliminate
> most of this complexity.
>
>
+1, yes test but do optimize for.
> > > > >
> > > > > 2. Fault driven vs ioctl driven mapping
> > > > >
> > > > > drm_gpusvm is designed around GPU page fault handlers. The primary entry
> > > > > point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > > > > AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
> > > > > GPU cannot fault at all
> >
> > I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
> > issues these so the device does not fault (e.g., prefetch creates a set
> > of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
> > specific VM bind operations.
> >
>
> That is a very helpful way to think about it. Yes, our ioctl driven
> mapping(xnack off) is essentially equivalent to a prefetch operation. We are
> trying to improve it.
>
See above wrt 'userptr'.
>
> > > > >
> > > > > The ioctl path cannot hold mmap_read_lock across the entire operation
> > > > > because drm_gpusvm_range_find_or_insert() acquires/releases it
> > > > > internally. This creates race windows with MMU notifiers / workers.
> >
> > This is a very intentional choice in the locking design: mmap_read_lock
> > is held only in very specific parts of GPU SVM, and the driver should
> > never need to take this lock.
> >
> > Yes, notifiers can race, which is why the GPU fault handler and prefetch
> > handler are structured as retry loops when a notifier race is detected.
> > In practice, with well-behaved applications, these races should be
> > rare—but they do occur, and the driver must handle them.
> >
> > __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
> > prefetch has similar logic, although it is more spread out given that it
> > is part of a deeper software pipeline.
> >
> > FWIW, holding locks to avoid races was rejected by Sima because we
> > reasoned it is essentially impossible to guarantee the absence of races
> > by holding a lock. CPU page fault handlers are also effectively just
> > large retry loops.
> >
> > So this is one point I believe you will need to fixup driver side.
> >
>
> Understood. Thank you for the detailed explanation and for pointing to
> __xe_svm_handle_pagefault as a reference. We will restructure both our
> fault handler and ioctl path to a betterretry loop pattern with sequence
> number race detection.
>
Yes, the typical pattern is:
- Try to migrate once
- If you hit a race, give up, evict all memory back to system memory, and bind it
Atomics make this tricky because memory must move, but I’m not sure
“XNACK off” applies here. However, GPU SVM provides a timeslice
mechanism to ensure the CPU can’t move memory while the GPU needs to
execute something.
> > > > >
> > > > > 3. Multi GPU support
> > > > >
> > > > > drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> > > > > each GPU gets an independent instance with its own range tree, MMU
> > > > > notifiers, notifier_lock, and DMA mappings.
> > > > >
> >
> > This is a part I am absolutely open to fixing. Right now, each
> > drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
> > decoupling a GPU SVM instance from a single device, allowing each
> > drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
> > device).
> >
> > This would give drivers the flexibility to use one GPU SVM instance per
> > VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
> > MM.
> >
>
> That would be wonderful! Looking forward to your patch very much!
>
I can't say I'll code this but we thought about is as options and very
open patches which refactor the object model for multiple use cases.
>
> > > > > This may brings huge overhead:
> > > > > - N x MMU notifier registrations for the same address range
> >
> > The notifier overhead is a real concern. We recently introduced two-pass
> > notifiers [1] to speed up multi-device notifiers. At least in Xe, the
> > TLB invalidations—which are the truly expensive part—can be pipelined
> > using the two=pass approach. Currently, [1] only implements two-pass
> > notifiers for userptr, but Xe’s GPU SVM will be updated to use them
> > shortly.
> >
> > [1] https://patchwork.freedesktop.org/series/153280/
> >
>
> Thank you for the pointer to two-pass notifiers. Will study this
> series.
>
> > > > > - N x hmm_range_fault() calls for the same page (KFD: 1x)
> >
> > hmm_range_fault is extremely fast compared to the actual migration.
> > Running hmm_range_fault on a 2MB region using 4KB pages takes less
> > than 1µs. With THP or large device pages [2] (merged last week), it’s
> > around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
> >
> > [2] https://patchwork.freedesktop.org/series/163141/
> >
>
> That is very helpful data. Perhaps worry too much.
>
> > > > > - N x DMA mapping memory
> >
> > You will always have N x DMA mapping memory if the pages are in system
> > memory as the dma-mapping API is per device.
>
> Totally agreed.
>
> >
> > > > > - N x invalidation + restore worker scheduling per CPU unmap event
> > > > > - N x GPU page table flush / TLB invalidation
> >
> > I agree you do not want serialize GPU page table flush / TLB
> > invalidations. Hence two-pass notifiers [1].
>
> Yes, will learn it.
>
> >
> > > > > - Increased mmap_lock hold time, N callbacks serialize under it
> > > > >
> > > > > compatibility issues:
> > > > > - Quiesce/resume scope mismatch: to integrate with KFD compute
> > > > > queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
> > > > > which have process level semantics. Under the per GPU
> > > > > drm_gpusvm model, maybe there are some issues on sync. To properly
> > > > > integrate with KFD under the per SVM model, a compatibility or
> > > > > new per VM level queue control APIs maybe need to introduced.
> > > > >
> >
> > I thought the idea to get rid of KFD and move over to AMDGPU? I thought
> > Christian mentioned this to me at XDC.
> >
>
> > > > > Migration challenges:
> > > > >
> > > > > - No global migration decision logic: each per GPU SVM
> > > > > instance maintains its own attribute tree independently. This
> > > > > allows conflicting settings (e.g., GPU0's SVM sets
> > > > > PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
> > > > > for the same address range) with no detection or resolution.
> > > > > A global attribute coordinator or a shared manager is needed to
> > > > > provide a unified global view for migration decisions
> >
> > Yes, this is hole in the Xe API too. We have told UMDs if they setup
> > individual VMs with conflict attributes for a single CPU address space
> > the behavior is undefined. Our UMD implement madvise is basically loop
> > over al GPU VMs setting the same attributes.
>
> Will follow the same approach for now, the UMD is responsible for setting
> consistent attributes across GPU VMs.
>
+1
> >
> > > > >
> > > > > - migrate_vma_setup broadcast: one GPU's migration triggers MMU
> > > > > notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > > > > causing N-1 unnecessary restore workers to be scheduled. And
> >
> > My feeling is that you shouldn’t reschedule restore workers unless you
> > actually have to invalidate page tables (i.e., you have a local SVM
> > range within the notifier). So the first migration to an untouched
> > region may trigger notifiers, but they won’t do anything because you
> > don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
> > region won’t trigger a notifier unless the memory is moved again.
> >
>
> That is a very good point. We should check whether we actually have
> valid SVM ranges before scheduling restore workers. If there is nothing
> to invalidate, the notifier callback should be a no-op. We will review
> our notifier callback logic to ensure we are not doing unnecessary work
> here. Thank you for pointing this out.
>
> > > > > creates races between the initiating migration and the other
> > > > > instance's restore attempts.
> >
> > Yes, if multiple devices try to migrate the same CPU pages at the same
> > time, that will race. That’s why in Xe we have a module-level
> > driver_migrate_lock. The first migration runs in read mode; if it
> > detects a race and aborts, it then takes driver_migrate_lock in write
> > mode so it becomes the only device allowed to move memory / CPU pages.
> > See xe_svm_alloc_vram() for how this is used.
> >
> > I’m not sure this approach will work for you, but I just wanted to point
> > out that we identified this as a potential issue.
> >
>
> Thank you for sharing the driver_migrate_lock approach and pointing to
> xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
> for our case.
>
> > > > >
> > > > > - No cross instance migration serialization: each per GPU
> > > > > drm_gpusvm instance has independent locking, so two GPUs'
> > > > > "decide -> migrate -> remap" sequences can interleave. While
> > > > > the kernel page lock prevents truly simultaneous migration of
> > > > > the same physical page, the losing side's retry (evict from
> > > > > other GPU's VRAM -> migrate back) triggers broadcast notifier
> > > > > invalidations and restore workers, compounding the ping pong
> > > > > problem above.
> > > > >
> >
> > See the driver_migrate_lock above.
>
> Acknowledged, thank you.
> >
> > > > > - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > > > > hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
> > > > > it only selects system memory pages for migration.
> > > > >
> >
> > I think this is fixed? We did find some core MM bugs that blocked VRAM
> > to VRAM but those have been worked out.
> >
> > The code I'm looking at:
> >
> > 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
> > 518 struct mm_struct *mm,
> > 519 unsigned long start, unsigned long end,
> > 520 const struct drm_pagemap_migrate_details *mdetails)
> > 521 {
> > 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
> > 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
> > 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
> > 525 struct migrate_vma migrate = {
> > 526 .start = start,
> > 527 .end = end,
> > 528 .pgmap_owner = pagemap->owner,
> > 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> > 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
> > 531 };
> >
>
> Thank you for checking! I am using v6.18 for this POC, missed the fix, will
> rebase to the latest.
>
>
> > > > > - CPU fault reverse migration race: CPU page fault triggers
> > > > > migrate_to_ram while GPU instances are concurrently operating.
> > > > > Per GPU notifier_lock does not protect cross GPU operations.
> >
> > No, again retry loop as discussed above.
>
> Understood.
>
> >
> > > > >
> > > > > We believe a strong, well designed solution at the framework level is
> > > > > needed to properly address these problems, and we look forward to
> > > > > discussion and suggestions.
> >
> > Let's work together to figure out what is missing here.
>
> Thank you so much, Matt. Your feedback has been incredibly valuable and
> has given us a much clearer picture of the framework's design.
> Ireally appreciate the effort you put into building drm_gpusvm as a
> shared framework. Will incorporate your suggestions into our next
> revision and look forward to continuing the collaboration.
>
No problem. Happy to help.
Matt
> Regards,
> Honglei
>
>
> >
> > Matt
> >
> > > > >
> > > > > Honglei Huang (12):
> > > > > drm/amdgpu: add SVM UAPI definitions
> > > > > drm/amdgpu: add SVM data structures and header
> > > > > drm/amdgpu: add SVM attribute data structures
> > > > > drm/amdgpu: implement SVM attribute tree operations
> > > > > drm/amdgpu: implement SVM attribute set
> > > > > drm/amdgpu: add SVM range data structures
> > > > > drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > > > > drm/amdgpu: implement SVM range notifier and invalidation
> > > > > drm/amdgpu: implement SVM range workers
> > > > > drm/amdgpu: implement SVM core initialization and fini
> > > > > drm/amdgpu: implement SVM ioctl and fault handler
> > > > > drm/amdgpu: wire up SVM build system and fault handler
> > > > >
> > > > > drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> > > > > drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> > > > > include/uapi/drm/amdgpu_drm.h | 39 +
> > > > > 12 files changed, 2958 insertions(+), 4 deletions(-)
> > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
> > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
> > > > >
> > > > >
> > > > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> > > >
> > >
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-23 6:31 ` Matthew Brost
@ 2026-03-24 7:24 ` Honglei Huang
2026-03-25 22:24 ` Matthew Brost
2026-04-23 6:09 ` Huang, Honglei1
1 sibling, 1 reply; 36+ messages in thread
From: Honglei Huang @ 2026-03-24 7:24 UTC (permalink / raw)
To: Matthew Brost
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On 3/23/26 14:31, Matthew Brost wrote:
> On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
>>
>>
>> On 3/19/26 13:08, Matthew Brost wrote:
>>> On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>>>>
>>>
>>> Disclaimer I haven't look at any code in this series yet.
>>>
>>>>
>>>> On 3/17/26 19:48, Christian König wrote:
>>>>> Adding a few XE and drm_gpuvm people on TO.
>>>>>
>>>>> On 3/17/26 12:29, Honglei Huang wrote:
>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>
>>>>>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>>>>>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>>>>>> the framework's applicability, identify implementation challenges,
>>>>>> and start discussion on framework evolution. This is not a production
>>>
>>> +1. Open to any ideas. Given this was designed originally for Xe we very
>>> well could have missed other drivers requirements.
>> Hi Matt,
>>
>> Thank you for the openness. And thank you so much for the incredibly
>> detailed and patient response. I really appreciate you taking the time to
>> walk through each point.
>>
>
> I'm here to help.
>
>> Actually I am still a learner when it comes to the drm_gpusvm framework and
>> GPU SVM design in general. Some of my descriptions below may not be entirely
>> accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
>> well.
>
> I appreciate another driver jumping in and using this framework—it
> becomes easier to validate as more users adopt it.
>
>>
>>>
>>>>>> ready submission.
>>>>>>
>>>>>> This patch series implements basic SVM support with the following features:
>>>>>>
>>>>>> 1. attributes sepatarated from physical page management:
>>>>>>
>>>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>>>>>> tree that stores SVM attributes. Managed through the SET_ATTR,
>>>>>> and mmu notifier callback.
>>>
>>> Can you explain the mmu notifier callback interaction here? See below in
>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>
>>
>> Let me try to explain, apologies if the description is not fully
>> precise.
>>
>> In current implementation, the MMU notifier callback interacts with the attr
>> tree only in the munmap path remove the corresponding attribute
>> entries from the attr tree so that stale attributes do not persist for
>> freed address space.
>>
>
> Ah, yes. We reset our attributes upon munmap too. We actually don't this
> 100% correct quite either and series in flight to fix [1].
>
> [1] https://patchwork.freedesktop.org/series/161815/
>
I studied [1]. This draft has a simliar mechanism to handle attributes
when munmap. But there are some sligt differences in detail, maybe
casued by different UMD runtime behaviors.
>>>>>>
>>>>>> - Physical page layer (drm_gpusvm ranges): managed by the
>>>>>> drm_gpusvm framework, representing actual HMM backed DMA
>>>>>> mappings and GPU page table entries.
>>>>>>
>>>>>> This separation is necessary:
>>>>>> - The framework does not support range splitting, so a partial
>>>>>> munmap destroys the entire overlapping range, including the
>>>>>> still valid parts. If attributes were stored inside drm_gpusvm
>>>>>> ranges, they would be lost on unmapping.
>>>>>> The separate attr tree preserves userspace set attributes
>>>>>> across range operations.
>>>
>>> Yes, in Xe the divide is at the VMA level (set by user space) via VM
>>> bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
>>> madvise IOCTLs which reflect user space attributes on current SVM
>>> mappings or future ones.
>>>
>>> The SVM range tree reflects mappings that have been faulted into the
>>> device and contain pages. This is an intentional choice.
>>
>> That makes a lot of sense. Thank you for clarifying the design intent. I
>> think the current adopt the same principle: the drm_gpusvm range tree only
>> reflect actual faulted in mappings.
>>
>>>
>>>>>
>>>>> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
>>>
>>>
>>> Yes, this was an intentional design choice to not support partial unmap,
>>> and instead rely on the driver to recreate a new range.
>>>
>>> The reasoning is:
>>>
>>> - In practice, this should be rare for well-behaved applications.
>>>
>>> - With THP / large device pages, if a sub-range is unmapped, the entire
>>> GPU mapping is invalidated anyway due to the page size change. As a
>>> result, the cost of creating a new range is minimal, since the device
>>> will likely fault again on the remaining pages.
>>>
>>> So there is no need to over-engineer the common code.
>>>
>>> FWIW, to even test partial unmaps in Xe, I had to do things I doubt
>>> anyone would ever do:
>>>
>>> ptr = mmap(SZ_2M);
>>> /* fault in memory to the device */
>>> munmap(ptr, SZ_1M);
>>> /* touch memory again on the device */
>>>
>>
>> Thank you for this explanation and the concrete example. After further
>> discussion internally with Christian, we are now aligned with same position
>> partial unmap. Will remove rebuild on partial unmap logic in the next
>> version and handle it as only partially backed range.
>>
>>>>
>>>>
>>>> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
>>>> but some other subregion is invalid, but under drm_gpusvm, need to destroy
>>>> the entire range.
>>>>
>>>> e.g.:
>>>>
>>>> [---------------unmap region in mmu notifier-----------------]
>>>> [0x1000 ------------ 0x9000]
>>>> [ valid ][ invalid ]
>>>>
>>>> see deatil in drm_gpusvm.c:110 line
>>>> section:Partial Unmapping of Ranges
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> - drm_gpusvm range boundaries are determined by fault address
>>>>>> and pre setted chunk size, not by userspace attribute boundaries.
>>>>>> Ranges may be rechunked on memory changes. Embedding
>>>>>> attributes in framework ranges would scatter attr state
>>>>>> across many small ranges and require complex reassemble
>>>>>> logic when operate attrbute.
>>>>>
>>>>> Yeah, that makes a lot of sense.
>>>>>
>>>>>>
>>>>>> 2) System memory mapping via drm_gpusvm
>>>>>>
>>>>>> The core mapping path uses drm_gpusvm_range_find_or_insert() to
>>>>>> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>>>>>> and DMA mapping, then updates GPU page tables via
>>>>>> amdgpu_vm_update_range().
>>>>>>
>>>>>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>>>>>
>>>>>> On XNACK off hardware the GPU cannot recover from page faults,
>>>>>> so mappings must be established through ioctl. When
>>>>>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>>>>>> walks the attr tree and maps all accessible intervals
>>>>>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>>>
>>> Can you expand on XNACK off / GPU no faults? Is this to the share GPU
>>> between 3D (dma-fences) and faulting clients? We have something similar
>>> in Xe, but it isn't an explicit IOCTL rather we switch between on demand
>>> as 3D client submits and then resumes page faults when all dma-fences
>>> have signaled.
>>>
>>> I see below you mention page tables are modified during quiesce KFD
>>> queues? I'm not sure that is required - you just need to guarnette
>>> faulting clients won't trigger page faults when dma-fence is in flight.
>>>
>>> Maybe give me an explaination of exactly what the requirement from AMD
>>> are here so I have better picture.
>>
>> Thank you for the patience, let me try to explain our situation, though
>> I may not get every detail right.
>>
>> XNACK off means hardware that does not have GPU page fault capability (or
>> turned off)
>>
>> So for these GPUs, ALL page table entries must be fully populated before
>> the GPU can access the memory. This is why we need the ioctl driven
>> mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
>> walk the attribute tree and eagerly map all accessible ranges into the
>> GPU page tables. This is functionally similar to what you describe as
>> prefetch IOCTLs / VM bind in Xe.
>>
>> Regarding queue quiesce during page table modification: on XNACK off
>> hardware, because the GPU cannot fault, we must ensure the GPU is
>> completely stopped before modifying any PTE it might be accessing.
>> Otherwise the GPU could access a partially updated page table and hang.
>> The quiesce/resume is the mechanism to guarantee this.
>>
>> I hope that helps clarify the picture.
>>
>
> This clarifies a lot. This is what we’d call in Xe “preemption fence”
> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> resume. We don’t actually support SVM in this case; instead, we use
> “userptr binds,” which are built on gpusvm for page collection. However,
> we don’t support migrating memory to the device—though we could.
>
> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> this case, don’t maintain a range tree, as those—as you suggest—are more
> of an on-demand fault driver concern. Instead, just embed 'struct
> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>
> We could extend this to support migrating 'userptr', but we just haven’t
> done that yet—this may be what you want to do in “XNACK off..
>
> [2] https://patchwork.freedesktop.org/series/146553/
>
Actually we need to swith the xnack mode between on and off, so in xnack
off mode, the driver operats in "implicit prefetch mode". This may be
due to compatibility with older hardware and the need for UMD runtime.
We will further discuss the handling method under xnack off internally.
>>
>>>
>>>>>>
>>>>>> 4) Invalidation, GC worker, and restore worker
>>>>>>
>>>>>> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>>>>>> three cases based on event type and hardware mode:
>>>>>> - unmap event: clear GPU PTEs in the notifier context,
>>>>>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>>>>>> and enqueue to the GC worker. On XNACK off, also
>>>>>> quiesce KFD queues and schedule rebuild of the
>>>>>> still valid portions that were destroyed together with
>>>>>> the unmapped subregion.
>>>>>>
>>>>>> - evict on XNACK off:
>>>>>> quiesce KFD queues first, then unmap DMA pages and
>>>>>> enqueue to the restore worker.
>>>>>
>>>>> Is that done through the DMA fence or by talking directly to the MES/HWS?
>>>>
>>>> Currently KFD queues quiesce/resume API are reused, lookig forward to a
>>>> better solution.
>>>>
>>>
>>> +1
>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> - evict on XNACK on:
>>>>>> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>>>>>> not schedule any worker. The GPU will fault on next
>>>>>> access and the fault handler establishes the mapping.
>>>>>>
>>>>>> Not supported feature:
>>>>>> - XNACK on GPU page fault mode
>>>>>> - migration and prefetch feature
>>>>>> - Multi GPU support
>>>>>>
>>>>>> XNACK on enablement is ongoing.The GPUs that support XNACK on
>>>>>> are currently only accessible to us via remote lab machines, which slows
>>>>>> down progress.
>>>>>>
>>>>>> Patch overview:
>>>>>>
>>>>>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>>>>>> SET_ATTR/GET_ATTR operations, attribute types, and related
>>>>>> structs in amdgpu_drm.h.
>>>>>>
>>>>>> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>>>>>> refcount, attr_tree, workqueues, locks, and
>>>>>> callbacks (begin/end_restore, flush_tlb).
>>>>>>
>>>>>> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>>>>>> (interval tree node), attr_tree, access enum, flag masks,
>>>>>> and change trigger enum.
>>>>>>
>>>>>> 04/12 Attribute tree operations: interval tree lookup, insert,
>>>>>> remove, and tree create/destroy lifecycle.
>>>>>>
>>>>>> 05/12 Attribute set: validate UAPI attributes, apply to internal
>>>>>> attrs, handle hole/existing range with head/tail splitting,
>>>>>> compute change triggers, and -EAGAIN retry loop.
>>>>>> Implements attr_clear_pages for unmap cleanup and attr_get.
>>>>>>
>>>>>> 06/12 Range data structures: amdgpu_svm_range extending
>>>>>> drm_gpusvm_range with gpu_mapped state, pending ops,
>>>>>> pte_flags cache, and GC/restore queue linkage.
>>>>>>
>>>>>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>>>>>> GPU page table update with DMA address, range mapping loop:
>>>>>> find_or_insert -> get_pages -> validate -> update PTE,
>>>>>> and attribute change driven mapping function.
>>>>>>
>>>>>> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
>>>>>> notifier context, range removal and overlap cleanup,
>>>>>> rebuild after destroy logic, and MMU event dispatcher
>>>>>>
>>>>>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>>>>>> worker for unmap processing and rebuild, ordered restore
>>>>>> worker for mapping evicted ranges, and flush/sync
>>>>>> helpers.
>>>>>>
>>>>>> 10/12 Initialization and fini: kmem_cache for range/attr,
>>>>>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>>>>>> flush helper, and amdgpu_svm init/close/fini lifecycle.
>>>>>>
>>>>>> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>>>>>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>>>>>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>>>>>
>>>>>> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>>>>>> Makefile rules, ioctl table registration, and amdgpu_vm
>>>>>> hooks (init in make_compute, close/fini, fault dispatch).
>>>>>>
>>>>>> Test result:
>>>>>> on gfx1100(W7900) and gfx943(MI300x)
>>>>>> kfd test: 95%+ passed, same failed cases with offical relase
>>>>>> rocr test: all passed
>>>>>> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>>>>>
>>>>>> During implementation we identified several challenges / design questions:
>>>>>>
>>>>>> 1. No range splitting on partial unmap
>>>>>>
>>>>>> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>>>>>> Partial munmap needs to destroy the entire range including the valid interval.
>>>>>> GPU fault driven hardware can handle this design by extra gpu fault handle,
>>>>>> but AMDGPU needs to support XNACK off hardware, this design requires driver
>>>>>> rebuild the valid part in the removed entire range. Whichs bring a very heavy
>>>>>> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>>>>>> this restore work even heavier than kfd_svm. In previous driver work queue
>>>>>> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>>>>>> which brings about more complex logic, heavier worker queue workload, and
>>>>>> synchronization issues.
>>>
>>> Is this common in the workload you are running? I'm also wondering if
>>> your restore logic / KFDs design is contributing to this actally the
>>> problem.
>>>
>>
>> Honestly, you raise a fair point.
>>
>> We will redesign the logic about the partial munap, which should eliminate
>> most of this complexity.
>>
>>
>
> +1, yes test but do optimize for.
>
>>>>>>
>>>>>> 2. Fault driven vs ioctl driven mapping
>>>>>>
>>>>>> drm_gpusvm is designed around GPU page fault handlers. The primary entry
>>>>>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>>>>>> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>>>>>> GPU cannot fault at all
>>>
>>> I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
>>> issues these so the device does not fault (e.g., prefetch creates a set
>>> of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
>>> specific VM bind operations.
>>>
>>
>> That is a very helpful way to think about it. Yes, our ioctl driven
>> mapping(xnack off) is essentially equivalent to a prefetch operation. We are
>> trying to improve it.
>>
>
> See above wrt 'userptr'.
Got it.
>
>>
>>>>>>
>>>>>> The ioctl path cannot hold mmap_read_lock across the entire operation
>>>>>> because drm_gpusvm_range_find_or_insert() acquires/releases it
>>>>>> internally. This creates race windows with MMU notifiers / workers.
>>>
>>> This is a very intentional choice in the locking design: mmap_read_lock
>>> is held only in very specific parts of GPU SVM, and the driver should
>>> never need to take this lock.
>>>
>>> Yes, notifiers can race, which is why the GPU fault handler and prefetch
>>> handler are structured as retry loops when a notifier race is detected.
>>> In practice, with well-behaved applications, these races should be
>>> rare—but they do occur, and the driver must handle them.
>>>
>>> __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
>>> prefetch has similar logic, although it is more spread out given that it
>>> is part of a deeper software pipeline.
>>>
>>> FWIW, holding locks to avoid races was rejected by Sima because we
>>> reasoned it is essentially impossible to guarantee the absence of races
>>> by holding a lock. CPU page fault handlers are also effectively just
>>> large retry loops.
>>>
>>> So this is one point I believe you will need to fixup driver side.
>>>
>>
>> Understood. Thank you for the detailed explanation and for pointing to
>> __xe_svm_handle_pagefault as a reference. We will restructure both our
>> fault handler and ioctl path to a betterretry loop pattern with sequence
>> number race detection.
>>
>
> Yes, the typical pattern is:
>
> - Try to migrate once
> - If you hit a race, give up, evict all memory back to system memory, and bind it
>
> Atomics make this tricky because memory must move, but I’m not sure
> “XNACK off” applies here. However, GPU SVM provides a timeslice
> mechanism to ensure the CPU can’t move memory while the GPU needs to
> execute something.
Understood.
>
>>>>>>
>>>>>> 3. Multi GPU support
>>>>>>
>>>>>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>>>>>> each GPU gets an independent instance with its own range tree, MMU
>>>>>> notifiers, notifier_lock, and DMA mappings.
>>>>>>
>>>
>>> This is a part I am absolutely open to fixing. Right now, each
>>> drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
>>> decoupling a GPU SVM instance from a single device, allowing each
>>> drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
>>> device).
>>>
>>> This would give drivers the flexibility to use one GPU SVM instance per
>>> VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
>>> MM.
>>>
>>
>> That would be wonderful! Looking forward to your patch very much!
>>
>
> I can't say I'll code this but we thought about is as options and very
> open patches which refactor the object model for multiple use cases.
Understood. I will focus on single GPU first, and once we have a
solid v1, we'd be happy to explore contributing patches for the
multi-device object model refactoring.
>
>>
>>>>>> This may brings huge overhead:
>>>>>> - N x MMU notifier registrations for the same address range
>>>
>>> The notifier overhead is a real concern. We recently introduced two-pass
>>> notifiers [1] to speed up multi-device notifiers. At least in Xe, the
>>> TLB invalidations—which are the truly expensive part—can be pipelined
>>> using the two=pass approach. Currently, [1] only implements two-pass
>>> notifiers for userptr, but Xe’s GPU SVM will be updated to use them
>>> shortly.
>>>
>>> [1] https://patchwork.freedesktop.org/series/153280/
>>>
>>
>> Thank you for the pointer to two-pass notifiers. Will study this
>> series.
>>
>>>>>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>>>
>>> hmm_range_fault is extremely fast compared to the actual migration.
>>> Running hmm_range_fault on a 2MB region using 4KB pages takes less
>>> than 1µs. With THP or large device pages [2] (merged last week), it’s
>>> around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
>>>
>>> [2] https://patchwork.freedesktop.org/series/163141/
>>>
>>
>> That is very helpful data. Perhaps worry too much.
>>
>>>>>> - N x DMA mapping memory
>>>
>>> You will always have N x DMA mapping memory if the pages are in system
>>> memory as the dma-mapping API is per device.
>>
>> Totally agreed.
>>
>>>
>>>>>> - N x invalidation + restore worker scheduling per CPU unmap event
>>>>>> - N x GPU page table flush / TLB invalidation
>>>
>>> I agree you do not want serialize GPU page table flush / TLB
>>> invalidations. Hence two-pass notifiers [1].
>>
>> Yes, will learn it.
>>
>>>
>>>>>> - Increased mmap_lock hold time, N callbacks serialize under it
>>>>>>
>>>>>> compatibility issues:
>>>>>> - Quiesce/resume scope mismatch: to integrate with KFD compute
>>>>>> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>>>>>> which have process level semantics. Under the per GPU
>>>>>> drm_gpusvm model, maybe there are some issues on sync. To properly
>>>>>> integrate with KFD under the per SVM model, a compatibility or
>>>>>> new per VM level queue control APIs maybe need to introduced.
>>>>>>
>>>
>>> I thought the idea to get rid of KFD and move over to AMDGPU? I thought
>>> Christian mentioned this to me at XDC.
>>>
>>
>>>>>> Migration challenges:
>>>>>>
>>>>>> - No global migration decision logic: each per GPU SVM
>>>>>> instance maintains its own attribute tree independently. This
>>>>>> allows conflicting settings (e.g., GPU0's SVM sets
>>>>>> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>>>>>> for the same address range) with no detection or resolution.
>>>>>> A global attribute coordinator or a shared manager is needed to
>>>>>> provide a unified global view for migration decisions
>>>
>>> Yes, this is hole in the Xe API too. We have told UMDs if they setup
>>> individual VMs with conflict attributes for a single CPU address space
>>> the behavior is undefined. Our UMD implement madvise is basically loop
>>> over al GPU VMs setting the same attributes.
>>
>> Will follow the same approach for now, the UMD is responsible for setting
>> consistent attributes across GPU VMs.
>>
>
> +1
>
>>>
>>>>>>
>>>>>> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>>>>>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>>>>>> causing N-1 unnecessary restore workers to be scheduled. And
>>>
>>> My feeling is that you shouldn’t reschedule restore workers unless you
>>> actually have to invalidate page tables (i.e., you have a local SVM
>>> range within the notifier). So the first migration to an untouched
>>> region may trigger notifiers, but they won’t do anything because you
>>> don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
>>> region won’t trigger a notifier unless the memory is moved again.
>>>
>>
>> That is a very good point. We should check whether we actually have
>> valid SVM ranges before scheduling restore workers. If there is nothing
>> to invalidate, the notifier callback should be a no-op. We will review
>> our notifier callback logic to ensure we are not doing unnecessary work
>> here. Thank you for pointing this out.
>>
>>>>>> creates races between the initiating migration and the other
>>>>>> instance's restore attempts.
>>>
>>> Yes, if multiple devices try to migrate the same CPU pages at the same
>>> time, that will race. That’s why in Xe we have a module-level
>>> driver_migrate_lock. The first migration runs in read mode; if it
>>> detects a race and aborts, it then takes driver_migrate_lock in write
>>> mode so it becomes the only device allowed to move memory / CPU pages.
>>> See xe_svm_alloc_vram() for how this is used.
>>>
>>> I’m not sure this approach will work for you, but I just wanted to point
>>> out that we identified this as a potential issue.
>>>
>>
>> Thank you for sharing the driver_migrate_lock approach and pointing to
>> xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
>> for our case.
>>
>>>>>>
>>>>>> - No cross instance migration serialization: each per GPU
>>>>>> drm_gpusvm instance has independent locking, so two GPUs'
>>>>>> "decide -> migrate -> remap" sequences can interleave. While
>>>>>> the kernel page lock prevents truly simultaneous migration of
>>>>>> the same physical page, the losing side's retry (evict from
>>>>>> other GPU's VRAM -> migrate back) triggers broadcast notifier
>>>>>> invalidations and restore workers, compounding the ping pong
>>>>>> problem above.
>>>>>>
>>>
>>> See the driver_migrate_lock above.
>>
>> Acknowledged, thank you.
>>>
>>>>>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>>>>>> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>>>>>> it only selects system memory pages for migration.
>>>>>>
>>>
>>> I think this is fixed? We did find some core MM bugs that blocked VRAM
>>> to VRAM but those have been worked out.
>>>
>>> The code I'm looking at:
>>>
>>> 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>>> 518 struct mm_struct *mm,
>>> 519 unsigned long start, unsigned long end,
>>> 520 const struct drm_pagemap_migrate_details *mdetails)
>>> 521 {
>>> 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
>>> 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
>>> 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
>>> 525 struct migrate_vma migrate = {
>>> 526 .start = start,
>>> 527 .end = end,
>>> 528 .pgmap_owner = pagemap->owner,
>>> 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
>>> 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
>>> 531 };
>>>
>>
>> Thank you for checking! I am using v6.18 for this POC, missed the fix, will
>> rebase to the latest.
>>
>>
>>>>>> - CPU fault reverse migration race: CPU page fault triggers
>>>>>> migrate_to_ram while GPU instances are concurrently operating.
>>>>>> Per GPU notifier_lock does not protect cross GPU operations.
>>>
>>> No, again retry loop as discussed above.
>>
>> Understood.
>>
>>>
>>>>>>
>>>>>> We believe a strong, well designed solution at the framework level is
>>>>>> needed to properly address these problems, and we look forward to
>>>>>> discussion and suggestions.
>>>
>>> Let's work together to figure out what is missing here.
>>
>> Thank you so much, Matt. Your feedback has been incredibly valuable and
>> has given us a much clearer picture of the framework's design.
>> Ireally appreciate the effort you put into building drm_gpusvm as a
>> shared framework. Will incorporate your suggestions into our next
>> revision and look forward to continuing the collaboration.
>>
>
> No problem. Happy to help.
Thank you again for all the detailed feedback.
Regards,
Honglei
>
> Matt
>
>> Regards,
>> Honglei
>>
>>
>>>
>>> Matt
>>>
>>>>>>
>>>>>> Honglei Huang (12):
>>>>>> drm/amdgpu: add SVM UAPI definitions
>>>>>> drm/amdgpu: add SVM data structures and header
>>>>>> drm/amdgpu: add SVM attribute data structures
>>>>>> drm/amdgpu: implement SVM attribute tree operations
>>>>>> drm/amdgpu: implement SVM attribute set
>>>>>> drm/amdgpu: add SVM range data structures
>>>>>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>>>>>> drm/amdgpu: implement SVM range notifier and invalidation
>>>>>> drm/amdgpu: implement SVM range workers
>>>>>> drm/amdgpu: implement SVM core initialization and fini
>>>>>> drm/amdgpu: implement SVM ioctl and fault handler
>>>>>> drm/amdgpu: wire up SVM build system and fault handler
>>>>>>
>>>>>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>>>>>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>>>>>> include/uapi/drm/amdgpu_drm.h | 39 +
>>>>>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>>>>>
>>>>>>
>>>>>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-24 7:24 ` Honglei Huang
@ 2026-03-25 22:24 ` Matthew Brost
2026-03-26 12:16 ` Honglei Huang
0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2026-03-25 22:24 UTC (permalink / raw)
To: Honglei Huang
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On Tue, Mar 24, 2026 at 03:24:43PM +0800, Honglei Huang wrote:
>
>
> On 3/23/26 14:31, Matthew Brost wrote:
> > On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
> > >
> > >
> > > On 3/19/26 13:08, Matthew Brost wrote:
> > > > On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
> > > > >
> > > >
> > > > Disclaimer I haven't look at any code in this series yet.
> > > >
> > > > >
> > > > > On 3/17/26 19:48, Christian König wrote:
> > > > > > Adding a few XE and drm_gpuvm people on TO.
> > > > > >
> > > > > > On 3/17/26 12:29, Honglei Huang wrote:
> > > > > > > From: Honglei Huang <honghuan@amd.com>
> > > > > > >
> > > > > > > This is a POC/draft patch series of SVM feature in amdgpu based on the
> > > > > > > drm_gpusvm framework. The primary purpose of this RFC is to validate
> > > > > > > the framework's applicability, identify implementation challenges,
> > > > > > > and start discussion on framework evolution. This is not a production
> > > >
> > > > +1. Open to any ideas. Given this was designed originally for Xe we very
> > > > well could have missed other drivers requirements.
> > > Hi Matt,
> > >
> > > Thank you for the openness. And thank you so much for the incredibly
> > > detailed and patient response. I really appreciate you taking the time to
> > > walk through each point.
> > >
> >
> > I'm here to help.
> >
> > > Actually I am still a learner when it comes to the drm_gpusvm framework and
> > > GPU SVM design in general. Some of my descriptions below may not be entirely
> > > accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
> > > well.
> >
> > I appreciate another driver jumping in and using this framework—it
> > becomes easier to validate as more users adopt it.
> >
> > >
> > > >
> > > > > > > ready submission.
> > > > > > >
> > > > > > > This patch series implements basic SVM support with the following features:
> > > > > > >
> > > > > > > 1. attributes sepatarated from physical page management:
> > > > > > >
> > > > > > > - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
> > > > > > > tree that stores SVM attributes. Managed through the SET_ATTR,
> > > > > > > and mmu notifier callback.
> > > >
> > > > Can you explain the mmu notifier callback interaction here? See below in
> > > > Xe the attribute tree is existing VMA tree (gpuvm).
> > > >
> > >
> > > Let me try to explain, apologies if the description is not fully
> > > precise.
> > >
> > > In current implementation, the MMU notifier callback interacts with the attr
> > > tree only in the munmap path remove the corresponding attribute
> > > entries from the attr tree so that stale attributes do not persist for
> > > freed address space.
> > >
> >
> > Ah, yes. We reset our attributes upon munmap too. We actually don't this
> > 100% correct quite either and series in flight to fix [1].
> >
> > [1] https://patchwork.freedesktop.org/series/161815/
> >
>
> I studied [1]. This draft has a simliar mechanism to handle attributes when
> munmap. But there are some sligt differences in detail, maybe casued by
> different UMD runtime behaviors.
>
>
> > > > > > >
> > > > > > > - Physical page layer (drm_gpusvm ranges): managed by the
> > > > > > > drm_gpusvm framework, representing actual HMM backed DMA
> > > > > > > mappings and GPU page table entries.
> > > > > > >
> > > > > > > This separation is necessary:
> > > > > > > - The framework does not support range splitting, so a partial
> > > > > > > munmap destroys the entire overlapping range, including the
> > > > > > > still valid parts. If attributes were stored inside drm_gpusvm
> > > > > > > ranges, they would be lost on unmapping.
> > > > > > > The separate attr tree preserves userspace set attributes
> > > > > > > across range operations.
> > > >
> > > > Yes, in Xe the divide is at the VMA level (set by user space) via VM
> > > > bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
> > > > madvise IOCTLs which reflect user space attributes on current SVM
> > > > mappings or future ones.
> > > >
> > > > The SVM range tree reflects mappings that have been faulted into the
> > > > device and contain pages. This is an intentional choice.
> > >
> > > That makes a lot of sense. Thank you for clarifying the design intent. I
> > > think the current adopt the same principle: the drm_gpusvm range tree only
> > > reflect actual faulted in mappings.
> > >
> > > >
> > > > > >
> > > > > > Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
> > > >
> > > >
> > > > Yes, this was an intentional design choice to not support partial unmap,
> > > > and instead rely on the driver to recreate a new range.
> > > >
> > > > The reasoning is:
> > > >
> > > > - In practice, this should be rare for well-behaved applications.
> > > >
> > > > - With THP / large device pages, if a sub-range is unmapped, the entire
> > > > GPU mapping is invalidated anyway due to the page size change. As a
> > > > result, the cost of creating a new range is minimal, since the device
> > > > will likely fault again on the remaining pages.
> > > >
> > > > So there is no need to over-engineer the common code.
> > > >
> > > > FWIW, to even test partial unmaps in Xe, I had to do things I doubt
> > > > anyone would ever do:
> > > >
> > > > ptr = mmap(SZ_2M);
> > > > /* fault in memory to the device */
> > > > munmap(ptr, SZ_1M);
> > > > /* touch memory again on the device */
> > > >
> > >
> > > Thank you for this explanation and the concrete example. After further
> > > discussion internally with Christian, we are now aligned with same position
> > > partial unmap. Will remove rebuild on partial unmap logic in the next
> > > version and handle it as only partially backed range.
> > >
> > > > >
> > > > >
> > > > > It is about partial unmap, some subregion in drm_gpusvm_range is still valid
> > > > > but some other subregion is invalid, but under drm_gpusvm, need to destroy
> > > > > the entire range.
> > > > >
> > > > > e.g.:
> > > > >
> > > > > [---------------unmap region in mmu notifier-----------------]
> > > > > [0x1000 ------------ 0x9000]
> > > > > [ valid ][ invalid ]
> > > > >
> > > > > see deatil in drm_gpusvm.c:110 line
> > > > > section:Partial Unmapping of Ranges
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > - drm_gpusvm range boundaries are determined by fault address
> > > > > > > and pre setted chunk size, not by userspace attribute boundaries.
> > > > > > > Ranges may be rechunked on memory changes. Embedding
> > > > > > > attributes in framework ranges would scatter attr state
> > > > > > > across many small ranges and require complex reassemble
> > > > > > > logic when operate attrbute.
> > > > > >
> > > > > > Yeah, that makes a lot of sense.
> > > > > >
> > > > > > >
> > > > > > > 2) System memory mapping via drm_gpusvm
> > > > > > >
> > > > > > > The core mapping path uses drm_gpusvm_range_find_or_insert() to
> > > > > > > create ranges, drm_gpusvm_range_get_pages() for HMM page fault
> > > > > > > and DMA mapping, then updates GPU page tables via
> > > > > > > amdgpu_vm_update_range().
> > > > > > >
> > > > > > > 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > > > > > >
> > > > > > > On XNACK off hardware the GPU cannot recover from page faults,
> > > > > > > so mappings must be established through ioctl. When
> > > > > > > userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > > > > > > walks the attr tree and maps all accessible intervals
> > > > > > > to the GPU by amdgpu_svm_range_map_attr_ranges().
> > > >
> > > > Can you expand on XNACK off / GPU no faults? Is this to the share GPU
> > > > between 3D (dma-fences) and faulting clients? We have something similar
> > > > in Xe, but it isn't an explicit IOCTL rather we switch between on demand
> > > > as 3D client submits and then resumes page faults when all dma-fences
> > > > have signaled.
> > > >
> > > > I see below you mention page tables are modified during quiesce KFD
> > > > queues? I'm not sure that is required - you just need to guarnette
> > > > faulting clients won't trigger page faults when dma-fence is in flight.
> > > >
> > > > Maybe give me an explaination of exactly what the requirement from AMD
> > > > are here so I have better picture.
> > >
> > > Thank you for the patience, let me try to explain our situation, though
> > > I may not get every detail right.
> > >
> > > XNACK off means hardware that does not have GPU page fault capability (or
> > > turned off)
> > >
> > > So for these GPUs, ALL page table entries must be fully populated before
> > > the GPU can access the memory. This is why we need the ioctl driven
> > > mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
> > > walk the attribute tree and eagerly map all accessible ranges into the
> > > GPU page tables. This is functionally similar to what you describe as
> > > prefetch IOCTLs / VM bind in Xe.
> > >
> > > Regarding queue quiesce during page table modification: on XNACK off
> > > hardware, because the GPU cannot fault, we must ensure the GPU is
> > > completely stopped before modifying any PTE it might be accessing.
> > > Otherwise the GPU could access a partially updated page table and hang.
> > > The quiesce/resume is the mechanism to guarantee this.
> > >
> > > I hope that helps clarify the picture.
> > >
> >
> > This clarifies a lot. This is what we’d call in Xe “preemption fence”
> > mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> > resume. We don’t actually support SVM in this case; instead, we use
> > “userptr binds,” which are built on gpusvm for page collection. However,
> > we don’t support migrating memory to the device—though we could.
> >
> > I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> > this case, don’t maintain a range tree, as those—as you suggest—are more
> > of an on-demand fault driver concern. Instead, just embed 'struct
> > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> >
> > We could extend this to support migrating 'userptr', but we just haven’t
> > done that yet—this may be what you want to do in “XNACK off..
> >
> > [2] https://patchwork.freedesktop.org/series/146553/
> >
>
> Actually we need to swith the xnack mode between on and off, so in xnack off
> mode, the driver operats in "implicit prefetch mode". This may be due to
> compatibility with older hardware and the need for UMD runtime. We will
> further discuss the handling method under xnack off internally.
>
>
> > >
> > > >
> > > > > > >
> > > > > > > 4) Invalidation, GC worker, and restore worker
> > > > > > >
> > > > > > > MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
> > > > > > > three cases based on event type and hardware mode:
> > > > > > > - unmap event: clear GPU PTEs in the notifier context,
> > > > > > > unmap DMA pages, mark ranges as unmapped, flush TLB,
> > > > > > > and enqueue to the GC worker. On XNACK off, also
> > > > > > > quiesce KFD queues and schedule rebuild of the
> > > > > > > still valid portions that were destroyed together with
> > > > > > > the unmapped subregion.
> > > > > > >
> > > > > > > - evict on XNACK off:
> > > > > > > quiesce KFD queues first, then unmap DMA pages and
> > > > > > > enqueue to the restore worker.
> > > > > >
> > > > > > Is that done through the DMA fence or by talking directly to the MES/HWS?
> > > > >
> > > > > Currently KFD queues quiesce/resume API are reused, lookig forward to a
> > > > > better solution.
> > > > >
> > > >
> > > > +1
> > > >
> > > > > Regards,
> > > > > Honglei
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > > - evict on XNACK on:
> > > > > > > clear GPU PTEs, unmap DMA pages, and flush TLB, but do
> > > > > > > not schedule any worker. The GPU will fault on next
> > > > > > > access and the fault handler establishes the mapping.
> > > > > > >
> > > > > > > Not supported feature:
> > > > > > > - XNACK on GPU page fault mode
> > > > > > > - migration and prefetch feature
> > > > > > > - Multi GPU support
> > > > > > >
> > > > > > > XNACK on enablement is ongoing.The GPUs that support XNACK on
> > > > > > > are currently only accessible to us via remote lab machines, which slows
> > > > > > > down progress.
> > > > > > >
> > > > > > > Patch overview:
> > > > > > >
> > > > > > > 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > > > > > > SET_ATTR/GET_ATTR operations, attribute types, and related
> > > > > > > structs in amdgpu_drm.h.
> > > > > > >
> > > > > > > 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
> > > > > > > refcount, attr_tree, workqueues, locks, and
> > > > > > > callbacks (begin/end_restore, flush_tlb).
> > > > > > >
> > > > > > > 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
> > > > > > > (interval tree node), attr_tree, access enum, flag masks,
> > > > > > > and change trigger enum.
> > > > > > >
> > > > > > > 04/12 Attribute tree operations: interval tree lookup, insert,
> > > > > > > remove, and tree create/destroy lifecycle.
> > > > > > >
> > > > > > > 05/12 Attribute set: validate UAPI attributes, apply to internal
> > > > > > > attrs, handle hole/existing range with head/tail splitting,
> > > > > > > compute change triggers, and -EAGAIN retry loop.
> > > > > > > Implements attr_clear_pages for unmap cleanup and attr_get.
> > > > > > >
> > > > > > > 06/12 Range data structures: amdgpu_svm_range extending
> > > > > > > drm_gpusvm_range with gpu_mapped state, pending ops,
> > > > > > > pte_flags cache, and GC/restore queue linkage.
> > > > > > >
> > > > > > > 07/12 PTE flags and GPU mapping: simple gpu pte function,
> > > > > > > GPU page table update with DMA address, range mapping loop:
> > > > > > > find_or_insert -> get_pages -> validate -> update PTE,
> > > > > > > and attribute change driven mapping function.
> > > > > > >
> > > > > > > 08/12 Notifier and invalidation: synchronous GPU PTE clear in
> > > > > > > notifier context, range removal and overlap cleanup,
> > > > > > > rebuild after destroy logic, and MMU event dispatcher
> > > > > > >
> > > > > > > 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > > > > > > worker for unmap processing and rebuild, ordered restore
> > > > > > > worker for mapping evicted ranges, and flush/sync
> > > > > > > helpers.
> > > > > > >
> > > > > > > 10/12 Initialization and fini: kmem_cache for range/attr,
> > > > > > > drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > > > > > > flush helper, and amdgpu_svm init/close/fini lifecycle.
> > > > > > >
> > > > > > > 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
> > > > > > > protection, amdgpu_gem_svm_ioctl dispatcher, and
> > > > > > > amdgpu_svm_handle_fault for GPU page fault recovery.
> > > > > > >
> > > > > > > 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
> > > > > > > Makefile rules, ioctl table registration, and amdgpu_vm
> > > > > > > hooks (init in make_compute, close/fini, fault dispatch).
> > > > > > >
> > > > > > > Test result:
> > > > > > > on gfx1100(W7900) and gfx943(MI300x)
> > > > > > > kfd test: 95%+ passed, same failed cases with offical relase
> > > > > > > rocr test: all passed
> > > > > > > hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
> > > > > > >
> > > > > > > During implementation we identified several challenges / design questions:
> > > > > > >
> > > > > > > 1. No range splitting on partial unmap
> > > > > > >
> > > > > > > drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
> > > > > > > Partial munmap needs to destroy the entire range including the valid interval.
> > > > > > > GPU fault driven hardware can handle this design by extra gpu fault handle,
> > > > > > > but AMDGPU needs to support XNACK off hardware, this design requires driver
> > > > > > > rebuild the valid part in the removed entire range. Whichs bring a very heavy
> > > > > > > restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
> > > > > > > this restore work even heavier than kfd_svm. In previous driver work queue
> > > > > > > only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
> > > > > > > which brings about more complex logic, heavier worker queue workload, and
> > > > > > > synchronization issues.
> > > >
> > > > Is this common in the workload you are running? I'm also wondering if
> > > > your restore logic / KFDs design is contributing to this actally the
> > > > problem.
> > > >
> > >
> > > Honestly, you raise a fair point.
> > >
> > > We will redesign the logic about the partial munap, which should eliminate
> > > most of this complexity.
> > >
> > >
> >
> > +1, yes test but do optimize for.
> >
> > > > > > >
> > > > > > > 2. Fault driven vs ioctl driven mapping
> > > > > > >
> > > > > > > drm_gpusvm is designed around GPU page fault handlers. The primary entry
> > > > > > > point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > > > > > > AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
> > > > > > > GPU cannot fault at all
> > > >
> > > > I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
> > > > issues these so the device does not fault (e.g., prefetch creates a set
> > > > of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
> > > > specific VM bind operations.
> > > >
> > >
> > > That is a very helpful way to think about it. Yes, our ioctl driven
> > > mapping(xnack off) is essentially equivalent to a prefetch operation. We are
> > > trying to improve it.
> > >
> >
> > See above wrt 'userptr'.
>
> Got it.
>
> >
> > >
> > > > > > >
> > > > > > > The ioctl path cannot hold mmap_read_lock across the entire operation
> > > > > > > because drm_gpusvm_range_find_or_insert() acquires/releases it
> > > > > > > internally. This creates race windows with MMU notifiers / workers.
> > > >
> > > > This is a very intentional choice in the locking design: mmap_read_lock
> > > > is held only in very specific parts of GPU SVM, and the driver should
> > > > never need to take this lock.
> > > >
> > > > Yes, notifiers can race, which is why the GPU fault handler and prefetch
> > > > handler are structured as retry loops when a notifier race is detected.
> > > > In practice, with well-behaved applications, these races should be
> > > > rare—but they do occur, and the driver must handle them.
> > > >
> > > > __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
> > > > prefetch has similar logic, although it is more spread out given that it
> > > > is part of a deeper software pipeline.
> > > >
> > > > FWIW, holding locks to avoid races was rejected by Sima because we
> > > > reasoned it is essentially impossible to guarantee the absence of races
> > > > by holding a lock. CPU page fault handlers are also effectively just
> > > > large retry loops.
> > > >
> > > > So this is one point I believe you will need to fixup driver side.
> > > >
> > >
> > > Understood. Thank you for the detailed explanation and for pointing to
> > > __xe_svm_handle_pagefault as a reference. We will restructure both our
> > > fault handler and ioctl path to a betterretry loop pattern with sequence
> > > number race detection.
> > >
> >
> > Yes, the typical pattern is:
> >
> > - Try to migrate once
> > - If you hit a race, give up, evict all memory back to system memory, and bind it
> >
> > Atomics make this tricky because memory must move, but I’m not sure
> > “XNACK off” applies here. However, GPU SVM provides a timeslice
> > mechanism to ensure the CPU can’t move memory while the GPU needs to
> > execute something.
>
> Understood.
>
> >
> > > > > > >
> > > > > > > 3. Multi GPU support
> > > > > > >
> > > > > > > drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> > > > > > > each GPU gets an independent instance with its own range tree, MMU
> > > > > > > notifiers, notifier_lock, and DMA mappings.
> > > > > > >
> > > >
> > > > This is a part I am absolutely open to fixing. Right now, each
> > > > drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
> > > > decoupling a GPU SVM instance from a single device, allowing each
> > > > drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
> > > > device).
> > > >
> > > > This would give drivers the flexibility to use one GPU SVM instance per
> > > > VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
> > > > MM.
> > > >
> > >
> > > That would be wonderful! Looking forward to your patch very much!
> > >
> >
> > I can't say I'll code this but we thought about is as options and very
> > open patches which refactor the object model for multiple use cases.
>
> Understood. I will focus on single GPU first, and once we have a
> solid v1, we'd be happy to explore contributing patches for the
> multi-device object model refactoring.
>
I think roughly what would need to be done is:
- Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
- Embed either a struct device or a struct drm_device in struct
drm_gpusvm_pages.
- Drop struct drm_device from struct drm_gpusvm.
- Have the driver’s range structure embed one or more struct
drm_gpusvm_pages in addition to struct drm_gpusvm_range.
- Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or simply
drop them entirely and update drivers to use the drm_gpusvm_pages
helpers instead.
Then it is up to the drivers whether struct drm_gpusvm maps to a single
device or multiple devices. Either use case seems valid, and giving
drivers the option appears to be the right approach, rather than having
the common drm_gpusvm layer impose its own constraints.
This type of refactor can be done at any time as an independent patch,
so feel free to post it whenever and I can verify on the Xe side that
everything looks good.
Matt
> >
> > >
> > > > > > > This may brings huge overhead:
> > > > > > > - N x MMU notifier registrations for the same address range
> > > >
> > > > The notifier overhead is a real concern. We recently introduced two-pass
> > > > notifiers [1] to speed up multi-device notifiers. At least in Xe, the
> > > > TLB invalidations—which are the truly expensive part—can be pipelined
> > > > using the two=pass approach. Currently, [1] only implements two-pass
> > > > notifiers for userptr, but Xe’s GPU SVM will be updated to use them
> > > > shortly.
> > > >
> > > > [1] https://patchwork.freedesktop.org/series/153280/
> > > >
> > >
> > > Thank you for the pointer to two-pass notifiers. Will study this
> > > series.
> > >
> > > > > > > - N x hmm_range_fault() calls for the same page (KFD: 1x)
> > > >
> > > > hmm_range_fault is extremely fast compared to the actual migration.
> > > > Running hmm_range_fault on a 2MB region using 4KB pages takes less
> > > > than 1µs. With THP or large device pages [2] (merged last week), it’s
> > > > around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
> > > >
> > > > [2] https://patchwork.freedesktop.org/series/163141/
> > > >
> > >
> > > That is very helpful data. Perhaps worry too much.
> > >
> > > > > > > - N x DMA mapping memory
> > > >
> > > > You will always have N x DMA mapping memory if the pages are in system
> > > > memory as the dma-mapping API is per device.
> > >
> > > Totally agreed.
> > >
> > > >
> > > > > > > - N x invalidation + restore worker scheduling per CPU unmap event
> > > > > > > - N x GPU page table flush / TLB invalidation
> > > >
> > > > I agree you do not want serialize GPU page table flush / TLB
> > > > invalidations. Hence two-pass notifiers [1].
> > >
> > > Yes, will learn it.
> > >
> > > >
> > > > > > > - Increased mmap_lock hold time, N callbacks serialize under it
> > > > > > >
> > > > > > > compatibility issues:
> > > > > > > - Quiesce/resume scope mismatch: to integrate with KFD compute
> > > > > > > queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
> > > > > > > which have process level semantics. Under the per GPU
> > > > > > > drm_gpusvm model, maybe there are some issues on sync. To properly
> > > > > > > integrate with KFD under the per SVM model, a compatibility or
> > > > > > > new per VM level queue control APIs maybe need to introduced.
> > > > > > >
> > > >
> > > > I thought the idea to get rid of KFD and move over to AMDGPU? I thought
> > > > Christian mentioned this to me at XDC.
> > > >
> > >
> > > > > > > Migration challenges:
> > > > > > >
> > > > > > > - No global migration decision logic: each per GPU SVM
> > > > > > > instance maintains its own attribute tree independently. This
> > > > > > > allows conflicting settings (e.g., GPU0's SVM sets
> > > > > > > PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
> > > > > > > for the same address range) with no detection or resolution.
> > > > > > > A global attribute coordinator or a shared manager is needed to
> > > > > > > provide a unified global view for migration decisions
> > > >
> > > > Yes, this is hole in the Xe API too. We have told UMDs if they setup
> > > > individual VMs with conflict attributes for a single CPU address space
> > > > the behavior is undefined. Our UMD implement madvise is basically loop
> > > > over al GPU VMs setting the same attributes.
> > >
> > > Will follow the same approach for now, the UMD is responsible for setting
> > > consistent attributes across GPU VMs.
> > >
> >
> > +1
> >
> > > >
> > > > > > >
> > > > > > > - migrate_vma_setup broadcast: one GPU's migration triggers MMU
> > > > > > > notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > > > > > > causing N-1 unnecessary restore workers to be scheduled. And
> > > >
> > > > My feeling is that you shouldn’t reschedule restore workers unless you
> > > > actually have to invalidate page tables (i.e., you have a local SVM
> > > > range within the notifier). So the first migration to an untouched
> > > > region may trigger notifiers, but they won’t do anything because you
> > > > don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
> > > > region won’t trigger a notifier unless the memory is moved again.
> > > >
> > >
> > > That is a very good point. We should check whether we actually have
> > > valid SVM ranges before scheduling restore workers. If there is nothing
> > > to invalidate, the notifier callback should be a no-op. We will review
> > > our notifier callback logic to ensure we are not doing unnecessary work
> > > here. Thank you for pointing this out.
> > >
> > > > > > > creates races between the initiating migration and the other
> > > > > > > instance's restore attempts.
> > > >
> > > > Yes, if multiple devices try to migrate the same CPU pages at the same
> > > > time, that will race. That’s why in Xe we have a module-level
> > > > driver_migrate_lock. The first migration runs in read mode; if it
> > > > detects a race and aborts, it then takes driver_migrate_lock in write
> > > > mode so it becomes the only device allowed to move memory / CPU pages.
> > > > See xe_svm_alloc_vram() for how this is used.
> > > >
> > > > I’m not sure this approach will work for you, but I just wanted to point
> > > > out that we identified this as a potential issue.
> > > >
> > >
> > > Thank you for sharing the driver_migrate_lock approach and pointing to
> > > xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
> > > for our case.
> > >
> > > > > > >
> > > > > > > - No cross instance migration serialization: each per GPU
> > > > > > > drm_gpusvm instance has independent locking, so two GPUs'
> > > > > > > "decide -> migrate -> remap" sequences can interleave. While
> > > > > > > the kernel page lock prevents truly simultaneous migration of
> > > > > > > the same physical page, the losing side's retry (evict from
> > > > > > > other GPU's VRAM -> migrate back) triggers broadcast notifier
> > > > > > > invalidations and restore workers, compounding the ping pong
> > > > > > > problem above.
> > > > > > >
> > > >
> > > > See the driver_migrate_lock above.
> > >
> > > Acknowledged, thank you.
> > > >
> > > > > > > - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > > > > > > hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
> > > > > > > it only selects system memory pages for migration.
> > > > > > >
> > > >
> > > > I think this is fixed? We did find some core MM bugs that blocked VRAM
> > > > to VRAM but those have been worked out.
> > > >
> > > > The code I'm looking at:
> > > >
> > > > 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
> > > > 518 struct mm_struct *mm,
> > > > 519 unsigned long start, unsigned long end,
> > > > 520 const struct drm_pagemap_migrate_details *mdetails)
> > > > 521 {
> > > > 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
> > > > 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
> > > > 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
> > > > 525 struct migrate_vma migrate = {
> > > > 526 .start = start,
> > > > 527 .end = end,
> > > > 528 .pgmap_owner = pagemap->owner,
> > > > 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> > > > 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
> > > > 531 };
> > > >
> > >
> > > Thank you for checking! I am using v6.18 for this POC, missed the fix, will
> > > rebase to the latest.
> > >
> > >
> > > > > > > - CPU fault reverse migration race: CPU page fault triggers
> > > > > > > migrate_to_ram while GPU instances are concurrently operating.
> > > > > > > Per GPU notifier_lock does not protect cross GPU operations.
> > > >
> > > > No, again retry loop as discussed above.
> > >
> > > Understood.
> > >
> > > >
> > > > > > >
> > > > > > > We believe a strong, well designed solution at the framework level is
> > > > > > > needed to properly address these problems, and we look forward to
> > > > > > > discussion and suggestions.
> > > >
> > > > Let's work together to figure out what is missing here.
> > >
> > > Thank you so much, Matt. Your feedback has been incredibly valuable and
> > > has given us a much clearer picture of the framework's design.
> > > Ireally appreciate the effort you put into building drm_gpusvm as a
> > > shared framework. Will incorporate your suggestions into our next
> > > revision and look forward to continuing the collaboration.
> > >
> >
> > No problem. Happy to help.
>
> Thank you again for all the detailed feedback.
>
> Regards,
> Honglei
>
> >
> > Matt
> >
> > > Regards,
> > > Honglei
> > >
> > >
> > > >
> > > > Matt
> > > >
> > > > > > >
> > > > > > > Honglei Huang (12):
> > > > > > > drm/amdgpu: add SVM UAPI definitions
> > > > > > > drm/amdgpu: add SVM data structures and header
> > > > > > > drm/amdgpu: add SVM attribute data structures
> > > > > > > drm/amdgpu: implement SVM attribute tree operations
> > > > > > > drm/amdgpu: implement SVM attribute set
> > > > > > > drm/amdgpu: add SVM range data structures
> > > > > > > drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > > > > > > drm/amdgpu: implement SVM range notifier and invalidation
> > > > > > > drm/amdgpu: implement SVM range workers
> > > > > > > drm/amdgpu: implement SVM core initialization and fini
> > > > > > > drm/amdgpu: implement SVM ioctl and fault handler
> > > > > > > drm/amdgpu: wire up SVM build system and fault handler
> > > > > > >
> > > > > > > drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> > > > > > > include/uapi/drm/amdgpu_drm.h | 39 +
> > > > > > > 12 files changed, 2958 insertions(+), 4 deletions(-)
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
> > > > > > >
> > > > > > >
> > > > > > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> > > > > >
> > > > >
> > >
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-25 22:24 ` Matthew Brost
@ 2026-03-26 12:16 ` Honglei Huang
2026-04-15 10:04 ` Huang, Honglei1
0 siblings, 1 reply; 36+ messages in thread
From: Honglei Huang @ 2026-03-26 12:16 UTC (permalink / raw)
To: Matthew Brost
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On 3/26/26 06:24, Matthew Brost wrote:
> On Tue, Mar 24, 2026 at 03:24:43PM +0800, Honglei Huang wrote:
>>
>>
>> On 3/23/26 14:31, Matthew Brost wrote:
>>> On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
>>>>
>>>>
>>>> On 3/19/26 13:08, Matthew Brost wrote:
>>>>> On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>>>>>>
>>>>>
>>>>> Disclaimer I haven't look at any code in this series yet.
>>>>>
>>>>>>
>>>>>> On 3/17/26 19:48, Christian König wrote:
>>>>>>> Adding a few XE and drm_gpuvm people on TO.
>>>>>>>
>>>>>>> On 3/17/26 12:29, Honglei Huang wrote:
>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>
>>>>>>>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>>>>>>>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>>>>>>>> the framework's applicability, identify implementation challenges,
>>>>>>>> and start discussion on framework evolution. This is not a production
>>>>>
>>>>> +1. Open to any ideas. Given this was designed originally for Xe we very
>>>>> well could have missed other drivers requirements.
>>>> Hi Matt,
>>>>
>>>> Thank you for the openness. And thank you so much for the incredibly
>>>> detailed and patient response. I really appreciate you taking the time to
>>>> walk through each point.
>>>>
>>>
>>> I'm here to help.
>>>
>>>> Actually I am still a learner when it comes to the drm_gpusvm framework and
>>>> GPU SVM design in general. Some of my descriptions below may not be entirely
>>>> accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
>>>> well.
>>>
>>> I appreciate another driver jumping in and using this framework—it
>>> becomes easier to validate as more users adopt it.
>>>
>>>>
>>>>>
>>>>>>>> ready submission.
>>>>>>>>
>>>>>>>> This patch series implements basic SVM support with the following features:
>>>>>>>>
>>>>>>>> 1. attributes sepatarated from physical page management:
>>>>>>>>
>>>>>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>>>>>>>> tree that stores SVM attributes. Managed through the SET_ATTR,
>>>>>>>> and mmu notifier callback.
>>>>>
>>>>> Can you explain the mmu notifier callback interaction here? See below in
>>>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>>>
>>>>
>>>> Let me try to explain, apologies if the description is not fully
>>>> precise.
>>>>
>>>> In current implementation, the MMU notifier callback interacts with the attr
>>>> tree only in the munmap path remove the corresponding attribute
>>>> entries from the attr tree so that stale attributes do not persist for
>>>> freed address space.
>>>>
>>>
>>> Ah, yes. We reset our attributes upon munmap too. We actually don't this
>>> 100% correct quite either and series in flight to fix [1].
>>>
>>> [1] https://patchwork.freedesktop.org/series/161815/
>>>
>>
>> I studied [1]. This draft has a simliar mechanism to handle attributes when
>> munmap. But there are some sligt differences in detail, maybe casued by
>> different UMD runtime behaviors.
>>
>>
>>>>>>>>
>>>>>>>> - Physical page layer (drm_gpusvm ranges): managed by the
>>>>>>>> drm_gpusvm framework, representing actual HMM backed DMA
>>>>>>>> mappings and GPU page table entries.
>>>>>>>>
>>>>>>>> This separation is necessary:
>>>>>>>> - The framework does not support range splitting, so a partial
>>>>>>>> munmap destroys the entire overlapping range, including the
>>>>>>>> still valid parts. If attributes were stored inside drm_gpusvm
>>>>>>>> ranges, they would be lost on unmapping.
>>>>>>>> The separate attr tree preserves userspace set attributes
>>>>>>>> across range operations.
>>>>>
>>>>> Yes, in Xe the divide is at the VMA level (set by user space) via VM
>>>>> bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
>>>>> madvise IOCTLs which reflect user space attributes on current SVM
>>>>> mappings or future ones.
>>>>>
>>>>> The SVM range tree reflects mappings that have been faulted into the
>>>>> device and contain pages. This is an intentional choice.
>>>>
>>>> That makes a lot of sense. Thank you for clarifying the design intent. I
>>>> think the current adopt the same principle: the drm_gpusvm range tree only
>>>> reflect actual faulted in mappings.
>>>>
>>>>>
>>>>>>>
>>>>>>> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
>>>>>
>>>>>
>>>>> Yes, this was an intentional design choice to not support partial unmap,
>>>>> and instead rely on the driver to recreate a new range.
>>>>>
>>>>> The reasoning is:
>>>>>
>>>>> - In practice, this should be rare for well-behaved applications.
>>>>>
>>>>> - With THP / large device pages, if a sub-range is unmapped, the entire
>>>>> GPU mapping is invalidated anyway due to the page size change. As a
>>>>> result, the cost of creating a new range is minimal, since the device
>>>>> will likely fault again on the remaining pages.
>>>>>
>>>>> So there is no need to over-engineer the common code.
>>>>>
>>>>> FWIW, to even test partial unmaps in Xe, I had to do things I doubt
>>>>> anyone would ever do:
>>>>>
>>>>> ptr = mmap(SZ_2M);
>>>>> /* fault in memory to the device */
>>>>> munmap(ptr, SZ_1M);
>>>>> /* touch memory again on the device */
>>>>>
>>>>
>>>> Thank you for this explanation and the concrete example. After further
>>>> discussion internally with Christian, we are now aligned with same position
>>>> partial unmap. Will remove rebuild on partial unmap logic in the next
>>>> version and handle it as only partially backed range.
>>>>
>>>>>>
>>>>>>
>>>>>> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
>>>>>> but some other subregion is invalid, but under drm_gpusvm, need to destroy
>>>>>> the entire range.
>>>>>>
>>>>>> e.g.:
>>>>>>
>>>>>> [---------------unmap region in mmu notifier-----------------]
>>>>>> [0x1000 ------------ 0x9000]
>>>>>> [ valid ][ invalid ]
>>>>>>
>>>>>> see deatil in drm_gpusvm.c:110 line
>>>>>> section:Partial Unmapping of Ranges
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> - drm_gpusvm range boundaries are determined by fault address
>>>>>>>> and pre setted chunk size, not by userspace attribute boundaries.
>>>>>>>> Ranges may be rechunked on memory changes. Embedding
>>>>>>>> attributes in framework ranges would scatter attr state
>>>>>>>> across many small ranges and require complex reassemble
>>>>>>>> logic when operate attrbute.
>>>>>>>
>>>>>>> Yeah, that makes a lot of sense.
>>>>>>>
>>>>>>>>
>>>>>>>> 2) System memory mapping via drm_gpusvm
>>>>>>>>
>>>>>>>> The core mapping path uses drm_gpusvm_range_find_or_insert() to
>>>>>>>> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>>>>>>>> and DMA mapping, then updates GPU page tables via
>>>>>>>> amdgpu_vm_update_range().
>>>>>>>>
>>>>>>>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>>>>>>>
>>>>>>>> On XNACK off hardware the GPU cannot recover from page faults,
>>>>>>>> so mappings must be established through ioctl. When
>>>>>>>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>>>>>>>> walks the attr tree and maps all accessible intervals
>>>>>>>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>>>>>
>>>>> Can you expand on XNACK off / GPU no faults? Is this to the share GPU
>>>>> between 3D (dma-fences) and faulting clients? We have something similar
>>>>> in Xe, but it isn't an explicit IOCTL rather we switch between on demand
>>>>> as 3D client submits and then resumes page faults when all dma-fences
>>>>> have signaled.
>>>>>
>>>>> I see below you mention page tables are modified during quiesce KFD
>>>>> queues? I'm not sure that is required - you just need to guarnette
>>>>> faulting clients won't trigger page faults when dma-fence is in flight.
>>>>>
>>>>> Maybe give me an explaination of exactly what the requirement from AMD
>>>>> are here so I have better picture.
>>>>
>>>> Thank you for the patience, let me try to explain our situation, though
>>>> I may not get every detail right.
>>>>
>>>> XNACK off means hardware that does not have GPU page fault capability (or
>>>> turned off)
>>>>
>>>> So for these GPUs, ALL page table entries must be fully populated before
>>>> the GPU can access the memory. This is why we need the ioctl driven
>>>> mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
>>>> walk the attribute tree and eagerly map all accessible ranges into the
>>>> GPU page tables. This is functionally similar to what you describe as
>>>> prefetch IOCTLs / VM bind in Xe.
>>>>
>>>> Regarding queue quiesce during page table modification: on XNACK off
>>>> hardware, because the GPU cannot fault, we must ensure the GPU is
>>>> completely stopped before modifying any PTE it might be accessing.
>>>> Otherwise the GPU could access a partially updated page table and hang.
>>>> The quiesce/resume is the mechanism to guarantee this.
>>>>
>>>> I hope that helps clarify the picture.
>>>>
>>>
>>> This clarifies a lot. This is what we’d call in Xe “preemption fence”
>>> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
>>> resume. We don’t actually support SVM in this case; instead, we use
>>> “userptr binds,” which are built on gpusvm for page collection. However,
>>> we don’t support migrating memory to the device—though we could.
>>>
>>> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
>>> this case, don’t maintain a range tree, as those—as you suggest—are more
>>> of an on-demand fault driver concern. Instead, just embed 'struct
>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>
>>> We could extend this to support migrating 'userptr', but we just haven’t
>>> done that yet—this may be what you want to do in “XNACK off..
>>>
>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>
>>
>> Actually we need to swith the xnack mode between on and off, so in xnack off
>> mode, the driver operats in "implicit prefetch mode". This may be due to
>> compatibility with older hardware and the need for UMD runtime. We will
>> further discuss the handling method under xnack off internally.
>>
>>
>>>>
>>>>>
>>>>>>>>
>>>>>>>> 4) Invalidation, GC worker, and restore worker
>>>>>>>>
>>>>>>>> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>>>>>>>> three cases based on event type and hardware mode:
>>>>>>>> - unmap event: clear GPU PTEs in the notifier context,
>>>>>>>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>>>>>>>> and enqueue to the GC worker. On XNACK off, also
>>>>>>>> quiesce KFD queues and schedule rebuild of the
>>>>>>>> still valid portions that were destroyed together with
>>>>>>>> the unmapped subregion.
>>>>>>>>
>>>>>>>> - evict on XNACK off:
>>>>>>>> quiesce KFD queues first, then unmap DMA pages and
>>>>>>>> enqueue to the restore worker.
>>>>>>>
>>>>>>> Is that done through the DMA fence or by talking directly to the MES/HWS?
>>>>>>
>>>>>> Currently KFD queues quiesce/resume API are reused, lookig forward to a
>>>>>> better solution.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>>> Regards,
>>>>>> Honglei
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> - evict on XNACK on:
>>>>>>>> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>>>>>>>> not schedule any worker. The GPU will fault on next
>>>>>>>> access and the fault handler establishes the mapping.
>>>>>>>>
>>>>>>>> Not supported feature:
>>>>>>>> - XNACK on GPU page fault mode
>>>>>>>> - migration and prefetch feature
>>>>>>>> - Multi GPU support
>>>>>>>>
>>>>>>>> XNACK on enablement is ongoing.The GPUs that support XNACK on
>>>>>>>> are currently only accessible to us via remote lab machines, which slows
>>>>>>>> down progress.
>>>>>>>>
>>>>>>>> Patch overview:
>>>>>>>>
>>>>>>>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>>>>>>>> SET_ATTR/GET_ATTR operations, attribute types, and related
>>>>>>>> structs in amdgpu_drm.h.
>>>>>>>>
>>>>>>>> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>>>>>>>> refcount, attr_tree, workqueues, locks, and
>>>>>>>> callbacks (begin/end_restore, flush_tlb).
>>>>>>>>
>>>>>>>> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>>>>>>>> (interval tree node), attr_tree, access enum, flag masks,
>>>>>>>> and change trigger enum.
>>>>>>>>
>>>>>>>> 04/12 Attribute tree operations: interval tree lookup, insert,
>>>>>>>> remove, and tree create/destroy lifecycle.
>>>>>>>>
>>>>>>>> 05/12 Attribute set: validate UAPI attributes, apply to internal
>>>>>>>> attrs, handle hole/existing range with head/tail splitting,
>>>>>>>> compute change triggers, and -EAGAIN retry loop.
>>>>>>>> Implements attr_clear_pages for unmap cleanup and attr_get.
>>>>>>>>
>>>>>>>> 06/12 Range data structures: amdgpu_svm_range extending
>>>>>>>> drm_gpusvm_range with gpu_mapped state, pending ops,
>>>>>>>> pte_flags cache, and GC/restore queue linkage.
>>>>>>>>
>>>>>>>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>>>>>>>> GPU page table update with DMA address, range mapping loop:
>>>>>>>> find_or_insert -> get_pages -> validate -> update PTE,
>>>>>>>> and attribute change driven mapping function.
>>>>>>>>
>>>>>>>> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
>>>>>>>> notifier context, range removal and overlap cleanup,
>>>>>>>> rebuild after destroy logic, and MMU event dispatcher
>>>>>>>>
>>>>>>>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>>>>>>>> worker for unmap processing and rebuild, ordered restore
>>>>>>>> worker for mapping evicted ranges, and flush/sync
>>>>>>>> helpers.
>>>>>>>>
>>>>>>>> 10/12 Initialization and fini: kmem_cache for range/attr,
>>>>>>>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>>>>>>>> flush helper, and amdgpu_svm init/close/fini lifecycle.
>>>>>>>>
>>>>>>>> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>>>>>>>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>>>>>>>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>>>>>>>
>>>>>>>> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>>>>>>>> Makefile rules, ioctl table registration, and amdgpu_vm
>>>>>>>> hooks (init in make_compute, close/fini, fault dispatch).
>>>>>>>>
>>>>>>>> Test result:
>>>>>>>> on gfx1100(W7900) and gfx943(MI300x)
>>>>>>>> kfd test: 95%+ passed, same failed cases with offical relase
>>>>>>>> rocr test: all passed
>>>>>>>> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>>>>>>>
>>>>>>>> During implementation we identified several challenges / design questions:
>>>>>>>>
>>>>>>>> 1. No range splitting on partial unmap
>>>>>>>>
>>>>>>>> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>>>>>>>> Partial munmap needs to destroy the entire range including the valid interval.
>>>>>>>> GPU fault driven hardware can handle this design by extra gpu fault handle,
>>>>>>>> but AMDGPU needs to support XNACK off hardware, this design requires driver
>>>>>>>> rebuild the valid part in the removed entire range. Whichs bring a very heavy
>>>>>>>> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>>>>>>>> this restore work even heavier than kfd_svm. In previous driver work queue
>>>>>>>> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>>>>>>>> which brings about more complex logic, heavier worker queue workload, and
>>>>>>>> synchronization issues.
>>>>>
>>>>> Is this common in the workload you are running? I'm also wondering if
>>>>> your restore logic / KFDs design is contributing to this actally the
>>>>> problem.
>>>>>
>>>>
>>>> Honestly, you raise a fair point.
>>>>
>>>> We will redesign the logic about the partial munap, which should eliminate
>>>> most of this complexity.
>>>>
>>>>
>>>
>>> +1, yes test but do optimize for.
>>>
>>>>>>>>
>>>>>>>> 2. Fault driven vs ioctl driven mapping
>>>>>>>>
>>>>>>>> drm_gpusvm is designed around GPU page fault handlers. The primary entry
>>>>>>>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>>>>>>>> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>>>>>>>> GPU cannot fault at all
>>>>>
>>>>> I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
>>>>> issues these so the device does not fault (e.g., prefetch creates a set
>>>>> of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
>>>>> specific VM bind operations.
>>>>>
>>>>
>>>> That is a very helpful way to think about it. Yes, our ioctl driven
>>>> mapping(xnack off) is essentially equivalent to a prefetch operation. We are
>>>> trying to improve it.
>>>>
>>>
>>> See above wrt 'userptr'.
>>
>> Got it.
>>
>>>
>>>>
>>>>>>>>
>>>>>>>> The ioctl path cannot hold mmap_read_lock across the entire operation
>>>>>>>> because drm_gpusvm_range_find_or_insert() acquires/releases it
>>>>>>>> internally. This creates race windows with MMU notifiers / workers.
>>>>>
>>>>> This is a very intentional choice in the locking design: mmap_read_lock
>>>>> is held only in very specific parts of GPU SVM, and the driver should
>>>>> never need to take this lock.
>>>>>
>>>>> Yes, notifiers can race, which is why the GPU fault handler and prefetch
>>>>> handler are structured as retry loops when a notifier race is detected.
>>>>> In practice, with well-behaved applications, these races should be
>>>>> rare—but they do occur, and the driver must handle them.
>>>>>
>>>>> __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
>>>>> prefetch has similar logic, although it is more spread out given that it
>>>>> is part of a deeper software pipeline.
>>>>>
>>>>> FWIW, holding locks to avoid races was rejected by Sima because we
>>>>> reasoned it is essentially impossible to guarantee the absence of races
>>>>> by holding a lock. CPU page fault handlers are also effectively just
>>>>> large retry loops.
>>>>>
>>>>> So this is one point I believe you will need to fixup driver side.
>>>>>
>>>>
>>>> Understood. Thank you for the detailed explanation and for pointing to
>>>> __xe_svm_handle_pagefault as a reference. We will restructure both our
>>>> fault handler and ioctl path to a betterretry loop pattern with sequence
>>>> number race detection.
>>>>
>>>
>>> Yes, the typical pattern is:
>>>
>>> - Try to migrate once
>>> - If you hit a race, give up, evict all memory back to system memory, and bind it
>>>
>>> Atomics make this tricky because memory must move, but I’m not sure
>>> “XNACK off” applies here. However, GPU SVM provides a timeslice
>>> mechanism to ensure the CPU can’t move memory while the GPU needs to
>>> execute something.
>>
>> Understood.
>>
>>>
>>>>>>>>
>>>>>>>> 3. Multi GPU support
>>>>>>>>
>>>>>>>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>>>>>>>> each GPU gets an independent instance with its own range tree, MMU
>>>>>>>> notifiers, notifier_lock, and DMA mappings.
>>>>>>>>
>>>>>
>>>>> This is a part I am absolutely open to fixing. Right now, each
>>>>> drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
>>>>> decoupling a GPU SVM instance from a single device, allowing each
>>>>> drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
>>>>> device).
>>>>>
>>>>> This would give drivers the flexibility to use one GPU SVM instance per
>>>>> VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
>>>>> MM.
>>>>>
>>>>
>>>> That would be wonderful! Looking forward to your patch very much!
>>>>
>>>
>>> I can't say I'll code this but we thought about is as options and very
>>> open patches which refactor the object model for multiple use cases.
>>
>> Understood. I will focus on single GPU first, and once we have a
>> solid v1, we'd be happy to explore contributing patches for the
>> multi-device object model refactoring.
>>
>
> I think roughly what would need to be done is:
>
> - Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
> - Embed either a struct device or a struct drm_device in struct
> drm_gpusvm_pages.
> - Drop struct drm_device from struct drm_gpusvm.
> - Have the driver’s range structure embed one or more struct
> drm_gpusvm_pages in addition to struct drm_gpusvm_range.
> - Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
> drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or simply
> drop them entirely and update drivers to use the drm_gpusvm_pages
> helpers instead.
>
> Then it is up to the drivers whether struct drm_gpusvm maps to a single
> device or multiple devices. Either use case seems valid, and giving
> drivers the option appears to be the right approach, rather than having
> the common drm_gpusvm layer impose its own constraints.
>
> This type of refactor can be done at any time as an independent patch,
> so feel free to post it whenever and I can verify on the Xe side that
> everything looks good.
>
> Matt
>
Really thanks for the detailed guidance and steps, it is very clear and
actionable. I'm excited about this direction, it gives the drivers more
flexibility. I'll start working on this as soon as possible. Will post
the multi-device refactor as a standalone series once it's well
validated. Thanks again for being so open to collaboration!
Regards,
Honglei
>>>
>>>>
>>>>>>>> This may brings huge overhead:
>>>>>>>> - N x MMU notifier registrations for the same address range
>>>>>
>>>>> The notifier overhead is a real concern. We recently introduced two-pass
>>>>> notifiers [1] to speed up multi-device notifiers. At least in Xe, the
>>>>> TLB invalidations—which are the truly expensive part—can be pipelined
>>>>> using the two=pass approach. Currently, [1] only implements two-pass
>>>>> notifiers for userptr, but Xe’s GPU SVM will be updated to use them
>>>>> shortly.
>>>>>
>>>>> [1] https://patchwork.freedesktop.org/series/153280/
>>>>>
>>>>
>>>> Thank you for the pointer to two-pass notifiers. Will study this
>>>> series.
>>>>
>>>>>>>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>>>>>
>>>>> hmm_range_fault is extremely fast compared to the actual migration.
>>>>> Running hmm_range_fault on a 2MB region using 4KB pages takes less
>>>>> than 1µs. With THP or large device pages [2] (merged last week), it’s
>>>>> around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
>>>>>
>>>>> [2] https://patchwork.freedesktop.org/series/163141/
>>>>>
>>>>
>>>> That is very helpful data. Perhaps worry too much.
>>>>
>>>>>>>> - N x DMA mapping memory
>>>>>
>>>>> You will always have N x DMA mapping memory if the pages are in system
>>>>> memory as the dma-mapping API is per device.
>>>>
>>>> Totally agreed.
>>>>
>>>>>
>>>>>>>> - N x invalidation + restore worker scheduling per CPU unmap event
>>>>>>>> - N x GPU page table flush / TLB invalidation
>>>>>
>>>>> I agree you do not want serialize GPU page table flush / TLB
>>>>> invalidations. Hence two-pass notifiers [1].
>>>>
>>>> Yes, will learn it.
>>>>
>>>>>
>>>>>>>> - Increased mmap_lock hold time, N callbacks serialize under it
>>>>>>>>
>>>>>>>> compatibility issues:
>>>>>>>> - Quiesce/resume scope mismatch: to integrate with KFD compute
>>>>>>>> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>>>>>>>> which have process level semantics. Under the per GPU
>>>>>>>> drm_gpusvm model, maybe there are some issues on sync. To properly
>>>>>>>> integrate with KFD under the per SVM model, a compatibility or
>>>>>>>> new per VM level queue control APIs maybe need to introduced.
>>>>>>>>
>>>>>
>>>>> I thought the idea to get rid of KFD and move over to AMDGPU? I thought
>>>>> Christian mentioned this to me at XDC.
>>>>>
>>>>
>>>>>>>> Migration challenges:
>>>>>>>>
>>>>>>>> - No global migration decision logic: each per GPU SVM
>>>>>>>> instance maintains its own attribute tree independently. This
>>>>>>>> allows conflicting settings (e.g., GPU0's SVM sets
>>>>>>>> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>>>>>>>> for the same address range) with no detection or resolution.
>>>>>>>> A global attribute coordinator or a shared manager is needed to
>>>>>>>> provide a unified global view for migration decisions
>>>>>
>>>>> Yes, this is hole in the Xe API too. We have told UMDs if they setup
>>>>> individual VMs with conflict attributes for a single CPU address space
>>>>> the behavior is undefined. Our UMD implement madvise is basically loop
>>>>> over al GPU VMs setting the same attributes.
>>>>
>>>> Will follow the same approach for now, the UMD is responsible for setting
>>>> consistent attributes across GPU VMs.
>>>>
>>>
>>> +1
>>>
>>>>>
>>>>>>>>
>>>>>>>> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>>>>>>>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>>>>>>>> causing N-1 unnecessary restore workers to be scheduled. And
>>>>>
>>>>> My feeling is that you shouldn’t reschedule restore workers unless you
>>>>> actually have to invalidate page tables (i.e., you have a local SVM
>>>>> range within the notifier). So the first migration to an untouched
>>>>> region may trigger notifiers, but they won’t do anything because you
>>>>> don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
>>>>> region won’t trigger a notifier unless the memory is moved again.
>>>>>
>>>>
>>>> That is a very good point. We should check whether we actually have
>>>> valid SVM ranges before scheduling restore workers. If there is nothing
>>>> to invalidate, the notifier callback should be a no-op. We will review
>>>> our notifier callback logic to ensure we are not doing unnecessary work
>>>> here. Thank you for pointing this out.
>>>>
>>>>>>>> creates races between the initiating migration and the other
>>>>>>>> instance's restore attempts.
>>>>>
>>>>> Yes, if multiple devices try to migrate the same CPU pages at the same
>>>>> time, that will race. That’s why in Xe we have a module-level
>>>>> driver_migrate_lock. The first migration runs in read mode; if it
>>>>> detects a race and aborts, it then takes driver_migrate_lock in write
>>>>> mode so it becomes the only device allowed to move memory / CPU pages.
>>>>> See xe_svm_alloc_vram() for how this is used.
>>>>>
>>>>> I’m not sure this approach will work for you, but I just wanted to point
>>>>> out that we identified this as a potential issue.
>>>>>
>>>>
>>>> Thank you for sharing the driver_migrate_lock approach and pointing to
>>>> xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
>>>> for our case.
>>>>
>>>>>>>>
>>>>>>>> - No cross instance migration serialization: each per GPU
>>>>>>>> drm_gpusvm instance has independent locking, so two GPUs'
>>>>>>>> "decide -> migrate -> remap" sequences can interleave. While
>>>>>>>> the kernel page lock prevents truly simultaneous migration of
>>>>>>>> the same physical page, the losing side's retry (evict from
>>>>>>>> other GPU's VRAM -> migrate back) triggers broadcast notifier
>>>>>>>> invalidations and restore workers, compounding the ping pong
>>>>>>>> problem above.
>>>>>>>>
>>>>>
>>>>> See the driver_migrate_lock above.
>>>>
>>>> Acknowledged, thank you.
>>>>>
>>>>>>>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>>>>>>>> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>>>>>>>> it only selects system memory pages for migration.
>>>>>>>>
>>>>>
>>>>> I think this is fixed? We did find some core MM bugs that blocked VRAM
>>>>> to VRAM but those have been worked out.
>>>>>
>>>>> The code I'm looking at:
>>>>>
>>>>> 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>>>>> 518 struct mm_struct *mm,
>>>>> 519 unsigned long start, unsigned long end,
>>>>> 520 const struct drm_pagemap_migrate_details *mdetails)
>>>>> 521 {
>>>>> 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
>>>>> 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
>>>>> 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
>>>>> 525 struct migrate_vma migrate = {
>>>>> 526 .start = start,
>>>>> 527 .end = end,
>>>>> 528 .pgmap_owner = pagemap->owner,
>>>>> 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
>>>>> 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
>>>>> 531 };
>>>>>
>>>>
>>>> Thank you for checking! I am using v6.18 for this POC, missed the fix, will
>>>> rebase to the latest.
>>>>
>>>>
>>>>>>>> - CPU fault reverse migration race: CPU page fault triggers
>>>>>>>> migrate_to_ram while GPU instances are concurrently operating.
>>>>>>>> Per GPU notifier_lock does not protect cross GPU operations.
>>>>>
>>>>> No, again retry loop as discussed above.
>>>>
>>>> Understood.
>>>>
>>>>>
>>>>>>>>
>>>>>>>> We believe a strong, well designed solution at the framework level is
>>>>>>>> needed to properly address these problems, and we look forward to
>>>>>>>> discussion and suggestions.
>>>>>
>>>>> Let's work together to figure out what is missing here.
>>>>
>>>> Thank you so much, Matt. Your feedback has been incredibly valuable and
>>>> has given us a much clearer picture of the framework's design.
>>>> Ireally appreciate the effort you put into building drm_gpusvm as a
>>>> shared framework. Will incorporate your suggestions into our next
>>>> revision and look forward to continuing the collaboration.
>>>>
>>>
>>> No problem. Happy to help.
>>
>> Thank you again for all the detailed feedback.
>>
>> Regards,
>> Honglei
>>
>>>
>>> Matt
>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>>>>
>>>>>>>> Honglei Huang (12):
>>>>>>>> drm/amdgpu: add SVM UAPI definitions
>>>>>>>> drm/amdgpu: add SVM data structures and header
>>>>>>>> drm/amdgpu: add SVM attribute data structures
>>>>>>>> drm/amdgpu: implement SVM attribute tree operations
>>>>>>>> drm/amdgpu: implement SVM attribute set
>>>>>>>> drm/amdgpu: add SVM range data structures
>>>>>>>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>>>>>>>> drm/amdgpu: implement SVM range notifier and invalidation
>>>>>>>> drm/amdgpu: implement SVM range workers
>>>>>>>> drm/amdgpu: implement SVM core initialization and fini
>>>>>>>> drm/amdgpu: implement SVM ioctl and fault handler
>>>>>>>> drm/amdgpu: wire up SVM build system and fault handler
>>>>>>>>
>>>>>>>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>>>>>>>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>>>>>>>> include/uapi/drm/amdgpu_drm.h | 39 +
>>>>>>>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>>>>>>>
>>>>>>>>
>>>>>>>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>>>>>>>
>>>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-26 12:16 ` Honglei Huang
@ 2026-04-15 10:04 ` Huang, Honglei1
2026-04-23 6:40 ` Matthew Brost
0 siblings, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-15 10:04 UTC (permalink / raw)
To: Matthew Brost
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On 3/26/2026 8:16 PM, Honglei Huang wrote:
>
>
> On 3/26/26 06:24, Matthew Brost wrote:
>> On Tue, Mar 24, 2026 at 03:24:43PM +0800, Honglei Huang wrote:
>>>
>>>
>>> On 3/23/26 14:31, Matthew Brost wrote:
>>>> On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
>>>>>
>>>>>
>>>>> On 3/19/26 13:08, Matthew Brost wrote:
>>>>>> On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>>>>>>>
>>>>>>
>>>>>> Disclaimer I haven't look at any code in this series yet.
>>>>>>
>>>>>>>
>>>>>>> On 3/17/26 19:48, Christian König wrote:
>>>>>>>> Adding a few XE and drm_gpuvm people on TO.
>>>>>>>>
>>>>>>>> On 3/17/26 12:29, Honglei Huang wrote:
>>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>>
>>>>>>>>> This is a POC/draft patch series of SVM feature in amdgpu based
>>>>>>>>> on the
>>>>>>>>> drm_gpusvm framework. The primary purpose of this RFC is to
>>>>>>>>> validate
>>>>>>>>> the framework's applicability, identify implementation challenges,
>>>>>>>>> and start discussion on framework evolution. This is not a
>>>>>>>>> production
>>>>>>
>>>>>> +1. Open to any ideas. Given this was designed originally for Xe
>>>>>> we very
>>>>>> well could have missed other drivers requirements.
>>>>> Hi Matt,
>>>>>
>>>>> Thank you for the openness. And thank you so much for the incredibly
>>>>> detailed and patient response. I really appreciate you taking the
>>>>> time to
>>>>> walk through each point.
>>>>>
>>>>
>>>> I'm here to help.
>>>>
>>>>> Actually I am still a learner when it comes to the drm_gpusvm
>>>>> framework and
>>>>> GPU SVM design in general. Some of my descriptions below may not be
>>>>> entirely
>>>>> accurate. But I really want to bring drm_gpusvm into amdgpu and
>>>>> make it work
>>>>> well.
>>>>
>>>> I appreciate another driver jumping in and using this framework—it
>>>> becomes easier to validate as more users adopt it.
>>>>
>>>>>
>>>>>>
>>>>>>>>> ready submission.
>>>>>>>>>
>>>>>>>>> This patch series implements basic SVM support with the
>>>>>>>>> following features:
>>>>>>>>>
>>>>>>>>> 1. attributes sepatarated from physical page management:
>>>>>>>>>
>>>>>>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver side
>>>>>>>>> interval
>>>>>>>>> tree that stores SVM attributes. Managed through the
>>>>>>>>> SET_ATTR,
>>>>>>>>> and mmu notifier callback.
>>>>>>
>>>>>> Can you explain the mmu notifier callback interaction here? See
>>>>>> below in
>>>>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>>>>
>>>>>
>>>>> Let me try to explain, apologies if the description is not fully
>>>>> precise.
>>>>>
>>>>> In current implementation, the MMU notifier callback interacts with
>>>>> the attr
>>>>> tree only in the munmap path remove the corresponding attribute
>>>>> entries from the attr tree so that stale attributes do not persist for
>>>>> freed address space.
>>>>>
>>>>
>>>> Ah, yes. We reset our attributes upon munmap too. We actually don't
>>>> this
>>>> 100% correct quite either and series in flight to fix [1].
>>>>
>>>> [1] https://patchwork.freedesktop.org/series/161815/
>>>>
>>>
>>> I studied [1]. This draft has a simliar mechanism to handle
>>> attributes when
>>> munmap. But there are some sligt differences in detail, maybe casued by
>>> different UMD runtime behaviors.
>>>
>>>
>>>>>>>>>
>>>>>>>>> - Physical page layer (drm_gpusvm ranges): managed by the
>>>>>>>>> drm_gpusvm framework, representing actual HMM backed DMA
>>>>>>>>> mappings and GPU page table entries.
>>>>>>>>>
>>>>>>>>> This separation is necessary:
>>>>>>>>> - The framework does not support range splitting,
>>>>>>>>> so a partial
>>>>>>>>> munmap destroys the entire overlapping range,
>>>>>>>>> including the
>>>>>>>>> still valid parts. If attributes were stored
>>>>>>>>> inside drm_gpusvm
>>>>>>>>> ranges, they would be lost on unmapping.
>>>>>>>>> The separate attr tree preserves userspace set
>>>>>>>>> attributes
>>>>>>>>> across range operations.
>>>>>>
>>>>>> Yes, in Xe the divide is at the VMA level (set by user space) via VM
>>>>>> bind (parts of VM may be mappings BOs, parts could be setup for
>>>>>> SVM) or
>>>>>> madvise IOCTLs which reflect user space attributes on current SVM
>>>>>> mappings or future ones.
>>>>>>
>>>>>> The SVM range tree reflects mappings that have been faulted into the
>>>>>> device and contain pages. This is an intentional choice.
>>>>>
>>>>> That makes a lot of sense. Thank you for clarifying the design
>>>>> intent. I
>>>>> think the current adopt the same principle: the drm_gpusvm range
>>>>> tree only
>>>>> reflect actual faulted in mappings.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Isn't that actually intended? When parts of the range unmap then
>>>>>>>> that usually means the whole range isn't valid any more.
>>>>>>
>>>>>>
>>>>>> Yes, this was an intentional design choice to not support partial
>>>>>> unmap,
>>>>>> and instead rely on the driver to recreate a new range.
>>>>>>
>>>>>> The reasoning is:
>>>>>>
>>>>>> - In practice, this should be rare for well-behaved applications.
>>>>>>
>>>>>> - With THP / large device pages, if a sub-range is unmapped, the
>>>>>> entire
>>>>>> GPU mapping is invalidated anyway due to the page size change. As a
>>>>>> result, the cost of creating a new range is minimal, since the device
>>>>>> will likely fault again on the remaining pages.
>>>>>>
>>>>>> So there is no need to over-engineer the common code.
>>>>>>
>>>>>> FWIW, to even test partial unmaps in Xe, I had to do things I doubt
>>>>>> anyone would ever do:
>>>>>>
>>>>>> ptr = mmap(SZ_2M);
>>>>>> /* fault in memory to the device */
>>>>>> munmap(ptr, SZ_1M);
>>>>>> /* touch memory again on the device */
>>>>>>
>>>>>
>>>>> Thank you for this explanation and the concrete example. After further
>>>>> discussion internally with Christian, we are now aligned with same
>>>>> position
>>>>> partial unmap. Will remove rebuild on partial unmap logic in the next
>>>>> version and handle it as only partially backed range.
>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It is about partial unmap, some subregion in drm_gpusvm_range is
>>>>>>> still valid
>>>>>>> but some other subregion is invalid, but under drm_gpusvm, need
>>>>>>> to destroy
>>>>>>> the entire range.
>>>>>>>
>>>>>>> e.g.:
>>>>>>>
>>>>>>> [---------------unmap region in mmu
>>>>>>> notifier-----------------]
>>>>>>> [0x1000 ------------ 0x9000]
>>>>>>> [ valid ][ invalid ]
>>>>>>>
>>>>>>> see deatil in drm_gpusvm.c:110 line
>>>>>>> section:Partial Unmapping of Ranges
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> - drm_gpusvm range boundaries are determined by
>>>>>>>>> fault address
>>>>>>>>> and pre setted chunk size, not by userspace
>>>>>>>>> attribute boundaries.
>>>>>>>>> Ranges may be rechunked on memory changes.
>>>>>>>>> Embedding
>>>>>>>>> attributes in framework ranges would scatter attr
>>>>>>>>> state
>>>>>>>>> across many small ranges and require complex
>>>>>>>>> reassemble
>>>>>>>>> logic when operate attrbute.
>>>>>>>>
>>>>>>>> Yeah, that makes a lot of sense.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2) System memory mapping via drm_gpusvm
>>>>>>>>>
>>>>>>>>> The core mapping path uses
>>>>>>>>> drm_gpusvm_range_find_or_insert() to
>>>>>>>>> create ranges, drm_gpusvm_range_get_pages() for HMM
>>>>>>>>> page fault
>>>>>>>>> and DMA mapping, then updates GPU page tables via
>>>>>>>>> amdgpu_vm_update_range().
>>>>>>>>>
>>>>>>>>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>>>>>>>>
>>>>>>>>> On XNACK off hardware the GPU cannot recover from page
>>>>>>>>> faults,
>>>>>>>>> so mappings must be established through ioctl. When
>>>>>>>>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>>>>>>>>> walks the attr tree and maps all accessible intervals
>>>>>>>>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>>>>>>
>>>>>> Can you expand on XNACK off / GPU no faults? Is this to the share GPU
>>>>>> between 3D (dma-fences) and faulting clients? We have something
>>>>>> similar
>>>>>> in Xe, but it isn't an explicit IOCTL rather we switch between on
>>>>>> demand
>>>>>> as 3D client submits and then resumes page faults when all dma-fences
>>>>>> have signaled.
>>>>>>
>>>>>> I see below you mention page tables are modified during quiesce KFD
>>>>>> queues? I'm not sure that is required - you just need to guarnette
>>>>>> faulting clients won't trigger page faults when dma-fence is in
>>>>>> flight.
>>>>>>
>>>>>> Maybe give me an explaination of exactly what the requirement from
>>>>>> AMD
>>>>>> are here so I have better picture.
>>>>>
>>>>> Thank you for the patience, let me try to explain our situation,
>>>>> though
>>>>> I may not get every detail right.
>>>>>
>>>>> XNACK off means hardware that does not have GPU page fault
>>>>> capability (or
>>>>> turned off)
>>>>>
>>>>> So for these GPUs, ALL page table entries must be fully populated
>>>>> before
>>>>> the GPU can access the memory. This is why we need the ioctl driven
>>>>> mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
>>>>> walk the attribute tree and eagerly map all accessible ranges into the
>>>>> GPU page tables. This is functionally similar to what you describe as
>>>>> prefetch IOCTLs / VM bind in Xe.
>>>>>
>>>>> Regarding queue quiesce during page table modification: on XNACK off
>>>>> hardware, because the GPU cannot fault, we must ensure the GPU is
>>>>> completely stopped before modifying any PTE it might be accessing.
>>>>> Otherwise the GPU could access a partially updated page table and
>>>>> hang.
>>>>> The quiesce/resume is the mechanism to guarantee this.
>>>>>
>>>>> I hope that helps clarify the picture.
>>>>>
>>>>
>>>> This clarifies a lot. This is what we’d call in Xe “preemption fence”
>>>> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
>>>> resume. We don’t actually support SVM in this case; instead, we use
>>>> “userptr binds,” which are built on gpusvm for page collection.
>>>> However,
>>>> we don’t support migrating memory to the device—though we could.
>>>>
>>>> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
>>>> this case, don’t maintain a range tree, as those—as you suggest—are
>>>> more
>>>> of an on-demand fault driver concern. Instead, just embed 'struct
>>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>>
>>>> We could extend this to support migrating 'userptr', but we just
>>>> haven’t
>>>> done that yet—this may be what you want to do in “XNACK off..
>>>>
>>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>>
>>>
>>> Actually we need to swith the xnack mode between on and off, so in
>>> xnack off
>>> mode, the driver operats in "implicit prefetch mode". This may be
>>> due to
>>> compatibility with older hardware and the need for UMD runtime. We will
>>> further discuss the handling method under xnack off internally.
>>>
Hi Matt,
I studied the xe_userptr code and the conversion series [2] you
pointed to.
I have a question that:
Would it be possible to reuse drm_gpusvm_range to handle the hardware
without gpu fault feature(xnack off mode).
Reusing drm_gpusvm_range for the XNACK-off case would simplify our
implementation considerably, it already provides large page chunk
optimization, can reuse the existing migration infrastructure.
Building these on top of a standalone drm_gpusvm_pages
would mean reimplementing much of what the range layer already offers.
It would also let us keep a single code path for both XNACK modes,
which reduces maintenance burden and avoids behavioral difference.
Would this direction be acceptable, or do you see concerns with reusing
the range infrastructure for the no-fault case?
Regards,
Honglei
>>>
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> 4) Invalidation, GC worker, and restore worker
>>>>>>>>>
>>>>>>>>> MMU notifier callbacks (amdgpu_svm_range_invalidate)
>>>>>>>>> handle
>>>>>>>>> three cases based on event type and hardware mode:
>>>>>>>>> - unmap event: clear GPU PTEs in the notifier context,
>>>>>>>>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>>>>>>>>> and enqueue to the GC worker. On XNACK off, also
>>>>>>>>> quiesce KFD queues and schedule rebuild of the
>>>>>>>>> still valid portions that were destroyed together
>>>>>>>>> with
>>>>>>>>> the unmapped subregion.
>>>>>>>>>
>>>>>>>>> - evict on XNACK off:
>>>>>>>>> quiesce KFD queues first, then unmap DMA pages and
>>>>>>>>> enqueue to the restore worker.
>>>>>>>>
>>>>>>>> Is that done through the DMA fence or by talking directly to the
>>>>>>>> MES/HWS?
>>>>>>>
>>>>>>> Currently KFD queues quiesce/resume API are reused, lookig
>>>>>>> forward to a
>>>>>>> better solution.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>> Regards,
>>>>>>> Honglei
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> - evict on XNACK on:
>>>>>>>>> clear GPU PTEs, unmap DMA pages, and flush TLB,
>>>>>>>>> but do
>>>>>>>>> not schedule any worker. The GPU will fault on next
>>>>>>>>> access and the fault handler establishes the mapping.
>>>>>>>>>
>>>>>>>>> Not supported feature:
>>>>>>>>> - XNACK on GPU page fault mode
>>>>>>>>> - migration and prefetch feature
>>>>>>>>> - Multi GPU support
>>>>>>>>>
>>>>>>>>> XNACK on enablement is ongoing.The GPUs that support
>>>>>>>>> XNACK on
>>>>>>>>> are currently only accessible to us via remote lab
>>>>>>>>> machines, which slows
>>>>>>>>> down progress.
>>>>>>>>>
>>>>>>>>> Patch overview:
>>>>>>>>>
>>>>>>>>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>>>>>>>>> SET_ATTR/GET_ATTR operations, attribute types, and
>>>>>>>>> related
>>>>>>>>> structs in amdgpu_drm.h.
>>>>>>>>>
>>>>>>>>> 02/12 Core data structures: amdgpu_svm wrapping
>>>>>>>>> drm_gpusvm with
>>>>>>>>> refcount, attr_tree, workqueues, locks, and
>>>>>>>>> callbacks (begin/end_restore, flush_tlb).
>>>>>>>>>
>>>>>>>>> 03/12 Attribute data structures: amdgpu_svm_attrs,
>>>>>>>>> attr_range
>>>>>>>>> (interval tree node), attr_tree, access enum, flag
>>>>>>>>> masks,
>>>>>>>>> and change trigger enum.
>>>>>>>>>
>>>>>>>>> 04/12 Attribute tree operations: interval tree lookup,
>>>>>>>>> insert,
>>>>>>>>> remove, and tree create/destroy lifecycle.
>>>>>>>>>
>>>>>>>>> 05/12 Attribute set: validate UAPI attributes, apply to
>>>>>>>>> internal
>>>>>>>>> attrs, handle hole/existing range with head/tail
>>>>>>>>> splitting,
>>>>>>>>> compute change triggers, and -EAGAIN retry loop.
>>>>>>>>> Implements attr_clear_pages for unmap cleanup and
>>>>>>>>> attr_get.
>>>>>>>>>
>>>>>>>>> 06/12 Range data structures: amdgpu_svm_range extending
>>>>>>>>> drm_gpusvm_range with gpu_mapped state, pending ops,
>>>>>>>>> pte_flags cache, and GC/restore queue linkage.
>>>>>>>>>
>>>>>>>>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>>>>>>>>> GPU page table update with DMA address, range
>>>>>>>>> mapping loop:
>>>>>>>>> find_or_insert -> get_pages -> validate -> update PTE,
>>>>>>>>> and attribute change driven mapping function.
>>>>>>>>>
>>>>>>>>> 08/12 Notifier and invalidation: synchronous GPU PTE
>>>>>>>>> clear in
>>>>>>>>> notifier context, range removal and overlap cleanup,
>>>>>>>>> rebuild after destroy logic, and MMU event dispatcher
>>>>>>>>>
>>>>>>>>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>>>>>>>>> worker for unmap processing and rebuild, ordered
>>>>>>>>> restore
>>>>>>>>> worker for mapping evicted ranges, and flush/sync
>>>>>>>>> helpers.
>>>>>>>>>
>>>>>>>>> 10/12 Initialization and fini: kmem_cache for range/attr,
>>>>>>>>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>>>>>>>>> flush helper, and amdgpu_svm init/close/fini
>>>>>>>>> lifecycle.
>>>>>>>>>
>>>>>>>>> 11/12 IOCTL and fault handler: PASID based SVM lookup
>>>>>>>>> with kref
>>>>>>>>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>>>>>>>>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>>>>>>>>
>>>>>>>>> 12/12 Build integration: Kconfig option
>>>>>>>>> (CONFIG_DRM_AMDGPU_SVM),
>>>>>>>>> Makefile rules, ioctl table registration, and
>>>>>>>>> amdgpu_vm
>>>>>>>>> hooks (init in make_compute, close/fini, fault
>>>>>>>>> dispatch).
>>>>>>>>>
>>>>>>>>> Test result:
>>>>>>>>> on gfx1100(W7900) and gfx943(MI300x)
>>>>>>>>> kfd test: 95%+ passed, same failed cases with offical relase
>>>>>>>>> rocr test: all passed
>>>>>>>>> hip catch test: 20 cases failed in all 5366 cases, +13
>>>>>>>>> failures vs offical relase
>>>>>>>>>
>>>>>>>>> During implementation we identified several challenges / design
>>>>>>>>> questions:
>>>>>>>>>
>>>>>>>>> 1. No range splitting on partial unmap
>>>>>>>>>
>>>>>>>>> drm_gpusvm explicitly does not support range splitting in
>>>>>>>>> drm_gpusvm.c:122.
>>>>>>>>> Partial munmap needs to destroy the entire range
>>>>>>>>> including the valid interval.
>>>>>>>>> GPU fault driven hardware can handle this design by extra
>>>>>>>>> gpu fault handle,
>>>>>>>>> but AMDGPU needs to support XNACK off hardware, this
>>>>>>>>> design requires driver
>>>>>>>>> rebuild the valid part in the removed entire range.
>>>>>>>>> Whichs bring a very heavy
>>>>>>>>> restore work in work queue/GC worker: unmap/destroy ->
>>>>>>>>> rebuild(insert and map)
>>>>>>>>> this restore work even heavier than kfd_svm. In previous
>>>>>>>>> driver work queue
>>>>>>>>> only needs to restore or unmap, but in drm_gpusvm driver
>>>>>>>>> needs to unmap and restore.
>>>>>>>>> which brings about more complex logic, heavier worker
>>>>>>>>> queue workload, and
>>>>>>>>> synchronization issues.
>>>>>>
>>>>>> Is this common in the workload you are running? I'm also wondering if
>>>>>> your restore logic / KFDs design is contributing to this actally the
>>>>>> problem.
>>>>>>
>>>>>
>>>>> Honestly, you raise a fair point.
>>>>>
>>>>> We will redesign the logic about the partial munap, which should
>>>>> eliminate
>>>>> most of this complexity.
>>>>>
>>>>>
>>>>
>>>> +1, yes test but do optimize for.
>>>>
>>>>>>>>>
>>>>>>>>> 2. Fault driven vs ioctl driven mapping
>>>>>>>>>
>>>>>>>>> drm_gpusvm is designed around GPU page fault handlers.
>>>>>>>>> The primary entry
>>>>>>>>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>>>>>>>>> AMDGPU needs to support IOCTL driven mapping cause No
>>>>>>>>> XNACK hardware that
>>>>>>>>> GPU cannot fault at all
>>>>>>
>>>>>> I think we refer to these as prefetch IOCTLs in Xe. Ideally, user
>>>>>> space
>>>>>> issues these so the device does not fault (e.g., prefetch creates
>>>>>> a set
>>>>>> of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
>>>>>> specific VM bind operations.
>>>>>>
>>>>>
>>>>> That is a very helpful way to think about it. Yes, our ioctl driven
>>>>> mapping(xnack off) is essentially equivalent to a prefetch
>>>>> operation. We are
>>>>> trying to improve it.
>>>>>
>>>>
>>>> See above wrt 'userptr'.
>>>
>>> Got it.
>>>
>>>>
>>>>>
>>>>>>>>>
>>>>>>>>> The ioctl path cannot hold mmap_read_lock across the
>>>>>>>>> entire operation
>>>>>>>>> because drm_gpusvm_range_find_or_insert() acquires/
>>>>>>>>> releases it
>>>>>>>>> internally. This creates race windows with MMU
>>>>>>>>> notifiers / workers.
>>>>>>
>>>>>> This is a very intentional choice in the locking design:
>>>>>> mmap_read_lock
>>>>>> is held only in very specific parts of GPU SVM, and the driver should
>>>>>> never need to take this lock.
>>>>>>
>>>>>> Yes, notifiers can race, which is why the GPU fault handler and
>>>>>> prefetch
>>>>>> handler are structured as retry loops when a notifier race is
>>>>>> detected.
>>>>>> In practice, with well-behaved applications, these races should be
>>>>>> rare—but they do occur, and the driver must handle them.
>>>>>>
>>>>>> __xe_svm_handle_pagefault implements the page fault retry loop. VM
>>>>>> bind
>>>>>> prefetch has similar logic, although it is more spread out given
>>>>>> that it
>>>>>> is part of a deeper software pipeline.
>>>>>>
>>>>>> FWIW, holding locks to avoid races was rejected by Sima because we
>>>>>> reasoned it is essentially impossible to guarantee the absence of
>>>>>> races
>>>>>> by holding a lock. CPU page fault handlers are also effectively just
>>>>>> large retry loops.
>>>>>>
>>>>>> So this is one point I believe you will need to fixup driver side.
>>>>>>
>>>>>
>>>>> Understood. Thank you for the detailed explanation and for pointing to
>>>>> __xe_svm_handle_pagefault as a reference. We will restructure both our
>>>>> fault handler and ioctl path to a betterretry loop pattern with
>>>>> sequence
>>>>> number race detection.
>>>>>
>>>>
>>>> Yes, the typical pattern is:
>>>>
>>>> - Try to migrate once
>>>> - If you hit a race, give up, evict all memory back to system
>>>> memory, and bind it
>>>>
>>>> Atomics make this tricky because memory must move, but I’m not sure
>>>> “XNACK off” applies here. However, GPU SVM provides a timeslice
>>>> mechanism to ensure the CPU can’t move memory while the GPU needs to
>>>> execute something.
>>>
>>> Understood.
>>>
>>>>
>>>>>>>>>
>>>>>>>>> 3. Multi GPU support
>>>>>>>>>
>>>>>>>>> drm_gpusvm binds one drm_device to one instance. In multi GPU
>>>>>>>>> systems,
>>>>>>>>> each GPU gets an independent instance with its own range tree, MMU
>>>>>>>>> notifiers, notifier_lock, and DMA mappings.
>>>>>>>>>
>>>>>>
>>>>>> This is a part I am absolutely open to fixing. Right now, each
>>>>>> drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
>>>>>> decoupling a GPU SVM instance from a single device, allowing each
>>>>>> drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
>>>>>> device).
>>>>>>
>>>>>> This would give drivers the flexibility to use one GPU SVM
>>>>>> instance per
>>>>>> VM/device instance (as in Xe), or to maintain a single GPU SVM per
>>>>>> CPU
>>>>>> MM.
>>>>>>
>>>>>
>>>>> That would be wonderful! Looking forward to your patch very much!
>>>>>
>>>>
>>>> I can't say I'll code this but we thought about is as options and very
>>>> open patches which refactor the object model for multiple use cases.
>>>
>>> Understood. I will focus on single GPU first, and once we have a
>>> solid v1, we'd be happy to explore contributing patches for the
>>> multi-device object model refactoring.
>>>
>>
>> I think roughly what would need to be done is:
>>
>> - Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
>> - Embed either a struct device or a struct drm_device in struct
>> drm_gpusvm_pages.
>> - Drop struct drm_device from struct drm_gpusvm.
>> - Have the driver’s range structure embed one or more struct
>> drm_gpusvm_pages in addition to struct drm_gpusvm_range.
>> - Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
>> drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or simply
>> drop them entirely and update drivers to use the drm_gpusvm_pages
>> helpers instead.
>>
>> Then it is up to the drivers whether struct drm_gpusvm maps to a single
>> device or multiple devices. Either use case seems valid, and giving
>> drivers the option appears to be the right approach, rather than having
>> the common drm_gpusvm layer impose its own constraints.
>>
>> This type of refactor can be done at any time as an independent patch,
>> so feel free to post it whenever and I can verify on the Xe side that
>> everything looks good.
>>
>> Matt
>>
>
> Really thanks for the detailed guidance and steps, it is very clear and
> actionable. I'm excited about this direction, it gives the drivers more
> flexibility. I'll start working on this as soon as possible. Will post
> the multi-device refactor as a standalone series once it's well
> validated. Thanks again for being so open to collaboration!
>
> Regards,
> Honglei
>
>>>>
>>>>>
>>>>>>>>> This may brings huge overhead:
>>>>>>>>> - N x MMU notifier registrations for the same address
>>>>>>>>> range
>>>>>>
>>>>>> The notifier overhead is a real concern. We recently introduced
>>>>>> two-pass
>>>>>> notifiers [1] to speed up multi-device notifiers. At least in Xe, the
>>>>>> TLB invalidations—which are the truly expensive part—can be pipelined
>>>>>> using the two=pass approach. Currently, [1] only implements two-pass
>>>>>> notifiers for userptr, but Xe’s GPU SVM will be updated to use them
>>>>>> shortly.
>>>>>>
>>>>>> [1] https://patchwork.freedesktop.org/series/153280/
>>>>>>
>>>>>
>>>>> Thank you for the pointer to two-pass notifiers. Will study this
>>>>> series.
>>>>>
>>>>>>>>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>>>>>>
>>>>>> hmm_range_fault is extremely fast compared to the actual migration.
>>>>>> Running hmm_range_fault on a 2MB region using 4KB pages takes less
>>>>>> than 1µs. With THP or large device pages [2] (merged last week), it’s
>>>>>> around 1/20 of a microsecond. So I wouldn’t be too concerned about
>>>>>> this.
>>>>>>
>>>>>> [2] https://patchwork.freedesktop.org/series/163141/
>>>>>>
>>>>>
>>>>> That is very helpful data. Perhaps worry too much.
>>>>>
>>>>>>>>> - N x DMA mapping memory
>>>>>>
>>>>>> You will always have N x DMA mapping memory if the pages are in
>>>>>> system
>>>>>> memory as the dma-mapping API is per device.
>>>>>
>>>>> Totally agreed.
>>>>>
>>>>>>
>>>>>>>>> - N x invalidation + restore worker scheduling per CPU
>>>>>>>>> unmap event
>>>>>>>>> - N x GPU page table flush / TLB invalidation
>>>>>>
>>>>>> I agree you do not want serialize GPU page table flush / TLB
>>>>>> invalidations. Hence two-pass notifiers [1].
>>>>>
>>>>> Yes, will learn it.
>>>>>
>>>>>>
>>>>>>>>> - Increased mmap_lock hold time, N callbacks serialize
>>>>>>>>> under it
>>>>>>>>>
>>>>>>>>> compatibility issues:
>>>>>>>>> - Quiesce/resume scope mismatch: to integrate with KFD
>>>>>>>>> compute
>>>>>>>>> queues, the driver reuses kgd2kfd_quiesce_mm()/
>>>>>>>>> resume_mm()
>>>>>>>>> which have process level semantics. Under the per GPU
>>>>>>>>> drm_gpusvm model, maybe there are some issues on
>>>>>>>>> sync. To properly
>>>>>>>>> integrate with KFD under the per SVM model, a
>>>>>>>>> compatibility or
>>>>>>>>> new per VM level queue control APIs maybe need to
>>>>>>>>> introduced.
>>>>>>>>>
>>>>>>
>>>>>> I thought the idea to get rid of KFD and move over to AMDGPU? I
>>>>>> thought
>>>>>> Christian mentioned this to me at XDC.
>>>>>>
>>>>>
>>>>>>>>> Migration challenges:
>>>>>>>>>
>>>>>>>>> - No global migration decision logic: each per GPU SVM
>>>>>>>>> instance maintains its own attribute tree
>>>>>>>>> independently. This
>>>>>>>>> allows conflicting settings (e.g., GPU0's SVM sets
>>>>>>>>> PREFERRED_LOC=GPU0 while GPU1's SVM sets
>>>>>>>>> PREFERRED_LOC=GPU1
>>>>>>>>> for the same address range) with no detection or
>>>>>>>>> resolution.
>>>>>>>>> A global attribute coordinator or a shared manager is
>>>>>>>>> needed to
>>>>>>>>> provide a unified global view for migration decisions
>>>>>>
>>>>>> Yes, this is hole in the Xe API too. We have told UMDs if they setup
>>>>>> individual VMs with conflict attributes for a single CPU address
>>>>>> space
>>>>>> the behavior is undefined. Our UMD implement madvise is basically
>>>>>> loop
>>>>>> over al GPU VMs setting the same attributes.
>>>>>
>>>>> Will follow the same approach for now, the UMD is responsible for
>>>>> setting
>>>>> consistent attributes across GPU VMs.
>>>>>
>>>>
>>>> +1
>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> - migrate_vma_setup broadcast: one GPU's migration
>>>>>>>>> triggers MMU
>>>>>>>>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>>>>>>>>> causing N-1 unnecessary restore workers to be
>>>>>>>>> scheduled. And
>>>>>>
>>>>>> My feeling is that you shouldn’t reschedule restore workers unless
>>>>>> you
>>>>>> actually have to invalidate page tables (i.e., you have a local SVM
>>>>>> range within the notifier). So the first migration to an untouched
>>>>>> region may trigger notifiers, but they won’t do anything because you
>>>>>> don’t have any valid SVM ranges yet. Subsequent mappings of the
>>>>>> migrated
>>>>>> region won’t trigger a notifier unless the memory is moved again.
>>>>>>
>>>>>
>>>>> That is a very good point. We should check whether we actually have
>>>>> valid SVM ranges before scheduling restore workers. If there is
>>>>> nothing
>>>>> to invalidate, the notifier callback should be a no-op. We will review
>>>>> our notifier callback logic to ensure we are not doing unnecessary
>>>>> work
>>>>> here. Thank you for pointing this out.
>>>>>
>>>>>>>>> creates races between the initiating migration and the
>>>>>>>>> other
>>>>>>>>> instance's restore attempts.
>>>>>>
>>>>>> Yes, if multiple devices try to migrate the same CPU pages at the
>>>>>> same
>>>>>> time, that will race. That’s why in Xe we have a module-level
>>>>>> driver_migrate_lock. The first migration runs in read mode; if it
>>>>>> detects a race and aborts, it then takes driver_migrate_lock in write
>>>>>> mode so it becomes the only device allowed to move memory / CPU
>>>>>> pages.
>>>>>> See xe_svm_alloc_vram() for how this is used.
>>>>>>
>>>>>> I’m not sure this approach will work for you, but I just wanted to
>>>>>> point
>>>>>> out that we identified this as a potential issue.
>>>>>>
>>>>>
>>>>> Thank you for sharing the driver_migrate_lock approach and pointing to
>>>>> xe_svm_alloc_vram(). Will explore whether a similar lock pattern
>>>>> can work
>>>>> for our case.
>>>>>
>>>>>>>>>
>>>>>>>>> - No cross instance migration serialization: each per GPU
>>>>>>>>> drm_gpusvm instance has independent locking, so two GPUs'
>>>>>>>>> "decide -> migrate -> remap" sequences can interleave.
>>>>>>>>> While
>>>>>>>>> the kernel page lock prevents truly simultaneous
>>>>>>>>> migration of
>>>>>>>>> the same physical page, the losing side's retry (evict
>>>>>>>>> from
>>>>>>>>> other GPU's VRAM -> migrate back) triggers broadcast
>>>>>>>>> notifier
>>>>>>>>> invalidations and restore workers, compounding the ping
>>>>>>>>> pong
>>>>>>>>> problem above.
>>>>>>>>>
>>>>>>
>>>>>> See the driver_migrate_lock above.
>>>>>
>>>>> Acknowledged, thank you.
>>>>>>
>>>>>>>>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>>>>>>>>> hardcodes MIGRATE_VMA_SELECT_SYSTEM
>>>>>>>>> (drm_pagemap.c:328), meaning
>>>>>>>>> it only selects system memory pages for migration.
>>>>>>>>>
>>>>>>
>>>>>> I think this is fixed? We did find some core MM bugs that blocked
>>>>>> VRAM
>>>>>> to VRAM but those have been worked out.
>>>>>>
>>>>>> The code I'm looking at:
>>>>>>
>>>>>> 517 int drm_pagemap_migrate_to_devmem(struct
>>>>>> drm_pagemap_devmem *devmem_allocation,
>>>>>> 518 struct mm_struct *mm,
>>>>>> 519 unsigned long start,
>>>>>> unsigned long end,
>>>>>> 520 const struct
>>>>>> drm_pagemap_migrate_details *mdetails)
>>>>>> 521 {
>>>>>> 522 const struct drm_pagemap_devmem_ops *ops =
>>>>>> devmem_allocation->ops;
>>>>>> 523 struct drm_pagemap *dpagemap = devmem_allocation-
>>>>>> >dpagemap;
>>>>>> 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
>>>>>> 525 struct migrate_vma migrate = {
>>>>>> 526 .start = start,
>>>>>> 527 .end = end,
>>>>>> 528 .pgmap_owner = pagemap->owner,
>>>>>> 529 .flags =
>>>>>> MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
>>>>>> 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
>>>>>> MIGRATE_VMA_SELECT_COMPOUND,
>>>>>> 531 };
>>>>>>
>>>>>
>>>>> Thank you for checking! I am using v6.18 for this POC, missed the
>>>>> fix, will
>>>>> rebase to the latest.
>>>>>
>>>>>
>>>>>>>>> - CPU fault reverse migration race: CPU page fault triggers
>>>>>>>>> migrate_to_ram while GPU instances are concurrently
>>>>>>>>> operating.
>>>>>>>>> Per GPU notifier_lock does not protect cross GPU
>>>>>>>>> operations.
>>>>>>
>>>>>> No, again retry loop as discussed above.
>>>>>
>>>>> Understood.
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> We believe a strong, well designed solution at the framework
>>>>>>>>> level is
>>>>>>>>> needed to properly address these problems, and we look forward to
>>>>>>>>> discussion and suggestions.
>>>>>>
>>>>>> Let's work together to figure out what is missing here.
>>>>>
>>>>> Thank you so much, Matt. Your feedback has been incredibly valuable
>>>>> and
>>>>> has given us a much clearer picture of the framework's design.
>>>>> Ireally appreciate the effort you put into building drm_gpusvm as a
>>>>> shared framework. Will incorporate your suggestions into our next
>>>>> revision and look forward to continuing the collaboration.
>>>>>
>>>>
>>>> No problem. Happy to help.
>>>
>>> Thank you again for all the detailed feedback.
>>>
>>> Regards,
>>> Honglei
>>>
>>>>
>>>> Matt
>>>>
>>>>> Regards,
>>>>> Honglei
>>>>>
>>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>>>>
>>>>>>>>> Honglei Huang (12):
>>>>>>>>> drm/amdgpu: add SVM UAPI definitions
>>>>>>>>> drm/amdgpu: add SVM data structures and header
>>>>>>>>> drm/amdgpu: add SVM attribute data structures
>>>>>>>>> drm/amdgpu: implement SVM attribute tree operations
>>>>>>>>> drm/amdgpu: implement SVM attribute set
>>>>>>>>> drm/amdgpu: add SVM range data structures
>>>>>>>>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>>>>>>>>> drm/amdgpu: implement SVM range notifier and invalidation
>>>>>>>>> drm/amdgpu: implement SVM range workers
>>>>>>>>> drm/amdgpu: implement SVM core initialization and fini
>>>>>>>>> drm/amdgpu: implement SVM ioctl and fault handler
>>>>>>>>> drm/amdgpu: wire up SVM build system and fault handler
>>>>>>>>>
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 +++++
>>>>>>>>> +++++++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++
>>>>>>>>> ++++++++++++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>>>>>>>>> include/uapi/drm/amdgpu_drm.h | 39 +
>>>>>>>>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/
>>>>>>>>> amdgpu_svm_attr.c
>>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/
>>>>>>>>> amdgpu_svm_attr.h
>>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/
>>>>>>>>> amdgpu_svm_range.c
>>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/
>>>>>>>>> amdgpu_svm_range.h
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>>>>>>>>
>>>>>>>
>>>>>
>>>
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-15 10:04 ` Huang, Honglei1
@ 2026-04-23 6:40 ` Matthew Brost
2026-04-23 7:18 ` Matthew Brost
0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2026-04-23 6:40 UTC (permalink / raw)
To: Huang, Honglei1
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On Wed, Apr 15, 2026 at 06:04:11PM +0800, Huang, Honglei1 wrote:
>
>
> On 3/26/2026 8:16 PM, Honglei Huang wrote:
> >
> >
> > On 3/26/26 06:24, Matthew Brost wrote:
> > > On Tue, Mar 24, 2026 at 03:24:43PM +0800, Honglei Huang wrote:
> > > >
> > > >
> > > > On 3/23/26 14:31, Matthew Brost wrote:
> > > > > On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
> > > > > >
> > > > > >
> > > > > > On 3/19/26 13:08, Matthew Brost wrote:
> > > > > > > On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
> > > > > > > >
> > > > > > >
> > > > > > > Disclaimer I haven't look at any code in this series yet.
> > > > > > >
> > > > > > > >
> > > > > > > > On 3/17/26 19:48, Christian König wrote:
> > > > > > > > > Adding a few XE and drm_gpuvm people on TO.
> > > > > > > > >
> > > > > > > > > On 3/17/26 12:29, Honglei Huang wrote:
> > > > > > > > > > From: Honglei Huang <honghuan@amd.com>
> > > > > > > > > >
> > > > > > > > > > This is a POC/draft patch series of SVM
> > > > > > > > > > feature in amdgpu based on the
> > > > > > > > > > drm_gpusvm framework. The primary
> > > > > > > > > > purpose of this RFC is to validate
> > > > > > > > > > the framework's applicability, identify implementation challenges,
> > > > > > > > > > and start discussion on framework
> > > > > > > > > > evolution. This is not a production
> > > > > > >
> > > > > > > +1. Open to any ideas. Given this was designed
> > > > > > > originally for Xe we very
> > > > > > > well could have missed other drivers requirements.
> > > > > > Hi Matt,
> > > > > >
> > > > > > Thank you for the openness. And thank you so much for the incredibly
> > > > > > detailed and patient response. I really appreciate you
> > > > > > taking the time to
> > > > > > walk through each point.
> > > > > >
> > > > >
> > > > > I'm here to help.
> > > > >
> > > > > > Actually I am still a learner when it comes to the
> > > > > > drm_gpusvm framework and
> > > > > > GPU SVM design in general. Some of my descriptions below
> > > > > > may not be entirely
> > > > > > accurate. But I really want to bring drm_gpusvm into
> > > > > > amdgpu and make it work
> > > > > > well.
> > > > >
> > > > > I appreciate another driver jumping in and using this framework—it
> > > > > becomes easier to validate as more users adopt it.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > > > ready submission.
> > > > > > > > > >
> > > > > > > > > > This patch series implements basic SVM
> > > > > > > > > > support with the following features:
> > > > > > > > > >
> > > > > > > > > > 1. attributes sepatarated from physical page management:
> > > > > > > > > >
> > > > > > > > > > - Attribute layer
> > > > > > > > > > (amdgpu_svm_attr_tree): a driver side
> > > > > > > > > > interval
> > > > > > > > > > tree that stores SVM
> > > > > > > > > > attributes. Managed through the
> > > > > > > > > > SET_ATTR,
> > > > > > > > > > and mmu notifier callback.
> > > > > > >
> > > > > > > Can you explain the mmu notifier callback
> > > > > > > interaction here? See below in
> > > > > > > Xe the attribute tree is existing VMA tree (gpuvm).
> > > > > > >
> > > > > >
> > > > > > Let me try to explain, apologies if the description is not fully
> > > > > > precise.
> > > > > >
> > > > > > In current implementation, the MMU notifier callback
> > > > > > interacts with the attr
> > > > > > tree only in the munmap path remove the corresponding attribute
> > > > > > entries from the attr tree so that stale attributes do not persist for
> > > > > > freed address space.
> > > > > >
> > > > >
> > > > > Ah, yes. We reset our attributes upon munmap too. We
> > > > > actually don't this
> > > > > 100% correct quite either and series in flight to fix [1].
> > > > >
> > > > > [1] https://patchwork.freedesktop.org/series/161815/
> > > > >
> > > >
> > > > I studied [1]. This draft has a simliar mechanism to handle
> > > > attributes when
> > > > munmap. But there are some sligt differences in detail, maybe casued by
> > > > different UMD runtime behaviors.
> > > >
> > > >
> > > > > > > > > >
> > > > > > > > > > - Physical page layer (drm_gpusvm ranges): managed by the
> > > > > > > > > > drm_gpusvm framework, representing actual HMM backed DMA
> > > > > > > > > > mappings and GPU page table entries.
> > > > > > > > > >
> > > > > > > > > > This separation is necessary:
> > > > > > > > > > - The framework does not
> > > > > > > > > > support range splitting, so a partial
> > > > > > > > > > munmap destroys the entire
> > > > > > > > > > overlapping range, including the
> > > > > > > > > > still valid parts. If
> > > > > > > > > > attributes were stored inside drm_gpusvm
> > > > > > > > > > ranges, they would be lost on unmapping.
> > > > > > > > > > The separate attr tree
> > > > > > > > > > preserves userspace set attributes
> > > > > > > > > > across range operations.
> > > > > > >
> > > > > > > Yes, in Xe the divide is at the VMA level (set by user space) via VM
> > > > > > > bind (parts of VM may be mappings BOs, parts could
> > > > > > > be setup for SVM) or
> > > > > > > madvise IOCTLs which reflect user space attributes on current SVM
> > > > > > > mappings or future ones.
> > > > > > >
> > > > > > > The SVM range tree reflects mappings that have been faulted into the
> > > > > > > device and contain pages. This is an intentional choice.
> > > > > >
> > > > > > That makes a lot of sense. Thank you for clarifying the
> > > > > > design intent. I
> > > > > > think the current adopt the same principle: the
> > > > > > drm_gpusvm range tree only
> > > > > > reflect actual faulted in mappings.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > Isn't that actually intended? When parts of
> > > > > > > > > the range unmap then that usually means the
> > > > > > > > > whole range isn't valid any more.
> > > > > > >
> > > > > > >
> > > > > > > Yes, this was an intentional design choice to not
> > > > > > > support partial unmap,
> > > > > > > and instead rely on the driver to recreate a new range.
> > > > > > >
> > > > > > > The reasoning is:
> > > > > > >
> > > > > > > - In practice, this should be rare for well-behaved applications.
> > > > > > >
> > > > > > > - With THP / large device pages, if a sub-range is
> > > > > > > unmapped, the entire
> > > > > > > GPU mapping is invalidated anyway due to the page size change. As a
> > > > > > > result, the cost of creating a new range is minimal, since the device
> > > > > > > will likely fault again on the remaining pages.
> > > > > > >
> > > > > > > So there is no need to over-engineer the common code.
> > > > > > >
> > > > > > > FWIW, to even test partial unmaps in Xe, I had to do things I doubt
> > > > > > > anyone would ever do:
> > > > > > >
> > > > > > > ptr = mmap(SZ_2M);
> > > > > > > /* fault in memory to the device */
> > > > > > > munmap(ptr, SZ_1M);
> > > > > > > /* touch memory again on the device */
> > > > > > >
> > > > > >
> > > > > > Thank you for this explanation and the concrete example. After further
> > > > > > discussion internally with Christian, we are now aligned
> > > > > > with same position
> > > > > > partial unmap. Will remove rebuild on partial unmap logic in the next
> > > > > > version and handle it as only partially backed range.
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > It is about partial unmap, some subregion in
> > > > > > > > drm_gpusvm_range is still valid
> > > > > > > > but some other subregion is invalid, but under
> > > > > > > > drm_gpusvm, need to destroy
> > > > > > > > the entire range.
> > > > > > > >
> > > > > > > > e.g.:
> > > > > > > >
> > > > > > > > [---------------unmap region in mmu
> > > > > > > > notifier-----------------]
> > > > > > > > [0x1000 ------------ 0x9000]
> > > > > > > > [ valid ][ invalid ]
> > > > > > > >
> > > > > > > > see deatil in drm_gpusvm.c:110 line
> > > > > > > > section:Partial Unmapping of Ranges
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > - drm_gpusvm range
> > > > > > > > > > boundaries are determined by fault
> > > > > > > > > > address
> > > > > > > > > > and pre setted chunk size,
> > > > > > > > > > not by userspace attribute boundaries.
> > > > > > > > > > Ranges may be rechunked
> > > > > > > > > > on memory changes. Embedding
> > > > > > > > > > attributes in framework
> > > > > > > > > > ranges would scatter attr state
> > > > > > > > > > across many small ranges
> > > > > > > > > > and require complex reassemble
> > > > > > > > > > logic when operate attrbute.
> > > > > > > > >
> > > > > > > > > Yeah, that makes a lot of sense.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2) System memory mapping via drm_gpusvm
> > > > > > > > > >
> > > > > > > > > > The core mapping path uses
> > > > > > > > > > drm_gpusvm_range_find_or_insert() to
> > > > > > > > > > create ranges,
> > > > > > > > > > drm_gpusvm_range_get_pages() for HMM
> > > > > > > > > > page fault
> > > > > > > > > > and DMA mapping, then updates GPU page tables via
> > > > > > > > > > amdgpu_vm_update_range().
> > > > > > > > > >
> > > > > > > > > > 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > > > > > > > > >
> > > > > > > > > > On XNACK off hardware the GPU
> > > > > > > > > > cannot recover from page faults,
> > > > > > > > > > so mappings must be established through ioctl. When
> > > > > > > > > > userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > > > > > > > > > walks the attr tree and maps all accessible intervals
> > > > > > > > > > to the GPU by amdgpu_svm_range_map_attr_ranges().
> > > > > > >
> > > > > > > Can you expand on XNACK off / GPU no faults? Is this to the share GPU
> > > > > > > between 3D (dma-fences) and faulting clients? We
> > > > > > > have something similar
> > > > > > > in Xe, but it isn't an explicit IOCTL rather we
> > > > > > > switch between on demand
> > > > > > > as 3D client submits and then resumes page faults when all dma-fences
> > > > > > > have signaled.
> > > > > > >
> > > > > > > I see below you mention page tables are modified during quiesce KFD
> > > > > > > queues? I'm not sure that is required - you just need to guarnette
> > > > > > > faulting clients won't trigger page faults when
> > > > > > > dma-fence is in flight.
> > > > > > >
> > > > > > > Maybe give me an explaination of exactly what the
> > > > > > > requirement from AMD
> > > > > > > are here so I have better picture.
> > > > > >
> > > > > > Thank you for the patience, let me try to explain our
> > > > > > situation, though
> > > > > > I may not get every detail right.
> > > > > >
> > > > > > XNACK off means hardware that does not have GPU page
> > > > > > fault capability (or
> > > > > > turned off)
> > > > > >
> > > > > > So for these GPUs, ALL page table entries must be fully
> > > > > > populated before
> > > > > > the GPU can access the memory. This is why we need the ioctl driven
> > > > > > mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
> > > > > > walk the attribute tree and eagerly map all accessible ranges into the
> > > > > > GPU page tables. This is functionally similar to what you describe as
> > > > > > prefetch IOCTLs / VM bind in Xe.
> > > > > >
> > > > > > Regarding queue quiesce during page table modification: on XNACK off
> > > > > > hardware, because the GPU cannot fault, we must ensure the GPU is
> > > > > > completely stopped before modifying any PTE it might be accessing.
> > > > > > Otherwise the GPU could access a partially updated page
> > > > > > table and hang.
> > > > > > The quiesce/resume is the mechanism to guarantee this.
> > > > > >
> > > > > > I hope that helps clarify the picture.
> > > > > >
> > > > >
> > > > > This clarifies a lot. This is what we’d call in Xe “preemption fence”
> > > > > mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> > > > > resume. We don’t actually support SVM in this case; instead, we use
> > > > > “userptr binds,” which are built on gpusvm for page
> > > > > collection. However,
> > > > > we don’t support migrating memory to the device—though we could.
> > > > >
> > > > > I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> > > > > this case, don’t maintain a range tree, as those—as you
> > > > > suggest—are more
> > > > > of an on-demand fault driver concern. Instead, just embed 'struct
> > > > > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> > > > >
> > > > > We could extend this to support migrating 'userptr', but we
> > > > > just haven’t
> > > > > done that yet—this may be what you want to do in “XNACK off..
> > > > >
> > > > > [2] https://patchwork.freedesktop.org/series/146553/
> > > > >
> > > >
> > > > Actually we need to swith the xnack mode between on and off, so
> > > > in xnack off
> > > > mode, the driver operats in "implicit prefetch mode". This may
> > > > be due to
> > > > compatibility with older hardware and the need for UMD runtime. We will
> > > > further discuss the handling method under xnack off internally.
> > > >
>
> Hi Matt,
>
> I studied the xe_userptr code and the conversion series [2] you
> pointed to.
>
> I have a question that:
> Would it be possible to reuse drm_gpusvm_range to handle the hardware
> without gpu fault feature(xnack off mode).
That’s not how we’ve done it. We embedded drm_gpusvm_pages into our VMA
structure and then attached a notifier. The notifier attachment is
open-coded on the Xe side, and this could be normalized and opened up
for common driver use cases.
The problem with reusing drm_gpusvm_range directly is that a VMA may
span multiple gpusvm notifiers—i.e., it can be larger than the notifier
size. Of course, we could rework this as well.
So either way, the Xe userptr + gpusvm implementation should be refined
further for common driver use.
>
> Reusing drm_gpusvm_range for the XNACK-off case would simplify our
> implementation considerably, it already provides large page chunk
> optimization, can reuse the existing migration infrastructure.
>
> Building these on top of a standalone drm_gpusvm_pages
> would mean reimplementing much of what the range layer already offers.
> It would also let us keep a single code path for both XNACK modes,
> which reduces maintenance burden and avoids behavioral difference.
>
> Would this direction be acceptable, or do you see concerns with reusing
> the range infrastructure for the no-fault case?
>
If you prefer something like insert a range exactly here + create range
+ notifier I think that completely reasonable direction and Xe would
likely switch over to using this.
I guess my only concern is sub-userptr migration. We are trending
towards allowing userptrs to being migrated either via prefetch IOCTLs
or access counters on the GPU side - access counter we'd likely a single
2M page at time migration within the userptr. get_pages() supports mixed
mappings between VRAM + system but likely needs some more work to really
make this complete though.
Matt
> Regards,
> Honglei
>
>
>
> > > >
> > > > > >
> > > > > > >
> > > > > > > > > >
> > > > > > > > > > 4) Invalidation, GC worker, and restore worker
> > > > > > > > > >
> > > > > > > > > > MMU notifier callbacks
> > > > > > > > > > (amdgpu_svm_range_invalidate) handle
> > > > > > > > > > three cases based on event type and hardware mode:
> > > > > > > > > > - unmap event: clear GPU PTEs in the notifier context,
> > > > > > > > > > unmap DMA pages, mark ranges as unmapped, flush TLB,
> > > > > > > > > > and enqueue to the GC worker. On XNACK off, also
> > > > > > > > > > quiesce KFD queues and schedule rebuild of the
> > > > > > > > > > still valid portions that
> > > > > > > > > > were destroyed together with
> > > > > > > > > > the unmapped subregion.
> > > > > > > > > >
> > > > > > > > > > - evict on XNACK off:
> > > > > > > > > > quiesce KFD queues first, then unmap DMA pages and
> > > > > > > > > > enqueue to the restore worker.
> > > > > > > > >
> > > > > > > > > Is that done through the DMA fence or by
> > > > > > > > > talking directly to the MES/HWS?
> > > > > > > >
> > > > > > > > Currently KFD queues quiesce/resume API are
> > > > > > > > reused, lookig forward to a
> > > > > > > > better solution.
> > > > > > > >
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > > Regards,
> > > > > > > > Honglei
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Christian.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > - evict on XNACK on:
> > > > > > > > > > clear GPU PTEs, unmap DMA
> > > > > > > > > > pages, and flush TLB, but do
> > > > > > > > > > not schedule any worker. The GPU will fault on next
> > > > > > > > > > access and the fault handler establishes the mapping.
> > > > > > > > > >
> > > > > > > > > > Not supported feature:
> > > > > > > > > > - XNACK on GPU page fault mode
> > > > > > > > > > - migration and prefetch feature
> > > > > > > > > > - Multi GPU support
> > > > > > > > > >
> > > > > > > > > > XNACK on enablement is ongoing.The
> > > > > > > > > > GPUs that support XNACK on
> > > > > > > > > > are currently only accessible to
> > > > > > > > > > us via remote lab machines, which slows
> > > > > > > > > > down progress.
> > > > > > > > > >
> > > > > > > > > > Patch overview:
> > > > > > > > > >
> > > > > > > > > > 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > > > > > > > > > SET_ATTR/GET_ATTR
> > > > > > > > > > operations, attribute types, and related
> > > > > > > > > > structs in amdgpu_drm.h.
> > > > > > > > > >
> > > > > > > > > > 02/12 Core data structures:
> > > > > > > > > > amdgpu_svm wrapping drm_gpusvm with
> > > > > > > > > > refcount, attr_tree, workqueues, locks, and
> > > > > > > > > > callbacks (begin/end_restore, flush_tlb).
> > > > > > > > > >
> > > > > > > > > > 03/12 Attribute data structures:
> > > > > > > > > > amdgpu_svm_attrs, attr_range
> > > > > > > > > > (interval tree node),
> > > > > > > > > > attr_tree, access enum, flag masks,
> > > > > > > > > > and change trigger enum.
> > > > > > > > > >
> > > > > > > > > > 04/12 Attribute tree operations:
> > > > > > > > > > interval tree lookup, insert,
> > > > > > > > > > remove, and tree create/destroy lifecycle.
> > > > > > > > > >
> > > > > > > > > > 05/12 Attribute set: validate UAPI
> > > > > > > > > > attributes, apply to internal
> > > > > > > > > > attrs, handle hole/existing
> > > > > > > > > > range with head/tail splitting,
> > > > > > > > > > compute change triggers, and -EAGAIN retry loop.
> > > > > > > > > > Implements attr_clear_pages
> > > > > > > > > > for unmap cleanup and attr_get.
> > > > > > > > > >
> > > > > > > > > > 06/12 Range data structures: amdgpu_svm_range extending
> > > > > > > > > > drm_gpusvm_range with gpu_mapped state, pending ops,
> > > > > > > > > > pte_flags cache, and GC/restore queue linkage.
> > > > > > > > > >
> > > > > > > > > > 07/12 PTE flags and GPU mapping: simple gpu pte function,
> > > > > > > > > > GPU page table update with
> > > > > > > > > > DMA address, range mapping loop:
> > > > > > > > > > find_or_insert -> get_pages -> validate -> update PTE,
> > > > > > > > > > and attribute change driven mapping function.
> > > > > > > > > >
> > > > > > > > > > 08/12 Notifier and invalidation:
> > > > > > > > > > synchronous GPU PTE clear in
> > > > > > > > > > notifier context, range removal and overlap cleanup,
> > > > > > > > > > rebuild after destroy logic, and MMU event dispatcher
> > > > > > > > > >
> > > > > > > > > > 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > > > > > > > > > worker for unmap processing
> > > > > > > > > > and rebuild, ordered restore
> > > > > > > > > > worker for mapping evicted ranges, and flush/sync
> > > > > > > > > > helpers.
> > > > > > > > > >
> > > > > > > > > > 10/12 Initialization and fini: kmem_cache for range/attr,
> > > > > > > > > > drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > > > > > > > > > flush helper, and amdgpu_svm
> > > > > > > > > > init/close/fini lifecycle.
> > > > > > > > > >
> > > > > > > > > > 11/12 IOCTL and fault handler:
> > > > > > > > > > PASID based SVM lookup with kref
> > > > > > > > > > protection, amdgpu_gem_svm_ioctl dispatcher, and
> > > > > > > > > > amdgpu_svm_handle_fault for GPU page fault recovery.
> > > > > > > > > >
> > > > > > > > > > 12/12 Build integration: Kconfig
> > > > > > > > > > option (CONFIG_DRM_AMDGPU_SVM),
> > > > > > > > > > Makefile rules, ioctl table
> > > > > > > > > > registration, and amdgpu_vm
> > > > > > > > > > hooks (init in make_compute,
> > > > > > > > > > close/fini, fault dispatch).
> > > > > > > > > >
> > > > > > > > > > Test result:
> > > > > > > > > > on gfx1100(W7900) and gfx943(MI300x)
> > > > > > > > > > kfd test: 95%+ passed, same failed cases with offical relase
> > > > > > > > > > rocr test: all passed
> > > > > > > > > > hip catch test: 20 cases failed in
> > > > > > > > > > all 5366 cases, +13 failures vs offical
> > > > > > > > > > relase
> > > > > > > > > >
> > > > > > > > > > During implementation we identified
> > > > > > > > > > several challenges / design questions:
> > > > > > > > > >
> > > > > > > > > > 1. No range splitting on partial unmap
> > > > > > > > > >
> > > > > > > > > > drm_gpusvm explicitly does not
> > > > > > > > > > support range splitting in
> > > > > > > > > > drm_gpusvm.c:122.
> > > > > > > > > > Partial munmap needs to destroy
> > > > > > > > > > the entire range including the valid
> > > > > > > > > > interval.
> > > > > > > > > > GPU fault driven hardware can
> > > > > > > > > > handle this design by extra gpu fault
> > > > > > > > > > handle,
> > > > > > > > > > but AMDGPU needs to support XNACK
> > > > > > > > > > off hardware, this design requires
> > > > > > > > > > driver
> > > > > > > > > > rebuild the valid part in the
> > > > > > > > > > removed entire range. Whichs bring a
> > > > > > > > > > very heavy
> > > > > > > > > > restore work in work queue/GC
> > > > > > > > > > worker: unmap/destroy -> rebuild(insert
> > > > > > > > > > and map)
> > > > > > > > > > this restore work even heavier
> > > > > > > > > > than kfd_svm. In previous driver work
> > > > > > > > > > queue
> > > > > > > > > > only needs to restore or unmap,
> > > > > > > > > > but in drm_gpusvm driver needs to unmap
> > > > > > > > > > and restore.
> > > > > > > > > > which brings about more complex
> > > > > > > > > > logic, heavier worker queue workload,
> > > > > > > > > > and
> > > > > > > > > > synchronization issues.
> > > > > > >
> > > > > > > Is this common in the workload you are running? I'm also wondering if
> > > > > > > your restore logic / KFDs design is contributing to this actally the
> > > > > > > problem.
> > > > > > >
> > > > > >
> > > > > > Honestly, you raise a fair point.
> > > > > >
> > > > > > We will redesign the logic about the partial munap,
> > > > > > which should eliminate
> > > > > > most of this complexity.
> > > > > >
> > > > > >
> > > > >
> > > > > +1, yes test but do optimize for.
> > > > >
> > > > > > > > > >
> > > > > > > > > > 2. Fault driven vs ioctl driven mapping
> > > > > > > > > >
> > > > > > > > > > drm_gpusvm is designed around GPU
> > > > > > > > > > page fault handlers. The primary entry
> > > > > > > > > > point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > > > > > > > > > AMDGPU needs to support IOCTL
> > > > > > > > > > driven mapping cause No XNACK hardware
> > > > > > > > > > that
> > > > > > > > > > GPU cannot fault at all
> > > > > > >
> > > > > > > I think we refer to these as prefetch IOCTLs in Xe.
> > > > > > > Ideally, user space
> > > > > > > issues these so the device does not fault (e.g.,
> > > > > > > prefetch creates a set
> > > > > > > of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
> > > > > > > specific VM bind operations.
> > > > > > >
> > > > > >
> > > > > > That is a very helpful way to think about it. Yes, our ioctl driven
> > > > > > mapping(xnack off) is essentially equivalent to a
> > > > > > prefetch operation. We are
> > > > > > trying to improve it.
> > > > > >
> > > > >
> > > > > See above wrt 'userptr'.
> > > >
> > > > Got it.
> > > >
> > > > >
> > > > > >
> > > > > > > > > >
> > > > > > > > > > The ioctl path cannot hold
> > > > > > > > > > mmap_read_lock across the entire
> > > > > > > > > > operation
> > > > > > > > > > because
> > > > > > > > > > drm_gpusvm_range_find_or_insert()
> > > > > > > > > > acquires/ releases it
> > > > > > > > > > internally. This creates race
> > > > > > > > > > windows with MMU notifiers / workers.
> > > > > > >
> > > > > > > This is a very intentional choice in the locking
> > > > > > > design: mmap_read_lock
> > > > > > > is held only in very specific parts of GPU SVM, and the driver should
> > > > > > > never need to take this lock.
> > > > > > >
> > > > > > > Yes, notifiers can race, which is why the GPU fault
> > > > > > > handler and prefetch
> > > > > > > handler are structured as retry loops when a
> > > > > > > notifier race is detected.
> > > > > > > In practice, with well-behaved applications, these races should be
> > > > > > > rare—but they do occur, and the driver must handle them.
> > > > > > >
> > > > > > > __xe_svm_handle_pagefault implements the page fault
> > > > > > > retry loop. VM bind
> > > > > > > prefetch has similar logic, although it is more
> > > > > > > spread out given that it
> > > > > > > is part of a deeper software pipeline.
> > > > > > >
> > > > > > > FWIW, holding locks to avoid races was rejected by Sima because we
> > > > > > > reasoned it is essentially impossible to guarantee
> > > > > > > the absence of races
> > > > > > > by holding a lock. CPU page fault handlers are also effectively just
> > > > > > > large retry loops.
> > > > > > >
> > > > > > > So this is one point I believe you will need to fixup driver side.
> > > > > > >
> > > > > >
> > > > > > Understood. Thank you for the detailed explanation and for pointing to
> > > > > > __xe_svm_handle_pagefault as a reference. We will restructure both our
> > > > > > fault handler and ioctl path to a betterretry loop
> > > > > > pattern with sequence
> > > > > > number race detection.
> > > > > >
> > > > >
> > > > > Yes, the typical pattern is:
> > > > >
> > > > > - Try to migrate once
> > > > > - If you hit a race, give up, evict all memory back to
> > > > > system memory, and bind it
> > > > >
> > > > > Atomics make this tricky because memory must move, but I’m not sure
> > > > > “XNACK off” applies here. However, GPU SVM provides a timeslice
> > > > > mechanism to ensure the CPU can’t move memory while the GPU needs to
> > > > > execute something.
> > > >
> > > > Understood.
> > > >
> > > > >
> > > > > > > > > >
> > > > > > > > > > 3. Multi GPU support
> > > > > > > > > >
> > > > > > > > > > drm_gpusvm binds one drm_device to one
> > > > > > > > > > instance. In multi GPU systems,
> > > > > > > > > > each GPU gets an independent instance with its own range tree, MMU
> > > > > > > > > > notifiers, notifier_lock, and DMA mappings.
> > > > > > > > > >
> > > > > > >
> > > > > > > This is a part I am absolutely open to fixing. Right now, each
> > > > > > > drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
> > > > > > > decoupling a GPU SVM instance from a single device, allowing each
> > > > > > > drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
> > > > > > > device).
> > > > > > >
> > > > > > > This would give drivers the flexibility to use one
> > > > > > > GPU SVM instance per
> > > > > > > VM/device instance (as in Xe), or to maintain a
> > > > > > > single GPU SVM per CPU
> > > > > > > MM.
> > > > > > >
> > > > > >
> > > > > > That would be wonderful! Looking forward to your patch very much!
> > > > > >
> > > > >
> > > > > I can't say I'll code this but we thought about is as options and very
> > > > > open patches which refactor the object model for multiple use cases.
> > > >
> > > > Understood. I will focus on single GPU first, and once we have a
> > > > solid v1, we'd be happy to explore contributing patches for the
> > > > multi-device object model refactoring.
> > > >
> > >
> > > I think roughly what would need to be done is:
> > >
> > > - Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
> > > - Embed either a struct device or a struct drm_device in struct
> > > drm_gpusvm_pages.
> > > - Drop struct drm_device from struct drm_gpusvm.
> > > - Have the driver’s range structure embed one or more struct
> > > drm_gpusvm_pages in addition to struct drm_gpusvm_range.
> > > - Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
> > > drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or simply
> > > drop them entirely and update drivers to use the drm_gpusvm_pages
> > > helpers instead.
> > >
> > > Then it is up to the drivers whether struct drm_gpusvm maps to a single
> > > device or multiple devices. Either use case seems valid, and giving
> > > drivers the option appears to be the right approach, rather than having
> > > the common drm_gpusvm layer impose its own constraints.
> > >
> > > This type of refactor can be done at any time as an independent patch,
> > > so feel free to post it whenever and I can verify on the Xe side that
> > > everything looks good.
> > >
> > > Matt
> > >
> >
> > Really thanks for the detailed guidance and steps, it is very clear and
> > actionable. I'm excited about this direction, it gives the drivers more
> > flexibility. I'll start working on this as soon as possible. Will post
> > the multi-device refactor as a standalone series once it's well
> > validated. Thanks again for being so open to collaboration!
> >
> > Regards,
> > Honglei
> >
> > > > >
> > > > > >
> > > > > > > > > > This may brings huge overhead:
> > > > > > > > > > - N x MMU notifier registrations
> > > > > > > > > > for the same address range
> > > > > > >
> > > > > > > The notifier overhead is a real concern. We recently
> > > > > > > introduced two-pass
> > > > > > > notifiers [1] to speed up multi-device notifiers. At least in Xe, the
> > > > > > > TLB invalidations—which are the truly expensive part—can be pipelined
> > > > > > > using the two=pass approach. Currently, [1] only implements two-pass
> > > > > > > notifiers for userptr, but Xe’s GPU SVM will be updated to use them
> > > > > > > shortly.
> > > > > > >
> > > > > > > [1] https://patchwork.freedesktop.org/series/153280/
> > > > > > >
> > > > > >
> > > > > > Thank you for the pointer to two-pass notifiers. Will study this
> > > > > > series.
> > > > > >
> > > > > > > > > > - N x hmm_range_fault() calls for the same page (KFD: 1x)
> > > > > > >
> > > > > > > hmm_range_fault is extremely fast compared to the actual migration.
> > > > > > > Running hmm_range_fault on a 2MB region using 4KB pages takes less
> > > > > > > than 1µs. With THP or large device pages [2] (merged last week), it’s
> > > > > > > around 1/20 of a microsecond. So I wouldn’t be too
> > > > > > > concerned about this.
> > > > > > >
> > > > > > > [2] https://patchwork.freedesktop.org/series/163141/
> > > > > > >
> > > > > >
> > > > > > That is very helpful data. Perhaps worry too much.
> > > > > >
> > > > > > > > > > - N x DMA mapping memory
> > > > > > >
> > > > > > > You will always have N x DMA mapping memory if the
> > > > > > > pages are in system
> > > > > > > memory as the dma-mapping API is per device.
> > > > > >
> > > > > > Totally agreed.
> > > > > >
> > > > > > >
> > > > > > > > > > - N x invalidation + restore
> > > > > > > > > > worker scheduling per CPU unmap event
> > > > > > > > > > - N x GPU page table flush / TLB invalidation
> > > > > > >
> > > > > > > I agree you do not want serialize GPU page table flush / TLB
> > > > > > > invalidations. Hence two-pass notifiers [1].
> > > > > >
> > > > > > Yes, will learn it.
> > > > > >
> > > > > > >
> > > > > > > > > > - Increased mmap_lock hold time,
> > > > > > > > > > N callbacks serialize under it
> > > > > > > > > >
> > > > > > > > > > compatibility issues:
> > > > > > > > > > - Quiesce/resume scope mismatch:
> > > > > > > > > > to integrate with KFD compute
> > > > > > > > > > queues, the driver reuses
> > > > > > > > > > kgd2kfd_quiesce_mm()/ resume_mm()
> > > > > > > > > > which have process level semantics. Under the per GPU
> > > > > > > > > > drm_gpusvm model, maybe there
> > > > > > > > > > are some issues on sync. To properly
> > > > > > > > > > integrate with KFD under the
> > > > > > > > > > per SVM model, a compatibility or
> > > > > > > > > > new per VM level queue control
> > > > > > > > > > APIs maybe need to introduced.
> > > > > > > > > >
> > > > > > >
> > > > > > > I thought the idea to get rid of KFD and move over
> > > > > > > to AMDGPU? I thought
> > > > > > > Christian mentioned this to me at XDC.
> > > > > > >
> > > > > >
> > > > > > > > > > Migration challenges:
> > > > > > > > > >
> > > > > > > > > > - No global migration decision logic: each per GPU SVM
> > > > > > > > > > instance maintains its own
> > > > > > > > > > attribute tree independently. This
> > > > > > > > > > allows conflicting settings (e.g., GPU0's SVM sets
> > > > > > > > > > PREFERRED_LOC=GPU0 while GPU1's
> > > > > > > > > > SVM sets PREFERRED_LOC=GPU1
> > > > > > > > > > for the same address range) with
> > > > > > > > > > no detection or resolution.
> > > > > > > > > > A global attribute coordinator
> > > > > > > > > > or a shared manager is needed to
> > > > > > > > > > provide a unified global view for migration decisions
> > > > > > >
> > > > > > > Yes, this is hole in the Xe API too. We have told UMDs if they setup
> > > > > > > individual VMs with conflict attributes for a single
> > > > > > > CPU address space
> > > > > > > the behavior is undefined. Our UMD implement madvise
> > > > > > > is basically loop
> > > > > > > over al GPU VMs setting the same attributes.
> > > > > >
> > > > > > Will follow the same approach for now, the UMD is
> > > > > > responsible for setting
> > > > > > consistent attributes across GPU VMs.
> > > > > >
> > > > >
> > > > > +1
> > > > >
> > > > > > >
> > > > > > > > > >
> > > > > > > > > > - migrate_vma_setup broadcast: one
> > > > > > > > > > GPU's migration triggers MMU
> > > > > > > > > > notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > > > > > > > > > causing N-1 unnecessary restore
> > > > > > > > > > workers to be scheduled. And
> > > > > > >
> > > > > > > My feeling is that you shouldn’t reschedule restore
> > > > > > > workers unless you
> > > > > > > actually have to invalidate page tables (i.e., you have a local SVM
> > > > > > > range within the notifier). So the first migration to an untouched
> > > > > > > region may trigger notifiers, but they won’t do anything because you
> > > > > > > don’t have any valid SVM ranges yet. Subsequent
> > > > > > > mappings of the migrated
> > > > > > > region won’t trigger a notifier unless the memory is moved again.
> > > > > > >
> > > > > >
> > > > > > That is a very good point. We should check whether we actually have
> > > > > > valid SVM ranges before scheduling restore workers. If
> > > > > > there is nothing
> > > > > > to invalidate, the notifier callback should be a no-op. We will review
> > > > > > our notifier callback logic to ensure we are not doing
> > > > > > unnecessary work
> > > > > > here. Thank you for pointing this out.
> > > > > >
> > > > > > > > > > creates races between the
> > > > > > > > > > initiating migration and the other
> > > > > > > > > > instance's restore attempts.
> > > > > > >
> > > > > > > Yes, if multiple devices try to migrate the same CPU
> > > > > > > pages at the same
> > > > > > > time, that will race. That’s why in Xe we have a module-level
> > > > > > > driver_migrate_lock. The first migration runs in read mode; if it
> > > > > > > detects a race and aborts, it then takes driver_migrate_lock in write
> > > > > > > mode so it becomes the only device allowed to move
> > > > > > > memory / CPU pages.
> > > > > > > See xe_svm_alloc_vram() for how this is used.
> > > > > > >
> > > > > > > I’m not sure this approach will work for you, but I
> > > > > > > just wanted to point
> > > > > > > out that we identified this as a potential issue.
> > > > > > >
> > > > > >
> > > > > > Thank you for sharing the driver_migrate_lock approach and pointing to
> > > > > > xe_svm_alloc_vram(). Will explore whether a similar lock
> > > > > > pattern can work
> > > > > > for our case.
> > > > > >
> > > > > > > > > >
> > > > > > > > > > - No cross instance migration serialization: each per GPU
> > > > > > > > > > drm_gpusvm instance has independent locking, so two GPUs'
> > > > > > > > > > "decide -> migrate -> remap"
> > > > > > > > > > sequences can interleave. While
> > > > > > > > > > the kernel page lock prevents
> > > > > > > > > > truly simultaneous migration of
> > > > > > > > > > the same physical page, the
> > > > > > > > > > losing side's retry (evict from
> > > > > > > > > > other GPU's VRAM -> migrate
> > > > > > > > > > back) triggers broadcast notifier
> > > > > > > > > > invalidations and restore
> > > > > > > > > > workers, compounding the ping pong
> > > > > > > > > > problem above.
> > > > > > > > > >
> > > > > > >
> > > > > > > See the driver_migrate_lock above.
> > > > > >
> > > > > > Acknowledged, thank you.
> > > > > > >
> > > > > > > > > > - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > > > > > > > > > hardcodes
> > > > > > > > > > MIGRATE_VMA_SELECT_SYSTEM
> > > > > > > > > > (drm_pagemap.c:328), meaning
> > > > > > > > > > it only selects system memory pages for migration.
> > > > > > > > > >
> > > > > > >
> > > > > > > I think this is fixed? We did find some core MM bugs
> > > > > > > that blocked VRAM
> > > > > > > to VRAM but those have been worked out.
> > > > > > >
> > > > > > > The code I'm looking at:
> > > > > > >
> > > > > > > 517 int drm_pagemap_migrate_to_devmem(struct
> > > > > > > drm_pagemap_devmem *devmem_allocation,
> > > > > > > 518 struct mm_struct *mm,
> > > > > > > 519 unsigned
> > > > > > > long start, unsigned long end,
> > > > > > > 520 const
> > > > > > > struct drm_pagemap_migrate_details *mdetails)
> > > > > > > 521 {
> > > > > > > 522 const struct drm_pagemap_devmem_ops
> > > > > > > *ops = devmem_allocation->ops;
> > > > > > > 523 struct drm_pagemap *dpagemap =
> > > > > > > devmem_allocation- >dpagemap;
> > > > > > > 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
> > > > > > > 525 struct migrate_vma migrate = {
> > > > > > > 526 .start = start,
> > > > > > > 527 .end = end,
> > > > > > > 528 .pgmap_owner = pagemap->owner,
> > > > > > > 529 .flags =
> > > > > > > MIGRATE_VMA_SELECT_SYSTEM |
> > > > > > > MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> > > > > > > 530
> > > > > > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
> > > > > > > MIGRATE_VMA_SELECT_COMPOUND,
> > > > > > > 531 };
> > > > > > >
> > > > > >
> > > > > > Thank you for checking! I am using v6.18 for this POC,
> > > > > > missed the fix, will
> > > > > > rebase to the latest.
> > > > > >
> > > > > >
> > > > > > > > > > - CPU fault reverse migration race: CPU page fault triggers
> > > > > > > > > > migrate_to_ram while GPU
> > > > > > > > > > instances are concurrently operating.
> > > > > > > > > > Per GPU notifier_lock does not
> > > > > > > > > > protect cross GPU operations.
> > > > > > >
> > > > > > > No, again retry loop as discussed above.
> > > > > >
> > > > > > Understood.
> > > > > >
> > > > > > >
> > > > > > > > > >
> > > > > > > > > > We believe a strong, well designed
> > > > > > > > > > solution at the framework level is
> > > > > > > > > > needed to properly address these problems, and we look forward to
> > > > > > > > > > discussion and suggestions.
> > > > > > >
> > > > > > > Let's work together to figure out what is missing here.
> > > > > >
> > > > > > Thank you so much, Matt. Your feedback has been
> > > > > > incredibly valuable and
> > > > > > has given us a much clearer picture of the framework's design.
> > > > > > Ireally appreciate the effort you put into building drm_gpusvm as a
> > > > > > shared framework. Will incorporate your suggestions into our next
> > > > > > revision and look forward to continuing the collaboration.
> > > > > >
> > > > >
> > > > > No problem. Happy to help.
> > > >
> > > > Thank you again for all the detailed feedback.
> > > >
> > > > Regards,
> > > > Honglei
> > > >
> > > > >
> > > > > Matt
> > > > >
> > > > > > Regards,
> > > > > > Honglei
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Matt
> > > > > > >
> > > > > > > > > >
> > > > > > > > > > Honglei Huang (12):
> > > > > > > > > > drm/amdgpu: add SVM UAPI definitions
> > > > > > > > > > drm/amdgpu: add SVM data structures and header
> > > > > > > > > > drm/amdgpu: add SVM attribute data structures
> > > > > > > > > > drm/amdgpu: implement SVM attribute tree operations
> > > > > > > > > > drm/amdgpu: implement SVM attribute set
> > > > > > > > > > drm/amdgpu: add SVM range data structures
> > > > > > > > > > drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > > > > > > > > > drm/amdgpu: implement SVM range notifier and invalidation
> > > > > > > > > > drm/amdgpu: implement SVM range workers
> > > > > > > > > > drm/amdgpu: implement SVM core initialization and fini
> > > > > > > > > > drm/amdgpu: implement SVM ioctl and fault handler
> > > > > > > > > > drm/amdgpu: wire up SVM build system and fault handler
> > > > > > > > > >
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> > > > > > > > > >
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > > > > > > > > > | 894 +++++ +++++++
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> > > > > > > > > >
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > > > > > > > > > | 1196 +++++ ++++++++++++
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> > > > > > > > > > include/uapi/drm/amdgpu_drm.h | 39 +
> > > > > > > > > > 12 files changed, 2958 insertions(+), 4 deletions(-)
> > > > > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > > > > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > > > > > > > > > create mode 100644
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > amdgpu_svm_attr.c
> > > > > > > > > > create mode 100644
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > amdgpu_svm_attr.h
> > > > > > > > > > create mode 100644
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > amdgpu_svm_range.c
> > > > > > > > > > create mode 100644
> > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > amdgpu_svm_range.h
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-23 6:40 ` Matthew Brost
@ 2026-04-23 7:18 ` Matthew Brost
2026-04-23 11:03 ` Huang, Honglei1
0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2026-04-23 7:18 UTC (permalink / raw)
To: Huang, Honglei1
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On Wed, Apr 22, 2026 at 11:40:56PM -0700, Matthew Brost wrote:
> On Wed, Apr 15, 2026 at 06:04:11PM +0800, Huang, Honglei1 wrote:
> >
> >
> > On 3/26/2026 8:16 PM, Honglei Huang wrote:
> > >
> > >
> > > On 3/26/26 06:24, Matthew Brost wrote:
> > > > On Tue, Mar 24, 2026 at 03:24:43PM +0800, Honglei Huang wrote:
> > > > >
> > > > >
> > > > > On 3/23/26 14:31, Matthew Brost wrote:
> > > > > > On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 3/19/26 13:08, Matthew Brost wrote:
> > > > > > > > On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
> > > > > > > > >
> > > > > > > >
> > > > > > > > Disclaimer I haven't look at any code in this series yet.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On 3/17/26 19:48, Christian König wrote:
> > > > > > > > > > Adding a few XE and drm_gpuvm people on TO.
> > > > > > > > > >
> > > > > > > > > > On 3/17/26 12:29, Honglei Huang wrote:
> > > > > > > > > > > From: Honglei Huang <honghuan@amd.com>
> > > > > > > > > > >
> > > > > > > > > > > This is a POC/draft patch series of SVM
> > > > > > > > > > > feature in amdgpu based on the
> > > > > > > > > > > drm_gpusvm framework. The primary
> > > > > > > > > > > purpose of this RFC is to validate
> > > > > > > > > > > the framework's applicability, identify implementation challenges,
> > > > > > > > > > > and start discussion on framework
> > > > > > > > > > > evolution. This is not a production
> > > > > > > >
> > > > > > > > +1. Open to any ideas. Given this was designed
> > > > > > > > originally for Xe we very
> > > > > > > > well could have missed other drivers requirements.
> > > > > > > Hi Matt,
> > > > > > >
> > > > > > > Thank you for the openness. And thank you so much for the incredibly
> > > > > > > detailed and patient response. I really appreciate you
> > > > > > > taking the time to
> > > > > > > walk through each point.
> > > > > > >
> > > > > >
> > > > > > I'm here to help.
> > > > > >
> > > > > > > Actually I am still a learner when it comes to the
> > > > > > > drm_gpusvm framework and
> > > > > > > GPU SVM design in general. Some of my descriptions below
> > > > > > > may not be entirely
> > > > > > > accurate. But I really want to bring drm_gpusvm into
> > > > > > > amdgpu and make it work
> > > > > > > well.
> > > > > >
> > > > > > I appreciate another driver jumping in and using this framework—it
> > > > > > becomes easier to validate as more users adopt it.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > > ready submission.
> > > > > > > > > > >
> > > > > > > > > > > This patch series implements basic SVM
> > > > > > > > > > > support with the following features:
> > > > > > > > > > >
> > > > > > > > > > > 1. attributes sepatarated from physical page management:
> > > > > > > > > > >
> > > > > > > > > > > - Attribute layer
> > > > > > > > > > > (amdgpu_svm_attr_tree): a driver side
> > > > > > > > > > > interval
> > > > > > > > > > > tree that stores SVM
> > > > > > > > > > > attributes. Managed through the
> > > > > > > > > > > SET_ATTR,
> > > > > > > > > > > and mmu notifier callback.
> > > > > > > >
> > > > > > > > Can you explain the mmu notifier callback
> > > > > > > > interaction here? See below in
> > > > > > > > Xe the attribute tree is existing VMA tree (gpuvm).
> > > > > > > >
> > > > > > >
> > > > > > > Let me try to explain, apologies if the description is not fully
> > > > > > > precise.
> > > > > > >
> > > > > > > In current implementation, the MMU notifier callback
> > > > > > > interacts with the attr
> > > > > > > tree only in the munmap path remove the corresponding attribute
> > > > > > > entries from the attr tree so that stale attributes do not persist for
> > > > > > > freed address space.
> > > > > > >
> > > > > >
> > > > > > Ah, yes. We reset our attributes upon munmap too. We
> > > > > > actually don't this
> > > > > > 100% correct quite either and series in flight to fix [1].
> > > > > >
> > > > > > [1] https://patchwork.freedesktop.org/series/161815/
> > > > > >
> > > > >
> > > > > I studied [1]. This draft has a simliar mechanism to handle
> > > > > attributes when
> > > > > munmap. But there are some sligt differences in detail, maybe casued by
> > > > > different UMD runtime behaviors.
> > > > >
> > > > >
> > > > > > > > > > >
> > > > > > > > > > > - Physical page layer (drm_gpusvm ranges): managed by the
> > > > > > > > > > > drm_gpusvm framework, representing actual HMM backed DMA
> > > > > > > > > > > mappings and GPU page table entries.
> > > > > > > > > > >
> > > > > > > > > > > This separation is necessary:
> > > > > > > > > > > - The framework does not
> > > > > > > > > > > support range splitting, so a partial
> > > > > > > > > > > munmap destroys the entire
> > > > > > > > > > > overlapping range, including the
> > > > > > > > > > > still valid parts. If
> > > > > > > > > > > attributes were stored inside drm_gpusvm
> > > > > > > > > > > ranges, they would be lost on unmapping.
> > > > > > > > > > > The separate attr tree
> > > > > > > > > > > preserves userspace set attributes
> > > > > > > > > > > across range operations.
> > > > > > > >
> > > > > > > > Yes, in Xe the divide is at the VMA level (set by user space) via VM
> > > > > > > > bind (parts of VM may be mappings BOs, parts could
> > > > > > > > be setup for SVM) or
> > > > > > > > madvise IOCTLs which reflect user space attributes on current SVM
> > > > > > > > mappings or future ones.
> > > > > > > >
> > > > > > > > The SVM range tree reflects mappings that have been faulted into the
> > > > > > > > device and contain pages. This is an intentional choice.
> > > > > > >
> > > > > > > That makes a lot of sense. Thank you for clarifying the
> > > > > > > design intent. I
> > > > > > > think the current adopt the same principle: the
> > > > > > > drm_gpusvm range tree only
> > > > > > > reflect actual faulted in mappings.
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Isn't that actually intended? When parts of
> > > > > > > > > > the range unmap then that usually means the
> > > > > > > > > > whole range isn't valid any more.
> > > > > > > >
> > > > > > > >
> > > > > > > > Yes, this was an intentional design choice to not
> > > > > > > > support partial unmap,
> > > > > > > > and instead rely on the driver to recreate a new range.
> > > > > > > >
> > > > > > > > The reasoning is:
> > > > > > > >
> > > > > > > > - In practice, this should be rare for well-behaved applications.
> > > > > > > >
> > > > > > > > - With THP / large device pages, if a sub-range is
> > > > > > > > unmapped, the entire
> > > > > > > > GPU mapping is invalidated anyway due to the page size change. As a
> > > > > > > > result, the cost of creating a new range is minimal, since the device
> > > > > > > > will likely fault again on the remaining pages.
> > > > > > > >
> > > > > > > > So there is no need to over-engineer the common code.
> > > > > > > >
> > > > > > > > FWIW, to even test partial unmaps in Xe, I had to do things I doubt
> > > > > > > > anyone would ever do:
> > > > > > > >
> > > > > > > > ptr = mmap(SZ_2M);
> > > > > > > > /* fault in memory to the device */
> > > > > > > > munmap(ptr, SZ_1M);
> > > > > > > > /* touch memory again on the device */
> > > > > > > >
> > > > > > >
> > > > > > > Thank you for this explanation and the concrete example. After further
> > > > > > > discussion internally with Christian, we are now aligned
> > > > > > > with same position
> > > > > > > partial unmap. Will remove rebuild on partial unmap logic in the next
> > > > > > > version and handle it as only partially backed range.
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > It is about partial unmap, some subregion in
> > > > > > > > > drm_gpusvm_range is still valid
> > > > > > > > > but some other subregion is invalid, but under
> > > > > > > > > drm_gpusvm, need to destroy
> > > > > > > > > the entire range.
> > > > > > > > >
> > > > > > > > > e.g.:
> > > > > > > > >
> > > > > > > > > [---------------unmap region in mmu
> > > > > > > > > notifier-----------------]
> > > > > > > > > [0x1000 ------------ 0x9000]
> > > > > > > > > [ valid ][ invalid ]
> > > > > > > > >
> > > > > > > > > see deatil in drm_gpusvm.c:110 line
> > > > > > > > > section:Partial Unmapping of Ranges
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > - drm_gpusvm range
> > > > > > > > > > > boundaries are determined by fault
> > > > > > > > > > > address
> > > > > > > > > > > and pre setted chunk size,
> > > > > > > > > > > not by userspace attribute boundaries.
> > > > > > > > > > > Ranges may be rechunked
> > > > > > > > > > > on memory changes. Embedding
> > > > > > > > > > > attributes in framework
> > > > > > > > > > > ranges would scatter attr state
> > > > > > > > > > > across many small ranges
> > > > > > > > > > > and require complex reassemble
> > > > > > > > > > > logic when operate attrbute.
> > > > > > > > > >
> > > > > > > > > > Yeah, that makes a lot of sense.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 2) System memory mapping via drm_gpusvm
> > > > > > > > > > >
> > > > > > > > > > > The core mapping path uses
> > > > > > > > > > > drm_gpusvm_range_find_or_insert() to
> > > > > > > > > > > create ranges,
> > > > > > > > > > > drm_gpusvm_range_get_pages() for HMM
> > > > > > > > > > > page fault
> > > > > > > > > > > and DMA mapping, then updates GPU page tables via
> > > > > > > > > > > amdgpu_vm_update_range().
> > > > > > > > > > >
> > > > > > > > > > > 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > > > > > > > > > >
> > > > > > > > > > > On XNACK off hardware the GPU
> > > > > > > > > > > cannot recover from page faults,
> > > > > > > > > > > so mappings must be established through ioctl. When
> > > > > > > > > > > userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > > > > > > > > > > walks the attr tree and maps all accessible intervals
> > > > > > > > > > > to the GPU by amdgpu_svm_range_map_attr_ranges().
> > > > > > > >
> > > > > > > > Can you expand on XNACK off / GPU no faults? Is this to the share GPU
> > > > > > > > between 3D (dma-fences) and faulting clients? We
> > > > > > > > have something similar
> > > > > > > > in Xe, but it isn't an explicit IOCTL rather we
> > > > > > > > switch between on demand
> > > > > > > > as 3D client submits and then resumes page faults when all dma-fences
> > > > > > > > have signaled.
> > > > > > > >
> > > > > > > > I see below you mention page tables are modified during quiesce KFD
> > > > > > > > queues? I'm not sure that is required - you just need to guarnette
> > > > > > > > faulting clients won't trigger page faults when
> > > > > > > > dma-fence is in flight.
> > > > > > > >
> > > > > > > > Maybe give me an explaination of exactly what the
> > > > > > > > requirement from AMD
> > > > > > > > are here so I have better picture.
> > > > > > >
> > > > > > > Thank you for the patience, let me try to explain our
> > > > > > > situation, though
> > > > > > > I may not get every detail right.
> > > > > > >
> > > > > > > XNACK off means hardware that does not have GPU page
> > > > > > > fault capability (or
> > > > > > > turned off)
> > > > > > >
> > > > > > > So for these GPUs, ALL page table entries must be fully
> > > > > > > populated before
> > > > > > > the GPU can access the memory. This is why we need the ioctl driven
> > > > > > > mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
> > > > > > > walk the attribute tree and eagerly map all accessible ranges into the
> > > > > > > GPU page tables. This is functionally similar to what you describe as
> > > > > > > prefetch IOCTLs / VM bind in Xe.
> > > > > > >
> > > > > > > Regarding queue quiesce during page table modification: on XNACK off
> > > > > > > hardware, because the GPU cannot fault, we must ensure the GPU is
> > > > > > > completely stopped before modifying any PTE it might be accessing.
> > > > > > > Otherwise the GPU could access a partially updated page
> > > > > > > table and hang.
> > > > > > > The quiesce/resume is the mechanism to guarantee this.
> > > > > > >
> > > > > > > I hope that helps clarify the picture.
> > > > > > >
> > > > > >
> > > > > > This clarifies a lot. This is what we’d call in Xe “preemption fence”
> > > > > > mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> > > > > > resume. We don’t actually support SVM in this case; instead, we use
> > > > > > “userptr binds,” which are built on gpusvm for page
> > > > > > collection. However,
> > > > > > we don’t support migrating memory to the device—though we could.
> > > > > >
> > > > > > I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> > > > > > this case, don’t maintain a range tree, as those—as you
> > > > > > suggest—are more
> > > > > > of an on-demand fault driver concern. Instead, just embed 'struct
> > > > > > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> > > > > >
> > > > > > We could extend this to support migrating 'userptr', but we
> > > > > > just haven’t
> > > > > > done that yet—this may be what you want to do in “XNACK off..
> > > > > >
> > > > > > [2] https://patchwork.freedesktop.org/series/146553/
> > > > > >
> > > > >
> > > > > Actually we need to swith the xnack mode between on and off, so
> > > > > in xnack off
> > > > > mode, the driver operats in "implicit prefetch mode". This may
> > > > > be due to
> > > > > compatibility with older hardware and the need for UMD runtime. We will
> > > > > further discuss the handling method under xnack off internally.
> > > > >
> >
> > Hi Matt,
> >
> > I studied the xe_userptr code and the conversion series [2] you
> > pointed to.
> >
> > I have a question that:
> > Would it be possible to reuse drm_gpusvm_range to handle the hardware
> > without gpu fault feature(xnack off mode).
>
> That’s not how we’ve done it. We embedded drm_gpusvm_pages into our VMA
> structure and then attached a notifier. The notifier attachment is
> open-coded on the Xe side, and this could be normalized and opened up
> for common driver use cases.
>
> The problem with reusing drm_gpusvm_range directly is that a VMA may
> span multiple gpusvm notifiers—i.e., it can be larger than the notifier
> size. Of course, we could rework this as well.
>
Sorry for the double reply—I just glanced at the latest series. I don’t
think creating a range per page of the userptr is desirable. While it
would work, from a time-complexity point of view I don’t think this is
ideal.
The issue with spans across multiple notifiers is real, though.
My rough idea would be:
- Give drivers an interface to create larger ranges.
- If the range fits inside a single notifier’s size → done.
- If the range spans multiple notifier sizes → round up to a power of
two and create a larger notifier. This may overlap with existing
notifiers, which is likely fine given that interval trees support
overlaps (?). We’d need to double-check and test this. If overlapping
notifiers are not acceptable, we’d need some heavy-handed notifier merge
logic—it will be complicated, but isolated, so once we get it right
everyone can use it.
- Finally, make sure that individual userptr pages can reside at any
location.
Over conversely:
- Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
creation
- Make sure that individual userptr pages can reside at any location.
Both options actually sound really similar after typing this out.
Matt
> So either way, the Xe userptr + gpusvm implementation should be refined
> further for common driver use.
>
> >
> > Reusing drm_gpusvm_range for the XNACK-off case would simplify our
> > implementation considerably, it already provides large page chunk
> > optimization, can reuse the existing migration infrastructure.
> >
> > Building these on top of a standalone drm_gpusvm_pages
> > would mean reimplementing much of what the range layer already offers.
> > It would also let us keep a single code path for both XNACK modes,
> > which reduces maintenance burden and avoids behavioral difference.
> >
> > Would this direction be acceptable, or do you see concerns with reusing
> > the range infrastructure for the no-fault case?
> >
>
> If you prefer something like insert a range exactly here + create range
> + notifier I think that completely reasonable direction and Xe would
> likely switch over to using this.
>
> I guess my only concern is sub-userptr migration. We are trending
> towards allowing userptrs to being migrated either via prefetch IOCTLs
> or access counters on the GPU side - access counter we'd likely a single
> 2M page at time migration within the userptr. get_pages() supports mixed
> mappings between VRAM + system but likely needs some more work to really
> make this complete though.
>
> Matt
>
> > Regards,
> > Honglei
> >
> >
> >
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 4) Invalidation, GC worker, and restore worker
> > > > > > > > > > >
> > > > > > > > > > > MMU notifier callbacks
> > > > > > > > > > > (amdgpu_svm_range_invalidate) handle
> > > > > > > > > > > three cases based on event type and hardware mode:
> > > > > > > > > > > - unmap event: clear GPU PTEs in the notifier context,
> > > > > > > > > > > unmap DMA pages, mark ranges as unmapped, flush TLB,
> > > > > > > > > > > and enqueue to the GC worker. On XNACK off, also
> > > > > > > > > > > quiesce KFD queues and schedule rebuild of the
> > > > > > > > > > > still valid portions that
> > > > > > > > > > > were destroyed together with
> > > > > > > > > > > the unmapped subregion.
> > > > > > > > > > >
> > > > > > > > > > > - evict on XNACK off:
> > > > > > > > > > > quiesce KFD queues first, then unmap DMA pages and
> > > > > > > > > > > enqueue to the restore worker.
> > > > > > > > > >
> > > > > > > > > > Is that done through the DMA fence or by
> > > > > > > > > > talking directly to the MES/HWS?
> > > > > > > > >
> > > > > > > > > Currently KFD queues quiesce/resume API are
> > > > > > > > > reused, lookig forward to a
> > > > > > > > > better solution.
> > > > > > > > >
> > > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Honglei
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Christian.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > - evict on XNACK on:
> > > > > > > > > > > clear GPU PTEs, unmap DMA
> > > > > > > > > > > pages, and flush TLB, but do
> > > > > > > > > > > not schedule any worker. The GPU will fault on next
> > > > > > > > > > > access and the fault handler establishes the mapping.
> > > > > > > > > > >
> > > > > > > > > > > Not supported feature:
> > > > > > > > > > > - XNACK on GPU page fault mode
> > > > > > > > > > > - migration and prefetch feature
> > > > > > > > > > > - Multi GPU support
> > > > > > > > > > >
> > > > > > > > > > > XNACK on enablement is ongoing.The
> > > > > > > > > > > GPUs that support XNACK on
> > > > > > > > > > > are currently only accessible to
> > > > > > > > > > > us via remote lab machines, which slows
> > > > > > > > > > > down progress.
> > > > > > > > > > >
> > > > > > > > > > > Patch overview:
> > > > > > > > > > >
> > > > > > > > > > > 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > > > > > > > > > > SET_ATTR/GET_ATTR
> > > > > > > > > > > operations, attribute types, and related
> > > > > > > > > > > structs in amdgpu_drm.h.
> > > > > > > > > > >
> > > > > > > > > > > 02/12 Core data structures:
> > > > > > > > > > > amdgpu_svm wrapping drm_gpusvm with
> > > > > > > > > > > refcount, attr_tree, workqueues, locks, and
> > > > > > > > > > > callbacks (begin/end_restore, flush_tlb).
> > > > > > > > > > >
> > > > > > > > > > > 03/12 Attribute data structures:
> > > > > > > > > > > amdgpu_svm_attrs, attr_range
> > > > > > > > > > > (interval tree node),
> > > > > > > > > > > attr_tree, access enum, flag masks,
> > > > > > > > > > > and change trigger enum.
> > > > > > > > > > >
> > > > > > > > > > > 04/12 Attribute tree operations:
> > > > > > > > > > > interval tree lookup, insert,
> > > > > > > > > > > remove, and tree create/destroy lifecycle.
> > > > > > > > > > >
> > > > > > > > > > > 05/12 Attribute set: validate UAPI
> > > > > > > > > > > attributes, apply to internal
> > > > > > > > > > > attrs, handle hole/existing
> > > > > > > > > > > range with head/tail splitting,
> > > > > > > > > > > compute change triggers, and -EAGAIN retry loop.
> > > > > > > > > > > Implements attr_clear_pages
> > > > > > > > > > > for unmap cleanup and attr_get.
> > > > > > > > > > >
> > > > > > > > > > > 06/12 Range data structures: amdgpu_svm_range extending
> > > > > > > > > > > drm_gpusvm_range with gpu_mapped state, pending ops,
> > > > > > > > > > > pte_flags cache, and GC/restore queue linkage.
> > > > > > > > > > >
> > > > > > > > > > > 07/12 PTE flags and GPU mapping: simple gpu pte function,
> > > > > > > > > > > GPU page table update with
> > > > > > > > > > > DMA address, range mapping loop:
> > > > > > > > > > > find_or_insert -> get_pages -> validate -> update PTE,
> > > > > > > > > > > and attribute change driven mapping function.
> > > > > > > > > > >
> > > > > > > > > > > 08/12 Notifier and invalidation:
> > > > > > > > > > > synchronous GPU PTE clear in
> > > > > > > > > > > notifier context, range removal and overlap cleanup,
> > > > > > > > > > > rebuild after destroy logic, and MMU event dispatcher
> > > > > > > > > > >
> > > > > > > > > > > 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > > > > > > > > > > worker for unmap processing
> > > > > > > > > > > and rebuild, ordered restore
> > > > > > > > > > > worker for mapping evicted ranges, and flush/sync
> > > > > > > > > > > helpers.
> > > > > > > > > > >
> > > > > > > > > > > 10/12 Initialization and fini: kmem_cache for range/attr,
> > > > > > > > > > > drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > > > > > > > > > > flush helper, and amdgpu_svm
> > > > > > > > > > > init/close/fini lifecycle.
> > > > > > > > > > >
> > > > > > > > > > > 11/12 IOCTL and fault handler:
> > > > > > > > > > > PASID based SVM lookup with kref
> > > > > > > > > > > protection, amdgpu_gem_svm_ioctl dispatcher, and
> > > > > > > > > > > amdgpu_svm_handle_fault for GPU page fault recovery.
> > > > > > > > > > >
> > > > > > > > > > > 12/12 Build integration: Kconfig
> > > > > > > > > > > option (CONFIG_DRM_AMDGPU_SVM),
> > > > > > > > > > > Makefile rules, ioctl table
> > > > > > > > > > > registration, and amdgpu_vm
> > > > > > > > > > > hooks (init in make_compute,
> > > > > > > > > > > close/fini, fault dispatch).
> > > > > > > > > > >
> > > > > > > > > > > Test result:
> > > > > > > > > > > on gfx1100(W7900) and gfx943(MI300x)
> > > > > > > > > > > kfd test: 95%+ passed, same failed cases with offical relase
> > > > > > > > > > > rocr test: all passed
> > > > > > > > > > > hip catch test: 20 cases failed in
> > > > > > > > > > > all 5366 cases, +13 failures vs offical
> > > > > > > > > > > relase
> > > > > > > > > > >
> > > > > > > > > > > During implementation we identified
> > > > > > > > > > > several challenges / design questions:
> > > > > > > > > > >
> > > > > > > > > > > 1. No range splitting on partial unmap
> > > > > > > > > > >
> > > > > > > > > > > drm_gpusvm explicitly does not
> > > > > > > > > > > support range splitting in
> > > > > > > > > > > drm_gpusvm.c:122.
> > > > > > > > > > > Partial munmap needs to destroy
> > > > > > > > > > > the entire range including the valid
> > > > > > > > > > > interval.
> > > > > > > > > > > GPU fault driven hardware can
> > > > > > > > > > > handle this design by extra gpu fault
> > > > > > > > > > > handle,
> > > > > > > > > > > but AMDGPU needs to support XNACK
> > > > > > > > > > > off hardware, this design requires
> > > > > > > > > > > driver
> > > > > > > > > > > rebuild the valid part in the
> > > > > > > > > > > removed entire range. Whichs bring a
> > > > > > > > > > > very heavy
> > > > > > > > > > > restore work in work queue/GC
> > > > > > > > > > > worker: unmap/destroy -> rebuild(insert
> > > > > > > > > > > and map)
> > > > > > > > > > > this restore work even heavier
> > > > > > > > > > > than kfd_svm. In previous driver work
> > > > > > > > > > > queue
> > > > > > > > > > > only needs to restore or unmap,
> > > > > > > > > > > but in drm_gpusvm driver needs to unmap
> > > > > > > > > > > and restore.
> > > > > > > > > > > which brings about more complex
> > > > > > > > > > > logic, heavier worker queue workload,
> > > > > > > > > > > and
> > > > > > > > > > > synchronization issues.
> > > > > > > >
> > > > > > > > Is this common in the workload you are running? I'm also wondering if
> > > > > > > > your restore logic / KFDs design is contributing to this actally the
> > > > > > > > problem.
> > > > > > > >
> > > > > > >
> > > > > > > Honestly, you raise a fair point.
> > > > > > >
> > > > > > > We will redesign the logic about the partial munap,
> > > > > > > which should eliminate
> > > > > > > most of this complexity.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > +1, yes test but do optimize for.
> > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 2. Fault driven vs ioctl driven mapping
> > > > > > > > > > >
> > > > > > > > > > > drm_gpusvm is designed around GPU
> > > > > > > > > > > page fault handlers. The primary entry
> > > > > > > > > > > point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > > > > > > > > > > AMDGPU needs to support IOCTL
> > > > > > > > > > > driven mapping cause No XNACK hardware
> > > > > > > > > > > that
> > > > > > > > > > > GPU cannot fault at all
> > > > > > > >
> > > > > > > > I think we refer to these as prefetch IOCTLs in Xe.
> > > > > > > > Ideally, user space
> > > > > > > > issues these so the device does not fault (e.g.,
> > > > > > > > prefetch creates a set
> > > > > > > > of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
> > > > > > > > specific VM bind operations.
> > > > > > > >
> > > > > > >
> > > > > > > That is a very helpful way to think about it. Yes, our ioctl driven
> > > > > > > mapping(xnack off) is essentially equivalent to a
> > > > > > > prefetch operation. We are
> > > > > > > trying to improve it.
> > > > > > >
> > > > > >
> > > > > > See above wrt 'userptr'.
> > > > >
> > > > > Got it.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The ioctl path cannot hold
> > > > > > > > > > > mmap_read_lock across the entire
> > > > > > > > > > > operation
> > > > > > > > > > > because
> > > > > > > > > > > drm_gpusvm_range_find_or_insert()
> > > > > > > > > > > acquires/ releases it
> > > > > > > > > > > internally. This creates race
> > > > > > > > > > > windows with MMU notifiers / workers.
> > > > > > > >
> > > > > > > > This is a very intentional choice in the locking
> > > > > > > > design: mmap_read_lock
> > > > > > > > is held only in very specific parts of GPU SVM, and the driver should
> > > > > > > > never need to take this lock.
> > > > > > > >
> > > > > > > > Yes, notifiers can race, which is why the GPU fault
> > > > > > > > handler and prefetch
> > > > > > > > handler are structured as retry loops when a
> > > > > > > > notifier race is detected.
> > > > > > > > In practice, with well-behaved applications, these races should be
> > > > > > > > rare—but they do occur, and the driver must handle them.
> > > > > > > >
> > > > > > > > __xe_svm_handle_pagefault implements the page fault
> > > > > > > > retry loop. VM bind
> > > > > > > > prefetch has similar logic, although it is more
> > > > > > > > spread out given that it
> > > > > > > > is part of a deeper software pipeline.
> > > > > > > >
> > > > > > > > FWIW, holding locks to avoid races was rejected by Sima because we
> > > > > > > > reasoned it is essentially impossible to guarantee
> > > > > > > > the absence of races
> > > > > > > > by holding a lock. CPU page fault handlers are also effectively just
> > > > > > > > large retry loops.
> > > > > > > >
> > > > > > > > So this is one point I believe you will need to fixup driver side.
> > > > > > > >
> > > > > > >
> > > > > > > Understood. Thank you for the detailed explanation and for pointing to
> > > > > > > __xe_svm_handle_pagefault as a reference. We will restructure both our
> > > > > > > fault handler and ioctl path to a betterretry loop
> > > > > > > pattern with sequence
> > > > > > > number race detection.
> > > > > > >
> > > > > >
> > > > > > Yes, the typical pattern is:
> > > > > >
> > > > > > - Try to migrate once
> > > > > > - If you hit a race, give up, evict all memory back to
> > > > > > system memory, and bind it
> > > > > >
> > > > > > Atomics make this tricky because memory must move, but I’m not sure
> > > > > > “XNACK off” applies here. However, GPU SVM provides a timeslice
> > > > > > mechanism to ensure the CPU can’t move memory while the GPU needs to
> > > > > > execute something.
> > > > >
> > > > > Understood.
> > > > >
> > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 3. Multi GPU support
> > > > > > > > > > >
> > > > > > > > > > > drm_gpusvm binds one drm_device to one
> > > > > > > > > > > instance. In multi GPU systems,
> > > > > > > > > > > each GPU gets an independent instance with its own range tree, MMU
> > > > > > > > > > > notifiers, notifier_lock, and DMA mappings.
> > > > > > > > > > >
> > > > > > > >
> > > > > > > > This is a part I am absolutely open to fixing. Right now, each
> > > > > > > > drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
> > > > > > > > decoupling a GPU SVM instance from a single device, allowing each
> > > > > > > > drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
> > > > > > > > device).
> > > > > > > >
> > > > > > > > This would give drivers the flexibility to use one
> > > > > > > > GPU SVM instance per
> > > > > > > > VM/device instance (as in Xe), or to maintain a
> > > > > > > > single GPU SVM per CPU
> > > > > > > > MM.
> > > > > > > >
> > > > > > >
> > > > > > > That would be wonderful! Looking forward to your patch very much!
> > > > > > >
> > > > > >
> > > > > > I can't say I'll code this but we thought about is as options and very
> > > > > > open patches which refactor the object model for multiple use cases.
> > > > >
> > > > > Understood. I will focus on single GPU first, and once we have a
> > > > > solid v1, we'd be happy to explore contributing patches for the
> > > > > multi-device object model refactoring.
> > > > >
> > > >
> > > > I think roughly what would need to be done is:
> > > >
> > > > - Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
> > > > - Embed either a struct device or a struct drm_device in struct
> > > > drm_gpusvm_pages.
> > > > - Drop struct drm_device from struct drm_gpusvm.
> > > > - Have the driver’s range structure embed one or more struct
> > > > drm_gpusvm_pages in addition to struct drm_gpusvm_range.
> > > > - Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
> > > > drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or simply
> > > > drop them entirely and update drivers to use the drm_gpusvm_pages
> > > > helpers instead.
> > > >
> > > > Then it is up to the drivers whether struct drm_gpusvm maps to a single
> > > > device or multiple devices. Either use case seems valid, and giving
> > > > drivers the option appears to be the right approach, rather than having
> > > > the common drm_gpusvm layer impose its own constraints.
> > > >
> > > > This type of refactor can be done at any time as an independent patch,
> > > > so feel free to post it whenever and I can verify on the Xe side that
> > > > everything looks good.
> > > >
> > > > Matt
> > > >
> > >
> > > Really thanks for the detailed guidance and steps, it is very clear and
> > > actionable. I'm excited about this direction, it gives the drivers more
> > > flexibility. I'll start working on this as soon as possible. Will post
> > > the multi-device refactor as a standalone series once it's well
> > > validated. Thanks again for being so open to collaboration!
> > >
> > > Regards,
> > > Honglei
> > >
> > > > > >
> > > > > > >
> > > > > > > > > > > This may brings huge overhead:
> > > > > > > > > > > - N x MMU notifier registrations
> > > > > > > > > > > for the same address range
> > > > > > > >
> > > > > > > > The notifier overhead is a real concern. We recently
> > > > > > > > introduced two-pass
> > > > > > > > notifiers [1] to speed up multi-device notifiers. At least in Xe, the
> > > > > > > > TLB invalidations—which are the truly expensive part—can be pipelined
> > > > > > > > using the two=pass approach. Currently, [1] only implements two-pass
> > > > > > > > notifiers for userptr, but Xe’s GPU SVM will be updated to use them
> > > > > > > > shortly.
> > > > > > > >
> > > > > > > > [1] https://patchwork.freedesktop.org/series/153280/
> > > > > > > >
> > > > > > >
> > > > > > > Thank you for the pointer to two-pass notifiers. Will study this
> > > > > > > series.
> > > > > > >
> > > > > > > > > > > - N x hmm_range_fault() calls for the same page (KFD: 1x)
> > > > > > > >
> > > > > > > > hmm_range_fault is extremely fast compared to the actual migration.
> > > > > > > > Running hmm_range_fault on a 2MB region using 4KB pages takes less
> > > > > > > > than 1µs. With THP or large device pages [2] (merged last week), it’s
> > > > > > > > around 1/20 of a microsecond. So I wouldn’t be too
> > > > > > > > concerned about this.
> > > > > > > >
> > > > > > > > [2] https://patchwork.freedesktop.org/series/163141/
> > > > > > > >
> > > > > > >
> > > > > > > That is very helpful data. Perhaps worry too much.
> > > > > > >
> > > > > > > > > > > - N x DMA mapping memory
> > > > > > > >
> > > > > > > > You will always have N x DMA mapping memory if the
> > > > > > > > pages are in system
> > > > > > > > memory as the dma-mapping API is per device.
> > > > > > >
> > > > > > > Totally agreed.
> > > > > > >
> > > > > > > >
> > > > > > > > > > > - N x invalidation + restore
> > > > > > > > > > > worker scheduling per CPU unmap event
> > > > > > > > > > > - N x GPU page table flush / TLB invalidation
> > > > > > > >
> > > > > > > > I agree you do not want serialize GPU page table flush / TLB
> > > > > > > > invalidations. Hence two-pass notifiers [1].
> > > > > > >
> > > > > > > Yes, will learn it.
> > > > > > >
> > > > > > > >
> > > > > > > > > > > - Increased mmap_lock hold time,
> > > > > > > > > > > N callbacks serialize under it
> > > > > > > > > > >
> > > > > > > > > > > compatibility issues:
> > > > > > > > > > > - Quiesce/resume scope mismatch:
> > > > > > > > > > > to integrate with KFD compute
> > > > > > > > > > > queues, the driver reuses
> > > > > > > > > > > kgd2kfd_quiesce_mm()/ resume_mm()
> > > > > > > > > > > which have process level semantics. Under the per GPU
> > > > > > > > > > > drm_gpusvm model, maybe there
> > > > > > > > > > > are some issues on sync. To properly
> > > > > > > > > > > integrate with KFD under the
> > > > > > > > > > > per SVM model, a compatibility or
> > > > > > > > > > > new per VM level queue control
> > > > > > > > > > > APIs maybe need to introduced.
> > > > > > > > > > >
> > > > > > > >
> > > > > > > > I thought the idea to get rid of KFD and move over
> > > > > > > > to AMDGPU? I thought
> > > > > > > > Christian mentioned this to me at XDC.
> > > > > > > >
> > > > > > >
> > > > > > > > > > > Migration challenges:
> > > > > > > > > > >
> > > > > > > > > > > - No global migration decision logic: each per GPU SVM
> > > > > > > > > > > instance maintains its own
> > > > > > > > > > > attribute tree independently. This
> > > > > > > > > > > allows conflicting settings (e.g., GPU0's SVM sets
> > > > > > > > > > > PREFERRED_LOC=GPU0 while GPU1's
> > > > > > > > > > > SVM sets PREFERRED_LOC=GPU1
> > > > > > > > > > > for the same address range) with
> > > > > > > > > > > no detection or resolution.
> > > > > > > > > > > A global attribute coordinator
> > > > > > > > > > > or a shared manager is needed to
> > > > > > > > > > > provide a unified global view for migration decisions
> > > > > > > >
> > > > > > > > Yes, this is hole in the Xe API too. We have told UMDs if they setup
> > > > > > > > individual VMs with conflict attributes for a single
> > > > > > > > CPU address space
> > > > > > > > the behavior is undefined. Our UMD implement madvise
> > > > > > > > is basically loop
> > > > > > > > over al GPU VMs setting the same attributes.
> > > > > > >
> > > > > > > Will follow the same approach for now, the UMD is
> > > > > > > responsible for setting
> > > > > > > consistent attributes across GPU VMs.
> > > > > > >
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > - migrate_vma_setup broadcast: one
> > > > > > > > > > > GPU's migration triggers MMU
> > > > > > > > > > > notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > > > > > > > > > > causing N-1 unnecessary restore
> > > > > > > > > > > workers to be scheduled. And
> > > > > > > >
> > > > > > > > My feeling is that you shouldn’t reschedule restore
> > > > > > > > workers unless you
> > > > > > > > actually have to invalidate page tables (i.e., you have a local SVM
> > > > > > > > range within the notifier). So the first migration to an untouched
> > > > > > > > region may trigger notifiers, but they won’t do anything because you
> > > > > > > > don’t have any valid SVM ranges yet. Subsequent
> > > > > > > > mappings of the migrated
> > > > > > > > region won’t trigger a notifier unless the memory is moved again.
> > > > > > > >
> > > > > > >
> > > > > > > That is a very good point. We should check whether we actually have
> > > > > > > valid SVM ranges before scheduling restore workers. If
> > > > > > > there is nothing
> > > > > > > to invalidate, the notifier callback should be a no-op. We will review
> > > > > > > our notifier callback logic to ensure we are not doing
> > > > > > > unnecessary work
> > > > > > > here. Thank you for pointing this out.
> > > > > > >
> > > > > > > > > > > creates races between the
> > > > > > > > > > > initiating migration and the other
> > > > > > > > > > > instance's restore attempts.
> > > > > > > >
> > > > > > > > Yes, if multiple devices try to migrate the same CPU
> > > > > > > > pages at the same
> > > > > > > > time, that will race. That’s why in Xe we have a module-level
> > > > > > > > driver_migrate_lock. The first migration runs in read mode; if it
> > > > > > > > detects a race and aborts, it then takes driver_migrate_lock in write
> > > > > > > > mode so it becomes the only device allowed to move
> > > > > > > > memory / CPU pages.
> > > > > > > > See xe_svm_alloc_vram() for how this is used.
> > > > > > > >
> > > > > > > > I’m not sure this approach will work for you, but I
> > > > > > > > just wanted to point
> > > > > > > > out that we identified this as a potential issue.
> > > > > > > >
> > > > > > >
> > > > > > > Thank you for sharing the driver_migrate_lock approach and pointing to
> > > > > > > xe_svm_alloc_vram(). Will explore whether a similar lock
> > > > > > > pattern can work
> > > > > > > for our case.
> > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > - No cross instance migration serialization: each per GPU
> > > > > > > > > > > drm_gpusvm instance has independent locking, so two GPUs'
> > > > > > > > > > > "decide -> migrate -> remap"
> > > > > > > > > > > sequences can interleave. While
> > > > > > > > > > > the kernel page lock prevents
> > > > > > > > > > > truly simultaneous migration of
> > > > > > > > > > > the same physical page, the
> > > > > > > > > > > losing side's retry (evict from
> > > > > > > > > > > other GPU's VRAM -> migrate
> > > > > > > > > > > back) triggers broadcast notifier
> > > > > > > > > > > invalidations and restore
> > > > > > > > > > > workers, compounding the ping pong
> > > > > > > > > > > problem above.
> > > > > > > > > > >
> > > > > > > >
> > > > > > > > See the driver_migrate_lock above.
> > > > > > >
> > > > > > > Acknowledged, thank you.
> > > > > > > >
> > > > > > > > > > > - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > > > > > > > > > > hardcodes
> > > > > > > > > > > MIGRATE_VMA_SELECT_SYSTEM
> > > > > > > > > > > (drm_pagemap.c:328), meaning
> > > > > > > > > > > it only selects system memory pages for migration.
> > > > > > > > > > >
> > > > > > > >
> > > > > > > > I think this is fixed? We did find some core MM bugs
> > > > > > > > that blocked VRAM
> > > > > > > > to VRAM but those have been worked out.
> > > > > > > >
> > > > > > > > The code I'm looking at:
> > > > > > > >
> > > > > > > > 517 int drm_pagemap_migrate_to_devmem(struct
> > > > > > > > drm_pagemap_devmem *devmem_allocation,
> > > > > > > > 518 struct mm_struct *mm,
> > > > > > > > 519 unsigned
> > > > > > > > long start, unsigned long end,
> > > > > > > > 520 const
> > > > > > > > struct drm_pagemap_migrate_details *mdetails)
> > > > > > > > 521 {
> > > > > > > > 522 const struct drm_pagemap_devmem_ops
> > > > > > > > *ops = devmem_allocation->ops;
> > > > > > > > 523 struct drm_pagemap *dpagemap =
> > > > > > > > devmem_allocation- >dpagemap;
> > > > > > > > 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
> > > > > > > > 525 struct migrate_vma migrate = {
> > > > > > > > 526 .start = start,
> > > > > > > > 527 .end = end,
> > > > > > > > 528 .pgmap_owner = pagemap->owner,
> > > > > > > > 529 .flags =
> > > > > > > > MIGRATE_VMA_SELECT_SYSTEM |
> > > > > > > > MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> > > > > > > > 530
> > > > > > > > MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
> > > > > > > > MIGRATE_VMA_SELECT_COMPOUND,
> > > > > > > > 531 };
> > > > > > > >
> > > > > > >
> > > > > > > Thank you for checking! I am using v6.18 for this POC,
> > > > > > > missed the fix, will
> > > > > > > rebase to the latest.
> > > > > > >
> > > > > > >
> > > > > > > > > > > - CPU fault reverse migration race: CPU page fault triggers
> > > > > > > > > > > migrate_to_ram while GPU
> > > > > > > > > > > instances are concurrently operating.
> > > > > > > > > > > Per GPU notifier_lock does not
> > > > > > > > > > > protect cross GPU operations.
> > > > > > > >
> > > > > > > > No, again retry loop as discussed above.
> > > > > > >
> > > > > > > Understood.
> > > > > > >
> > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > We believe a strong, well designed
> > > > > > > > > > > solution at the framework level is
> > > > > > > > > > > needed to properly address these problems, and we look forward to
> > > > > > > > > > > discussion and suggestions.
> > > > > > > >
> > > > > > > > Let's work together to figure out what is missing here.
> > > > > > >
> > > > > > > Thank you so much, Matt. Your feedback has been
> > > > > > > incredibly valuable and
> > > > > > > has given us a much clearer picture of the framework's design.
> > > > > > > Ireally appreciate the effort you put into building drm_gpusvm as a
> > > > > > > shared framework. Will incorporate your suggestions into our next
> > > > > > > revision and look forward to continuing the collaboration.
> > > > > > >
> > > > > >
> > > > > > No problem. Happy to help.
> > > > >
> > > > > Thank you again for all the detailed feedback.
> > > > >
> > > > > Regards,
> > > > > Honglei
> > > > >
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > > Regards,
> > > > > > > Honglei
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Matt
> > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Honglei Huang (12):
> > > > > > > > > > > drm/amdgpu: add SVM UAPI definitions
> > > > > > > > > > > drm/amdgpu: add SVM data structures and header
> > > > > > > > > > > drm/amdgpu: add SVM attribute data structures
> > > > > > > > > > > drm/amdgpu: implement SVM attribute tree operations
> > > > > > > > > > > drm/amdgpu: implement SVM attribute set
> > > > > > > > > > > drm/amdgpu: add SVM range data structures
> > > > > > > > > > > drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > > > > > > > > > > drm/amdgpu: implement SVM range notifier and invalidation
> > > > > > > > > > > drm/amdgpu: implement SVM range workers
> > > > > > > > > > > drm/amdgpu: implement SVM core initialization and fini
> > > > > > > > > > > drm/amdgpu: implement SVM ioctl and fault handler
> > > > > > > > > > > drm/amdgpu: wire up SVM build system and fault handler
> > > > > > > > > > >
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> > > > > > > > > > >
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > > > > > > > > > > | 894 +++++ +++++++
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> > > > > > > > > > >
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > > > > > > > > > > | 1196 +++++ ++++++++++++
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> > > > > > > > > > > include/uapi/drm/amdgpu_drm.h | 39 +
> > > > > > > > > > > 12 files changed, 2958 insertions(+), 4 deletions(-)
> > > > > > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > > > > > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > > > > > > > > > > create mode 100644
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > > amdgpu_svm_attr.c
> > > > > > > > > > > create mode 100644
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > > amdgpu_svm_attr.h
> > > > > > > > > > > create mode 100644
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > > amdgpu_svm_range.c
> > > > > > > > > > > create mode 100644
> > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/
> > > > > > > > > > > amdgpu_svm_range.h
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> >
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-23 7:18 ` Matthew Brost
@ 2026-04-23 11:03 ` Huang, Honglei1
2026-04-23 20:21 ` Matthew Brost
0 siblings, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-23 11:03 UTC (permalink / raw)
To: Matthew Brost
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On 4/23/2026 3:18 PM, Matthew Brost wrote:
...
>>>>>>> This clarifies a lot. This is what we’d call in Xe “preemption fence”
>>>>>>> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
>>>>>>> resume. We don’t actually support SVM in this case; instead, we use
>>>>>>> “userptr binds,” which are built on gpusvm for page
>>>>>>> collection. However,
>>>>>>> we don’t support migrating memory to the device—though we could.
>>>>>>>
>>>>>>> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
>>>>>>> this case, don’t maintain a range tree, as those—as you
>>>>>>> suggest—are more
>>>>>>> of an on-demand fault driver concern. Instead, just embed 'struct
>>>>>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>>>>>
>>>>>>> We could extend this to support migrating 'userptr', but we
>>>>>>> just haven’t
>>>>>>> done that yet—this may be what you want to do in “XNACK off..
>>>>>>>
>>>>>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>>>>>
>>>>>>
>>>>>> Actually we need to swith the xnack mode between on and off, so
>>>>>> in xnack off
>>>>>> mode, the driver operats in "implicit prefetch mode". This may
>>>>>> be due to
>>>>>> compatibility with older hardware and the need for UMD runtime. We will
>>>>>> further discuss the handling method under xnack off internally.
>>>>>>
>>>
>>> Hi Matt,
>>>
>>> I studied the xe_userptr code and the conversion series [2] you
>>> pointed to.
>>>
>>> I have a question that:
>>> Would it be possible to reuse drm_gpusvm_range to handle the hardware
>>> without gpu fault feature(xnack off mode).
>>
>> That’s not how we’ve done it. We embedded drm_gpusvm_pages into our VMA
>> structure and then attached a notifier. The notifier attachment is
>> open-coded on the Xe side, and this could be normalized and opened up
>> for common driver use cases.
The way in xe_userptr likes the implementation in kfd_svm: embeded
physical pages into structure and attach same size notifier.
But kfd_svm is an implementation of SVM semantics, which supports
partial unmap, doesn't need explicitly delete userptr ioctl calling when
remove , and doesn't need a explicitly userptr flag when creating.
And actually there is also a existing implementation for userptr
semantics in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
If the no gpu fault mode can not use the drm gpu svm fram work, use the
same way for xe_userptr, it seems like doing the duplicate work.
I think the core gap is we are trying to use the drmgpu_svm to implement
a SVM semantics driver for no gpu fault hardware instead of userptr
semantics.
>>
>> The problem with reusing drm_gpusvm_range directly is that a VMA may
>> span multiple gpusvm notifiers—i.e., it can be larger than the notifier
>> size. Of course, we could rework this as well.
So the "VMA spans multiple gpusvm notifiers" concern: I'd like to
clarify that this is not actually a blocker for amdgpu's XNACK-off path,
because amdgpu does not try to represent one user ioctl virtual address
interval as a single drm_gpusvm_range.
we walk the attr interval and call drm_gpusvm_range_find_or_insert()
repeatedly, letting gpusvm pick chunk aligned ranges bounded by
notifier_size. One ioctl interval will create N chunk sized ranges.
>>
>
> Sorry for the double reply—I just glanced at the latest series. I don’t
> think creating a range per page of the userptr is desirable. While it
> would work, from a time-complexity point of view I don’t think this is
> ideal.
>
> The issue with spans across multiple notifiers is real, though.
>
> My rough idea would be:
>
> - Give drivers an interface to create larger ranges.
So maybe we do not need to create larger ranges if we call
drm_gpusvm_range_find_or_insert() repeatedly.
>
> - If the range fits inside a single notifier’s size → done.
>
> - If the range spans multiple notifier sizes → round up to a power of
> two and create a larger notifier. This may overlap with existing
> notifiers, which is likely fine given that interval trees support
> overlaps (?). We’d need to double-check and test this. If overlapping
> notifiers are not acceptable, we’d need some heavy-handed notifier merge
> logic—it will be complicated, but isolated, so once we get it right
> everyone can use it.
If we call drm_gpusvm_range_find_or_insert() repeatedly the drmgpu_svm
will create the corresponding notifier correctly as far as I can see.
Regards,
Honglei
>
> - Finally, make sure that individual userptr pages can reside at any
> location.
>
> Over conversely:
>
> - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
> creation
>
> - Make sure that individual userptr pages can reside at any location.
>
> Both options actually sound really similar after typing this out.
>
> Matt
>
>> So either way, the Xe userptr + gpusvm implementation should be refined
>> further for common driver use.
>>
>>>
>>> Reusing drm_gpusvm_range for the XNACK-off case would simplify our
>>> implementation considerably, it already provides large page chunk
>>> optimization, can reuse the existing migration infrastructure.
>>>
>>> Building these on top of a standalone drm_gpusvm_pages
>>> would mean reimplementing much of what the range layer already offers.
>>> It would also let us keep a single code path for both XNACK modes,
>>> which reduces maintenance burden and avoids behavioral difference.
>>>
>>> Would this direction be acceptable, or do you see concerns with reusing
>>> the range infrastructure for the no-fault case?
>>>
>>
>> If you prefer something like insert a range exactly here + create range
>> + notifier I think that completely reasonable direction and Xe would
>> likely switch over to using this.
>>
>> I guess my only concern is sub-userptr migration. We are trending
>> towards allowing userptrs to being migrated either via prefetch IOCTLs
>> or access counters on the GPU side - access counter we'd likely a single
>> 2M page at time migration within the userptr. get_pages() supports mixed
>> mappings between VRAM + system but likely needs some more work to really
>> make this complete though.
>>
>> Matt
>>
>>> Regards,
>>> Honglei
...
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-23 11:03 ` Huang, Honglei1
@ 2026-04-23 20:21 ` Matthew Brost
2026-04-24 10:43 ` Huang, Honglei1
0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2026-04-23 20:21 UTC (permalink / raw)
To: Huang, Honglei1
Cc: Christian König, amd-gfx, dri-devel, Alexander.Deucher,
Felix.Kuehling, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On Thu, Apr 23, 2026 at 07:03:52PM +0800, Huang, Honglei1 wrote:
>
>
> On 4/23/2026 3:18 PM, Matthew Brost wrote:
> ...
> > > > > > > > This clarifies a lot. This is what we’d call in Xe “preemption fence”
> > > > > > > > mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> > > > > > > > resume. We don’t actually support SVM in this case; instead, we use
> > > > > > > > “userptr binds,” which are built on gpusvm for page
> > > > > > > > collection. However,
> > > > > > > > we don’t support migrating memory to the device—though we could.
> > > > > > > >
> > > > > > > > I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> > > > > > > > this case, don’t maintain a range tree, as those—as you
> > > > > > > > suggest—are more
> > > > > > > > of an on-demand fault driver concern. Instead, just embed 'struct
> > > > > > > > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> > > > > > > >
> > > > > > > > We could extend this to support migrating 'userptr', but we
> > > > > > > > just haven’t
> > > > > > > > done that yet—this may be what you want to do in “XNACK off..
> > > > > > > >
> > > > > > > > [2] https://patchwork.freedesktop.org/series/146553/
> > > > > > > >
> > > > > > >
> > > > > > > Actually we need to swith the xnack mode between on and off, so
> > > > > > > in xnack off
> > > > > > > mode, the driver operats in "implicit prefetch mode". This may
> > > > > > > be due to
> > > > > > > compatibility with older hardware and the need for UMD runtime. We will
> > > > > > > further discuss the handling method under xnack off internally.
> > > > > > >
> > > >
> > > > Hi Matt,
> > > >
> > > > I studied the xe_userptr code and the conversion series [2] you
> > > > pointed to.
> > > >
> > > > I have a question that:
> > > > Would it be possible to reuse drm_gpusvm_range to handle the hardware
> > > > without gpu fault feature(xnack off mode).
> > >
> > > That’s not how we’ve done it. We embedded drm_gpusvm_pages into our VMA
> > > structure and then attached a notifier. The notifier attachment is
> > > open-coded on the Xe side, and this could be normalized and opened up
> > > for common driver use cases.
>
> The way in xe_userptr likes the implementation in kfd_svm: embeded physical
> pages into structure and attach same size notifier.
> But kfd_svm is an implementation of SVM semantics, which supports partial
> unmap, doesn't need explicitly delete userptr ioctl calling when remove ,
> and doesn't need a explicitly userptr flag when creating.
> And actually there is also a existing implementation for userptr semantics
> in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
> If the no gpu fault mode can not use the drm gpu svm fram work, use the same
> way for xe_userptr, it seems like doing the duplicate work.
>
> I think the core gap is we are trying to use the drmgpu_svm to implement a
> SVM semantics driver for no gpu fault hardware instead of userptr semantics.
>
> > >
> > > The problem with reusing drm_gpusvm_range directly is that a VMA may
> > > span multiple gpusvm notifiers—i.e., it can be larger than the notifier
> > > size. Of course, we could rework this as well.
>
> So the "VMA spans multiple gpusvm notifiers" concern: I'd like to clarify
> that this is not actually a blocker for amdgpu's XNACK-off path, because
> amdgpu does not try to represent one user ioctl virtual address interval as
> a single drm_gpusvm_range.
>
> we walk the attr interval and call drm_gpusvm_range_find_or_insert()
> repeatedly, letting gpusvm pick chunk aligned ranges bounded by
> notifier_size. One ioctl interval will create N chunk sized ranges.
>
> > >
> >
> > Sorry for the double reply—I just glanced at the latest series. I don’t
> > think creating a range per page of the userptr is desirable. While it
> > would work, from a time-complexity point of view I don’t think this is
> > ideal.
> >
> > The issue with spans across multiple notifiers is real, though.
> >
> > My rough idea would be:
> >
> > - Give drivers an interface to create larger ranges.
>
> So maybe we do not need to create larger ranges if we call
> drm_gpusvm_range_find_or_insert() repeatedly.
>
That will be functional, but consider it from a time-complexity point of
view.
Multiple ranges increase the time complexity of range-tree searches.
This isn’t a huge deal, but it will show up to some extent.
Multiple ranges will also slow down DMA mapping and migration. We
switched over to the dma_iova_alloc/link/unlink/sync uAPI here [1].
While dma_iova_link is a relatively fast radix-tree walk, the allocation
and sync steps are where things get expensive. Therefore, it is
advantageous to perform these steps as few times as possible. For
example, if your SVM buffer is 512MB, instead of doing these steps 256
times, you do them once. The same logic applies to the migrate_vma_*
functions—they are quite expensive, so doing them in a single shot is
significantly faster.
The same applies to invalidations. If you can invalidate a large range
in a single shot, it will be faster. Although the logic in the notifier
should be able to zap multiple ranges in one shot (Xe does this), having
to DMA-unmap a single large range will still be faster than multiple
smaller DMA unmaps.
The TL;DR is if your driver knows size of SVM allocation upfront (e.g.,
an IOCTL tells you the size) it makes more sense to use a single large
struct (either embedded drm_gpusvm_pages into a VMA or we figure out an
interface to insert large ranges / notifiers).
[1] https://patchwork.freedesktop.org/series/160587/
> >
> > - If the range fits inside a single notifier’s size → done.
> >
> > - If the range spans multiple notifier sizes → round up to a power of
> > two and create a larger notifier. This may overlap with existing
> > notifiers, which is likely fine given that interval trees support
> > overlaps (?). We’d need to double-check and test this. If overlapping
> > notifiers are not acceptable, we’d need some heavy-handed notifier merge
> > logic—it will be complicated, but isolated, so once we get it right
> > everyone can use it.
>
> If we call drm_gpusvm_range_find_or_insert() repeatedly the drmgpu_svm will
> create the corresponding notifier correctly as far as I can see.
>
I agree this will be functional but not ideal. You can always start the
approach you have here and optimize it later by adding the required
support in GPU SVM.
Matt
> Regards,
> Honglei
>
> >
> > - Finally, make sure that individual userptr pages can reside at any
> > location.
> >
> > Over conversely:
> >
> > - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
> > creation
> >
> > - Make sure that individual userptr pages can reside at any location.
>
> >
> > Both options actually sound really similar after typing this out.
> >
> > Matt
> >
> > > So either way, the Xe userptr + gpusvm implementation should be refined
> > > further for common driver use.
> > >
> > > >
> > > > Reusing drm_gpusvm_range for the XNACK-off case would simplify our
> > > > implementation considerably, it already provides large page chunk
> > > > optimization, can reuse the existing migration infrastructure.
> > > >
> > > > Building these on top of a standalone drm_gpusvm_pages
> > > > would mean reimplementing much of what the range layer already offers.
> > > > It would also let us keep a single code path for both XNACK modes,
> > > > which reduces maintenance burden and avoids behavioral difference.
> > > >
> > > > Would this direction be acceptable, or do you see concerns with reusing
> > > > the range infrastructure for the no-fault case?
> > > >
> > >
> > > If you prefer something like insert a range exactly here + create range
> > > + notifier I think that completely reasonable direction and Xe would
> > > likely switch over to using this.
> > >
> > > I guess my only concern is sub-userptr migration. We are trending
> > > towards allowing userptrs to being migrated either via prefetch IOCTLs
> > > or access counters on the GPU side - access counter we'd likely a single
> > > 2M page at time migration within the userptr. get_pages() supports mixed
> > > mappings between VRAM + system but likely needs some more work to really
> > > make this complete though.
> > >
> > > Matt
> > > > Regards,
> > > > Honglei
> ...
>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-23 20:21 ` Matthew Brost
@ 2026-04-24 10:43 ` Huang, Honglei1
2026-04-27 20:00 ` Felix Kuehling
0 siblings, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-24 10:43 UTC (permalink / raw)
To: Matthew Brost, Christian König, Felix.Kuehling
Cc: amd-gfx, dri-devel, Alexander.Deucher, Honglei Huang, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On 4/24/2026 4:21 AM, Matthew Brost wrote:
> On Thu, Apr 23, 2026 at 07:03:52PM +0800, Huang, Honglei1 wrote:
>>
>>
>> On 4/23/2026 3:18 PM, Matthew Brost wrote:
>> ...
>>>>>>>>> This clarifies a lot. This is what we’d call in Xe “preemption fence”
>>>>>>>>> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
>>>>>>>>> resume. We don’t actually support SVM in this case; instead, we use
>>>>>>>>> “userptr binds,” which are built on gpusvm for page
>>>>>>>>> collection. However,
>>>>>>>>> we don’t support migrating memory to the device—though we could.
>>>>>>>>>
>>>>>>>>> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
>>>>>>>>> this case, don’t maintain a range tree, as those—as you
>>>>>>>>> suggest—are more
>>>>>>>>> of an on-demand fault driver concern. Instead, just embed 'struct
>>>>>>>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>>>>>>>
>>>>>>>>> We could extend this to support migrating 'userptr', but we
>>>>>>>>> just haven’t
>>>>>>>>> done that yet—this may be what you want to do in “XNACK off..
>>>>>>>>>
>>>>>>>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>>>>>>>
>>>>>>>>
>>>>>>>> Actually we need to swith the xnack mode between on and off, so
>>>>>>>> in xnack off
>>>>>>>> mode, the driver operats in "implicit prefetch mode". This may
>>>>>>>> be due to
>>>>>>>> compatibility with older hardware and the need for UMD runtime. We will
>>>>>>>> further discuss the handling method under xnack off internally.
>>>>>>>>
>>>>>
>>>>> Hi Matt,
>>>>>
>>>>> I studied the xe_userptr code and the conversion series [2] you
>>>>> pointed to.
>>>>>
>>>>> I have a question that:
>>>>> Would it be possible to reuse drm_gpusvm_range to handle the hardware
>>>>> without gpu fault feature(xnack off mode).
>>>>
>>>> That’s not how we’ve done it. We embedded drm_gpusvm_pages into our VMA
>>>> structure and then attached a notifier. The notifier attachment is
>>>> open-coded on the Xe side, and this could be normalized and opened up
>>>> for common driver use cases.
>>
>> The way in xe_userptr likes the implementation in kfd_svm: embeded physical
>> pages into structure and attach same size notifier.
>> But kfd_svm is an implementation of SVM semantics, which supports partial
>> unmap, doesn't need explicitly delete userptr ioctl calling when remove ,
>> and doesn't need a explicitly userptr flag when creating.
>> And actually there is also a existing implementation for userptr semantics
>> in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
>> If the no gpu fault mode can not use the drm gpu svm fram work, use the same
>> way for xe_userptr, it seems like doing the duplicate work.
>>
>> I think the core gap is we are trying to use the drmgpu_svm to implement a
>> SVM semantics driver for no gpu fault hardware instead of userptr semantics.
>>
>>>>
>>>> The problem with reusing drm_gpusvm_range directly is that a VMA may
>>>> span multiple gpusvm notifiers—i.e., it can be larger than the notifier
>>>> size. Of course, we could rework this as well.
>>
>> So the "VMA spans multiple gpusvm notifiers" concern: I'd like to clarify
>> that this is not actually a blocker for amdgpu's XNACK-off path, because
>> amdgpu does not try to represent one user ioctl virtual address interval as
>> a single drm_gpusvm_range.
>>
>> we walk the attr interval and call drm_gpusvm_range_find_or_insert()
>> repeatedly, letting gpusvm pick chunk aligned ranges bounded by
>> notifier_size. One ioctl interval will create N chunk sized ranges.
>>
>>>>
>>>
>>> Sorry for the double reply—I just glanced at the latest series. I don’t
>>> think creating a range per page of the userptr is desirable. While it
>>> would work, from a time-complexity point of view I don’t think this is
>>> ideal.
>>>
>>> The issue with spans across multiple notifiers is real, though.
>>>
>>> My rough idea would be:
>>>
>>> - Give drivers an interface to create larger ranges.
>>
>> So maybe we do not need to create larger ranges if we call
>> drm_gpusvm_range_find_or_insert() repeatedly.
>>
>
> That will be functional, but consider it from a time-complexity point of
> view.
>
> Multiple ranges increase the time complexity of range-tree searches.
> This isn’t a huge deal, but it will show up to some extent.
>
> Multiple ranges will also slow down DMA mapping and migration. We
> switched over to the dma_iova_alloc/link/unlink/sync uAPI here [1].
> While dma_iova_link is a relatively fast radix-tree walk, the allocation
> and sync steps are where things get expensive. Therefore, it is
> advantageous to perform these steps as few times as possible. For
> example, if your SVM buffer is 512MB, instead of doing these steps 256
> times, you do them once. The same logic applies to the migrate_vma_*
> functions—they are quite expensive, so doing them in a single shot is
> significantly faster.
>
> The same applies to invalidations. If you can invalidate a large range
> in a single shot, it will be faster. Although the logic in the notifier
> should be able to zap multiple ranges in one shot (Xe does this), having
> to DMA-unmap a single large range will still be faster than multiple
> smaller DMA unmaps.
>
> The TL;DR is if your driver knows size of SVM allocation upfront (e.g.,
> an IOCTL tells you the size) it makes more sense to use a single large
> struct (either embedded drm_gpusvm_pages into a VMA or we figure out an
> interface to insert large ranges / notifiers).
>
> [1] https://patchwork.freedesktop.org/series/160587/
>
>>>
>>> - If the range fits inside a single notifier’s size → done.
>>>
>>> - If the range spans multiple notifier sizes → round up to a power of
>>> two and create a larger notifier. This may overlap with existing
>>> notifiers, which is likely fine given that interval trees support
>>> overlaps (?). We’d need to double-check and test this. If overlapping
>>> notifiers are not acceptable, we’d need some heavy-handed notifier merge
>>> logic—it will be complicated, but isolated, so once we get it right
>>> everyone can use it.
>>
>> If we call drm_gpusvm_range_find_or_insert() repeatedly the drmgpu_svm will
>> create the corresponding notifier correctly as far as I can see.
>>
>
> I agree this will be functional but not ideal. You can always start the
> approach you have here and optimize it later by adding the required
> support in GPU SVM.
>
Hi Matt,
Really thanks for your information, this really helps a lot!
Hi Christian, Felix,
According to the discussion with Matt on the previous thread, I'd like
to align with you on the XNACK off direction before start to the series.
According to the information form Matt:
when the allocation size is known doing one big operation is
significantly faster than doing many small ranges, because
the allocation and sync steps are where things get expensive.
Doing them in a single shot is significantly faster, especially in the
situlation of xnack off mode, which needs pre fault and pre map in
ioctl, and the size is known.
It is confirmed that repeatedly calling drm_gpusvm_range_find_or_insert() is
functional, and suggested we land it first and optimize later by adding
large range support in GPU SVM core. That motivates the two phase plan
below.
Phase 1
- Reuse drm_gpusvm_range for XNACK-off, one ioctl interval is split by
drm_gpusvm_range_find_or_insert() into
N chunk-sized ranges bounded by notifier_size, same mechanism as the
fault path.
- populate all ranges at ioctl / submit time instead of on fault.
- Invalidation -> GPU queue stop -> rebind/restore the pages and gpu map
->restore queue
Phase 2:
Add a large range / large notifier insert interface in GPU SVM core
so one ioctl interval maps to a single range to improve efficiency.
This needs modify the drmgpu_svm frame work.
May I know your thoughts on this plan?
Regards,
Honglei
> Matt
>
>> Regards,
>> Honglei
>>
>>>
>>> - Finally, make sure that individual userptr pages can reside at any
>>> location.
>>>
>>> Over conversely:
>>>
>>> - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
>>> creation
>>>
>>> - Make sure that individual userptr pages can reside at any location.
>>
>>>
>>> Both options actually sound really similar after typing this out.
>>>
>>> Matt
>>>
>>>> So either way, the Xe userptr + gpusvm implementation should be refined
>>>> further for common driver use.
>>>>
>>>>>
>>>>> Reusing drm_gpusvm_range for the XNACK-off case would simplify our
>>>>> implementation considerably, it already provides large page chunk
>>>>> optimization, can reuse the existing migration infrastructure.
>>>>>
>>>>> Building these on top of a standalone drm_gpusvm_pages
>>>>> would mean reimplementing much of what the range layer already offers.
>>>>> It would also let us keep a single code path for both XNACK modes,
>>>>> which reduces maintenance burden and avoids behavioral difference.
>>>>>
>>>>> Would this direction be acceptable, or do you see concerns with reusing
>>>>> the range infrastructure for the no-fault case?
>>>>>
>>>>
>>>> If you prefer something like insert a range exactly here + create range
>>>> + notifier I think that completely reasonable direction and Xe would
>>>> likely switch over to using this.
>>>>
>>>> I guess my only concern is sub-userptr migration. We are trending
>>>> towards allowing userptrs to being migrated either via prefetch IOCTLs
>>>> or access counters on the GPU side - access counter we'd likely a single
>>>> 2M page at time migration within the userptr. get_pages() supports mixed
>>>> mappings between VRAM + system but likely needs some more work to really
>>>> make this complete though.
>>>>
>>>> Matt
>>>>> Regards,
>>>>> Honglei
>> ...
>>
>>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-24 10:43 ` Huang, Honglei1
@ 2026-04-27 20:00 ` Felix Kuehling
2026-04-28 2:23 ` Huang, Honglei1
0 siblings, 1 reply; 36+ messages in thread
From: Felix Kuehling @ 2026-04-27 20:00 UTC (permalink / raw)
To: Huang, Honglei1, Matthew Brost, Christian König
Cc: amd-gfx, dri-devel, Alexander.Deucher, Honglei Huang, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On 2026-04-24 06:43, Huang, Honglei1 wrote:
>
>
> On 4/24/2026 4:21 AM, Matthew Brost wrote:
>> On Thu, Apr 23, 2026 at 07:03:52PM +0800, Huang, Honglei1 wrote:
>>>
>>>
>>> On 4/23/2026 3:18 PM, Matthew Brost wrote:
>>> ...
>>>>>>>>>> This clarifies a lot. This is what we’d call in Xe
>>>>>>>>>> “preemption fence”
>>>>>>>>>> mode for a VM. Anytime memory is moved, we trigger a GPU
>>>>>>>>>> preemption and
>>>>>>>>>> resume. We don’t actually support SVM in this case; instead,
>>>>>>>>>> we use
>>>>>>>>>> “userptr binds,” which are built on gpusvm for page
>>>>>>>>>> collection. However,
>>>>>>>>>> we don’t support migrating memory to the device—though we could.
>>>>>>>>>>
>>>>>>>>>> I’d look at how we converted 'userptr' to be based on GPU SVM
>>>>>>>>>> [2]. In
>>>>>>>>>> this case, don’t maintain a range tree, as those—as you
>>>>>>>>>> suggest—are more
>>>>>>>>>> of an on-demand fault driver concern. Instead, just embed
>>>>>>>>>> 'struct
>>>>>>>>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>>>>>>>>
>>>>>>>>>> We could extend this to support migrating 'userptr', but we
>>>>>>>>>> just haven’t
>>>>>>>>>> done that yet—this may be what you want to do in “XNACK off..
>>>>>>>>>>
>>>>>>>>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Actually we need to swith the xnack mode between on and off, so
>>>>>>>>> in xnack off
>>>>>>>>> mode, the driver operats in "implicit prefetch mode". This may
>>>>>>>>> be due to
>>>>>>>>> compatibility with older hardware and the need for UMD
>>>>>>>>> runtime. We will
>>>>>>>>> further discuss the handling method under xnack off internally.
>>>>>>>>>
>>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> I studied the xe_userptr code and the conversion series [2] you
>>>>>> pointed to.
>>>>>>
>>>>>> I have a question that:
>>>>>> Would it be possible to reuse drm_gpusvm_range to handle the
>>>>>> hardware
>>>>>> without gpu fault feature(xnack off mode).
>>>>>
>>>>> That’s not how we’ve done it. We embedded drm_gpusvm_pages into
>>>>> our VMA
>>>>> structure and then attached a notifier. The notifier attachment is
>>>>> open-coded on the Xe side, and this could be normalized and opened up
>>>>> for common driver use cases.
>>>
>>> The way in xe_userptr likes the implementation in kfd_svm: embeded
>>> physical
>>> pages into structure and attach same size notifier.
>>> But kfd_svm is an implementation of SVM semantics, which supports
>>> partial
>>> unmap, doesn't need explicitly delete userptr ioctl calling when
>>> remove ,
>>> and doesn't need a explicitly userptr flag when creating.
>>> And actually there is also a existing implementation for userptr
>>> semantics
>>> in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
>>> If the no gpu fault mode can not use the drm gpu svm fram work, use
>>> the same
>>> way for xe_userptr, it seems like doing the duplicate work.
>>>
>>> I think the core gap is we are trying to use the drmgpu_svm to
>>> implement a
>>> SVM semantics driver for no gpu fault hardware instead of userptr
>>> semantics.
>>>
>>>>>
>>>>> The problem with reusing drm_gpusvm_range directly is that a VMA may
>>>>> span multiple gpusvm notifiers—i.e., it can be larger than the
>>>>> notifier
>>>>> size. Of course, we could rework this as well.
>>>
>>> So the "VMA spans multiple gpusvm notifiers" concern: I'd like to
>>> clarify
>>> that this is not actually a blocker for amdgpu's XNACK-off path,
>>> because
>>> amdgpu does not try to represent one user ioctl virtual address
>>> interval as
>>> a single drm_gpusvm_range.
>>>
>>> we walk the attr interval and call drm_gpusvm_range_find_or_insert()
>>> repeatedly, letting gpusvm pick chunk aligned ranges bounded by
>>> notifier_size. One ioctl interval will create N chunk sized ranges.
>>>
>>>>>
>>>>
>>>> Sorry for the double reply—I just glanced at the latest series. I
>>>> don’t
>>>> think creating a range per page of the userptr is desirable. While it
>>>> would work, from a time-complexity point of view I don’t think this is
>>>> ideal.
>>>>
>>>> The issue with spans across multiple notifiers is real, though.
>>>>
>>>> My rough idea would be:
>>>>
>>>> - Give drivers an interface to create larger ranges.
>>>
>>> So maybe we do not need to create larger ranges if we call
>>> drm_gpusvm_range_find_or_insert() repeatedly.
>>>
>>
>> That will be functional, but consider it from a time-complexity point of
>> view.
>>
>> Multiple ranges increase the time complexity of range-tree searches.
>> This isn’t a huge deal, but it will show up to some extent.
>>
>> Multiple ranges will also slow down DMA mapping and migration. We
>> switched over to the dma_iova_alloc/link/unlink/sync uAPI here [1].
>> While dma_iova_link is a relatively fast radix-tree walk, the allocation
>> and sync steps are where things get expensive. Therefore, it is
>> advantageous to perform these steps as few times as possible. For
>> example, if your SVM buffer is 512MB, instead of doing these steps 256
>> times, you do them once. The same logic applies to the migrate_vma_*
>> functions—they are quite expensive, so doing them in a single shot is
>> significantly faster.
>>
>> The same applies to invalidations. If you can invalidate a large range
>> in a single shot, it will be faster. Although the logic in the notifier
>> should be able to zap multiple ranges in one shot (Xe does this), having
>> to DMA-unmap a single large range will still be faster than multiple
>> smaller DMA unmaps.
>>
>> The TL;DR is if your driver knows size of SVM allocation upfront (e.g.,
>> an IOCTL tells you the size) it makes more sense to use a single large
>> struct (either embedded drm_gpusvm_pages into a VMA or we figure out an
>> interface to insert large ranges / notifiers).
>>
>> [1] https://patchwork.freedesktop.org/series/160587/
>>
>>>>
>>>> - If the range fits inside a single notifier’s size → done.
>>>>
>>>> - If the range spans multiple notifier sizes → round up to a power of
>>>> two and create a larger notifier. This may overlap with existing
>>>> notifiers, which is likely fine given that interval trees support
>>>> overlaps (?). We’d need to double-check and test this. If
>>>> overlapping
>>>> notifiers are not acceptable, we’d need some heavy-handed
>>>> notifier merge
>>>> logic—it will be complicated, but isolated, so once we get it
>>>> right
>>>> everyone can use it.
>>>
>>> If we call drm_gpusvm_range_find_or_insert() repeatedly the
>>> drmgpu_svm will
>>> create the corresponding notifier correctly as far as I can see.
>>>
>>
>> I agree this will be functional but not ideal. You can always start the
>> approach you have here and optimize it later by adding the required
>> support in GPU SVM.
>>
>
> Hi Matt,
>
> Really thanks for your information, this really helps a lot!
>
>
> Hi Christian, Felix,
>
> According to the discussion with Matt on the previous thread, I'd like
> to align with you on the XNACK off direction before start to the series.
>
> According to the information form Matt:
> when the allocation size is known doing one big operation is
> significantly faster than doing many small ranges, because
> the allocation and sync steps are where things get expensive.
> Doing them in a single shot is significantly faster, especially in the
> situlation of xnack off mode, which needs pre fault and pre map in
> ioctl, and the size is known.
>
> It is confirmed that repeatedly calling
> drm_gpusvm_range_find_or_insert() is
> functional, and suggested we land it first and optimize later by adding
> large range support in GPU SVM core. That motivates the two phase plan
> below.
>
> Phase 1
> - Reuse drm_gpusvm_range for XNACK-off, one ioctl interval is split by
> drm_gpusvm_range_find_or_insert() into
> N chunk-sized ranges bounded by notifier_size, same mechanism as the
> fault path.
> - populate all ranges at ioctl / submit time instead of on fault.
> - Invalidation -> GPU queue stop -> rebind/restore the pages and gpu
> map ->restore queue
>
> Phase 2:
> Add a large range / large notifier insert interface in GPU SVM core
> so one ioctl interval maps to a single range to improve efficiency.
> This needs modify the drmgpu_svm frame work.
>
> May I know your thoughts on this plan?
I think drm_gpusvm_range_find_or_insert already has all the parameters
necessary to allocate larger notifiers and ranges. All it would take is
maybe adding a flag in drm_gpusvm_ctx to request larger range allocation
instead of arbitrary chunking.
I agree this could be done as a second phase and is mostly work in the
drm_gpusvm code.
Regards,
Felix
>
> Regards,
> Honglei
>
>
>> Matt
>>
>>> Regards,
>>> Honglei
>>>
>>>>
>>>> - Finally, make sure that individual userptr pages can reside at any
>>>> location.
>>>>
>>>> Over conversely:
>>>>
>>>> - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
>>>> creation
>>>>
>>>> - Make sure that individual userptr pages can reside at any location.
>>>
>>>>
>>>> Both options actually sound really similar after typing this out.
>>>>
>>>> Matt
>>>>
>>>>> So either way, the Xe userptr + gpusvm implementation should be
>>>>> refined
>>>>> further for common driver use.
>>>>>
>>>>>>
>>>>>> Reusing drm_gpusvm_range for the XNACK-off case would simplify our
>>>>>> implementation considerably, it already provides large page chunk
>>>>>> optimization, can reuse the existing migration infrastructure.
>>>>>>
>>>>>> Building these on top of a standalone drm_gpusvm_pages
>>>>>> would mean reimplementing much of what the range layer already
>>>>>> offers.
>>>>>> It would also let us keep a single code path for both XNACK modes,
>>>>>> which reduces maintenance burden and avoids behavioral difference.
>>>>>>
>>>>>> Would this direction be acceptable, or do you see concerns with
>>>>>> reusing
>>>>>> the range infrastructure for the no-fault case?
>>>>>>
>>>>>
>>>>> If you prefer something like insert a range exactly here + create
>>>>> range
>>>>> + notifier I think that completely reasonable direction and Xe would
>>>>> likely switch over to using this.
>>>>>
>>>>> I guess my only concern is sub-userptr migration. We are trending
>>>>> towards allowing userptrs to being migrated either via prefetch
>>>>> IOCTLs
>>>>> or access counters on the GPU side - access counter we'd likely a
>>>>> single
>>>>> 2M page at time migration within the userptr. get_pages() supports
>>>>> mixed
>>>>> mappings between VRAM + system but likely needs some more work to
>>>>> really
>>>>> make this complete though.
>>>>>
>>>>> Matt
>>>>>> Regards,
>>>>>> Honglei
>>> ...
>>>
>>>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-27 20:00 ` Felix Kuehling
@ 2026-04-28 2:23 ` Huang, Honglei1
2026-04-30 3:04 ` Matthew Brost
0 siblings, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-28 2:23 UTC (permalink / raw)
To: Felix Kuehling, Matthew Brost, Christian König
Cc: amd-gfx, dri-devel, Alexander.Deucher, Honglei Huang, Oak.Zeng,
Jenny-Jing.Liu, Philip.Yang, Xiaogang.Chen, Ray.Huang,
Lingshan.Zhu, Junhua.Shen, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On 4/28/2026 4:00 AM, Felix Kuehling wrote:
>
> On 2026-04-24 06:43, Huang, Honglei1 wrote:
>>
>>
>> On 4/24/2026 4:21 AM, Matthew Brost wrote:
>>> On Thu, Apr 23, 2026 at 07:03:52PM +0800, Huang, Honglei1 wrote:
>>>>
>>>>
>>>> On 4/23/2026 3:18 PM, Matthew Brost wrote:
>>>> ...
>>>>>>>>>>> This clarifies a lot. This is what we’d call in Xe
>>>>>>>>>>> “preemption fence”
>>>>>>>>>>> mode for a VM. Anytime memory is moved, we trigger a GPU
>>>>>>>>>>> preemption and
>>>>>>>>>>> resume. We don’t actually support SVM in this case; instead,
>>>>>>>>>>> we use
>>>>>>>>>>> “userptr binds,” which are built on gpusvm for page
>>>>>>>>>>> collection. However,
>>>>>>>>>>> we don’t support migrating memory to the device—though we could.
>>>>>>>>>>>
>>>>>>>>>>> I’d look at how we converted 'userptr' to be based on GPU SVM
>>>>>>>>>>> [2]. In
>>>>>>>>>>> this case, don’t maintain a range tree, as those—as you
>>>>>>>>>>> suggest—are more
>>>>>>>>>>> of an on-demand fault driver concern. Instead, just embed
>>>>>>>>>>> 'struct
>>>>>>>>>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>>>>>>>>>
>>>>>>>>>>> We could extend this to support migrating 'userptr', but we
>>>>>>>>>>> just haven’t
>>>>>>>>>>> done that yet—this may be what you want to do in “XNACK off..
>>>>>>>>>>>
>>>>>>>>>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Actually we need to swith the xnack mode between on and off, so
>>>>>>>>>> in xnack off
>>>>>>>>>> mode, the driver operats in "implicit prefetch mode". This may
>>>>>>>>>> be due to
>>>>>>>>>> compatibility with older hardware and the need for UMD
>>>>>>>>>> runtime. We will
>>>>>>>>>> further discuss the handling method under xnack off internally.
>>>>>>>>>>
>>>>>>>
>>>>>>> Hi Matt,
>>>>>>>
>>>>>>> I studied the xe_userptr code and the conversion series [2] you
>>>>>>> pointed to.
>>>>>>>
>>>>>>> I have a question that:
>>>>>>> Would it be possible to reuse drm_gpusvm_range to handle the
>>>>>>> hardware
>>>>>>> without gpu fault feature(xnack off mode).
>>>>>>
>>>>>> That’s not how we’ve done it. We embedded drm_gpusvm_pages into
>>>>>> our VMA
>>>>>> structure and then attached a notifier. The notifier attachment is
>>>>>> open-coded on the Xe side, and this could be normalized and opened up
>>>>>> for common driver use cases.
>>>>
>>>> The way in xe_userptr likes the implementation in kfd_svm: embeded
>>>> physical
>>>> pages into structure and attach same size notifier.
>>>> But kfd_svm is an implementation of SVM semantics, which supports
>>>> partial
>>>> unmap, doesn't need explicitly delete userptr ioctl calling when
>>>> remove ,
>>>> and doesn't need a explicitly userptr flag when creating.
>>>> And actually there is also a existing implementation for userptr
>>>> semantics
>>>> in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
>>>> If the no gpu fault mode can not use the drm gpu svm fram work, use
>>>> the same
>>>> way for xe_userptr, it seems like doing the duplicate work.
>>>>
>>>> I think the core gap is we are trying to use the drmgpu_svm to
>>>> implement a
>>>> SVM semantics driver for no gpu fault hardware instead of userptr
>>>> semantics.
>>>>
>>>>>>
>>>>>> The problem with reusing drm_gpusvm_range directly is that a VMA may
>>>>>> span multiple gpusvm notifiers—i.e., it can be larger than the
>>>>>> notifier
>>>>>> size. Of course, we could rework this as well.
>>>>
>>>> So the "VMA spans multiple gpusvm notifiers" concern: I'd like to
>>>> clarify
>>>> that this is not actually a blocker for amdgpu's XNACK-off path,
>>>> because
>>>> amdgpu does not try to represent one user ioctl virtual address
>>>> interval as
>>>> a single drm_gpusvm_range.
>>>>
>>>> we walk the attr interval and call drm_gpusvm_range_find_or_insert()
>>>> repeatedly, letting gpusvm pick chunk aligned ranges bounded by
>>>> notifier_size. One ioctl interval will create N chunk sized ranges.
>>>>
>>>>>>
>>>>>
>>>>> Sorry for the double reply—I just glanced at the latest series. I
>>>>> don’t
>>>>> think creating a range per page of the userptr is desirable. While it
>>>>> would work, from a time-complexity point of view I don’t think this is
>>>>> ideal.
>>>>>
>>>>> The issue with spans across multiple notifiers is real, though.
>>>>>
>>>>> My rough idea would be:
>>>>>
>>>>> - Give drivers an interface to create larger ranges.
>>>>
>>>> So maybe we do not need to create larger ranges if we call
>>>> drm_gpusvm_range_find_or_insert() repeatedly.
>>>>
>>>
>>> That will be functional, but consider it from a time-complexity point of
>>> view.
>>>
>>> Multiple ranges increase the time complexity of range-tree searches.
>>> This isn’t a huge deal, but it will show up to some extent.
>>>
>>> Multiple ranges will also slow down DMA mapping and migration. We
>>> switched over to the dma_iova_alloc/link/unlink/sync uAPI here [1].
>>> While dma_iova_link is a relatively fast radix-tree walk, the allocation
>>> and sync steps are where things get expensive. Therefore, it is
>>> advantageous to perform these steps as few times as possible. For
>>> example, if your SVM buffer is 512MB, instead of doing these steps 256
>>> times, you do them once. The same logic applies to the migrate_vma_*
>>> functions—they are quite expensive, so doing them in a single shot is
>>> significantly faster.
>>>
>>> The same applies to invalidations. If you can invalidate a large range
>>> in a single shot, it will be faster. Although the logic in the notifier
>>> should be able to zap multiple ranges in one shot (Xe does this), having
>>> to DMA-unmap a single large range will still be faster than multiple
>>> smaller DMA unmaps.
>>>
>>> The TL;DR is if your driver knows size of SVM allocation upfront (e.g.,
>>> an IOCTL tells you the size) it makes more sense to use a single large
>>> struct (either embedded drm_gpusvm_pages into a VMA or we figure out an
>>> interface to insert large ranges / notifiers).
>>>
>>> [1] https://patchwork.freedesktop.org/series/160587/
>>>
>>>>>
>>>>> - If the range fits inside a single notifier’s size → done.
>>>>>
>>>>> - If the range spans multiple notifier sizes → round up to a power of
>>>>> two and create a larger notifier. This may overlap with existing
>>>>> notifiers, which is likely fine given that interval trees support
>>>>> overlaps (?). We’d need to double-check and test this. If
>>>>> overlapping
>>>>> notifiers are not acceptable, we’d need some heavy-handed
>>>>> notifier merge
>>>>> logic—it will be complicated, but isolated, so once we get it
>>>>> right
>>>>> everyone can use it.
>>>>
>>>> If we call drm_gpusvm_range_find_or_insert() repeatedly the
>>>> drmgpu_svm will
>>>> create the corresponding notifier correctly as far as I can see.
>>>>
>>>
>>> I agree this will be functional but not ideal. You can always start the
>>> approach you have here and optimize it later by adding the required
>>> support in GPU SVM.
>>>
>>
>> Hi Matt,
>>
>> Really thanks for your information, this really helps a lot!
>>
>>
>> Hi Christian, Felix,
>>
>> According to the discussion with Matt on the previous thread, I'd like
>> to align with you on the XNACK off direction before start to the series.
>>
>> According to the information form Matt:
>> when the allocation size is known doing one big operation is
>> significantly faster than doing many small ranges, because
>> the allocation and sync steps are where things get expensive.
>> Doing them in a single shot is significantly faster, especially in the
>> situlation of xnack off mode, which needs pre fault and pre map in
>> ioctl, and the size is known.
>>
>> It is confirmed that repeatedly calling
>> drm_gpusvm_range_find_or_insert() is
>> functional, and suggested we land it first and optimize later by adding
>> large range support in GPU SVM core. That motivates the two phase plan
>> below.
>>
>> Phase 1
>> - Reuse drm_gpusvm_range for XNACK-off, one ioctl interval is split by
>> drm_gpusvm_range_find_or_insert() into
>> N chunk-sized ranges bounded by notifier_size, same mechanism as the
>> fault path.
>> - populate all ranges at ioctl / submit time instead of on fault.
>> - Invalidation -> GPU queue stop -> rebind/restore the pages and gpu
>> map ->restore queue
>>
>> Phase 2:
>> Add a large range / large notifier insert interface in GPU SVM core
>> so one ioctl interval maps to a single range to improve efficiency.
>> This needs modify the drmgpu_svm frame work.
>>
>> May I know your thoughts on this plan?
>
> I think drm_gpusvm_range_find_or_insert already has all the parameters
> necessary to allocate larger notifiers and ranges. All it would take is
> maybe adding a flag in drm_gpusvm_ctx to request larger range allocation
> instead of arbitrary chunking.
>
> I agree this could be done as a second phase and is mostly work in the
> drm_gpusvm code.
Really thanks for the reply, will implement the large range feature
according your suggestion.
Regards,
Honglei
>
> Regards,
> Felix
>
>
>>
>> Regards,
>> Honglei
>>
>>
>>> Matt
>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>>
>>>>> - Finally, make sure that individual userptr pages can reside at any
>>>>> location.
>>>>>
>>>>> Over conversely:
>>>>>
>>>>> - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
>>>>> creation
>>>>>
>>>>> - Make sure that individual userptr pages can reside at any location.
>>>>
>>>>>
>>>>> Both options actually sound really similar after typing this out.
>>>>>
>>>>> Matt
>>>>>
>>>>>> So either way, the Xe userptr + gpusvm implementation should be
>>>>>> refined
>>>>>> further for common driver use.
>>>>>>
>>>>>>>
>>>>>>> Reusing drm_gpusvm_range for the XNACK-off case would simplify our
>>>>>>> implementation considerably, it already provides large page chunk
>>>>>>> optimization, can reuse the existing migration infrastructure.
>>>>>>>
>>>>>>> Building these on top of a standalone drm_gpusvm_pages
>>>>>>> would mean reimplementing much of what the range layer already
>>>>>>> offers.
>>>>>>> It would also let us keep a single code path for both XNACK modes,
>>>>>>> which reduces maintenance burden and avoids behavioral difference.
>>>>>>>
>>>>>>> Would this direction be acceptable, or do you see concerns with
>>>>>>> reusing
>>>>>>> the range infrastructure for the no-fault case?
>>>>>>>
>>>>>>
>>>>>> If you prefer something like insert a range exactly here + create
>>>>>> range
>>>>>> + notifier I think that completely reasonable direction and Xe would
>>>>>> likely switch over to using this.
>>>>>>
>>>>>> I guess my only concern is sub-userptr migration. We are trending
>>>>>> towards allowing userptrs to being migrated either via prefetch
>>>>>> IOCTLs
>>>>>> or access counters on the GPU side - access counter we'd likely a
>>>>>> single
>>>>>> 2M page at time migration within the userptr. get_pages() supports
>>>>>> mixed
>>>>>> mappings between VRAM + system but likely needs some more work to
>>>>>> really
>>>>>> make this complete though.
>>>>>>
>>>>>> Matt
>>>>>>> Regards,
>>>>>>> Honglei
>>>> ...
>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-28 2:23 ` Huang, Honglei1
@ 2026-04-30 3:04 ` Matthew Brost
0 siblings, 0 replies; 36+ messages in thread
From: Matthew Brost @ 2026-04-30 3:04 UTC (permalink / raw)
To: Huang, Honglei1
Cc: Felix Kuehling, Christian König, amd-gfx, dri-devel,
Alexander.Deucher, Honglei Huang, Oak.Zeng, Jenny-Jing.Liu,
Philip.Yang, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu, Junhua.Shen,
Thomas Hellström, Rodrigo Vivi, Danilo Krummrich, Alice Ryhl
On Tue, Apr 28, 2026 at 10:23:18AM +0800, Huang, Honglei1 wrote:
>
>
> On 4/28/2026 4:00 AM, Felix Kuehling wrote:
> >
> > On 2026-04-24 06:43, Huang, Honglei1 wrote:
> > >
> > >
> > > On 4/24/2026 4:21 AM, Matthew Brost wrote:
> > > > On Thu, Apr 23, 2026 at 07:03:52PM +0800, Huang, Honglei1 wrote:
> > > > >
> > > > >
> > > > > On 4/23/2026 3:18 PM, Matthew Brost wrote:
> > > > > ...
> > > > > > > > > > > > This clarifies a lot. This is
> > > > > > > > > > > > what we’d call in Xe “preemption
> > > > > > > > > > > > fence”
> > > > > > > > > > > > mode for a VM. Anytime memory is
> > > > > > > > > > > > moved, we trigger a GPU
> > > > > > > > > > > > preemption and
> > > > > > > > > > > > resume. We don’t actually
> > > > > > > > > > > > support SVM in this case;
> > > > > > > > > > > > instead, we use
> > > > > > > > > > > > “userptr binds,” which are built on gpusvm for page
> > > > > > > > > > > > collection. However,
> > > > > > > > > > > > we don’t support migrating memory to the device—though we could.
> > > > > > > > > > > >
> > > > > > > > > > > > I’d look at how we converted
> > > > > > > > > > > > 'userptr' to be based on GPU SVM
> > > > > > > > > > > > [2]. In
> > > > > > > > > > > > this case, don’t maintain a range tree, as those—as you
> > > > > > > > > > > > suggest—are more
> > > > > > > > > > > > of an on-demand fault driver
> > > > > > > > > > > > concern. Instead, just embed
> > > > > > > > > > > > 'struct
> > > > > > > > > > > > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> > > > > > > > > > > >
> > > > > > > > > > > > We could extend this to support migrating 'userptr', but we
> > > > > > > > > > > > just haven’t
> > > > > > > > > > > > done that yet—this may be what you want to do in “XNACK off..
> > > > > > > > > > > >
> > > > > > > > > > > > [2] https://patchwork.freedesktop.org/series/146553/
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Actually we need to swith the xnack mode between on and off, so
> > > > > > > > > > > in xnack off
> > > > > > > > > > > mode, the driver operats in "implicit prefetch mode". This may
> > > > > > > > > > > be due to
> > > > > > > > > > > compatibility with older hardware
> > > > > > > > > > > and the need for UMD runtime. We
> > > > > > > > > > > will
> > > > > > > > > > > further discuss the handling method under xnack off internally.
> > > > > > > > > > >
> > > > > > > >
> > > > > > > > Hi Matt,
> > > > > > > >
> > > > > > > > I studied the xe_userptr code and the conversion series [2] you
> > > > > > > > pointed to.
> > > > > > > >
> > > > > > > > I have a question that:
> > > > > > > > Would it be possible to reuse drm_gpusvm_range
> > > > > > > > to handle the hardware
> > > > > > > > without gpu fault feature(xnack off mode).
> > > > > > >
> > > > > > > That’s not how we’ve done it. We embedded
> > > > > > > drm_gpusvm_pages into our VMA
> > > > > > > structure and then attached a notifier. The notifier attachment is
> > > > > > > open-coded on the Xe side, and this could be normalized and opened up
> > > > > > > for common driver use cases.
> > > > >
> > > > > The way in xe_userptr likes the implementation in kfd_svm:
> > > > > embeded physical
> > > > > pages into structure and attach same size notifier.
> > > > > But kfd_svm is an implementation of SVM semantics, which
> > > > > supports partial
> > > > > unmap, doesn't need explicitly delete userptr ioctl calling
> > > > > when remove ,
> > > > > and doesn't need a explicitly userptr flag when creating.
> > > > > And actually there is also a existing implementation for
> > > > > userptr semantics
> > > > > in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
> > > > > If the no gpu fault mode can not use the drm gpu svm fram
> > > > > work, use the same
> > > > > way for xe_userptr, it seems like doing the duplicate work.
> > > > >
> > > > > I think the core gap is we are trying to use the drmgpu_svm
> > > > > to implement a
> > > > > SVM semantics driver for no gpu fault hardware instead of
> > > > > userptr semantics.
> > > > >
> > > > > > >
> > > > > > > The problem with reusing drm_gpusvm_range directly is that a VMA may
> > > > > > > span multiple gpusvm notifiers—i.e., it can be
> > > > > > > larger than the notifier
> > > > > > > size. Of course, we could rework this as well.
> > > > >
> > > > > So the "VMA spans multiple gpusvm notifiers" concern: I'd
> > > > > like to clarify
> > > > > that this is not actually a blocker for amdgpu's XNACK-off
> > > > > path, because
> > > > > amdgpu does not try to represent one user ioctl virtual
> > > > > address interval as
> > > > > a single drm_gpusvm_range.
> > > > >
> > > > > we walk the attr interval and call drm_gpusvm_range_find_or_insert()
> > > > > repeatedly, letting gpusvm pick chunk aligned ranges bounded by
> > > > > notifier_size. One ioctl interval will create N chunk sized ranges.
> > > > >
> > > > > > >
> > > > > >
> > > > > > Sorry for the double reply—I just glanced at the latest
> > > > > > series. I don’t
> > > > > > think creating a range per page of the userptr is desirable. While it
> > > > > > would work, from a time-complexity point of view I don’t think this is
> > > > > > ideal.
> > > > > >
> > > > > > The issue with spans across multiple notifiers is real, though.
> > > > > >
> > > > > > My rough idea would be:
> > > > > >
> > > > > > - Give drivers an interface to create larger ranges.
> > > > >
> > > > > So maybe we do not need to create larger ranges if we call
> > > > > drm_gpusvm_range_find_or_insert() repeatedly.
> > > > >
> > > >
> > > > That will be functional, but consider it from a time-complexity point of
> > > > view.
> > > >
> > > > Multiple ranges increase the time complexity of range-tree searches.
> > > > This isn’t a huge deal, but it will show up to some extent.
> > > >
> > > > Multiple ranges will also slow down DMA mapping and migration. We
> > > > switched over to the dma_iova_alloc/link/unlink/sync uAPI here [1].
> > > > While dma_iova_link is a relatively fast radix-tree walk, the allocation
> > > > and sync steps are where things get expensive. Therefore, it is
> > > > advantageous to perform these steps as few times as possible. For
> > > > example, if your SVM buffer is 512MB, instead of doing these steps 256
> > > > times, you do them once. The same logic applies to the migrate_vma_*
> > > > functions—they are quite expensive, so doing them in a single shot is
> > > > significantly faster.
> > > >
> > > > The same applies to invalidations. If you can invalidate a large range
> > > > in a single shot, it will be faster. Although the logic in the notifier
> > > > should be able to zap multiple ranges in one shot (Xe does this), having
> > > > to DMA-unmap a single large range will still be faster than multiple
> > > > smaller DMA unmaps.
> > > >
> > > > The TL;DR is if your driver knows size of SVM allocation upfront (e.g.,
> > > > an IOCTL tells you the size) it makes more sense to use a single large
> > > > struct (either embedded drm_gpusvm_pages into a VMA or we figure out an
> > > > interface to insert large ranges / notifiers).
> > > >
> > > > [1] https://patchwork.freedesktop.org/series/160587/
> > > >
> > > > > >
> > > > > > - If the range fits inside a single notifier’s size → done.
> > > > > >
> > > > > > - If the range spans multiple notifier sizes → round up to a power of
> > > > > > two and create a larger notifier. This may overlap with existing
> > > > > > notifiers, which is likely fine given that interval trees support
> > > > > > overlaps (?). We’d need to double-check and test
> > > > > > this. If overlapping
> > > > > > notifiers are not acceptable, we’d need some
> > > > > > heavy-handed notifier merge
> > > > > > logic—it will be complicated, but isolated, so once
> > > > > > we get it right
> > > > > > everyone can use it.
> > > > >
> > > > > If we call drm_gpusvm_range_find_or_insert() repeatedly the
> > > > > drmgpu_svm will
> > > > > create the corresponding notifier correctly as far as I can see.
> > > > >
> > > >
> > > > I agree this will be functional but not ideal. You can always start the
> > > > approach you have here and optimize it later by adding the required
> > > > support in GPU SVM.
> > > >
> > >
> > > Hi Matt,
> > >
> > > Really thanks for your information, this really helps a lot!
> > >
> > >
> > > Hi Christian, Felix,
> > >
> > > According to the discussion with Matt on the previous thread, I'd
> > > like to align with you on the XNACK off direction before start to
> > > the series.
> > >
> > > According to the information form Matt:
> > > when the allocation size is known doing one big operation is
> > > significantly faster than doing many small ranges, because
> > > the allocation and sync steps are where things get expensive.
> > > Doing them in a single shot is significantly faster, especially in the
> > > situlation of xnack off mode, which needs pre fault and pre map in
> > > ioctl, and the size is known.
> > >
> > > It is confirmed that repeatedly calling
> > > drm_gpusvm_range_find_or_insert() is
> > > functional, and suggested we land it first and optimize later by adding
> > > large range support in GPU SVM core. That motivates the two phase
> > > plan below.
> > >
> > > Phase 1
> > > - Reuse drm_gpusvm_range for XNACK-off, one ioctl interval is split
> > > by drm_gpusvm_range_find_or_insert() into
> > > N chunk-sized ranges bounded by notifier_size, same mechanism as
> > > the fault path.
> > > - populate all ranges at ioctl / submit time instead of on fault.
> > > - Invalidation -> GPU queue stop -> rebind/restore the pages and gpu
> > > map ->restore queue
> > >
> > > Phase 2:
> > > Add a large range / large notifier insert interface in GPU SVM core
> > > so one ioctl interval maps to a single range to improve efficiency.
> > > This needs modify the drmgpu_svm frame work.
> > >
> > > May I know your thoughts on this plan?
> >
> > I think drm_gpusvm_range_find_or_insert already has all the parameters
> > necessary to allocate larger notifiers and ranges. All it would take is
> > maybe adding a flag in drm_gpusvm_ctx to request larger range allocation
> > instead of arbitrary chunking.
Yes, I agree this a completely reasonable direction. Something like
override 'chunks' with a direct placement + size and then figure out the
notifier install algorithm in gpusvm layer - again this only gets tricky
if a direct placement spans multiple notifiers.
Matt
> >
> > I agree this could be done as a second phase and is mostly work in the
> > drm_gpusvm code.
>
>
> Really thanks for the reply, will implement the large range feature
> according your suggestion.
>
> Regards,
> Honglei
>
> >
> > Regards,
> > Felix
> >
> >
> > >
> > > Regards,
> > > Honglei
> > >
> > >
> > > > Matt
> > > >
> > > > > Regards,
> > > > > Honglei
> > > > >
> > > > > >
> > > > > > - Finally, make sure that individual userptr pages can reside at any
> > > > > > location.
> > > > > >
> > > > > > Over conversely:
> > > > > >
> > > > > > - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
> > > > > > creation
> > > > > >
> > > > > > - Make sure that individual userptr pages can reside at any location.
> > > > >
> > > > > >
> > > > > > Both options actually sound really similar after typing this out.
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > > So either way, the Xe userptr + gpusvm
> > > > > > > implementation should be refined
> > > > > > > further for common driver use.
> > > > > > >
> > > > > > > >
> > > > > > > > Reusing drm_gpusvm_range for the XNACK-off case would simplify our
> > > > > > > > implementation considerably, it already provides large page chunk
> > > > > > > > optimization, can reuse the existing migration infrastructure.
> > > > > > > >
> > > > > > > > Building these on top of a standalone drm_gpusvm_pages
> > > > > > > > would mean reimplementing much of what the range
> > > > > > > > layer already offers.
> > > > > > > > It would also let us keep a single code path for both XNACK modes,
> > > > > > > > which reduces maintenance burden and avoids behavioral difference.
> > > > > > > >
> > > > > > > > Would this direction be acceptable, or do you
> > > > > > > > see concerns with reusing
> > > > > > > > the range infrastructure for the no-fault case?
> > > > > > > >
> > > > > > >
> > > > > > > If you prefer something like insert a range exactly
> > > > > > > here + create range
> > > > > > > + notifier I think that completely reasonable direction and Xe would
> > > > > > > likely switch over to using this.
> > > > > > >
> > > > > > > I guess my only concern is sub-userptr migration. We are trending
> > > > > > > towards allowing userptrs to being migrated either
> > > > > > > via prefetch IOCTLs
> > > > > > > or access counters on the GPU side - access counter
> > > > > > > we'd likely a single
> > > > > > > 2M page at time migration within the userptr.
> > > > > > > get_pages() supports mixed
> > > > > > > mappings between VRAM + system but likely needs some
> > > > > > > more work to really
> > > > > > > make this complete though.
> > > > > > >
> > > > > > > Matt
> > > > > > > > Regards,
> > > > > > > > Honglei
> > > > > ...
> > > > >
> > > > >
> > >
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-03-23 6:31 ` Matthew Brost
2026-03-24 7:24 ` Honglei Huang
@ 2026-04-23 6:09 ` Huang, Honglei1
2026-04-23 6:52 ` Matthew Brost
1 sibling, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-23 6:09 UTC (permalink / raw)
To: Matthew Brost, Christian König, Felix.Kuehling, Philip.Yang
Cc: amd-gfx, dri-devel, Alexander.Deucher, Honglei Huang, Oak.Zeng,
Jenny-Jing.Liu, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu,
Junhua.Shen, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On 3/23/2026 2:31 PM, Matthew Brost wrote:
> On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
>>
>>
>> On 3/19/26 13:08, Matthew Brost wrote:
>>> On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>>>>
>>>
>>> Disclaimer I haven't look at any code in this series yet.
>>>
>>>>
>>>> On 3/17/26 19:48, Christian König wrote:
>>>>> Adding a few XE and drm_gpuvm people on TO.
>>>>>
>>>>> On 3/17/26 12:29, Honglei Huang wrote:
>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>
>>>>>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>>>>>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>>>>>> the framework's applicability, identify implementation challenges,
>>>>>> and start discussion on framework evolution. This is not a production
>>>
>>> +1. Open to any ideas. Given this was designed originally for Xe we very
>>> well could have missed other drivers requirements.
>> Hi Matt,
>>
>> Thank you for the openness. And thank you so much for the incredibly
>> detailed and patient response. I really appreciate you taking the time to
>> walk through each point.
>>
>
> I'm here to help.
>
>> Actually I am still a learner when it comes to the drm_gpusvm framework and
>> GPU SVM design in general. Some of my descriptions below may not be entirely
>> accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
>> well.
>
> I appreciate another driver jumping in and using this framework—it
> becomes easier to validate as more users adopt it.
>
>>
>>>
>>>>>> ready submission.
>>>>>>
>>>>>> This patch series implements basic SVM support with the following features:
>>>>>>
>>>>>> 1. attributes sepatarated from physical page management:
>>>>>>
>>>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>>>>>> tree that stores SVM attributes. Managed through the SET_ATTR,
>>>>>> and mmu notifier callback.
>>>
>>> Can you explain the mmu notifier callback interaction here? See below in
>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>
>>
>> Let me try to explain, apologies if the description is not fully
>> precise.
>>
>> In current implementation, the MMU notifier callback interacts with the attr
>> tree only in the munmap path remove the corresponding attribute
>> entries from the attr tree so that stale attributes do not persist for
>> freed address space.
>>
>
> Ah, yes. We reset our attributes upon munmap too. We actually don't this
> 100% correct quite either and series in flight to fix [1].
>
> [1] https://patchwork.freedesktop.org/series/161815/
Hi matt,
It seems like you are tring to modify the implementation into remove the
attributes when munmap.
Actuall we have a discussion internally that does the driver need to
remove the attributes when munmap.
So there servel ideas:
1. attribute need keep: attributes may be needed again when a new VMA
appears or on subsequent faults.
2.attribute need keep: attributes can be set independent of whether
memory is currently mapped; attributes persist and are modified
explicitly via ioctl, not implicitly by notifier callbacks.
3. attribute need remove: casue VMA is gone, driver can do nothing
without VMA.
and I saw xe_svm set default attribute in the previous version, this is
also a option.
Can you please help to give some information that why xe_svm is turing
to remove the attribute when munmap? And does keeping attribute is a
valid way?
Regards,
Honglei
>
>>>>>>
>>>>>> - Physical page layer (drm_gpusvm ranges): managed by the
>>>>>> drm_gpusvm framework, representing actual HMM backed DMA
>>>>>> mappings and GPU page table entries.
>>>>>>
>>>>>> This separation is necessary:
>>>>>> - The framework does not support range splitting, so a partial
>>>>>> munmap destroys the entire overlapping range, including the
>>>>>> still valid parts. If attributes were stored inside drm_gpusvm
>>>>>> ranges, they would be lost on unmapping.
>>>>>> The separate attr tree preserves userspace set attributes
>>>>>> across range operations.
>>>
>>> Yes, in Xe the divide is at the VMA level (set by user space) via VM
>>> bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
>>> madvise IOCTLs which reflect user space attributes on current SVM
>>> mappings or future ones.
>>>
>>> The SVM range tree reflects mappings that have been faulted into the
>>> device and contain pages. This is an intentional choice.
>>
>> That makes a lot of sense. Thank you for clarifying the design intent. I
>> think the current adopt the same principle: the drm_gpusvm range tree only
>> reflect actual faulted in mappings.
>>
>>>
>>>>>
>>>>> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
>>>
>>>
>>> Yes, this was an intentional design choice to not support partial unmap,
>>> and instead rely on the driver to recreate a new range.
>>>
>>> The reasoning is:
>>>
>>> - In practice, this should be rare for well-behaved applications.
>>>
>>> - With THP / large device pages, if a sub-range is unmapped, the entire
>>> GPU mapping is invalidated anyway due to the page size change. As a
>>> result, the cost of creating a new range is minimal, since the device
>>> will likely fault again on the remaining pages.
>>>
>>> So there is no need to over-engineer the common code.
>>>
>>> FWIW, to even test partial unmaps in Xe, I had to do things I doubt
>>> anyone would ever do:
>>>
>>> ptr = mmap(SZ_2M);
>>> /* fault in memory to the device */
>>> munmap(ptr, SZ_1M);
>>> /* touch memory again on the device */
>>>
>>
>> Thank you for this explanation and the concrete example. After further
>> discussion internally with Christian, we are now aligned with same position
>> partial unmap. Will remove rebuild on partial unmap logic in the next
>> version and handle it as only partially backed range.
>>
>>>>
>>>>
>>>> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
>>>> but some other subregion is invalid, but under drm_gpusvm, need to destroy
>>>> the entire range.
>>>>
>>>> e.g.:
>>>>
>>>> [---------------unmap region in mmu notifier-----------------]
>>>> [0x1000 ------------ 0x9000]
>>>> [ valid ][ invalid ]
>>>>
>>>> see deatil in drm_gpusvm.c:110 line
>>>> section:Partial Unmapping of Ranges
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> - drm_gpusvm range boundaries are determined by fault address
>>>>>> and pre setted chunk size, not by userspace attribute boundaries.
>>>>>> Ranges may be rechunked on memory changes. Embedding
>>>>>> attributes in framework ranges would scatter attr state
>>>>>> across many small ranges and require complex reassemble
>>>>>> logic when operate attrbute.
>>>>>
>>>>> Yeah, that makes a lot of sense.
>>>>>
>>>>>>
>>>>>> 2) System memory mapping via drm_gpusvm
>>>>>>
>>>>>> The core mapping path uses drm_gpusvm_range_find_or_insert() to
>>>>>> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>>>>>> and DMA mapping, then updates GPU page tables via
>>>>>> amdgpu_vm_update_range().
>>>>>>
>>>>>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>>>>>
>>>>>> On XNACK off hardware the GPU cannot recover from page faults,
>>>>>> so mappings must be established through ioctl. When
>>>>>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>>>>>> walks the attr tree and maps all accessible intervals
>>>>>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>>>
>>> Can you expand on XNACK off / GPU no faults? Is this to the share GPU
>>> between 3D (dma-fences) and faulting clients? We have something similar
>>> in Xe, but it isn't an explicit IOCTL rather we switch between on demand
>>> as 3D client submits and then resumes page faults when all dma-fences
>>> have signaled.
>>>
>>> I see below you mention page tables are modified during quiesce KFD
>>> queues? I'm not sure that is required - you just need to guarnette
>>> faulting clients won't trigger page faults when dma-fence is in flight.
>>>
>>> Maybe give me an explaination of exactly what the requirement from AMD
>>> are here so I have better picture.
>>
>> Thank you for the patience, let me try to explain our situation, though
>> I may not get every detail right.
>>
>> XNACK off means hardware that does not have GPU page fault capability (or
>> turned off)
>>
>> So for these GPUs, ALL page table entries must be fully populated before
>> the GPU can access the memory. This is why we need the ioctl driven
>> mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
>> walk the attribute tree and eagerly map all accessible ranges into the
>> GPU page tables. This is functionally similar to what you describe as
>> prefetch IOCTLs / VM bind in Xe.
>>
>> Regarding queue quiesce during page table modification: on XNACK off
>> hardware, because the GPU cannot fault, we must ensure the GPU is
>> completely stopped before modifying any PTE it might be accessing.
>> Otherwise the GPU could access a partially updated page table and hang.
>> The quiesce/resume is the mechanism to guarantee this.
>>
>> I hope that helps clarify the picture.
>>
>
> This clarifies a lot. This is what we’d call in Xe “preemption fence”
> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> resume. We don’t actually support SVM in this case; instead, we use
> “userptr binds,” which are built on gpusvm for page collection. However,
> we don’t support migrating memory to the device—though we could.
>
> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> this case, don’t maintain a range tree, as those—as you suggest—are more
> of an on-demand fault driver concern. Instead, just embed 'struct
> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>
> We could extend this to support migrating 'userptr', but we just haven’t
> done that yet—this may be what you want to do in “XNACK off..
>
> [2] https://patchwork.freedesktop.org/series/146553/
>
>>
>>>
>>>>>>
>>>>>> 4) Invalidation, GC worker, and restore worker
>>>>>>
>>>>>> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>>>>>> three cases based on event type and hardware mode:
>>>>>> - unmap event: clear GPU PTEs in the notifier context,
>>>>>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>>>>>> and enqueue to the GC worker. On XNACK off, also
>>>>>> quiesce KFD queues and schedule rebuild of the
>>>>>> still valid portions that were destroyed together with
>>>>>> the unmapped subregion.
>>>>>>
>>>>>> - evict on XNACK off:
>>>>>> quiesce KFD queues first, then unmap DMA pages and
>>>>>> enqueue to the restore worker.
>>>>>
>>>>> Is that done through the DMA fence or by talking directly to the MES/HWS?
>>>>
>>>> Currently KFD queues quiesce/resume API are reused, lookig forward to a
>>>> better solution.
>>>>
>>>
>>> +1
>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> - evict on XNACK on:
>>>>>> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>>>>>> not schedule any worker. The GPU will fault on next
>>>>>> access and the fault handler establishes the mapping.
>>>>>>
>>>>>> Not supported feature:
>>>>>> - XNACK on GPU page fault mode
>>>>>> - migration and prefetch feature
>>>>>> - Multi GPU support
>>>>>>
>>>>>> XNACK on enablement is ongoing.The GPUs that support XNACK on
>>>>>> are currently only accessible to us via remote lab machines, which slows
>>>>>> down progress.
>>>>>>
>>>>>> Patch overview:
>>>>>>
>>>>>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>>>>>> SET_ATTR/GET_ATTR operations, attribute types, and related
>>>>>> structs in amdgpu_drm.h.
>>>>>>
>>>>>> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>>>>>> refcount, attr_tree, workqueues, locks, and
>>>>>> callbacks (begin/end_restore, flush_tlb).
>>>>>>
>>>>>> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>>>>>> (interval tree node), attr_tree, access enum, flag masks,
>>>>>> and change trigger enum.
>>>>>>
>>>>>> 04/12 Attribute tree operations: interval tree lookup, insert,
>>>>>> remove, and tree create/destroy lifecycle.
>>>>>>
>>>>>> 05/12 Attribute set: validate UAPI attributes, apply to internal
>>>>>> attrs, handle hole/existing range with head/tail splitting,
>>>>>> compute change triggers, and -EAGAIN retry loop.
>>>>>> Implements attr_clear_pages for unmap cleanup and attr_get.
>>>>>>
>>>>>> 06/12 Range data structures: amdgpu_svm_range extending
>>>>>> drm_gpusvm_range with gpu_mapped state, pending ops,
>>>>>> pte_flags cache, and GC/restore queue linkage.
>>>>>>
>>>>>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>>>>>> GPU page table update with DMA address, range mapping loop:
>>>>>> find_or_insert -> get_pages -> validate -> update PTE,
>>>>>> and attribute change driven mapping function.
>>>>>>
>>>>>> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
>>>>>> notifier context, range removal and overlap cleanup,
>>>>>> rebuild after destroy logic, and MMU event dispatcher
>>>>>>
>>>>>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>>>>>> worker for unmap processing and rebuild, ordered restore
>>>>>> worker for mapping evicted ranges, and flush/sync
>>>>>> helpers.
>>>>>>
>>>>>> 10/12 Initialization and fini: kmem_cache for range/attr,
>>>>>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>>>>>> flush helper, and amdgpu_svm init/close/fini lifecycle.
>>>>>>
>>>>>> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>>>>>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>>>>>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>>>>>
>>>>>> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>>>>>> Makefile rules, ioctl table registration, and amdgpu_vm
>>>>>> hooks (init in make_compute, close/fini, fault dispatch).
>>>>>>
>>>>>> Test result:
>>>>>> on gfx1100(W7900) and gfx943(MI300x)
>>>>>> kfd test: 95%+ passed, same failed cases with offical relase
>>>>>> rocr test: all passed
>>>>>> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>>>>>
>>>>>> During implementation we identified several challenges / design questions:
>>>>>>
>>>>>> 1. No range splitting on partial unmap
>>>>>>
>>>>>> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>>>>>> Partial munmap needs to destroy the entire range including the valid interval.
>>>>>> GPU fault driven hardware can handle this design by extra gpu fault handle,
>>>>>> but AMDGPU needs to support XNACK off hardware, this design requires driver
>>>>>> rebuild the valid part in the removed entire range. Whichs bring a very heavy
>>>>>> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>>>>>> this restore work even heavier than kfd_svm. In previous driver work queue
>>>>>> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>>>>>> which brings about more complex logic, heavier worker queue workload, and
>>>>>> synchronization issues.
>>>
>>> Is this common in the workload you are running? I'm also wondering if
>>> your restore logic / KFDs design is contributing to this actally the
>>> problem.
>>>
>>
>> Honestly, you raise a fair point.
>>
>> We will redesign the logic about the partial munap, which should eliminate
>> most of this complexity.
>>
>>
>
> +1, yes test but do optimize for.
>
>>>>>>
>>>>>> 2. Fault driven vs ioctl driven mapping
>>>>>>
>>>>>> drm_gpusvm is designed around GPU page fault handlers. The primary entry
>>>>>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>>>>>> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>>>>>> GPU cannot fault at all
>>>
>>> I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
>>> issues these so the device does not fault (e.g., prefetch creates a set
>>> of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
>>> specific VM bind operations.
>>>
>>
>> That is a very helpful way to think about it. Yes, our ioctl driven
>> mapping(xnack off) is essentially equivalent to a prefetch operation. We are
>> trying to improve it.
>>
>
> See above wrt 'userptr'.
>
>>
>>>>>>
>>>>>> The ioctl path cannot hold mmap_read_lock across the entire operation
>>>>>> because drm_gpusvm_range_find_or_insert() acquires/releases it
>>>>>> internally. This creates race windows with MMU notifiers / workers.
>>>
>>> This is a very intentional choice in the locking design: mmap_read_lock
>>> is held only in very specific parts of GPU SVM, and the driver should
>>> never need to take this lock.
>>>
>>> Yes, notifiers can race, which is why the GPU fault handler and prefetch
>>> handler are structured as retry loops when a notifier race is detected.
>>> In practice, with well-behaved applications, these races should be
>>> rare—but they do occur, and the driver must handle them.
>>>
>>> __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
>>> prefetch has similar logic, although it is more spread out given that it
>>> is part of a deeper software pipeline.
>>>
>>> FWIW, holding locks to avoid races was rejected by Sima because we
>>> reasoned it is essentially impossible to guarantee the absence of races
>>> by holding a lock. CPU page fault handlers are also effectively just
>>> large retry loops.
>>>
>>> So this is one point I believe you will need to fixup driver side.
>>>
>>
>> Understood. Thank you for the detailed explanation and for pointing to
>> __xe_svm_handle_pagefault as a reference. We will restructure both our
>> fault handler and ioctl path to a betterretry loop pattern with sequence
>> number race detection.
>>
>
> Yes, the typical pattern is:
>
> - Try to migrate once
> - If you hit a race, give up, evict all memory back to system memory, and bind it
>
> Atomics make this tricky because memory must move, but I’m not sure
> “XNACK off” applies here. However, GPU SVM provides a timeslice
> mechanism to ensure the CPU can’t move memory while the GPU needs to
> execute something.
>
>>>>>>
>>>>>> 3. Multi GPU support
>>>>>>
>>>>>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>>>>>> each GPU gets an independent instance with its own range tree, MMU
>>>>>> notifiers, notifier_lock, and DMA mappings.
>>>>>>
>>>
>>> This is a part I am absolutely open to fixing. Right now, each
>>> drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
>>> decoupling a GPU SVM instance from a single device, allowing each
>>> drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
>>> device).
>>>
>>> This would give drivers the flexibility to use one GPU SVM instance per
>>> VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
>>> MM.
>>>
>>
>> That would be wonderful! Looking forward to your patch very much!
>>
>
> I can't say I'll code this but we thought about is as options and very
> open patches which refactor the object model for multiple use cases.
>
>>
>>>>>> This may brings huge overhead:
>>>>>> - N x MMU notifier registrations for the same address range
>>>
>>> The notifier overhead is a real concern. We recently introduced two-pass
>>> notifiers [1] to speed up multi-device notifiers. At least in Xe, the
>>> TLB invalidations—which are the truly expensive part—can be pipelined
>>> using the two=pass approach. Currently, [1] only implements two-pass
>>> notifiers for userptr, but Xe’s GPU SVM will be updated to use them
>>> shortly.
>>>
>>> [1] https://patchwork.freedesktop.org/series/153280/
>>>
>>
>> Thank you for the pointer to two-pass notifiers. Will study this
>> series.
>>
>>>>>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>>>
>>> hmm_range_fault is extremely fast compared to the actual migration.
>>> Running hmm_range_fault on a 2MB region using 4KB pages takes less
>>> than 1µs. With THP or large device pages [2] (merged last week), it’s
>>> around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
>>>
>>> [2] https://patchwork.freedesktop.org/series/163141/
>>>
>>
>> That is very helpful data. Perhaps worry too much.
>>
>>>>>> - N x DMA mapping memory
>>>
>>> You will always have N x DMA mapping memory if the pages are in system
>>> memory as the dma-mapping API is per device.
>>
>> Totally agreed.
>>
>>>
>>>>>> - N x invalidation + restore worker scheduling per CPU unmap event
>>>>>> - N x GPU page table flush / TLB invalidation
>>>
>>> I agree you do not want serialize GPU page table flush / TLB
>>> invalidations. Hence two-pass notifiers [1].
>>
>> Yes, will learn it.
>>
>>>
>>>>>> - Increased mmap_lock hold time, N callbacks serialize under it
>>>>>>
>>>>>> compatibility issues:
>>>>>> - Quiesce/resume scope mismatch: to integrate with KFD compute
>>>>>> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>>>>>> which have process level semantics. Under the per GPU
>>>>>> drm_gpusvm model, maybe there are some issues on sync. To properly
>>>>>> integrate with KFD under the per SVM model, a compatibility or
>>>>>> new per VM level queue control APIs maybe need to introduced.
>>>>>>
>>>
>>> I thought the idea to get rid of KFD and move over to AMDGPU? I thought
>>> Christian mentioned this to me at XDC.
>>>
>>
>>>>>> Migration challenges:
>>>>>>
>>>>>> - No global migration decision logic: each per GPU SVM
>>>>>> instance maintains its own attribute tree independently. This
>>>>>> allows conflicting settings (e.g., GPU0's SVM sets
>>>>>> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>>>>>> for the same address range) with no detection or resolution.
>>>>>> A global attribute coordinator or a shared manager is needed to
>>>>>> provide a unified global view for migration decisions
>>>
>>> Yes, this is hole in the Xe API too. We have told UMDs if they setup
>>> individual VMs with conflict attributes for a single CPU address space
>>> the behavior is undefined. Our UMD implement madvise is basically loop
>>> over al GPU VMs setting the same attributes.
>>
>> Will follow the same approach for now, the UMD is responsible for setting
>> consistent attributes across GPU VMs.
>>
>
> +1
>
>>>
>>>>>>
>>>>>> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>>>>>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>>>>>> causing N-1 unnecessary restore workers to be scheduled. And
>>>
>>> My feeling is that you shouldn’t reschedule restore workers unless you
>>> actually have to invalidate page tables (i.e., you have a local SVM
>>> range within the notifier). So the first migration to an untouched
>>> region may trigger notifiers, but they won’t do anything because you
>>> don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
>>> region won’t trigger a notifier unless the memory is moved again.
>>>
>>
>> That is a very good point. We should check whether we actually have
>> valid SVM ranges before scheduling restore workers. If there is nothing
>> to invalidate, the notifier callback should be a no-op. We will review
>> our notifier callback logic to ensure we are not doing unnecessary work
>> here. Thank you for pointing this out.
>>
>>>>>> creates races between the initiating migration and the other
>>>>>> instance's restore attempts.
>>>
>>> Yes, if multiple devices try to migrate the same CPU pages at the same
>>> time, that will race. That’s why in Xe we have a module-level
>>> driver_migrate_lock. The first migration runs in read mode; if it
>>> detects a race and aborts, it then takes driver_migrate_lock in write
>>> mode so it becomes the only device allowed to move memory / CPU pages.
>>> See xe_svm_alloc_vram() for how this is used.
>>>
>>> I’m not sure this approach will work for you, but I just wanted to point
>>> out that we identified this as a potential issue.
>>>
>>
>> Thank you for sharing the driver_migrate_lock approach and pointing to
>> xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
>> for our case.
>>
>>>>>>
>>>>>> - No cross instance migration serialization: each per GPU
>>>>>> drm_gpusvm instance has independent locking, so two GPUs'
>>>>>> "decide -> migrate -> remap" sequences can interleave. While
>>>>>> the kernel page lock prevents truly simultaneous migration of
>>>>>> the same physical page, the losing side's retry (evict from
>>>>>> other GPU's VRAM -> migrate back) triggers broadcast notifier
>>>>>> invalidations and restore workers, compounding the ping pong
>>>>>> problem above.
>>>>>>
>>>
>>> See the driver_migrate_lock above.
>>
>> Acknowledged, thank you.
>>>
>>>>>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>>>>>> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>>>>>> it only selects system memory pages for migration.
>>>>>>
>>>
>>> I think this is fixed? We did find some core MM bugs that blocked VRAM
>>> to VRAM but those have been worked out.
>>>
>>> The code I'm looking at:
>>>
>>> 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>>> 518 struct mm_struct *mm,
>>> 519 unsigned long start, unsigned long end,
>>> 520 const struct drm_pagemap_migrate_details *mdetails)
>>> 521 {
>>> 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
>>> 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
>>> 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
>>> 525 struct migrate_vma migrate = {
>>> 526 .start = start,
>>> 527 .end = end,
>>> 528 .pgmap_owner = pagemap->owner,
>>> 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
>>> 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
>>> 531 };
>>>
>>
>> Thank you for checking! I am using v6.18 for this POC, missed the fix, will
>> rebase to the latest.
>>
>>
>>>>>> - CPU fault reverse migration race: CPU page fault triggers
>>>>>> migrate_to_ram while GPU instances are concurrently operating.
>>>>>> Per GPU notifier_lock does not protect cross GPU operations.
>>>
>>> No, again retry loop as discussed above.
>>
>> Understood.
>>
>>>
>>>>>>
>>>>>> We believe a strong, well designed solution at the framework level is
>>>>>> needed to properly address these problems, and we look forward to
>>>>>> discussion and suggestions.
>>>
>>> Let's work together to figure out what is missing here.
>>
>> Thank you so much, Matt. Your feedback has been incredibly valuable and
>> has given us a much clearer picture of the framework's design.
>> Ireally appreciate the effort you put into building drm_gpusvm as a
>> shared framework. Will incorporate your suggestions into our next
>> revision and look forward to continuing the collaboration.
>>
>
> No problem. Happy to help.
>
> Matt
>
>> Regards,
>> Honglei
>>
>>
>>>
>>> Matt
>>>
>>>>>>
>>>>>> Honglei Huang (12):
>>>>>> drm/amdgpu: add SVM UAPI definitions
>>>>>> drm/amdgpu: add SVM data structures and header
>>>>>> drm/amdgpu: add SVM attribute data structures
>>>>>> drm/amdgpu: implement SVM attribute tree operations
>>>>>> drm/amdgpu: implement SVM attribute set
>>>>>> drm/amdgpu: add SVM range data structures
>>>>>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>>>>>> drm/amdgpu: implement SVM range notifier and invalidation
>>>>>> drm/amdgpu: implement SVM range workers
>>>>>> drm/amdgpu: implement SVM core initialization and fini
>>>>>> drm/amdgpu: implement SVM ioctl and fault handler
>>>>>> drm/amdgpu: wire up SVM build system and fault handler
>>>>>>
>>>>>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>>>>>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>>>>>> include/uapi/drm/amdgpu_drm.h | 39 +
>>>>>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>>>>>
>>>>>>
>>>>>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-23 6:09 ` Huang, Honglei1
@ 2026-04-23 6:52 ` Matthew Brost
2026-04-23 8:22 ` Huang, Honglei1
0 siblings, 1 reply; 36+ messages in thread
From: Matthew Brost @ 2026-04-23 6:52 UTC (permalink / raw)
To: Huang, Honglei1
Cc: Christian König, Felix.Kuehling, Philip.Yang, amd-gfx,
dri-devel, Alexander.Deucher, Honglei Huang, Oak.Zeng,
Jenny-Jing.Liu, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu,
Junhua.Shen, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On Thu, Apr 23, 2026 at 02:09:59PM +0800, Huang, Honglei1 wrote:
>
>
> On 3/23/2026 2:31 PM, Matthew Brost wrote:
> > On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
> > >
> > >
> > > On 3/19/26 13:08, Matthew Brost wrote:
> > > > On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
> > > > >
> > > >
> > > > Disclaimer I haven't look at any code in this series yet.
> > > >
> > > > >
> > > > > On 3/17/26 19:48, Christian König wrote:
> > > > > > Adding a few XE and drm_gpuvm people on TO.
> > > > > >
> > > > > > On 3/17/26 12:29, Honglei Huang wrote:
> > > > > > > From: Honglei Huang <honghuan@amd.com>
> > > > > > >
> > > > > > > This is a POC/draft patch series of SVM feature in amdgpu based on the
> > > > > > > drm_gpusvm framework. The primary purpose of this RFC is to validate
> > > > > > > the framework's applicability, identify implementation challenges,
> > > > > > > and start discussion on framework evolution. This is not a production
> > > >
> > > > +1. Open to any ideas. Given this was designed originally for Xe we very
> > > > well could have missed other drivers requirements.
> > > Hi Matt,
> > >
> > > Thank you for the openness. And thank you so much for the incredibly
> > > detailed and patient response. I really appreciate you taking the time to
> > > walk through each point.
> > >
> >
> > I'm here to help.
> >
> > > Actually I am still a learner when it comes to the drm_gpusvm framework and
> > > GPU SVM design in general. Some of my descriptions below may not be entirely
> > > accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
> > > well.
> >
> > I appreciate another driver jumping in and using this framework—it
> > becomes easier to validate as more users adopt it.
> >
> > >
> > > >
> > > > > > > ready submission.
> > > > > > >
> > > > > > > This patch series implements basic SVM support with the following features:
> > > > > > >
> > > > > > > 1. attributes sepatarated from physical page management:
> > > > > > >
> > > > > > > - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
> > > > > > > tree that stores SVM attributes. Managed through the SET_ATTR,
> > > > > > > and mmu notifier callback.
> > > >
> > > > Can you explain the mmu notifier callback interaction here? See below in
> > > > Xe the attribute tree is existing VMA tree (gpuvm).
> > > >
> > >
> > > Let me try to explain, apologies if the description is not fully
> > > precise.
> > >
> > > In current implementation, the MMU notifier callback interacts with the attr
> > > tree only in the munmap path remove the corresponding attribute
> > > entries from the attr tree so that stale attributes do not persist for
> > > freed address space.
> > >
> >
> > Ah, yes. We reset our attributes upon munmap too. We actually don't this
> > 100% correct quite either and series in flight to fix [1].
> >
> > [1] https://patchwork.freedesktop.org/series/161815/
>
> Hi matt,
>
> It seems like you are tring to modify the implementation into remove the
> attributes when munmap.
>
> Actuall we have a discussion internally that does the driver need to remove
> the attributes when munmap.
>
> So there servel ideas:
>
> 1. attribute need keep: attributes may be needed again when a new VMA
> appears or on subsequent faults.
> 2.attribute need keep: attributes can be set independent of whether memory
> is currently mapped; attributes persist and are modified explicitly via
> ioctl, not implicitly by notifier callbacks.
> 3. attribute need remove: casue VMA is gone, driver can do nothing without
> VMA.
>
> and I saw xe_svm set default attribute in the previous version, this is also
> a option.
>
> Can you please help to give some information that why xe_svm is turing to
> remove the attribute when munmap? And does keeping attribute is a valid way?
>
This is a semantic choice, and we’re trying to match the semantics of
CPU madvise. I believe any semantic an individual driver stack wants to
define is valid, but if vendors mismatch sematics this will create a
level of vendor lock in which may (cough Nvidia, CUDA) or may not (open
source) be desired.
AFAIK, if you do something like this in C (a CPU-only program):
mmap(addr_range);
madvise(addr_range, some_flags);
munmap(addr_range);
mmap(addr_range); /* Here the madvise attributes are reset */
Also, AFAIK, the CUDA GPU madvise API works this way as well.
That said, making this work 100% reliably is quite difficult, especially
with a rude user.
For example:
mmap(addr_range);
gpu_madvise(addr_range, some_flags);
/* GPU never actually touches memory */
munmap(addr_range);
So we have an opt-in VM bind flag,
DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET, which we’re working on to mostly
handle the “rude” case above. Maybe we can reach 100% correctness, but
again, this is a difficult problem is WIP.
Matt
>
> Regards,
> Honglei
>
> >
> > > > > > >
> > > > > > > - Physical page layer (drm_gpusvm ranges): managed by the
> > > > > > > drm_gpusvm framework, representing actual HMM backed DMA
> > > > > > > mappings and GPU page table entries.
> > > > > > >
> > > > > > > This separation is necessary:
> > > > > > > - The framework does not support range splitting, so a partial
> > > > > > > munmap destroys the entire overlapping range, including the
> > > > > > > still valid parts. If attributes were stored inside drm_gpusvm
> > > > > > > ranges, they would be lost on unmapping.
> > > > > > > The separate attr tree preserves userspace set attributes
> > > > > > > across range operations.
> > > >
> > > > Yes, in Xe the divide is at the VMA level (set by user space) via VM
> > > > bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
> > > > madvise IOCTLs which reflect user space attributes on current SVM
> > > > mappings or future ones.
> > > >
> > > > The SVM range tree reflects mappings that have been faulted into the
> > > > device and contain pages. This is an intentional choice.
> > >
> > > That makes a lot of sense. Thank you for clarifying the design intent. I
> > > think the current adopt the same principle: the drm_gpusvm range tree only
> > > reflect actual faulted in mappings.
> > >
> > > >
> > > > > >
> > > > > > Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
> > > >
> > > >
> > > > Yes, this was an intentional design choice to not support partial unmap,
> > > > and instead rely on the driver to recreate a new range.
> > > >
> > > > The reasoning is:
> > > >
> > > > - In practice, this should be rare for well-behaved applications.
> > > >
> > > > - With THP / large device pages, if a sub-range is unmapped, the entire
> > > > GPU mapping is invalidated anyway due to the page size change. As a
> > > > result, the cost of creating a new range is minimal, since the device
> > > > will likely fault again on the remaining pages.
> > > >
> > > > So there is no need to over-engineer the common code.
> > > >
> > > > FWIW, to even test partial unmaps in Xe, I had to do things I doubt
> > > > anyone would ever do:
> > > >
> > > > ptr = mmap(SZ_2M);
> > > > /* fault in memory to the device */
> > > > munmap(ptr, SZ_1M);
> > > > /* touch memory again on the device */
> > > >
> > >
> > > Thank you for this explanation and the concrete example. After further
> > > discussion internally with Christian, we are now aligned with same position
> > > partial unmap. Will remove rebuild on partial unmap logic in the next
> > > version and handle it as only partially backed range.
> > >
> > > > >
> > > > >
> > > > > It is about partial unmap, some subregion in drm_gpusvm_range is still valid
> > > > > but some other subregion is invalid, but under drm_gpusvm, need to destroy
> > > > > the entire range.
> > > > >
> > > > > e.g.:
> > > > >
> > > > > [---------------unmap region in mmu notifier-----------------]
> > > > > [0x1000 ------------ 0x9000]
> > > > > [ valid ][ invalid ]
> > > > >
> > > > > see deatil in drm_gpusvm.c:110 line
> > > > > section:Partial Unmapping of Ranges
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > - drm_gpusvm range boundaries are determined by fault address
> > > > > > > and pre setted chunk size, not by userspace attribute boundaries.
> > > > > > > Ranges may be rechunked on memory changes. Embedding
> > > > > > > attributes in framework ranges would scatter attr state
> > > > > > > across many small ranges and require complex reassemble
> > > > > > > logic when operate attrbute.
> > > > > >
> > > > > > Yeah, that makes a lot of sense.
> > > > > >
> > > > > > >
> > > > > > > 2) System memory mapping via drm_gpusvm
> > > > > > >
> > > > > > > The core mapping path uses drm_gpusvm_range_find_or_insert() to
> > > > > > > create ranges, drm_gpusvm_range_get_pages() for HMM page fault
> > > > > > > and DMA mapping, then updates GPU page tables via
> > > > > > > amdgpu_vm_update_range().
> > > > > > >
> > > > > > > 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > > > > > >
> > > > > > > On XNACK off hardware the GPU cannot recover from page faults,
> > > > > > > so mappings must be established through ioctl. When
> > > > > > > userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > > > > > > walks the attr tree and maps all accessible intervals
> > > > > > > to the GPU by amdgpu_svm_range_map_attr_ranges().
> > > >
> > > > Can you expand on XNACK off / GPU no faults? Is this to the share GPU
> > > > between 3D (dma-fences) and faulting clients? We have something similar
> > > > in Xe, but it isn't an explicit IOCTL rather we switch between on demand
> > > > as 3D client submits and then resumes page faults when all dma-fences
> > > > have signaled.
> > > >
> > > > I see below you mention page tables are modified during quiesce KFD
> > > > queues? I'm not sure that is required - you just need to guarnette
> > > > faulting clients won't trigger page faults when dma-fence is in flight.
> > > >
> > > > Maybe give me an explaination of exactly what the requirement from AMD
> > > > are here so I have better picture.
> > >
> > > Thank you for the patience, let me try to explain our situation, though
> > > I may not get every detail right.
> > >
> > > XNACK off means hardware that does not have GPU page fault capability (or
> > > turned off)
> > >
> > > So for these GPUs, ALL page table entries must be fully populated before
> > > the GPU can access the memory. This is why we need the ioctl driven
> > > mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
> > > walk the attribute tree and eagerly map all accessible ranges into the
> > > GPU page tables. This is functionally similar to what you describe as
> > > prefetch IOCTLs / VM bind in Xe.
> > >
> > > Regarding queue quiesce during page table modification: on XNACK off
> > > hardware, because the GPU cannot fault, we must ensure the GPU is
> > > completely stopped before modifying any PTE it might be accessing.
> > > Otherwise the GPU could access a partially updated page table and hang.
> > > The quiesce/resume is the mechanism to guarantee this.
> > >
> > > I hope that helps clarify the picture.
> > >
> >
> > This clarifies a lot. This is what we’d call in Xe “preemption fence”
> > mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> > resume. We don’t actually support SVM in this case; instead, we use
> > “userptr binds,” which are built on gpusvm for page collection. However,
> > we don’t support migrating memory to the device—though we could.
> >
> > I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> > this case, don’t maintain a range tree, as those—as you suggest—are more
> > of an on-demand fault driver concern. Instead, just embed 'struct
> > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> >
> > We could extend this to support migrating 'userptr', but we just haven’t
> > done that yet—this may be what you want to do in “XNACK off..
> >
> > [2] https://patchwork.freedesktop.org/series/146553/
> >
> > >
> > > >
> > > > > > >
> > > > > > > 4) Invalidation, GC worker, and restore worker
> > > > > > >
> > > > > > > MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
> > > > > > > three cases based on event type and hardware mode:
> > > > > > > - unmap event: clear GPU PTEs in the notifier context,
> > > > > > > unmap DMA pages, mark ranges as unmapped, flush TLB,
> > > > > > > and enqueue to the GC worker. On XNACK off, also
> > > > > > > quiesce KFD queues and schedule rebuild of the
> > > > > > > still valid portions that were destroyed together with
> > > > > > > the unmapped subregion.
> > > > > > >
> > > > > > > - evict on XNACK off:
> > > > > > > quiesce KFD queues first, then unmap DMA pages and
> > > > > > > enqueue to the restore worker.
> > > > > >
> > > > > > Is that done through the DMA fence or by talking directly to the MES/HWS?
> > > > >
> > > > > Currently KFD queues quiesce/resume API are reused, lookig forward to a
> > > > > better solution.
> > > > >
> > > >
> > > > +1
> > > >
> > > > > Regards,
> > > > > Honglei
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > > - evict on XNACK on:
> > > > > > > clear GPU PTEs, unmap DMA pages, and flush TLB, but do
> > > > > > > not schedule any worker. The GPU will fault on next
> > > > > > > access and the fault handler establishes the mapping.
> > > > > > >
> > > > > > > Not supported feature:
> > > > > > > - XNACK on GPU page fault mode
> > > > > > > - migration and prefetch feature
> > > > > > > - Multi GPU support
> > > > > > >
> > > > > > > XNACK on enablement is ongoing.The GPUs that support XNACK on
> > > > > > > are currently only accessible to us via remote lab machines, which slows
> > > > > > > down progress.
> > > > > > >
> > > > > > > Patch overview:
> > > > > > >
> > > > > > > 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > > > > > > SET_ATTR/GET_ATTR operations, attribute types, and related
> > > > > > > structs in amdgpu_drm.h.
> > > > > > >
> > > > > > > 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
> > > > > > > refcount, attr_tree, workqueues, locks, and
> > > > > > > callbacks (begin/end_restore, flush_tlb).
> > > > > > >
> > > > > > > 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
> > > > > > > (interval tree node), attr_tree, access enum, flag masks,
> > > > > > > and change trigger enum.
> > > > > > >
> > > > > > > 04/12 Attribute tree operations: interval tree lookup, insert,
> > > > > > > remove, and tree create/destroy lifecycle.
> > > > > > >
> > > > > > > 05/12 Attribute set: validate UAPI attributes, apply to internal
> > > > > > > attrs, handle hole/existing range with head/tail splitting,
> > > > > > > compute change triggers, and -EAGAIN retry loop.
> > > > > > > Implements attr_clear_pages for unmap cleanup and attr_get.
> > > > > > >
> > > > > > > 06/12 Range data structures: amdgpu_svm_range extending
> > > > > > > drm_gpusvm_range with gpu_mapped state, pending ops,
> > > > > > > pte_flags cache, and GC/restore queue linkage.
> > > > > > >
> > > > > > > 07/12 PTE flags and GPU mapping: simple gpu pte function,
> > > > > > > GPU page table update with DMA address, range mapping loop:
> > > > > > > find_or_insert -> get_pages -> validate -> update PTE,
> > > > > > > and attribute change driven mapping function.
> > > > > > >
> > > > > > > 08/12 Notifier and invalidation: synchronous GPU PTE clear in
> > > > > > > notifier context, range removal and overlap cleanup,
> > > > > > > rebuild after destroy logic, and MMU event dispatcher
> > > > > > >
> > > > > > > 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > > > > > > worker for unmap processing and rebuild, ordered restore
> > > > > > > worker for mapping evicted ranges, and flush/sync
> > > > > > > helpers.
> > > > > > >
> > > > > > > 10/12 Initialization and fini: kmem_cache for range/attr,
> > > > > > > drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > > > > > > flush helper, and amdgpu_svm init/close/fini lifecycle.
> > > > > > >
> > > > > > > 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
> > > > > > > protection, amdgpu_gem_svm_ioctl dispatcher, and
> > > > > > > amdgpu_svm_handle_fault for GPU page fault recovery.
> > > > > > >
> > > > > > > 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
> > > > > > > Makefile rules, ioctl table registration, and amdgpu_vm
> > > > > > > hooks (init in make_compute, close/fini, fault dispatch).
> > > > > > >
> > > > > > > Test result:
> > > > > > > on gfx1100(W7900) and gfx943(MI300x)
> > > > > > > kfd test: 95%+ passed, same failed cases with offical relase
> > > > > > > rocr test: all passed
> > > > > > > hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
> > > > > > >
> > > > > > > During implementation we identified several challenges / design questions:
> > > > > > >
> > > > > > > 1. No range splitting on partial unmap
> > > > > > >
> > > > > > > drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
> > > > > > > Partial munmap needs to destroy the entire range including the valid interval.
> > > > > > > GPU fault driven hardware can handle this design by extra gpu fault handle,
> > > > > > > but AMDGPU needs to support XNACK off hardware, this design requires driver
> > > > > > > rebuild the valid part in the removed entire range. Whichs bring a very heavy
> > > > > > > restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
> > > > > > > this restore work even heavier than kfd_svm. In previous driver work queue
> > > > > > > only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
> > > > > > > which brings about more complex logic, heavier worker queue workload, and
> > > > > > > synchronization issues.
> > > >
> > > > Is this common in the workload you are running? I'm also wondering if
> > > > your restore logic / KFDs design is contributing to this actally the
> > > > problem.
> > > >
> > >
> > > Honestly, you raise a fair point.
> > >
> > > We will redesign the logic about the partial munap, which should eliminate
> > > most of this complexity.
> > >
> > >
> >
> > +1, yes test but do optimize for.
> >
> > > > > > >
> > > > > > > 2. Fault driven vs ioctl driven mapping
> > > > > > >
> > > > > > > drm_gpusvm is designed around GPU page fault handlers. The primary entry
> > > > > > > point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > > > > > > AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
> > > > > > > GPU cannot fault at all
> > > >
> > > > I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
> > > > issues these so the device does not fault (e.g., prefetch creates a set
> > > > of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
> > > > specific VM bind operations.
> > > >
> > >
> > > That is a very helpful way to think about it. Yes, our ioctl driven
> > > mapping(xnack off) is essentially equivalent to a prefetch operation. We are
> > > trying to improve it.
> > >
> >
> > See above wrt 'userptr'.
> >
> > >
> > > > > > >
> > > > > > > The ioctl path cannot hold mmap_read_lock across the entire operation
> > > > > > > because drm_gpusvm_range_find_or_insert() acquires/releases it
> > > > > > > internally. This creates race windows with MMU notifiers / workers.
> > > >
> > > > This is a very intentional choice in the locking design: mmap_read_lock
> > > > is held only in very specific parts of GPU SVM, and the driver should
> > > > never need to take this lock.
> > > >
> > > > Yes, notifiers can race, which is why the GPU fault handler and prefetch
> > > > handler are structured as retry loops when a notifier race is detected.
> > > > In practice, with well-behaved applications, these races should be
> > > > rare—but they do occur, and the driver must handle them.
> > > >
> > > > __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
> > > > prefetch has similar logic, although it is more spread out given that it
> > > > is part of a deeper software pipeline.
> > > >
> > > > FWIW, holding locks to avoid races was rejected by Sima because we
> > > > reasoned it is essentially impossible to guarantee the absence of races
> > > > by holding a lock. CPU page fault handlers are also effectively just
> > > > large retry loops.
> > > >
> > > > So this is one point I believe you will need to fixup driver side.
> > > >
> > >
> > > Understood. Thank you for the detailed explanation and for pointing to
> > > __xe_svm_handle_pagefault as a reference. We will restructure both our
> > > fault handler and ioctl path to a betterretry loop pattern with sequence
> > > number race detection.
> > >
> >
> > Yes, the typical pattern is:
> >
> > - Try to migrate once
> > - If you hit a race, give up, evict all memory back to system memory, and bind it
> >
> > Atomics make this tricky because memory must move, but I’m not sure
> > “XNACK off” applies here. However, GPU SVM provides a timeslice
> > mechanism to ensure the CPU can’t move memory while the GPU needs to
> > execute something.
> >
> > > > > > >
> > > > > > > 3. Multi GPU support
> > > > > > >
> > > > > > > drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> > > > > > > each GPU gets an independent instance with its own range tree, MMU
> > > > > > > notifiers, notifier_lock, and DMA mappings.
> > > > > > >
> > > >
> > > > This is a part I am absolutely open to fixing. Right now, each
> > > > drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
> > > > decoupling a GPU SVM instance from a single device, allowing each
> > > > drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
> > > > device).
> > > >
> > > > This would give drivers the flexibility to use one GPU SVM instance per
> > > > VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
> > > > MM.
> > > >
> > >
> > > That would be wonderful! Looking forward to your patch very much!
> > >
> >
> > I can't say I'll code this but we thought about is as options and very
> > open patches which refactor the object model for multiple use cases.
> >
> > >
> > > > > > > This may brings huge overhead:
> > > > > > > - N x MMU notifier registrations for the same address range
> > > >
> > > > The notifier overhead is a real concern. We recently introduced two-pass
> > > > notifiers [1] to speed up multi-device notifiers. At least in Xe, the
> > > > TLB invalidations—which are the truly expensive part—can be pipelined
> > > > using the two=pass approach. Currently, [1] only implements two-pass
> > > > notifiers for userptr, but Xe’s GPU SVM will be updated to use them
> > > > shortly.
> > > >
> > > > [1] https://patchwork.freedesktop.org/series/153280/
> > > >
> > >
> > > Thank you for the pointer to two-pass notifiers. Will study this
> > > series.
> > >
> > > > > > > - N x hmm_range_fault() calls for the same page (KFD: 1x)
> > > >
> > > > hmm_range_fault is extremely fast compared to the actual migration.
> > > > Running hmm_range_fault on a 2MB region using 4KB pages takes less
> > > > than 1µs. With THP or large device pages [2] (merged last week), it’s
> > > > around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
> > > >
> > > > [2] https://patchwork.freedesktop.org/series/163141/
> > > >
> > >
> > > That is very helpful data. Perhaps worry too much.
> > >
> > > > > > > - N x DMA mapping memory
> > > >
> > > > You will always have N x DMA mapping memory if the pages are in system
> > > > memory as the dma-mapping API is per device.
> > >
> > > Totally agreed.
> > >
> > > >
> > > > > > > - N x invalidation + restore worker scheduling per CPU unmap event
> > > > > > > - N x GPU page table flush / TLB invalidation
> > > >
> > > > I agree you do not want serialize GPU page table flush / TLB
> > > > invalidations. Hence two-pass notifiers [1].
> > >
> > > Yes, will learn it.
> > >
> > > >
> > > > > > > - Increased mmap_lock hold time, N callbacks serialize under it
> > > > > > >
> > > > > > > compatibility issues:
> > > > > > > - Quiesce/resume scope mismatch: to integrate with KFD compute
> > > > > > > queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
> > > > > > > which have process level semantics. Under the per GPU
> > > > > > > drm_gpusvm model, maybe there are some issues on sync. To properly
> > > > > > > integrate with KFD under the per SVM model, a compatibility or
> > > > > > > new per VM level queue control APIs maybe need to introduced.
> > > > > > >
> > > >
> > > > I thought the idea to get rid of KFD and move over to AMDGPU? I thought
> > > > Christian mentioned this to me at XDC.
> > > >
> > >
> > > > > > > Migration challenges:
> > > > > > >
> > > > > > > - No global migration decision logic: each per GPU SVM
> > > > > > > instance maintains its own attribute tree independently. This
> > > > > > > allows conflicting settings (e.g., GPU0's SVM sets
> > > > > > > PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
> > > > > > > for the same address range) with no detection or resolution.
> > > > > > > A global attribute coordinator or a shared manager is needed to
> > > > > > > provide a unified global view for migration decisions
> > > >
> > > > Yes, this is hole in the Xe API too. We have told UMDs if they setup
> > > > individual VMs with conflict attributes for a single CPU address space
> > > > the behavior is undefined. Our UMD implement madvise is basically loop
> > > > over al GPU VMs setting the same attributes.
> > >
> > > Will follow the same approach for now, the UMD is responsible for setting
> > > consistent attributes across GPU VMs.
> > >
> >
> > +1
> >
> > > >
> > > > > > >
> > > > > > > - migrate_vma_setup broadcast: one GPU's migration triggers MMU
> > > > > > > notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > > > > > > causing N-1 unnecessary restore workers to be scheduled. And
> > > >
> > > > My feeling is that you shouldn’t reschedule restore workers unless you
> > > > actually have to invalidate page tables (i.e., you have a local SVM
> > > > range within the notifier). So the first migration to an untouched
> > > > region may trigger notifiers, but they won’t do anything because you
> > > > don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
> > > > region won’t trigger a notifier unless the memory is moved again.
> > > >
> > >
> > > That is a very good point. We should check whether we actually have
> > > valid SVM ranges before scheduling restore workers. If there is nothing
> > > to invalidate, the notifier callback should be a no-op. We will review
> > > our notifier callback logic to ensure we are not doing unnecessary work
> > > here. Thank you for pointing this out.
> > >
> > > > > > > creates races between the initiating migration and the other
> > > > > > > instance's restore attempts.
> > > >
> > > > Yes, if multiple devices try to migrate the same CPU pages at the same
> > > > time, that will race. That’s why in Xe we have a module-level
> > > > driver_migrate_lock. The first migration runs in read mode; if it
> > > > detects a race and aborts, it then takes driver_migrate_lock in write
> > > > mode so it becomes the only device allowed to move memory / CPU pages.
> > > > See xe_svm_alloc_vram() for how this is used.
> > > >
> > > > I’m not sure this approach will work for you, but I just wanted to point
> > > > out that we identified this as a potential issue.
> > > >
> > >
> > > Thank you for sharing the driver_migrate_lock approach and pointing to
> > > xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
> > > for our case.
> > >
> > > > > > >
> > > > > > > - No cross instance migration serialization: each per GPU
> > > > > > > drm_gpusvm instance has independent locking, so two GPUs'
> > > > > > > "decide -> migrate -> remap" sequences can interleave. While
> > > > > > > the kernel page lock prevents truly simultaneous migration of
> > > > > > > the same physical page, the losing side's retry (evict from
> > > > > > > other GPU's VRAM -> migrate back) triggers broadcast notifier
> > > > > > > invalidations and restore workers, compounding the ping pong
> > > > > > > problem above.
> > > > > > >
> > > >
> > > > See the driver_migrate_lock above.
> > >
> > > Acknowledged, thank you.
> > > >
> > > > > > > - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > > > > > > hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
> > > > > > > it only selects system memory pages for migration.
> > > > > > >
> > > >
> > > > I think this is fixed? We did find some core MM bugs that blocked VRAM
> > > > to VRAM but those have been worked out.
> > > >
> > > > The code I'm looking at:
> > > >
> > > > 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
> > > > 518 struct mm_struct *mm,
> > > > 519 unsigned long start, unsigned long end,
> > > > 520 const struct drm_pagemap_migrate_details *mdetails)
> > > > 521 {
> > > > 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
> > > > 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
> > > > 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
> > > > 525 struct migrate_vma migrate = {
> > > > 526 .start = start,
> > > > 527 .end = end,
> > > > 528 .pgmap_owner = pagemap->owner,
> > > > 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> > > > 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
> > > > 531 };
> > > >
> > >
> > > Thank you for checking! I am using v6.18 for this POC, missed the fix, will
> > > rebase to the latest.
> > >
> > >
> > > > > > > - CPU fault reverse migration race: CPU page fault triggers
> > > > > > > migrate_to_ram while GPU instances are concurrently operating.
> > > > > > > Per GPU notifier_lock does not protect cross GPU operations.
> > > >
> > > > No, again retry loop as discussed above.
> > >
> > > Understood.
> > >
> > > >
> > > > > > >
> > > > > > > We believe a strong, well designed solution at the framework level is
> > > > > > > needed to properly address these problems, and we look forward to
> > > > > > > discussion and suggestions.
> > > >
> > > > Let's work together to figure out what is missing here.
> > >
> > > Thank you so much, Matt. Your feedback has been incredibly valuable and
> > > has given us a much clearer picture of the framework's design.
> > > Ireally appreciate the effort you put into building drm_gpusvm as a
> > > shared framework. Will incorporate your suggestions into our next
> > > revision and look forward to continuing the collaboration.
> > >
> >
> > No problem. Happy to help.
> >
> > Matt
> >
> > > Regards,
> > > Honglei
> > >
> > >
> > > >
> > > > Matt
> > > >
> > > > > > >
> > > > > > > Honglei Huang (12):
> > > > > > > drm/amdgpu: add SVM UAPI definitions
> > > > > > > drm/amdgpu: add SVM data structures and header
> > > > > > > drm/amdgpu: add SVM attribute data structures
> > > > > > > drm/amdgpu: implement SVM attribute tree operations
> > > > > > > drm/amdgpu: implement SVM attribute set
> > > > > > > drm/amdgpu: add SVM range data structures
> > > > > > > drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > > > > > > drm/amdgpu: implement SVM range notifier and invalidation
> > > > > > > drm/amdgpu: implement SVM range workers
> > > > > > > drm/amdgpu: implement SVM core initialization and fini
> > > > > > > drm/amdgpu: implement SVM ioctl and fault handler
> > > > > > > drm/amdgpu: wire up SVM build system and fault handler
> > > > > > >
> > > > > > > drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> > > > > > > include/uapi/drm/amdgpu_drm.h | 39 +
> > > > > > > 12 files changed, 2958 insertions(+), 4 deletions(-)
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
> > > > > > >
> > > > > > >
> > > > > > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> > > > > >
> > > > >
> > >
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-23 6:52 ` Matthew Brost
@ 2026-04-23 8:22 ` Huang, Honglei1
2026-04-29 9:56 ` Huang, Honglei1
0 siblings, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-23 8:22 UTC (permalink / raw)
To: Matthew Brost, Christian König, Felix.Kuehling, Philip.Yang
Cc: amd-gfx, dri-devel, Alexander.Deucher, Honglei Huang, Oak.Zeng,
Jenny-Jing.Liu, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu,
Junhua.Shen, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On 4/23/2026 2:52 PM, Matthew Brost wrote:
> On Thu, Apr 23, 2026 at 02:09:59PM +0800, Huang, Honglei1 wrote:
>>
>>
>> On 3/23/2026 2:31 PM, Matthew Brost wrote:
>>> On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
>>>>
>>>>
>>>> On 3/19/26 13:08, Matthew Brost wrote:
>>>>> On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>>>>>>
>>>>>
>>>>> Disclaimer I haven't look at any code in this series yet.
>>>>>
>>>>>>
>>>>>> On 3/17/26 19:48, Christian König wrote:
>>>>>>> Adding a few XE and drm_gpuvm people on TO.
>>>>>>>
>>>>>>> On 3/17/26 12:29, Honglei Huang wrote:
>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>
>>>>>>>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>>>>>>>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>>>>>>>> the framework's applicability, identify implementation challenges,
>>>>>>>> and start discussion on framework evolution. This is not a production
>>>>>
>>>>> +1. Open to any ideas. Given this was designed originally for Xe we very
>>>>> well could have missed other drivers requirements.
>>>> Hi Matt,
>>>>
>>>> Thank you for the openness. And thank you so much for the incredibly
>>>> detailed and patient response. I really appreciate you taking the time to
>>>> walk through each point.
>>>>
>>>
>>> I'm here to help.
>>>
>>>> Actually I am still a learner when it comes to the drm_gpusvm framework and
>>>> GPU SVM design in general. Some of my descriptions below may not be entirely
>>>> accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
>>>> well.
>>>
>>> I appreciate another driver jumping in and using this framework—it
>>> becomes easier to validate as more users adopt it.
>>>
>>>>
>>>>>
>>>>>>>> ready submission.
>>>>>>>>
>>>>>>>> This patch series implements basic SVM support with the following features:
>>>>>>>>
>>>>>>>> 1. attributes sepatarated from physical page management:
>>>>>>>>
>>>>>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>>>>>>>> tree that stores SVM attributes. Managed through the SET_ATTR,
>>>>>>>> and mmu notifier callback.
>>>>>
>>>>> Can you explain the mmu notifier callback interaction here? See below in
>>>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>>>
>>>>
>>>> Let me try to explain, apologies if the description is not fully
>>>> precise.
>>>>
>>>> In current implementation, the MMU notifier callback interacts with the attr
>>>> tree only in the munmap path remove the corresponding attribute
>>>> entries from the attr tree so that stale attributes do not persist for
>>>> freed address space.
>>>>
>>>
>>> Ah, yes. We reset our attributes upon munmap too. We actually don't this
>>> 100% correct quite either and series in flight to fix [1].
>>>
>>> [1] https://patchwork.freedesktop.org/series/161815/
>>
>> Hi matt,
>>
>> It seems like you are tring to modify the implementation into remove the
>> attributes when munmap.
>>
>> Actuall we have a discussion internally that does the driver need to remove
>> the attributes when munmap.
>>
>> So there servel ideas:
>>
>> 1. attribute need keep: attributes may be needed again when a new VMA
>> appears or on subsequent faults.
>> 2.attribute need keep: attributes can be set independent of whether memory
>> is currently mapped; attributes persist and are modified explicitly via
>> ioctl, not implicitly by notifier callbacks.
>> 3. attribute need remove: casue VMA is gone, driver can do nothing without
>> VMA.
>>
>> and I saw xe_svm set default attribute in the previous version, this is also
>> a option.
>>
>> Can you please help to give some information that why xe_svm is turing to
>> remove the attribute when munmap? And does keeping attribute is a valid way?
>>
>
> This is a semantic choice, and we’re trying to match the semantics of
> CPU madvise. I believe any semantic an individual driver stack wants to
> define is valid, but if vendors mismatch sematics this will create a
> level of vendor lock in which may (cough Nvidia, CUDA) or may not (open
> source) be desired.
>
> AFAIK, if you do something like this in C (a CPU-only program):
>
> mmap(addr_range);
> madvise(addr_range, some_flags);
> munmap(addr_range);
>
> mmap(addr_range); /* Here the madvise attributes are reset */
>
> Also, AFAIK, the CUDA GPU madvise API works this way as well.
>
> That said, making this work 100% reliably is quite difficult, especially
> with a rude user.
>
> For example:
>
> mmap(addr_range);
> gpu_madvise(addr_range, some_flags);
> /* GPU never actually touches memory */
> munmap(addr_range);
>
> So we have an opt-in VM bind flag,
> DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET, which we’re working on to mostly
> handle the “rude” case above. Maybe we can reach 100% correctness, but
> again, this is a difficult problem is WIP.
>
I think the current amdgpu SVM draft version also has the issue for the
rude user situlation. Maybe this is caused by the separation of
attribute layer and physical layer. Seems like KFD_SVM doesn't have this
issue.
Maybe driver can find_or_insert in madvise ioctl path, to add a MMU
notifier but do not to get_page, and then clean the attribyte in mmu
notifier callback instead of GC. This is just my thought.
And really thanks for the information, and waiting for other's comments.
Regards,
Honglei
> Matt
>
>>
>> Regards,
>> Honglei
>>
>>>
>>>>>>>>
>>>>>>>> - Physical page layer (drm_gpusvm ranges): managed by the
>>>>>>>> drm_gpusvm framework, representing actual HMM backed DMA
>>>>>>>> mappings and GPU page table entries.
>>>>>>>>
>>>>>>>> This separation is necessary:
>>>>>>>> - The framework does not support range splitting, so a partial
>>>>>>>> munmap destroys the entire overlapping range, including the
>>>>>>>> still valid parts. If attributes were stored inside drm_gpusvm
>>>>>>>> ranges, they would be lost on unmapping.
>>>>>>>> The separate attr tree preserves userspace set attributes
>>>>>>>> across range operations.
>>>>>
>>>>> Yes, in Xe the divide is at the VMA level (set by user space) via VM
>>>>> bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
>>>>> madvise IOCTLs which reflect user space attributes on current SVM
>>>>> mappings or future ones.
>>>>>
>>>>> The SVM range tree reflects mappings that have been faulted into the
>>>>> device and contain pages. This is an intentional choice.
>>>>
>>>> That makes a lot of sense. Thank you for clarifying the design intent. I
>>>> think the current adopt the same principle: the drm_gpusvm range tree only
>>>> reflect actual faulted in mappings.
>>>>
>>>>>
>>>>>>>
>>>>>>> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
>>>>>
>>>>>
>>>>> Yes, this was an intentional design choice to not support partial unmap,
>>>>> and instead rely on the driver to recreate a new range.
>>>>>
>>>>> The reasoning is:
>>>>>
>>>>> - In practice, this should be rare for well-behaved applications.
>>>>>
>>>>> - With THP / large device pages, if a sub-range is unmapped, the entire
>>>>> GPU mapping is invalidated anyway due to the page size change. As a
>>>>> result, the cost of creating a new range is minimal, since the device
>>>>> will likely fault again on the remaining pages.
>>>>>
>>>>> So there is no need to over-engineer the common code.
>>>>>
>>>>> FWIW, to even test partial unmaps in Xe, I had to do things I doubt
>>>>> anyone would ever do:
>>>>>
>>>>> ptr = mmap(SZ_2M);
>>>>> /* fault in memory to the device */
>>>>> munmap(ptr, SZ_1M);
>>>>> /* touch memory again on the device */
>>>>>
>>>>
>>>> Thank you for this explanation and the concrete example. After further
>>>> discussion internally with Christian, we are now aligned with same position
>>>> partial unmap. Will remove rebuild on partial unmap logic in the next
>>>> version and handle it as only partially backed range.
>>>>
>>>>>>
>>>>>>
>>>>>> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
>>>>>> but some other subregion is invalid, but under drm_gpusvm, need to destroy
>>>>>> the entire range.
>>>>>>
>>>>>> e.g.:
>>>>>>
>>>>>> [---------------unmap region in mmu notifier-----------------]
>>>>>> [0x1000 ------------ 0x9000]
>>>>>> [ valid ][ invalid ]
>>>>>>
>>>>>> see deatil in drm_gpusvm.c:110 line
>>>>>> section:Partial Unmapping of Ranges
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> - drm_gpusvm range boundaries are determined by fault address
>>>>>>>> and pre setted chunk size, not by userspace attribute boundaries.
>>>>>>>> Ranges may be rechunked on memory changes. Embedding
>>>>>>>> attributes in framework ranges would scatter attr state
>>>>>>>> across many small ranges and require complex reassemble
>>>>>>>> logic when operate attrbute.
>>>>>>>
>>>>>>> Yeah, that makes a lot of sense.
>>>>>>>
>>>>>>>>
>>>>>>>> 2) System memory mapping via drm_gpusvm
>>>>>>>>
>>>>>>>> The core mapping path uses drm_gpusvm_range_find_or_insert() to
>>>>>>>> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>>>>>>>> and DMA mapping, then updates GPU page tables via
>>>>>>>> amdgpu_vm_update_range().
>>>>>>>>
>>>>>>>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>>>>>>>
>>>>>>>> On XNACK off hardware the GPU cannot recover from page faults,
>>>>>>>> so mappings must be established through ioctl. When
>>>>>>>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>>>>>>>> walks the attr tree and maps all accessible intervals
>>>>>>>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>>>>>
>>>>> Can you expand on XNACK off / GPU no faults? Is this to the share GPU
>>>>> between 3D (dma-fences) and faulting clients? We have something similar
>>>>> in Xe, but it isn't an explicit IOCTL rather we switch between on demand
>>>>> as 3D client submits and then resumes page faults when all dma-fences
>>>>> have signaled.
>>>>>
>>>>> I see below you mention page tables are modified during quiesce KFD
>>>>> queues? I'm not sure that is required - you just need to guarnette
>>>>> faulting clients won't trigger page faults when dma-fence is in flight.
>>>>>
>>>>> Maybe give me an explaination of exactly what the requirement from AMD
>>>>> are here so I have better picture.
>>>>
>>>> Thank you for the patience, let me try to explain our situation, though
>>>> I may not get every detail right.
>>>>
>>>> XNACK off means hardware that does not have GPU page fault capability (or
>>>> turned off)
>>>>
>>>> So for these GPUs, ALL page table entries must be fully populated before
>>>> the GPU can access the memory. This is why we need the ioctl driven
>>>> mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
>>>> walk the attribute tree and eagerly map all accessible ranges into the
>>>> GPU page tables. This is functionally similar to what you describe as
>>>> prefetch IOCTLs / VM bind in Xe.
>>>>
>>>> Regarding queue quiesce during page table modification: on XNACK off
>>>> hardware, because the GPU cannot fault, we must ensure the GPU is
>>>> completely stopped before modifying any PTE it might be accessing.
>>>> Otherwise the GPU could access a partially updated page table and hang.
>>>> The quiesce/resume is the mechanism to guarantee this.
>>>>
>>>> I hope that helps clarify the picture.
>>>>
>>>
>>> This clarifies a lot. This is what we’d call in Xe “preemption fence”
>>> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
>>> resume. We don’t actually support SVM in this case; instead, we use
>>> “userptr binds,” which are built on gpusvm for page collection. However,
>>> we don’t support migrating memory to the device—though we could.
>>>
>>> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
>>> this case, don’t maintain a range tree, as those—as you suggest—are more
>>> of an on-demand fault driver concern. Instead, just embed 'struct
>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>
>>> We could extend this to support migrating 'userptr', but we just haven’t
>>> done that yet—this may be what you want to do in “XNACK off..
>>>
>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>
>>>>
>>>>>
>>>>>>>>
>>>>>>>> 4) Invalidation, GC worker, and restore worker
>>>>>>>>
>>>>>>>> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>>>>>>>> three cases based on event type and hardware mode:
>>>>>>>> - unmap event: clear GPU PTEs in the notifier context,
>>>>>>>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>>>>>>>> and enqueue to the GC worker. On XNACK off, also
>>>>>>>> quiesce KFD queues and schedule rebuild of the
>>>>>>>> still valid portions that were destroyed together with
>>>>>>>> the unmapped subregion.
>>>>>>>>
>>>>>>>> - evict on XNACK off:
>>>>>>>> quiesce KFD queues first, then unmap DMA pages and
>>>>>>>> enqueue to the restore worker.
>>>>>>>
>>>>>>> Is that done through the DMA fence or by talking directly to the MES/HWS?
>>>>>>
>>>>>> Currently KFD queues quiesce/resume API are reused, lookig forward to a
>>>>>> better solution.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>>> Regards,
>>>>>> Honglei
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> - evict on XNACK on:
>>>>>>>> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>>>>>>>> not schedule any worker. The GPU will fault on next
>>>>>>>> access and the fault handler establishes the mapping.
>>>>>>>>
>>>>>>>> Not supported feature:
>>>>>>>> - XNACK on GPU page fault mode
>>>>>>>> - migration and prefetch feature
>>>>>>>> - Multi GPU support
>>>>>>>>
>>>>>>>> XNACK on enablement is ongoing.The GPUs that support XNACK on
>>>>>>>> are currently only accessible to us via remote lab machines, which slows
>>>>>>>> down progress.
>>>>>>>>
>>>>>>>> Patch overview:
>>>>>>>>
>>>>>>>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>>>>>>>> SET_ATTR/GET_ATTR operations, attribute types, and related
>>>>>>>> structs in amdgpu_drm.h.
>>>>>>>>
>>>>>>>> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>>>>>>>> refcount, attr_tree, workqueues, locks, and
>>>>>>>> callbacks (begin/end_restore, flush_tlb).
>>>>>>>>
>>>>>>>> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>>>>>>>> (interval tree node), attr_tree, access enum, flag masks,
>>>>>>>> and change trigger enum.
>>>>>>>>
>>>>>>>> 04/12 Attribute tree operations: interval tree lookup, insert,
>>>>>>>> remove, and tree create/destroy lifecycle.
>>>>>>>>
>>>>>>>> 05/12 Attribute set: validate UAPI attributes, apply to internal
>>>>>>>> attrs, handle hole/existing range with head/tail splitting,
>>>>>>>> compute change triggers, and -EAGAIN retry loop.
>>>>>>>> Implements attr_clear_pages for unmap cleanup and attr_get.
>>>>>>>>
>>>>>>>> 06/12 Range data structures: amdgpu_svm_range extending
>>>>>>>> drm_gpusvm_range with gpu_mapped state, pending ops,
>>>>>>>> pte_flags cache, and GC/restore queue linkage.
>>>>>>>>
>>>>>>>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>>>>>>>> GPU page table update with DMA address, range mapping loop:
>>>>>>>> find_or_insert -> get_pages -> validate -> update PTE,
>>>>>>>> and attribute change driven mapping function.
>>>>>>>>
>>>>>>>> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
>>>>>>>> notifier context, range removal and overlap cleanup,
>>>>>>>> rebuild after destroy logic, and MMU event dispatcher
>>>>>>>>
>>>>>>>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>>>>>>>> worker for unmap processing and rebuild, ordered restore
>>>>>>>> worker for mapping evicted ranges, and flush/sync
>>>>>>>> helpers.
>>>>>>>>
>>>>>>>> 10/12 Initialization and fini: kmem_cache for range/attr,
>>>>>>>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>>>>>>>> flush helper, and amdgpu_svm init/close/fini lifecycle.
>>>>>>>>
>>>>>>>> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>>>>>>>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>>>>>>>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>>>>>>>
>>>>>>>> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>>>>>>>> Makefile rules, ioctl table registration, and amdgpu_vm
>>>>>>>> hooks (init in make_compute, close/fini, fault dispatch).
>>>>>>>>
>>>>>>>> Test result:
>>>>>>>> on gfx1100(W7900) and gfx943(MI300x)
>>>>>>>> kfd test: 95%+ passed, same failed cases with offical relase
>>>>>>>> rocr test: all passed
>>>>>>>> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>>>>>>>
>>>>>>>> During implementation we identified several challenges / design questions:
>>>>>>>>
>>>>>>>> 1. No range splitting on partial unmap
>>>>>>>>
>>>>>>>> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>>>>>>>> Partial munmap needs to destroy the entire range including the valid interval.
>>>>>>>> GPU fault driven hardware can handle this design by extra gpu fault handle,
>>>>>>>> but AMDGPU needs to support XNACK off hardware, this design requires driver
>>>>>>>> rebuild the valid part in the removed entire range. Whichs bring a very heavy
>>>>>>>> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>>>>>>>> this restore work even heavier than kfd_svm. In previous driver work queue
>>>>>>>> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>>>>>>>> which brings about more complex logic, heavier worker queue workload, and
>>>>>>>> synchronization issues.
>>>>>
>>>>> Is this common in the workload you are running? I'm also wondering if
>>>>> your restore logic / KFDs design is contributing to this actally the
>>>>> problem.
>>>>>
>>>>
>>>> Honestly, you raise a fair point.
>>>>
>>>> We will redesign the logic about the partial munap, which should eliminate
>>>> most of this complexity.
>>>>
>>>>
>>>
>>> +1, yes test but do optimize for.
>>>
>>>>>>>>
>>>>>>>> 2. Fault driven vs ioctl driven mapping
>>>>>>>>
>>>>>>>> drm_gpusvm is designed around GPU page fault handlers. The primary entry
>>>>>>>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>>>>>>>> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>>>>>>>> GPU cannot fault at all
>>>>>
>>>>> I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
>>>>> issues these so the device does not fault (e.g., prefetch creates a set
>>>>> of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
>>>>> specific VM bind operations.
>>>>>
>>>>
>>>> That is a very helpful way to think about it. Yes, our ioctl driven
>>>> mapping(xnack off) is essentially equivalent to a prefetch operation. We are
>>>> trying to improve it.
>>>>
>>>
>>> See above wrt 'userptr'.
>>>
>>>>
>>>>>>>>
>>>>>>>> The ioctl path cannot hold mmap_read_lock across the entire operation
>>>>>>>> because drm_gpusvm_range_find_or_insert() acquires/releases it
>>>>>>>> internally. This creates race windows with MMU notifiers / workers.
>>>>>
>>>>> This is a very intentional choice in the locking design: mmap_read_lock
>>>>> is held only in very specific parts of GPU SVM, and the driver should
>>>>> never need to take this lock.
>>>>>
>>>>> Yes, notifiers can race, which is why the GPU fault handler and prefetch
>>>>> handler are structured as retry loops when a notifier race is detected.
>>>>> In practice, with well-behaved applications, these races should be
>>>>> rare—but they do occur, and the driver must handle them.
>>>>>
>>>>> __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
>>>>> prefetch has similar logic, although it is more spread out given that it
>>>>> is part of a deeper software pipeline.
>>>>>
>>>>> FWIW, holding locks to avoid races was rejected by Sima because we
>>>>> reasoned it is essentially impossible to guarantee the absence of races
>>>>> by holding a lock. CPU page fault handlers are also effectively just
>>>>> large retry loops.
>>>>>
>>>>> So this is one point I believe you will need to fixup driver side.
>>>>>
>>>>
>>>> Understood. Thank you for the detailed explanation and for pointing to
>>>> __xe_svm_handle_pagefault as a reference. We will restructure both our
>>>> fault handler and ioctl path to a betterretry loop pattern with sequence
>>>> number race detection.
>>>>
>>>
>>> Yes, the typical pattern is:
>>>
>>> - Try to migrate once
>>> - If you hit a race, give up, evict all memory back to system memory, and bind it
>>>
>>> Atomics make this tricky because memory must move, but I’m not sure
>>> “XNACK off” applies here. However, GPU SVM provides a timeslice
>>> mechanism to ensure the CPU can’t move memory while the GPU needs to
>>> execute something.
>>>
>>>>>>>>
>>>>>>>> 3. Multi GPU support
>>>>>>>>
>>>>>>>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>>>>>>>> each GPU gets an independent instance with its own range tree, MMU
>>>>>>>> notifiers, notifier_lock, and DMA mappings.
>>>>>>>>
>>>>>
>>>>> This is a part I am absolutely open to fixing. Right now, each
>>>>> drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
>>>>> decoupling a GPU SVM instance from a single device, allowing each
>>>>> drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
>>>>> device).
>>>>>
>>>>> This would give drivers the flexibility to use one GPU SVM instance per
>>>>> VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
>>>>> MM.
>>>>>
>>>>
>>>> That would be wonderful! Looking forward to your patch very much!
>>>>
>>>
>>> I can't say I'll code this but we thought about is as options and very
>>> open patches which refactor the object model for multiple use cases.
>>>
>>>>
>>>>>>>> This may brings huge overhead:
>>>>>>>> - N x MMU notifier registrations for the same address range
>>>>>
>>>>> The notifier overhead is a real concern. We recently introduced two-pass
>>>>> notifiers [1] to speed up multi-device notifiers. At least in Xe, the
>>>>> TLB invalidations—which are the truly expensive part—can be pipelined
>>>>> using the two=pass approach. Currently, [1] only implements two-pass
>>>>> notifiers for userptr, but Xe’s GPU SVM will be updated to use them
>>>>> shortly.
>>>>>
>>>>> [1] https://patchwork.freedesktop.org/series/153280/
>>>>>
>>>>
>>>> Thank you for the pointer to two-pass notifiers. Will study this
>>>> series.
>>>>
>>>>>>>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>>>>>
>>>>> hmm_range_fault is extremely fast compared to the actual migration.
>>>>> Running hmm_range_fault on a 2MB region using 4KB pages takes less
>>>>> than 1µs. With THP or large device pages [2] (merged last week), it’s
>>>>> around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
>>>>>
>>>>> [2] https://patchwork.freedesktop.org/series/163141/
>>>>>
>>>>
>>>> That is very helpful data. Perhaps worry too much.
>>>>
>>>>>>>> - N x DMA mapping memory
>>>>>
>>>>> You will always have N x DMA mapping memory if the pages are in system
>>>>> memory as the dma-mapping API is per device.
>>>>
>>>> Totally agreed.
>>>>
>>>>>
>>>>>>>> - N x invalidation + restore worker scheduling per CPU unmap event
>>>>>>>> - N x GPU page table flush / TLB invalidation
>>>>>
>>>>> I agree you do not want serialize GPU page table flush / TLB
>>>>> invalidations. Hence two-pass notifiers [1].
>>>>
>>>> Yes, will learn it.
>>>>
>>>>>
>>>>>>>> - Increased mmap_lock hold time, N callbacks serialize under it
>>>>>>>>
>>>>>>>> compatibility issues:
>>>>>>>> - Quiesce/resume scope mismatch: to integrate with KFD compute
>>>>>>>> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>>>>>>>> which have process level semantics. Under the per GPU
>>>>>>>> drm_gpusvm model, maybe there are some issues on sync. To properly
>>>>>>>> integrate with KFD under the per SVM model, a compatibility or
>>>>>>>> new per VM level queue control APIs maybe need to introduced.
>>>>>>>>
>>>>>
>>>>> I thought the idea to get rid of KFD and move over to AMDGPU? I thought
>>>>> Christian mentioned this to me at XDC.
>>>>>
>>>>
>>>>>>>> Migration challenges:
>>>>>>>>
>>>>>>>> - No global migration decision logic: each per GPU SVM
>>>>>>>> instance maintains its own attribute tree independently. This
>>>>>>>> allows conflicting settings (e.g., GPU0's SVM sets
>>>>>>>> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>>>>>>>> for the same address range) with no detection or resolution.
>>>>>>>> A global attribute coordinator or a shared manager is needed to
>>>>>>>> provide a unified global view for migration decisions
>>>>>
>>>>> Yes, this is hole in the Xe API too. We have told UMDs if they setup
>>>>> individual VMs with conflict attributes for a single CPU address space
>>>>> the behavior is undefined. Our UMD implement madvise is basically loop
>>>>> over al GPU VMs setting the same attributes.
>>>>
>>>> Will follow the same approach for now, the UMD is responsible for setting
>>>> consistent attributes across GPU VMs.
>>>>
>>>
>>> +1
>>>
>>>>>
>>>>>>>>
>>>>>>>> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>>>>>>>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>>>>>>>> causing N-1 unnecessary restore workers to be scheduled. And
>>>>>
>>>>> My feeling is that you shouldn’t reschedule restore workers unless you
>>>>> actually have to invalidate page tables (i.e., you have a local SVM
>>>>> range within the notifier). So the first migration to an untouched
>>>>> region may trigger notifiers, but they won’t do anything because you
>>>>> don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
>>>>> region won’t trigger a notifier unless the memory is moved again.
>>>>>
>>>>
>>>> That is a very good point. We should check whether we actually have
>>>> valid SVM ranges before scheduling restore workers. If there is nothing
>>>> to invalidate, the notifier callback should be a no-op. We will review
>>>> our notifier callback logic to ensure we are not doing unnecessary work
>>>> here. Thank you for pointing this out.
>>>>
>>>>>>>> creates races between the initiating migration and the other
>>>>>>>> instance's restore attempts.
>>>>>
>>>>> Yes, if multiple devices try to migrate the same CPU pages at the same
>>>>> time, that will race. That’s why in Xe we have a module-level
>>>>> driver_migrate_lock. The first migration runs in read mode; if it
>>>>> detects a race and aborts, it then takes driver_migrate_lock in write
>>>>> mode so it becomes the only device allowed to move memory / CPU pages.
>>>>> See xe_svm_alloc_vram() for how this is used.
>>>>>
>>>>> I’m not sure this approach will work for you, but I just wanted to point
>>>>> out that we identified this as a potential issue.
>>>>>
>>>>
>>>> Thank you for sharing the driver_migrate_lock approach and pointing to
>>>> xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
>>>> for our case.
>>>>
>>>>>>>>
>>>>>>>> - No cross instance migration serialization: each per GPU
>>>>>>>> drm_gpusvm instance has independent locking, so two GPUs'
>>>>>>>> "decide -> migrate -> remap" sequences can interleave. While
>>>>>>>> the kernel page lock prevents truly simultaneous migration of
>>>>>>>> the same physical page, the losing side's retry (evict from
>>>>>>>> other GPU's VRAM -> migrate back) triggers broadcast notifier
>>>>>>>> invalidations and restore workers, compounding the ping pong
>>>>>>>> problem above.
>>>>>>>>
>>>>>
>>>>> See the driver_migrate_lock above.
>>>>
>>>> Acknowledged, thank you.
>>>>>
>>>>>>>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>>>>>>>> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>>>>>>>> it only selects system memory pages for migration.
>>>>>>>>
>>>>>
>>>>> I think this is fixed? We did find some core MM bugs that blocked VRAM
>>>>> to VRAM but those have been worked out.
>>>>>
>>>>> The code I'm looking at:
>>>>>
>>>>> 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>>>>> 518 struct mm_struct *mm,
>>>>> 519 unsigned long start, unsigned long end,
>>>>> 520 const struct drm_pagemap_migrate_details *mdetails)
>>>>> 521 {
>>>>> 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
>>>>> 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
>>>>> 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
>>>>> 525 struct migrate_vma migrate = {
>>>>> 526 .start = start,
>>>>> 527 .end = end,
>>>>> 528 .pgmap_owner = pagemap->owner,
>>>>> 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
>>>>> 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
>>>>> 531 };
>>>>>
>>>>
>>>> Thank you for checking! I am using v6.18 for this POC, missed the fix, will
>>>> rebase to the latest.
>>>>
>>>>
>>>>>>>> - CPU fault reverse migration race: CPU page fault triggers
>>>>>>>> migrate_to_ram while GPU instances are concurrently operating.
>>>>>>>> Per GPU notifier_lock does not protect cross GPU operations.
>>>>>
>>>>> No, again retry loop as discussed above.
>>>>
>>>> Understood.
>>>>
>>>>>
>>>>>>>>
>>>>>>>> We believe a strong, well designed solution at the framework level is
>>>>>>>> needed to properly address these problems, and we look forward to
>>>>>>>> discussion and suggestions.
>>>>>
>>>>> Let's work together to figure out what is missing here.
>>>>
>>>> Thank you so much, Matt. Your feedback has been incredibly valuable and
>>>> has given us a much clearer picture of the framework's design.
>>>> Ireally appreciate the effort you put into building drm_gpusvm as a
>>>> shared framework. Will incorporate your suggestions into our next
>>>> revision and look forward to continuing the collaboration.
>>>>
>>>
>>> No problem. Happy to help.
>>>
>>> Matt
>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>>>>
>>>>>>>> Honglei Huang (12):
>>>>>>>> drm/amdgpu: add SVM UAPI definitions
>>>>>>>> drm/amdgpu: add SVM data structures and header
>>>>>>>> drm/amdgpu: add SVM attribute data structures
>>>>>>>> drm/amdgpu: implement SVM attribute tree operations
>>>>>>>> drm/amdgpu: implement SVM attribute set
>>>>>>>> drm/amdgpu: add SVM range data structures
>>>>>>>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>>>>>>>> drm/amdgpu: implement SVM range notifier and invalidation
>>>>>>>> drm/amdgpu: implement SVM range workers
>>>>>>>> drm/amdgpu: implement SVM core initialization and fini
>>>>>>>> drm/amdgpu: implement SVM ioctl and fault handler
>>>>>>>> drm/amdgpu: wire up SVM build system and fault handler
>>>>>>>>
>>>>>>>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>>>>>>>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>>>>>>>> include/uapi/drm/amdgpu_drm.h | 39 +
>>>>>>>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>>>>>>>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>>>>>>>
>>>>>>>>
>>>>>>>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>>>>>>>
>>>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-23 8:22 ` Huang, Honglei1
@ 2026-04-29 9:56 ` Huang, Honglei1
2026-04-30 2:56 ` Huang, Honglei1
0 siblings, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-29 9:56 UTC (permalink / raw)
To: Matthew Brost, Christian König, Felix.Kuehling, Philip.Yang
Cc: amd-gfx, dri-devel, Alexander.Deucher, Honglei Huang, Oak.Zeng,
Jenny-Jing.Liu, Xiaogang.Chen, Ray.Huang, Lingshan.Zhu,
Junhua.Shen, Thomas Hellström, Rodrigo Vivi,
Danilo Krummrich, Alice Ryhl
On 4/23/2026 4:22 PM, Huang, Honglei1 wrote:
...
>>>>>>>>>
>>>>>>>>> This patch series implements basic SVM support with the
>>>>>>>>> following features:
>>>>>>>>>
>>>>>>>>> 1. attributes sepatarated from physical page management:
>>>>>>>>>
>>>>>>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver side
>>>>>>>>> interval
>>>>>>>>> tree that stores SVM attributes. Managed through the
>>>>>>>>> SET_ATTR,
>>>>>>>>> and mmu notifier callback.
>>>>>>
>>>>>> Can you explain the mmu notifier callback interaction here? See
>>>>>> below in
>>>>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>>>>
>>>>>
>>>>> Let me try to explain, apologies if the description is not fully
>>>>> precise.
>>>>>
>>>>> In current implementation, the MMU notifier callback interacts with
>>>>> the attr
>>>>> tree only in the munmap path remove the corresponding attribute
>>>>> entries from the attr tree so that stale attributes do not persist for
>>>>> freed address space.
>>>>>
>>>>
>>>> Ah, yes. We reset our attributes upon munmap too. We actually don't
>>>> this
>>>> 100% correct quite either and series in flight to fix [1].
>>>>
>>>> [1] https://patchwork.freedesktop.org/series/161815/
>>>
>>> Hi matt,
>>>
>>> It seems like you are tring to modify the implementation into remove the
>>> attributes when munmap.
>>>
>>> Actuall we have a discussion internally that does the driver need to
>>> remove
>>> the attributes when munmap.
>>>
>>> So there servel ideas:
>>>
>>> 1. attribute need keep: attributes may be needed again when a new VMA
>>> appears or on subsequent faults.
>>> 2.attribute need keep: attributes can be set independent of whether
>>> memory
>>> is currently mapped; attributes persist and are modified explicitly via
>>> ioctl, not implicitly by notifier callbacks.
>>> 3. attribute need remove: casue VMA is gone, driver can do nothing
>>> without
>>> VMA.
>>>
>>> and I saw xe_svm set default attribute in the previous version, this
>>> is also
>>> a option.
>>>
>>> Can you please help to give some information that why xe_svm is
>>> turing to
>>> remove the attribute when munmap? And does keeping attribute is a
>>> valid way?
>>>
>>
>> This is a semantic choice, and we’re trying to match the semantics of
>> CPU madvise. I believe any semantic an individual driver stack wants to
>> define is valid, but if vendors mismatch sematics this will create a
>> level of vendor lock in which may (cough Nvidia, CUDA) or may not (open
>> source) be desired.
>>
>> AFAIK, if you do something like this in C (a CPU-only program):
>>
>> mmap(addr_range);
>> madvise(addr_range, some_flags);
>> munmap(addr_range);
>>
>> mmap(addr_range); /* Here the madvise attributes are reset */
>>
>> Also, AFAIK, the CUDA GPU madvise API works this way as well.
>>
>> That said, making this work 100% reliably is quite difficult, especially
>> with a rude user.
>>
>> For example:
>>
>> mmap(addr_range);
>> gpu_madvise(addr_range, some_flags);
>> /* GPU never actually touches memory */
>> munmap(addr_range);
>>
>> So we have an opt-in VM bind flag,
>> DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET, which we’re working on to mostly
>> handle the “rude” case above. Maybe we can reach 100% correctness, but
>> again, this is a difficult problem is WIP.
>>
>
> I think the current amdgpu SVM draft version also has the issue for the
> rude user situlation. Maybe this is caused by the separation of
> attribute layer and physical layer. Seems like KFD_SVM doesn't have this
> issue.
>
> Maybe driver can find_or_insert in madvise ioctl path, to add a MMU
> notifier but do not to get_page, and then clean the attribyte in mmu
> notifier callback instead of GC. This is just my thought.
>
> And really thanks for the information, and waiting for other's comments.
>
Hi Matt,
I'd like to share a concrete bug we hit in the amdgpu SVM
implementation/tests about stale attributes.
In short it is stale attr_range overlaps with VM_PFNMAP VMA
1: User allocates memory and sets GPU attributes but never use/fault it.
CPU VMA (anonymous):
|<── 0x1000 ── 0x5000 ──>|
attr_range:
|<── 0x1000 ── 0x5000 ──>|
2: User munmaps the region, attr_range is NOT cleaned up
CPU VMA: (gone)
attr_range: stale
|<── 0x1000 ── 0x5000 ─>|
No gpusvm_range existed, No MMU notifier, No GC, No cleanup
3: User mmaps new memory for device pfn remap partially overlapping the
old range,
and new memory for GPU set attribute
CPU VMAs:
|<── 0x1000 ── 0x2000 ──>|<── 0x2000 ── 0x4000 ──>|
| VM_PFNMAP | anonymous (new alloc) |
attr_range: STILL STALE
|<────── 0x1000 ─── 0x5000 ─────────────>|
4: GPU faults at address 0x3000
Fault handler finds the stale attr_range [0x1000, 0x5000):
gpuva_start = 0x1000 (from stale attr_range)
gpuva_end = 0x5000 (from stale attr_range)
drm_gpusvm_range_chunk_size() find the chunk: 0x0000-0x5000 cover
the VM_PFNMAP area
|<── 0x1000 ── 0x2000 ──>|<── 0x2000 ──── 0x4000 ───>|
| VM_PFNMAP | |
hmm_range_fault fails
or vma check:
VM_PFNMAP -> -EOPNOTSUPP
I think it is caused by of the stale attribute.
Or is this considered as an invalid userspace behavior?
Regards,
Honglei
> Regards,
> Honglei
>
>
>> Matt
>>
>>>
>>> Regards,
>>> Honglei...
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-29 9:56 ` Huang, Honglei1
@ 2026-04-30 2:56 ` Huang, Honglei1
2026-04-30 3:12 ` Matthew Brost
0 siblings, 1 reply; 36+ messages in thread
From: Huang, Honglei1 @ 2026-04-30 2:56 UTC (permalink / raw)
To: Deucher, Alexander, Kuehling, Felix, Koenig, Christian, Zeng, Oak,
Liu, Jenny (Jing), Yang, Philip, Chen, Xiaogang, Huang, Ray,
Zhu, Lingshan, Shen, Junhua, yiru.ma, sima@ffwll.ch,
matthew.brost@intel.com, rodrigo.vivi@intel.com,
thomas.hellstrom@linux.intel.com, dakr@kernel.org,
aliceryhl@google.com
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
On 4/29/2026 5:56 PM, Huang, Honglei1 wrote:
>
>
> On 4/23/2026 4:22 PM, Huang, Honglei1 wrote:
> ...
>>>>>>>>>>
>>>>>>>>>> This patch series implements basic SVM support with the
>>>>>>>>>> following features:
>>>>>>>>>>
>>>>>>>>>> 1. attributes sepatarated from physical page management:
>>>>>>>>>>
>>>>>>>>>> - Attribute layer (amdgpu_svm_attr_tree): a driver
>>>>>>>>>> side interval
>>>>>>>>>> tree that stores SVM attributes. Managed through the
>>>>>>>>>> SET_ATTR,
>>>>>>>>>> and mmu notifier callback.
>>>>>>>
>>>>>>> Can you explain the mmu notifier callback interaction here? See
>>>>>>> below in
>>>>>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>>>>>
>>>>>>
>>>>>> Let me try to explain, apologies if the description is not fully
>>>>>> precise.
>>>>>>
>>>>>> In current implementation, the MMU notifier callback interacts
>>>>>> with the attr
>>>>>> tree only in the munmap path remove the corresponding attribute
>>>>>> entries from the attr tree so that stale attributes do not persist
>>>>>> for
>>>>>> freed address space.
>>>>>>
>>>>>
>>>>> Ah, yes. We reset our attributes upon munmap too. We actually don't
>>>>> this
>>>>> 100% correct quite either and series in flight to fix [1].
>>>>>
>>>>> [1] https://patchwork.freedesktop.org/series/161815/
>>>>
>>>> Hi matt,
>>>>
>>>> It seems like you are tring to modify the implementation into remove
>>>> the
>>>> attributes when munmap.
>>>>
>>>> Actuall we have a discussion internally that does the driver need to
>>>> remove
>>>> the attributes when munmap.
>>>>
>>>> So there servel ideas:
>>>>
>>>> 1. attribute need keep: attributes may be needed again when a new VMA
>>>> appears or on subsequent faults.
>>>> 2.attribute need keep: attributes can be set independent of whether
>>>> memory
>>>> is currently mapped; attributes persist and are modified explicitly via
>>>> ioctl, not implicitly by notifier callbacks.
>>>> 3. attribute need remove: casue VMA is gone, driver can do nothing
>>>> without
>>>> VMA.
>>>>
>>>> and I saw xe_svm set default attribute in the previous version, this
>>>> is also
>>>> a option.
>>>>
>>>> Can you please help to give some information that why xe_svm is
>>>> turing to
>>>> remove the attribute when munmap? And does keeping attribute is a
>>>> valid way?
>>>>
>>>
>>> This is a semantic choice, and we’re trying to match the semantics of
>>> CPU madvise. I believe any semantic an individual driver stack wants to
>>> define is valid, but if vendors mismatch sematics this will create a
>>> level of vendor lock in which may (cough Nvidia, CUDA) or may not (open
>>> source) be desired.
>>>
>>> AFAIK, if you do something like this in C (a CPU-only program):
>>>
>>> mmap(addr_range);
>>> madvise(addr_range, some_flags);
>>> munmap(addr_range);
>>>
>>> mmap(addr_range); /* Here the madvise attributes are reset */
>>>
>>> Also, AFAIK, the CUDA GPU madvise API works this way as well.
>>>
>>> That said, making this work 100% reliably is quite difficult, especially
>>> with a rude user.
>>>
>>> For example:
>>>
>>> mmap(addr_range);
>>> gpu_madvise(addr_range, some_flags);
>>> /* GPU never actually touches memory */
>>> munmap(addr_range);
>>>
>>> So we have an opt-in VM bind flag,
>>> DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET, which we’re working on to mostly
>>> handle the “rude” case above. Maybe we can reach 100% correctness, but
>>> again, this is a difficult problem is WIP.
>>>
>>
>> I think the current amdgpu SVM draft version also has the issue for
>> the rude user situlation. Maybe this is caused by the separation of
>> attribute layer and physical layer. Seems like KFD_SVM doesn't have
>> this issue.
>>
>> Maybe driver can find_or_insert in madvise ioctl path, to add a MMU
>> notifier but do not to get_page, and then clean the attribyte in mmu
>> notifier callback instead of GC. This is just my thought.
>>
>> And really thanks for the information, and waiting for other's comments.
>>
>
> Hi Matt,
>
> I'd like to share a concrete bug we hit in the amdgpu SVM
> implementation/tests about stale attributes.
>
> In short it is stale attr_range overlaps with VM_PFNMAP VMA
>
> 1: User allocates memory and sets GPU attributes but never use/fault it.
>
> CPU VMA (anonymous):
> |<── 0x1000 ── 0x5000 ──>|
>
> attr_range:
> |<── 0x1000 ── 0x5000 ──>|
>
> 2: User munmaps the region, attr_range is NOT cleaned up
>
> CPU VMA: (gone)
>
> attr_range: stale
> |<── 0x1000 ── 0x5000 ─>|
>
> No gpusvm_range existed, No MMU notifier, No GC, No cleanup
>
>
> 3: User mmaps new memory for device pfn remap partially overlapping the
> old range,
> and new memory for GPU set attribute
>
>
> CPU VMAs:
> |<── 0x1000 ── 0x2000 ──>|<── 0x2000 ── 0x4000 ──>|
> | VM_PFNMAP | anonymous (new alloc) |
>
> attr_range: STILL STALE
> |<────── 0x1000 ─── 0x5000 ─────────────>|
>
>
> 4: GPU faults at address 0x3000
>
> Fault handler finds the stale attr_range [0x1000, 0x5000):
> gpuva_start = 0x1000 (from stale attr_range)
> gpuva_end = 0x5000 (from stale attr_range)
>
> drm_gpusvm_range_chunk_size() find the chunk: 0x0000-0x5000 cover
> the VM_PFNMAP area
>
> |<── 0x1000 ── 0x2000 ──>|<── 0x2000 ──── 0x4000 ───>|
> | VM_PFNMAP | |
> hmm_range_fault fails
> or vma check:
> VM_PFNMAP -> -EOPNOTSUPP
>
> I think it is caused by of the stale attribute.
> Or is this considered as an invalid userspace behavior?
Hi all,
After internal discussion, I'd like to summary some conclusions from team.
It is decided that default behavior will be to keep attributes on munmap
for behavioral consistency, advantages of explicit interfaces, safety
and extensibility.
And provide an opt-in flag similar to Xe's
DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET.
There is a concern about behavioral consistency from Felix. In
userspace, whether free() actually triggers a munmap is outside the
user's control, the C library allocator may retain pages internally
rather than returning them to the OS. As It is said: "when you call
malloc and then free, that doesn't necessarily result in unmapping the
pages it may result in unmapping pages if you are freeing something big
that was allocated with mmap under the hood, but it may also just stick
around. From a user's point of view who just uses malloc/free and sets
some attributes, they have no way of knowing whether their attributes
will stick or not. I think if the attributes always stick, that would
give you a more consistent behavior."
And it is from Christian that explicit interfaces are better than
implicit ones auto-removing attributes on munmap is an implicit kernel
reaction that users may not even be aware of. As it is said: "Explicit
interfaces which say 'hey kernel, do something' are usually better than
implicit interfaces where the kernel is doing something on its own and
we are just reacting to it." And it is also noted that we can always add
a flag for auto-removal later if needed, but the reverse would be a UAPI
breaking change.
Based on the above, the next steps are:
1. Change the logic to keep the attribute as default in current
implementation.
2. Maybe need adding a new ioctl op to explicitly delete attribute?
Currently the UAPI can only overwrite attributes to not to map/default
but not truly remove them.
3. Add a new UAPI flag similar to DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET.
4. For the auto reset path, need to address the "never-faulted" issue.
Maybe we can register a lightweight MMU notifier at madvise ioctl
time to observe munmap and clean up attr_range, even if the GPU never
touches the range.
Please correct me if I'm wrong on any of the above assumptions. Would be
interested to hear thoughts on this direction.
Regards,
Honglei
> Regards,
> Honglei
>
>> Regards,
>> Honglei
>>
>>
>>> Matt
>>>
>>>>
>>>> Regards,
>>>> Honglei...
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
2026-04-30 2:56 ` Huang, Honglei1
@ 2026-04-30 3:12 ` Matthew Brost
0 siblings, 0 replies; 36+ messages in thread
From: Matthew Brost @ 2026-04-30 3:12 UTC (permalink / raw)
To: Huang, Honglei1
Cc: Deucher, Alexander, Kuehling, Felix, Koenig, Christian, Zeng, Oak,
Liu, Jenny (Jing), Yang, Philip, Chen, Xiaogang, Huang, Ray,
Zhu, Lingshan, Shen, Junhua, yiru.ma, sima@ffwll.ch,
rodrigo.vivi@intel.com, thomas.hellstrom@linux.intel.com,
dakr@kernel.org, aliceryhl@google.com,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
On Thu, Apr 30, 2026 at 10:56:47AM +0800, Huang, Honglei1 wrote:
>
>
> On 4/29/2026 5:56 PM, Huang, Honglei1 wrote:
> >
> >
> > On 4/23/2026 4:22 PM, Huang, Honglei1 wrote:
> > ...
> > > > > > > > > > >
> > > > > > > > > > > This patch series implements basic
> > > > > > > > > > > SVM support with the following
> > > > > > > > > > > features:
> > > > > > > > > > >
> > > > > > > > > > > 1. attributes sepatarated from physical page management:
> > > > > > > > > > >
> > > > > > > > > > > - Attribute layer
> > > > > > > > > > > (amdgpu_svm_attr_tree): a driver
> > > > > > > > > > > side interval
> > > > > > > > > > > tree that stores SVM
> > > > > > > > > > > attributes. Managed through the
> > > > > > > > > > > SET_ATTR,
> > > > > > > > > > > and mmu notifier callback.
> > > > > > > >
> > > > > > > > Can you explain the mmu notifier callback
> > > > > > > > interaction here? See below in
> > > > > > > > Xe the attribute tree is existing VMA tree (gpuvm).
> > > > > > > >
> > > > > > >
> > > > > > > Let me try to explain, apologies if the description is not fully
> > > > > > > precise.
> > > > > > >
> > > > > > > In current implementation, the MMU notifier callback
> > > > > > > interacts with the attr
> > > > > > > tree only in the munmap path remove the corresponding attribute
> > > > > > > entries from the attr tree so that stale attributes
> > > > > > > do not persist for
> > > > > > > freed address space.
> > > > > > >
> > > > > >
> > > > > > Ah, yes. We reset our attributes upon munmap too. We
> > > > > > actually don't this
> > > > > > 100% correct quite either and series in flight to fix [1].
> > > > > >
> > > > > > [1] https://patchwork.freedesktop.org/series/161815/
> > > > >
> > > > > Hi matt,
> > > > >
> > > > > It seems like you are tring to modify the implementation
> > > > > into remove the
> > > > > attributes when munmap.
> > > > >
> > > > > Actuall we have a discussion internally that does the driver
> > > > > need to remove
> > > > > the attributes when munmap.
> > > > >
> > > > > So there servel ideas:
> > > > >
> > > > > 1. attribute need keep: attributes may be needed again when a new VMA
> > > > > appears or on subsequent faults.
> > > > > 2.attribute need keep: attributes can be set independent of
> > > > > whether memory
> > > > > is currently mapped; attributes persist and are modified explicitly via
> > > > > ioctl, not implicitly by notifier callbacks.
> > > > > 3. attribute need remove: casue VMA is gone, driver can do
> > > > > nothing without
> > > > > VMA.
> > > > >
> > > > > and I saw xe_svm set default attribute in the previous
> > > > > version, this is also
> > > > > a option.
> > > > >
> > > > > Can you please help to give some information that why xe_svm
> > > > > is turing to
> > > > > remove the attribute when munmap? And does keeping attribute
> > > > > is a valid way?
> > > > >
> > > >
> > > > This is a semantic choice, and we’re trying to match the semantics of
> > > > CPU madvise. I believe any semantic an individual driver stack wants to
> > > > define is valid, but if vendors mismatch sematics this will create a
> > > > level of vendor lock in which may (cough Nvidia, CUDA) or may not (open
> > > > source) be desired.
> > > >
> > > > AFAIK, if you do something like this in C (a CPU-only program):
> > > >
> > > > mmap(addr_range);
> > > > madvise(addr_range, some_flags);
> > > > munmap(addr_range);
> > > >
> > > > mmap(addr_range); /* Here the madvise attributes are reset */
> > > >
> > > > Also, AFAIK, the CUDA GPU madvise API works this way as well.
> > > >
> > > > That said, making this work 100% reliably is quite difficult, especially
> > > > with a rude user.
> > > >
> > > > For example:
> > > >
> > > > mmap(addr_range);
> > > > gpu_madvise(addr_range, some_flags);
> > > > /* GPU never actually touches memory */
> > > > munmap(addr_range);
> > > >
> > > > So we have an opt-in VM bind flag,
> > > > DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET, which we’re working on to mostly
> > > > handle the “rude” case above. Maybe we can reach 100% correctness, but
> > > > again, this is a difficult problem is WIP.
> > > >
> > >
> > > I think the current amdgpu SVM draft version also has the issue for
> > > the rude user situlation. Maybe this is caused by the separation of
> > > attribute layer and physical layer. Seems like KFD_SVM doesn't have
> > > this issue.
> > >
> > > Maybe driver can find_or_insert in madvise ioctl path, to add a MMU
> > > notifier but do not to get_page, and then clean the attribyte in mmu
> > > notifier callback instead of GC. This is just my thought.
> > >
> > > And really thanks for the information, and waiting for other's comments.
> > >
> >
> > Hi Matt,
> >
> > I'd like to share a concrete bug we hit in the amdgpu SVM
> > implementation/tests about stale attributes.
> >
> > In short it is stale attr_range overlaps with VM_PFNMAP VMA
> >
> > 1: User allocates memory and sets GPU attributes but never use/fault it.
> >
> > CPU VMA (anonymous):
> > |<── 0x1000 ── 0x5000 ──>|
> >
> > attr_range:
> > |<── 0x1000 ── 0x5000 ──>|
> >
> > 2: User munmaps the region, attr_range is NOT cleaned up
> >
> > CPU VMA: (gone)
> >
> > attr_range: stale
> > |<── 0x1000 ── 0x5000 ─>|
> >
> > No gpusvm_range existed, No MMU notifier, No GC, No cleanup
> >
> >
> > 3: User mmaps new memory for device pfn remap partially overlapping the
> > old range,
> > and new memory for GPU set attribute
> >
> >
> > CPU VMAs:
> > |<── 0x1000 ── 0x2000 ──>|<── 0x2000 ── 0x4000 ──>|
> > | VM_PFNMAP | anonymous (new alloc) |
> >
> > attr_range: STILL STALE
> > |<────── 0x1000 ─── 0x5000 ─────────────>|
> >
> >
> > 4: GPU faults at address 0x3000
> >
> > Fault handler finds the stale attr_range [0x1000, 0x5000):
> > gpuva_start = 0x1000 (from stale attr_range)
> > gpuva_end = 0x5000 (from stale attr_range)
> >
> > drm_gpusvm_range_chunk_size() find the chunk: 0x0000-0x5000 cover
> > the VM_PFNMAP area
> >
> > |<── 0x1000 ── 0x2000 ──>|<── 0x2000 ──── 0x4000 ───>|
> > | VM_PFNMAP | |
> > hmm_range_fault fails
> > or vma check:
> > VM_PFNMAP -> -EOPNOTSUPP
> >
> > I think it is caused by of the stale attribute.
> > Or is this considered as an invalid userspace behavior?
The above example is exactly what we concerned about and handling in Xe
is WIP (i.e., this example breaks Xe).
>
> Hi all,
>
> After internal discussion, I'd like to summary some conclusions from team.
>
> It is decided that default behavior will be to keep attributes on munmap for
> behavioral consistency, advantages of explicit interfaces, safety and
> extensibility.
> And provide an opt-in flag similar to Xe's
> DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET.
>
> There is a concern about behavioral consistency from Felix. In userspace,
> whether free() actually triggers a munmap is outside the user's control, the
Yes, free() is the real killer here as libc may or or may call munmap.
Thus we have an opt in flag...
> C library allocator may retain pages internally rather than returning them
> to the OS. As It is said: "when you call malloc and then free, that doesn't
> necessarily result in unmapping the pages it may result in unmapping pages
> if you are freeing something big that was allocated with mmap under the
> hood, but it may also just stick around. From a user's point of view who
> just uses malloc/free and sets some attributes, they have no way of knowing
> whether their attributes will stick or not. I think if the attributes always
> stick, that would give you a more consistent behavior."
>
> And it is from Christian that explicit interfaces are better than implicit
> ones auto-removing attributes on munmap is an implicit kernel reaction that
> users may not even be aware of. As it is said: "Explicit interfaces which
> say 'hey kernel, do something' are usually better than implicit interfaces
> where the kernel is doing something on its own and we are just reacting to
> it." And it is also noted that we can always add a flag for auto-removal
> later if needed, but the reverse would be a UAPI breaking change.
>
This might be better - I believe Xe already was explict interfaces. So
DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET might be a non-sense idea but it
is in our uAPI so what is done is done... The free() case is where
DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET may fall apart / become
unpredictable... But it will work (after some Xe changes) if
say an app is also allocating largers sizes (128k plus iirc for default
libc behavior) where libc malloc/free directly map to mmap / munmap
which is why I believe we provided this option.
Matt
> Based on the above, the next steps are:
>
> 1. Change the logic to keep the attribute as default in current
> implementation.
> 2. Maybe need adding a new ioctl op to explicitly delete attribute?
> Currently the UAPI can only overwrite attributes to not to map/default but
> not truly remove them.
> 3. Add a new UAPI flag similar to DRM_XE_VM_BIND_FLAG_MADVISE_AUTORESET.
> 4. For the auto reset path, need to address the "never-faulted" issue.
> Maybe we can register a lightweight MMU notifier at madvise ioctl time
> to observe munmap and clean up attr_range, even if the GPU never touches the
> range.
>
> Please correct me if I'm wrong on any of the above assumptions. Would be
> interested to hear thoughts on this direction.
>
> Regards,
> Honglei
>
> > Regards,
> > Honglei
> >
> > > Regards,
> > > Honglei
> > >
> > >
> > > > Matt
> > > >
> > > > >
> > > > > Regards,
> > > > > Honglei...
>
^ permalink raw reply [flat|nested] 36+ messages in thread