* [PATCH v2 0/4] Improve SVM migrate event report
@ 2024-07-30 20:15 Philip Yang
2024-07-30 20:15 ` [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro Philip Yang
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Philip Yang @ 2024-07-30 20:15 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang
1. Document how to use SMI system management interface to receive SVM
events, define string format macro for user mode.
2. Increase the event kfifo size, so less chance to drop event.
3. Output migrate end event with error code if migration failed.
4. Report dropped event count if fifo is full.
Philip Yang (4):
drm/amdkfd: Document and define SVM events message macro
drm/amdkfd: Output migrate end event if migrate failed
drm/amdkfd: Increase SMI event fifo size
drm/amdkfd: SMI report dropped event count
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 14 ++-
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 95 +++++++++++------
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h | 3 +-
include/uapi/linux/kfd_ioctl.h | 107 +++++++++++++++++---
4 files changed, 164 insertions(+), 55 deletions(-)
--
2.43.2
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro
2024-07-30 20:15 [PATCH v2 0/4] Improve SVM migrate event report Philip Yang
@ 2024-07-30 20:15 ` Philip Yang
2024-08-22 14:32 ` James Zhu
2024-07-30 20:15 ` [PATCH v2 2/4] drm/amdkfd: Output migrate end event if migrate failed Philip Yang
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Philip Yang @ 2024-07-30 20:15 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang
Document how to use SMI system management interface to enable and
receive SVM events. Document SVM event triggers.
Define SVM events message string format macro that could be used by user
mode for sscanf to parse the event. Add it to uAPI header file to make
it obvious that is changing uAPI in future.
No functional changes.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 45 +++++----
include/uapi/linux/kfd_ioctl.h | 100 +++++++++++++++++---
2 files changed, 109 insertions(+), 36 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
index ea6a8e43bd5b..de8b9abf7afc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
@@ -235,17 +235,16 @@ void kfd_smi_event_update_gpu_reset(struct kfd_node *dev, bool post_reset,
amdgpu_reset_get_desc(reset_context, reset_cause,
sizeof(reset_cause));
- kfd_smi_event_add(0, dev, event, "%x %s\n",
- dev->reset_seq_num,
- reset_cause);
+ kfd_smi_event_add(0, dev, event, KFD_EVENT_FMT_UPDATE_GPU_RESET(
+ dev->reset_seq_num, reset_cause));
}
void kfd_smi_event_update_thermal_throttling(struct kfd_node *dev,
uint64_t throttle_bitmask)
{
- kfd_smi_event_add(0, dev, KFD_SMI_EVENT_THERMAL_THROTTLE, "%llx:%llx\n",
+ kfd_smi_event_add(0, dev, KFD_SMI_EVENT_THERMAL_THROTTLE, KFD_EVENT_FMT_THERMAL_THROTTLING(
throttle_bitmask,
- amdgpu_dpm_get_thermal_throttling_counter(dev->adev));
+ amdgpu_dpm_get_thermal_throttling_counter(dev->adev)));
}
void kfd_smi_event_update_vmfault(struct kfd_node *dev, uint16_t pasid)
@@ -256,8 +255,8 @@ void kfd_smi_event_update_vmfault(struct kfd_node *dev, uint16_t pasid)
if (task_info) {
/* Report VM faults from user applications, not retry from kernel */
if (task_info->pid)
- kfd_smi_event_add(0, dev, KFD_SMI_EVENT_VMFAULT, "%x:%s\n",
- task_info->pid, task_info->task_name);
+ kfd_smi_event_add(0, dev, KFD_SMI_EVENT_VMFAULT, KFD_EVENT_FMT_VMFAULT(
+ task_info->pid, task_info->task_name));
amdgpu_vm_put_task_info(task_info);
}
}
@@ -267,16 +266,16 @@ void kfd_smi_event_page_fault_start(struct kfd_node *node, pid_t pid,
ktime_t ts)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_PAGE_FAULT_START,
- "%lld -%d @%lx(%x) %c\n", ktime_to_ns(ts), pid,
- address, node->id, write_fault ? 'W' : 'R');
+ KFD_EVENT_FMT_PAGEFAULT_START(ktime_to_ns(ts), pid,
+ address, node->id, write_fault ? 'W' : 'R'));
}
void kfd_smi_event_page_fault_end(struct kfd_node *node, pid_t pid,
unsigned long address, bool migration)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_PAGE_FAULT_END,
- "%lld -%d @%lx(%x) %c\n", ktime_get_boottime_ns(),
- pid, address, node->id, migration ? 'M' : 'U');
+ KFD_EVENT_FMT_PAGEFAULT_END(ktime_get_boottime_ns(),
+ pid, address, node->id, migration ? 'M' : 'U'));
}
void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid,
@@ -286,9 +285,9 @@ void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid,
uint32_t trigger)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_MIGRATE_START,
- "%lld -%d @%lx(%lx) %x->%x %x:%x %d\n",
+ KFD_EVENT_FMT_MIGRATE_START(
ktime_get_boottime_ns(), pid, start, end - start,
- from, to, prefetch_loc, preferred_loc, trigger);
+ from, to, prefetch_loc, preferred_loc, trigger));
}
void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid,
@@ -296,24 +295,24 @@ void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid,
uint32_t from, uint32_t to, uint32_t trigger)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_MIGRATE_END,
- "%lld -%d @%lx(%lx) %x->%x %d\n",
+ KFD_EVENT_FMT_MIGRATE_END(
ktime_get_boottime_ns(), pid, start, end - start,
- from, to, trigger);
+ from, to, trigger));
}
void kfd_smi_event_queue_eviction(struct kfd_node *node, pid_t pid,
uint32_t trigger)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_QUEUE_EVICTION,
- "%lld -%d %x %d\n", ktime_get_boottime_ns(), pid,
- node->id, trigger);
+ KFD_EVENT_FMT_QUEUE_EVICTION(ktime_get_boottime_ns(), pid,
+ node->id, trigger));
}
void kfd_smi_event_queue_restore(struct kfd_node *node, pid_t pid)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_QUEUE_RESTORE,
- "%lld -%d %x\n", ktime_get_boottime_ns(), pid,
- node->id);
+ KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(), pid,
+ node->id, 0));
}
void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm)
@@ -330,8 +329,8 @@ void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm)
kfd_smi_event_add(p->lead_thread->pid, pdd->dev,
KFD_SMI_EVENT_QUEUE_RESTORE,
- "%lld -%d %x %c\n", ktime_get_boottime_ns(),
- p->lead_thread->pid, pdd->dev->id, 'R');
+ KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(),
+ p->lead_thread->pid, pdd->dev->id, 'R'));
}
kfd_unref_process(p);
}
@@ -341,8 +340,8 @@ void kfd_smi_event_unmap_from_gpu(struct kfd_node *node, pid_t pid,
uint32_t trigger)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_UNMAP_FROM_GPU,
- "%lld -%d @%lx(%lx) %x %d\n", ktime_get_boottime_ns(),
- pid, address, last - address + 1, node->id, trigger);
+ KFD_EVENT_FMT_UNMAP_FROM_GPU(ktime_get_boottime_ns(),
+ pid, address, last - address + 1, node->id, trigger));
}
int kfd_smi_event_open(struct kfd_node *dev, uint32_t *fd)
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 71a7ce5f2d4c..c94182ad8fb8 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -540,26 +540,29 @@ enum kfd_smi_event {
KFD_SMI_EVENT_ALL_PROCESS = 64
};
+/* The reason of the page migration event */
enum KFD_MIGRATE_TRIGGERS {
- KFD_MIGRATE_TRIGGER_PREFETCH,
- KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
- KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU,
- KFD_MIGRATE_TRIGGER_TTM_EVICTION
+ KFD_MIGRATE_TRIGGER_PREFETCH, /* Prefetch to GPU */
+ KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, /* GPU page fault recover */
+ KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, /* CPU page fault recover */
+ KFD_MIGRATE_TRIGGER_TTM_EVICTION /* TTM eviction */
};
+/* The reason of user queue evition event */
enum KFD_QUEUE_EVICTION_TRIGGERS {
- KFD_QUEUE_EVICTION_TRIGGER_SVM,
- KFD_QUEUE_EVICTION_TRIGGER_USERPTR,
- KFD_QUEUE_EVICTION_TRIGGER_TTM,
- KFD_QUEUE_EVICTION_TRIGGER_SUSPEND,
- KFD_QUEUE_EVICTION_CRIU_CHECKPOINT,
- KFD_QUEUE_EVICTION_CRIU_RESTORE
+ KFD_QUEUE_EVICTION_TRIGGER_SVM, /* SVM buffer migration */
+ KFD_QUEUE_EVICTION_TRIGGER_USERPTR, /* userptr movement */
+ KFD_QUEUE_EVICTION_TRIGGER_TTM, /* TTM move buffer */
+ KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, /* GPU suspend */
+ KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, /* CRIU checkpoint */
+ KFD_QUEUE_EVICTION_CRIU_RESTORE /* CRIU restore */
};
+/* The reason of unmap buffer from GPU event */
enum KFD_SVM_UNMAP_TRIGGERS {
- KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY,
- KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,
- KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU
+ KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, /* MMU notifier CPU buffer movement */
+ KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,/* MMU notifier page migration */
+ KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU /* Unmap to free the buffer */
};
#define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1))
@@ -570,6 +573,77 @@ struct kfd_ioctl_smi_events_args {
__u32 anon_fd; /* from KFD */
};
+/*
+ * SVM event tracing via SMI system management interface
+ *
+ * Open event file descriptor
+ * use ioctl AMDKFD_IOC_SMI_EVENTS, pass in gpuid and return a anonymous file
+ * descriptor to receive SMI events.
+ * If calling with sudo permission, then file descriptor can be used to receive
+ * SVM events from all processes, otherwise, to only receive SVM events of same
+ * process.
+ *
+ * To enable the SVM event
+ * Write event file descriptor with KFD_SMI_EVENT_MASK_FROM_INDEX(event) bitmap
+ * mask to start record the event to the kfifo, use bitmap mask combination
+ * for multiple events. New event mask will overwrite the previous event mask.
+ * KFD_SMI_EVENT_MASK_FROM_INDEX(KFD_SMI_EVENT_ALL_PROCESS) bit requires sudo
+ * permisson to receive SVM events from all process.
+ *
+ * To receive the event
+ * Application can poll file descriptor to wait for the events, then read event
+ * from the file into a buffer. Each event is one line string message, starting
+ * with the event id, then the event specific information.
+ *
+ * To decode event information
+ * The following event format string macro can be used with sscanf to decode
+ * the specific event information.
+ * event triggers: the reason to generate the event, defined as enum for unmap,
+ * eviction and migrate events.
+ * node, from, to, prefetch_loc, preferred_loc: GPU ID, or 0 for system memory.
+ * addr: user mode address, in pages
+ * size: in pages
+ * pid: the process ID to generate the event
+ * ns: timestamp in nanosecond-resolution, starts at system boot time but
+ * stops during suspend
+ * migrate_update: GPU page fault is recovered by 'M' for migrate, 'U' for update
+ * rw: 'W' for write page fault, 'R' for read page fault
+ * rescheduled: 'R' if the queue restore failed and rescheduled to try again
+ */
+#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\
+ "%x %s\n", (reset_seq_num), (reset_cause)
+
+#define KFD_EVENT_FMT_THERMAL_THROTTLING(bitmask, counter)\
+ "%llx:%llx\n", (bitmask), (counter)
+
+#define KFD_EVENT_FMT_VMFAULT(pid, task_name)\
+ "%x:%s\n", (pid), (task_name)
+
+#define KFD_EVENT_FMT_PAGEFAULT_START(ns, pid, addr, node, rw)\
+ "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (rw)
+
+#define KFD_EVENT_FMT_PAGEFAULT_END(ns, pid, addr, node, migrate_update)\
+ "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (migrate_update)
+
+#define KFD_EVENT_FMT_MIGRATE_START(ns, pid, start, size, from, to, prefetch_loc,\
+ preferred_loc, migrate_trigger)\
+ "%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", (ns), (pid), (start), (size),\
+ (from), (to), (prefetch_loc), (preferred_loc), (migrate_trigger)
+
+#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger)\
+ "%lld -%d @%lx(%lx) %x->%x %d\n", (ns), (pid), (start), (size),\
+ (from), (to), (migrate_trigger)
+
+#define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\
+ "%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger)
+
+#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node, rescheduled)\
+ "%lld -%d %x %c\n", (ns), (pid), (node), (rescheduled)
+
+#define KFD_EVENT_FMT_UNMAP_FROM_GPU(ns, pid, addr, size, node, unmap_trigger)\
+ "%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\
+ (node), (unmap_trigger)
+
/**************************************************************************************************
* CRIU IOCTLs (Checkpoint Restore In Userspace)
*
--
2.43.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 2/4] drm/amdkfd: Output migrate end event if migrate failed
2024-07-30 20:15 [PATCH v2 0/4] Improve SVM migrate event report Philip Yang
2024-07-30 20:15 ` [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro Philip Yang
@ 2024-07-30 20:15 ` Philip Yang
2024-07-30 20:15 ` [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size Philip Yang
2024-07-30 20:15 ` [PATCH v2 4/4] drm/amdkfd: SMI report dropped event count Philip Yang
3 siblings, 0 replies; 10+ messages in thread
From: Philip Yang @ 2024-07-30 20:15 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang
If page migration failed, also output migrate end event to match with
migrate start event, with failure error_code added to the end of the
migrate message macro. This will not break uAPI because application uses
old message macro sscanf drop and ignore the error_code.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 14 ++++++--------
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 5 +++--
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h | 3 ++-
include/uapi/linux/kfd_ioctl.h | 7 ++++---
4 files changed, 15 insertions(+), 14 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 8ee3d07ffbdf..eacfeb32f35d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -445,14 +445,13 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct svm_range *prange,
pr_debug("successful/cpages/npages 0x%lx/0x%lx/0x%lx\n",
mpages, cpages, migrate.npages);
- kfd_smi_event_migration_end(node, p->lead_thread->pid,
- start >> PAGE_SHIFT, end >> PAGE_SHIFT,
- 0, node->id, trigger);
-
svm_range_dma_unmap_dev(adev->dev, scratch, 0, npages);
out_free:
kvfree(buf);
+ kfd_smi_event_migration_end(node, p->lead_thread->pid,
+ start >> PAGE_SHIFT, end >> PAGE_SHIFT,
+ 0, node->id, trigger, r);
out:
if (!r && mpages) {
pdd = svm_range_get_pdd_by_node(prange, node);
@@ -751,14 +750,13 @@ svm_migrate_vma_to_ram(struct kfd_node *node, struct svm_range *prange,
svm_migrate_copy_done(adev, mfence);
migrate_vma_finalize(&migrate);
- kfd_smi_event_migration_end(node, p->lead_thread->pid,
- start >> PAGE_SHIFT, end >> PAGE_SHIFT,
- node->id, 0, trigger);
-
svm_range_dma_unmap_dev(adev->dev, scratch, 0, npages);
out_free:
kvfree(buf);
+ kfd_smi_event_migration_end(node, p->lead_thread->pid,
+ start >> PAGE_SHIFT, end >> PAGE_SHIFT,
+ node->id, 0, trigger, r);
out:
if (!r && cpages) {
mpages = cpages - upages;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
index de8b9abf7afc..1d94b445a060 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
@@ -292,12 +292,13 @@ void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid,
void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid,
unsigned long start, unsigned long end,
- uint32_t from, uint32_t to, uint32_t trigger)
+ uint32_t from, uint32_t to, uint32_t trigger,
+ int error_code)
{
kfd_smi_event_add(pid, node, KFD_SMI_EVENT_MIGRATE_END,
KFD_EVENT_FMT_MIGRATE_END(
ktime_get_boottime_ns(), pid, start, end - start,
- from, to, trigger));
+ from, to, trigger, error_code));
}
void kfd_smi_event_queue_eviction(struct kfd_node *node, pid_t pid,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
index 85010b8307f8..503bff13d815 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
@@ -44,7 +44,8 @@ void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid,
uint32_t trigger);
void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid,
unsigned long start, unsigned long end,
- uint32_t from, uint32_t to, uint32_t trigger);
+ uint32_t from, uint32_t to, uint32_t trigger,
+ int error_code);
void kfd_smi_event_queue_eviction(struct kfd_node *node, pid_t pid,
uint32_t trigger);
void kfd_smi_event_queue_restore(struct kfd_node *node, pid_t pid);
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index c94182ad8fb8..e4ed8fec3294 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -609,6 +609,7 @@ struct kfd_ioctl_smi_events_args {
* migrate_update: GPU page fault is recovered by 'M' for migrate, 'U' for update
* rw: 'W' for write page fault, 'R' for read page fault
* rescheduled: 'R' if the queue restore failed and rescheduled to try again
+ * error_code: migrate failure error code, 0 if no error
*/
#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\
"%x %s\n", (reset_seq_num), (reset_cause)
@@ -630,9 +631,9 @@ struct kfd_ioctl_smi_events_args {
"%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", (ns), (pid), (start), (size),\
(from), (to), (prefetch_loc), (preferred_loc), (migrate_trigger)
-#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger)\
- "%lld -%d @%lx(%lx) %x->%x %d\n", (ns), (pid), (start), (size),\
- (from), (to), (migrate_trigger)
+#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger, error_code) \
+ "%lld -%d @%lx(%lx) %x->%x %d %d\n", (ns), (pid), (start), (size),\
+ (from), (to), (migrate_trigger), (error_code)
#define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\
"%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger)
--
2.43.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size
2024-07-30 20:15 [PATCH v2 0/4] Improve SVM migrate event report Philip Yang
2024-07-30 20:15 ` [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro Philip Yang
2024-07-30 20:15 ` [PATCH v2 2/4] drm/amdkfd: Output migrate end event if migrate failed Philip Yang
@ 2024-07-30 20:15 ` Philip Yang
2024-08-22 14:34 ` James Zhu
2024-07-30 20:15 ` [PATCH v2 4/4] drm/amdkfd: SMI report dropped event count Philip Yang
3 siblings, 1 reply; 10+ messages in thread
From: Philip Yang @ 2024-07-30 20:15 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang
SMI event fifo size 1KB was enough to report GPU vm fault or reset
event, increase it to 8KB to store about 100 migrate events, less chance
to drop the migrate events if lots of migration happened in the short
period of time. Add KFD prefix to the macro name.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
index 1d94b445a060..9b8169761ec5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
@@ -44,7 +44,7 @@ struct kfd_smi_client {
bool suser;
};
-#define MAX_KFIFO_SIZE 1024
+#define KFD_MAX_KFIFO_SIZE 8192
static __poll_t kfd_smi_ev_poll(struct file *, struct poll_table_struct *);
static ssize_t kfd_smi_ev_read(struct file *, char __user *, size_t, loff_t *);
@@ -86,7 +86,7 @@ static ssize_t kfd_smi_ev_read(struct file *filep, char __user *user,
struct kfd_smi_client *client = filep->private_data;
unsigned char *buf;
- size = min_t(size_t, size, MAX_KFIFO_SIZE);
+ size = min_t(size_t, size, KFD_MAX_KFIFO_SIZE);
buf = kmalloc(size, GFP_KERNEL);
if (!buf)
return -ENOMEM;
@@ -355,7 +355,7 @@ int kfd_smi_event_open(struct kfd_node *dev, uint32_t *fd)
return -ENOMEM;
INIT_LIST_HEAD(&client->list);
- ret = kfifo_alloc(&client->fifo, MAX_KFIFO_SIZE, GFP_KERNEL);
+ ret = kfifo_alloc(&client->fifo, KFD_MAX_KFIFO_SIZE, GFP_KERNEL);
if (ret) {
kfree(client);
return ret;
--
2.43.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 4/4] drm/amdkfd: SMI report dropped event count
2024-07-30 20:15 [PATCH v2 0/4] Improve SVM migrate event report Philip Yang
` (2 preceding siblings ...)
2024-07-30 20:15 ` [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size Philip Yang
@ 2024-07-30 20:15 ` Philip Yang
2024-08-22 14:45 ` James Zhu
3 siblings, 1 reply; 10+ messages in thread
From: Philip Yang @ 2024-07-30 20:15 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang
Add new SMI event to report the dropped event count when the event kfifo
is full.
When the kfifo has space for two events, generate a dropped event record
to report how many events were dropped, together with the next event to
add to kfifo.
After reading event out from kfifo, if there were events dropped,
generate a dropped event record and add to kfifo.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 41 ++++++++++++++++++---
include/uapi/linux/kfd_ioctl.h | 6 +++
2 files changed, 41 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
index 9b8169761ec5..9b47657d5160 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
@@ -42,6 +42,7 @@ struct kfd_smi_client {
struct rcu_head rcu;
pid_t pid;
bool suser;
+ u32 drop_count;
};
#define KFD_MAX_KFIFO_SIZE 8192
@@ -103,12 +104,26 @@ static ssize_t kfd_smi_ev_read(struct file *filep, char __user *user,
}
to_copy = min(size, to_copy);
ret = kfifo_out(&client->fifo, buf, to_copy);
- spin_unlock(&client->lock);
if (ret <= 0) {
+ spin_unlock(&client->lock);
ret = -EAGAIN;
goto ret_err;
}
+ if (client->drop_count) {
+ char msg[KFD_SMI_EVENT_MSG_SIZE];
+ int len;
+
+ len = snprintf(msg, sizeof(msg), "%x ", KFD_SMI_EVENT_DROPPED_EVENT);
+ len += snprintf(msg + len, sizeof(msg) - len,
+ KFD_EVENT_FMT_DROPPED_EVENT(ktime_get_boottime_ns(),
+ client->pid, client->drop_count));
+ kfifo_in(&client->fifo, msg, len);
+ client->drop_count = 0;
+ }
+
+ spin_unlock(&client->lock);
+
ret = copy_to_user(user, buf, to_copy);
if (ret) {
ret = -EFAULT;
@@ -173,22 +188,36 @@ static bool kfd_smi_ev_enabled(pid_t pid, struct kfd_smi_client *client,
}
static void add_event_to_kfifo(pid_t pid, struct kfd_node *dev,
- unsigned int smi_event, char *event_msg, int len)
+ unsigned int smi_event, char *event_msg, int event_len)
{
struct kfd_smi_client *client;
+ char msg[KFD_SMI_EVENT_MSG_SIZE];
+ int len = 0;
rcu_read_lock();
list_for_each_entry_rcu(client, &dev->smi_clients, list) {
if (!kfd_smi_ev_enabled(pid, client, smi_event))
continue;
+
spin_lock(&client->lock);
- if (kfifo_avail(&client->fifo) >= len) {
- kfifo_in(&client->fifo, event_msg, len);
+ if (client->drop_count) {
+ len = snprintf(msg, sizeof(msg), "%x ", KFD_SMI_EVENT_DROPPED_EVENT);
+ len += snprintf(msg + len, sizeof(msg) - len,
+ KFD_EVENT_FMT_DROPPED_EVENT(ktime_get_boottime_ns(), pid,
+ client->drop_count));
+ }
+
+ if (kfifo_avail(&client->fifo) >= event_len + len) {
+ if (len)
+ kfifo_in(&client->fifo, msg, len);
+ kfifo_in(&client->fifo, event_msg, event_len);
wake_up_all(&client->wait_queue);
+ client->drop_count = 0;
} else {
- pr_debug("smi_event(EventID: %u): no space left\n",
- smi_event);
+ client->drop_count++;
+ pr_debug("smi_event(EventID: %u): no space left drop_count %d\n",
+ smi_event, client->drop_count);
}
spin_unlock(&client->lock);
}
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index e4ed8fec3294..915d1e7c67fe 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -530,6 +530,7 @@ enum kfd_smi_event {
KFD_SMI_EVENT_QUEUE_EVICTION = 9,
KFD_SMI_EVENT_QUEUE_RESTORE = 10,
KFD_SMI_EVENT_UNMAP_FROM_GPU = 11,
+ KFD_SMI_EVENT_DROPPED_EVENT = 12,
/*
* max event number, as a flag bit to get events from all processes,
@@ -610,6 +611,7 @@ struct kfd_ioctl_smi_events_args {
* rw: 'W' for write page fault, 'R' for read page fault
* rescheduled: 'R' if the queue restore failed and rescheduled to try again
* error_code: migrate failure error code, 0 if no error
+ * drop_count: how many events dropped when fifo is full
*/
#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\
"%x %s\n", (reset_seq_num), (reset_cause)
@@ -645,6 +647,10 @@ struct kfd_ioctl_smi_events_args {
"%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\
(node), (unmap_trigger)
+#define KFD_EVENT_FMT_DROPPED_EVENT(ns, pid, drop_count)\
+ "%lld -%d %d\n", (ns), (pid), (drop_count)
+
+
/**************************************************************************************************
* CRIU IOCTLs (Checkpoint Restore In Userspace)
*
--
2.43.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro
2024-07-30 20:15 ` [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro Philip Yang
@ 2024-08-22 14:32 ` James Zhu
2024-08-22 14:58 ` Philip Yang
0 siblings, 1 reply; 10+ messages in thread
From: James Zhu @ 2024-08-22 14:32 UTC (permalink / raw)
To: Philip Yang, amd-gfx; +Cc: Felix.Kuehling
[-- Attachment #1: Type: text/plain, Size: 11744 bytes --]
On 2024-07-30 16:15, Philip Yang wrote:
> Document how to use SMI system management interface to enable and
> receive SVM events. Document SVM event triggers.
>
> Define SVM events message string format macro that could be used by user
> mode for sscanf to parse the event. Add it to uAPI header file to make
> it obvious that is changing uAPI in future.
>
> No functional changes.
>
> Signed-off-by: Philip Yang<Philip.Yang@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 45 +++++----
> include/uapi/linux/kfd_ioctl.h | 100 +++++++++++++++++---
> 2 files changed, 109 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> index ea6a8e43bd5b..de8b9abf7afc 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> @@ -235,17 +235,16 @@ void kfd_smi_event_update_gpu_reset(struct kfd_node *dev, bool post_reset,
> amdgpu_reset_get_desc(reset_context, reset_cause,
> sizeof(reset_cause));
>
> - kfd_smi_event_add(0, dev, event, "%x %s\n",
> - dev->reset_seq_num,
> - reset_cause);
> + kfd_smi_event_add(0, dev, event, KFD_EVENT_FMT_UPDATE_GPU_RESET(
> + dev->reset_seq_num, reset_cause));
> }
>
> void kfd_smi_event_update_thermal_throttling(struct kfd_node *dev,
> uint64_t throttle_bitmask)
> {
> - kfd_smi_event_add(0, dev, KFD_SMI_EVENT_THERMAL_THROTTLE, "%llx:%llx\n",
> + kfd_smi_event_add(0, dev, KFD_SMI_EVENT_THERMAL_THROTTLE, KFD_EVENT_FMT_THERMAL_THROTTLING(
> throttle_bitmask,
> - amdgpu_dpm_get_thermal_throttling_counter(dev->adev));
> + amdgpu_dpm_get_thermal_throttling_counter(dev->adev)));
> }
>
> void kfd_smi_event_update_vmfault(struct kfd_node *dev, uint16_t pasid)
> @@ -256,8 +255,8 @@ void kfd_smi_event_update_vmfault(struct kfd_node *dev, uint16_t pasid)
> if (task_info) {
> /* Report VM faults from user applications, not retry from kernel */
> if (task_info->pid)
> - kfd_smi_event_add(0, dev, KFD_SMI_EVENT_VMFAULT, "%x:%s\n",
> - task_info->pid, task_info->task_name);
> + kfd_smi_event_add(0, dev, KFD_SMI_EVENT_VMFAULT, KFD_EVENT_FMT_VMFAULT(
> + task_info->pid, task_info->task_name));
> amdgpu_vm_put_task_info(task_info);
> }
> }
> @@ -267,16 +266,16 @@ void kfd_smi_event_page_fault_start(struct kfd_node *node, pid_t pid,
> ktime_t ts)
> {
> kfd_smi_event_add(pid, node, KFD_SMI_EVENT_PAGE_FAULT_START,
> - "%lld -%d @%lx(%x) %c\n", ktime_to_ns(ts), pid,
> - address, node->id, write_fault ? 'W' : 'R');
> + KFD_EVENT_FMT_PAGEFAULT_START(ktime_to_ns(ts), pid,
> + address, node->id, write_fault ? 'W' : 'R'));
> }
>
> void kfd_smi_event_page_fault_end(struct kfd_node *node, pid_t pid,
> unsigned long address, bool migration)
> {
> kfd_smi_event_add(pid, node, KFD_SMI_EVENT_PAGE_FAULT_END,
> - "%lld -%d @%lx(%x) %c\n", ktime_get_boottime_ns(),
> - pid, address, node->id, migration ? 'M' : 'U');
> + KFD_EVENT_FMT_PAGEFAULT_END(ktime_get_boottime_ns(),
> + pid, address, node->id, migration ? 'M' : 'U'));
> }
>
> void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid,
> @@ -286,9 +285,9 @@ void kfd_smi_event_migration_start(struct kfd_node *node, pid_t pid,
> uint32_t trigger)
> {
> kfd_smi_event_add(pid, node, KFD_SMI_EVENT_MIGRATE_START,
> - "%lld -%d @%lx(%lx) %x->%x %x:%x %d\n",
> + KFD_EVENT_FMT_MIGRATE_START(
> ktime_get_boottime_ns(), pid, start, end - start,
> - from, to, prefetch_loc, preferred_loc, trigger);
> + from, to, prefetch_loc, preferred_loc, trigger));
> }
>
> void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid,
> @@ -296,24 +295,24 @@ void kfd_smi_event_migration_end(struct kfd_node *node, pid_t pid,
> uint32_t from, uint32_t to, uint32_t trigger)
> {
> kfd_smi_event_add(pid, node, KFD_SMI_EVENT_MIGRATE_END,
> - "%lld -%d @%lx(%lx) %x->%x %d\n",
> + KFD_EVENT_FMT_MIGRATE_END(
> ktime_get_boottime_ns(), pid, start, end - start,
> - from, to, trigger);
> + from, to, trigger));
> }
>
> void kfd_smi_event_queue_eviction(struct kfd_node *node, pid_t pid,
> uint32_t trigger)
> {
> kfd_smi_event_add(pid, node, KFD_SMI_EVENT_QUEUE_EVICTION,
> - "%lld -%d %x %d\n", ktime_get_boottime_ns(), pid,
> - node->id, trigger);
> + KFD_EVENT_FMT_QUEUE_EVICTION(ktime_get_boottime_ns(), pid,
> + node->id, trigger));
> }
>
> void kfd_smi_event_queue_restore(struct kfd_node *node, pid_t pid)
> {
> kfd_smi_event_add(pid, node, KFD_SMI_EVENT_QUEUE_RESTORE,
> - "%lld -%d %x\n", ktime_get_boottime_ns(), pid,
> - node->id);
> + KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(), pid,
> + node->id, 0));
> }
>
> void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm)
> @@ -330,8 +329,8 @@ void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm)
>
> kfd_smi_event_add(p->lead_thread->pid, pdd->dev,
> KFD_SMI_EVENT_QUEUE_RESTORE,
> - "%lld -%d %x %c\n", ktime_get_boottime_ns(),
> - p->lead_thread->pid, pdd->dev->id, 'R');
> + KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(),
> + p->lead_thread->pid, pdd->dev->id, 'R'));
> }
> kfd_unref_process(p);
> }
> @@ -341,8 +340,8 @@ void kfd_smi_event_unmap_from_gpu(struct kfd_node *node, pid_t pid,
> uint32_t trigger)
> {
> kfd_smi_event_add(pid, node, KFD_SMI_EVENT_UNMAP_FROM_GPU,
> - "%lld -%d @%lx(%lx) %x %d\n", ktime_get_boottime_ns(),
> - pid, address, last - address + 1, node->id, trigger);
> + KFD_EVENT_FMT_UNMAP_FROM_GPU(ktime_get_boottime_ns(),
> + pid, address, last - address + 1, node->id, trigger));
> }
>
> int kfd_smi_event_open(struct kfd_node *dev, uint32_t *fd)
> diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
> index 71a7ce5f2d4c..c94182ad8fb8 100644
> --- a/include/uapi/linux/kfd_ioctl.h
> +++ b/include/uapi/linux/kfd_ioctl.h
> @@ -540,26 +540,29 @@ enum kfd_smi_event {
> KFD_SMI_EVENT_ALL_PROCESS = 64
> };
>
> +/* The reason of the page migration event */
> enum KFD_MIGRATE_TRIGGERS {
> - KFD_MIGRATE_TRIGGER_PREFETCH,
> - KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
> - KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU,
> - KFD_MIGRATE_TRIGGER_TTM_EVICTION
> + KFD_MIGRATE_TRIGGER_PREFETCH, /* Prefetch to GPU */
[JZ] could it be per-fetched to system RAM also?
> + KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, /* GPU page fault recover */
> + KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, /* CPU page fault recover */
> + KFD_MIGRATE_TRIGGER_TTM_EVICTION /* TTM eviction */
> };
>
> +/* The reason of user queue evition event */
> enum KFD_QUEUE_EVICTION_TRIGGERS {
> - KFD_QUEUE_EVICTION_TRIGGER_SVM,
> - KFD_QUEUE_EVICTION_TRIGGER_USERPTR,
> - KFD_QUEUE_EVICTION_TRIGGER_TTM,
> - KFD_QUEUE_EVICTION_TRIGGER_SUSPEND,
> - KFD_QUEUE_EVICTION_CRIU_CHECKPOINT,
> - KFD_QUEUE_EVICTION_CRIU_RESTORE
> + KFD_QUEUE_EVICTION_TRIGGER_SVM, /* SVM buffer migration */
> + KFD_QUEUE_EVICTION_TRIGGER_USERPTR, /* userptr movement */
> + KFD_QUEUE_EVICTION_TRIGGER_TTM, /* TTM move buffer */
> + KFD_QUEUE_EVICTION_TRIGGER_SUSPEND, /* GPU suspend */
> + KFD_QUEUE_EVICTION_CRIU_CHECKPOINT, /* CRIU checkpoint */
> + KFD_QUEUE_EVICTION_CRIU_RESTORE /* CRIU restore */
> };
>
> +/* The reason of unmap buffer from GPU event */
> enum KFD_SVM_UNMAP_TRIGGERS {
> - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY,
> - KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,
> - KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU
> + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY, /* MMU notifier CPU buffer movement */
> + KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE,/* MMU notifier page migration */
> + KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU /* Unmap to free the buffer */
> };
>
> #define KFD_SMI_EVENT_MASK_FROM_INDEX(i) (1ULL << ((i) - 1))
> @@ -570,6 +573,77 @@ struct kfd_ioctl_smi_events_args {
> __u32 anon_fd; /* from KFD */
> };
>
> +/*
> + * SVM event tracing via SMI system management interface
> + *
> + * Open event file descriptor
> + * use ioctl AMDKFD_IOC_SMI_EVENTS, pass in gpuid and return a anonymous file
> + * descriptor to receive SMI events.
> + * If calling with sudo permission, then file descriptor can be used to receive
> + * SVM events from all processes, otherwise, to only receive SVM events of same
> + * process.
> + *
> + * To enable the SVM event
> + * Write event file descriptor with KFD_SMI_EVENT_MASK_FROM_INDEX(event) bitmap
> + * mask to start record the event to the kfifo, use bitmap mask combination
> + * for multiple events. New event mask will overwrite the previous event mask.
> + * KFD_SMI_EVENT_MASK_FROM_INDEX(KFD_SMI_EVENT_ALL_PROCESS) bit requires sudo
> + * permisson to receive SVM events from all process.
> + *
> + * To receive the event
> + * Application can poll file descriptor to wait for the events, then read event
> + * from the file into a buffer. Each event is one line string message, starting
> + * with the event id, then the event specific information.
> + *
> + * To decode event information
> + * The following event format string macro can be used with sscanf to decode
> + * the specific event information.
> + * event triggers: the reason to generate the event, defined as enum for unmap,
> + * eviction and migrate events.
> + * node, from, to, prefetch_loc, preferred_loc: GPU ID, or 0 for system memory.
> + * addr: user mode address, in pages
> + * size: in pages
> + * pid: the process ID to generate the event
> + * ns: timestamp in nanosecond-resolution, starts at system boot time but
> + * stops during suspend
> + * migrate_update: GPU page fault is recovered by 'M' for migrate, 'U' for update
> + * rw: 'W' for write page fault, 'R' for read page fault
> + * rescheduled: 'R' if the queue restore failed and rescheduled to try again
> + */
> +#define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\
> + "%x %s\n", (reset_seq_num), (reset_cause)
> +
> +#define KFD_EVENT_FMT_THERMAL_THROTTLING(bitmask, counter)\
> + "%llx:%llx\n", (bitmask), (counter)
> +
> +#define KFD_EVENT_FMT_VMFAULT(pid, task_name)\
> + "%x:%s\n", (pid), (task_name)
> +
> +#define KFD_EVENT_FMT_PAGEFAULT_START(ns, pid, addr, node, rw)\
> + "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (rw)
> +
> +#define KFD_EVENT_FMT_PAGEFAULT_END(ns, pid, addr, node, migrate_update)\
> + "%lld -%d @%lx(%x) %c\n", (ns), (pid), (addr), (node), (migrate_update)
> +
> +#define KFD_EVENT_FMT_MIGRATE_START(ns, pid, start, size, from, to, prefetch_loc,\
> + preferred_loc, migrate_trigger)\
> + "%lld -%d @%lx(%lx) %x->%x %x:%x %d\n", (ns), (pid), (start), (size),\
> + (from), (to), (prefetch_loc), (preferred_loc), (migrate_trigger)
> +
> +#define KFD_EVENT_FMT_MIGRATE_END(ns, pid, start, size, from, to, migrate_trigger)\
> + "%lld -%d @%lx(%lx) %x->%x %d\n", (ns), (pid), (start), (size),\
> + (from), (to), (migrate_trigger)
> +
> +#define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\
> + "%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger)
> +
> +#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node, rescheduled)\
> + "%lld -%d %x %c\n", (ns), (pid), (node), (rescheduled)
> +
> +#define KFD_EVENT_FMT_UNMAP_FROM_GPU(ns, pid, addr, size, node, unmap_trigger)\
> + "%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\
> + (node), (unmap_trigger)
> +
> /**************************************************************************************************
> * CRIU IOCTLs (Checkpoint Restore In Userspace)
> *
[-- Attachment #2: Type: text/html, Size: 12098 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size
2024-07-30 20:15 ` [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size Philip Yang
@ 2024-08-22 14:34 ` James Zhu
2024-08-22 15:06 ` Philip Yang
0 siblings, 1 reply; 10+ messages in thread
From: James Zhu @ 2024-08-22 14:34 UTC (permalink / raw)
To: Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 2024-07-30 16:15, Philip Yang wrote:
> SMI event fifo size 1KB was enough to report GPU vm fault or reset
[JZ] There is a typo here. it should be NOT enough.
> event, increase it to 8KB to store about 100 migrate events, less chance
> to drop the migrate events if lots of migration happened in the short
> period of time. Add KFD prefix to the macro name.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> index 1d94b445a060..9b8169761ec5 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> @@ -44,7 +44,7 @@ struct kfd_smi_client {
> bool suser;
> };
>
> -#define MAX_KFIFO_SIZE 1024
> +#define KFD_MAX_KFIFO_SIZE 8192
>
> static __poll_t kfd_smi_ev_poll(struct file *, struct poll_table_struct *);
> static ssize_t kfd_smi_ev_read(struct file *, char __user *, size_t, loff_t *);
> @@ -86,7 +86,7 @@ static ssize_t kfd_smi_ev_read(struct file *filep, char __user *user,
> struct kfd_smi_client *client = filep->private_data;
> unsigned char *buf;
>
> - size = min_t(size_t, size, MAX_KFIFO_SIZE);
> + size = min_t(size_t, size, KFD_MAX_KFIFO_SIZE);
> buf = kmalloc(size, GFP_KERNEL);
> if (!buf)
> return -ENOMEM;
> @@ -355,7 +355,7 @@ int kfd_smi_event_open(struct kfd_node *dev, uint32_t *fd)
> return -ENOMEM;
> INIT_LIST_HEAD(&client->list);
>
> - ret = kfifo_alloc(&client->fifo, MAX_KFIFO_SIZE, GFP_KERNEL);
> + ret = kfifo_alloc(&client->fifo, KFD_MAX_KFIFO_SIZE, GFP_KERNEL);
> if (ret) {
> kfree(client);
> return ret;
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 4/4] drm/amdkfd: SMI report dropped event count
2024-07-30 20:15 ` [PATCH v2 4/4] drm/amdkfd: SMI report dropped event count Philip Yang
@ 2024-08-22 14:45 ` James Zhu
0 siblings, 0 replies; 10+ messages in thread
From: James Zhu @ 2024-08-22 14:45 UTC (permalink / raw)
To: Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 2024-07-30 16:15, Philip Yang wrote:
> Add new SMI event to report the dropped event count when the event kfifo
> is full.
>
> When the kfifo has space for two events, generate a dropped event record
> to report how many events were dropped, together with the next event to
> add to kfifo.
>
> After reading event out from kfifo, if there were events dropped,
> generate a dropped event record and add to kfifo.
[JZ] I am wondering if it is better to use the below method.
can calculate dropped event count during adding event.
Generate/Send dropped event only in event read.
when there are dropped events, If there is space left for dropped
event,then
send out dropped event with this event read. otherwise, send out
dropped event with next event read.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 41 ++++++++++++++++++---
> include/uapi/linux/kfd_ioctl.h | 6 +++
> 2 files changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> index 9b8169761ec5..9b47657d5160 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
> @@ -42,6 +42,7 @@ struct kfd_smi_client {
> struct rcu_head rcu;
> pid_t pid;
> bool suser;
> + u32 drop_count;
> };
>
> #define KFD_MAX_KFIFO_SIZE 8192
> @@ -103,12 +104,26 @@ static ssize_t kfd_smi_ev_read(struct file *filep, char __user *user,
> }
> to_copy = min(size, to_copy);
> ret = kfifo_out(&client->fifo, buf, to_copy);
> - spin_unlock(&client->lock);
> if (ret <= 0) {
> + spin_unlock(&client->lock);
> ret = -EAGAIN;
> goto ret_err;
> }
>
> + if (client->drop_count) {
> + char msg[KFD_SMI_EVENT_MSG_SIZE];
> + int len;
> +
> + len = snprintf(msg, sizeof(msg), "%x ", KFD_SMI_EVENT_DROPPED_EVENT);
> + len += snprintf(msg + len, sizeof(msg) - len,
> + KFD_EVENT_FMT_DROPPED_EVENT(ktime_get_boottime_ns(),
> + client->pid, client->drop_count));
> + kfifo_in(&client->fifo, msg, len);
> + client->drop_count = 0;
> + }
> +
> + spin_unlock(&client->lock);
> +
> ret = copy_to_user(user, buf, to_copy);
> if (ret) {
> ret = -EFAULT;
> @@ -173,22 +188,36 @@ static bool kfd_smi_ev_enabled(pid_t pid, struct kfd_smi_client *client,
> }
>
> static void add_event_to_kfifo(pid_t pid, struct kfd_node *dev,
> - unsigned int smi_event, char *event_msg, int len)
> + unsigned int smi_event, char *event_msg, int event_len)
> {
> struct kfd_smi_client *client;
> + char msg[KFD_SMI_EVENT_MSG_SIZE];
> + int len = 0;
>
> rcu_read_lock();
>
> list_for_each_entry_rcu(client, &dev->smi_clients, list) {
> if (!kfd_smi_ev_enabled(pid, client, smi_event))
> continue;
> +
> spin_lock(&client->lock);
> - if (kfifo_avail(&client->fifo) >= len) {
> - kfifo_in(&client->fifo, event_msg, len);
> + if (client->drop_count) {
> + len = snprintf(msg, sizeof(msg), "%x ", KFD_SMI_EVENT_DROPPED_EVENT);
> + len += snprintf(msg + len, sizeof(msg) - len,
> + KFD_EVENT_FMT_DROPPED_EVENT(ktime_get_boottime_ns(), pid,
> + client->drop_count));
> + }
> +
> + if (kfifo_avail(&client->fifo) >= event_len + len) {
> + if (len)
> + kfifo_in(&client->fifo, msg, len);
> + kfifo_in(&client->fifo, event_msg, event_len);
> wake_up_all(&client->wait_queue);
> + client->drop_count = 0;
> } else {
> - pr_debug("smi_event(EventID: %u): no space left\n",
> - smi_event);
> + client->drop_count++;
> + pr_debug("smi_event(EventID: %u): no space left drop_count %d\n",
> + smi_event, client->drop_count);
> }
> spin_unlock(&client->lock);
> }
> diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
> index e4ed8fec3294..915d1e7c67fe 100644
> --- a/include/uapi/linux/kfd_ioctl.h
> +++ b/include/uapi/linux/kfd_ioctl.h
> @@ -530,6 +530,7 @@ enum kfd_smi_event {
> KFD_SMI_EVENT_QUEUE_EVICTION = 9,
> KFD_SMI_EVENT_QUEUE_RESTORE = 10,
> KFD_SMI_EVENT_UNMAP_FROM_GPU = 11,
> + KFD_SMI_EVENT_DROPPED_EVENT = 12,
>
> /*
> * max event number, as a flag bit to get events from all processes,
> @@ -610,6 +611,7 @@ struct kfd_ioctl_smi_events_args {
> * rw: 'W' for write page fault, 'R' for read page fault
> * rescheduled: 'R' if the queue restore failed and rescheduled to try again
> * error_code: migrate failure error code, 0 if no error
> + * drop_count: how many events dropped when fifo is full
> */
> #define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\
> "%x %s\n", (reset_seq_num), (reset_cause)
> @@ -645,6 +647,10 @@ struct kfd_ioctl_smi_events_args {
> "%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\
> (node), (unmap_trigger)
>
> +#define KFD_EVENT_FMT_DROPPED_EVENT(ns, pid, drop_count)\
> + "%lld -%d %d\n", (ns), (pid), (drop_count)
> +
> +
> /**************************************************************************************************
> * CRIU IOCTLs (Checkpoint Restore In Userspace)
> *
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro
2024-08-22 14:32 ` James Zhu
@ 2024-08-22 14:58 ` Philip Yang
0 siblings, 0 replies; 10+ messages in thread
From: Philip Yang @ 2024-08-22 14:58 UTC (permalink / raw)
To: James Zhu, Philip Yang, amd-gfx; +Cc: Felix.Kuehling
[-- Attachment #1: Type: text/html, Size: 12546 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size
2024-08-22 14:34 ` James Zhu
@ 2024-08-22 15:06 ` Philip Yang
0 siblings, 0 replies; 10+ messages in thread
From: Philip Yang @ 2024-08-22 15:06 UTC (permalink / raw)
To: James Zhu, Philip Yang, amd-gfx; +Cc: Felix.Kuehling
[-- Attachment #1: Type: text/html, Size: 4003 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-08-22 15:06 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30 20:15 [PATCH v2 0/4] Improve SVM migrate event report Philip Yang
2024-07-30 20:15 ` [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro Philip Yang
2024-08-22 14:32 ` James Zhu
2024-08-22 14:58 ` Philip Yang
2024-07-30 20:15 ` [PATCH v2 2/4] drm/amdkfd: Output migrate end event if migrate failed Philip Yang
2024-07-30 20:15 ` [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size Philip Yang
2024-08-22 14:34 ` James Zhu
2024-08-22 15:06 ` Philip Yang
2024-07-30 20:15 ` [PATCH v2 4/4] drm/amdkfd: SMI report dropped event count Philip Yang
2024-08-22 14:45 ` James Zhu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox