From: Jonathan Kim <Jonathan.Kim@amd.com>
To: <amd-gfx@lists.freedesktop.org>
Cc: JinHuiEric.Huang@amd.com, Alice Wong <shiwei.wong@amd.com>,
Felix.Kuehling@amd.com, Jonathan Kim <jonathan.kim@amd.com>,
Harish.Kasiviswanathan@amd.com
Subject: [PATCH] drm/amdkfd: fix mes set shader debugger process management
Date: Mon, 11 Dec 2023 16:16:12 -0500 [thread overview]
Message-ID: <20231211211612.3109-1-Jonathan.Kim@amd.com> (raw)
MES provides the driver a call to explicitly flush stale process memory
within the MES to avoid a race condition that results in a fatal
memory violation.
When SET_SHADER_DEBUGGER is called, the driver passes a memory address
that represents a process context address MES uses to keep track of
future per-process calls.
Normally, MES will purge its process context list when the last queue
has been removed. The driver, however, can call SET_SHADER_DEBUGGER
regardless of whether a queue has been added or not.
If SET_SHADER_DEBUGGER has been called with no queues as the last call
prior to process termination, the passed process context address will
still reside within MES.
On a new process call to SET_SHADER_DEBUGGER, the driver may end up
passing an identical process context address value (based on per-process
gpu memory address) to MES but is now pointing to a new allocated buffer
object during KFD process creation. Since the MES is unaware of this,
access of the passed address points to the stale object within MES and
triggers a fatal memory violation.
The solution is for KFD to explicitly flush the process context address
from MES on process termination.
Note that the flush call and the MES debugger calls use the same MES
interface but are separated as KFD calls to avoid conflicting with each
other.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Alice Wong <shiwei.wong@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 +++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++---
.../amd/amdkfd/kfd_process_queue_manager.c | 1 +
drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +-
4 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index e544b823abf6..e98de23250dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev,
op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
op_input.set_shader_debugger.process_context_addr = process_context_addr;
op_input.set_shader_debugger.flags.u32all = flags;
+
+ /* use amdgpu mes_flush_shader_debugger instead */
+ if (op_input.set_shader_debugger.flags.process_ctx_flush)
+ return -EINVAL;
+
op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = spi_gdbg_per_vmid_cntl;
memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl,
sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
@@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev,
return r;
}
+int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
+ uint64_t process_context_addr)
+{
+ struct mes_misc_op_input op_input = {0};
+ int r;
+
+ if (!adev->mes.funcs->misc_op) {
+ DRM_ERROR("mes flush shader debugger is not supported!\n");
+ return -EINVAL;
+ }
+
+ op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
+ op_input.set_shader_debugger.process_context_addr = process_context_addr;
+ op_input.set_shader_debugger.flags.process_ctx_flush = true;
+
+ amdgpu_mes_lock(&adev->mes);
+
+ r = adev->mes.funcs->misc_op(&adev->mes, &op_input);
+ if (r)
+ DRM_ERROR("failed to set_shader_debugger\n");
+
+ amdgpu_mes_unlock(&adev->mes);
+
+ return r;
+}
+
static void
amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
struct amdgpu_ring *ring,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index 894b9b133000..7d4f93fea937 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -296,9 +296,10 @@ struct mes_misc_op_input {
uint64_t process_context_addr;
union {
struct {
- uint64_t single_memop : 1;
- uint64_t single_alu_op : 1;
- uint64_t reserved: 30;
+ uint32_t single_memop : 1;
+ uint32_t single_alu_op : 1;
+ uint32_t reserved: 29;
+ uint32_t process_ctx_flush: 1;
};
uint32_t u32all;
} flags;
@@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev,
const uint32_t *tcp_watch_cntl,
uint32_t flags,
bool trap_en);
-
+int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
+ uint64_t process_context_addr);
int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id,
int queue_type, int idx,
struct amdgpu_mes_ctx_data *ctx_data,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 77f493262e05..8e55e78fce4e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
return;
dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
+ amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
pdd->already_dequeued = true;
}
diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
index 1fbfd1aa987e..ec5b9ab67c5e 100644
--- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
+++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
@@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER {
struct {
uint32_t single_memop : 1; /* SQ_DEBUG.single_memop */
uint32_t single_alu_op : 1; /* SQ_DEBUG.single_alu_op */
- uint32_t reserved : 30;
+ uint32_t reserved : 29;
+ uint32_t process_ctx_flush : 1;
};
uint32_t u32all;
} flags;
--
2.34.1
next reply other threads:[~2023-12-11 21:16 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-11 21:16 Jonathan Kim [this message]
2023-12-12 17:49 ` [PATCH] drm/amdkfd: fix mes set shader debugger process management Eric Huang
2023-12-12 21:00 ` Liu, Shaoyun
2023-12-12 21:33 ` Kim, Jonathan
2023-12-12 21:44 ` Liu, Shaoyun
2023-12-12 21:48 ` Kim, Jonathan
2023-12-12 22:44 ` Liu, Shaoyun
2023-12-12 23:06 ` Kim, Jonathan
2023-12-13 0:07 ` Liu, Shaoyun
2023-12-13 1:19 ` Kim, Jonathan
2023-12-13 4:07 ` Liu, Shaoyun
2023-12-13 4:43 ` Kim, Jonathan
2023-12-13 22:14 ` Liu, Shaoyun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231211211612.3109-1-Jonathan.Kim@amd.com \
--to=jonathan.kim@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Harish.Kasiviswanathan@amd.com \
--cc=JinHuiEric.Huang@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=shiwei.wong@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.