public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation
@ 2024-12-28  6:32 Shuai Xue
  2024-12-29 20:11 ` Christian König
  0 siblings, 1 reply; 8+ messages in thread
From: Shuai Xue @ 2024-12-28  6:32 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	lijo.lazar, le.ma, hamza.mahfooz, tzimmermann, shaoyun.liu,
	Jun.Ma2
  Cc: xueshuai, amd-gfx, dri-devel, linux-kernel

It's observed that most GPU jobs utilize less than one server, typically
with each GPU being used by an independent job. If a job consumed poisoned
data, a SIGBUS signal will be sent to terminate it. Meanwhile, the
gpu_recovery parameter is set to -1 by default, the amdgpu driver resets
all GPUs on the server. As a result, all jobs are terminated. Setting
gpu_recovery to 0 provides an opportunity to preemptively evacuate other
jobs and subsequently manually reset all GPUs. However, this parameter is
read-only, necessitating correct settings at driver load. And reloading the
GPU driver in a production environment can be challenging due to reference
counts maintained by various monitoring services.

Set the gpu_recovery parameter with read-write permission to enable runtime
modification. It will enables users to dynamically manage GPU recovery
mechanisms based on real-time requirements or conditions.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 38686203bea6..03dd902e1cec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
 MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 
+static int amdgpu_set_gpu_recovery(const char *buf,
+				   const struct kernel_param *kp)
+{
+	unsigned long val;
+	int ret;
+
+	ret = kstrtol(buf, 10, &val);
+	if (ret < 0)
+		return ret;
+
+	if (val != 1 && val != 0 && val != -1) {
+		pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n",
+		       val);
+		return -EINVAL;
+	}
+
+	return param_set_int(buf, kp);
+}
+
+static const struct kernel_param_ops amdgpu_gpu_recovery_ops = {
+	.set = amdgpu_set_gpu_recovery,
+	.get = param_get_int,
+};
+
 /**
  * DOC: gpu_recovery (int)
  * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV).
  */
 MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
-module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
+module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644);
 
 /**
  * DOC: emu_mode (int)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-01-07 12:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-28  6:32 [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation Shuai Xue
2024-12-29 20:11 ` Christian König
2024-12-30  8:50   ` Shuai Xue
2025-01-03  8:21     ` AW: " Koenig, Christian
2025-01-06 12:09       ` Simona Vetter
2025-01-07 12:36         ` Christian König
2025-01-07  7:06       ` AW: " Shuai Xue
2025-01-07 12:34         ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox