patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Lijo Lazar <lijo.lazar@amd.com>, Asad Kamal <asad.kamal@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	christian.koenig@amd.com, Hawking.Zhang@amd.com,
	mario.limonciello@amd.com, alexandre.f.demers@gmail.com,
	cesun102@amd.com
Subject: [PATCH AUTOSEL 6.17] drm/amdgpu: Report individual reset error
Date: Sun, 26 Oct 2025 10:48:51 -0400	[thread overview]
Message-ID: <20251026144958.26750-13-sashal@kernel.org> (raw)
In-Reply-To: <20251026144958.26750-1-sashal@kernel.org>

From: Lijo Lazar <lijo.lazar@amd.com>

[ Upstream commit 2e97663760e5fb7ee14f399c68e57b894f01e505 ]

If reinitialization of one of the GPUs fails after reset, it logs
failure on all subsequent GPUs eventhough they have resumed
successfully.

A sample log where only device at 0000:95:00.0 had a failure -

	amdgpu 0000:15:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:65:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:75:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:85:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:95:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:e5:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:f5:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:05:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:15:00.0: amdgpu: GPU reset end with ret = -5

To avoid confusion, report the error for each device
separately and return the first error as the overall result.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES
- The fix makes `amdgpu_device_sched_resume()` report failures per-
  device by gating the failure path on `tmp_adev->asic_reset_res` and
  logging the GPU-specific errno
  (`drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6341-6356`). Without
  this, once any GPU in an XGMI hive fails to reinitialize, every
  subsequent GPU is logged — and reported to SR-IOV guests — as failed
  even when it succeeds, exactly matching the problem described in the
  commit message.
- `amdgpu_vf_error_put()` now receives the correct GPU’s error code,
  preventing benign devices from being flagged as reset failures to the
  management stack
  (`drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6354-6356`).
- By only recording the first non-zero `r` and leaving later successful
  devices untouched
  (`drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6357-6359`), the overall
  reset outcome still reflects a real fault while no longer clobbering
  it with spurious data.
- The change is self-contained (single function, no new APIs, no
  behavior change for the success path), so regression risk is minimal,
  yet it fixes a real user-visible bug in multi-GPU recovery flows—clear
  stable-tree material.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++++++++++++---------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c8459337fcb89..690bda2ab8d2b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6338,23 +6338,28 @@ static int amdgpu_device_sched_resume(struct list_head *device_list,
 		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
 			drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
 
-		if (tmp_adev->asic_reset_res)
-			r = tmp_adev->asic_reset_res;
-
-		tmp_adev->asic_reset_res = 0;
-
-		if (r) {
+		if (tmp_adev->asic_reset_res) {
 			/* bad news, how to tell it to userspace ?
 			 * for ras error, we should report GPU bad status instead of
 			 * reset failure
 			 */
 			if (reset_context->src != AMDGPU_RESET_SRC_RAS ||
 			    !amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
-				dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
-					atomic_read(&tmp_adev->gpu_reset_counter));
-			amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
+				dev_info(
+					tmp_adev->dev,
+					"GPU reset(%d) failed with error %d \n",
+					atomic_read(
+						&tmp_adev->gpu_reset_counter),
+					tmp_adev->asic_reset_res);
+			amdgpu_vf_error_put(tmp_adev,
+					    AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0,
+					    tmp_adev->asic_reset_res);
+			if (!r)
+				r = tmp_adev->asic_reset_res;
+			tmp_adev->asic_reset_res = 0;
 		} else {
-			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
+			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n",
+				 atomic_read(&tmp_adev->gpu_reset_counter));
 			if (amdgpu_acpi_smart_shift_update(tmp_adev,
 							   AMDGPU_SS_DEV_D0))
 				dev_warn(tmp_adev->dev,
-- 
2.51.0


  parent reply	other threads:[~2025-10-26 14:50 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-26 14:48 [PATCH AUTOSEL 6.17-5.4] ACPI: property: Return present device nodes only on fwnode interface Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.4] ceph: add checking of wait_for_completion_killable() return value Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.4] 9p: sysfs_init: don't hardcode error to ENOMEM Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.10] um: Fix help message for ssl-non-raw Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17] clk: thead: th1520-ap: set all AXI clocks to CLK_IS_CRITICAL Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-6.1] NTB: epf: Allow arbitrary BAR mapping Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17] rtc: zynqmp: Restore alarm functionality after kexec transition Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17] hyperv: Add missing field to hv_output_map_device_interrupt Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.4] fbdev: Add bounds checking in bit_putcs to fix vmalloc-out-of-bounds Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17] fbdev: core: Fix ubsan warning in pixel_to_pat Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.10] ASoC: meson: aiu-encoder-i2s: fix bit clock polarity Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.4] fs/hpfs: Fix error code for new_inode() failure in mkdir/create/mknod/symlink Sasha Levin
2025-10-26 14:48 ` Sasha Levin [this message]
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.15] clk: ti: am33xx: keep WKUP_DEBUGSS_CLKCTRL enabled Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17] clk: at91: add ACR in all PLL settings Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-6.12] clk: scmi: Add duty cycle ops only when duty cycle is supported Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.10] ARM: at91: pm: save and restore ACR during PLL disable/enable Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-6.6] rtc: pcf2127: fix watchdog interrupt mask on pcf2131 Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.15] clk: at91: clk-master: Add check for divide by 3 Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-5.15] rtc: pcf2127: clear minute/second interrupt Sasha Levin
2025-10-26 14:48 ` [PATCH AUTOSEL 6.17-6.12] clk: at91: sam9x7: Add peripheral clock id for pmecc Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-5.4] 9p: fix /sys/fs/9p/caches overwriting itself Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] 9p/trans_fd: p9_fd_request: kick rx thread if EPOLLIN Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17] clk: samsung: exynos990: Add missing USB clock registers to HSI0 Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17] clocksource: hyper-v: Skip unnecessary checks for the root partition Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] ceph: fix multifs mds auth caps issue Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] LoongArch: Handle new atomic instructions for probes Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.6] ceph: refactor wake_up_bit() pattern of calling Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] drm/amdkfd: Fix mmap write lock not release Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] ceph: fix potential race condition in ceph_ioctl_lazyio() Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] clk: qcom: gcc-ipq6018: rework nss_port5 clock to multiple conf Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.1] clk: at91: clk-sam9x60-pll: force write to PLL_UPDT register Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-5.4] tools bitmap: Add missing asm-generic/bitsperlong.h include Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17] ALSA: hda/realtek: Add quirk for ASUS ROG Zephyrus Duo Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.1] kbuild: uapi: Strip comments before size type check Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.1] tools: lib: thermal: don't preserve owner in install Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.1] scsi: ufs: core: Include UTP error in INT_FATAL_ERRORS Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.1] clk: sunxi-ng: sun6i-rtc: Add A523 specifics Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] clk: scmi: migrate round_rate() to determine_rate() Sasha Levin
2025-10-26 23:16   ` Brian Masney
2025-10-28 17:47     ` Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] clk: clocking-wizard: Fix output clock register offset for Versal platforms Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17] ASoC: rt722: add settings for rt722VB Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-5.15] cpufreq: tegra186: Initialize all cores to max frequencies Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.1] tools: lib: thermal: use pkg-config to locate libnl3 Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17] clk: renesas: rzv2h: Re-assert reset on deassert timeout Sasha Levin
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] net: wwan: t7xx: add support for HP DRMR-H01 Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251026144958.26750-13-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Hawking.Zhang@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexandre.f.demers@gmail.com \
    --cc=asad.kamal@amd.com \
    --cc=cesun102@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=mario.limonciello@amd.com \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).