From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D65B23EA89; Sun, 26 Oct 2025 14:50:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761490225; cv=none; b=ANF1zsUsoN4wZYJISb7PkJ2G2CUgCE1TaD9aqI5YPJMOBY9trz63PXQQBOmb1YkAqACGtT6rdWghARcqYtXQ9BGI/nnuEzL04m2zTINn4425QkuJEvD1Elm1LlvHm2s64hULaAPoS/aPdPHYTOty+8cWzlh148iq+v2NsJC1I0Y= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761490225; c=relaxed/simple; bh=nAPdwis2vsUYx9A+DDlbYAp2acEo+E4+SZdScECi7/g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Wgjl3MxkcGRBridC7j3/Ce/4DtN1dUCS+BG/QemK53yoAwgHlfKlmmSfvuXNIouhr1iaMkqj/snchCHZRO9mFMRr6e0J41I1hbRr3hFa126yIvMSqRpn/9VR1MQLV2Eog1Nl9y2JTiP1Yngd00nsljyHbb2MgJIeli7Z7UKWgZk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=lWVOVw27; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="lWVOVw27" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D088EC4CEE7; Sun, 26 Oct 2025 14:50:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1761490225; bh=nAPdwis2vsUYx9A+DDlbYAp2acEo+E4+SZdScECi7/g=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=lWVOVw27kIquhwL2M/Je+0DBXAsf9r3fa3XUx/VoLgJbnQehvg9pvr1AjzD1V3srr zD4ui1Nx5l6yvGnrgJUtfw1JgEyAk+6x4Vi3u+H21na7AR888WcPHjuUsK9rdz+SRo uskJ+sT6kCJEWwVPwrIf/CCROqpdJUluqj5/pWgwhvYIdytsliL+tJLOGKtLp72izp 7UrprPvpaTXvlD7e8QtlzJ1Z0NDGGlAm5KqZZOSMH3dew2lKONbBe8uftVeQx2wMb/ EW6Cs5OmLYQdet13x5NL5DKjICl5xMlbuiPg4tqgEY1AERGLkIkHstCTsWoCrRWgFs wVAHBUyNYDchg== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Lijo Lazar , Asad Kamal , Alex Deucher , Sasha Levin , christian.koenig@amd.com, Hawking.Zhang@amd.com, mario.limonciello@amd.com, alexandre.f.demers@gmail.com, cesun102@amd.com Subject: [PATCH AUTOSEL 6.17] drm/amdgpu: Report individual reset error Date: Sun, 26 Oct 2025 10:48:51 -0400 Message-ID: <20251026144958.26750-13-sashal@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251026144958.26750-1-sashal@kernel.org> References: <20251026144958.26750-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.17.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Lijo Lazar [ Upstream commit 2e97663760e5fb7ee14f399c68e57b894f01e505 ] If reinitialization of one of the GPUs fails after reset, it logs failure on all subsequent GPUs eventhough they have resumed successfully. A sample log where only device at 0000:95:00.0 had a failure - amdgpu 0000:15:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:65:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:75:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:85:00.0: amdgpu: GPU reset(19) succeeded! amdgpu 0000:95:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:e5:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:f5:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:05:00.0: amdgpu: GPU reset(19) failed amdgpu 0000:15:00.0: amdgpu: GPU reset end with ret = -5 To avoid confusion, report the error for each device separately and return the first error as the overall result. Signed-off-by: Lijo Lazar Reviewed-by: Asad Kamal Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: YES - The fix makes `amdgpu_device_sched_resume()` report failures per- device by gating the failure path on `tmp_adev->asic_reset_res` and logging the GPU-specific errno (`drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6341-6356`). Without this, once any GPU in an XGMI hive fails to reinitialize, every subsequent GPU is logged — and reported to SR-IOV guests — as failed even when it succeeds, exactly matching the problem described in the commit message. - `amdgpu_vf_error_put()` now receives the correct GPU’s error code, preventing benign devices from being flagged as reset failures to the management stack (`drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6354-6356`). - By only recording the first non-zero `r` and leaving later successful devices untouched (`drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6357-6359`), the overall reset outcome still reflects a real fault while no longer clobbering it with spurious data. - The change is self-contained (single function, no new APIs, no behavior change for the success path), so regression risk is minimal, yet it fixes a real user-visible bug in multi-GPU recovery flows—clear stable-tree material. drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++++++++++++--------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index c8459337fcb89..690bda2ab8d2b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -6338,23 +6338,28 @@ static int amdgpu_device_sched_resume(struct list_head *device_list, if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled) drm_helper_resume_force_mode(adev_to_drm(tmp_adev)); - if (tmp_adev->asic_reset_res) - r = tmp_adev->asic_reset_res; - - tmp_adev->asic_reset_res = 0; - - if (r) { + if (tmp_adev->asic_reset_res) { /* bad news, how to tell it to userspace ? * for ras error, we should report GPU bad status instead of * reset failure */ if (reset_context->src != AMDGPU_RESET_SRC_RAS || !amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", - atomic_read(&tmp_adev->gpu_reset_counter)); - amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); + dev_info( + tmp_adev->dev, + "GPU reset(%d) failed with error %d \n", + atomic_read( + &tmp_adev->gpu_reset_counter), + tmp_adev->asic_reset_res); + amdgpu_vf_error_put(tmp_adev, + AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, + tmp_adev->asic_reset_res); + if (!r) + r = tmp_adev->asic_reset_res; + tmp_adev->asic_reset_res = 0; } else { - dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter)); + dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", + atomic_read(&tmp_adev->gpu_reset_counter)); if (amdgpu_acpi_smart_shift_update(tmp_adev, AMDGPU_SS_DEV_D0)) dev_warn(tmp_adev->dev, -- 2.51.0