From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0909D302CB8; Sat, 25 Oct 2025 16:24:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761409481; cv=none; b=LFuEEyLXfJsCdF9/SOWyVI0U7g0U0RU6ZR4gO+rT13+iF1Gail3hi/2iU8fRO16Ips74FByA4scs+SzyctC1qPK+hZS+Fm+MHPNCt15v2j10RNk9KeqsfISkovQAzn2QTeFTOALImdhA1+plhcaeX5mkdnxdrBPaqyiiw4j+wf0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761409481; c=relaxed/simple; bh=zZGa73oBDKVxP7AkmSQQThA4hlrdHelS5o8V02ZUs+0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=aOGBHIkvqgspNSk+13vc8+GoEOD6mjy/bvkSGNI1ycDekoFw0xudRL1X3BNUTaWQUaj21OxM3Es7I8QMuziJm0WwcCqhV0K4B33No2aTFjDo2n7TwEew3tB8Eoom6lp8ObHcVYLjc2YTUVzOCwVfXSK92butn75iDIOtk0mXl/c= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=MIDP+U4v; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="MIDP+U4v" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B0A7EC4CEFB; Sat, 25 Oct 2025 16:24:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1761409480; bh=zZGa73oBDKVxP7AkmSQQThA4hlrdHelS5o8V02ZUs+0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=MIDP+U4vtF8+JLA5aqSSZIvBYZQnfSEMFcVJt2yIj5cGy/RBgYP4r/IAd8YYW3Fav q+YP9fqUzW3ZsyJEJ86mo//GNHGvqj/pL1764bFt3tyHrqyoIYUfXygklwEwYCzE2v Rc89Dqhj0qzbV9edAqUMr/wiUUoo0C4leiya0qQpnu462YDr5/RKBC47eJeZkwFc7k 1cXNFJ7yx4CGkXNX4oSSM3Pjrf6raxmHioRuUn6opv7ulP1TXdg2sEHPP7dfnMwjId Omhkk6fUNp7k6B4ZxEczKQyRAD0TsRDGnmHeo8Emz3LYpeINEXZwUXx71uFT1uKiL+ VHtAjrlfmaMFQ== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Xiang Liu , Hawking Zhang , Alex Deucher , Sasha Levin , tao.zhou1@amd.com, ganglxie@amd.com, lijo.lazar@amd.com, candice.li@amd.com, alexandre.f.demers@gmail.com Subject: [PATCH AUTOSEL 6.17] drm/amdgpu: Notify pmfw bad page threshold exceeded Date: Sat, 25 Oct 2025 11:59:33 -0400 Message-ID: <20251025160905.3857885-342-sashal@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org> References: <20251025160905.3857885-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.17.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Xiang Liu [ Upstream commit c8d6e90abe50377110f92702fbebc6efdd22391d ] Notify pmfw when bad page threshold is exceeded, no matter the module parameter 'bad_page_threshold' is set or not. Signed-off-by: Xiang Liu Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: YES **Rationale** - What changed - The call to notify the PMFW/SMU about an RMA reason (`amdgpu_dpm_send_rma_reason(adev)`) is moved outside the inner check that previously only executed for user-defined thresholds. Now it runs whenever the bad-page threshold is exceeded (and the feature isn’t disabled), regardless of whether the module parameter is left at default (-1) or formula-based (-2). - Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:772` (inner check for user-defined thresholds), `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:783` (unconditional PMFW notify within the threshold-exceeded block). - Why it matters (bug fix, not a feature) - With the default (-1) or formula-based (-2) settings of `bad_page_threshold`, the driver already computes a threshold and warns when it’s exceeded, but previously did not always notify PMFW. This commit ensures PMFW is notified whenever the bad-page count crosses the computed threshold, aligning behavior across configurations and avoiding missed PMFW-side actions/telemetry. - Threshold semantics are documented and unchanged: -1 (default), 0 (disable), -2 (formula), N>0 (user-defined). Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:979` (module param description), `drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:980` (parameter definition); threshold computation paths: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3283`, `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3289`, `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3292`. - Scope and containment - The change is confined to a single function in AMDGPU RAS EEPROM handling and only adjusts when a single notification is sent. No architectural changes, no interface changes. - Safety and regression risk - The PMFW notification path is robust: `amdgpu_dpm_send_rma_reason` guards for unsupported SW SMU and returns `-EOPNOTSUPP`; the caller ignores such failures by design (see comment just above the call). References: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:782` (comment “ignore the -ENOTSUPP”), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:760` (unsupported check), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:763` (mutex), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:764` (SMU call), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:767` (return). - The driver continues to mark RMA in the EEPROM header (`ras->is_rma = true` and `header = RAS_TABLE_HDR_BAD`) only for user-defined thresholds, unchanged. Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:772` to `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:780`. - The feature remains disabled when `bad_page_threshold == 0`; the outer guard still requires `amdgpu_bad_page_threshold != 0`. Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:763`. - User impact - Fixes a real behavioral gap: in common default/auto modes, PMFW was not being notified of threshold exceed events. This can affect reliability handling/telemetry on systems that rely on PMFW awareness. The fix is minimal, localized, and low risk. Given the small, targeted nature of the fix, its correctness, and low regression risk, this is a good candidate for stable backport. drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c index 9bda9ad13f882..88ded6296be34 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c @@ -774,9 +774,10 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control) control->tbl_rai.health_percent = 0; } ras->is_rma = true; - /* ignore the -ENOTSUPP return value */ - amdgpu_dpm_send_rma_reason(adev); } + + /* ignore the -ENOTSUPP return value */ + amdgpu_dpm_send_rma_reason(adev); } if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1) -- 2.51.0