public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Jake Hillion" <jake@hillion.co.uk>,
	"Blaise Sanouillet" <linux@blaise.sanouillet.com>,
	"Suma Hegde" <suma.hegde@amd.com>,
	"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>,
	"Sasha Levin" <sashal@kernel.org>,
	naveenkrishna.chatradhi@amd.com,
	platform-driver-x86@vger.kernel.org
Subject: [PATCH AUTOSEL 6.15 01/20] x86/platform/amd: move final timeout check to after final sleep
Date: Tue, 24 Jun 2025 00:11:00 -0400	[thread overview]
Message-ID: <20250624041120.83191-1-sashal@kernel.org> (raw)

From: Jake Hillion <jake@hillion.co.uk>

[ Upstream commit f8afb12a2d7503de6558c23cacd7acbf6e9fe678 ]

__hsmp_send_message sleeps between result read attempts and has a
timeout of 100ms. Under extreme load it's possible for these sleeps to
take a long time, exceeding the 100ms. In this case the current code
does not check the register and fails with ETIMEDOUT.

Refactor the loop to ensure there is at least one read of the register
after a sleep of any duration. This removes instances of ETIMEDOUT with
a single caller, even with a misbehaving scheduler. Tested on AMD
Bergamo machines.

Suggested-by: Blaise Sanouillet <linux@blaise.sanouillet.com>
Reviewed-by: Suma Hegde <suma.hegde@amd.com>
Tested-by: Suma Hegde <suma.hegde@amd.com>
Signed-off-by: Jake Hillion <jake@hillion.co.uk>
Link: https://lore.kernel.org/r/20250605-amd-hsmp-v2-1-a811bc3dd74a@hillion.co.uk
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

**YES**

This commit should be backported to stable kernel trees. Here's my
extensive analysis:

## Bug Analysis

The commit fixes a critical race condition in the AMD HSMP driver's
timeout handling. The issue occurs in the `__hsmp_send_message()`
function where:

1. **Original code flaw**: The loop checks `time_before(jiffies,
   timeout)` at the start of each iteration, then sleeps, but doesn't
   check the register one final time after the last sleep completes.

2. **Race condition scenario**: Under heavy system load (as mentioned in
   the commit message - "Under extreme load"), the sleep operations
   (`usleep_range()`) can take significantly longer than intended. If
   the final sleep extends beyond the 100ms timeout window, the code
   exits with `-ETIMEDOUT` without checking if the SMU actually
   responded during that extended sleep.

3. **The fix**: Changes the loop from `while (time_before(jiffies,
   timeout))` to `while (true)` and moves the timeout check to after the
   register read, ensuring at least one register check occurs after any
   sleep duration.

## Why This Qualifies for Stable Backport

1. **Fixes a real bug affecting users**: The commit message explicitly
   states this was "Tested on AMD Bergamo machines" and fixes actual
   instances of ETIMEDOUT errors. AMD Bergamo is a server processor used
   in production data centers.

2. **Small, contained fix**: The change is minimal - just restructuring
   the loop logic without changing functionality. This meets the stable
   kernel criteria of being a small, obvious fix.

3. **No new features**: This purely fixes existing behavior without
   adding capabilities.

4. **Critical subsystem**: The HSMP driver is essential for AMD EPYC
   server management, controlling:
   - Power management
   - Thermal monitoring
   - Performance states
   - System telemetry

5. **Production impact**: Spurious `-ETIMEDOUT` errors would cause
   failures in:
   - Data center management tools
   - Power capping operations
   - Performance monitoring
   - Thermal management

6. **Clear problem and solution**: The race condition is well-
   understood, and the fix ensures the code behaves as intended -
   checking the register after sleeping rather than potentially timing
   out without a final check.

## Risk Assessment

The risk is minimal because:
- The logic change is straightforward and correct
- It's been tested on production AMD Bergamo systems
- It only affects the timeout path behavior
- The worst case is the same as before (timeout after 100ms)
- The best case fixes false timeouts under load

This is exactly the type of bug fix that stable kernels exist to deliver
- fixing real issues users encounter in production without introducing
new risks.

 drivers/platform/x86/amd/hsmp/hsmp.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/platform/x86/amd/hsmp/hsmp.c b/drivers/platform/x86/amd/hsmp/hsmp.c
index a3ac09a90de45..ab877112f4c80 100644
--- a/drivers/platform/x86/amd/hsmp/hsmp.c
+++ b/drivers/platform/x86/amd/hsmp/hsmp.c
@@ -99,7 +99,7 @@ static int __hsmp_send_message(struct hsmp_socket *sock, struct hsmp_message *ms
 	short_sleep = jiffies + msecs_to_jiffies(HSMP_SHORT_SLEEP);
 	timeout	= jiffies + msecs_to_jiffies(HSMP_MSG_TIMEOUT);
 
-	while (time_before(jiffies, timeout)) {
+	while (true) {
 		ret = sock->amd_hsmp_rdwr(sock, mbinfo->msg_resp_off, &mbox_status, HSMP_RD);
 		if (ret) {
 			dev_err(sock->dev, "Error %d reading mailbox status\n", ret);
@@ -108,6 +108,10 @@ static int __hsmp_send_message(struct hsmp_socket *sock, struct hsmp_message *ms
 
 		if (mbox_status != HSMP_STATUS_NOT_READY)
 			break;
+
+		if (!time_before(jiffies, timeout))
+			break;
+
 		if (time_before(jiffies, short_sleep))
 			usleep_range(50, 100);
 		else
-- 
2.39.5


             reply	other threads:[~2025-06-24  4:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-24  4:11 Sasha Levin [this message]
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 02/20] drm/msm: Fix a fence leak in submit error path Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 03/20] drm/msm: Fix another leak in the " Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 04/20] ALSA: sb: Don't allow changing the DMA mode during operations Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 05/20] ALSA: sb: Force to disable DMAs once when DMA mode is changed Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 06/20] ata: libata-acpi: Do not assume 40 wire cable if no devices are enabled Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 07/20] ata: pata_cs5536: fix build on 32-bit UML Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 08/20] ASoC: amd: yc: Add quirk for MSI Bravo 17 D7VF internal mic Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 09/20] platform/x86/amd/pmc: Add PCSpecialist Lafite Pro V 14M to 8042 quirks list Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 10/20] genirq/irq_sim: Initialize work context pointers properly Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 11/20] powerpc: Fix struct termio related ioctl macros Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 12/20] ASoC: amd: yc: update quirk data for HP Victus Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 13/20] regulator: fan53555: add enable_time support and soft-start times Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 14/20] scsi: target: Fix NULL pointer dereference in core_scsi3_decode_spec_i_port() Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 15/20] aoe: defer rexmit timer downdev work to workqueue Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 16/20] wifi: mac80211: drop invalid source address OCB frames Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 17/20] wifi: ath6kl: remove WARN on bad firmware input Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 18/20] ACPICA: Refuse to evaluate a method if arguments are missing Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 19/20] mtd: spinand: fix memory leak of ECC engine conf Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 20/20] rcu: Return early if callback is not specified Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250624041120.83191-1-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=jake@hillion.co.uk \
    --cc=linux@blaise.sanouillet.com \
    --cc=naveenkrishna.chatradhi@amd.com \
    --cc=patches@lists.linux.dev \
    --cc=platform-driver-x86@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=suma.hegde@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox