From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Aaron Plattner <aplattner@nvidia.com>,
Timur Tabi <ttabi@nvidia.com>, Guenter Roeck <linux@roeck-us.net>,
Wim Van Sebroeck <wim@linux-watchdog.org>,
Sasha Levin <sashal@kernel.org>,
linux-watchdog@vger.kernel.org
Subject: [PATCH AUTOSEL 6.16-5.15] watchdog: sbsa: Adjust keepalive timeout to avoid MediaTek WS0 race condition
Date: Tue, 5 Aug 2025 09:08:47 -0400 [thread overview]
Message-ID: <20250805130945.471732-12-sashal@kernel.org> (raw)
In-Reply-To: <20250805130945.471732-1-sashal@kernel.org>
From: Aaron Plattner <aplattner@nvidia.com>
[ Upstream commit 48defdf6b083f74a44e1f742db284960d3444aec ]
The MediaTek implementation of the sbsa_gwdt watchdog has a race
condition where a write to SBSA_GWDT_WRR is ignored if it occurs while
the hardware is processing a timeout refresh that asserts WS0.
Detect this based on the hardware implementer and adjust
wdd->min_hw_heartbeat_ms to avoid the race by forcing the keepalive ping
to be one second later.
Signed-off-by: Aaron Plattner <aplattner@nvidia.com>
Acked-by: Timur Tabi <ttabi@nvidia.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20250721230640.2244915-1-aplattner@nvidia.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my
extensive analysis:
## Nature of the Fix
This is a **hardware-specific bug fix** that addresses a race condition
in MediaTek's implementation of the SBSA generic watchdog. The race
occurs when a watchdog keepalive ping (`writel(0, gwdt->refresh_base +
SBSA_GWDT_WRR)`) happens simultaneously with the hardware asserting the
WS0 timeout signal. In affected hardware, this write is silently
ignored, potentially leading to unexpected system resets.
## Key Code Changes Analysis
1. **Hardware Detection** (lines 78-79, 217-219):
- Adds `SBSA_GWDT_IMPL_MEDIATEK` constant (0x426) to identify
MediaTek hardware
- Reads hardware implementer ID from `SBSA_GWDT_W_IIDR` register
- Sets `need_ws0_race_workaround` flag only for MediaTek
implementations when `action=0`
2. **Workaround Implementation** (lines 171-191):
- Adjusts `wdd->min_hw_heartbeat_ms` to `timeout * 500 + 1000`
- This forces keepalive pings to occur 1 second after the WOR timeout
- Avoids the exact moment when WS0 is being asserted
3. **Minimum Timeout Adjustment** (lines 337-343):
- Sets minimum timeout to 3 seconds for affected hardware
- Ensures WOR is set to 1.5 seconds with heartbeats every 2.5 seconds
## Why This Qualifies for Stable Backport
1. **Fixes a Real Bug**: Addresses a hardware race condition that can
cause **unexpected system resets** - a critical reliability issue in
production systems.
2. **Limited Scope**:
- Only affects MediaTek SBSA watchdog implementations
- Workaround is conditionally applied based on hardware detection
- No impact on other implementations
3. **Minimal Risk**:
- Changes are confined to timing adjustments
- No architectural changes or new features
- Preserves existing behavior for all non-MediaTek hardware
4. **Clear Hardware Issue**: The commit message explicitly describes the
race condition where "a write to SBSA_GWDT_WRR is ignored if it
occurs while the hardware is processing a timeout refresh that
asserts WS0."
5. **Production Impact**: Without this fix, systems with MediaTek SBSA
watchdog can experience spurious reboots when the race condition is
triggered, affecting system availability.
6. **Follows Stable Rules**:
- Fixes an actual bug (not adding features)
- Small, contained change (~50 lines)
- Already reviewed by subsystem maintainers
- Hardware-specific fix with clear problem/solution
## Comparison with Similar Commits
Looking at recent watchdog commits that were backported:
- `c6e89348fd58`: Fixed incorrect timeout calculations - backported with
Cc: stable
- `041fdbe73de9`: Fixed timeout programming limits - backported to
stable
This commit follows the same pattern of fixing hardware-specific bugs
that affect system reliability.
The fix is essential for MediaTek-based systems using SBSA watchdog to
prevent unexpected reboots caused by the hardware race condition.
drivers/watchdog/sbsa_gwdt.c | 50 +++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)
diff --git a/drivers/watchdog/sbsa_gwdt.c b/drivers/watchdog/sbsa_gwdt.c
index 5f23913ce3b4..6ce1bfb39064 100644
--- a/drivers/watchdog/sbsa_gwdt.c
+++ b/drivers/watchdog/sbsa_gwdt.c
@@ -75,11 +75,17 @@
#define SBSA_GWDT_VERSION_MASK 0xF
#define SBSA_GWDT_VERSION_SHIFT 16
+#define SBSA_GWDT_IMPL_MASK 0x7FF
+#define SBSA_GWDT_IMPL_SHIFT 0
+#define SBSA_GWDT_IMPL_MEDIATEK 0x426
+
/**
* struct sbsa_gwdt - Internal representation of the SBSA GWDT
* @wdd: kernel watchdog_device structure
* @clk: store the System Counter clock frequency, in Hz.
* @version: store the architecture version
+ * @need_ws0_race_workaround:
+ * indicate whether to adjust wdd->timeout to avoid a race with WS0
* @refresh_base: Virtual address of the watchdog refresh frame
* @control_base: Virtual address of the watchdog control frame
*/
@@ -87,6 +93,7 @@ struct sbsa_gwdt {
struct watchdog_device wdd;
u32 clk;
int version;
+ bool need_ws0_race_workaround;
void __iomem *refresh_base;
void __iomem *control_base;
};
@@ -161,6 +168,31 @@ static int sbsa_gwdt_set_timeout(struct watchdog_device *wdd,
*/
sbsa_gwdt_reg_write(((u64)gwdt->clk / 2) * timeout, gwdt);
+ /*
+ * Some watchdog hardware has a race condition where it will ignore
+ * sbsa_gwdt_keepalive() if it is called at the exact moment that a
+ * timeout occurs and WS0 is being asserted. Unfortunately, the default
+ * behavior of the watchdog core is very likely to trigger this race
+ * when action=0 because it programs WOR to be half of the desired
+ * timeout, and watchdog_next_keepalive() chooses the exact same time to
+ * send keepalive pings.
+ *
+ * This triggers a race where sbsa_gwdt_keepalive() can be called right
+ * as WS0 is being asserted, and affected hardware will ignore that
+ * write and continue to assert WS0. After another (timeout / 2)
+ * seconds, the same race happens again. If the driver wins then the
+ * explicit refresh will reset WS0 to false but if the hardware wins,
+ * then WS1 is asserted and the system resets.
+ *
+ * Avoid the problem by scheduling keepalive heartbeats one second later
+ * than the WOR timeout.
+ *
+ * This workaround might not be needed in a future revision of the
+ * hardware.
+ */
+ if (gwdt->need_ws0_race_workaround)
+ wdd->min_hw_heartbeat_ms = timeout * 500 + 1000;
+
return 0;
}
@@ -202,12 +234,15 @@ static int sbsa_gwdt_keepalive(struct watchdog_device *wdd)
static void sbsa_gwdt_get_version(struct watchdog_device *wdd)
{
struct sbsa_gwdt *gwdt = watchdog_get_drvdata(wdd);
- int ver;
+ int iidr, ver, impl;
- ver = readl(gwdt->control_base + SBSA_GWDT_W_IIDR);
- ver = (ver >> SBSA_GWDT_VERSION_SHIFT) & SBSA_GWDT_VERSION_MASK;
+ iidr = readl(gwdt->control_base + SBSA_GWDT_W_IIDR);
+ ver = (iidr >> SBSA_GWDT_VERSION_SHIFT) & SBSA_GWDT_VERSION_MASK;
+ impl = (iidr >> SBSA_GWDT_IMPL_SHIFT) & SBSA_GWDT_IMPL_MASK;
gwdt->version = ver;
+ gwdt->need_ws0_race_workaround =
+ !action && (impl == SBSA_GWDT_IMPL_MEDIATEK);
}
static int sbsa_gwdt_start(struct watchdog_device *wdd)
@@ -299,6 +334,15 @@ static int sbsa_gwdt_probe(struct platform_device *pdev)
else
wdd->max_hw_heartbeat_ms = GENMASK_ULL(47, 0) / gwdt->clk * 1000;
+ if (gwdt->need_ws0_race_workaround) {
+ /*
+ * A timeout of 3 seconds means that WOR will be set to 1.5
+ * seconds and the heartbeat will be scheduled every 2.5
+ * seconds.
+ */
+ wdd->min_timeout = 3;
+ }
+
status = readl(cf_base + SBSA_GWDT_WCS);
if (status & SBSA_GWDT_WCS_WS1) {
dev_warn(dev, "System reset by WDT.\n");
--
2.39.5
next prev parent reply other threads:[~2025-08-05 13:10 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-05 13:08 [PATCH AUTOSEL 6.16-6.6] mfd: axp20x: Set explicit ID for AXP313 regulator Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free} Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.10] leds: leds-lp50xx: Handle reg to get correct multi_index Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] scsi: bfa: Double-free fix Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] pinctrl: stm32: Manage irq affinity settings Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] PCI: dw-rockchip: Delay link training after hot reset in EP mode Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] phy: rockchip-pcie: Properly disable TEST_WRITE strobe signal Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] soundwire: Move handle_nested_irq outside of sdw_dev_lock Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] media: uvcvideo: Fix bandwidth issue for Alcor camera Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.15] crypto: hisilicon/hpre - fix dma unmap sequence Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] soundwire: amd: serialize amd manager resume sequence during pm_prepare Sasha Levin
2025-08-05 13:08 ` Sasha Levin [this message]
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] clk: qcom: ipq5018: keep XO clock always on Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] media: i2c: vd55g1: Fix RATE macros not being expressed in bps Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] media: usb: hdpvr: disable zero-length read messages Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.15] media: raspberrypi: cfe: Fix min_reqbufs_allocation Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.1] hwmon: (emc2305) Set initial PWM minimum value during probe based on thermal state Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.12] media: uvcvideo: Add quirk for HP Webcam HD 2300 Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.1] drm/amd/display: Only finalize atomic_obj if it was initialized Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] vhost: fail early when __vhost_add_used() fails Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.12] scsi: lpfc: Ensure HBA_SETUP flag is used only for SLI4 in dev_loss_tmo_callbk Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] ext4: limit the maximum folio order Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] fs/orangefs: use snprintf() instead of sprintf() Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] crypto: caam - Support iMX8QXP and variants thereof Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] crypto: ccp - Add missing bootloader info reg for pspv6 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: lpfc: Check for hdwq null ptr when cleaning up lpfc_vport structure Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] scsi: pm80xx: Free allocated tags after failure Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] HID: rate-limit hid_warn to prevent log flooding Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16] media: i2c: vd55g1: Setup sensor external clock before patching Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.15] watchdog: iTCO_wdt: Report error if timeout configuration fails Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] media: iris: Add handling for corrupt and drop frames Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] phy: rockchip-pcie: Enable all four lanes if required Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] watchdog: dw_wdt: Fix default timeout Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] MIPS: Don't crash in stack_top() for tasks without ABI or vDSO Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] crypto: jitter - fix intermediary handling Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] MIPS: lantiq: falcon: sysctrl: fix request memory check logic Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: tc358743: Check I2C succeeded during probe Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] scsi: mpi3mr: Correctly handle ATA device errors Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] clk: renesas: rzg2l: Postpone updating priv->clks[] Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: mpt3sas: Correctly handle ATA device errors Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] smb: client: fix session setup against servers that require SPN Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] fbdev: fix potential buffer overflow in do_register_framebuffer() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] sphinx: kernel_abi: fix performance regression with O=<dir> Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: tc358743: Return an appropriate colorspace from tc358743_set_fmt Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] drm/amd/display: Avoid configuring PSR granularity if PSR-SU not supported Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: tc358743: Increase FIFO trigger level to 374 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] jfs: truncate good inode pages when hard link is 0 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.15] media: v4l2-common: Reduce warnings about missing V4L2_CID_LINK_FREQ control Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] dmaengine: stm32-dma: configure next sg only if there are more than 2 sgs Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] RDMA/bnxt_re: Fix size of uverbs_copy_to() in BNXT_RE_METHOD_GET_TOGGLE_MEM Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] cifs: Fix calling CIFSFindFirst() for root path without msearch Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.10] RDMA/core: reduce stack using in nldev_stat_get_doit() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] soundwire: amd: cancel pending slave status handling workqueue during remove sequence Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] PCI: xgene-msi: Resend an MSI racing with itself on a different CPU Sasha Levin
2025-08-05 13:20 ` Marc Zyngier
2025-08-05 13:59 ` Sasha Levin
2025-08-05 18:09 ` Marc Zyngier
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] clk: tegra: periph: Fix error handling and resolve unsigned compare warning Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] drm/amd/display: Disable dsc_power_gate for dcn314 by default Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.15] crypto: octeontx2 - add timeout for load_fvc completion poll Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] power: supply: qcom_battmgr: Add lithium-polymer entry Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] media: ipu-bridge: Add _HID for OV5670 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] media: hi556: Fix reset GPIO timings Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] clk: thead: Mark essential bus clocks as CLK_IGNORE_UNUSED Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] media: uvcvideo: Set V4L2_CTRL_FLAG_DISABLED during queryctrl errors Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] jfs: Regular file corruption check Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] jfs: upper bound check of tree index in dbAllocAG Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250805130945.471732-12-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=aplattner@nvidia.com \
--cc=linux-watchdog@vger.kernel.org \
--cc=linux@roeck-us.net \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
--cc=ttabi@nvidia.com \
--cc=wim@linux-watchdog.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox