From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Aaron Plattner <aplattner@nvidia.com>,
Timur Tabi <ttabi@nvidia.com>, Guenter Roeck <linux@roeck-us.net>,
Wim Van Sebroeck <wim@linux-watchdog.org>,
Sasha Levin <sashal@kernel.org>,
linux-watchdog@vger.kernel.org
Subject: [PATCH AUTOSEL 6.16-5.15] watchdog: sbsa: Adjust keepalive timeout to avoid MediaTek WS0 race condition
Date: Tue, 5 Aug 2025 09:08:47 -0400 [thread overview]
Message-ID: <20250805130945.471732-12-sashal@kernel.org> (raw)
In-Reply-To: <20250805130945.471732-1-sashal@kernel.org>
From: Aaron Plattner <aplattner@nvidia.com>
[ Upstream commit 48defdf6b083f74a44e1f742db284960d3444aec ]
The MediaTek implementation of the sbsa_gwdt watchdog has a race
condition where a write to SBSA_GWDT_WRR is ignored if it occurs while
the hardware is processing a timeout refresh that asserts WS0.
Detect this based on the hardware implementer and adjust
wdd->min_hw_heartbeat_ms to avoid the race by forcing the keepalive ping
to be one second later.
Signed-off-by: Aaron Plattner <aplattner@nvidia.com>
Acked-by: Timur Tabi <ttabi@nvidia.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20250721230640.2244915-1-aplattner@nvidia.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my
extensive analysis:
## Nature of the Fix
This is a **hardware-specific bug fix** that addresses a race condition
in MediaTek's implementation of the SBSA generic watchdog. The race
occurs when a watchdog keepalive ping (`writel(0, gwdt->refresh_base +
SBSA_GWDT_WRR)`) happens simultaneously with the hardware asserting the
WS0 timeout signal. In affected hardware, this write is silently
ignored, potentially leading to unexpected system resets.
## Key Code Changes Analysis
1. **Hardware Detection** (lines 78-79, 217-219):
- Adds `SBSA_GWDT_IMPL_MEDIATEK` constant (0x426) to identify
MediaTek hardware
- Reads hardware implementer ID from `SBSA_GWDT_W_IIDR` register
- Sets `need_ws0_race_workaround` flag only for MediaTek
implementations when `action=0`
2. **Workaround Implementation** (lines 171-191):
- Adjusts `wdd->min_hw_heartbeat_ms` to `timeout * 500 + 1000`
- This forces keepalive pings to occur 1 second after the WOR timeout
- Avoids the exact moment when WS0 is being asserted
3. **Minimum Timeout Adjustment** (lines 337-343):
- Sets minimum timeout to 3 seconds for affected hardware
- Ensures WOR is set to 1.5 seconds with heartbeats every 2.5 seconds
## Why This Qualifies for Stable Backport
1. **Fixes a Real Bug**: Addresses a hardware race condition that can
cause **unexpected system resets** - a critical reliability issue in
production systems.
2. **Limited Scope**:
- Only affects MediaTek SBSA watchdog implementations
- Workaround is conditionally applied based on hardware detection
- No impact on other implementations
3. **Minimal Risk**:
- Changes are confined to timing adjustments
- No architectural changes or new features
- Preserves existing behavior for all non-MediaTek hardware
4. **Clear Hardware Issue**: The commit message explicitly describes the
race condition where "a write to SBSA_GWDT_WRR is ignored if it
occurs while the hardware is processing a timeout refresh that
asserts WS0."
5. **Production Impact**: Without this fix, systems with MediaTek SBSA
watchdog can experience spurious reboots when the race condition is
triggered, affecting system availability.
6. **Follows Stable Rules**:
- Fixes an actual bug (not adding features)
- Small, contained change (~50 lines)
- Already reviewed by subsystem maintainers
- Hardware-specific fix with clear problem/solution
## Comparison with Similar Commits
Looking at recent watchdog commits that were backported:
- `c6e89348fd58`: Fixed incorrect timeout calculations - backported with
Cc: stable
- `041fdbe73de9`: Fixed timeout programming limits - backported to
stable
This commit follows the same pattern of fixing hardware-specific bugs
that affect system reliability.
The fix is essential for MediaTek-based systems using SBSA watchdog to
prevent unexpected reboots caused by the hardware race condition.
drivers/watchdog/sbsa_gwdt.c | 50 +++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)
diff --git a/drivers/watchdog/sbsa_gwdt.c b/drivers/watchdog/sbsa_gwdt.c
index 5f23913ce3b4..6ce1bfb39064 100644
--- a/drivers/watchdog/sbsa_gwdt.c
+++ b/drivers/watchdog/sbsa_gwdt.c
@@ -75,11 +75,17 @@
#define SBSA_GWDT_VERSION_MASK 0xF
#define SBSA_GWDT_VERSION_SHIFT 16
+#define SBSA_GWDT_IMPL_MASK 0x7FF
+#define SBSA_GWDT_IMPL_SHIFT 0
+#define SBSA_GWDT_IMPL_MEDIATEK 0x426
+
/**
* struct sbsa_gwdt - Internal representation of the SBSA GWDT
* @wdd: kernel watchdog_device structure
* @clk: store the System Counter clock frequency, in Hz.
* @version: store the architecture version
+ * @need_ws0_race_workaround:
+ * indicate whether to adjust wdd->timeout to avoid a race with WS0
* @refresh_base: Virtual address of the watchdog refresh frame
* @control_base: Virtual address of the watchdog control frame
*/
@@ -87,6 +93,7 @@ struct sbsa_gwdt {
struct watchdog_device wdd;
u32 clk;
int version;
+ bool need_ws0_race_workaround;
void __iomem *refresh_base;
void __iomem *control_base;
};
@@ -161,6 +168,31 @@ static int sbsa_gwdt_set_timeout(struct watchdog_device *wdd,
*/
sbsa_gwdt_reg_write(((u64)gwdt->clk / 2) * timeout, gwdt);
+ /*
+ * Some watchdog hardware has a race condition where it will ignore
+ * sbsa_gwdt_keepalive() if it is called at the exact moment that a
+ * timeout occurs and WS0 is being asserted. Unfortunately, the default
+ * behavior of the watchdog core is very likely to trigger this race
+ * when action=0 because it programs WOR to be half of the desired
+ * timeout, and watchdog_next_keepalive() chooses the exact same time to
+ * send keepalive pings.
+ *
+ * This triggers a race where sbsa_gwdt_keepalive() can be called right
+ * as WS0 is being asserted, and affected hardware will ignore that
+ * write and continue to assert WS0. After another (timeout / 2)
+ * seconds, the same race happens again. If the driver wins then the
+ * explicit refresh will reset WS0 to false but if the hardware wins,
+ * then WS1 is asserted and the system resets.
+ *
+ * Avoid the problem by scheduling keepalive heartbeats one second later
+ * than the WOR timeout.
+ *
+ * This workaround might not be needed in a future revision of the
+ * hardware.
+ */
+ if (gwdt->need_ws0_race_workaround)
+ wdd->min_hw_heartbeat_ms = timeout * 500 + 1000;
+
return 0;
}
@@ -202,12 +234,15 @@ static int sbsa_gwdt_keepalive(struct watchdog_device *wdd)
static void sbsa_gwdt_get_version(struct watchdog_device *wdd)
{
struct sbsa_gwdt *gwdt = watchdog_get_drvdata(wdd);
- int ver;
+ int iidr, ver, impl;
- ver = readl(gwdt->control_base + SBSA_GWDT_W_IIDR);
- ver = (ver >> SBSA_GWDT_VERSION_SHIFT) & SBSA_GWDT_VERSION_MASK;
+ iidr = readl(gwdt->control_base + SBSA_GWDT_W_IIDR);
+ ver = (iidr >> SBSA_GWDT_VERSION_SHIFT) & SBSA_GWDT_VERSION_MASK;
+ impl = (iidr >> SBSA_GWDT_IMPL_SHIFT) & SBSA_GWDT_IMPL_MASK;
gwdt->version = ver;
+ gwdt->need_ws0_race_workaround =
+ !action && (impl == SBSA_GWDT_IMPL_MEDIATEK);
}
static int sbsa_gwdt_start(struct watchdog_device *wdd)
@@ -299,6 +334,15 @@ static int sbsa_gwdt_probe(struct platform_device *pdev)
else
wdd->max_hw_heartbeat_ms = GENMASK_ULL(47, 0) / gwdt->clk * 1000;
+ if (gwdt->need_ws0_race_workaround) {
+ /*
+ * A timeout of 3 seconds means that WOR will be set to 1.5
+ * seconds and the heartbeat will be scheduled every 2.5
+ * seconds.
+ */
+ wdd->min_timeout = 3;
+ }
+
status = readl(cf_base + SBSA_GWDT_WCS);
if (status & SBSA_GWDT_WCS_WS1) {
dev_warn(dev, "System reset by WDT.\n");
--
2.39.5
next prev parent reply other threads:[~2025-08-05 13:10 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-05 13:08 [PATCH AUTOSEL 6.16-6.6] mfd: axp20x: Set explicit ID for AXP313 regulator Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free} Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.10] leds: leds-lp50xx: Handle reg to get correct multi_index Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] scsi: bfa: Double-free fix Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] pinctrl: stm32: Manage irq affinity settings Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] PCI: dw-rockchip: Delay link training after hot reset in EP mode Sasha Levin
2025-08-05 13:08 ` Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] phy: rockchip-pcie: Properly disable TEST_WRITE strobe signal Sasha Levin
2025-08-05 13:08 ` Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] soundwire: Move handle_nested_irq outside of sdw_dev_lock Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] media: uvcvideo: Fix bandwidth issue for Alcor camera Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.15] crypto: hisilicon/hpre - fix dma unmap sequence Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] soundwire: amd: serialize amd manager resume sequence during pm_prepare Sasha Levin
2025-08-05 13:08 ` Sasha Levin [this message]
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.6] clk: qcom: ipq5018: keep XO clock always on Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] media: i2c: vd55g1: Fix RATE macros not being expressed in bps Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] media: usb: hdpvr: disable zero-length read messages Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.15] media: raspberrypi: cfe: Fix min_reqbufs_allocation Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.1] hwmon: (emc2305) Set initial PWM minimum value during probe based on thermal state Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.12] media: uvcvideo: Add quirk for HP Webcam HD 2300 Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.1] drm/amd/display: Only finalize atomic_obj if it was initialized Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] vhost: fail early when __vhost_add_used() fails Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.12] scsi: lpfc: Ensure HBA_SETUP flag is used only for SLI4 in dev_loss_tmo_callbk Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] ext4: limit the maximum folio order Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] fs/orangefs: use snprintf() instead of sprintf() Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16] crypto: caam - Support iMX8QXP and variants thereof Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] crypto: ccp - Add missing bootloader info reg for pspv6 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: lpfc: Check for hdwq null ptr when cleaning up lpfc_vport structure Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] scsi: pm80xx: Free allocated tags after failure Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] HID: rate-limit hid_warn to prevent log flooding Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16] media: i2c: vd55g1: Setup sensor external clock before patching Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.15] watchdog: iTCO_wdt: Report error if timeout configuration fails Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] media: iris: Add handling for corrupt and drop frames Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] phy: rockchip-pcie: Enable all four lanes if required Sasha Levin
2025-08-05 13:09 ` Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] watchdog: dw_wdt: Fix default timeout Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] MIPS: Don't crash in stack_top() for tasks without ABI or vDSO Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] crypto: jitter - fix intermediary handling Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] MIPS: lantiq: falcon: sysctrl: fix request memory check logic Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: tc358743: Check I2C succeeded during probe Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] scsi: mpi3mr: Correctly handle ATA device errors Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] clk: renesas: rzg2l: Postpone updating priv->clks[] Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: mpt3sas: Correctly handle ATA device errors Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] smb: client: fix session setup against servers that require SPN Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] fbdev: fix potential buffer overflow in do_register_framebuffer() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] sphinx: kernel_abi: fix performance regression with O=<dir> Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: tc358743: Return an appropriate colorspace from tc358743_set_fmt Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] drm/amd/display: Avoid configuring PSR granularity if PSR-SU not supported Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: tc358743: Increase FIFO trigger level to 374 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] jfs: truncate good inode pages when hard link is 0 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.15] media: v4l2-common: Reduce warnings about missing V4L2_CID_LINK_FREQ control Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.1] dmaengine: stm32-dma: configure next sg only if there are more than 2 sgs Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] RDMA/bnxt_re: Fix size of uverbs_copy_to() in BNXT_RE_METHOD_GET_TOGGLE_MEM Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] cifs: Fix calling CIFSFindFirst() for root path without msearch Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.10] RDMA/core: reduce stack using in nldev_stat_get_doit() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] soundwire: amd: cancel pending slave status handling workqueue during remove sequence Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] PCI: xgene-msi: Resend an MSI racing with itself on a different CPU Sasha Levin
2025-08-05 13:20 ` Marc Zyngier
2025-08-05 13:59 ` Sasha Levin
2025-08-05 18:09 ` Marc Zyngier
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] clk: tegra: periph: Fix error handling and resolve unsigned compare warning Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] drm/amd/display: Disable dsc_power_gate for dcn314 by default Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask() Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.15] crypto: octeontx2 - add timeout for load_fvc completion poll Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.6] power: supply: qcom_battmgr: Add lithium-polymer entry Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] media: ipu-bridge: Add _HID for OV5670 Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] media: hi556: Fix reset GPIO timings Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.12] clk: thead: Mark essential bus clocks as CLK_IGNORE_UNUSED Sasha Levin
2025-08-05 13:09 ` Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] media: uvcvideo: Set V4L2_CTRL_FLAG_DISABLED during queryctrl errors Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] jfs: Regular file corruption check Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] jfs: upper bound check of tree index in dbAllocAG Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250805130945.471732-12-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=aplattner@nvidia.com \
--cc=linux-watchdog@vger.kernel.org \
--cc=linux@roeck-us.net \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
--cc=ttabi@nvidia.com \
--cc=wim@linux-watchdog.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.