* [RFC PATCH RESEND 0/3] net: ath11k: Firmware lockup detection & mitigation
@ 2026-03-30 10:05 Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 1/3] net: ath11k: fix redundant reset from stale pending workqueue bit Matthew Leach
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Matthew Leach @ 2026-03-30 10:05 UTC (permalink / raw)
To: Jeff Johnson; +Cc: linux-wireless, ath11k, linux-kernel, kernel, Matthew Leach
When sat idle for approx 24 hours, a user experienced a firmware lockup on a
ath11k chip, resulting in the following log output:
systemd-timesyncd[558]: Timed out waiting for reply from 23.95.49.216:123 (2.arch.pool.ntp.org).
systemd-timesyncd[558]: Timed out waiting for reply from 23.186.168.125:123 (2.arch.pool.ntp.org).
systemd-timesyncd[558]: Timed out waiting for reply from 64.79.100.197:123 (2.arch.pool.ntp.org).
systemd-timesyncd[558]: Timed out waiting for reply from 69.89.207.199:123 (2.arch.pool.ntp.org).
kernel: ath11k_pci 0000:03:00.0: failed to transmit frame -12
kernel: ath11k_pci 0000:03:00.0: failed to transmit frame -12
kernel: ath11k_pci 0000:03:00.0: failed to transmit frame -12
[...]
kernel: ath11k_pci 0000:03:00.0: failed to flush transmit queue, data pkts pending 564
kernel: ath11k_pci 0000:03:00.0: wmi command 20486 timeout
kernel: ath11k_pci 0000:03:00.0: failed to submit WMI_VDEV_STOP cmd
kernel: ath11k_pci 0000:03:00.0: failed to stop WMI vdev 0: -11
kernel: ath11k_pci 0000:03:00.0: failed to stop vdev 0: -11
kernel: ath11k_pci 0000:03:00.0: failed to do early vdev stop: -11
kernel: ath11k_pci 0000:03:00.0: Failed to remove station: xx:xx:xx:xx:xx:xx for VDEV: 0
kernel: ath11k_pci 0000:03:00.0: Found peer entry xx:xx:xx:xx:xx:xx n vdev 0 after it was supposedly removed
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 0 PID: 1229 at net/mac80211/sta_info.c:1490 __sta_info_destroy_part2+0x14e/0x180 [mac80211]
This patch series:
- Fixes a bug in the core reset logic which could cause a second redundant reset
after the original reset completes.
- Implements the error correlation logic and queues a chip reset when detected.
- Adds a simulation to the simulate_fw_crash debugfs file to test the
detection logic.
Signed-off-by: Matthew Leach <matthew.leach@collabora.com>
---
Matthew Leach (3):
net: ath11k: fix redundant reset from stale pending workqueue bit
net: ath11k: add firmware lockup detection and recovery
net: ath11k: add lockup simulation via debugfs
drivers/net/wireless/ath/ath11k/core.h | 3 +++
drivers/net/wireless/ath/ath11k/debugfs.c | 7 ++++++-
drivers/net/wireless/ath/ath11k/hal.c | 7 +++++--
drivers/net/wireless/ath/ath11k/htc.c | 2 +-
drivers/net/wireless/ath/ath11k/mac.c | 10 ++++++++++
drivers/net/wireless/ath/ath11k/wmi.c | 28 +++++++++++++++++++++++++++-
6 files changed, 52 insertions(+), 5 deletions(-)
---
base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
change-id: 20260304-ath11k-lockup-fixes-b808b5c7318b
Best regards,
--
Matt
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH RESEND RFC 1/3] net: ath11k: fix redundant reset from stale pending workqueue bit
2026-03-30 10:05 [RFC PATCH RESEND 0/3] net: ath11k: Firmware lockup detection & mitigation Matthew Leach
@ 2026-03-30 10:05 ` Matthew Leach
2026-05-12 23:09 ` Jeff Johnson
2026-03-30 10:05 ` [PATCH RESEND RFC 2/3] net: ath11k: add firmware lockup detection and recovery Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 3/3] net: ath11k: add lockup simulation via debugfs Matthew Leach
2 siblings, 1 reply; 6+ messages in thread
From: Matthew Leach @ 2026-03-30 10:05 UTC (permalink / raw)
To: Jeff Johnson; +Cc: linux-wireless, ath11k, linux-kernel, kernel, Matthew Leach
During a firmware lockup, WMI commands time out in rapid succession,
each calling queue_work() to schedule ath11k_core_reset(). This can
cause a spurious extra reset after recovery completes:
1. First WMI timeout calls queue_work(), sets the pending bit and
schedules ath11k_core_reset(). The workqueue clears the pending bit
before invoking the work function. reset_count becomes 1 and the reset
is kicked off asynchronously. ath11k_core_reset() returns.
2. Second WMI timeout calls queue_work() and re-queues the work. When it
runs after step 1 returns, it sees reset_count > 1 and blocks in
wait_for_completion(). The pending bit is again cleared.
3. Third WMI timeout calls queue_work(), the pending bit was cleared in
step 2, so this succeeds and arms another execution.
4. The asynchronous reset finishes. ath11k_mac_op_reconfig_complete()
decrements reset_count and calls complete(). The blocked worker from
step 2 wakes, takes the early-exit path, and decrements reset_count to
0.
5. The workqueue sees the pending bit from step 3 and runs
ath11k_core_reset() again. reset_count is 0, triggering a
full redundant hardware reset.
Fix this by calling cancel_work() on reset_work in
ath11k_mac_op_reconfig_complete() before signalling completion. This
clears any stale pending bit, preventing the spurious re-execution.
Signed-off-by: Matthew Leach <matthew.leach@collabora.com>
---
drivers/net/wireless/ath/ath11k/mac.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/wireless/ath/ath11k/mac.c b/drivers/net/wireless/ath/ath11k/mac.c
index e4ee2ba1f669..748f779b3d1b 100644
--- a/drivers/net/wireless/ath/ath11k/mac.c
+++ b/drivers/net/wireless/ath/ath11k/mac.c
@@ -9274,6 +9274,10 @@ ath11k_mac_op_reconfig_complete(struct ieee80211_hw *hw,
* the recovery has to be done for each radio
*/
if (recovery_count == ab->num_radios) {
+ /* Cancel any pending work, preventing a second redudant
+ * reset.
+ */
+ cancel_work(&ab->reset_work);
atomic_dec(&ab->reset_count);
complete(&ab->reset_complete);
ab->is_reset = false;
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH RESEND RFC 2/3] net: ath11k: add firmware lockup detection and recovery
2026-03-30 10:05 [RFC PATCH RESEND 0/3] net: ath11k: Firmware lockup detection & mitigation Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 1/3] net: ath11k: fix redundant reset from stale pending workqueue bit Matthew Leach
@ 2026-03-30 10:05 ` Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 3/3] net: ath11k: add lockup simulation via debugfs Matthew Leach
2 siblings, 0 replies; 6+ messages in thread
From: Matthew Leach @ 2026-03-30 10:05 UTC (permalink / raw)
To: Jeff Johnson; +Cc: linux-wireless, ath11k, linux-kernel, kernel, Matthew Leach
Detect firmware lockup when a WMI command times out and TX descriptor
exhaustion occurs within ATH11K_LOCKUP_DESC_ERR_RANGE_HZ (1 minute). In
this case, consider the firmware dead.
When a lockup is detected, queue reset work to restart the chip.
After reset completes, clear the lockup detection state.
Signed-off-by: Matthew Leach <matthew.leach@collabora.com>
---
drivers/net/wireless/ath/ath11k/core.h | 2 ++
drivers/net/wireless/ath/ath11k/mac.c | 6 ++++++
drivers/net/wireless/ath/ath11k/wmi.c | 24 +++++++++++++++++++++++-
3 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/ath/ath11k/core.h b/drivers/net/wireless/ath/ath11k/core.h
index a0d725923ef2..221dcd23b3dd 100644
--- a/drivers/net/wireless/ath/ath11k/core.h
+++ b/drivers/net/wireless/ath/ath11k/core.h
@@ -70,6 +70,7 @@ extern bool ath11k_ftm_mode;
#define ATH11K_RESET_FAIL_TIMEOUT_HZ (20 * HZ)
#define ATH11K_RECONFIGURE_TIMEOUT_HZ (10 * HZ)
#define ATH11K_RECOVER_START_TIMEOUT_HZ (20 * HZ)
+#define ATH11K_LOCKUP_DESC_ERR_RANGE_HZ (60 * HZ)
enum ath11k_supported_bw {
ATH11K_BW_20 = 0,
@@ -1039,6 +1040,7 @@ struct ath11k_base {
struct ath11k_dbring_cap *db_caps;
u32 num_db_cap;
+ u64 last_frame_tx_error_jiffies;
/* To synchronize 11d scan vdev id */
struct mutex vdev_id_11d_lock;
diff --git a/drivers/net/wireless/ath/ath11k/mac.c b/drivers/net/wireless/ath/ath11k/mac.c
index 748f779b3d1b..a0b4d60da330 100644
--- a/drivers/net/wireless/ath/ath11k/mac.c
+++ b/drivers/net/wireless/ath/ath11k/mac.c
@@ -9,6 +9,7 @@
#include <linux/etherdevice.h>
#include <linux/bitfield.h>
#include <linux/inetdevice.h>
+#include <linux/jiffies.h>
#include <net/if_inet6.h>
#include <net/ipv6.h>
@@ -6546,6 +6547,10 @@ static void ath11k_mac_op_tx(struct ieee80211_hw *hw,
ret = ath11k_dp_tx(ar, arvif, arsta, skb);
if (unlikely(ret)) {
+ scoped_guard(spinlock_bh, &ar->ab->base_lock) {
+ ar->ab->last_frame_tx_error_jiffies = jiffies_64;
+ }
+
ath11k_warn(ar->ab, "failed to transmit frame %d\n", ret);
ieee80211_free_txskb(ar->hw, skb);
}
@@ -9281,6 +9286,7 @@ ath11k_mac_op_reconfig_complete(struct ieee80211_hw *hw,
atomic_dec(&ab->reset_count);
complete(&ab->reset_complete);
ab->is_reset = false;
+ ab->last_frame_tx_error_jiffies = 0;
atomic_set(&ab->fail_cont_count, 0);
ath11k_dbg(ab, ATH11K_DBG_BOOT, "reset success\n");
}
diff --git a/drivers/net/wireless/ath/ath11k/wmi.c b/drivers/net/wireless/ath/ath11k/wmi.c
index 40747fba3b0c..7d9f0bcbb3b0 100644
--- a/drivers/net/wireless/ath/ath11k/wmi.c
+++ b/drivers/net/wireless/ath/ath11k/wmi.c
@@ -7,8 +7,11 @@
#include <linux/ctype.h>
#include <net/mac80211.h>
#include <net/cfg80211.h>
+#include <linux/cleanup.h>
#include <linux/completion.h>
#include <linux/if_ether.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
#include <linux/types.h>
#include <linux/pci.h>
#include <linux/uuid.h>
@@ -325,9 +328,28 @@ int ath11k_wmi_cmd_send(struct ath11k_pdev_wmi *wmi, struct sk_buff *skb,
}), WMI_SEND_TIMEOUT_HZ);
}
- if (ret == -EAGAIN)
+ if (ret == -EAGAIN) {
+ u64 range_start;
+
ath11k_warn(wmi_ab->ab, "wmi command %d timeout\n", cmd_id);
+ guard(spinlock_bh)(&ab->base_lock);
+
+ if (ab->last_frame_tx_error_jiffies == 0)
+ return ret;
+
+ range_start =
+ (jiffies_64 > ATH11K_LOCKUP_DESC_ERR_RANGE_HZ) ?
+ jiffies_64 - ATH11K_LOCKUP_DESC_ERR_RANGE_HZ :
+ 0;
+
+ if (time_in_range64(ab->last_frame_tx_error_jiffies,
+ range_start, jiffies_64) &&
+ queue_work(ab->workqueue_aux, &ab->reset_work))
+ ath11k_err(wmi_ab->ab,
+ "Firmware lockup detected. Resetting.");
+ }
+
if (ret == -ENOBUFS)
ath11k_warn(wmi_ab->ab, "ce desc not available for wmi command %d\n",
cmd_id);
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH RESEND RFC 3/3] net: ath11k: add lockup simulation via debugfs
2026-03-30 10:05 [RFC PATCH RESEND 0/3] net: ath11k: Firmware lockup detection & mitigation Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 1/3] net: ath11k: fix redundant reset from stale pending workqueue bit Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 2/3] net: ath11k: add firmware lockup detection and recovery Matthew Leach
@ 2026-03-30 10:05 ` Matthew Leach
2026-05-12 23:19 ` Jeff Johnson
2 siblings, 1 reply; 6+ messages in thread
From: Matthew Leach @ 2026-03-30 10:05 UTC (permalink / raw)
To: Jeff Johnson; +Cc: linux-wireless, ath11k, linux-kernel, kernel, Matthew Leach
Add a debugfs command to simulate a firmware lockup.
This does not hang the hardware. Instead, it forces the driver down an
error path that reproduces the sequence observed during real lockups:
ath11k_pci 0000:03:00.0: failed to transmit frame -12
ath11k_pci 0000:03:00.0: failed to transmit frame -12
ath11k_pci 0000:03:00.0: failed to transmit frame -12
...
ath11k_pci 0000:03:00.0: wmi command 28680 timeout
ath11k_pci 0000:03:00.0: failed to submit WMI_MGMT_TX_SEND_CMDID cmd
ath11k_pci 0000:03:00.0: failed to send mgmt frame: -11
This allows validation of the firmware lockup detection and recovery
mechanism without requiring a real hardware failure.
Signed-off-by: Matthew Leach <matthew.leach@collabora.com>
---
drivers/net/wireless/ath/ath11k/core.h | 1 +
drivers/net/wireless/ath/ath11k/debugfs.c | 7 ++++++-
drivers/net/wireless/ath/ath11k/hal.c | 7 +++++--
drivers/net/wireless/ath/ath11k/htc.c | 2 +-
drivers/net/wireless/ath/ath11k/wmi.c | 6 +++++-
5 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/drivers/net/wireless/ath/ath11k/core.h b/drivers/net/wireless/ath/ath11k/core.h
index 221dcd23b3dd..44b02ae1e85b 100644
--- a/drivers/net/wireless/ath/ath11k/core.h
+++ b/drivers/net/wireless/ath/ath11k/core.h
@@ -1041,6 +1041,7 @@ struct ath11k_base {
struct ath11k_dbring_cap *db_caps;
u32 num_db_cap;
u64 last_frame_tx_error_jiffies;
+ bool simulate_lockup;
/* To synchronize 11d scan vdev id */
struct mutex vdev_id_11d_lock;
diff --git a/drivers/net/wireless/ath/ath11k/debugfs.c b/drivers/net/wireless/ath/ath11k/debugfs.c
index 0c1138407838..ca0b72a3e0b0 100644
--- a/drivers/net/wireless/ath/ath11k/debugfs.c
+++ b/drivers/net/wireless/ath/ath11k/debugfs.c
@@ -356,7 +356,8 @@ static ssize_t ath11k_read_simulate_fw_crash(struct file *file,
const char buf[] =
"To simulate firmware crash write one of the keywords to this file:\n"
"`assert` - this will send WMI_FORCE_FW_HANG_CMDID to firmware to cause assert.\n"
- "`hw-restart` - this will simply queue hw restart without fw/hw actually crashing.\n";
+ "`hw-restart` - this will simply queue hw restart without fw/hw actually crashing.\n"
+ "`lockup` - simulate a firmware lockup without the h/w actually hanging.\n";
return simple_read_from_buffer(user_buf, count, ppos, buf, strlen(buf));
}
@@ -413,6 +414,10 @@ static ssize_t ath11k_write_simulate_fw_crash(struct file *file,
ath11k_info(ab, "user requested hw restart\n");
queue_work(ab->workqueue_aux, &ab->reset_work);
ret = 0;
+ } else if (!strcmp(buf, "lockup")) {
+ ath11k_info(ab, "simulating lockup\n");
+ ab->simulate_lockup = true;
+ ret = 0;
} else {
ret = -EINVAL;
goto exit;
diff --git a/drivers/net/wireless/ath/ath11k/hal.c b/drivers/net/wireless/ath/ath11k/hal.c
index e821e5a62c1c..e01fb17a4734 100644
--- a/drivers/net/wireless/ath/ath11k/hal.c
+++ b/drivers/net/wireless/ath/ath11k/hal.c
@@ -691,7 +691,7 @@ int ath11k_hal_srng_dst_num_free(struct ath11k_base *ab, struct hal_srng *srng,
tp = srng->u.dst_ring.tp;
- if (sync_hw_ptr) {
+ if (sync_hw_ptr && !ab->simulate_lockup) {
hp = *srng->u.dst_ring.hp_addr;
srng->u.dst_ring.cached_hp = hp;
} else {
@@ -743,7 +743,7 @@ u32 *ath11k_hal_srng_src_get_next_entry(struct ath11k_base *ab,
*/
next_hp = (srng->u.src_ring.hp + srng->entry_size) % srng->ring_size;
- if (next_hp == srng->u.src_ring.cached_tp)
+ if (next_hp == srng->u.src_ring.cached_tp || ab->simulate_lockup)
return NULL;
desc = srng->ring_base_vaddr + srng->u.src_ring.hp;
@@ -828,6 +828,9 @@ void ath11k_hal_srng_access_begin(struct ath11k_base *ab, struct hal_srng *srng)
lockdep_assert_held(&srng->lock);
+ if (ab->simulate_lockup)
+ return;
+
if (srng->ring_dir == HAL_SRNG_DIR_SRC) {
srng->u.src_ring.cached_tp =
*(volatile u32 *)srng->u.src_ring.tp_addr;
diff --git a/drivers/net/wireless/ath/ath11k/htc.c b/drivers/net/wireless/ath/ath11k/htc.c
index 4571d01cc33d..b05d04a1f5e8 100644
--- a/drivers/net/wireless/ath/ath11k/htc.c
+++ b/drivers/net/wireless/ath/ath11k/htc.c
@@ -208,7 +208,7 @@ static int ath11k_htc_process_trailer(struct ath11k_htc *htc,
break;
}
- if (ab->hw_params.credit_flow) {
+ if (ab->hw_params.credit_flow && !ab->simulate_lockup) {
switch (record->hdr.id) {
case ATH11K_HTC_RECORD_CREDITS:
len = sizeof(struct ath11k_htc_credit_report);
diff --git a/drivers/net/wireless/ath/ath11k/wmi.c b/drivers/net/wireless/ath/ath11k/wmi.c
index 7d9f0bcbb3b0..27d6d4a2f803 100644
--- a/drivers/net/wireless/ath/ath11k/wmi.c
+++ b/drivers/net/wireless/ath/ath11k/wmi.c
@@ -345,9 +345,13 @@ int ath11k_wmi_cmd_send(struct ath11k_pdev_wmi *wmi, struct sk_buff *skb,
if (time_in_range64(ab->last_frame_tx_error_jiffies,
range_start, jiffies_64) &&
- queue_work(ab->workqueue_aux, &ab->reset_work))
+ queue_work(ab->workqueue_aux, &ab->reset_work)) {
ath11k_err(wmi_ab->ab,
"Firmware lockup detected. Resetting.");
+
+ /* Assume that reset gets us out of lockup. */
+ ab->simulate_lockup = false;
+ }
}
if (ret == -ENOBUFS)
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH RESEND RFC 1/3] net: ath11k: fix redundant reset from stale pending workqueue bit
2026-03-30 10:05 ` [PATCH RESEND RFC 1/3] net: ath11k: fix redundant reset from stale pending workqueue bit Matthew Leach
@ 2026-05-12 23:09 ` Jeff Johnson
0 siblings, 0 replies; 6+ messages in thread
From: Jeff Johnson @ 2026-05-12 23:09 UTC (permalink / raw)
To: Matthew Leach, Jeff Johnson; +Cc: linux-wireless, ath11k, linux-kernel, kernel
On 3/30/2026 3:05 AM, Matthew Leach wrote:
> During a firmware lockup, WMI commands time out in rapid succession,
> each calling queue_work() to schedule ath11k_core_reset(). This can
> cause a spurious extra reset after recovery completes:
>
> 1. First WMI timeout calls queue_work(), sets the pending bit and
> schedules ath11k_core_reset(). The workqueue clears the pending bit
> before invoking the work function. reset_count becomes 1 and the reset
> is kicked off asynchronously. ath11k_core_reset() returns.
>
> 2. Second WMI timeout calls queue_work() and re-queues the work. When it
> runs after step 1 returns, it sees reset_count > 1 and blocks in
> wait_for_completion(). The pending bit is again cleared.
>
> 3. Third WMI timeout calls queue_work(), the pending bit was cleared in
> step 2, so this succeeds and arms another execution.
>
> 4. The asynchronous reset finishes. ath11k_mac_op_reconfig_complete()
> decrements reset_count and calls complete(). The blocked worker from
> step 2 wakes, takes the early-exit path, and decrements reset_count to
> 0.
>
> 5. The workqueue sees the pending bit from step 3 and runs
> ath11k_core_reset() again. reset_count is 0, triggering a
> full redundant hardware reset.
>
> Fix this by calling cancel_work() on reset_work in
> ath11k_mac_op_reconfig_complete() before signalling completion. This
> clears any stale pending bit, preventing the spurious re-execution.
>
> Signed-off-by: Matthew Leach <matthew.leach@collabora.com>
> ---
> drivers/net/wireless/ath/ath11k/mac.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath11k/mac.c b/drivers/net/wireless/ath/ath11k/mac.c
> index e4ee2ba1f669..748f779b3d1b 100644
> --- a/drivers/net/wireless/ath/ath11k/mac.c
> +++ b/drivers/net/wireless/ath/ath11k/mac.c
> @@ -9274,6 +9274,10 @@ ath11k_mac_op_reconfig_complete(struct ieee80211_hw *hw,
> * the recovery has to be done for each radio
> */
> if (recovery_count == ab->num_radios) {
> + /* Cancel any pending work, preventing a second redudant
nits:
1) networking no longer uses a different block comment style so use the
standard style where /* is on a line by itself
2: s/redudant/redundant/ (subject has it right)
but don't post a new version just for these -- wait for any other comments.
I'm pinging the development team to look at this thread.
> + * reset.
> + */
> + cancel_work(&ab->reset_work);
> atomic_dec(&ab->reset_count);
> complete(&ab->reset_complete);
> ab->is_reset = false;
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RESEND RFC 3/3] net: ath11k: add lockup simulation via debugfs
2026-03-30 10:05 ` [PATCH RESEND RFC 3/3] net: ath11k: add lockup simulation via debugfs Matthew Leach
@ 2026-05-12 23:19 ` Jeff Johnson
0 siblings, 0 replies; 6+ messages in thread
From: Jeff Johnson @ 2026-05-12 23:19 UTC (permalink / raw)
To: Matthew Leach, Jeff Johnson; +Cc: linux-wireless, ath11k, linux-kernel, kernel
On 3/30/2026 3:05 AM, Matthew Leach wrote:
> Add a debugfs command to simulate a firmware lockup.
>
> This does not hang the hardware. Instead, it forces the driver down an
> error path that reproduces the sequence observed during real lockups:
>
> ath11k_pci 0000:03:00.0: failed to transmit frame -12
> ath11k_pci 0000:03:00.0: failed to transmit frame -12
> ath11k_pci 0000:03:00.0: failed to transmit frame -12
> ...
> ath11k_pci 0000:03:00.0: wmi command 28680 timeout
> ath11k_pci 0000:03:00.0: failed to submit WMI_MGMT_TX_SEND_CMDID cmd
> ath11k_pci 0000:03:00.0: failed to send mgmt frame: -11
>
> This allows validation of the firmware lockup detection and recovery
> mechanism without requiring a real hardware failure.
>
> Signed-off-by: Matthew Leach <matthew.leach@collabora.com>
> ---
> drivers/net/wireless/ath/ath11k/core.h | 1 +
> drivers/net/wireless/ath/ath11k/debugfs.c | 7 ++++++-
> drivers/net/wireless/ath/ath11k/hal.c | 7 +++++--
> drivers/net/wireless/ath/ath11k/htc.c | 2 +-
> drivers/net/wireless/ath/ath11k/wmi.c | 6 +++++-
> 5 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath11k/core.h b/drivers/net/wireless/ath/ath11k/core.h
> index 221dcd23b3dd..44b02ae1e85b 100644
> --- a/drivers/net/wireless/ath/ath11k/core.h
> +++ b/drivers/net/wireless/ath/ath11k/core.h
> @@ -1041,6 +1041,7 @@ struct ath11k_base {
> struct ath11k_dbring_cap *db_caps;
> u32 num_db_cap;
> u64 last_frame_tx_error_jiffies;
> + bool simulate_lockup;
>
> /* To synchronize 11d scan vdev id */
> struct mutex vdev_id_11d_lock;
> diff --git a/drivers/net/wireless/ath/ath11k/debugfs.c b/drivers/net/wireless/ath/ath11k/debugfs.c
> index 0c1138407838..ca0b72a3e0b0 100644
> --- a/drivers/net/wireless/ath/ath11k/debugfs.c
> +++ b/drivers/net/wireless/ath/ath11k/debugfs.c
> @@ -356,7 +356,8 @@ static ssize_t ath11k_read_simulate_fw_crash(struct file *file,
> const char buf[] =
> "To simulate firmware crash write one of the keywords to this file:\n"
> "`assert` - this will send WMI_FORCE_FW_HANG_CMDID to firmware to cause assert.\n"
> - "`hw-restart` - this will simply queue hw restart without fw/hw actually crashing.\n";
> + "`hw-restart` - this will simply queue hw restart without fw/hw actually crashing.\n"
> + "`lockup` - simulate a firmware lockup without the h/w actually hanging.\n";
>
> return simple_read_from_buffer(user_buf, count, ppos, buf, strlen(buf));
> }
> @@ -413,6 +414,10 @@ static ssize_t ath11k_write_simulate_fw_crash(struct file *file,
> ath11k_info(ab, "user requested hw restart\n");
> queue_work(ab->workqueue_aux, &ab->reset_work);
> ret = 0;
> + } else if (!strcmp(buf, "lockup")) {
> + ath11k_info(ab, "simulating lockup\n");
> + ab->simulate_lockup = true;
> + ret = 0;
> } else {
> ret = -EINVAL;
> goto exit;
> diff --git a/drivers/net/wireless/ath/ath11k/hal.c b/drivers/net/wireless/ath/ath11k/hal.c
> index e821e5a62c1c..e01fb17a4734 100644
> --- a/drivers/net/wireless/ath/ath11k/hal.c
> +++ b/drivers/net/wireless/ath/ath11k/hal.c
> @@ -691,7 +691,7 @@ int ath11k_hal_srng_dst_num_free(struct ath11k_base *ab, struct hal_srng *srng,
>
> tp = srng->u.dst_ring.tp;
>
> - if (sync_hw_ptr) {
> + if (sync_hw_ptr && !ab->simulate_lockup) {
> hp = *srng->u.dst_ring.hp_addr;
> srng->u.dst_ring.cached_hp = hp;
> } else {
> @@ -743,7 +743,7 @@ u32 *ath11k_hal_srng_src_get_next_entry(struct ath11k_base *ab,
> */
> next_hp = (srng->u.src_ring.hp + srng->entry_size) % srng->ring_size;
>
> - if (next_hp == srng->u.src_ring.cached_tp)
> + if (next_hp == srng->u.src_ring.cached_tp || ab->simulate_lockup)
> return NULL;
>
> desc = srng->ring_base_vaddr + srng->u.src_ring.hp;
> @@ -828,6 +828,9 @@ void ath11k_hal_srng_access_begin(struct ath11k_base *ab, struct hal_srng *srng)
>
> lockdep_assert_held(&srng->lock);
>
> + if (ab->simulate_lockup)
> + return;
> +
> if (srng->ring_dir == HAL_SRNG_DIR_SRC) {
> srng->u.src_ring.cached_tp =
> *(volatile u32 *)srng->u.src_ring.tp_addr;
> diff --git a/drivers/net/wireless/ath/ath11k/htc.c b/drivers/net/wireless/ath/ath11k/htc.c
> index 4571d01cc33d..b05d04a1f5e8 100644
> --- a/drivers/net/wireless/ath/ath11k/htc.c
> +++ b/drivers/net/wireless/ath/ath11k/htc.c
> @@ -208,7 +208,7 @@ static int ath11k_htc_process_trailer(struct ath11k_htc *htc,
> break;
> }
>
> - if (ab->hw_params.credit_flow) {
> + if (ab->hw_params.credit_flow && !ab->simulate_lockup) {
> switch (record->hdr.id) {
> case ATH11K_HTC_RECORD_CREDITS:
> len = sizeof(struct ath11k_htc_credit_report);
> diff --git a/drivers/net/wireless/ath/ath11k/wmi.c b/drivers/net/wireless/ath/ath11k/wmi.c
> index 7d9f0bcbb3b0..27d6d4a2f803 100644
> --- a/drivers/net/wireless/ath/ath11k/wmi.c
> +++ b/drivers/net/wireless/ath/ath11k/wmi.c
> @@ -345,9 +345,13 @@ int ath11k_wmi_cmd_send(struct ath11k_pdev_wmi *wmi, struct sk_buff *skb,
>
> if (time_in_range64(ab->last_frame_tx_error_jiffies,
> range_start, jiffies_64) &&
> - queue_work(ab->workqueue_aux, &ab->reset_work))
> + queue_work(ab->workqueue_aux, &ab->reset_work)) {
> ath11k_err(wmi_ab->ab,
> "Firmware lockup detected. Resetting.");
> +
> + /* Assume that reset gets us out of lockup. */
> + ab->simulate_lockup = false;
> + }
> }
>
> if (ret == -ENOBUFS)
>
My 1st impression of this patch is that the datapath folks are not going to
like the ab->simulate_lockup checks in the hot path. But I'll let the
engineers speak for themselves.
/jeff
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-12 23:19 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-30 10:05 [RFC PATCH RESEND 0/3] net: ath11k: Firmware lockup detection & mitigation Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 1/3] net: ath11k: fix redundant reset from stale pending workqueue bit Matthew Leach
2026-05-12 23:09 ` Jeff Johnson
2026-03-30 10:05 ` [PATCH RESEND RFC 2/3] net: ath11k: add firmware lockup detection and recovery Matthew Leach
2026-03-30 10:05 ` [PATCH RESEND RFC 3/3] net: ath11k: add lockup simulation via debugfs Matthew Leach
2026-05-12 23:19 ` Jeff Johnson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox