Netdev List
 help / color / mirror / Atom feed
* [PATCH] net: wwan: t7xx: fix race between TX thread and system PM suspend
@ 2026-05-13  8:37 Tim JH Chen(陳仁鴻)
  2026-05-15  0:19 ` Jakub Kicinski
  0 siblings, 1 reply; 2+ messages in thread
From: Tim JH Chen(陳仁鴻) @ 2026-05-13  8:37 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: chandrashekar.devegowda@intel.com, haijun.liu@mediatek.com,
	ricardo.martinez@linux.intel.com, loic.poulain@oss.qualcomm.com,
	ryazanov.s.a@gmail.com, davem@davemloft.net, kuba@kernel.org,
	linux-kernel@vger.kernel.org


[-- Attachment #1.1: Type: text/plain, Size: 4191 bytes --]

Date: Wed, 13 May 2026 09:21:40 +0800
Subject: [PATCH] net: wwan: t7xx: fix race between TX thread and system PM
 suspend
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When system suspend is triggered while the DPMAIF TX kthread
(t7xx_dpmaif_tx_hw_push_thread) is running, a deadlock can occur
leading to a CPU soft lockup.

The root cause is two-fold:

1. t7xx_dpmaif_suspend() calls t7xx_dpmaif_tx_stop() which only stops
   the TX work-queue items (by clearing txq->que_started and waiting on
   txq->tx_processing). It does NOT signal the kthread and does NOT
   update dpmaif_ctrl->state, which stays DPMAIF_STATE_PWRON.

2. The kthread's state guard (line: "if ... state != DPMAIF_STATE_PWRON")
   is only checked at the top of each loop iteration. If the thread
   already passed this guard, it proceeds unconditionally to call
   pm_runtime_resume_and_get() — which tries to acquire the PM spinlock
   also held (or contended) by the system PM suspend path.

The result is a spinlock deadlock observed as:

  watchdog: BUG: soft lockup - CPU#N stuck for 26s! [dpmaif_tx_hw_pu]
  RIP: _raw_spin_unlock_irqrestore
  Call Trace:
    __pm_runtime_resume+0x5b/0x80
    t7xx_dpmaif_tx_hw_push_thread+0xc4 [mtk_t7xx]

The condition requires ASPM L1 enabled on the endpoint (which extends
the time pm_runtime_resume_and_get() holds the PM lock during L1.2
link retraining) and hundreds of repeated suspend/resume cycles to
trigger reliably.

Fix by three coordinated changes:

- In t7xx_dpmaif_suspend(): immediately set state to DPMAIF_STATE_PWROFF
  after stopping the TX queue, then call wake_up() so any sleeping thread
  re-evaluates the wait_event condition and stops.

- In t7xx_dpmaif_resume(): restore state to DPMAIF_STATE_PWRON before
  re-enabling the TX queues, symmetric with the suspend change.
  Without this the kthread would never wake up after resume.

- In t7xx_dpmaif_tx_hw_push_thread(): add a second state check
  immediately before pm_runtime_resume_and_get() to close the TOCTOU
  window between the wait_event guard and the pm call.

Tested: no soft lockup observed over 500+ suspend/resume cycles with
SIM registered and ASPM L1 enabled (previously triggered in < 300).

Fixes: 05f7e89ab ("Linux 6.19")
Signed-off-by: Tim JH Chen <tim.jh.chen@wnc.com.tw>
---
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c    | 3 +++
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c
index 7ff33c1d6..315a77e24 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c
@@ -412,6 +412,8 @@ static int t7xx_dpmaif_suspend(struct t7xx_pci_dev *t7xx_dev, void *param)
        struct dpmaif_ctrl *dpmaif_ctrl = param;

        t7xx_dpmaif_tx_stop(dpmaif_ctrl);
+       dpmaif_ctrl->state = DPMAIF_STATE_PWROFF;
+       wake_up(&dpmaif_ctrl->tx_wq);
        t7xx_dpmaif_hw_stop_all_txq(&dpmaif_ctrl->hw_info);
        t7xx_dpmaif_hw_stop_all_rxq(&dpmaif_ctrl->hw_info);
        t7xx_dpmaif_disable_irq(dpmaif_ctrl);
@@ -451,6 +453,7 @@ static int t7xx_dpmaif_resume(struct t7xx_pci_dev *t7xx_dev, void *param)
        if (!dpmaif_ctrl)
                return 0;

+       dpmaif_ctrl->state = DPMAIF_STATE_PWRON;
        t7xx_dpmaif_start_txrx_qs(dpmaif_ctrl);
        t7xx_dpmaif_enable_irq(dpmaif_ctrl);
        t7xx_dpmaif_unmask_dlq_intr(dpmaif_ctrl);
diff --git a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
index 236d632cf..d5a5befec 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
@@ -460,6 +460,9 @@ static int t7xx_dpmaif_tx_hw_push_thread(void *arg)
                                break;
                }

+               if (dpmaif_ctrl->state != DPMAIF_STATE_PWRON)
+                       continue;
+
                ret = pm_runtime_resume_and_get(dpmaif_ctrl->dev);
                if (ret < 0 && ret != -EACCES)
                        return ret;
--
2.25.1

[-- Attachment #1.2: Type: text/html, Size: 22867 bytes --]

[-- Attachment #2: 0001-net-wwan-t7xx-fix-race-between-TX-thread-and-system-.patch --]
[-- Type: application/octet-stream, Size: 4009 bytes --]

From 7412885fd3b1da86d0fdc23e9a48af4b6d52c370 Mon Sep 17 00:00:00 2001
From: Tim JH Chen <tim.jh.chen@wnc.com.tw>
Date: Wed, 13 May 2026 09:21:40 +0800
Subject: [PATCH] net: wwan: t7xx: fix race between TX thread and system PM
 suspend
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When system suspend is triggered while the DPMAIF TX kthread
(t7xx_dpmaif_tx_hw_push_thread) is running, a deadlock can occur
leading to a CPU soft lockup.

The root cause is two-fold:

1. t7xx_dpmaif_suspend() calls t7xx_dpmaif_tx_stop() which only stops
   the TX work-queue items (by clearing txq->que_started and waiting on
   txq->tx_processing). It does NOT signal the kthread and does NOT
   update dpmaif_ctrl->state, which stays DPMAIF_STATE_PWRON.

2. The kthread's state guard (line: "if ... state != DPMAIF_STATE_PWRON")
   is only checked at the top of each loop iteration. If the thread
   already passed this guard, it proceeds unconditionally to call
   pm_runtime_resume_and_get() — which tries to acquire the PM spinlock
   also held (or contended) by the system PM suspend path.

The result is a spinlock deadlock observed as:

  watchdog: BUG: soft lockup - CPU#N stuck for 26s! [dpmaif_tx_hw_pu]
  RIP: _raw_spin_unlock_irqrestore
  Call Trace:
    __pm_runtime_resume+0x5b/0x80
    t7xx_dpmaif_tx_hw_push_thread+0xc4 [mtk_t7xx]

The condition requires ASPM L1 enabled on the endpoint (which extends
the time pm_runtime_resume_and_get() holds the PM lock during L1.2
link retraining) and hundreds of repeated suspend/resume cycles to
trigger reliably.

Fix by three coordinated changes:

- In t7xx_dpmaif_suspend(): immediately set state to DPMAIF_STATE_PWROFF
  after stopping the TX queue, then call wake_up() so any sleeping thread
  re-evaluates the wait_event condition and stops.

- In t7xx_dpmaif_resume(): restore state to DPMAIF_STATE_PWRON before
  re-enabling the TX queues, symmetric with the suspend change.
  Without this the kthread would never wake up after resume.

- In t7xx_dpmaif_tx_hw_push_thread(): add a second state check
  immediately before pm_runtime_resume_and_get() to close the TOCTOU
  window between the wait_event guard and the pm call.

Tested: no soft lockup observed over 500+ suspend/resume cycles with
SIM registered and ASPM L1 enabled (previously triggered in < 300).

Fixes: 05f7e89ab ("Linux 6.19")
Signed-off-by: Tim JH Chen <tim.jh.chen@wnc.com.tw>
---
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c    | 3 +++
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c
index 7ff33c1d6..315a77e24 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif.c
@@ -412,6 +412,8 @@ static int t7xx_dpmaif_suspend(struct t7xx_pci_dev *t7xx_dev, void *param)
 	struct dpmaif_ctrl *dpmaif_ctrl = param;
 
 	t7xx_dpmaif_tx_stop(dpmaif_ctrl);
+	dpmaif_ctrl->state = DPMAIF_STATE_PWROFF;
+	wake_up(&dpmaif_ctrl->tx_wq);
 	t7xx_dpmaif_hw_stop_all_txq(&dpmaif_ctrl->hw_info);
 	t7xx_dpmaif_hw_stop_all_rxq(&dpmaif_ctrl->hw_info);
 	t7xx_dpmaif_disable_irq(dpmaif_ctrl);
@@ -451,6 +453,7 @@ static int t7xx_dpmaif_resume(struct t7xx_pci_dev *t7xx_dev, void *param)
 	if (!dpmaif_ctrl)
 		return 0;
 
+	dpmaif_ctrl->state = DPMAIF_STATE_PWRON;
 	t7xx_dpmaif_start_txrx_qs(dpmaif_ctrl);
 	t7xx_dpmaif_enable_irq(dpmaif_ctrl);
 	t7xx_dpmaif_unmask_dlq_intr(dpmaif_ctrl);
diff --git a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
index 236d632cf..d5a5befec 100644
--- a/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
+++ b/drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
@@ -460,6 +460,9 @@ static int t7xx_dpmaif_tx_hw_push_thread(void *arg)
 				break;
 		}
 
+		if (dpmaif_ctrl->state != DPMAIF_STATE_PWRON)
+			continue;
+
 		ret = pm_runtime_resume_and_get(dpmaif_ctrl->dev);
 		if (ret < 0 && ret != -EACCES)
 			return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] net: wwan: t7xx: fix race between TX thread and system PM suspend
  2026-05-13  8:37 [PATCH] net: wwan: t7xx: fix race between TX thread and system PM suspend Tim JH Chen(陳仁鴻)
@ 2026-05-15  0:19 ` Jakub Kicinski
  0 siblings, 0 replies; 2+ messages in thread
From: Jakub Kicinski @ 2026-05-15  0:19 UTC (permalink / raw)
  To: Tim JH Chen(陳仁鴻)
  Cc: netdev@vger.kernel.org, chandrashekar.devegowda@intel.com,
	haijun.liu@mediatek.com, ricardo.martinez@linux.intel.com,
	loic.poulain@oss.qualcomm.com, ryazanov.s.a@gmail.com,
	davem@davemloft.net, linux-kernel@vger.kernel.org

On Wed, 13 May 2026 08:37:48 +0000 Tim JH Chen(陳仁鴻) wrote:
> Date: Wed, 13 May 2026 09:21:40 +0800
> Subject: [PATCH] net: wwan: t7xx: fix race between TX thread and system PM
>  suspend
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit

Something has corrupted this patch (either your email client or server).
Please try to fix your setup and resend (maybe use b4 gateway).

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-05-15  0:19 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13  8:37 [PATCH] net: wwan: t7xx: fix race between TX thread and system PM suspend Tim JH Chen(陳仁鴻)
2026-05-15  0:19 ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox