From: David Christensen <drc@linux.vnet.ibm.com>
To: netdev@vger.kernel.org
Cc: David Christensen <drc@linux.vnet.ibm.com>,
Siva Reddy Kallam <siva.kallam@broadcom.com>,
Prashant Sreedharan <prashant@broadcom.com>,
Michael Chan <mchan@broadcom.com>
Subject: [PATCH] net/tg3: resolve deadlock in tg3_reset_task() during EEH
Date: Mon, 23 Jan 2023 19:31:07 -0500 [thread overview]
Message-ID: <20230124003107.214307-1-drc@linux.vnet.ibm.com> (raw)
During EEH error injection testing, a deadlock was encountered in the tg3
driver when tg3_io_error_detected() was attempting to cancel outstanding
reset tasks:
crash> foreach UN bt
...
PID: 159 TASK: c0000000067c6000 CPU: 8 COMMAND: "eehd"
...
#5 [c00000000681f990] __cancel_work_timer at c00000000019fd18
#6 [c00000000681fa30] tg3_io_error_detected at c00800000295f098 [tg3]
#7 [c00000000681faf0] eeh_report_error at c00000000004e25c
...
PID: 290 TASK: c000000036e5f800 CPU: 6 COMMAND: "kworker/6:1"
...
#4 [c00000003721fbc0] rtnl_lock at c000000000c940d8
#5 [c00000003721fbe0] tg3_reset_task at c008000002969358 [tg3]
#6 [c00000003721fc60] process_one_work at c00000000019e5c4
...
PID: 296 TASK: c000000037a65800 CPU: 21 COMMAND: "kworker/21:1"
...
#4 [c000000037247bc0] rtnl_lock at c000000000c940d8
#5 [c000000037247be0] tg3_reset_task at c008000002969358 [tg3]
#6 [c000000037247c60] process_one_work at c00000000019e5c4
...
PID: 655 TASK: c000000036f49000 CPU: 16 COMMAND: "kworker/16:2"
...:1
#4 [c0000000373ebbc0] rtnl_lock at c000000000c940d8
#5 [c0000000373ebbe0] tg3_reset_task at c008000002969358 [tg3]
#6 [c0000000373ebc60] process_one_work at c00000000019e5c4
...
Code inspection shows that both tg3_io_error_detected() and
tg3_reset_task() attempt to acquire the RTNL lock at the beginning of
their code blocks. If tg3_reset_task() should happen to execute between
the times when tg3_io_error_deteced() acquires the RTNL lock and
tg3_reset_task_cancel() is called, a deadlock will occur.
Moving tg3_reset_task_cancel() call earlier within the code block, prior
to acquiring RTNL, prevents this from happening, but also exposes another
deadlock issue where tg3_reset_task() may execute AFTER
tg3_io_error_detected() has executed:
crash> foreach UN bt
PID: 159 TASK: c0000000067d2000 CPU: 9 COMMAND: "eehd"
...
#4 [c000000006867a60] rtnl_lock at c000000000c940d8
#5 [c000000006867a80] tg3_io_slot_reset at c0080000026c2ea8 [tg3]
#6 [c000000006867b00] eeh_report_reset at c00000000004de88
...
PID: 363 TASK: c000000037564000 CPU: 6 COMMAND: "kworker/6:1"
...
#3 [c000000036c1bb70] msleep at c000000000259e6c
#4 [c000000036c1bba0] napi_disable at c000000000c6b848
#5 [c000000036c1bbe0] tg3_reset_task at c0080000026d942c [tg3]
#6 [c000000036c1bc60] process_one_work at c00000000019e5c4
...
This issue can be avoided by aborting tg3_reset_task() if EEH error
recovery is already in progress.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Cc: Siva Reddy Kallam <siva.kallam@broadcom.com>
Cc: Prashant Sreedharan <prashant@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
---
drivers/net/ethernet/broadcom/tg3.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 59debdc344a5..ee4604e6900e 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -11166,7 +11166,8 @@ static void tg3_reset_task(struct work_struct *work)
rtnl_lock();
tg3_full_lock(tp, 0);
- if (!netif_running(tp->dev)) {
+ // Skip reset task if no netdev or already in PCI recovery
+ if (!tp->dev || tp->pcierr_recovery || !netif_running(tp->dev)) {
tg3_flag_clear(tp, RESET_TASK_PENDING);
tg3_full_unlock(tp);
rtnl_unlock();
@@ -18101,6 +18102,9 @@ static pci_ers_result_t tg3_io_error_detected(struct pci_dev *pdev,
netdev_info(netdev, "PCI I/O error detected\n");
+ /* Want to make sure that the reset task doesn't run */
+ tg3_reset_task_cancel(tp);
+
rtnl_lock();
/* Could be second call or maybe we don't have netdev yet */
@@ -18117,9 +18121,6 @@ static pci_ers_result_t tg3_io_error_detected(struct pci_dev *pdev,
tg3_timer_stop(tp);
- /* Want to make sure that the reset task doesn't run */
- tg3_reset_task_cancel(tp);
-
netif_device_detach(netdev);
/* Clean up software state, even if MMIO is blocked */
--
2.31.1
next reply other threads:[~2023-01-24 0:31 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-24 0:31 David Christensen [this message]
2023-01-24 7:25 ` [PATCH] net/tg3: resolve deadlock in tg3_reset_task() during EEH Pavan Chebbi
2023-01-24 8:43 ` Michael Chan
2023-01-24 9:12 ` Pavan Chebbi
2023-01-24 18:53 ` [PATCH v2] " David Christensen
2023-01-25 8:33 ` Pavan Chebbi
2023-01-26 6:50 ` patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230124003107.214307-1-drc@linux.vnet.ibm.com \
--to=drc@linux.vnet.ibm.com \
--cc=mchan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=prashant@broadcom.com \
--cc=siva.kallam@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).