netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thinh Tran <thinhtr@linux.vnet.ibm.com>
To: netdev@vger.kernel.org, siva.kallam@broadcom.com,
	prashant@broadcom.com, mchan@broadcom.com,
	pavan.chebbi@broadcom.com, drc@linux.vnet.ibm.com
Cc: venkata.sai.duggi@ibm.com, Thinh Tran <thinhtr@linux.vnet.ibm.com>
Subject: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
Date: Thu,  2 Nov 2023 11:12:19 -0500	[thread overview]
Message-ID: <20231102161219.220-1-thinhtr@linux.vnet.ibm.com> (raw)
In-Reply-To: <20231002185510.1488-1-thinhtr@linux.vnet.ibm.com>

When an EEH error is encountered by a PCI adapter, the EEH driver
modifies the PCI channel's state as shown below:

   enum {
      /* I/O channel is in normal state */
      pci_channel_io_normal = (__force pci_channel_state_t) 1,

      /* I/O to channel is blocked */
      pci_channel_io_frozen = (__force pci_channel_state_t) 2,

      /* PCI card is dead */
      pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
   };

If the same EEH error then causes the tg3 driver's transmit timeout
logic to execute, the tg3_tx_timeout() function schedules a reset
task via tg3_reset_task_schedule(), which may cause a race condition
between the tg3 and EEH driver as both attempt to recover the HW via
a reset action.

EEH driver gets error event
--> eeh_set_channel_state()
    and set device to one of
    error state above		scheduler: tg3_reset_task() get 
   				returned error from tg3_init_hw()
			     --> dev_close() shuts down the interface

tg3_io_slot_reset() and 
tg3_io_resume() fail to
reset/resume the device


To resolve this issue, we avoid the race condition by checking the PCI
channel state in the tg3_tx_timeout() function and skip the tg3 driver
initiated reset when the PCI channel is not in the normal state.  (The
driver has no access to tg3 device registers at this point and cannot
even complete the reset task successfully without external assistance.)
We'll leave the reset procedure to be managed by the EEH driver which
calls the tg3_io_error_detected(), tg3_io_slot_reset() and 
tg3_io_resume() functions as appropriate. 



Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

---
 drivers/net/ethernet/broadcom/tg3.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..1c72ef05ab1b 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7630,6 +7630,26 @@ static void tg3_tx_timeout(struct net_device *dev, unsigned int txqueue)
 {
 	struct tg3 *tp = netdev_priv(dev);
 
+	/* checking the PCI channel state for hard errors
+	 * for pci_channel_io_frozen case
+	 *   - I/O to channel is blocked.
+	 *     The EEH layer and I/O error detections will
+	 *     handle the reset procedure
+	 * for pci_channel_io_perm_failure  case
+	 *   - the PCI card is dead.
+	 *     The reset will not help
+	 * report the error for both cases and return.
+	 */
+	if (tp->pdev->error_state == pci_channel_io_frozen) {
+		netdev_err(dev, " %s, I/O to channel is blocked\n", __func__);
+		return;
+	}
+
+	if (tp->pdev->error_state == pci_channel_io_perm_failure) {
+		netdev_err(dev, " %s, adapter has failed permanently!\n", __func__);
+		return;
+	}
+
 	if (netif_msg_tx_err(tp)) {
 		netdev_err(dev, "transmit timed out, resetting\n");
 		tg3_dump_state(tp);
-- 
2.25.1


  parent reply	other threads:[~2023-11-02 16:16 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-02 18:55 [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel() Thinh Tran
2023-10-03  4:34 ` Pavan Chebbi
2023-10-31 23:18   ` Thinh Tran
2023-10-03  9:37 ` Michael Chan
2023-10-03 22:05   ` Thinh Tran
2023-11-02 16:02     ` Thinh Tran
2023-11-02 16:12 ` Thinh Tran [this message]
2023-11-02 17:27   ` [PATCH v2] net/tg3: fix race condition in tg3_reset_task() Michael Chan
2023-11-02 20:37     ` Thinh Tran
2023-11-14 17:39       ` Thinh Tran
2023-11-14 21:03         ` Michael Chan
2023-11-15 18:23           ` Thinh Tran
2023-11-15 18:56             ` Michael Chan
2023-11-16 14:41               ` Thinh Tran
2023-11-16 15:18   ` [PATCH v3] " Thinh Tran
2023-11-16 21:34     ` Michael Chan
2023-11-17 16:19       ` Thinh Tran
2023-11-17 18:31         ` Michael Chan
2023-11-30 22:29           ` Thinh Tran
2023-12-01  0:19     ` [PATCH v4] " Thinh Tran
2023-12-01 16:50       ` Michael Chan
2023-12-02  0:40       ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231102161219.220-1-thinhtr@linux.vnet.ibm.com \
    --to=thinhtr@linux.vnet.ibm.com \
    --cc=drc@linux.vnet.ibm.com \
    --cc=mchan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pavan.chebbi@broadcom.com \
    --cc=prashant@broadcom.com \
    --cc=siva.kallam@broadcom.com \
    --cc=venkata.sai.duggi@ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).