[PATCH] net/tg3: fix race condition in tg3_reset_task

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel()
@ 2023-10-02 18:55 Thinh Tran
  2023-10-03  4:34 ` Pavan Chebbi
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Thinh Tran @ 2023-10-02 18:55 UTC (permalink / raw)
  To: netdev
  Cc: siva.kallam, prashant, mchan, drc, pavan.chebbi, Thinh Tran,
	Venkata Sai Duggi

during the EEH error injection tests on the 4-port 1 GbE NetXtreme
BCM5719 Gigabit Ethernet PCIe adapter, a race condition was observed in
the process of resetting and setting the driver flag to
TX_RECOVERY_PENDING between tg3_reset_task_cancel() and tg3_tx_recover().
As a result, it occasionally leads to transmit timeouts and the
subsequent disabling of all the driver's interfaces.

[12046.886221] NETDEV WATCHDOG: eth16 (tg3): transmit queue 0 timed out
[12046.886238] WARNING: CPU: 7 PID: 0 at ../net/sched/sch_generic.c:478
   dev_watchdog+0x42c/0x440
[12046.886247] Modules linked in: tg3 libphy nfsv3 nfs_acl .......
 ..........
[12046.886571] tg3 0021:01:00.0 eth16: transmit timed out, resetting
...........
[12046.966175] tg3 0021:01:00.1 eth15: transmit timed out, resetting
...........
[12046.981584] tg3 0021:01:00.2 eth14: transmit timed out, resetting
...........
[12047.056165] tg3 0021:01:00.3 eth13: transmit timed out, resetting


Fixing this issue by taking the spinlock when modifying the driver flag


Fixes: 6c4ca03bd890 ("net/tg3: resolve deadlock in tg3_reset_task() during EEH")


Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>

---
 drivers/net/ethernet/broadcom/tg3.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..f4558762f9de 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -6507,7 +6507,9 @@ static void tg3_tx_recover(struct tg3 *tp)
 		    "Please report the problem to the driver maintainer "
 		    "and include system chipset information.\n");
 
+	tg3_full_lock(tp, 0);
 	tg3_flag_set(tp, TX_RECOVERY_PENDING);
+	tg3_full_unlock(tp);
 }
 
 static inline u32 tg3_tx_avail(struct tg3_napi *tnapi)
@@ -7210,7 +7212,10 @@ static inline void tg3_reset_task_cancel(struct tg3 *tp)
 {
 	if (test_and_clear_bit(TG3_FLAG_RESET_TASK_PENDING, tp->tg3_flags))
 		cancel_work_sync(&tp->reset_task);
+
+	tg3_full_lock(tp, 0);
 	tg3_flag_clear(tp, TX_RECOVERY_PENDING);
+	tg3_full_unlock(tp);
 }
 
 static int tg3_poll_msix(struct napi_struct *napi, int budget)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel()
  2023-10-02 18:55 [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel() Thinh Tran
@ 2023-10-03  4:34 ` Pavan Chebbi
  2023-10-31 23:18   ` Thinh Tran
  2023-10-03  9:37 ` Michael Chan
  2023-11-02 16:12 ` [PATCH v2] net/tg3: fix race condition in tg3_reset_task() Thinh Tran
  2 siblings, 1 reply; 22+ messages in thread
From: Pavan Chebbi @ 2023-10-03  4:34 UTC (permalink / raw)
  To: Thinh Tran; +Cc: netdev, siva.kallam, prashant, mchan, drc, Venkata Sai Duggi

[-- Attachment #1: Type: text/plain, Size: 716 bytes --]

On Tue, Oct 3, 2023 at 12:25 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>
> during the EEH error injection tests on the 4-port 1 GbE NetXtreme
> BCM5719 Gigabit Ethernet PCIe adapter, a race condition was observed in
> the process of resetting and setting the driver flag to
> TX_RECOVERY_PENDING between tg3_reset_task_cancel() and tg3_tx_recover().
> As a result, it occasionally leads to transmit timeouts and the
> subsequent disabling of all the driver's interfaces.

Can you elaborate on the race condition please? Are you saying
tg3_reset_task_cancel() cleared the flag and tg3_tx_recover() set it
again and the reset task never got a chance to run?
Is that what is leading to TX stall?

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel()
  2023-10-02 18:55 [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel() Thinh Tran
  2023-10-03  4:34 ` Pavan Chebbi
@ 2023-10-03  9:37 ` Michael Chan
  2023-10-03 22:05   ` Thinh Tran
  2023-11-02 16:12 ` [PATCH v2] net/tg3: fix race condition in tg3_reset_task() Thinh Tran
  2 siblings, 1 reply; 22+ messages in thread
From: Michael Chan @ 2023-10-03  9:37 UTC (permalink / raw)
  To: Thinh Tran
  Cc: netdev, siva.kallam, prashant, mchan, drc, pavan.chebbi,
	Venkata Sai Duggi

[-- Attachment #1: Type: text/plain, Size: 2718 bytes --]

On Mon, Oct 2, 2023 at 11:55 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>
> during the EEH error injection tests on the 4-port 1 GbE NetXtreme
> BCM5719 Gigabit Ethernet PCIe adapter, a race condition was observed in
> the process of resetting and setting the driver flag to
> TX_RECOVERY_PENDING between tg3_reset_task_cancel() and tg3_tx_recover().
> As a result, it occasionally leads to transmit timeouts and the
> subsequent disabling of all the driver's interfaces.
>
> [12046.886221] NETDEV WATCHDOG: eth16 (tg3): transmit queue 0 timed out
> [12046.886238] WARNING: CPU: 7 PID: 0 at ../net/sched/sch_generic.c:478
>    dev_watchdog+0x42c/0x440
> [12046.886247] Modules linked in: tg3 libphy nfsv3 nfs_acl .......
>  ..........
> [12046.886571] tg3 0021:01:00.0 eth16: transmit timed out, resetting
> ...........
> [12046.966175] tg3 0021:01:00.1 eth15: transmit timed out, resetting
> ...........
> [12046.981584] tg3 0021:01:00.2 eth14: transmit timed out, resetting
> ...........
> [12047.056165] tg3 0021:01:00.3 eth13: transmit timed out, resetting
>
>
> Fixing this issue by taking the spinlock when modifying the driver flag
>
>
> Fixes: 6c4ca03bd890 ("net/tg3: resolve deadlock in tg3_reset_task() during EEH")
>
>
> Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
> Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>
>
> ---
>  drivers/net/ethernet/broadcom/tg3.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 14b311196b8f..f4558762f9de 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -6507,7 +6507,9 @@ static void tg3_tx_recover(struct tg3 *tp)
>                     "Please report the problem to the driver maintainer "
>                     "and include system chipset information.\n");
>
> +       tg3_full_lock(tp, 0);
>         tg3_flag_set(tp, TX_RECOVERY_PENDING);
> +       tg3_full_unlock(tp);

tg3_flag_set() calls set_bit() which is atomic.  The same is true for
tg3_flag_clear().  Maybe we just need some smp_mb__after_atomic() or
similar memory barriers.

>  }
>
>  static inline u32 tg3_tx_avail(struct tg3_napi *tnapi)
> @@ -7210,7 +7212,10 @@ static inline void tg3_reset_task_cancel(struct tg3 *tp)
>  {
>         if (test_and_clear_bit(TG3_FLAG_RESET_TASK_PENDING, tp->tg3_flags))
>                 cancel_work_sync(&tp->reset_task);
> +
> +       tg3_full_lock(tp, 0);
>         tg3_flag_clear(tp, TX_RECOVERY_PENDING);
> +       tg3_full_unlock(tp);
>  }
>
>  static int tg3_poll_msix(struct napi_struct *napi, int budget)
> --
> 2.25.1
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel()
  2023-10-03  9:37 ` Michael Chan
@ 2023-10-03 22:05   ` Thinh Tran
  2023-11-02 16:02     ` Thinh Tran
  0 siblings, 1 reply; 22+ messages in thread
From: Thinh Tran @ 2023-10-03 22:05 UTC (permalink / raw)
  To: Michael Chan
  Cc: netdev, siva.kallam, prashant, mchan, drc, pavan.chebbi,
	Venkata Sai Duggi

Thanks for the review.

On 10/3/2023 4:37 AM, Michael Chan wrote:

> 
> tg3_flag_set() calls set_bit() which is atomic.  The same is true for
> tg3_flag_clear().  Maybe we just need some smp_mb__after_atomic() or
> similar memory barriers.
> 

I did not see it being used in this driver. I'll try that.

Thinh Tran


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel()
  2023-10-03  4:34 ` Pavan Chebbi
@ 2023-10-31 23:18   ` Thinh Tran
  0 siblings, 0 replies; 22+ messages in thread
From: Thinh Tran @ 2023-10-31 23:18 UTC (permalink / raw)
  To: Pavan Chebbi; +Cc: netdev, siva.kallam, prashant, mchan, drc, Venkata Sai Duggi

Thanks for the review and I apologize for the delayed response. I had 
some trouble accessing the system, which delayed my investigation.

On 10/2/2023 11:34 PM, Pavan Chebbi wrote:
> 
> Can you elaborate on the race condition please? Are you saying
> tg3_reset_task_cancel() cleared the flag and tg3_tx_recover() set it
> again and the reset task never got a chance to run?
> Is that what is leading to TX stall?

This code path only triggered once, and after updating both the system 
and adapter firmware, I haven't encountered it again. However, the race 
condition issue still persists, causing the interfaces to go down.

Implementing the memory barrier, smp_mb__after_atomic(), as suggested by 
Michael Chan, the intermittent problem still persists. Upon closer 
investigation, I identified the root cause, details in the next version 
of the patch. When I commented out the call to the tg3_dump_state() 
function in the tg3_tx_timeout() function, the issue occurred quicker.

I'm working on submitting the v2 of the patch.

Regards,
Thinh Tran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel()
  2023-10-03 22:05   ` Thinh Tran
@ 2023-11-02 16:02     ` Thinh Tran
  0 siblings, 0 replies; 22+ messages in thread
From: Thinh Tran @ 2023-11-02 16:02 UTC (permalink / raw)
  To: Michael Chan
  Cc: netdev, siva.kallam, prashant, mchan, drc, pavan.chebbi,
	Venkata Sai Duggi

Hi Michael,

On 10/3/2023 5:05 PM, Thinh Tran wrote:
> Thanks for the review.
> 
> On 10/3/2023 4:37 AM, Michael Chan wrote:
> 
>>
>> tg3_flag_set() calls set_bit() which is atomic.  The same is true for
>> tg3_flag_clear().  Maybe we just need some smp_mb__after_atomic() or
>> similar memory barriers.
>>
> 
> I did not see it being used in this driver. I'll try that.
> 
> Thinh Tran
> 

I tried that but still the intermittent problem still persists.
I have a fix that I'll describe in V2 of the patch

Thinh Tran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-10-02 18:55 [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel() Thinh Tran
  2023-10-03  4:34 ` Pavan Chebbi
  2023-10-03  9:37 ` Michael Chan
@ 2023-11-02 16:12 ` Thinh Tran
  2023-11-02 17:27   ` Michael Chan
  2023-11-16 15:18   ` [PATCH v3] " Thinh Tran
  2 siblings, 2 replies; 22+ messages in thread
From: Thinh Tran @ 2023-11-02 16:12 UTC (permalink / raw)
  To: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc
  Cc: venkata.sai.duggi, Thinh Tran

When an EEH error is encountered by a PCI adapter, the EEH driver
modifies the PCI channel's state as shown below:

   enum {
      /* I/O channel is in normal state */
      pci_channel_io_normal = (__force pci_channel_state_t) 1,

      /* I/O to channel is blocked */
      pci_channel_io_frozen = (__force pci_channel_state_t) 2,

      /* PCI card is dead */
      pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
   };

If the same EEH error then causes the tg3 driver's transmit timeout
logic to execute, the tg3_tx_timeout() function schedules a reset
task via tg3_reset_task_schedule(), which may cause a race condition
between the tg3 and EEH driver as both attempt to recover the HW via
a reset action.

EEH driver gets error event
--> eeh_set_channel_state()
    and set device to one of
    error state above		scheduler: tg3_reset_task() get 
   				returned error from tg3_init_hw()
			     --> dev_close() shuts down the interface

tg3_io_slot_reset() and 
tg3_io_resume() fail to
reset/resume the device


To resolve this issue, we avoid the race condition by checking the PCI
channel state in the tg3_tx_timeout() function and skip the tg3 driver
initiated reset when the PCI channel is not in the normal state.  (The
driver has no access to tg3 device registers at this point and cannot
even complete the reset task successfully without external assistance.)
We'll leave the reset procedure to be managed by the EEH driver which
calls the tg3_io_error_detected(), tg3_io_slot_reset() and 
tg3_io_resume() functions as appropriate. 



Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

---
 drivers/net/ethernet/broadcom/tg3.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..1c72ef05ab1b 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7630,6 +7630,26 @@ static void tg3_tx_timeout(struct net_device *dev, unsigned int txqueue)
 {
 	struct tg3 *tp = netdev_priv(dev);
 
+	/* checking the PCI channel state for hard errors
+	 * for pci_channel_io_frozen case
+	 *   - I/O to channel is blocked.
+	 *     The EEH layer and I/O error detections will
+	 *     handle the reset procedure
+	 * for pci_channel_io_perm_failure  case
+	 *   - the PCI card is dead.
+	 *     The reset will not help
+	 * report the error for both cases and return.
+	 */
+	if (tp->pdev->error_state == pci_channel_io_frozen) {
+		netdev_err(dev, " %s, I/O to channel is blocked\n", __func__);
+		return;
+	}
+
+	if (tp->pdev->error_state == pci_channel_io_perm_failure) {
+		netdev_err(dev, " %s, adapter has failed permanently!\n", __func__);
+		return;
+	}
+
 	if (netif_msg_tx_err(tp)) {
 		netdev_err(dev, "transmit timed out, resetting\n");
 		tg3_dump_state(tp);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-11-02 16:12 ` [PATCH v2] net/tg3: fix race condition in tg3_reset_task() Thinh Tran
@ 2023-11-02 17:27   ` Michael Chan
  2023-11-02 20:37     ` Thinh Tran
  2023-11-16 15:18   ` [PATCH v3] " Thinh Tran
  1 sibling, 1 reply; 22+ messages in thread
From: Michael Chan @ 2023-11-02 17:27 UTC (permalink / raw)
  To: Thinh Tran
  Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
	venkata.sai.duggi

[-- Attachment #1: Type: text/plain, Size: 2016 bytes --]

On Thu, Nov 2, 2023 at 9:16 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>
> When an EEH error is encountered by a PCI adapter, the EEH driver
> modifies the PCI channel's state as shown below:
>
>    enum {
>       /* I/O channel is in normal state */
>       pci_channel_io_normal = (__force pci_channel_state_t) 1,
>
>       /* I/O to channel is blocked */
>       pci_channel_io_frozen = (__force pci_channel_state_t) 2,
>
>       /* PCI card is dead */
>       pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
>    };
>
> If the same EEH error then causes the tg3 driver's transmit timeout
> logic to execute, the tg3_tx_timeout() function schedules a reset
> task via tg3_reset_task_schedule(), which may cause a race condition
> between the tg3 and EEH driver as both attempt to recover the HW via
> a reset action.
>
> EEH driver gets error event
> --> eeh_set_channel_state()
>     and set device to one of
>     error state above           scheduler: tg3_reset_task() get
>                                 returned error from tg3_init_hw()
>                              --> dev_close() shuts down the interface
>
> tg3_io_slot_reset() and
> tg3_io_resume() fail to
> reset/resume the device
>
>
> To resolve this issue, we avoid the race condition by checking the PCI
> channel state in the tg3_tx_timeout() function and skip the tg3 driver
> initiated reset when the PCI channel is not in the normal state.  (The
> driver has no access to tg3 device registers at this point and cannot
> even complete the reset task successfully without external assistance.)
> We'll leave the reset procedure to be managed by the EEH driver which
> calls the tg3_io_error_detected(), tg3_io_slot_reset() and
> tg3_io_resume() functions as appropriate.

This scenario can affect other drivers too, right?  Shouldn't this be
handled in a higher layer before calling ->ndo_tx_timeout() so we
don't have to add this logic to all the other drivers?  Thanks.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-11-02 17:27   ` Michael Chan
@ 2023-11-02 20:37     ` Thinh Tran
  2023-11-14 17:39       ` Thinh Tran
  0 siblings, 1 reply; 22+ messages in thread
From: Thinh Tran @ 2023-11-02 20:37 UTC (permalink / raw)
  To: Michael Chan
  Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
	venkata.sai.duggi

Thanks for the review.

On 11/2/2023 12:27 PM, Michael Chan wrote:
> 
> This scenario can affect other drivers too, right?  Shouldn't this be
> handled in a higher layer before calling ->ndo_tx_timeout() so we
> don't have to add this logic to all the other drivers?  Thanks.

Yes, it does. We can add this into the dev_watchdog() function, but 
further investigations are required. This is because each driver may 
have a different approach to handling its own ->ndo_tx_timeout() function.

Thinh Tran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-11-02 20:37     ` Thinh Tran
@ 2023-11-14 17:39       ` Thinh Tran
  2023-11-14 21:03         ` Michael Chan
  0 siblings, 1 reply; 22+ messages in thread
From: Thinh Tran @ 2023-11-14 17:39 UTC (permalink / raw)
  To: Michael Chan
  Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
	venkata.sai.duggi

Hi Michael,

On 11/2/2023 3:37 PM, Thinh Tran wrote:
> Thanks for the review.
> 
> On 11/2/2023 12:27 PM, Michael Chan wrote:
>>
>> This scenario can affect other drivers too, right?  Shouldn't this be
>> handled in a higher layer before calling ->ndo_tx_timeout() so we
>> don't have to add this logic to all the other drivers?  Thanks.
> 
> Yes, it does. We can add this into the dev_watchdog() function, but 
> further investigations are required. This is because each driver may 
> have a different approach to handling its own ->ndo_tx_timeout() function.
> 
> Thinh Tran

I attempted to relocate the fix into the dev_watchdog(), but we 
experienced crashes in various drivers, leading to the destabilization 
of other drivers.
I propose to accept the current patch, rather than introducing a new fix 
that may potentially create issues for other drivers

Thanks,
Thinh Tran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-11-14 17:39       ` Thinh Tran
@ 2023-11-14 21:03         ` Michael Chan
  2023-11-15 18:23           ` Thinh Tran
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Chan @ 2023-11-14 21:03 UTC (permalink / raw)
  To: Thinh Tran
  Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
	venkata.sai.duggi

[-- Attachment #1: Type: text/plain, Size: 614 bytes --]

On Tue, Nov 14, 2023 at 9:39 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
> I attempted to relocate the fix into the dev_watchdog(), but we
> experienced crashes in various drivers, leading to the destabilization
> of other drivers.

Could you provide more information about the crashes?  The
dev_watchdog() code already checks for netif_device_present() and
netif_running() and netif_carrier_ok() before proceeding to check for
TX timeout.  Why would adding some additional checks for PCI errors
cause problems?  Of course the additional checks should only be done
on PCI devices only.  Thanks.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-11-14 21:03         ` Michael Chan
@ 2023-11-15 18:23           ` Thinh Tran
  2023-11-15 18:56             ` Michael Chan
  0 siblings, 1 reply; 22+ messages in thread
From: Thinh Tran @ 2023-11-15 18:23 UTC (permalink / raw)
  To: Michael Chan
  Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
	venkata.sai.duggi


On 11/14/2023 3:03 PM, Michael Chan wrote:
> 
> Could you provide more information about the crashes?  The
> dev_watchdog() code already checks for netif_device_present() and
> netif_running() and netif_carrier_ok() before proceeding to check for
> TX timeout.  Why would adding some additional checks for PCI errors
> cause problems?  Of course the additional checks should only be done
> on PCI devices only.  Thanks.

The checking for PCI errors is not the problem, avoiding calling drivers 
->ndo_tx_timeout() function, causing some issue.
Here is the fix in the dev_watchdog():

--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -24,6 +24,7 @@
  #include <linux/if_vlan.h>
  #include <linux/skb_array.h>
  #include <linux/if_macvlan.h>
+#include <linux/pci.h>
  #include <net/sch_generic.h>
  #include <net/pkt_sched.h>
  #include <net/dst.h>
@@ -521,12 +522,32 @@ static void dev_watchdog(struct timer_list *t)
               }

               if (unlikely(some_queue_timedout)) {
+                      struct pci_dev *pci_dev;
+
                        trace_net_dev_xmit_timeout(dev, i);
                        WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s 
(%s): transmit queue %u timed out\n",
                        dev->name, netdev_drivername(dev), i);
-                      netif_freeze_queues(dev);
-                      dev->netdev_ops->ndo_tx_timeout(dev, i);
-                      netif_unfreeze_queues(dev);
+                      pci_dev = to_pci_dev(dev->dev.parent);
+                      if (pci_dev && (pci_dev->error_state != 
pci_channel_io_normal)) {
+                    /* checking the PCI channel state for hard errors
+                     * for pci_channel_io_frozen case
+                     * - I/O to channel is blocked.
+                     *   The EEH layer and I/O error detections will
+                     *   handle the reset procedure
+                     * for pci_channel_io_perm_failure  case
+                     * - the PCI card is dead. The reset will not help
+                     * Report the error for either case, but not calling
+                     * the driver's ndo_tx_timeout() function.
+                     */
+                          if (pci_dev->error_state == 
pci_channel_io_frozen)
+                               netdev_err(dev, " %s, I/O to channel is 
blocked\n",dev->name);
+                          else
+                              netdev_err(dev, " %s, adapter has failed 
permanently!\n", dev->name );
+                      } else {
+                              netif_freeze_queues(dev);
+                              dev->netdev_ops->ndo_tx_timeout(dev, i);
+                              netif_unfreeze_queues(dev);
+                      }
               }
               if (!mod_timer(&dev->watchdog_timer,
                                 round_jiffies(jiffies +
--

from the crash dump on a system with 4-port BCM57800 adapter
crash> net
    NET_DEVICE     NAME       IP ADDRESS(ES)
c000000003581000  lo         127.0.0.1, ::1
c000000008f8b000  net0       9.3.233.69
c00000000315c000  enP23p1s0f0 101.1.233.69
c000000003164000  enP23p1s0f1 102.1.233.69
c00000000316c000  enP23p1s0f2 103.1.233.69
c000000003174000  enP23p1s0f3 104.1.233.69
crash> dmesg
[  752.115994] ------------[ cut here ]------------
[  752.115996] NETDEV WATCHDOG: enP23p1s0f0 (bnx2x): transmit queue 2 
timed out
[  752.116018] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:528 
dev_watchdog+0x3c8/0x420
[  752.116037] Modules linked in: rpadlpar_io rpaphp xsk_diag 
nft_counter nft_compat nf_tables nfnetlink rfkill bonding tls sunrpc 
binfmt_misc pseries_rng drm drm_panel_orientation_quirks xfs sd_mod 
t10_pi sg ibmvscsi scsi_transport_srp ibmveth bnx2x vmx_crypto mdio 
pseries_wdt libcrc32c dm_mirror dm_region_hash dm_log dm_mod fuse

------ snip the watchdog's dump ---------
------ in dev_watchdog() checking the PCI error and skipping 
->ndo_tx_timeout()  --------

[  752.116430] ---[ end trace 868b17f3f105be7b ]---
[  752.116437] bnx2x 0017:01:00.0 enP23p1s0f0:  enP23p1s0f0, I/O to 
channel is blocked
[  752.195975] bnx2x: [bnx2x_timer:5811(enP23p1s0f3)]MFW seems hanged: 
drv_pulse (0x2ca) != mcp_pulse (0x7fff)
[  752.195986] bnx2x: 
[bnx2x_acquire_hw_lock:2023(enP23p1s0f3)]lock_status 0xffffffff 
resource_bit 0x1
[  752.196507] bnx2x 0017:01:00.3 enP23p1s0f3: MDC/MDIO access timeout
[  752.196792] EEH: Recovering PHB#17-PE#10000
[  752.196795] EEH: PE location: N/A, PHB location: N/A
[  752.196797] EEH: Frozen PHB#17-PE#10000 detected
[  752.196798] EEH: Call Trace:
[  752.196799] EEH: [c00000000005102c] __eeh_send_failure_event+0x7c/0x160
[  752.196808] EEH: [c000000000049cf4] 
eeh_dev_check_failure.part.0+0x254/0x650
[  752.196812] EEH: [c00800000107c7a0] bnx2x_timer+0x1e8/0x250 [bnx2x]
[  752.196863] EEH: [c00000000023eb00] call_timer_fn+0x50/0x1c0

--------- snip here -------------

[  752.196913] EEH: This PCI device has failed 1 times in the last hour 
and will be permanently disabled after 5 failures.
[  752.196915] EEH: Notify device drivers to shutdown
[  752.196918] EEH: Beginning: 'error_detected(IO frozen)'
[  752.196920] PCI 0017:01:00.0#10000: EEH: Invoking 
bnx2x->error_detected(IO frozen)
[  752.196924] bnx2x: [bnx2x_io_error_detected:14190(enP23p1s0f0)]IO 
error detected
[  752.197024] bnx2x 0017:01:00.3 enP23p1s0f3: MDC/MDIO access timeout
[  752.197039] PCI 0017:01:00.0#10000: EEH: bnx2x driver reports: 'need 
reset'
[  752.197041] PCI 0017:01:00.1#10000: EEH: Invoking 
bnx2x->error_detected(IO frozen)
[  752.197042] bnx2x: [bnx2x_io_error_detected:14190(enP23p1s0f1)]IO 
error detected
[  752.197093] PCI 0017:01:00.1#10000: EEH: bnx2x driver reports: 'need 
reset'
[  752.197095] PCI 0017:01:00.2#10000: EEH: Invoking 
bnx2x->error_detected(IO frozen)
[  752.197096] bnx2x: [bnx2x_io_error_detected:14190(enP23p1s0f2)]IO 
error detected
[  752.197151] PCI 0017:01:00.2#10000: EEH: bnx2x driver reports: 'need 
reset'
[  752.197153] PCI 0017:01:00.3#10000: EEH: Invoking 
bnx2x->error_detected(IO frozen)
[  752.197154] bnx2x: [bnx2x_io_error_detected:14190(enP23p1s0f3)]IO 
error detected
[  752.197208] PCI 0017:01:00.3#10000: EEH: bnx2x driver reports: 'need 
reset'
[  752.197210] EEH: Finished:'error_detected(IO frozen)' with aggregate 
recovery state:'need reset'

-------------- snip here --------

[  754.407972] EEH: Beginning: 'slot_reset'
[  754.407978] PCI 0017:01:00.0#10000: EEH: Invoking bnx2x->slot_reset()
[  754.407981] bnx2x: [bnx2x_io_slot_reset:14225(enP23p1s0f0)]IO slot 
reset initializing...
[  754.408047] bnx2x 0017:01:00.0: enabling device (0140 -> 0142)
[  754.412432] bnx2x: [bnx2x_io_slot_reset:14241(enP23p1s0f0)]IO slot 
reset --> driver unload

--------- snip here ------------

[  764.526802] PCI 0017:01:00.0#10000: EEH: bnx2x driver reports: 
'recovered'
[  764.526806] PCI 0017:01:00.1#10000: EEH: Invoking bnx2x->slot_reset()
[  764.526808] bnx2x: [bnx2x_io_slot_reset:14225(enP23p1s0f1)]IO slot 
reset initializing...
[  764.526898] bnx2x 0017:01:00.1: enabling device (0140 -> 0142)
[  764.531117] bnx2x: [bnx2x_io_slot_reset:14241(enP23p1s0f1)]IO slot 
reset --> driver unload

---------- snip here --------

[  772.770957] bnx2x: [bnx2x_io_slot_reset:14241(enP23p1s0f3)]IO slot 
reset --> driver unload
[  772.886717] PCI 0017:01:00.3#10000: EEH: bnx2x driver reports: 
'recovered'
[  772.886720] EEH: Finished:'slot_reset' with aggregate recovery 
state:'recovered'
[  772.886721] EEH: Notify device driver to resume
[  772.886722] EEH: Beginning: 'resume'
[  772.886723] PCI 0017:01:00.0#10000: EEH: Invoking bnx2x->resume()
[  773.476919] bnx2x 0017:01:00.0 enP23p1s0f0: using MSI-X  IRQs: sp 55 
fp[0] 57 ... fp[7] 64
[  773.706115] bnx2x 0017:01:00.0 enP23p1s0f0: NIC Link is Up, 10000 
Mbps full duplex, Flow control: ON - receive & transmit
[  773.708230] PCI 0017:01:00.0#10000: EEH: bnx2x driver reports: 'none'
[  773.708234] PCI 0017:01:00.1#10000: EEH: Invoking bnx2x->resume()
[  774.307404] bnx2x 0017:01:00.1 enP23p1s0f1: using MSI-X  IRQs: sp 65 
fp[0] 67 ... fp[7] 74
[  774.546123] bnx2x 0017:01:00.1 enP23p1s0f1: NIC Link is Up, 10000 
Mbps full duplex, Flow control: ON - receive & transmit
[  774.548304] PCI 0017:01:00.1#10000: EEH: bnx2x driver reports: 'none'
[  774.548311] PCI 0017:01:00.2#10000: EEH: Invoking bnx2x->resume()
[  774.747483] bnx2x 0017:01:00.2 enP23p1s0f2: using MSI-X  IRQs: sp 75 
fp[0] 77 ... fp[7] 84
[  774.756111] bnx2x: [bnx2x_hw_stats_update:871(enP23p1s0f0)]NIG timer 
max (0)
[  775.038454] PCI 0017:01:00.2#10000: EEH: bnx2x driver reports: 'none'
[  775.038466] PCI 0017:01:00.3#10000: EEH: Invoking bnx2x->resume()
[  775.228049] bnx2x 0017:01:00.3 enP23p1s0f3: using MSI-X  IRQs: sp 85 
fp[0] 87 ... fp[7] 94
[  775.548237] PCI 0017:01:00.3#10000: EEH: bnx2x driver reports: 'none'
[  775.548245] EEH: Finished:'resume'
[  775.548247] EEH: Recovery successful.
[  775.556120] bnx2x: [bnx2x_hw_stats_update:871(enP23p1s0f1)]NIG timer 
max (0)
[ 1203.919654] bnx2x 0017:01:00.0 enP23p1s0f0: using MSI-X  IRQs: sp 55 
fp[0] 57 ... fp[7] 64
[ 1204.156946] bnx2x 0017:01:00.0 enP23p1s0f0: NIC Link is Up, 10000 
Mbps full duplex, Flow control: ON - receive & transmit
[ 1204.159011] IPv6: ADDRCONF(NETDEV_CHANGE): enP23p1s0f0: link becomes 
ready
[ 1204.386939] bnx2x 0017:01:00.0 enP23p1s0f0: NIC Link is Down
[ 1209.789617] bnx2x 0017:01:00.1 enP23p1s0f1: using MSI-X  IRQs: sp 65 
fp[0] 67 ... fp[7] 74
[ 1210.026894] bnx2x 0017:01:00.1 enP23p1s0f1: NIC Link is Up, 10000 
Mbps full duplex, Flow control: ON - receive & transmit
[ 1210.028955] IPv6: ADDRCONF(NETDEV_CHANGE): enP23p1s0f1: link becomes 
ready
[ 1210.357268] bnx2x 0017:01:00.0 enP23p1s0f0: NIC Link is Up, 10000 
Mbps full duplex, Flow control: ON - receive & transmit
[ 1214.526868] bnx2x 0017:01:00.1 enP23p1s0f1: NIC Link is Down
[ 1215.647561] bnx2x 0017:01:00.2 enP23p1s0f2: using MSI-X  IRQs: sp 75 
fp[0] 77 ... fp[7] 84
[ 1220.357087] bnx2x 0017:01:00.1 enP23p1s0f1: NIC Link is Up, 10000 
Mbps full duplex, Flow control: ON - receive & transmit
[ 1221.517564] bnx2x 0017:01:00.3 enP23p1s0f3: using MSI-X  IRQs: sp 85 
fp[0] 87 ... fp[7] 94
[ 1222.012323] systemd-rc-local-generator[16948]: /etc/rc.d/rc.local is 
not marked executable, skipping.
[ 1232.476909] bnx2x 0017:01:00.2 enP23p1s0f2: NIC Link is Up, 1000 Mbps 
full duplex, Flow control: ON - receive & transmit
[ 1232.476941] IPv6: ADDRCONF(NETDEV_CHANGE): enP23p1s0f2: link becomes 
ready
[ 1237.996937] bnx2x 0017:01:00.3 enP23p1s0f3: NIC Link is Up, 1000 Mbps 
full duplex, Flow control: ON - receive & transmit
[ 1237.996961] IPv6: ADDRCONF(NETDEV_CHANGE): enP23p1s0f3: link becomes 
ready


---------- snip here ---------
[ 1592.978832] Kernel attempted to write user page (e) - exploit 
attempt? (uid: 0)
[ 1592.978836] BUG: Kernel NULL pointer dereference on write at 0x0000000e
[ 1592.978838] Faulting instruction address: 0xc0080000010bb1e8
[ 1592.978841] Oops: Kernel access of bad area, sig: 11 [#1]
----------
crash> bt
PID: 41       TASK: c000000003d29b00  CPU: 5    COMMAND: "ksoftirqd/5"
  R0:  c0080000010bb1a0    R1:  c000000003b7b910    R2:  c008000001178000
  R3:  08000001173928be    R4:  c00c00000045ce40    R5:  00000000000028be
  R6:  0000000000000001    R7:  ffffffffffffffff    R8:  0000000000000000
  R9:  0000000000000010    R10: 0000000000000000    R11: c0080000010fee78
  R12: c000000000231cf0    R13: c000000fffff9080    R14: 0000000000000000
  R15: 0000000000000000    R16: 0000000000000001    R17: 0000000000000000
  R18: 08000001173928be    R19: 0000000000000000    R20: 0000000000000000
  R21: c0000001173928be    R22: c000000003164a00    R23: c000000012570200
  R24: 0000000000000000    R25: 0000000000000001    R26: c00000001c51e050
  R27: 0000000000000005    R28: c000000003164000    R29: 0000000000000000
  R30: c0000000d38a8ae0    R31: 0000000000000000
  NIP: c0080000010bb1e8    MSR: 800000000280b033    OR3: c000000000230128
  CTR: c000000000231cf0    LR:  c0080000010bb1a0    XER: 0000000020040000
  CCR: 0000000048008482    MQ:  0000000000000000    DAR: 000000000000000e
  DSISR: 0000000042000000     Syscall Result: 0000000000000000
  [NIP  : bnx2x_start_xmit+496]
  [LR   : bnx2x_start_xmit+424]
  #0 [c000000003b7b4e0] crash_kexec at c000000000279f8c
  #1 [c000000003b7b510] oops_end at c0000000000291a8
  #2 [c000000003b7b590] __bad_page_fault at c00000000008d1cc
  #3 [c000000003b7b600] data_access_common_virt at c0000000000088dc
  Data Access [300] exception frame:
  R0:  c0080000010bb1a0    R1:  c000000003b7b910    R2:  c008000001178000
  R3:  08000001173928be    R4:  c00c00000045ce40    R5:  00000000000028be
  R6:  0000000000000001    R7:  ffffffffffffffff    R8:  0000000000000000
  R9:  0000000000000010    R10: 0000000000000000    R11: c0080000010fee78
  R12: c000000000231cf0    R13: c000000fffff9080    R14: 0000000000000000
  R15: 0000000000000000    R16: 0000000000000001    R17: 0000000000000000
  R18: 08000001173928be    R19: 0000000000000000    R20: 0000000000000000
  R21: c0000001173928be    R22: c000000003164a00    R23: c000000012570200
  R24: 0000000000000000    R25: 0000000000000001    R26: c00000001c51e050
  R27: 0000000000000005    R28: c000000003164000    R29: 0000000000000000
  R30: c0000000d38a8ae0    R31: 0000000000000000
  NIP: c0080000010bb1e8    MSR: 800000000280b033    OR3: c000000000230128
  CTR: c000000000231cf0    LR:  c0080000010bb1a0    XER: 0000000020040000
  CCR: 0000000048008482    MQ:  0000000000000000    DAR: 000000000000000e
crash> bt
PID: 41       TASK: c000000003d29b00  CPU: 5    COMMAND: "ksoftirqd/5"
  R0:  c0080000010bb1a0    R1:  c000000003b7b910    R2:  c008000001178000
  R3:  08000001173928be    R4:  c00c00000045ce40    R5:  00000000000028be
  R6:  0000000000000001    R7:  ffffffffffffffff    R8:  0000000000000000
  R9:  0000000000000010    R10: 0000000000000000    R11: c0080000010fee78
  R12: c000000000231cf0    R13: c000000fffff9080    R14: 0000000000000000
  R15: 0000000000000000    R16: 0000000000000001    R17: 0000000000000000
  R18: 08000001173928be    R19: 0000000000000000    R20: 0000000000000000
  R21: c0000001173928be    R22: c000000003164a00    R23: c000000012570200
  R24: 0000000000000000    R25: 0000000000000001    R26: c00000001c51e050
  R27: 0000000000000005    R28: c000000003164000    R29: 0000000000000000
  R30: c0000000d38a8ae0    R31: 0000000000000000
  NIP: c0080000010bb1e8    MSR: 800000000280b033    OR3: c000000000230128
  CTR: c000000000231cf0    LR:  c0080000010bb1a0    XER: 0000000020040000
  CCR: 0000000048008482    MQ:  0000000000000000    DAR: 000000000000000e
  DSISR: 0000000042000000     Syscall Result: 0000000000000000
  [NIP  : bnx2x_start_xmit+496]
  [LR   : bnx2x_start_xmit+424]
  #0 [c000000003b7b4e0] crash_kexec at c000000000279f8c
  #1 [c000000003b7b510] oops_end at c0000000000291a8
  #2 [c000000003b7b590] __bad_page_fault at c00000000008d1cc
  #3 [c000000003b7b600] data_access_common_virt at c0000000000088dc
  Data Access [300] exception frame:
  R0:  c0080000010bb1a0    R1:  c000000003b7b910    R2:  c008000001178000
  R3:  08000001173928be    R4:  c00c00000045ce40    R5:  00000000000028be
  R6:  0000000000000001    R7:  ffffffffffffffff    R8:  0000000000000000
  R9:  0000000000000010    R10: 0000000000000000    R11: c0080000010fee78
  R12: c000000000231cf0    R13: c000000fffff9080    R14: 0000000000000000
  R15: 0000000000000000    R16: 0000000000000001    R17: 0000000000000000
  R18: 08000001173928be    R19: 0000000000000000    R20: 0000000000000000
  R21: c0000001173928be    R22: c000000003164a00    R23: c000000012570200
  R24: 0000000000000000    R25: 0000000000000001    R26: c00000001c51e050
  R27: 0000000000000005    R28: c000000003164000    R29: 0000000000000000
  R30: c0000000d38a8ae0    R31: 0000000000000000
  NIP: c0080000010bb1e8    MSR: 800000000280b033    OR3: c000000000230128
  CTR: c000000000231cf0    LR:  c0080000010bb1a0    XER: 0000000020040000
  CCR: 0000000048008482    MQ:  0000000000000000    DAR: 000000000000000e

crash> dis -s bnx2x_start_xmit+496
FILE: drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
LINE: 3858

   3853          /* get a tx_buf and first BD
   3854           * tx_start_bd may be changed during SPLIT,
   3855           * but first_bd will always stay first
   3856           */
   3857          tx_buf = &txdata->tx_buf_ring[TX_BD(pkt_prod)];
* 3858          tx_start_bd = &txdata->tx_desc_ring[bd_prod].start_bd;
   3859          first_bd = tx_start_bd;
   3860

I have not identified the root of this crash yet.

Regards,
Thinh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-11-15 18:23           ` Thinh Tran
@ 2023-11-15 18:56             ` Michael Chan
  2023-11-16 14:41               ` Thinh Tran
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Chan @ 2023-11-15 18:56 UTC (permalink / raw)
  To: Thinh Tran
  Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
	venkata.sai.duggi

[-- Attachment #1: Type: text/plain, Size: 890 bytes --]

On Wed, Nov 15, 2023 at 10:23 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>
>
> On 11/14/2023 3:03 PM, Michael Chan wrote:
> >
> > Could you provide more information about the crashes?  The
> > dev_watchdog() code already checks for netif_device_present() and
> > netif_running() and netif_carrier_ok() before proceeding to check for
> > TX timeout.  Why would adding some additional checks for PCI errors
> > cause problems?  Of course the additional checks should only be done
> > on PCI devices only.  Thanks.
>
> The checking for PCI errors is not the problem, avoiding calling drivers
> ->ndo_tx_timeout() function, causing some issue.

I see.  By skipping TX timeout during PCI errors, bnx2x crashes in
.ndo_start_xmit() after EEH error recovery.

I think it should be fine to fix the original EEH issue in tg3 then.
Please re-post the tg3 patch.  Thanks.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
  2023-11-15 18:56             ` Michael Chan
@ 2023-11-16 14:41               ` Thinh Tran
  0 siblings, 0 replies; 22+ messages in thread
From: Thinh Tran @ 2023-11-16 14:41 UTC (permalink / raw)
  To: Michael Chan
  Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
	venkata.sai.duggi

I'll re-post the V2 patch shortly.
Thanks for the review.

Thinh Tran
On 11/15/2023 12:56 PM, Michael Chan wrote:
> On Wed, Nov 15, 2023 at 10:23 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>>
>>
>> On 11/14/2023 3:03 PM, Michael Chan wrote:
>>>
>>> Could you provide more information about the crashes?  The
>>> dev_watchdog() code already checks for netif_device_present() and
>>> netif_running() and netif_carrier_ok() before proceeding to check for
>>> TX timeout.  Why would adding some additional checks for PCI errors
>>> cause problems?  Of course the additional checks should only be done
>>> on PCI devices only.  Thanks.
>>
>> The checking for PCI errors is not the problem, avoiding calling drivers
>> ->ndo_tx_timeout() function, causing some issue.
> 
> I see.  By skipping TX timeout during PCI errors, bnx2x crashes in
> .ndo_start_xmit() after EEH error recovery.
> 
> I think it should be fine to fix the original EEH issue in tg3 then.
> Please re-post the tg3 patch.  Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3] net/tg3: fix race condition in tg3_reset_task()
  2023-11-02 16:12 ` [PATCH v2] net/tg3: fix race condition in tg3_reset_task() Thinh Tran
  2023-11-02 17:27   ` Michael Chan
@ 2023-11-16 15:18   ` Thinh Tran
  2023-11-16 21:34     ` Michael Chan
  2023-12-01  0:19     ` [PATCH v4] " Thinh Tran
  1 sibling, 2 replies; 22+ messages in thread
From: Thinh Tran @ 2023-11-16 15:18 UTC (permalink / raw)
  To: mchan
  Cc: pavan.chebbi, netdev, prashant, siva.kallam, drc,
	venkata.sai.duggi, thinhtr, edumazet, kuba, pabeni, davem

When an EEH error is encountered by a PCI adapter, the EEH driver
modifies the PCI channel's state as shown below:

   enum {
      /* I/O channel is in normal state */
      pci_channel_io_normal = (__force pci_channel_state_t) 1,

      /* I/O to channel is blocked */
      pci_channel_io_frozen = (__force pci_channel_state_t) 2,

      /* PCI card is dead */
      pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
   };

If the same EEH error then causes the tg3 driver's transmit timeout
logic to execute, the tg3_tx_timeout() function schedules a reset
task via tg3_reset_task_schedule(), which may cause a race condition
between the tg3 and EEH driver as both attempt to recover the HW via
a reset action.

EEH driver gets error event
--> eeh_set_channel_state()
    and set device to one of
    error state above		scheduler: tg3_reset_task() get 
   				returned error from tg3_init_hw()
			     --> dev_close() shuts down the interface

tg3_io_slot_reset() and 
tg3_io_resume() fail to
reset/resume the device


To resolve this issue, we avoid the race condition by checking the PCI
channel state in the tg3_tx_timeout() function and skip the tg3 driver
initiated reset when the PCI channel is not in the normal state.  (The
driver has no access to tg3 device registers at this point and cannot
even complete the reset task successfully without external assistance.)
We'll leave the reset procedure to be managed by the EEH driver which
calls the tg3_io_error_detected(), tg3_io_slot_reset() and 
tg3_io_resume() functions as appropriate. 



Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

  v3: re-post the patch.
  v2: checking PCI errors in tg3_tx_timeout()

---
 drivers/net/ethernet/broadcom/tg3.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..1c72ef05ab1b 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7630,6 +7630,26 @@ static void tg3_tx_timeout(struct net_device *dev, unsigned int txqueue)
 {
 	struct tg3 *tp = netdev_priv(dev);
 
+	/* checking the PCI channel state for hard errors
+	 * for pci_channel_io_frozen case
+	 *   - I/O to channel is blocked.
+	 *     The EEH layer and I/O error detections will
+	 *     handle the reset procedure
+	 * for pci_channel_io_perm_failure  case
+	 *   - the PCI card is dead.
+	 *     The reset will not help
+	 * report the error for both cases and return.
+	 */
+	if (tp->pdev->error_state == pci_channel_io_frozen) {
+		netdev_err(dev, " %s, I/O to channel is blocked\n", __func__);
+		return;
+	}
+
+	if (tp->pdev->error_state == pci_channel_io_perm_failure) {
+		netdev_err(dev, " %s, adapter has failed permanently!\n", __func__);
+		return;
+	}
+
 	if (netif_msg_tx_err(tp)) {
 		netdev_err(dev, "transmit timed out, resetting\n");
 		tg3_dump_state(tp);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()
  2023-11-16 15:18   ` [PATCH v3] " Thinh Tran
@ 2023-11-16 21:34     ` Michael Chan
  2023-11-17 16:19       ` Thinh Tran
  2023-12-01  0:19     ` [PATCH v4] " Thinh Tran
  1 sibling, 1 reply; 22+ messages in thread
From: Michael Chan @ 2023-11-16 21:34 UTC (permalink / raw)
  To: Thinh Tran
  Cc: mchan, pavan.chebbi, netdev, prashant, siva.kallam, drc,
	venkata.sai.duggi, edumazet, kuba, pabeni, davem

[-- Attachment #1: Type: text/plain, Size: 1884 bytes --]

On Thu, Nov 16, 2023 at 7:19 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 14b311196b8f..1c72ef05ab1b 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -7630,6 +7630,26 @@ static void tg3_tx_timeout(struct net_device *dev, unsigned int txqueue)
>  {
>         struct tg3 *tp = netdev_priv(dev);
>
> +       /* checking the PCI channel state for hard errors
> +        * for pci_channel_io_frozen case
> +        *   - I/O to channel is blocked.
> +        *     The EEH layer and I/O error detections will
> +        *     handle the reset procedure
> +        * for pci_channel_io_perm_failure  case
> +        *   - the PCI card is dead.
> +        *     The reset will not help
> +        * report the error for both cases and return.
> +        */
> +       if (tp->pdev->error_state == pci_channel_io_frozen) {
> +               netdev_err(dev, " %s, I/O to channel is blocked\n", __func__);
> +               return;
> +       }
> +
> +       if (tp->pdev->error_state == pci_channel_io_perm_failure) {
> +               netdev_err(dev, " %s, adapter has failed permanently!\n", __func__);
> +               return;
> +       }
> +

I think it will be better to add these 2 checks in tg3_reset_task().
We are already doing a similar check there for tp->pcierr_recovery so
it is better to centralize all the related checks in the same place.
I don't think tg3_dump_state() below will cause any problems.  We'll
probably read 0xffffffff for all registers and it will actually
confirm that the registers are not readable.

>         if (netif_msg_tx_err(tp)) {
>                 netdev_err(dev, "transmit timed out, resetting\n");
>                 tg3_dump_state(tp);
> --
> 2.25.1
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()
  2023-11-16 21:34     ` Michael Chan
@ 2023-11-17 16:19       ` Thinh Tran
  2023-11-17 18:31         ` Michael Chan
  0 siblings, 1 reply; 22+ messages in thread
From: Thinh Tran @ 2023-11-17 16:19 UTC (permalink / raw)
  To: Michael Chan
  Cc: mchan, pavan.chebbi, netdev, prashant, siva.kallam, drc,
	venkata.sai.duggi, edumazet, kuba, pabeni, davem

On 11/16/2023 3:34 PM, Michael Chan wrote:
> 
> I think it will be better to add these 2 checks in tg3_reset_task().
> We are already doing a similar check there for tp->pcierr_recovery so
> it is better to centralize all the related checks in the same place.
> I don't think tg3_dump_state() below will cause any problems.  We'll
> probably read 0xffffffff for all registers and it will actually
> confirm that the registers are not readable.
> 

I'm concerned that race conditions could still occur during the handling 
of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies 
in the dependency on the lower-level FW reset procedure. When the driver 
executes tg3_dump_state(), and then follows it with tg3_reset_task(), 
the EEH driver possibility changes in the PCI devices' state, but the 
MMIO and DMA reset procedures may not have completed yet. Leading to a 
crash in tg3_reset_task().  This patch tries to prevent that scenario.

While tg3_dump_state() is helpful, it also poses issues as it takes a 
considerable amount of time, approximately 1 or 2 seconds per device. 
Given our 4-port adapter, this could extend to more than 10 seconds to 
write to the dmesg buffer. With the default size, the dmesg buffer may 
be over-written and not retain all useful information.

Thanks,
Thinh Tran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()
  2023-11-17 16:19       ` Thinh Tran
@ 2023-11-17 18:31         ` Michael Chan
  2023-11-30 22:29           ` Thinh Tran
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Chan @ 2023-11-17 18:31 UTC (permalink / raw)
  To: Thinh Tran
  Cc: mchan, pavan.chebbi, netdev, prashant, siva.kallam, drc,
	venkata.sai.duggi, edumazet, kuba, pabeni, davem

[-- Attachment #1: Type: text/plain, Size: 2056 bytes --]

On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>
>
> On 11/16/2023 3:34 PM, Michael Chan wrote:
> >
> > I think it will be better to add these 2 checks in tg3_reset_task().
> > We are already doing a similar check there for tp->pcierr_recovery so
> > it is better to centralize all the related checks in the same place.
> > I don't think tg3_dump_state() below will cause any problems.  We'll
> > probably read 0xffffffff for all registers and it will actually
> > confirm that the registers are not readable.
> >
>
> I'm concerned that race conditions could still occur during the handling
> of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies
> in the dependency on the lower-level FW reset procedure. When the driver
> executes tg3_dump_state(), and then follows it with tg3_reset_task(),
> the EEH driver possibility changes in the PCI devices' state, but the
> MMIO and DMA reset procedures may not have completed yet. Leading to a
> crash in tg3_reset_task().  This patch tries to prevent that scenario.

It seems fragile if you are relying on such timing.  TG3_TX_TIMEOUT is
5 seconds but the actual time tg3_tx_timeout() is called varies
depending on when the TX queue is stopped.  So tg3_tx_timeout() will
be called 5 seconds or more after EEH if there are uncompleted TX
packets but we don't know precisely when.

>
> While tg3_dump_state() is helpful, it also poses issues as it takes a
> considerable amount of time, approximately 1 or 2 seconds per device.
> Given our 4-port adapter, this could extend to more than 10 seconds to
> write to the dmesg buffer. With the default size, the dmesg buffer may
> be over-written and not retain all useful information.
>

If tg3_dump_state() is not useful and fills the dmesg log with useless
data, we can add the same check in tg3_dump_state() and skip it.
tg3_dump_state() is also called by tg3_process_error() so we can avoid
dumping useless data if we hit tg3_process_error() during EEH or AER.

Thanks.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()
  2023-11-17 18:31         ` Michael Chan
@ 2023-11-30 22:29           ` Thinh Tran
  0 siblings, 0 replies; 22+ messages in thread
From: Thinh Tran @ 2023-11-30 22:29 UTC (permalink / raw)
  To: Michael Chan
  Cc: mchan, pavan.chebbi, netdev, prashant, siva.kallam, drc,
	venkata.sai.duggi, edumazet, kuba, pabeni, davem


On 11/17/2023 12:31 PM, Michael Chan wrote:
> On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>>
>>
>> On 11/16/2023 3:34 PM, Michael Chan wrote:
>>>
>>> I think it will be better to add these 2 checks in tg3_reset_task().
>>> We are already doing a similar check there for tp->pcierr_recovery so
>>> it is better to centralize all the related checks in the same place.
>>> I don't think tg3_dump_state() below will cause any problems.  We'll
>>> probably read 0xffffffff for all registers and it will actually
>>> confirm that the registers are not readable.
>>>
>>
>> I'm concerned that race conditions could still occur during the handling
>> of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies
>> in the dependency on the lower-level FW reset procedure. When the driver
>> executes tg3_dump_state(), and then follows it with tg3_reset_task(),
>> the EEH driver possibility changes in the PCI devices' state, but the
>> MMIO and DMA reset procedures may not have completed yet. Leading to a
>> crash in tg3_reset_task().  This patch tries to prevent that scenario.
> 
> It seems fragile if you are relying on such timing.  TG3_TX_TIMEOUT is
> 5 seconds but the actual time tg3_tx_timeout() is called varies
> depending on when the TX queue is stopped.  So tg3_tx_timeout() will
> be called 5 seconds or more after EEH if there are uncompleted TX
> packets but we don't know precisely when.
> 
>>
>> While tg3_dump_state() is helpful, it also poses issues as it takes a
>> considerable amount of time, approximately 1 or 2 seconds per device.
>> Given our 4-port adapter, this could extend to more than 10 seconds to
>> write to the dmesg buffer. With the default size, the dmesg buffer may
>> be over-written and not retain all useful information.
>>
> 
> If tg3_dump_state() is not useful and fills the dmesg log with useless
> data, we can add the same check in tg3_dump_state() and skip it.
> tg3_dump_state() is also called by tg3_process_error() so we can avoid
> dumping useless data if we hit tg3_process_error() during EEH or AER.
> 
> Thanks.

I implemented the fix as you suggested and passed the tests.
I will send the next version of patch shortly.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4] net/tg3: fix race condition in tg3_reset_task()
  2023-11-16 15:18   ` [PATCH v3] " Thinh Tran
  2023-11-16 21:34     ` Michael Chan
@ 2023-12-01  0:19     ` Thinh Tran
  2023-12-01 16:50       ` Michael Chan
  2023-12-02  0:40       ` patchwork-bot+netdevbpf
  1 sibling, 2 replies; 22+ messages in thread
From: Thinh Tran @ 2023-12-01  0:19 UTC (permalink / raw)
  To: michael.chan
  Cc: davem, drc, edumazet, kuba, mchan, netdev, pabeni, pavan.chebbi,
	prashant, siva.kallam, thinhtr, Venkata Sai Duggi

When an EEH error is encountered by a PCI adapter, the EEH driver
modifies the PCI channel's state as shown below:

   enum {
      /* I/O channel is in normal state */
      pci_channel_io_normal = (__force pci_channel_state_t) 1,

      /* I/O to channel is blocked */
      pci_channel_io_frozen = (__force pci_channel_state_t) 2,

      /* PCI card is dead */
      pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
   };

If the same EEH error then causes the tg3 driver's transmit timeout
logic to execute, the tg3_tx_timeout() function schedules a reset
task via tg3_reset_task_schedule(), which may cause a race condition
between the tg3 and EEH driver as both attempt to recover the HW via
a reset action.

EEH driver gets error event
--> eeh_set_channel_state()
    and set device to one of
    error state above           scheduler: tg3_reset_task() get
                                returned error from tg3_init_hw()
                             --> dev_close() shuts down the interface
tg3_io_slot_reset() and
tg3_io_resume() fail to
reset/resume the device

To resolve this issue, we avoid the race condition by checking the PCI
channel state in the tg3_reset_task() function and skip the tg3 driver
initiated reset when the PCI channel is not in the normal state.  (The
driver has no access to tg3 device registers at this point and cannot
even complete the reset task successfully without external assistance.)
We'll leave the reset procedure to be managed by the EEH driver which
calls the tg3_io_error_detected(), tg3_io_slot_reset() and
tg3_io_resume() functions as appropriate.

Adding the same checking in tg3_dump_state() to avoid dumping all
device registers when the PCI channel is not in the normal state.


Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

  v4: moving the PCI error checking to tg3_reset_task() and
      tg3_dump_state() 
  v3: re-post the patch.
  v2: checking PCI errors in tg3_tx_timeout()

---
 drivers/net/ethernet/broadcom/tg3.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 48b6191efa56..49f299c868a1 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -6474,6 +6474,14 @@ static void tg3_dump_state(struct tg3 *tp)
 	int i;
 	u32 *regs;
 
+	/* If it is a PCI error, all registers will be 0xffff,
+	 * we don't dump them out, just report the error and return
+	 */
+	if (tp->pdev->error_state != pci_channel_io_normal) {
+		netdev_err(tp->dev, "PCI channel ERROR!\n");
+		return;
+	}
+
 	regs = kzalloc(TG3_REG_BLK_SIZE, GFP_ATOMIC);
 	if (!regs)
 		return;
@@ -11259,7 +11267,8 @@ static void tg3_reset_task(struct work_struct *work)
 	rtnl_lock();
 	tg3_full_lock(tp, 0);
 
-	if (tp->pcierr_recovery || !netif_running(tp->dev)) {
+	if (tp->pcierr_recovery || !netif_running(tp->dev) ||
+	    tp->pdev->error_state != pci_channel_io_normal) {
 		tg3_flag_clear(tp, RESET_TASK_PENDING);
 		tg3_full_unlock(tp);
 		rtnl_unlock();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/tg3: fix race condition in tg3_reset_task()
  2023-12-01  0:19     ` [PATCH v4] " Thinh Tran
@ 2023-12-01 16:50       ` Michael Chan
  2023-12-02  0:40       ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 22+ messages in thread
From: Michael Chan @ 2023-12-01 16:50 UTC (permalink / raw)
  To: Thinh Tran
  Cc: davem, drc, edumazet, kuba, mchan, netdev, pabeni, pavan.chebbi,
	prashant, siva.kallam, Venkata Sai Duggi

[-- Attachment #1: Type: text/plain, Size: 2365 bytes --]

On Thu, Nov 30, 2023 at 4:19 PM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>
> When an EEH error is encountered by a PCI adapter, the EEH driver
> modifies the PCI channel's state as shown below:
>
>    enum {
>       /* I/O channel is in normal state */
>       pci_channel_io_normal = (__force pci_channel_state_t) 1,
>
>       /* I/O to channel is blocked */
>       pci_channel_io_frozen = (__force pci_channel_state_t) 2,
>
>       /* PCI card is dead */
>       pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
>    };
>
> If the same EEH error then causes the tg3 driver's transmit timeout
> logic to execute, the tg3_tx_timeout() function schedules a reset
> task via tg3_reset_task_schedule(), which may cause a race condition
> between the tg3 and EEH driver as both attempt to recover the HW via
> a reset action.
>
> EEH driver gets error event
> --> eeh_set_channel_state()
>     and set device to one of
>     error state above           scheduler: tg3_reset_task() get
>                                 returned error from tg3_init_hw()
>                              --> dev_close() shuts down the interface
> tg3_io_slot_reset() and
> tg3_io_resume() fail to
> reset/resume the device
>
> To resolve this issue, we avoid the race condition by checking the PCI
> channel state in the tg3_reset_task() function and skip the tg3 driver
> initiated reset when the PCI channel is not in the normal state.  (The
> driver has no access to tg3 device registers at this point and cannot
> even complete the reset task successfully without external assistance.)
> We'll leave the reset procedure to be managed by the EEH driver which
> calls the tg3_io_error_detected(), tg3_io_slot_reset() and
> tg3_io_resume() functions as appropriate.
>
> Adding the same checking in tg3_dump_state() to avoid dumping all
> device registers when the PCI channel is not in the normal state.
>
>
> Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
> Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com>
> Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>
>
>   v4: moving the PCI error checking to tg3_reset_task() and
>       tg3_dump_state()
>   v3: re-post the patch.
>   v2: checking PCI errors in tg3_tx_timeout()

Thanks.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4] net/tg3: fix race condition in tg3_reset_task()
  2023-12-01  0:19     ` [PATCH v4] " Thinh Tran
  2023-12-01 16:50       ` Michael Chan
@ 2023-12-02  0:40       ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 22+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-12-02  0:40 UTC (permalink / raw)
  To: Thinh Tran
  Cc: michael.chan, davem, drc, edumazet, kuba, mchan, netdev, pabeni,
	pavan.chebbi, prashant, siva.kallam, venkata.sai.duggi

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 30 Nov 2023 18:19:11 -0600 you wrote:
> When an EEH error is encountered by a PCI adapter, the EEH driver
> modifies the PCI channel's state as shown below:
> 
>    enum {
>       /* I/O channel is in normal state */
>       pci_channel_io_normal = (__force pci_channel_state_t) 1,
> 
> [...]

Here is the summary with links:
  - [v4] net/tg3: fix race condition in tg3_reset_task()
    https://git.kernel.org/netdev/net/c/16b55b1f2269

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-12-02  0:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-02 18:55 [PATCH] net/tg3: fix race condition in tg3_reset_task_cancel() Thinh Tran
2023-10-03  4:34 ` Pavan Chebbi
2023-10-31 23:18   ` Thinh Tran
2023-10-03  9:37 ` Michael Chan
2023-10-03 22:05   ` Thinh Tran
2023-11-02 16:02     ` Thinh Tran
2023-11-02 16:12 ` [PATCH v2] net/tg3: fix race condition in tg3_reset_task() Thinh Tran
2023-11-02 17:27   ` Michael Chan
2023-11-02 20:37     ` Thinh Tran
2023-11-14 17:39       ` Thinh Tran
2023-11-14 21:03         ` Michael Chan
2023-11-15 18:23           ` Thinh Tran
2023-11-15 18:56             ` Michael Chan
2023-11-16 14:41               ` Thinh Tran
2023-11-16 15:18   ` [PATCH v3] " Thinh Tran
2023-11-16 21:34     ` Michael Chan
2023-11-17 16:19       ` Thinh Tran
2023-11-17 18:31         ` Michael Chan
2023-11-30 22:29           ` Thinh Tran
2023-12-01  0:19     ` [PATCH v4] " Thinh Tran
2023-12-01 16:50       ` Michael Chan
2023-12-02  0:40       ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).