Netdev List
 help / color / mirror / Atom feed
* [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls
@ 2026-05-25  9:49 Yury Murashka
  2026-05-26  3:37 ` Pavan Chebbi
  2026-05-27  1:42 ` Jakub Kicinski
  0 siblings, 2 replies; 5+ messages in thread
From: Yury Murashka @ 2026-05-25  9:49 UTC (permalink / raw)
  To: Pavan Chebbi, mchan
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel, Yury Murashka

During PCIe hot-plug events, uncorrectable errors can be reported and
AER recovery for the tg3 device is initiated by the AER kernel driver.
The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
will be called. But AER error recovery can fail. For example, when one
of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
device is disabled and NAPI is disabled (pci_disable_device and
napi_disable are called from tg3_io_error_detected). Then we can try to
disable PCIe link and napi_disable will be called again:

  napi_disable+0x1b/0x1b0
  tg3_napi_disable+0x89/0xa0 [tg3]
  tg3_netif_stop+0x37/0xe3 [tg3]
  tg3_stop+0x30/0x160 [tg3]
  tg3_close+0x2a/0x60 [tg3]
  __dev_close_many+0xad/0x130
  dev_close_many+0xb2/0x190
  unregister_netdevice_many_notify+0x19d/0xa00
  unregister_netdevice_queue+0xf8/0x140
  unregister_netdev+0x1c/0x30
  tg3_remove_one+0xaa/0x150 [tg3]
  pci_device_remove+0x42/0xb0
  device_release_driver_internal+0x19c/0x200
  pci_stop_bus_device+0x85/0xb0
  pci_stop_bus_device+0x2c/0xb0
  pci_stop_bus_device+0x2c/0xb0
  pci_stop_and_remove_bus_device+0x12/0x20
  pciehp_unconfigure_device+0x9f/0x160
  pciehp_disable_slot+0x67/0x100
  pciehp_handle_presence_or_link_change+0x77/0x350

This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover a similar issue,
but for fatal errors. We cannot reuse this flag because it is reset in
tg3_io_resume, but it is not called when AER recovery fails.

Similarly, if an AER error is reported and tg3_io_error_detected calls
pci_disable_device, a subsequent device removal via tg3_remove_one or
tg3_shutdown will call pci_disable_device again for the already-disabled
device.

Add a napi_enabled flag to struct tg3 to track whether napi_enable has
been called. Guard tg3_napi_disable() so it returns early if NAPI was
not previously enabled. Also guard pci_disable_device() calls in
tg3_remove_one() and tg3_shutdown() with pci_is_enabled() to avoid
disabling an already-disabled device.

Fixes: b45aa2f6192e ("tg3: Add EEH support")
Signed-off-by: Yury Murashka <yurypm@arista.com>
---
 drivers/net/ethernet/broadcom/tg3.c | 14 ++++++++++++--
 drivers/net/ethernet/broadcom/tg3.h |  1 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c
b/drivers/net/ethernet/broadcom/tg3.c
index 73a4b569b03e..86995e689519 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7398,6 +7398,11 @@ static void tg3_napi_disable(struct tg3 *tp)
        struct tg3_napi *tnapi;
        int i;

+       if (!tp->napi_enabled)
+               return;
+
+       tp->napi_enabled = false;
+
        for (i = tp->irq_cnt - 1; i >= 0; i--) {
                tnapi = &tp->napi[i];
                if (tnapi->tx_buffers) {
@@ -7420,6 +7425,8 @@ static void tg3_napi_enable(struct tg3 *tp)
        struct tg3_napi *tnapi;
        int i;

+       tp->napi_enabled = true;
+
        for (i = 0; i < tp->irq_cnt; i++) {
                tnapi = &tp->napi[i];
                napi_enable_locked(&tnapi->napi);
@@ -17718,6 +17725,7 @@ static int tg3_init_one(struct pci_dev *pdev,
        tp->tx_mode = TG3_DEF_TX_MODE;
        tp->irq_sync = 1;
        tp->pcierr_recovery = false;
+       tp->napi_enabled = false;

        if (tg3_debug > 0)
                tp->msg_enable = tg3_debug;
@@ -18099,7 +18107,8 @@ static void tg3_remove_one(struct pci_dev *pdev)
                }
                free_netdev(dev);
                pci_release_regions(pdev);
-               pci_disable_device(pdev);
+               if (pci_is_enabled(pdev))
+                       pci_disable_device(pdev);
        }
 }

@@ -18257,7 +18266,8 @@ static void tg3_shutdown(struct pci_dev *pdev)

        rtnl_unlock();

-       pci_disable_device(pdev);
+       if (pci_is_enabled(pdev))
+               pci_disable_device(pdev);
 }

 /**
diff --git a/drivers/net/ethernet/broadcom/tg3.h
b/drivers/net/ethernet/broadcom/tg3.h
index a9e7f88fa26d..34fb771e8a86 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -3429,6 +3429,7 @@ struct tg3 {
        struct device                   *hwmon_dev;
        bool                            link_up;
        bool                            pcierr_recovery;
+       bool                            napi_enabled;

        u32                             ape_hb;
        unsigned long                   ape_hb_interval;
--
2.51.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls
  2026-05-25  9:49 [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls Yury Murashka
@ 2026-05-26  3:37 ` Pavan Chebbi
  2026-05-26  7:09   ` ALOK TIWARI
  2026-05-27  1:42 ` Jakub Kicinski
  1 sibling, 1 reply; 5+ messages in thread
From: Pavan Chebbi @ 2026-05-26  3:37 UTC (permalink / raw)
  To: Yury Murashka
  Cc: mchan, andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2951 bytes --]

On Mon, May 25, 2026 at 3:19 PM Yury Murashka <yurypm@arista.com> wrote:
>
> During PCIe hot-plug events, uncorrectable errors can be reported and
> AER recovery for the tg3 device is initiated by the AER kernel driver.
> The tg3_io_error_detected function is the AER error recovery handler.
>
> From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
> napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
> We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
> will be called. But AER error recovery can fail. For example, when one
> of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
> As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
> device is disabled and NAPI is disabled (pci_disable_device and
> napi_disable are called from tg3_io_error_detected). Then we can try to
> disable PCIe link and napi_disable will be called again:
>
>   napi_disable+0x1b/0x1b0
>   tg3_napi_disable+0x89/0xa0 [tg3]
>   tg3_netif_stop+0x37/0xe3 [tg3]
>   tg3_stop+0x30/0x160 [tg3]
>   tg3_close+0x2a/0x60 [tg3]
>   __dev_close_many+0xad/0x130
>   dev_close_many+0xb2/0x190
>   unregister_netdevice_many_notify+0x19d/0xa00
>   unregister_netdevice_queue+0xf8/0x140
>   unregister_netdev+0x1c/0x30
>   tg3_remove_one+0xaa/0x150 [tg3]
>   pci_device_remove+0x42/0xb0
>   device_release_driver_internal+0x19c/0x200
>   pci_stop_bus_device+0x85/0xb0
>   pci_stop_bus_device+0x2c/0xb0
>   pci_stop_bus_device+0x2c/0xb0
>   pci_stop_and_remove_bus_device+0x12/0x20
>   pciehp_unconfigure_device+0x9f/0x160
>   pciehp_disable_slot+0x67/0x100
>   pciehp_handle_presence_or_link_change+0x77/0x350
>
> This is not expected by napi_disable and a thread can be locked in
> napi_disable forever. We have pcierr_recovery to cover a similar issue,
> but for fatal errors. We cannot reuse this flag because it is reset in
> tg3_io_resume, but it is not called when AER recovery fails.
>
> Similarly, if an AER error is reported and tg3_io_error_detected calls
> pci_disable_device, a subsequent device removal via tg3_remove_one or
> tg3_shutdown will call pci_disable_device again for the already-disabled
> device.
>
> Add a napi_enabled flag to struct tg3 to track whether napi_enable has
> been called. Guard tg3_napi_disable() so it returns early if NAPI was
> not previously enabled. Also guard pci_disable_device() calls in
> tg3_remove_one() and tg3_shutdown() with pci_is_enabled() to avoid
> disabling an already-disabled device.
>
> Fixes: b45aa2f6192e ("tg3: Add EEH support")
> Signed-off-by: Yury Murashka <yurypm@arista.com>
> ---
>  drivers/net/ethernet/broadcom/tg3.c | 14 ++++++++++++--
>  drivers/net/ethernet/broadcom/tg3.h |  1 +
>  2 files changed, 13 insertions(+), 2 deletions(-)
>

LGTM.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
nit: The revision-change log for the patch is missing..

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls
  2026-05-26  3:37 ` Pavan Chebbi
@ 2026-05-26  7:09   ` ALOK TIWARI
  2026-05-26  8:58     ` Pavan Chebbi
  0 siblings, 1 reply; 5+ messages in thread
From: ALOK TIWARI @ 2026-05-26  7:09 UTC (permalink / raw)
  To: Pavan Chebbi, Yury Murashka
  Cc: mchan, andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel



On 5/26/2026 9:07 AM, Pavan Chebbi wrote:
> On Mon, May 25, 2026 at 3:19 PM Yury Murashka <yurypm@arista.com> wrote:
>>
>> During PCIe hot-plug events, uncorrectable errors can be reported and
>> AER recovery for the tg3 device is initiated by the AER kernel driver.
>> The tg3_io_error_detected function is the AER error recovery handler.
>>
>>  From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
>> napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
>> We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
>> will be called. But AER error recovery can fail. For example, when one
>> of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
>> As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
>> device is disabled and NAPI is disabled (pci_disable_device and
>> napi_disable are called from tg3_io_error_detected). Then we can try to
>> disable PCIe link and napi_disable will be called again:
>>
>>    napi_disable+0x1b/0x1b0
>>    tg3_napi_disable+0x89/0xa0 [tg3]
>>    tg3_netif_stop+0x37/0xe3 [tg3]
>>    tg3_stop+0x30/0x160 [tg3]
>>    tg3_close+0x2a/0x60 [tg3]
>>    __dev_close_many+0xad/0x130
>>    dev_close_many+0xb2/0x190
>>    unregister_netdevice_many_notify+0x19d/0xa00
>>    unregister_netdevice_queue+0xf8/0x140
>>    unregister_netdev+0x1c/0x30
>>    tg3_remove_one+0xaa/0x150 [tg3]
>>    pci_device_remove+0x42/0xb0
>>    device_release_driver_internal+0x19c/0x200
>>    pci_stop_bus_device+0x85/0xb0
>>    pci_stop_bus_device+0x2c/0xb0
>>    pci_stop_bus_device+0x2c/0xb0
>>    pci_stop_and_remove_bus_device+0x12/0x20
>>    pciehp_unconfigure_device+0x9f/0x160
>>    pciehp_disable_slot+0x67/0x100
>>    pciehp_handle_presence_or_link_change+0x77/0x350
>>
>> This is not expected by napi_disable and a thread can be locked in
>> napi_disable forever. We have pcierr_recovery to cover a similar issue,
>> but for fatal errors. We cannot reuse this flag because it is reset in
>> tg3_io_resume, but it is not called when AER recovery fails.
>>
>> Similarly, if an AER error is reported and tg3_io_error_detected calls
>> pci_disable_device, a subsequent device removal via tg3_remove_one or
>> tg3_shutdown will call pci_disable_device again for the already-disabled
>> device.
>>
>> Add a napi_enabled flag to struct tg3 to track whether napi_enable has
>> been called. Guard tg3_napi_disable() so it returns early if NAPI was
>> not previously enabled. Also guard pci_disable_device() calls in
>> tg3_remove_one() and tg3_shutdown() with pci_is_enabled() to avoid
>> disabling an already-disabled device.
>>
>> Fixes: b45aa2f6192e ("tg3: Add EEH support")
>> Signed-off-by: Yury Murashka <yurypm@arista.com>
>> ---
>>   drivers/net/ethernet/broadcom/tg3.c | 14 ++++++++++++--
>>   drivers/net/ethernet/broadcom/tg3.h |  1 +
>>   2 files changed, 13 insertions(+), 2 deletions(-)
>>
> 
> LGTM.
> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
> nit: The revision-change log for the patch is missing..

It seems this patch addresses two separate issues in the AER recovery flow:

NAPI enable/disable state handling
PCI device enabled/disabled state handling

Would it make sense to split this into two patches?

Thanks,
Alok


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls
  2026-05-26  7:09   ` ALOK TIWARI
@ 2026-05-26  8:58     ` Pavan Chebbi
  0 siblings, 0 replies; 5+ messages in thread
From: Pavan Chebbi @ 2026-05-26  8:58 UTC (permalink / raw)
  To: ALOK TIWARI
  Cc: Yury Murashka, mchan, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3647 bytes --]

On Tue, May 26, 2026 at 12:39 PM ALOK TIWARI <alok.a.tiwari@oracle.com> wrote:
>
>
>
> On 5/26/2026 9:07 AM, Pavan Chebbi wrote:
> > On Mon, May 25, 2026 at 3:19 PM Yury Murashka <yurypm@arista.com> wrote:
> >>
> >> During PCIe hot-plug events, uncorrectable errors can be reported and
> >> AER recovery for the tg3 device is initiated by the AER kernel driver.
> >> The tg3_io_error_detected function is the AER error recovery handler.
> >>
> >>  From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
> >> napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
> >> We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
> >> will be called. But AER error recovery can fail. For example, when one
> >> of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
> >> As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
> >> device is disabled and NAPI is disabled (pci_disable_device and
> >> napi_disable are called from tg3_io_error_detected). Then we can try to
> >> disable PCIe link and napi_disable will be called again:
> >>
> >>    napi_disable+0x1b/0x1b0
> >>    tg3_napi_disable+0x89/0xa0 [tg3]
> >>    tg3_netif_stop+0x37/0xe3 [tg3]
> >>    tg3_stop+0x30/0x160 [tg3]
> >>    tg3_close+0x2a/0x60 [tg3]
> >>    __dev_close_many+0xad/0x130
> >>    dev_close_many+0xb2/0x190
> >>    unregister_netdevice_many_notify+0x19d/0xa00
> >>    unregister_netdevice_queue+0xf8/0x140
> >>    unregister_netdev+0x1c/0x30
> >>    tg3_remove_one+0xaa/0x150 [tg3]
> >>    pci_device_remove+0x42/0xb0
> >>    device_release_driver_internal+0x19c/0x200
> >>    pci_stop_bus_device+0x85/0xb0
> >>    pci_stop_bus_device+0x2c/0xb0
> >>    pci_stop_bus_device+0x2c/0xb0
> >>    pci_stop_and_remove_bus_device+0x12/0x20
> >>    pciehp_unconfigure_device+0x9f/0x160
> >>    pciehp_disable_slot+0x67/0x100
> >>    pciehp_handle_presence_or_link_change+0x77/0x350
> >>
> >> This is not expected by napi_disable and a thread can be locked in
> >> napi_disable forever. We have pcierr_recovery to cover a similar issue,
> >> but for fatal errors. We cannot reuse this flag because it is reset in
> >> tg3_io_resume, but it is not called when AER recovery fails.
> >>
> >> Similarly, if an AER error is reported and tg3_io_error_detected calls
> >> pci_disable_device, a subsequent device removal via tg3_remove_one or
> >> tg3_shutdown will call pci_disable_device again for the already-disabled
> >> device.
> >>
> >> Add a napi_enabled flag to struct tg3 to track whether napi_enable has
> >> been called. Guard tg3_napi_disable() so it returns early if NAPI was
> >> not previously enabled. Also guard pci_disable_device() calls in
> >> tg3_remove_one() and tg3_shutdown() with pci_is_enabled() to avoid
> >> disabling an already-disabled device.
> >>
> >> Fixes: b45aa2f6192e ("tg3: Add EEH support")
> >> Signed-off-by: Yury Murashka <yurypm@arista.com>
> >> ---
> >>   drivers/net/ethernet/broadcom/tg3.c | 14 ++++++++++++--
> >>   drivers/net/ethernet/broadcom/tg3.h |  1 +
> >>   2 files changed, 13 insertions(+), 2 deletions(-)
> >>
> >
> > LGTM.
> > Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
> > nit: The revision-change log for the patch is missing..
>
> It seems this patch addresses two separate issues in the AER recovery flow:
>
> NAPI enable/disable state handling
> PCI device enabled/disabled state handling
>
> Would it make sense to split this into two patches?

I am fine with a single patch, both guards together make the solution
complete which has a single root case.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls
  2026-05-25  9:49 [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls Yury Murashka
  2026-05-26  3:37 ` Pavan Chebbi
@ 2026-05-27  1:42 ` Jakub Kicinski
  1 sibling, 0 replies; 5+ messages in thread
From: Jakub Kicinski @ 2026-05-27  1:42 UTC (permalink / raw)
  To: Yury Murashka
  Cc: Pavan Chebbi, mchan, andrew+netdev, davem, edumazet, pabeni,
	netdev, linux-kernel

On Mon, 25 May 2026 10:49:03 +0100 Yury Murashka wrote:
> Subject: [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls

The "net" in the subject indicates that this patch is meant for this
tree:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/

to which it currently does not apply. So please rebase & repost.
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-27  1:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25  9:49 [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls Yury Murashka
2026-05-26  3:37 ` Pavan Chebbi
2026-05-26  7:09   ` ALOK TIWARI
2026-05-26  8:58     ` Pavan Chebbi
2026-05-27  1:42 ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox