* [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV
@ 2023-08-24 9:16 Corinna Vinschen
2023-08-24 11:02 ` Akihiko Odaki
2023-08-24 13:07 ` Corinna Vinschen
0 siblings, 2 replies; 6+ messages in thread
From: Corinna Vinschen @ 2023-08-24 9:16 UTC (permalink / raw)
To: jesse.brandeburg, anthony.l.nguyen, davem, kuba, Akihiko Odaki,
intel-wired-lan, netdev
After commit 50f303496d92 ("igb: Enable SR-IOV after reinit"), removing
the igb module could hang or crash (depending on the machine) when the
module has been loaded with the max_vfs parameter set to some value != 0.
In case of one test machine with a dual port 82580, this hang occured:
[ 232.480687] igb 0000:41:00.1: removed PHC on enp65s0f1
[ 233.093257] igb 0000:41:00.1: IOV Disabled
[ 233.329969] pcieport 0000:40:01.0: AER: Multiple Uncorrected (Non-Fatal) err0
[ 233.340302] igb 0000:41:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fata)
[ 233.352248] igb 0000:41:00.0: device [8086:1516] error status/mask=00100000
[ 233.361088] igb 0000:41:00.0: [20] UnsupReq (First)
[ 233.368183] igb 0000:41:00.0: AER: TLP Header: 40000001 0000040f cdbfc00c c
[ 233.376846] igb 0000:41:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fata)
[ 233.388779] igb 0000:41:00.1: device [8086:1516] error status/mask=00100000
[ 233.397629] igb 0000:41:00.1: [20] UnsupReq (First)
[ 233.404736] igb 0000:41:00.1: AER: TLP Header: 40000001 0000040f cdbfc00c c
[ 233.538214] pci 0000:41:00.1: AER: can't recover (no error_detected callback)
[ 233.538401] igb 0000:41:00.0: removed PHC on enp65s0f0
[ 233.546197] pcieport 0000:40:01.0: AER: device recovery failed
[ 234.157244] igb 0000:41:00.0: IOV Disabled
[ 371.619705] INFO: task irq/35-aerdrv:257 blocked for more than 122 seconds.
[ 371.627489] Not tainted 6.4.0-dirty #2
[ 371.632257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this.
[ 371.641000] task:irq/35-aerdrv state:D stack:0 pid:257 ppid:2 f0
[ 371.650330] Call Trace:
[ 371.653061] <TASK>
[ 371.655407] __schedule+0x20e/0x660
[ 371.659313] schedule+0x5a/0xd0
[ 371.662824] schedule_preempt_disabled+0x11/0x20
[ 371.667983] __mutex_lock.constprop.0+0x372/0x6c0
[ 371.673237] ? __pfx_aer_root_reset+0x10/0x10
[ 371.678105] report_error_detected+0x25/0x1c0
[ 371.682974] ? __pfx_report_normal_detected+0x10/0x10
[ 371.688618] pci_walk_bus+0x72/0x90
[ 371.692519] pcie_do_recovery+0xb2/0x330
[ 371.696899] aer_process_err_devices+0x117/0x170
[ 371.702055] aer_isr+0x1c0/0x1e0
[ 371.705661] ? __set_cpus_allowed_ptr+0x54/0xa0
[ 371.710723] ? __pfx_irq_thread_fn+0x10/0x10
[ 371.715496] irq_thread_fn+0x20/0x60
[ 371.719491] irq_thread+0xe6/0x1b0
[ 371.723291] ? __pfx_irq_thread_dtor+0x10/0x10
[ 371.728255] ? __pfx_irq_thread+0x10/0x10
[ 371.732731] kthread+0xe2/0x110
[ 371.736243] ? __pfx_kthread+0x10/0x10
[ 371.740430] ret_from_fork+0x2c/0x50
[ 371.744428] </TASK>
The reproducer was a simple script:
#!/bin/sh
for i in `seq 1 5`; do
modprobe -rv igb
modprobe -v igb max_vfs=1
sleep 1
modprobe -rv igb
done
It turned out that this could only be reproduce on 82580 (quad and
dual-port), but not on 82576, i350 and i210. Further debugging showed
that igb_enable_sriov()'s call to pci_enable_sriov() is failing, because
dev->is_physfn is 0 on 82580.
Prior to commit 50f303496d92 ("igb: Enable SR-IOV after reinit"),
igb_enable_sriov() jumped into the "err_out" cleanup branch. After this
commit it only returned the error code.
So the cleanup didn't take place, and the incorrect VF setup in the
igb_adapter structure fooled the igb driver into assuming that VFs have
been set up where no VF actually existed.
Fix this problem by cleaning up again if pci_enable_sriov() fails.
Fixes: 50f303496d92 ("igb: Enable SR-IOV after reinit")
Signed-off-by: Corinna Vinschen <vinschen@redhat.com>
---
drivers/net/ethernet/intel/igb/igb_main.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 9a2561409b06..42ab9ca7f97e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3827,8 +3827,11 @@ static int igb_enable_sriov(struct pci_dev *pdev, int num_vfs, bool reinit)
}
/* only call pci_enable_sriov() if no VFs are allocated already */
- if (!old_vfs)
+ if (!old_vfs) {
err = pci_enable_sriov(pdev, adapter->vfs_allocated_count);
+ if (err)
+ goto err_out;
+ }
goto out;
--
2.41.0
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV
2023-08-24 9:16 [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV Corinna Vinschen
@ 2023-08-24 11:02 ` Akihiko Odaki
2023-08-24 13:07 ` Corinna Vinschen
1 sibling, 0 replies; 6+ messages in thread
From: Akihiko Odaki @ 2023-08-24 11:02 UTC (permalink / raw)
To: Corinna Vinschen, jesse.brandeburg, anthony.l.nguyen, davem, kuba,
intel-wired-lan, netdev
On 2023/08/24 18:16, Corinna Vinschen wrote:
> After commit 50f303496d92 ("igb: Enable SR-IOV after reinit"), removing
> the igb module could hang or crash (depending on the machine) when the
> module has been loaded with the max_vfs parameter set to some value != 0.
>
> In case of one test machine with a dual port 82580, this hang occured:
>
> [ 232.480687] igb 0000:41:00.1: removed PHC on enp65s0f1
> [ 233.093257] igb 0000:41:00.1: IOV Disabled
> [ 233.329969] pcieport 0000:40:01.0: AER: Multiple Uncorrected (Non-Fatal) err0
> [ 233.340302] igb 0000:41:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fata)
> [ 233.352248] igb 0000:41:00.0: device [8086:1516] error status/mask=00100000
> [ 233.361088] igb 0000:41:00.0: [20] UnsupReq (First)
> [ 233.368183] igb 0000:41:00.0: AER: TLP Header: 40000001 0000040f cdbfc00c c
> [ 233.376846] igb 0000:41:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fata)
> [ 233.388779] igb 0000:41:00.1: device [8086:1516] error status/mask=00100000
> [ 233.397629] igb 0000:41:00.1: [20] UnsupReq (First)
> [ 233.404736] igb 0000:41:00.1: AER: TLP Header: 40000001 0000040f cdbfc00c c
> [ 233.538214] pci 0000:41:00.1: AER: can't recover (no error_detected callback)
> [ 233.538401] igb 0000:41:00.0: removed PHC on enp65s0f0
> [ 233.546197] pcieport 0000:40:01.0: AER: device recovery failed
> [ 234.157244] igb 0000:41:00.0: IOV Disabled
> [ 371.619705] INFO: task irq/35-aerdrv:257 blocked for more than 122 seconds.
> [ 371.627489] Not tainted 6.4.0-dirty #2
> [ 371.632257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this.
> [ 371.641000] task:irq/35-aerdrv state:D stack:0 pid:257 ppid:2 f0
> [ 371.650330] Call Trace:
> [ 371.653061] <TASK>
> [ 371.655407] __schedule+0x20e/0x660
> [ 371.659313] schedule+0x5a/0xd0
> [ 371.662824] schedule_preempt_disabled+0x11/0x20
> [ 371.667983] __mutex_lock.constprop.0+0x372/0x6c0
> [ 371.673237] ? __pfx_aer_root_reset+0x10/0x10
> [ 371.678105] report_error_detected+0x25/0x1c0
> [ 371.682974] ? __pfx_report_normal_detected+0x10/0x10
> [ 371.688618] pci_walk_bus+0x72/0x90
> [ 371.692519] pcie_do_recovery+0xb2/0x330
> [ 371.696899] aer_process_err_devices+0x117/0x170
> [ 371.702055] aer_isr+0x1c0/0x1e0
> [ 371.705661] ? __set_cpus_allowed_ptr+0x54/0xa0
> [ 371.710723] ? __pfx_irq_thread_fn+0x10/0x10
> [ 371.715496] irq_thread_fn+0x20/0x60
> [ 371.719491] irq_thread+0xe6/0x1b0
> [ 371.723291] ? __pfx_irq_thread_dtor+0x10/0x10
> [ 371.728255] ? __pfx_irq_thread+0x10/0x10
> [ 371.732731] kthread+0xe2/0x110
> [ 371.736243] ? __pfx_kthread+0x10/0x10
> [ 371.740430] ret_from_fork+0x2c/0x50
> [ 371.744428] </TASK>
>
> The reproducer was a simple script:
>
> #!/bin/sh
> for i in `seq 1 5`; do
> modprobe -rv igb
> modprobe -v igb max_vfs=1
> sleep 1
> modprobe -rv igb
> done
>
> It turned out that this could only be reproduce on 82580 (quad and
> dual-port), but not on 82576, i350 and i210. Further debugging showed
> that igb_enable_sriov()'s call to pci_enable_sriov() is failing, because
> dev->is_physfn is 0 on 82580.
>
> Prior to commit 50f303496d92 ("igb: Enable SR-IOV after reinit"),
> igb_enable_sriov() jumped into the "err_out" cleanup branch. After this
> commit it only returned the error code.
>
> So the cleanup didn't take place, and the incorrect VF setup in the
> igb_adapter structure fooled the igb driver into assuming that VFs have
> been set up where no VF actually existed.
>
> Fix this problem by cleaning up again if pci_enable_sriov() fails.
>
> Fixes: 50f303496d92 ("igb: Enable SR-IOV after reinit")
> Signed-off-by: Corinna Vinschen <vinschen@redhat.com>
> ---
> drivers/net/ethernet/intel/igb/igb_main.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 9a2561409b06..42ab9ca7f97e 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3827,8 +3827,11 @@ static int igb_enable_sriov(struct pci_dev *pdev, int num_vfs, bool reinit)
> }
>
> /* only call pci_enable_sriov() if no VFs are allocated already */
> - if (!old_vfs)
> + if (!old_vfs) {
> err = pci_enable_sriov(pdev, adapter->vfs_allocated_count);
> + if (err)
> + goto err_out;
> + }
>
> goto out;
>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV
2023-08-24 9:16 [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV Corinna Vinschen
2023-08-24 11:02 ` Akihiko Odaki
@ 2023-08-24 13:07 ` Corinna Vinschen
2023-08-29 23:09 ` Tony Nguyen
1 sibling, 1 reply; 6+ messages in thread
From: Corinna Vinschen @ 2023-08-24 13:07 UTC (permalink / raw)
To: jesse.brandeburg, anthony.l.nguyen, intel-wired-lan
Question to the Intel folks:
On Aug 24 11:16, Corinna Vinschen wrote:
> After commit 50f303496d92 ("igb: Enable SR-IOV after reinit"), removing
> the igb module could hang or crash (depending on the machine) when the
> module has been loaded with the max_vfs parameter set to some value != 0.
>
> In case of one test machine with a dual port 82580, this hang occured:
> [...]
> The reproducer was a simple script:
>
> #!/bin/sh
> for i in `seq 1 5`; do
> modprobe -rv igb
> modprobe -v igb max_vfs=1
> sleep 1
> modprobe -rv igb
> done
>
> It turned out that this could only be reproduce on 82580 (quad and
> dual-port), but not on 82576, i350 and i210. Further debugging showed
> that igb_enable_sriov()'s call to pci_enable_sriov() is failing, because
> dev->is_physfn is 0 on 82580.
Along these lines, isn't the first and foremost bug that igb_enable_sriov()
has been called for this NIC at all? In terms of patches, shouldn't the
guard expression in igb_probe_vfs()
/* Virtualization features not supported on i210 family. */
if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211))
return;
get changed to:
/* Virtualization features not supported on i210 and 82580 family. */
if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211) ||
(hw->mac.type == e1000_82580))
return;
or, to make it independent of the actual HW:
/* Virtualization features not supported? */
if (!pdev->is_physfn)
return;
Thanks,
Corinna
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV
2023-08-24 13:07 ` Corinna Vinschen
@ 2023-08-29 23:09 ` Tony Nguyen
2023-08-31 8:10 ` Corinna Vinschen
0 siblings, 1 reply; 6+ messages in thread
From: Tony Nguyen @ 2023-08-29 23:09 UTC (permalink / raw)
To: jesse.brandeburg, intel-wired-lan, Corinna Vinschen
On 8/24/2023 6:07 AM, Corinna Vinschen wrote:
> Question to the Intel folks:
>
> On Aug 24 11:16, Corinna Vinschen wrote:
>> After commit 50f303496d92 ("igb: Enable SR-IOV after reinit"), removing
>> the igb module could hang or crash (depending on the machine) when the
>> module has been loaded with the max_vfs parameter set to some value != 0.
>>
>> In case of one test machine with a dual port 82580, this hang occured:
>> [...]
>> The reproducer was a simple script:
>>
>> #!/bin/sh
>> for i in `seq 1 5`; do
>> modprobe -rv igb
>> modprobe -v igb max_vfs=1
>> sleep 1
>> modprobe -rv igb
>> done
>>
>> It turned out that this could only be reproduce on 82580 (quad and
>> dual-port), but not on 82576, i350 and i210. Further debugging showed
>> that igb_enable_sriov()'s call to pci_enable_sriov() is failing, because
>> dev->is_physfn is 0 on 82580.
>
> Along these lines, isn't the first and foremost bug that igb_enable_sriov()
> has been called for this NIC at all? In terms of patches, shouldn't the
> guard expression in igb_probe_vfs()
>
> /* Virtualization features not supported on i210 family. */
> if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211))
> return;
>
> get changed to:
>
> /* Virtualization features not supported on i210 and 82580 family. */
> if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211) ||
> (hw->mac.type == e1000_82580))
> return;
Hi Corinna,
Adding 82580 to this seems like a good change; did you want to submit a
patch to do this?
Thanks,
Tony
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV
2023-08-29 23:09 ` Tony Nguyen
@ 2023-08-31 8:10 ` Corinna Vinschen
2023-09-11 8:02 ` Romanowski, Rafal
0 siblings, 1 reply; 6+ messages in thread
From: Corinna Vinschen @ 2023-08-31 8:10 UTC (permalink / raw)
To: Tony Nguyen; +Cc: intel-wired-lan, jesse.brandeburg
On Aug 29 16:09, Tony Nguyen wrote:
> On 8/24/2023 6:07 AM, Corinna Vinschen wrote:
> > Question to the Intel folks:
> >
> > On Aug 24 11:16, Corinna Vinschen wrote:
> > > After commit 50f303496d92 ("igb: Enable SR-IOV after reinit"), removing
> > > the igb module could hang or crash (depending on the machine) when the
> > > module has been loaded with the max_vfs parameter set to some value != 0.
> > >
> > > In case of one test machine with a dual port 82580, this hang occured:
> > > [...]
> > > The reproducer was a simple script:
> > >
> > > #!/bin/sh
> > > for i in `seq 1 5`; do
> > > modprobe -rv igb
> > > modprobe -v igb max_vfs=1
> > > sleep 1
> > > modprobe -rv igb
> > > done
> > >
> > > It turned out that this could only be reproduce on 82580 (quad and
> > > dual-port), but not on 82576, i350 and i210. Further debugging showed
> > > that igb_enable_sriov()'s call to pci_enable_sriov() is failing, because
> > > dev->is_physfn is 0 on 82580.
> >
> > Along these lines, isn't the first and foremost bug that igb_enable_sriov()
> > has been called for this NIC at all? In terms of patches, shouldn't the
> > guard expression in igb_probe_vfs()
> >
> > /* Virtualization features not supported on i210 family. */
> > if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211))
> > return;
> >
> > get changed to:
> >
> > /* Virtualization features not supported on i210 and 82580 family. */
> > if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211) ||
> > (hw->mac.type == e1000_82580))
> > return;
>
> Hi Corinna,
>
> Adding 82580 to this seems like a good change; did you want to submit a
> patch to do this?
Hi Tony,
sure, I just sent the patch.
Thanks,
Corinna
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV
2023-08-31 8:10 ` Corinna Vinschen
@ 2023-09-11 8:02 ` Romanowski, Rafal
0 siblings, 0 replies; 6+ messages in thread
From: Romanowski, Rafal @ 2023-09-11 8:02 UTC (permalink / raw)
To: Vinschen, Corinna, Nguyen, Anthony L
Cc: intel-wired-lan@lists.osuosl.org, Brandeburg, Jesse
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of
> Corinna Vinschen
> Sent: Thursday, August 31, 2023 10:10 AM
> To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org; Brandeburg, Jesse
> <jesse.brandeburg@intel.com>
> Subject: Re: [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when
> enabling SR-IOV
>
> On Aug 29 16:09, Tony Nguyen wrote:
> > On 8/24/2023 6:07 AM, Corinna Vinschen wrote:
> > > Question to the Intel folks:
> > >
> > > On Aug 24 11:16, Corinna Vinschen wrote:
> > > > After commit 50f303496d92 ("igb: Enable SR-IOV after reinit"),
> > > > removing the igb module could hang or crash (depending on the
> > > > machine) when the module has been loaded with the max_vfs parameter
> set to some value != 0.
> > > >
> > > > In case of one test machine with a dual port 82580, this hang occured:
> > > > [...]
> > > > The reproducer was a simple script:
> > > >
> > > > #!/bin/sh
> > > > for i in `seq 1 5`; do
> > > > modprobe -rv igb
> > > > modprobe -v igb max_vfs=1
> > > > sleep 1
> > > > modprobe -rv igb
> > > > done
> > > >
> > > > It turned out that this could only be reproduce on 82580 (quad and
> > > > dual-port), but not on 82576, i350 and i210. Further debugging
> > > > showed that igb_enable_sriov()'s call to pci_enable_sriov() is
> > > > failing, because
> > > > dev->is_physfn is 0 on 82580.
> > >
> > > Along these lines, isn't the first and foremost bug that
> > > igb_enable_sriov() has been called for this NIC at all? In terms of
> > > patches, shouldn't the guard expression in igb_probe_vfs()
> > >
> > > /* Virtualization features not supported on i210 family. */
> > > if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211))
> > > return;
> > >
> > > get changed to:
> > >
> > > /* Virtualization features not supported on i210 and 82580 family. */
> > > if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211)
> ||
> > > (hw->mac.type == e1000_82580))
> > > return;
> >
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-09-11 8:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-24 9:16 [Intel-wired-lan] [PATCH net] igb: clean up in all error paths when enabling SR-IOV Corinna Vinschen
2023-08-24 11:02 ` Akihiko Odaki
2023-08-24 13:07 ` Corinna Vinschen
2023-08-29 23:09 ` Tony Nguyen
2023-08-31 8:10 ` Corinna Vinschen
2023-09-11 8:02 ` Romanowski, Rafal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox