* [PATCH net] iavf: fix hang on reboot with ice
@ 2023-03-10 12:26 Stefan Assmann
2023-03-10 17:24 ` Michal Kubiak
0 siblings, 1 reply; 5+ messages in thread
From: Stefan Assmann @ 2023-03-10 12:26 UTC (permalink / raw)
To: intel-wired-lan
Cc: netdev, anthony.l.nguyen, patryk.piotrowski, slawomirx.laba,
sassmann
When a system with E810 with existing VFs gets rebooted the following
hang may be observed.
Pid 1 is hung in iavf_remove(), part of a network driver:
PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow"
#0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb
#1 [ffffaad04005fae8] schedule at ffffffff8b323e2d
#2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc
#3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930
#4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf]
#5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513
#6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa
#7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc
#8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e
#9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429
#10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4
#11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice]
#12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice]
#13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice]
#14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1
#15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386
#16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870
#17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6
#18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159
#19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc
#20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d
#21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169
#22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b
RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7
RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90
R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005
R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000
ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
During reboot all drivers PM shutdown callbacks are invoked.
In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE.
In ice_shutdown() the call chain above is executed, which at some point
calls iavf_remove(). However iavf_remove() expects the VF to be in one
of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If
that's not the case it sleeps forever.
So if iavf_shutdown() gets invoked before ice_shutdown() the system will
hang indefinitely because the adapter is already in state __IAVF_REMOVE.
Fix this by adding __IAVF_REMOVE to the list of allowed states in
iavf_remove().
Fixes: 974578017fc1 ("iavf: Add waiting so the port is initialized in remove")
Reported-by: Marius Cornea <mcornea@redhat.com>
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
---
drivers/net/ethernet/intel/iavf/iavf_main.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 3273aeb8fa67..83ef3a343ef0 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -5062,7 +5062,8 @@ static void iavf_remove(struct pci_dev *pdev)
mutex_lock(&adapter->crit_lock);
if (adapter->state == __IAVF_RUNNING ||
adapter->state == __IAVF_DOWN ||
- adapter->state == __IAVF_INIT_FAILED) {
+ adapter->state == __IAVF_INIT_FAILED ||
+ adapter->state == __IAVF_REMOVE) {
mutex_unlock(&adapter->crit_lock);
break;
}
--
2.39.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH net] iavf: fix hang on reboot with ice
2023-03-10 12:26 [PATCH net] iavf: fix hang on reboot with ice Stefan Assmann
@ 2023-03-10 17:24 ` Michal Kubiak
2023-03-13 6:48 ` Stefan Assmann
0 siblings, 1 reply; 5+ messages in thread
From: Michal Kubiak @ 2023-03-10 17:24 UTC (permalink / raw)
To: Stefan Assmann
Cc: intel-wired-lan, netdev, anthony.l.nguyen, patryk.piotrowski,
slawomirx.laba
On Fri, Mar 10, 2023 at 01:26:53PM +0100, Stefan Assmann wrote:
> When a system with E810 with existing VFs gets rebooted the following
> hang may be observed.
>
> Pid 1 is hung in iavf_remove(), part of a network driver:
> PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow"
> #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb
> #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d
> #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc
> #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930
> #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf]
> #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513
> #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa
> #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc
> #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e
> #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429
> #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4
> #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice]
> #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice]
> #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice]
> #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1
> #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386
> #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870
> #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6
> #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159
> #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc
> #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d
> #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169
> #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b
> RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7
> RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
> RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90
> R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005
> R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000
> ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
>
> During reboot all drivers PM shutdown callbacks are invoked.
> In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE.
> In ice_shutdown() the call chain above is executed, which at some point
> calls iavf_remove(). However iavf_remove() expects the VF to be in one
> of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If
> that's not the case it sleeps forever.
> So if iavf_shutdown() gets invoked before ice_shutdown() the system will
> hang indefinitely because the adapter is already in state __IAVF_REMOVE.
>
> Fix this by adding __IAVF_REMOVE to the list of allowed states in
> iavf_remove().
>
> Fixes: 974578017fc1 ("iavf: Add waiting so the port is initialized in remove")
> Reported-by: Marius Cornea <mcornea@redhat.com>
> Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
> ---
> drivers/net/ethernet/intel/iavf/iavf_main.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> index 3273aeb8fa67..83ef3a343ef0 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
> @@ -5062,7 +5062,8 @@ static void iavf_remove(struct pci_dev *pdev)
> mutex_lock(&adapter->crit_lock);
> if (adapter->state == __IAVF_RUNNING ||
> adapter->state == __IAVF_DOWN ||
> - adapter->state == __IAVF_INIT_FAILED) {
> + adapter->state == __IAVF_INIT_FAILED ||
> + adapter->state == __IAVF_REMOVE) {
> mutex_unlock(&adapter->crit_lock);
> break;
> }
Adding the __IAVF_REMOVE state to the loop break condition seems OK to
me.
I would only consider adding a timeout to this loop to prevent endless hangs
for other potential corner cases.
Thanks,
Michal
> --
> 2.39.1
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net] iavf: fix hang on reboot with ice
2023-03-10 17:24 ` Michal Kubiak
@ 2023-03-13 6:48 ` Stefan Assmann
2023-03-13 9:15 ` Michal Kubiak
0 siblings, 1 reply; 5+ messages in thread
From: Stefan Assmann @ 2023-03-13 6:48 UTC (permalink / raw)
To: Michal Kubiak
Cc: intel-wired-lan, netdev, anthony.l.nguyen, patryk.piotrowski,
slawomirx.laba
On 10.03.23 18:24, Michal Kubiak wrote:
> On Fri, Mar 10, 2023 at 01:26:53PM +0100, Stefan Assmann wrote:
>> When a system with E810 with existing VFs gets rebooted the following
>> hang may be observed.
>>
>> Pid 1 is hung in iavf_remove(), part of a network driver:
>> PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow"
>> #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb
>> #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d
>> #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc
>> #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930
>> #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf]
>> #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513
>> #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa
>> #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc
>> #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e
>> #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429
>> #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4
>> #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice]
>> #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice]
>> #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice]
>> #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1
>> #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386
>> #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870
>> #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6
>> #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159
>> #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc
>> #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d
>> #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169
>> #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b
>> RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202
>> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7
>> RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
>> RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90
>> R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005
>> R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000
>> ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
>>
>> During reboot all drivers PM shutdown callbacks are invoked.
>> In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE.
>> In ice_shutdown() the call chain above is executed, which at some point
>> calls iavf_remove(). However iavf_remove() expects the VF to be in one
>> of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If
>> that's not the case it sleeps forever.
>> So if iavf_shutdown() gets invoked before ice_shutdown() the system will
>> hang indefinitely because the adapter is already in state __IAVF_REMOVE.
>>
>> Fix this by adding __IAVF_REMOVE to the list of allowed states in
>> iavf_remove().
>>
>> Fixes: 974578017fc1 ("iavf: Add waiting so the port is initialized in remove")
>> Reported-by: Marius Cornea <mcornea@redhat.com>
>> Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
>> ---
>> drivers/net/ethernet/intel/iavf/iavf_main.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
>> index 3273aeb8fa67..83ef3a343ef0 100644
>> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
>> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
>> @@ -5062,7 +5062,8 @@ static void iavf_remove(struct pci_dev *pdev)
>> mutex_lock(&adapter->crit_lock);
>> if (adapter->state == __IAVF_RUNNING ||
>> adapter->state == __IAVF_DOWN ||
>> - adapter->state == __IAVF_INIT_FAILED) {
>> + adapter->state == __IAVF_INIT_FAILED ||
>> + adapter->state == __IAVF_REMOVE) {
>> mutex_unlock(&adapter->crit_lock);
>> break;
>> }
>
> Adding the __IAVF_REMOVE state to the loop break condition seems OK to
> me.
> I would only consider adding a timeout to this loop to prevent endless hangs
> for other potential corner cases.
Hi Michal,
is it okay to add this change in a follow-up patch? I'd like to get this
patch merged quickly since we have a customer being blocked by this
issue.
Thanks!
Stefan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net] iavf: fix hang on reboot with ice
2023-03-13 6:48 ` Stefan Assmann
@ 2023-03-13 9:15 ` Michal Kubiak
2023-03-13 14:52 ` Stefan Assmann
0 siblings, 1 reply; 5+ messages in thread
From: Michal Kubiak @ 2023-03-13 9:15 UTC (permalink / raw)
To: Stefan Assmann
Cc: intel-wired-lan, netdev, anthony.l.nguyen, patryk.piotrowski,
slawomirx.laba
On Mon, Mar 13, 2023 at 07:48:52AM +0100, Stefan Assmann wrote:
>
> Hi Michal,
>
> is it okay to add this change in a follow-up patch? I'd like to get this
> patch merged quickly since we have a customer being blocked by this
> issue.
>
> Thanks!
>
> Stefan
Hi Stefan,
I think it is OK.
Moreover, I consulted that idea with Slawomir Laba - the author of that loop.
It seems adding a timeout needs a further investigation. Slawomir claims
that adding that timeout may cause some side effects in other corner cases,
so for now let's leave your patch as it is.
Thank you for fixing it!
Thanks,
Michal
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net] iavf: fix hang on reboot with ice
2023-03-13 9:15 ` Michal Kubiak
@ 2023-03-13 14:52 ` Stefan Assmann
0 siblings, 0 replies; 5+ messages in thread
From: Stefan Assmann @ 2023-03-13 14:52 UTC (permalink / raw)
To: Michal Kubiak
Cc: intel-wired-lan, netdev, anthony.l.nguyen, patryk.piotrowski,
slawomirx.laba
On 13.03.23 10:15, Michal Kubiak wrote:
> On Mon, Mar 13, 2023 at 07:48:52AM +0100, Stefan Assmann wrote:
>>
>> Hi Michal,
>>
>> is it okay to add this change in a follow-up patch? I'd like to get this
>> patch merged quickly since we have a customer being blocked by this
>> issue.
>>
>> Thanks!
>>
>> Stefan
>
>
> Hi Stefan,
>
> I think it is OK.
> Moreover, I consulted that idea with Slawomir Laba - the author of that loop.
> It seems adding a timeout needs a further investigation. Slawomir claims
> that adding that timeout may cause some side effects in other corner cases,
> so for now let's leave your patch as it is.
> Thank you for fixing it!
Hi Michal, Slawomir,
I looked at this once more and noticed that iavf_remove() had the
following code removed in a8417330f8a5.
/* When reboot/shutdown is in progress no need to do anything
* as the adapter is already REMOVE state that was set during
* iavf_shutdown() callback.
*/
if (adapter->state == __IAVF_REMOVE)
return;
So instead of breaking out of the while(1) loop, in iavf_remove(), it
might be better to simply return and avoid going through the whole of
iavf_remove(). That's more in line with the behaviour before
a8417330f8a5.
Otherwise pci_disable_device() might be called twice and give a warning,
Stefan
>
> Thanks,
> Michal
>
> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-03-13 15:00 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-10 12:26 [PATCH net] iavf: fix hang on reboot with ice Stefan Assmann
2023-03-10 17:24 ` Michal Kubiak
2023-03-13 6:48 ` Stefan Assmann
2023-03-13 9:15 ` Michal Kubiak
2023-03-13 14:52 ` Stefan Assmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).