* [PATCH net 1/2] net/smc: fix deadlock triggered by cancel_delayed_work_syn()
2023-03-13 10:08 [PATCH net 0/2] net/smc: Fixes 2023-03-01 Wenjia Zhang
@ 2023-03-13 10:08 ` Wenjia Zhang
2023-03-13 11:14 ` Tony Lu
2023-03-13 10:08 ` [PATCH net 2/2] net/smc: Fix device de-init sequence Wenjia Zhang
2023-03-15 8:20 ` [PATCH net 0/2] net/smc: Fixes 2023-03-01 patchwork-bot+netdevbpf
2 siblings, 1 reply; 6+ messages in thread
From: Wenjia Zhang @ 2023-03-13 10:08 UTC (permalink / raw)
To: David Miller, Jakub Kicinski
Cc: netdev, linux-s390, Eric Dumazet, Paolo Abeni, Heiko Carstens,
Karsten Graul, Alexandra Winter, Jan Karcher, Stefan Raspl,
Tony Lu, Wenjia Zhang
The following LOCKDEP was detected:
Workqueue: events smc_lgr_free_work [smc]
WARNING: possible circular locking dependency detected
6.1.0-20221027.rc2.git8.56bc5b569087.300.fc36.s390x+debug #1 Not tainted
------------------------------------------------------
kworker/3:0/176251 is trying to acquire lock:
00000000f1467148 ((wq_completion)smc_tx_wq-00000000#2){+.+.}-{0:0},
at: __flush_workqueue+0x7a/0x4f0
but task is already holding lock:
0000037fffe97dc8 ((work_completion)(&(&lgr->free_work)->work)){+.+.}-{0:0},
at: process_one_work+0x232/0x730
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 ((work_completion)(&(&lgr->free_work)->work)){+.+.}-{0:0}:
__lock_acquire+0x58e/0xbd8
lock_acquire.part.0+0xe2/0x248
lock_acquire+0xac/0x1c8
__flush_work+0x76/0xf0
__cancel_work_timer+0x170/0x220
__smc_lgr_terminate.part.0+0x34/0x1c0 [smc]
smc_connect_rdma+0x15e/0x418 [smc]
__smc_connect+0x234/0x480 [smc]
smc_connect+0x1d6/0x230 [smc]
__sys_connect+0x90/0xc0
__do_sys_socketcall+0x186/0x370
__do_syscall+0x1da/0x208
system_call+0x82/0xb0
-> #3 (smc_client_lgr_pending){+.+.}-{3:3}:
__lock_acquire+0x58e/0xbd8
lock_acquire.part.0+0xe2/0x248
lock_acquire+0xac/0x1c8
__mutex_lock+0x96/0x8e8
mutex_lock_nested+0x32/0x40
smc_connect_rdma+0xa4/0x418 [smc]
__smc_connect+0x234/0x480 [smc]
smc_connect+0x1d6/0x230 [smc]
__sys_connect+0x90/0xc0
__do_sys_socketcall+0x186/0x370
__do_syscall+0x1da/0x208
system_call+0x82/0xb0
-> #2 (sk_lock-AF_SMC){+.+.}-{0:0}:
__lock_acquire+0x58e/0xbd8
lock_acquire.part.0+0xe2/0x248
lock_acquire+0xac/0x1c8
lock_sock_nested+0x46/0xa8
smc_tx_work+0x34/0x50 [smc]
process_one_work+0x30c/0x730
worker_thread+0x62/0x420
kthread+0x138/0x150
__ret_from_fork+0x3c/0x58
ret_from_fork+0xa/0x40
-> #1 ((work_completion)(&(&smc->conn.tx_work)->work)){+.+.}-{0:0}:
__lock_acquire+0x58e/0xbd8
lock_acquire.part.0+0xe2/0x248
lock_acquire+0xac/0x1c8
process_one_work+0x2bc/0x730
worker_thread+0x62/0x420
kthread+0x138/0x150
__ret_from_fork+0x3c/0x58
ret_from_fork+0xa/0x40
-> #0 ((wq_completion)smc_tx_wq-00000000#2){+.+.}-{0:0}:
check_prev_add+0xd8/0xe88
validate_chain+0x70c/0xb20
__lock_acquire+0x58e/0xbd8
lock_acquire.part.0+0xe2/0x248
lock_acquire+0xac/0x1c8
__flush_workqueue+0xaa/0x4f0
drain_workqueue+0xaa/0x158
destroy_workqueue+0x44/0x2d8
smc_lgr_free+0x9e/0xf8 [smc]
process_one_work+0x30c/0x730
worker_thread+0x62/0x420
kthread+0x138/0x150
__ret_from_fork+0x3c/0x58
ret_from_fork+0xa/0x40
other info that might help us debug this:
Chain exists of:
(wq_completion)smc_tx_wq-00000000#2
--> smc_client_lgr_pending
--> (work_completion)(&(&lgr->free_work)->work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&(&lgr->free_work)->work));
lock(smc_client_lgr_pending);
lock((work_completion)
(&(&lgr->free_work)->work));
lock((wq_completion)smc_tx_wq-00000000#2);
*** DEADLOCK ***
2 locks held by kworker/3:0/176251:
#0: 0000000080183548
((wq_completion)events){+.+.}-{0:0},
at: process_one_work+0x232/0x730
#1: 0000037fffe97dc8
((work_completion)
(&(&lgr->free_work)->work)){+.+.}-{0:0},
at: process_one_work+0x232/0x730
stack backtrace:
CPU: 3 PID: 176251 Comm: kworker/3:0 Not tainted
Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
Call Trace:
[<000000002983c3e4>] dump_stack_lvl+0xac/0x100
[<0000000028b477ae>] check_noncircular+0x13e/0x160
[<0000000028b48808>] check_prev_add+0xd8/0xe88
[<0000000028b49cc4>] validate_chain+0x70c/0xb20
[<0000000028b4bd26>] __lock_acquire+0x58e/0xbd8
[<0000000028b4cf6a>] lock_acquire.part.0+0xe2/0x248
[<0000000028b4d17c>] lock_acquire+0xac/0x1c8
[<0000000028addaaa>] __flush_workqueue+0xaa/0x4f0
[<0000000028addf9a>] drain_workqueue+0xaa/0x158
[<0000000028ae303c>] destroy_workqueue+0x44/0x2d8
[<000003ff8029af26>] smc_lgr_free+0x9e/0xf8 [smc]
[<0000000028adf3d4>] process_one_work+0x30c/0x730
[<0000000028adf85a>] worker_thread+0x62/0x420
[<0000000028aeac50>] kthread+0x138/0x150
[<0000000028a63914>] __ret_from_fork+0x3c/0x58
[<00000000298503da>] ret_from_fork+0xa/0x40
INFO: lockdep is turned off.
===================================================================
This deadlock occurs because cancel_delayed_work_sync() waits for
the work(&lgr->free_work) to finish, while the &lgr->free_work
waits for the work(lgr->tx_wq), which needs the sk_lock-AF_SMC, that
is already used under the mutex_lock.
The solution is to use cancel_delayed_work() instead, which kills
off a pending work.
Fixes: a52bcc919b14 ("net/smc: improve termination processing")
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
---
net/smc/smc_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index d52060b2680c..454356771cda 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1464,7 +1464,7 @@ static void __smc_lgr_terminate(struct smc_link_group *lgr, bool soft)
if (lgr->terminating)
return; /* lgr already terminating */
/* cancel free_work sync, will terminate when lgr->freeing is set */
- cancel_delayed_work_sync(&lgr->free_work);
+ cancel_delayed_work(&lgr->free_work);
lgr->terminating = 1;
/* kill remaining link group connections */
--
2.37.2
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH net 1/2] net/smc: fix deadlock triggered by cancel_delayed_work_syn()
2023-03-13 10:08 ` [PATCH net 1/2] net/smc: fix deadlock triggered by cancel_delayed_work_syn() Wenjia Zhang
@ 2023-03-13 11:14 ` Tony Lu
0 siblings, 0 replies; 6+ messages in thread
From: Tony Lu @ 2023-03-13 11:14 UTC (permalink / raw)
To: Wenjia Zhang
Cc: David Miller, Jakub Kicinski, netdev, linux-s390, Eric Dumazet,
Paolo Abeni, Heiko Carstens, Karsten Graul, Alexandra Winter,
Jan Karcher, Stefan Raspl
On Mon, Mar 13, 2023 at 11:08:28AM +0100, Wenjia Zhang wrote:
> The following LOCKDEP was detected:
> Workqueue: events smc_lgr_free_work [smc]
> WARNING: possible circular locking dependency detected
> 6.1.0-20221027.rc2.git8.56bc5b569087.300.fc36.s390x+debug #1 Not tainted
> ------------------------------------------------------
> kworker/3:0/176251 is trying to acquire lock:
> 00000000f1467148 ((wq_completion)smc_tx_wq-00000000#2){+.+.}-{0:0},
> at: __flush_workqueue+0x7a/0x4f0
> but task is already holding lock:
> 0000037fffe97dc8 ((work_completion)(&(&lgr->free_work)->work)){+.+.}-{0:0},
> at: process_one_work+0x232/0x730
> which lock already depends on the new lock.
> the existing dependency chain (in reverse order) is:
> -> #4 ((work_completion)(&(&lgr->free_work)->work)){+.+.}-{0:0}:
> __lock_acquire+0x58e/0xbd8
> lock_acquire.part.0+0xe2/0x248
> lock_acquire+0xac/0x1c8
> __flush_work+0x76/0xf0
> __cancel_work_timer+0x170/0x220
> __smc_lgr_terminate.part.0+0x34/0x1c0 [smc]
> smc_connect_rdma+0x15e/0x418 [smc]
> __smc_connect+0x234/0x480 [smc]
> smc_connect+0x1d6/0x230 [smc]
> __sys_connect+0x90/0xc0
> __do_sys_socketcall+0x186/0x370
> __do_syscall+0x1da/0x208
> system_call+0x82/0xb0
> -> #3 (smc_client_lgr_pending){+.+.}-{3:3}:
> __lock_acquire+0x58e/0xbd8
> lock_acquire.part.0+0xe2/0x248
> lock_acquire+0xac/0x1c8
> __mutex_lock+0x96/0x8e8
> mutex_lock_nested+0x32/0x40
> smc_connect_rdma+0xa4/0x418 [smc]
> __smc_connect+0x234/0x480 [smc]
> smc_connect+0x1d6/0x230 [smc]
> __sys_connect+0x90/0xc0
> __do_sys_socketcall+0x186/0x370
> __do_syscall+0x1da/0x208
> system_call+0x82/0xb0
> -> #2 (sk_lock-AF_SMC){+.+.}-{0:0}:
> __lock_acquire+0x58e/0xbd8
> lock_acquire.part.0+0xe2/0x248
> lock_acquire+0xac/0x1c8
> lock_sock_nested+0x46/0xa8
> smc_tx_work+0x34/0x50 [smc]
> process_one_work+0x30c/0x730
> worker_thread+0x62/0x420
> kthread+0x138/0x150
> __ret_from_fork+0x3c/0x58
> ret_from_fork+0xa/0x40
> -> #1 ((work_completion)(&(&smc->conn.tx_work)->work)){+.+.}-{0:0}:
> __lock_acquire+0x58e/0xbd8
> lock_acquire.part.0+0xe2/0x248
> lock_acquire+0xac/0x1c8
> process_one_work+0x2bc/0x730
> worker_thread+0x62/0x420
> kthread+0x138/0x150
> __ret_from_fork+0x3c/0x58
> ret_from_fork+0xa/0x40
> -> #0 ((wq_completion)smc_tx_wq-00000000#2){+.+.}-{0:0}:
> check_prev_add+0xd8/0xe88
> validate_chain+0x70c/0xb20
> __lock_acquire+0x58e/0xbd8
> lock_acquire.part.0+0xe2/0x248
> lock_acquire+0xac/0x1c8
> __flush_workqueue+0xaa/0x4f0
> drain_workqueue+0xaa/0x158
> destroy_workqueue+0x44/0x2d8
> smc_lgr_free+0x9e/0xf8 [smc]
> process_one_work+0x30c/0x730
> worker_thread+0x62/0x420
> kthread+0x138/0x150
> __ret_from_fork+0x3c/0x58
> ret_from_fork+0xa/0x40
> other info that might help us debug this:
> Chain exists of:
> (wq_completion)smc_tx_wq-00000000#2
> --> smc_client_lgr_pending
> --> (work_completion)(&(&lgr->free_work)->work)
> Possible unsafe locking scenario:
> CPU0 CPU1
> ---- ----
> lock((work_completion)(&(&lgr->free_work)->work));
> lock(smc_client_lgr_pending);
> lock((work_completion)
> (&(&lgr->free_work)->work));
> lock((wq_completion)smc_tx_wq-00000000#2);
> *** DEADLOCK ***
> 2 locks held by kworker/3:0/176251:
> #0: 0000000080183548
> ((wq_completion)events){+.+.}-{0:0},
> at: process_one_work+0x232/0x730
> #1: 0000037fffe97dc8
> ((work_completion)
> (&(&lgr->free_work)->work)){+.+.}-{0:0},
> at: process_one_work+0x232/0x730
> stack backtrace:
> CPU: 3 PID: 176251 Comm: kworker/3:0 Not tainted
> Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> Call Trace:
> [<000000002983c3e4>] dump_stack_lvl+0xac/0x100
> [<0000000028b477ae>] check_noncircular+0x13e/0x160
> [<0000000028b48808>] check_prev_add+0xd8/0xe88
> [<0000000028b49cc4>] validate_chain+0x70c/0xb20
> [<0000000028b4bd26>] __lock_acquire+0x58e/0xbd8
> [<0000000028b4cf6a>] lock_acquire.part.0+0xe2/0x248
> [<0000000028b4d17c>] lock_acquire+0xac/0x1c8
> [<0000000028addaaa>] __flush_workqueue+0xaa/0x4f0
> [<0000000028addf9a>] drain_workqueue+0xaa/0x158
> [<0000000028ae303c>] destroy_workqueue+0x44/0x2d8
> [<000003ff8029af26>] smc_lgr_free+0x9e/0xf8 [smc]
> [<0000000028adf3d4>] process_one_work+0x30c/0x730
> [<0000000028adf85a>] worker_thread+0x62/0x420
> [<0000000028aeac50>] kthread+0x138/0x150
> [<0000000028a63914>] __ret_from_fork+0x3c/0x58
> [<00000000298503da>] ret_from_fork+0xa/0x40
> INFO: lockdep is turned off.
> ===================================================================
>
> This deadlock occurs because cancel_delayed_work_sync() waits for
> the work(&lgr->free_work) to finish, while the &lgr->free_work
> waits for the work(lgr->tx_wq), which needs the sk_lock-AF_SMC, that
> is already used under the mutex_lock.
>
> The solution is to use cancel_delayed_work() instead, which kills
> off a pending work.
>
> Fixes: a52bcc919b14 ("net/smc: improve termination processing")
> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
> Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
> Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Thanks Wenjia, LGTM.
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
> ---
> net/smc/smc_core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index d52060b2680c..454356771cda 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1464,7 +1464,7 @@ static void __smc_lgr_terminate(struct smc_link_group *lgr, bool soft)
> if (lgr->terminating)
> return; /* lgr already terminating */
> /* cancel free_work sync, will terminate when lgr->freeing is set */
> - cancel_delayed_work_sync(&lgr->free_work);
> + cancel_delayed_work(&lgr->free_work);
> lgr->terminating = 1;
>
> /* kill remaining link group connections */
> --
> 2.37.2
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH net 2/2] net/smc: Fix device de-init sequence
2023-03-13 10:08 [PATCH net 0/2] net/smc: Fixes 2023-03-01 Wenjia Zhang
2023-03-13 10:08 ` [PATCH net 1/2] net/smc: fix deadlock triggered by cancel_delayed_work_syn() Wenjia Zhang
@ 2023-03-13 10:08 ` Wenjia Zhang
2023-03-13 11:15 ` Tony Lu
2023-03-15 8:20 ` [PATCH net 0/2] net/smc: Fixes 2023-03-01 patchwork-bot+netdevbpf
2 siblings, 1 reply; 6+ messages in thread
From: Wenjia Zhang @ 2023-03-13 10:08 UTC (permalink / raw)
To: David Miller, Jakub Kicinski
Cc: netdev, linux-s390, Eric Dumazet, Paolo Abeni, Heiko Carstens,
Karsten Graul, Alexandra Winter, Jan Karcher, Stefan Raspl,
Tony Lu, Wenjia Zhang, Alexander Gordeev
From: Stefan Raspl <raspl@linux.ibm.com>
CLC message initialization was not properly reversed in error handling path.
Reported-and-suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
---
net/smc/af_smc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index a4cccdfdc00a..50052f53a1dd 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -3498,6 +3498,7 @@ static int __init smc_init(void)
out_nl:
smc_nl_exit();
out_ism:
+ smc_clc_exit();
smc_ism_exit();
out_pernet_subsys_stat:
unregister_pernet_subsys(&smc_net_stat_ops);
--
2.37.2
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net 2/2] net/smc: Fix device de-init sequence
2023-03-13 10:08 ` [PATCH net 2/2] net/smc: Fix device de-init sequence Wenjia Zhang
@ 2023-03-13 11:15 ` Tony Lu
0 siblings, 0 replies; 6+ messages in thread
From: Tony Lu @ 2023-03-13 11:15 UTC (permalink / raw)
To: Wenjia Zhang
Cc: David Miller, Jakub Kicinski, netdev, linux-s390, Eric Dumazet,
Paolo Abeni, Heiko Carstens, Karsten Graul, Alexandra Winter,
Jan Karcher, Stefan Raspl, Alexander Gordeev
On Mon, Mar 13, 2023 at 11:08:29AM +0100, Wenjia Zhang wrote:
> From: Stefan Raspl <raspl@linux.ibm.com>
>
> CLC message initialization was not properly reversed in error handling path.
>
> Reported-and-suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Thanks, LGTM.
> ---
> net/smc/af_smc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index a4cccdfdc00a..50052f53a1dd 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -3498,6 +3498,7 @@ static int __init smc_init(void)
> out_nl:
> smc_nl_exit();
> out_ism:
> + smc_clc_exit();
> smc_ism_exit();
> out_pernet_subsys_stat:
> unregister_pernet_subsys(&smc_net_stat_ops);
> --
> 2.37.2
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net 0/2] net/smc: Fixes 2023-03-01
2023-03-13 10:08 [PATCH net 0/2] net/smc: Fixes 2023-03-01 Wenjia Zhang
2023-03-13 10:08 ` [PATCH net 1/2] net/smc: fix deadlock triggered by cancel_delayed_work_syn() Wenjia Zhang
2023-03-13 10:08 ` [PATCH net 2/2] net/smc: Fix device de-init sequence Wenjia Zhang
@ 2023-03-15 8:20 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 6+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-03-15 8:20 UTC (permalink / raw)
To: Wenjia Zhang
Cc: davem, kuba, netdev, linux-s390, edumazet, pabeni, hca, kgraul,
wintera, jaka, raspl, tonylu
Hello:
This series was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:
On Mon, 13 Mar 2023 11:08:27 +0100 you wrote:
> The 1st patch solves the problem that CLC message initialization was
> not properly reversed in error handling path. And the 2nd one fixes
> the possible deadlock triggered by cancel_delayed_work_sync().
>
> Stefan Raspl (1):
> net/smc: Fix device de-init sequence
>
> [...]
Here is the summary with links:
- [net,1/2] net/smc: fix deadlock triggered by cancel_delayed_work_syn()
https://git.kernel.org/netdev/net/c/13085e1b5cab
- [net,2/2] net/smc: Fix device de-init sequence
https://git.kernel.org/netdev/net/c/9d876d3ef27f
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 6+ messages in thread