From: Dan Aloni <dan.aloni@vastdata.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org
Subject: Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen
Date: Sun, 5 May 2024 21:36:28 +0300 [thread overview]
Message-ID: <20240505183628.g2hhzkrtna5asz6b@gmail.com> (raw)
In-Reply-To: <ZjeZQmi7um7LDOd3@tissot.1015granger.net>
On 2024-05-05 10:35:46, Chuck Lever wrote:
> On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > underflow.
> >
> > The specific scenario that caused this to happen is IB device bonding,
> > when bringing down one of the ports, or all ports. The situation is not
> > just a print - it also causes a non-recoverable state it is not even
> > possible to complete the disconnect and shut it down the mount,
> > requiring a reboot, suggesting that tear-down is also incomplete in this
> > state.
> >
> > The trivial fix seems to work as such - if we did not receive a
> > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > `rpcrdma_xprt_disconnect`.
> >
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
>
> Hi Dan, thanks for the bug report!
>
[..]
> Second, I wonder if, when bonding interfaces, there's an opportunity
> for the verbs consumer to take an additional transport reference.
> Cc'ing linux-rdma for input on that issue. Can you show a brief
> diagram of when the ep reference count changes when bonding?
Not sure we need an additional reference here. I understand regarding rpcrdma
managing its internal EP refcount, that it is having three in total: one for
init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
and I found some more about it, please read on.
> Also, I note that the WARNING below happened on a RHEL9 kernel:
>
> 5.14.0-284.11.1.el9_2.x86_64
>
> Can you confirm that this issue reproduces on v6.8 and newer ?
Some context: I originally tested with a version that also has implemented a
timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
'culprit' in the negative decref, but there is the other issue and second order
effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
the unkillable transport, so we are not reaching teardown at all. With both of
these changes applied both problems are gone.
As for upstream version testing, with the elrepo build of 6.8.9 which matches
vanilla, I don't see the negative kref but I do see the other effect of
`XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
which does not release even after ports are back up online.
Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
on bonding, I'll post in a reply.
```
INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000
Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
Call Trace:
<TASK>
__schedule+0x21c/0x680
schedule+0x31/0xd0
schedule_timeout+0x148/0x160
? mutex_lock+0xe/0x30
? cma_process_remove+0x218/0x260 [rdma_cm]
? preempt_count_add+0x70/0xa0
__wait_for_common+0x90/0x1e0
? __pfx_schedule_timeout+0x10/0x10
cma_remove_one+0x5c/0xd0 [rdma_cm]
remove_client_context+0x88/0xd0 [ib_core]
disable_device+0x8a/0x160 [ib_core]
? _raw_spin_unlock+0x15/0x30
__ib_unregister_device+0x42/0xb0 [ib_core]
ib_unregister_device+0x22/0x30 [ib_core]
mlx5r_remove+0x36/0x60 [mlx5_ib]
auxiliary_bus_remove+0x18/0x30
device_release_driver_internal+0x193/0x200
bus_remove_device+0xbf/0x130
device_del+0x157/0x3e0
? devl_param_driverinit_value_get+0x29/0xa0
mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
mlx5_do_bond+0x2cb/0x390 [mlx5_core]
mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
process_one_work+0x174/0x340
worker_thread+0x27e/0x390
? __pfx_worker_thread+0x10/0x10
kthread+0xee/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2d/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
INFO: task fio:10584 blocked for more than 122 seconds.
Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1
```
--
Dan Aloni
next prev parent reply other threads:[~2024-05-05 18:36 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-05 12:49 [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen Dan Aloni
2024-05-05 14:35 ` Chuck Lever
2024-05-05 18:36 ` Dan Aloni [this message]
2024-05-05 18:38 ` [PATCH] rpcrdma: decref EP only if ESTABLISHED and handle DEVICE_REMOVAL Dan Aloni
2024-05-05 19:00 ` [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen Chuck Lever
2024-05-05 20:10 ` Dan Aloni
2024-05-12 17:51 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240505183628.g2hhzkrtna5asz6b@gmail.com \
--to=dan.aloni@vastdata.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox