* rdma provider module references
@ 2010-12-15 16:15 Steve Wise
[not found] ` <4D08E989.5020307-10udUCx4aRo@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Steve Wise @ 2010-12-15 16:15 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma
Hey Roland,
I notice that if I have a user rdma application running that has an rdma connection using iw_cxgb3, then the iw_cxgb3
module reference count is bumped and thus it cannot be unloaded. However when I have an NFSRDMA connection that
utilizes iw_cxgb3, the module reference count is not bumped, and iw_cxgb3 can erroneously be unloaded while the NFSRDMA
connection is still active, causing a crash.
My question is more of what is the right thing to do here.
Should the module refcnt be the way to prohibit a provider from being unloaded?
Thoughts?
Thanks,
Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread[parent not found: <4D08E989.5020307-10udUCx4aRo@public.gmane.org>]
* Re: rdma provider module references [not found] ` <4D08E989.5020307-10udUCx4aRo@public.gmane.org> @ 2010-12-15 17:09 ` Roland Dreier [not found] ` <adapqt3gdzb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 5+ messages in thread From: Roland Dreier @ 2010-12-15 17:09 UTC (permalink / raw) To: Steve Wise; +Cc: linux-rdma > I notice that if I have a user rdma application running that has an > rdma connection using iw_cxgb3, then the iw_cxgb3 module reference > count is bumped and thus it cannot be unloaded. However when I have > an NFSRDMA connection that utilizes iw_cxgb3, the module reference > count is not bumped, and iw_cxgb3 can erroneously be unloaded while > the NFSRDMA connection is still active, causing a crash. What is supposed to happen is that as the HW driver is unloading, it calls ib_unregister_device() first, and this calls each client's .remove() method to have it release everything related to that device. However I guess NFS/RDMA is behind the RDMA CM, which is supposed to handle device removal. In that code it seems to end up in cma_process_remove(), which appears at first glance to do the right things to destroy all connections etc. The idea is that RDMA devices should be like net devices, ie you can remove them even if they're in use -- things should just clean up, rather than blocking the module removal. The uverbs case is a bit of a hack because we don't have a way to handle revoking the mmap regions etc yet. What goes wrong with NFS/RDMA in this scheme? It looks like it should work. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <adapqt3gdzb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: rdma provider module references [not found] ` <adapqt3gdzb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> @ 2010-12-15 18:53 ` Steve Wise 2010-12-16 15:34 ` Steve Wise 2010-12-16 15:48 ` Steve Wise 2 siblings, 0 replies; 5+ messages in thread From: Steve Wise @ 2010-12-15 18:53 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma On 12/15/2010 11:09 AM, Roland Dreier wrote: > > I notice that if I have a user rdma application running that has an > > rdma connection using iw_cxgb3, then the iw_cxgb3 module reference > > count is bumped and thus it cannot be unloaded. However when I have > > an NFSRDMA connection that utilizes iw_cxgb3, the module reference > > count is not bumped, and iw_cxgb3 can erroneously be unloaded while > > the NFSRDMA connection is still active, causing a crash. > > What is supposed to happen is that as the HW driver is unloading, it > calls ib_unregister_device() first, and this calls each client's > .remove() method to have it release everything related to that device. > > However I guess NFS/RDMA is behind the RDMA CM, which is supposed to > handle device removal. In that code it seems to end up in > cma_process_remove(), which appears at first glance to do the right > things to destroy all connections etc. > > The idea is that RDMA devices should be like net devices, ie you can > remove them even if they're in use -- things should just clean up, > rather than blocking the module removal. The uverbs case is a bit of a > hack because we don't have a way to handle revoking the mmap regions > etc yet. > Thanks for the description of how it should work. > What goes wrong with NFS/RDMA in this scheme? It looks like it should work. > I'm still investigating exactly what happens. I'll follow up once I know more. Thanks! Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rdma provider module references [not found] ` <adapqt3gdzb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2010-12-15 18:53 ` Steve Wise @ 2010-12-16 15:34 ` Steve Wise 2010-12-16 15:48 ` Steve Wise 2 siblings, 0 replies; 5+ messages in thread From: Steve Wise @ 2010-12-16 15:34 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma On 12/15/2010 11:09 AM, Roland Dreier wrote: > > I notice that if I have a user rdma application running that has an > > rdma connection using iw_cxgb3, then the iw_cxgb3 module reference > > count is bumped and thus it cannot be unloaded. However when I have > > an NFSRDMA connection that utilizes iw_cxgb3, the module reference > > count is not bumped, and iw_cxgb3 can erroneously be unloaded while > > the NFSRDMA connection is still active, causing a crash. > > What is supposed to happen is that as the HW driver is unloading, it > calls ib_unregister_device() first, and this calls each client's > .remove() method to have it release everything related to that device. > > However I guess NFS/RDMA is behind the RDMA CM, which is supposed to > handle device removal. In that code it seems to end up in > cma_process_remove(), which appears at first glance to do the right > things to destroy all connections etc. > > The idea is that RDMA devices should be like net devices, ie you can > remove them even if they're in use -- things should just clean up, > rather than blocking the module removal. The uverbs case is a bit of a > hack because we don't have a way to handle revoking the mmap regions > etc yet. > > What goes wrong with NFS/RDMA in this scheme? It looks like it should work. > Here's one stack. From this I assume the offload connection was still active after iw_cxgb3 was unloaded... Call Trace: <IRQ> [<ffffffff80037136>] kref_get+0x38/0x3d [<ffffffff885fb5b1>] :iw_cxgb3:sched+0x17/0x49 [<ffffffff8824cf37>] :cxgb3:process_rx+0x37/0x8b [<ffffffff8824a3e7>] :cxgb3:process_responses+0xc09/0xc63 [<ffffffff8824ac65>] :cxgb3:napi_rx_handler+0x36/0xa4 [<ffffffff8000c88a>] net_rx_action+0xac/0x1e0 [<ffffffff8824ac15>] :cxgb3:t3_sge_intr_msix_napi+0x173/0x18d [<ffffffff80012409>] __do_softirq+0x89/0x133 [<ffffffff8005f2fc>] call_softirq+0x1c/0x28 [<ffffffff8006dba8>] do_softirq+0x2c/0x85 [<ffffffff8006da30>] do_IRQ+0xec/0xf5 [<ffffffff800575d0>] mwait_idle+0x0/0x4a [<ffffffff8005e615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80057606>] mwait_idle+0x36/0x4a [<ffffffff800497be>] cpu_idle+0x95/0xb8 [<ffffffff80078997>] start_secondary+0x498/0x4a7 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rdma provider module references [not found] ` <adapqt3gdzb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2010-12-15 18:53 ` Steve Wise 2010-12-16 15:34 ` Steve Wise @ 2010-12-16 15:48 ` Steve Wise 2 siblings, 0 replies; 5+ messages in thread From: Steve Wise @ 2010-12-16 15:48 UTC (permalink / raw) To: Roland Dreier; +Cc: Steve Wise, linux-rdma, Tom Tucker However I guess NFS/RDMA is behind the RDMA CM, which is supposed to > handle device removal. In that code it seems to end up in > cma_process_remove(), which appears at first glance to do the right > things to destroy all connections etc. > Function cma_process_remove() calls cma_remove_id_dev() for each cm_id bound to the device being removed. Function cma_remove_id_dev() calls the event handler function for each cm_id and passes a RDMA_CM_EVENT_DEVICE_REMOVAL event. The NFSRDMA server marks the RPC transport as XPT_CLOSE, but doesn't immediately destroy the cm_id in the event handler function. This is in net/sunrpc/xprtrdma/svc_rdma_transport.c / rdma_cma_handler(). That's the issue methinks. Each RDMA kernel user must destroy all the resources in the event handler function itself. These cannot be scheduled or deferred in any way given the current design. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-12-16 15:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-15 16:15 rdma provider module references Steve Wise
[not found] ` <4D08E989.5020307-10udUCx4aRo@public.gmane.org>
2010-12-15 17:09 ` Roland Dreier
[not found] ` <adapqt3gdzb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2010-12-15 18:53 ` Steve Wise
2010-12-16 15:34 ` Steve Wise
2010-12-16 15:48 ` Steve Wise
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox