From mboxrd@z Thu Jan 1 00:00:00 1970 From: bvanassche@acm.org (Bart Van Assche) Date: Thu, 25 Oct 2018 15:20:42 -0700 Subject: [PATCH 1/2] nvmet-rdma: Avoid that nvmet_rdma_release_queue_work() triggers a lockdep complaint In-Reply-To: <93a3139c6f66cc8a9fa38a1b52309f501c98233e.camel@sipsolutions.net> References: <20181025151940.388-1-bvanassche@acm.org> <20181025151940.388-2-bvanassche@acm.org> <93a3139c6f66cc8a9fa38a1b52309f501c98233e.camel@sipsolutions.net> Message-ID: <1540506042.66186.76.camel@acm.org> On Thu, 2018-10-25@23:05 +0200, Johannes Berg wrote: > Yes, you said to ignore this, but I'm looking at the lockdep report > again anyway ... > > So ... if I understand this correctly, we have: > > cma's id_priv->handler_mutex - a listen_id's mutex > cma's id_priv->handler_mutex/1 - a conn_id's mutex > queue - an RDMA queue, belonging to ...? > nvmet-rdma-delete-wq - the global deletion workqueue > > In cma_ib_req_handler(), we create a dependency from > handler_mutex -> handler_mutex/1 > since we lock both of them (here's the "real _nested()" part). > > While holding those, we call the event handler, which is > nvmet_rdma_cm_handler. This, via nvmet_rdma_queue_connect, flushes the > workqueue, creating that dependency. > > Obviously, queue->release_work has a dependency to the workqueue since > it runs on there. > > Finally, the queue->release_work calls rdma_destroy_id() which locks the > destroyed id_priv's mutex. > > Does that about seem right? Hi Johannes, I don't think it's that complicated. I think only the conn_id handler_mutex and the workqueue lockdep annotations are involved in the lockdep complaint. In drivers/infiniband/core/cma.c one can see that a new CM ID (conn_id) is created before the RDMA/CM handler callback occurs for event type RDMA_CM_EVENT_CONNECT_REQUEST. Otherwise I agree that a flush_workqueue() call occurs with the handler_mutex held and that the same handler_mutex is obtained while executing work queued on the same workqueue that gets flushed. > I'm not sure I see a solution right now. If you were to replace the > workqueue by a list of objects to be destroyed and then iterate it where > you have the flush_workqueue() now (*), you'd end up with a very similar > lockdep report I think, so it's not inherent in the lockdep workqueue > annotations, it's inherent in locking some object(s) while holding the > same class of lock for another object. This would be a pretty typical > case of mutex_lock_nested(), but it's hard to do across the layers here, > and pretty much impossible to do across the workqueue. > > I'd usually solve that sort of problem by flushing/iterating before > taking all the locks, since it's a different object anyway, but given > the two layers here that doesn't seem feasible either. > > (*) of course in addition to iterating it in some thread, or work > struct, or something else asynchronous > > I think again you could suppress this by using flush_workqueue_nested() > here, basically saying that while it's the same workqueue, it cannot > actually connect back to itself locking-wise due to being guaranteed to > have different objects on it... I think, but it's getting late here. Since I'm not sure accepting every connect request in case of a DoS attack is the best approach, my own preference is to address this by refusing new connections if too many CM IDs are being cleaned up. So I changed the approach of this patch series. In case you would like to have a look at v3 of this patch series, it is available at http://lists.infradead.org/pipermail/linux-nvme/2018-October/020553.html. Best regards, Bart.