From mboxrd@z Thu Jan 1 00:00:00 1970 From: hch@lst.de (Christoph Hellwig) Date: Tue, 21 Jun 2016 18:01:34 +0200 Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics In-Reply-To: <57631C0E.80200@grimberg.me> References: <00d801d1c7de$e17fc7d0$a47f5770$@opengridcomputing.com> <20160616145724.GA32635@infradead.org> <20160616151048.GA13218@lst.de> <5762F9E2.7030101@grimberg.me> <20160616203824.GA19113@lst.de> <57631C0E.80200@grimberg.me> Message-ID: <20160621160134.GA8428@lst.de> On Fri, Jun 17, 2016@12:37:18AM +0300, Sagi Grimberg wrote: >> to false because the queue is on the local list, and now we have thread 1 >> and 2 racing for disconnecting the queue. > > But the list removal and list_empty evaluation is still under a mutex, > isn't that sufficient to avoid the race? If only once side takes the lock it's not very helpful. We can execute nvmet_rdma_queue_disconnect from the CM handler at the same time that the queue is on the to be removed list, which creates two issues: a) we manipulate local del_list without any knowledge of the thread calling nvmet_rdma_delete_ctrl, leading to potential list corruption, and b) we can call into __nvmet_rdma_queue_disconnect concurrently. As you pointed out we still have the per-queue state_lock inside __nvmet_rdma_queue_disconnect so b) probably is harmless at the moment as long as the queue hasn't been freed yet by one of the racing threads, which is fairly unlikely. Either way - using list_empty to check if something is still alive due to being linked in a list and using a local dispose list simply don't mix. Both are useful patterns on their own, but should not be mixed.