From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Mon, 9 Oct 2017 09:37:29 -0500 Subject: [Cluster-devel] [BUG] fs/dlm: A possible sleep-in-atomic bug in dlm_master_lookup In-Reply-To: <20171007022611.GO21978@ZenIV.linux.org.uk> References: <58efdfb6-d18d-cb45-ecd4-4c9b680d7595@163.com> <20171007022611.GO21978@ZenIV.linux.org.uk> Message-ID: <20171009143729.GA8549@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Sat, Oct 07, 2017 at 03:26:11AM +0100, Al Viro wrote: > On Sat, Oct 07, 2017 at 09:59:41AM +0800, Jia-Ju Bai wrote: > > According to fs/dlm/lock.c, the kernel may sleep under a spinlock, > > and the function call path is: > > dlm_master_lookup (acquire the spinlock) > > dlm_send_rcom_lookup_dump > > create_rcom > > dlm_lowcomms_get_buffer > > nodeid2con > > mutex_lock --> may sleep > > > > This bug is found by my static analysis tool and my code review. > > Umm... dlm_master_lookup() locking is not nice, but to trigger that > you would need a combination of > > * from_nodeid != our_nodeid (or we would've buggered off long before that point) > * dir_nodeid == our_nodeid > * failing dlm_search_rsb_tree(&ls->ls_rsbtbl[b].keep, name, len, &r) > (success would have the lock dropped) > * succeeding dlm_search_rsb_tree(&ls->ls_rsbtbl[b].toss, name, len, &r) > * from_master being true > * r->res_master_nodeid != from_nodeid and r->res_master_nodeid == our_nodeid > (the former is follows from the latter, actually) > > The last one might or might not be impossible - I'm not familiar with dlm > guts, but it does have > log_error(ls, "from_master %d our_master", from_nodeid); > just before that call, so it's worth a further look. dlm_send_rcom_lookup_dump() was for debugging and can be removed. It's a condition that shouldn't happen, and I'm guessing I added that to catch any evidence if it did. I'm surprised it wasn't removed in the final version of the patch, but after 5 years I don't remember what I was thinking. I've pushed a commit dropping it to linux-dlm.git next. Thanks, Dave From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754889AbdJIOhn (ORCPT ); Mon, 9 Oct 2017 10:37:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52462 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754846AbdJIOhg (ORCPT ); Mon, 9 Oct 2017 10:37:36 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com A233B80477 Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=teigland@redhat.com Date: Mon, 9 Oct 2017 09:37:29 -0500 From: David Teigland To: Al Viro Cc: Jia-Ju Bai , ccaulfie@redhat.com, cluster-devel@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [BUG] fs/dlm: A possible sleep-in-atomic bug in dlm_master_lookup Message-ID: <20171009143729.GA8549@redhat.com> References: <58efdfb6-d18d-cb45-ecd4-4c9b680d7595@163.com> <20171007022611.GO21978@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171007022611.GO21978@ZenIV.linux.org.uk> User-Agent: Mutt/1.8.3 (2017-05-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Mon, 09 Oct 2017 14:37:35 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Oct 07, 2017 at 03:26:11AM +0100, Al Viro wrote: > On Sat, Oct 07, 2017 at 09:59:41AM +0800, Jia-Ju Bai wrote: > > According to fs/dlm/lock.c, the kernel may sleep under a spinlock, > > and the function call path is: > > dlm_master_lookup (acquire the spinlock) > > dlm_send_rcom_lookup_dump > > create_rcom > > dlm_lowcomms_get_buffer > > nodeid2con > > mutex_lock --> may sleep > > > > This bug is found by my static analysis tool and my code review. > > Umm... dlm_master_lookup() locking is not nice, but to trigger that > you would need a combination of > > * from_nodeid != our_nodeid (or we would've buggered off long before that point) > * dir_nodeid == our_nodeid > * failing dlm_search_rsb_tree(&ls->ls_rsbtbl[b].keep, name, len, &r) > (success would have the lock dropped) > * succeeding dlm_search_rsb_tree(&ls->ls_rsbtbl[b].toss, name, len, &r) > * from_master being true > * r->res_master_nodeid != from_nodeid and r->res_master_nodeid == our_nodeid > (the former is follows from the latter, actually) > > The last one might or might not be impossible - I'm not familiar with dlm > guts, but it does have > log_error(ls, "from_master %d our_master", from_nodeid); > just before that call, so it's worth a further look. dlm_send_rcom_lookup_dump() was for debugging and can be removed. It's a condition that shouldn't happen, and I'm guessing I added that to catch any evidence if it did. I'm surprised it wasn't removed in the final version of the patch, but after 5 years I don't remember what I was thinking. I've pushed a commit dropping it to linux-dlm.git next. Thanks, Dave