From mboxrd@z Thu Jan 1 00:00:00 1970 From: piaojun Date: Wed, 18 Oct 2017 16:17:31 +0800 Subject: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies In-Reply-To: <63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com> References: <63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com> Message-ID: <59E70E1B.7000303@huawei.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Changwei Ge , "ocfs2-devel@oss.oracle.com" , Mark Fasheh , Junxiao Bi , Joseph Qi , Joel Becker Cc: "linux-fsdevel@vger.kernel.org" , Vitaly Mayatskih Hi Changwei, Could you share the method to reproduce the problem? On 2017/10/17 14:48, Changwei Ge wrote: > When a node dies, other live nodes have to choose a new master > for an existed lock resource mastered by the dead node. > > As for ocfs2/dlm implementation, this is done by function - > dlm_move_lockres_to_recovery_list which marks those lock rsources > as DLM_LOCK_RES_RECOVERING and manages them via a list from which > DLM changes lock resource's master later. > > So without invoking dlm_move_lockres_to_recovery_list, no master will > be choosed after dlm recovery accomplishment since no lock resource can > be found through ::resource list. > > What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for > lock resources mastered a dead node, it will break up synchronization > among nodes. > > So invoke dlm_move_lockres_to_recovery_list again. > > Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery > lockres when recovery master goes down")' > > Reported-by: Vitaly Mayatskih > Signed-off-by: Changwei Ge > --- > fs/ocfs2/dlm/dlmrecovery.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 74407c6..ec8f758 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct > dlm_ctxt *dlm, u8 dead_node) > dlm_lockres_put(res); > continue; > } > + dlm_move_lockres_to_recovery_list(dlm, res); > } else if (res->owner == dlm->node_num) { > dlm_free_dead_locks(dlm, res, dead_node); > __dlm_lockres_calc_usage(dlm, res); > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from szxga05-in.huawei.com ([45.249.212.191]:8532 "EHLO szxga05-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758939AbdJRIUr (ORCPT ); Wed, 18 Oct 2017 04:20:47 -0400 Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies To: Changwei Ge , "ocfs2-devel@oss.oracle.com" , Mark Fasheh , Junxiao Bi , Joseph Qi , Joel Becker References: <63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com> CC: "linux-fsdevel@vger.kernel.org" , "Vitaly Mayatskih" From: piaojun Message-ID: <59E70E1B.7000303@huawei.com> Date: Wed, 18 Oct 2017 16:17:31 +0800 MIME-Version: 1.0 In-Reply-To: <63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi Changwei, Could you share the method to reproduce the problem? On 2017/10/17 14:48, Changwei Ge wrote: > When a node dies, other live nodes have to choose a new master > for an existed lock resource mastered by the dead node. > > As for ocfs2/dlm implementation, this is done by function - > dlm_move_lockres_to_recovery_list which marks those lock rsources > as DLM_LOCK_RES_RECOVERING and manages them via a list from which > DLM changes lock resource's master later. > > So without invoking dlm_move_lockres_to_recovery_list, no master will > be choosed after dlm recovery accomplishment since no lock resource can > be found through ::resource list. > > What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for > lock resources mastered a dead node, it will break up synchronization > among nodes. > > So invoke dlm_move_lockres_to_recovery_list again. > > Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery > lockres when recovery master goes down")' > > Reported-by: Vitaly Mayatskih > Signed-off-by: Changwei Ge > --- > fs/ocfs2/dlm/dlmrecovery.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 74407c6..ec8f758 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct > dlm_ctxt *dlm, u8 dead_node) > dlm_lockres_put(res); > continue; > } > + dlm_move_lockres_to_recovery_list(dlm, res); > } else if (res->owner == dlm->node_num) { > dlm_free_dead_locks(dlm, res, dead_node); > __dlm_lockres_calc_usage(dlm, res); >