From mboxrd@z Thu Jan 1 00:00:00 1970 From: Junxiao Bi Date: Thu, 20 Feb 2014 13:51:57 +0800 Subject: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix recovery hung In-Reply-To: <53056939.3050702@oracle.com> References: <53056939.3050702@oracle.com> Message-ID: <530597FD.4020108@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com Hi Wengang, On 02/20/2014 10:32 AM, Wengang wrote: > Yes. Checking DLM_RECO_STATE_ACTIVE would introduce the dead lock you > described. I don't see the need we have to set this flag before > picking up the recovery master(till now the remaster locks not started > yet). We may think of setting this flag only when the node become the > really recovery master. While this may be a separate change. > For your change, seems you changed the use of xxx_FINALIZE(making it > also can set on recovery master). you'd better add some comment for that. Thank you for reviewing it. I will add some comments for it. > > Signed-off-by: Wengang Wang Do you mean Reviewed-by: Wengang Wang ? Thanks, Junxiao. > > > ? 2014?02?19? 21:44, Junxiao Bi ??: >> Hi Wengang, >> >> I thought about this changes before, it seemed may cause a deadlock. >> When the new recovery master is sending the RECO_BEGIN message, >> another node x may be trying to become the recovery master in >> dlm_pick_recovery_master(), at there, DLM_RECO_STATE_ACTIVE had been >> set before in dlm_do_recovery(), so the RECO_BEGIN message handler >> dlm_pick_recovery_master() can not be executed, then >> dlm->reco.new_master will not be set, so node x will loop in >> dlm_pick_recovery_master() forever. >> >> Thanks, >> Junxiao. >> >> ----- Original Message ----- >> From: wen.gang.wang at oracle.com >> To: junxiao.bi at oracle.com, ocfs2-devel at oss.oracle.com >> Cc: mfasheh at suse.de, joe.jin at oracle.com >> Sent: Wednesday, February 19, 2014 6:10:43 PM GMT +08:00 Beijing / >> Chongqing / Hong Kong / Urumqi >> Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix recovery hung >> >> Seems DLM_RECO_STATE_FINALIZE is used to track the situation that the >> recovery master(node 0) dead (just) after it had finished the dlm >> recovery. So that the original dead node(node 1) needed not be >> recovered again. This flag is set only on non recovery master nodes. >> In this case, things looks like the recovery master reset >> "recover_master" and "dead node" after another node(node 2) had >> started to recover another dead node(node 3). This would happen only >> on the "old" recovery master(node 0). Note this flag >> DLM_RECO_STATE_ACTIVE, it's set only on master when it begins to >> recover and cleared when recovery finished. For an "old" master, >> checking DLM_RECO_STATE_ACTIVE makes sense. I'd like to suggest: >> >> --- a/fs/ocfs2/dlm/dlmrecovery.c >> +++ b/fs/ocfs2/dlm/dlmrecovery.c >> @@ -2704,7 +2704,7 @@ int dlm_begin_reco_handler(struct o2net_msg >> *msg, u32 len, void *data, >> return 0; >> spin_lock(&dlm->spinlock); >> - if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) { >> + if (dlm->reco.state & (DLM_RECO_STATE_FINALIZE | >> DLM_RECO_STATE_ACTIVE)) { >> mlog(0, "%s: node %u wants to recover node %u >> (%u:%u) " >> "but this node is in finalize state, waiting >> on finalize2\n", >> dlm->name, br->node_idx, br->dead_node, >> >> thanks >> wengang >> >> ? 2014?02?19? 16:30, Junxiao Bi ??: >>> There is a race window in dlm_do_recovery() between >>> dlm_remaster_locks() >>> and dlm_reset_recovery() when the recovery master nearly finish the >>> recovery >>> process for a dead node. After the master sends FINALIZE_RECO >>> message in >>> dlm_remaster_locks(), another node may become the recovery master >>> for another >>> dead node, and then send the BEGIN_RECO message to all the nodes >>> included the >>> old master, in the handler of this message dlm_begin_reco_handler() >>> of old master, >>> dlm->reco.dead_node and dlm->reco.new_master will be set to the >>> second dead >>> node and the new master, then in dlm_reset_recovery(), these two >>> variables >>> will be reset to default value. This will cause new recovery master >>> can not finish >>> the recovery process and hung, at last the whole cluster will hung >>> for recovery. >>> >>> old recovery master: new recovery >>> master: >>> dlm_remaster_locks() >>> become recovery >>> master for >>> another dead node. >>> >>> dlm_send_begin_reco_message() >>> dlm_begin_reco_handler() >>> { >>> if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) { >>> return -EAGAIN; >>> } >>> dlm_set_reco_master(dlm, br->node_idx); >>> dlm_set_reco_dead_node(dlm, br->dead_node); >>> } >>> dlm_reset_recovery() >>> { >>> dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM); >>> dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM); >>> } >>> will hung in >>> dlm_remaster_locks() for >>> request dlm >>> locks info >>> >>> Before send FINALIZE_RECO message, recovery master should set >>> DLM_RECO_STATE_FINALIZE >>> for itself and clear it after the recovery done, this can break the >>> race windows as >>> the BEGIN_RECO messages will not be handled before >>> DLM_RECO_STATE_FINALIZE flag is >>> cleared. >>> >>> A similar race may happen between new recovery master and normal >>> node which is in >>> dlm_finalize_reco_handler(), also fix it. >>> >>> Signed-off-by: Junxiao Bi >>> --- >>> fs/ocfs2/dlm/dlmrecovery.c | 11 +++++++++-- >>> 1 file changed, 9 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c >>> index 7035af0..fe58e8b 100644 >>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>> @@ -537,7 +537,10 @@ master_here: >>> /* success! see if any other nodes need recovery */ >>> mlog(0, "DONE mastering recovery of %s:%u here(this=%u)!\n", >>> dlm->name, dlm->reco.dead_node, dlm->node_num); >>> - dlm_reset_recovery(dlm); >>> + spin_lock(&dlm->spinlock); >>> + __dlm_reset_recovery(dlm); >>> + dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE; >>> + spin_unlock(&dlm->spinlock); >>> } >>> dlm_end_recovery(dlm); >>> @@ -695,6 +698,10 @@ static int dlm_remaster_locks(struct >>> dlm_ctxt *dlm, u8 dead_node) >>> if (all_nodes_done) { >>> int ret; >>> + spin_lock(&dlm->spinlock); >>> + dlm->reco.state |= DLM_RECO_STATE_FINALIZE; >>> + spin_unlock(&dlm->spinlock); >>> + >>> /* all nodes are now in DLM_RECO_NODE_DATA_DONE state >>> * just send a finalize message to everyone and >>> * clean up */ >>> @@ -2882,8 +2889,8 @@ int dlm_finalize_reco_handler(struct o2net_msg >>> *msg, u32 len, void *data, >>> BUG(); >>> } >>> dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE; >>> + __dlm_reset_recovery(dlm); >>> spin_unlock(&dlm->spinlock); >>> - dlm_reset_recovery(dlm); >>> dlm_kick_recovery_thread(dlm); >>> break; >>> default: >