From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wengang Date: Thu, 20 Feb 2014 13:51:28 +0800 Subject: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix recovery hung In-Reply-To: <530597FD.4020108@oracle.com> References: <53056939.3050702@oracle.com> <530597FD.4020108@oracle.com> Message-ID: <530597E0.1060607@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com Yes. thanks wengang ? 2014?02?20? 13:51, Junxiao Bi ??: > Hi Wengang, > > On 02/20/2014 10:32 AM, Wengang wrote: >> Yes. Checking DLM_RECO_STATE_ACTIVE would introduce the dead lock you >> described. I don't see the need we have to set this flag before >> picking up the recovery master(till now the remaster locks not started >> yet). We may think of setting this flag only when the node become the >> really recovery master. While this may be a separate change. >> For your change, seems you changed the use of xxx_FINALIZE(making it >> also can set on recovery master). you'd better add some comment for that. > Thank you for reviewing it. I will add some comments for it. >> Signed-off-by: Wengang Wang > Do you mean Reviewed-by: Wengang Wang ? > > Thanks, > Junxiao. >> >> ? 2014?02?19? 21:44, Junxiao Bi ??: >>> Hi Wengang, >>> >>> I thought about this changes before, it seemed may cause a deadlock. >>> When the new recovery master is sending the RECO_BEGIN message, >>> another node x may be trying to become the recovery master in >>> dlm_pick_recovery_master(), at there, DLM_RECO_STATE_ACTIVE had been >>> set before in dlm_do_recovery(), so the RECO_BEGIN message handler >>> dlm_pick_recovery_master() can not be executed, then >>> dlm->reco.new_master will not be set, so node x will loop in >>> dlm_pick_recovery_master() forever. >>> >>> Thanks, >>> Junxiao. >>> >>> ----- Original Message ----- >>> From: wen.gang.wang at oracle.com >>> To: junxiao.bi at oracle.com, ocfs2-devel at oss.oracle.com >>> Cc: mfasheh at suse.de, joe.jin at oracle.com >>> Sent: Wednesday, February 19, 2014 6:10:43 PM GMT +08:00 Beijing / >>> Chongqing / Hong Kong / Urumqi >>> Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix recovery hung >>> >>> Seems DLM_RECO_STATE_FINALIZE is used to track the situation that the >>> recovery master(node 0) dead (just) after it had finished the dlm >>> recovery. So that the original dead node(node 1) needed not be >>> recovered again. This flag is set only on non recovery master nodes. >>> In this case, things looks like the recovery master reset >>> "recover_master" and "dead node" after another node(node 2) had >>> started to recover another dead node(node 3). This would happen only >>> on the "old" recovery master(node 0). Note this flag >>> DLM_RECO_STATE_ACTIVE, it's set only on master when it begins to >>> recover and cleared when recovery finished. For an "old" master, >>> checking DLM_RECO_STATE_ACTIVE makes sense. I'd like to suggest: >>> >>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>> @@ -2704,7 +2704,7 @@ int dlm_begin_reco_handler(struct o2net_msg >>> *msg, u32 len, void *data, >>> return 0; >>> spin_lock(&dlm->spinlock); >>> - if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) { >>> + if (dlm->reco.state & (DLM_RECO_STATE_FINALIZE | >>> DLM_RECO_STATE_ACTIVE)) { >>> mlog(0, "%s: node %u wants to recover node %u >>> (%u:%u) " >>> "but this node is in finalize state, waiting >>> on finalize2\n", >>> dlm->name, br->node_idx, br->dead_node, >>> >>> thanks >>> wengang >>> >>> ? 2014?02?19? 16:30, Junxiao Bi ??: >>>> There is a race window in dlm_do_recovery() between >>>> dlm_remaster_locks() >>>> and dlm_reset_recovery() when the recovery master nearly finish the >>>> recovery >>>> process for a dead node. After the master sends FINALIZE_RECO >>>> message in >>>> dlm_remaster_locks(), another node may become the recovery master >>>> for another >>>> dead node, and then send the BEGIN_RECO message to all the nodes >>>> included the >>>> old master, in the handler of this message dlm_begin_reco_handler() >>>> of old master, >>>> dlm->reco.dead_node and dlm->reco.new_master will be set to the >>>> second dead >>>> node and the new master, then in dlm_reset_recovery(), these two >>>> variables >>>> will be reset to default value. This will cause new recovery master >>>> can not finish >>>> the recovery process and hung, at last the whole cluster will hung >>>> for recovery. >>>> >>>> old recovery master: new recovery >>>> master: >>>> dlm_remaster_locks() >>>> become recovery >>>> master for >>>> another dead node. >>>> >>>> dlm_send_begin_reco_message() >>>> dlm_begin_reco_handler() >>>> { >>>> if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) { >>>> return -EAGAIN; >>>> } >>>> dlm_set_reco_master(dlm, br->node_idx); >>>> dlm_set_reco_dead_node(dlm, br->dead_node); >>>> } >>>> dlm_reset_recovery() >>>> { >>>> dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM); >>>> dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM); >>>> } >>>> will hung in >>>> dlm_remaster_locks() for >>>> request dlm >>>> locks info >>>> >>>> Before send FINALIZE_RECO message, recovery master should set >>>> DLM_RECO_STATE_FINALIZE >>>> for itself and clear it after the recovery done, this can break the >>>> race windows as >>>> the BEGIN_RECO messages will not be handled before >>>> DLM_RECO_STATE_FINALIZE flag is >>>> cleared. >>>> >>>> A similar race may happen between new recovery master and normal >>>> node which is in >>>> dlm_finalize_reco_handler(), also fix it. >>>> >>>> Signed-off-by: Junxiao Bi >>>> --- >>>> fs/ocfs2/dlm/dlmrecovery.c | 11 +++++++++-- >>>> 1 file changed, 9 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c >>>> index 7035af0..fe58e8b 100644 >>>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>>> @@ -537,7 +537,10 @@ master_here: >>>> /* success! see if any other nodes need recovery */ >>>> mlog(0, "DONE mastering recovery of %s:%u here(this=%u)!\n", >>>> dlm->name, dlm->reco.dead_node, dlm->node_num); >>>> - dlm_reset_recovery(dlm); >>>> + spin_lock(&dlm->spinlock); >>>> + __dlm_reset_recovery(dlm); >>>> + dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE; >>>> + spin_unlock(&dlm->spinlock); >>>> } >>>> dlm_end_recovery(dlm); >>>> @@ -695,6 +698,10 @@ static int dlm_remaster_locks(struct >>>> dlm_ctxt *dlm, u8 dead_node) >>>> if (all_nodes_done) { >>>> int ret; >>>> + spin_lock(&dlm->spinlock); >>>> + dlm->reco.state |= DLM_RECO_STATE_FINALIZE; >>>> + spin_unlock(&dlm->spinlock); >>>> + >>>> /* all nodes are now in DLM_RECO_NODE_DATA_DONE state >>>> * just send a finalize message to everyone and >>>> * clean up */ >>>> @@ -2882,8 +2889,8 @@ int dlm_finalize_reco_handler(struct o2net_msg >>>> *msg, u32 len, void *data, >>>> BUG(); >>>> } >>>> dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE; >>>> + __dlm_reset_recovery(dlm); >>>> spin_unlock(&dlm->spinlock); >>>> - dlm_reset_recovery(dlm); >>>> dlm_kick_recovery_thread(dlm); >>>> break; >>>> default: