From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xue jiufei Date: Tue, 18 Jun 2013 11:13:14 +0800 Subject: [Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery V2 In-Reply-To: <51BFC7E1.10908@huawei.com> References: <51BFC7E1.10908@huawei.com> Message-ID: <51BFD04A.1030404@huawei.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com > From: "Xiaowei.Hu" > > when the master requested locks ,but one/some of the live nodes died, > after it received the request msg and before send out the locks packages, > the recovery will fall into endless loop,waiting for the status changed to finalize > > NodeA NodeB > selected as recovery master > dlm_remaster_locks > -> dlm_requeset_all_locks > this send request locks msg to B > received the msg from A, > queue worker dlm_request_all_locks_worker > return 0 > go on set state to requested > wait for the state become done > NodeB lost connection due to network > before the worker begin, or it die. > > NodeA still waiting for the change of reco state. > It won't end if it not get data done msg. > And at this time nodeB do not realize this (or it just died), > it won't send the msg for ever, nodeA left in the recovery process forever. > > This patch let the recovery master check if the node still in live node > map when it stay in REQUESTED status. > Hi, xiaowei, We have reviewed this patch and have some questions: 1) in dlm_is_node_in_livemap(), I think it should use !!(test_bit(node, dlm->live_nodes_map)) to determine whether a node is live; 2) why not use dlm_is_node_dead() instead of dlm_is_node_in_livemap() in dlm_remaster_locks? I think dlm_is_node_dead() is better because dlm->live_nodes_map may be remain when another node umount. Thanks. > Signed-off-by: Xiaowei.Hu > --- > fs/ocfs2/dlm/dlmrecovery.c | 16 +++++++++++++++- > 1 files changed, 15 insertions(+), 1 deletions(-) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 01ebfd0..546c5b5 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -339,6 +339,17 @@ static int dlm_reco_master_ready(struct dlm_ctxt *dlm) > return ready; > } > > +/* returns true if node is still in the live node map > + * this map is cleared before domain map,could be checked in recovery*/ > +int dlm_is_node_in_livemap(struct dlm_ctxt *dlm, u8 node) > +{ > + int live; > + spin_lock(&dlm->spinlock); > + live = !test_bit(node, dlm->live_nodes_map); > + spin_unlock(&dlm->spinlock); > + return live; > +} > + > /* returns true if node is no longer in the domain > * could be dead or just not joined */ > int dlm_is_node_dead(struct dlm_ctxt *dlm, u8 node) > @@ -679,7 +690,10 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node) > dlm->name, ndata->node_num, > ndata->state==DLM_RECO_NODE_DATA_RECEIVING ? > "receiving" : "requested"); > - all_nodes_done = 0; > + if (!dlm_is_node_in_livemap(dlm, ndata->node_num)) > + ndata->state = DLM_RECO_NODE_DATA_DEAD; > + else > + all_nodes_done = 0; > break; > case DLM_RECO_NODE_DATA_DONE: > mlog(0, "%s: node %u state is done\n", > .