[Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery

From: Xiaowei <xiaowei.hu@oracle.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery
Date: Wed, 30 May 2012 08:41:09 +0800	[thread overview]
Message-ID: <4FC56CA5.8040902@oracle.com> (raw)
In-Reply-To: <CAEeiSHXcaKXi7Qm5vLBmTp2CjiB7DCrUee5qmr03YpuJbzP5yg@mail.gmail.com>

On 05/30/2012 06:09 AM, Sunil Mushran wrote:
> On Thu, May 24, 2012 at 10:53 PM, <xiaowei.hu@oracle.com 
> <mailto:xiaowei.hu@oracle.com>> wrote:
>
>
>     diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>     index 01ebfd0..62659e8 100644
>     --- a/fs/ocfs2/dlm/dlmrecovery.c
>     +++ b/fs/ocfs2/dlm/dlmrecovery.c
>     @@ -555,6 +555,7 @@ static int dlm_remaster_locks(struct dlm_ctxt
>     *dlm, u8 dead_node)
>            int all_nodes_done;
>            int destroy = 0;
>            int pass = 0;
>     +       int dying = 0;
>
>            do {
>                    /* we have become recovery master.  there is no
>     escaping
>     @@ -659,6 +660,7 @@ static int dlm_remaster_locks(struct dlm_ctxt
>     *dlm, u8 dead_node)
>                    list_for_each_entry(ndata, &dlm->reco.node_data,
>     list) {
>                            mlog(0, "checking recovery state of node %u\n",
>                                 ndata->node_num);
>     +                       dying = 0;
>                            switch (ndata->state) {
>                                    case DLM_RECO_NODE_DATA_INIT:
>                                    case DLM_RECO_NODE_DATA_REQUESTING:
>     @@ -679,6 +681,13 @@ static int dlm_remaster_locks(struct dlm_ctxt
>     *dlm, u8 dead_node)
>                                                 dlm->name,
>     ndata->node_num,
>                                                
>     ndata->state==DLM_RECO_NODE_DATA_RECEIVING ?
>                                                 "receiving" :
>     "requested");
>     +                                       spin_lock(&dlm->spinlock);
>     +                                       dying =
>     !test_bit(ndata->node_num, dlm->live_nodes_map);
>     +                                       spin_unlock(&dlm->spinlock);
>     +                                       if (dying) {
>     +                                               ndata->state =
>     DLM_RECO_NODE_DATA_DEAD;
>     +                                               break;
>     +                                       }
>
>
>
>
>
> I would suggest exploring adding this in dlm hb down event. Checking 
> live map all
> over the place is hacky. We do it more than we should right now. Let's 
> not add to the
> mess.
HI Sunil,

Do you mean we should clear the bit in domain map in dlm hb down event 
directly when the node down
and check with dlm_is_node_dead at here?
Or how could we explore and ensure the node is alive during the whole 
migrate process?One node could die even after it sends out one locks 
package and before the next if there were too many locks on that lockres.

Thanks,
Xiaowei
>
>
>
>                                            all_nodes_done = 0;
>                                            break;
>                                    case DLM_RECO_NODE_DATA_DONE:
>     --
>     1.7.7.6
>
>
>     _______________________________________________
>     Ocfs2-devel mailing list
>     Ocfs2-devel at oss.oracle.com <mailto:Ocfs2-devel@oss.oracle.com>
>     http://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120530/5fcb3ea7/attachment.html