From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xiaowei Date: Thu, 26 Jul 2012 14:52:17 +0800 Subject: [Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery In-Reply-To: References: <1337925202-13086-1-git-send-email-xiaowei.hu@oracle.com> <4FC56CA5.8040902@oracle.com> Message-ID: <5010E921.40808@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com Hi Sunil, I considered your suggestion about this patch, it's possible to change the status in dlm hb down event, but what need to change are the dlm_reco_node_data structures in dlm->reco.node_data list. This list is initialized in dlm_remaster_locks when it begins the lock remaster and destroied before exit this function. So it's not proper to check data in such a list from dlm hb down event, am I right? If change the status from dlm hb down event , that means we make the recovery thread rely on more information from the hb down event, actually the dlm->live_nodes_map is marked in this event , and for others to check , right? This race condition only happen when cluster already in recovery and a node dead during recovery. the recovery thread blocked the update of dlm->domain_map, so I fallback to check the live_nodes_map, which won't be blocked. Please reconsider this patch. Thanks, Xiaowei On 05/31/2012 09:18 AM, Sunil Mushran wrote: > On Tue, May 29, 2012 at 5:41 PM, Xiaowei wrote: >> On 05/30/2012 06:09 AM, Sunil Mushran wrote: >> I would suggest exploring adding this in dlm hb down event. Checking live >> map all >> over the place is hacky. We do it more than we should right now. Let's not >> add to the >> mess. >> >> HI Sunil, >> >> Do you mean we should clear the bit in domain map in dlm hb down event >> directly when the node down >> and check with dlm_is_node_dead at here? >> Or how could we explore and ensure the node is alive during the whole >> migrate process?One node could die even after it sends out one locks package >> and before the next if there were too many locks on that lockres. > dlm hb down event is triggered when a node is declared dead. That's where we > clean up pending mles, etc. You can add a check for recovery and add logic to > change the reco state for that node there.