From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sunil Mushran Date: Mon, 21 Sep 2009 10:25:36 -0700 Subject: [Ocfs2-devel] dlm stress test hangs OCFS2 In-Reply-To: <4AB7B721.6060307@suse.de> References: <4A8B0083.8030400@suse.de> <4A8B6C29.30802@oracle.com> <4A9EA759.5090906@suse.de> <4A9EEB26.2080204@oracle.com> <4A9FEDA8.3080108@suse.de> <4A9FEDAC.50704@oracle.com> <4AA80AE4.9090105@suse.de> <4AA82136.9000403@oracle.com> <4AA890ED.3040406@suse.de> <4AAAD5C6.4000800@oracle.com> <4AACFCEB.4060902@suse.de> <4AAE99DF.3030005@oracle.com> <4AAEA64C.3030607@suse.de> <4AAED882.9020601@oracle.com> <4AAF3E24.9050207@suse.de> <4AB0360B.4050602@oracle.com> <4AB7B721.6060307@suse.de> Message-ID: <4AB7B710.3040801@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com The patch does not have a fix. Only tracing. We may have to disable a printk for the 2 node to reproduce. For the BUG, can I have the full logs. The oops trace and the tracing from all nodes. Thanks Sunil Coly Li wrote: > Hi Sunil, > > I tried this patch, on 2 nodes cluster, it works. No blocking observed so far. > Then I run it on a 4 nodes cluster, run make_panic on each node simultaneously, > and BUG inside ocfs2_prepare_downconvert() triggered (in line 3224) on one of > the nodes (I observed the oops on node x4), > > 3214 static unsigned int ocfs2_prepare_downconvert(struct ocfs2_lock_res *lockres, > 3215 int new_level) > 3216 { > 3217 assert_spin_locked(&lockres->l_lock); > 3218 > 3219 BUG_ON(lockres->l_blocking <= DLM_LOCK_NL); > 3220 > 3221 if (lockres->l_level <= new_level) { > 3222 mlog(ML_ERROR, "lockres->l_level (%d) <= new_level (%d)\n", > 3223 lockres->l_level, new_level); > 3224 BUG(); > 3225 } > 3226 > 3227 mlog(ML_NOTICE, "lock %s, new_level = %d, l_blocking = %d\n", > 3228 lockres->l_name, new_level, lockres->l_blocking); > 3229 > 3230 lockres->l_action = OCFS2_AST_DOWNCONVERT; > 3231 lockres->l_requested = new_level; > 3232 lockres_or_flags(lockres, OCFS2_LOCK_BUSY); > 3233 return lockres_set_pending(lockres); > 3234 } > > I am trying to understand what you did now :-) > > Sunil Mushran Wrote: > >> So originally my thinking was that the dc thread was not getting kicked. >> That is not the case. The lock is getting downconverted. But it is getting >> upconverted shortly thereafter. This just could be the case in which >> dlmglue >> is slow to increment the holders to block the dc thread from downconverting >> the lock. The snippet shows that BAST is received 16 usecs after the >> upconvert. >> >> Coly, I have another patch. Pop out the older patch before applying this >> one. >> http://oss.oracle.com/~smushran/0001-ocfs2-Patch-to-debug-hang-in-dlmglue-when-running-d.patch >> >>