From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sunil Mushran Date: Wed, 02 Sep 2009 15:01:10 -0700 Subject: [Ocfs2-devel] dlm stress test hangs OCFS2 In-Reply-To: <4A9EA759.5090906@suse.de> References: <4A8B0083.8030400@suse.de> <4A8B6C29.30802@oracle.com> <4A9EA759.5090906@suse.de> Message-ID: <4A9EEB26.2080204@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com So the thread is removing the lockres from the list (thus making the count 0), and then calling ocfs2_process_blocked_lock() which is adding that lockres back in the list (thus 1). Trace ocfs2_process_blocked_lock() to see as to why it is putting it back on the list. Coly Li wrote: > > Sunil Mushran Wrote: >> Read this thread for some background. There are others like this. >> http://oss.oracle.com/pipermail/ocfs2-devel/2009-April/004313.html >> >> David had run into a similar issue with two nodes. The symptoms were the >> same. In that case, we were failing to kick the downconvert thread under >> one situation. >> >> Bottomline, the reason for the hang is that a node is not downconverting >> its lock. It could be a race in dlmglue or something else. >> >> The node has a PR and an another nodes wants an EX. Unless the node >> downconverts >> to a NL, the master cannot upconvert the other node to EX. Hang. Also, >> cancel >> converts are in the mix. >> [snip] >> The downcnvt shows 1 lockres is queued. We have to assume it is this one. >> If not, then we have a bigger problem. Maybe add a quick/dirty hack to dump >> the lockres in this queue. >> >> Maybe we are forgetting to kick it like last time. I did scan the code >> for that but came up empty handed. >> >> To solve this mystery, you have to find out as to why the dc thread is >> not acting on the lockres. Forget stats. Just add printks in that thread. >> Starting from say ocfs2_downconvert_thread_do_work(). >> > > I simplified the original perl script to a simple bash script, > --------------------------------- > #!/bin/sh > > prefix=`hostname` > i=1 > while [ 1 ];do > f="$prefix"_"$i" > echo $f > touch $f > i=`expr $i + 1` > if [ $i -ge 1000 ];then > i=1 > rm -f "$prefix"_* > fi > done > --------------------------------- > > Run the above script on both nodes can also reproduce the blocking issue. > > When the blocking happens, ocfs2_downconvert_thread_do_work() still gets called > again and again. > > I add a printk to display osb->blocked_lock_count before the while(1) loop > inside ocfs2_downconvert_thread_do_work(). > > Here is what I observed, > 1) Before the blocking happens, the > number sequence is, > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 2 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > (the count could be 1, 0, 2 and in an irregular sequence) > > 2) when the blocking happens, the number sequence of osb->blocked_lock_count is > always like this, > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0 > ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1 > (all are 0-1-0-1-0-1-... in a regular sequence) > > Continue to track... > >