From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sunil Mushran Date: Wed, 15 Apr 2009 19:07:19 -0700 Subject: [Ocfs2-devel] hang with fsdlm In-Reply-To: <20090415165717.GD22705@redhat.com> References: <20090415165717.GD22705@redhat.com> Message-ID: <49E692D7.5030009@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com David, So it's the same issue. bull-02 knows it is blocking an EX, has no holders, but is not doing anything to downconvert. Coly Li is working on this same issue on novellbz#492055. https://bugzilla.novell.com/show_bug.cgi?id=492055 He is unable to reproduce with the sles11 kernel. The dlm bits should be the same as mainline. I think. I've asked him to run with mainline. Do you have a novell bz account? Sunil David Teigland wrote: > Using fsdlm/ocfs2_controld.cman, I've rerun the test I've been having problems > with on 2.6.30-rc1. After running for several minutes in the same directory > on three nodes, the test hangs, and I collect the following information: > > > bull-01 > ------- > > 3053 S< [ocfs2dc] ocfs2_downconvert_thread > 3054 S< [dlm_astd] dlm_astd > 3055 S< [dlm_scand] dlm_scand > 3056 S< [dlm_recv/0] worker_thread > 3057 S< [dlm_recv/1] worker_thread > 3058 S< [dlm_recv/2] worker_thread > 3059 S< [dlm_recv/3] worker_thread > 3060 S< [dlm_send] worker_thread > 3061 S< [dlm_recoverd] dlm_recoverd > 3067 S< [kjournald2] kjournald2 > 3068 S< [ocfs2cmt] ocfs2_commit_thread > 3082 D+ /usr/bin/perl /root/make_pa ocfs2_cluster_lock > > make_panic D 0000000000000002 0 3082 3021 > ffff8801360f1a38 0000000000000046 ffffffffa0240b08 ffff88013703ea40 > ffff88013eb877f8 0000000000004000 00000000001d0a80 000000000000cb10 > 0000000000000000 ffffffffa0240af0 ffff88013703ea40 ffff88007ffee340 > Call Trace: > [] ? dlm_put_lockspace+0x18/0x2b [dlm] > [] schedule+0x9/0x1d > [] schedule_timeout+0x24/0x15e > [] ? wait_for_common+0x3c/0x123 > [] wait_for_common+0xcf/0x123 > [] ? default_wake_function+0x0/0xf > [] wait_for_completion+0x18/0x1a > [] ocfs2_cluster_lock+0x91f/0x93e [ocfs2] > [] ? _spin_unlock+0x26/0x2a > [] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2] > [] ocfs2_inode_lock_full+0x180/0xccb [ocfs2] > [] ? print_lock_contention_bug+0x1e/0x110 > [] ocfs2_unlink+0x1a3/0xa6e [ocfs2] > [] ? vfs_unlink+0x95/0x110 > [] vfs_unlink+0xbe/0x110 > [] do_unlinkat+0xcd/0x15d > [] ? sys_newlstat+0x31/0x3c > [] ? lockdep_sys_exit_thunk+0x35/0x67 > [] sys_unlink+0x11/0x13 > [] system_call_fastpath+0x16/0x1b > > Device => Id: 253,2 Uuid: 8A77044F9D98416FAB240B44B1843EDA Gen: 0x7622CA50 > Label: > Volume => State: 2 Flags: 0x0 > Sizes => Block: 4096 Cluster: 4096 > Features => Compat: 0x3 Incompat: 0x1D0 ROcompat: 0x1 > Mount => Opts: 0x0 AtimeQuanta: 60 > Cluster => Stack: cman Name: 8A77044F9D98416FAB240B44B1843EDA Version: 1.0 > DownCnvt => Pid: 3053 Count: 0 WakeSeq: 55892 WorkSeq: 55892 > Recovery => Pid: -1 Nodes: None > Commit => Pid: 3068 Interval: 0 Needs: 1 > Journal => State: 1 NumTxns: 2 TxnId: 1136 > Stats => GlobalAllocs: 3 LocalAllocs: 915 SubAllocs: 2938 LAWinMoves: 1 > SAExtends: 2 > LocalAlloc => State: 1 Descriptor: 2032128 Size: 2048 bits Default: 2048 bits > Steal => Slot: -1 NumStolen: 0 > Slots => Num RecoGen > * 0 3 > 1 3 > 2 3 > 3 3 > > Lockres: M0000000000000000046e2a00000000 Mode: Protected Read > Flags: Initialized Attached Busy > RO Holders: 0 EX Holders: 0 > Pending Action: Convert Pending Unlock Action: None > Requested Mode: Exclusive Blocking Mode: No Lock > PR > Gets: 982 Fails: 0 Waits (usec) Total: 637775 Max: 79972 > EX > Gets: 69 Fails: 0 Waits (usec) Total: 723745 Max: 40985 > Disk Refreshes: 0 > > Resource len 31 "M0000000000000000046e2a00000000" > Master > LVB len 64 seq 123 > 05 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 > 12 75 a1 17 e0 68 d0 10 12 79 39 dd b8 79 ee b4 > 12 79 39 dd b8 79 ee b4 00 00 00 00 00 00 20 00 > 41 ed 00 02 00 00 00 00 82 92 d2 7a 00 00 00 00 > Granted > 01ce0003 PR Remote: 4 03d80001 > 014a0002 PR Remote: 2 00d70001 > Convert > 038f0001 PR (EX) > > > bull-02 > ------- > > 2759 S< [ocfs2dc] ocfs2_downconvert_thread > 2760 S< [dlm_astd] dlm_astd > 2761 S< [dlm_scand] dlm_scand > 2762 S< [dlm_recv/0] worker_thread > 2763 S< [dlm_recv/1] worker_thread > 2764 S< [dlm_recv/2] worker_thread > 2765 S< [dlm_recv/3] worker_thread > 2766 S< [dlm_send] worker_thread > 2767 S< [dlm_recoverd] dlm_recoverd > 2779 S< [kjournald2] kjournald2 > 2780 S< [ocfs2cmt] ocfs2_commit_thread > 2783 D+ /usr/bin/perl /root/make_pa ocfs2_cluster_lock > > make_panic D 0000000000000002 0 2783 2727 > ffff88013f497958 0000000000000046 ffffffffa024ab08 ffff88013ed6e540 > ffff88007e9557f8 0000000000004000 00000000001d0a80 000000000000cb10 > 0000000000000000 ffffffffa024aaf0 ffff88013ed6e540 ffff88007fffe440 > Call Trace: > [] ? dlm_put_lockspace+0x18/0x2b [dlm] > [] schedule+0x9/0x1d > [] schedule_timeout+0x24/0x15e > [] ? wait_for_common+0x3c/0x123 > [] wait_for_common+0xcf/0x123 > [] ? default_wake_function+0x0/0xf > [] wait_for_completion+0x18/0x1a > [] ocfs2_cluster_lock+0x91f/0x93e [ocfs2] > [] ? print_lock_contention_bug+0x1e/0x110 > [] ? _spin_unlock+0x26/0x2a > [] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2] > [] ocfs2_inode_lock_full+0x180/0xccb [ocfs2] > [] ? ocfs2_dentry_revalidate+0x194/0x268 [ocfs2] > [] ocfs2_permission+0x70/0x167 [ocfs2] > [] inode_permission+0x69/0x97 > [] __link_path_walk+0x134/0xdea > [] path_walk+0x69/0xd4 > [] do_path_lookup+0x187/0x1df > [] ? get_empty_filp+0xfc/0x188 > [] path_lookup_open+0x5a/0x9b > [] do_filp_open+0xb6/0x899 > [] ? alloc_fd+0x3a/0x12e > [] ? alloc_fd+0x11d/0x12e > [] do_sys_open+0x53/0xd3 > [] sys_open+0x1b/0x1d > [] system_call_fastpath+0x16/0x1b > > Device => Id: 253,2 Uuid: 8A77044F9D98416FAB240B44B1843EDA Gen: 0x7622CA50 Label: > Volume => State: 2 Flags: 0x0 > Sizes => Block: 4096 Cluster: 4096 > Features => Compat: 0x3 Incompat: 0x1D0 ROcompat: 0x1 > Mount => Opts: 0x0 AtimeQuanta: 60 > Cluster => Stack: cman Name: 8A77044F9D98416FAB240B44B1843EDA Version: 1.0 > DownCnvt => Pid: 2759 Count: 1 WakeSeq: 14838 WorkSeq: 14838 > Recovery => Pid: -1 Nodes: None > Commit => Pid: 2780 Interval: 0 Needs: 1 > Journal => State: 1 NumTxns: 8 TxnId: 928 > Stats => GlobalAllocs: 3 LocalAllocs: 513 SubAllocs: 1556 LAWinMoves: 1 SAExtends: 2 > LocalAlloc => State: 1 Descriptor: 2064384 Size: 2048 bits Default: 2048 bits > Steal => Slot: -1 NumStolen: 0 > Slots => Num RecoGen > 0 3 > * 1 3 > 2 3 > 3 3 > > Lockres: M0000000000000000046e2a00000000 Mode: Protected Read > Flags: Initialized Attached Blocked Needs Refresh Queued > RO Holders: 0 EX Holders: 0 > Pending Action: None Pending Unlock Action: None > Requested Mode: Protected Read Blocking Mode: Exclusive > PR > Gets: 463 Fails: 0 Waits (usec) Total: 1052625 Max: 41985 > EX > Gets: 37 Fails: 0 Waits (usec) Total: 990652 Max: 79971 > Disk Refreshes: 0 > > Resource len 31 "M0000000000000000046e2a00000000" > Local 1 > Granted > 00d70001 PR Master: 1 014a0002 > > dlm: 8A77044F9D98416FAB240B44B1843EDA: addwait d70001 cur 2 overlap 4 count 2 f 100000 > dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 convert_reply zap overlap_cancel > dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 cancel_reply wait_type 0 > > messages mean a cancel operation occurs while a remote convert is in progress, > then the convert reply comes back indicating that the convert was successful, > so the cancel is a moot point and being ignored. > > > bull-04 > ------- > > 4114 S< [ocfs2dc] ocfs2_downconvert_thread > 4115 S< [dlm_astd] dlm_astd > 4116 S< [dlm_scand] dlm_scand > 4117 S< [dlm_recv/0] worker_thread > 4118 S< [dlm_recv/1] worker_thread > 4119 S< [dlm_recv/2] worker_thread > 4120 S< [dlm_recv/3] worker_thread > 4121 S< [dlm_send] worker_thread > 4122 S< [dlm_recoverd] dlm_recoverd > 4136 S< [kjournald2] kjournald2 > 4137 S< [ocfs2cmt] ocfs2_commit_thread > 4142 D+ /usr/bin/perl /root/make_pa ocfs2_wait_for_mask > > make_panic D 0000000000000002 0 4142 4086 > ffff88013147b998 0000000000000046 ffff88013147b938 ffffffff8025aea3 > 0000000000000000 ffff88013142acc0 ffff88007ffee340 ffff88013142b078 > 0000000100078e31 0000000000000246 0000000000000292 ffffffffa02b102b > Call Trace: > [] ? print_lock_contention_bug+0x1b/0xe1 > [] ? dlm_put_lockspace+0x18/0x2b [dlm] > [] schedule+0x9/0x1d > [] schedule_timeout+0x28/0x176 > [] ? wait_for_common+0x3d/0x11e > [] wait_for_common+0xc6/0x11e > [] ? default_wake_function+0x0/0xf > [] ? ocfs2_cluster_lock+0xf2/0x8e9 [ocfs2] > [] wait_for_completion+0x18/0x1a > [] ocfs2_wait_for_mask+0x15/0x23 [ocfs2] > [] ocfs2_cluster_lock+0x8c4/0x8e9 [ocfs2] > [] ? print_lock_contention_bug+0x1b/0xe1 > [] ? _spin_unlock+0x26/0x2a > [] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2] > [] ocfs2_inode_lock_full+0x181/0xcbc [ocfs2] > [] ? _spin_unlock+0x26/0x2a > [] ? ocfs2_dentry_revalidate+0x194/0x26c [ocfs2] > [] ocfs2_permission+0x70/0x166 [ocfs2] > [] inode_permission+0x69/0x97 > [] __link_path_walk+0x13f/0xced > [] ? print_lock_contention_bug+0x1b/0xe1 > [] path_walk+0x4e/0x97 > [] do_path_lookup+0x118/0x175 > [] do_filp_open+0xde/0x819 > [] ? print_lock_contention_bug+0x1b/0xe1 > [] ? getname+0x26/0x1ac > [] ? alloc_fd+0x33/0x11e > [] ? _spin_unlock+0x26/0x2a > [] ? alloc_fd+0x10f/0x11e > [] do_sys_open+0x53/0xd3 > [] sys_open+0x1b/0x1d > [] system_call_fastpath+0x16/0x1b > > Device => Id: 253,2 Uuid: 8A77044F9D98416FAB240B44B1843EDA Gen: 0x7622CA50 Label: > Volume => State: 2 Flags: 0x0 > Sizes => Block: 4096 Cluster: 4096 > Features => Compat: 0x3 Incompat: 0x1D0 ROcompat: 0x1 > Mount => Opts: 0x0 AtimeQuanta: 60 > Cluster => Stack: cman Name: 8A77044F9D98416FAB240B44B1843EDA Version: 1.0 > DownCnvt => Pid: 4114 Count: 1 WakeSeq: 14213 WorkSeq: 14213 > Recovery => Pid: -1 Nodes: None > Commit => Pid: 4137 Interval: 0 Needs: 1 > Journal => State: 1 NumTxns: 4 TxnId: 851 > Stats => GlobalAllocs: 2 LocalAllocs: 414 SubAllocs: 1396 LAWinMoves: 1 SAExtends: 1 > LocalAlloc => State: 1 Descriptor: 2096640 Size: 2048 bits Default: 2048 bits > Steal => Slot: -1 NumStolen: 0 > Slots => Num RecoGen > 0 3 > 1 3 > 2 3 > * 3 3 > > no M0000000000000000046e2a00000000 exists in locking_state > > Resource len 31 "M0000000000000000046e2a00000000" > Local 1 > Granted > 03d80001 PR Master: 1 01ce0003 > > dlm: 8A77044F9D98416FAB240B44B1843EDA: validate_unlock_args -16 3d80001 0 10c 2 0 M0000000000000000046e2a00000000 > > message means the lock is granted, i.e. the convert being canceled already > completed, so the cancel operation is being ignored. > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel >