From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Wed, 15 Apr 2009 11:57:17 -0500 Subject: [Ocfs2-devel] hang with fsdlm Message-ID: <20090415165717.GD22705@redhat.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com Using fsdlm/ocfs2_controld.cman, I've rerun the test I've been having problems with on 2.6.30-rc1. After running for several minutes in the same directory on three nodes, the test hangs, and I collect the following information: bull-01 ------- 3053 S< [ocfs2dc] ocfs2_downconvert_thread 3054 S< [dlm_astd] dlm_astd 3055 S< [dlm_scand] dlm_scand 3056 S< [dlm_recv/0] worker_thread 3057 S< [dlm_recv/1] worker_thread 3058 S< [dlm_recv/2] worker_thread 3059 S< [dlm_recv/3] worker_thread 3060 S< [dlm_send] worker_thread 3061 S< [dlm_recoverd] dlm_recoverd 3067 S< [kjournald2] kjournald2 3068 S< [ocfs2cmt] ocfs2_commit_thread 3082 D+ /usr/bin/perl /root/make_pa ocfs2_cluster_lock make_panic D 0000000000000002 0 3082 3021 ffff8801360f1a38 0000000000000046 ffffffffa0240b08 ffff88013703ea40 ffff88013eb877f8 0000000000004000 00000000001d0a80 000000000000cb10 0000000000000000 ffffffffa0240af0 ffff88013703ea40 ffff88007ffee340 Call Trace: [] ? dlm_put_lockspace+0x18/0x2b [dlm] [] schedule+0x9/0x1d [] schedule_timeout+0x24/0x15e [] ? wait_for_common+0x3c/0x123 [] wait_for_common+0xcf/0x123 [] ? default_wake_function+0x0/0xf [] wait_for_completion+0x18/0x1a [] ocfs2_cluster_lock+0x91f/0x93e [ocfs2] [] ? _spin_unlock+0x26/0x2a [] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2] [] ocfs2_inode_lock_full+0x180/0xccb [ocfs2] [] ? print_lock_contention_bug+0x1e/0x110 [] ocfs2_unlink+0x1a3/0xa6e [ocfs2] [] ? vfs_unlink+0x95/0x110 [] vfs_unlink+0xbe/0x110 [] do_unlinkat+0xcd/0x15d [] ? sys_newlstat+0x31/0x3c [] ? lockdep_sys_exit_thunk+0x35/0x67 [] sys_unlink+0x11/0x13 [] system_call_fastpath+0x16/0x1b Device => Id: 253,2 Uuid: 8A77044F9D98416FAB240B44B1843EDA Gen: 0x7622CA50 Label: Volume => State: 2 Flags: 0x0 Sizes => Block: 4096 Cluster: 4096 Features => Compat: 0x3 Incompat: 0x1D0 ROcompat: 0x1 Mount => Opts: 0x0 AtimeQuanta: 60 Cluster => Stack: cman Name: 8A77044F9D98416FAB240B44B1843EDA Version: 1.0 DownCnvt => Pid: 3053 Count: 0 WakeSeq: 55892 WorkSeq: 55892 Recovery => Pid: -1 Nodes: None Commit => Pid: 3068 Interval: 0 Needs: 1 Journal => State: 1 NumTxns: 2 TxnId: 1136 Stats => GlobalAllocs: 3 LocalAllocs: 915 SubAllocs: 2938 LAWinMoves: 1 SAExtends: 2 LocalAlloc => State: 1 Descriptor: 2032128 Size: 2048 bits Default: 2048 bits Steal => Slot: -1 NumStolen: 0 Slots => Num RecoGen * 0 3 1 3 2 3 3 3 Lockres: M0000000000000000046e2a00000000 Mode: Protected Read Flags: Initialized Attached Busy RO Holders: 0 EX Holders: 0 Pending Action: Convert Pending Unlock Action: None Requested Mode: Exclusive Blocking Mode: No Lock PR > Gets: 982 Fails: 0 Waits (usec) Total: 637775 Max: 79972 EX > Gets: 69 Fails: 0 Waits (usec) Total: 723745 Max: 40985 Disk Refreshes: 0 Resource len 31 "M0000000000000000046e2a00000000" Master LVB len 64 seq 123 05 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 12 75 a1 17 e0 68 d0 10 12 79 39 dd b8 79 ee b4 12 79 39 dd b8 79 ee b4 00 00 00 00 00 00 20 00 41 ed 00 02 00 00 00 00 82 92 d2 7a 00 00 00 00 Granted 01ce0003 PR Remote: 4 03d80001 014a0002 PR Remote: 2 00d70001 Convert 038f0001 PR (EX) bull-02 ------- 2759 S< [ocfs2dc] ocfs2_downconvert_thread 2760 S< [dlm_astd] dlm_astd 2761 S< [dlm_scand] dlm_scand 2762 S< [dlm_recv/0] worker_thread 2763 S< [dlm_recv/1] worker_thread 2764 S< [dlm_recv/2] worker_thread 2765 S< [dlm_recv/3] worker_thread 2766 S< [dlm_send] worker_thread 2767 S< [dlm_recoverd] dlm_recoverd 2779 S< [kjournald2] kjournald2 2780 S< [ocfs2cmt] ocfs2_commit_thread 2783 D+ /usr/bin/perl /root/make_pa ocfs2_cluster_lock make_panic D 0000000000000002 0 2783 2727 ffff88013f497958 0000000000000046 ffffffffa024ab08 ffff88013ed6e540 ffff88007e9557f8 0000000000004000 00000000001d0a80 000000000000cb10 0000000000000000 ffffffffa024aaf0 ffff88013ed6e540 ffff88007fffe440 Call Trace: [] ? dlm_put_lockspace+0x18/0x2b [dlm] [] schedule+0x9/0x1d [] schedule_timeout+0x24/0x15e [] ? wait_for_common+0x3c/0x123 [] wait_for_common+0xcf/0x123 [] ? default_wake_function+0x0/0xf [] wait_for_completion+0x18/0x1a [] ocfs2_cluster_lock+0x91f/0x93e [ocfs2] [] ? print_lock_contention_bug+0x1e/0x110 [] ? _spin_unlock+0x26/0x2a [] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2] [] ocfs2_inode_lock_full+0x180/0xccb [ocfs2] [] ? ocfs2_dentry_revalidate+0x194/0x268 [ocfs2] [] ocfs2_permission+0x70/0x167 [ocfs2] [] inode_permission+0x69/0x97 [] __link_path_walk+0x134/0xdea [] path_walk+0x69/0xd4 [] do_path_lookup+0x187/0x1df [] ? get_empty_filp+0xfc/0x188 [] path_lookup_open+0x5a/0x9b [] do_filp_open+0xb6/0x899 [] ? alloc_fd+0x3a/0x12e [] ? alloc_fd+0x11d/0x12e [] do_sys_open+0x53/0xd3 [] sys_open+0x1b/0x1d [] system_call_fastpath+0x16/0x1b Device => Id: 253,2 Uuid: 8A77044F9D98416FAB240B44B1843EDA Gen: 0x7622CA50 Label: Volume => State: 2 Flags: 0x0 Sizes => Block: 4096 Cluster: 4096 Features => Compat: 0x3 Incompat: 0x1D0 ROcompat: 0x1 Mount => Opts: 0x0 AtimeQuanta: 60 Cluster => Stack: cman Name: 8A77044F9D98416FAB240B44B1843EDA Version: 1.0 DownCnvt => Pid: 2759 Count: 1 WakeSeq: 14838 WorkSeq: 14838 Recovery => Pid: -1 Nodes: None Commit => Pid: 2780 Interval: 0 Needs: 1 Journal => State: 1 NumTxns: 8 TxnId: 928 Stats => GlobalAllocs: 3 LocalAllocs: 513 SubAllocs: 1556 LAWinMoves: 1 SAExtends: 2 LocalAlloc => State: 1 Descriptor: 2064384 Size: 2048 bits Default: 2048 bits Steal => Slot: -1 NumStolen: 0 Slots => Num RecoGen 0 3 * 1 3 2 3 3 3 Lockres: M0000000000000000046e2a00000000 Mode: Protected Read Flags: Initialized Attached Blocked Needs Refresh Queued RO Holders: 0 EX Holders: 0 Pending Action: None Pending Unlock Action: None Requested Mode: Protected Read Blocking Mode: Exclusive PR > Gets: 463 Fails: 0 Waits (usec) Total: 1052625 Max: 41985 EX > Gets: 37 Fails: 0 Waits (usec) Total: 990652 Max: 79971 Disk Refreshes: 0 Resource len 31 "M0000000000000000046e2a00000000" Local 1 Granted 00d70001 PR Master: 1 014a0002 dlm: 8A77044F9D98416FAB240B44B1843EDA: addwait d70001 cur 2 overlap 4 count 2 f 100000 dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 convert_reply zap overlap_cancel dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 cancel_reply wait_type 0 messages mean a cancel operation occurs while a remote convert is in progress, then the convert reply comes back indicating that the convert was successful, so the cancel is a moot point and being ignored. bull-04 ------- 4114 S< [ocfs2dc] ocfs2_downconvert_thread 4115 S< [dlm_astd] dlm_astd 4116 S< [dlm_scand] dlm_scand 4117 S< [dlm_recv/0] worker_thread 4118 S< [dlm_recv/1] worker_thread 4119 S< [dlm_recv/2] worker_thread 4120 S< [dlm_recv/3] worker_thread 4121 S< [dlm_send] worker_thread 4122 S< [dlm_recoverd] dlm_recoverd 4136 S< [kjournald2] kjournald2 4137 S< [ocfs2cmt] ocfs2_commit_thread 4142 D+ /usr/bin/perl /root/make_pa ocfs2_wait_for_mask make_panic D 0000000000000002 0 4142 4086 ffff88013147b998 0000000000000046 ffff88013147b938 ffffffff8025aea3 0000000000000000 ffff88013142acc0 ffff88007ffee340 ffff88013142b078 0000000100078e31 0000000000000246 0000000000000292 ffffffffa02b102b Call Trace: [] ? print_lock_contention_bug+0x1b/0xe1 [] ? dlm_put_lockspace+0x18/0x2b [dlm] [] schedule+0x9/0x1d [] schedule_timeout+0x28/0x176 [] ? wait_for_common+0x3d/0x11e [] wait_for_common+0xc6/0x11e [] ? default_wake_function+0x0/0xf [] ? ocfs2_cluster_lock+0xf2/0x8e9 [ocfs2] [] wait_for_completion+0x18/0x1a [] ocfs2_wait_for_mask+0x15/0x23 [ocfs2] [] ocfs2_cluster_lock+0x8c4/0x8e9 [ocfs2] [] ? print_lock_contention_bug+0x1b/0xe1 [] ? _spin_unlock+0x26/0x2a [] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2] [] ocfs2_inode_lock_full+0x181/0xcbc [ocfs2] [] ? _spin_unlock+0x26/0x2a [] ? ocfs2_dentry_revalidate+0x194/0x26c [ocfs2] [] ocfs2_permission+0x70/0x166 [ocfs2] [] inode_permission+0x69/0x97 [] __link_path_walk+0x13f/0xced [] ? print_lock_contention_bug+0x1b/0xe1 [] path_walk+0x4e/0x97 [] do_path_lookup+0x118/0x175 [] do_filp_open+0xde/0x819 [] ? print_lock_contention_bug+0x1b/0xe1 [] ? getname+0x26/0x1ac [] ? alloc_fd+0x33/0x11e [] ? _spin_unlock+0x26/0x2a [] ? alloc_fd+0x10f/0x11e [] do_sys_open+0x53/0xd3 [] sys_open+0x1b/0x1d [] system_call_fastpath+0x16/0x1b Device => Id: 253,2 Uuid: 8A77044F9D98416FAB240B44B1843EDA Gen: 0x7622CA50 Label: Volume => State: 2 Flags: 0x0 Sizes => Block: 4096 Cluster: 4096 Features => Compat: 0x3 Incompat: 0x1D0 ROcompat: 0x1 Mount => Opts: 0x0 AtimeQuanta: 60 Cluster => Stack: cman Name: 8A77044F9D98416FAB240B44B1843EDA Version: 1.0 DownCnvt => Pid: 4114 Count: 1 WakeSeq: 14213 WorkSeq: 14213 Recovery => Pid: -1 Nodes: None Commit => Pid: 4137 Interval: 0 Needs: 1 Journal => State: 1 NumTxns: 4 TxnId: 851 Stats => GlobalAllocs: 2 LocalAllocs: 414 SubAllocs: 1396 LAWinMoves: 1 SAExtends: 1 LocalAlloc => State: 1 Descriptor: 2096640 Size: 2048 bits Default: 2048 bits Steal => Slot: -1 NumStolen: 0 Slots => Num RecoGen 0 3 1 3 2 3 * 3 3 no M0000000000000000046e2a00000000 exists in locking_state Resource len 31 "M0000000000000000046e2a00000000" Local 1 Granted 03d80001 PR Master: 1 01ce0003 dlm: 8A77044F9D98416FAB240B44B1843EDA: validate_unlock_args -16 3d80001 0 10c 2 0 M0000000000000000046e2a00000000 message means the lock is granted, i.e. the convert being canceled already completed, so the cancel operation is being ignored.