From mboxrd@z Thu Jan 1 00:00:00 1970 From: eric Date: Tue, 13 Oct 2015 18:07:14 +0800 Subject: [Cluster-devel] (no subject) Message-ID: <561CD7D2.7070107@suse.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi David and list, I'm working on ocfs2, and encountered an problem about dlm posix file lock. After some investigation, I'd like to share information about it and get some hints from you. Environment: kernel: 3.12.47 FS: OCFS2 stack: pacemaker cluster: 2 testing nodes, node1, node2 Issue desc: There is a deadlock test case for file lock in ocfs2 test suites. The deadlock test first prepare an testing file1 on shared disk, then on node1 do "fcntl(file1, F_SETLKW, {F_WRLCK, SEEK_SET, 0, 0})" , then on node2 set alarm(10s) and also "fcntl(file1, F_SETLKW, {F_WRLCK, SEEK_SET, 0, 0})". It expects alarm timeout to send SIGALRM, and wake up the sleep process, as "man fcntl" says: "If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR". But, the process on node2 was in "Dl" state when using ps, and signal was blocked. So, the test case was hung for ever. Investigations: * Key debug infos: process stack on node1: n1:/opt/ocfs2-test/bin # cat /proc/22677/stack [] kvm_clock_get_cycles+0x1b/0x20 [] __getnstimeofday+0x34/0xc0 [] getnstimeofday+0xa/0x30 [] SyS_poll+0x5d/0xf0 [] system_call_fastpath+0x16/0x1b [] 0xffffffffffffffff process stack on node2: n2:~ # cat /proc/1534/stack [] dlm_posix_lock+0x185/0x380 [dlm] [] fcntl_setlk+0x12e/0x2d0 [] SyS_fcntl+0x261/0x510 [] system_call_fastpath+0x16/0x1b [<00007f3f5721eb42>] 0x7f3f5721eb42 [] 0xffffffffffffffff * dlm_posix_lock Through adding printk and recompile dlm kernel module, where n2 is hung has been located: dlm_posix_lock -> wait_event_killable And wait_event_killable will put process into "TASK_KILLABLE" state which's like "UNINTERRUPTABLE" but can be waked up by fatal signals. I did some tests, SIGTERM can did it, but SIGALRM cannot. Did this go against posix file lock semanteme? Any hints would be very appreciated! I can provide any infos as I can if needed;-) Thanks, Eric