cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
From: eric <zren@suse.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] (no subject)
Date: Tue, 13 Oct 2015 18:07:14 +0800	[thread overview]
Message-ID: <561CD7D2.7070107@suse.com> (raw)

Hi David and list,

I'm working on ocfs2, and encountered an problem about dlm posix file lock.
After some investigation, I'd like to share information about it and get 
some
hints from you.

Environment:
    kernel: 3.12.47
    FS: OCFS2
    stack: pacemaker
    cluster: 2 testing nodes, node1, node2

Issue desc:
There is a deadlock test case for file lock in ocfs2 test suites. The 
deadlock test first prepare
an testing file1 on shared disk, then on node1 do "fcntl(file1, 
F_SETLKW, {F_WRLCK, SEEK_SET, 0, 0})"
, then on node2 set alarm(10s) and also  "fcntl(file1, F_SETLKW, 
{F_WRLCK, SEEK_SET, 0, 0})".
It expects alarm timeout to send SIGALRM, and wake up the sleep process, 
as "man fcntl"
says: "If a  signal  is  caught  while waiting,  then  the call is 
interrupted and (after the signal handler has returned)
returns immediately (with return value -1 and errno set to EINTR".

But, the process on node2 was in "Dl" state when using ps, and signal 
was blocked. So, the test case was hung for ever.

Investigations:
* Key debug infos:
process stack on node1:

n1:/opt/ocfs2-test/bin # cat /proc/22677/stack
[<ffffffff8104250b>] kvm_clock_get_cycles+0x1b/0x20
[<ffffffff810ba924>] __getnstimeofday+0x34/0xc0
[<ffffffff810ba9ba>] getnstimeofday+0xa/0x30
[<ffffffff811bb30d>] SyS_poll+0x5d/0xf0
[<ffffffff81529809>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

process stack on node2:
n2:~ # cat /proc/1534/stack
[<ffffffffa050fa65>] dlm_posix_lock+0x185/0x380 [dlm]
[<ffffffff811f39ce>] fcntl_setlk+0x12e/0x2d0
[<ffffffff811b8231>] SyS_fcntl+0x261/0x510
[<ffffffff81529809>] system_call_fastpath+0x16/0x1b
[<00007f3f5721eb42>] 0x7f3f5721eb42
[<ffffffffffffffff>] 0xffffffffffffffff

* dlm_posix_lock
Through adding printk and recompile dlm kernel module, where n2 is hung
has been located:
      dlm_posix_lock -> wait_event_killable
And wait_event_killable will put process into "TASK_KILLABLE" state which's like
"UNINTERRUPTABLE" but can be waked up by fatal signals. I did some tests, SIGTERM
can did it, but SIGALRM cannot.

Did this go against posix file lock semanteme? Any hints would be very appreciated!
I can provide any infos as I can if needed;-)

Thanks,
Eric




             reply	other threads:[~2015-10-13 10:07 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-13 10:07 eric [this message]
  -- strict thread matches above, loose matches on Subject: below --
2017-10-09  9:12 [Cluster-devel] (no subject) Andreas Gruenbacher
2010-02-05  5:45 Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=561CD7D2.7070107@suse.com \
    --to=zren@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).