From: Menyhart Zoltan <Zoltan.Menyhart@bull.net>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] "->ls_in_recovery" not released
Date: Mon, 22 Nov 2010 17:31:25 +0100 [thread overview]
Message-ID: <4CEA9ADD.2050109@bull.net> (raw)
Hi,
We have got a two-node OCFS2 file system controlled by the pacemaker.
We do some robustness tests, e.g. blocking the access to the "other" node.
The "local" machine is blocked:
PID: 15617 TASK: ffff880c77572d90 CPU: 38 COMMAND: "dlm_recoverd"
#0 [ffff880c7cb07c30] schedule at ffffffff81452830
#1 [ffff880c7cb07cf8] dlm_wait_function at ffffffffa03aaffb
#2 [ffff880c7cb07d68] dlm_rcom_status at ffffffffa03aa3d9
ping_members
#3 [ffff880c7cb07db8] dlm_recover_members at ffffffffa03a58a3
ls_recover
do_ls_recovery
#4 [ffff880c7cb07e48] dlm_recoverd at ffffffffa03abc89
#5 [ffff880c7cb07ee8] kthread at ffffffff810820f6
#6 [ffff880c7cb07f48] kernel_thread at ffffffff8100d1aa
If either the monitor device closes, or someone sends down a "stop"
onto the control device, then "ls_recover()" goes to the "fail:" branch
without setting free "->ls_in_recovery".
As a result OCFS2 operations remain blocked, e.g.:
PID: 3385 TASK: ffff880876e69520 CPU: 1 COMMAND: "bash"
#0 [ffff88087cb91980] schedule at ffffffff81452830
#1 [ffff88087cb91a48] rwsem_down_failed_common at ffffffff81454c95
#2 [ffff88087cb91a98] rwsem_down_read_failed at ffffffff81454e26
#3 [ffff88087cb91ad8] call_rwsem_down_read_failed at ffffffff81248004
#4 [ffff88087cb91b40] dlm_lock at ffffffffa03a17b2
#5 [ffff88087cb91c00] user_dlm_lock at ffffffffa020d18e
#6 [ffff88087cb91c30] ocfs2_dlm_lock at ffffffffa00683c2
#7 [ffff88087cb91c40] __ocfs2_cluster_lock at ffffffffa04f951c
#8 [ffff88087cb91d60] ocfs2_inode_lock_full_nested at ffffffffa04fd800
#9 [ffff88087cb91df0] ocfs2_inode_revalidate at ffffffffa0507566
#10 [ffff88087cb91e20] ocfs2_getattr at ffffffffa050270b
#11 [ffff88087cb91e60] vfs_getattr at ffffffff8115cac1
#12 [ffff88087cb91ea0] vfs_fstatat at ffffffff8115cb50
#13 [ffff88087cb91ee0] vfs_stat at ffffffff8115cc9b
#14 [ffff88087cb91ef0] sys_newstat at ffffffff8115ccc4
#15 [ffff88087cb91f80] system_call_fastpath at ffffffff8100c172
"ls_recover()" includes several other cases when it simply goes
to the "fail:" branch without setting free "->ls_in_recovery" and
without cleaning up the inconsistent data left behind.
I think some error handling code is missing in "ls_recover()".
Have you modified the DLM since the RHEL 6.0?
Thanks,
Zoltan Menyhart
next reply other threads:[~2010-11-22 16:31 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-22 16:31 Menyhart Zoltan [this message]
2010-11-22 17:34 ` [Cluster-devel] "->ls_in_recovery" not released David Teigland
2010-11-23 14:58 ` Menyhart Zoltan
2010-11-23 17:15 ` David Teigland
2010-11-24 16:13 ` Menyhart Zoltan
2010-11-24 20:29 ` David Teigland
2010-11-30 16:57 ` [Cluster-devel] Patch: making DLM more robust Menyhart Zoltan
2010-11-30 17:30 ` David Teigland
2010-12-01 9:23 ` Menyhart Zoltan
2010-12-01 17:27 ` David Teigland
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CEA9ADD.2050109@bull.net \
--to=zoltan.menyhart@bull.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.