From mboxrd@z Thu Jan 1 00:00:00 1970 From: teigland@sourceware.org Date: 14 Jan 2008 15:35:31 -0000 Subject: [Cluster-devel] cluster/gfs-kernel/src/dlm mount.c Message-ID: <20080114153531.26366.qmail@sourceware.org> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit CVSROOT: /cvs/cluster Module name: cluster Branch: RHEL4 Changes by: teigland at sourceware.org 2008-01-14 15:35:30 Modified files: gfs-kernel/src/dlm: mount.c Log message: bz 324881 It's easy to tell if you've hit this bug, because a message like this will always appear in /var/log/messages: SM: 02000378 ignoring service callback id=2000144 event=1324 If you look at /proc/cluster/lock_dlm/debug on this node at this point, you'll see something like this at the end, which shows what the problem is: others_may_mount start_done 1322 b The event_id that others_may_mount uses when calling kcl_start_done() is incorrect; it's using 1322 when it should be 1324. I believe the fix is for others_may_mount() to read the event_id after taking the umount_lock semaphore which serializes others_may_mount() with a start callback from the lock_dlm thread. In this case, I believe the start callback is changing the event_id after others_may_mount reads it, and before othres_may_mount gets the umount_lock semaphore. Patches: http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/dlm/mount.c.diff?cvsroot=cluster&only_with_tag=RHEL4&r1=1.11.2.3&r2=1.11.2.4 --- cluster/gfs-kernel/src/dlm/Attic/mount.c 2005/06/29 07:28:21 1.11.2.3 +++ cluster/gfs-kernel/src/dlm/Attic/mount.c 2008/01/14 15:35:30 1.11.2.4 @@ -316,11 +316,12 @@ return; } + down(&dlm->unmount_lock); + spin_lock(&dlm->async_lock); last_start = dlm->mg_last_start; spin_unlock(&dlm->async_lock); - down(&dlm->unmount_lock); set_bit(DFL_OTHERSMAYMOUNT, &dlm->flags); /* There's been a start to add a second node while we've been