From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bob Peterson <rpeterso@redhat.com>
Date: Sun, 24 Jun 2018 08:49:52 -0400 (EDT)
Subject: [Cluster-devel] [GFS2 PATCH] GFS2: Skip taking journal recovery
 locks for spectators
In-Reply-To: <CAHc6FU6DzLDjKQT-EMGtnoWKkQDfX0wieL3kO8B4JoQ=LcQUqA@mail.gmail.com>
References: <2061107895.44805743.1529618725855.JavaMail.zimbra@redhat.com>
	<602032757.44805751.1529618745642.JavaMail.zimbra@redhat.com>
	<CAHc6FU6DzLDjKQT-EMGtnoWKkQDfX0wieL3kO8B4JoQ=LcQUqA@mail.gmail.com>
Message-ID: <1041963515.45201495.1529844592291.JavaMail.zimbra@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

----- Original Message -----
> Bob,
> 
> On 22 June 2018 at 00:05, Bob Peterson <rpeterso@redhat.com> wrote:
> > Hi,
> >
> > Before this patch, spectator mounts would try to acquire the dlm's
> > control lock and mounted lock as part of its normal recovery sequence.
> > It's not necessary because spectators don't ever do journal recovery.
> > And if they acquire those locks (at all) it will prevent another
> > first-mounter from acquiring the lock in EX mode, which means it
> > also cannot do journal recovery because it doesn't think it's the
> > first node to mount the file system.
> >
> > This patch checks if the mounter is a spectator, and if so, avoids
> > grabbing the control lock and mounted lock. This allows a secondary
> > mounter who is really the first rw mounter, to do journal recovery:
> > since the spectator doesn't acquire those locks, it can grab them
> > in EX mode, and therefore consider itself to be the first mounter.
> 
> if I understand things correctly, spectator mounts were so far holding
> the mounted_lock in PR mode which did prevent other nodes from
> starting recovery. With this change, spectators can now end up reading
> from the filesystem during recovery. That doesn't seem safe,
> especially since recovery won't take the usual glocks, and so there
> would be no synchronization with spectators anymore. I'm wondering if
> spectators could pause and relinquish the mounted_lock when another
> node tries to acquire it in EX mode, and resume after re-acquiring the
> lock. Or something similar.
> 
> Thanks,
> Andreas
> 
Hi Andreas,

With the patch, I suppose _new_ spectators can now mount the file
system before recovery and see an inconsistent view. However, what's
the difference between that and an already-mounted spectator, who can
already see an unrecovered file system before the journal is replayed?
Iow, if it's already mounted as spectator, but before the rw mounting
node has done its recovery? Or even been fenced for that matter.
It turns out, there is none.

I did a little experiment using only stock kernels:

1. Node 1 mounts rw
2. Node 2 mounts -o spectator
3. Node 1 writes to the file system
4. Node 1 gets fenced while it's writing
5. Node 2 does ls -lR on the file system before the rw node reboots

Guess what happens? Boom! It sees an inconsistent view of the file
system. This is true, in theory, with all kernels today.

[43241.051707] GFS2: fsid=dm-12.s: fatal: invalid metadata block
[43241.051707] GFS2: fsid=dm-12.s:   bh = 360123 (magic number)
[43241.051707] GFS2: fsid=dm-12.s:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 428
[43241.072890] GFS2: fsid=dm-12.s: about to withdraw this file system
[43241.079112] GFS2: fsid=dm-12.s: withdrawn

So there's a larger issue here. We should prevent spectators from
seeing an inconsistent view of the file system, regardless of
the status of the mounted lock.

The concept of cluster quorum was supposed to prevent inconsistent
views because there's always someone around to replay the journal.
But when we introduced spectator mounts, we nullified that,
unintentionally.

In the old pre-spectator way of thinking, when a cluster loses
quorum, the dlm locking is temporarily halted until quorum is
regained, thus preventing inconsistent views until quorum is
regained and the journal is replayed. This can even be done by a
read-only mounter, and before the node rejoins the cluster.

With spectators, the there might not be anyone capable of replaying
the journal, regardless of quorum. You could have a 16 node cluster
with 15 spectators, all of whom instantly see an inconsistent view
of the file system metadata as soon as the writer disappears from
the cluster.

Bottom line: My patch makes things a little better because now at
least the rw node can do recovery and replay the journal, whereas
before it couldn't. It's true that we used to prevent other nodes
from mounting -o spectator at that point, but our existing nodes
who mount as spectator already have the same (worse) problem.

We have a larger problem that is not addressed by my patch, which
might be difficult to resolve.

Bob Peterson