From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bob Peterson Date: Sun, 24 Jun 2018 08:49:52 -0400 (EDT) Subject: [Cluster-devel] [GFS2 PATCH] GFS2: Skip taking journal recovery locks for spectators In-Reply-To: References: <2061107895.44805743.1529618725855.JavaMail.zimbra@redhat.com> <602032757.44805751.1529618745642.JavaMail.zimbra@redhat.com> Message-ID: <1041963515.45201495.1529844592291.JavaMail.zimbra@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit ----- Original Message ----- > Bob, > > On 22 June 2018 at 00:05, Bob Peterson wrote: > > Hi, > > > > Before this patch, spectator mounts would try to acquire the dlm's > > control lock and mounted lock as part of its normal recovery sequence. > > It's not necessary because spectators don't ever do journal recovery. > > And if they acquire those locks (at all) it will prevent another > > first-mounter from acquiring the lock in EX mode, which means it > > also cannot do journal recovery because it doesn't think it's the > > first node to mount the file system. > > > > This patch checks if the mounter is a spectator, and if so, avoids > > grabbing the control lock and mounted lock. This allows a secondary > > mounter who is really the first rw mounter, to do journal recovery: > > since the spectator doesn't acquire those locks, it can grab them > > in EX mode, and therefore consider itself to be the first mounter. > > if I understand things correctly, spectator mounts were so far holding > the mounted_lock in PR mode which did prevent other nodes from > starting recovery. With this change, spectators can now end up reading > from the filesystem during recovery. That doesn't seem safe, > especially since recovery won't take the usual glocks, and so there > would be no synchronization with spectators anymore. I'm wondering if > spectators could pause and relinquish the mounted_lock when another > node tries to acquire it in EX mode, and resume after re-acquiring the > lock. Or something similar. > > Thanks, > Andreas > Hi Andreas, With the patch, I suppose _new_ spectators can now mount the file system before recovery and see an inconsistent view. However, what's the difference between that and an already-mounted spectator, who can already see an unrecovered file system before the journal is replayed? Iow, if it's already mounted as spectator, but before the rw mounting node has done its recovery? Or even been fenced for that matter. It turns out, there is none. I did a little experiment using only stock kernels: 1. Node 1 mounts rw 2. Node 2 mounts -o spectator 3. Node 1 writes to the file system 4. Node 1 gets fenced while it's writing 5. Node 2 does ls -lR on the file system before the rw node reboots Guess what happens? Boom! It sees an inconsistent view of the file system. This is true, in theory, with all kernels today. [43241.051707] GFS2: fsid=dm-12.s: fatal: invalid metadata block [43241.051707] GFS2: fsid=dm-12.s: bh = 360123 (magic number) [43241.051707] GFS2: fsid=dm-12.s: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 428 [43241.072890] GFS2: fsid=dm-12.s: about to withdraw this file system [43241.079112] GFS2: fsid=dm-12.s: withdrawn So there's a larger issue here. We should prevent spectators from seeing an inconsistent view of the file system, regardless of the status of the mounted lock. The concept of cluster quorum was supposed to prevent inconsistent views because there's always someone around to replay the journal. But when we introduced spectator mounts, we nullified that, unintentionally. In the old pre-spectator way of thinking, when a cluster loses quorum, the dlm locking is temporarily halted until quorum is regained, thus preventing inconsistent views until quorum is regained and the journal is replayed. This can even be done by a read-only mounter, and before the node rejoins the cluster. With spectators, the there might not be anyone capable of replaying the journal, regardless of quorum. You could have a 16 node cluster with 15 spectators, all of whom instantly see an inconsistent view of the file system metadata as soon as the writer disappears from the cluster. Bottom line: My patch makes things a little better because now at least the rw node can do recovery and replay the journal, whereas before it couldn't. It's true that we used to prevent other nodes from mounting -o spectator at that point, but our existing nodes who mount as spectator already have the same (worse) problem. We have a larger problem that is not addressed by my patch, which might be difficult to resolve. Bob Peterson