From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Thu, 22 Nov 2018 10:16:25 +0000 Subject: [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2) In-Reply-To: <20181121185257.11206-1-rpeterso@redhat.com> References: <20181121185257.11206-1-rpeterso@redhat.com> Message-ID: <896c2088-633d-3cad-844e-03445e072bb3@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On 21/11/18 18:52, Bob Peterson wrote: > Hi, > > This is a second draft of a two-patch set to fix some of the nasty > journal recovery problems I've found lately. > > The original post from 08 November had horribly bad and inaccurate > comments, and Steve Whitehouse and Andreas Gruenbacher pointed out. > This version is hopefully better and more accurately describes what > the patches do and how they work. Also, I fixed a superblock flag > that was improperly declared as a glock flag. > > Other than the renamed and re-valued superblock flag, the code > remains unchanged from the previous version. It probably needs a bit > more testing, but it seems to work well. > --- > The problems have to do with file system corruption caused when recovery > replays a journal after the resource group blocks have been unlocked > by the recovery process. In other words, when no cluster node takes > responsibility to replay the journal of a withdrawing node, then it > gets replayed later on, after the blocks contents have been changed. > > The first patch prevents gfs2 from attempting recovery if the file system > is withdrawn or has journal IO errors. Trying to recover your own journal > from either of these unstable conditions is dangerous and likely to corrupt > the file system. > > The second patch is more extensive. When a node withdraws from a file system > it signals all other nodes with the file system mounted to perform recovery > on its journal, since it cannot safely recover its own journal. This is > accomplished by a new non-disk callback glop used exclusively by the > "live" glock, which sets up an lvb in the glock to indicate which > journal(s) need to be recovered. > > Regards, > > Bob Peterson > --- > Bob Peterson (2): > gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn > gfs2: initiate journal recovery as soon as a node withdraws > > fs/gfs2/glock.c | 5 ++- > fs/gfs2/glops.c | 47 +++++++++++++++++++++++ > fs/gfs2/incore.h | 3 ++ > fs/gfs2/lock_dlm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/gfs2/log.c | 62 ++++++++++++++++-------------- > fs/gfs2/super.c | 5 ++- > fs/gfs2/super.h | 1 + > fs/gfs2/util.c | 84 ++++++++++++++++++++++++++++++++++++++++ > fs/gfs2/util.h | 13 +++++++ > 9 files changed, 282 insertions(+), 33 deletions(-) > Yes, that looks a bit cleaner now, Steve.