From mboxrd@z Thu Jan  1 00:00:00 1970
From: Steven Whitehouse <swhiteho@redhat.com>
Date: Thu, 22 Nov 2018 10:16:25 +0000
Subject: [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and
 withdraw process (v2)
In-Reply-To: <20181121185257.11206-1-rpeterso@redhat.com>
References: <20181121185257.11206-1-rpeterso@redhat.com>
Message-ID: <896c2088-633d-3cad-844e-03445e072bb3@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi,


On 21/11/18 18:52, Bob Peterson wrote:
> Hi,
>
> This is a second draft of a two-patch set to fix some of the nasty
> journal recovery problems I've found lately.
>
> The original post from 08 November had horribly bad and inaccurate
> comments, and Steve Whitehouse and Andreas Gruenbacher pointed out.
> This version is hopefully better and more accurately describes what
> the patches do and how they work. Also, I fixed a superblock flag
> that was improperly declared as a glock flag.
>
> Other than the renamed and re-valued superblock flag, the code
> remains unchanged from the previous version. It probably needs a bit
> more testing, but it seems to work well.
> ---
> The problems have to do with file system corruption caused when recovery
> replays a journal after the resource group blocks have been unlocked
> by the recovery process. In other words, when no cluster node takes
> responsibility to replay the journal of a withdrawing node, then it
> gets replayed later on, after the blocks contents have been changed.
>
> The first patch prevents gfs2 from attempting recovery if the file system
> is withdrawn or has journal IO errors. Trying to recover your own journal
> from either of these unstable conditions is dangerous and likely to corrupt
> the file system.
>
> The second patch is more extensive. When a node withdraws from a file system
> it signals all other nodes with the file system mounted to perform recovery
> on its journal, since it cannot safely recover its own journal. This is
> accomplished by a new non-disk callback glop used exclusively by the
> "live" glock, which sets up an lvb in the glock to indicate which
> journal(s) need to be recovered.
>
> Regards,
>
> Bob Peterson
> ---
> Bob Peterson (2):
>    gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
>    gfs2: initiate journal recovery as soon as a node withdraws
>
>   fs/gfs2/glock.c    |  5 ++-
>   fs/gfs2/glops.c    | 47 +++++++++++++++++++++++
>   fs/gfs2/incore.h   |  3 ++
>   fs/gfs2/lock_dlm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++
>   fs/gfs2/log.c      | 62 ++++++++++++++++--------------
>   fs/gfs2/super.c    |  5 ++-
>   fs/gfs2/super.h    |  1 +
>   fs/gfs2/util.c     | 84 ++++++++++++++++++++++++++++++++++++++++
>   fs/gfs2/util.h     | 13 +++++++
>   9 files changed, 282 insertions(+), 33 deletions(-)
>
Yes, that looks a bit cleaner now,

Steve.