cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2)
@ 2018-11-21 18:52 Bob Peterson
  2018-11-21 18:52 ` [Cluster-devel] [PATCH 1/2] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn Bob Peterson
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Bob Peterson @ 2018-11-21 18:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

This is a second draft of a two-patch set to fix some of the nasty
journal recovery problems I've found lately.

The original post from 08 November had horribly bad and inaccurate
comments, and Steve Whitehouse and Andreas Gruenbacher pointed out.
This version is hopefully better and more accurately describes what
the patches do and how they work. Also, I fixed a superblock flag
that was improperly declared as a glock flag.

Other than the renamed and re-valued superblock flag, the code
remains unchanged from the previous version. It probably needs a bit
more testing, but it seems to work well.
---
The problems have to do with file system corruption caused when recovery
replays a journal after the resource group blocks have been unlocked
by the recovery process. In other words, when no cluster node takes
responsibility to replay the journal of a withdrawing node, then it
gets replayed later on, after the blocks contents have been changed.

The first patch prevents gfs2 from attempting recovery if the file system
is withdrawn or has journal IO errors. Trying to recover your own journal
from either of these unstable conditions is dangerous and likely to corrupt
the file system.

The second patch is more extensive. When a node withdraws from a file system
it signals all other nodes with the file system mounted to perform recovery
on its journal, since it cannot safely recover its own journal. This is
accomplished by a new non-disk callback glop used exclusively by the
"live" glock, which sets up an lvb in the glock to indicate which
journal(s) need to be recovered.

Regards,

Bob Peterson
---
Bob Peterson (2):
  gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
  gfs2: initiate journal recovery as soon as a node withdraws

 fs/gfs2/glock.c    |  5 ++-
 fs/gfs2/glops.c    | 47 +++++++++++++++++++++++
 fs/gfs2/incore.h   |  3 ++
 fs/gfs2/lock_dlm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/log.c      | 62 ++++++++++++++++--------------
 fs/gfs2/super.c    |  5 ++-
 fs/gfs2/super.h    |  1 +
 fs/gfs2/util.c     | 84 ++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/util.h     | 13 +++++++
 9 files changed, 282 insertions(+), 33 deletions(-)

-- 
2.19.1



^ permalink raw reply	[flat|nested] 6+ messages in thread
* [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process
@ 2018-11-08 20:25 Bob Peterson
  2018-11-08 20:25 ` [Cluster-devel] [PATCH 2/2] gfs2: initiate journal recovery as soon as a node withdraws Bob Peterson
  0 siblings, 1 reply; 6+ messages in thread
From: Bob Peterson @ 2018-11-08 20:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

This is a first draft of a two-patch set to fix some of the nasty
journal recovery problems I've found lately.

The problems have to do with file system corruption caused when recovery
replays a journal after the resource group blocks have been unlocked
by the recovery process. In other words, when no cluster node takes
responsibility to replay the journal of a withdrawing node, then it
gets replayed later on, after the blocks contents have been changed.

The first patch prevents gfs2 from attempting recovery if the file system
is withdrawn or has journal IO errors. Trying to recover your own journal
from either of these unstable conditions is dangerous and likely to corrupt
the file system.

The second patch is more extensive. When a node withdraws from a file system
it first empties out all ourstanding pages in the ail lists, then it
signals all other nodes with the file system mounted to perform recovery
on its journal since it cannot safely recover its own journal. This is
accomplished by a new non-disk callback glop used exclusively by the
"live" glock, which sets up an lvb in the glock to indicate which
journal(s) need to be replayed. This sytem makes it necessary to prevent
recursion, since the journal operations themselves (i.e. the ones that
empty out the ail list on withdraw) can also withdraw. Thus, the withdraw
system is now separated into "journal" and "non-journal" withdraws.
Also, the "withdraw" flag is now replaced by a superblock bit because
once the file system withdraws in this way, it needs to remember that from
that point on.

Regards,

Bob Peterson
---
Bob Peterson (2):
  gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
  gfs2: initiate journal recovery as soon as a node withdraws

 fs/gfs2/glock.c    |  5 ++-
 fs/gfs2/glops.c    | 47 +++++++++++++++++++++++
 fs/gfs2/incore.h   |  3 ++
 fs/gfs2/lock_dlm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/log.c      | 62 ++++++++++++++++--------------
 fs/gfs2/super.c    |  5 ++-
 fs/gfs2/super.h    |  1 +
 fs/gfs2/util.c     | 84 ++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/util.h     | 13 +++++++
 9 files changed, 282 insertions(+), 33 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-11-29 20:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-21 18:52 [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2) Bob Peterson
2018-11-21 18:52 ` [Cluster-devel] [PATCH 1/2] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn Bob Peterson
2018-11-29 20:56   ` Andreas Gruenbacher
2018-11-21 18:52 ` [Cluster-devel] [PATCH 2/2] gfs2: initiate journal recovery as soon as a node withdraws Bob Peterson
2018-11-22 10:16 ` [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2) Steven Whitehouse
  -- strict thread matches above, loose matches on Subject: below --
2018-11-08 20:25 [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process Bob Peterson
2018-11-08 20:25 ` [Cluster-devel] [PATCH 2/2] gfs2: initiate journal recovery as soon as a node withdraws Bob Peterson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).