From: Bob Peterson <rpeterso@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 v8 PATCH 03/22] gfs2: Rework how rgrp buffer_heads are managed
Date: Mon, 9 Dec 2019 09:36:41 -0600 [thread overview]
Message-ID: <20191209153700.700208-4-rpeterso@redhat.com> (raw)
In-Reply-To: <20191209153700.700208-1-rpeterso@redhat.com>
Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.
When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.
After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.
Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.
The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.
For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.
A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.
This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:
1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
We don't want to release the buffers or zero the pointers when
syncing for the reasons stated above. It only makes sense to
release them when the glock is actually invalidated (go_inval).
And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
eliminated, as we've talked about doing many times before.
The go_unlock function was called too early in the glock dq
process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
That will now happen automatically when the rgrp glocks are
demoted, and shouldn't happen any sooner or later than that.
Instead, function gfs2_clear_rgrpd has been modified to demote
the rgrp glocks, and therefore, free those pages, before the
remaining glocks are culled by gfs2_gl_hash_clear. This
prevents the gl_object from hanging around when the glocks are
culled.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
fs/gfs2/glock.c | 7 ------
fs/gfs2/glops.c | 9 +-------
fs/gfs2/incore.h | 1 -
fs/gfs2/rgrp.c | 60 +++++++++++++++++++++++++++++++-----------------
fs/gfs2/rgrp.h | 1 -
5 files changed, 40 insertions(+), 38 deletions(-)
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b7123de7c180..46c50090c3ca 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1252,13 +1252,6 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
list_del_init(&gh->gh_list);
clear_bit(HIF_HOLDER, &gh->gh_iflags);
if (find_first_holder(gl) == NULL) {
- if (glops->go_unlock) {
- GLOCK_BUG_ON(gl, test_and_set_bit(GLF_LOCK, &gl->gl_flags));
- spin_unlock(&gl->gl_lockref.lock);
- glops->go_unlock(gh);
- spin_lock(&gl->gl_lockref.lock);
- clear_bit(GLF_LOCK, &gl->gl_flags);
- }
if (list_empty(&gl->gl_holders) &&
!test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
!test_bit(GLF_DEMOTE, &gl->gl_flags))
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 4ede1f18de85..dec5e245b991 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -144,15 +144,9 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
{
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
struct address_space *mapping = &sdp->sd_aspace;
- struct gfs2_rgrpd *rgd;
+ struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
int error;
- spin_lock(&gl->gl_lockref.lock);
- rgd = gl->gl_object;
- if (rgd)
- gfs2_rgrp_brelse(rgd);
- spin_unlock(&gl->gl_lockref.lock);
-
if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
return;
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
@@ -600,7 +594,6 @@ const struct gfs2_glock_operations gfs2_rgrp_glops = {
.go_sync = rgrp_go_sync,
.go_inval = rgrp_go_inval,
.go_lock = gfs2_rgrp_go_lock,
- .go_unlock = gfs2_rgrp_go_unlock,
.go_dump = gfs2_rgrp_dump,
.go_type = LM_TYPE_RGRP,
.go_flags = GLOF_LVB,
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 6042babb7324..9a50235567f4 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -239,7 +239,6 @@ struct gfs2_glock_operations {
void (*go_inval) (struct gfs2_glock *gl, int flags);
int (*go_demote_ok) (const struct gfs2_glock *gl);
int (*go_lock) (struct gfs2_holder *gh);
- void (*go_unlock) (struct gfs2_holder *gh);
void (*go_dump)(struct seq_file *seq, struct gfs2_glock *gl,
const char *fs_id_buf);
void (*go_callback)(struct gfs2_glock *gl, bool remote);
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 2466bb44a23c..10d3397ed3cd 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -730,8 +730,12 @@ void gfs2_clear_rgrpd(struct gfs2_sbd *sdp)
rb_erase(n, &sdp->sd_rindex_tree);
if (gl) {
- glock_clear_object(gl, rgd);
+ if (gl->gl_state != LM_ST_UNLOCKED) {
+ gfs2_glock_cb(gl, LM_ST_UNLOCKED);
+ flush_delayed_work(&gl->gl_work);
+ }
gfs2_rgrp_brelse(rgd);
+ glock_clear_object(gl, rgd);
gfs2_glock_put(gl);
}
@@ -1283,6 +1287,10 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh)
* gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get()
* @rgd: The resource group
*
+ * Under normal circumstances, b_count should be 1. However, if we withdraw
+ * there might be buffer_heads stranded that were pinned but never unpinned
+ * because an io error may have resulted in withdraw, which avoids the bulk
+ * of function gfs2_log_flush. Therefore, we need to release all refs.
*/
void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
@@ -1291,28 +1299,38 @@ void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
for (x = 0; x < length; x++) {
struct gfs2_bitmap *bi = rgd->rd_bits + x;
- if (bi->bi_bh) {
- brelse(bi->bi_bh);
- bi->bi_bh = NULL;
+ if (bi->bi_bh == NULL)
+ continue;
+
+ smp_mb();
+ if (!gfs2_withdrawn(rgd->rd_sbd) &&
+ (atomic_read(&bi->bi_bh->b_count) != 1)) {
+ struct gfs2_sbd *sdp = rgd->rd_gl->gl_name.ln_sbd;
+ struct gfs2_bufdata *bd = bi->bi_bh->b_private;
+
+ fs_err(sdp, "blk:0x%llx b_count=%d pinned:%d "
+ "locked:%d dirty:%d up2dt:%d bd?%d\n",
+ (unsigned long long)bi->bi_bh->b_blocknr,
+ atomic_read(&bi->bi_bh->b_count),
+ buffer_pinned(bi->bi_bh),
+ buffer_locked(bi->bi_bh),
+ buffer_dirty(bi->bi_bh),
+ buffer_uptodate(bi->bi_bh), bd ? 1 : 0);
+ if (bd)
+ fs_err(sdp, "bd_blkno:0x%llx bd_tr:%p bd_list:%d "
+ "st_list:%d gl_list:%d\n",
+ (unsigned long long)bd->bd_blkno,
+ bd->bd_tr,
+ list_empty(&bd->bd_list)?0:1,
+ list_empty(&bd->bd_ail_st_list)?0:1,
+ list_empty(&bd->bd_ail_gl_list)?0:1);
+ gfs2_assert_withdraw(rgd->rd_sbd,
+ (atomic_read(&bi->bi_bh->b_count) == 1));
}
+ brelse(bi->bi_bh);
+ bi->bi_bh = NULL;
}
-
-}
-
-/**
- * gfs2_rgrp_go_unlock - Unlock a rgrp glock
- * @gh: The glock holder for the resource group
- *
- */
-
-void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
-{
- struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
- int demote_requested = test_bit(GLF_DEMOTE, &gh->gh_gl->gl_flags) |
- test_bit(GLF_PENDING_DEMOTE, &gh->gh_gl->gl_flags);
-
- if (rgd && demote_requested)
- gfs2_rgrp_brelse(rgd);
+ smp_mb();
}
int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index c14a673ae36f..a584f3096418 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -33,7 +33,6 @@ extern int gfs2_rindex_update(struct gfs2_sbd *sdp);
extern void gfs2_free_clones(struct gfs2_rgrpd *rgd);
extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh);
extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd);
-extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh);
extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);
--
2.23.0
next prev parent reply other threads:[~2019-12-09 15:36 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-09 15:36 [Cluster-devel] [GFS2 v8 PATCH 00/22] GFS2 Recovery corruption patches v8 Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 01/22] gfs2: Introduce concept of a pending withdraw Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 02/22] gfs2: clear ail1 list when gfs2 withdraws Bob Peterson
2019-12-09 15:36 ` Bob Peterson [this message]
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 04/22] gfs2: log error reform Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 05/22] gfs2: Only complain the first time an io error occurs in quota or log Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 06/22] gfs2: Ignore dlm recovery requests if gfs2 is withdrawn Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 07/22] gfs2: move check_journal_clean to util.c for future use Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 08/22] gfs2: Allow some glocks to be used during withdraw Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 09/22] gfs2: Make secondary withdrawers wait for first withdrawer Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 10/22] gfs2: Force withdraw to replay journals and wait for it to finish Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 11/22] gfs2: fix infinite loop when checking ail item count before go_inval Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 12/22] gfs2: Add verbose option to check_journal_clean Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 13/22] gfs2: Issue revokes more intelligently Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 14/22] gfs2: Prepare to withdraw as soon as an IO error occurs in log write Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 15/22] gfs2: Check for log write errors before telling dlm to unlock Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 16/22] gfs2: new slab for transactions Bob Peterson
2020-01-23 22:22 ` Andreas Gruenbacher
2020-01-24 13:45 ` Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 17/22] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 18/22] gfs2: Don't skip log flush if glock still has revokes Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 19/22] gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 20/22] gfs2: drain the ail2 list after io errors Bob Peterson
2019-12-09 15:36 ` [Cluster-devel] [GFS2 v8 PATCH 21/22] gfs2: Don't demote a glock until its revokes are written Bob Peterson
2019-12-09 15:37 ` [Cluster-devel] [GFS2 v8 PATCH 22/22] gfs2: Do proper error checking for go_sync family of glops functions Bob Peterson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191209153700.700208-4-rpeterso@redhat.com \
--to=rpeterso@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).