From: Bob Peterson <rpeterso@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 PATCH v3] gfs2: Fix gfs2_testbit to use clone bitmaps
Date: Fri, 3 Aug 2018 12:35:54 -0400 (EDT) [thread overview]
Message-ID: <821495790.57211167.1533314154544.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <144324987.57209973.1533314072434.JavaMail.zimbra@redhat.com>
Hi,
I sent a previous version of this patch earlier, but Andreas asked
me to add some comments to the code to explain the purpose of the
clone bitmaps. This version adds an extensive new comment to incore.h.
Bob Peterson
---
In the gfs2 block allocator there's a concept of an in-core only
"clone" bitmap that exists after blocks are freed or unlinked.
When a block is freed or unlinked, memory for the clone bitmap
is allocated for the bitmap affected, but only the real bitmap
is updated. The clone bitmap will still reflect its prior
(allocated) state.
When blocks are assigned (free -> used or free -> dinode), it
does not need to allocate a clone bitmap. However, if one exists,
both the real bitmap and the clone bitmap are updated to reflect
the new state of the block. So block assignments are more
"sticky" than frees.
When the bit search algorithms try to find blocks of a certain
state, they use the clone bitmaps (if they exist). That way,
the only blocks to be allocated are never ones that have been
recently freed.
However, the management of the clone bitmaps is not done properly
in the case of function gfs2_testbit.
Function gfs2_testbit is called in three places. Two of those
places, gfs2_alloc_extent and gfs2_unaligned_extlen, should be
using the clone bitmaps, not the "real" bitmaps. Function
gfs2_unaligned_extlen is used by the block reservations scheme
to determine the length of an extent of free blocks. Before this
patch, it wasn't using the clone bitmap, which means soon-to-be-
freed blocks were treated as free blocks for the purposes of
an extent, and that's clearly wrong for the reasons stated above.
This patch adds a new parameter to gfs2_testbit to indicate whether
or not the clone bitmaps should be used (if available).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
fs/gfs2/incore.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/gfs2/rgrp.c | 45 ++++++++++++++++-------------------
2 files changed, 91 insertions(+), 25 deletions(-)
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b50908211b69..3e744cc40ced 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -65,6 +65,77 @@ struct gfs2_log_operations {
#define GBF_FULL 1
+/**
+ * Explanation of clone bitmaps (bi_clone):
+ *
+ * In the gfs2 block allocator there's a concept of an in-core only "clone"
+ * bitmap that exists after blocks are freed or unlinked.
+ *
+ * The reason gfs2 goes through all this trouble is to make sure that journals
+ * never contain a transaction that both frees and reassigns the same block.
+ * If they were allowed to do so, journal replay (due to a node failure) would
+ * introduce file system corruption. For example, suppose block 12345 is a
+ * data block for dinode X. Then:
+ *
+ * 1. One process truncates dinode X and frees data block 12345.
+ * The updated bitmap buffer is now in a transaction in the journal, and
+ * reflects block 12345 is "free".
+ * 2. A different process wants to write a new data block for dinode Y. It
+ * searches the bitmap for a free block and finds block 12345, and changes
+ * its bit status from "free" back to "data." Now the bitmap that reflects
+ * block 12345 is a "data" block again, but the new transaction is merged
+ * with the previous via function gfs2_merge_trans().
+ * 3. The machine is fenced or goes down for some reason.
+ *
+ * At this point, the failing system's journal needs to be replayed, but
+ * the journal now reflects corruption: Since the buffer_heads for dinode Y
+ * and the rgrp may have been written back to the media before dinode X,
+ * the inplace dinode X may still have a data pointer to block 12345, but
+ * so does dinode Y.
+ *
+ * The clone bitmaps are meant to ensure that this cannot ever happen because
+ * the bitmap allocations can never find blocks that have been freed since the
+ * last log flush. This is how they accomplish that:
+ *
+ * When a block is freed or unlinked, memory for a "clone bitmap" is allocated
+ * for the bitmap affected and becomes an exact copy of the data in the
+ * state it was in before the free or unlink operation. When blocks (maybe
+ * multiple blocks) are freed or unlinked, only the "real" bitmap is updated
+ * to reflect the new state, but the clone bitmap still reflects its state
+ * prior to the free / unlink.
+ *
+ * When blocks are assigned (free -> used or free -> dinode), it does not need
+ * to allocate a clone bitmap. However, if one exists, both the real bitmap
+ * and the clone bitmap are updated to reflect the new state of the block. So
+ * block assignments are more "sticky" than block frees. In that way, block
+ * allocations are always reflected in both real and clone bitmaps, but frees
+ * are (temporarily) not reflected in the clone bitmaps.
+ *
+ * When the bit search algorithms try to find blocks of a certain state, they
+ * use the clone bitmaps if they exist, or the real bitmaps if there isn't
+ * a clone. That way, the only blocks that may be allocated are never blocks
+ * that have been recently freed or unlinked.
+ *
+ * So the clone bitmap is used to search for free blocks, but the only free
+ * blocks it knows about are blocks that are really free since before the last
+ * log flush. The clone bitmaps are never put in the journal nor written back.
+ *
+ * At the end of a log_flush (lo_after_commit), gfs2_unpin copies the "real"
+ * bitmap buffer data back to the clones (via maybe_release_space) and those
+ * blocks will now appear correctly as "free." The clone bitmaps, however, are
+ * kept around rather than doing a lot of unnecessary kmalloc/kfree operations.
+ *
+ * By the way, this is not a problem in the case of multiple node failures:
+ *
+ * If there are multiple node failures and we have multiple dirty journals,
+ * we can tell which is the latest version of a rgrp bitmap: Once the bitmap
+ * block is written back, it makes its way to the ail2 list, then it builds
+ * a revoke transaction for the bitmap before giving another node the glock
+ * used to modify the bitmap, and the journal is flushed. So if the bitmap
+ * buffer is revoked, we ignore it at journal replay time. If it's not revoked,
+ * we replay it.
+ *
+ */
struct gfs2_bitmap {
struct buffer_head *bi_bh;
char *bi_clone;
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7c5afeba8888..5593c7956d07 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -123,17 +123,27 @@ static inline void gfs2_setbit(const struct gfs2_rbm *rbm, bool do_clone,
/**
* gfs2_testbit - test a bit in the bitmaps
* @rbm: The bit to test
+ * @use_clone: If true, test the clone bitmap, not the official bitmap.
+ *
+ * Some callers, like gfs2_unaligned_extlen, need to test the clone bitmaps,
+ * not the "real" bitmaps, because we can't reserve blocks if they're still
+ * considered metadata in the journal.
*
* Returns: The two bit block state of the requested bit
*/
-static inline u8 gfs2_testbit(const struct gfs2_rbm *rbm)
+static inline u8 gfs2_testbit(const struct gfs2_rbm *rbm, bool use_clone)
{
struct gfs2_bitmap *bi = rbm_bi(rbm);
- const u8 *buffer = bi->bi_bh->b_data + bi->bi_offset;
+ const u8 *buffer;
const u8 *byte;
unsigned int bit;
+ if (use_clone && bi->bi_clone)
+ buffer = bi->bi_clone;
+ else
+ buffer = bi->bi_bh->b_data;
+ buffer += bi->bi_offset;
byte = buffer + (rbm->offset / GFS2_NBBY);
bit = (rbm->offset % GFS2_NBBY) * GFS2_BIT_SIZE;
@@ -322,7 +332,7 @@ static bool gfs2_unaligned_extlen(struct gfs2_rbm *rbm, u32 n_unaligned, u32 *le
u8 res;
for (n = 0; n < n_unaligned; n++) {
- res = gfs2_testbit(rbm);
+ res = gfs2_testbit(rbm, true);
if (res != GFS2_BLKST_FREE)
return true;
(*len)--;
@@ -2146,26 +2156,6 @@ void gfs2_inplace_release(struct gfs2_inode *ip)
gfs2_glock_dq_uninit(&rs->rs_rgd_gh);
}
-/**
- * gfs2_get_block_type - Check a block in a RG is of given type
- * @rgd: the resource group holding the block
- * @block: the block number
- *
- * Returns: The block type (GFS2_BLKST_*)
- */
-
-static unsigned char gfs2_get_block_type(struct gfs2_rgrpd *rgd, u64 block)
-{
- struct gfs2_rbm rbm = { .rgd = rgd, };
- int ret;
-
- ret = gfs2_rbm_from_block(&rbm, block);
- WARN_ON_ONCE(ret != 0);
-
- return gfs2_testbit(&rbm);
-}
-
-
/**
* gfs2_alloc_extent - allocate an extent from a given bitmap
* @rbm: the resource group information
@@ -2190,7 +2180,7 @@ static void gfs2_alloc_extent(const struct gfs2_rbm *rbm, bool dinode,
block++;
while (*n < elen) {
ret = gfs2_rbm_from_block(&pos, block);
- if (ret || gfs2_testbit(&pos) != GFS2_BLKST_FREE)
+ if (ret || gfs2_testbit(&pos, true) != GFS2_BLKST_FREE)
break;
gfs2_trans_add_meta(pos.rgd->rd_gl, rbm_bi(&pos)->bi_bh);
gfs2_setbit(&pos, true, GFS2_BLKST_USED);
@@ -2543,6 +2533,7 @@ int gfs2_check_blk_type(struct gfs2_sbd *sdp, u64 no_addr, unsigned int type)
{
struct gfs2_rgrpd *rgd;
struct gfs2_holder rgd_gh;
+ struct gfs2_rbm rbm;
int error = -EINVAL;
rgd = gfs2_blk2rgrpd(sdp, no_addr, 1);
@@ -2553,7 +2544,11 @@ int gfs2_check_blk_type(struct gfs2_sbd *sdp, u64 no_addr, unsigned int type)
if (error)
goto fail;
- if (gfs2_get_block_type(rgd, no_addr) != type)
+ rbm.rgd = rgd;
+ error = gfs2_rbm_from_block(&rbm, no_addr);
+ WARN_ON_ONCE(error != 0);
+
+ if (gfs2_testbit(&rbm, false) != type)
error = -ESTALE;
gfs2_glock_dq_uninit(&rgd_gh);
parent reply other threads:[~2018-08-03 16:35 UTC|newest]
Thread overview: expand[flat|nested] mbox.gz Atom feed
[parent not found: <144324987.57209973.1533314072434.JavaMail.zimbra@redhat.com>]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=821495790.57211167.1533314154544.JavaMail.zimbra@redhat.com \
--to=rpeterso@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).