public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed
From: Coly Li <colyli@suse.de>
To: linux-bcache@vger.kernel.org
Cc: linux-block@vger.kernel.org, Coly Li <colyli@suse.de>,
	stable@vger.kernel.org
Subject: [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay
Date: Sat, 20 Apr 2019 00:04:56 +0800	[thread overview]
Message-ID: <20190419160509.66298-4-colyli@suse.de> (raw)
In-Reply-To: <20190419160509.66298-1-colyli@suse.de>

When bcache journal initiates during running cache set, cache set
journal.blocks_free is initiated as 0. Then during journal replay if
journal_meta() is called and an empty jset is written to cache device,
journal_reclaim() is called. If there is available journal bucket to
reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
bucket, which is c->sb.bucket_size >> c->block_bits.

Most of time the above process works correctly, expect the condtion
when journal space is almost full. "Almost full" means there is no free
journal bucket, but there are still free blocks in last available
bucket indexed by ja->cur_idx.

If system crashes or reboots when journal space is almost full, problem
comes. During cache set reload after the reboot, c->journal.blocks_free
is initialized as 0, when jouranl replay process writes bcache jouranl,
journal_reclaim() will be called to reclaim available journal bucket and
set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
there is no fully free bucket to reclaim in journal_reclaim(), so value
of c->journal.blocks_free will keep 0. If the first journal entry
processed by journal_replay() causes btree split and requires writing
journal space by journal_meta(), journal_meta() has to go into an
infinite loop to reclaim jouranl bucket, and blocks the whole cache set
to run.

Such buggy situation can be solved if we do following things before
journal replay starts,
- Recover previous value of c->journal.blocks_free in last run time,
  and set it to current c->journal.blocks_free as initial value.
- Recover previous value of ja->cur_idx in last run time, and set it to
  KEY_PTR of current c->journal.key as initial value.

After c->journal.blocks_free and c->journal.key are recovered, in
condition when jouranl space is almost full and cache set is reloaded,
meta journal entry from journal reply can be written into free blocks of
the last available journal bucket, then old jouranl entries can be
replayed and reclaimed for further journaling request.

This patch adds bch_journal_key_reload() to recover journal blocks_free
and key ptr value for above purpose. bch_journal_key_reload() is called
in bch_journal_read() before replying journal by bch_journal_replay().

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 5180bed911ef..a6deb16c15c8 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -143,6 +143,89 @@ reread:		left = ca->sb.bucket_size - offset;
 	return ret;
 }
 
+static int bch_journal_key_reload(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int iter, n = 0;
+	struct bkey *k = &c->journal.key;
+	int ret = 0;
+
+	for_each_cache(ca, c, iter) {
+		struct journal_device *ja = &ca->journal;
+		struct bio *bio = &ja->bio;
+		struct jset *j, *data = c->journal.w[0].data;
+		struct closure cl;
+		unsigned int len, left;
+		unsigned int offset = 0, used_blocks = 0;
+		sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
+
+		closure_init_stack(&cl);
+
+		while (offset < ca->sb.bucket_size) {
+reread:			left = ca->sb.bucket_size - offset;
+			len = min_t(unsigned int,
+				    left, PAGE_SECTORS << JSET_BITS);
+
+			bio_reset(bio);
+			bio->bi_iter.bi_sector = bucket + offset;
+			bio_set_dev(bio, ca->bdev);
+			bio->bi_iter.bi_size = len << 9;
+
+			bio->bi_end_io = journal_read_endio;
+			bio->bi_private = &cl;
+			bio_set_op_attrs(bio, REQ_OP_READ, 0);
+			bch_bio_map(bio, data);
+
+			closure_bio_submit(c, bio, &cl);
+			closure_sync(&cl);
+
+			j = data;
+			while (len) {
+				size_t blocks, bytes = set_bytes(j);
+
+				if (j->magic != jset_magic(&ca->sb))
+					goto out;
+
+				if (bytes > left << 9 ||
+				    bytes > PAGE_SIZE << JSET_BITS) {
+					pr_err("jset may be correpted: too big");
+					ret = -EIO;
+					goto err;
+				}
+
+				if (bytes > len << 9)
+					goto reread;
+
+				if (j->csum != csum_set(j)) {
+					pr_err("jset may be corrupted: bad csum");
+					ret = -EIO;
+					goto err;
+				}
+
+				blocks = set_blocks(j, block_bytes(c));
+				used_blocks += blocks;
+
+				offset	+= blocks * ca->sb.block_size;
+				len	-= blocks * ca->sb.block_size;
+				j = ((void *) j) + blocks * block_bytes(ca);
+			}
+		}
+out:
+		c->journal.blocks_free =
+			(c->sb.bucket_size >> c->block_bits) -
+			used_blocks;
+
+		k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
+	}
+
+	BUG_ON(n == 0);
+	bkey_init(k);
+	SET_KEY_PTRS(k, n);
+
+err:
+	return ret;
+}
+
 int bch_journal_read(struct cache_set *c, struct list_head *list)
 {
 #define read_bucket(b)							\
@@ -268,6 +351,10 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 					    struct journal_replay,
 					    list)->j.seq;
 
+	/* Initial value of c->journal.blocks_free should be 0 */
+	BUG_ON(c->journal.blocks_free != 0);
+	ret = bch_journal_key_reload(c);
+
 	return ret;
 #undef read_bucket
 }
-- 
2.16.4

  parent reply	other threads:[~2019-04-19 16:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190419160509.66298-1-colyli@suse.de>
2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
2019-04-23  6:50   ` Hannes Reinecke
2019-04-19 16:04 ` Coly Li [this message]
2019-04-23  6:54   ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Hannes Reinecke
2019-04-23  6:56     ` Coly Li
2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
     [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
2019-04-20 13:20     ` Coly Li
2019-04-23  7:09   ` Hannes Reinecke
2019-04-23  7:16     ` Coly Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190419160509.66298-4-colyli@suse.de \
    --to=colyli@suse.de \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox