[RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal

public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim()
       [not found] <20190419160509.66298-1-colyli@suse.de>
@ 2019-04-19 16:04 ` Coly Li
  2019-04-23  6:50   ` Hannes Reinecke
  2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
  2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
  2 siblings, 1 reply; 9+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.

The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),

529 for_each_cache(ca, c, iter) {
530         struct journal_device *ja = &ca->journal;
531         unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533         /* No space available on this device */
534         if (next == ja->discard_idx)
535                 continue;
536
537         ja->cur_idx = next;
538         k->ptr[n++] = MAKE_PTR(0,
539                           bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540                           ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);

If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.

Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,

649	for (i = 0; i < KEY_PTRS(k); i++) {
650-671		submit journal data to cache device
672	}

If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.

Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6e18057d1d82..5180bed911ef 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c)
 				  ca->sb.nr_this_dev);
 	}

-	bkey_init(k);
-	SET_KEY_PTRS(k, n);
-
-	if (n)
+	if (n) {
+		bkey_init(k);
+		SET_KEY_PTRS(k, n);
 		c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
+	}
 out:
 	if (!journal_full(&c->journal))
 		__closure_wake_up(&c->journal.wait);
@@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl)
 		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
 	}

+	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
+	BUG_ON(i == 0);
+
 	atomic_dec_bug(&fifo_back(&c->journal.pin));
 	bch_journal_next(&c->journal);
 	journal_reclaim(c);
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay
       [not found] <20190419160509.66298-1-colyli@suse.de>
  2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
@ 2019-04-19 16:04 ` Coly Li
  2019-04-23  6:54   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
  2 siblings, 1 reply; 9+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

When bcache journal initiates during running cache set, cache set
journal.blocks_free is initiated as 0. Then during journal replay if
journal_meta() is called and an empty jset is written to cache device,
journal_reclaim() is called. If there is available journal bucket to
reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
bucket, which is c->sb.bucket_size >> c->block_bits.

Most of time the above process works correctly, expect the condtion
when journal space is almost full. "Almost full" means there is no free
journal bucket, but there are still free blocks in last available
bucket indexed by ja->cur_idx.

If system crashes or reboots when journal space is almost full, problem
comes. During cache set reload after the reboot, c->journal.blocks_free
is initialized as 0, when jouranl replay process writes bcache jouranl,
journal_reclaim() will be called to reclaim available journal bucket and
set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
there is no fully free bucket to reclaim in journal_reclaim(), so value
of c->journal.blocks_free will keep 0. If the first journal entry
processed by journal_replay() causes btree split and requires writing
journal space by journal_meta(), journal_meta() has to go into an
infinite loop to reclaim jouranl bucket, and blocks the whole cache set
to run.

Such buggy situation can be solved if we do following things before
journal replay starts,
- Recover previous value of c->journal.blocks_free in last run time,
  and set it to current c->journal.blocks_free as initial value.
- Recover previous value of ja->cur_idx in last run time, and set it to
  KEY_PTR of current c->journal.key as initial value.

After c->journal.blocks_free and c->journal.key are recovered, in
condition when jouranl space is almost full and cache set is reloaded,
meta journal entry from journal reply can be written into free blocks of
the last available journal bucket, then old jouranl entries can be
replayed and reclaimed for further journaling request.

This patch adds bch_journal_key_reload() to recover journal blocks_free
and key ptr value for above purpose. bch_journal_key_reload() is called
in bch_journal_read() before replying journal by bch_journal_replay().

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 5180bed911ef..a6deb16c15c8 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -143,6 +143,89 @@ reread:		left = ca->sb.bucket_size - offset;
 	return ret;
 }

+static int bch_journal_key_reload(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int iter, n = 0;
+	struct bkey *k = &c->journal.key;
+	int ret = 0;
+
+	for_each_cache(ca, c, iter) {
+		struct journal_device *ja = &ca->journal;
+		struct bio *bio = &ja->bio;
+		struct jset *j, *data = c->journal.w[0].data;
+		struct closure cl;
+		unsigned int len, left;
+		unsigned int offset = 0, used_blocks = 0;
+		sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
+
+		closure_init_stack(&cl);
+
+		while (offset < ca->sb.bucket_size) {
+reread:			left = ca->sb.bucket_size - offset;
+			len = min_t(unsigned int,
+				    left, PAGE_SECTORS << JSET_BITS);
+
+			bio_reset(bio);
+			bio->bi_iter.bi_sector = bucket + offset;
+			bio_set_dev(bio, ca->bdev);
+			bio->bi_iter.bi_size = len << 9;
+
+			bio->bi_end_io = journal_read_endio;
+			bio->bi_private = &cl;
+			bio_set_op_attrs(bio, REQ_OP_READ, 0);
+			bch_bio_map(bio, data);
+
+			closure_bio_submit(c, bio, &cl);
+			closure_sync(&cl);
+
+			j = data;
+			while (len) {
+				size_t blocks, bytes = set_bytes(j);
+
+				if (j->magic != jset_magic(&ca->sb))
+					goto out;
+
+				if (bytes > left << 9 ||
+				    bytes > PAGE_SIZE << JSET_BITS) {
+					pr_err("jset may be correpted: too big");
+					ret = -EIO;
+					goto err;
+				}
+
+				if (bytes > len << 9)
+					goto reread;
+
+				if (j->csum != csum_set(j)) {
+					pr_err("jset may be corrupted: bad csum");
+					ret = -EIO;
+					goto err;
+				}
+
+				blocks = set_blocks(j, block_bytes(c));
+				used_blocks += blocks;
+
+				offset	+= blocks * ca->sb.block_size;
+				len	-= blocks * ca->sb.block_size;
+				j = ((void *) j) + blocks * block_bytes(ca);
+			}
+		}
+out:
+		c->journal.blocks_free =
+			(c->sb.bucket_size >> c->block_bits) -
+			used_blocks;
+
+		k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
+	}
+
+	BUG_ON(n == 0);
+	bkey_init(k);
+	SET_KEY_PTRS(k, n);
+
+err:
+	return ret;
+}
+
 int bch_journal_read(struct cache_set *c, struct list_head *list)
 {
 #define read_bucket(b)							\
@@ -268,6 +351,10 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 					    struct journal_replay,
 					    list)->j.seq;

+	/* Initial value of c->journal.blocks_free should be 0 */
+	BUG_ON(c->journal.blocks_free != 0);
+	ret = bch_journal_key_reload(c);
+
 	return ret;
 #undef read_bucket
 }
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
       [not found] <20190419160509.66298-1-colyli@suse.de>
  2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
  2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
@ 2019-04-19 16:05 ` Coly Li
       [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
  2019-04-23  7:09   ` Hannes Reinecke
  2 siblings, 2 replies; 9+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

Current journal_max_cmp() and journal_min_cmp() assume that smaller fifo
index indicating elder journal entries, but this is only true when fifo
index is not swapped.

Fifo structure journal.pin is implemented by a cycle buffer, if the head
index reaches highest location of the cycle buffer, it will be swapped
to 0. Once the swapping happens, it means a smaller fifo index might be
associated to a newer journal entry. So the btree node with oldest
journal entry won't be selected by btree_flush_write() to flush out to
cache device. The result is, the oldest journal entries may always has
no chance to be written into cache device, and after a reboot
bch_journal_replay() may complain some journal entries are missing.

This patch handles the fifo index swapping conditions properly, then in
btree_flush_write() the btree node with oldest journal entry can be
slected from c->flush_btree correctly.

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 47 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index bdb6f9cefe48..bc0e01151155 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -464,12 +464,47 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 }
 
 /* Journalling */
-#define journal_max_cmp(l, r) \
-	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
-	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
-#define journal_min_cmp(l, r) \
-	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
-	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
+#define journal_max_cmp(l, r)						\
+({									\
+	int l_idx, r_idx, f_idx, b_idx;					\
+	bool _ret = true;						\
+									\
+	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
+	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
+	f_idx = c->journal.pin.front;					\
+	b_idx = c->journal.pin.back;					\
+									\
+	_ret = (l_idx < r_idx);						\
+	/* in case fifo back pointer is swapped */			\
+	if (b_idx < f_idx) { 						\
+		if (l_idx <= b_idx && r_idx >= f_idx)			\
+			_ret = false;					\
+		else if (l_idx >= f_idx && r_idx <= b_idx)		\
+			_ret = true;					\
+	}								\
+	_ret;								\
+})
+
+#define journal_min_cmp(l, r)						\
+({									\
+	int l_idx, r_idx, f_idx, b_idx;					\
+	bool _ret = true;						\
+									\
+	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
+	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
+	f_idx = c->journal.pin.front;					\
+	b_idx = c->journal.pin.back;					\
+									\
+	_ret = (l_idx > r_idx);						\
+	/* in case fifo back pointer is swapped */			\
+	if (b_idx < f_idx) {						\
+		if (l_idx <= b_idx && r_idx >= f_idx)			\
+			_ret = true;					\
+		else if (l_idx >= f_idx && r_idx <= b_idx)		\
+			_ret = false;					\
+	}								\
+	_ret;								\
+})
 
 static void btree_flush_write(struct cache_set *c)
 {
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
       [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
@ 2019-04-20 13:20     ` Coly Li
  0 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2019-04-20 13:20 UTC (permalink / raw)
  To: Sasha Levin, linux-bcache; +Cc: linux-block, stable

On 2019/4/20 7:16 上午, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
> 
> The bot has tested the following trees: v5.0.8, v4.19.35, v4.14.112, v4.9.169, v4.4.178, v3.18.138.
> 
> v5.0.8: Build OK!
> v4.19.35: Build OK!
> v4.14.112: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> v4.9.169: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> v4.4.178: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> v3.18.138: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> 
> How should we proceed with this patch?

This patch will go into Linux v5.2. We can have them in stable after
they being upstream.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim()
  2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
@ 2019-04-23  6:50   ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2019-04-23  6:50 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, stable

On 4/19/19 6:04 PM, Coly Li wrote:
> In journal_reclaim() ja->cur_idx of each cache will be update to
> reclaim available journal buckets. Variable 'int n' is used to count how
> many cache is successfully reclaimed, then n is set to c->journal.key
> by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
> loop will write the jset data onto each cache.
> 
> The problem is, if all jouranl buckets on each cache is full, the
> following code in journal_reclaim(),
> 
> 529 for_each_cache(ca, c, iter) {
> 530         struct journal_device *ja = &ca->journal;
> 531         unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
> 532
> 533         /* No space available on this device */
> 534         if (next == ja->discard_idx)
> 535                 continue;
> 536
> 537         ja->cur_idx = next;
> 538         k->ptr[n++] = MAKE_PTR(0,
> 539                           bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
> 540                           ca->sb.nr_this_dev);
> 541 }
> 542
> 543 bkey_init(k);
> 544 SET_KEY_PTRS(k, n);
> 
> If there is no available bucket to reclaim, the if() condition at line
> 534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
> will set KEY_PTRS field of c->journal.key to 0.
> 
> Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
> journal_write_unlocked() the journal data is written in following loop,
> 
> 649	for (i = 0; i < KEY_PTRS(k); i++) {
> 650-671		submit journal data to cache device
> 672	}
> 
> If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
> won't be written to cache device here. If system crahed or rebooted
> before bkeys of the lost journal entries written into btree nodes, data
> corruption will be reported during bcache reload after rebooting the
> system.
> 
> Indeed there is only one cache in a cache set, there is no need to set
> KEY_PTRS field in journal_reclaim() at all. But in order to keep the
> for_each_cache() logic consistent for now, this patch fixes the above
> problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
> available to reclaim.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 6e18057d1d82..5180bed911ef 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c)
>   				  ca->sb.nr_this_dev);
>   	}
>   
> -	bkey_init(k);
> -	SET_KEY_PTRS(k, n);
> -
> -	if (n)
> +	if (n) {
> +		bkey_init(k);
> +		SET_KEY_PTRS(k, n);
>   		c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
> +	}
>   out:
>   	if (!journal_full(&c->journal))
>   		__closure_wake_up(&c->journal.wait);
> @@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl)
>   		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
>   	}
>   
> +	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
> +	BUG_ON(i == 0);
> +
>   	atomic_dec_bug(&fifo_back(&c->journal.pin));
>   	bch_journal_next(&c->journal);
>   	journal_reclaim(c);
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay
  2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
@ 2019-04-23  6:54   ` Hannes Reinecke
  2019-04-23  6:56     ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2019-04-23  6:54 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, stable

On 4/19/19 6:04 PM, Coly Li wrote:
> When bcache journal initiates during running cache set, cache set
> journal.blocks_free is initiated as 0. Then during journal replay if
> journal_meta() is called and an empty jset is written to cache device,
> journal_reclaim() is called. If there is available journal bucket to
> reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
> bucket, which is c->sb.bucket_size >> c->block_bits.
> 
> Most of time the above process works correctly, expect the condtion
> when journal space is almost full. "Almost full" means there is no free
> journal bucket, but there are still free blocks in last available
> bucket indexed by ja->cur_idx.
> 
> If system crashes or reboots when journal space is almost full, problem
> comes. During cache set reload after the reboot, c->journal.blocks_free
> is initialized as 0, when jouranl replay process writes bcache jouranl,
> journal_reclaim() will be called to reclaim available journal bucket and
> set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
> there is no fully free bucket to reclaim in journal_reclaim(), so value
> of c->journal.blocks_free will keep 0. If the first journal entry
> processed by journal_replay() causes btree split and requires writing
> journal space by journal_meta(), journal_meta() has to go into an
> infinite loop to reclaim jouranl bucket, and blocks the whole cache set
> to run.
> 
> Such buggy situation can be solved if we do following things before
> journal replay starts,
> - Recover previous value of c->journal.blocks_free in last run time,
>    and set it to current c->journal.blocks_free as initial value.
> - Recover previous value of ja->cur_idx in last run time, and set it to
>    KEY_PTR of current c->journal.key as initial value.
> 
> After c->journal.blocks_free and c->journal.key are recovered, in
> condition when jouranl space is almost full and cache set is reloaded,
> meta journal entry from journal reply can be written into free blocks of
> the last available journal bucket, then old jouranl entries can be
> replayed and reclaimed for further journaling request.
> 
> This patch adds bch_journal_key_reload() to recover journal blocks_free
> and key ptr value for above purpose. bch_journal_key_reload() is called
> in bch_journal_read() before replying journal by bch_journal_replay().
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 87 insertions(+)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 5180bed911ef..a6deb16c15c8 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -143,6 +143,89 @@ reread:		left = ca->sb.bucket_size - offset;
>   	return ret;
>   }
>   
> +static int bch_journal_key_reload(struct cache_set *c)
> +{
> +	struct cache *ca;
> +	unsigned int iter, n = 0;
> +	struct bkey *k = &c->journal.key;
> +	int ret = 0;
> +
> +	for_each_cache(ca, c, iter) {
> +		struct journal_device *ja = &ca->journal;
> +		struct bio *bio = &ja->bio;
> +		struct jset *j, *data = c->journal.w[0].data;
> +		struct closure cl;
> +		unsigned int len, left;
> +		unsigned int offset = 0, used_blocks = 0;
> +		sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
> +
> +		closure_init_stack(&cl);
> +
> +		while (offset < ca->sb.bucket_size) {
> +reread:			left = ca->sb.bucket_size - offset;

Please place the label on a line of its own.

> +			len = min_t(unsigned int,
> +				    left, PAGE_SECTORS << JSET_BITS);
> +
> +			bio_reset(bio);
> +			bio->bi_iter.bi_sector = bucket + offset;
> +			bio_set_dev(bio, ca->bdev);
> +			bio->bi_iter.bi_size = len << 9;
> +
> +			bio->bi_end_io = journal_read_endio;
> +			bio->bi_private = &cl;
> +			bio_set_op_attrs(bio, REQ_OP_READ, 0);
> +			bch_bio_map(bio, data);
> +
> +			closure_bio_submit(c, bio, &cl);
> +			closure_sync(&cl);
> +
> +			j = data;
> +			while (len) {
> +				size_t blocks, bytes = set_bytes(j);
> +
> +				if (j->magic != jset_magic(&ca->sb))
> +					goto out;
> +
> +				if (bytes > left << 9 ||
> +				    bytes > PAGE_SIZE << JSET_BITS) {
> +					pr_err("jset may be correpted: too big");
> +					ret = -EIO;
> +					goto err;
> +				}
> +
> +				if (bytes > len << 9)
> +					goto reread;
> +
> +				if (j->csum != csum_set(j)) {
> +					pr_err("jset may be corrupted: bad csum");
> +					ret = -EIO;
> +					goto err;
> +				}
> +
> +				blocks = set_blocks(j, block_bytes(c));
> +				used_blocks += blocks;
> +
> +				offset	+= blocks * ca->sb.block_size;
> +				len	-= blocks * ca->sb.block_size;
> +				j = ((void *) j) + blocks * block_bytes(ca);
> +			}
> +		}
> +out:
> +		c->journal.blocks_free =
> +			(c->sb.bucket_size >> c->block_bits) -
> +			used_blocks;
> +
> +		k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
> +	}
> +
> +	BUG_ON(n == 0);
> +	bkey_init(k);
> +	SET_KEY_PTRS(k, n);
> +
> +err:
> +	return ret;
> +}
> +
>   int bch_journal_read(struct cache_set *c, struct list_head *list)
>   {
>   #define read_bucket(b)							\
This is a _quite_ convoluted nested loop.
It would be far better if it could be split into functions, so as to 
avoid the loop-within-loop constructs.
Oh, and some documentation (especially the 'reread' bit) would be nice.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay
  2019-04-23  6:54   ` Hannes Reinecke
@ 2019-04-23  6:56     ` Coly Li
  0 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2019-04-23  6:56 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block, stable

On 2019/4/23 2:54 下午, Hannes Reinecke wrote:
> On 4/19/19 6:04 PM, Coly Li wrote:
>> When bcache journal initiates during running cache set, cache set
>> journal.blocks_free is initiated as 0. Then during journal replay if
>> journal_meta() is called and an empty jset is written to cache device,
>> journal_reclaim() is called. If there is available journal bucket to
>> reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
>> bucket, which is c->sb.bucket_size >> c->block_bits.
>>
>> Most of time the above process works correctly, expect the condtion
>> when journal space is almost full. "Almost full" means there is no free
>> journal bucket, but there are still free blocks in last available
>> bucket indexed by ja->cur_idx.
>>
>> If system crashes or reboots when journal space is almost full, problem
>> comes. During cache set reload after the reboot, c->journal.blocks_free
>> is initialized as 0, when jouranl replay process writes bcache jouranl,
>> journal_reclaim() will be called to reclaim available journal bucket and
>> set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
>> there is no fully free bucket to reclaim in journal_reclaim(), so value
>> of c->journal.blocks_free will keep 0. If the first journal entry
>> processed by journal_replay() causes btree split and requires writing
>> journal space by journal_meta(), journal_meta() has to go into an
>> infinite loop to reclaim jouranl bucket, and blocks the whole cache set
>> to run.
>>
>> Such buggy situation can be solved if we do following things before
>> journal replay starts,
>> - Recover previous value of c->journal.blocks_free in last run time,
>>    and set it to current c->journal.blocks_free as initial value.
>> - Recover previous value of ja->cur_idx in last run time, and set it to
>>    KEY_PTR of current c->journal.key as initial value.
>>
>> After c->journal.blocks_free and c->journal.key are recovered, in
>> condition when jouranl space is almost full and cache set is reloaded,
>> meta journal entry from journal reply can be written into free blocks of
>> the last available journal bucket, then old jouranl entries can be
>> replayed and reclaimed for further journaling request.
>>
>> This patch adds bch_journal_key_reload() to recover journal blocks_free
>> and key ptr value for above purpose. bch_journal_key_reload() is called
>> in bch_journal_read() before replying journal by bch_journal_replay().
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/journal.c | 87
>> +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 87 insertions(+)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index 5180bed911ef..a6deb16c15c8 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -143,6 +143,89 @@ reread:        left = ca->sb.bucket_size - offset;
>>       return ret;
>>   }
>>   +static int bch_journal_key_reload(struct cache_set *c)
>> +{
>> +    struct cache *ca;
>> +    unsigned int iter, n = 0;
>> +    struct bkey *k = &c->journal.key;
>> +    int ret = 0;
>> +
>> +    for_each_cache(ca, c, iter) {
>> +        struct journal_device *ja = &ca->journal;
>> +        struct bio *bio = &ja->bio;
>> +        struct jset *j, *data = c->journal.w[0].data;
>> +        struct closure cl;
>> +        unsigned int len, left;
>> +        unsigned int offset = 0, used_blocks = 0;
>> +        sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
>> +
>> +        closure_init_stack(&cl);
>> +
>> +        while (offset < ca->sb.bucket_size) {
>> +reread:            left = ca->sb.bucket_size - offset;
> 
> Please place the label on a line of its own.
> 
>> +            len = min_t(unsigned int,
>> +                    left, PAGE_SECTORS << JSET_BITS);
>> +
>> +            bio_reset(bio);
>> +            bio->bi_iter.bi_sector = bucket + offset;
>> +            bio_set_dev(bio, ca->bdev);
>> +            bio->bi_iter.bi_size = len << 9;
>> +
>> +            bio->bi_end_io = journal_read_endio;
>> +            bio->bi_private = &cl;
>> +            bio_set_op_attrs(bio, REQ_OP_READ, 0);
>> +            bch_bio_map(bio, data);
>> +
>> +            closure_bio_submit(c, bio, &cl);
>> +            closure_sync(&cl);
>> +
>> +            j = data;
>> +            while (len) {
>> +                size_t blocks, bytes = set_bytes(j);
>> +
>> +                if (j->magic != jset_magic(&ca->sb))
>> +                    goto out;
>> +
>> +                if (bytes > left << 9 ||
>> +                    bytes > PAGE_SIZE << JSET_BITS) {
>> +                    pr_err("jset may be correpted: too big");
>> +                    ret = -EIO;
>> +                    goto err;
>> +                }
>> +
>> +                if (bytes > len << 9)
>> +                    goto reread;
>> +
>> +                if (j->csum != csum_set(j)) {
>> +                    pr_err("jset may be corrupted: bad csum");
>> +                    ret = -EIO;
>> +                    goto err;
>> +                }
>> +
>> +                blocks = set_blocks(j, block_bytes(c));
>> +                used_blocks += blocks;
>> +
>> +                offset    += blocks * ca->sb.block_size;
>> +                len    -= blocks * ca->sb.block_size;
>> +                j = ((void *) j) + blocks * block_bytes(ca);
>> +            }
>> +        }
>> +out:
>> +        c->journal.blocks_free =
>> +            (c->sb.bucket_size >> c->block_bits) -
>> +            used_blocks;
>> +
>> +        k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
>> +    }
>> +
>> +    BUG_ON(n == 0);
>> +    bkey_init(k);
>> +    SET_KEY_PTRS(k, n);
>> +
>> +err:
>> +    return ret;
>> +}
>> +
>>   int bch_journal_read(struct cache_set *c, struct list_head *list)
>>   {
>>   #define read_bucket(b)                            \
> This is a _quite_ convoluted nested loop.
> It would be far better if it could be split into functions, so as to
> avoid the loop-within-loop constructs.
> Oh, and some documentation (especially the 'reread' bit) would be nice.

Hi Hannes,

Sure I will fix them in next version. Thanks.


-- 

Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
  2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
       [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
@ 2019-04-23  7:09   ` Hannes Reinecke
  2019-04-23  7:16     ` Coly Li
  1 sibling, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:09 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, stable

On 4/19/19 6:05 PM, Coly Li wrote:
> Current journal_max_cmp() and journal_min_cmp() assume that smaller fifo
> index indicating elder journal entries, but this is only true when fifo
> index is not swapped.
> 
> Fifo structure journal.pin is implemented by a cycle buffer, if the head
> index reaches highest location of the cycle buffer, it will be swapped
> to 0. Once the swapping happens, it means a smaller fifo index might be
> associated to a newer journal entry. So the btree node with oldest
> journal entry won't be selected by btree_flush_write() to flush out to
> cache device. The result is, the oldest journal entries may always has
> no chance to be written into cache device, and after a reboot
> bch_journal_replay() may complain some journal entries are missing.
> 
> This patch handles the fifo index swapping conditions properly, then in
> btree_flush_write() the btree node with oldest journal entry can be
> slected from c->flush_btree correctly.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 47 +++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index bdb6f9cefe48..bc0e01151155 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -464,12 +464,47 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
>   }
>   
>   /* Journalling */
> -#define journal_max_cmp(l, r) \
> -	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
> -	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> -#define journal_min_cmp(l, r) \
> -	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
> -	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> +#define journal_max_cmp(l, r)						\
> +({									\
> +	int l_idx, r_idx, f_idx, b_idx;					\
> +	bool _ret = true;						\
> +									\
> +	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
> +	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
> +	f_idx = c->journal.pin.front;					\
> +	b_idx = c->journal.pin.back;					\
> +									\
> +	_ret = (l_idx < r_idx);						\
> +	/* in case fifo back pointer is swapped */			\
> +	if (b_idx < f_idx) { 						\
> +		if (l_idx <= b_idx && r_idx >= f_idx)			\
> +			_ret = false;					\
> +		else if (l_idx >= f_idx && r_idx <= b_idx)		\
> +			_ret = true;					\
> +	}								\
> +	_ret;								\
> +})
> +
> +#define journal_min_cmp(l, r)						\
> +({									\
> +	int l_idx, r_idx, f_idx, b_idx;					\
> +	bool _ret = true;						\
> +									\
> +	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
> +	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
> +	f_idx = c->journal.pin.front;					\
> +	b_idx = c->journal.pin.back;					\
> +									\
> +	_ret = (l_idx > r_idx);						\
> +	/* in case fifo back pointer is swapped */			\
> +	if (b_idx < f_idx) {						\
> +		if (l_idx <= b_idx && r_idx >= f_idx)			\
> +			_ret = true;					\
> +		else if (l_idx >= f_idx && r_idx <= b_idx)		\
> +			_ret = false;					\
> +	}								\
> +	_ret;								\
> +})
>   
>   static void btree_flush_write(struct cache_set *c)
>   {
> 
Please make it a proper function.
This is far too convoluted for being handled via #define, and it would
avoid cluttering the function namespace with hidden variables.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
  2019-04-23  7:09   ` Hannes Reinecke
@ 2019-04-23  7:16     ` Coly Li
  0 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2019-04-23  7:16 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block, stable

On 2019/4/23 3:09 下午, Hannes Reinecke wrote:
> On 4/19/19 6:05 PM, Coly Li wrote:
>> Current journal_max_cmp() and journal_min_cmp() assume that smaller fifo
>> index indicating elder journal entries, but this is only true when fifo
>> index is not swapped.
>>
>> Fifo structure journal.pin is implemented by a cycle buffer, if the head
>> index reaches highest location of the cycle buffer, it will be swapped
>> to 0. Once the swapping happens, it means a smaller fifo index might be
>> associated to a newer journal entry. So the btree node with oldest
>> journal entry won't be selected by btree_flush_write() to flush out to
>> cache device. The result is, the oldest journal entries may always has
>> no chance to be written into cache device, and after a reboot
>> bch_journal_replay() may complain some journal entries are missing.
>>
>> This patch handles the fifo index swapping conditions properly, then in
>> btree_flush_write() the btree node with oldest journal entry can be
>> slected from c->flush_btree correctly.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/journal.c | 47
>> +++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 41 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index bdb6f9cefe48..bc0e01151155 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -464,12 +464,47 @@ int bch_journal_replay(struct cache_set *s,
>> struct list_head *list)
>>   }
>>     /* Journalling */
>> -#define journal_max_cmp(l, r) \
>> -    (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
>> -     fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>> -#define journal_min_cmp(l, r) \
>> -    (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
>> -     fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>> +#define journal_max_cmp(l, r)                        \
>> +({                                    \
>> +    int l_idx, r_idx, f_idx, b_idx;                    \
>> +    bool _ret = true;                        \
>> +                                    \
>> +    l_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(l)->journal); \
>> +    r_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(r)->journal); \
>> +    f_idx = c->journal.pin.front;                    \
>> +    b_idx = c->journal.pin.back;                    \
>> +                                    \
>> +    _ret = (l_idx < r_idx);                        \
>> +    /* in case fifo back pointer is swapped */            \
>> +    if (b_idx < f_idx) {                         \
>> +        if (l_idx <= b_idx && r_idx >= f_idx)            \
>> +            _ret = false;                    \
>> +        else if (l_idx >= f_idx && r_idx <= b_idx)        \
>> +            _ret = true;                    \
>> +    }                                \
>> +    _ret;                                \
>> +})
>> +
>> +#define journal_min_cmp(l, r)                        \
>> +({                                    \
>> +    int l_idx, r_idx, f_idx, b_idx;                    \
>> +    bool _ret = true;                        \
>> +                                    \
>> +    l_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(l)->journal); \
>> +    r_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(r)->journal); \
>> +    f_idx = c->journal.pin.front;                    \
>> +    b_idx = c->journal.pin.back;                    \
>> +                                    \
>> +    _ret = (l_idx > r_idx);                        \
>> +    /* in case fifo back pointer is swapped */            \
>> +    if (b_idx < f_idx) {                        \
>> +        if (l_idx <= b_idx && r_idx >= f_idx)            \
>> +            _ret = true;                    \
>> +        else if (l_idx >= f_idx && r_idx <= b_idx)        \
>> +            _ret = false;                    \
>> +    }                                \
>> +    _ret;                                \
>> +})
>>     static void btree_flush_write(struct cache_set *c)
>>   {
>>
> Please make it a proper function.
> This is far too convoluted for being handled via #define, and it would
> avoid cluttering the function namespace with hidden variables.

Hi Hannes,

Sure let me do it in next version. Thanks.


-- 

Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-04-23  7:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20190419160509.66298-1-colyli@suse.de>
2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
2019-04-23  6:50   ` Hannes Reinecke
2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
2019-04-23  6:54   ` Hannes Reinecke
2019-04-23  6:56     ` Coly Li
2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
     [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
2019-04-20 13:20     ` Coly Li
2019-04-23  7:09   ` Hannes Reinecke
2019-04-23  7:16     ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox