[PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub
@ 2023-04-13  5:57 Qu Wenruo
  2023-04-13  5:57 ` [PATCH 1/2] btrfs: scrub: try harder to mark RAID56 block groups read-only Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Qu Wenruo @ 2023-04-13  5:57 UTC (permalink / raw)
  To: linux-btrfs

[PROBLEM]
It's a known problem that btrfs scrub for RAID56 is pretty slow.

[CAUSE]
One of the causes is that we read the same data stripes at least twice
during P/Q stripes scrub.

This means, during a full fs scrub (one scrub process for each device),
there will be quite some extra seeks just because of this.

[FIX]
The truth is, scrub stripes have a much better view of the data stripes.
As btrfs would firstly verify all data stripes, and only continue
scrubing the P/Q stripes if all the data stripes is fine after needed
repair.

So this means, as long as there is no new RMW writes into the RAID56
block group, we can re-use the scrub cache for P/Q verification.

This patchset would fix it by:

- Ensure the RAID56 block groups are marked read-only for scrub
  This is to avoid RMW in to the block group, or scrub cache is no
  longer reliable.

- Introduce a new interface to pass cached pages to RAID56 cache
  The only disadvantage is here we still need to do page copy, due to
  the uncertain lifespan of an rbio.

Qu Wenruo (2):
  btrfs: scrub: try harder to mark RAID56 block groups read-only
  btrfs: scrub: use recovered data stripes as cache to avoid unnecessary
    read

 fs/btrfs/block-group.c | 16 +++++++++++++--
 fs/btrfs/raid56.c      | 46 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid56.h      |  3 +++
 fs/btrfs/scrub.c       | 16 ++++++++++++++-
 4 files changed, 78 insertions(+), 3 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] btrfs: scrub: try harder to mark RAID56 block groups read-only
  2023-04-13  5:57 [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub Qu Wenruo
@ 2023-04-13  5:57 ` Qu Wenruo
  2023-04-13  5:57 ` [PATCH 2/2] btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read Qu Wenruo
  2023-04-17 22:51 ` [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2023-04-13  5:57 UTC (permalink / raw)
  To: linux-btrfs

Currently we allows a block group not to be marked read-only for scrub.

But for RAID56 block groups if we require the block group to be
read-only, then we're allowed to use cached content from scrub stripe to
reduce unnecessary raid56 reads.

So this patch would:

- Make btrfs_inc_block_group_ro() to try harder
  During my tests, for cases like btrfs/061 and btrfs/064, we can hit
  -ENOSPC from btrfs_inc_block_group_ro() calls during scrub.

  The reason is if we only have one single data chunk, and trying to
  scrub it, we won't have any space left for any newer data writes.

  But this check should be done by the caller, especially for scrub
  cases we only temporarily mark the chunk read-only.
  And newer data writes would always try to allocate a new data chunk
  when needed.

- Return error for scrub if we failed to mark a RAID56 chunk read-only

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 16 ++++++++++++++--
 fs/btrfs/scrub.c       |  9 ++++++++-
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 957ad1c31c4f..1da798752159 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2818,10 +2818,22 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 	}
 
 	ret = inc_block_group_ro(cache, 0);
-	if (!do_chunk_alloc || ret == -ETXTBSY)
-		goto unlock_out;
 	if (!ret)
 		goto out;
+	if (ret == -ETXTBSY)
+		goto unlock_out;
+
+	/*
+	 * Skip chunk alloc if the bg is SYSTEM, this is to avoid
+	 * system chunk allocation storm to exhaust the system chunk
+	 * array.
+	 * Otherwise we still want to try our best to mark the
+	 * block group read-only.
+	 */
+	if (!do_chunk_alloc && ret == -ENOSPC &&
+	    (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM))
+		goto unlock_out;
+
 	alloc_flags = btrfs_get_alloc_profile(fs_info, cache->space_info->flags);
 	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
 	if (ret < 0)
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 836725a19661..22ce3a628eb5 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2518,13 +2518,20 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
 		if (ret == 0) {
 			ro_set = 1;
-		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
+		} else if (ret == -ENOSPC && !sctx->is_dev_replace &&
+			   !(cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK)) {
 			/*
 			 * btrfs_inc_block_group_ro return -ENOSPC when it
 			 * failed in creating new chunk for metadata.
 			 * It is not a problem for scrub, because
 			 * metadata are always cowed, and our scrub paused
 			 * commit_transactions.
+			 *
+			 * For RAID56 chunks, we have to mark them read-only
+			 * for scrub, as later we would use our own cache
+			 * out of RAID56 realm.
+			 * Thus we want the RAID56 bg to br marked RO to
+			 * prevent RMW from screwing up out cache.
 			 */
 			ro_set = 0;
 		} else if (ret == -ETXTBSY) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read
  2023-04-13  5:57 [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub Qu Wenruo
  2023-04-13  5:57 ` [PATCH 1/2] btrfs: scrub: try harder to mark RAID56 block groups read-only Qu Wenruo
@ 2023-04-13  5:57 ` Qu Wenruo
  2023-04-17 22:51 ` [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2023-04-13  5:57 UTC (permalink / raw)
  To: linux-btrfs

For P/Q stripe scrub, we have quite some duplicated read IO:

- Data stripes read for verification
  This is triggered by the scrub_submit_initial_read() inside
  scrub_raid56_parity_stripe().

- Data stripes read (again) for P/Q stripe verification
  This is triggered by scrub_assemble_read_bios() from scrub_rbio().

  Although we can have hit rbio cache and avoid unnecessary read, the
  chance is very low, as scrub would easily flush the whole rbio cache.

This means, even we're just scrubing a single P/Q stripe, we would read
the data stripes twice for the best case scenario.
If we need to recover some data stripes, it would cause more reads on
the same data stripes, again and again.

However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.

To avoid such cache miss, this patch would:

- Introduce a new helper, raid56_parity_cache_data_pages()
  This function would grab the pages from an array, and copy the content
  to the rbio, marking all the involved sectors uptodate.

  The page copy is unavoidable because of the cache pages of rbio are all
  self managed, thus can not utilize outside pages without screwing up
  the lifespan.

- Use the repaired data stripes as cache inside
  scrub_raid56_parity_stripe()

By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid56.h |  3 +++
 fs/btrfs/scrub.c  |  7 +++++++
 3 files changed, 56 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 2fab37f062de..eaa96e8b3f10 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2747,3 +2747,49 @@ void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
 	if (!lock_stripe_add(rbio))
 		start_async_work(rbio, scrub_rbio_work_locked);
 }
+
+/*
+ * This is for scrub call sites where we already have correct data contents.
+ * This allows us to avoid reading data stripes again.
+ *
+ * Unfortunately here we have to do page copy, other than reusing the pages.
+ * This is due to the fact rbio has its own page management for its cache.
+ */
+void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
+				    struct page **data_pages, u64 data_logical)
+{
+	const u64 offset_in_full_stripe = data_logical -
+					  rbio->bioc->full_stripe_logical;
+	const int page_index = offset_in_full_stripe >> PAGE_SHIFT;
+	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
+	const u32 sectors_per_page = PAGE_SIZE / sectorsize;
+	int ret;
+
+	/*
+	 * If we hit ENOMEM temporarily, but later at
+	 * raid56_parity_submit_scrub_rbio() time it succeeded, we just do
+	 * the extra read, not a big deal.
+	 *
+	 * If we hit ENOMEM later at raid56_parity_submit_scrub_rbio() time,
+	 * the bio would got proper error number set.
+	 */
+	ret = alloc_rbio_data_pages(rbio);
+	if (ret < 0)
+		return;
+
+	/* @data_logical must be at stripe boundary and inside the full stripe. */
+	ASSERT(IS_ALIGNED(offset_in_full_stripe, BTRFS_STRIPE_LEN));
+	ASSERT(offset_in_full_stripe < (rbio->nr_data << BTRFS_STRIPE_LEN_SHIFT));
+
+	for (int page_nr = 0; page_nr < (BTRFS_STRIPE_LEN >> PAGE_SHIFT);
+	     page_nr++) {
+		struct page *dst = rbio->stripe_pages[page_nr + page_index];
+		struct page *src = data_pages[page_nr];
+
+		memcpy_page(dst, 0, src, 0, PAGE_SIZE);
+		for (int sector_nr = sectors_per_page * page_index;
+		     sector_nr < sectors_per_page * (page_index + 1);
+		     sector_nr++)
+			rbio->stripe_sectors[sector_nr].uptodate = true;
+	}
+}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 0f7f31c8cb98..0e84c9c9293f 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -193,6 +193,9 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 				unsigned long *dbitmap, int stripe_nsectors);
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
 
+void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
+				    struct page **data_pages, u64 data_logical);
+
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 22ce3a628eb5..33c49a563986 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1944,6 +1944,13 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
 		btrfs_bio_counter_dec(fs_info);
 		goto out;
 	}
+	/* Use the recovered stripes as cache to avoid read them from disk again. */
+	for (int i = 0; i < data_stripes; i++) {
+		stripe = &sctx->raid56_data_stripes[i];
+
+		raid56_parity_cache_data_pages(rbio, stripe->pages,
+				full_stripe_start + (i << BTRFS_STRIPE_LEN_SHIFT));
+	}
 	raid56_parity_submit_scrub_rbio(rbio);
 	wait_for_completion_io(&io_done);
 	ret = blk_status_to_errno(bio->bi_status);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub
  2023-04-13  5:57 [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub Qu Wenruo
  2023-04-13  5:57 ` [PATCH 1/2] btrfs: scrub: try harder to mark RAID56 block groups read-only Qu Wenruo
  2023-04-13  5:57 ` [PATCH 2/2] btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read Qu Wenruo
@ 2023-04-17 22:51 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2023-04-17 22:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Apr 13, 2023 at 01:57:16PM +0800, Qu Wenruo wrote:
> [PROBLEM]
> It's a known problem that btrfs scrub for RAID56 is pretty slow.
> 
> [CAUSE]
> One of the causes is that we read the same data stripes at least twice
> during P/Q stripes scrub.
> 
> This means, during a full fs scrub (one scrub process for each device),
> there will be quite some extra seeks just because of this.
> 
> [FIX]
> The truth is, scrub stripes have a much better view of the data stripes.
> As btrfs would firstly verify all data stripes, and only continue
> scrubing the P/Q stripes if all the data stripes is fine after needed
> repair.
> 
> So this means, as long as there is no new RMW writes into the RAID56
> block group, we can re-use the scrub cache for P/Q verification.
> 
> This patchset would fix it by:
> 
> - Ensure the RAID56 block groups are marked read-only for scrub
>   This is to avoid RMW in to the block group, or scrub cache is no
>   longer reliable.
> 
> - Introduce a new interface to pass cached pages to RAID56 cache
>   The only disadvantage is here we still need to do page copy, due to
>   the uncertain lifespan of an rbio.
> 
> Qu Wenruo (2):
>   btrfs: scrub: try harder to mark RAID56 block groups read-only
>   btrfs: scrub: use recovered data stripes as cache to avoid unnecessary
>     read

Added to misc-next, thanks. I'd like to batch it with the rest of the
scrub but we also need to let it test for a while it so it'll be part of
some rc.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-04-17 22:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-13  5:57 [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub Qu Wenruo
2023-04-13  5:57 ` [PATCH 1/2] btrfs: scrub: try harder to mark RAID56 block groups read-only Qu Wenruo
2023-04-13  5:57 ` [PATCH 2/2] btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read Qu Wenruo
2023-04-17 22:51 ` [PATCH 0/2] btrfs: reduce the duplicated reads during P/Q scrub David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.