[PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection
@ 2022-11-07  7:32 Qu Wenruo
  2022-11-07  7:32 ` [PATCH 1/3] btrfs: raid56: introduce btrfs_raid_bio::error_bitmap Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Qu Wenruo @ 2022-11-07  7:32 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs raid56 uses stripe based error detection.

This means, any error (which vary from single sector csum mismatch, to a
missing device) will mark the whole horizontal stripe as error.

This can lead to some unexpected behavior, for example:

             0        32K       64K
     Data 1  |XXXXXXXX|         |
     Data 2  |        |XXXXXXXXX|
     Parity  |        |         |

When reading data 1 [0, 32K), we got csum mismatch and go RAID56
recovery path.

If going the old path, we will mark the whole data 1 [0, 64K) all as
error, and recover using data 2 and parity.

But since data 2 [32K, 64K) is also corrupted, the recovered data will
also be corrupted.

Thankfully such problem will be mostly avoided after commit f6065f8edeb2
("btrfs: raid56: don't trust any cached sector in
__raid56_parity_recover()"), as when we read the sectors in data 2 [32K,
64K), we will recover discarding all the cached result.

This patchset will change the behavior by introducing an error bitmap,
recording corrupted sector one by one, so for above case, at least we
won't try to recover data 1 [32K, 64K) using incorrect data.

The true solution to this destructive RMW problem will be read time csum
verification, but this patchset introduces the basis to handle extra
csum mismatch error better (csum mismatch will also be marked as error,
but only for the offending sectors).

This patchset itself doesn't improve the raid56 destructive RMW
situation by itself, but would make later destructive RMW fix much
easier to implement.

Qu Wenruo (3):
  btrfs: raid56: introduce btrfs_raid_bio::error_bitmap
  btrfs: raid56: migrate recovery and scrub recovery path to use
    error_bitmap
  btrfs: raid56: remove the old error tracing system

 fs/btrfs/raid56.c | 572 ++++++++++++++++++++++++++--------------------
 fs/btrfs/raid56.h |  19 +-
 2 files changed, 334 insertions(+), 257 deletions(-)

-- 
2.38.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] btrfs: raid56: introduce btrfs_raid_bio::error_bitmap
  2022-11-07  7:32 [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection Qu Wenruo
@ 2022-11-07  7:32 ` Qu Wenruo
  2022-11-07 17:03   ` David Sterba
  2022-11-07  7:32 ` [PATCH 2/3] btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2022-11-07  7:32 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate
which stripe(s) had IO errors.

But that has some problems:

- If one sector failed csum check, the whole stripe where the corruption
  is will be marked error.
  This can reduce the chance we do recover, like this:

          0  4K 8K
  Data 1  |XX|  |
  Data 2  |  |XX|
  Parity  |  |  |

  In above case, 0~4K in data 1 should be recovered using data 2 and
  parity, while 4K~8K in data 2 should be recovered using data 1 and
  parity.

  Currently if we trigger read on 0~4K of data 1, we will also recover
  4K~8K of data 1 using corrupted data 2 and parity, causing wrong
  result in rbio cache.

- Harder to expand for future M-N scheme
  As we're limited to just faila/b, two corruptions.

- Harder to expand to handle extra csum errors
  This can be problematic if we start to do csum verification.

This patch will introduce an extra @error_bitmap, where one bit
represent error has happened for that sector.

The choice to introduce a new error bitmap other than reusing
sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and
rbio::bio_sectors[].

Since we can submit bio using sectors from both sectors, doing proper
search on both array will more complex.

Although the new bitmap will take extra memory, later we can remove
things like @error and @faila/b to save some memory.

Currently the new error bitmap and failab mechanism is co-existing,
the error bitmap is only updated at endio time and recover entrance.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 104 ++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/raid56.h |  11 +++++
 2 files changed, 108 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 143c115c977e..e5c630a0e8e9 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -76,6 +76,7 @@ static void scrub_rbio_work_locked(struct work_struct *work);
 
 static void free_raid_bio_pointers(struct btrfs_raid_bio *rbio)
 {
+	bitmap_free(rbio->error_bitmap);
 	kfree(rbio->stripe_pages);
 	kfree(rbio->bio_sectors);
 	kfree(rbio->stripe_sectors);
@@ -950,9 +951,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	rbio->stripe_sectors = kcalloc(num_sectors, sizeof(struct sector_ptr),
 				       GFP_NOFS);
 	rbio->finish_pointers = kcalloc(real_stripes, sizeof(void *), GFP_NOFS);
+	rbio->error_bitmap = bitmap_zalloc(num_sectors, GFP_NOFS);
 
 	if (!rbio->stripe_pages || !rbio->bio_sectors || !rbio->stripe_sectors ||
-	    !rbio->finish_pointers) {
+	    !rbio->finish_pointers || !rbio->error_bitmap) {
 		free_raid_bio_pointers(rbio);
 		kfree(rbio);
 		return ERR_PTR(-ENOMEM);
@@ -1044,8 +1046,11 @@ static int rbio_add_io_sector(struct btrfs_raid_bio *rbio,
 	disk_start = stripe->physical + sector_nr * sectorsize;
 
 	/* if the device is missing, just fail this stripe */
-	if (!stripe->dev->bdev)
+	if (!stripe->dev->bdev) {
+		set_bit(stripe_nr * rbio->stripe_nsectors + sector_nr,
+			rbio->error_bitmap);
 		return fail_rbio_index(rbio, stripe_nr);
+	}
 
 	/* see if we can add this page onto our existing bio */
 	if (last) {
@@ -1209,6 +1214,7 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 	 * write.
 	 */
 	atomic_set(&rbio->error, 0);
+	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
 	rbio->faila = -1;
 	rbio->failb = -1;
 
@@ -1332,6 +1338,43 @@ static int find_logical_bio_stripe(struct btrfs_raid_bio *rbio,
 	return -1;
 }
 
+static void set_rbio_range_error(struct btrfs_raid_bio *rbio,
+				  struct bio *bio)
+{
+	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	u32 offset = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+		     rbio->bioc->raid_map[0];
+	int total_nr_sector = offset >> fs_info->sectorsize_bits;
+
+	ASSERT(total_nr_sector < rbio->nr_data * rbio->stripe_nsectors);
+
+	bitmap_set(rbio->error_bitmap, total_nr_sector,
+		   bio->bi_iter.bi_size >> fs_info->sectorsize_bits);
+
+	/*
+	 * Special handling for raid56_alloc_missing_rbio() used by
+	 * scrub/replace.
+	 * Unlike call path in raid56_parity_recover(), they pass an
+	 * empty bio here.
+	 * Thus we have to find out the missing device and mark the stripe
+	 * error instead.
+	 */
+	if (bio->bi_iter.bi_size == 0) {
+		bool found_missing = false;
+		int stripe_nr;
+
+		for (stripe_nr = 0; stripe_nr < rbio->real_stripes; stripe_nr++) {
+			if (!rbio->bioc->stripes[stripe_nr].dev->bdev) {
+				found_missing = true;
+				bitmap_set(rbio->error_bitmap,
+					   stripe_nr * rbio->stripe_nsectors,
+					   rbio->stripe_nsectors);
+			}
+		}
+		ASSERT(found_missing);
+	}
+}
+
 /*
  * returns -EIO if we had too many failures
  */
@@ -1423,13 +1466,50 @@ static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 	}
 }
 
+static int get_bio_sector_nr(struct btrfs_raid_bio *rbio, struct bio *bio)
+{
+	struct bio_vec *bv = bio_first_bvec_all(bio);
+	int i;
+
+	for (i = 0; i < rbio->nr_sectors; i++) {
+		struct sector_ptr *sector;
+
+		sector = &rbio->stripe_sectors[i];
+		if (sector->page == bv->bv_page &&
+		    sector->pgoff == bv->bv_offset)
+			break;
+		sector = &rbio->bio_sectors[i];
+		if (sector->page == bv->bv_page &&
+		    sector->pgoff == bv->bv_offset)
+			break;
+	}
+	ASSERT(i < rbio->nr_sectors);
+	return i;
+}
+
+static void rbio_update_error_bitmap(struct btrfs_raid_bio *rbio,
+				     struct bio *bio)
+{
+	int total_sector_nr = get_bio_sector_nr(rbio, bio);
+	int bio_size = 0;
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+
+	bio_for_each_segment_all(bvec, bio, iter_all)
+		bio_size += bvec->bv_len;
+
+	bitmap_set(rbio->error_bitmap, total_sector_nr,
+		   bio_size >> rbio->bioc->fs_info->sectorsize_bits);
+}
+
 static void raid_wait_read_end_io(struct bio *bio)
 {
 	struct btrfs_raid_bio *rbio = bio->bi_private;
 
-	if (bio->bi_status)
+	if (bio->bi_status) {
 		fail_bio_stripe(rbio, bio);
-	else
+		rbio_update_error_bitmap(rbio, bio);
+	} else
 		set_bio_pages_uptodate(rbio, bio);
 
 	bio_put(bio);
@@ -1863,10 +1943,10 @@ static int recover_assemble_read_bios(struct btrfs_raid_bio *rbio,
 		struct sector_ptr *sector;
 
 		if (rbio->faila == stripe || rbio->failb == stripe) {
-			atomic_inc(&rbio->error);
 			/* Skip the current stripe. */
 			ASSERT(sectornr == 0);
 			total_sector_nr += rbio->stripe_nsectors - 1;
+			atomic_inc(&rbio->error);
 			continue;
 		}
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
@@ -1891,9 +1971,10 @@ static int recover_rbio(struct btrfs_raid_bio *rbio)
 
 	/*
 	 * Either we're doing recover for a read failure or degraded write,
-	 * caller should have set faila/b correctly.
+	 * caller should have set faila/b and error bitmap correctly.
 	 */
 	ASSERT(rbio->faila >= 0 || rbio->failb >= 0);
+	ASSERT(bitmap_weight(rbio->error_bitmap, rbio->nr_sectors));
 	bio_list_init(&bio_list);
 
 	/*
@@ -1978,6 +2059,8 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 	rbio->operation = BTRFS_RBIO_READ_REBUILD;
 	rbio_add_bio(rbio, bio);
 
+	set_rbio_range_error(rbio, bio);
+
 	rbio->faila = find_logical_bio_stripe(rbio, bio);
 	if (rbio->faila == -1) {
 		btrfs_warn(fs_info,
@@ -2038,8 +2121,10 @@ static void raid_wait_write_end_io(struct bio *bio)
 	struct btrfs_raid_bio *rbio = bio->bi_private;
 	blk_status_t err = bio->bi_status;
 
-	if (err)
+	if (err) {
 		fail_bio_stripe(rbio, bio);
+		rbio_update_error_bitmap(rbio, bio);
+	}
 	bio_put(bio);
 	atomic_dec(&rbio->stripes_pending);
 	wake_up(&rbio->io_wait);
@@ -2117,6 +2202,7 @@ static int rmw_rbio(struct btrfs_raid_bio *rbio)
 	spin_unlock_irq(&rbio->bio_list_lock);
 
 	atomic_set(&rbio->error, 0);
+	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
 
 	index_rbio_pages(rbio);
 
@@ -2328,6 +2414,7 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
 	}
 
 	atomic_set(&rbio->error, 0);
+	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
 
 	/* Map the parity stripe just once */
 	pointers[nr_data] = kmap_local_page(p_sector.page);
@@ -2533,6 +2620,8 @@ static int scrub_rbio(struct btrfs_raid_bio *rbio)
 		goto cleanup;
 
 	atomic_set(&rbio->error, 0);
+	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
+
 	ret = scrub_assemble_read_bios(rbio, &bio_list);
 	if (ret < 0)
 		goto cleanup;
@@ -2612,6 +2701,7 @@ raid56_alloc_missing_rbio(struct bio *bio, struct btrfs_io_context *bioc)
 	 */
 	ASSERT(!bio->bi_iter.bi_size);
 
+	set_rbio_range_error(rbio, bio);
 	rbio->faila = find_logical_bio_stripe(rbio, bio);
 	if (rbio->faila == -1) {
 		btrfs_warn_rl(fs_info,
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 9fae97b7a2a5..e38da4fa76d6 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -126,6 +126,17 @@ struct btrfs_raid_bio {
 
 	/* Allocated with real_stripes-many pointers for finish_*() calls */
 	void **finish_pointers;
+
+	/*
+	 * The bitmap recording where IO errors happened.
+	 * Each bit is corresponding to one sector in either bio_sectors[] or
+	 * stripe_sectors[] array.
+	 *
+	 * The reason we don't use another bit in sector_ptr is, we have two
+	 * arrays of sectors, and a lot of IO can use sectors in both arrays.
+	 * Thus making it much harder to iterate.
+	 */
+	unsigned long *error_bitmap;
 };
 
 /*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap
  2022-11-07  7:32 [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection Qu Wenruo
  2022-11-07  7:32 ` [PATCH 1/3] btrfs: raid56: introduce btrfs_raid_bio::error_bitmap Qu Wenruo
@ 2022-11-07  7:32 ` Qu Wenruo
  2022-11-07  7:32 ` [PATCH 3/3] btrfs: raid56: remove the old error tracing system Qu Wenruo
  2022-11-07 17:04 ` [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection David Sterba
  3 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2022-11-07  7:32 UTC (permalink / raw)
  To: linux-btrfs

Since we have rbio::error_bitmap to indicate exactly where the errors
are (including read error and csum mismatch error), we can make recovery
path more accurate.

For example:

             0        32K       64K
     Data 1  |XXXXXXXX|         |
     Data 2  |        |XXXXXXXXX|
     Parity  |        |         |

1) Get csum mismatch when reading data 1 [0, 32K)

2) Mark corresponding range error
   The old code will mark the whole data 1 stripe as error.
   While the new code will only mark data 1 [0, 32K) as error.

3) Recovery path
   The old code will recover data 1 [0, 64K), all using Data 2 and
   parity.

   This means, Data 1 [32K, 64K) will be corrupted data, as data 2
   [32K, 64K) is already corrupted.

   While the new code will only recover data 1 [0, 32K), as only
   that range has error so far.

This new behavior can avoid populating rbio cache with incorrect data.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 281 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 195 insertions(+), 86 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index e5c630a0e8e9..800213ab44a1 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1013,6 +1013,37 @@ static int alloc_rbio_parity_pages(struct btrfs_raid_bio *rbio)
 	return 0;
 }
 
+/*
+ * Return the total errors found in the vertical stripe of @sector_nr.
+ *
+ * @faila and @failb will also be updated to the first and second stripe
+ * number of the errors.
+ */
+static int get_rbio_veritical_errors(struct btrfs_raid_bio *rbio, int sector_nr,
+				     int *faila, int *failb)
+{
+	int stripe_nr;
+	int found_errors = 0;
+
+	ASSERT(faila && failb);
+	*faila = -1;
+	*failb = -1;
+
+	for (stripe_nr = 0; stripe_nr < rbio->real_stripes; stripe_nr++) {
+		int total_sector_nr = stripe_nr * rbio->stripe_nsectors +
+				      sector_nr;
+
+		if (test_bit(total_sector_nr, rbio->error_bitmap)) {
+			found_errors++;
+			if (*faila < 0)
+				*faila = stripe_nr;
+			else if (*failb < 0)
+				*failb = stripe_nr;
+		}
+	}
+	return found_errors;
+}
+
 /*
  * Add a single sector @sector into our list of bios for IO.
  *
@@ -1745,14 +1776,15 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
  * @*pointers are the pre-allocated pointers by the caller, so we don't
  * need to allocate/free the pointers again and again.
  */
-static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
-			     void **pointers, void **unmap_array)
+static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
+			    void **pointers, void **unmap_array)
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
 	struct sector_ptr *sector;
 	const u32 sectorsize = fs_info->sectorsize;
-	const int faila = rbio->faila;
-	const int failb = rbio->failb;
+	int found_errors;
+	int faila;
+	int failb;
 	int stripe_nr;
 
 	/*
@@ -1761,7 +1793,19 @@ static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 	 */
 	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
 	    !test_bit(sector_nr, &rbio->dbitmap))
-		return;
+		return 0;
+
+	found_errors = get_rbio_veritical_errors(rbio, sector_nr, &faila,
+						 &failb);
+	/*
+	 * No errors in the veritical stripe, skip it.
+	 * Can happen for recovery which only part of a stripe failed csum check.
+	 */
+	if (!found_errors)
+		return 0;
+
+	if (found_errors > rbio->bioc->max_errors)
+		return -EIO;
 
 	/*
 	 * Setup our array of pointers with sectors from each stripe
@@ -1772,11 +1816,10 @@ static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 	for (stripe_nr = 0; stripe_nr < rbio->real_stripes; stripe_nr++) {
 		/*
 		 * If we're rebuilding a read, we have to use
-		 * pages from the bio list
+		 * pages from the bio list if possible.
 		 */
 		if ((rbio->operation == BTRFS_RBIO_READ_REBUILD ||
-		     rbio->operation == BTRFS_RBIO_REBUILD_MISSING) &&
-		    (stripe_nr == faila || stripe_nr == failb)) {
+		     rbio->operation == BTRFS_RBIO_REBUILD_MISSING)) {
 			sector = sector_in_rbio(rbio, stripe_nr, sector_nr, 0);
 		} else {
 			sector = rbio_stripe_sector(rbio, stripe_nr, sector_nr);
@@ -1864,18 +1907,19 @@ static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 	 * Especially if we determine to cache the rbio, we need to
 	 * have at least all data sectors uptodate.
 	 */
-	if (rbio->faila >= 0) {
-		sector = rbio_stripe_sector(rbio, rbio->faila, sector_nr);
+	if (faila >= 0) {
+		sector = rbio_stripe_sector(rbio, faila, sector_nr);
 		sector->uptodate = 1;
 	}
-	if (rbio->failb >= 0) {
-		sector = rbio_stripe_sector(rbio, rbio->failb, sector_nr);
+	if (failb >= 0) {
+		sector = rbio_stripe_sector(rbio, failb, sector_nr);
 		sector->uptodate = 1;
 	}
 
 cleanup:
 	for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--)
 		kunmap_local(unmap_array[stripe_nr]);
+	return 0;
 }
 
 static int recover_sectors(struct btrfs_raid_bio *rbio)
@@ -1898,10 +1942,6 @@ static int recover_sectors(struct btrfs_raid_bio *rbio)
 		goto out;
 	}
 
-	/* Make sure faila and fail b are in order. */
-	if (rbio->faila >= 0 && rbio->failb >= 0 && rbio->faila > rbio->failb)
-		swap(rbio->faila, rbio->failb);
-
 	if (rbio->operation == BTRFS_RBIO_READ_REBUILD ||
 	    rbio->operation == BTRFS_RBIO_REBUILD_MISSING) {
 		spin_lock_irq(&rbio->bio_list_lock);
@@ -1911,8 +1951,11 @@ static int recover_sectors(struct btrfs_raid_bio *rbio)
 
 	index_rbio_pages(rbio);
 
-	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
-		recover_vertical(rbio, sectornr, pointers, unmap_array);
+	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) {
+		ret = recover_vertical(rbio, sectornr, pointers, unmap_array);
+		if (ret < 0)
+			break;
+	}
 
 out:
 	kfree(pointers);
@@ -1942,13 +1985,21 @@ static int recover_assemble_read_bios(struct btrfs_raid_bio *rbio,
 		int sectornr = total_sector_nr % rbio->stripe_nsectors;
 		struct sector_ptr *sector;
 
-		if (rbio->faila == stripe || rbio->failb == stripe) {
-			/* Skip the current stripe. */
-			ASSERT(sectornr == 0);
-			total_sector_nr += rbio->stripe_nsectors - 1;
-			atomic_inc(&rbio->error);
+		/*
+		 * Skip the range which has error.
+		 * It can be a range which is marked error (for csum mismatch),
+		 * or it can be a missing device.
+		 */
+		if (!rbio->bioc->stripes[stripe].dev->bdev ||
+		    test_bit(total_sector_nr, rbio->error_bitmap)) {
+			/*
+			 * Also set the error bit for missing device, which
+			 * may not yet have its error bit set.
+			 */
+			set_bit(total_sector_nr, rbio->error_bitmap);
 			continue;
 		}
+
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
 		ret = rbio_add_io_sector(rbio, bio_list, sector, stripe,
 					 sectornr, REQ_OP_READ);
@@ -1971,9 +2022,8 @@ static int recover_rbio(struct btrfs_raid_bio *rbio)
 
 	/*
 	 * Either we're doing recover for a read failure or degraded write,
-	 * caller should have set faila/b and error bitmap correctly.
+	 * caller should have set error bitmap correctly.
 	 */
-	ASSERT(rbio->faila >= 0 || rbio->failb >= 0);
 	ASSERT(bitmap_weight(rbio->error_bitmap, rbio->nr_sectors));
 	bio_list_init(&bio_list);
 
@@ -1997,12 +2047,6 @@ static int recover_rbio(struct btrfs_raid_bio *rbio)
 	submit_read_bios(rbio, &bio_list);
 	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
 
-	/* We have more errors than our tolerance during the read. */
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors) {
-		ret = -EIO;
-		goto out;
-	}
-
 	ret = recover_sectors(rbio);
 
 out:
@@ -2037,6 +2081,52 @@ static void recover_rbio_work_locked(struct work_struct *work)
 	rbio_orig_end_io(rbio, errno_to_blk_status(ret));
 }
 
+static void set_rbio_raid6_extra_error(struct btrfs_raid_bio *rbio,
+				       int mirror_num)
+{
+	bool found = false;
+	int sector_nr;
+
+	/*
+	 * This is for RAID6 extra recovery tries, thus mirror number should
+	 * be large than 2.
+	 * Mirror 1 means read from data stripes. Mirror 2 means rebuild using
+	 * RAID5 methods.
+	 */
+	ASSERT(mirror_num > 2);
+	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
+		int found_errors;
+		int faila;
+		int failb;
+
+		found_errors = get_rbio_veritical_errors(rbio, sector_nr,
+							 &faila, &failb);
+		/* This vertical stripe doesn't have errors. */
+		if (!found_errors)
+			continue;
+
+		/*
+		 * If we found errors, there should be only one error marked
+		 * by previous set_rbio_range_error().
+		 */
+		ASSERT(found_errors == 1);
+		found = true;
+
+		/* Now select another stripe to mark as error. */
+		failb = rbio->real_stripes - (mirror_num - 1);
+		if (failb <= faila)
+			failb--;
+
+		/* Set the extra bit in error bitmap. */
+		if (failb >= 0)
+			set_bit(failb * rbio->stripe_nsectors + sector_nr,
+				rbio->error_bitmap);
+	}
+
+	/* We should found at least one vertical stripe with error.*/
+	ASSERT(found);
+}
+
 /*
  * the main entry point for reads from the higher layers.  This
  * is really only called when the normal read path had a failure,
@@ -2079,11 +2169,7 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 	 * for 'mirror_num > 2', select a stripe to fail on every retry.
 	 */
 	if (mirror_num > 2) {
-		/*
-		 * 'mirror == 3' is to fail the p stripe and
-		 * reconstruct from the q stripe.  'mirror > 3' is to
-		 * fail a data stripe and reconstruct from p+q stripe.
-		 */
+		set_rbio_raid6_extra_error(rbio, mirror_num);
 		rbio->failb = rbio->real_stripes - (mirror_num - 1);
 		ASSERT(rbio->failb > 0);
 		if (rbio->failb <= rbio->faila)
@@ -2512,48 +2598,85 @@ static inline int is_data_stripe(struct btrfs_raid_bio *rbio, int stripe)
 
 static int recover_scrub_rbio(struct btrfs_raid_bio *rbio)
 {
-	int dfail = 0, failp = -1;
+	void **pointers = NULL;
+	void **unmap_array = NULL;
+	int sector_nr;
 	int ret;
 
-	/* No error case should be already handled by the caller. */
-	ASSERT(rbio->faila >= 0 || rbio->failb >= 0);
-
-	if (is_data_stripe(rbio, rbio->faila))
-		dfail++;
-	else if (is_parity_stripe(rbio->faila))
-		failp = rbio->faila;
-
-	if (is_data_stripe(rbio, rbio->failb))
-		dfail++;
-	else if (is_parity_stripe(rbio->failb))
-		failp = rbio->failb;
-
 	/*
-	 * Because we can not use a scrubbing parity to repair
-	 * the data, so the capability of the repair is declined.
-	 * (In the case of RAID5, we can not repair anything)
+	 * @pointers array stores the pointer for each sector.
+	 *
+	 * @unmap_array stores copy of pointers that does not get reordered
+	 * during reconstruction so that kunmap_local works.
 	 */
-	if (dfail > rbio->bioc->max_errors - 1)
-		return -EIO;
+	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
+	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
+	if (!pointers || !unmap_array) {
+		ret = -ENOMEM;
+		goto out;
+	}
 
-	/*
-	 * If all data is good, only parity is correctly, just
-	 * repair the parity.
-	 */
-	if (dfail == 0)
-		return 0;
+	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
+		int dfail = 0, failp = -1;
+		int faila;
+		int failb;
+		int found_errors;
 
-	/*
-	 * Here means we got one corrupted data stripe and one
-	 * corrupted parity on RAID6, if the corrupted parity
-	 * is scrubbing parity, luckily, use the other one to repair
-	 * the data, or we can not repair the data stripe.
-	 */
-	if (failp != rbio->scrubp)
-		return -EIO;
+		found_errors = get_rbio_veritical_errors(rbio, sector_nr,
+							 &faila, &failb);
+		if (found_errors > rbio->bioc->max_errors) {
+			ret = -EIO;
+			goto out;
+		}
+		if (found_errors == 0)
+			continue;
 
-	/* We have some corrupted sectors, need to repair them. */
-	ret = recover_sectors(rbio);
+		/* We should have at least one error here. */
+		ASSERT(faila >= 0 || failb >= 0);
+
+		if (is_data_stripe(rbio, faila))
+			dfail++;
+		else if (is_parity_stripe(faila))
+			failp = faila;
+
+		if (is_data_stripe(rbio, failb))
+			dfail++;
+		else if (is_parity_stripe(failb))
+			failp = failb;
+		/*
+		 * Because we can not use a scrubbing parity to repair
+		 * the data, so the capability of the repair is declined.
+		 * (In the case of RAID5, we can not repair anything)
+		 */
+		if (dfail > rbio->bioc->max_errors - 1) {
+			ret = -EIO;
+			goto out;
+		}
+		/*
+		 * If all data is good, only parity is correctly, just
+		 * repair the parity, no need to recover data stripes.
+		 */
+		if (dfail == 0)
+			continue;
+
+		/*
+		 * Here means we got one corrupted data stripe and one
+		 * corrupted parity on RAID6, if the corrupted parity
+		 * is scrubbing parity, luckily, use the other one to repair
+		 * the data, or we can not repair the data stripe.
+		 */
+		if (failp != rbio->scrubp) {
+			ret = -EIO;
+			goto out;
+		}
+
+		ret = recover_vertical(rbio, sector_nr, pointers, unmap_array);
+		if (ret < 0)
+			goto out;
+	}
+out:
+	kfree(pointers);
+	kfree(unmap_array);
 	return ret;
 }
 
@@ -2629,25 +2752,11 @@ static int scrub_rbio(struct btrfs_raid_bio *rbio)
 	submit_read_bios(rbio, &bio_list);
 	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
 
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors) {
-		ret = -EIO;
-		goto cleanup;
-	}
-	/*
-	 * No error during read, can finish the scrub and need to verify the
-	 * P/Q sectors;
-	 */
-	if (atomic_read(&rbio->error) == 0) {
-		need_check = true;
-		goto finish;
-	}
-
-	/* We have some failures, need to recover the failed sectors first. */
+	/* We may have some failures, recover the failed sectors first. */
 	ret = recover_scrub_rbio(rbio);
 	if (ret < 0)
 		goto cleanup;
 
-finish:
 	/*
 	 * We have every sector properly prepared. Can finish the scrub
 	 * and writeback the good content.
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] btrfs: raid56: remove the old error tracing system
  2022-11-07  7:32 [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection Qu Wenruo
  2022-11-07  7:32 ` [PATCH 1/3] btrfs: raid56: introduce btrfs_raid_bio::error_bitmap Qu Wenruo
  2022-11-07  7:32 ` [PATCH 2/3] btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap Qu Wenruo
@ 2022-11-07  7:32 ` Qu Wenruo
  2022-11-07 17:04 ` [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection David Sterba
  3 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2022-11-07  7:32 UTC (permalink / raw)
  To: linux-btrfs

Since all the recovery path has migrated to the new error bitmap based
system, we can remove the old stripe number based system.

This cleanup involves one behavior change:

- Rebuild rbio can no longer be merged
  Previously a rebuild rbio (caused by retry after data csum mismatch)
  can be merged, if the the error happens in the same stripe.

  But with the new error bitmap based solution, it's much harder to
  compare error bitmaps.

  So here we just don't merge rebuild rbio at all.
  This may introduce some performance impact at extreme corner cases,
  but I'm willing to take it.

Other than that, this patch will cleanup the following members:

- rbio::faila
- rbio::failb
  They will be replaced by per-veritical stripe check, which is more
  accurate.

- rbio::error
  It will be replace by per-vertical stripe error bitmap check.

- Allow get_rbio_vertical_errors() to accept NULL pointers for
  @faila and @failb
  Some call sites only want to check if we have errors beyond the
  tolerance.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 235 +++++++++++-----------------------------------
 fs/btrfs/raid56.h |   8 --
 2 files changed, 55 insertions(+), 188 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 800213ab44a1..73dce024a25e 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -66,8 +66,6 @@ struct sector_ptr {
 
 static void rmw_rbio_work(struct work_struct *work);
 static void rmw_rbio_work_locked(struct work_struct *work);
-static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
-static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
@@ -588,28 +586,10 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
 	if (last->operation == BTRFS_RBIO_PARITY_SCRUB)
 		return 0;
 
-	if (last->operation == BTRFS_RBIO_REBUILD_MISSING)
+	if (last->operation == BTRFS_RBIO_REBUILD_MISSING ||
+	    last->operation == BTRFS_RBIO_READ_REBUILD)
 		return 0;
 
-	if (last->operation == BTRFS_RBIO_READ_REBUILD) {
-		int fa = last->faila;
-		int fb = last->failb;
-		int cur_fa = cur->faila;
-		int cur_fb = cur->failb;
-
-		if (last->faila >= last->failb) {
-			fa = last->failb;
-			fb = last->faila;
-		}
-
-		if (cur->faila >= cur->failb) {
-			cur_fa = cur->failb;
-			cur_fb = cur->faila;
-		}
-
-		if (fa != cur_fa || fb != cur_fb)
-			return 0;
-	}
 	return 1;
 }
 
@@ -973,10 +953,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	rbio->real_stripes = real_stripes;
 	rbio->stripe_npages = stripe_npages;
 	rbio->stripe_nsectors = stripe_nsectors;
-	rbio->faila = -1;
-	rbio->failb = -1;
 	refcount_set(&rbio->refs, 1);
-	atomic_set(&rbio->error, 0);
 	atomic_set(&rbio->stripes_pending, 0);
 
 	ASSERT(btrfs_nr_parity_stripes(bioc->map_type));
@@ -1025,9 +1002,15 @@ static int get_rbio_veritical_errors(struct btrfs_raid_bio *rbio, int sector_nr,
 	int stripe_nr;
 	int found_errors = 0;
 
-	ASSERT(faila && failb);
-	*faila = -1;
-	*failb = -1;
+	if (faila || failb) {
+		/*
+		 * Both @faila and @failb should be valid pointers if any of them is
+		 * specified.
+		 */
+		ASSERT(faila && failb);
+		*faila = -1;
+		*failb = -1;
+	}
 
 	for (stripe_nr = 0; stripe_nr < rbio->real_stripes; stripe_nr++) {
 		int total_sector_nr = stripe_nr * rbio->stripe_nsectors +
@@ -1035,10 +1018,13 @@ static int get_rbio_veritical_errors(struct btrfs_raid_bio *rbio, int sector_nr,
 
 		if (test_bit(total_sector_nr, rbio->error_bitmap)) {
 			found_errors++;
-			if (*faila < 0)
-				*faila = stripe_nr;
-			else if (*failb < 0)
-				*failb = stripe_nr;
+			if (faila) {
+				/* Update faila and failb. */
+				if (*faila < 0)
+					*faila = stripe_nr;
+				else if (*failb < 0)
+					*failb = stripe_nr;
+			}
 		}
 	}
 	return found_errors;
@@ -1078,9 +1064,17 @@ static int rbio_add_io_sector(struct btrfs_raid_bio *rbio,
 
 	/* if the device is missing, just fail this stripe */
 	if (!stripe->dev->bdev) {
+		int found_errors;
+
 		set_bit(stripe_nr * rbio->stripe_nsectors + sector_nr,
 			rbio->error_bitmap);
-		return fail_rbio_index(rbio, stripe_nr);
+
+		/* Check if we have reached tolerance early. */
+		found_errors = get_rbio_veritical_errors(rbio, sector_nr,
+							 NULL, NULL);
+		if (found_errors > rbio->bioc->max_errors)
+			return -EIO;
+		return 0;
 	}
 
 	/* see if we can add this page onto our existing bio */
@@ -1244,10 +1238,7 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 	 * Reset errors, as we may have errors inherited from from degraded
 	 * write.
 	 */
-	atomic_set(&rbio->error, 0);
 	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
-	rbio->faila = -1;
-	rbio->failb = -1;
 
 	/*
 	 * Start assembly.  Make bios for everything from the higher layers (the
@@ -1325,50 +1316,6 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 	return -EIO;
 }
 
-/*
- * helper to find the stripe number for a given bio.  Used to figure out which
- * stripe has failed.  This expects the bio to correspond to a physical disk,
- * so it looks up based on physical sector numbers.
- */
-static int find_bio_stripe(struct btrfs_raid_bio *rbio,
-			   struct bio *bio)
-{
-	u64 physical = bio->bi_iter.bi_sector;
-	int i;
-	struct btrfs_io_stripe *stripe;
-
-	physical <<= 9;
-
-	for (i = 0; i < rbio->bioc->num_stripes; i++) {
-		stripe = &rbio->bioc->stripes[i];
-		if (in_range(physical, stripe->physical, BTRFS_STRIPE_LEN) &&
-		    stripe->dev->bdev && bio->bi_bdev == stripe->dev->bdev) {
-			return i;
-		}
-	}
-	return -1;
-}
-
-/*
- * helper to find the stripe number for a given
- * bio (before mapping).  Used to figure out which stripe has
- * failed.  This looks up based on logical block numbers.
- */
-static int find_logical_bio_stripe(struct btrfs_raid_bio *rbio,
-				   struct bio *bio)
-{
-	u64 logical = bio->bi_iter.bi_sector << 9;
-	int i;
-
-	for (i = 0; i < rbio->nr_data; i++) {
-		u64 stripe_start = rbio->bioc->raid_map[i];
-
-		if (in_range(logical, stripe_start, BTRFS_STRIPE_LEN))
-			return i;
-	}
-	return -1;
-}
-
 static void set_rbio_range_error(struct btrfs_raid_bio *rbio,
 				  struct bio *bio)
 {
@@ -1406,52 +1353,6 @@ static void set_rbio_range_error(struct btrfs_raid_bio *rbio,
 	}
 }
 
-/*
- * returns -EIO if we had too many failures
- */
-static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed)
-{
-	unsigned long flags;
-	int ret = 0;
-
-	spin_lock_irqsave(&rbio->bio_list_lock, flags);
-
-	/* we already know this stripe is bad, move on */
-	if (rbio->faila == failed || rbio->failb == failed)
-		goto out;
-
-	if (rbio->faila == -1) {
-		/* first failure on this rbio */
-		rbio->faila = failed;
-		atomic_inc(&rbio->error);
-	} else if (rbio->failb == -1) {
-		/* second failure on this rbio */
-		rbio->failb = failed;
-		atomic_inc(&rbio->error);
-	} else {
-		ret = -EIO;
-	}
-out:
-	spin_unlock_irqrestore(&rbio->bio_list_lock, flags);
-
-	return ret;
-}
-
-/*
- * helper to fail a stripe based on a physical disk
- * bio.
- */
-static int fail_bio_stripe(struct btrfs_raid_bio *rbio,
-			   struct bio *bio)
-{
-	int failed = find_bio_stripe(rbio, bio);
-
-	if (failed < 0)
-		return -EIO;
-
-	return fail_rbio_index(rbio, failed);
-}
-
 /*
  * For subpage case, we can no longer set page Uptodate directly for
  * stripe_pages[], thus we need to locate the sector.
@@ -1537,10 +1438,9 @@ static void raid_wait_read_end_io(struct bio *bio)
 {
 	struct btrfs_raid_bio *rbio = bio->bi_private;
 
-	if (bio->bi_status) {
-		fail_bio_stripe(rbio, bio);
+	if (bio->bi_status)
 		rbio_update_error_bitmap(rbio, bio);
-	} else
+	else
 		set_bio_pages_uptodate(rbio, bio);
 
 	bio_put(bio);
@@ -2027,12 +1927,6 @@ static int recover_rbio(struct btrfs_raid_bio *rbio)
 	ASSERT(bitmap_weight(rbio->error_bitmap, rbio->nr_sectors));
 	bio_list_init(&bio_list);
 
-	/*
-	 * Reset error to 0, as we will later increase error for missing
-	 * devices.
-	 */
-	atomic_set(&rbio->error, 0);
-
 	/* For recovery, we need to read all sectors including P/Q. */
 	ret = alloc_rbio_pages(rbio);
 	if (ret < 0)
@@ -2151,30 +2045,13 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 
 	set_rbio_range_error(rbio, bio);
 
-	rbio->faila = find_logical_bio_stripe(rbio, bio);
-	if (rbio->faila == -1) {
-		btrfs_warn(fs_info,
-"%s could not find the bad stripe in raid56 so that we cannot recover any more (bio has logical %llu len %llu, bioc has map_type %llu)",
-			   __func__, bio->bi_iter.bi_sector << 9,
-			   (u64)bio->bi_iter.bi_size, bioc->map_type);
-		free_raid_bio(rbio);
-		bio->bi_status = BLK_STS_IOERR;
-		bio_endio(bio);
-		return;
-	}
-
 	/*
 	 * Loop retry:
 	 * for 'mirror == 2', reconstruct from all other stripes.
 	 * for 'mirror_num > 2', select a stripe to fail on every retry.
 	 */
-	if (mirror_num > 2) {
+	if (mirror_num > 2)
 		set_rbio_raid6_extra_error(rbio, mirror_num);
-		rbio->failb = rbio->real_stripes - (mirror_num - 1);
-		ASSERT(rbio->failb > 0);
-		if (rbio->failb <= rbio->faila)
-			rbio->failb--;
-	}
 
 	start_async_work(rbio, recover_rbio_work);
 }
@@ -2186,7 +2063,6 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
 	int ret;
 
 	bio_list_init(&bio_list);
-	atomic_set(&rbio->error, 0);
 
 	ret = rmw_assemble_read_bios(rbio, &bio_list);
 	if (ret < 0)
@@ -2207,10 +2083,8 @@ static void raid_wait_write_end_io(struct bio *bio)
 	struct btrfs_raid_bio *rbio = bio->bi_private;
 	blk_status_t err = bio->bi_status;
 
-	if (err) {
-		fail_bio_stripe(rbio, bio);
+	if (err)
 		rbio_update_error_bitmap(rbio, bio);
-	}
 	bio_put(bio);
 	atomic_dec(&rbio->stripes_pending);
 	wake_up(&rbio->io_wait);
@@ -2260,19 +2134,14 @@ static int rmw_rbio(struct btrfs_raid_bio *rbio)
 	if (ret < 0)
 		return ret;
 
-	atomic_set(&rbio->error, 0);
 	index_rbio_pages(rbio);
 
 	ret = rmw_read_and_wait(rbio);
 	if (ret < 0)
 		return ret;
 
-	/* Too many read errors, beyond our tolerance. */
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
-		return ret;
-
-	/* Have read failures but under tolerance, needs recovery. */
-	if (rbio->faila >= 0 || rbio->failb >= 0) {
+	/* We have read errors, try recovery path. */
+	if (!bitmap_empty(rbio->error_bitmap, rbio->nr_sectors)) {
 		ret = recover_rbio(rbio);
 		if (ret < 0)
 			return ret;
@@ -2287,7 +2156,6 @@ static int rmw_rbio(struct btrfs_raid_bio *rbio)
 	set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
 	spin_unlock_irq(&rbio->bio_list_lock);
 
-	atomic_set(&rbio->error, 0);
 	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
 
 	index_rbio_pages(rbio);
@@ -2316,9 +2184,17 @@ static int rmw_rbio(struct btrfs_raid_bio *rbio)
 	submit_write_bios(rbio, &bio_list);
 	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
 
-	/* We have more errors than our tolerance during the read. */
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
-		ret = -EIO;
+	/* We may have more errors than our tolerance during the read. */
+	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) {
+		int found_errors;
+
+		found_errors = get_rbio_veritical_errors(rbio, sectornr,
+							 NULL, NULL);
+		if (found_errors > rbio->bioc->max_errors) {
+			ret = -EIO;
+			break;
+		}
+	}
 	return ret;
 }
 
@@ -2499,7 +2375,6 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
 		pointers[rbio->real_stripes - 1] = kmap_local_page(q_sector.page);
 	}
 
-	atomic_set(&rbio->error, 0);
 	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
 
 	/* Map the parity stripe just once */
@@ -2733,6 +2608,7 @@ static int scrub_rbio(struct btrfs_raid_bio *rbio)
 {
 	bool need_check = false;
 	struct bio_list bio_list;
+	int sector_nr;
 	int ret;
 	struct bio *bio;
 
@@ -2742,7 +2618,6 @@ static int scrub_rbio(struct btrfs_raid_bio *rbio)
 	if (ret)
 		goto cleanup;
 
-	atomic_set(&rbio->error, 0);
 	bitmap_clear(rbio->error_bitmap, 0, rbio->nr_sectors);
 
 	ret = scrub_assemble_read_bios(rbio, &bio_list);
@@ -2763,8 +2638,16 @@ static int scrub_rbio(struct btrfs_raid_bio *rbio)
 	 */
 	ret = finish_parity_scrub(rbio, need_check);
 	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
-		ret = -EIO;
+	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
+		int found_errors;
+
+		found_errors = get_rbio_veritical_errors(rbio, sector_nr,
+							 NULL, NULL);
+		if (found_errors > rbio->bioc->max_errors) {
+			ret = -EIO;
+			break;
+		}
+	}
 	return ret;
 
 cleanup:
@@ -2811,14 +2694,6 @@ raid56_alloc_missing_rbio(struct bio *bio, struct btrfs_io_context *bioc)
 	ASSERT(!bio->bi_iter.bi_size);
 
 	set_rbio_range_error(rbio, bio);
-	rbio->faila = find_logical_bio_stripe(rbio, bio);
-	if (rbio->faila == -1) {
-		btrfs_warn_rl(fs_info,
-	"can not determine the failed stripe number for full stripe %llu",
-			      bioc->raid_map[0]);
-		free_raid_bio(rbio);
-		return NULL;
-	}
 
 	return rbio;
 }
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index e38da4fa76d6..a2e653e93fd8 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -74,12 +74,6 @@ struct btrfs_raid_bio {
 	/* How many sectors there are for each stripe */
 	u8 stripe_nsectors;
 
-	/* First bad stripe, -1 means no corruption */
-	s8 faila;
-
-	/* Second bad stripe (for RAID6 use) */
-	s8 failb;
-
 	/* Stripe number that we're scrubbing  */
 	u8 scrubp;
 
@@ -93,8 +87,6 @@ struct btrfs_raid_bio {
 
 	atomic_t stripes_pending;
 
-	atomic_t error;
-
 	wait_queue_head_t io_wait;
 
 	/* Bitmap to record which horizontal stripe has data */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] btrfs: raid56: introduce btrfs_raid_bio::error_bitmap
  2022-11-07  7:32 ` [PATCH 1/3] btrfs: raid56: introduce btrfs_raid_bio::error_bitmap Qu Wenruo
@ 2022-11-07 17:03   ` David Sterba
  0 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2022-11-07 17:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Nov 07, 2022 at 03:32:29PM +0800, Qu Wenruo wrote:
> +
> +		sector = &rbio->stripe_sectors[i];
> +		if (sector->page == bv->bv_page &&
> +		    sector->pgoff == bv->bv_offset)
> +			break;
> +		sector = &rbio->bio_sectors[i];
> +		if (sector->page == bv->bv_page &&
> +		    sector->pgoff == bv->bv_offset)
> +			break;
> +	}
> +	ASSERT(i < rbio->nr_sectors);
> +	return i;
> +}
> +
> +static void rbio_update_error_bitmap(struct btrfs_raid_bio *rbio,
> +				     struct bio *bio)
> +{
> +	int total_sector_nr = get_bio_sector_nr(rbio, bio);
> +	int bio_size = 0;

bio_size is better a u32 as it's the bv_len type and there's a shift
done later.

> +	struct bio_vec *bvec;
> +	struct bvec_iter_all iter_all;
> +
> +	bio_for_each_segment_all(bvec, bio, iter_all)
> +		bio_size += bvec->bv_len;
> +
> +	bitmap_set(rbio->error_bitmap, total_sector_nr,
> +		   bio_size >> rbio->bioc->fs_info->sectorsize_bits);
> +}
> +
>  static void raid_wait_read_end_io(struct bio *bio)
>  {
>  	struct btrfs_raid_bio *rbio = bio->bi_private;
>  
> -	if (bio->bi_status)
> +	if (bio->bi_status) {
>  		fail_bio_stripe(rbio, bio);
> -	else
> +		rbio_update_error_bitmap(rbio, bio);
> +	} else
>  		set_bio_pages_uptodate(rbio, bio);

The else branch needs { }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection
  2022-11-07  7:32 [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-11-07  7:32 ` [PATCH 3/3] btrfs: raid56: remove the old error tracing system Qu Wenruo
@ 2022-11-07 17:04 ` David Sterba
  3 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2022-11-07 17:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Nov 07, 2022 at 03:32:28PM +0800, Qu Wenruo wrote:
> Currently btrfs raid56 uses stripe based error detection.
> 
> This means, any error (which vary from single sector csum mismatch, to a
> missing device) will mark the whole horizontal stripe as error.
> 
> This can lead to some unexpected behavior, for example:
> 
> 
>              0        32K       64K
>      Data 1  |XXXXXXXX|         |
>      Data 2  |        |XXXXXXXXX|
>      Parity  |        |         |
> 
> When reading data 1 [0, 32K), we got csum mismatch and go RAID56
> recovery path.
> 
> If going the old path, we will mark the whole data 1 [0, 64K) all as
> error, and recover using data 2 and parity.
> 
> But since data 2 [32K, 64K) is also corrupted, the recovered data will
> also be corrupted.
> 
> Thankfully such problem will be mostly avoided after commit f6065f8edeb2
> ("btrfs: raid56: don't trust any cached sector in
> __raid56_parity_recover()"), as when we read the sectors in data 2 [32K,
> 64K), we will recover discarding all the cached result.
> 
> 
> This patchset will change the behavior by introducing an error bitmap,
> recording corrupted sector one by one, so for above case, at least we
> won't try to recover data 1 [32K, 64K) using incorrect data.
> 
> The true solution to this destructive RMW problem will be read time csum
> verification, but this patchset introduces the basis to handle extra
> csum mismatch error better (csum mismatch will also be marked as error,
> but only for the offending sectors).
> 
> This patchset itself doesn't improve the raid56 destructive RMW
> situation by itself, but would make later destructive RMW fix much
> easier to implement.
> 
> Qu Wenruo (3):
>   btrfs: raid56: introduce btrfs_raid_bio::error_bitmap
>   btrfs: raid56: migrate recovery and scrub recovery path to use
>     error_bitmap
>   btrfs: raid56: remove the old error tracing system

With minor fixups added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-11-07 17:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-07  7:32 [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection Qu Wenruo
2022-11-07  7:32 ` [PATCH 1/3] btrfs: raid56: introduce btrfs_raid_bio::error_bitmap Qu Wenruo
2022-11-07 17:03   ` David Sterba
2022-11-07  7:32 ` [PATCH 2/3] btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap Qu Wenruo
2022-11-07  7:32 ` [PATCH 3/3] btrfs: raid56: remove the old error tracing system Qu Wenruo
2022-11-07 17:04 ` [PATCH 0/3] btrfs: raid56: make raid56 to use more accurate error bitmap for error detection David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox