linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Implement device scrub/replace for RAID56
@ 2014-11-14 13:50 Miao Xie
  2014-11-14 13:50 ` [PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs

This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. In order to avoid
that problem the data that is easy to be change out the stripe lock,
we do most work in the RAID56 stripe lock context.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

Miao Xie (6):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs,raid56: use a variant to record the operation type
  Btrfs,raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
    __btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/dev-replace.c |   5 -
 fs/btrfs/raid56.c      | 711 +++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/raid56.h      |  14 +-
 fs/btrfs/scrub.c       | 793 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.c     |  47 ++-
 fs/btrfs/volumes.h     |  14 +-
 6 files changed, 1471 insertions(+), 113 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
@ 2014-11-14 13:50 ` Miao Xie
  2014-11-14 14:57   ` David Sterba
  2014-11-14 13:50 ` [PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Miao Xie
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909     goto out;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..41b0dff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				BTRFS_BLOCK_GROUP_RAID6)) {
 		u64 tmp;
 
-		if (bbio_ret && ((rw & REQ_WRITE) || mirror_num > 1)
-		    && raid_map_ret) {
+		if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
 			int i, rot;
 
 			/* push stripe_nr back to the start of the full stripe */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
  2014-11-14 13:50 ` [PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
@ 2014-11-14 13:50 ` Miao Xie
  2014-11-14 14:55   ` David Sterba
  2014-11-14 13:50 ` [PATCH 3/9] Btrfs, raid56: don't change bbio and raid_map Miao Xie
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41b0dff..66d5035 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 
 			/* push stripe_nr back to the start of the full stripe */
 			stripe_nr = raid56_full_stripe_start;
-			do_div(stripe_nr, stripe_len);
-
-			stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+			do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
 			/* RAID[56] write or recovery. Return all stripes */
 			num_stripes = map->num_stripes;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/9] Btrfs, raid56: don't change bbio and raid_map
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
  2014-11-14 13:50 ` [PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
  2014-11-14 13:50 ` [PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Miao Xie
@ 2014-11-14 13:50 ` Miao Xie
  2014-11-14 13:50 ` [PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Miao Xie
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs

Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/raid56.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..c54b0e6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT	3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
 	atomic_t refs;
 
+
+	atomic_t stripes_pending;
+
+	atomic_t error;
 	/*
 	 * these are two arrays of pointers.  We allocate the
 	 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
 	bio_put(bio);
 
-	if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
 		return;
 
 	err = 0;
 
 	/* OK, we have read all the stripes we need to. */
-	if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
 		err = -EIO;
 
 	rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->faila = -1;
 	rbio->failb = -1;
 	atomic_set(&rbio->refs, 1);
+	atomic_set(&rbio->error, 0);
+	atomic_set(&rbio->stripes_pending, 0);
 
 	/*
 	 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
 	spin_unlock_irq(&rbio->bio_list_lock);
 
-	atomic_set(&rbio->bbio->error, 0);
+	atomic_set(&rbio->error, 0);
 
 	/*
 	 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 		}
 	}
 
-	atomic_set(&bbio->stripes_pending, bio_list_size(&bio_list));
-	BUG_ON(atomic_read(&bbio->stripes_pending) == 0);
+	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
+	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
 	while (1) {
 		bio = bio_list_pop(&bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed)
 	if (rbio->faila == -1) {
 		/* first failure on this rbio */
 		rbio->faila = failed;
-		atomic_inc(&rbio->bbio->error);
+		atomic_inc(&rbio->error);
 	} else if (rbio->failb == -1) {
 		/* second failure on this rbio */
 		rbio->failb = failed;
-		atomic_inc(&rbio->bbio->error);
+		atomic_inc(&rbio->error);
 	} else {
 		ret = -EIO;
 	}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
 	bio_put(bio);
 
-	if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
 		return;
 
 	err = 0;
-	if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
 		goto cleanup;
 
 	/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio *rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
 	int bios_to_read = 0;
-	struct btrfs_bio *bbio = rbio->bbio;
 	struct bio_list bio_list;
 	int ret;
 	int nr_pages = DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
 	index_rbio_pages(rbio);
 
-	atomic_set(&rbio->bbio->error, 0);
+	atomic_set(&rbio->error, 0);
 	/*
 	 * build a list of bios to read all the missing parts of this
 	 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 	 * the bbio may be freed once we submit the last bio.  Make sure
 	 * not to touch it after that
 	 */
-	atomic_set(&bbio->stripes_pending, bios_to_read);
+	atomic_set(&rbio->stripes_pending, bios_to_read);
 	while (1) {
 		bio = bio_list_pop(&bio_list);
 		if (!bio)
@@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int err)
 		set_bio_pages_uptodate(bio);
 	bio_put(bio);
 
-	if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
 		return;
 
-	if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
 		rbio_orig_end_io(rbio, -EIO, 0);
 	else
 		__raid_recover_end_io(rbio);
@@ -1951,7 +1955,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	if (ret)
 		goto cleanup;
 
-	atomic_set(&rbio->bbio->error, 0);
+	atomic_set(&rbio->error, 0);
 
 	/*
 	 * read everything that hasn't failed.  Thanks to the
@@ -1960,7 +1964,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	 */
 	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
 		if (rbio->faila == stripe || rbio->failb == stripe) {
-			atomic_inc(&rbio->bbio->error);
+			atomic_inc(&rbio->error);
 			continue;
 		}
 
@@ -1990,7 +1994,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 		 * were up to date, or we might have no bios to read because
 		 * the devices were gone.
 		 */
-		if (atomic_read(&rbio->bbio->error) <= rbio->bbio->max_errors) {
+		if (atomic_read(&rbio->error) <= rbio->bbio->max_errors) {
 			__raid_recover_end_io(rbio);
 			goto out;
 		} else {
@@ -2002,7 +2006,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	 * the bbio may be freed once we submit the last bio.  Make sure
 	 * not to touch it after that
 	 */
-	atomic_set(&bbio->stripes_pending, bios_to_read);
+	atomic_set(&rbio->stripes_pending, bios_to_read);
 	while (1) {
 		bio = bio_list_pop(&bio_list);
 		if (!bio)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (2 preceding siblings ...)
  2014-11-14 13:50 ` [PATCH 3/9] Btrfs, raid56: don't change bbio and raid_map Miao Xie
@ 2014-11-14 13:50 ` Miao Xie
  2014-11-24  9:31   ` [PATCH v2 " Miao Xie
  2014-11-14 13:50 ` [PATCH 5/9] Btrfs,raid56: use a variant to record the operation type Miao Xie
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/raid56.c  |  42 +++++++++---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.c |  16 ++++-
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..b3e9c76 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT	3
 
+#define RBIO_HOLD_BBIO_MAP_BIT	4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
 		remove_rbio_from_cache(rbio);
 }
 
+static inline void
+___free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+	if (need) {
+		kfree(raid_map);
+		kfree(bbio);
+	}
+}
+
+static inline void __free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+	___free_bbio_and_raid_map(rbio->bbio, rbio->raid_map,
+			!test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
 	int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 			rbio->stripe_pages[i] = NULL;
 		}
 	}
-	kfree(rbio->raid_map);
-	kfree(rbio->bbio);
+
+	__free_bbio_and_raid_map(rbio);
+
 	kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 
 	rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
 			GFP_NOFS);
-	if (!rbio) {
-		kfree(raid_map);
-		kfree(bbio);
+	if (!rbio)
 		return ERR_PTR(-ENOMEM);
-	}
 
 	bio_list_init(&rbio->bio_list);
 	INIT_LIST_HEAD(&rbio->plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 	struct blk_plug_cb *cb;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-	if (IS_ERR(rbio))
+	if (IS_ERR(rbio)) {
+		___free_bbio_and_raid_map(bbio, raid_map, 1);
 		return PTR_ERR(rbio);
+	}
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 			  struct btrfs_bio *bbio, u64 *raid_map,
-			  u64 stripe_len, int mirror_num)
+			  u64 stripe_len, int mirror_num, int hold_bbio)
 {
 	struct btrfs_raid_bio *rbio;
 	int ret;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-	if (IS_ERR(rbio))
+	if (IS_ERR(rbio)) {
+		___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
 		return PTR_ERR(rbio);
+	}
 
+	if (hold_bbio)
+		set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
 	rbio->read_rebuild = 1;
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 	rbio->faila = find_logical_bio_stripe(rbio, bio);
 	if (rbio->faila == -1) {
 		BUG();
-		kfree(raid_map);
-		kfree(bbio);
+		___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
 		kfree(rbio);
 		return -EIO;
 	}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 				 struct btrfs_bio *bbio, u64 *raid_map,
-				 u64 stripe_len, int mirror_num);
+				 u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 			       struct btrfs_bio *bbio, u64 *raid_map,
 			       u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK	16	/* 64k per node/leaf/sector */
 
+struct scrub_recover {
+	atomic_t		refs;
+	struct btrfs_bio	*bbio;
+	u64			*raid_map;
+	u64			map_length;
+};
+
 struct scrub_page {
 	struct scrub_block	*sblock;
 	struct page		*page;
@@ -79,6 +86,8 @@ struct scrub_page {
 		unsigned int	io_error:1;
 	};
 	u8			csum[BTRFS_CSUM_SIZE];
+
+	struct scrub_recover	*recover;
 };
 
 struct scrub_bio {
@@ -196,7 +205,7 @@ static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 				struct scrub_block *sblock, int is_metadata,
 				int have_csum, u8 *csum, u64 generation,
-				u16 csum_size);
+				u16 csum_size, int retry_failed_mirror);
 static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
 					 struct scrub_block *sblock,
 					 int is_metadata, int have_csum,
@@ -790,6 +799,20 @@ out:
 	scrub_pending_trans_workers_dec(sctx);
 }
 
+static inline void scrub_get_recover(struct scrub_recover *recover)
+{
+	atomic_inc(&recover->refs);
+}
+
+static inline void scrub_put_recover(struct scrub_recover *recover)
+{
+	if (atomic_dec_and_test(&recover->refs)) {
+		kfree(recover->bbio);
+		kfree(recover->raid_map);
+		kfree(recover);
+	}
+}
+
 /*
  * scrub_handle_errored_block gets called when either verification of the
  * pages failed or the bio failed to read, e.g. with EIO. In the latter
@@ -906,7 +929,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 
 	/* build and submit the bios for the failed mirror, check checksums */
 	scrub_recheck_block(fs_info, sblock_bad, is_metadata, have_csum,
-			    csum, generation, sctx->csum_size);
+			    csum, generation, sctx->csum_size, 1);
 
 	if (!sblock_bad->header_error && !sblock_bad->checksum_error &&
 	    sblock_bad->no_io_error_seen) {
@@ -1019,7 +1042,7 @@ nodatasum_case:
 		/* build and submit the bios, check checksums */
 		scrub_recheck_block(fs_info, sblock_other, is_metadata,
 				    have_csum, csum, generation,
-				    sctx->csum_size);
+				    sctx->csum_size, 0);
 
 		if (!sblock_other->header_error &&
 		    !sblock_other->checksum_error &&
@@ -1169,7 +1192,7 @@ nodatasum_case:
 			 */
 			scrub_recheck_block(fs_info, sblock_bad,
 					    is_metadata, have_csum, csum,
-					    generation, sctx->csum_size);
+					    generation, sctx->csum_size, 1);
 			if (!sblock_bad->header_error &&
 			    !sblock_bad->checksum_error &&
 			    sblock_bad->no_io_error_seen)
@@ -1201,11 +1224,18 @@ out:
 		     mirror_index++) {
 			struct scrub_block *sblock = sblocks_for_recheck +
 						     mirror_index;
+			struct scrub_recover *recover;
 			int page_index;
 
 			for (page_index = 0; page_index < sblock->page_count;
 			     page_index++) {
 				sblock->pagev[page_index]->sblock = NULL;
+				recover = sblock->pagev[page_index]->recover;
+				if (recover) {
+					scrub_put_recover(recover);
+					sblock->pagev[page_index]->recover =
+									NULL;
+				}
 				scrub_page_put(sblock->pagev[page_index]);
 			}
 		}
@@ -1215,14 +1245,63 @@ out:
 	return 0;
 }
 
+static inline int scrub_nr_raid_mirrors(struct btrfs_bio *bbio, u64 *raid_map)
+{
+	if (raid_map) {
+		if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
+			return 3;
+		else
+			return 2;
+	} else {
+		return (int)bbio->num_stripes;
+	}
+}
+
+static inline void scrub_stripe_index_and_offset(u64 logical, u64 *raid_map,
+						 u64 mapped_length,
+						 int nstripes, int mirror,
+						 int *stripe_index,
+						 u64 *stripe_offset)
+{
+	int i;
+
+	if (raid_map) {
+		/* RAID5/6 */
+		for (i = 0; i < nstripes; i++) {
+			if (raid_map[i] == RAID6_Q_STRIPE ||
+			    raid_map[i] == RAID5_P_STRIPE)
+				continue;
+
+			if (logical >= raid_map[i] &&
+			    logical < raid_map[i] + mapped_length)
+				break;
+		}
+
+		*stripe_index = i;
+		*stripe_offset = logical - raid_map[i];
+	} else {
+		/* The other RAID type */
+		*stripe_index = mirror;
+		*stripe_offset = 0;
+	}
+}
+
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_fs_info *fs_info,
 				     struct scrub_block *original_sblock,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblocks_for_recheck)
 {
+	struct scrub_recover *recover;
+	struct btrfs_bio *bbio;
+	u64 *raid_map;
+	u64 sublen;
+	u64 mapped_length;
+	u64 stripe_offset;
+	int stripe_index;
 	int page_index;
 	int mirror_index;
+	int nmirrors;
 	int ret;
 
 	/*
@@ -1233,23 +1312,39 @@ static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 
 	page_index = 0;
 	while (length > 0) {
-		u64 sublen = min_t(u64, length, PAGE_SIZE);
-		u64 mapped_length = sublen;
-		struct btrfs_bio *bbio = NULL;
+		sublen = min_t(u64, length, PAGE_SIZE);
+		mapped_length = sublen;
+		bbio = NULL;
+		raid_map = NULL;
 
 		/*
 		 * with a length of PAGE_SIZE, each returned stripe
 		 * represents one mirror
 		 */
-		ret = btrfs_map_block(fs_info, REQ_GET_READ_MIRRORS, logical,
-				      &mapped_length, &bbio, 0);
+		ret = btrfs_map_sblock(fs_info, REQ_GET_READ_MIRRORS, logical,
+				       &mapped_length, &bbio, 0, &raid_map);
 		if (ret || !bbio || mapped_length < sublen) {
 			kfree(bbio);
+			kfree(raid_map);
 			return -EIO;
 		}
 
+		recover = kzalloc(sizeof(struct scrub_recover), GFP_NOFS);
+		if (!recover) {
+			kfree(bbio);
+			kfree(raid_map);
+			return -ENOMEM;
+		}
+
+		atomic_set(&recover->refs, 1);
+		recover->bbio = bbio;
+		recover->raid_map = raid_map;
+		recover->map_length = mapped_length;
+
 		BUG_ON(page_index >= SCRUB_PAGES_PER_RD_BIO);
-		for (mirror_index = 0; mirror_index < (int)bbio->num_stripes;
+
+		nmirrors = scrub_nr_raid_mirrors(bbio, raid_map);
+		for (mirror_index = 0; mirror_index < nmirrors;
 		     mirror_index++) {
 			struct scrub_block *sblock;
 			struct scrub_page *page;
@@ -1265,26 +1360,38 @@ leave_nomem:
 				spin_lock(&sctx->stat_lock);
 				sctx->stat.malloc_errors++;
 				spin_unlock(&sctx->stat_lock);
-				kfree(bbio);
+				scrub_put_recover(recover);
 				return -ENOMEM;
 			}
 			scrub_page_get(page);
 			sblock->pagev[page_index] = page;
 			page->logical = logical;
-			page->physical = bbio->stripes[mirror_index].physical;
+
+			scrub_stripe_index_and_offset(logical, raid_map,
+						      mapped_length,
+						      bbio->num_stripes,
+						      mirror_index,
+						      &stripe_index,
+						      &stripe_offset);
+			page->physical = bbio->stripes[stripe_index].physical +
+					 stripe_offset;
+			page->dev = bbio->stripes[stripe_index].dev;
+
 			BUG_ON(page_index >= original_sblock->page_count);
 			page->physical_for_dev_replace =
 				original_sblock->pagev[page_index]->
 				physical_for_dev_replace;
 			/* for missing devices, dev->bdev is NULL */
-			page->dev = bbio->stripes[mirror_index].dev;
 			page->mirror_num = mirror_index + 1;
 			sblock->page_count++;
 			page->page = alloc_page(GFP_NOFS);
 			if (!page->page)
 				goto leave_nomem;
+
+			scrub_get_recover(recover);
+			page->recover = recover;
 		}
-		kfree(bbio);
+		scrub_put_recover(recover);
 		length -= sublen;
 		logical += sublen;
 		page_index++;
@@ -1293,6 +1400,51 @@ leave_nomem:
 	return 0;
 }
 
+struct scrub_bio_ret {
+	struct completion event;
+	int error;
+};
+
+static void scrub_bio_wait_endio(struct bio *bio, int error)
+{
+	struct scrub_bio_ret *ret = bio->bi_private;
+
+	ret->error = error;
+	complete(&ret->event);
+}
+
+static inline int scrub_is_page_on_raid56(struct scrub_page *page)
+{
+	return page->recover && page->recover->raid_map;
+}
+
+static int scrub_submit_raid56_bio_wait(struct btrfs_fs_info *fs_info,
+					struct bio *bio,
+					struct scrub_page *page)
+{
+	struct scrub_bio_ret done;
+	int ret;
+
+	init_completion(&done.event);
+	done.error = 0;
+	bio->bi_iter.bi_sector = page->logical >> 9;
+	bio->bi_private = &done;
+	bio->bi_end_io = scrub_bio_wait_endio;
+
+	ret = raid56_parity_recover(fs_info->fs_root, bio, page->recover->bbio,
+				    page->recover->raid_map,
+				    page->recover->map_length,
+				    page->mirror_num, 1);
+	if (ret)
+		return ret;
+
+	wait_for_completion(&done.event);
+	if (done.error)
+		return -EIO;
+
+	return 0;
+}
+
 /*
  * this function will check the on disk data for checksum errors, header
  * errors and read I/O errors. If any I/O errors happen, the exact pages
@@ -1303,7 +1455,7 @@ leave_nomem:
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 				struct scrub_block *sblock, int is_metadata,
 				int have_csum, u8 *csum, u64 generation,
-				u16 csum_size)
+				u16 csum_size, int retry_failed_mirror)
 {
 	int page_num;
 
@@ -1329,11 +1481,17 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 			continue;
 		}
 		bio->bi_bdev = page->dev->bdev;
-		bio->bi_iter.bi_sector = page->physical >> 9;
 
 		bio_add_page(bio, page->page, PAGE_SIZE, 0);
-		if (btrfsic_submit_bio_wait(READ, bio))
-			sblock->no_io_error_seen = 0;
+		if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
+			if (scrub_submit_raid56_bio_wait(fs_info, bio, page))
+				sblock->no_io_error_seen = 0;
+		} else {
+			bio->bi_iter.bi_sector = page->physical >> 9;
+
+			if (btrfsic_submit_bio_wait(READ, bio))
+				sblock->no_io_error_seen = 0;
+		}
 
 		bio_put(bio);
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66d5035..05cf489 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,7 +5162,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				BTRFS_BLOCK_GROUP_RAID6)) {
 		u64 tmp;
 
-		if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
+		if (raid_map_ret &&
+		    ((rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) ||
+		     mirror_num > 1)) {
 			int i, rot;
 
 			/* push stripe_nr back to the start of the full stripe */
@@ -5441,6 +5443,16 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				 mirror_num, NULL);
 }
 
+/* For Scrub/replace */
+int btrfs_map_sblock(struct btrfs_fs_info *fs_info, int rw,
+		     u64 logical, u64 *length,
+		     struct btrfs_bio **bbio_ret, int mirror_num,
+		     u64 **raid_map_ret)
+{
+	return __btrfs_map_block(fs_info, rw, logical, length, bbio_ret,
+				 mirror_num, raid_map_ret);
+}
+
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 		     u64 chunk_start, u64 physical, u64 devid,
 		     u64 **logical, int *naddrs, int *stripe_len)
@@ -5810,7 +5822,7 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 		} else {
 			ret = raid56_parity_recover(root, bio, bbio,
 						    raid_map, map_length,
-						    mirror_num);
+						    mirror_num, 0);
 		}
 		/*
 		 * FIXME, replace dosen't support raid56 yet, please fix
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 08980fa..01094bb 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -393,6 +393,10 @@ int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
 int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		    u64 logical, u64 *length,
 		    struct btrfs_bio **bbio_ret, int mirror_num);
+int btrfs_map_sblock(struct btrfs_fs_info *fs_info, int rw,
+		     u64 logical, u64 *length,
+		     struct btrfs_bio **bbio_ret, int mirror_num,
+		     u64 **raid_map_ret);
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 		     u64 chunk_start, u64 physical, u64 devid,
 		     u64 **logical, int *naddrs, int *stripe_len);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/9] Btrfs,raid56: use a variant to record the operation type
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (3 preceding siblings ...)
  2014-11-14 13:50 ` [PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Miao Xie
@ 2014-11-14 13:50 ` Miao Xie
  2014-11-14 13:50 ` [PATCH 6/9] Btrfs,raid56: support parity scrub on raid56 Miao Xie
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs

We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/raid56.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b3e9c76..d550e9b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -62,6 +62,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+	BTRFS_RBIO_WRITE	= 0,
+	BTRFS_RBIO_READ_REBUILD	= 1,
+};
+
 struct btrfs_raid_bio {
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_bio *bbio;
@@ -124,7 +129,7 @@ struct btrfs_raid_bio {
 	 * differently from a parity rebuild as part of
 	 * rmw
 	 */
-	int read_rebuild;
+	enum btrfs_rbio_ops operation;
 
 	/* first bad stripe */
 	int faila;
@@ -147,7 +152,6 @@ struct btrfs_raid_bio {
 
 	atomic_t refs;
 
-
 	atomic_t stripes_pending;
 
 	atomic_t error;
@@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
 		return 0;
 
 	/* reads can't merge with writes */
-	if (last->read_rebuild !=
-	    cur->read_rebuild) {
+	if (last->operation != cur->operation) {
 		return 0;
 	}
 
@@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
 			spin_unlock(&rbio->bio_list_lock);
 			spin_unlock_irqrestore(&h->lock, flags);
 
-			if (next->read_rebuild)
+			if (next->operation == BTRFS_RBIO_READ_REBUILD)
 				async_read_rebuild(next);
-			else {
+			else if (next->operation == BTRFS_RBIO_WRITE){
 				steal_rbio(rbio, next);
 				async_rmw_stripe(next);
 			}
@@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 	}
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
+	rbio->operation = BTRFS_RBIO_WRITE;
 
 	/*
 	 * don't plug on full rbios, just get them out the door
@@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	faila = rbio->faila;
 	failb = rbio->failb;
 
-	if (rbio->read_rebuild) {
+	if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
 		spin_lock_irq(&rbio->bio_list_lock);
 		set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
 		spin_unlock_irq(&rbio->bio_list_lock);
@@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
 			 */
-			if (rbio->read_rebuild &&
+			if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
 			    (stripe == faila || stripe == failb)) {
 				page = page_in_rbio(rbio, stripe, pagenr, 0);
 			} else {
@@ -1871,7 +1875,7 @@ pstripe:
 		 * know they can be trusted.  If this was a read reconstruction,
 		 * other endio functions will fiddle the uptodate bits
 		 */
-		if (!rbio->read_rebuild) {
+		if (rbio->operation == BTRFS_RBIO_WRITE) {
 			for (i = 0;  i < nr_pages; i++) {
 				if (faila != -1) {
 					page = rbio_stripe_page(rbio, faila, i);
@@ -1888,7 +1892,7 @@ pstripe:
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
 			 */
-			if (rbio->read_rebuild &&
+			if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
 			    (stripe == faila || stripe == failb)) {
 				page = page_in_rbio(rbio, stripe, pagenr, 0);
 			} else {
@@ -1904,7 +1908,7 @@ cleanup:
 
 cleanup_io:
 
-	if (rbio->read_rebuild) {
+	if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
 		if (err == 0)
 			cache_rbio_pages(rbio);
 		else
@@ -2042,7 +2046,7 @@ out:
 	return 0;
 
 cleanup:
-	if (rbio->read_rebuild)
+	if (rbio->operation == BTRFS_RBIO_READ_REBUILD)
 		rbio_orig_end_io(rbio, -EIO, 0);
 	return -EIO;
 }
@@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 
 	if (hold_bbio)
 		set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
-	rbio->read_rebuild = 1;
+	rbio->operation = BTRFS_RBIO_READ_REBUILD;
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/9] Btrfs,raid56: support parity scrub on raid56
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (4 preceding siblings ...)
  2014-11-14 13:50 ` [PATCH 5/9] Btrfs,raid56: use a variant to record the operation type Miao Xie
@ 2014-11-14 13:50 ` Miao Xie
  2014-11-14 13:50 ` [PATCH 7/9] Btrfs, replace: write dirty pages into the replace target device Miao Xie
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs

The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-tripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/raid56.c | 500 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 599 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 1099 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d550e9b..a13eb1b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -65,6 +65,7 @@
 enum btrfs_rbio_ops {
 	BTRFS_RBIO_WRITE	= 0,
 	BTRFS_RBIO_READ_REBUILD	= 1,
+	BTRFS_RBIO_PARITY_SCRUB	= 2,
 };
 
 struct btrfs_raid_bio {
@@ -123,6 +124,7 @@ struct btrfs_raid_bio {
 	/* number of data stripes (no p/q) */
 	int nr_data;
 
+	int stripe_npages;
 	/*
 	 * set if we're doing a parity rebuild
 	 * for a read from higher up, which is handled
@@ -137,6 +139,7 @@ struct btrfs_raid_bio {
 	/* second bad stripe (for raid6 use) */
 	int failb;
 
+	int scrubp;
 	/*
 	 * number of pages needed to represent the full
 	 * stripe
@@ -171,6 +174,11 @@ struct btrfs_raid_bio {
 	 * here for faster lookup
 	 */
 	struct page **bio_pages;
+
+	/*
+	 * bitmap to record which horizontal stripe has data
+	 */
+	unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -185,6 +193,8 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+					 int need_check);
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -950,9 +960,11 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	struct btrfs_raid_bio *rbio;
 	int nr_data = 0;
 	int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+	int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
 	void *p;
 
-	rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
+	rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2 +
+		       DIV_ROUND_UP(stripe_npages, BITS_PER_LONG / 8),
 			GFP_NOFS);
 	if (!rbio)
 		return ERR_PTR(-ENOMEM);
@@ -967,6 +979,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->fs_info = root->fs_info;
 	rbio->stripe_len = stripe_len;
 	rbio->nr_pages = num_pages;
+	rbio->stripe_npages = stripe_npages;
 	rbio->faila = -1;
 	rbio->failb = -1;
 	atomic_set(&rbio->refs, 1);
@@ -980,6 +993,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	p = rbio + 1;
 	rbio->stripe_pages = p;
 	rbio->bio_pages = p + sizeof(struct page *) * num_pages;
+	rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
 	if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
 		nr_data = bbio->num_stripes - 2;
@@ -1774,6 +1788,14 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	index_rbio_pages(rbio);
 
 	for (pagenr = 0; pagenr < nr_pages; pagenr++) {
+		/*
+		 * Now we just use bitmap to mark the horizontal stripes in
+		 * which we have data when doing parity scrub.
+		 */
+		if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
+		    !test_bit(pagenr, rbio->dbitmap))
+			continue;
+
 		/* setup our array of pointers with pages
 		 * from each stripe
 		 */
@@ -1918,7 +1940,13 @@ cleanup_io:
 	} else if (err == 0) {
 		rbio->faila = -1;
 		rbio->failb = -1;
-		finish_rmw(rbio);
+
+		if (rbio->operation == BTRFS_RBIO_WRITE)
+			finish_rmw(rbio);
+		else if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB)
+			finish_parity_scrub(rbio, 0);
+		else
+			BUG();
 	} else {
 		rbio_orig_end_io(rbio, err, 0);
 	}
@@ -2126,3 +2154,471 @@ static void read_rebuild_work(struct btrfs_work *work)
 	rbio = container_of(work, struct btrfs_raid_bio, work);
 	__raid56_parity_recover(rbio);
 }
+
+/*
+ * The following code is used to scrub/replace the parity stripe
+ *
+ * Note: We need make sure all the pages that add into the scrub/replace
+ * raid bio are correct and not be changed during the scrub/replace. That
+ * is those pages just hold metadata or file data with checksum.
+ */
+
+struct btrfs_raid_bio *
+raid56_parity_alloc_scrub_rbio(struct btrfs_root *root, struct bio *bio,
+			       struct btrfs_bio *bbio, u64 *raid_map,
+			       u64 stripe_len, struct btrfs_device *scrub_dev,
+			       unsigned long *dbitmap, int stripe_nsectors)
+{
+	struct btrfs_raid_bio *rbio;
+	int i;
+
+	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
+	if (IS_ERR(rbio))
+		return NULL;
+	bio_list_add(&rbio->bio_list, bio);
+	/*
+	 * This is a special bio which is used to hold the completion handler
+	 * and make the scrub rbio is similar to the other types
+	 */
+	ASSERT(!bio->bi_iter.bi_size);
+	rbio->operation = BTRFS_RBIO_PARITY_SCRUB;
+	/*
+	 * We've need read the full stripe from the drive.
+	 * check and repair the parity and write the new results.
+	 *
+	 * We're not allowed to add any new bios to the
+	 * bio list here, anyone else that wants to
+	 * change this stripe needs to do their own rmw.
+	 */
+	set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
+
+	for (i = 0; i < bbio->num_stripes; i++) {
+		if (bbio->stripes[i].dev == scrub_dev) {
+			rbio->scrubp = i;
+			break;
+		}
+	}
+
+	/* Now we just support the sectorsize equals to page size */
+	ASSERT(root->sectorsize == PAGE_SIZE);
+	ASSERT(rbio->stripe_npages == stripe_nsectors);
+	bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors);
+
+	return rbio;
+}
+
+void raid56_parity_add_scrub_pages(struct btrfs_raid_bio *rbio,
+				   struct page *page, u64 logical)
+{
+	int stripe_offset;
+	int index;
+
+	ASSERT(logical >= rbio->raid_map[0]);
+	ASSERT(logical + PAGE_SIZE <= rbio->raid_map[0] +
+				rbio->stripe_len * rbio->nr_data);
+	stripe_offset = (int)(logical - rbio->raid_map[0]);
+	index = stripe_offset >> PAGE_CACHE_SHIFT;
+	rbio->bio_pages[index] = page;
+}
+
+/*
+ * We just scrub the parity that we have correct data on the same horizontal,
+ * so we needn't allocate all pages for all the stripes.
+ */
+static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio)
+{
+	int i;
+	int bit;
+	int index;
+	struct page *page;
+
+	for_each_set_bit(bit, rbio->dbitmap, rbio->stripe_npages) {
+		for (i = 0; i < rbio->bbio->num_stripes; i++) {
+			index = i * rbio->stripe_npages + bit;
+			if (rbio->stripe_pages[index])
+				continue;
+
+			page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+			if (!page)
+				return -ENOMEM;
+			rbio->stripe_pages[index] = page;
+			ClearPageUptodate(page);
+		}
+	}
+	return 0;
+}
+
+/*
+ * end io function used by finish_rmw.  When we finally
+ * get here, we've written a full stripe
+ */
+static void raid_write_parity_end_io(struct bio *bio, int err)
+{
+	struct btrfs_raid_bio *rbio = bio->bi_private;
+
+	if (err)
+		fail_bio_stripe(rbio, bio);
+
+	bio_put(bio);
+
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
+		return;
+
+	err = 0;
+
+	if (atomic_read(&rbio->error))
+		err = -EIO;
+
+	rbio_orig_end_io(rbio, err, 0);
+}
+
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+					 int need_check)
+{
+	struct btrfs_bio *bbio = rbio->bbio;
+	void *pointers[bbio->num_stripes];
+	int nr_data = rbio->nr_data;
+	int stripe;
+	int pagenr;
+	int p_stripe = -1;
+	int q_stripe = -1;
+	struct page *p_page = NULL;
+	struct page *q_page = NULL;
+	struct bio_list bio_list;
+	struct bio *bio;
+	int ret;
+
+	bio_list_init(&bio_list);
+
+	if (bbio->num_stripes - rbio->nr_data == 1) {
+		p_stripe = bbio->num_stripes - 1;
+	} else if (bbio->num_stripes - rbio->nr_data == 2) {
+		p_stripe = bbio->num_stripes - 2;
+		q_stripe = bbio->num_stripes - 1;
+	} else {
+		BUG();
+	}
+
+	/*
+	 * Because the higher layers(scrubber) are unlikely to
+	 * use this area of the disk again soon, so don't cache
+	 * it.
+	 */
+	clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
+
+	if (!need_check)
+		goto writeback;
+
+	p_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	if (!p_page)
+		goto cleanup;
+	SetPageUptodate(p_page);
+
+	if (q_stripe != -1) {
+		q_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!q_page) {
+			__free_page(p_page);
+			goto cleanup;
+		}
+		SetPageUptodate(q_page);
+	}
+
+	atomic_set(&rbio->error, 0);
+
+	for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
+		struct page *p;
+		void *parity;
+		/* first collect one page from each data stripe */
+		for (stripe = 0; stripe < nr_data; stripe++) {
+			p = page_in_rbio(rbio, stripe, pagenr, 0);
+			pointers[stripe] = kmap(p);
+		}
+
+		/* then add the parity stripe */
+		pointers[stripe++] = kmap(p_page);
+
+		if (q_stripe != -1) {
+
+			/*
+			 * raid6, add the qstripe and call the
+			 * library function to fill in our p/q
+			 */
+			pointers[stripe++] = kmap(q_page);
+
+			raid6_call.gen_syndrome(bbio->num_stripes, PAGE_SIZE,
+						pointers);
+		} else {
+			/* raid5 */
+			memcpy(pointers[nr_data], pointers[0], PAGE_SIZE);
+			run_xor(pointers + 1, nr_data - 1, PAGE_CACHE_SIZE);
+		}
+
+		/* Check scrubbing pairty and repair it */
+		p = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+		parity = kmap(p);
+		if (memcmp(parity, pointers[rbio->scrubp], PAGE_CACHE_SIZE))
+			memcpy(parity, pointers[rbio->scrubp], PAGE_CACHE_SIZE);
+		else
+			/* Parity is right, needn't writeback */
+			bitmap_clear(rbio->dbitmap, pagenr, 1);
+		kunmap(p);
+
+		for (stripe = 0; stripe < bbio->num_stripes; stripe++)
+			kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
+	}
+
+	__free_page(p_page);
+	if (q_page)
+		__free_page(q_page);
+
+writeback:
+	/*
+	 * time to start writing.  Make bios for everything from the
+	 * higher layers (the bio_list in our rbio) and our p/q.  Ignore
+	 * everything else.
+	 */
+	for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
+		struct page *page;
+
+		page = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+		ret = rbio_add_io_page(rbio, &bio_list,
+			       page, rbio->scrubp, pagenr, rbio->stripe_len);
+		if (ret)
+			goto cleanup;
+	}
+
+	nr_data = bio_list_size(&bio_list);
+	if (!nr_data) {
+		/* Every parity is right */
+		rbio_orig_end_io(rbio, 0, 0);
+		return;
+	}
+
+	atomic_set(&rbio->stripes_pending, nr_data);
+
+	while (1) {
+		bio = bio_list_pop(&bio_list);
+		if (!bio)
+			break;
+
+		bio->bi_private = rbio;
+		bio->bi_end_io = raid_write_parity_end_io;
+		BUG_ON(!test_bit(BIO_UPTODATE, &bio->bi_flags));
+		submit_bio(WRITE, bio);
+	}
+	return;
+
+cleanup:
+	rbio_orig_end_io(rbio, -EIO, 0);
+}
+
+static inline int is_data_stripe(struct btrfs_raid_bio *rbio, int stripe)
+{
+	if (stripe >= 0 && stripe < rbio->nr_data)
+		return 1;
+	return 0;
+}
+
+/*
+ * While we're doing the parity check and repair, we could have errors
+ * in reading pages off the disk.  This checks for errors and if we're
+ * not able to read the page it'll trigger parity reconstruction.  The
+ * parity scrub will be finished after we've reconstructed the failed
+ * stripes
+ */
+static void validate_rbio_for_parity_scrub(struct btrfs_raid_bio *rbio)
+{
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
+		goto cleanup;
+
+	if (rbio->faila >= 0 || rbio->failb >= 0) {
+		int dfail = 0, failp = -1;
+
+		if (is_data_stripe(rbio, rbio->faila))
+			dfail++;
+		else if (is_parity_stripe(rbio->faila))
+			failp = rbio->faila;
+
+		if (is_data_stripe(rbio, rbio->failb))
+			dfail++;
+		else if (is_parity_stripe(rbio->failb))
+			failp = rbio->failb;
+
+		/*
+		 * Because we can not use a scrubbing parity to repair
+		 * the data, so the capability of the repair is declined.
+		 * (In the case of RAID5, we can not repair anything)
+		 */
+		if (dfail > rbio->bbio->max_errors - 1)
+			goto cleanup;
+
+		/*
+		 * If all data is good, only parity is correctly, just
+		 * repair the parity.
+		 */
+		if (dfail == 0) {
+			finish_parity_scrub(rbio, 0);
+			return;
+		}
+
+		/*
+		 * Here means we got one corrupted data stripe and one
+		 * corrupted parity on RAID6, if the corrupted parity
+		 * is scrubbing parity, luckly, use the other one to repair
+		 * the data, or we can not repair the data stripe.
+		 */
+		if (failp != rbio->scrubp)
+			goto cleanup;
+
+		__raid_recover_end_io(rbio);
+	} else {
+		finish_parity_scrub(rbio, 1);
+	}
+	return;
+
+cleanup:
+	rbio_orig_end_io(rbio, -EIO, 0);
+}
+
+/*
+ * end io for the read phase of the rmw cycle.  All the bios here are physical
+ * stripe bios we've read from the disk so we can recalculate the parity of the
+ * stripe.
+ *
+ * This will usually kick off finish_rmw once all the bios are read in, but it
+ * may trigger parity reconstruction if we had any errors along the way
+ */
+static void raid56_parity_scrub_end_io(struct bio *bio, int err)
+{
+	struct btrfs_raid_bio *rbio = bio->bi_private;
+
+	if (err)
+		fail_bio_stripe(rbio, bio);
+	else
+		set_bio_pages_uptodate(bio);
+
+	bio_put(bio);
+
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
+		return;
+
+	/*
+	 * this will normally call finish_rmw to start our write
+	 * but if there are any failed stripes we'll reconstruct
+	 * from parity first
+	 */
+	validate_rbio_for_parity_scrub(rbio);
+}
+
+static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
+{
+	int bios_to_read = 0;
+	struct btrfs_bio *bbio = rbio->bbio;
+	struct bio_list bio_list;
+	int ret;
+	int pagenr;
+	int stripe;
+	struct bio *bio;
+
+	ret = alloc_rbio_essential_pages(rbio);
+	if (ret)
+		goto cleanup;
+
+	bio_list_init(&bio_list);
+
+	atomic_set(&rbio->error, 0);
+	/*
+	 * build a list of bios to read all the missing parts of this
+	 * stripe
+	 */
+	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+		for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
+			struct page *page;
+			/*
+			 * we want to find all the pages missing from
+			 * the rbio and read them from the disk.  If
+			 * page_in_rbio finds a page in the bio list
+			 * we don't need to read it off the stripe.
+			 */
+			page = page_in_rbio(rbio, stripe, pagenr, 1);
+			if (page)
+				continue;
+
+			page = rbio_stripe_page(rbio, stripe, pagenr);
+			/*
+			 * the bio cache may have handed us an uptodate
+			 * page.  If so, be happy and use it
+			 */
+			if (PageUptodate(page))
+				continue;
+
+			ret = rbio_add_io_page(rbio, &bio_list, page,
+				       stripe, pagenr, rbio->stripe_len);
+			if (ret)
+				goto cleanup;
+		}
+	}
+
+	bios_to_read = bio_list_size(&bio_list);
+	if (!bios_to_read) {
+		/*
+		 * this can happen if others have merged with
+		 * us, it means there is nothing left to read.
+		 * But if there are missing devices it may not be
+		 * safe to do the full stripe write yet.
+		 */
+		goto finish;
+	}
+
+	/*
+	 * the bbio may be freed once we submit the last bio.  Make sure
+	 * not to touch it after that
+	 */
+	atomic_set(&rbio->stripes_pending, bios_to_read);
+	while (1) {
+		bio = bio_list_pop(&bio_list);
+		if (!bio)
+			break;
+
+		bio->bi_private = rbio;
+		bio->bi_end_io = raid56_parity_scrub_end_io;
+
+		btrfs_bio_wq_end_io(rbio->fs_info, bio,
+				    BTRFS_WQ_ENDIO_RAID56);
+
+		BUG_ON(!test_bit(BIO_UPTODATE, &bio->bi_flags));
+		submit_bio(READ, bio);
+	}
+	/* the actual write will happen once the reads are done */
+	return;
+
+cleanup:
+	rbio_orig_end_io(rbio, -EIO, 0);
+	return;
+
+finish:
+	validate_rbio_for_parity_scrub(rbio);
+}
+
+static void scrub_parity_work(struct btrfs_work *work)
+{
+	struct btrfs_raid_bio *rbio;
+
+	rbio = container_of(work, struct btrfs_raid_bio, work);
+	raid56_parity_scrub_stripe(rbio);
+}
+
+static void async_scrub_parity(struct btrfs_raid_bio *rbio)
+{
+	btrfs_init_work(&rbio->work, btrfs_rmw_helper,
+			scrub_parity_work, NULL, NULL);
+
+	btrfs_queue_work(rbio->fs_info->rmw_workers,
+			 &rbio->work);
+}
+
+void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
+{
+	if (!lock_stripe_add(rbio))
+		async_scrub_parity(rbio);
+}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index b310e8c..3d4ddb3 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -39,6 +39,9 @@ static inline int nr_data_stripes(struct map_lookup *map)
 #define is_parity_stripe(x) (((x) == RAID5_P_STRIPE) ||		\
 			     ((x) == RAID6_Q_STRIPE))
 
+struct btrfs_raid_bio;
+struct btrfs_device;
+
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 				 struct btrfs_bio *bbio, u64 *raid_map,
 				 u64 stripe_len, int mirror_num, int hold_bbio);
@@ -46,6 +49,15 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 			       struct btrfs_bio *bbio, u64 *raid_map,
 			       u64 stripe_len);
 
+struct btrfs_raid_bio *
+raid56_parity_alloc_scrub_rbio(struct btrfs_root *root, struct bio *bio,
+			       struct btrfs_bio *bbio, u64 *raid_map,
+			       u64 stripe_len, struct btrfs_device *scrub_dev,
+			       unsigned long *dbitmap, int stripe_nsectors);
+void raid56_parity_add_scrub_pages(struct btrfs_raid_bio *rbio,
+				   struct page *page, u64 logical);
+void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
+
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ca4b9eb..3ef1e1b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -74,6 +74,7 @@ struct scrub_page {
 	struct scrub_block	*sblock;
 	struct page		*page;
 	struct btrfs_device	*dev;
+	struct list_head	list;
 	u64			flags;  /* extent flags */
 	u64			generation;
 	u64			logical;
@@ -114,14 +115,52 @@ struct scrub_block {
 	atomic_t		outstanding_pages;
 	atomic_t		ref_count; /* free mem on transition to zero */
 	struct scrub_ctx	*sctx;
+	struct scrub_parity	*sparity;
 	struct {
 		unsigned int	header_error:1;
 		unsigned int	checksum_error:1;
 		unsigned int	no_io_error_seen:1;
 		unsigned int	generation_error:1; /* also sets header_error */
+
+		/* The following is for the data used to check parity */
+		/* It is for the data with checksum */
+		unsigned int	data_corrected:1;
 	};
 };
 
+/* Used for the chunks with parity stripe such RAID5/6 */
+struct scrub_parity {
+	struct scrub_ctx	*sctx;
+
+	struct btrfs_device	*scrub_dev;
+
+	u64			logic_start;
+
+	u64			logic_end;
+
+	int			nsectors;
+
+	int			stripe_len;
+
+	atomic_t		ref_count;
+
+	struct list_head	spages;
+
+	/* Work of parity check and repair */
+	struct btrfs_work	work;
+
+	/* Mark the parity blocks which have data */
+	unsigned long		*dbitmap;
+
+	/*
+	 * Mark the parity blocks which have data, but errors happen when
+	 * read data or check data
+	 */
+	unsigned long		*ebitmap;
+
+	unsigned long		bitmap[0];
+};
+
 struct scrub_wr_ctx {
 	struct scrub_bio *wr_curr_bio;
 	struct btrfs_device *tgtdev;
@@ -227,6 +266,8 @@ static void scrub_block_get(struct scrub_block *sblock);
 static void scrub_block_put(struct scrub_block *sblock);
 static void scrub_page_get(struct scrub_page *spage);
 static void scrub_page_put(struct scrub_page *spage);
+static void scrub_parity_get(struct scrub_parity *sparity);
+static void scrub_parity_put(struct scrub_parity *sparity);
 static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 				    struct scrub_page *spage);
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
@@ -943,6 +984,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		 */
 		spin_lock(&sctx->stat_lock);
 		sctx->stat.unverified_errors++;
+		sblock_to_check->data_corrected = 1;
 		spin_unlock(&sctx->stat_lock);
 
 		if (sctx->is_dev_replace)
@@ -1203,6 +1245,7 @@ nodatasum_case:
 corrected_error:
 			spin_lock(&sctx->stat_lock);
 			sctx->stat.corrected_errors++;
+			sblock_to_check->data_corrected = 1;
 			spin_unlock(&sctx->stat_lock);
 			printk_ratelimited_in_rcu(KERN_ERR
 				"BTRFS: fixed up error at logical %llu on dev %s\n",
@@ -1644,6 +1687,13 @@ static void scrub_write_block_to_dev_replace(struct scrub_block *sblock)
 {
 	int page_num;
 
+	/*
+	 * This block is used for the check of the parity on the source device,
+	 * so the data needn't be written into the destination device.
+	 */
+	if (sblock->sparity)
+		return;
+
 	for (page_num = 0; page_num < sblock->page_count; page_num++) {
 		int ret;
 
@@ -2025,6 +2075,9 @@ static void scrub_block_put(struct scrub_block *sblock)
 	if (atomic_dec_and_test(&sblock->ref_count)) {
 		int i;
 
+		if (sblock->sparity)
+			scrub_parity_put(sblock->sparity);
+
 		for (i = 0; i < sblock->page_count; i++)
 			scrub_page_put(sblock->pagev[i]);
 		kfree(sblock);
@@ -2282,9 +2335,51 @@ static void scrub_bio_end_io_worker(struct btrfs_work *work)
 	scrub_pending_bio_dec(sctx);
 }
 
+static inline void __scrub_mark_bitmap(struct scrub_parity *sparity,
+				       unsigned long *bitmap,
+				       u64 start, u64 len)
+{
+	int offset;
+	int nsectors;
+	int sectorsize = sparity->sctx->dev_root->sectorsize;
+
+	if (len >= sparity->stripe_len) {
+		bitmap_set(bitmap, 0, sparity->nsectors);
+		return;
+	}
+
+	start -= sparity->logic_start;
+	offset = (int)do_div(start, sparity->stripe_len);
+	offset /= sectorsize;
+	nsectors = (int)len / sectorsize;
+
+	if (offset + nsectors <= sparity->nsectors) {
+		bitmap_set(bitmap, offset, nsectors);
+		return;
+	}
+
+	bitmap_set(bitmap, offset, sparity->nsectors - offset);
+	bitmap_set(bitmap, 0, nsectors - (sparity->nsectors - offset));
+}
+
+static inline void scrub_parity_mark_sectors_error(struct scrub_parity *sparity,
+						   u64 start, u64 len)
+{
+	__scrub_mark_bitmap(sparity, sparity->ebitmap, start, len);
+}
+
+static inline void scrub_parity_mark_sectors_data(struct scrub_parity *sparity,
+						  u64 start, u64 len)
+{
+	__scrub_mark_bitmap(sparity, sparity->dbitmap, start, len);
+}
+
 static void scrub_block_complete(struct scrub_block *sblock)
 {
+	int corrupted = 0;
+
 	if (!sblock->no_io_error_seen) {
+		corrupted = 1;
 		scrub_handle_errored_block(sblock);
 	} else {
 		/*
@@ -2292,9 +2387,19 @@ static void scrub_block_complete(struct scrub_block *sblock)
 		 * dev replace case, otherwise write here in dev replace
 		 * case.
 		 */
-		if (!scrub_checksum(sblock) && sblock->sctx->is_dev_replace)
+		corrupted = scrub_checksum(sblock);
+		if (!corrupted && sblock->sctx->is_dev_replace)
 			scrub_write_block_to_dev_replace(sblock);
 	}
+
+	if (sblock->sparity && corrupted && !sblock->data_corrected) {
+		u64 start = sblock->pagev[0]->logical;
+		u64 end = sblock->pagev[sblock->page_count - 1]->logical +
+			  PAGE_SIZE;
+
+		scrub_parity_mark_sectors_error(sblock->sparity,
+						start, end - start);
+	}
 }
 
 static int scrub_find_csum(struct scrub_ctx *sctx, u64 logical, u64 len,
@@ -2386,6 +2491,132 @@ behind_scrub_pages:
 	return 0;
 }
 
+static int scrub_pages_for_parity(struct scrub_parity *sparity,
+				  u64 logical, u64 len,
+				  u64 physical, struct btrfs_device *dev,
+				  u64 flags, u64 gen, int mirror_num, u8 *csum)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	struct scrub_block *sblock;
+	int index;
+
+	sblock = kzalloc(sizeof(*sblock), GFP_NOFS);
+	if (!sblock) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
+		return -ENOMEM;
+	}
+
+	/* one ref inside this function, plus one for each page added to
+	 * a bio later on */
+	atomic_set(&sblock->ref_count, 1);
+	sblock->sctx = sctx;
+	sblock->no_io_error_seen = 1;
+	sblock->sparity = sparity;
+	scrub_parity_get(sparity);
+
+	for (index = 0; len > 0; index++) {
+		struct scrub_page *spage;
+		u64 l = min_t(u64, len, PAGE_SIZE);
+
+		spage = kzalloc(sizeof(*spage), GFP_NOFS);
+		if (!spage) {
+leave_nomem:
+			spin_lock(&sctx->stat_lock);
+			sctx->stat.malloc_errors++;
+			spin_unlock(&sctx->stat_lock);
+			scrub_block_put(sblock);
+			return -ENOMEM;
+		}
+		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
+		/* For scrub block */
+		scrub_page_get(spage);
+		sblock->pagev[index] = spage;
+		/* For scrub parity */
+		scrub_page_get(spage);
+		list_add_tail(&spage->list, &sparity->spages);
+		spage->sblock = sblock;
+		spage->dev = dev;
+		spage->flags = flags;
+		spage->generation = gen;
+		spage->logical = logical;
+		spage->physical = physical;
+		spage->mirror_num = mirror_num;
+		if (csum) {
+			spage->have_csum = 1;
+			memcpy(spage->csum, csum, sctx->csum_size);
+		} else {
+			spage->have_csum = 0;
+		}
+		sblock->page_count++;
+		spage->page = alloc_page(GFP_NOFS);
+		if (!spage->page)
+			goto leave_nomem;
+		len -= l;
+		logical += l;
+		physical += l;
+	}
+
+	WARN_ON(sblock->page_count == 0);
+	for (index = 0; index < sblock->page_count; index++) {
+		struct scrub_page *spage = sblock->pagev[index];
+		int ret;
+
+		ret = scrub_add_page_to_rd_bio(sctx, spage);
+		if (ret) {
+			scrub_block_put(sblock);
+			return ret;
+		}
+	}
+
+	/* last one frees, either here or in bio completion for last page */
+	scrub_block_put(sblock);
+	return 0;
+}
+
+static int scrub_extent_for_parity(struct scrub_parity *sparity,
+				   u64 logical, u64 len,
+				   u64 physical, struct btrfs_device *dev,
+				   u64 flags, u64 gen, int mirror_num)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	int ret;
+	u8 csum[BTRFS_CSUM_SIZE];
+	u32 blocksize;
+
+	if (flags & BTRFS_EXTENT_FLAG_DATA) {
+		blocksize = sctx->sectorsize;
+	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+		blocksize = sctx->nodesize;
+	} else {
+		blocksize = sctx->sectorsize;
+		WARN_ON(1);
+	}
+
+	while (len) {
+		u64 l = min_t(u64, len, blocksize);
+		int have_csum = 0;
+
+		if (flags & BTRFS_EXTENT_FLAG_DATA) {
+			/* push csums to sbio */
+			have_csum = scrub_find_csum(sctx, logical, l, csum);
+			if (have_csum == 0)
+				goto skip;
+		}
+		ret = scrub_pages_for_parity(sparity, logical, l, physical, dev,
+					     flags, gen, mirror_num,
+					     have_csum ? csum : NULL);
+skip:
+		if (ret)
+			return ret;
+		len -= l;
+		logical += l;
+		physical += l;
+	}
+	return 0;
+}
+
 /*
  * Given a physical address, this will calculate it's
  * logical offset. if this is a parity stripe, it will return
@@ -2427,13 +2658,330 @@ static int get_raid56_logic_offset(u64 physical, int num,
 	return 1;
 }
 
+static void scrub_free_parity(struct scrub_parity *sparity)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	struct scrub_page *curr, *next;
+	int nbits;
+
+	nbits = bitmap_weight(sparity->ebitmap, sparity->nsectors);
+	if (nbits) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.read_errors += nbits;
+		sctx->stat.uncorrectable_errors += nbits;
+		spin_unlock(&sctx->stat_lock);
+	}
+
+	list_for_each_entry_safe(curr, next, &sparity->spages, list) {
+		list_del_init(&curr->list);
+		scrub_page_put(curr);
+	}
+
+	kfree(sparity);
+}
+
+static void scrub_parity_bio_endio(struct bio *bio, int error)
+{
+	struct scrub_parity *sparity = (struct scrub_parity *)bio->bi_private;
+	struct scrub_ctx *sctx = sparity->sctx;
+
+	if (error)
+		bitmap_or(sparity->ebitmap, sparity->ebitmap, sparity->dbitmap,
+			  sparity->nsectors);
+
+	scrub_free_parity(sparity);
+	scrub_pending_bio_dec(sctx);
+	bio_put(bio);
+}
+
+static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	struct bio *bio;
+	struct btrfs_raid_bio *rbio;
+	struct scrub_page *spage;
+	struct btrfs_bio *bbio = NULL;
+	u64 *raid_map = NULL;
+	u64 length;
+	int ret;
+
+	if (!bitmap_andnot(sparity->dbitmap, sparity->dbitmap, sparity->ebitmap,
+			   sparity->nsectors))
+		goto out;
+
+	length = sparity->logic_end - sparity->logic_start + 1;
+	ret = btrfs_map_sblock(sctx->dev_root->fs_info, REQ_GET_READ_MIRRORS,
+			       sparity->logic_start,
+			       &length, &bbio, 0, &raid_map);
+	if (ret || !bbio || !raid_map)
+		goto bbio_out;
+
+	bio = btrfs_io_bio_alloc(GFP_NOFS, 0);
+	if (!bio)
+		goto bbio_out;
+
+	bio->bi_iter.bi_sector = sparity->logic_start >> 9;
+	bio->bi_private = sparity;
+	bio->bi_end_io = scrub_parity_bio_endio;
+
+	rbio = raid56_parity_alloc_scrub_rbio(sctx->dev_root, bio, bbio,
+					      raid_map, length,
+					      sparity->scrub_dev,
+					      sparity->dbitmap,
+					      sparity->nsectors);
+	if (!rbio)
+		goto rbio_out;
+
+	list_for_each_entry(spage, &sparity->spages, list)
+		raid56_parity_add_scrub_pages(rbio, spage->page,
+					      spage->logical);
+
+	scrub_pending_bio_inc(sctx);
+	raid56_parity_submit_scrub_rbio(rbio);
+	return;
+
+rbio_out:
+	bio_put(bio);
+bbio_out:
+	kfree(bbio);
+	kfree(raid_map);
+	bitmap_or(sparity->ebitmap, sparity->ebitmap, sparity->dbitmap,
+		  sparity->nsectors);
+	spin_lock(&sctx->stat_lock);
+	sctx->stat.malloc_errors++;
+	spin_unlock(&sctx->stat_lock);
+out:
+	scrub_free_parity(sparity);
+}
+
+static inline int scrub_calc_parity_bitmap_len(int nsectors)
+{
+	return DIV_ROUND_UP(nsectors, BITS_PER_LONG) * (BITS_PER_LONG / 8);
+}
+
+static void scrub_parity_get(struct scrub_parity *sparity)
+{
+	atomic_inc(&sparity->ref_count);
+}
+
+static void scrub_parity_put(struct scrub_parity *sparity)
+{
+	if (!atomic_dec_and_test(&sparity->ref_count))
+		return;
+
+	scrub_parity_check_and_repair(sparity);
+}
+
+static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
+						  struct map_lookup *map,
+						  struct btrfs_device *sdev,
+						  struct btrfs_path *path,
+						  u64 logic_start,
+						  u64 logic_end)
+{
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_root *csum_root = fs_info->csum_root;
+	struct btrfs_extent_item *extent;
+	u64 flags;
+	int ret;
+	int slot;
+	struct extent_buffer *l;
+	struct btrfs_key key;
+	u64 generation;
+	u64 extent_logical;
+	u64 extent_physical;
+	u64 extent_len;
+	struct btrfs_device *extent_dev;
+	struct scrub_parity *sparity;
+	int nsectors;
+	int bitmap_len;
+	int extent_mirror_num;
+	int stop_loop = 0;
+
+	nsectors = map->stripe_len / root->sectorsize;
+	bitmap_len = scrub_calc_parity_bitmap_len(nsectors);
+	sparity = kzalloc(sizeof(struct scrub_parity) + 2 * bitmap_len,
+			  GFP_NOFS);
+	if (!sparity) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
+		return -ENOMEM;
+	}
+
+	sparity->stripe_len = map->stripe_len;
+	sparity->nsectors = nsectors;
+	sparity->sctx = sctx;
+	sparity->scrub_dev = sdev;
+	sparity->logic_start = logic_start;
+	sparity->logic_end = logic_end;
+	atomic_set(&sparity->ref_count, 1);
+	INIT_LIST_HEAD(&sparity->spages);
+	sparity->dbitmap = sparity->bitmap;
+	sparity->ebitmap = (void *)sparity->bitmap + bitmap_len;
+
+	ret = 0;
+	while (logic_start < logic_end) {
+		if (btrfs_fs_incompat(fs_info, SKINNY_METADATA))
+			key.type = BTRFS_METADATA_ITEM_KEY;
+		else
+			key.type = BTRFS_EXTENT_ITEM_KEY;
+		key.objectid = logic_start;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			goto out;
+
+		if (ret > 0) {
+			ret = btrfs_previous_extent_item(root, path, 0);
+			if (ret < 0)
+				goto out;
+			if (ret > 0) {
+				btrfs_release_path(path);
+				ret = btrfs_search_slot(NULL, root, &key,
+							path, 0, 0);
+				if (ret < 0)
+					goto out;
+			}
+		}
+
+		stop_loop = 0;
+		while (1) {
+			u64 bytes;
+
+			l = path->nodes[0];
+			slot = path->slots[0];
+			if (slot >= btrfs_header_nritems(l)) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret == 0)
+					continue;
+				if (ret < 0)
+					goto out;
+
+				stop_loop = 1;
+				break;
+			}
+			btrfs_item_key_to_cpu(l, &key, slot);
+
+			if (key.type == BTRFS_METADATA_ITEM_KEY)
+				bytes = root->nodesize;
+			else
+				bytes = key.offset;
+
+			if (key.objectid + bytes <= logic_start)
+				goto next;
+
+			if (key.type != BTRFS_EXTENT_ITEM_KEY &&
+			    key.type != BTRFS_METADATA_ITEM_KEY)
+				goto next;
+
+			if (key.objectid > logic_end) {
+				stop_loop = 1;
+				break;
+			}
+
+			while (key.objectid >= logic_start + map->stripe_len)
+				logic_start += map->stripe_len;
+
+			extent = btrfs_item_ptr(l, slot,
+						struct btrfs_extent_item);
+			flags = btrfs_extent_flags(l, extent);
+			generation = btrfs_extent_generation(l, extent);
+
+			if (key.objectid < logic_start &&
+			    (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) {
+				btrfs_err(fs_info,
+					  "scrub: tree block %llu spanning stripes, ignored. logical=%llu",
+					   key.objectid, logic_start);
+				goto next;
+			}
+again:
+			extent_logical = key.objectid;
+			extent_len = bytes;
+
+			if (extent_logical < logic_start) {
+				extent_len -= logic_start - extent_logical;
+				extent_logical = logic_start;
+			}
+
+			if (extent_logical + extent_len >
+			    logic_start + map->stripe_len)
+				extent_len = logic_start + map->stripe_len -
+					     extent_logical;
+
+			scrub_parity_mark_sectors_data(sparity, extent_logical,
+						       extent_len);
+
+			scrub_remap_extent(fs_info, extent_logical,
+					   extent_len, &extent_physical,
+					   &extent_dev,
+					   &extent_mirror_num);
+
+			ret = btrfs_lookup_csums_range(csum_root,
+						extent_logical,
+						extent_logical + extent_len - 1,
+						&sctx->csum_list, 1);
+			if (ret)
+				goto out;
+
+			ret = scrub_extent_for_parity(sparity, extent_logical,
+						      extent_len,
+						      extent_physical,
+						      extent_dev, flags,
+						      generation,
+						      extent_mirror_num);
+			if (ret)
+				goto out;
+
+			scrub_free_csums(sctx);
+			if (extent_logical + extent_len <
+			    key.objectid + bytes) {
+				logic_start += map->stripe_len;
+
+				if (logic_start >= logic_end) {
+					stop_loop = 1;
+					break;
+				}
+
+				if (logic_start < key.objectid + bytes) {
+					cond_resched();
+					goto again;
+				}
+			}
+next:
+			path->slots[0]++;
+		}
+
+		btrfs_release_path(path);
+
+		if (stop_loop)
+			break;
+
+		logic_start += map->stripe_len;
+	}
+out:
+	if (ret < 0)
+		scrub_parity_mark_sectors_error(sparity, logic_start,
+						logic_end - logic_start + 1);
+	scrub_parity_put(sparity);
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_ctx.wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_ctx.wr_lock);
+
+	btrfs_release_path(path);
+	return ret < 0 ? ret : 0;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
 					   int num, u64 base, u64 length,
 					   int is_dev_replace)
 {
-	struct btrfs_path *path;
+	struct btrfs_path *path, *ppath;
 	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
 	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_root *csum_root = fs_info->csum_root;
@@ -2460,6 +3008,8 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	u64 extent_logical;
 	u64 extent_physical;
 	u64 extent_len;
+	u64 stripe_logical;
+	u64 stripe_end;
 	struct btrfs_device *extent_dev;
 	int extent_mirror_num;
 	int stop_loop = 0;
@@ -2497,6 +3047,12 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	if (!path)
 		return -ENOMEM;
 
+	ppath = btrfs_alloc_path();
+	if (!ppath) {
+		btrfs_free_path(ppath);
+		return -ENOMEM;
+	}
+
 	/*
 	 * work on commit root. The related disk blocks are static as
 	 * long as COW is applied. This means, it is save to rewrite
@@ -2564,8 +3120,17 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			ret = get_raid56_logic_offset(physical, num,
 					map, &logical);
 			logical += base;
-			if (ret)
+			if (ret) {
+				stripe_logical = logical;
+				stripe_logical -= num * map->stripe_len;
+				stripe_end = stripe_logical + increment - 1;
+				ret = scrub_raid56_parity(sctx, map, scrub_dev,
+						ppath, stripe_logical,
+						stripe_end);
+				if (ret)
+					goto out;
 				goto skip;
+			}
 		}
 		/*
 		 * canceled?
@@ -2716,13 +3281,26 @@ again:
 					 * loop until we find next data stripe
 					 * or we have finished all stripes.
 					 */
-					do {
-						physical += map->stripe_len;
-						ret = get_raid56_logic_offset(
-								physical, num,
-								map, &logical);
-						logical += base;
-					} while (physical < physical_end && ret);
+loop:
+					physical += map->stripe_len;
+					ret = get_raid56_logic_offset(physical,
+							num, map, &logical);
+					logical += base;
+
+					if (ret && physical < physical_end) {
+						stripe_logical = logical;
+						stripe_logical -= num *
+								map->stripe_len;
+						stripe_end = stripe_logical +
+								increment - 1;
+						ret = scrub_raid56_parity(sctx,
+							map, scrub_dev, ppath,
+							stripe_logical,
+							stripe_end);
+						if (ret)
+							goto out;
+						goto loop;
+					}
 				} else {
 					physical += map->stripe_len;
 					logical += increment;
@@ -2763,6 +3341,7 @@ out:
 
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
+	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/9] Btrfs, replace: write dirty pages into the replace target device
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (5 preceding siblings ...)
  2014-11-14 13:50 ` [PATCH 6/9] Btrfs,raid56: support parity scrub on raid56 Miao Xie
@ 2014-11-14 13:50 ` Miao Xie
  2014-11-14 13:51 ` [PATCH 8/9] Btrfs, replace: write raid56 parity " Miao Xie
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:50 UTC (permalink / raw)
  To: linux-btrfs

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/raid56.c  | 104 +++++++++++++++++++++++++++++++++--------------------
 fs/btrfs/volumes.c |  26 ++++++++++++--
 fs/btrfs/volumes.h |  10 ++++--
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index a13eb1b..7ad9546a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -124,6 +124,8 @@ struct btrfs_raid_bio {
 	/* number of data stripes (no p/q) */
 	int nr_data;
 
+	int real_stripes;
+
 	int stripe_npages;
 	/*
 	 * set if we're doing a parity rebuild
@@ -619,7 +621,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio *rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-	if (rbio->nr_data + 1 == rbio->bbio->num_stripes)
+	if (rbio->nr_data + 1 == rbio->real_stripes)
 		return NULL;
 
 	index += ((rbio->nr_data + 1) * rbio->stripe_len) >>
@@ -959,7 +961,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 {
 	struct btrfs_raid_bio *rbio;
 	int nr_data = 0;
-	int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+	int real_stripes = bbio->num_stripes - bbio->num_tgtdevs;
+	int num_pages = rbio_nr_pages(stripe_len, real_stripes);
 	int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
 	void *p;
 
@@ -979,6 +982,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->fs_info = root->fs_info;
 	rbio->stripe_len = stripe_len;
 	rbio->nr_pages = num_pages;
+	rbio->real_stripes = real_stripes;
 	rbio->stripe_npages = stripe_npages;
 	rbio->faila = -1;
 	rbio->failb = -1;
@@ -995,10 +999,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->bio_pages = p + sizeof(struct page *) * num_pages;
 	rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-	if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
-		nr_data = bbio->num_stripes - 2;
+	if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+		nr_data = real_stripes - 2;
 	else
-		nr_data = bbio->num_stripes - 1;
+		nr_data = real_stripes - 1;
 
 	rbio->nr_data = nr_data;
 	return rbio;
@@ -1110,7 +1114,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
 	if (rbio->faila >= 0 || rbio->failb >= 0) {
-		BUG_ON(rbio->faila == rbio->bbio->num_stripes - 1);
+		BUG_ON(rbio->faila == rbio->real_stripes - 1);
 		__raid56_parity_recover(rbio);
 	} else {
 		finish_rmw(rbio);
@@ -1171,7 +1175,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
 	struct btrfs_bio *bbio = rbio->bbio;
-	void *pointers[bbio->num_stripes];
+	void *pointers[rbio->real_stripes];
 	int stripe_len = rbio->stripe_len;
 	int nr_data = rbio->nr_data;
 	int stripe;
@@ -1185,11 +1189,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 
 	bio_list_init(&bio_list);
 
-	if (bbio->num_stripes - rbio->nr_data == 1) {
-		p_stripe = bbio->num_stripes - 1;
-	} else if (bbio->num_stripes - rbio->nr_data == 2) {
-		p_stripe = bbio->num_stripes - 2;
-		q_stripe = bbio->num_stripes - 1;
+	if (rbio->real_stripes - rbio->nr_data == 1) {
+		p_stripe = rbio->real_stripes - 1;
+	} else if (rbio->real_stripes - rbio->nr_data == 2) {
+		p_stripe = rbio->real_stripes - 2;
+		q_stripe = rbio->real_stripes - 1;
 	} else {
 		BUG();
 	}
@@ -1246,7 +1250,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 			SetPageUptodate(p);
 			pointers[stripe++] = kmap(p);
 
-			raid6_call.gen_syndrome(bbio->num_stripes, PAGE_SIZE,
+			raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE,
 						pointers);
 		} else {
 			/* raid5 */
@@ -1255,7 +1259,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 		}
 
 
-		for (stripe = 0; stripe < bbio->num_stripes; stripe++)
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++)
 			kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
 	}
 
@@ -1264,7 +1268,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	 * higher layers (the bio_list in our rbio) and our p/q.  Ignore
 	 * everything else.
 	 */
-	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 		for (pagenr = 0; pagenr < pages_per_stripe; pagenr++) {
 			struct page *page;
 			if (stripe < rbio->nr_data) {
@@ -1282,6 +1286,32 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 		}
 	}
 
+	if (likely(!bbio->num_tgtdevs))
+		goto write_data;
+
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
+		if (!bbio->tgtdev_map[stripe])
+			continue;
+
+		for (pagenr = 0; pagenr < pages_per_stripe; pagenr++) {
+			struct page *page;
+			if (stripe < rbio->nr_data) {
+				page = page_in_rbio(rbio, stripe, pagenr, 1);
+				if (!page)
+					continue;
+			} else {
+			       page = rbio_stripe_page(rbio, stripe, pagenr);
+			}
+
+			ret = rbio_add_io_page(rbio, &bio_list, page,
+					       rbio->bbio->tgtdev_map[stripe],
+					       pagenr, rbio->stripe_len);
+			if (ret)
+				goto cleanup;
+		}
+	}
+
+write_data:
 	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
 	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
@@ -1320,7 +1350,8 @@ static int find_bio_stripe(struct btrfs_raid_bio *rbio,
 		stripe = &rbio->bbio->stripes[i];
 		stripe_start = stripe->physical;
 		if (physical >= stripe_start &&
-		    physical < stripe_start + rbio->stripe_len) {
+		    physical < stripe_start + rbio->stripe_len &&
+		    bio->bi_bdev == stripe->dev->bdev) {
 			return i;
 		}
 	}
@@ -1769,7 +1800,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	int err;
 	int i;
 
-	pointers = kzalloc(rbio->bbio->num_stripes * sizeof(void *),
+	pointers = kzalloc(rbio->real_stripes * sizeof(void *),
 			   GFP_NOFS);
 	if (!pointers) {
 		err = -ENOMEM;
@@ -1799,7 +1830,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 		/* setup our array of pointers with pages
 		 * from each stripe
 		 */
-		for (stripe = 0; stripe < rbio->bbio->num_stripes; stripe++) {
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 			/*
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
@@ -1814,7 +1845,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 		}
 
 		/* all raid6 handling here */
-		if (rbio->raid_map[rbio->bbio->num_stripes - 1] ==
+		if (rbio->raid_map[rbio->real_stripes - 1] ==
 		    RAID6_Q_STRIPE) {
 
 			/*
@@ -1864,10 +1895,10 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 			}
 
 			if (rbio->raid_map[failb] == RAID5_P_STRIPE) {
-				raid6_datap_recov(rbio->bbio->num_stripes,
+				raid6_datap_recov(rbio->real_stripes,
 						  PAGE_SIZE, faila, pointers);
 			} else {
-				raid6_2data_recov(rbio->bbio->num_stripes,
+				raid6_2data_recov(rbio->real_stripes,
 						  PAGE_SIZE, faila, failb,
 						  pointers);
 			}
@@ -1909,7 +1940,7 @@ pstripe:
 				}
 			}
 		}
-		for (stripe = 0; stripe < rbio->bbio->num_stripes; stripe++) {
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 			/*
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
@@ -1990,7 +2021,6 @@ static void raid_recover_end_io(struct bio *bio, int err)
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 {
 	int bios_to_read = 0;
-	struct btrfs_bio *bbio = rbio->bbio;
 	struct bio_list bio_list;
 	int ret;
 	int nr_pages = DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE);
@@ -2011,7 +2041,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	 * stripe cache, it is possible that some or all of these
 	 * pages are going to be uptodate.
 	 */
-	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 		if (rbio->faila == stripe || rbio->failb == stripe) {
 			atomic_inc(&rbio->error);
 			continue;
@@ -2117,7 +2147,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 	 * asking for mirror 3
 	 */
 	if (mirror_num == 3)
-		rbio->failb = bbio->num_stripes - 2;
+		rbio->failb = rbio->real_stripes - 2;
 
 	ret = lock_stripe_add(rbio);
 
@@ -2192,7 +2222,7 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_root *root, struct bio *bio,
 	 */
 	set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
 
-	for (i = 0; i < bbio->num_stripes; i++) {
+	for (i = 0; i < rbio->real_stripes; i++) {
 		if (bbio->stripes[i].dev == scrub_dev) {
 			rbio->scrubp = i;
 			break;
@@ -2233,7 +2263,7 @@ static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio)
 	struct page *page;
 
 	for_each_set_bit(bit, rbio->dbitmap, rbio->stripe_npages) {
-		for (i = 0; i < rbio->bbio->num_stripes; i++) {
+		for (i = 0; i < rbio->real_stripes; i++) {
 			index = i * rbio->stripe_npages + bit;
 			if (rbio->stripe_pages[index])
 				continue;
@@ -2275,8 +2305,7 @@ static void raid_write_parity_end_io(struct bio *bio, int err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 					 int need_check)
 {
-	struct btrfs_bio *bbio = rbio->bbio;
-	void *pointers[bbio->num_stripes];
+	void *pointers[rbio->real_stripes];
 	int nr_data = rbio->nr_data;
 	int stripe;
 	int pagenr;
@@ -2290,11 +2319,11 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 
 	bio_list_init(&bio_list);
 
-	if (bbio->num_stripes - rbio->nr_data == 1) {
-		p_stripe = bbio->num_stripes - 1;
-	} else if (bbio->num_stripes - rbio->nr_data == 2) {
-		p_stripe = bbio->num_stripes - 2;
-		q_stripe = bbio->num_stripes - 1;
+	if (rbio->real_stripes - rbio->nr_data == 1) {
+		p_stripe = rbio->real_stripes - 1;
+	} else if (rbio->real_stripes - rbio->nr_data == 2) {
+		p_stripe = rbio->real_stripes - 2;
+		q_stripe = rbio->real_stripes - 1;
 	} else {
 		BUG();
 	}
@@ -2345,7 +2374,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 			 */
 			pointers[stripe++] = kmap(q_page);
 
-			raid6_call.gen_syndrome(bbio->num_stripes, PAGE_SIZE,
+			raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE,
 						pointers);
 		} else {
 			/* raid5 */
@@ -2363,7 +2392,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 			bitmap_clear(rbio->dbitmap, pagenr, 1);
 		kunmap(p);
 
-		for (stripe = 0; stripe < bbio->num_stripes; stripe++)
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++)
 			kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
 	}
 
@@ -2513,7 +2542,6 @@ static void raid56_parity_scrub_end_io(struct bio *bio, int err)
 static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
 {
 	int bios_to_read = 0;
-	struct btrfs_bio *bbio = rbio->bbio;
 	struct bio_list bio_list;
 	int ret;
 	int pagenr;
@@ -2531,7 +2559,7 @@ static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
 	 * build a list of bios to read all the missing parts of this
 	 * stripe
 	 */
-	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 		for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
 			struct page *page;
 			/*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 05cf489..28d42ba 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4882,13 +4882,15 @@ static inline int parity_smaller(u64 a, u64 b)
 static void sort_parity_stripes(struct btrfs_bio *bbio, u64 *raid_map)
 {
 	struct btrfs_bio_stripe s;
+	int real_stripes = bbio->num_stripes - bbio->num_tgtdevs;
 	int i;
 	u64 l;
 	int again = 1;
+	int m;
 
 	while (again) {
 		again = 0;
-		for (i = 0; i < bbio->num_stripes - 1; i++) {
+		for (i = 0; i < real_stripes - 1; i++) {
 			if (parity_smaller(raid_map[i], raid_map[i+1])) {
 				s = bbio->stripes[i];
 				l = raid_map[i];
@@ -4896,6 +4898,14 @@ static void sort_parity_stripes(struct btrfs_bio *bbio, u64 *raid_map)
 				raid_map[i] = raid_map[i+1];
 				bbio->stripes[i+1] = s;
 				raid_map[i+1] = l;
+
+				if (bbio->tgtdev_map) {
+					m = bbio->tgtdev_map[i];
+					bbio->tgtdev_map[i] =
+							bbio->tgtdev_map[i + 1];
+					bbio->tgtdev_map[i + 1] = m;
+				}
+
 				again = 1;
 			}
 		}
@@ -4924,6 +4934,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	int ret = 0;
 	int num_stripes;
 	int max_errors = 0;
+	int tgtdev_indexes = 0;
 	struct btrfs_bio *bbio = NULL;
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
 	int dev_replace_is_ongoing = 0;
@@ -5235,14 +5246,19 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			num_alloc_stripes <<= 1;
 		if (rw & REQ_GET_READ_MIRRORS)
 			num_alloc_stripes++;
+		tgtdev_indexes = num_stripes;
 	}
-	bbio = kzalloc(btrfs_bio_size(num_alloc_stripes), GFP_NOFS);
+
+	bbio = kzalloc(btrfs_bio_size(num_alloc_stripes, tgtdev_indexes),
+		       GFP_NOFS);
 	if (!bbio) {
 		kfree(raid_map);
 		ret = -ENOMEM;
 		goto out;
 	}
 	atomic_set(&bbio->error, 0);
+	if (dev_replace_is_ongoing)
+		bbio->tgtdev_map = (int *)(bbio->stripes + num_alloc_stripes);
 
 	if (rw & REQ_DISCARD) {
 		int factor = 0;
@@ -5327,6 +5343,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	if (rw & (REQ_WRITE | REQ_GET_READ_MIRRORS))
 		max_errors = btrfs_chunk_max_errors(map);
 
+	tgtdev_indexes = 0;
 	if (dev_replace_is_ongoing && (rw & (REQ_WRITE | REQ_DISCARD)) &&
 	    dev_replace->tgtdev != NULL) {
 		int index_where_to_add;
@@ -5355,8 +5372,10 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				new->physical = old->physical;
 				new->length = old->length;
 				new->dev = dev_replace->tgtdev;
+				bbio->tgtdev_map[i] = index_where_to_add;
 				index_where_to_add++;
 				max_errors++;
+				tgtdev_indexes++;
 			}
 		}
 		num_stripes = index_where_to_add;
@@ -5402,7 +5421,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				tgtdev_stripe->length =
 					bbio->stripes[index_srcdev].length;
 				tgtdev_stripe->dev = dev_replace->tgtdev;
+				bbio->tgtdev_map[index_srcdev] = num_stripes;
 
+				tgtdev_indexes++;
 				num_stripes++;
 			}
 		}
@@ -5412,6 +5433,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	bbio->num_stripes = num_stripes;
 	bbio->max_errors = max_errors;
 	bbio->mirror_num = mirror_num;
+	bbio->num_tgtdevs = tgtdev_indexes;
 
 	/*
 	 * this is the case that REQ_READ && dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 01094bb..70be257 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -292,7 +292,7 @@ struct btrfs_bio_stripe {
 struct btrfs_bio;
 typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
 
-#define BTRFS_BIO_ORIG_BIO_SUBMITTED	0x1
+#define BTRFS_BIO_ORIG_BIO_SUBMITTED	(1 << 0)
 
 struct btrfs_bio {
 	atomic_t stripes_pending;
@@ -305,6 +305,8 @@ struct btrfs_bio {
 	int max_errors;
 	int num_stripes;
 	int mirror_num;
+	int num_tgtdevs;
+	int *tgtdev_map;
 	struct btrfs_bio_stripe stripes[];
 };
 
@@ -387,8 +389,10 @@ struct btrfs_balance_control {
 int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
 				   u64 end, u64 *length);
 
-#define btrfs_bio_size(n) (sizeof(struct btrfs_bio) + \
-			    (sizeof(struct btrfs_bio_stripe) * (n)))
+#define btrfs_bio_size(total_stripes, real_stripes)		\
+	(sizeof(struct btrfs_bio) +				\
+	 (sizeof(struct btrfs_bio_stripe) * (total_stripes)) +	\
+	 (sizeof(int) * (real_stripes)))
 
 int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		    u64 logical, u64 *length,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/9] Btrfs, replace: write raid56 parity into the replace target device
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (6 preceding siblings ...)
  2014-11-14 13:50 ` [PATCH 7/9] Btrfs, replace: write dirty pages into the replace target device Miao Xie
@ 2014-11-14 13:51 ` Miao Xie
  2014-11-14 13:51 ` [PATCH 9/9] Btrfs, replace: enable dev-replace for raid56 Miao Xie
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:51 UTC (permalink / raw)
  To: linux-btrfs

This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/raid56.c | 23 +++++++++++++++++++++++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7ad9546a..b69c01f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2305,7 +2305,9 @@ static void raid_write_parity_end_io(struct bio *bio, int err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 					 int need_check)
 {
+	struct btrfs_bio *bbio = rbio->bbio;
 	void *pointers[rbio->real_stripes];
+	DECLARE_BITMAP(pbitmap, rbio->stripe_npages);
 	int nr_data = rbio->nr_data;
 	int stripe;
 	int pagenr;
@@ -2315,6 +2317,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 	struct page *q_page = NULL;
 	struct bio_list bio_list;
 	struct bio *bio;
+	int is_replace = 0;
 	int ret;
 
 	bio_list_init(&bio_list);
@@ -2328,6 +2331,11 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 		BUG();
 	}
 
+	if (bbio->num_tgtdevs && bbio->tgtdev_map[rbio->scrubp]) {
+		is_replace = 1;
+		bitmap_copy(pbitmap, rbio->dbitmap, rbio->stripe_npages);
+	}
+
 	/*
 	 * Because the higher layers(scrubber) are unlikely to
 	 * use this area of the disk again soon, so don't cache
@@ -2416,6 +2424,21 @@ writeback:
 			goto cleanup;
 	}
 
+	if (!is_replace)
+		goto submit_write;
+
+	for_each_set_bit(pagenr, pbitmap, rbio->stripe_npages) {
+		struct page *page;
+
+		page = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+		ret = rbio_add_io_page(rbio, &bio_list, page,
+				       bbio->tgtdev_map[rbio->scrubp],
+				       pagenr, rbio->stripe_len);
+		if (ret)
+			goto cleanup;
+	}
+
+submit_write:
 	nr_data = bio_list_size(&bio_list);
 	if (!nr_data) {
 		/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3ef1e1b..f690c8f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2710,7 +2710,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
 		goto out;
 
 	length = sparity->logic_end - sparity->logic_start + 1;
-	ret = btrfs_map_sblock(sctx->dev_root->fs_info, REQ_GET_READ_MIRRORS,
+	ret = btrfs_map_sblock(sctx->dev_root->fs_info, WRITE,
 			       sparity->logic_start,
 			       &length, &bbio, 0, &raid_map);
 	if (ret || !bbio || !raid_map)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 9/9] Btrfs, replace: enable dev-replace for raid56
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (7 preceding siblings ...)
  2014-11-14 13:51 ` [PATCH 8/9] Btrfs, replace: write raid56 parity " Miao Xie
@ 2014-11-14 13:51 ` Miao Xie
  2014-11-14 14:28 ` [PATCH 0/9] Implement device scrub/replace for RAID56 Chris Mason
  2014-11-25 15:14 ` Chris Mason
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-14 13:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/dev-replace.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6f662b3..6aa835c 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
 	struct btrfs_device *tgt_device = NULL;
 	struct btrfs_device *src_device = NULL;
 
-	if (btrfs_fs_incompat(fs_info, RAID56)) {
-		btrfs_warn(fs_info, "dev_replace cannot yet handle RAID5/RAID6");
-		return -EOPNOTSUPP;
-	}
-
 	switch (args->start.cont_reading_from_srcdev_mode) {
 	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
 	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] Implement device scrub/replace for RAID56
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (8 preceding siblings ...)
  2014-11-14 13:51 ` [PATCH 9/9] Btrfs, replace: enable dev-replace for raid56 Miao Xie
@ 2014-11-14 14:28 ` Chris Mason
  2014-11-25 15:14 ` Chris Mason
  10 siblings, 0 replies; 32+ messages in thread
From: Chris Mason @ 2014-11-14 14:28 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs



On Fri, Nov 14, 2014 at 8:50 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> This patchset implement the device scrub/replace function for RAID56, 
> the
> most implementation of the common data is similar to the other RAID 
> type.
> The differentia or difficulty is the parity process. In order to avoid
> that problem the data that is easy to be change out the stripe lock,
> we do most work in the RAID56 stripe lock context.
> 
> And in order to avoid making the code more and more complex, we copy 
> some
> code of common data process for the parity, the cleanup work is in my
> TODO list.

I'm starting to review and test these, but many thanks for tackling 
this.

-chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  2014-11-14 13:50 ` [PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Miao Xie
@ 2014-11-14 14:55   ` David Sterba
  0 siblings, 0 replies; 32+ messages in thread
From: David Sterba @ 2014-11-14 14:55 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs, Zhao Lei

On Fri, Nov 14, 2014 at 09:50:54PM +0800, Miao Xie wrote:
> From: Zhao Lei <zhaolei@cn.fujitsu.com>
> 
> stripe_index's value was set again in latter line:
> stripe_index = 0;
> 
> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>

Reviewed-by: David Sterba <dsterba@suse.cz>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  2014-11-14 13:50 ` [PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
@ 2014-11-14 14:57   ` David Sterba
  0 siblings, 0 replies; 32+ messages in thread
From: David Sterba @ 2014-11-14 14:57 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs, Zhao Lei

On Fri, Nov 14, 2014 at 09:50:53PM +0800, Miao Xie wrote:
> From: Zhao Lei <zhaolei@cn.fujitsu.com>
> 
> bbio_ret in this condition is always !NULL because previous code
> already have a check-and-skip:
> 4908 if (!bbio_ret)
> 4909     goto out;
> 
> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>

Reviewed-by: David Sterba <dsterba@suse.cz>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  2014-11-14 13:50 ` [PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Miao Xie
@ 2014-11-24  9:31   ` Miao Xie
  0 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-24  9:31 UTC (permalink / raw)
  To: linux-btrfs

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- Remove the redundant prefix underscores of the function names to make
  them obey the common pattern of the source in raid56.c
---
 fs/btrfs/raid56.c  |  42 +++++++++---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.c |  16 ++++-
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..6013d88 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT	3
 
+#define RBIO_HOLD_BBIO_MAP_BIT	4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
 		remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+	if (need) {
+		kfree(raid_map);
+		kfree(bbio);
+	}
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+	__free_bbio_and_raid_map(rbio->bbio, rbio->raid_map,
+			!test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
 	int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 			rbio->stripe_pages[i] = NULL;
 		}
 	}
-	kfree(rbio->raid_map);
-	kfree(rbio->bbio);
+
+	free_bbio_and_raid_map(rbio);
+
 	kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 
 	rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
 			GFP_NOFS);
-	if (!rbio) {
-		kfree(raid_map);
-		kfree(bbio);
+	if (!rbio)
 		return ERR_PTR(-ENOMEM);
-	}
 
 	bio_list_init(&rbio->bio_list);
 	INIT_LIST_HEAD(&rbio->plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 	struct blk_plug_cb *cb;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-	if (IS_ERR(rbio))
+	if (IS_ERR(rbio)) {
+		__free_bbio_and_raid_map(bbio, raid_map, 1);
 		return PTR_ERR(rbio);
+	}
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 			  struct btrfs_bio *bbio, u64 *raid_map,
-			  u64 stripe_len, int mirror_num)
+			  u64 stripe_len, int mirror_num, int hold_bbio)
 {
 	struct btrfs_raid_bio *rbio;
 	int ret;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-	if (IS_ERR(rbio))
+	if (IS_ERR(rbio)) {
+		__free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
 		return PTR_ERR(rbio);
+	}
 
+	if (hold_bbio)
+		set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
 	rbio->read_rebuild = 1;
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 	rbio->faila = find_logical_bio_stripe(rbio, bio);
 	if (rbio->faila == -1) {
 		BUG();
-		kfree(raid_map);
-		kfree(bbio);
+		__free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
 		kfree(rbio);
 		return -EIO;
 	}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 				 struct btrfs_bio *bbio, u64 *raid_map,
-				 u64 stripe_len, int mirror_num);
+				 u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 			       struct btrfs_bio *bbio, u64 *raid_map,
 			       u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK	16	/* 64k per node/leaf/sector */
 
+struct scrub_recover {
+	atomic_t		refs;
+	struct btrfs_bio	*bbio;
+	u64			*raid_map;
+	u64			map_length;
+};
+
 struct scrub_page {
 	struct scrub_block	*sblock;
 	struct page		*page;
@@ -79,6 +86,8 @@ struct scrub_page {
 		unsigned int	io_error:1;
 	};
 	u8			csum[BTRFS_CSUM_SIZE];
+
+	struct scrub_recover	*recover;
 };
 
 struct scrub_bio {
@@ -196,7 +205,7 @@ static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 				struct scrub_block *sblock, int is_metadata,
 				int have_csum, u8 *csum, u64 generation,
-				u16 csum_size);
+				u16 csum_size, int retry_failed_mirror);
 static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
 					 struct scrub_block *sblock,
 					 int is_metadata, int have_csum,
@@ -790,6 +799,20 @@ out:
 	scrub_pending_trans_workers_dec(sctx);
 }
 
+static inline void scrub_get_recover(struct scrub_recover *recover)
+{
+	atomic_inc(&recover->refs);
+}
+
+static inline void scrub_put_recover(struct scrub_recover *recover)
+{
+	if (atomic_dec_and_test(&recover->refs)) {
+		kfree(recover->bbio);
+		kfree(recover->raid_map);
+		kfree(recover);
+	}
+}
+
 /*
  * scrub_handle_errored_block gets called when either verification of the
  * pages failed or the bio failed to read, e.g. with EIO. In the latter
@@ -906,7 +929,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 
 	/* build and submit the bios for the failed mirror, check checksums */
 	scrub_recheck_block(fs_info, sblock_bad, is_metadata, have_csum,
-			    csum, generation, sctx->csum_size);
+			    csum, generation, sctx->csum_size, 1);
 
 	if (!sblock_bad->header_error && !sblock_bad->checksum_error &&
 	    sblock_bad->no_io_error_seen) {
@@ -1019,7 +1042,7 @@ nodatasum_case:
 		/* build and submit the bios, check checksums */
 		scrub_recheck_block(fs_info, sblock_other, is_metadata,
 				    have_csum, csum, generation,
-				    sctx->csum_size);
+				    sctx->csum_size, 0);
 
 		if (!sblock_other->header_error &&
 		    !sblock_other->checksum_error &&
@@ -1169,7 +1192,7 @@ nodatasum_case:
 			 */
 			scrub_recheck_block(fs_info, sblock_bad,
 					    is_metadata, have_csum, csum,
-					    generation, sctx->csum_size);
+					    generation, sctx->csum_size, 1);
 			if (!sblock_bad->header_error &&
 			    !sblock_bad->checksum_error &&
 			    sblock_bad->no_io_error_seen)
@@ -1201,11 +1224,18 @@ out:
 		     mirror_index++) {
 			struct scrub_block *sblock = sblocks_for_recheck +
 						     mirror_index;
+			struct scrub_recover *recover;
 			int page_index;
 
 			for (page_index = 0; page_index < sblock->page_count;
 			     page_index++) {
 				sblock->pagev[page_index]->sblock = NULL;
+				recover = sblock->pagev[page_index]->recover;
+				if (recover) {
+					scrub_put_recover(recover);
+					sblock->pagev[page_index]->recover =
+									NULL;
+				}
 				scrub_page_put(sblock->pagev[page_index]);
 			}
 		}
@@ -1215,14 +1245,63 @@ out:
 	return 0;
 }
 
+static inline int scrub_nr_raid_mirrors(struct btrfs_bio *bbio, u64 *raid_map)
+{
+	if (raid_map) {
+		if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
+			return 3;
+		else
+			return 2;
+	} else {
+		return (int)bbio->num_stripes;
+	}
+}
+
+static inline void scrub_stripe_index_and_offset(u64 logical, u64 *raid_map,
+						 u64 mapped_length,
+						 int nstripes, int mirror,
+						 int *stripe_index,
+						 u64 *stripe_offset)
+{
+	int i;
+
+	if (raid_map) {
+		/* RAID5/6 */
+		for (i = 0; i < nstripes; i++) {
+			if (raid_map[i] == RAID6_Q_STRIPE ||
+			    raid_map[i] == RAID5_P_STRIPE)
+				continue;
+
+			if (logical >= raid_map[i] &&
+			    logical < raid_map[i] + mapped_length)
+				break;
+		}
+
+		*stripe_index = i;
+		*stripe_offset = logical - raid_map[i];
+	} else {
+		/* The other RAID type */
+		*stripe_index = mirror;
+		*stripe_offset = 0;
+	}
+}
+
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_fs_info *fs_info,
 				     struct scrub_block *original_sblock,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblocks_for_recheck)
 {
+	struct scrub_recover *recover;
+	struct btrfs_bio *bbio;
+	u64 *raid_map;
+	u64 sublen;
+	u64 mapped_length;
+	u64 stripe_offset;
+	int stripe_index;
 	int page_index;
 	int mirror_index;
+	int nmirrors;
 	int ret;
 
 	/*
@@ -1233,23 +1312,39 @@ static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 
 	page_index = 0;
 	while (length > 0) {
-		u64 sublen = min_t(u64, length, PAGE_SIZE);
-		u64 mapped_length = sublen;
-		struct btrfs_bio *bbio = NULL;
+		sublen = min_t(u64, length, PAGE_SIZE);
+		mapped_length = sublen;
+		bbio = NULL;
+		raid_map = NULL;
 
 		/*
 		 * with a length of PAGE_SIZE, each returned stripe
 		 * represents one mirror
 		 */
-		ret = btrfs_map_block(fs_info, REQ_GET_READ_MIRRORS, logical,
-				      &mapped_length, &bbio, 0);
+		ret = btrfs_map_sblock(fs_info, REQ_GET_READ_MIRRORS, logical,
+				       &mapped_length, &bbio, 0, &raid_map);
 		if (ret || !bbio || mapped_length < sublen) {
 			kfree(bbio);
+			kfree(raid_map);
 			return -EIO;
 		}
 
+		recover = kzalloc(sizeof(struct scrub_recover), GFP_NOFS);
+		if (!recover) {
+			kfree(bbio);
+			kfree(raid_map);
+			return -ENOMEM;
+		}
+
+		atomic_set(&recover->refs, 1);
+		recover->bbio = bbio;
+		recover->raid_map = raid_map;
+		recover->map_length = mapped_length;
+
 		BUG_ON(page_index >= SCRUB_PAGES_PER_RD_BIO);
-		for (mirror_index = 0; mirror_index < (int)bbio->num_stripes;
+
+		nmirrors = scrub_nr_raid_mirrors(bbio, raid_map);
+		for (mirror_index = 0; mirror_index < nmirrors;
 		     mirror_index++) {
 			struct scrub_block *sblock;
 			struct scrub_page *page;
@@ -1265,26 +1360,38 @@ leave_nomem:
 				spin_lock(&sctx->stat_lock);
 				sctx->stat.malloc_errors++;
 				spin_unlock(&sctx->stat_lock);
-				kfree(bbio);
+				scrub_put_recover(recover);
 				return -ENOMEM;
 			}
 			scrub_page_get(page);
 			sblock->pagev[page_index] = page;
 			page->logical = logical;
-			page->physical = bbio->stripes[mirror_index].physical;
+
+			scrub_stripe_index_and_offset(logical, raid_map,
+						      mapped_length,
+						      bbio->num_stripes,
+						      mirror_index,
+						      &stripe_index,
+						      &stripe_offset);
+			page->physical = bbio->stripes[stripe_index].physical +
+					 stripe_offset;
+			page->dev = bbio->stripes[stripe_index].dev;
+
 			BUG_ON(page_index >= original_sblock->page_count);
 			page->physical_for_dev_replace =
 				original_sblock->pagev[page_index]->
 				physical_for_dev_replace;
 			/* for missing devices, dev->bdev is NULL */
-			page->dev = bbio->stripes[mirror_index].dev;
 			page->mirror_num = mirror_index + 1;
 			sblock->page_count++;
 			page->page = alloc_page(GFP_NOFS);
 			if (!page->page)
 				goto leave_nomem;
+
+			scrub_get_recover(recover);
+			page->recover = recover;
 		}
-		kfree(bbio);
+		scrub_put_recover(recover);
 		length -= sublen;
 		logical += sublen;
 		page_index++;
@@ -1293,6 +1400,51 @@ leave_nomem:
 	return 0;
 }
 
+struct scrub_bio_ret {
+	struct completion event;
+	int error;
+};
+
+static void scrub_bio_wait_endio(struct bio *bio, int error)
+{
+	struct scrub_bio_ret *ret = bio->bi_private;
+
+	ret->error = error;
+	complete(&ret->event);
+}
+
+static inline int scrub_is_page_on_raid56(struct scrub_page *page)
+{
+	return page->recover && page->recover->raid_map;
+}
+
+static int scrub_submit_raid56_bio_wait(struct btrfs_fs_info *fs_info,
+					struct bio *bio,
+					struct scrub_page *page)
+{
+	struct scrub_bio_ret done;
+	int ret;
+
+	init_completion(&done.event);
+	done.error = 0;
+	bio->bi_iter.bi_sector = page->logical >> 9;
+	bio->bi_private = &done;
+	bio->bi_end_io = scrub_bio_wait_endio;
+
+	ret = raid56_parity_recover(fs_info->fs_root, bio, page->recover->bbio,
+				    page->recover->raid_map,
+				    page->recover->map_length,
+				    page->mirror_num, 1);
+	if (ret)
+		return ret;
+
+	wait_for_completion(&done.event);
+	if (done.error)
+		return -EIO;
+
+	return 0;
+}
+
 /*
  * this function will check the on disk data for checksum errors, header
  * errors and read I/O errors. If any I/O errors happen, the exact pages
@@ -1303,7 +1455,7 @@ leave_nomem:
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 				struct scrub_block *sblock, int is_metadata,
 				int have_csum, u8 *csum, u64 generation,
-				u16 csum_size)
+				u16 csum_size, int retry_failed_mirror)
 {
 	int page_num;
 
@@ -1329,11 +1481,17 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 			continue;
 		}
 		bio->bi_bdev = page->dev->bdev;
-		bio->bi_iter.bi_sector = page->physical >> 9;
 
 		bio_add_page(bio, page->page, PAGE_SIZE, 0);
-		if (btrfsic_submit_bio_wait(READ, bio))
-			sblock->no_io_error_seen = 0;
+		if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
+			if (scrub_submit_raid56_bio_wait(fs_info, bio, page))
+				sblock->no_io_error_seen = 0;
+		} else {
+			bio->bi_iter.bi_sector = page->physical >> 9;
+
+			if (btrfsic_submit_bio_wait(READ, bio))
+				sblock->no_io_error_seen = 0;
+		}
 
 		bio_put(bio);
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66d5035..05cf489 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,7 +5162,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				BTRFS_BLOCK_GROUP_RAID6)) {
 		u64 tmp;
 
-		if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
+		if (raid_map_ret &&
+		    ((rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) ||
+		     mirror_num > 1)) {
 			int i, rot;
 
 			/* push stripe_nr back to the start of the full stripe */
@@ -5441,6 +5443,16 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				 mirror_num, NULL);
 }
 
+/* For Scrub/replace */
+int btrfs_map_sblock(struct btrfs_fs_info *fs_info, int rw,
+		     u64 logical, u64 *length,
+		     struct btrfs_bio **bbio_ret, int mirror_num,
+		     u64 **raid_map_ret)
+{
+	return __btrfs_map_block(fs_info, rw, logical, length, bbio_ret,
+				 mirror_num, raid_map_ret);
+}
+
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 		     u64 chunk_start, u64 physical, u64 devid,
 		     u64 **logical, int *naddrs, int *stripe_len)
@@ -5810,7 +5822,7 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 		} else {
 			ret = raid56_parity_recover(root, bio, bbio,
 						    raid_map, map_length,
-						    mirror_num);
+						    mirror_num, 0);
 		}
 		/*
 		 * FIXME, replace dosen't support raid56 yet, please fix
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 08980fa..01094bb 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -393,6 +393,10 @@ int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
 int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		    u64 logical, u64 *length,
 		    struct btrfs_bio **bbio_ret, int mirror_num);
+int btrfs_map_sblock(struct btrfs_fs_info *fs_info, int rw,
+		     u64 logical, u64 *length,
+		     struct btrfs_bio **bbio_ret, int mirror_num,
+		     u64 **raid_map_ret);
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 		     u64 chunk_start, u64 physical, u64 devid,
 		     u64 **logical, int *naddrs, int *stripe_len);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] Implement device scrub/replace for RAID56
  2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
                   ` (9 preceding siblings ...)
  2014-11-14 14:28 ` [PATCH 0/9] Implement device scrub/replace for RAID56 Chris Mason
@ 2014-11-25 15:14 ` Chris Mason
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
  10 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2014-11-25 15:14 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs

On Fri, Nov 14, 2014 at 8:50 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> This patchset implement the device scrub/replace function for RAID56, 
> the
> most implementation of the common data is similar to the other RAID 
> type.
> The differentia or difficulty is the parity process. In order to avoid
> that problem the data that is easy to be change out the stripe lock,
> we do most work in the RAID56 stripe lock context.
> 
> And in order to avoid making the code more and more complex, we copy 
> some
> code of common data process for the parity, the cleanup work is in my
> TODO list.
> 
> We have done some test, the patchset worked well. Of course, more 
> tests
> are welcome. If you are interesting to use it or test it, you can pull
> the patchset from
> 
>   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

I'm getting crashes from btrfs/060 with these in place:

> [ 1649.712413] BTRFS: assertion failed: logical + PAGE_SIZE <= 
> rbio->raid_map[0] + rbio->stripe_len * rbio->nr_data, file: 
> fs/btrfs/raid56.c, line: 2248^M
> [ 1649.738982] ------------[ cut here ]------------^M
> [ 1649.748727] kernel BUG at fs/btrfs/ctree.h:4020!^M
> [ 1649.758039] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC^M
> [ 1649.768977] Modules linked in: fuse loop btrfs raid6_pq 
> zlib_deflate lzo_compress xor k10temp coretemp hwmon xfs exportfs 
> libcrc32c tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables 
> xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic 
> iptable_filter ip_tables x_tables nfsv3 nfs lockd grace mptctl 
> netconsole autofs4 rpcsec_gss_krb5 auth_rpcgss oid_registry sunrpc 
> ipv6 ext3 jbd dm_mod iTCO_wdt iTCO_vendor_support rtc_cmos ipmi_si 
> ipmi_msghandler pcspkr i2c_i801 lpc_ich mfd_core shpchp ehci_pci 
> ehci_hcd mlx4_en ptp pps_core mlx4_core ses enclosure sg button 
> megaraid_sas^M
> [ 1649.872917] CPU: 0 PID: 16687 Comm: kworker/u65:0 Not tainted 
> 3.18.0-rc6-mason+ #3^M
> [ 1649.888171] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, 
> BIOS 1.07 05/10/2012^M
> [ 1649.903962] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper 
> [btrfs]^M
> [ 1649.916588] task: ffff88072557dd90 ti: ffff88070fdc4000 task.ti: 
> ffff88070fdc4000^M
> [ 1649.931669] RIP: 0010:[<ffffffffa060db2f>]  [<ffffffffa060db2f>] 
> raid56_parity_add_scrub_pages+0x8f/0xa0 [btrfs]^M
> [ 1649.952169] RSP: 0018:ffff88070fdc7b68  EFLAGS: 00010292^M
> [ 1649.962852] RAX: 0000000000000089 RBX: ffff8804cf681f30 RCX: 
> 0000000000004b4a^M
> [ 1649.977177] RDX: 000000000000004a RSI: 0000000000000001 RDI: 
> 0000000000000000^M
> [ 1649.991496] RBP: ffff88070fdc7b68 R08: 0000000000000001 R09: 
> 0000000000000000^M
> [ 1650.005819] R10: 0000000000000001 R11: 0000000000000000 R12: 
> ffff880689b62800^M
> [ 1650.020140] R13: ffff88024d85cf80 R14: ffff88075d0dd800 R15: 
> 0000000000000003^M
> [ 1650.034459] FS:  0000000000000000(0000) GS:ffff88085fc00000(0000) 
> knlGS:0000000000000000^M
> [ 1650.050757] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 1650.062306] CR2: 00007f445b6d0e78 CR3: 0000000001c14000 CR4: 
> 00000000000407f0^M
> [ 1650.076625] Stack:^M
> [ 1650.080716]  ffff88070fdc7bc8 ffffffffa05f2e50 ffff8804cf681fc8 
> ffff880700000010^M
> [ 1650.095761]  ffff88070fdc7b98 0000000000010000 ffff880290f92340 
> ffff88074eda9f00^M
> [ 1650.110792]  ffff8807edbb1700 ffff880639910e20 ffff8807edbb1700 
> 0000000000001000^M
> [ 1650.125865] Call Trace:^M
> [ 1650.130845]  [<ffffffffa05f2e50>] 
> scrub_parity_check_and_repair+0x140/0x1e0 [btrfs]^M
> [ 1650.146286]  [<ffffffffa05f2f7d>] scrub_block_put+0x8d/0x90 
> [btrfs]^M
> [ 1650.158884]  [<ffffffff810961e0>] ? 
> cpuacct_account_field+0xd0/0xd0^M
> [ 1650.171493]  [<ffffffffa05f6e19>] 
> scrub_bio_end_io_worker+0xe9/0x870 [btrfs]^M
> [ 1650.185725]  [<ffffffffa05c8b84>] normal_work_helper+0x84/0x330 
> [btrfs]^M
> [ 1650.199041]  [<ffffffffa05c8e82>] btrfs_scrub_helper+0x12/0x20 
> [btrfs]^M
> [ 1650.212165]  [<ffffffff8106c50f>] process_one_work+0x1bf/0x520^M
> [ 1650.223892]  [<ffffffff8106c48d>] ? process_one_work+0x13d/0x520^M
> [ 1650.235988]  [<ffffffff8106c98e>] worker_thread+0x11e/0x4b0^M
> [ 1650.247204]  [<ffffffff81653ac9>] ? __schedule+0x389/0x880^M
> [ 1650.258242]  [<ffffffff8106c870>] ? process_one_work+0x520/0x520^M
> [ 1650.270314]  [<ffffffff81071e2e>] kthread+0xde/0x100^M
> [ 1650.280302]  [<ffffffff81071d50>] ? 
> __init_kthread_worker+0x70/0x70^M
> [ 1650.292894]  [<ffffffff81659eac>] ret_from_fork+0x7c/0xb0^M
> [ 1650.303746]  [<ffffffff81071d50>] ? 
> __init_kthread_worker+0x70/0x70^M
> [ 1650.316359] Code: c0 e8 1d 5a 04 e1 0f 0b eb fe b9 c8 08 00 00 48 
> c7 c2 71 6b 62 a0 48 c7 c6 b8 c4 62 a0 48 c7 c7 80 c4 62 a0 31 c0 e8 
> f8 59 04 e1 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 
> 48 89 e5 ^M
> [ 1650.356466] RIP  [<ffffffffa060db2f>] 
> raid56_parity_add_scrub_pages+0x8f/0xa0 [btrfs]^M
> [ 1650.372307]  RSP <ffff88070fdc7b68>^M
> [ 1650.381427] ---[ end trace 14445249faa12848 ]---^M


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 00/11] Implement device scrub/replace for RAID56
@ 2014-11-26 13:04   ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
                       ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. The basic idea is reading
and check the data which has checksum out of the raid56 stripe lock, if the
data is right, then lock the raid56 stripe, read out the other data in the
same stripe, if no IO error happens, calculate the parity and check the
original one, if the original parity is right, the scrub parity passes.
or write out the new one. But if the common data(not parity) that we read out
is wrong, we will try to recover it, and then check and repair the parity.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v2 -> v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Fix possible deadlock caused by the pending bios in the plug list
  when the io submitters were going to sleep.
- Fix undealt use-after-free problem of the source device in the final
  device replace procedure.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 -> v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.

Thanks
Miao

Miao Xie (8):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, raid56: fix use-after-free problem in the final device replace
    procedure on raid56
  Btrfs: fix possible deadlock caused by pending I/O in plug list

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
    __btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/dev-replace.c |  20 +-
 fs/btrfs/raid56.c      | 746 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/raid56.h      |  16 +-
 fs/btrfs/scrub.c       | 803 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.c     |  52 +++-
 fs/btrfs/volumes.h     |  14 +-
 6 files changed, 1521 insertions(+), 130 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Miao Xie
                       ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909     goto out;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..41b0dff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				BTRFS_BLOCK_GROUP_RAID6)) {
 		u64 tmp;
 
-		if (bbio_ret && ((rw & REQ_WRITE) || mirror_num > 1)
-		    && raid_map_ret) {
+		if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
 			int i, rot;
 
 			/* push stripe_nr back to the start of the full stripe */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
  2014-11-26 13:04     ` [PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map Miao Xie
                       ` (8 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41b0dff..66d5035 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 
 			/* push stripe_nr back to the start of the full stripe */
 			stripe_nr = raid56_full_stripe_start;
-			do_div(stripe_nr, stripe_len);
-
-			stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+			do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
 			/* RAID[56] write or recovery. Return all stripes */
 			num_stripes = map->num_stripes;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
  2014-11-26 13:04     ` [PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
  2014-11-26 13:04     ` [PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Miao Xie
                       ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..c54b0e6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT	3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
 	atomic_t refs;
 
+
+	atomic_t stripes_pending;
+
+	atomic_t error;
 	/*
 	 * these are two arrays of pointers.  We allocate the
 	 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
 	bio_put(bio);
 
-	if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
 		return;
 
 	err = 0;
 
 	/* OK, we have read all the stripes we need to. */
-	if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
 		err = -EIO;
 
 	rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->faila = -1;
 	rbio->failb = -1;
 	atomic_set(&rbio->refs, 1);
+	atomic_set(&rbio->error, 0);
+	atomic_set(&rbio->stripes_pending, 0);
 
 	/*
 	 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
 	spin_unlock_irq(&rbio->bio_list_lock);
 
-	atomic_set(&rbio->bbio->error, 0);
+	atomic_set(&rbio->error, 0);
 
 	/*
 	 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 		}
 	}
 
-	atomic_set(&bbio->stripes_pending, bio_list_size(&bio_list));
-	BUG_ON(atomic_read(&bbio->stripes_pending) == 0);
+	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
+	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
 	while (1) {
 		bio = bio_list_pop(&bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed)
 	if (rbio->faila == -1) {
 		/* first failure on this rbio */
 		rbio->faila = failed;
-		atomic_inc(&rbio->bbio->error);
+		atomic_inc(&rbio->error);
 	} else if (rbio->failb == -1) {
 		/* second failure on this rbio */
 		rbio->failb = failed;
-		atomic_inc(&rbio->bbio->error);
+		atomic_inc(&rbio->error);
 	} else {
 		ret = -EIO;
 	}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
 	bio_put(bio);
 
-	if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
 		return;
 
 	err = 0;
-	if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
 		goto cleanup;
 
 	/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio *rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
 	int bios_to_read = 0;
-	struct btrfs_bio *bbio = rbio->bbio;
 	struct bio_list bio_list;
 	int ret;
 	int nr_pages = DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
 	index_rbio_pages(rbio);
 
-	atomic_set(&rbio->bbio->error, 0);
+	atomic_set(&rbio->error, 0);
 	/*
 	 * build a list of bios to read all the missing parts of this
 	 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 	 * the bbio may be freed once we submit the last bio.  Make sure
 	 * not to touch it after that
 	 */
-	atomic_set(&bbio->stripes_pending, bios_to_read);
+	atomic_set(&rbio->stripes_pending, bios_to_read);
 	while (1) {
 		bio = bio_list_pop(&bio_list);
 		if (!bio)
@@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int err)
 		set_bio_pages_uptodate(bio);
 	bio_put(bio);
 
-	if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
 		return;
 
-	if (atomic_read(&rbio->bbio->error) > rbio->bbio->max_errors)
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
 		rbio_orig_end_io(rbio, -EIO, 0);
 	else
 		__raid_recover_end_io(rbio);
@@ -1951,7 +1955,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	if (ret)
 		goto cleanup;
 
-	atomic_set(&rbio->bbio->error, 0);
+	atomic_set(&rbio->error, 0);
 
 	/*
 	 * read everything that hasn't failed.  Thanks to the
@@ -1960,7 +1964,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	 */
 	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
 		if (rbio->faila == stripe || rbio->failb == stripe) {
-			atomic_inc(&rbio->bbio->error);
+			atomic_inc(&rbio->error);
 			continue;
 		}
 
@@ -1990,7 +1994,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 		 * were up to date, or we might have no bios to read because
 		 * the devices were gone.
 		 */
-		if (atomic_read(&rbio->bbio->error) <= rbio->bbio->max_errors) {
+		if (atomic_read(&rbio->error) <= rbio->bbio->max_errors) {
 			__raid_recover_end_io(rbio);
 			goto out;
 		} else {
@@ -2002,7 +2006,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	 * the bbio may be freed once we submit the last bio.  Make sure
 	 * not to touch it after that
 	 */
-	atomic_set(&bbio->stripes_pending, bios_to_read);
+	atomic_set(&rbio->stripes_pending, bios_to_read);
 	while (1) {
 		bio = bio_list_pop(&bio_list);
 		if (!bio)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (2 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type Miao Xie
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c  |  42 +++++++++---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.c |  16 ++++-
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..6013d88 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT	3
 
+#define RBIO_HOLD_BBIO_MAP_BIT	4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
 		remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+	if (need) {
+		kfree(raid_map);
+		kfree(bbio);
+	}
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+	__free_bbio_and_raid_map(rbio->bbio, rbio->raid_map,
+			!test_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
 	int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 			rbio->stripe_pages[i] = NULL;
 		}
 	}
-	kfree(rbio->raid_map);
-	kfree(rbio->bbio);
+
+	free_bbio_and_raid_map(rbio);
+
 	kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 
 	rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
 			GFP_NOFS);
-	if (!rbio) {
-		kfree(raid_map);
-		kfree(bbio);
+	if (!rbio)
 		return ERR_PTR(-ENOMEM);
-	}
 
 	bio_list_init(&rbio->bio_list);
 	INIT_LIST_HEAD(&rbio->plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 	struct blk_plug_cb *cb;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-	if (IS_ERR(rbio))
+	if (IS_ERR(rbio)) {
+		__free_bbio_and_raid_map(bbio, raid_map, 1);
 		return PTR_ERR(rbio);
+	}
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 			  struct btrfs_bio *bbio, u64 *raid_map,
-			  u64 stripe_len, int mirror_num)
+			  u64 stripe_len, int mirror_num, int hold_bbio)
 {
 	struct btrfs_raid_bio *rbio;
 	int ret;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-	if (IS_ERR(rbio))
+	if (IS_ERR(rbio)) {
+		__free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
 		return PTR_ERR(rbio);
+	}
 
+	if (hold_bbio)
+		set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
 	rbio->read_rebuild = 1;
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 	rbio->faila = find_logical_bio_stripe(rbio, bio);
 	if (rbio->faila == -1) {
 		BUG();
-		kfree(raid_map);
-		kfree(bbio);
+		__free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
 		kfree(rbio);
 		return -EIO;
 	}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 				 struct btrfs_bio *bbio, u64 *raid_map,
-				 u64 stripe_len, int mirror_num);
+				 u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 			       struct btrfs_bio *bbio, u64 *raid_map,
 			       u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK	16	/* 64k per node/leaf/sector */
 
+struct scrub_recover {
+	atomic_t		refs;
+	struct btrfs_bio	*bbio;
+	u64			*raid_map;
+	u64			map_length;
+};
+
 struct scrub_page {
 	struct scrub_block	*sblock;
 	struct page		*page;
@@ -79,6 +86,8 @@ struct scrub_page {
 		unsigned int	io_error:1;
 	};
 	u8			csum[BTRFS_CSUM_SIZE];
+
+	struct scrub_recover	*recover;
 };
 
 struct scrub_bio {
@@ -196,7 +205,7 @@ static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 				struct scrub_block *sblock, int is_metadata,
 				int have_csum, u8 *csum, u64 generation,
-				u16 csum_size);
+				u16 csum_size, int retry_failed_mirror);
 static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
 					 struct scrub_block *sblock,
 					 int is_metadata, int have_csum,
@@ -790,6 +799,20 @@ out:
 	scrub_pending_trans_workers_dec(sctx);
 }
 
+static inline void scrub_get_recover(struct scrub_recover *recover)
+{
+	atomic_inc(&recover->refs);
+}
+
+static inline void scrub_put_recover(struct scrub_recover *recover)
+{
+	if (atomic_dec_and_test(&recover->refs)) {
+		kfree(recover->bbio);
+		kfree(recover->raid_map);
+		kfree(recover);
+	}
+}
+
 /*
  * scrub_handle_errored_block gets called when either verification of the
  * pages failed or the bio failed to read, e.g. with EIO. In the latter
@@ -906,7 +929,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 
 	/* build and submit the bios for the failed mirror, check checksums */
 	scrub_recheck_block(fs_info, sblock_bad, is_metadata, have_csum,
-			    csum, generation, sctx->csum_size);
+			    csum, generation, sctx->csum_size, 1);
 
 	if (!sblock_bad->header_error && !sblock_bad->checksum_error &&
 	    sblock_bad->no_io_error_seen) {
@@ -1019,7 +1042,7 @@ nodatasum_case:
 		/* build and submit the bios, check checksums */
 		scrub_recheck_block(fs_info, sblock_other, is_metadata,
 				    have_csum, csum, generation,
-				    sctx->csum_size);
+				    sctx->csum_size, 0);
 
 		if (!sblock_other->header_error &&
 		    !sblock_other->checksum_error &&
@@ -1169,7 +1192,7 @@ nodatasum_case:
 			 */
 			scrub_recheck_block(fs_info, sblock_bad,
 					    is_metadata, have_csum, csum,
-					    generation, sctx->csum_size);
+					    generation, sctx->csum_size, 1);
 			if (!sblock_bad->header_error &&
 			    !sblock_bad->checksum_error &&
 			    sblock_bad->no_io_error_seen)
@@ -1201,11 +1224,18 @@ out:
 		     mirror_index++) {
 			struct scrub_block *sblock = sblocks_for_recheck +
 						     mirror_index;
+			struct scrub_recover *recover;
 			int page_index;
 
 			for (page_index = 0; page_index < sblock->page_count;
 			     page_index++) {
 				sblock->pagev[page_index]->sblock = NULL;
+				recover = sblock->pagev[page_index]->recover;
+				if (recover) {
+					scrub_put_recover(recover);
+					sblock->pagev[page_index]->recover =
+									NULL;
+				}
 				scrub_page_put(sblock->pagev[page_index]);
 			}
 		}
@@ -1215,14 +1245,63 @@ out:
 	return 0;
 }
 
+static inline int scrub_nr_raid_mirrors(struct btrfs_bio *bbio, u64 *raid_map)
+{
+	if (raid_map) {
+		if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
+			return 3;
+		else
+			return 2;
+	} else {
+		return (int)bbio->num_stripes;
+	}
+}
+
+static inline void scrub_stripe_index_and_offset(u64 logical, u64 *raid_map,
+						 u64 mapped_length,
+						 int nstripes, int mirror,
+						 int *stripe_index,
+						 u64 *stripe_offset)
+{
+	int i;
+
+	if (raid_map) {
+		/* RAID5/6 */
+		for (i = 0; i < nstripes; i++) {
+			if (raid_map[i] == RAID6_Q_STRIPE ||
+			    raid_map[i] == RAID5_P_STRIPE)
+				continue;
+
+			if (logical >= raid_map[i] &&
+			    logical < raid_map[i] + mapped_length)
+				break;
+		}
+
+		*stripe_index = i;
+		*stripe_offset = logical - raid_map[i];
+	} else {
+		/* The other RAID type */
+		*stripe_index = mirror;
+		*stripe_offset = 0;
+	}
+}
+
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_fs_info *fs_info,
 				     struct scrub_block *original_sblock,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblocks_for_recheck)
 {
+	struct scrub_recover *recover;
+	struct btrfs_bio *bbio;
+	u64 *raid_map;
+	u64 sublen;
+	u64 mapped_length;
+	u64 stripe_offset;
+	int stripe_index;
 	int page_index;
 	int mirror_index;
+	int nmirrors;
 	int ret;
 
 	/*
@@ -1233,23 +1312,39 @@ static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 
 	page_index = 0;
 	while (length > 0) {
-		u64 sublen = min_t(u64, length, PAGE_SIZE);
-		u64 mapped_length = sublen;
-		struct btrfs_bio *bbio = NULL;
+		sublen = min_t(u64, length, PAGE_SIZE);
+		mapped_length = sublen;
+		bbio = NULL;
+		raid_map = NULL;
 
 		/*
 		 * with a length of PAGE_SIZE, each returned stripe
 		 * represents one mirror
 		 */
-		ret = btrfs_map_block(fs_info, REQ_GET_READ_MIRRORS, logical,
-				      &mapped_length, &bbio, 0);
+		ret = btrfs_map_sblock(fs_info, REQ_GET_READ_MIRRORS, logical,
+				       &mapped_length, &bbio, 0, &raid_map);
 		if (ret || !bbio || mapped_length < sublen) {
 			kfree(bbio);
+			kfree(raid_map);
 			return -EIO;
 		}
 
+		recover = kzalloc(sizeof(struct scrub_recover), GFP_NOFS);
+		if (!recover) {
+			kfree(bbio);
+			kfree(raid_map);
+			return -ENOMEM;
+		}
+
+		atomic_set(&recover->refs, 1);
+		recover->bbio = bbio;
+		recover->raid_map = raid_map;
+		recover->map_length = mapped_length;
+
 		BUG_ON(page_index >= SCRUB_PAGES_PER_RD_BIO);
-		for (mirror_index = 0; mirror_index < (int)bbio->num_stripes;
+
+		nmirrors = scrub_nr_raid_mirrors(bbio, raid_map);
+		for (mirror_index = 0; mirror_index < nmirrors;
 		     mirror_index++) {
 			struct scrub_block *sblock;
 			struct scrub_page *page;
@@ -1265,26 +1360,38 @@ leave_nomem:
 				spin_lock(&sctx->stat_lock);
 				sctx->stat.malloc_errors++;
 				spin_unlock(&sctx->stat_lock);
-				kfree(bbio);
+				scrub_put_recover(recover);
 				return -ENOMEM;
 			}
 			scrub_page_get(page);
 			sblock->pagev[page_index] = page;
 			page->logical = logical;
-			page->physical = bbio->stripes[mirror_index].physical;
+
+			scrub_stripe_index_and_offset(logical, raid_map,
+						      mapped_length,
+						      bbio->num_stripes,
+						      mirror_index,
+						      &stripe_index,
+						      &stripe_offset);
+			page->physical = bbio->stripes[stripe_index].physical +
+					 stripe_offset;
+			page->dev = bbio->stripes[stripe_index].dev;
+
 			BUG_ON(page_index >= original_sblock->page_count);
 			page->physical_for_dev_replace =
 				original_sblock->pagev[page_index]->
 				physical_for_dev_replace;
 			/* for missing devices, dev->bdev is NULL */
-			page->dev = bbio->stripes[mirror_index].dev;
 			page->mirror_num = mirror_index + 1;
 			sblock->page_count++;
 			page->page = alloc_page(GFP_NOFS);
 			if (!page->page)
 				goto leave_nomem;
+
+			scrub_get_recover(recover);
+			page->recover = recover;
 		}
-		kfree(bbio);
+		scrub_put_recover(recover);
 		length -= sublen;
 		logical += sublen;
 		page_index++;
@@ -1293,6 +1400,51 @@ leave_nomem:
 	return 0;
 }
 
+struct scrub_bio_ret {
+	struct completion event;
+	int error;
+};
+
+static void scrub_bio_wait_endio(struct bio *bio, int error)
+{
+	struct scrub_bio_ret *ret = bio->bi_private;
+
+	ret->error = error;
+	complete(&ret->event);
+}
+
+static inline int scrub_is_page_on_raid56(struct scrub_page *page)
+{
+	return page->recover && page->recover->raid_map;
+}
+
+static int scrub_submit_raid56_bio_wait(struct btrfs_fs_info *fs_info,
+					struct bio *bio,
+					struct scrub_page *page)
+{
+	struct scrub_bio_ret done;
+	int ret;
+
+	init_completion(&done.event);
+	done.error = 0;
+	bio->bi_iter.bi_sector = page->logical >> 9;
+	bio->bi_private = &done;
+	bio->bi_end_io = scrub_bio_wait_endio;
+
+	ret = raid56_parity_recover(fs_info->fs_root, bio, page->recover->bbio,
+				    page->recover->raid_map,
+				    page->recover->map_length,
+				    page->mirror_num, 1);
+	if (ret)
+		return ret;
+
+	wait_for_completion(&done.event);
+	if (done.error)
+		return -EIO;
+
+	return 0;
+}
+
 /*
  * this function will check the on disk data for checksum errors, header
  * errors and read I/O errors. If any I/O errors happen, the exact pages
@@ -1303,7 +1455,7 @@ leave_nomem:
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 				struct scrub_block *sblock, int is_metadata,
 				int have_csum, u8 *csum, u64 generation,
-				u16 csum_size)
+				u16 csum_size, int retry_failed_mirror)
 {
 	int page_num;
 
@@ -1329,11 +1481,17 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 			continue;
 		}
 		bio->bi_bdev = page->dev->bdev;
-		bio->bi_iter.bi_sector = page->physical >> 9;
 
 		bio_add_page(bio, page->page, PAGE_SIZE, 0);
-		if (btrfsic_submit_bio_wait(READ, bio))
-			sblock->no_io_error_seen = 0;
+		if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
+			if (scrub_submit_raid56_bio_wait(fs_info, bio, page))
+				sblock->no_io_error_seen = 0;
+		} else {
+			bio->bi_iter.bi_sector = page->physical >> 9;
+
+			if (btrfsic_submit_bio_wait(READ, bio))
+				sblock->no_io_error_seen = 0;
+		}
 
 		bio_put(bio);
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66d5035..05cf489 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,7 +5162,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				BTRFS_BLOCK_GROUP_RAID6)) {
 		u64 tmp;
 
-		if (raid_map_ret && ((rw & REQ_WRITE) || mirror_num > 1)) {
+		if (raid_map_ret &&
+		    ((rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) ||
+		     mirror_num > 1)) {
 			int i, rot;
 
 			/* push stripe_nr back to the start of the full stripe */
@@ -5441,6 +5443,16 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				 mirror_num, NULL);
 }
 
+/* For Scrub/replace */
+int btrfs_map_sblock(struct btrfs_fs_info *fs_info, int rw,
+		     u64 logical, u64 *length,
+		     struct btrfs_bio **bbio_ret, int mirror_num,
+		     u64 **raid_map_ret)
+{
+	return __btrfs_map_block(fs_info, rw, logical, length, bbio_ret,
+				 mirror_num, raid_map_ret);
+}
+
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 		     u64 chunk_start, u64 physical, u64 devid,
 		     u64 **logical, int *naddrs, int *stripe_len)
@@ -5810,7 +5822,7 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 		} else {
 			ret = raid56_parity_recover(root, bio, bbio,
 						    raid_map, map_length,
-						    mirror_num);
+						    mirror_num, 0);
 		}
 		/*
 		 * FIXME, replace dosen't support raid56 yet, please fix
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 08980fa..01094bb 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -393,6 +393,10 @@ int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
 int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		    u64 logical, u64 *length,
 		    struct btrfs_bio **bbio_ret, int mirror_num);
+int btrfs_map_sblock(struct btrfs_fs_info *fs_info, int rw,
+		     u64 logical, u64 *length,
+		     struct btrfs_bio **bbio_ret, int mirror_num,
+		     u64 **raid_map_ret);
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 		     u64 chunk_start, u64 physical, u64 devid,
 		     u64 **logical, int *naddrs, int *stripe_len);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (3 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56 Miao Xie
                       ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6013d88..bfc406d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -62,6 +62,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+	BTRFS_RBIO_WRITE	= 0,
+	BTRFS_RBIO_READ_REBUILD	= 1,
+};
+
 struct btrfs_raid_bio {
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_bio *bbio;
@@ -124,7 +129,7 @@ struct btrfs_raid_bio {
 	 * differently from a parity rebuild as part of
 	 * rmw
 	 */
-	int read_rebuild;
+	enum btrfs_rbio_ops operation;
 
 	/* first bad stripe */
 	int faila;
@@ -147,7 +152,6 @@ struct btrfs_raid_bio {
 
 	atomic_t refs;
 
-
 	atomic_t stripes_pending;
 
 	atomic_t error;
@@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
 		return 0;
 
 	/* reads can't merge with writes */
-	if (last->read_rebuild !=
-	    cur->read_rebuild) {
+	if (last->operation != cur->operation) {
 		return 0;
 	}
 
@@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
 			spin_unlock(&rbio->bio_list_lock);
 			spin_unlock_irqrestore(&h->lock, flags);
 
-			if (next->read_rebuild)
+			if (next->operation == BTRFS_RBIO_READ_REBUILD)
 				async_read_rebuild(next);
-			else {
+			else if (next->operation == BTRFS_RBIO_WRITE){
 				steal_rbio(rbio, next);
 				async_rmw_stripe(next);
 			}
@@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 	}
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
+	rbio->operation = BTRFS_RBIO_WRITE;
 
 	/*
 	 * don't plug on full rbios, just get them out the door
@@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	faila = rbio->faila;
 	failb = rbio->failb;
 
-	if (rbio->read_rebuild) {
+	if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
 		spin_lock_irq(&rbio->bio_list_lock);
 		set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
 		spin_unlock_irq(&rbio->bio_list_lock);
@@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
 			 */
-			if (rbio->read_rebuild &&
+			if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
 			    (stripe == faila || stripe == failb)) {
 				page = page_in_rbio(rbio, stripe, pagenr, 0);
 			} else {
@@ -1871,7 +1875,7 @@ pstripe:
 		 * know they can be trusted.  If this was a read reconstruction,
 		 * other endio functions will fiddle the uptodate bits
 		 */
-		if (!rbio->read_rebuild) {
+		if (rbio->operation == BTRFS_RBIO_WRITE) {
 			for (i = 0;  i < nr_pages; i++) {
 				if (faila != -1) {
 					page = rbio_stripe_page(rbio, faila, i);
@@ -1888,7 +1892,7 @@ pstripe:
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
 			 */
-			if (rbio->read_rebuild &&
+			if (rbio->operation == BTRFS_RBIO_READ_REBUILD &&
 			    (stripe == faila || stripe == failb)) {
 				page = page_in_rbio(rbio, stripe, pagenr, 0);
 			} else {
@@ -1904,7 +1908,7 @@ cleanup:
 
 cleanup_io:
 
-	if (rbio->read_rebuild) {
+	if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
 		if (err == 0)
 			cache_rbio_pages(rbio);
 		else
@@ -2042,7 +2046,7 @@ out:
 	return 0;
 
 cleanup:
-	if (rbio->read_rebuild)
+	if (rbio->operation == BTRFS_RBIO_READ_REBUILD)
 		rbio_orig_end_io(rbio, -EIO, 0);
 	return -EIO;
 }
@@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 
 	if (hold_bbio)
 		set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
-	rbio->read_rebuild = 1;
+	rbio->operation = BTRFS_RBIO_READ_REBUILD;
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (4 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device Miao Xie
                       ` (4 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v2 -> v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/raid56.c | 514 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 609 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 1115 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index bfc406d..3b99cbc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -65,6 +65,7 @@
 enum btrfs_rbio_ops {
 	BTRFS_RBIO_WRITE	= 0,
 	BTRFS_RBIO_READ_REBUILD	= 1,
+	BTRFS_RBIO_PARITY_SCRUB	= 2,
 };
 
 struct btrfs_raid_bio {
@@ -123,6 +124,7 @@ struct btrfs_raid_bio {
 	/* number of data stripes (no p/q) */
 	int nr_data;
 
+	int stripe_npages;
 	/*
 	 * set if we're doing a parity rebuild
 	 * for a read from higher up, which is handled
@@ -137,6 +139,7 @@ struct btrfs_raid_bio {
 	/* second bad stripe (for raid6 use) */
 	int failb;
 
+	int scrubp;
 	/*
 	 * number of pages needed to represent the full
 	 * stripe
@@ -171,6 +174,11 @@ struct btrfs_raid_bio {
 	 * here for faster lookup
 	 */
 	struct page **bio_pages;
+
+	/*
+	 * bitmap to record which horizontal stripe has data
+	 */
+	unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -185,6 +193,10 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+					 int need_check);
+static void async_scrub_parity(struct btrfs_raid_bio *rbio);
+
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -586,10 +598,20 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
 	    cur->raid_map[0])
 		return 0;
 
-	/* reads can't merge with writes */
-	if (last->operation != cur->operation) {
+	/* we can't merge with different operations */
+	if (last->operation != cur->operation)
+		return 0;
+	/*
+	 * We've need read the full stripe from the drive.
+	 * check and repair the parity and write the new results.
+	 *
+	 * We're not allowed to add any new bios to the
+	 * bio list here, anyone else that wants to
+	 * change this stripe needs to do their own rmw.
+	 */
+	if (last->operation == BTRFS_RBIO_PARITY_SCRUB ||
+	    cur->operation == BTRFS_RBIO_PARITY_SCRUB)
 		return 0;
-	}
 
 	return 1;
 }
@@ -782,9 +804,12 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
 
 			if (next->operation == BTRFS_RBIO_READ_REBUILD)
 				async_read_rebuild(next);
-			else if (next->operation == BTRFS_RBIO_WRITE){
+			else if (next->operation == BTRFS_RBIO_WRITE) {
 				steal_rbio(rbio, next);
 				async_rmw_stripe(next);
+			} else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) {
+				steal_rbio(rbio, next);
+				async_scrub_parity(next);
 			}
 
 			goto done_nolock;
@@ -950,9 +975,11 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	struct btrfs_raid_bio *rbio;
 	int nr_data = 0;
 	int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+	int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
 	void *p;
 
-	rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
+	rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2 +
+		       DIV_ROUND_UP(stripe_npages, BITS_PER_LONG / 8),
 			GFP_NOFS);
 	if (!rbio)
 		return ERR_PTR(-ENOMEM);
@@ -967,6 +994,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->fs_info = root->fs_info;
 	rbio->stripe_len = stripe_len;
 	rbio->nr_pages = num_pages;
+	rbio->stripe_npages = stripe_npages;
 	rbio->faila = -1;
 	rbio->failb = -1;
 	atomic_set(&rbio->refs, 1);
@@ -980,6 +1008,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	p = rbio + 1;
 	rbio->stripe_pages = p;
 	rbio->bio_pages = p + sizeof(struct page *) * num_pages;
+	rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
 	if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
 		nr_data = bbio->num_stripes - 2;
@@ -1774,6 +1803,14 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	index_rbio_pages(rbio);
 
 	for (pagenr = 0; pagenr < nr_pages; pagenr++) {
+		/*
+		 * Now we just use bitmap to mark the horizontal stripes in
+		 * which we have data when doing parity scrub.
+		 */
+		if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
+		    !test_bit(pagenr, rbio->dbitmap))
+			continue;
+
 		/* setup our array of pointers with pages
 		 * from each stripe
 		 */
@@ -1918,7 +1955,13 @@ cleanup_io:
 	} else if (err == 0) {
 		rbio->faila = -1;
 		rbio->failb = -1;
-		finish_rmw(rbio);
+
+		if (rbio->operation == BTRFS_RBIO_WRITE)
+			finish_rmw(rbio);
+		else if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB)
+			finish_parity_scrub(rbio, 0);
+		else
+			BUG();
 	} else {
 		rbio_orig_end_io(rbio, err, 0);
 	}
@@ -2126,3 +2169,462 @@ static void read_rebuild_work(struct btrfs_work *work)
 	rbio = container_of(work, struct btrfs_raid_bio, work);
 	__raid56_parity_recover(rbio);
 }
+
+/*
+ * The following code is used to scrub/replace the parity stripe
+ *
+ * Note: We need make sure all the pages that add into the scrub/replace
+ * raid bio are correct and not be changed during the scrub/replace. That
+ * is those pages just hold metadata or file data with checksum.
+ */
+
+struct btrfs_raid_bio *
+raid56_parity_alloc_scrub_rbio(struct btrfs_root *root, struct bio *bio,
+			       struct btrfs_bio *bbio, u64 *raid_map,
+			       u64 stripe_len, struct btrfs_device *scrub_dev,
+			       unsigned long *dbitmap, int stripe_nsectors)
+{
+	struct btrfs_raid_bio *rbio;
+	int i;
+
+	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
+	if (IS_ERR(rbio))
+		return NULL;
+	bio_list_add(&rbio->bio_list, bio);
+	/*
+	 * This is a special bio which is used to hold the completion handler
+	 * and make the scrub rbio is similar to the other types
+	 */
+	ASSERT(!bio->bi_iter.bi_size);
+	rbio->operation = BTRFS_RBIO_PARITY_SCRUB;
+
+	for (i = 0; i < bbio->num_stripes; i++) {
+		if (bbio->stripes[i].dev == scrub_dev) {
+			rbio->scrubp = i;
+			break;
+		}
+	}
+
+	/* Now we just support the sectorsize equals to page size */
+	ASSERT(root->sectorsize == PAGE_SIZE);
+	ASSERT(rbio->stripe_npages == stripe_nsectors);
+	bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors);
+
+	return rbio;
+}
+
+void raid56_parity_add_scrub_pages(struct btrfs_raid_bio *rbio,
+				   struct page *page, u64 logical)
+{
+	int stripe_offset;
+	int index;
+
+	ASSERT(logical >= rbio->raid_map[0]);
+	ASSERT(logical + PAGE_SIZE <= rbio->raid_map[0] +
+				rbio->stripe_len * rbio->nr_data);
+	stripe_offset = (int)(logical - rbio->raid_map[0]);
+	index = stripe_offset >> PAGE_CACHE_SHIFT;
+	rbio->bio_pages[index] = page;
+}
+
+/*
+ * We just scrub the parity that we have correct data on the same horizontal,
+ * so we needn't allocate all pages for all the stripes.
+ */
+static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio)
+{
+	int i;
+	int bit;
+	int index;
+	struct page *page;
+
+	for_each_set_bit(bit, rbio->dbitmap, rbio->stripe_npages) {
+		for (i = 0; i < rbio->bbio->num_stripes; i++) {
+			index = i * rbio->stripe_npages + bit;
+			if (rbio->stripe_pages[index])
+				continue;
+
+			page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+			if (!page)
+				return -ENOMEM;
+			rbio->stripe_pages[index] = page;
+			ClearPageUptodate(page);
+		}
+	}
+	return 0;
+}
+
+/*
+ * end io function used by finish_rmw.  When we finally
+ * get here, we've written a full stripe
+ */
+static void raid_write_parity_end_io(struct bio *bio, int err)
+{
+	struct btrfs_raid_bio *rbio = bio->bi_private;
+
+	if (err)
+		fail_bio_stripe(rbio, bio);
+
+	bio_put(bio);
+
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
+		return;
+
+	err = 0;
+
+	if (atomic_read(&rbio->error))
+		err = -EIO;
+
+	rbio_orig_end_io(rbio, err, 0);
+}
+
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+					 int need_check)
+{
+	struct btrfs_bio *bbio = rbio->bbio;
+	void *pointers[bbio->num_stripes];
+	int nr_data = rbio->nr_data;
+	int stripe;
+	int pagenr;
+	int p_stripe = -1;
+	int q_stripe = -1;
+	struct page *p_page = NULL;
+	struct page *q_page = NULL;
+	struct bio_list bio_list;
+	struct bio *bio;
+	int ret;
+
+	bio_list_init(&bio_list);
+
+	if (bbio->num_stripes - rbio->nr_data == 1) {
+		p_stripe = bbio->num_stripes - 1;
+	} else if (bbio->num_stripes - rbio->nr_data == 2) {
+		p_stripe = bbio->num_stripes - 2;
+		q_stripe = bbio->num_stripes - 1;
+	} else {
+		BUG();
+	}
+
+	/*
+	 * Because the higher layers(scrubber) are unlikely to
+	 * use this area of the disk again soon, so don't cache
+	 * it.
+	 */
+	clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
+
+	if (!need_check)
+		goto writeback;
+
+	p_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	if (!p_page)
+		goto cleanup;
+	SetPageUptodate(p_page);
+
+	if (q_stripe != -1) {
+		q_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!q_page) {
+			__free_page(p_page);
+			goto cleanup;
+		}
+		SetPageUptodate(q_page);
+	}
+
+	atomic_set(&rbio->error, 0);
+
+	for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
+		struct page *p;
+		void *parity;
+		/* first collect one page from each data stripe */
+		for (stripe = 0; stripe < nr_data; stripe++) {
+			p = page_in_rbio(rbio, stripe, pagenr, 0);
+			pointers[stripe] = kmap(p);
+		}
+
+		/* then add the parity stripe */
+		pointers[stripe++] = kmap(p_page);
+
+		if (q_stripe != -1) {
+
+			/*
+			 * raid6, add the qstripe and call the
+			 * library function to fill in our p/q
+			 */
+			pointers[stripe++] = kmap(q_page);
+
+			raid6_call.gen_syndrome(bbio->num_stripes, PAGE_SIZE,
+						pointers);
+		} else {
+			/* raid5 */
+			memcpy(pointers[nr_data], pointers[0], PAGE_SIZE);
+			run_xor(pointers + 1, nr_data - 1, PAGE_CACHE_SIZE);
+		}
+
+		/* Check scrubbing pairty and repair it */
+		p = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+		parity = kmap(p);
+		if (memcmp(parity, pointers[rbio->scrubp], PAGE_CACHE_SIZE))
+			memcpy(parity, pointers[rbio->scrubp], PAGE_CACHE_SIZE);
+		else
+			/* Parity is right, needn't writeback */
+			bitmap_clear(rbio->dbitmap, pagenr, 1);
+		kunmap(p);
+
+		for (stripe = 0; stripe < bbio->num_stripes; stripe++)
+			kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
+	}
+
+	__free_page(p_page);
+	if (q_page)
+		__free_page(q_page);
+
+writeback:
+	/*
+	 * time to start writing.  Make bios for everything from the
+	 * higher layers (the bio_list in our rbio) and our p/q.  Ignore
+	 * everything else.
+	 */
+	for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
+		struct page *page;
+
+		page = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+		ret = rbio_add_io_page(rbio, &bio_list,
+			       page, rbio->scrubp, pagenr, rbio->stripe_len);
+		if (ret)
+			goto cleanup;
+	}
+
+	nr_data = bio_list_size(&bio_list);
+	if (!nr_data) {
+		/* Every parity is right */
+		rbio_orig_end_io(rbio, 0, 0);
+		return;
+	}
+
+	atomic_set(&rbio->stripes_pending, nr_data);
+
+	while (1) {
+		bio = bio_list_pop(&bio_list);
+		if (!bio)
+			break;
+
+		bio->bi_private = rbio;
+		bio->bi_end_io = raid_write_parity_end_io;
+		BUG_ON(!test_bit(BIO_UPTODATE, &bio->bi_flags));
+		submit_bio(WRITE, bio);
+	}
+	return;
+
+cleanup:
+	rbio_orig_end_io(rbio, -EIO, 0);
+}
+
+static inline int is_data_stripe(struct btrfs_raid_bio *rbio, int stripe)
+{
+	if (stripe >= 0 && stripe < rbio->nr_data)
+		return 1;
+	return 0;
+}
+
+/*
+ * While we're doing the parity check and repair, we could have errors
+ * in reading pages off the disk.  This checks for errors and if we're
+ * not able to read the page it'll trigger parity reconstruction.  The
+ * parity scrub will be finished after we've reconstructed the failed
+ * stripes
+ */
+static void validate_rbio_for_parity_scrub(struct btrfs_raid_bio *rbio)
+{
+	if (atomic_read(&rbio->error) > rbio->bbio->max_errors)
+		goto cleanup;
+
+	if (rbio->faila >= 0 || rbio->failb >= 0) {
+		int dfail = 0, failp = -1;
+
+		if (is_data_stripe(rbio, rbio->faila))
+			dfail++;
+		else if (is_parity_stripe(rbio->faila))
+			failp = rbio->faila;
+
+		if (is_data_stripe(rbio, rbio->failb))
+			dfail++;
+		else if (is_parity_stripe(rbio->failb))
+			failp = rbio->failb;
+
+		/*
+		 * Because we can not use a scrubbing parity to repair
+		 * the data, so the capability of the repair is declined.
+		 * (In the case of RAID5, we can not repair anything)
+		 */
+		if (dfail > rbio->bbio->max_errors - 1)
+			goto cleanup;
+
+		/*
+		 * If all data is good, only parity is correctly, just
+		 * repair the parity.
+		 */
+		if (dfail == 0) {
+			finish_parity_scrub(rbio, 0);
+			return;
+		}
+
+		/*
+		 * Here means we got one corrupted data stripe and one
+		 * corrupted parity on RAID6, if the corrupted parity
+		 * is scrubbing parity, luckly, use the other one to repair
+		 * the data, or we can not repair the data stripe.
+		 */
+		if (failp != rbio->scrubp)
+			goto cleanup;
+
+		__raid_recover_end_io(rbio);
+	} else {
+		finish_parity_scrub(rbio, 1);
+	}
+	return;
+
+cleanup:
+	rbio_orig_end_io(rbio, -EIO, 0);
+}
+
+/*
+ * end io for the read phase of the rmw cycle.  All the bios here are physical
+ * stripe bios we've read from the disk so we can recalculate the parity of the
+ * stripe.
+ *
+ * This will usually kick off finish_rmw once all the bios are read in, but it
+ * may trigger parity reconstruction if we had any errors along the way
+ */
+static void raid56_parity_scrub_end_io(struct bio *bio, int err)
+{
+	struct btrfs_raid_bio *rbio = bio->bi_private;
+
+	if (err)
+		fail_bio_stripe(rbio, bio);
+	else
+		set_bio_pages_uptodate(bio);
+
+	bio_put(bio);
+
+	if (!atomic_dec_and_test(&rbio->stripes_pending))
+		return;
+
+	/*
+	 * this will normally call finish_rmw to start our write
+	 * but if there are any failed stripes we'll reconstruct
+	 * from parity first
+	 */
+	validate_rbio_for_parity_scrub(rbio);
+}
+
+static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
+{
+	int bios_to_read = 0;
+	struct btrfs_bio *bbio = rbio->bbio;
+	struct bio_list bio_list;
+	int ret;
+	int pagenr;
+	int stripe;
+	struct bio *bio;
+
+	ret = alloc_rbio_essential_pages(rbio);
+	if (ret)
+		goto cleanup;
+
+	bio_list_init(&bio_list);
+
+	atomic_set(&rbio->error, 0);
+	/*
+	 * build a list of bios to read all the missing parts of this
+	 * stripe
+	 */
+	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+		for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
+			struct page *page;
+			/*
+			 * we want to find all the pages missing from
+			 * the rbio and read them from the disk.  If
+			 * page_in_rbio finds a page in the bio list
+			 * we don't need to read it off the stripe.
+			 */
+			page = page_in_rbio(rbio, stripe, pagenr, 1);
+			if (page)
+				continue;
+
+			page = rbio_stripe_page(rbio, stripe, pagenr);
+			/*
+			 * the bio cache may have handed us an uptodate
+			 * page.  If so, be happy and use it
+			 */
+			if (PageUptodate(page))
+				continue;
+
+			ret = rbio_add_io_page(rbio, &bio_list, page,
+				       stripe, pagenr, rbio->stripe_len);
+			if (ret)
+				goto cleanup;
+		}
+	}
+
+	bios_to_read = bio_list_size(&bio_list);
+	if (!bios_to_read) {
+		/*
+		 * this can happen if others have merged with
+		 * us, it means there is nothing left to read.
+		 * But if there are missing devices it may not be
+		 * safe to do the full stripe write yet.
+		 */
+		goto finish;
+	}
+
+	/*
+	 * the bbio may be freed once we submit the last bio.  Make sure
+	 * not to touch it after that
+	 */
+	atomic_set(&rbio->stripes_pending, bios_to_read);
+	while (1) {
+		bio = bio_list_pop(&bio_list);
+		if (!bio)
+			break;
+
+		bio->bi_private = rbio;
+		bio->bi_end_io = raid56_parity_scrub_end_io;
+
+		btrfs_bio_wq_end_io(rbio->fs_info, bio,
+				    BTRFS_WQ_ENDIO_RAID56);
+
+		BUG_ON(!test_bit(BIO_UPTODATE, &bio->bi_flags));
+		submit_bio(READ, bio);
+	}
+	/* the actual write will happen once the reads are done */
+	return;
+
+cleanup:
+	rbio_orig_end_io(rbio, -EIO, 0);
+	return;
+
+finish:
+	validate_rbio_for_parity_scrub(rbio);
+}
+
+static void scrub_parity_work(struct btrfs_work *work)
+{
+	struct btrfs_raid_bio *rbio;
+
+	rbio = container_of(work, struct btrfs_raid_bio, work);
+	raid56_parity_scrub_stripe(rbio);
+}
+
+static void async_scrub_parity(struct btrfs_raid_bio *rbio)
+{
+	btrfs_init_work(&rbio->work, btrfs_rmw_helper,
+			scrub_parity_work, NULL, NULL);
+
+	btrfs_queue_work(rbio->fs_info->rmw_workers,
+			 &rbio->work);
+}
+
+void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
+{
+	if (!lock_stripe_add(rbio))
+		async_scrub_parity(rbio);
+}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index b310e8c..3d4ddb3 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -39,6 +39,9 @@ static inline int nr_data_stripes(struct map_lookup *map)
 #define is_parity_stripe(x) (((x) == RAID5_P_STRIPE) ||		\
 			     ((x) == RAID6_Q_STRIPE))
 
+struct btrfs_raid_bio;
+struct btrfs_device;
+
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 				 struct btrfs_bio *bbio, u64 *raid_map,
 				 u64 stripe_len, int mirror_num, int hold_bbio);
@@ -46,6 +49,15 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 			       struct btrfs_bio *bbio, u64 *raid_map,
 			       u64 stripe_len);
 
+struct btrfs_raid_bio *
+raid56_parity_alloc_scrub_rbio(struct btrfs_root *root, struct bio *bio,
+			       struct btrfs_bio *bbio, u64 *raid_map,
+			       u64 stripe_len, struct btrfs_device *scrub_dev,
+			       unsigned long *dbitmap, int stripe_nsectors);
+void raid56_parity_add_scrub_pages(struct btrfs_raid_bio *rbio,
+				   struct page *page, u64 logical);
+void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
+
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ca4b9eb..7f95afc 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -74,6 +74,7 @@ struct scrub_page {
 	struct scrub_block	*sblock;
 	struct page		*page;
 	struct btrfs_device	*dev;
+	struct list_head	list;
 	u64			flags;  /* extent flags */
 	u64			generation;
 	u64			logical;
@@ -114,14 +115,52 @@ struct scrub_block {
 	atomic_t		outstanding_pages;
 	atomic_t		ref_count; /* free mem on transition to zero */
 	struct scrub_ctx	*sctx;
+	struct scrub_parity	*sparity;
 	struct {
 		unsigned int	header_error:1;
 		unsigned int	checksum_error:1;
 		unsigned int	no_io_error_seen:1;
 		unsigned int	generation_error:1; /* also sets header_error */
+
+		/* The following is for the data used to check parity */
+		/* It is for the data with checksum */
+		unsigned int	data_corrected:1;
 	};
 };
 
+/* Used for the chunks with parity stripe such RAID5/6 */
+struct scrub_parity {
+	struct scrub_ctx	*sctx;
+
+	struct btrfs_device	*scrub_dev;
+
+	u64			logic_start;
+
+	u64			logic_end;
+
+	int			nsectors;
+
+	int			stripe_len;
+
+	atomic_t		ref_count;
+
+	struct list_head	spages;
+
+	/* Work of parity check and repair */
+	struct btrfs_work	work;
+
+	/* Mark the parity blocks which have data */
+	unsigned long		*dbitmap;
+
+	/*
+	 * Mark the parity blocks which have data, but errors happen when
+	 * read data or check data
+	 */
+	unsigned long		*ebitmap;
+
+	unsigned long		bitmap[0];
+};
+
 struct scrub_wr_ctx {
 	struct scrub_bio *wr_curr_bio;
 	struct btrfs_device *tgtdev;
@@ -227,6 +266,8 @@ static void scrub_block_get(struct scrub_block *sblock);
 static void scrub_block_put(struct scrub_block *sblock);
 static void scrub_page_get(struct scrub_page *spage);
 static void scrub_page_put(struct scrub_page *spage);
+static void scrub_parity_get(struct scrub_parity *sparity);
+static void scrub_parity_put(struct scrub_parity *sparity);
 static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 				    struct scrub_page *spage);
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
@@ -943,6 +984,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		 */
 		spin_lock(&sctx->stat_lock);
 		sctx->stat.unverified_errors++;
+		sblock_to_check->data_corrected = 1;
 		spin_unlock(&sctx->stat_lock);
 
 		if (sctx->is_dev_replace)
@@ -1203,6 +1245,7 @@ nodatasum_case:
 corrected_error:
 			spin_lock(&sctx->stat_lock);
 			sctx->stat.corrected_errors++;
+			sblock_to_check->data_corrected = 1;
 			spin_unlock(&sctx->stat_lock);
 			printk_ratelimited_in_rcu(KERN_ERR
 				"BTRFS: fixed up error at logical %llu on dev %s\n",
@@ -1644,6 +1687,13 @@ static void scrub_write_block_to_dev_replace(struct scrub_block *sblock)
 {
 	int page_num;
 
+	/*
+	 * This block is used for the check of the parity on the source device,
+	 * so the data needn't be written into the destination device.
+	 */
+	if (sblock->sparity)
+		return;
+
 	for (page_num = 0; page_num < sblock->page_count; page_num++) {
 		int ret;
 
@@ -2025,6 +2075,9 @@ static void scrub_block_put(struct scrub_block *sblock)
 	if (atomic_dec_and_test(&sblock->ref_count)) {
 		int i;
 
+		if (sblock->sparity)
+			scrub_parity_put(sblock->sparity);
+
 		for (i = 0; i < sblock->page_count; i++)
 			scrub_page_put(sblock->pagev[i]);
 		kfree(sblock);
@@ -2282,9 +2335,51 @@ static void scrub_bio_end_io_worker(struct btrfs_work *work)
 	scrub_pending_bio_dec(sctx);
 }
 
+static inline void __scrub_mark_bitmap(struct scrub_parity *sparity,
+				       unsigned long *bitmap,
+				       u64 start, u64 len)
+{
+	int offset;
+	int nsectors;
+	int sectorsize = sparity->sctx->dev_root->sectorsize;
+
+	if (len >= sparity->stripe_len) {
+		bitmap_set(bitmap, 0, sparity->nsectors);
+		return;
+	}
+
+	start -= sparity->logic_start;
+	offset = (int)do_div(start, sparity->stripe_len);
+	offset /= sectorsize;
+	nsectors = (int)len / sectorsize;
+
+	if (offset + nsectors <= sparity->nsectors) {
+		bitmap_set(bitmap, offset, nsectors);
+		return;
+	}
+
+	bitmap_set(bitmap, offset, sparity->nsectors - offset);
+	bitmap_set(bitmap, 0, nsectors - (sparity->nsectors - offset));
+}
+
+static inline void scrub_parity_mark_sectors_error(struct scrub_parity *sparity,
+						   u64 start, u64 len)
+{
+	__scrub_mark_bitmap(sparity, sparity->ebitmap, start, len);
+}
+
+static inline void scrub_parity_mark_sectors_data(struct scrub_parity *sparity,
+						  u64 start, u64 len)
+{
+	__scrub_mark_bitmap(sparity, sparity->dbitmap, start, len);
+}
+
 static void scrub_block_complete(struct scrub_block *sblock)
 {
+	int corrupted = 0;
+
 	if (!sblock->no_io_error_seen) {
+		corrupted = 1;
 		scrub_handle_errored_block(sblock);
 	} else {
 		/*
@@ -2292,9 +2387,19 @@ static void scrub_block_complete(struct scrub_block *sblock)
 		 * dev replace case, otherwise write here in dev replace
 		 * case.
 		 */
-		if (!scrub_checksum(sblock) && sblock->sctx->is_dev_replace)
+		corrupted = scrub_checksum(sblock);
+		if (!corrupted && sblock->sctx->is_dev_replace)
 			scrub_write_block_to_dev_replace(sblock);
 	}
+
+	if (sblock->sparity && corrupted && !sblock->data_corrected) {
+		u64 start = sblock->pagev[0]->logical;
+		u64 end = sblock->pagev[sblock->page_count - 1]->logical +
+			  PAGE_SIZE;
+
+		scrub_parity_mark_sectors_error(sblock->sparity,
+						start, end - start);
+	}
 }
 
 static int scrub_find_csum(struct scrub_ctx *sctx, u64 logical, u64 len,
@@ -2386,6 +2491,132 @@ behind_scrub_pages:
 	return 0;
 }
 
+static int scrub_pages_for_parity(struct scrub_parity *sparity,
+				  u64 logical, u64 len,
+				  u64 physical, struct btrfs_device *dev,
+				  u64 flags, u64 gen, int mirror_num, u8 *csum)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	struct scrub_block *sblock;
+	int index;
+
+	sblock = kzalloc(sizeof(*sblock), GFP_NOFS);
+	if (!sblock) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
+		return -ENOMEM;
+	}
+
+	/* one ref inside this function, plus one for each page added to
+	 * a bio later on */
+	atomic_set(&sblock->ref_count, 1);
+	sblock->sctx = sctx;
+	sblock->no_io_error_seen = 1;
+	sblock->sparity = sparity;
+	scrub_parity_get(sparity);
+
+	for (index = 0; len > 0; index++) {
+		struct scrub_page *spage;
+		u64 l = min_t(u64, len, PAGE_SIZE);
+
+		spage = kzalloc(sizeof(*spage), GFP_NOFS);
+		if (!spage) {
+leave_nomem:
+			spin_lock(&sctx->stat_lock);
+			sctx->stat.malloc_errors++;
+			spin_unlock(&sctx->stat_lock);
+			scrub_block_put(sblock);
+			return -ENOMEM;
+		}
+		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
+		/* For scrub block */
+		scrub_page_get(spage);
+		sblock->pagev[index] = spage;
+		/* For scrub parity */
+		scrub_page_get(spage);
+		list_add_tail(&spage->list, &sparity->spages);
+		spage->sblock = sblock;
+		spage->dev = dev;
+		spage->flags = flags;
+		spage->generation = gen;
+		spage->logical = logical;
+		spage->physical = physical;
+		spage->mirror_num = mirror_num;
+		if (csum) {
+			spage->have_csum = 1;
+			memcpy(spage->csum, csum, sctx->csum_size);
+		} else {
+			spage->have_csum = 0;
+		}
+		sblock->page_count++;
+		spage->page = alloc_page(GFP_NOFS);
+		if (!spage->page)
+			goto leave_nomem;
+		len -= l;
+		logical += l;
+		physical += l;
+	}
+
+	WARN_ON(sblock->page_count == 0);
+	for (index = 0; index < sblock->page_count; index++) {
+		struct scrub_page *spage = sblock->pagev[index];
+		int ret;
+
+		ret = scrub_add_page_to_rd_bio(sctx, spage);
+		if (ret) {
+			scrub_block_put(sblock);
+			return ret;
+		}
+	}
+
+	/* last one frees, either here or in bio completion for last page */
+	scrub_block_put(sblock);
+	return 0;
+}
+
+static int scrub_extent_for_parity(struct scrub_parity *sparity,
+				   u64 logical, u64 len,
+				   u64 physical, struct btrfs_device *dev,
+				   u64 flags, u64 gen, int mirror_num)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	int ret;
+	u8 csum[BTRFS_CSUM_SIZE];
+	u32 blocksize;
+
+	if (flags & BTRFS_EXTENT_FLAG_DATA) {
+		blocksize = sctx->sectorsize;
+	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+		blocksize = sctx->nodesize;
+	} else {
+		blocksize = sctx->sectorsize;
+		WARN_ON(1);
+	}
+
+	while (len) {
+		u64 l = min_t(u64, len, blocksize);
+		int have_csum = 0;
+
+		if (flags & BTRFS_EXTENT_FLAG_DATA) {
+			/* push csums to sbio */
+			have_csum = scrub_find_csum(sctx, logical, l, csum);
+			if (have_csum == 0)
+				goto skip;
+		}
+		ret = scrub_pages_for_parity(sparity, logical, l, physical, dev,
+					     flags, gen, mirror_num,
+					     have_csum ? csum : NULL);
+skip:
+		if (ret)
+			return ret;
+		len -= l;
+		logical += l;
+		physical += l;
+	}
+	return 0;
+}
+
 /*
  * Given a physical address, this will calculate it's
  * logical offset. if this is a parity stripe, it will return
@@ -2394,7 +2625,8 @@ behind_scrub_pages:
  * return 0 if it is a data stripe, 1 means parity stripe.
  */
 static int get_raid56_logic_offset(u64 physical, int num,
-				   struct map_lookup *map, u64 *offset)
+				   struct map_lookup *map, u64 *offset,
+				   u64 *stripe_start)
 {
 	int i;
 	int j = 0;
@@ -2405,6 +2637,9 @@ static int get_raid56_logic_offset(u64 physical, int num,
 
 	last_offset = (physical - map->stripes[num].physical) *
 		      nr_data_stripes(map);
+	if (stripe_start)
+		*stripe_start = last_offset;
+
 	*offset = last_offset;
 	for (i = 0; i < nr_data_stripes(map); i++) {
 		*offset = last_offset + i * map->stripe_len;
@@ -2427,13 +2662,330 @@ static int get_raid56_logic_offset(u64 physical, int num,
 	return 1;
 }
 
+static void scrub_free_parity(struct scrub_parity *sparity)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	struct scrub_page *curr, *next;
+	int nbits;
+
+	nbits = bitmap_weight(sparity->ebitmap, sparity->nsectors);
+	if (nbits) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.read_errors += nbits;
+		sctx->stat.uncorrectable_errors += nbits;
+		spin_unlock(&sctx->stat_lock);
+	}
+
+	list_for_each_entry_safe(curr, next, &sparity->spages, list) {
+		list_del_init(&curr->list);
+		scrub_page_put(curr);
+	}
+
+	kfree(sparity);
+}
+
+static void scrub_parity_bio_endio(struct bio *bio, int error)
+{
+	struct scrub_parity *sparity = (struct scrub_parity *)bio->bi_private;
+	struct scrub_ctx *sctx = sparity->sctx;
+
+	if (error)
+		bitmap_or(sparity->ebitmap, sparity->ebitmap, sparity->dbitmap,
+			  sparity->nsectors);
+
+	scrub_free_parity(sparity);
+	scrub_pending_bio_dec(sctx);
+	bio_put(bio);
+}
+
+static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
+{
+	struct scrub_ctx *sctx = sparity->sctx;
+	struct bio *bio;
+	struct btrfs_raid_bio *rbio;
+	struct scrub_page *spage;
+	struct btrfs_bio *bbio = NULL;
+	u64 *raid_map = NULL;
+	u64 length;
+	int ret;
+
+	if (!bitmap_andnot(sparity->dbitmap, sparity->dbitmap, sparity->ebitmap,
+			   sparity->nsectors))
+		goto out;
+
+	length = sparity->logic_end - sparity->logic_start + 1;
+	ret = btrfs_map_sblock(sctx->dev_root->fs_info, REQ_GET_READ_MIRRORS,
+			       sparity->logic_start,
+			       &length, &bbio, 0, &raid_map);
+	if (ret || !bbio || !raid_map)
+		goto bbio_out;
+
+	bio = btrfs_io_bio_alloc(GFP_NOFS, 0);
+	if (!bio)
+		goto bbio_out;
+
+	bio->bi_iter.bi_sector = sparity->logic_start >> 9;
+	bio->bi_private = sparity;
+	bio->bi_end_io = scrub_parity_bio_endio;
+
+	rbio = raid56_parity_alloc_scrub_rbio(sctx->dev_root, bio, bbio,
+					      raid_map, length,
+					      sparity->scrub_dev,
+					      sparity->dbitmap,
+					      sparity->nsectors);
+	if (!rbio)
+		goto rbio_out;
+
+	list_for_each_entry(spage, &sparity->spages, list)
+		raid56_parity_add_scrub_pages(rbio, spage->page,
+					      spage->logical);
+
+	scrub_pending_bio_inc(sctx);
+	raid56_parity_submit_scrub_rbio(rbio);
+	return;
+
+rbio_out:
+	bio_put(bio);
+bbio_out:
+	kfree(bbio);
+	kfree(raid_map);
+	bitmap_or(sparity->ebitmap, sparity->ebitmap, sparity->dbitmap,
+		  sparity->nsectors);
+	spin_lock(&sctx->stat_lock);
+	sctx->stat.malloc_errors++;
+	spin_unlock(&sctx->stat_lock);
+out:
+	scrub_free_parity(sparity);
+}
+
+static inline int scrub_calc_parity_bitmap_len(int nsectors)
+{
+	return DIV_ROUND_UP(nsectors, BITS_PER_LONG) * (BITS_PER_LONG / 8);
+}
+
+static void scrub_parity_get(struct scrub_parity *sparity)
+{
+	atomic_inc(&sparity->ref_count);
+}
+
+static void scrub_parity_put(struct scrub_parity *sparity)
+{
+	if (!atomic_dec_and_test(&sparity->ref_count))
+		return;
+
+	scrub_parity_check_and_repair(sparity);
+}
+
+static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
+						  struct map_lookup *map,
+						  struct btrfs_device *sdev,
+						  struct btrfs_path *path,
+						  u64 logic_start,
+						  u64 logic_end)
+{
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_root *csum_root = fs_info->csum_root;
+	struct btrfs_extent_item *extent;
+	u64 flags;
+	int ret;
+	int slot;
+	struct extent_buffer *l;
+	struct btrfs_key key;
+	u64 generation;
+	u64 extent_logical;
+	u64 extent_physical;
+	u64 extent_len;
+	struct btrfs_device *extent_dev;
+	struct scrub_parity *sparity;
+	int nsectors;
+	int bitmap_len;
+	int extent_mirror_num;
+	int stop_loop = 0;
+
+	nsectors = map->stripe_len / root->sectorsize;
+	bitmap_len = scrub_calc_parity_bitmap_len(nsectors);
+	sparity = kzalloc(sizeof(struct scrub_parity) + 2 * bitmap_len,
+			  GFP_NOFS);
+	if (!sparity) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
+		return -ENOMEM;
+	}
+
+	sparity->stripe_len = map->stripe_len;
+	sparity->nsectors = nsectors;
+	sparity->sctx = sctx;
+	sparity->scrub_dev = sdev;
+	sparity->logic_start = logic_start;
+	sparity->logic_end = logic_end;
+	atomic_set(&sparity->ref_count, 1);
+	INIT_LIST_HEAD(&sparity->spages);
+	sparity->dbitmap = sparity->bitmap;
+	sparity->ebitmap = (void *)sparity->bitmap + bitmap_len;
+
+	ret = 0;
+	while (logic_start < logic_end) {
+		if (btrfs_fs_incompat(fs_info, SKINNY_METADATA))
+			key.type = BTRFS_METADATA_ITEM_KEY;
+		else
+			key.type = BTRFS_EXTENT_ITEM_KEY;
+		key.objectid = logic_start;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			goto out;
+
+		if (ret > 0) {
+			ret = btrfs_previous_extent_item(root, path, 0);
+			if (ret < 0)
+				goto out;
+			if (ret > 0) {
+				btrfs_release_path(path);
+				ret = btrfs_search_slot(NULL, root, &key,
+							path, 0, 0);
+				if (ret < 0)
+					goto out;
+			}
+		}
+
+		stop_loop = 0;
+		while (1) {
+			u64 bytes;
+
+			l = path->nodes[0];
+			slot = path->slots[0];
+			if (slot >= btrfs_header_nritems(l)) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret == 0)
+					continue;
+				if (ret < 0)
+					goto out;
+
+				stop_loop = 1;
+				break;
+			}
+			btrfs_item_key_to_cpu(l, &key, slot);
+
+			if (key.type == BTRFS_METADATA_ITEM_KEY)
+				bytes = root->nodesize;
+			else
+				bytes = key.offset;
+
+			if (key.objectid + bytes <= logic_start)
+				goto next;
+
+			if (key.type != BTRFS_EXTENT_ITEM_KEY &&
+			    key.type != BTRFS_METADATA_ITEM_KEY)
+				goto next;
+
+			if (key.objectid > logic_end) {
+				stop_loop = 1;
+				break;
+			}
+
+			while (key.objectid >= logic_start + map->stripe_len)
+				logic_start += map->stripe_len;
+
+			extent = btrfs_item_ptr(l, slot,
+						struct btrfs_extent_item);
+			flags = btrfs_extent_flags(l, extent);
+			generation = btrfs_extent_generation(l, extent);
+
+			if (key.objectid < logic_start &&
+			    (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) {
+				btrfs_err(fs_info,
+					  "scrub: tree block %llu spanning stripes, ignored. logical=%llu",
+					   key.objectid, logic_start);
+				goto next;
+			}
+again:
+			extent_logical = key.objectid;
+			extent_len = bytes;
+
+			if (extent_logical < logic_start) {
+				extent_len -= logic_start - extent_logical;
+				extent_logical = logic_start;
+			}
+
+			if (extent_logical + extent_len >
+			    logic_start + map->stripe_len)
+				extent_len = logic_start + map->stripe_len -
+					     extent_logical;
+
+			scrub_parity_mark_sectors_data(sparity, extent_logical,
+						       extent_len);
+
+			scrub_remap_extent(fs_info, extent_logical,
+					   extent_len, &extent_physical,
+					   &extent_dev,
+					   &extent_mirror_num);
+
+			ret = btrfs_lookup_csums_range(csum_root,
+						extent_logical,
+						extent_logical + extent_len - 1,
+						&sctx->csum_list, 1);
+			if (ret)
+				goto out;
+
+			ret = scrub_extent_for_parity(sparity, extent_logical,
+						      extent_len,
+						      extent_physical,
+						      extent_dev, flags,
+						      generation,
+						      extent_mirror_num);
+			if (ret)
+				goto out;
+
+			scrub_free_csums(sctx);
+			if (extent_logical + extent_len <
+			    key.objectid + bytes) {
+				logic_start += map->stripe_len;
+
+				if (logic_start >= logic_end) {
+					stop_loop = 1;
+					break;
+				}
+
+				if (logic_start < key.objectid + bytes) {
+					cond_resched();
+					goto again;
+				}
+			}
+next:
+			path->slots[0]++;
+		}
+
+		btrfs_release_path(path);
+
+		if (stop_loop)
+			break;
+
+		logic_start += map->stripe_len;
+	}
+out:
+	if (ret < 0)
+		scrub_parity_mark_sectors_error(sparity, logic_start,
+						logic_end - logic_start + 1);
+	scrub_parity_put(sparity);
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_ctx.wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_ctx.wr_lock);
+
+	btrfs_release_path(path);
+	return ret < 0 ? ret : 0;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
 					   int num, u64 base, u64 length,
 					   int is_dev_replace)
 {
-	struct btrfs_path *path;
+	struct btrfs_path *path, *ppath;
 	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
 	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_root *csum_root = fs_info->csum_root;
@@ -2460,6 +3012,8 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	u64 extent_logical;
 	u64 extent_physical;
 	u64 extent_len;
+	u64 stripe_logical;
+	u64 stripe_end;
 	struct btrfs_device *extent_dev;
 	int extent_mirror_num;
 	int stop_loop = 0;
@@ -2485,7 +3039,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		mirror_num = num % map->num_stripes + 1;
 	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID5 |
 				BTRFS_BLOCK_GROUP_RAID6)) {
-		get_raid56_logic_offset(physical, num, map, &offset);
+		get_raid56_logic_offset(physical, num, map, &offset, NULL);
 		increment = map->stripe_len * nr_data_stripes(map);
 		mirror_num = 1;
 	} else {
@@ -2497,6 +3051,12 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	if (!path)
 		return -ENOMEM;
 
+	ppath = btrfs_alloc_path();
+	if (!ppath) {
+		btrfs_free_path(ppath);
+		return -ENOMEM;
+	}
+
 	/*
 	 * work on commit root. The related disk blocks are static as
 	 * long as COW is applied. This means, it is save to rewrite
@@ -2515,7 +3075,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	if (map->type & (BTRFS_BLOCK_GROUP_RAID5 |
 			 BTRFS_BLOCK_GROUP_RAID6)) {
 		get_raid56_logic_offset(physical_end, num,
-					map, &logic_end);
+					map, &logic_end, NULL);
 		logic_end += base;
 	} else {
 		logic_end = logical + increment * nstripes;
@@ -2562,10 +3122,18 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		if (map->type & (BTRFS_BLOCK_GROUP_RAID5 |
 				BTRFS_BLOCK_GROUP_RAID6)) {
 			ret = get_raid56_logic_offset(physical, num,
-					map, &logical);
+					map, &logical, &stripe_logical);
 			logical += base;
-			if (ret)
+			if (ret) {
+				stripe_logical += base;
+				stripe_end = stripe_logical + increment - 1;
+				ret = scrub_raid56_parity(sctx, map, scrub_dev,
+						ppath, stripe_logical,
+						stripe_end);
+				if (ret)
+					goto out;
 				goto skip;
+			}
 		}
 		/*
 		 * canceled?
@@ -2716,13 +3284,25 @@ again:
 					 * loop until we find next data stripe
 					 * or we have finished all stripes.
 					 */
-					do {
-						physical += map->stripe_len;
-						ret = get_raid56_logic_offset(
-								physical, num,
-								map, &logical);
-						logical += base;
-					} while (physical < physical_end && ret);
+loop:
+					physical += map->stripe_len;
+					ret = get_raid56_logic_offset(physical,
+							num, map, &logical,
+							&stripe_logical);
+					logical += base;
+
+					if (ret && physical < physical_end) {
+						stripe_logical += base;
+						stripe_end = stripe_logical +
+								increment - 1;
+						ret = scrub_raid56_parity(sctx,
+							map, scrub_dev, ppath,
+							stripe_logical,
+							stripe_end);
+						if (ret)
+							goto out;
+						goto loop;
+					}
 				} else {
 					physical += map->stripe_len;
 					logical += increment;
@@ -2763,6 +3343,7 @@ out:
 
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
+	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (5 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56 Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 08/11] Btrfs, replace: write raid56 parity " Miao Xie
                       ` (3 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c  | 104 +++++++++++++++++++++++++++++++++--------------------
 fs/btrfs/volumes.c |  26 ++++++++++++--
 fs/btrfs/volumes.h |  10 ++++--
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 3b99cbc..6f82c1b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -124,6 +124,8 @@ struct btrfs_raid_bio {
 	/* number of data stripes (no p/q) */
 	int nr_data;
 
+	int real_stripes;
+
 	int stripe_npages;
 	/*
 	 * set if we're doing a parity rebuild
@@ -631,7 +633,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio *rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-	if (rbio->nr_data + 1 == rbio->bbio->num_stripes)
+	if (rbio->nr_data + 1 == rbio->real_stripes)
 		return NULL;
 
 	index += ((rbio->nr_data + 1) * rbio->stripe_len) >>
@@ -974,7 +976,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 {
 	struct btrfs_raid_bio *rbio;
 	int nr_data = 0;
-	int num_pages = rbio_nr_pages(stripe_len, bbio->num_stripes);
+	int real_stripes = bbio->num_stripes - bbio->num_tgtdevs;
+	int num_pages = rbio_nr_pages(stripe_len, real_stripes);
 	int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
 	void *p;
 
@@ -994,6 +997,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->fs_info = root->fs_info;
 	rbio->stripe_len = stripe_len;
 	rbio->nr_pages = num_pages;
+	rbio->real_stripes = real_stripes;
 	rbio->stripe_npages = stripe_npages;
 	rbio->faila = -1;
 	rbio->failb = -1;
@@ -1010,10 +1014,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->bio_pages = p + sizeof(struct page *) * num_pages;
 	rbio->dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-	if (raid_map[bbio->num_stripes - 1] == RAID6_Q_STRIPE)
-		nr_data = bbio->num_stripes - 2;
+	if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+		nr_data = real_stripes - 2;
 	else
-		nr_data = bbio->num_stripes - 1;
+		nr_data = real_stripes - 1;
 
 	rbio->nr_data = nr_data;
 	return rbio;
@@ -1125,7 +1129,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
 	if (rbio->faila >= 0 || rbio->failb >= 0) {
-		BUG_ON(rbio->faila == rbio->bbio->num_stripes - 1);
+		BUG_ON(rbio->faila == rbio->real_stripes - 1);
 		__raid56_parity_recover(rbio);
 	} else {
 		finish_rmw(rbio);
@@ -1186,7 +1190,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
 	struct btrfs_bio *bbio = rbio->bbio;
-	void *pointers[bbio->num_stripes];
+	void *pointers[rbio->real_stripes];
 	int stripe_len = rbio->stripe_len;
 	int nr_data = rbio->nr_data;
 	int stripe;
@@ -1200,11 +1204,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 
 	bio_list_init(&bio_list);
 
-	if (bbio->num_stripes - rbio->nr_data == 1) {
-		p_stripe = bbio->num_stripes - 1;
-	} else if (bbio->num_stripes - rbio->nr_data == 2) {
-		p_stripe = bbio->num_stripes - 2;
-		q_stripe = bbio->num_stripes - 1;
+	if (rbio->real_stripes - rbio->nr_data == 1) {
+		p_stripe = rbio->real_stripes - 1;
+	} else if (rbio->real_stripes - rbio->nr_data == 2) {
+		p_stripe = rbio->real_stripes - 2;
+		q_stripe = rbio->real_stripes - 1;
 	} else {
 		BUG();
 	}
@@ -1261,7 +1265,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 			SetPageUptodate(p);
 			pointers[stripe++] = kmap(p);
 
-			raid6_call.gen_syndrome(bbio->num_stripes, PAGE_SIZE,
+			raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE,
 						pointers);
 		} else {
 			/* raid5 */
@@ -1270,7 +1274,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 		}
 
 
-		for (stripe = 0; stripe < bbio->num_stripes; stripe++)
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++)
 			kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
 	}
 
@@ -1279,7 +1283,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	 * higher layers (the bio_list in our rbio) and our p/q.  Ignore
 	 * everything else.
 	 */
-	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 		for (pagenr = 0; pagenr < pages_per_stripe; pagenr++) {
 			struct page *page;
 			if (stripe < rbio->nr_data) {
@@ -1297,6 +1301,32 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 		}
 	}
 
+	if (likely(!bbio->num_tgtdevs))
+		goto write_data;
+
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
+		if (!bbio->tgtdev_map[stripe])
+			continue;
+
+		for (pagenr = 0; pagenr < pages_per_stripe; pagenr++) {
+			struct page *page;
+			if (stripe < rbio->nr_data) {
+				page = page_in_rbio(rbio, stripe, pagenr, 1);
+				if (!page)
+					continue;
+			} else {
+			       page = rbio_stripe_page(rbio, stripe, pagenr);
+			}
+
+			ret = rbio_add_io_page(rbio, &bio_list, page,
+					       rbio->bbio->tgtdev_map[stripe],
+					       pagenr, rbio->stripe_len);
+			if (ret)
+				goto cleanup;
+		}
+	}
+
+write_data:
 	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
 	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
@@ -1335,7 +1365,8 @@ static int find_bio_stripe(struct btrfs_raid_bio *rbio,
 		stripe = &rbio->bbio->stripes[i];
 		stripe_start = stripe->physical;
 		if (physical >= stripe_start &&
-		    physical < stripe_start + rbio->stripe_len) {
+		    physical < stripe_start + rbio->stripe_len &&
+		    bio->bi_bdev == stripe->dev->bdev) {
 			return i;
 		}
 	}
@@ -1784,7 +1815,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	int err;
 	int i;
 
-	pointers = kzalloc(rbio->bbio->num_stripes * sizeof(void *),
+	pointers = kzalloc(rbio->real_stripes * sizeof(void *),
 			   GFP_NOFS);
 	if (!pointers) {
 		err = -ENOMEM;
@@ -1814,7 +1845,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 		/* setup our array of pointers with pages
 		 * from each stripe
 		 */
-		for (stripe = 0; stripe < rbio->bbio->num_stripes; stripe++) {
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 			/*
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
@@ -1829,7 +1860,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 		}
 
 		/* all raid6 handling here */
-		if (rbio->raid_map[rbio->bbio->num_stripes - 1] ==
+		if (rbio->raid_map[rbio->real_stripes - 1] ==
 		    RAID6_Q_STRIPE) {
 
 			/*
@@ -1879,10 +1910,10 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 			}
 
 			if (rbio->raid_map[failb] == RAID5_P_STRIPE) {
-				raid6_datap_recov(rbio->bbio->num_stripes,
+				raid6_datap_recov(rbio->real_stripes,
 						  PAGE_SIZE, faila, pointers);
 			} else {
-				raid6_2data_recov(rbio->bbio->num_stripes,
+				raid6_2data_recov(rbio->real_stripes,
 						  PAGE_SIZE, faila, failb,
 						  pointers);
 			}
@@ -1924,7 +1955,7 @@ pstripe:
 				}
 			}
 		}
-		for (stripe = 0; stripe < rbio->bbio->num_stripes; stripe++) {
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 			/*
 			 * if we're rebuilding a read, we have to use
 			 * pages from the bio list
@@ -2005,7 +2036,6 @@ static void raid_recover_end_io(struct bio *bio, int err)
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 {
 	int bios_to_read = 0;
-	struct btrfs_bio *bbio = rbio->bbio;
 	struct bio_list bio_list;
 	int ret;
 	int nr_pages = DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE);
@@ -2026,7 +2056,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	 * stripe cache, it is possible that some or all of these
 	 * pages are going to be uptodate.
 	 */
-	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 		if (rbio->faila == stripe || rbio->failb == stripe) {
 			atomic_inc(&rbio->error);
 			continue;
@@ -2132,7 +2162,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 	 * asking for mirror 3
 	 */
 	if (mirror_num == 3)
-		rbio->failb = bbio->num_stripes - 2;
+		rbio->failb = rbio->real_stripes - 2;
 
 	ret = lock_stripe_add(rbio);
 
@@ -2198,7 +2228,7 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_root *root, struct bio *bio,
 	ASSERT(!bio->bi_iter.bi_size);
 	rbio->operation = BTRFS_RBIO_PARITY_SCRUB;
 
-	for (i = 0; i < bbio->num_stripes; i++) {
+	for (i = 0; i < rbio->real_stripes; i++) {
 		if (bbio->stripes[i].dev == scrub_dev) {
 			rbio->scrubp = i;
 			break;
@@ -2239,7 +2269,7 @@ static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio)
 	struct page *page;
 
 	for_each_set_bit(bit, rbio->dbitmap, rbio->stripe_npages) {
-		for (i = 0; i < rbio->bbio->num_stripes; i++) {
+		for (i = 0; i < rbio->real_stripes; i++) {
 			index = i * rbio->stripe_npages + bit;
 			if (rbio->stripe_pages[index])
 				continue;
@@ -2281,8 +2311,7 @@ static void raid_write_parity_end_io(struct bio *bio, int err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 					 int need_check)
 {
-	struct btrfs_bio *bbio = rbio->bbio;
-	void *pointers[bbio->num_stripes];
+	void *pointers[rbio->real_stripes];
 	int nr_data = rbio->nr_data;
 	int stripe;
 	int pagenr;
@@ -2296,11 +2325,11 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 
 	bio_list_init(&bio_list);
 
-	if (bbio->num_stripes - rbio->nr_data == 1) {
-		p_stripe = bbio->num_stripes - 1;
-	} else if (bbio->num_stripes - rbio->nr_data == 2) {
-		p_stripe = bbio->num_stripes - 2;
-		q_stripe = bbio->num_stripes - 1;
+	if (rbio->real_stripes - rbio->nr_data == 1) {
+		p_stripe = rbio->real_stripes - 1;
+	} else if (rbio->real_stripes - rbio->nr_data == 2) {
+		p_stripe = rbio->real_stripes - 2;
+		q_stripe = rbio->real_stripes - 1;
 	} else {
 		BUG();
 	}
@@ -2351,7 +2380,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 			 */
 			pointers[stripe++] = kmap(q_page);
 
-			raid6_call.gen_syndrome(bbio->num_stripes, PAGE_SIZE,
+			raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE,
 						pointers);
 		} else {
 			/* raid5 */
@@ -2369,7 +2398,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 			bitmap_clear(rbio->dbitmap, pagenr, 1);
 		kunmap(p);
 
-		for (stripe = 0; stripe < bbio->num_stripes; stripe++)
+		for (stripe = 0; stripe < rbio->real_stripes; stripe++)
 			kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
 	}
 
@@ -2519,7 +2548,6 @@ static void raid56_parity_scrub_end_io(struct bio *bio, int err)
 static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
 {
 	int bios_to_read = 0;
-	struct btrfs_bio *bbio = rbio->bbio;
 	struct bio_list bio_list;
 	int ret;
 	int pagenr;
@@ -2537,7 +2565,7 @@ static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
 	 * build a list of bios to read all the missing parts of this
 	 * stripe
 	 */
-	for (stripe = 0; stripe < bbio->num_stripes; stripe++) {
+	for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
 		for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
 			struct page *page;
 			/*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 05cf489..28d42ba 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4882,13 +4882,15 @@ static inline int parity_smaller(u64 a, u64 b)
 static void sort_parity_stripes(struct btrfs_bio *bbio, u64 *raid_map)
 {
 	struct btrfs_bio_stripe s;
+	int real_stripes = bbio->num_stripes - bbio->num_tgtdevs;
 	int i;
 	u64 l;
 	int again = 1;
+	int m;
 
 	while (again) {
 		again = 0;
-		for (i = 0; i < bbio->num_stripes - 1; i++) {
+		for (i = 0; i < real_stripes - 1; i++) {
 			if (parity_smaller(raid_map[i], raid_map[i+1])) {
 				s = bbio->stripes[i];
 				l = raid_map[i];
@@ -4896,6 +4898,14 @@ static void sort_parity_stripes(struct btrfs_bio *bbio, u64 *raid_map)
 				raid_map[i] = raid_map[i+1];
 				bbio->stripes[i+1] = s;
 				raid_map[i+1] = l;
+
+				if (bbio->tgtdev_map) {
+					m = bbio->tgtdev_map[i];
+					bbio->tgtdev_map[i] =
+							bbio->tgtdev_map[i + 1];
+					bbio->tgtdev_map[i + 1] = m;
+				}
+
 				again = 1;
 			}
 		}
@@ -4924,6 +4934,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	int ret = 0;
 	int num_stripes;
 	int max_errors = 0;
+	int tgtdev_indexes = 0;
 	struct btrfs_bio *bbio = NULL;
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
 	int dev_replace_is_ongoing = 0;
@@ -5235,14 +5246,19 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			num_alloc_stripes <<= 1;
 		if (rw & REQ_GET_READ_MIRRORS)
 			num_alloc_stripes++;
+		tgtdev_indexes = num_stripes;
 	}
-	bbio = kzalloc(btrfs_bio_size(num_alloc_stripes), GFP_NOFS);
+
+	bbio = kzalloc(btrfs_bio_size(num_alloc_stripes, tgtdev_indexes),
+		       GFP_NOFS);
 	if (!bbio) {
 		kfree(raid_map);
 		ret = -ENOMEM;
 		goto out;
 	}
 	atomic_set(&bbio->error, 0);
+	if (dev_replace_is_ongoing)
+		bbio->tgtdev_map = (int *)(bbio->stripes + num_alloc_stripes);
 
 	if (rw & REQ_DISCARD) {
 		int factor = 0;
@@ -5327,6 +5343,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	if (rw & (REQ_WRITE | REQ_GET_READ_MIRRORS))
 		max_errors = btrfs_chunk_max_errors(map);
 
+	tgtdev_indexes = 0;
 	if (dev_replace_is_ongoing && (rw & (REQ_WRITE | REQ_DISCARD)) &&
 	    dev_replace->tgtdev != NULL) {
 		int index_where_to_add;
@@ -5355,8 +5372,10 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				new->physical = old->physical;
 				new->length = old->length;
 				new->dev = dev_replace->tgtdev;
+				bbio->tgtdev_map[i] = index_where_to_add;
 				index_where_to_add++;
 				max_errors++;
+				tgtdev_indexes++;
 			}
 		}
 		num_stripes = index_where_to_add;
@@ -5402,7 +5421,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 				tgtdev_stripe->length =
 					bbio->stripes[index_srcdev].length;
 				tgtdev_stripe->dev = dev_replace->tgtdev;
+				bbio->tgtdev_map[index_srcdev] = num_stripes;
 
+				tgtdev_indexes++;
 				num_stripes++;
 			}
 		}
@@ -5412,6 +5433,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	bbio->num_stripes = num_stripes;
 	bbio->max_errors = max_errors;
 	bbio->mirror_num = mirror_num;
+	bbio->num_tgtdevs = tgtdev_indexes;
 
 	/*
 	 * this is the case that REQ_READ && dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 01094bb..70be257 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -292,7 +292,7 @@ struct btrfs_bio_stripe {
 struct btrfs_bio;
 typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
 
-#define BTRFS_BIO_ORIG_BIO_SUBMITTED	0x1
+#define BTRFS_BIO_ORIG_BIO_SUBMITTED	(1 << 0)
 
 struct btrfs_bio {
 	atomic_t stripes_pending;
@@ -305,6 +305,8 @@ struct btrfs_bio {
 	int max_errors;
 	int num_stripes;
 	int mirror_num;
+	int num_tgtdevs;
+	int *tgtdev_map;
 	struct btrfs_bio_stripe stripes[];
 };
 
@@ -387,8 +389,10 @@ struct btrfs_balance_control {
 int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
 				   u64 end, u64 *length);
 
-#define btrfs_bio_size(n) (sizeof(struct btrfs_bio) + \
-			    (sizeof(struct btrfs_bio_stripe) * (n)))
+#define btrfs_bio_size(total_stripes, real_stripes)		\
+	(sizeof(struct btrfs_bio) +				\
+	 (sizeof(struct btrfs_bio_stripe) * (total_stripes)) +	\
+	 (sizeof(int) * (real_stripes)))
 
 int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		    u64 logical, u64 *length,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 08/11] Btrfs, replace: write raid56 parity into the replace target device
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (6 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 09/11] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 Miao Xie
                       ` (2 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/raid56.c | 23 +++++++++++++++++++++++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6f82c1b..cfa449f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2311,7 +2311,9 @@ static void raid_write_parity_end_io(struct bio *bio, int err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 					 int need_check)
 {
+	struct btrfs_bio *bbio = rbio->bbio;
 	void *pointers[rbio->real_stripes];
+	DECLARE_BITMAP(pbitmap, rbio->stripe_npages);
 	int nr_data = rbio->nr_data;
 	int stripe;
 	int pagenr;
@@ -2321,6 +2323,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 	struct page *q_page = NULL;
 	struct bio_list bio_list;
 	struct bio *bio;
+	int is_replace = 0;
 	int ret;
 
 	bio_list_init(&bio_list);
@@ -2334,6 +2337,11 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 		BUG();
 	}
 
+	if (bbio->num_tgtdevs && bbio->tgtdev_map[rbio->scrubp]) {
+		is_replace = 1;
+		bitmap_copy(pbitmap, rbio->dbitmap, rbio->stripe_npages);
+	}
+
 	/*
 	 * Because the higher layers(scrubber) are unlikely to
 	 * use this area of the disk again soon, so don't cache
@@ -2422,6 +2430,21 @@ writeback:
 			goto cleanup;
 	}
 
+	if (!is_replace)
+		goto submit_write;
+
+	for_each_set_bit(pagenr, pbitmap, rbio->stripe_npages) {
+		struct page *page;
+
+		page = rbio_stripe_page(rbio, rbio->scrubp, pagenr);
+		ret = rbio_add_io_page(rbio, &bio_list, page,
+				       bbio->tgtdev_map[rbio->scrubp],
+				       pagenr, rbio->stripe_len);
+		if (ret)
+			goto cleanup;
+	}
+
+submit_write:
 	nr_data = bio_list_size(&bio_list);
 	if (!nr_data) {
 		/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 7f95afc..0ae837f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2714,7 +2714,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
 		goto out;
 
 	length = sparity->logic_end - sparity->logic_start + 1;
-	ret = btrfs_map_sblock(sctx->dev_root->fs_info, REQ_GET_READ_MIRRORS,
+	ret = btrfs_map_sblock(sctx->dev_root->fs_info, WRITE,
 			       sparity->logic_start,
 			       &length, &bbio, 0, &raid_map);
 	if (ret || !bbio || !raid_map)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 09/11] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (7 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 08/11] Btrfs, replace: write raid56 parity " Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 13:04     ` [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list Miao Xie
  2014-11-26 13:04     ` [PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56 Miao Xie
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v2 -> v3:
- New patch to fix undealt use-after-free problem of the source device
  in the final device replace procedure.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/ctree.h       |  7 ++++++-
 fs/btrfs/dev-replace.c |  4 ++--
 fs/btrfs/raid56.c      | 41 ++++++++++++++++++++++++++++++++---------
 fs/btrfs/raid56.h      |  4 ++--
 fs/btrfs/scrub.c       |  2 +-
 fs/btrfs/volumes.c     |  7 ++-----
 6 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fe69edd..470e317 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4097,7 +4097,12 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
 /* dev-replace.c */
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
 void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info);
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount);
+
+static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+{
+	btrfs_bio_counter_sub(fs_info, 1);
+}
 
 /* reada.c */
 struct reada_control {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6f662b3..fa27b4e 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -920,9 +920,9 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info)
 	percpu_counter_inc(&fs_info->bio_counter);
 }
 
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 {
-	percpu_counter_dec(&fs_info->bio_counter);
+	percpu_counter_sub(&fs_info->bio_counter, amount);
 
 	if (waitqueue_active(&fs_info->replace_wait))
 		wake_up(&fs_info->replace_wait);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index cfa449f..4bdb822 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -155,6 +155,8 @@ struct btrfs_raid_bio {
 	 */
 	int bio_list_bytes;
 
+	int generic_bio_cnt;
+
 	atomic_t refs;
 
 	atomic_t stripes_pending;
@@ -347,6 +349,7 @@ static void merge_rbio(struct btrfs_raid_bio *dest,
 {
 	bio_list_merge(&dest->bio_list, &victim->bio_list);
 	dest->bio_list_bytes += victim->bio_list_bytes;
+	dest->generic_bio_cnt += victim->generic_bio_cnt;
 	bio_list_init(&victim->bio_list);
 }
 
@@ -884,6 +887,10 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, int err, int uptodate)
 {
 	struct bio *cur = bio_list_get(&rbio->bio_list);
 	struct bio *next;
+
+	if (rbio->generic_bio_cnt)
+		btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
+
 	free_raid_bio(rbio);
 
 	while (cur) {
@@ -1768,6 +1775,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 	struct btrfs_raid_bio *rbio;
 	struct btrfs_plug_cb *plug = NULL;
 	struct blk_plug_cb *cb;
+	int ret;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
 	if (IS_ERR(rbio)) {
@@ -1778,12 +1786,19 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
 	rbio->operation = BTRFS_RBIO_WRITE;
 
+	btrfs_bio_counter_inc_noblocked(root->fs_info);
+	rbio->generic_bio_cnt = 1;
+
 	/*
 	 * don't plug on full rbios, just get them out the door
 	 * as quickly as we can
 	 */
-	if (rbio_is_full(rbio))
-		return full_stripe_write(rbio);
+	if (rbio_is_full(rbio)) {
+		ret = full_stripe_write(rbio);
+		if (ret)
+			btrfs_bio_counter_dec(root->fs_info);
+		return ret;
+	}
 
 	cb = blk_check_plugged(btrfs_raid_unplug, root->fs_info,
 			       sizeof(*plug));
@@ -1794,10 +1809,13 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 			INIT_LIST_HEAD(&plug->rbio_list);
 		}
 		list_add_tail(&rbio->plug_list, &plug->rbio_list);
+		ret = 0;
 	} else {
-		return __raid56_parity_write(rbio);
+		ret = __raid56_parity_write(rbio);
+		if (ret)
+			btrfs_bio_counter_dec(root->fs_info);
 	}
-	return 0;
+	return ret;
 }
 
 /*
@@ -2132,19 +2150,17 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 			  struct btrfs_bio *bbio, u64 *raid_map,
-			  u64 stripe_len, int mirror_num, int hold_bbio)
+			  u64 stripe_len, int mirror_num, int generic_io)
 {
 	struct btrfs_raid_bio *rbio;
 	int ret;
 
 	rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
 	if (IS_ERR(rbio)) {
-		__free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
+		__free_bbio_and_raid_map(bbio, raid_map, generic_io);
 		return PTR_ERR(rbio);
 	}
 
-	if (hold_bbio)
-		set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
 	rbio->operation = BTRFS_RBIO_READ_REBUILD;
 	bio_list_add(&rbio->bio_list, bio);
 	rbio->bio_list_bytes = bio->bi_iter.bi_size;
@@ -2152,11 +2168,18 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 	rbio->faila = find_logical_bio_stripe(rbio, bio);
 	if (rbio->faila == -1) {
 		BUG();
-		__free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
+		__free_bbio_and_raid_map(bbio, raid_map, generic_io);
 		kfree(rbio);
 		return -EIO;
 	}
 
+	if (generic_io) {
+		btrfs_bio_counter_inc_noblocked(root->fs_info);
+		rbio->generic_bio_cnt = 1;
+	} else {
+		set_bit(RBIO_HOLD_BBIO_MAP_BIT, &rbio->flags);
+	}
+
 	/*
 	 * reconstruct from the q stripe if they are
 	 * asking for mirror 3
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 3d4ddb3..31d4a15 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -43,8 +43,8 @@ struct btrfs_raid_bio;
 struct btrfs_device;
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
-				 struct btrfs_bio *bbio, u64 *raid_map,
-				 u64 stripe_len, int mirror_num, int hold_bbio);
+			  struct btrfs_bio *bbio, u64 *raid_map,
+			  u64 stripe_len, int mirror_num, int generic_io);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
 			       struct btrfs_bio *bbio, u64 *raid_map,
 			       u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0ae837f..27f2e16cd 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1477,7 +1477,7 @@ static int scrub_submit_raid56_bio_wait(struct btrfs_fs_info *fs_info,
 	ret = raid56_parity_recover(fs_info->fs_root, bio, page->recover->bbio,
 				    page->recover->raid_map,
 				    page->recover->map_length,
-				    page->mirror_num, 1);
+				    page->mirror_num, 0);
 	if (ret)
 		return ret;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 28d42ba..4bcbb2b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5844,12 +5844,9 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 		} else {
 			ret = raid56_parity_recover(root, bio, bbio,
 						    raid_map, map_length,
-						    mirror_num, 0);
+						    mirror_num, 1);
 		}
-		/*
-		 * FIXME, replace dosen't support raid56 yet, please fix
-		 * it in the future.
-		 */
+
 		btrfs_bio_counter_dec(root->fs_info);
 		return ret;
 	}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (8 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 09/11] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  2014-11-26 15:02       ` Chris Mason
  2014-11-26 13:04     ` [PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56 Miao Xie
  10 siblings, 1 reply; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs

The increase/decrease of bio counter is on the I/O path, so we should
use io_schedule() instead of schedule(), or the deadlock might be
triggered by the pending I/O in the plug list. io_schedule() can help
us because it will flush all the pending I/O before the task is going
to sleep.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v2 -> v3:
- New patch to fix possible deadlock caused by the pending bios in the
  plug list when the io submitters were going to sleep.

Changelog v1 -> v2:
- None.
---
 fs/btrfs/dev-replace.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fa27b4e..894796a 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -928,16 +928,23 @@ void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 		wake_up(&fs_info->replace_wait);
 }
 
+#define btrfs_wait_event_io(wq, condition)				\
+do {									\
+	if (condition)							\
+		break;							\
+	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
+			    io_schedule());				\
+} while (0)
+
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info)
 {
-	DEFINE_WAIT(wait);
 again:
 	percpu_counter_inc(&fs_info->bio_counter);
 	if (test_bit(BTRFS_FS_STATE_DEV_REPLACING, &fs_info->fs_state)) {
 		btrfs_bio_counter_dec(fs_info);
-		wait_event(fs_info->replace_wait,
-			   !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
-				     &fs_info->fs_state));
+		btrfs_wait_event_io(fs_info->replace_wait,
+				    !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
+					      &fs_info->fs_state));
 		goto again;
 	}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56
  2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
                       ` (9 preceding siblings ...)
  2014-11-26 13:04     ` [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list Miao Xie
@ 2014-11-26 13:04     ` Miao Xie
  10 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-11-26 13:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zhao Lei

From: Zhao Lei <zhaolei@cn.fujitsu.com>

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v3:
- None.
---
 fs/btrfs/dev-replace.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 894796a..9f6a464 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
 	struct btrfs_device *tgt_device = NULL;
 	struct btrfs_device *src_device = NULL;
 
-	if (btrfs_fs_incompat(fs_info, RAID56)) {
-		btrfs_warn(fs_info, "dev_replace cannot yet handle RAID5/RAID6");
-		return -EOPNOTSUPP;
-	}
-
 	switch (args->start.cont_reading_from_srcdev_mode) {
 	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
 	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
  2014-11-26 13:04     ` [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list Miao Xie
@ 2014-11-26 15:02       ` Chris Mason
  2014-11-27  1:39         ` Miao Xie
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2014-11-26 15:02 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs

On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> The increase/decrease of bio counter is on the I/O path, so we should
> use io_schedule() instead of schedule(), or the deadlock might be
> triggered by the pending I/O in the plug list. io_schedule() can help
> us because it will flush all the pending I/O before the task is going
> to sleep.

Can you please describe this deadlock in more detail?  schedule() also 
triggers a flush of the plug list, and if that's no longer sufficient 
we can run into other problems (especially with preemption on).

-chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
  2014-11-26 15:02       ` Chris Mason
@ 2014-11-27  1:39         ` Miao Xie
  2014-11-27  3:00           ` Miao Xie
  0 siblings, 1 reply; 32+ messages in thread
From: Miao Xie @ 2014-11-27  1:39 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
> On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> The increase/decrease of bio counter is on the I/O path, so we should
>> use io_schedule() instead of schedule(), or the deadlock might be
>> triggered by the pending I/O in the plug list. io_schedule() can help
>> us because it will flush all the pending I/O before the task is going
>> to sleep.
> 
> Can you please describe this deadlock in more detail?  schedule() also triggers
> a flush of the plug list, and if that's no longer sufficient we can run into other
> problems (especially with preemption on).

Sorry for my miss. I forgot to check the current implementation of schedule(), which flushes the plug list unconditionally. Please ignore this patch.

Thanks
Miao

> 
> -chris
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
  2014-11-27  1:39         ` Miao Xie
@ 2014-11-27  3:00           ` Miao Xie
  2014-11-28 21:32             ` Chris Mason
  0 siblings, 1 reply; 32+ messages in thread
From: Miao Xie @ 2014-11-27  3:00 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
> On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
>> On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>> The increase/decrease of bio counter is on the I/O path, so we should
>>> use io_schedule() instead of schedule(), or the deadlock might be
>>> triggered by the pending I/O in the plug list. io_schedule() can help
>>> us because it will flush all the pending I/O before the task is going
>>> to sleep.
>>
>> Can you please describe this deadlock in more detail?  schedule() also triggers
>> a flush of the plug list, and if that's no longer sufficient we can run into other
>> problems (especially with preemption on).
> 
> Sorry for my miss. I forgot to check the current implementation of schedule(), which flushes the plug list unconditionally. Please ignore this patch.

I have updated my raid56-scrub-replace branch, please re-pull the branch.

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

> 
> Thanks
> Miao
> 
>>
>> -chris
>>
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
  2014-11-27  3:00           ` Miao Xie
@ 2014-11-28 21:32             ` Chris Mason
  2014-12-02 13:02               ` Miao Xie
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2014-11-28 21:32 UTC (permalink / raw)
  To: miaox; +Cc: linux-btrfs

On Wed, Nov 26, 2014 at 10:00 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
>>  On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
>>>  On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie <miaox@cn.fujitsu.com> 
>>> wrote:
>>>>  The increase/decrease of bio counter is on the I/O path, so we 
>>>> should
>>>>  use io_schedule() instead of schedule(), or the deadlock might be
>>>>  triggered by the pending I/O in the plug list. io_schedule() can 
>>>> help
>>>>  us because it will flush all the pending I/O before the task is 
>>>> going
>>>>  to sleep.
>>> 
>>>  Can you please describe this deadlock in more detail?  schedule() 
>>> also triggers
>>>  a flush of the plug list, and if that's no longer sufficient we 
>>> can run into other
>>>  problems (especially with preemption on).
>> 
>>  Sorry for my miss. I forgot to check the current implementation of 
>> schedule(), which flushes the plug list unconditionally. Please 
>> ignore this patch.
> 
> I have updated my raid56-scrub-replace branch, please re-pull the 
> branch.
> 
>   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Sorry, I wasn't clear.  I do like the patch because it uses a slightly 
better trigger mechanism for the flush.  I was just worried about a 
larger deadlock.

I ran the raid56 work with stress.sh overnight, then scrubbed the 
resulting filesystem and ran balance when the scrub completed.  All of 
these passed without errors (excellent!).

Then I zero'd 4GB of one drive and ran scrub again.  This was the 
result.  Please make sure CONFIG_DEBUG_PAGEALLOC is enabled and you 
should be able to reproduce.

[192392.495260] BUG: unable to handle kernel paging request at 
ffff880303062f80
[192392.495279] IP: [<ffffffffa05fe77a>] lock_stripe_add+0xba/0x390 
[btrfs]
[192392.495281] PGD 2bdb067 PUD 107e7fd067 PMD 107e7e4067 PTE 
8000000303062060
[192392.495283] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[192392.495307] Modules linked in: ipmi_devintf loop fuse k10temp 
coretemp hwmon btrfs raid6_pq zlib_deflate lzo_compress xor xfs 
exportfs libcrc32c tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables 
xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic iptable_filter 
ip_tables x_tables mptctl netconsole autofs4 nfsv3 nfs lockd grace 
rpcsec_gss_krb5 auth_rpcgss oid_registry sunrpc ipv6 ext3 jbd dm_mod 
rtc_cmos ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support pcspkr 
i2c_i801 lpc_ich mfd_core shpchp ehci_pci ehci_hcd mlx4_en ptp pps_core 
mlx4_core sg ses enclosure button megaraid_sas
[192392.495310] CPU: 0 PID: 11992 Comm: kworker/u65:2 Not tainted 
3.18.0-rc6-mason+ #7
[192392.495310] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, 
BIOS 1.07 05/10/2012
[192392.495323] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[192392.495324] task: ffff88013dae9110 ti: ffff8802296a0000 task.ti: 
ffff8802296a0000
[192392.495335] RIP: 0010:[<ffffffffa05fe77a>]  [<ffffffffa05fe77a>] 
lock_stripe_add+0xba/0x390 [btrfs]
[192392.495335] RSP: 0018:ffff8802296a3ac8  EFLAGS: 00010006
[192392.495336] RAX: ffff880577e85018 RBX: ffff880497f0b2f8 RCX: 
ffff8801190fb000
[192392.495337] RDX: 000000000000013d RSI: ffff880303062f80 RDI: 
0000040c275a0000
[192392.495338] RBP: ffff8802296a3b48 R08: ffff880497f00000 R09: 
0000000000000001
[192392.495339] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000282
[192392.495339] R13: 000000000000b250 R14: ffff880577e85000 R15: 
ffff880497f0b2a0
[192392.495340] FS:  0000000000000000(0000) GS:ffff88085fc00000(0000) 
knlGS:0000000000000000
[192392.495341] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[192392.495342] CR2: ffff880303062f80 CR3: 0000000005289000 CR4: 
00000000000407f0
[192392.495342] Stack:
[192392.495344]  ffff880755e28000 ffff880497f00000 000000000000013d 
ffff8801190fb000
[192392.495346]  0000000000000000 ffff88013dae9110 ffffffff81090d40 
ffff8802296a3b00
[192392.495347]  ffff8802296a3b00 0000000000000010 ffff8802296a3b68 
ffff8801190fb000
[192392.495348] Call Trace:
[192392.495353]  [<ffffffff81090d40>] ? bit_waitqueue+0xa0/0xa0
[192392.495363]  [<ffffffffa05fea66>] 
raid56_parity_submit_scrub_rbio+0x16/0x30 [btrfs]
[192392.495372]  [<ffffffffa05e2f0e>] 
scrub_parity_check_and_repair+0x15e/0x1e0 [btrfs]
[192392.495380]  [<ffffffffa05e301d>] scrub_block_put+0x8d/0x90 [btrfs]
[192392.495388]  [<ffffffffa05e6ed7>] ? 
scrub_bio_end_io_worker+0xd7/0x870 [btrfs]
[192392.495396]  [<ffffffffa05e6ee9>] 
scrub_bio_end_io_worker+0xe9/0x870 [btrfs]
[192392.495405]  [<ffffffffa05b8c44>] normal_work_helper+0x84/0x330 
[btrfs]
[192392.495414]  [<ffffffffa05b8f42>] btrfs_scrub_helper+0x12/0x20 
[btrfs]
[192392.495417]  [<ffffffff8106c50f>] process_one_work+0x1bf/0x520
[192392.495419]  [<ffffffff8106c48d>] ? process_one_work+0x13d/0x520
[192392.495421]  [<ffffffff8106c98e>] worker_thread+0x11e/0x4b0
[192392.495424]  [<ffffffff81653ac9>] ? __schedule+0x389/0x880
[192392.495426]  [<ffffffff8106c870>] ? process_one_work+0x520/0x520
[192392.495428]  [<ffffffff81071e2e>] kthread+0xde/0x100
[192392.495430]  [<ffffffff81071d50>] ? __init_kthread_worker+0x70/0x70
[192392.495431]  [<ffffffff81659eac>] ret_from_fork+0x7c/0xb0
[192392.495433]  [<ffffffff81071d50>] ? __init_kthread_worker+0x70/0x70
[192392.495449] Code: 45 88 49 89 c4 4f 8d 7c 28 50 4b 8b 44 28 50 48 
8b 55 90 4c 8d 70 e8 4c 39 f8 48 8b 4d 98 74 32 48 8b 71 10 48 8b 3e 48 
8b 70 f8 <48> 39 3e 75 12 eb 6f 0f 1f 80 00 00 00 00 48 8b 76 f8 48 39 
3e
[192392.495458] RIP  [<ffffffffa05fe77a>] lock_stripe_add+0xba/0x390 
[btrfs]
[192392.495458]  RSP <ffff8802296a3ac8>
[192392.495458] CR2: ffff880303062f80
[192392.496389] ---[ end trace c04c23ee0d843df0 ]---




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
  2014-11-28 21:32             ` Chris Mason
@ 2014-12-02 13:02               ` Miao Xie
  0 siblings, 0 replies; 32+ messages in thread
From: Miao Xie @ 2014-12-02 13:02 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

hi, Chris

On Fri, 28 Nov 2014 16:32:03 -0500, Chris Mason wrote:
> On Wed, Nov 26, 2014 at 10:00 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
>>>  On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
>>>>  On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>>  The increase/decrease of bio counter is on the I/O path, so we should
>>>>>  use io_schedule() instead of schedule(), or the deadlock might be
>>>>>  triggered by the pending I/O in the plug list. io_schedule() can help
>>>>>  us because it will flush all the pending I/O before the task is going
>>>>>  to sleep.
>>>>
>>>>  Can you please describe this deadlock in more detail?  schedule() also triggers
>>>>  a flush of the plug list, and if that's no longer sufficient we can run into other
>>>>  problems (especially with preemption on).
>>>
>>>  Sorry for my miss. I forgot to check the current implementation of schedule(), which flushes the plug list unconditionally. Please ignore this patch.
>>
>> I have updated my raid56-scrub-replace branch, please re-pull the branch.
>>
>>   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace
> 
> Sorry, I wasn't clear.  I do like the patch because it uses a slightly better trigger mechanism for the flush.  I was just worried about a larger deadlock.
> 
> I ran the raid56 work with stress.sh overnight, then scrubbed the resulting filesystem and ran balance when the scrub completed.  All of these passed without errors (excellent!).
> 
> Then I zero'd 4GB of one drive and ran scrub again.  This was the result.  Please make sure CONFIG_DEBUG_PAGEALLOC is enabled and you should be able to reproduce.

I sent out the 4th version of the patchset, please try it.

I have pushed the new patchset to my git tree, you can re-pull it.
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

> 
> [192392.495260] BUG: unable to handle kernel paging request at ffff880303062f80
> [192392.495279] IP: [<ffffffffa05fe77a>] lock_stripe_add+0xba/0x390 [btrfs]
> [192392.495281] PGD 2bdb067 PUD 107e7fd067 PMD 107e7e4067 PTE 8000000303062060
> [192392.495283] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [192392.495307] Modules linked in: ipmi_devintf loop fuse k10temp coretemp hwmon btrfs raid6_pq zlib_deflate lzo_compress xor xfs exportfs libcrc32c tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic iptable_filter ip_tables x_tables mptctl netconsole autofs4 nfsv3 nfs lockd grace rpcsec_gss_krb5 auth_rpcgss oid_registry sunrpc ipv6 ext3 jbd dm_mod rtc_cmos ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 lpc_ich mfd_core shpchp ehci_pci ehci_hcd mlx4_en ptp pps_core mlx4_core sg ses enclosure button megaraid_sas
> [192392.495310] CPU: 0 PID: 11992 Comm: kworker/u65:2 Not tainted 3.18.0-rc6-mason+ #7
> [192392.495310] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 05/10/2012
> [192392.495323] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
> [192392.495324] task: ffff88013dae9110 ti: ffff8802296a0000 task.ti: ffff8802296a0000
> [192392.495335] RIP: 0010:[<ffffffffa05fe77a>]  [<ffffffffa05fe77a>] lock_stripe_add+0xba/0x390 [btrfs]
> [192392.495335] RSP: 0018:ffff8802296a3ac8  EFLAGS: 00010006
> [192392.495336] RAX: ffff880577e85018 RBX: ffff880497f0b2f8 RCX: ffff8801190fb000
> [192392.495337] RDX: 000000000000013d RSI: ffff880303062f80 RDI: 0000040c275a0000
> [192392.495338] RBP: ffff8802296a3b48 R08: ffff880497f00000 R09: 0000000000000001
> [192392.495339] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000282
> [192392.495339] R13: 000000000000b250 R14: ffff880577e85000 R15: ffff880497f0b2a0
> [192392.495340] FS:  0000000000000000(0000) GS:ffff88085fc00000(0000) knlGS:0000000000000000
> [192392.495341] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [192392.495342] CR2: ffff880303062f80 CR3: 0000000005289000 CR4: 00000000000407f0
> [192392.495342] Stack:
> [192392.495344]  ffff880755e28000 ffff880497f00000 000000000000013d ffff8801190fb000
> [192392.495346]  0000000000000000 ffff88013dae9110 ffffffff81090d40 ffff8802296a3b00
> [192392.495347]  ffff8802296a3b00 0000000000000010 ffff8802296a3b68 ffff8801190fb000
> [192392.495348] Call Trace:
> [192392.495353]  [<ffffffff81090d40>] ? bit_waitqueue+0xa0/0xa0
> [192392.495363]  [<ffffffffa05fea66>] raid56_parity_submit_scrub_rbio+0x16/0x30 [btrfs]
> [192392.495372]  [<ffffffffa05e2f0e>] scrub_parity_check_and_repair+0x15e/0x1e0 [btrfs]
> [192392.495380]  [<ffffffffa05e301d>] scrub_block_put+0x8d/0x90 [btrfs]
> [192392.495388]  [<ffffffffa05e6ed7>] ? scrub_bio_end_io_worker+0xd7/0x870 [btrfs]
> [192392.495396]  [<ffffffffa05e6ee9>] scrub_bio_end_io_worker+0xe9/0x870 [btrfs]
> [192392.495405]  [<ffffffffa05b8c44>] normal_work_helper+0x84/0x330 [btrfs]
> [192392.495414]  [<ffffffffa05b8f42>] btrfs_scrub_helper+0x12/0x20 [btrfs]
> [192392.495417]  [<ffffffff8106c50f>] process_one_work+0x1bf/0x520
> [192392.495419]  [<ffffffff8106c48d>] ? process_one_work+0x13d/0x520
> [192392.495421]  [<ffffffff8106c98e>] worker_thread+0x11e/0x4b0
> [192392.495424]  [<ffffffff81653ac9>] ? __schedule+0x389/0x880
> [192392.495426]  [<ffffffff8106c870>] ? process_one_work+0x520/0x520
> [192392.495428]  [<ffffffff81071e2e>] kthread+0xde/0x100
> [192392.495430]  [<ffffffff81071d50>] ? __init_kthread_worker+0x70/0x70
> [192392.495431]  [<ffffffff81659eac>] ret_from_fork+0x7c/0xb0
> [192392.495433]  [<ffffffff81071d50>] ? __init_kthread_worker+0x70/0x70
> [192392.495449] Code: 45 88 49 89 c4 4f 8d 7c 28 50 4b 8b 44 28 50 48 8b 55 90 4c 8d 70 e8 4c 39 f8 48 8b 4d 98 74 32 48 8b 71 10 48 8b 3e 48 8b 70 f8 <48> 39 3e 75 12 eb 6f 0f 1f 80 00 00 00 00 48 8b 76 f8 48 39 3e
> [192392.495458] RIP  [<ffffffffa05fe77a>] lock_stripe_add+0xba/0x390 [btrfs]
> [192392.495458]  RSP <ffff8802296a3ac8>
> [192392.495458] CR2: ffff880303062f80
> [192392.496389] ---[ end trace c04c23ee0d843df0 ]---
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-12-02 13:00 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-14 13:50 [PATCH 0/9] Implement device scrub/replace for RAID56 Miao Xie
2014-11-14 13:50 ` [PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
2014-11-14 14:57   ` David Sterba
2014-11-14 13:50 ` [PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Miao Xie
2014-11-14 14:55   ` David Sterba
2014-11-14 13:50 ` [PATCH 3/9] Btrfs, raid56: don't change bbio and raid_map Miao Xie
2014-11-14 13:50 ` [PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Miao Xie
2014-11-24  9:31   ` [PATCH v2 " Miao Xie
2014-11-14 13:50 ` [PATCH 5/9] Btrfs,raid56: use a variant to record the operation type Miao Xie
2014-11-14 13:50 ` [PATCH 6/9] Btrfs,raid56: support parity scrub on raid56 Miao Xie
2014-11-14 13:50 ` [PATCH 7/9] Btrfs, replace: write dirty pages into the replace target device Miao Xie
2014-11-14 13:51 ` [PATCH 8/9] Btrfs, replace: write raid56 parity " Miao Xie
2014-11-14 13:51 ` [PATCH 9/9] Btrfs, replace: enable dev-replace for raid56 Miao Xie
2014-11-14 14:28 ` [PATCH 0/9] Implement device scrub/replace for RAID56 Chris Mason
2014-11-25 15:14 ` Chris Mason
2014-11-26 13:04   ` [PATCH v3 00/11] " Miao Xie
2014-11-26 13:04     ` [PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Miao Xie
2014-11-26 13:04     ` [PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Miao Xie
2014-11-26 13:04     ` [PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map Miao Xie
2014-11-26 13:04     ` [PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Miao Xie
2014-11-26 13:04     ` [PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type Miao Xie
2014-11-26 13:04     ` [PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56 Miao Xie
2014-11-26 13:04     ` [PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device Miao Xie
2014-11-26 13:04     ` [PATCH v3 08/11] Btrfs, replace: write raid56 parity " Miao Xie
2014-11-26 13:04     ` [PATCH v3 09/11] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 Miao Xie
2014-11-26 13:04     ` [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list Miao Xie
2014-11-26 15:02       ` Chris Mason
2014-11-27  1:39         ` Miao Xie
2014-11-27  3:00           ` Miao Xie
2014-11-28 21:32             ` Chris Mason
2014-12-02 13:02               ` Miao Xie
2014-11-26 13:04     ` [PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56 Miao Xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).