public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall
@ 2025-01-25  1:26 Doug V Johnson
  2025-01-25  1:26 ` [PATCH 2/2] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
  2025-01-26  6:27 ` [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Yu Kuai
  0 siblings, 2 replies; 12+ messages in thread
From: Doug V Johnson @ 2025-01-25  1:26 UTC (permalink / raw)
  Cc: Doug Johnson, Doug V Johnson, Song Liu, Yu Kuai,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list

While adding an additional drive to a raid6 array, the reshape stalled
at about 13% complete and any I/O operations on the array hung,
creating an effective soft lock. The kernel reported a hung task in
mdXX_reshape thread and I had to use magic sysrq to recover as systemd
hung as well.

I first suspected an issue with one of the underlying block devices and
as precaution I recovered the data in read only mode to a new array, but
it turned out to be in the RAID layer as I was able to recreate the
issue from a superblock dump in sparse files.

After poking around some I discovered that I had somehow propagated the
bad block list to several devices in the array such that a few blocks
were unreable. The bad read reported correctly in userspace during
recovery, but it wasn't obvious that it was from a bad block list
metadata at the time and instead confirmed my bias suspecting hardware
issues

I was able to reproduce the issue with a minimal test case using small
loopback devices. I put a script for this in a github repository:

https://github.com/dougvj/md_badblock_reshape_stall_test

This patch handles bad reads during a reshape by unmarking the
STRIPE_EXPANDING and STRIPE_EXPAND_READY bits effectively skipping the
stripe and then reports the issue in dmesg.

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5c79429acc64..0ae9ac695d8e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4987,6 +4987,14 @@ static void handle_stripe(struct stripe_head *sh)
 			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
+		if (test_bit(STRIPE_EXPANDING, &sh->state)) {
+			pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu",
+					    mdname(conf->mddev),
+					    (unsigned long)sh->sector);
+			/* Abort the current stripe */
+			clear_bit(STRIPE_EXPANDING, &sh->state);
+			clear_bit(STRIPE_EXPAND_READY, &sh->state);
+		}
 	}
 
 	/* Now we check to see if any write operations have recently
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] md/raid5: warn when failing a read due to bad blocks metadata
  2025-01-25  1:26 [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Doug V Johnson
@ 2025-01-25  1:26 ` Doug V Johnson
  2025-01-26  6:27 ` [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Yu Kuai
  1 sibling, 0 replies; 12+ messages in thread
From: Doug V Johnson @ 2025-01-25  1:26 UTC (permalink / raw)
  Cc: Doug Johnson, Doug V Johnson, Song Liu, Yu Kuai,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list

It's easy to suspect that there might be some underlying hardware
failures or similar issues when userspace receives a Buffer I/O error
from a raid device.

In order to hopefully send more sysadmins on the right track, lets
report that a read failed at least in part due to bad blocks in the bad
block list on device metadata.

There are real world examples where bad block lists accidentally get
propagated or copied around, so having this warning helps mitigate the
consequences

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 10 +++++++++-
 drivers/md/raid5.h |  2 +-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0ae9ac695d8e..5d80e9bcbd6f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3671,7 +3671,14 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			       sh->dev[i].sector + RAID5_STRIPE_SECTORS(conf)) {
 				struct bio *nextbi =
 					r5_next_bio(conf, bi, sh->dev[i].sector);
-
+				/* If we recorded bad blocks from the metadata
+				 * on any of the devices then report this to
+				 * userspace in case anyone might suspect
+				 * something more fundamental instead
+				 */
+				if (s->bad_blocks)
+					pr_warn_ratelimited("%s: read encountered block in device bad block list.",
+							    mdname(conf->mddev));
 				bio_io_error(bi);
 				bi = nextbi;
 			}
@@ -4682,6 +4689,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 		if (rdev) {
 			is_bad = rdev_has_badblock(rdev, sh->sector,
 						   RAID5_STRIPE_SECTORS(conf));
+			s->bad_blocks++;
 			if (s->blocked_rdev == NULL) {
 				if (is_bad < 0)
 					set_bit(BlockedBadBlocks, &rdev->flags);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index eafc6e9ed6ee..c755c321ae36 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -282,7 +282,7 @@ struct stripe_head_state {
 	 * read all devices, just the replacement targets.
 	 */
 	int syncing, expanding, expanded, replacing;
-	int locked, uptodate, to_read, to_write, failed, written;
+	int locked, uptodate, to_read, to_write, failed, written, bad_blocks;
 	int to_fill, compute, req_compute, non_overwrite;
 	int injournal, just_cached;
 	int failed_num[2];
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall
  2025-01-25  1:26 [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Doug V Johnson
  2025-01-25  1:26 ` [PATCH 2/2] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
@ 2025-01-26  6:27 ` Yu Kuai
  2025-01-27  8:51   ` Doug V Johnson
  2025-01-27  9:00   ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Doug V Johnson
  1 sibling, 2 replies; 12+ messages in thread
From: Yu Kuai @ 2025-01-26  6:27 UTC (permalink / raw)
  To: Doug V Johnson
  Cc: Doug Johnson, Song Liu,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list,
	yukuai (C)

Hi,

在 2025/01/25 9:26, Doug V Johnson 写道:
> While adding an additional drive to a raid6 array, the reshape stalled
> at about 13% complete and any I/O operations on the array hung,
> creating an effective soft lock. The kernel reported a hung task in
> mdXX_reshape thread and I had to use magic sysrq to recover as systemd
> hung as well.
> 
> I first suspected an issue with one of the underlying block devices and
> as precaution I recovered the data in read only mode to a new array, but
> it turned out to be in the RAID layer as I was able to recreate the
> issue from a superblock dump in sparse files.
> 
> After poking around some I discovered that I had somehow propagated the
> bad block list to several devices in the array such that a few blocks
> were unreable. The bad read reported correctly in userspace during
> recovery, but it wasn't obvious that it was from a bad block list
> metadata at the time and instead confirmed my bias suspecting hardware
> issues
> 
> I was able to reproduce the issue with a minimal test case using small
> loopback devices. I put a script for this in a github repository:
> 
> https://github.com/dougvj/md_badblock_reshape_stall_test
> 
> This patch handles bad reads during a reshape by unmarking the
> STRIPE_EXPANDING and STRIPE_EXPAND_READY bits effectively skipping the
> stripe and then reports the issue in dmesg.
> 
> Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
> ---
>   drivers/md/raid5.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 5c79429acc64..0ae9ac695d8e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4987,6 +4987,14 @@ static void handle_stripe(struct stripe_head *sh)
>   			handle_failed_stripe(conf, sh, &s, disks);
>   		if (s.syncing + s.replacing)
>   			handle_failed_sync(conf, sh, &s);
> +		if (test_bit(STRIPE_EXPANDING, &sh->state)) {
> +			pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu",
> +					    mdname(conf->mddev),
> +					    (unsigned long)sh->sector);
> +			/* Abort the current stripe */
> +			clear_bit(STRIPE_EXPANDING, &sh->state);
> +			clear_bit(STRIPE_EXPAND_READY, &sh->state);
> +		}
>   	}

Thanks for the patch, however, for example:

before reshape, three disks raid5:
rdev0           rdev1           rdev2
chunk0          chunk1          P0
chunk2(BB)      P1(BB)          chunk3
P2              chunk4          chunk5
chunk6          chunk7          P3
chunk8          P4              chunk9
P5              chunk10         chunk11

after reshape, four disks raid5:
rdev0           rdev1           rdev2           rdev3
chunk0          chunk1          chunk2(lost)    P0
chunk3(BB)      chunk4(BB)      P1              chunk5
chunk6          P2              chunk7          chunk8
P3              chunk9          chunk10         chunk11

In this case, before reshape, data from chunk2 is lost, however,
after reshape, chunk2 is lost as well, need to set badblocks to rdev2 to
prevent user get wrong data. Meanwhile, chunk3 and chunk4 are all lost,
because rdev0 and rdev1 has badblocks.

So, perhaps just abort the reshape will make more sense to me, because
user will lost more data if so.

Thanks,
Kuai

>   
>   	/* Now we check to see if any write operations have recently
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall
  2025-01-26  6:27 ` [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Yu Kuai
@ 2025-01-27  8:51   ` Doug V Johnson
  2025-01-27  9:00   ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Doug V Johnson
  1 sibling, 0 replies; 12+ messages in thread
From: Doug V Johnson @ 2025-01-27  8:51 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Doug Johnson, Song Liu,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list,
	yukuai (C)

> So, perhaps just abort the reshape will make more sense to me, because
> user will lost more data if so.

Ah yes, I agree that does make more sense. I appreciate the prompt 
feedback and explanation. I will submit a v2 soon.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read
  2025-01-26  6:27 ` [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Yu Kuai
  2025-01-27  8:51   ` Doug V Johnson
@ 2025-01-27  9:00   ` Doug V Johnson
  2025-01-27  9:00     ` [PATCH v2 2/2] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
  2025-02-08  2:49     ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Yu Kuai
  1 sibling, 2 replies; 12+ messages in thread
From: Doug V Johnson @ 2025-01-27  9:00 UTC (permalink / raw)
  Cc: Doug Johnson, Doug V Johnson, Song Liu, Yu Kuai,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list

While adding an additional drive to a raid6 array, the reshape stalled
at about 13% complete and any I/O operations on the array hung,
creating an effective soft lock. The kernel reported a hung task in
mdXX_reshape thread and I had to use magic sysrq to recover as systemd
hung as well.

I first suspected an issue with one of the underlying block devices and
as precaution I recovered the data in read only mode to a new array, but
it turned out to be in the RAID layer as I was able to recreate the
issue from a superblock dump in sparse files.

After poking around some I discovered that I had somehow propagated the
bad block list to several devices in the array such that a few blocks
were unreable. The bad read reported correctly in userspace during
recovery, but it wasn't obvious that it was from a bad block list
metadata at the time and instead confirmed my bias suspecting hardware
issues

I was able to reproduce the issue with a minimal test case using small
loopback devices. I put a script for this in a github repository:

https://github.com/dougvj/md_badblock_reshape_stall_test

This patch handles bad reads during a reshape by introducing a
handle_failed_reshape function in a similar manner to
handle_failed_resync. The function aborts the current stripe by
unmarking STRIPE_EXPANDING and STRIP_EXPAND_READY, sets the
MD_RECOVERY_FROZEN bit, reverts the head of the reshape to the safe
position, and reports the situation in dmesg.

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5c79429acc64..bc0b0c2540f0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3738,6 +3738,27 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
 	md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort);
 }
 
+static void
+handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
+		      struct stripe_head_state *s)
+{
+	// Abort the current stripe
+	clear_bit(STRIPE_EXPANDING, &sh->state);
+	clear_bit(STRIPE_EXPAND_READY, &sh->state);
+	pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu, cannot progress",
+			    mdname(conf->mddev),
+			    (unsigned long)sh->sector);
+	// Freeze the reshape
+	set_bit(MD_RECOVERY_FROZEN, &conf->mddev->recovery);
+	// Revert progress to safe position
+	spin_lock_irq(&conf->device_lock);
+	conf->reshape_progress = conf->reshape_safe;
+	spin_unlock_irq(&conf->device_lock);
+	// report failed md sync
+	md_done_sync(conf->mddev, 0, 0);
+	wake_up(&conf->wait_for_reshape);
+}
+
 static int want_replace(struct stripe_head *sh, int disk_idx)
 {
 	struct md_rdev *rdev;
@@ -4987,6 +5008,8 @@ static void handle_stripe(struct stripe_head *sh)
 			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
+		if (test_bit(STRIPE_EXPANDING, &sh->state))
+			handle_failed_reshape(conf, sh, &s);
 	}
 
 	/* Now we check to see if any write operations have recently
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/2] md/raid5: warn when failing a read due to bad blocks metadata
  2025-01-27  9:00   ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Doug V Johnson
@ 2025-01-27  9:00     ` Doug V Johnson
  2025-02-08  2:49     ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Yu Kuai
  1 sibling, 0 replies; 12+ messages in thread
From: Doug V Johnson @ 2025-01-27  9:00 UTC (permalink / raw)
  Cc: Doug Johnson, Doug V Johnson, Song Liu, Yu Kuai,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list

It's easy to suspect that there might be some underlying hardware
failures or similar issues when userspace receives a Buffer I/O error
from a raid device.

In order to hopefully send more sysadmins on the right track, lets
report that a read failed at least in part due to bad blocks in the bad
block list on device metadata.

There are real world examples where bad block lists accidentally get
propagated or copied around, so having this warning helps mitigate the
consequences

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 10 +++++++++-
 drivers/md/raid5.h |  2 +-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index bc0b0c2540f0..631ec72e7ab9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3671,7 +3671,14 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			       sh->dev[i].sector + RAID5_STRIPE_SECTORS(conf)) {
 				struct bio *nextbi =
 					r5_next_bio(conf, bi, sh->dev[i].sector);
-
+				/* If we recorded bad blocks from the metadata
+				 * on any of the devices then report this to
+				 * userspace in case anyone might suspect
+				 * something more fundamental instead
+				 */
+				if (s->bad_blocks)
+					pr_warn_ratelimited("%s: read encountered block in device bad block list.",
+							    mdname(conf->mddev));
 				bio_io_error(bi);
 				bi = nextbi;
 			}
@@ -4703,6 +4710,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 		if (rdev) {
 			is_bad = rdev_has_badblock(rdev, sh->sector,
 						   RAID5_STRIPE_SECTORS(conf));
+			s->bad_blocks++;
 			if (s->blocked_rdev == NULL) {
 				if (is_bad < 0)
 					set_bit(BlockedBadBlocks, &rdev->flags);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index eafc6e9ed6ee..c755c321ae36 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -282,7 +282,7 @@ struct stripe_head_state {
 	 * read all devices, just the replacement targets.
 	 */
 	int syncing, expanding, expanded, replacing;
-	int locked, uptodate, to_read, to_write, failed, written;
+	int locked, uptodate, to_read, to_write, failed, written, bad_blocks;
 	int to_fill, compute, req_compute, non_overwrite;
 	int injournal, just_cached;
 	int failed_num[2];
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read
  2025-01-27  9:00   ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Doug V Johnson
  2025-01-27  9:00     ` [PATCH v2 2/2] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
@ 2025-02-08  2:49     ` Yu Kuai
  2025-02-18  7:20       ` Doug V Johnson
  1 sibling, 1 reply; 12+ messages in thread
From: Yu Kuai @ 2025-02-08  2:49 UTC (permalink / raw)
  To: Doug V Johnson
  Cc: Doug Johnson, Song Liu,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list,
	yukuai (C)

Hi,

在 2025/01/27 17:00, Doug V Johnson 写道:
> While adding an additional drive to a raid6 array, the reshape stalled
> at about 13% complete and any I/O operations on the array hung,
> creating an effective soft lock. The kernel reported a hung task in
> mdXX_reshape thread and I had to use magic sysrq to recover as systemd
> hung as well.
> 
> I first suspected an issue with one of the underlying block devices and
> as precaution I recovered the data in read only mode to a new array, but
> it turned out to be in the RAID layer as I was able to recreate the
> issue from a superblock dump in sparse files.
> 
> After poking around some I discovered that I had somehow propagated the
> bad block list to several devices in the array such that a few blocks
> were unreable. The bad read reported correctly in userspace during
> recovery, but it wasn't obvious that it was from a bad block list
> metadata at the time and instead confirmed my bias suspecting hardware
> issues
> 
> I was able to reproduce the issue with a minimal test case using small
> loopback devices. I put a script for this in a github repository:
> 
> https://github.com/dougvj/md_badblock_reshape_stall_test
> 
> This patch handles bad reads during a reshape by introducing a
> handle_failed_reshape function in a similar manner to
> handle_failed_resync. The function aborts the current stripe by
> unmarking STRIPE_EXPANDING and STRIP_EXPAND_READY, sets the
> MD_RECOVERY_FROZEN bit, reverts the head of the reshape to the safe
> position, and reports the situation in dmesg.
> 
> Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
> ---
>   drivers/md/raid5.c | 23 +++++++++++++++++++++++
>   1 file changed, 23 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 5c79429acc64..bc0b0c2540f0 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3738,6 +3738,27 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
>   	md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort);
>   }
>   
> +static void
> +handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
> +		      struct stripe_head_state *s)
> +{
> +	// Abort the current stripe
> +	clear_bit(STRIPE_EXPANDING, &sh->state);
> +	clear_bit(STRIPE_EXPAND_READY, &sh->state);
> +	pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu, cannot progress",
> +			    mdname(conf->mddev),
> +			    (unsigned long)sh->sector);
pr_err() is fine here.

> +	// Freeze the reshape
> +	set_bit(MD_RECOVERY_FROZEN, &conf->mddev->recovery);
> +	// Revert progress to safe position
> +	spin_lock_irq(&conf->device_lock);
> +	conf->reshape_progress = conf->reshape_safe;
> +	spin_unlock_irq(&conf->device_lock);
> +	// report failed md sync
> +	md_done_sync(conf->mddev, 0, 0);
> +	wake_up(&conf->wait_for_reshape);

Just wonder, this will leave the array in the state that reshape is
still in progress, and forbid any other sync action(details in
md_choose_sync_action()). It's better to avoid that, not quite
sure yet how. :(

In theory, if user provide a new disk, this werid state can be resolved
by doing a recovery concurrent with the reshape. However, this is too
complicated and doesn't seem worth it.

Perhaps, check badblocks before starting reshape in the first place can
solve most problems in this case? We can keep this patch just in case
new badblocks are set while performing reshape.

Thanks,
Kuai

> +}
> +
>   static int want_replace(struct stripe_head *sh, int disk_idx)
>   {
>   	struct md_rdev *rdev;
> @@ -4987,6 +5008,8 @@ static void handle_stripe(struct stripe_head *sh)
>   			handle_failed_stripe(conf, sh, &s, disks);
>   		if (s.syncing + s.replacing)
>   			handle_failed_sync(conf, sh, &s);
> +		if (test_bit(STRIPE_EXPANDING, &sh->state))
> +			handle_failed_reshape(conf, sh, &s);
>   	}
>   
>   	/* Now we check to see if any write operations have recently
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read
  2025-02-08  2:49     ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Yu Kuai
@ 2025-02-18  7:20       ` Doug V Johnson
  2025-02-24  9:02         ` [PATCH v3 1/3] " Doug V Johnson
  0 siblings, 1 reply; 12+ messages in thread
From: Doug V Johnson @ 2025-02-18  7:20 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Doug Johnson, Song Liu,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list,
	yukuai (C)

> Perhaps, check badblocks before starting reshape in the first place can
> solve most problems in this case? We can keep this patch just in case
> new badblocks are set while performing reshape.
> 
> Thanks,
> Kuai
> 
This is a really good idea and should mitigate the risk of getting into 
these weird states. I started work on a patch that seems to work, I will 
submit it after I have done some testing.

Thanks for your feedback,
Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/3] md/raid5: freeze reshape when encountering a bad read
  2025-02-18  7:20       ` Doug V Johnson
@ 2025-02-24  9:02         ` Doug V Johnson
  2025-02-24  9:02           ` [PATCH v3 2/3] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
  2025-02-24  9:02           ` [PATCH v3 3/3] md/raid5: check for overlapping bad blocks before starting reshape Doug V Johnson
  0 siblings, 2 replies; 12+ messages in thread
From: Doug V Johnson @ 2025-02-24  9:02 UTC (permalink / raw)
  Cc: Doug Johnson, Doug V Johnson, Song Liu, Yu Kuai,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list

While adding an additional drive to a raid6 array, the reshape stalled
at about 13% complete and any I/O operations on the array hung,
creating an effective soft lock. The kernel reported a hung task in
mdXX_reshape thread and I had to use magic sysrq to recover as systemd
hung as well.

I first suspected an issue with one of the underlying block devices and
as precaution I recovered the data in read only mode to a new array, but
it turned out to be in the RAID layer as I was able to recreate the
issue from a superblock dump in sparse files.

After poking around some I discovered that I had somehow propagated the
bad block list to several devices in the array such that a few blocks
were unreable. The bad read reported correctly in userspace during
recovery, but it wasn't obvious that it was from a bad block list
metadata at the time and instead confirmed my bias suspecting hardware
issues

I was able to reproduce the issue with a minimal test case using small
loopback devices. I put a script for this in a github repository:

https://github.com/dougvj/md_badblock_reshape_stall_test

This patch handles bad reads during a reshape by introducing a
handle_failed_reshape function in a similar manner to
handle_failed_resync. The function aborts the current stripe by
unmarking STRIPE_EXPANDING and STRIP_EXPAND_READY, sets the
MD_RECOVERY_FROZEN bit, reverts the head of the reshape to the safe
position, and reports the situation in dmesg.

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5c79429acc64..3b5345e66daf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3738,6 +3738,27 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
 	md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort);
 }
 
+static void
+handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
+		      struct stripe_head_state *s)
+{
+	// Abort the current stripe
+	clear_bit(STRIPE_EXPANDING, &sh->state);
+	clear_bit(STRIPE_EXPAND_READY, &sh->state);
+	pr_err("md/raid:%s: read error during reshape at %lu, cannot progress",
+	       mdname(conf->mddev),
+	       (unsigned long)sh->sector);
+	// Freeze the reshape
+	set_bit(MD_RECOVERY_FROZEN, &conf->mddev->recovery);
+	// Revert progress to safe position
+	spin_lock_irq(&conf->device_lock);
+	conf->reshape_progress = conf->reshape_safe;
+	spin_unlock_irq(&conf->device_lock);
+	// report failed md sync
+	md_done_sync(conf->mddev, 0, 0);
+	wake_up(&conf->wait_for_reshape);
+}
+
 static int want_replace(struct stripe_head *sh, int disk_idx)
 {
 	struct md_rdev *rdev;
@@ -4987,6 +5008,8 @@ static void handle_stripe(struct stripe_head *sh)
 			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
+		if (test_bit(STRIPE_EXPANDING, &sh->state))
+			handle_failed_reshape(conf, sh, &s);
 	}
 
 	/* Now we check to see if any write operations have recently
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/3] md/raid5: warn when failing a read due to bad blocks metadata
  2025-02-24  9:02         ` [PATCH v3 1/3] " Doug V Johnson
@ 2025-02-24  9:02           ` Doug V Johnson
  2025-02-24  9:02           ` [PATCH v3 3/3] md/raid5: check for overlapping bad blocks before starting reshape Doug V Johnson
  1 sibling, 0 replies; 12+ messages in thread
From: Doug V Johnson @ 2025-02-24  9:02 UTC (permalink / raw)
  Cc: Doug Johnson, Doug V Johnson, Song Liu, Yu Kuai,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list

It's easy to suspect that there might be some underlying hardware
failures or similar issues when userspace receives a Buffer I/O error
from a raid device.

In order to hopefully send more sysadmins on the right track, lets
report that a read failed at least in part due to bad blocks in the bad
block list on device metadata.

There are real world examples where bad block lists accidentally get
propagated or copied around, so having this warning helps mitigate the
consequences

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 16 +++++++++++++++-
 drivers/md/raid5.h |  2 +-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3b5345e66daf..8b23109d6f37 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3671,7 +3671,15 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			       sh->dev[i].sector + RAID5_STRIPE_SECTORS(conf)) {
 				struct bio *nextbi =
 					r5_next_bio(conf, bi, sh->dev[i].sector);
-
+				/* If we recorded bad blocks from the metadata
+				 * on any of the devices then report this to
+				 * userspace in case anyone might suspect
+				 * something more fundamental instead
+				 */
+				if (s->bad_blocks)
+					pr_warn_ratelimited("%s: read encountered block in device bad block list at %lu",
+							    mdname(conf->mddev),
+							    (unsigned long)sh->sector);
 				bio_io_error(bi);
 				bi = nextbi;
 			}
@@ -4703,6 +4711,12 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 		if (rdev) {
 			is_bad = rdev_has_badblock(rdev, sh->sector,
 						   RAID5_STRIPE_SECTORS(conf));
+			if (is_bad) {
+				s->bad_blocks++;
+				pr_debug("bad blocks encountered dev %i sector %lu %lu",
+					 i, (unsigned long)sh->sector,
+					 RAID5_STRIPE_SECTORS(conf));
+			}
 			if (s->blocked_rdev == NULL) {
 				if (is_bad < 0)
 					set_bit(BlockedBadBlocks, &rdev->flags);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index eafc6e9ed6ee..c755c321ae36 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -282,7 +282,7 @@ struct stripe_head_state {
 	 * read all devices, just the replacement targets.
 	 */
 	int syncing, expanding, expanded, replacing;
-	int locked, uptodate, to_read, to_write, failed, written;
+	int locked, uptodate, to_read, to_write, failed, written, bad_blocks;
 	int to_fill, compute, req_compute, non_overwrite;
 	int injournal, just_cached;
 	int failed_num[2];
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/3] md/raid5: check for overlapping bad blocks before starting reshape
  2025-02-24  9:02         ` [PATCH v3 1/3] " Doug V Johnson
  2025-02-24  9:02           ` [PATCH v3 2/3] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
@ 2025-02-24  9:02           ` Doug V Johnson
  2025-03-05  6:36             ` Yu Kuai
  1 sibling, 1 reply; 12+ messages in thread
From: Doug V Johnson @ 2025-02-24  9:02 UTC (permalink / raw)
  Cc: Doug Johnson, Doug V Johnson, Song Liu, Yu Kuai,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list

In addition to halting a reshape in progress when we encounter bad
blocks, we want to make sure that we do not even attempt a reshape if we
know before hand that there are too many overlapping bad blocks and we
would have to stall the reshape.

To do this, we add a new internal function array_has_badblock() which
first checks to see if there are enough drives with bad blocks for the
condition to occur and if there are proceeds to do a simple O(n^2) check
for overlapping bad blocks. If more overlaps are found than can be
corrected for, we return 1 for the presence of bad blocks, otherwise 0

This function is invoked in raid5_start_reshape() and if there are bad
blocks present, returns -EIO which is reported to userspace.

It's possible for bad blocks to be discovered or put in the metadata
after a reshape has started, so we want to leave in place the
functionality to detect and halt a reshape.

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8b23109d6f37..4b907a674dd1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8451,6 +8451,94 @@ static int check_reshape(struct mddev *mddev)
 				     + mddev->delta_disks));
 }
 
+static int array_has_badblock(struct r5conf *conf)
+{
+	/* Searches for overlapping bad blocks on devices that would result
+	 * in an unreadable condition
+	 */
+	int i, j;
+	/* First see if we even have bad blocks on enough drives to have a
+	 * bad read condition
+	 */
+	int num_badblock_devs = 0;
+
+	for (i = 0; i < conf->raid_disks; i++) {
+		if (rdev_has_badblock(conf->disks[i].rdev,
+				      0, conf->disks[i].rdev->sectors))
+			num_badblock_devs++;
+	}
+	if (num_badblock_devs <= conf->max_degraded) {
+		/* There are not enough devices with bad blocks to pose any
+		 * read problem
+		 */
+		return 0;
+	}
+	pr_debug("%s: running overlapping bad block check",
+		 mdname(conf->mddev));
+	/* Do a more sophisticated check for overlapping regions */
+	for (i = 0; i < conf->raid_disks; i++) {
+		sector_t first_bad;
+		int bad_sectors;
+		sector_t next_check_s = 0;
+		int next_check_sectors = conf->disks[i].rdev->sectors;
+
+		pr_debug("%s: badblock check: %i (s: %lu, sec: %i)",
+			 mdname(conf->mddev), i,
+			 (unsigned long)next_check_s, next_check_sectors);
+		while (is_badblock(conf->disks[i].rdev,
+				   next_check_s, next_check_sectors,
+				   &first_bad,
+				   &bad_sectors) != 0) {
+			/* Align bad blocks to the size of our stripe */
+			sector_t aligned_first_bad = first_bad &
+				~((sector_t)RAID5_STRIPE_SECTORS(conf) - 1);
+			int aligned_bad_sectors =
+				max_t(int, RAID5_STRIPE_SECTORS(conf),
+				      bad_sectors);
+			int this_num_bad = 1;
+
+			pr_debug("%s: found blocks %i %lu -> %i",
+				 mdname(conf->mddev), i,
+				 (unsigned long)aligned_first_bad,
+				 aligned_bad_sectors);
+			for (j = 0; j < conf->raid_disks; j++) {
+				sector_t this_first_bad;
+				int this_bad_sectors;
+
+				if (j == i)
+					continue;
+				if (is_badblock(conf->disks[j].rdev,
+						aligned_first_bad,
+						aligned_bad_sectors,
+						&this_first_bad,
+						&this_bad_sectors)) {
+					this_num_bad++;
+					pr_debug("md/raid:%s: bad block overlap dev %i: %lu %i",
+						 mdname(conf->mddev), j,
+						 (unsigned long)this_first_bad,
+						 this_bad_sectors);
+				}
+			}
+			if (this_num_bad > conf->max_degraded) {
+				pr_debug("md/raid:%s: %i drives with unreadable sector(s) around %lu %i due to bad block list",
+					 mdname(conf->mddev),
+					 this_num_bad,
+					 (unsigned long)first_bad,
+					 bad_sectors);
+				return 1;
+			}
+			next_check_s = first_bad + bad_sectors;
+			next_check_sectors =
+				next_check_sectors - (first_bad + bad_sectors);
+			pr_debug("%s: badblock check: %i (s: %lu, sec: %i)",
+				 mdname(conf->mddev), i,
+				 (unsigned long)next_check_s,
+				 next_check_sectors);
+		}
+	}
+	return 0;
+}
+
 static int raid5_start_reshape(struct mddev *mddev)
 {
 	struct r5conf *conf = mddev->private;
@@ -8498,6 +8586,12 @@ static int raid5_start_reshape(struct mddev *mddev)
 		return -EINVAL;
 	}
 
+	if (array_has_badblock(conf)) {
+		pr_warn("md/raid:%s: reshape not possible due to bad block list",
+			mdname(mddev));
+		return -EIO;
+	}
+
 	atomic_set(&conf->reshape_stripes, 0);
 	spin_lock_irq(&conf->device_lock);
 	write_seqcount_begin(&conf->gen_lock);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] md/raid5: check for overlapping bad blocks before starting reshape
  2025-02-24  9:02           ` [PATCH v3 3/3] md/raid5: check for overlapping bad blocks before starting reshape Doug V Johnson
@ 2025-03-05  6:36             ` Yu Kuai
  0 siblings, 0 replies; 12+ messages in thread
From: Yu Kuai @ 2025-03-05  6:36 UTC (permalink / raw)
  To: Doug V Johnson
  Cc: Doug Johnson, Song Liu,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, open list,
	yukuai (C)

Hi,

在 2025/02/24 17:02, Doug V Johnson 写道:
> In addition to halting a reshape in progress when we encounter bad
> blocks, we want to make sure that we do not even attempt a reshape if we
> know before hand that there are too many overlapping bad blocks and we
> would have to stall the reshape.
> 
> To do this, we add a new internal function array_has_badblock() which
> first checks to see if there are enough drives with bad blocks for the
> condition to occur and if there are proceeds to do a simple O(n^2) check
> for overlapping bad blocks. If more overlaps are found than can be
> corrected for, we return 1 for the presence of bad blocks, otherwise 0
> 
> This function is invoked in raid5_start_reshape() and if there are bad
> blocks present, returns -EIO which is reported to userspace.
> 
> It's possible for bad blocks to be discovered or put in the metadata
> after a reshape has started, so we want to leave in place the
> functionality to detect and halt a reshape.
> 
> Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
> ---
>   drivers/md/raid5.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 94 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8b23109d6f37..4b907a674dd1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8451,6 +8451,94 @@ static int check_reshape(struct mddev *mddev)
>   				     + mddev->delta_disks));
>   }
>   
> +static int array_has_badblock(struct r5conf *conf)
> +{
> +	/* Searches for overlapping bad blocks on devices that would result
> +	 * in an unreadable condition
> +	 */
> +	int i, j;
> +	/* First see if we even have bad blocks on enough drives to have a
> +	 * bad read condition
> +	 */
> +	int num_badblock_devs = 0;
> +
> +	for (i = 0; i < conf->raid_disks; i++) {
> +		if (rdev_has_badblock(conf->disks[i].rdev,
> +				      0, conf->disks[i].rdev->sectors))
		if (rdev->badblocks.count)

> +			num_badblock_devs++;
> +	}
> +	if (num_badblock_devs <= conf->max_degraded) {
> +		/* There are not enough devices with bad blocks to pose any
> +		 * read problem
> +		 */
> +		return 0;
> +	}
> +	pr_debug("%s: running overlapping bad block check",
> +		 mdname(conf->mddev));
> +	/* Do a more sophisticated check for overlapping regions */
> +	for (i = 0; i < conf->raid_disks; i++) {
> +		sector_t first_bad;
> +		int bad_sectors;
> +		sector_t next_check_s = 0;
> +		int next_check_sectors = conf->disks[i].rdev->sectors;
> +
> +		pr_debug("%s: badblock check: %i (s: %lu, sec: %i)",
> +			 mdname(conf->mddev), i,
> +			 (unsigned long)next_check_s, next_check_sectors);
> +		while (is_badblock(conf->disks[i].rdev,
> +				   next_check_s, next_check_sectors,
> +				   &first_bad,
> +				   &bad_sectors) != 0) {
> +			/* Align bad blocks to the size of our stripe */
> +			sector_t aligned_first_bad = first_bad &
> +				~((sector_t)RAID5_STRIPE_SECTORS(conf) - 1);
> +			int aligned_bad_sectors =
> +				max_t(int, RAID5_STRIPE_SECTORS(conf),
> +				      bad_sectors);
> +			int this_num_bad = 1;
For example, if first_bad is 0, bad_sectors is 512 in rdev0

> +
> +			pr_debug("%s: found blocks %i %lu -> %i",
> +				 mdname(conf->mddev), i,
> +				 (unsigned long)aligned_first_bad,
> +				 aligned_bad_sectors);
> +			for (j = 0; j < conf->raid_disks; j++) {
> +				sector_t this_first_bad;
> +				int this_bad_sectors;
> +
> +				if (j == i)
> +					continue;
> +				if (is_badblock(conf->disks[j].rdev,
> +						aligned_first_bad,
> +						aligned_bad_sectors,
> +						&this_first_bad,
> +						&this_bad_sectors)) {
And rdev1 has badblocks 0+256, rdev2 has badblocks 256+256.

If this array is a raid6 with max_degraded=2, then it's fine.

Perhaps a pseudocode loop like following?

  sector_t offset = 0;
  while (offset < dev_sectors) {
          len = dev_sectors - offset;
          bad_disks = 0;
          for (i = 0; i < conf->raid_disks; ++i) {
                  if (is_badblock(rdev, offset, len, &first_bad, 
&bad_sectors)) {
                          if (first_bad <= offset) {
                                  len = min(len, first_bad + bad_sectors 
  offset);
                                  bad_disks++;
                          } else {
                                  len = min(len, first_bad - offset);
                          }
                  }
          }

          if (bad_disks > max_degraded)
                  return false;

          offset += len;
  }

  return true;

Thanks,
Kuai

> +					this_num_bad++;
> +					pr_debug("md/raid:%s: bad block overlap dev %i: %lu %i",
> +						 mdname(conf->mddev), j,
> +						 (unsigned long)this_first_bad,
> +						 this_bad_sectors);
> +				}
> +			}
> +			if (this_num_bad > conf->max_degraded) {
> +				pr_debug("md/raid:%s: %i drives with unreadable sector(s) around %lu %i due to bad block list",
> +					 mdname(conf->mddev),
> +					 this_num_bad,
> +					 (unsigned long)first_bad,
> +					 bad_sectors);
> +				return 1;
> +			}
> +			next_check_s = first_bad + bad_sectors;
> +			next_check_sectors =
> +				next_check_sectors - (first_bad + bad_sectors);
> +			pr_debug("%s: badblock check: %i (s: %lu, sec: %i)",
> +				 mdname(conf->mddev), i,
> +				 (unsigned long)next_check_s,
> +				 next_check_sectors);
> +		}
> +	}
> +	return 0;
> +}
> +
>   static int raid5_start_reshape(struct mddev *mddev)
>   {
>   	struct r5conf *conf = mddev->private;
> @@ -8498,6 +8586,12 @@ static int raid5_start_reshape(struct mddev *mddev)
>   		return -EINVAL;
>   	}
>   
> +	if (array_has_badblock(conf)) {
> +		pr_warn("md/raid:%s: reshape not possible due to bad block list",
> +			mdname(mddev));
> +		return -EIO;
> +	}
> +
>   	atomic_set(&conf->reshape_stripes, 0);
>   	spin_lock_irq(&conf->device_lock);
>   	write_seqcount_begin(&conf->gen_lock);
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-05  6:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-25  1:26 [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Doug V Johnson
2025-01-25  1:26 ` [PATCH 2/2] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
2025-01-26  6:27 ` [PATCH 1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall Yu Kuai
2025-01-27  8:51   ` Doug V Johnson
2025-01-27  9:00   ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Doug V Johnson
2025-01-27  9:00     ` [PATCH v2 2/2] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
2025-02-08  2:49     ` [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read Yu Kuai
2025-02-18  7:20       ` Doug V Johnson
2025-02-24  9:02         ` [PATCH v3 1/3] " Doug V Johnson
2025-02-24  9:02           ` [PATCH v3 2/3] md/raid5: warn when failing a read due to bad blocks metadata Doug V Johnson
2025-02-24  9:02           ` [PATCH v3 3/3] md/raid5: check for overlapping bad blocks before starting reshape Doug V Johnson
2025-03-05  6:36             ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox