[PATCH] [md] raid5: check faulty flag for array status during recovery.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [md] raid5: check faulty flag for array status during recovery.
@ 2015-01-06 22:24 Eric Mei
  2015-02-19 21:51 ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Mei @ 2015-01-06 22:24 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, eric.mei

[-- Attachment #1: Type: text/plain, Size: 1447 bytes --]

Hi Neil,

In a MDRAID derived work we found and fixed a data corruption bug. We think this also affect vanilla MDRAID, but we didn’t directly prove that by constructing a test to show the corruption. Following is the theoretical analysis, please kindly review and see if I missed something.

To rebuild a stripe, MD checks whether array will be optimal after rebuild complete, if that’s true, we’ll mark the WIB bit to be cleared, the purpose is to enable “incremental rebuild”. The code section is like this:

	/* Need to check if array will still be degraded after recovery/resync
	 * We don't need to check the 'failed' flag as when that gets set,
	 * recovery aborts.
	 */
	for (i = 0; i < conf->raid_disks; i++)
		if (conf->disks[i].rdev == NULL)
			still_degraded = 1;

The problem is that only checking rdev == NULL might not be enough. Suppose both 2 drives D0 and D1 failed and marked as Faulty; We immediately removed D0 from array, but because some lingering IO on D1, it remains in array with Faulty flags on. A new drive pulled in, rebuild against D0 starts. Now because no rdev is NULL, MD thinks array will be optimal. If some writes happened before rebuild reaches the region, their dirty bits in WIB will be cleared. When later add D1 back into array, we’ll skip rebuilding those stripes, thus data corruption.

The attached patch (against 3.18.0-rc6) is supposed to fix this issue.

Thanks
Eric

[-- Attachment #2: 0001-md-raid5-check-faulty-flag-for-array-status-during-r.patch --]
[-- Type: application/octet-stream, Size: 1550 bytes --]

From 6e569429d9deac31df3c7023a59c93ac862b10e5 Mon Sep 17 00:00:00 2001
From: Eric Mei <eric.mei@seagate.com>
Date: Tue, 6 Jan 2015 09:35:02 -0800
Subject: [PATCH] [md] raid5: check faulty flag for array status during recovery.

When we have more than 1 drive failure, it's possible we start
rebuild one drive while leaving another faulty drive in array.
To determine whether array will be optimal after building, current
code only check whether a drive is missing, which could potentially
lead to data corruption. This patch is to add checking Faulty flag.
---
 drivers/md/raid5.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9c66e59..a01a25e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5054,11 +5054,12 @@ static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int
 		schedule_timeout_uninterruptible(1);
 	}
 	/* Need to check if array will still be degraded after recovery/resync
-	 * We don't need to check the 'failed' flag as when that gets set,
-	 * recovery aborts.
+	 * Note in case of > 1 drive failures it's possible we're rebuilding
+	 * one drive while leaving another faulty drive in array.
 	 */
 	for (i = 0; i < conf->raid_disks; i++)
-		if (conf->disks[i].rdev == NULL)
+		if (conf->disks[i].rdev == NULL ||
+		    test_bit(Faulty, &conf->disks[i].rdev->flags))
 			still_degraded = 1;

 	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] [md] raid5: check faulty flag for array status during recovery.
  2015-01-06 22:24 [PATCH] [md] raid5: check faulty flag for array status during recovery Eric Mei
@ 2015-02-19 21:51 ` NeilBrown
  2015-02-19 22:49   ` Eric Mei
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2015-02-19 21:51 UTC (permalink / raw)
  To: Eric Mei; +Cc: linux-raid, eric.mei

[-- Attachment #1: Type: text/plain, Size: 3665 bytes --]

On Tue, 6 Jan 2015 15:24:24 -0700 Eric Mei <meijia@gmail.com> wrote:

> Hi Neil,
> 
> In a MDRAID derived work we found and fixed a data corruption bug. We think this also affect vanilla MDRAID, but we didn’t directly prove that by constructing a test to show the corruption. Following is the theoretical analysis, please kindly review and see if I missed something.
> 
> To rebuild a stripe, MD checks whether array will be optimal after rebuild complete, if that’s true, we’ll mark the WIB bit to be cleared, the purpose is to enable “incremental rebuild”. The code section is like this:
> 
> 	/* Need to check if array will still be degraded after recovery/resync
> 	 * We don't need to check the 'failed' flag as when that gets set,
> 	 * recovery aborts.
> 	 */
> 	for (i = 0; i < conf->raid_disks; i++)
> 		if (conf->disks[i].rdev == NULL)
> 			still_degraded = 1;
> 
> The problem is that only checking rdev == NULL might not be enough. Suppose both 2 drives D0 and D1 failed and marked as Faulty; We immediately removed D0 from array, but because some lingering IO on D1, it remains in array with Faulty flags on. A new drive pulled in, rebuild against D0 starts. Now because no rdev is NULL, MD thinks array will be optimal. If some writes happened before rebuild reaches the region, their dirty bits in WIB will be cleared. When later add D1 back into array, we’ll skip rebuilding those stripes, thus data corruption.
> 
> The attached patch (against 3.18.0-rc6) is supposed to fix this issue.
> 
> Thanks
> Eric
> 

Hi Eric,
 sorry for the delay, and thanks for the reminder...

The issue you described could only affect RAID6 as it requires the array to
continue with two failed drives.

However in the RAID6 case I think you are correct - there is a chance of
corruption if there is a double failure and a delay in removing one device.

Your patch isn't quite safe as conf->disks[i].rdev can become NULL at any
moment, so it could become NULL between testing and de-referencing.
So I've modified it as follows.

Thanks,
NeilBrown



Author: Eric Mei <eric.mei@seagate.com>
Date:   Tue Jan 6 09:35:02 2015 -0800

    raid5: check faulty flag for array status during recovery.
    
    When we have more than 1 drive failure, it's possible we start
    rebuild one drive while leaving another faulty drive in array.
    To determine whether array will be optimal after building, current
    code only check whether a drive is missing, which could potentially
    lead to data corruption. This patch is to add checking Faulty flag.
    
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index bc6d7595ad76..022a0d99e110 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5120,12 +5120,17 @@ static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int
 		schedule_timeout_uninterruptible(1);
 	}
 	/* Need to check if array will still be degraded after recovery/resync
-	 * We don't need to check the 'failed' flag as when that gets set,
-	 * recovery aborts.
+	 * Note in case of > 1 drive failures it's possible we're rebuilding
+	 * one drive while leaving another faulty drive in array.
 	 */
-	for (i = 0; i < conf->raid_disks; i++)
-		if (conf->disks[i].rdev == NULL)
+	rcu_read_lock();
+	for (i = 0; i < conf->raid_disks; i++) {
+		struct md_rdev *rdev = ACCESS_ONCE(conf->disks[i].rdev);
+
+		if (rdev == NULL || test_bit(Faulty, &rdev->flags))
 			still_degraded = 1;
+	}
+	rcu_read_unlock();
 
 	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] [md] raid5: check faulty flag for array status during recovery.
  2015-02-19 21:51 ` NeilBrown
@ 2015-02-19 22:49   ` Eric Mei
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Mei @ 2015-02-19 22:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, eric.mei

Hi Neil, You are absolutely right we need RCU lock for this. Thank you 
so much!

Eric

On 2015-02-19 2:51 PM, NeilBrown wrote:
> On Tue, 6 Jan 2015 15:24:24 -0700 Eric Mei <meijia@gmail.com> wrote:
>
>> Hi Neil,
>>
>> In a MDRAID derived work we found and fixed a data corruption bug. We think this also affect vanilla MDRAID, but we didn’t directly prove that by constructing a test to show the corruption. Following is the theoretical analysis, please kindly review and see if I missed something.
>>
>> To rebuild a stripe, MD checks whether array will be optimal after rebuild complete, if that’s true, we’ll mark the WIB bit to be cleared, the purpose is to enable “incremental rebuild”. The code section is like this:
>>
>> 	/* Need to check if array will still be degraded after recovery/resync
>> 	 * We don't need to check the 'failed' flag as when that gets set,
>> 	 * recovery aborts.
>> 	 */
>> 	for (i = 0; i < conf->raid_disks; i++)
>> 		if (conf->disks[i].rdev == NULL)
>> 			still_degraded = 1;
>>
>> The problem is that only checking rdev == NULL might not be enough. Suppose both 2 drives D0 and D1 failed and marked as Faulty; We immediately removed D0 from array, but because some lingering IO on D1, it remains in array with Faulty flags on. A new drive pulled in, rebuild against D0 starts. Now because no rdev is NULL, MD thinks array will be optimal. If some writes happened before rebuild reaches the region, their dirty bits in WIB will be cleared. When later add D1 back into array, we’ll skip rebuilding those stripes, thus data corruption.
>>
>> The attached patch (against 3.18.0-rc6) is supposed to fix this issue.
>>
>> Thanks
>> Eric
>>
> Hi Eric,
>   sorry for the delay, and thanks for the reminder...
>
> The issue you described could only affect RAID6 as it requires the array to
> continue with two failed drives.
>
> However in the RAID6 case I think you are correct - there is a chance of
> corruption if there is a double failure and a delay in removing one device.
>
> Your patch isn't quite safe as conf->disks[i].rdev can become NULL at any
> moment, so it could become NULL between testing and de-referencing.
> So I've modified it as follows.
>
> Thanks,
> NeilBrown
>
>
>
> Author: Eric Mei <eric.mei@seagate.com>
> Date:   Tue Jan 6 09:35:02 2015 -0800
>
>      raid5: check faulty flag for array status during recovery.
>      
>      When we have more than 1 drive failure, it's possible we start
>      rebuild one drive while leaving another faulty drive in array.
>      To determine whether array will be optimal after building, current
>      code only check whether a drive is missing, which could potentially
>      lead to data corruption. This patch is to add checking Faulty flag.
>      
>      Signed-off-by: NeilBrown <neilb@suse.de>
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index bc6d7595ad76..022a0d99e110 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5120,12 +5120,17 @@ static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int
>   		schedule_timeout_uninterruptible(1);
>   	}
>   	/* Need to check if array will still be degraded after recovery/resync
> -	 * We don't need to check the 'failed' flag as when that gets set,
> -	 * recovery aborts.
> +	 * Note in case of > 1 drive failures it's possible we're rebuilding
> +	 * one drive while leaving another faulty drive in array.
>   	 */
> -	for (i = 0; i < conf->raid_disks; i++)
> -		if (conf->disks[i].rdev == NULL)
> +	rcu_read_lock();
> +	for (i = 0; i < conf->raid_disks; i++) {
> +		struct md_rdev *rdev = ACCESS_ONCE(conf->disks[i].rdev);
> +
> +		if (rdev == NULL || test_bit(Faulty, &rdev->flags))
>   			still_degraded = 1;
> +	}
> +	rcu_read_unlock();
>   
>   	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-02-19 22:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-06 22:24 [PATCH] [md] raid5: check faulty flag for array status during recovery Eric Mei
2015-02-19 21:51 ` NeilBrown
2015-02-19 22:49   ` Eric Mei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).