[PATCH 000 of 3] md: raid5 patches suitable for 2.6.26 and -stable

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 000 of 3] md: raid5 patches suitable for 2.6.26 and -stable
@ 2008-05-27  6:31 NeilBrown
  2008-05-27  6:32 ` [PATCH 001 of 3] md: md: fix prexor vs sync_request race NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: NeilBrown @ 2008-05-27  6:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams, stable

Following are three patches that fixes bugs in raid5 that could
conceivable cause data corruption (1 and 3) or and oops (2).

Bugs 1 and 3 can cause a 'resync' to mistakenly think the partiy block
is correct when infact it isn't (it looks at a parity block that was
generated rather than read from disk).  If this happens to leave a
parity block wrong, and a device then fails, data regenerated based on
that parity block will be wrong.

Once a patched kernel is installed, running a repair pass 
(echo repair > .../sync_action) will fix any incorrect parity.

The Oops (patch 2) can only happen if you right to a partucular sysfs
file that only root has access to and only developer (currently) have
any reason to write to.

NeilBrown

 [PATCH 001 of 3] md: md: fix prexor vs sync_request race
 [PATCH 002 of 3] md: fix uninitialized use of mddev->recovery_wait
 [PATCH 003 of 3] md: Do not compute parity unless it is on a failed drive

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 001 of 3] md: md: fix prexor vs sync_request race
  2008-05-27  6:31 [PATCH 000 of 3] md: raid5 patches suitable for 2.6.26 and -stable NeilBrown
@ 2008-05-27  6:32 ` NeilBrown
  2008-05-27  6:32 ` [PATCH 002 of 3] md: fix uninitialized use of mddev->recovery_wait NeilBrown
  2008-05-27  6:32 ` [PATCH 003 of 3] md: Do not compute parity unless it is on a failed drive NeilBrown
  2 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2008-05-27  6:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams, stable

From: Dan Williams <dan.j.williams@intel.com>

During the initial array synchronization process there is a window between
when a prexor operation is scheduled to a specific stripe and when it
completes for a sync_request to be scheduled to the same stripe.  When this
happens the prexor completes and the stripe is unconditionally marked
"insync", effectively canceling the sync_request for the stripe.  Prior to
2.6.23 this was not a problem because the prexor operation was done under
sh->lock.  The effect in older kernels being that the prexor would still
erroneously mark the stripe "insync", but sync_request would be held off
and re-mark the stripe as "!in_sync".

Change the write completion logic to not mark the stripe "in_sync" if
a prexor was performed. The effect of the change is to sometimes not
set STRIPE_INSYNC.  The worst this can do is cause the resync to stall
waiting for STRIPE_INSYNC to be set.  If this were happening, then
STRIPE_SYNCING would be set and handle_issuing_new_read_requests would
cause all available blocks to eventually be read, at which point
prexor would never be used on that stripe any more and STRIPE_INSYNC
would eventually be set.

echo repair > /sys/block/mdN/md/sync_action will correct arrays that may
have lost this race.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |    5 +++++
 1 file changed, 5 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2008-05-27 16:24:02.000000000 +1000
+++ ./drivers/md/raid5.c	2008-05-27 16:24:18.000000000 +1000
@@ -2645,6 +2645,7 @@ static void handle_stripe5(struct stripe
 	struct r5dev *dev;
 	unsigned long pending = 0;
 	mdk_rdev_t *blocked_rdev = NULL;
+	int prexor;

 	memset(&s, 0, sizeof(s));
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
@@ -2774,9 +2775,11 @@ static void handle_stripe5(struct stripe
 	/* leave prexor set until postxor is done, allows us to distinguish
 	 * a rmw from a rcw during biodrain
 	 */
+	prexor = 0;
 	if (test_bit(STRIPE_OP_PREXOR, &sh->ops.complete) &&
 		test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete)) {

+		prexor = 1;
 		clear_bit(STRIPE_OP_PREXOR, &sh->ops.complete);
 		clear_bit(STRIPE_OP_PREXOR, &sh->ops.ack);
 		clear_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
@@ -2810,6 +2813,8 @@ static void handle_stripe5(struct stripe
 				if (!test_and_set_bit(
 				    STRIPE_OP_IO, &sh->ops.pending))
 					sh->ops.count++;
+				if (prexor)
+					continue;
 				if (!test_bit(R5_Insync, &dev->flags) ||
 				    (i == sh->pd_idx && s.failed == 0))
 					set_bit(STRIPE_INSYNC, &sh->state);

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 002 of 3] md: fix uninitialized use of mddev->recovery_wait
  2008-05-27  6:31 [PATCH 000 of 3] md: raid5 patches suitable for 2.6.26 and -stable NeilBrown
  2008-05-27  6:32 ` [PATCH 001 of 3] md: md: fix prexor vs sync_request race NeilBrown
@ 2008-05-27  6:32 ` NeilBrown
  2008-05-27  6:32 ` [PATCH 003 of 3] md: Do not compute parity unless it is on a failed drive NeilBrown
  2 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2008-05-27  6:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams, stable


From: Dan Williams <dan.j.williams@intel.com>

If an array was created with --assume-clean we will oops when trying to set
->resync_max.

Fix this by initializing ->recovery_wait in mddev_find.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-27 16:24:02.000000000 +1000
+++ ./drivers/md/md.c	2008-05-27 16:24:34.000000000 +1000
@@ -276,6 +276,7 @@ static mddev_t * mddev_find(dev_t unit)
 	atomic_set(&new->active, 1);
 	spin_lock_init(&new->write_lock);
 	init_waitqueue_head(&new->sb_wait);
+	init_waitqueue_head(&new->recovery_wait);
 	new->reshape_position = MaxSector;
 	new->resync_max = MaxSector;
 	new->level = LEVEL_NONE;
@@ -5665,7 +5666,6 @@ void md_do_sync(mddev_t *mddev)
 		window/2,(unsigned long long) max_sectors/2);
 
 	atomic_set(&mddev->recovery_active, 0);
-	init_waitqueue_head(&mddev->recovery_wait);
 	last_check = 0;
 
 	if (j>2) {

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 003 of 3] md: Do not compute parity unless it is on a failed drive
  2008-05-27  6:31 [PATCH 000 of 3] md: raid5 patches suitable for 2.6.26 and -stable NeilBrown
  2008-05-27  6:32 ` [PATCH 001 of 3] md: md: fix prexor vs sync_request race NeilBrown
  2008-05-27  6:32 ` [PATCH 002 of 3] md: fix uninitialized use of mddev->recovery_wait NeilBrown
@ 2008-05-27  6:32 ` NeilBrown
  2 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2008-05-27  6:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams, stable


From: Dan Williams <dan.j.williams@intel.com>

If a block is computed (rather than read) then a check/repair operation
may be lead to believe that the data on disk is correct, when infact it
isn't.  So only compute blocks for failed devices.

This issue has been around since at least 2.6.12, but has become harder to hit
in recent kernels since most reads bypass the cache.

echo repair > /sys/block/mdN/md/sync_action will set the parity blocks to the
correct state.

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2008-05-27 16:24:18.000000000 +1000
+++ ./drivers/md/raid5.c	2008-05-27 16:24:41.000000000 +1000
@@ -2002,6 +2002,7 @@ static int __handle_issuing_new_read_req
 		 * have quiesced.
 		 */
 		if ((s->uptodate == disks - 1) &&
+		    (s->failed && disk_idx == s->failed_num) &&
 		    !test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
 			set_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
 			set_bit(R5_Wantcompute, &dev->flags);
@@ -2087,7 +2088,9 @@ static void handle_issuing_new_read_requ
 			/* we would like to get this block, possibly
 			 * by computing it, but we might not be able to
 			 */
-			if (s->uptodate == disks-1) {
+			if ((s->uptodate == disks - 1) &&
+			    (s->failed && (i == r6s->failed_num[0] ||
+					   i == r6s->failed_num[1]))) {
 				pr_debug("Computing stripe %llu block %d\n",
 				       (unsigned long long)sh->sector, i);
 				compute_block_1(sh, i, 0);

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-05-27  6:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-27  6:31 [PATCH 000 of 3] md: raid5 patches suitable for 2.6.26 and -stable NeilBrown
2008-05-27  6:32 ` [PATCH 001 of 3] md: md: fix prexor vs sync_request race NeilBrown
2008-05-27  6:32 ` [PATCH 002 of 3] md: fix uninitialized use of mddev->recovery_wait NeilBrown
2008-05-27  6:32 ` [PATCH 003 of 3] md: Do not compute parity unless it is on a failed drive NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).