[BUG / PATCH] raid1: set BIO_UPTODATE after read error

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG / PATCH] raid1: set BIO_UPTODATE after read error
@ 2004-09-10 20:22 Paul Clements
  2004-09-13  5:32 ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Clements @ 2004-09-10 20:22 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 646 bytes --]

Neil,

unless you've already done so, I believe there is a little fix needed in 
the raid1 read reschedule code. As the code currently works, a read that 
is retried will continue to fail and cause raid1 to go into an infinite 
retry loop:

Sep 10 14:42:04 fiji kernel: raid1: sdb1: redirecting sector 2 to 
another mirror
Sep 10 14:42:04 fiji kernel: raid1: sdb1: rescheduling sector 2
Sep 10 14:42:04 fiji kernel: raid1: sdb1: redirecting sector 2 to 
another mirror
Sep 10 14:42:04 fiji kernel: raid1: sdb1: rescheduling sector 2
Sep 10 14:42:04 fiji kernel: raid1: sdb1: redirecting sector 2 to 
another mirror

Patch attached.

Thanks,
Paul

[-- Attachment #2: md_raid1_bio_uptodate_fix.diff --]
[-- Type: text/plain, Size: 439 bytes --]

--- linux/drivers/md/raid1.c.orig	2004-09-10 13:49:58.000000000 -0400
+++ linux/drivers/md/raid1.c	2004-09-10 15:46:47.000000000 -0400
@@ -970,6 +970,8 @@ static void raid1d(mddev_t *mddev)
 				bio->bi_bdev = rdev->bdev;
 				bio->bi_sector = r1_bio->sector + rdev->data_offset;
 				bio->bi_rw = READ;
+				/* clear any previous error */
+				set_bit(BIO_UPTODATE, &bio->bi_flags);
 				unplug = 1;
 				generic_make_request(bio);
 			}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG / PATCH] raid1: set BIO_UPTODATE after read error
  2004-09-10 20:22 [BUG / PATCH] raid1: set BIO_UPTODATE after read error Paul Clements
@ 2004-09-13  5:32 ` Neil Brown
  2004-09-15 17:34   ` Paul Clements
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2004-09-13  5:32 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid

On Friday September 10, paul.clements@steeleye.com wrote:
> Neil,
> 
> unless you've already done so, I believe there is a little fix needed in 
> the raid1 read reschedule code. As the code currently works, a read that 
> is retried will continue to fail and cause raid1 to go into an infinite 
> retry loop:

Thanks.  I must have noticed this when writing the raid10 module
because it gets it right.  Obviously I didn't "back-port" it to raid1.

A few other fields need to be reset for safety.

NeilBrown

======================================================
Make sure bio is re-initialised properly before a raid1 re-read.

Various bio fields might have been changed by a failed read request.
We must reset them.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid1.c |    5 +++++
 1 files changed, 5 insertions(+)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2004-09-13 15:27:05.000000000 +1000
+++ ./drivers/md/raid1.c	2004-09-13 15:27:40.000000000 +1000
@@ -952,6 +952,11 @@ static void raid1d(mddev_t *mddev)
 					       (unsigned long long)r1_bio->sector);
 				bio->bi_bdev = rdev->bdev;
 				bio->bi_sector = r1_bio->sector + rdev->data_offset;
+				bio->bi_next = NULL;
+				bio->bi_flags &= (1<<BIO_CLONED);
+				bio->bi_flags |= 1 << BIO_UPTODATE;
+				bio->bi_idx = 0;
+				bio->bi_size = r1_bio->sectors << 9;
 				bio->bi_rw = READ;
 				unplug = 1;
 				generic_make_request(bio);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG / PATCH] raid1: set BIO_UPTODATE after read error
  2004-09-13  5:32 ` Neil Brown
@ 2004-09-15 17:34   ` Paul Clements
  2004-09-16 10:50     ` Write and verify correct data to read-failed sectors before degrading array? Tim Small
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Clements @ 2004-09-15 17:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

Neil Brown wrote:
> On Friday September 10, paul.clements@steeleye.com wrote:
> 
>>Neil,
>>
>>unless you've already done so, I believe there is a little fix needed in 
>>the raid1 read reschedule code. As the code currently works, a read that 
>>is retried will continue to fail and cause raid1 to go into an infinite 
>>retry loop:
> 
> 
> Thanks.  I must have noticed this when writing the raid10 module
> because it gets it right.  Obviously I didn't "back-port" it to raid1.
> 
> A few other fields need to be reset for safety.

Well, it turns out that even that is not enough. Even with your patch, 
we're still seeing ext3-fs errors, which means we're getting bogus data 
on the read retry (the filesystem is re-created every test run, so 
there's no chance of lingering filesystem corruption causing the errors).

Rather than getting down in the guts of the bio and trying to reset all 
the fields that potentially could have been touched, I think it's 
probably safer to simply discard the bio that had the failed I/O 
attempted against it and clone a new bio, setting it up just as we did 
for the original read attempt. This seems to work better and will also 
protect us against any future changes in the bio code (or bio handling 
in any driver sitting below raid1), which could break read retry again. 
Patch attached.

--
Paul

[-- Attachment #2: md_raid1_bio_clone_on_read_error.diff --]
[-- Type: text/plain, Size: 930 bytes --]

--- linux-2.6.5-7.97/drivers/md/raid1.c.orig	2004-09-15 12:12:16.658276872 -0400
+++ linux-2.6.5-7.97/drivers/md/raid1.c		2004-09-15 12:11:38.644055912 -0400
@@ -962,14 +962,19 @@ static void raid1d(mddev_t *mddev)
 			} else {
 				r1_bio->bios[r1_bio->read_disk] = NULL;
 				r1_bio->read_disk = disk;
+				/* discard the failed bio and clone a new one */
+				bio_put(bio);
+				bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
 				r1_bio->bios[r1_bio->read_disk] = bio;
 				printk(KERN_ERR "raid1: %s: redirecting sector %llu to"
 				       " another mirror\n",
 				       bdevname(rdev->bdev,b),
 				       (unsigned long long)r1_bio->sector);
-				bio->bi_bdev = rdev->bdev;
 				bio->bi_sector = r1_bio->sector + rdev->data_offset;
+				bio->bi_bdev = rdev->bdev;
+				bio->bi_end_io = raid1_end_read_request;
 				bio->bi_rw = READ;
+				bio->bi_private = r1_bio;
 				unplug = 1;
 				generic_make_request(bio);
 			}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Write and verify correct data to read-failed sectors before degrading array?
  2004-09-15 17:34   ` Paul Clements
@ 2004-09-16 10:50     ` Tim Small
  2004-09-17  0:39       ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Tim Small @ 2004-09-16 10:50 UTC (permalink / raw)
  To: Paul Clements; +Cc: Neil Brown, linux-raid

Paul Clements wrote:

> Neil Brown wrote:
>
>> On Friday September 10, paul.clements@steeleye.com wrote:
>>
>>> Neil,
>>>
>>> unless you've already done so, I believe there is a little fix 
>>> needed in the raid1 read reschedule code. As the code currently 
>>> works, a read that is retried will continue to fail and cause raid1 
>>> to go into an infinite retry loop:
>>
>>
>>
>> Thanks.  I must have noticed this when writing the raid10 module
>> because it gets it right.  Obviously I didn't "back-port" it to raid1.
>>
>> A few other fields need to be reset for safety.
>
>
> Well, it turns out that even that is not enough. Even with your patch, 
> we're still seeing ext3-fs errors, which means we're getting bogus 
> data on the read retry (the filesystem is re-created every test run, 
> so there's no chance of lingering filesystem corruption causing the 
> errors).
>
> Rather than getting down in the guts of the bio and trying to reset 
> all the fields that potentially could have been touched, I think it's 
> probably safer to simply discard the bio that had the failed I/O 
> attempted against it and clone a new bio, setting it up just as we did 
> for the original read attempt. This seems to work better and will also 
> protect us against any future changes in the bio code (or bio handling 
> in any driver sitting below raid1), which could break read retry 
> again. Patch attached.
>

Just thinking out loud here, but I wonder if the following change is 
possible or worth making to this code?  For a failed read, where the 
block is then successfully read from another drive, then attempt to 
write the correct data for this block to the device with the read 
failure (to try to see if the drive firmware thinks this sector is still 
usable, and if not then maybe it will reallocate the failed sector).  If 
this write succeeds, and can be verified, then don't mark the sector bad 
(maybe just complain with a printk)..

This would get around a lot of mirror failures that I see in 
operation..  In the past, I've had mirrors go bad with individual failed 
sectors in different locations on both drives, the array is then 
unusable (and the database server is dead, in my experience) unless you 
manually try to knit it back together with dd.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Write and verify correct data to read-failed sectors before degrading array?
  2004-09-16 10:50     ` Write and verify correct data to read-failed sectors before degrading array? Tim Small
@ 2004-09-17  0:39       ` Neil Brown
  2004-09-17  1:41         ` Sebastian Sobolewski
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2004-09-17  0:39 UTC (permalink / raw)
  To: Tim Small; +Cc: Paul Clements, linux-raid

On Thursday September 16, tim@buttersideup.com wrote:
> Just thinking out loud here, but I wonder if the following change is 
> possible or worth making to this code?  For a failed read, where the 
> block is then successfully read from another drive, then attempt to 
> write the correct data for this block to the device with the read 
> failure (to try to see if the drive firmware thinks this sector is still 
> usable, and if not then maybe it will reallocate the failed sector).  If 
> this write succeeds, and can be verified, then don't mark the sector bad 
> (maybe just complain with a printk)..
> 
> This would get around a lot of mirror failures that I see in 
> operation..  In the past, I've had mirrors go bad with individual failed 
> sectors in different locations on both drives, the array is then 
> unusable (and the database server is dead, in my experience) unless you 
> manually try to knit it back together with dd.

Yes.  Great idea.  Just as good as every other time it gets suggested :-)
Unfortunately no-one has presented any actual *code* yet, and I
haven't found/made/allocated time to do it.

   http://neilb.web.cse.unsw.edu.au/SoftRaid/01084418693

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Write and verify correct data to read-failed sectors before degrading array?
  2004-09-17  0:39       ` Neil Brown
@ 2004-09-17  1:41         ` Sebastian Sobolewski
  2004-09-17  2:00           ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Sobolewski @ 2004-09-17  1:41 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:

>On Thursday September 16, tim@buttersideup.com wrote:
>  
>
>>Just thinking out loud here, but I wonder if the following change is 
>>possible or worth making to this code?  For a failed read, where the 
>>block is then successfully read from another drive, then attempt to 
>>write the correct data for this block to the device with the read 
>>failure (to try to see if the drive firmware thinks this sector is still 
>>usable, and if not then maybe it will reallocate the failed sector).  If 
>>this write succeeds, and can be verified, then don't mark the sector bad 
>>(maybe just complain with a printk)..
>>
>>This would get around a lot of mirror failures that I see in 
>>operation..  In the past, I've had mirrors go bad with individual failed 
>>sectors in different locations on both drives, the array is then 
>>unusable (and the database server is dead, in my experience) unless you 
>>manually try to knit it back together with dd.
>>    
>>
>
>Yes.  Great idea.  Just as good as every other time it gets suggested :-)
>Unfortunately no-one has presented any actual *code* yet, and I
>haven't found/made/allocated time to do it.
>
>   http://neilb.web.cse.unsw.edu.au/SoftRaid/01084418693
>
>NeilBrown
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>
    I have some experimental code that does the read-recovery piece for 
raid1 devices against kernel 2.4.26.  If an error is encountered on a 
read, the failure is delayed until the read is retried to the other 
mirror.  If the retried read succeeds it then writes the recovered block 
back over the previously failed block. 
    If the write fails then the drive is marked faulty otherwise we 
continue without setting the drive faulty.  ( The idea here is that 
modern disk drives have spare sectors, and will be automatically 
reallocate a bad sector to one of the spares on the next write ). 
    The caveat is that if the drive is generating lots of bad/failed 
reads it's most likely going south.. but that's what smart log 
monitoring is for.  If anyone is interested I can post the patch.

-Sebastian



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Write and verify correct data to read-failed sectors before degrading array?
  2004-09-17  1:41         ` Sebastian Sobolewski
@ 2004-09-17  2:00           ` Neil Brown
  2004-09-17  2:13             ` Sebastian Sobolewski
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2004-09-17  2:00 UTC (permalink / raw)
  To: Sebastian Sobolewski; +Cc: linux-raid

On Thursday September 16, linux@thirdmartini.com wrote:
>     I have some experimental code that does the read-recovery piece for 
> raid1 devices against kernel 2.4.26.  If an error is encountered on a 
> read, the failure is delayed until the read is retried to the other 
> mirror.  If the retried read succeeds it then writes the recovered block 
> back over the previously failed block. 
>     If the write fails then the drive is marked faulty otherwise we 
> continue without setting the drive faulty.  ( The idea here is that 
> modern disk drives have spare sectors, and will be automatically 
> reallocate a bad sector to one of the spares on the next write ). 
>     The caveat is that if the drive is generating lots of bad/failed 
> reads it's most likely going south.. but that's what smart log 
> monitoring is for.  If anyone is interested I can post the patch.

Certainly interested.

Do you have any interlocking to ensure that if a real WRITE is
submitted immediately after (or even during !!!) the READ, it does not
get destroyed by the over-write.
e.g.

application     drive0          drive1
READ request
                READ from drive 0
		fails
				READ from drive 1
				success. Schedule over-write on drive0
READ completes
WRITE block
		WRITE to drive0 WRITE to drive1

                overwrite happens.


It is conceivable that the WRITE could be sent even *before* the READ
completes though I'm not sure if it is possible in practice.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Write and verify correct data to read-failed sectors before degrading array?
  2004-09-17  2:00           ` Neil Brown
@ 2004-09-17  2:13             ` Sebastian Sobolewski
  2004-09-22  0:06               ` [PATCH] " Sebastian Sobolewski
  0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Sobolewski @ 2004-09-17  2:13 UTC (permalink / raw)
  To: linux-raid



Neil Brown wrote:

>On Thursday September 16, linux@thirdmartini.com wrote:
>  
>
>>    I have some experimental code that does the read-recovery piece for 
>>raid1 devices against kernel 2.4.26.  If an error is encountered on a 
>>read, the failure is delayed until the read is retried to the other 
>>mirror.  If the retried read succeeds it then writes the recovered block 
>>back over the previously failed block. 
>>    If the write fails then the drive is marked faulty otherwise we 
>>continue without setting the drive faulty.  ( The idea here is that 
>>modern disk drives have spare sectors, and will be automatically 
>>reallocate a bad sector to one of the spares on the next write ). 
>>    The caveat is that if the drive is generating lots of bad/failed 
>>reads it's most likely going south.. but that's what smart log 
>>monitoring is for.  If anyone is interested I can post the patch.
>>    
>>
>
>Certainly interested.
>
>Do you have any interlocking to ensure that if a real WRITE is
>submitted immediately after (or even during !!!) the READ, it does not
>get destroyed by the over-write.
>e.g.
>
>application     drive0          drive1
>READ request
>                READ from drive 0
>		fails
>				READ from drive 1
>				success. Schedule over-write on drive0
>READ completes
>WRITE block
>		WRITE to drive0 WRITE to drive1
>
>                overwrite happens.
>
>
>It is conceivable that the WRITE could be sent even *before* the READ
>completes though I'm not sure if it is possible in practice.
>
>NeilBrown
>
>  
>
    No, there is no interlocking at this time. I solve the above problem 
by  not replying to the read until after the recovery write attempt 
either fails or completes.  This works great when the application above 
us ( like a FS ) is using the buffer cache or guarantees no R-W 
conflicts.  ( I believe this is the case with buffered block devices at 
this time ).  Using /dev/raw and an application that can cause R-W 
conflicts WILL result in corruption.  This is why the patch is 
experimental. :)

    I've tested the code on a fault injector and I have not been able to 
cause a corruption using ext3 or xfs. 

-Sebastian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] Re: Write and verify correct data to read-failed sectors before degrading array?
  2004-09-17  2:13             ` Sebastian Sobolewski
@ 2004-09-22  0:06               ` Sebastian Sobolewski
  0 siblings, 0 replies; 9+ messages in thread
From: Sebastian Sobolewski @ 2004-09-22  0:06 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 12655 bytes --]


    Here is the patch for URE recovery from a good mirror that I've been 
using.  I would not use it with /dev/raw as a R-W conflict during URE 
recovery can result in data in one copy potentially corrupting.  I've 
used this with ext2/ext3 and xfs.  This is against kernel.org 2.4.36 
kernel.  Please note that I do not verify that the recovery write wrote 
correct data, I assume that if the write was successful the drive 
remapped the sector and not failed on us silently.

-Sebastian

--- linux-2.4.26/include/linux/raid/raid1.h    2001-08-12 
13:39:02.000000000 -0600
+++ ../2420/linux/include/linux/raid/raid1.h    2004-09-21 
09:18:03.000000000 -0600
@@ -18,6 +18,7 @@
     int        spare;
 
     int        used_slot;
+    atomic_t rr_count;
 };
 
 struct raid1_private_data {
@@ -59,6 +60,9 @@
     md_wait_queue_head_t    wait_done;
     md_wait_queue_head_t    wait_ready;
     md_spinlock_t        segment_lock;
+    /* Use Read Recovery */
+    int use_read_recovery;
+    atomic_t rr_total;
 };
 
 typedef struct raid1_private_data raid1_conf_t;
@@ -86,6 +90,7 @@
     struct buffer_head    *mirror_bh_list;
     struct buffer_head    bh_req;
     struct raid1_bh        *next_r1;    /* next for retry or in free 
list */
+    kdev_t             failed_dev;
 };
 /* bits for raid1_bh.state */
 #define    R1BH_Uptodate    1
--- linux-2.4.26/drivers/md/raid1.c    2004-04-14 07:05:30.000000000 -0600
+++ ../2420/linux/drivers/md/raid1.c    2004-09-21 09:21:59.000000000 -0600
@@ -32,10 +32,19 @@
 #define MD_DRIVER
 #define MD_PERSONALITY
 
-#define MAX_WORK_PER_DISK 128
-
 #define    NR_RESERVED_BUFS    32
 
+unsigned MAX_WORK_PER_DISK = 128;
+MODULE_PARM(RAID1_MAX_WORK_PER_DISK, "i");
+MODULE_PARM_DESC(RAID1_MAX_WORK_PER_DISK, "The Maximum number of 
sectors given to any disk before we switch disks in read balance code");
+
+/*
+ * Enable Read Recovery code.  For more information see 
end_request_recovery()
+ */
+unsigned RAID1_READ_RECOVERY = 1;
+MODULE_PARM(RAID1_READ_RECOVERY, "i");
+MODULE_PARM_DESC(RAID1_READ_RECOVERY, "Use raid1 read recovery code");
+
 
 /*
  * The following can be used to debug the driver
@@ -165,6 +174,7 @@
             r1_bh->next_r1 = NULL;
             r1_bh->state = (1 << R1BH_PreAlloc);
             r1_bh->bh_req.b_state = 0;
+            r1_bh->failed_dev = 0;
         }
         md_spin_unlock_irq(&conf->device_lock);
         if (r1_bh)
@@ -262,6 +272,7 @@
     r1_bh = conf->freebuf;
     conf->freebuf = r1_bh->next_r1;
     r1_bh->next_r1= NULL;
+    r1_bh->failed_dev = 0;
     md_spin_unlock_irq(&conf->device_lock);
 
     return r1_bh;
@@ -321,6 +332,33 @@
     }
 }
 
+static int raid1_map_notsame(mddev_t *mddev, kdev_t *rdev)
+{
+    raid1_conf_t *conf = mddev_to_conf(mddev);
+    //kdev_t new_dev = *rdev;
+    int i, disks = MD_SB_DISKS;
+
+    /*
+     * Later we do read balancing on the read side
+     * now we use the first available disk.
+     */
+
+    for (i = 0; i < disks; i++) {
+        if (conf->mirrors[i].operational) {
+            /*
+             * Pick a different device then the original
+             */
+            if( conf->mirrors[i].dev != *rdev ){
+                *rdev = conf->mirrors[i].dev;
+                return (0);
+            }
+        }
+    }
+
+    return (-1);
+}
+
+#if 0
 static int raid1_map (mddev_t *mddev, kdev_t *rdev)
 {
     raid1_conf_t *conf = mddev_to_conf(mddev);
@@ -341,6 +379,7 @@
     printk (KERN_ERR "raid1_map(): huh, no more operational devices?\n");
     return (-1);
 }
+#endif
 
 static void raid1_reschedule_retry (struct raid1_bh *r1_bh)
 {
@@ -403,6 +442,103 @@
     bh->b_end_io(bh, uptodate);
     raid1_free_r1bh(r1_bh);
 }
+
+/*
+ * incement the read recovery counter for the mddevice as well as on 
the individual disk. 
+ *  this information is output in raid1_status.  This allows us to 
print out the per disk error counts
+ *  so we can decide when the disk is likely to be going completely bad 
as opposed to having partial media/sector
+ *  errors.
+ */
+void raid1_mark_recovered( raid1_conf_t *conf , kdev_t rdev )
+{
+    int i, disks = MD_SB_DISKS;
+    atomic_inc( &conf->rr_total );
+    for (i = 0; i < disks; i++) {
+        if( conf->mirrors[i].dev == rdev ){
+            atomic_inc( &conf->mirrors[i].rr_count );
+            return ;
+        }
+    }
+}
+
+/*
+ * This is the completion callback for the correcting write operation.  
If the write fails
+ * the disk is definetly bad.  Otherwise the write forced the disk 
drive to remap the bad sector
+ * to one of it's spares. 
+ *
+ * If we want to be paranoid, we can issue a read of the just written 
sector and compare it to the mirror
+ *  copy before we acknowledge the read. However impirical data has 
shown that if the write succeeds, the read will be correct.
+ *
+ */
+void raid1_end_request_recover_complete( struct buffer_head *bh, int 
uptodate )
+{
+    struct raid1_bh * r1_bh = (struct raid1_bh *)(bh->b_private);
+    raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+   
+    if( !uptodate )
+    {
+        printk(KERN_ERR "raid1: %s: recover failed lba=%lu\n",  
partition_name(bh->b_dev), bh->b_blocknr);
+        md_error (r1_bh->mddev, r1_bh->failed_dev );
+        r1_bh->failed_dev = 0;
+    }
+    else
+    {
+        raid1_mark_recovered( conf,  r1_bh->failed_dev );
+        printk(KERN_INFO "raid1: %s: recover success lba=%lu\n",  
partition_name(bh->b_dev), bh->b_blocknr);
+    }
+    /*
+     *  We got here because the write recovery attempt failed us, 
however since we made it this far
+     *  it means that the read WAS SUCESSFULL originally.
+     */
+    r1_bh->cmd = READ;
+    raid1_end_bh_io(r1_bh, 1 );
+}
+
+/*
+ * This is the io completion callback for the read from the redundant 
mirror.  If the read is sucessfull we will issue
+ *  a write to the mirror that previsouly failed the read of this 
sector.  This should cause the drive to remap the bad
+ *  sector to a spare.
+ */
+void raid1_end_request_recover( struct buffer_head *bh, int uptodate )
+{
+    struct raid1_bh * r1_bh = (struct raid1_bh *)(bh->b_private);
+    if( !uptodate )
+    {
+        printk(KERN_ERR "raid1: %s: recovering lba=%lu failed.. double 
fault\n",  partition_name(bh->b_dev), bh->b_blocknr);
+        /*
+         * We delayed the failure of this device earlier.  Now we had 2 
failures in a row from different devices
+         * thus we must fail the previos device to ensure we don;t do 
this forever.
+         */
+        md_error (r1_bh->mddev, r1_bh->failed_dev);
+        r1_bh->failed_dev = bh->b_dev;
+       
+        /*
+         * Now retry one more time.  We may have more valid devices/  
If not raid1d READ/READA handler
+         *   will tell us so.
+         */
+        printk(KERN_INFO "raid1: %s: rescheduling lba=%lu again (have 
more devices?)\n",
+             partition_name(bh->b_dev), bh->b_blocknr);
+        raid1_reschedule_retry(r1_bh);
+        return;
+    }
+    else
+    {
+        /*
+         * FIXME: this whole thing only recovers 1 raid mirror.  To do 
this 100% correctly we need to keep a list of prefail devices
+         *        since a 3 way mirror will drop 1 device before 
rebuilding
+         */
+        printk(KERN_INFO "raid1: %s: recovering block lba=%lu read ok.. 
do write\n",  partition_name(bh->b_dev), bh->b_blocknr);
+        //
+        // We need to map in the recovery device
+        //
+        r1_bh->cmd   = WRITE;
+        bh->b_end_io = raid1_end_request_recover_complete;
+        bh->b_dev    = r1_bh->failed_dev;
+        bh->b_rdev   = r1_bh->failed_dev;
+        raid1_reschedule_retry(r1_bh);
+    }
+}
+
 void raid1_end_request (struct buffer_head *bh, int uptodate)
 {
     struct raid1_bh * r1_bh = (struct raid1_bh *)(bh->b_private);
@@ -410,8 +546,27 @@
     /*
      * this branch is our 'one mirror IO has finished' event handler:
      */
-    if (!uptodate)
+    if (!uptodate){
+        raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+       
+        if ( ( conf->use_read_recovery == 1 ) && ( (r1_bh->cmd == READ) 
|| (r1_bh->cmd == READA) ) )
+        {
+            /*
+             * Remap the end io for this device since we are going to 
try recovery
+             * DO NOT change the mddev.. we need the original failing MDDEV
+             */
+            printk(KERN_INFO "raid1: %s: read recovery will be 
attempted on lba=%lu from another mirror\n",
+            partition_name(bh->b_dev), bh->b_blocknr);
+
+            bh->b_end_io = raid1_end_request_recover;
+            r1_bh->failed_dev = bh->b_dev;
+        }
+        else
+        {
         md_error (r1_bh->mddev, bh->b_dev);
+        }
+       
+    }
     else
         /*
          * Set R1BH_Uptodate in our master buffer_head, so that
@@ -441,7 +596,7 @@
         /*
          * oops, read error:
          */
-        printk(KERN_ERR "raid1: %s: rescheduling block %lu\n",
+        printk(KERN_ERR "raid1: %s: rescheduling lba=%lu\n",
              partition_name(bh->b_dev), bh->b_blocknr);
         raid1_reschedule_retry(r1_bh);
         return;
@@ -480,10 +635,11 @@
     unsigned long current_distance;
    
     /*
-     * Check if it is sane at all to balance
+     * Check if it is sane at all to balance.
+     * make sure the last used drive is operational (it may have been 
removed).
      */
    
-    if (conf->resync_mirrors)
+    if (conf->resync_mirrors && conf->mirrors[new_disk].operational)
         goto rb_out;
    
 
@@ -737,10 +893,17 @@
    
     seq_printf(seq, " [%d/%d] [", conf->raid_disks,
                          conf->working_disks);
+                         
     for (i = 0; i < conf->raid_disks; i++)
         seq_printf(seq, "%s",
             conf->mirrors[i].operational ? "U" : "_");
     seq_printf(seq, "]");
+   
+    seq_printf(seq, " ( ");
+    for (i = 0; i < conf->raid_disks; i++){
+        seq_printf(seq,"%d ",atomic_read( &conf->mirrors[i].rr_count ) );
+    }
+    seq_printf(seq, ")");
 }
 
 #define LAST_DISK KERN_ALERT \
@@ -783,6 +946,7 @@
 
 static int raid1_error (mddev_t *mddev, kdev_t dev)
 {
+    mdk_rdev_t *rrdev = NULL;
     raid1_conf_t *conf = mddev_to_conf(mddev);
     struct mirror_info * mirrors = conf->mirrors;
     int disks = MD_SB_DISKS;
@@ -808,6 +972,16 @@
 
         return 1;
     }
+   
+    rrdev = find_rdev( mddev, dev );
+    if( rrdev )
+    {
+        rrdev->faulty = 1;
+    }
+    else
+    {
+        printk("raid1: rrdev == NULL in raid1_error\n");
+    }
     mark_disk_bad(mddev, i);
     return 0;
 }
@@ -963,6 +1137,7 @@
             tmp = conf->mirrors + i;
             if (!tmp->used_slot) {
                 added_disk = i;
+                atomic_set(&tmp->rr_count,0);
                 break;
             }
         }
@@ -1129,7 +1304,7 @@
         conf->nr_disks++;
 
         break;
-
+       
     default:
         MD_BUG();   
         err = 1;
@@ -1266,12 +1441,26 @@
             }
 
             break;
+       
+        case WRITE:
+            /*
+             * We do not map the dev.  It SHOULD be already mapped for us
+             */
+            printk ("raid1: %s: read-error recovery lba=%lu  (writing 
recovered lba)\n",partition_name(bh->b_dev),bh->b_blocknr);
+            generic_make_request (r1_bh->cmd, bh);
+            break;
+           
         case READ:
         case READA:
             dev = bh->b_dev;
-            raid1_map (mddev, &bh->b_dev);
+            raid1_map_notsame(mddev, &bh->b_dev);
             if (bh->b_dev == dev) {
                 printk (IO_ERROR, partition_name(bh->b_dev), 
bh->b_blocknr);
+        /*        if( r1_bh->failed_dev )
+                {
+                    md_error (r1_bh->mddev, r1_bh->failed_dev);
+                    r1_bh->failed_dev = 0;
+                }*/
                 raid1_end_bh_io(r1_bh, 0);
             } else {
                 printk (REDIRECT_SECTOR,
@@ -1596,6 +1785,8 @@
         disk_idx = descriptor->raid_disk;
         disk = conf->mirrors + disk_idx;
 
+        atomic_set(&disk->rr_count, 0 );
+       
         if (disk_faulty(descriptor)) {
             disk->number = descriptor->number;
             disk->raid_disk = disk_idx;
@@ -1761,6 +1952,8 @@
         }
     }
     sb->active_disks = conf->working_disks;
+    /* Set the read recovery flag to the default value */
+    conf->use_read_recovery = RAID1_READ_RECOVERY;
 
     if (start_recovery)
         md_recover_arrays();
@@ -1859,6 +2052,9 @@
 
 static int md__init raid1_init (void)
 {
+    if( RAID1_READ_RECOVERY ){
+        printk("raid1: Read Recovery Enabled\n");
+    }
     return register_md_personality (RAID1, &raid1_personality);
 }


[-- Attachment #2: raid1rr.patch --]
[-- Type: text/plain, Size: 10679 bytes --]

--- linux-2.4.26/include/linux/raid/raid1.h	2001-08-12 13:39:02.000000000 -0600
+++ ../2420/linux/include/linux/raid/raid1.h	2004-09-21 09:18:03.000000000 -0600
@@ -18,6 +18,7 @@
 	int		spare;
 
 	int		used_slot;
+	atomic_t rr_count;
 };
 
 struct raid1_private_data {
@@ -59,6 +60,9 @@
 	md_wait_queue_head_t	wait_done;
 	md_wait_queue_head_t	wait_ready;
 	md_spinlock_t		segment_lock;
+	/* Use Read Recovery */
+	int use_read_recovery;
+	atomic_t rr_total;
 };
 
 typedef struct raid1_private_data raid1_conf_t;
@@ -86,6 +90,7 @@
 	struct buffer_head	*mirror_bh_list;
 	struct buffer_head	bh_req;
 	struct raid1_bh		*next_r1;	/* next for retry or in free list */
+	kdev_t 			failed_dev;
 };
 /* bits for raid1_bh.state */
 #define	R1BH_Uptodate	1
--- linux-2.4.26/drivers/md/raid1.c	2004-04-14 07:05:30.000000000 -0600
+++ ../2420/linux/drivers/md/raid1.c	2004-09-21 09:21:59.000000000 -0600
@@ -32,10 +32,19 @@
 #define MD_DRIVER
 #define MD_PERSONALITY
 
-#define MAX_WORK_PER_DISK 128
-
 #define	NR_RESERVED_BUFS	32
 
+unsigned MAX_WORK_PER_DISK = 128;
+MODULE_PARM(RAID1_MAX_WORK_PER_DISK, "i");
+MODULE_PARM_DESC(RAID1_MAX_WORK_PER_DISK, "The Maximum number of sectors given to any disk before we switch disks in read balance code");
+ 
+/*
+ * Enable Read Recovery code.  For more information see end_request_recovery()
+ */
+unsigned RAID1_READ_RECOVERY = 1;
+MODULE_PARM(RAID1_READ_RECOVERY, "i");
+MODULE_PARM_DESC(RAID1_READ_RECOVERY, "Use raid1 read recovery code");
+
 
 /*
  * The following can be used to debug the driver
@@ -165,6 +174,7 @@
 			r1_bh->next_r1 = NULL;
 			r1_bh->state = (1 << R1BH_PreAlloc);
 			r1_bh->bh_req.b_state = 0;
+			r1_bh->failed_dev = 0;
 		}
 		md_spin_unlock_irq(&conf->device_lock);
 		if (r1_bh)
@@ -262,6 +272,7 @@
 	r1_bh = conf->freebuf;
 	conf->freebuf = r1_bh->next_r1;
 	r1_bh->next_r1= NULL;
+	r1_bh->failed_dev = 0;
 	md_spin_unlock_irq(&conf->device_lock);
 
 	return r1_bh;
@@ -321,6 +332,33 @@
 	}
 }
 
+static int raid1_map_notsame(mddev_t *mddev, kdev_t *rdev)
+{
+	raid1_conf_t *conf = mddev_to_conf(mddev);
+	//kdev_t new_dev = *rdev;
+	int i, disks = MD_SB_DISKS;
+
+	/*
+	 * Later we do read balancing on the read side 
+	 * now we use the first available disk.
+	 */
+
+	for (i = 0; i < disks; i++) {
+		if (conf->mirrors[i].operational) {
+			/*
+			 * Pick a different device then the original
+			 */
+			if( conf->mirrors[i].dev != *rdev ){
+				*rdev = conf->mirrors[i].dev;
+				return (0);
+			}
+		}
+	}
+
+	return (-1);
+}
+
+#if 0
 static int raid1_map (mddev_t *mddev, kdev_t *rdev)
 {
 	raid1_conf_t *conf = mddev_to_conf(mddev);
@@ -341,6 +379,7 @@
 	printk (KERN_ERR "raid1_map(): huh, no more operational devices?\n");
 	return (-1);
 }
+#endif
 
 static void raid1_reschedule_retry (struct raid1_bh *r1_bh)
 {
@@ -403,6 +442,103 @@
 	bh->b_end_io(bh, uptodate);
 	raid1_free_r1bh(r1_bh);
 }
+
+/*
+ * incement the read recovery counter for the mddevice as well as on the individual disk.  
+ *  this information is output in raid1_status.  This allows us to print out the per disk error counts
+ *  so we can decide when the disk is likely to be going completely bad as opposed to having partial media/sector
+ *  errors.
+ */
+void raid1_mark_recovered( raid1_conf_t *conf , kdev_t rdev )
+{
+	int i, disks = MD_SB_DISKS;
+	atomic_inc( &conf->rr_total );
+	for (i = 0; i < disks; i++) {
+		if( conf->mirrors[i].dev == rdev ){
+			atomic_inc( &conf->mirrors[i].rr_count );
+			return ;
+		}
+	}
+}
+
+/*
+ * This is the completion callback for the correcting write operation.  If the write fails
+ * the disk is definetly bad.  Otherwise the write forced the disk drive to remap the bad sector
+ * to one of it's spares.  
+ *
+ * If we want to be paranoid, we can issue a read of the just written sector and compare it to the mirror
+ *  copy before we acknowledge the read. However impirical data has shown that if the write succeeds, the read will be correct.
+ *
+ */
+void raid1_end_request_recover_complete( struct buffer_head *bh, int uptodate )
+{
+	struct raid1_bh * r1_bh = (struct raid1_bh *)(bh->b_private);
+	raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+	
+	if( !uptodate )
+	{
+		printk(KERN_ERR "raid1: %s: recover failed lba=%lu\n",  partition_name(bh->b_dev), bh->b_blocknr);
+		md_error (r1_bh->mddev, r1_bh->failed_dev );
+		r1_bh->failed_dev = 0;
+	}
+	else
+	{
+		raid1_mark_recovered( conf,  r1_bh->failed_dev );
+		printk(KERN_INFO "raid1: %s: recover success lba=%lu\n",  partition_name(bh->b_dev), bh->b_blocknr);
+	}
+	/*
+	 *  We got here because the write recovery attempt failed us, however since we made it this far
+	 *  it means that the read WAS SUCESSFULL originally.
+	 */
+	r1_bh->cmd = READ;
+	raid1_end_bh_io(r1_bh, 1 );
+}
+
+/*
+ * This is the io completion callback for the read from the redundant mirror.  If the read is sucessfull we will issue
+ *  a write to the mirror that previsouly failed the read of this sector.  This should cause the drive to remap the bad
+ *  sector to a spare.
+ */
+void raid1_end_request_recover( struct buffer_head *bh, int uptodate )
+{
+	struct raid1_bh * r1_bh = (struct raid1_bh *)(bh->b_private);
+	if( !uptodate )
+	{
+		printk(KERN_ERR "raid1: %s: recovering lba=%lu failed.. double fault\n",  partition_name(bh->b_dev), bh->b_blocknr);
+		/*
+		 * We delayed the failure of this device earlier.  Now we had 2 failures in a row from different devices
+		 * thus we must fail the previos device to ensure we don;t do this forever.
+		 */
+		md_error (r1_bh->mddev, r1_bh->failed_dev);
+		r1_bh->failed_dev = bh->b_dev;
+		
+		/*
+		 * Now retry one more time.  We may have more valid devices/  If not raid1d READ/READA handler
+		 *   will tell us so.
+		 */
+		printk(KERN_INFO "raid1: %s: rescheduling lba=%lu again (have more devices?)\n", 
+			 partition_name(bh->b_dev), bh->b_blocknr);
+		raid1_reschedule_retry(r1_bh);
+		return;
+	}
+	else
+	{
+		/*
+		 * FIXME: this whole thing only recovers 1 raid mirror.  To do this 100% correctly we need to keep a list of prefail devices
+		 *        since a 3 way mirror will drop 1 device before rebuilding 
+		 */
+		printk(KERN_INFO "raid1: %s: recovering block lba=%lu read ok.. do write\n",  partition_name(bh->b_dev), bh->b_blocknr);
+		//
+		// We need to map in the recovery device
+		//
+		r1_bh->cmd   = WRITE;
+		bh->b_end_io = raid1_end_request_recover_complete;
+		bh->b_dev    = r1_bh->failed_dev;
+		bh->b_rdev   = r1_bh->failed_dev;
+		raid1_reschedule_retry(r1_bh);
+	}
+}
+
 void raid1_end_request (struct buffer_head *bh, int uptodate)
 {
 	struct raid1_bh * r1_bh = (struct raid1_bh *)(bh->b_private);
@@ -410,8 +546,27 @@
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
-	if (!uptodate)
+	if (!uptodate){
+		raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+		
+		if ( ( conf->use_read_recovery == 1 ) && ( (r1_bh->cmd == READ) || (r1_bh->cmd == READA) ) ) 
+		{
+			/*
+			 * Remap the end io for this device since we are going to try recovery
+			 * DO NOT change the mddev.. we need the original failing MDDEV
+			 */
+			printk(KERN_INFO "raid1: %s: read recovery will be attempted on lba=%lu from another mirror\n", 
+			partition_name(bh->b_dev), bh->b_blocknr);
+
+			bh->b_end_io = raid1_end_request_recover;
+			r1_bh->failed_dev = bh->b_dev;
+		}
+		else
+		{
 		md_error (r1_bh->mddev, bh->b_dev);
+		}
+		
+	}
 	else
 		/*
 		 * Set R1BH_Uptodate in our master buffer_head, so that
@@ -441,7 +596,7 @@
 		/*
 		 * oops, read error:
 		 */
-		printk(KERN_ERR "raid1: %s: rescheduling block %lu\n", 
+		printk(KERN_ERR "raid1: %s: rescheduling lba=%lu\n", 
 			 partition_name(bh->b_dev), bh->b_blocknr);
 		raid1_reschedule_retry(r1_bh);
 		return;
@@ -480,10 +635,11 @@
 	unsigned long current_distance;
 	
 	/*
-	 * Check if it is sane at all to balance
+	 * Check if it is sane at all to balance.
+	 * make sure the last used drive is operational (it may have been removed).
 	 */
 	
-	if (conf->resync_mirrors)
+	if (conf->resync_mirrors && conf->mirrors[new_disk].operational)
 		goto rb_out;
 	
 
@@ -737,10 +893,17 @@
 	
 	seq_printf(seq, " [%d/%d] [", conf->raid_disks,
 						 conf->working_disks);
+						 
 	for (i = 0; i < conf->raid_disks; i++)
 		seq_printf(seq, "%s",
 			conf->mirrors[i].operational ? "U" : "_");
 	seq_printf(seq, "]");
+	
+	seq_printf(seq, " ( ");
+	for (i = 0; i < conf->raid_disks; i++){
+		seq_printf(seq,"%d ",atomic_read( &conf->mirrors[i].rr_count ) );
+	}
+	seq_printf(seq, ")");
 }
 
 #define LAST_DISK KERN_ALERT \
@@ -783,6 +946,7 @@
 
 static int raid1_error (mddev_t *mddev, kdev_t dev)
 {
+	mdk_rdev_t *rrdev = NULL;
 	raid1_conf_t *conf = mddev_to_conf(mddev);
 	struct mirror_info * mirrors = conf->mirrors;
 	int disks = MD_SB_DISKS;
@@ -808,6 +972,16 @@
 
 		return 1;
 	}
+	
+	rrdev = find_rdev( mddev, dev );
+	if( rrdev )
+	{
+		rrdev->faulty = 1;
+	}
+	else
+	{
+		printk("raid1: rrdev == NULL in raid1_error\n");
+	}
 	mark_disk_bad(mddev, i);
 	return 0;
 }
@@ -963,6 +1137,7 @@
 			tmp = conf->mirrors + i;
 			if (!tmp->used_slot) {
 				added_disk = i;
+				atomic_set(&tmp->rr_count,0);
 				break;
 			}
 		}
@@ -1129,7 +1304,7 @@
 		conf->nr_disks++;
 
 		break;
-
+		
 	default:
 		MD_BUG();	
 		err = 1;
@@ -1266,12 +1441,26 @@
 			}
 
 			break;
+		
+		case WRITE:
+			/*
+			 * We do not map the dev.  It SHOULD be already mapped for us
+			 */
+			printk ("raid1: %s: read-error recovery lba=%lu  (writing recovered lba)\n",partition_name(bh->b_dev),bh->b_blocknr);
+			generic_make_request (r1_bh->cmd, bh);
+			break;
+			
 		case READ:
 		case READA:
 			dev = bh->b_dev;
-			raid1_map (mddev, &bh->b_dev);
+			raid1_map_notsame(mddev, &bh->b_dev);
 			if (bh->b_dev == dev) {
 				printk (IO_ERROR, partition_name(bh->b_dev), bh->b_blocknr);
+		/*		if( r1_bh->failed_dev )
+				{
+					md_error (r1_bh->mddev, r1_bh->failed_dev);
+					r1_bh->failed_dev = 0;
+				}*/
 				raid1_end_bh_io(r1_bh, 0);
 			} else {
 				printk (REDIRECT_SECTOR,
@@ -1596,6 +1785,8 @@
 		disk_idx = descriptor->raid_disk;
 		disk = conf->mirrors + disk_idx;
 
+		atomic_set(&disk->rr_count, 0 );
+		
 		if (disk_faulty(descriptor)) {
 			disk->number = descriptor->number;
 			disk->raid_disk = disk_idx;
@@ -1761,6 +1952,8 @@
 		}
 	}
 	sb->active_disks = conf->working_disks;
+	/* Set the read recovery flag to the default value */
+	conf->use_read_recovery = RAID1_READ_RECOVERY;
 
 	if (start_recovery)
 		md_recover_arrays();
@@ -1859,6 +2052,9 @@
 
 static int md__init raid1_init (void)
 {
+	if( RAID1_READ_RECOVERY ){
+		printk("raid1: Read Recovery Enabled\n");
+	}
 	return register_md_personality (RAID1, &raid1_personality);
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-09-22  0:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-10 20:22 [BUG / PATCH] raid1: set BIO_UPTODATE after read error Paul Clements
2004-09-13  5:32 ` Neil Brown
2004-09-15 17:34   ` Paul Clements
2004-09-16 10:50     ` Write and verify correct data to read-failed sectors before degrading array? Tim Small
2004-09-17  0:39       ` Neil Brown
2004-09-17  1:41         ` Sebastian Sobolewski
2004-09-17  2:00           ` Neil Brown
2004-09-17  2:13             ` Sebastian Sobolewski
2004-09-22  0:06               ` [PATCH] " Sebastian Sobolewski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).