* Problem with raid1 read error patch
@ 2004-09-20 15:16 Martin Josefsson
2004-09-20 15:28 ` Paul Clements
0 siblings, 1 reply; 4+ messages in thread
From: Martin Josefsson @ 2004-09-20 15:16 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
Hi Neil
I'm having some trouble with your patch that "fixes" raid1 read error
handling that went into Linus tree. Backing it out fixes it again.
The latest kernel I've tried is 2.6.9-rc2-bk6
ChangeSet 1.1926, 2004/06/24 09:36:53-07:00, akpm@osdl.org
[PATCH] md: Fix up handling for read error in raid1.
From: NeilBrown <neilb@cse.unsw.edu.au>
There is severe bit-rot in this code, which is to say that it doesn't work
at all: an io error during read will do bad things. It should work better
with this patch.
diff -Nru a/drivers/md/raid1.c b/drivers/md/raid1.c
--- a/drivers/md/raid1.c 2004-06-24 10:35:44 -07:00
+++ b/drivers/md/raid1.c 2004-06-24 10:35:44 -07:00
@@ -206,7 +206,7 @@
*rdevp = rdev;
atomic_inc(&rdev->nr_pending);
spin_unlock_irq(&conf->device_lock);
- return 0;
+ return i;
}
}
spin_unlock_irq(&conf->device_lock);
@@ -919,18 +919,22 @@
mddev = r1_bio->mddev;
conf = mddev_to_conf(mddev);
- bio = r1_bio->master_bio;
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
sync_request_write(mddev, r1_bio);
unplug = 1;
} else {
- if (map(mddev, &rdev) == -1) {
+ int disk;
+ bio = r1_bio->bios[r1_bio->read_disk];
+ if ((disk=map(mddev, &rdev)) == -1) {
printk(KERN_ALERT "raid1: %s: unrecoverable I/O"
" read error for block %llu\n",
bdevname(bio->bi_bdev,b),
(unsigned long long)r1_bio->sector);
raid_end_bio_io(r1_bio);
} else {
+ r1_bio->bios[r1_bio->read_disk] = NULL;
+ r1_bio->read_disk = disk;
+ r1_bio->bios[r1_bio->read_disk] = bio;
printk(KERN_ERR "raid1: %s: redirecting sector %llu to"
" another mirror\n",
bdevname(rdev->bdev,b),
After this patch I get infinite loops in sector rescheduling when one disk
fails (I physically remove it). I have two disks (sda, sdb) with 4
partitions each. They make up 4 raid1 arrays. I'm removing sda from the
scsi-chain.
Example:
Sep 17 16:23:45 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
Sep 17 16:23:45 faioffer kernel: end_request: I/O error, dev sda, sector 4208897
Sep 17 16:23:45 faioffer kernel: md: write_disk_sb failed for device sda1
Sep 17 16:23:45 faioffer kernel: md: errors occurred during superblock update, repeating
Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 4208897
Sep 17 16:23:46 faioffer kernel: md: write_disk_sb failed for device sda1
Sep 17 16:23:46 faioffer kernel: md: errors occurred during superblock update, repeating
Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 4208897
Sep 17 16:23:46 faioffer kernel: md: write_disk_sb failed for device sda1
Sep 17 16:23:46 faioffer kernel: md: errors occurred during superblock update, repeating
Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 2887473
Sep 17 16:23:46 faioffer kernel: raid1: Disk failure on sda1, disabling device.
Sep 17 16:23:46 faioffer kernel: ^IOperation continuing on 1 devices
Sep 17 16:23:46 faioffer kernel: raid1: sda1: rescheduling sector 2887472
Sep 17 16:23:46 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
Sep 17 16:23:46 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 33
Sep 17 16:23:46 faioffer kernel: RAID1 conf printout:
Sep 17 16:23:46 faioffer kernel: --- wd:1 rd:2
Sep 17 16:23:46 faioffer kernel: disk 0, wo:1, o:0, dev:sda1
Sep 17 16:23:46 faioffer kernel: disk 1, wo:0, o:1, dev:sdb1
Sep 17 16:23:46 faioffer kernel: RAID1 conf printout:
Sep 17 16:23:46 faioffer kernel: --- wd:1 rd:2
Sep 17 16:23:46 faioffer kernel: disk 1, wo:0, o:1, dev:sdb1
Sep 17 16:23:46 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
It continues like that forever.
After backing that patch out, with some minor modifications because the
code has changed a little bit, I get a number of scsi-errors and after a
while the drive gets disabled like above but life continues like before
the patch went in. That is, no infinite loop and everything works :)
Any ideas to what went wrong?
/Martin
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Problem with raid1 read error patch
2004-09-20 15:16 Problem with raid1 read error patch Martin Josefsson
@ 2004-09-20 15:28 ` Paul Clements
2004-09-20 16:14 ` Martin Josefsson
0 siblings, 1 reply; 4+ messages in thread
From: Paul Clements @ 2004-09-20 15:28 UTC (permalink / raw)
To: Martin Josefsson; +Cc: neilb, linux-raid
Martin Josefsson wrote:
> Hi Neil
>
> I'm having some trouble with your patch that "fixes" raid1 read error
> handling that went into Linus tree. Backing it out fixes it again.
> The latest kernel I've tried is 2.6.9-rc2-bk6
>
> ChangeSet 1.1926, 2004/06/24 09:36:53-07:00, akpm@osdl.org
>
> [PATCH] md: Fix up handling for read error in raid1.
>
> From: NeilBrown <neilb@cse.unsw.edu.au>
>
> There is severe bit-rot in this code, which is to say that it doesn't work
> at all: an io error during read will do bad things. It should work better
> with this patch.
>
> diff -Nru a/drivers/md/raid1.c b/drivers/md/raid1.c
> --- a/drivers/md/raid1.c 2004-06-24 10:35:44 -07:00
> +++ b/drivers/md/raid1.c 2004-06-24 10:35:44 -07:00
> @@ -206,7 +206,7 @@
> *rdevp = rdev;
> atomic_inc(&rdev->nr_pending);
> spin_unlock_irq(&conf->device_lock);
> - return 0;
> + return i;
> }
> }
> spin_unlock_irq(&conf->device_lock);
> @@ -919,18 +919,22 @@
>
> mddev = r1_bio->mddev;
> conf = mddev_to_conf(mddev);
> - bio = r1_bio->master_bio;
> if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
> sync_request_write(mddev, r1_bio);
> unplug = 1;
> } else {
> - if (map(mddev, &rdev) == -1) {
> + int disk;
> + bio = r1_bio->bios[r1_bio->read_disk];
> + if ((disk=map(mddev, &rdev)) == -1) {
> printk(KERN_ALERT "raid1: %s: unrecoverable I/O"
> " read error for block %llu\n",
> bdevname(bio->bi_bdev,b),
> (unsigned long long)r1_bio->sector);
> raid_end_bio_io(r1_bio);
> } else {
> + r1_bio->bios[r1_bio->read_disk] = NULL;
> + r1_bio->read_disk = disk;
> + r1_bio->bios[r1_bio->read_disk] = bio;
> printk(KERN_ERR "raid1: %s: redirecting sector %llu to"
> " another mirror\n",
> bdevname(rdev->bdev,b),
>
> After this patch I get infinite loops in sector rescheduling when one disk
> fails (I physically remove it). I have two disks (sda, sdb) with 4
> partitions each. They make up 4 raid1 arrays. I'm removing sda from the
> scsi-chain.
>
> Example:
>
> Sep 17 16:23:45 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
> Sep 17 16:23:45 faioffer kernel: end_request: I/O error, dev sda, sector 4208897
> Sep 17 16:23:45 faioffer kernel: md: write_disk_sb failed for device sda1
> Sep 17 16:23:45 faioffer kernel: md: errors occurred during superblock update, repeating
> Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
> Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 4208897
> Sep 17 16:23:46 faioffer kernel: md: write_disk_sb failed for device sda1
> Sep 17 16:23:46 faioffer kernel: md: errors occurred during superblock update, repeating
> Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
> Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 4208897
> Sep 17 16:23:46 faioffer kernel: md: write_disk_sb failed for device sda1
> Sep 17 16:23:46 faioffer kernel: md: errors occurred during superblock update, repeating
> Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
> Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 2887473
> Sep 17 16:23:46 faioffer kernel: raid1: Disk failure on sda1, disabling device.
> Sep 17 16:23:46 faioffer kernel: ^IOperation continuing on 1 devices
> Sep 17 16:23:46 faioffer kernel: raid1: sda1: rescheduling sector 2887472
> Sep 17 16:23:46 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
> Sep 17 16:23:46 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
> Sep 17 16:23:46 faioffer kernel: SCSI error : <0 0 1 0> return code = 0x10000
> Sep 17 16:23:46 faioffer kernel: end_request: I/O error, dev sda, sector 33
> Sep 17 16:23:46 faioffer kernel: RAID1 conf printout:
> Sep 17 16:23:46 faioffer kernel: --- wd:1 rd:2
> Sep 17 16:23:46 faioffer kernel: disk 0, wo:1, o:0, dev:sda1
> Sep 17 16:23:46 faioffer kernel: disk 1, wo:0, o:1, dev:sdb1
> Sep 17 16:23:46 faioffer kernel: RAID1 conf printout:
> Sep 17 16:23:46 faioffer kernel: --- wd:1 rd:2
> Sep 17 16:23:46 faioffer kernel: disk 1, wo:0, o:1, dev:sdb1
> Sep 17 16:23:46 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: redirecting sector 2887472 to another mirror
> Sep 17 16:23:47 faioffer kernel: raid1: sdb1: rescheduling sector 2887472
>
> It continues like that forever.
>
> After backing that patch out, with some minor modifications because the
> code has changed a little bit, I get a number of scsi-errors and after a
> while the drive gets disabled like above but life continues like before
> the patch went in. That is, no infinite loop and everything works :)
>
> Any ideas to what went wrong?
Yes. You're getting the infinite retries because the BIO_UPTODATE flag
in the bio is not set.
I've been debugging the same problem just recently. In addition to this
patch from Neil, you'll also need a patch that I posted here last week,
which does a bio_put() and a bio_clone() to get rid of the old bio that
the read error occurred on, and create a new (clean) bio to retry the
read against:
http://marc.theaimsgroup.com/?l=linux-raid&m=109527014728404&w=2
--
Paul
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Problem with raid1 read error patch
2004-09-20 15:28 ` Paul Clements
@ 2004-09-20 16:14 ` Martin Josefsson
2004-09-21 16:07 ` Martin Josefsson
0 siblings, 1 reply; 4+ messages in thread
From: Martin Josefsson @ 2004-09-20 16:14 UTC (permalink / raw)
To: Paul Clements; +Cc: neilb, linux-raid
[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]
On Mon, 2004-09-20 at 17:28, Paul Clements wrote:
Hi Paul
> > After backing that patch out, with some minor modifications because the
> > code has changed a little bit, I get a number of scsi-errors and after a
> > while the drive gets disabled like above but life continues like before
> > the patch went in. That is, no infinite loop and everything works :)
> >
> > Any ideas to what went wrong?
>
> Yes. You're getting the infinite retries because the BIO_UPTODATE flag
> in the bio is not set.
Ah
> I've been debugging the same problem just recently. In addition to this
> patch from Neil, you'll also need a patch that I posted here last week,
> which does a bio_put() and a bio_clone() to get rid of the old bio that
> the read error occurred on, and create a new (clean) bio to retry the
> read against:
>
> http://marc.theaimsgroup.com/?l=linux-raid&m=109527014728404&w=2
Thanks, will try it out tomorrow when I'm back in front of the machine.
Didn't expect an answer so soon :)
--
/Martin
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Problem with raid1 read error patch
2004-09-20 16:14 ` Martin Josefsson
@ 2004-09-21 16:07 ` Martin Josefsson
0 siblings, 0 replies; 4+ messages in thread
From: Martin Josefsson @ 2004-09-21 16:07 UTC (permalink / raw)
To: Paul Clements; +Cc: neilb, linux-raid
[-- Attachment #1.1: Type: text/plain, Size: 905 bytes --]
On Mon, 2004-09-20 at 18:14, Martin Josefsson wrote:
Hi Paul
> > I've been debugging the same problem just recently. In addition to this
> > patch from Neil, you'll also need a patch that I posted here last week,
> > which does a bio_put() and a bio_clone() to get rid of the old bio that
> > the read error occurred on, and create a new (clean) bio to retry the
> > read against:
> >
> > http://marc.theaimsgroup.com/?l=linux-raid&m=109527014728404&w=2
>
> Thanks, will try it out tomorrow when I'm back in front of the machine.
> Didn't expect an answer so soon :)
I tested the patch today, worked like a charm. Thank you.
There's been some small changes in 2.6.9-rc2-bk6 which is the kernel
I've been testing with. I applied your fix by hand and diffed, the
resulting patch is attached to make life easy for others with the same
problem.
Thanks again.
--
/Martin
[-- Attachment #1.2: raid1-errorfix.patch-2.6.9-rc2-bk6 --]
[-- Type: text/plain, Size: 954 bytes --]
--- linux-2.6.9-rc2-bk6/drivers/md/raid1.c.nofix 2004-09-21 09:25:48.000000000 +0200
+++ linux-2.6.9-rc2-bk6/drivers/md/raid1.c 2004-09-21 09:27:56.000000000 +0200
@@ -941,6 +941,9 @@
} else {
r1_bio->bios[r1_bio->read_disk] = NULL;
r1_bio->read_disk = disk;
+ /* discard the failed bio and clone a new one */
+ bio_put(bio);
+ bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
r1_bio->bios[r1_bio->read_disk] = bio;
rdev = conf->mirrors[disk].rdev;
if (printk_ratelimit())
@@ -948,9 +951,11 @@
" another mirror\n",
bdevname(rdev->bdev,b),
(unsigned long long)r1_bio->sector);
- bio->bi_bdev = rdev->bdev;
bio->bi_sector = r1_bio->sector + rdev->data_offset;
+ bio->bi_bdev = rdev->bdev;
+ bio->bi_end_io = raid1_end_read_request;
bio->bi_rw = READ;
+ bio->bi_private = r1_bio;
unplug = 1;
generic_make_request(bio);
}
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-09-21 16:07 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-20 15:16 Problem with raid1 read error patch Martin Josefsson
2004-09-20 15:28 ` Paul Clements
2004-09-20 16:14 ` Martin Josefsson
2004-09-21 16:07 ` Martin Josefsson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.