* [patch] limit error rate
@ 2008-04-12 18:16 Bernd Schubert
2008-04-23 0:44 ` Dan Williams
0 siblings, 1 reply; 5+ messages in thread
From: Bernd Schubert @ 2008-04-12 18:16 UTC (permalink / raw)
To: linux-raid
Hello,
last night we had scsi problems and a hardware raid
unit was offlined during heavy i/o. While this happened we got for
about 3 minutes a huge number messages like these
Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).
I guess the high error rate is responsible for not scheduling other
events - during this time the system was not pingable and in the end
also other devices run into scsi command timeouts causing problems on
these unrelated devices as well.
Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de>
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b162b83..3c82dfb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1141,10 +1141,12 @@ static void raid5_end_read_request(struct bio * bi, int error)
set_bit(R5_UPTODATE, &sh->dev[i].flags);
if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
rdev = conf->disks[i].rdev;
- printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n",
- mdname(conf->mddev), STRIPE_SECTORS,
- (unsigned long long)(sh->sector + rdev->data_offset),
- bdevname(rdev->bdev, b));
+ if (printk_ratelimit())
+ printk(KERN_INFO "raid5:%s: read error corrected"
+ " (%lu sectors at %llu on %s)\n",
+ mdname(conf->mddev), STRIPE_SECTORS,
+ (unsigned long long)(sh->sector + rdev->data_offset),
+ bdevname(rdev->bdev, b));
clear_bit(R5_ReadError, &sh->dev[i].flags);
clear_bit(R5_ReWrite, &sh->dev[i].flags);
}
@@ -1157,12 +1159,13 @@ static void raid5_end_read_request(struct bio * bi, int error)
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
atomic_inc(&rdev->read_errors);
- if (conf->mddev->degraded)
+ if (conf->mddev->degraded && printk_ratelimit())
printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n",
mdname(conf->mddev),
(unsigned long long)(sh->sector + rdev->data_offset),
bdn);
- else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
+ else if (test_bit(R5_ReWrite, &sh->dev[i].flags) &&
+ printk_ratelimit())
/* Oh, no!!! */
printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n",
mdname(conf->mddev),
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate
2008-04-12 18:16 [patch] limit error rate Bernd Schubert
@ 2008-04-23 0:44 ` Dan Williams
2008-04-23 22:55 ` Bernd Schubert
0 siblings, 1 reply; 5+ messages in thread
From: Dan Williams @ 2008-04-23 0:44 UTC (permalink / raw)
To: Bernd Schubert; +Cc: linux-raid
On Sat, Apr 12, 2008 at 11:16 AM, Bernd Schubert <bernd-schubert@gmx.de> wrote:
> Hello,
>
> last night we had scsi problems and a hardware raid
> unit was offlined during heavy i/o. While this happened we got for
> about 3 minutes a huge number messages like these
>
> Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).
>
> I guess the high error rate is responsible for not scheduling other
> events - during this time the system was not pingable and in the end
> also other devices run into scsi command timeouts causing problems on
> these unrelated devices as well.
>
>
> Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de>
>
Hi Bernd,
This patch is whitespace damaged (tabs-->spaces). Can you resend as
an attachment?
Thanks,
Dan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate
2008-04-23 0:44 ` Dan Williams
@ 2008-04-23 22:55 ` Bernd Schubert
2008-04-28 1:44 ` Neil Brown
0 siblings, 1 reply; 5+ messages in thread
From: Bernd Schubert @ 2008-04-23 22:55 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]
Hello Dan,
On Wednesday 23 April 2008, Dan Williams wrote:
> On Sat, Apr 12, 2008 at 11:16 AM, Bernd Schubert <bernd-schubert@gmx.de>
wrote:
> > Hello,
> >
> > last night we had scsi problems and a hardware raid
> > unit was offlined during heavy i/o. While this happened we got for
> > about 3 minutes a huge number messages like these
> >
> > Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error
> > not correctable (sector 2993096568 on sdj2).
> >
> > I guess the high error rate is responsible for not scheduling other
> > events - during this time the system was not pingable and in the end
> > also other devices run into scsi command timeouts causing problems on
> > these unrelated devices as well.
> >
> >
> > Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de>
>
> Hi Bernd,
>
> This patch is whitespace damaged (tabs-->spaces). Can you resend as
> an attachment?
hmm, don't know how I managed to do that. Probably copied it from the shell...
I have attached it this time. I also just added another printk_ratelimit().
Btw, from my point of view the
if (printk_ratelimit())
printk("print output");
looks odd. I just don't see why the API isn't
printk_ratelimit("print output");
Oh well, modifying this all over the code would give a huge almost useless
patch _only_ improving the beauty of code.
Thanks,
Bernd
[-- Attachment #2: limit_read_error_not_correctable.patch --]
[-- Type: text/x-diff, Size: 2025 bytes --]
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b162b83..60d3442 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1141,10 +1141,12 @@ static void raid5_end_read_request(struct bio * bi, int error)
set_bit(R5_UPTODATE, &sh->dev[i].flags);
if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
rdev = conf->disks[i].rdev;
- printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n",
- mdname(conf->mddev), STRIPE_SECTORS,
- (unsigned long long)(sh->sector + rdev->data_offset),
- bdevname(rdev->bdev, b));
+ if (printk_ratelimit())
+ printk(KERN_INFO "raid5:%s: read error corrected"
+ " (%lu sectors at %llu on %s)\n",
+ mdname(conf->mddev), STRIPE_SECTORS,
+ (unsigned long long)(sh->sector + rdev->data_offset),
+ bdevname(rdev->bdev, b));
clear_bit(R5_ReadError, &sh->dev[i].flags);
clear_bit(R5_ReWrite, &sh->dev[i].flags);
}
@@ -1157,19 +1159,20 @@ static void raid5_end_read_request(struct bio * bi, int error)
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
atomic_inc(&rdev->read_errors);
- if (conf->mddev->degraded)
+ if (conf->mddev->degraded && printk_ratelimit())
printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n",
mdname(conf->mddev),
(unsigned long long)(sh->sector + rdev->data_offset),
bdn);
- else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
+ else if (test_bit(R5_ReWrite, &sh->dev[i].flags) &&
+ printk_ratelimit())
/* Oh, no!!! */
printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n",
mdname(conf->mddev),
(unsigned long long)(sh->sector + rdev->data_offset),
bdn);
else if (atomic_read(&rdev->read_errors)
- > conf->max_nr_stripes)
+ > conf->max_nr_stripes && printk_ratelimit())
printk(KERN_WARNING
"raid5:%s: Too many read errors, failing device %s.\n",
mdname(conf->mddev), bdn);
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate
2008-04-23 22:55 ` Bernd Schubert
@ 2008-04-28 1:44 ` Neil Brown
2008-04-28 21:33 ` Bernd Schubert
0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2008-04-28 1:44 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Dan Williams, linux-raid
On Thursday April 24, bernd-schubert@gmx.de wrote:
>
> @@ -1157,19 +1159,20 @@ static void raid5_end_read_request(struct bio * bi, int error)
>
> clear_bit(R5_UPTODATE, &sh->dev[i].flags);
> atomic_inc(&rdev->read_errors);
> - if (conf->mddev->degraded)
> + if (conf->mddev->degraded && printk_ratelimit())
> printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n",
> mdname(conf->mddev),
> (unsigned long long)(sh->sector + rdev->data_offset),
> bdn);
> - else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
> + else if (test_bit(R5_ReWrite, &sh->dev[i].flags) &&
> + printk_ratelimit())
> /* Oh, no!!! */
> printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n",
> mdname(conf->mddev),
> (unsigned long long)(sh->sector + rdev->data_offset),
> bdn);
> else if (atomic_read(&rdev->read_errors)
> - > conf->max_nr_stripes)
> + > conf->max_nr_stripes && printk_ratelimit())
> printk(KERN_WARNING
> "raid5:%s: Too many read errors, failing device %s.\n",
> mdname(conf->mddev), bdn);
This is not good, you are subtly changing semantics.
You have changed:
if (A)
X;
else if (B)
Y;
else if (C)
Z;
else
W;
to
if (A && foo())
X;
else if (B && foo())
Y;
else if (C && foo())
Z;
else
W;
If 'A' and not 'foo()', you want nothing to happen, but you will
actually get one of Y, Z, or W happening, all of which are wrong.
>
> Btw, from my point of view the
>
> if (printk_ratelimit())
> printk("print output");
>
> looks odd. I just don't see why the API isn't
>
> printk_ratelimit("print output");
Very sensible. I've put
#define printk_rl printk_ratelimit() ?: printk
at the start of raid5.c, and used it to fix the problematic printks.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate
2008-04-28 1:44 ` Neil Brown
@ 2008-04-28 21:33 ` Bernd Schubert
0 siblings, 0 replies; 5+ messages in thread
From: Bernd Schubert @ 2008-04-28 21:33 UTC (permalink / raw)
To: Neil Brown; +Cc: Dan Williams, linux-raid
On Monday 28 April 2008, Neil Brown wrote:
> This is not good, you are subtly changing semantics.
> You have changed:
>
> if (A)
> X;
> else if (B)
> Y;
> else if (C)
> Z;
> else
> W;
>
> to
> if (A && foo())
> X;
> else if (B && foo())
> Y;
> else if (C && foo())
> Z;
> else
> W;
>
> If 'A' and not 'foo()', you want nothing to happen, but you will
> actually get one of Y, Z, or W happening, all of which are wrong.
Ouch, right. I shouldn't write patches in the middle of the night...
>
> > Btw, from my point of view the
> >
> > if (printk_ratelimit())
> > printk("print output");
> >
> > looks odd. I just don't see why the API isn't
> >
> > printk_ratelimit("print output");
>
> Very sensible. I've put
>
> #define printk_rl printk_ratelimit() ?: printk
>
> at the start of raid5.c, and used it to fix the problematic printks.
Great!
Thanks a lot,
Bernd
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-04-28 21:33 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-12 18:16 [patch] limit error rate Bernd Schubert
2008-04-23 0:44 ` Dan Williams
2008-04-23 22:55 ` Bernd Schubert
2008-04-28 1:44 ` Neil Brown
2008-04-28 21:33 ` Bernd Schubert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).