* [patch] limit error rate @ 2008-04-12 18:16 Bernd Schubert 2008-04-23 0:44 ` Dan Williams 0 siblings, 1 reply; 5+ messages in thread From: Bernd Schubert @ 2008-04-12 18:16 UTC (permalink / raw) To: linux-raid Hello, last night we had scsi problems and a hardware raid unit was offlined during heavy i/o. While this happened we got for about 3 minutes a huge number messages like these Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2). I guess the high error rate is responsible for not scheduling other events - during this time the system was not pingable and in the end also other devices run into scsi command timeouts causing problems on these unrelated devices as well. Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b162b83..3c82dfb 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1141,10 +1141,12 @@ static void raid5_end_read_request(struct bio * bi, int error) set_bit(R5_UPTODATE, &sh->dev[i].flags); if (test_bit(R5_ReadError, &sh->dev[i].flags)) { rdev = conf->disks[i].rdev; - printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n", - mdname(conf->mddev), STRIPE_SECTORS, - (unsigned long long)(sh->sector + rdev->data_offset), - bdevname(rdev->bdev, b)); + if (printk_ratelimit()) + printk(KERN_INFO "raid5:%s: read error corrected" + " (%lu sectors at %llu on %s)\n", + mdname(conf->mddev), STRIPE_SECTORS, + (unsigned long long)(sh->sector + rdev->data_offset), + bdevname(rdev->bdev, b)); clear_bit(R5_ReadError, &sh->dev[i].flags); clear_bit(R5_ReWrite, &sh->dev[i].flags); } @@ -1157,12 +1159,13 @@ static void raid5_end_read_request(struct bio * bi, int error) clear_bit(R5_UPTODATE, &sh->dev[i].flags); atomic_inc(&rdev->read_errors); - if (conf->mddev->degraded) + if (conf->mddev->degraded && printk_ratelimit()) printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n", mdname(conf->mddev), (unsigned long long)(sh->sector + rdev->data_offset), bdn); - else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) + else if (test_bit(R5_ReWrite, &sh->dev[i].flags) && + printk_ratelimit()) /* Oh, no!!! */ printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n", mdname(conf->mddev), ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate 2008-04-12 18:16 [patch] limit error rate Bernd Schubert @ 2008-04-23 0:44 ` Dan Williams 2008-04-23 22:55 ` Bernd Schubert 0 siblings, 1 reply; 5+ messages in thread From: Dan Williams @ 2008-04-23 0:44 UTC (permalink / raw) To: Bernd Schubert; +Cc: linux-raid On Sat, Apr 12, 2008 at 11:16 AM, Bernd Schubert <bernd-schubert@gmx.de> wrote: > Hello, > > last night we had scsi problems and a hardware raid > unit was offlined during heavy i/o. While this happened we got for > about 3 minutes a huge number messages like these > > Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2). > > I guess the high error rate is responsible for not scheduling other > events - during this time the system was not pingable and in the end > also other devices run into scsi command timeouts causing problems on > these unrelated devices as well. > > > Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de> > Hi Bernd, This patch is whitespace damaged (tabs-->spaces). Can you resend as an attachment? Thanks, Dan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate 2008-04-23 0:44 ` Dan Williams @ 2008-04-23 22:55 ` Bernd Schubert 2008-04-28 1:44 ` Neil Brown 0 siblings, 1 reply; 5+ messages in thread From: Bernd Schubert @ 2008-04-23 22:55 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1363 bytes --] Hello Dan, On Wednesday 23 April 2008, Dan Williams wrote: > On Sat, Apr 12, 2008 at 11:16 AM, Bernd Schubert <bernd-schubert@gmx.de> wrote: > > Hello, > > > > last night we had scsi problems and a hardware raid > > unit was offlined during heavy i/o. While this happened we got for > > about 3 minutes a huge number messages like these > > > > Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error > > not correctable (sector 2993096568 on sdj2). > > > > I guess the high error rate is responsible for not scheduling other > > events - during this time the system was not pingable and in the end > > also other devices run into scsi command timeouts causing problems on > > these unrelated devices as well. > > > > > > Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de> > > Hi Bernd, > > This patch is whitespace damaged (tabs-->spaces). Can you resend as > an attachment? hmm, don't know how I managed to do that. Probably copied it from the shell... I have attached it this time. I also just added another printk_ratelimit(). Btw, from my point of view the if (printk_ratelimit()) printk("print output"); looks odd. I just don't see why the API isn't printk_ratelimit("print output"); Oh well, modifying this all over the code would give a huge almost useless patch _only_ improving the beauty of code. Thanks, Bernd [-- Attachment #2: limit_read_error_not_correctable.patch --] [-- Type: text/x-diff, Size: 2025 bytes --] diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b162b83..60d3442 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1141,10 +1141,12 @@ static void raid5_end_read_request(struct bio * bi, int error) set_bit(R5_UPTODATE, &sh->dev[i].flags); if (test_bit(R5_ReadError, &sh->dev[i].flags)) { rdev = conf->disks[i].rdev; - printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n", - mdname(conf->mddev), STRIPE_SECTORS, - (unsigned long long)(sh->sector + rdev->data_offset), - bdevname(rdev->bdev, b)); + if (printk_ratelimit()) + printk(KERN_INFO "raid5:%s: read error corrected" + " (%lu sectors at %llu on %s)\n", + mdname(conf->mddev), STRIPE_SECTORS, + (unsigned long long)(sh->sector + rdev->data_offset), + bdevname(rdev->bdev, b)); clear_bit(R5_ReadError, &sh->dev[i].flags); clear_bit(R5_ReWrite, &sh->dev[i].flags); } @@ -1157,19 +1159,20 @@ static void raid5_end_read_request(struct bio * bi, int error) clear_bit(R5_UPTODATE, &sh->dev[i].flags); atomic_inc(&rdev->read_errors); - if (conf->mddev->degraded) + if (conf->mddev->degraded && printk_ratelimit()) printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n", mdname(conf->mddev), (unsigned long long)(sh->sector + rdev->data_offset), bdn); - else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) + else if (test_bit(R5_ReWrite, &sh->dev[i].flags) && + printk_ratelimit()) /* Oh, no!!! */ printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n", mdname(conf->mddev), (unsigned long long)(sh->sector + rdev->data_offset), bdn); else if (atomic_read(&rdev->read_errors) - > conf->max_nr_stripes) + > conf->max_nr_stripes && printk_ratelimit()) printk(KERN_WARNING "raid5:%s: Too many read errors, failing device %s.\n", mdname(conf->mddev), bdn); ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate 2008-04-23 22:55 ` Bernd Schubert @ 2008-04-28 1:44 ` Neil Brown 2008-04-28 21:33 ` Bernd Schubert 0 siblings, 1 reply; 5+ messages in thread From: Neil Brown @ 2008-04-28 1:44 UTC (permalink / raw) To: Bernd Schubert; +Cc: Dan Williams, linux-raid On Thursday April 24, bernd-schubert@gmx.de wrote: > > @@ -1157,19 +1159,20 @@ static void raid5_end_read_request(struct bio * bi, int error) > > clear_bit(R5_UPTODATE, &sh->dev[i].flags); > atomic_inc(&rdev->read_errors); > - if (conf->mddev->degraded) > + if (conf->mddev->degraded && printk_ratelimit()) > printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n", > mdname(conf->mddev), > (unsigned long long)(sh->sector + rdev->data_offset), > bdn); > - else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) > + else if (test_bit(R5_ReWrite, &sh->dev[i].flags) && > + printk_ratelimit()) > /* Oh, no!!! */ > printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n", > mdname(conf->mddev), > (unsigned long long)(sh->sector + rdev->data_offset), > bdn); > else if (atomic_read(&rdev->read_errors) > - > conf->max_nr_stripes) > + > conf->max_nr_stripes && printk_ratelimit()) > printk(KERN_WARNING > "raid5:%s: Too many read errors, failing device %s.\n", > mdname(conf->mddev), bdn); This is not good, you are subtly changing semantics. You have changed: if (A) X; else if (B) Y; else if (C) Z; else W; to if (A && foo()) X; else if (B && foo()) Y; else if (C && foo()) Z; else W; If 'A' and not 'foo()', you want nothing to happen, but you will actually get one of Y, Z, or W happening, all of which are wrong. > > Btw, from my point of view the > > if (printk_ratelimit()) > printk("print output"); > > looks odd. I just don't see why the API isn't > > printk_ratelimit("print output"); Very sensible. I've put #define printk_rl printk_ratelimit() ?: printk at the start of raid5.c, and used it to fix the problematic printks. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] limit error rate 2008-04-28 1:44 ` Neil Brown @ 2008-04-28 21:33 ` Bernd Schubert 0 siblings, 0 replies; 5+ messages in thread From: Bernd Schubert @ 2008-04-28 21:33 UTC (permalink / raw) To: Neil Brown; +Cc: Dan Williams, linux-raid On Monday 28 April 2008, Neil Brown wrote: > This is not good, you are subtly changing semantics. > You have changed: > > if (A) > X; > else if (B) > Y; > else if (C) > Z; > else > W; > > to > if (A && foo()) > X; > else if (B && foo()) > Y; > else if (C && foo()) > Z; > else > W; > > If 'A' and not 'foo()', you want nothing to happen, but you will > actually get one of Y, Z, or W happening, all of which are wrong. Ouch, right. I shouldn't write patches in the middle of the night... > > > Btw, from my point of view the > > > > if (printk_ratelimit()) > > printk("print output"); > > > > looks odd. I just don't see why the API isn't > > > > printk_ratelimit("print output"); > > Very sensible. I've put > > #define printk_rl printk_ratelimit() ?: printk > > at the start of raid5.c, and used it to fix the problematic printks. Great! Thanks a lot, Bernd ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-04-28 21:33 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-04-12 18:16 [patch] limit error rate Bernd Schubert 2008-04-23 0:44 ` Dan Williams 2008-04-23 22:55 ` Bernd Schubert 2008-04-28 1:44 ` Neil Brown 2008-04-28 21:33 ` Bernd Schubert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).