From: NeilBrown <neilb@suse.de>
To: Christian Balzer <chibi@gol.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Fatal crash/hang in scsi_lib after RAID disk failure
Date: Tue, 3 Jul 2012 17:31:45 +1000 [thread overview]
Message-ID: <20120703173145.3825674e@notabene.brown> (raw)
In-Reply-To: <20120703161200.2904740f@batzmaru.gol.ad.jp>
[-- Attachment #1: Type: text/plain, Size: 4308 bytes --]
On Tue, 3 Jul 2012 16:12:00 +0900 Christian Balzer <chibi@gol.com> wrote:
> On Tue, 3 Jul 2012 16:45:28 +1000 NeilBrown wrote:
>
> > On Tue, 3 Jul 2012 15:10:38 +0900 Christian Balzer <chibi@gol.com> wrote:
> >
> > > On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote:
> > >
> [snip]
> > > > That took *way* to long to find given how simple the fix is.
> > >
> > > Well, given how long it takes with some OSS projects, I'd say 4 days is
> > > pretty good. ^o^
> >
> > I meant the 4 hours of my time searching, not the 4 days of your time
> > waiting :-)
> >
> Hehehe, if you put it that way... ^o^
>
> >
> > >
> > > > I spent ages staring at the code, as about to reply and so "no idea"
> > > > when I thought I should test it myself. Test failed immediately.
> > >
> > > Could you elaborate a bit?
> > > As in, was this something introduced only very recently, since I had
> > > dozens of disks fail before w/o any such pyrotechnics.
> > > Or were there some special circumstances that triggered it?
> > > (But looking at the patch, I guess it should have been pretty
> > > universal)
> >
> > Bug was introduced by commit 58c54fcca3bac5bf9 which first appeared in
> > Linux 3.1. Since then, any read error on RAID10 will trigger the bug.
> >
> Ouch, that's a pretty substantial number of machines I'd reckon.
Could be. But they all seem to have very reliably disks. Except yours :-)
>
> But now I'm even more intrigued, how do you (or the md code) define a read
> error then?
The obvious way I guess.
> Remember this beauty here, which triggered the hunt and kill of the R10
> recovery bug of uneven member sets?
Looks like that was a write error. They are handled quite differently.
NeilBrown
> ---
> Jun 20 18:22:01 borg03b kernel: [1383357.792044] mptscsih: ioc0: attempting task abort! (sc=ffff88023c3c5180)
> Jun 20 18:22:01 borg03b kernel: [1383357.792049] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.317346] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0
> x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.317589] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567292] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay
> Retry}, SubCode(0x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.567316] mptscsih: ioc0: attempting target reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567321] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.568040] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568068] mptscsih: ioc0: attempting host reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568074] mptbase: ioc0: Initiating recovery
> Jun 20 18:22:29 borg03b kernel: [1383385.440045] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:29 borg03b kernel: [1383385.484846] Device returned, unsetting inDMD
> Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery
> Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device.
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
> ---
> That was a 3.2.18 kernel, but it didn't die and neither did the other
> cluster member with a very similar failure two weeks earlier.
>
> So I guess the device getting kicked out by the libsata layer below is
> fine, but it returning medium errors triggers the bug?
>
> Anyways, time to patch stuff, thankfully this is the only production
> cluster I have with a 3.2 kernel using RAID10. ^.^;
>
> Regards,
>
> Christian
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
prev parent reply other threads:[~2012-07-03 7:31 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-29 0:35 Fatal crash/hang in scsi_lib after RAID disk failure Christian Balzer
2012-07-03 5:50 ` NeilBrown
2012-07-03 6:10 ` Christian Balzer
2012-07-03 6:45 ` NeilBrown
2012-07-03 7:12 ` Christian Balzer
2012-07-03 7:31 ` NeilBrown [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120703173145.3825674e@notabene.brown \
--to=neilb@suse.de \
--cc=chibi@gol.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).