Re: Fatal crash/hang in scsi_lib after RAID disk failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Christian Balzer <chibi@gol.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Fatal crash/hang in scsi_lib after RAID disk failure
Date: Tue, 3 Jul 2012 17:31:45 +1000	[thread overview]
Message-ID: <20120703173145.3825674e@notabene.brown> (raw)
In-Reply-To: <20120703161200.2904740f@batzmaru.gol.ad.jp>

[-- Attachment #1: Type: text/plain, Size: 4308 bytes --]

On Tue, 3 Jul 2012 16:12:00 +0900 Christian Balzer <chibi@gol.com> wrote:

> On Tue, 3 Jul 2012 16:45:28 +1000 NeilBrown wrote:
> 
> > On Tue, 3 Jul 2012 15:10:38 +0900 Christian Balzer <chibi@gol.com> wrote:
> > 
> > > On Tue, 3 Jul 2012 15:50:45 +1000 NeilBrown wrote:
> > > 
> [snip]
> > > > That took *way* to long to find given how simple the fix is.
> > > 
> > > Well, given how long it takes with some OSS projects, I'd say 4 days is
> > > pretty good. ^o^
> > 
> > I meant the 4 hours of my time searching, not the 4 days of your time
> > waiting :-)
> > 
> Hehehe, if you put it that way... ^o^
> 
> > 
> > > 
> > > > I spent ages staring at the code, as about to reply and so "no idea"
> > > > when I thought I should test it myself.  Test failed immediately.
> > > 
> > > Could you elaborate a bit? 
> > > As in, was this something introduced only very recently, since I had
> > > dozens of disks fail before w/o any such pyrotechnics. 
> > > Or were there some special circumstances that triggered it? 
> > > (But looking at the patch, I guess it should have been pretty
> > > universal)
> > 
> > Bug was introduced by commit 58c54fcca3bac5bf9 which first appeared in
> > Linux 3.1.  Since then, any read error on RAID10 will trigger the bug.
> > 
> Ouch, that's a pretty substantial number of machines I'd reckon.

Could be.  But they all seem to have very reliably disks.  Except yours :-)

> 
> But now I'm even more intrigued, how do you (or the md code) define a read
> error then? 

The obvious way I guess.

> Remember this beauty here, which triggered the hunt and kill of the R10
> recovery bug of uneven member sets?

Looks like that was a write error.  They are handled quite differently.

NeilBrown


> ---
> Jun 20 18:22:01 borg03b kernel: [1383357.792044] mptscsih: ioc0: attempting task abort! (sc=ffff88023c3c5180)
> Jun 20 18:22:01 borg03b kernel: [1383357.792049] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.317346] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0
> x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.317589] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567292] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay
>  Retry}, SubCode(0x0000) cb_idx mptscsih_io_done
> Jun 20 18:22:06 borg03b kernel: [1383362.567316] mptscsih: ioc0: attempting target reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.567321] sd 8:0:3:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
> Jun 20 18:22:06 borg03b kernel: [1383362.568040] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568068] mptscsih: ioc0: attempting host reset! (sc=ffff88023c3c5180)
> Jun 20 18:22:06 borg03b kernel: [1383362.568074] mptbase: ioc0: Initiating recovery
> Jun 20 18:22:29 borg03b kernel: [1383385.440045] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88023c3c5180)
> Jun 20 18:22:29 borg03b kernel: [1383385.484846] Device returned, unsetting inDMD
> Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery
> Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device.
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
> ---
> That was a 3.2.18 kernel, but it didn't die and neither did the other
> cluster member with a very similar failure two weeks earlier. 
> 
> So I guess the device getting kicked out by the libsata layer below is
> fine, but it returning medium errors triggers the bug?
> 
> Anyways, time to patch stuff, thankfully this is the only production
> cluster I have with a 3.2 kernel using RAID10. ^.^;
> 
> Regards,
> 
> Christian


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

     prev parent reply	other threads:[~2012-07-03  7:31 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-29  0:35 Fatal crash/hang in scsi_lib after RAID disk failure Christian Balzer
2012-07-03  5:50 ` NeilBrown
2012-07-03  6:10   ` Christian Balzer
2012-07-03  6:45     ` NeilBrown
2012-07-03  7:12       ` Christian Balzer
2012-07-03  7:31         ` NeilBrown [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120703173145.3825674e@notabene.brown \
    --to=neilb@suse.de \
    --cc=chibi@gol.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).