public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
From: Paul Smith <paul@mad-scientist.net>
To: Mike Anderson <andmike@linux.vnet.ibm.com>
Cc: linux-scsi@vger.kernel.org
Subject: Re: [2.6.27.25] Hang in SCSI sync cache when a disk is removed--?
Date: Mon, 06 Jul 2009 14:04:13 -0400	[thread overview]
Message-ID: <1246903453.9022.7246.camel@psmith-ubeta.netezza.com> (raw)
In-Reply-To: <20090702174151.GA17414@linux.vnet.ibm.com>

On Thu, 2009-07-02 at 10:41 -0700, Mike Anderson wrote:
> Paul Smith <paul@mad-scientist.net> wrote:
> > Hi all; we are seeing a problem where, when we pull a disk out of our
> > disk array (even one that's not actively being used), the entire IO
> > subsystem in Linux hangs.  Here are some details:
> > 
> > I have an IBM Bladecenter with an LSI EXP3000 SAS expander with 12 1TB
> > Seagate SAS disks.  Relevant lspci output for the SAS controllers:
> > 
> >         # lspci | grep LSI
> >         02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064ET PCI-Express Fusion-MPT SAS (rev 02)
> >         08:01.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064 PCI-X Fusion-MPT SAS (rev 03)
> >         14:01.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064 PCI-X Fusion-MPT SAS (rev 03)
> 
> Have you tried a minimum level of logging like the following without the
> error going away?
> "sysctl -w dev.scsi.logging_level=4100"

We enabled this and were able to reproduce the hang.  Please see the log
output below.

> Can you run "cat /sys/class/scsi_host/*/state" when you are in the hung
> state?
> 
> If the host is in recovery no IOs will move forward. I assume if you can
> get a run with the 4100 level of logging it will show a host reset sent,
> but no waking up host to restart (unless the reset is being generated for
> other reasons outside of the scsi error handler).

This seems to be the case, yes.  The output of the above-requested cat
is:

# cat /sys/class/scsi_host/host*/state
running
running
running
recovery
running

# cat /sys/class/scsi_host/host3/state
recovery

# ls -l /sys/class/scsi_host/host3/device
lrwxrwxrwx    1 root     root            0 Jul  6 13:27 /sys/class/scsi_host/host3/device -> ../../../devices/pci0000:00/0000:00:04.0/0000:11:00.0/0000:13:08.0/0000:14:01.0/host3

Note that host3 is ioc1.

Thanks for looking at this; it's really killing us.


kernel log output:
------------------
Jul  6 13:31:18  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:18  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:27  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:27  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:35  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:35  user.info kernel: sd 3:0:1:0: [sdn] Result: hostbyte=0x02 driverbyte=0x00
Jul  6 13:31:35  user.warn kernel: device-mapper: multipath: Failing path 8:208.
Jul  6 13:31:35  daemon.notice multipathd: 8:208: mark as failed
Jul  6 13:31:35  daemon.notice multipathd: encl1Slot2: remaining active paths: 1
Jul  6 13:31:43  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:43  user.info kernel: sd 2:0:1:0: [sdb] <6>mptscsih: ioc0: attempting task abort! (sc=ffff8804623e4b40)
Jul  6 13:31:43  user.info kernel: sd 2:0:1:0: [sdb] CDB: cdb[0]=0x12: 12 01 80 00 fe 00
Jul  6 13:31:43  user.info kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff8804623e4b40)
Jul  6 13:31:43  user.warn kernel: Result: hostbyte=0x02 driverbyte=0x00
Jul  6 13:31:43  user.warn kernel: device-mapper: multipath: Failing path 8:16.
Jul  6 13:31:43  daemon.notice multipathd: 8:16: mark as failed
Jul  6 13:31:43  daemon.notice multipathd: encl1Slot2: remaining active paths: 0
Jul  6 13:31:51  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:51  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:51  user.info kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880463a6b680)
Jul  6 13:31:51  user.info kernel: sd 2:0:1:0: [sdb] CDB: 
Jul  6 13:31:51  user.info kernel: cdb[0]=0x12
Jul  6 13:31:51  user.info kernel: : 12
Jul  6 13:31:51  user.info kernel:  01
Jul  6 13:31:51  user.info kernel:  80 00
Jul  6 13:31:51  user.info kernel:  fe 00
Jul  6 13:31:51  user.info kernel: 
Jul  6 13:31:51  user.info kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff880463a6b680)
Jul  6 13:31:59  daemon.notice multipathd: sdn: readsector0 checker reports path is down
Jul  6 13:31:59  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:59  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:31:59  user.info kernel: mptscsih: ioc0: attempting target reset! (sc=ffff8804623e4b40)
Jul  6 13:31:59  user.info kernel: sd 2:0:1:0: [sdb] CDB: cdb[0]=0x12
Jul  6 13:31:59  user.info kernel: : 12
Jul  6 13:31:59  user.info kernel:  01 80
Jul  6 13:31:59  user.info kernel:  00 fe
Jul  6 13:31:59  user.info kernel:  00
Jul  6 13:31:59  user.info kernel: 
Jul  6 13:31:59  user.info kernel: mptscsih: ioc0: target reset: SUCCESS (sc=ffff8804623e4b40)
Jul  6 13:32:07  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:32:07  user.info kernel: mptscsih: ioc0: attempting bus reset! (sc=ffff8804623e4b40)
Jul  6 13:32:07  user.info kernel: sd 2:0:1:0: [sdb] CDB: 
Jul  6 13:32:07  user.info kernel: cdb[0]=0x12
Jul  6 13:32:07  user.info kernel: : 12
Jul  6 13:32:07  user.info kernel:  01
Jul  6 13:32:07  user.info kernel:  80 00
Jul  6 13:32:07  user.info kernel:  fe
Jul  6 13:32:07  user.info kernel:  00
Jul  6 13:32:11  user.info kernel: mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8804623e4b40)
Jul  6 13:32:23  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:32:31  user.info kernel: mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:32:31  user.info kernel: mptscsih: ioc0: attempting host reset! (sc=ffff8804623e4b40)
Jul  6 13:32:42  user.info kernel: mptscsih: ioc0: host reset: SUCCESS (sc=ffff8804623e4b40)
Jul  6 13:32:43  user.info kernel: mptbase: ioc0: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501)
Jul  6 13:32:43  user.info kernel: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 27, phy 1, sas_addr 0x5000c5000d298821
Jul  6 13:32:43  user.debug kernel:  phy-2:2:79: mptsas: ioc0: delete phy 1, phy-obj (0xffff88046a5e6800)
Jul  6 13:32:43  user.debug kernel:  port-2:2:1: mptsas: ioc0: delete port 1, sas_addr (0x5000c5000d298821)
Jul  6 13:32:43  user.notice kernel: sd 2:0:1:0: [sdb] Synchronizing SCSI cache
Jul  6 13:32:52  user.info kernel: sd 2:0:1:0: Device offlined - not ready after error recovery
Jul  6 13:32:53  user.info kernel: sd 2:0:1:0: [sdb] md: super_written gets error=-5, uptodate=0
Jul  6 13:32:53  user.alert kernel: raid1: Disk failure on dm-13, disabling device.
Jul  6 13:32:53  user.alert kernel: raid1: Operation continuing on 1 devices.
Jul  6 13:32:53  daemon.notice multipathd: sdb: readsector0 checker reports path is down
Jul  6 13:32:53  user.warn kernel: Result: hostbyte=0x01 driverbyte=0x00
Jul  6 13:32:53  user.info kernel: mptbase: ioc0: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501)
Jul  6 13:32:53  user.warn kernel: RAID1 conf printout:
Jul  6 13:32:53  user.warn kernel:  --- wd:1 rd:2
Jul  6 13:32:53  user.warn kernel:  disk 0, wo:0, o:1, dev:dm-6
Jul  6 13:32:53  user.warn kernel:  disk 1, wo:1, o:0, dev:dm-13
Jul  6 13:32:53  user.warn kernel: RAID1 conf printout:
Jul  6 13:32:53  user.warn kernel:  --- wd:1 rd:2
Jul  6 13:32:53  user.warn kernel:  disk 0, wo:0, o:1, dev:dm-6
Jul  6 13:32:55  daemon.notice multipathd: sdn: readsector0 checker reports path is down
Jul  6 13:32:55  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:33:02  user.info kernel: mptscsih: ioc1: attempting task abort! (sc=ffff880461f6a280)
Jul  6 13:33:02  user.info kernel: sd 3:0:1:0: CDB: 
Jul  6 13:33:02  user.info kernel: cdb[0]=0x12
Jul  6 13:33:02  user.info kernel: :
Jul  6 13:33:02  user.info kernel:  12
Jul  6 13:33:02  user.info kernel:  00 00
Jul  6 13:33:02  user.info kernel:  00 24
Jul  6 13:33:02  user.info kernel:  00
Jul  6 13:33:02  user.info kernel: 
Jul  6 13:33:02  user.info kernel: mptbase: ioc1: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
Jul  6 13:33:02  user.info kernel: mptscsih: ioc1: task abort: SUCCESS (sc=ffff880461f6a280)
Jul  6 13:33:03  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:33:03  user.info kernel: mptscsih: ioc1: attempting task abort! (sc=ffff88046c04a000)
Jul  6 13:33:03  user.info kernel: sd 3:0:1:0: [sdn] CDB: 
Jul  6 13:33:03  user.info kernel: cdb[0]=0x12
Jul  6 13:33:03  user.info kernel: : 12
Jul  6 13:33:03  user.info kernel:  01
Jul  6 13:33:03  user.info kernel:  80 00
Jul  6 13:33:03  user.info kernel:  fe
Jul  6 13:33:03  user.info kernel:  00
Jul  6 13:33:03  user.info kernel: 
Jul  6 13:33:03  user.info kernel: mptscsih: ioc1: task abort: SUCCESS (sc=ffff88046c04a000)
Jul  6 13:33:04  user.err kernel: scsi 2:0:1:0: rejecting I/O to dead device
Jul  6 13:33:04  daemon.notice multipathd: sdb: readsector0 checker reports path is down
Jul  6 13:33:11  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:33:11  user.info kernel: mptscsih: ioc1: attempting task abort! (sc=ffff880463a6adc0)
Jul  6 13:33:11  user.info kernel: sd 3:0:1:0: [sdn] CDB: 
Jul  6 13:33:11  user.info kernel: cdb[0]=0x12
Jul  6 13:33:11  user.info kernel: : 12
Jul  6 13:33:11  user.info kernel:  01
Jul  6 13:33:11  user.info kernel:  80 00
Jul  6 13:33:11  user.info kernel:  fe
Jul  6 13:33:11  user.info kernel:  00
Jul  6 13:33:11  user.info kernel: mptscsih: ioc1: task abort: SUCCESS (sc=ffff880463a6adc0)
Jul  6 13:33:19  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:33:19  user.info kernel: mptscsih: ioc1: attempting target reset! (sc=ffff880461f6a280)
Jul  6 13:33:19  user.info kernel: sd 3:0:1:0: CDB: 
Jul  6 13:33:19  user.info kernel: cdb[0]=0x12
Jul  6 13:33:19  user.info kernel: : 12
Jul  6 13:33:19  user.info kernel:  00 00
Jul  6 13:33:19  user.info kernel:  00 24
Jul  6 13:33:19  user.info kernel:  00
Jul  6 13:33:19  user.info kernel: 
Jul  6 13:33:19  user.info kernel: mptscsih: ioc1: target reset: SUCCESS (sc=ffff880461f6a280)
Jul  6 13:33:27  user.info kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jul  6 13:33:27  user.info kernel: mptscsih: ioc1: attempting bus reset! (sc=ffff880461f6a280)
Jul  6 13:33:27  user.info kernel: sd 3:0:1:0: CDB: 
Jul  6 13:33:27  user.info kernel: cdb[0]=0x12
Jul  6 13:33:27  user.info kernel: :
Jul  6 13:33:27  user.info kernel:  12
Jul  6 13:33:27  user.info kernel:  00
Jul  6 13:33:27  user.info kernel:  24
Jul  6 13:33:27  user.info kernel:  00
Jul  6 13:33:27  user.info kernel: 
Jul  6 13:33:31  user.info kernel: mptscsih: ioc1: bus reset: SUCCESS (sc=ffff880461f6a280)
Jul  6 13:33:42  user.info kernel: mptbase: ioc1: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
Jul  6 13:33:42  user.info kernel: mptscsih: ioc1: attempting host reset! (sc=ffff880461f6a280)
Jul  6 13:33:42  user.info kernel: mptsas: ioc1: removing ssp device: fw_channel 0, fw_id 62, phy 1, sas_addr 0x5000c5000d298822
Jul  6 13:33:42  user.debug kernel:  phy-3:2:79: mptsas: ioc1: delete phy 1, phy-obj (0xffff88046de9dc00)
Jul  6 13:33:42  user.debug kernel:  port-3:2:1: mptsas: ioc1: delete port 1, sas_addr (0x5000c5000d298822)
Jul  6 13:33:42  user.notice kernel: sd 3:0:1:0: [sdn] Synchronizing SCSI cache




  parent reply	other threads:[~2009-07-06 18:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-02 16:22 [2.6.27.25] Hang in SCSI sync cache when a disk is removed--? Paul Smith
2009-07-02 17:41 ` Mike Anderson
2009-07-02 20:12   ` Paul Smith
2009-07-06 18:04   ` Paul Smith [this message]
2009-07-07  6:25     ` Mike Anderson
2009-07-07 13:58       ` James Bottomley
2009-07-07 14:33         ` Paul Smith
2009-07-07 20:24           ` Desai, Kashyap
2009-07-07 20:45             ` Mike Anderson
2009-07-07 21:10               ` Mike Anderson
2009-07-21 10:16                 ` Desai, Kashyap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1246903453.9022.7246.camel@psmith-ubeta.netezza.com \
    --to=paul@mad-scientist.net \
    --cc=andmike@linux.vnet.ibm.com \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox