linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sd takes drive offline but md does not know
@ 2008-11-29  8:19 Richard Scobie
  2008-11-30  1:57 ` David Lethe
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Scobie @ 2008-11-29  8:19 UTC (permalink / raw)
  To: Linux RAID Mailing List

I have system running 2.6.26.6-79.fc9.x86_64 using a 16 SATA drive md 
RAID6 behind an LSI 1068 SAS controller.

The current stable version of smartmontools cannot be started at boot 
time if samba is also started at the same time - see:

http://marc.info/?l=smartmontools-support&m=122518510306493&w=2

Up until today, about 1 month, I have been able to run smartd and issue 
smrtctl commands without problem.

Today I smartctl'ed a drive (sdr) in the array and the drive was reset 
and finally offlined.

Is it to be expected that in this scenario, md was ignorant of this and 
/proc/mdstat showed this drive as being present still?

Only when the array is unmounted and possibly if filesystem activity 
occurs do thing fall over badly - in this case external ssh and console 
access hung and a reset was required. The log shows nothing of note 
after the following until the machine reboots:

Nov 29 13:12:56 avidstorage kernel: mptscsih: ioc0: attempting task 
abort! (sc=ffff810226524dc0)
Nov 29 13:12:56 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: ATA command 
pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
Nov 29 13:12:58 avidstorage kernel: mptbase: ioc0: LogInfo(0x31140000): 
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Nov 29 13:12:58 avidstorage kernel: mptscsih: ioc0: task abort: SUCCESS 
(sc=ffff810226524dc0)
Nov 29 13:13:08 avidstorage kernel: mptscsih: ioc0: attempting task 
abort! (sc=ffff810226524dc0)
Nov 29 13:13:08 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: Test Unit 
Ready: 00 00 00 00 00 00
Nov 29 13:13:10 avidstorage kernel: mptbase: ioc0: LogInfo(0x31140000): 
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Nov 29 13:13:10 avidstorage kernel: mptscsih: ioc0: task abort: SUCCESS 
(sc=ffff810226524dc0)
Nov 29 13:13:10 avidstorage kernel: mptscsih: ioc0: attempting target 
reset! (sc=ffff810226524dc0)
Nov 29 13:13:10 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: ATA command 
pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
Nov 29 13:13:12 avidstorage kernel: mptscsih: ioc0: Issue of TaskMgmt 
failed!
Nov 29 13:13:12 avidstorage kernel: mptscsih: ioc0: target reset: FAILED 
(sc=ffff810226524dc0)
Nov 29 13:13:12 avidstorage kernel: mptscsih: ioc0: attempting bus 
reset! (sc=ffff810226524dc0)
Nov 29 13:13:12 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: ATA command 
pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
Nov 29 13:13:20 avidstorage kernel: mptscsih: ioc0: bus reset: SUCCESS 
(sc=ffff810226524dc0)
Nov 29 13:13:40 avidstorage kernel: mptscsih: ioc0: attempting task 
abort! (sc=ffff810226524dc0)
Nov 29 13:13:40 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: Test Unit 
Ready: 00 00 00 00 00 00
Nov 29 13:13:42 avidstorage kernel: mptbase: ioc0: LogInfo(0x31130000): 
Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
Nov 29 13:13:42 avidstorage kernel: mptscsih: ioc0: task abort: SUCCESS 
(sc=ffff810226524dc0)
Nov 29 13:13:42 avidstorage kernel: mptscsih: ioc0: attempting host 
reset! (sc=ffff810226524dc0)
Nov 29 13:13:42 avidstorage kernel: mptbase: ioc0: Initiating recovery
Nov 29 13:13:57 avidstorage kernel: mptscsih: ioc0: host reset: SUCCESS 
(sc=ffff810226524dc0)
Nov 29 13:13:57 avidstorage kernel: sd 8:0:15:0: Device offlined - not 
ready after error recovery
Nov 29 13:18:05 avidstorage ntpd[3101]: kernel time sync status change 4001
Nov 29 13:26:40 avidstorage smartd[3468]: Device: /dev/sdr, No such 
device or address, open() failed
Nov 29 13:26:40 avidstorage smartd[3468]: Sending warning via mail to 
root@sauce.co.nz ...
Nov 29 13:26:40 avidstorage smartd[3468]: Warning via mail to 
root@sauce.co.nz: successful


Regards,

Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: sd takes drive offline but md does not know
  2008-11-29  8:19 sd takes drive offline but md does not know Richard Scobie
@ 2008-11-30  1:57 ` David Lethe
  2008-11-30  7:34   ` Richard Scobie
  0 siblings, 1 reply; 3+ messages in thread
From: David Lethe @ 2008-11-30  1:57 UTC (permalink / raw)
  To: Richard Scobie, Linux RAID Mailing List

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Richard Scobie
> Sent: Saturday, November 29, 2008 2:20 AM
> To: Linux RAID Mailing List
> Subject: sd takes drive offline but md does not know
> 
> I have system running 2.6.26.6-79.fc9.x86_64 using a 16 SATA drive md
> RAID6 behind an LSI 1068 SAS controller.
> 
> The current stable version of smartmontools cannot be started at boot
> time if samba is also started at the same time - see:
> 
> http://marc.info/?l=smartmontools-support&m=122518510306493&w=2
> 
> Up until today, about 1 month, I have been able to run smartd and
issue
> smrtctl commands without problem.
> 
> Today I smartctl'ed a drive (sdr) in the array and the drive was reset
> and finally offlined.
> 
> Is it to be expected that in this scenario, md was ignorant of this
and
> /proc/mdstat showed this drive as being present still?
> 
> Only when the array is unmounted and possibly if filesystem activity
> occurs do thing fall over badly - in this case external ssh and
console
> access hung and a reset was required. The log shows nothing of note
> after the following until the machine reboots:
> 
> Nov 29 13:12:56 avidstorage kernel: mptscsih: ioc0: attempting task
> abort! (sc=ffff810226524dc0)
> Nov 29 13:12:56 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: ATA
command
> pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
> Nov 29 13:12:58 avidstorage kernel: mptbase: ioc0:
LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> Nov 29 13:12:58 avidstorage kernel: mptscsih: ioc0: task abort:
SUCCESS
> (sc=ffff810226524dc0)
> Nov 29 13:13:08 avidstorage kernel: mptscsih: ioc0: attempting task
> abort! (sc=ffff810226524dc0)
> Nov 29 13:13:08 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: Test Unit
> Ready: 00 00 00 00 00 00
> Nov 29 13:13:10 avidstorage kernel: mptbase: ioc0:
LogInfo(0x31140000):
> Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> Nov 29 13:13:10 avidstorage kernel: mptscsih: ioc0: task abort:
SUCCESS
> (sc=ffff810226524dc0)
> Nov 29 13:13:10 avidstorage kernel: mptscsih: ioc0: attempting target
> reset! (sc=ffff810226524dc0)
> Nov 29 13:13:10 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: ATA
command
> pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
> Nov 29 13:13:12 avidstorage kernel: mptscsih: ioc0: Issue of TaskMgmt
> failed!
> Nov 29 13:13:12 avidstorage kernel: mptscsih: ioc0: target reset:
> FAILED
> (sc=ffff810226524dc0)
> Nov 29 13:13:12 avidstorage kernel: mptscsih: ioc0: attempting bus
> reset! (sc=ffff810226524dc0)
> Nov 29 13:13:12 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: ATA
command
> pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
> Nov 29 13:13:20 avidstorage kernel: mptscsih: ioc0: bus reset: SUCCESS
> (sc=ffff810226524dc0)
> Nov 29 13:13:40 avidstorage kernel: mptscsih: ioc0: attempting task
> abort! (sc=ffff810226524dc0)
> Nov 29 13:13:40 avidstorage kernel: sd 8:0:15:0: [sdr] CDB: Test Unit
> Ready: 00 00 00 00 00 00
> Nov 29 13:13:42 avidstorage kernel: mptbase: ioc0:
LogInfo(0x31130000):
> Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
> Nov 29 13:13:42 avidstorage kernel: mptscsih: ioc0: task abort:
SUCCESS
> (sc=ffff810226524dc0)
> Nov 29 13:13:42 avidstorage kernel: mptscsih: ioc0: attempting host
> reset! (sc=ffff810226524dc0)
> Nov 29 13:13:42 avidstorage kernel: mptbase: ioc0: Initiating recovery
> Nov 29 13:13:57 avidstorage kernel: mptscsih: ioc0: host reset:
SUCCESS
> (sc=ffff810226524dc0)
> Nov 29 13:13:57 avidstorage kernel: sd 8:0:15:0: Device offlined - not
> ready after error recovery
> Nov 29 13:18:05 avidstorage ntpd[3101]: kernel time sync status change
> 4001
> Nov 29 13:26:40 avidstorage smartd[3468]: Device: /dev/sdr, No such
> device or address, open() failed
> Nov 29 13:26:40 avidstorage smartd[3468]: Sending warning via mail to
> root@sauce.co.nz ...
> Nov 29 13:26:40 avidstorage smartd[3468]: Warning via mail to
> root@sauce.co.nz: successful
> 
> 
> Regards,
> 
> Richard
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
What firmware, drivers & BIOS is the LSI controller running, and what is
exact model number?

Several things to consider
 - if you enabled SMART rather than telling the controller to enable
SMART for the individual drives, then this will cause a problem
depending on specifics of what you have .. especially if the controller
is running the RAID firmware.
 - There are firmware issues with some LSI chipsets and
driver/bios/MPT-library revision logic which can cause bus resets.   In
this case, the bus reset made the controller think the disk timed out to
whatever I/O operations the LSI controller told it to perform ... so the
controller took disk to offline state.

My suggestion is to go to the MPT BIOS screen and enable SMART for all
disks, and let the controller manage it.  
Although you didn't comment on what firmware you have, let me also tell
you if the LSI controller is running the RAID version of the firmware,
rather than the -IT (non-RAID) version, then flash the IT firmware.
You'll get better performance.

Note, don't change firmware from RAID to non-RAID or vise-versa with
live data.  The number of blocks and location of metadata for the RAID
firmware is somewhat dependent of what you have and what you are going
to.

David @ santools.com



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: sd takes drive offline but md does not know
  2008-11-30  1:57 ` David Lethe
@ 2008-11-30  7:34   ` Richard Scobie
  0 siblings, 0 replies; 3+ messages in thread
From: Richard Scobie @ 2008-11-30  7:34 UTC (permalink / raw)
  To: David Lethe; +Cc: Linux RAID Mailing List

David Lethe wrote:

> What firmware, drivers & BIOS is the LSI controller running, and what is
> exact model number?

The card is the  SAS3442E-R using the B3 version of the 1068 controller 
and has the latest public versions of BIOS and IT version of the firmware.

> Several things to consider
>  - if you enabled SMART rather than telling the controller to enable
> SMART for the individual drives, then this will cause a problem
> depending on specifics of what you have .. especially if the controller
> is running the RAID firmware.
>  - There are firmware issues with some LSI chipsets and
> driver/bios/MPT-library revision logic which can cause bus resets.   In
> this case, the bus reset made the controller think the disk timed out to
> whatever I/O operations the LSI controller told it to perform ... so the
> controller took disk to offline state.


At this stage I am no longer concerned about using smartmontools - the 
card has performed flawlessly in all other respects, so I will avoid it 
in future.

I am concerned that when the drive was offlined, md was not made aware 
of it. Perhaps this is to be expected?

Unfortunately this machine is in production now, so I cannot really 
participate in any more testing/debugging.

Regards,

Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-11-30  7:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-29  8:19 sd takes drive offline but md does not know Richard Scobie
2008-11-30  1:57 ` David Lethe
2008-11-30  7:34   ` Richard Scobie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).