* Raid failure - drives or controller?
@ 2012-03-07 17:52 Danilo Godec
2012-03-07 19:33 ` Ray Morris
2012-03-07 21:11 ` Mathias Burén
0 siblings, 2 replies; 5+ messages in thread
From: Danilo Godec @ 2012-03-07 17:52 UTC (permalink / raw)
To: linux-raid
Hi,
I had two drive failure on a RAID5 in short time (unfortunately to short
to rebuild on a spare disk). However - drives seem to work on a test
machine and didn't report any errors. I also stuck them back into the
orig. server (after rebooting) and they work now.
The first drive's errors were:
> Mar 6 05:15:19 san1 kernel: [10681162.473960] sd 4:0:3:0: [sde]
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar 6 05:15:19 san1 kernel: [10681162.473965] sd 4:0:3:0: [sde] Sense
> Key : Aborted Command [current]
> Mar 6 05:15:19 san1 kernel: [10681162.473969] sd 4:0:3:0: [sde] Add.
> Sense: No additional sense information
> Mar 6 05:15:19 san1 kernel: [10681162.473973] sd 4:0:3:0: [sde] CDB:
> Read(10): 28 00 07 af 38 3f 00 00 08 00
> Mar 6 05:15:19 san1 kernel: [10681162.473980] end_request: I/O error,
> dev sde, sector 128923711
> Mar 6 05:17:53 san1 kernel: [10681316.885221] sd 4:0:3:0: [sde]
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar 6 05:17:53 san1 kernel: [10681316.885225] sd 4:0:3:0: [sde] Sense
> Key : Illegal Request [current]
> Mar 6 05:17:53 san1 kernel: [10681316.885229] sd 4:0:3:0: [sde] Add.
> Sense: Logical block address out of range
> Mar 6 05:17:53 san1 kernel: [10681316.885234] sd 4:0:3:0: [sde] CDB:
> Write(10): 2a 08 74 70 58 c7 00 00 08 00
> Mar 6 05:17:53 san1 kernel: [10681316.885242] end_request: I/O error,
> dev sde, sector 1953519815
> Mar 6 05:17:53 san1 kernel: [10681316.885246] end_request: I/O error,
> dev sde, sector 1953519815
> Mar 6 05:17:53 san1 kernel: [10681316.885252] raid5: Disk failure on
> sde1, disabling device.
> Mar 6 05:20:27 san1 kernel: [10681470.600610] sd 4:0:3:0: [sde]
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar 6 05:20:27 san1 kernel: [10681470.600615] sd 4:0:3:0: [sde] Sense
> Key : Illegal Request [current]
> Mar 6 05:20:27 san1 kernel: [10681470.600619] sd 4:0:3:0: [sde] Add.
> Sense: Logical block address out of range
> Mar 6 05:20:27 san1 kernel: [10681470.600624] sd 4:0:3:0: [sde] CDB:
> Write(10): 2a 08 74 70 59 27 00 00 08 00
> Mar 6 05:20:27 san1 kernel: [10681470.600631] end_request: I/O error,
> dev sde, sector 1953519911
> Mar 6 05:20:27 san1 kernel: [10681470.600636] end_request: I/O error,
> dev sde, sector 1953519911
> Mar 6 05:20:28 san1 kernel: [10681471.664682] disk 3, o:0, dev:sde1
> Mar 6 05:21:47 san1 kernel: [10681549.746852] sd 4:0:3:0: [sde]
> Synchronizing SCSI cache
> Mar 6 05:21:47 san1 kernel: [10681549.746905] sd 4:0:3:0: [sde]
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
The second drive did this:
> Mar 7 02:31:37 san1 kernel: [10757598.197391] sd 4:0:5:0: [sdg]
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar 7 02:31:37 san1 kernel: [10757598.197396] sd 4:0:5:0: [sdg] Sense
> Key : Aborted Command [current]
> Mar 7 02:31:37 san1 kernel: [10757598.197400] sd 4:0:5:0: [sdg] Add.
> Sense: No additional sense information
> Mar 7 02:31:37 san1 kernel: [10757598.197404] sd 4:0:5:0: [sdg] CDB:
> Read(10): 28 00 07 12 05 9f 00 00 10 00
> Mar 7 02:31:37 san1 kernel: [10757598.197411] end_request: I/O error,
> dev sdg, sector 118621599
> Mar 7 02:31:37 san1 kernel: [10757598.583990] raid5: Disk failure on
> sdg1, disabling device.
> Mar 7 02:31:37 san1 kernel: [10757598.616232] disk 5, o:0, dev:sdg1
Can anyone make some actual sense out of these sense messages?
Are these drives really / likely bad or is it more likely it was a
controller failure?
D.
--
Danilo Godec, sistemska podpora / system administration
Predlog! Obiscite prenovljeno spletno stran www.agenda.si
ODPRTA KODA IN LINUX
STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT : IZOBRAZEVANJE : PROGRAMSKA OPREMA
Visit our updated web page at www.agenda.si
OPEN SOURCE AND LINUX
SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE : TRAINING : SOFTWARE
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Raid failure - drives or controller?
2012-03-07 17:52 Raid failure - drives or controller? Danilo Godec
@ 2012-03-07 19:33 ` Ray Morris
2012-03-07 21:08 ` Danilo Godec
2012-03-07 21:11 ` Mathias Burén
1 sibling, 1 reply; 5+ messages in thread
From: Ray Morris @ 2012-03-07 19:33 UTC (permalink / raw)
To: Danilo Godec; +Cc: linux-raid
> drives or controller?
Don't forget cables and loose connections.
We've had more cable problems than anything else.
--
Ray Morris
support@bettercgi.com
Strongbox - The next generation in site security:
http://www.bettercgi.com/strongbox/
Throttlebox - Intelligent Bandwidth Control
http://www.bettercgi.com/throttlebox/
Strongbox / Throttlebox affiliate program:
http://www.bettercgi.com/affiliates/user/register.php
On Wed, 07 Mar 2012 18:52:30 +0100
Danilo Godec <danilo.godec@agenda.si> wrote:
> Hi,
>
> I had two drive failure on a RAID5 in short time (unfortunately to
> short to rebuild on a spare disk). However - drives seem to work on a
> test machine and didn't report any errors. I also stuck them back
> into the orig. server (after rebooting) and they work now.
>
> The first drive's errors were:
>
> > Mar 6 05:15:19 san1 kernel: [10681162.473960] sd 4:0:3:0: [sde]
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar 6 05:15:19 san1 kernel: [10681162.473965] sd 4:0:3:0: [sde]
> > Sense Key : Aborted Command [current]
> > Mar 6 05:15:19 san1 kernel: [10681162.473969] sd 4:0:3:0: [sde]
> > Add. Sense: No additional sense information
> > Mar 6 05:15:19 san1 kernel: [10681162.473973] sd 4:0:3:0: [sde]
> > CDB: Read(10): 28 00 07 af 38 3f 00 00 08 00
> > Mar 6 05:15:19 san1 kernel: [10681162.473980] end_request: I/O
> > error, dev sde, sector 128923711
> > Mar 6 05:17:53 san1 kernel: [10681316.885221] sd 4:0:3:0: [sde]
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar 6 05:17:53 san1 kernel: [10681316.885225] sd 4:0:3:0: [sde]
> > Sense Key : Illegal Request [current]
> > Mar 6 05:17:53 san1 kernel: [10681316.885229] sd 4:0:3:0: [sde]
> > Add. Sense: Logical block address out of range
> > Mar 6 05:17:53 san1 kernel: [10681316.885234] sd 4:0:3:0: [sde]
> > CDB: Write(10): 2a 08 74 70 58 c7 00 00 08 00
> > Mar 6 05:17:53 san1 kernel: [10681316.885242] end_request: I/O
> > error, dev sde, sector 1953519815
> > Mar 6 05:17:53 san1 kernel: [10681316.885246] end_request: I/O
> > error, dev sde, sector 1953519815
> > Mar 6 05:17:53 san1 kernel: [10681316.885252] raid5: Disk failure
> > on sde1, disabling device.
> > Mar 6 05:20:27 san1 kernel: [10681470.600610] sd 4:0:3:0: [sde]
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar 6 05:20:27 san1 kernel: [10681470.600615] sd 4:0:3:0: [sde]
> > Sense Key : Illegal Request [current]
> > Mar 6 05:20:27 san1 kernel: [10681470.600619] sd 4:0:3:0: [sde]
> > Add. Sense: Logical block address out of range
> > Mar 6 05:20:27 san1 kernel: [10681470.600624] sd 4:0:3:0: [sde]
> > CDB: Write(10): 2a 08 74 70 59 27 00 00 08 00
> > Mar 6 05:20:27 san1 kernel: [10681470.600631] end_request: I/O
> > error, dev sde, sector 1953519911
> > Mar 6 05:20:27 san1 kernel: [10681470.600636] end_request: I/O
> > error, dev sde, sector 1953519911
> > Mar 6 05:20:28 san1 kernel: [10681471.664682] disk 3, o:0,
> > dev:sde1 Mar 6 05:21:47 san1 kernel: [10681549.746852] sd 4:0:3:0:
> > [sde] Synchronizing SCSI cache
> > Mar 6 05:21:47 san1 kernel: [10681549.746905] sd 4:0:3:0: [sde]
> > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>
> The second drive did this:
>
> > Mar 7 02:31:37 san1 kernel: [10757598.197391] sd 4:0:5:0: [sdg]
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar 7 02:31:37 san1 kernel: [10757598.197396] sd 4:0:5:0: [sdg]
> > Sense Key : Aborted Command [current]
> > Mar 7 02:31:37 san1 kernel: [10757598.197400] sd 4:0:5:0: [sdg]
> > Add. Sense: No additional sense information
> > Mar 7 02:31:37 san1 kernel: [10757598.197404] sd 4:0:5:0: [sdg]
> > CDB: Read(10): 28 00 07 12 05 9f 00 00 10 00
> > Mar 7 02:31:37 san1 kernel: [10757598.197411] end_request: I/O
> > error, dev sdg, sector 118621599
> > Mar 7 02:31:37 san1 kernel: [10757598.583990] raid5: Disk failure
> > on sdg1, disabling device.
> > Mar 7 02:31:37 san1 kernel: [10757598.616232] disk 5, o:0,
> > dev:sdg1
>
> Can anyone make some actual sense out of these sense messages?
>
> Are these drives really / likely bad or is it more likely it was a
> controller failure?
>
>
> D.
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Raid failure - drives or controller?
2012-03-07 19:33 ` Ray Morris
@ 2012-03-07 21:08 ` Danilo Godec
0 siblings, 0 replies; 5+ messages in thread
From: Danilo Godec @ 2012-03-07 21:08 UTC (permalink / raw)
To: linux-raid; +Cc: Ray Morris
On 7.3.2012 20:33, Ray Morris wrote:
>> drives or controller?
> Don't forget cables and loose connections.
> We've had more cable problems than anything else.
Sorry, I should've probably mention the hardware specs in the first place:
The server is an Intel SR2612UR with 12-drive SAS/SATA backplane with an
Attotech ExpressSAS H608 (PCIe) SAS controller and a single Mini-SAS
SFF-8087 cable between controller and backplane.
Drives are 'WD1002FBYS' (WD RE3 'enterprise' models), 7 of them (6 RAID5
+ 1 spare).
The OS (OpenSuSE 11.4 x86_64) is installed on a separate SATA SSD,
connected to on-board SATA controller. The driver for the controller is
'esas2hba' version 1.65, supplied by Attotech.
D.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Raid failure - drives or controller?
2012-03-07 17:52 Raid failure - drives or controller? Danilo Godec
2012-03-07 19:33 ` Ray Morris
@ 2012-03-07 21:11 ` Mathias Burén
2012-03-07 21:15 ` Danilo Godec
1 sibling, 1 reply; 5+ messages in thread
From: Mathias Burén @ 2012-03-07 21:11 UTC (permalink / raw)
To: Danilo Godec; +Cc: linux-raid
On 7 March 2012 17:52, Danilo Godec <danilo.godec@agenda.si> wrote:
> Hi,
>
> I had two drive failure on a RAID5 in short time (unfortunately to short to
> rebuild on a spare disk). However - drives seem to work on a test machine
> and didn't report any errors. I also stuck them back into the orig. server
> (after rebooting) and they work now.
>
> The first drive's errors were:
>
>> Mar 6 05:15:19 san1 kernel: [10681162.473960] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar 6 05:15:19 san1 kernel: [10681162.473965] sd 4:0:3:0: [sde] Sense Key
>> : Aborted Command [current]
>> Mar 6 05:15:19 san1 kernel: [10681162.473969] sd 4:0:3:0: [sde] Add.
>> Sense: No additional sense information
>> Mar 6 05:15:19 san1 kernel: [10681162.473973] sd 4:0:3:0: [sde] CDB:
>> Read(10): 28 00 07 af 38 3f 00 00 08 00
>> Mar 6 05:15:19 san1 kernel: [10681162.473980] end_request: I/O error, dev
>> sde, sector 128923711
>> Mar 6 05:17:53 san1 kernel: [10681316.885221] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar 6 05:17:53 san1 kernel: [10681316.885225] sd 4:0:3:0: [sde] Sense Key
>> : Illegal Request [current]
>> Mar 6 05:17:53 san1 kernel: [10681316.885229] sd 4:0:3:0: [sde] Add.
>> Sense: Logical block address out of range
>> Mar 6 05:17:53 san1 kernel: [10681316.885234] sd 4:0:3:0: [sde] CDB:
>> Write(10): 2a 08 74 70 58 c7 00 00 08 00
>> Mar 6 05:17:53 san1 kernel: [10681316.885242] end_request: I/O error, dev
>> sde, sector 1953519815
>> Mar 6 05:17:53 san1 kernel: [10681316.885246] end_request: I/O error, dev
>> sde, sector 1953519815
>> Mar 6 05:17:53 san1 kernel: [10681316.885252] raid5: Disk failure on
>> sde1, disabling device.
>> Mar 6 05:20:27 san1 kernel: [10681470.600610] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar 6 05:20:27 san1 kernel: [10681470.600615] sd 4:0:3:0: [sde] Sense Key
>> : Illegal Request [current]
>> Mar 6 05:20:27 san1 kernel: [10681470.600619] sd 4:0:3:0: [sde] Add.
>> Sense: Logical block address out of range
>> Mar 6 05:20:27 san1 kernel: [10681470.600624] sd 4:0:3:0: [sde] CDB:
>> Write(10): 2a 08 74 70 59 27 00 00 08 00
>> Mar 6 05:20:27 san1 kernel: [10681470.600631] end_request: I/O error, dev
>> sde, sector 1953519911
>> Mar 6 05:20:27 san1 kernel: [10681470.600636] end_request: I/O error, dev
>> sde, sector 1953519911
>> Mar 6 05:20:28 san1 kernel: [10681471.664682] disk 3, o:0, dev:sde1
>> Mar 6 05:21:47 san1 kernel: [10681549.746852] sd 4:0:3:0: [sde]
>> Synchronizing SCSI cache
>> Mar 6 05:21:47 san1 kernel: [10681549.746905] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>
>
> The second drive did this:
>
>> Mar 7 02:31:37 san1 kernel: [10757598.197391] sd 4:0:5:0: [sdg] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar 7 02:31:37 san1 kernel: [10757598.197396] sd 4:0:5:0: [sdg] Sense Key
>> : Aborted Command [current]
>> Mar 7 02:31:37 san1 kernel: [10757598.197400] sd 4:0:5:0: [sdg] Add.
>> Sense: No additional sense information
>> Mar 7 02:31:37 san1 kernel: [10757598.197404] sd 4:0:5:0: [sdg] CDB:
>> Read(10): 28 00 07 12 05 9f 00 00 10 00
>> Mar 7 02:31:37 san1 kernel: [10757598.197411] end_request: I/O error, dev
>> sdg, sector 118621599
>> Mar 7 02:31:37 san1 kernel: [10757598.583990] raid5: Disk failure on
>> sdg1, disabling device.
>> Mar 7 02:31:37 san1 kernel: [10757598.616232] disk 5, o:0, dev:sdg1
>
>
> Can anyone make some actual sense out of these sense messages?
>
> Are these drives really / likely bad or is it more likely it was a
> controller failure?
>
>
> D.
>
>
> --
> Danilo Godec, sistemska podpora / system administration
>
> Predlog! Obiscite prenovljeno spletno stran www.agenda.si
>
> ODPRTA KODA IN LINUX
> STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT :
> IZOBRAZEVANJE : PROGRAMSKA OPREMA
>
> Visit our updated web page at www.agenda.si
>
> OPEN SOURCE AND LINUX
> SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE : TRAINING
> : SOFTWARE
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Can you please post the smartctl -a (from smartmontools) output for both drives?
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Raid failure - drives or controller?
2012-03-07 21:11 ` Mathias Burén
@ 2012-03-07 21:15 ` Danilo Godec
0 siblings, 0 replies; 5+ messages in thread
From: Danilo Godec @ 2012-03-07 21:15 UTC (permalink / raw)
To: Mathias Burén; +Cc: linux-raid
On 7.3.2012 22:11, Mathias Burén wrote:
>
> Can you please post the smartctl -a (from smartmontools) output for both drives?
>
> Mathias
> --
# smartctl -a /dev/sde
> smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (openSUSE RPM)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WDC WD1002FBYS-02A6B Version: 0C06
> Serial number: WD-WMATV6536667
> Device type: disk
> scsiModePageOffset: response length too short, resp_len=12 offset=12
> bd_len=8
> Local Time is: Wed Mar 7 22:13:35 2012 CET
> Device does not support SMART
>
> Current Drive Temperature: <not available>
>
> Error Counter logging not supported
>
> [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
>
> SMART Self-test log
> Num Test Status segment LifeTime
> LBA_first_err [SK ASC ASQ]
> Description number (hours)
> #256 Default Completed -
> 12946 - [- - -]
> #512 Default Completed -
> 12946 - [- - -]
> #768 Default Completed -
> 554 - [- - -]
# smartctl -a /dev/sdg
> smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (openSUSE RPM)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WDC WD1002FBYS-02A6B Version: 0C06
> Serial number: WD-WMATV6535018
> Device type: disk
> scsiModePageOffset: response length too short, resp_len=12 offset=12
> bd_len=8
> Local Time is: Wed Mar 7 22:13:40 2012 CET
> Device does not support SMART
>
> Current Drive Temperature: <not available>
>
> Error Counter logging not supported
>
> [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
>
> SMART Self-test log
> Num Test Status segment LifeTime
> LBA_first_err [SK ASC ASQ]
> Description number (hours)
> #256 Default Completed -
> 1359 - [- - -]
> #512 Default Completed -
> 1040 - [- - -]
> #768 Default Completed -
> 581 - [- - -]
Thanks, Danilo
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-03-07 21:15 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-07 17:52 Raid failure - drives or controller? Danilo Godec
2012-03-07 19:33 ` Ray Morris
2012-03-07 21:08 ` Danilo Godec
2012-03-07 21:11 ` Mathias Burén
2012-03-07 21:15 ` Danilo Godec
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.