Raid failure - drives or controller?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Raid failure - drives or controller?
@ 2012-03-07 17:52 Danilo Godec
  2012-03-07 19:33 ` Ray Morris
  2012-03-07 21:11 ` Mathias Burén
  0 siblings, 2 replies; 5+ messages in thread
From: Danilo Godec @ 2012-03-07 17:52 UTC (permalink / raw)
  To: linux-raid

Hi,

I had two drive failure on a RAID5 in short time (unfortunately to short 
to rebuild on a spare disk). However - drives seem to work on a test 
machine and didn't report any errors. I also stuck them back into the 
orig. server (after rebooting) and they work now.

The first drive's errors were:

> Mar  6 05:15:19 san1 kernel: [10681162.473960] sd 4:0:3:0: [sde] 
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar  6 05:15:19 san1 kernel: [10681162.473965] sd 4:0:3:0: [sde] Sense 
> Key : Aborted Command [current]
> Mar  6 05:15:19 san1 kernel: [10681162.473969] sd 4:0:3:0: [sde] Add. 
> Sense: No additional sense information
> Mar  6 05:15:19 san1 kernel: [10681162.473973] sd 4:0:3:0: [sde] CDB: 
> Read(10): 28 00 07 af 38 3f 00 00 08 00
> Mar  6 05:15:19 san1 kernel: [10681162.473980] end_request: I/O error, 
> dev sde, sector 128923711
> Mar  6 05:17:53 san1 kernel: [10681316.885221] sd 4:0:3:0: [sde] 
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar  6 05:17:53 san1 kernel: [10681316.885225] sd 4:0:3:0: [sde] Sense 
> Key : Illegal Request [current]
> Mar  6 05:17:53 san1 kernel: [10681316.885229] sd 4:0:3:0: [sde] Add. 
> Sense: Logical block address out of range
> Mar  6 05:17:53 san1 kernel: [10681316.885234] sd 4:0:3:0: [sde] CDB: 
> Write(10): 2a 08 74 70 58 c7 00 00 08 00
> Mar  6 05:17:53 san1 kernel: [10681316.885242] end_request: I/O error, 
> dev sde, sector 1953519815
> Mar  6 05:17:53 san1 kernel: [10681316.885246] end_request: I/O error, 
> dev sde, sector 1953519815
> Mar  6 05:17:53 san1 kernel: [10681316.885252] raid5: Disk failure on 
> sde1, disabling device.
> Mar  6 05:20:27 san1 kernel: [10681470.600610] sd 4:0:3:0: [sde] 
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar  6 05:20:27 san1 kernel: [10681470.600615] sd 4:0:3:0: [sde] Sense 
> Key : Illegal Request [current]
> Mar  6 05:20:27 san1 kernel: [10681470.600619] sd 4:0:3:0: [sde] Add. 
> Sense: Logical block address out of range
> Mar  6 05:20:27 san1 kernel: [10681470.600624] sd 4:0:3:0: [sde] CDB: 
> Write(10): 2a 08 74 70 59 27 00 00 08 00
> Mar  6 05:20:27 san1 kernel: [10681470.600631] end_request: I/O error, 
> dev sde, sector 1953519911
> Mar  6 05:20:27 san1 kernel: [10681470.600636] end_request: I/O error, 
> dev sde, sector 1953519911
> Mar  6 05:20:28 san1 kernel: [10681471.664682]  disk 3, o:0, dev:sde1
> Mar  6 05:21:47 san1 kernel: [10681549.746852] sd 4:0:3:0: [sde] 
> Synchronizing SCSI cache
> Mar  6 05:21:47 san1 kernel: [10681549.746905] sd 4:0:3:0: [sde] 
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

The second drive did this:

> Mar  7 02:31:37 san1 kernel: [10757598.197391] sd 4:0:5:0: [sdg] 
> Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Mar  7 02:31:37 san1 kernel: [10757598.197396] sd 4:0:5:0: [sdg] Sense 
> Key : Aborted Command [current]
> Mar  7 02:31:37 san1 kernel: [10757598.197400] sd 4:0:5:0: [sdg] Add. 
> Sense: No additional sense information
> Mar  7 02:31:37 san1 kernel: [10757598.197404] sd 4:0:5:0: [sdg] CDB: 
> Read(10): 28 00 07 12 05 9f 00 00 10 00
> Mar  7 02:31:37 san1 kernel: [10757598.197411] end_request: I/O error, 
> dev sdg, sector 118621599
> Mar  7 02:31:37 san1 kernel: [10757598.583990] raid5: Disk failure on 
> sdg1, disabling device.
> Mar  7 02:31:37 san1 kernel: [10757598.616232]  disk 5, o:0, dev:sdg1

Can anyone make some actual sense out of these sense messages?

Are these drives really / likely bad or is it more likely it was a 
controller failure?


    D.


-- 
Danilo Godec, sistemska podpora / system administration

Predlog! Obiscite prenovljeno spletno stran www.agenda.si

ODPRTA KODA IN LINUX
STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT : IZOBRAZEVANJE : PROGRAMSKA OPREMA

Visit our updated web page at www.agenda.si

OPEN SOURCE AND LINUX
SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE : TRAINING : SOFTWARE


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Raid failure - drives or controller?
  2012-03-07 17:52 Raid failure - drives or controller? Danilo Godec
@ 2012-03-07 19:33 ` Ray Morris
  2012-03-07 21:08   ` Danilo Godec
  2012-03-07 21:11 ` Mathias Burén
  1 sibling, 1 reply; 5+ messages in thread
From: Ray Morris @ 2012-03-07 19:33 UTC (permalink / raw)
  To: Danilo Godec; +Cc: linux-raid

> drives or controller?

Don't forget cables and loose connections. 
We've had more cable problems than anything else.
-- 
Ray Morris
support@bettercgi.com

Strongbox - The next generation in site security:
http://www.bettercgi.com/strongbox/

Throttlebox - Intelligent Bandwidth Control
http://www.bettercgi.com/throttlebox/

Strongbox / Throttlebox affiliate program:
http://www.bettercgi.com/affiliates/user/register.php




On Wed, 07 Mar 2012 18:52:30 +0100
Danilo Godec <danilo.godec@agenda.si> wrote:

> Hi,
> 
> I had two drive failure on a RAID5 in short time (unfortunately to
> short to rebuild on a spare disk). However - drives seem to work on a
> test machine and didn't report any errors. I also stuck them back
> into the orig. server (after rebooting) and they work now.
> 
> The first drive's errors were:
> 
> > Mar  6 05:15:19 san1 kernel: [10681162.473960] sd 4:0:3:0: [sde] 
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar  6 05:15:19 san1 kernel: [10681162.473965] sd 4:0:3:0: [sde]
> > Sense Key : Aborted Command [current]
> > Mar  6 05:15:19 san1 kernel: [10681162.473969] sd 4:0:3:0: [sde]
> > Add. Sense: No additional sense information
> > Mar  6 05:15:19 san1 kernel: [10681162.473973] sd 4:0:3:0: [sde]
> > CDB: Read(10): 28 00 07 af 38 3f 00 00 08 00
> > Mar  6 05:15:19 san1 kernel: [10681162.473980] end_request: I/O
> > error, dev sde, sector 128923711
> > Mar  6 05:17:53 san1 kernel: [10681316.885221] sd 4:0:3:0: [sde] 
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar  6 05:17:53 san1 kernel: [10681316.885225] sd 4:0:3:0: [sde]
> > Sense Key : Illegal Request [current]
> > Mar  6 05:17:53 san1 kernel: [10681316.885229] sd 4:0:3:0: [sde]
> > Add. Sense: Logical block address out of range
> > Mar  6 05:17:53 san1 kernel: [10681316.885234] sd 4:0:3:0: [sde]
> > CDB: Write(10): 2a 08 74 70 58 c7 00 00 08 00
> > Mar  6 05:17:53 san1 kernel: [10681316.885242] end_request: I/O
> > error, dev sde, sector 1953519815
> > Mar  6 05:17:53 san1 kernel: [10681316.885246] end_request: I/O
> > error, dev sde, sector 1953519815
> > Mar  6 05:17:53 san1 kernel: [10681316.885252] raid5: Disk failure
> > on sde1, disabling device.
> > Mar  6 05:20:27 san1 kernel: [10681470.600610] sd 4:0:3:0: [sde] 
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar  6 05:20:27 san1 kernel: [10681470.600615] sd 4:0:3:0: [sde]
> > Sense Key : Illegal Request [current]
> > Mar  6 05:20:27 san1 kernel: [10681470.600619] sd 4:0:3:0: [sde]
> > Add. Sense: Logical block address out of range
> > Mar  6 05:20:27 san1 kernel: [10681470.600624] sd 4:0:3:0: [sde]
> > CDB: Write(10): 2a 08 74 70 59 27 00 00 08 00
> > Mar  6 05:20:27 san1 kernel: [10681470.600631] end_request: I/O
> > error, dev sde, sector 1953519911
> > Mar  6 05:20:27 san1 kernel: [10681470.600636] end_request: I/O
> > error, dev sde, sector 1953519911
> > Mar  6 05:20:28 san1 kernel: [10681471.664682]  disk 3, o:0,
> > dev:sde1 Mar  6 05:21:47 san1 kernel: [10681549.746852] sd 4:0:3:0:
> > [sde] Synchronizing SCSI cache
> > Mar  6 05:21:47 san1 kernel: [10681549.746905] sd 4:0:3:0: [sde] 
> > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> 
> The second drive did this:
> 
> > Mar  7 02:31:37 san1 kernel: [10757598.197391] sd 4:0:5:0: [sdg] 
> > Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Mar  7 02:31:37 san1 kernel: [10757598.197396] sd 4:0:5:0: [sdg]
> > Sense Key : Aborted Command [current]
> > Mar  7 02:31:37 san1 kernel: [10757598.197400] sd 4:0:5:0: [sdg]
> > Add. Sense: No additional sense information
> > Mar  7 02:31:37 san1 kernel: [10757598.197404] sd 4:0:5:0: [sdg]
> > CDB: Read(10): 28 00 07 12 05 9f 00 00 10 00
> > Mar  7 02:31:37 san1 kernel: [10757598.197411] end_request: I/O
> > error, dev sdg, sector 118621599
> > Mar  7 02:31:37 san1 kernel: [10757598.583990] raid5: Disk failure
> > on sdg1, disabling device.
> > Mar  7 02:31:37 san1 kernel: [10757598.616232]  disk 5, o:0,
> > dev:sdg1
> 
> Can anyone make some actual sense out of these sense messages?
> 
> Are these drives really / likely bad or is it more likely it was a 
> controller failure?
> 
> 
>     D.
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Raid failure - drives or controller?
  2012-03-07 19:33 ` Ray Morris
@ 2012-03-07 21:08   ` Danilo Godec
  0 siblings, 0 replies; 5+ messages in thread
From: Danilo Godec @ 2012-03-07 21:08 UTC (permalink / raw)
  To: linux-raid; +Cc: Ray Morris

On 7.3.2012 20:33, Ray Morris wrote:
>> drives or controller?
> Don't forget cables and loose connections. 
> We've had more cable problems than anything else.

Sorry, I should've probably mention the hardware specs in the first place:

The server is an Intel SR2612UR with 12-drive SAS/SATA backplane with an
Attotech ExpressSAS H608 (PCIe) SAS controller and a single Mini-SAS
SFF-8087 cable between controller and backplane.

Drives are 'WD1002FBYS' (WD RE3 'enterprise' models), 7 of them (6 RAID5
+ 1 spare).

The OS (OpenSuSE 11.4 x86_64) is installed on a separate SATA SSD,
connected to on-board SATA controller. The driver for the controller is
'esas2hba' version 1.65, supplied by Attotech.

   D.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Raid failure - drives or controller?
  2012-03-07 17:52 Raid failure - drives or controller? Danilo Godec
  2012-03-07 19:33 ` Ray Morris
@ 2012-03-07 21:11 ` Mathias Burén
  2012-03-07 21:15   ` Danilo Godec
  1 sibling, 1 reply; 5+ messages in thread
From: Mathias Burén @ 2012-03-07 21:11 UTC (permalink / raw)
  To: Danilo Godec; +Cc: linux-raid

On 7 March 2012 17:52, Danilo Godec <danilo.godec@agenda.si> wrote:
> Hi,
>
> I had two drive failure on a RAID5 in short time (unfortunately to short to
> rebuild on a spare disk). However - drives seem to work on a test machine
> and didn't report any errors. I also stuck them back into the orig. server
> (after rebooting) and they work now.
>
> The first drive's errors were:
>
>> Mar  6 05:15:19 san1 kernel: [10681162.473960] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar  6 05:15:19 san1 kernel: [10681162.473965] sd 4:0:3:0: [sde] Sense Key
>> : Aborted Command [current]
>> Mar  6 05:15:19 san1 kernel: [10681162.473969] sd 4:0:3:0: [sde] Add.
>> Sense: No additional sense information
>> Mar  6 05:15:19 san1 kernel: [10681162.473973] sd 4:0:3:0: [sde] CDB:
>> Read(10): 28 00 07 af 38 3f 00 00 08 00
>> Mar  6 05:15:19 san1 kernel: [10681162.473980] end_request: I/O error, dev
>> sde, sector 128923711
>> Mar  6 05:17:53 san1 kernel: [10681316.885221] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar  6 05:17:53 san1 kernel: [10681316.885225] sd 4:0:3:0: [sde] Sense Key
>> : Illegal Request [current]
>> Mar  6 05:17:53 san1 kernel: [10681316.885229] sd 4:0:3:0: [sde] Add.
>> Sense: Logical block address out of range
>> Mar  6 05:17:53 san1 kernel: [10681316.885234] sd 4:0:3:0: [sde] CDB:
>> Write(10): 2a 08 74 70 58 c7 00 00 08 00
>> Mar  6 05:17:53 san1 kernel: [10681316.885242] end_request: I/O error, dev
>> sde, sector 1953519815
>> Mar  6 05:17:53 san1 kernel: [10681316.885246] end_request: I/O error, dev
>> sde, sector 1953519815
>> Mar  6 05:17:53 san1 kernel: [10681316.885252] raid5: Disk failure on
>> sde1, disabling device.
>> Mar  6 05:20:27 san1 kernel: [10681470.600610] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar  6 05:20:27 san1 kernel: [10681470.600615] sd 4:0:3:0: [sde] Sense Key
>> : Illegal Request [current]
>> Mar  6 05:20:27 san1 kernel: [10681470.600619] sd 4:0:3:0: [sde] Add.
>> Sense: Logical block address out of range
>> Mar  6 05:20:27 san1 kernel: [10681470.600624] sd 4:0:3:0: [sde] CDB:
>> Write(10): 2a 08 74 70 59 27 00 00 08 00
>> Mar  6 05:20:27 san1 kernel: [10681470.600631] end_request: I/O error, dev
>> sde, sector 1953519911
>> Mar  6 05:20:27 san1 kernel: [10681470.600636] end_request: I/O error, dev
>> sde, sector 1953519911
>> Mar  6 05:20:28 san1 kernel: [10681471.664682]  disk 3, o:0, dev:sde1
>> Mar  6 05:21:47 san1 kernel: [10681549.746852] sd 4:0:3:0: [sde]
>> Synchronizing SCSI cache
>> Mar  6 05:21:47 san1 kernel: [10681549.746905] sd 4:0:3:0: [sde] Result:
>> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>
>
> The second drive did this:
>
>> Mar  7 02:31:37 san1 kernel: [10757598.197391] sd 4:0:5:0: [sdg] Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> Mar  7 02:31:37 san1 kernel: [10757598.197396] sd 4:0:5:0: [sdg] Sense Key
>> : Aborted Command [current]
>> Mar  7 02:31:37 san1 kernel: [10757598.197400] sd 4:0:5:0: [sdg] Add.
>> Sense: No additional sense information
>> Mar  7 02:31:37 san1 kernel: [10757598.197404] sd 4:0:5:0: [sdg] CDB:
>> Read(10): 28 00 07 12 05 9f 00 00 10 00
>> Mar  7 02:31:37 san1 kernel: [10757598.197411] end_request: I/O error, dev
>> sdg, sector 118621599
>> Mar  7 02:31:37 san1 kernel: [10757598.583990] raid5: Disk failure on
>> sdg1, disabling device.
>> Mar  7 02:31:37 san1 kernel: [10757598.616232]  disk 5, o:0, dev:sdg1
>
>
> Can anyone make some actual sense out of these sense messages?
>
> Are these drives really / likely bad or is it more likely it was a
> controller failure?
>
>
>   D.
>
>
> --
> Danilo Godec, sistemska podpora / system administration
>
> Predlog! Obiscite prenovljeno spletno stran www.agenda.si
>
> ODPRTA KODA IN LINUX
> STORITVE : POSLOVNE RESITVE : UPRAVLJANJE IT : INFRASTRUKTURA IT :
> IZOBRAZEVANJE : PROGRAMSKA OPREMA
>
> Visit our updated web page at www.agenda.si
>
> OPEN SOURCE AND LINUX
> SERVICES : BUSINESS SOLUTIONS : IT MANAGEMENT : IT INFRASTRUCTURE : TRAINING
> : SOFTWARE
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Can you please post the smartctl -a (from smartmontools) output for both drives?

Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Raid failure - drives or controller?
  2012-03-07 21:11 ` Mathias Burén
@ 2012-03-07 21:15   ` Danilo Godec
  0 siblings, 0 replies; 5+ messages in thread
From: Danilo Godec @ 2012-03-07 21:15 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid

On 7.3.2012 22:11, Mathias Burén wrote:
>
> Can you please post the smartctl -a (from smartmontools) output for both drives?
>
> Mathias
> --

# smartctl -a /dev/sde

> smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (openSUSE RPM)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WDC      WD1002FBYS-02A6B Version: 0C06
> Serial number:      WD-WMATV6536667
> Device type: disk
> scsiModePageOffset: response length too short, resp_len=12 offset=12
> bd_len=8
> Local Time is: Wed Mar  7 22:13:35 2012 CET
> Device does not support SMART
>
> Current Drive Temperature:     <not available>
>
> Error Counter logging not supported
>
> [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime 
> LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> #256  Default           Completed                   -  
> 12946                 - [-   -    -]
> #512  Default           Completed                   -  
> 12946                 - [-   -    -]
> #768  Default           Completed                   -    
> 554                 - [-   -    -]

# smartctl -a /dev/sdg
> smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (openSUSE RPM)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Device: WDC      WD1002FBYS-02A6B Version: 0C06
> Serial number:      WD-WMATV6535018
> Device type: disk
> scsiModePageOffset: response length too short, resp_len=12 offset=12
> bd_len=8
> Local Time is: Wed Mar  7 22:13:40 2012 CET
> Device does not support SMART
>
> Current Drive Temperature:     <not available>
>
> Error Counter logging not supported
>
> [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
>
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime 
> LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> #256  Default           Completed                   -   
> 1359                 - [-   -    -]
> #512  Default           Completed                   -   
> 1040                 - [-   -    -]
> #768  Default           Completed                   -    
> 581                 - [-   -    -]



   Thanks, Danilo

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-03-07 21:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-07 17:52 Raid failure - drives or controller? Danilo Godec
2012-03-07 19:33 ` Ray Morris
2012-03-07 21:08   ` Danilo Godec
2012-03-07 21:11 ` Mathias Burén
2012-03-07 21:15   ` Danilo Godec

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.