Western Digital RE3: Raid Failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Western Digital RE3: Raid Failure
@ 2009-08-27  9:43 MOgWai46[Saurceful of Secrets]
  2009-08-27 13:38 ` Michal Soltys
  2009-08-31 20:58 ` Zdenek Kaspar
  0 siblings, 2 replies; 13+ messages in thread
From: MOgWai46[Saurceful of Secrets] @ 2009-08-27  9:43 UTC (permalink / raw)
  To: linux-raid

Sunday night i had a problem on my centos 5.3 Server:

Kernel: 2.6.18-128.4.1.el5
Hard Disks: 4 Western Digital Raid Edition 3 - 500 GB

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 51/04:08:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
ata2.00: status: { DRDY ERR }
ata2.00: error: { ABRT }
ata2.00: configured for UDMA/133
ata2.01: configured for UDMA/133
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
ata2.00: status: { DRDY ERR }
ata2.00: error: { ABRT }
ata2.00: configured for UDMA/133
ata2.01: configured for UDMA/133
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 51/04:08:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
ata2.00: status: { DRDY ERR }
ata2.00: error: { ABRT }
ata2.00: configured for UDMA/133
ata2.01: configured for UDMA/133
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
ata2.00: status: { DRDY ERR }
ata2.00: error: { ABRT }
ata2.00: configured for UDMA/133
ata2.01: configured for UDMA/133
ata2: EH complete
SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: BMDMA stat 0x65
ata2.00: cmd ca/00:08:69:22:ea/00:00:00:00:00/eb tag 0 dma 4096 out
         res 51/10:08:69:22:ea/00:00:00:00:00/eb Emask 0x81 (invalid argument)
ata2.00: status: { DRDY ERR }
ata2.00: error: { IDNF }
ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x5)
ata2.00: revalidation failed (errno=-5)
ata2: failed to recover some devices, retrying in 5 secs
ata2: soft resetting link
ata2.00: configured for UDMA/133
ata2.01: configured for UDMA/133
sd 1:0:0:0: SCSI error: return code = 0x08000002
sdc: Current [descriptor]: sense key: Aborted Command
    Add. Sense: Recorded entity not found

Descriptor sense data with sense descriptors (in hex):
        72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
        0b ea 22 69
end_request: I/O error, dev sdc, sector 199893609
raid1: Disk failure on sdc2, disabling device.
        Operation continuing on 1 devices
ata2: EH complete
SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:0, dev:sdc2
 disk 1, wo:0, o:1, dev:sda2
RAID1 conf printout:
 --- wd:1 rd:2
 disk 1, wo:0, o:1, dev:sda2
md: unbind<sdc2>
md: export_rdev(sdc2)
md: bind<sdc2>

Monday morning i verified that SDC hard disk didn't have bad blocks
and i tried to re-add to the Raid md1. The raid array was rebuilded
with no problems.

I didn't modified the TLER settings on this four hard disk (Default
Value 7 seconds).

I have installed this type of drives beacuse with the WD Velociraptor
i had the same issue :(

Can be an issue of the Server Hardware?

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-08-27  9:43 Western Digital RE3: Raid Failure MOgWai46[Saurceful of Secrets]
@ 2009-08-27 13:38 ` Michal Soltys
  2009-08-27 14:44   ` Drew
  2009-08-31 10:11   ` MOgWai46[Saurceful of Secrets]
  2009-08-31 20:58 ` Zdenek Kaspar
  1 sibling, 2 replies; 13+ messages in thread
From: Michal Soltys @ 2009-08-27 13:38 UTC (permalink / raw)
  To: MOgWai46[Saurceful of Secrets]; +Cc: linux-raid

MOgWai46[Saurceful of Secrets] wrote:
> Sunday night i had a problem on my centos 5.3 Server:
> 
> [cut]
> 
> Monday morning i verified that SDC hard disk didn't have bad blocks
> and i tried to re-add to the Raid md1. The raid array was rebuilded
> with no problems.
> 
> I didn't modified the TLER settings on this four hard disk (Default
> Value 7 seconds).
> 
> I have installed this type of drives beacuse with the WD Velociraptor
> i had the same issue :(
> 

Did you perhaps happen do be hit (in velociraptor case) by:
http://forums.storagereview.net/index.php?showtopic=27303&view=findpost&p=256912

As for RE3 drives - did the above happen roughly after 49 days of uptime ?

Reapperance of this firmware bug in RE3 drives would be hard to believe, 
but you never know.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-08-27 13:38 ` Michal Soltys
@ 2009-08-27 14:44   ` Drew
  2009-08-31 10:11   ` MOgWai46[Saurceful of Secrets]
  1 sibling, 0 replies; 13+ messages in thread
From: Drew @ 2009-08-27 14:44 UTC (permalink / raw)
  To: Michal Soltys; +Cc: MOgWai46[Saurceful of Secrets], linux-raid

> Did you perhaps happen do be hit (in velociraptor case) by:
> http://forums.storagereview.net/index.php?showtopic=27303&view=findpost&p=256912
>
> As for RE3 drives - did the above happen roughly after 49 days of uptime ?
>
> Reapperance of this firmware bug in RE3 drives would be hard to believe, but
> you never know.

If the TLER problem showed up in the Velociraptors, I'd expect to see
them in RE3's of the same vintage. The RE (RAID Edition) lineup
markets TLER as one of their benefits and were positioned to compete
against Seagate's Barracuda ES Series.


-- 
Drew

"Nothing in life is to be feared. It is only to be understood."
--Marie Curie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-08-27 13:38 ` Michal Soltys
  2009-08-27 14:44   ` Drew
@ 2009-08-31 10:11   ` MOgWai46[Saurceful of Secrets]
  2009-08-31 18:44     ` Richard Scobie
  1 sibling, 1 reply; 13+ messages in thread
From: MOgWai46[Saurceful of Secrets] @ 2009-08-31 10:11 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

2009/8/27 Michal Soltys <soltys@ziu.info>:

> Did you perhaps happen do be hit (in velociraptor case) by:
> http://forums.storagereview.net/index.php?showtopic=27303&view=findpost&p=256912
>
> As for RE3 drives - did the above happen roughly after 49 days of uptime ?
>
> Reapperance of this firmware bug in RE3 drives would be hard to believe, but
> you never know.

Tonight the array was broken again and the problem was on the same
drive of the last week :|

Drives are online by 13 days (last reboot) and i made the last power
cycle on 10th August (Ram Upgrade).

I must disable the TLER on my 4 Drives with Linux Software Raid?

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-08-31 10:11   ` MOgWai46[Saurceful of Secrets]
@ 2009-08-31 18:44     ` Richard Scobie
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Scobie @ 2009-08-31 18:44 UTC (permalink / raw)
  To: MOgWai46[Saurceful of Secrets]; +Cc: linux-raid

MOgWai46[Saurceful of Secrets] wrote:

> I must disable the TLER on my 4 Drives with Linux Software Raid?

I have never needed to do so.

Regards,

Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-08-27  9:43 Western Digital RE3: Raid Failure MOgWai46[Saurceful of Secrets]
  2009-08-27 13:38 ` Michal Soltys
@ 2009-08-31 20:58 ` Zdenek Kaspar
  2009-09-03  9:14   ` MOgWai46[Saurceful of Secrets]
  1 sibling, 1 reply; 13+ messages in thread
From: Zdenek Kaspar @ 2009-08-31 20:58 UTC (permalink / raw)
  To: MOgWai46[Saurceful of Secrets]; +Cc: linux-raid

MOgWai46[Saurceful of Secrets] napsal(a):
> Sunday night i had a problem on my centos 5.3 Server:
> 
> Kernel: 2.6.18-128.4.1.el5
> Hard Disks: 4 Western Digital Raid Edition 3 - 500 GB
> 
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>          res 51/04:08:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { ABRT }
> ata2.00: configured for UDMA/133
> ata2.01: configured for UDMA/133
> ata2: EH complete
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>          res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { ABRT }
> ata2.00: configured for UDMA/133
> ata2.01: configured for UDMA/133
> ata2: EH complete
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>          res 51/04:08:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { ABRT }
> ata2.00: configured for UDMA/133
> ata2.01: configured for UDMA/133
> ata2: EH complete
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>          res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { ABRT }
> ata2.00: configured for UDMA/133
> ata2.01: configured for UDMA/133
> ata2: EH complete
> SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
> sdc: Write Protect is off
> sdc: Mode Sense: 00 3a 00 00
> SCSI device sdc: drive cache: write back
> SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
> sdd: Write Protect is off
> sdd: Mode Sense: 00 3a 00 00
> SCSI device sdd: drive cache: write back
> SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
> sdc: Write Protect is off
> sdc: Mode Sense: 00 3a 00 00
> SCSI device sdc: drive cache: write back
> SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
> sdd: Write Protect is off
> sdd: Mode Sense: 00 3a 00 00
> SCSI device sdd: drive cache: write back
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata2.00: BMDMA stat 0x65
> ata2.00: cmd ca/00:08:69:22:ea/00:00:00:00:00/eb tag 0 dma 4096 out
>          res 51/10:08:69:22:ea/00:00:00:00:00/eb Emask 0x81 (invalid argument)
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { IDNF }
> ata2.00: qc timeout (cmd 0xec)
> ata2.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> ata2.00: revalidation failed (errno=-5)
> ata2: failed to recover some devices, retrying in 5 secs
> ata2: soft resetting link
> ata2.00: configured for UDMA/133
> ata2.01: configured for UDMA/133
> sd 1:0:0:0: SCSI error: return code = 0x08000002
> sdc: Current [descriptor]: sense key: Aborted Command
>     Add. Sense: Recorded entity not found
> 
> Descriptor sense data with sense descriptors (in hex):
>         72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
>         0b ea 22 69
> end_request: I/O error, dev sdc, sector 199893609
> raid1: Disk failure on sdc2, disabling device.
>         Operation continuing on 1 devices
> ata2: EH complete
> SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
> sdc: Write Protect is off
> sdc: Mode Sense: 00 3a 00 00
> SCSI device sdc: drive cache: write back
> SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
> sdd: Write Protect is off
> sdd: Mode Sense: 00 3a 00 00
> SCSI device sdd: drive cache: write back
> SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
> sdc: Write Protect is off
> sdc: Mode Sense: 00 3a 00 00
> SCSI device sdc: drive cache: write back
> SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
> sdd: Write Protect is off
> sdd: Mode Sense: 00 3a 00 00
> SCSI device sdd: drive cache: write back
> RAID1 conf printout:
>  --- wd:1 rd:2
>  disk 0, wo:1, o:0, dev:sdc2
>  disk 1, wo:0, o:1, dev:sda2
> RAID1 conf printout:
>  --- wd:1 rd:2
>  disk 1, wo:0, o:1, dev:sda2
> md: unbind<sdc2>
> md: export_rdev(sdc2)
> md: bind<sdc2>
> 
> Monday morning i verified that SDC hard disk didn't have bad blocks
> and i tried to re-add to the Raid md1. The raid array was rebuilded
> with no problems.
> 
> I didn't modified the TLER settings on this four hard disk (Default
> Value 7 seconds).
> 
> I have installed this type of drives beacuse with the WD Velociraptor
> i had the same issue :(
> 
> Can be an issue of the Server Hardware?
> 
> Andrea

Do you have some sort of port multiplier [because it looks like there
are 2 devices on ata2 -> ata2.00 (sdc),ata2.01 (sdd?)] and can you
provide smartctl -a /dev/sdc output ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-08-31 20:58 ` Zdenek Kaspar
@ 2009-09-03  9:14   ` MOgWai46[Saurceful of Secrets]
  2009-09-03 11:06     ` Gabor Gombas
  2009-09-03 12:28     ` Zdenek Kaspar
  0 siblings, 2 replies; 13+ messages in thread
From: MOgWai46[Saurceful of Secrets] @ 2009-09-03  9:14 UTC (permalink / raw)
  To: Zdenek Kaspar; +Cc: linux-raid

2009/8/31 Zdenek Kaspar <zkaspar82@gmail.com>:

> Do you have some sort of port multiplier [because it looks like there
> are 2 devices on ata2 -> ata2.00 (sdc),ata2.01 (sdd?)] and can you
> provide smartctl -a /dev/sdc output ?

This is the output of smartctl -a /dev/sdc: http://nopaste.com/p/aXwmnSEphb

This is the output of smartctl -a /dev/sdd: http://nopaste.com/p/aYzhpzppv

There isn't any port replicator on this little server (is not a
critical or production server). It has 4 SATA Port on the Mainboard:

# lspci -v

00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family)
SATA IDE Controller (rev 01) (prog-if 8a [Master SecP PriP])
        Subsystem: Hewlett-Packard Company Unknown device 3206
        Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 233
        I/O ports at <unassigned>
        I/O ports at <unassigned>
        I/O ports at <unassigned>
        I/O ports at <unassigned>
        I/O ports at 3080 [size=16]
        Memory at 88000000 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [70] Power Management version 2

Thanks, Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-09-03  9:14   ` MOgWai46[Saurceful of Secrets]
@ 2009-09-03 11:06     ` Gabor Gombas
  2009-09-03 11:24       ` Alex Butcher
  2009-09-03 12:28     ` Zdenek Kaspar
  1 sibling, 1 reply; 13+ messages in thread
From: Gabor Gombas @ 2009-09-03 11:06 UTC (permalink / raw)
  To: MOgWai46[Saurceful of Secrets]; +Cc: Zdenek Kaspar, linux-raid

On Thu, Sep 03, 2009 at 11:14:45AM +0200, MOgWai46[Saurceful of Secrets] wrote:

> This is the output of smartctl -a /dev/sdc: http://nopaste.com/p/aXwmnSEphb

  5 Reallocated_Sector_Ct   0x0033   197   197   140    Pre-fail  Always       -       24
[...]
196 Reallocated_Event_Count 0x0032   197   197   000    Old_age   Always       -       3

IMHO it's time to replace the drive. I think it also shows that TLER is
working just as expected: timing out when the recovery would have taken
too long.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-09-03 11:06     ` Gabor Gombas
@ 2009-09-03 11:24       ` Alex Butcher
  2009-09-05  3:53         ` Thomas Fjellstrom
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Butcher @ 2009-09-03 11:24 UTC (permalink / raw)
  To: Gabor Gombas; +Cc: MOgWai46[Saurceful of Secrets], Zdenek Kaspar, linux-raid

On Thu, 3 Sep 2009, Gabor Gombas wrote:

> On Thu, Sep 03, 2009 at 11:14:45AM +0200, MOgWai46[Saurceful of Secrets] wrote:
>
>> This is the output of smartctl -a /dev/sdc: http://nopaste.com/p/aXwmnSEphb
>
>  5 Reallocated_Sector_Ct   0x0033   197   197   140    Pre-fail  Always       -       24
> [...]
> 196 Reallocated_Event_Count 0x0032   197   197   000    Old_age   Always       -       3
>
> IMHO it's time to replace the drive. I think it also shows that TLER is
> working just as expected: timing out when the recovery would have taken
> too long.

The drive has only been spinning for 1008 hours, or about 42 days. It's
probably still under warranty, but I'd be surprised if WD (or any other
vendor) will accept an RMA since it will probably pass all its diagnostic
tests.

Note that those SMART attributes are nothing particularly scary and this is
backed up by both attributes being significantly higher than their
respective thresholds; modern drives have thousands of spare sectors, so a
few going bad is to be expected.

> Gabor

Best Regards,
Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-09-03  9:14   ` MOgWai46[Saurceful of Secrets]
  2009-09-03 11:06     ` Gabor Gombas
@ 2009-09-03 12:28     ` Zdenek Kaspar
  1 sibling, 0 replies; 13+ messages in thread
From: Zdenek Kaspar @ 2009-09-03 12:28 UTC (permalink / raw)
  To: MOgWai46[Saurceful of Secrets]; +Cc: linux-raid

MOgWai46[Saurceful of Secrets] napsal(a):
> 2009/8/31 Zdenek Kaspar <zkaspar82@gmail.com>:
> 
>> Do you have some sort of port multiplier [because it looks like there
>> are 2 devices on ata2 -> ata2.00 (sdc),ata2.01 (sdd?)] and can you
>> provide smartctl -a /dev/sdc output ?
> 
> This is the output of smartctl -a /dev/sdc: http://nopaste.com/p/aXwmnSEphb
> 
> This is the output of smartctl -a /dev/sdd: http://nopaste.com/p/aYzhpzppv
> 
> There isn't any port replicator on this little server (is not a
> critical or production server). It has 4 SATA Port on the Mainboard:
> 
> # lspci -v
> 
> 00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family)
> SATA IDE Controller (rev 01) (prog-if 8a [Master SecP PriP])
>         Subsystem: Hewlett-Packard Company Unknown device 3206
>         Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 233
>         I/O ports at <unassigned>
>         I/O ports at <unassigned>
>         I/O ports at <unassigned>
>         I/O ports at <unassigned>
>         I/O ports at 3080 [size=16]
>         Memory at 88000000 (32-bit, non-prefetchable) [size=1K]
>         Capabilities: [70] Power Management version 2
> 
> Thanks, Andrea


As others already pointed to Reallocated_Sector_Ct value.. it's bad sign
and you should replace (RMA) such drive quickly. BTW, speed is
significantly degraded when disk works with sectors from remapped area.

HTH, Z.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-09-03 11:24       ` Alex Butcher
@ 2009-09-05  3:53         ` Thomas Fjellstrom
  2009-09-05 12:15           ` Alex Butcher
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Fjellstrom @ 2009-09-05  3:53 UTC (permalink / raw)
  To: linux-raid

On Thu September 3 2009, you wrote:
> On Thu, 3 Sep 2009, Gabor Gombas wrote:
> > On Thu, Sep 03, 2009 at 11:14:45AM +0200, MOgWai46[Saurceful of Secrets] 
wrote:
> >> This is the output of smartctl -a /dev/sdc:
> >> http://nopaste.com/p/aXwmnSEphb
> >
> >  5 Reallocated_Sector_Ct   0x0033   197   197   140    Pre-fail  Always  
> >     -       24 [...]
> > 196 Reallocated_Event_Count 0x0032   197   197   000    Old_age   Always 
> >      -       3
> >
> > IMHO it's time to replace the drive. I think it also shows that TLER is
> > working just as expected: timing out when the recovery would have taken
> > too long.
>
> The drive has only been spinning for 1008 hours, or about 42 days. It's
> probably still under warranty, but I'd be surprised if WD (or any other
> vendor) will accept an RMA since it will probably pass all its diagnostic
> tests.

I just returned a Seagate drive that passed all tests. But its reallocated 
sector count was over 1500 (and climbing 10 or more per hour). The support 
tech didn't have a problem with authorizing the RMA. So if the problem gets 
bad enough, they SHOULD take the drive back, or at least Seagate will.

>
> Note that those SMART attributes are nothing particularly scary and this is
> backed up by both attributes being significantly higher than their
> respective thresholds; modern drives have thousands of spare sectors, so a
> few going bad is to be expected.
>
> > Gabor
>
> Best Regards,
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-09-05  3:53         ` Thomas Fjellstrom
@ 2009-09-05 12:15           ` Alex Butcher
  2009-09-05 12:22             ` Thomas Fjellstrom
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Butcher @ 2009-09-05 12:15 UTC (permalink / raw)
  To: Thomas Fjellstrom; +Cc: linux-raid

On Fri, 4 Sep 2009, Thomas Fjellstrom wrote:

> On Thu September 3 2009, you wrote:
>> On Thu, 3 Sep 2009, Gabor Gombas wrote:
>>> On Thu, Sep 03, 2009 at 11:14:45AM +0200, MOgWai46[Saurceful of Secrets]
> wrote:
>>>> This is the output of smartctl -a /dev/sdc:
>>>> http://nopaste.com/p/aXwmnSEphb
>>>
>>>  5 Reallocated_Sector_Ct   0x0033   197   197   140    Pre-fail  Always
>>>     -       24 [...]
>>> 196 Reallocated_Event_Count 0x0032   197   197   000    Old_age   Always
>>>      -       3
>>>
>>> IMHO it's time to replace the drive. I think it also shows that TLER is
>>> working just as expected: timing out when the recovery would have taken
>>> too long.
>>
>> The drive has only been spinning for 1008 hours, or about 42 days. It's
>> probably still under warranty, but I'd be surprised if WD (or any other
>> vendor) will accept an RMA since it will probably pass all its diagnostic
>> tests.
>
> I just returned a Seagate drive that passed all tests. But its reallocated
> sector count was over 1500 (and climbing 10 or more per hour). The support
> tech didn't have a problem with authorizing the RMA. So if the problem gets
> bad enough, they SHOULD take the drive back, or at least Seagate will.

Different circumstances; your drive's reallocated count was higher and
consistently climbing.

Sectors go bad, drives have spares, drives with reallocated sectors aren't
necessarily dead or about to die imminently, keep backups of files that are
important to you in case a file occupies a sector that goes bad, RAID is not
backup.

>>> Gabor
>> Alex

Best Regards,
Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Western Digital RE3: Raid Failure
  2009-09-05 12:15           ` Alex Butcher
@ 2009-09-05 12:22             ` Thomas Fjellstrom
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Fjellstrom @ 2009-09-05 12:22 UTC (permalink / raw)
  To: Alex Butcher; +Cc: linux-raid

On Sat September 5 2009, Alex Butcher wrote:
> On Fri, 4 Sep 2009, Thomas Fjellstrom wrote:
> > On Thu September 3 2009, you wrote:
> >> On Thu, 3 Sep 2009, Gabor Gombas wrote:
> >>> On Thu, Sep 03, 2009 at 11:14:45AM +0200, MOgWai46[Saurceful of
> >>> Secrets]
> >
> > wrote:
> >>>> This is the output of smartctl -a /dev/sdc:
> >>>> http://nopaste.com/p/aXwmnSEphb
> >>>
> >>>  5 Reallocated_Sector_Ct   0x0033   197   197   140    Pre-fail  Always
> >>>     -       24 [...]
> >>> 196 Reallocated_Event_Count 0x0032   197   197   000    Old_age  
> >>> Always -       3
> >>>
> >>> IMHO it's time to replace the drive. I think it also shows that TLER is
> >>> working just as expected: timing out when the recovery would have taken
> >>> too long.
> >>
> >> The drive has only been spinning for 1008 hours, or about 42 days. It's
> >> probably still under warranty, but I'd be surprised if WD (or any other
> >> vendor) will accept an RMA since it will probably pass all its
> >> diagnostic tests.
> >
> > I just returned a Seagate drive that passed all tests. But its
> > reallocated sector count was over 1500 (and climbing 10 or more per
> > hour). The support tech didn't have a problem with authorizing the RMA.
> > So if the problem gets bad enough, they SHOULD take the drive back, or at
> > least Seagate will.
> 
> Different circumstances; your drive's reallocated count was higher and
> consistently climbing.

Yeah, I did say "if the problem gets bad enough".

> Sectors go bad, drives have spares, drives with reallocated sectors aren't
> necessarily dead or about to die imminently, keep backups of files that are
> important to you in case a file occupies a sector that goes bad, RAID is
>  not backup.
> 
> >>> Gabor
> >>
> >> Alex
> 
> Best Regards,
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-09-05 12:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-27  9:43 Western Digital RE3: Raid Failure MOgWai46[Saurceful of Secrets]
2009-08-27 13:38 ` Michal Soltys
2009-08-27 14:44   ` Drew
2009-08-31 10:11   ` MOgWai46[Saurceful of Secrets]
2009-08-31 18:44     ` Richard Scobie
2009-08-31 20:58 ` Zdenek Kaspar
2009-09-03  9:14   ` MOgWai46[Saurceful of Secrets]
2009-09-03 11:06     ` Gabor Gombas
2009-09-03 11:24       ` Alex Butcher
2009-09-05  3:53         ` Thomas Fjellstrom
2009-09-05 12:15           ` Alex Butcher
2009-09-05 12:22             ` Thomas Fjellstrom
2009-09-03 12:28     ` Zdenek Kaspar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).