linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* SMART, RAID and real world experience of failures.
@ 2012-01-05 23:53 Steven Haigh
  2012-01-06  0:42 ` Roman Mamedov
  2012-01-06 11:22 ` Peter Grandi
  0 siblings, 2 replies; 9+ messages in thread
From: Steven Haigh @ 2012-01-05 23:53 UTC (permalink / raw)
  To: linux-raid

Hi all,

Extremely long time listener but very few time poster.

I got a SMART error email yesterday from my home server with a 4 x 1Tb 
RAID6. It basically boiled down to:

The following warning/error was logged by the smartd daemon:
Device: /dev/sdd [SAT], 1 Currently unreadable (pending) sectors
Device: /dev/sdd [SAT], 1 Offline uncorrectable sectors

This got me wondering so I ran a long test (smartctl -t long /dev/sdd) 
and sure enough, after an hour or so I got this:

# 2  Extended offline    Completed: read failure       50%     17465 
      1172842872

So, in the spirit of experimentation, I did the following:
# mdadm /dev/md2 --manage --fail /dev/sdd
# mdadm /dev/md2 --manage --remove /dev/sdd
# dd if=/dev/zero of=/dev/sdd bs=10M
# mdadm /dev/md2 --manage --add /dev/sdd
< a resync occurred here, afterwards >
# smartctl -t long /dev/sdd
< long wait >
# smartctl -a /dev/sdd

This is where it gets interesting. Although it originally logged an 
error, I now see the following (with lots of other info trimmed):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE 
UPDATED  WHEN_FAILED RAW_VALUE
   4 Start_Stop_Count        0x0032   100   100   020    Old_age 
Always       -       154
   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail 
Always       -       0
   9 Power_On_Hours          0x0032   081   081   000    Old_age 
Always       -       17493
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age 
Always       -       77
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always 
       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age 
Offline      -       0

Then even more interesting:
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     17489 
      -
# 2  Extended offline    Completed: read failure       50%     17465 
      1172842872

This makes me ponder. Has the drive recovered? Has the sector with the 
read failure been remapped and hidden from view? Is it still (more?) 
likely to fail in the near future?

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-01-09 20:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-05 23:53 SMART, RAID and real world experience of failures Steven Haigh
2012-01-06  0:42 ` Roman Mamedov
2012-01-06 11:22 ` Peter Grandi
2012-01-06 11:40   ` Steven Haigh
2012-01-06 13:38     ` Phil Turmel
2012-01-09 14:50       ` Peter Grandi
2012-01-09 16:37         ` Phil Turmel
2012-01-09 20:23         ` Peter Grandi
2012-01-09 13:59     ` Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).