Re: request help with RAID1 array that endlessly attempts to sync

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Julie Ashworth <ashworth@berkeley.edu>
To: linux-raid@vger.kernel.org
Subject: Re: request help with RAID1 array that endlessly attempts to sync
Date: Tue, 17 Dec 2013 08:53:48 -0800	[thread overview]
Message-ID: <20131217165348.GA5070@localhost.localdomain> (raw)
In-Reply-To: <20131217065028.GC20941@nx5.priv>

hi all,
The sync ran overnight, and smartctl reports 60 errors on /dev/sdb this morning. So, it seems like the drive is doomed. 

It's frustrating, because this has happened twice in the last month, where a disk failed in a RAID1, I replaced the drive, and the 'good' drive failed during the sync. Last time I rebuilt from scratch. I presume that is my fate this time.

I plan to use RAID6 in the future, but I still have important servers with RAID1 arrays. Do you folks recommend replacing HDDs before they report errors? The drives are all ~3 years old - Seagate.

I should probably stop the sync. I presume the best way to do this is to fail/remove /dev/sda (the new disk).

Thanks again!
best,
Julie
 


On 16-12-2013 22.50 -0800, Julie Ashworth wrote:
> hi,
> I have a RAID1 array (md1) with two partitions (/dev/sda1 and /dev/sdb1).
> 
> Earlier today, I replaced /dev/sda because it had errors (reported by smartd/smartctl)
> # mdadm /dev/md0 -f /dev/sda1 -r /dev/sda1
> # mdadm /dev/md1 -f /dev/sda2 -r /dev/sda2
> 
> I replaced and formatted the drive and added it to the RAID1 arrays:
> 
> # mdadm /dev/md0 -a /dev/sda1
> # mdadm /dev/md1 -a /dev/sda2
> 
> Everything looked great at first:
> # cat /proc/mdstat 
> Personalities : [raid1] 
> md0 : active raid1 sda1[0] sdb1[1]
>       521984 blocks [2/2] [UU]
>       
> md1 : active raid1 sda2[2] sdb2[1]
>       976237824 blocks [2/1] [_U]
>       [====>................]  recovery = 22.4% (219600512/976237824) finish=131.5min speed=95860K/sec
>       
> unused devices: <none>
> 
> 
> But the sync restarted w/o error.
> 
> So, I ran:
> # smartctl -a /dev/sdb
> 
> ... which returned 3 errors.
> 
> After the second time the sync restarted, smartctl reported 24 errors on /dev/sdb. It has restarted a few times since then, but smartctl reports the same number of errors (24).
> 
> I'm enclosing the output from 'smartctl -a /dev/sdb'.
> I tried to run a short selftest, but aborted it after 10 minutes. I was concerned that I shouldn't run a selftest at the same time it's rebuilding.
> 
> For what it's worth, I can't pause the sync. The command:
> 
> # echo idle > /sys/block/md1/md/sync_action
> 
> ... has apparently no effect.
> 
> Can anybody make a recommendation? I'd rather not reboot, but I have a planned outage scheduled Friday.
> 
> Thanks in advance for any help,
> Julie 
> -----------
> 
> 
>  

> smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> === START OF INFORMATION SECTION ===
> Device Model:     ST31000340NS
> Serial Number:    9QJ6Y79S
> Firmware Version: SN06
> User Capacity:    1,000,204,886,016 bytes
> Device is:        Not in smartctl database [for details use: -P showall]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 4
> Local Time is:    Mon Dec 16 22:27:54 2013 PST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x82)	Offline data collection activity
> 					was completed without error.
> 					Auto Offline Data Collection: Enabled.
> Self-test execution status:      (  22)	The self-test routine was aborted by
> 					the host.
> Total time to complete Offline 
> data collection: 		 ( 625) seconds.
> Offline data collection
> capabilities: 			 (0x7b) SMART execute Offline immediate.
> 					Auto Offline data collection on/off support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					General Purpose Logging supported.
> Short self-test routine 
> recommended polling time: 	 (   1) minutes.
> Extended self-test routine
> recommended polling time: 	 ( 220) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (   2) minutes.
> SCT capabilities: 	       (0x103d)	SCT Status supported.
> 					SCT Feature Control supported.
> 					SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   079   062   044    Pre-fail  Always       -       94946845
>   3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
>   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       29
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       3
>   7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  Always       -       131642238
>   9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       29562
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
>  12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always       -       29
> 184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
> 187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2
> 188 Unknown_Attribute       0x0032   100   096   000    Old_age   Always       -       42950328381
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   078   060   045    Old_age   Always       -       22 (Lifetime Min/Max 18/40)
> 194 Temperature_Celsius     0x0022   022   040   000    Old_age   Always       -       22 (0 15 0 0)
> 195 Hardware_ECC_Recovered  0x001a   064   048   000    Old_age   Always       -       94946845
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 
> SMART Error Log Version: 1
> ATA Error Count: 24 (device log contains only the most recent five errors)
> 	CR = Command Register [HEX]
> 	FR = Features Register [HEX]
> 	SC = Sector Count Register [HEX]
> 	SN = Sector Number Register [HEX]
> 	CL = Cylinder Low Register [HEX]
> 	CH = Cylinder High Register [HEX]
> 	DH = Device/Head Register [HEX]
> 	DC = Device Command Register [HEX]
> 	ER = Error register [HEX]
> 	ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> 
> Error 24 occurred at disk power-on lifetime: 29559 hours (1231 days + 15 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 ff ff ff 0f
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 00 08 ff ff ff 4f 00  23d+11:18:28.172  READ FPDMA QUEUED
>   27 00 00 00 00 00 e0 00  23d+11:18:28.145  READ NATIVE MAX ADDRESS EXT
>   ec 00 00 00 00 00 a0 00  23d+11:18:28.143  IDENTIFY DEVICE
>   ef 03 46 00 00 00 a0 00  23d+11:18:28.130  SET FEATURES [Set transfer mode]
>   27 00 00 00 00 00 e0 00  23d+11:18:28.102  READ NATIVE MAX ADDRESS EXT
> 
> Error 23 occurred at disk power-on lifetime: 29559 hours (1231 days + 15 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 ff ff ff 0f
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 00 08 ff ff ff 4f 00  23d+11:18:25.024  READ FPDMA QUEUED
>   27 00 00 00 00 00 e0 00  23d+11:18:24.996  READ NATIVE MAX ADDRESS EXT
>   ec 00 00 00 00 00 a0 00  23d+11:18:24.995  IDENTIFY DEVICE
>   ef 03 46 00 00 00 a0 00  23d+11:18:24.982  SET FEATURES [Set transfer mode]
>   27 00 00 00 00 00 e0 00  23d+11:18:24.954  READ NATIVE MAX ADDRESS EXT
> 
> Error 22 occurred at disk power-on lifetime: 29559 hours (1231 days + 15 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 ff ff ff 0f
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 00 08 ff ff ff 4f 00  23d+11:18:21.884  READ FPDMA QUEUED
>   27 00 00 00 00 00 e0 00  23d+11:18:21.856  READ NATIVE MAX ADDRESS EXT
>   ec 00 00 00 00 00 a0 00  23d+11:18:21.855  IDENTIFY DEVICE
>   ef 03 46 00 00 00 a0 00  23d+11:18:21.841  SET FEATURES [Set transfer mode]
>   27 00 00 00 00 00 e0 00  23d+11:18:21.814  READ NATIVE MAX ADDRESS EXT
> 
> Error 21 occurred at disk power-on lifetime: 29559 hours (1231 days + 15 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 ff ff ff 0f
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 00 08 ff ff ff 4f 00  23d+11:18:18.752  READ FPDMA QUEUED
>   27 00 00 00 00 00 e0 00  23d+11:18:18.724  READ NATIVE MAX ADDRESS EXT
>   ec 00 00 00 00 00 a0 00  23d+11:18:18.723  IDENTIFY DEVICE
>   ef 03 46 00 00 00 a0 00  23d+11:18:18.710  SET FEATURES [Set transfer mode]
>   27 00 00 00 00 00 e0 00  23d+11:18:18.682  READ NATIVE MAX ADDRESS EXT
> 
> Error 20 occurred at disk power-on lifetime: 29559 hours (1231 days + 15 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 ff ff ff 0f
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 00 08 ff ff ff 4f 00  23d+11:18:15.645  READ FPDMA QUEUED
>   27 00 00 00 00 00 e0 00  23d+11:18:15.617  READ NATIVE MAX ADDRESS EXT
>   ec 00 00 00 00 00 a0 00  23d+11:18:15.616  IDENTIFY DEVICE
>   ef 03 46 00 00 00 a0 00  23d+11:18:15.603  SET FEATURES [Set transfer mode]
>   27 00 00 00 00 00 e0 00  23d+11:18:15.575  READ NATIVE MAX ADDRESS EXT
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Aborted by host               60%     29560         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 

---end quoted text---

next prev parent reply	other threads:[~2013-12-17 16:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-17  6:50 request help with RAID1 array that endlessly attempts to sync Julie Ashworth
2013-12-17 16:53 ` Julie Ashworth [this message]
2013-12-17 17:55   ` Phil Turmel
2013-12-17 19:26     ` Julie Ashworth
2013-12-17 19:43       ` Phil Turmel
2013-12-17 23:12         ` David C. Rankin
2013-12-18  3:45         ` Julie Ashworth
2013-12-18 12:08           ` Phil Turmel
2014-01-21  6:38             ` Julie Ashworth
2014-01-21 13:23               ` Phil Turmel
2014-02-25  0:16               ` Julie Ashworth
2013-12-17 18:12   ` Wilson Jonathan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131217165348.GA5070@localhost.localdomain \
    --to=ashworth@berkeley.edu \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).