From mboxrd@z Thu Jan 1 00:00:00 1970 From: Adam Goryachev Subject: Re: RAID1 degraded Date: Tue, 04 Aug 2015 16:02:57 +1000 Message-ID: <55C05591.4090306@websitemanagers.com.au> References: <55BFFDD3.5000005@websitemanagers.com.au> <478DB4ED-9FAB-4035-A482-0BC11046B6C2@me.com> <55C0409E.5010004@websitemanagers.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids On 04/08/15 15:51, Hans Malissa wrote: > Thanks a lot for your help! > Rebooting the system didn=92t solve the problem, /dev/sdc is still=20 > nowhere to be found. > So I will have to replace /dev/sdc. > I tried to learn a bit about SRC/ERC from list archives, and it seems= =20 > like my hard drives (1TB Seagate Barracuda=92s) don=92t support this = option: > > # smartctl -l scterc /dev/sdb > smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.7.10-1.45-desktop] (SUS= E=20 > RPM) > Copyright (C) 2002-12, Bruce Allen, Christian Franke,=20 > www.smartmontools.org > > SCT Error Recovery Control command not supported > > /dev/sdc is (was) of exactly the same type, so it wouldn=92t support=20 > SRC/ERC either. > This doesn=92t seem to be the problem here, since the drive has just=20 > disappeared. Nope that is true, but it is always good to learn about the issue befor= e=20 it becomes the problem. > But I will certainly take this into account when buying a replacement= =20 > drive. Any current recommendations about what would work best in a=20 > RAID1 instead of a 1TB Seagate Barracuda? My personal preference was WD Black drives, or else Enterprise Black,=20 but they were always a lot more expensive. I think WD Red are "RAID=20 Certified" these days. (Note, I mostly use SSD now rather than any bran= d=20 HDD, so not a lot of recent experiences). > Just to make sure I understand correctly how to replace /dev/sdc and=20 > repair my RAID1, the steps to do would be: > > 1. Shutdown PC > 2. Replace /dev/sdc > 3. Restart computer > 4. Partition the new /dev/sdc > 5. Run # mdadm =97manage /dev/md0 =97add /dev/sdc1 > 6. Wait for synchronization to finish > > Did I get this right? Am I missing anything? Are there additional=20 > steps (I am backing my data up, anyway) that I can take to maximize m= y=20 > chance for success? Yep, all sounds good. Just make a note of the serial number for sdb=20 before you shutdown, and ensure you are removing the correct drive. The= =20 good thing with RAID1 is that it is difficult to really screw it up, bu= t=20 backups are *always* a good idea :) Regards, Adam > Thanks a lot, > > Hans > > On Aug 3, 2015, at 10:33 PM, Adam Goryachev=20 > > wrote: > >> On 04/08/15 14:16, Hans Malissa wrote: >>> Thanks a lot for your help! >>> smartctl yields the following information (details see below):=20 >>> /dev/sdb looks ok, but /dev/sdc seems to have quite a problem.=20 >>> /dev/sdc seems nonexistent, it=92s not even in /dev/ anymore. The d= isk=20 >>> is physically present, but that=92s about it. >>> The kernel logs contain a lot of information; what should I be=20 >>> looking for? >> >> The logs should contain information on why or what happened when the= =20 >> disk (sdc) vanished. In your case, it does indeed look like sdc has=20 >> failed, so you have a number of options depending on your preference= : >> 1) Simply reboot (including a complete power off) the machine, and=20 >> see if sdc comes back. If it does, do some tests, and then add back=20 >> to the array. If it survives, then carry on as normal. >> >> 2) If you are more cautious (and more prepared to spend the money=20 >> rather than risk the data), then purchase a replacement disk, and=20 >> replace sdc with the new disk. Prepare the drive/partition, and add=20 >> it to the raid array. >> >> Please make sure you "Research SCT/ERC on this list"!!! before=20 >> purchasing the replacement drive. It is far better to buy the right=20 >> drive if possible. >> >> Regards, >> Adam >> >>> Thanks a lot, >>> >>> Hans >>> >>> # smartctl -a /dev/sdb >>> smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.7.10-1.45-desktop]=20 >>> (SUSE RPM) >>> Copyright (C) 2002-12, Bruce Allen, Christian Franke,=20 >>> www.smartmontools.org >>> >>> =3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D >>> Model Family: Seagate Barracuda 7200.14 (AF) >>> Device Model: ST1000DM003-1ER162 >>> Serial Number: Z4Y6N2J3 >>> LU WWN Device Id: 5 000c50 07afe5c18 >>> Firmware Version: CC45 >>> User Capacity: 1,000,204,886,016 bytes [1.00 TB] >>> Sector Sizes: 512 bytes logical, 4096 bytes physical >>> Rotation Rate: 7200 rpm >>> Device is: In smartctl database [for details use: -P show] >>> ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b >>> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) >>> Local Time is: Mon Aug 3 21:52:32 2015 MDT >>> >>> =3D=3D> WARNING: A firmware update for this drive may be available, >>> see the following Seagate web pages: >>> http://knowledge.seagate.com/articles/en_US/FAQ/207931en >>> http://knowledge.seagate.com/articles/en_US/FAQ/223651en >>> >>> SMART support is: Available - device has SMART capability. >>> SMART support is: Enabled >>> >>> =3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D >>> SMART overall-health self-assessment test result: PASSED >>> >>> General SMART Values: >>> Offline data collection status: (0x00) Offline data collection act= ivity >>> was never started. >>> Auto Offline Data=20 >>> Collection: Disabled. >>> Self-test execution status: ( 0) The previous self-test=20 >>> routine completed >>> without error or no=20 >>> self-test has ever >>> been run. >>> Total time to complete Offline >>> data collection: ( 80) seconds. >>> Offline data collection >>> capabilities: (0x73) SMART execute Offline immed= iate. >>> Auto Offline data collectio= n=20 >>> on/off support. >>> Suspend Offline collection=20 >>> upon new >>> command. >>> No Offline surface scan=20 >>> supported. >>> Self-test supported. >>> Conveyance Self-test suppor= ted. >>> Selective Self-test support= ed. >>> SMART capabilities: (0x0003) Saves SMART data before ent= ering >>> power-saving mode. >>> Supports SMART auto save ti= mer. >>> Error logging capability: (0x01) Error logging supported. >>> General Purpose Logging=20 >>> supported. >>> Short self-test routine >>> recommended polling time: ( 1) minutes. >>> Extended self-test routine >>> recommended polling time: ( 105) minutes. >>> Conveyance self-test routine >>> recommended polling time: ( 2) minutes. >>> SCT capabilities: (0x1085) SCT Status supported. >>> >>> SMART Attributes Data Structure revision number: 10 >>> Vendor Specific SMART Attributes with Thresholds: >>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE=20 >>> UPDATED WHEN_FAILED RAW_VALUE >>> 1 Raw_Read_Error_Rate 0x000f 111 100 006 Pre-fail=20 >>> Always - 39301104 >>> 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail=20 >>> Always - 0 >>> 4 Start_Stop_Count 0x0032 100 100 020 Old_age=20 >>> Always - 20 >>> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail=20 >>> Always - 0 >>> 7 Seek_Error_Rate 0x000f 063 060 030 Pre-fail=20 >>> Always - 2152462 >>> 9 Power_On_Hours 0x0032 098 098 000 Old_age=20 >>> Always - 1872 >>> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail=20 >>> Always - 0 >>> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age=20 >>> Always - 20 >>> 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age=20 >>> Always - 0 >>> 184 End-to-End_Error 0x0032 100 100 099 Old_age=20 >>> Always - 0 >>> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age=20 >>> Always - 0 >>> 188 Command_Timeout 0x0032 100 100 000 Old_age=20 >>> Always - 0 0 0 >>> 189 High_Fly_Writes 0x003a 100 100 000 Old_age=20 >>> Always - 0 >>> 190 Airflow_Temperature_Cel 0x0022 068 064 045 Old_age=20 >>> Always - 32 (Min/Max 26/35) >>> 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age=20 >>> Always - 0 >>> 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age=20 >>> Always - 0 >>> 193 Load_Cycle_Count 0x0032 093 093 000 Old_age=20 >>> Always - 15119 >>> 194 Temperature_Celsius 0x0022 032 040 000 Old_age=20 >>> Always - 32 (0 19 0 0 0) >>> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age=20 >>> Always - 0 >>> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age=20 >>> Offline - 0 >>> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age=20 >>> Always - 0 >>> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age=20 >>> Offline - 662h+04m+56.474s >>> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age=20 >>> Offline - 2212066311 >>> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age=20 >>> Offline - 4204083236 >>> >>> SMART Error Log Version: 1 >>> No Errors Logged >>> >>> SMART Self-test log structure revision number 1 >>> No self-tests have been logged. [To run self-tests, use: smartctl = -t] >>> >>> >>> SMART Selective self-test log data structure revision number 1 >>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >>> 1 0 0 Not_testing >>> 2 0 0 Not_testing >>> 3 0 0 Not_testing >>> 4 0 0 Not_testing >>> 5 0 0 Not_testing >>> Selective self-test flags (0x0): >>> After scanning selected spans, do NOT read-scan remainder of disk= =2E >>> If Selective self-test is pending on power-up, resume after 0 minut= e=20 >>> delay. >>> >>> # smartctl -a /dev/sdc >>> smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.7.10-1.45-desktop]=20 >>> (SUSE RPM) >>> Copyright (C) 2002-12, Bruce Allen, Christian Franke,=20 >>> www.smartmontools.org >>> >>> Smartctl open device: /dev/sdc failed: No such device >>> >>> On Aug 3, 2015, at 5:48 PM, Adam Goryachev=20 >>> wrote: >>> >>>> On 04/08/15 08:18, Hans Malissa wrote: >>>>> Hi everybody, >>>>> >>>>> It looks like one of my disks in my RAID1 just failed: >>>>> >>>>> [SNIP] >>>>> >>>>> Are there any other tests I could run in order to figure out=20 >>>>> what=92s going on? It looks like I will have to replace /dev/sdc1= =20 >>>>> with a new hard drive. What is the correct procedure to do so=20 >>>>> without loosing my data? >>>>> >>>> Have a look at dmesg or your system kernel logs for details. >>>> Also, use smartctl to examine what the drive itself thinks. >>>> Also, try to use dd to read/write the drive. >>>> >>>> One common scenario is that you haven't configured the timing for=20 >>>> the drive correctly, and the drive is working perfectly, but didn'= t=20 >>>> respond to the kernel quickly enough. Research SCT/ERC on this lis= t >>>> >>>> Regards, >>>> Adam >>>> -- >>>> Adam Goryachev Website Managers www.websitemanagers.com.au >> >> >> -- >> Adam Goryachev Website Managerswww.websitemanagers.com.au=20 >> > --=20 Adam Goryachev Website Managers www.websitemanagers.com.au -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html