* Uncorrectable errors on RAID-1?
@ 2014-12-21 19:34 constantine
2014-12-21 21:56 ` Robert White
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: constantine @ 2014-12-21 19:34 UTC (permalink / raw)
To: linux-btrfs
Some months ago I had 6 uncorrectable errors. I deleted the files that
contained them and then after scrubbing I had 0 uncorrectable errors.
After some weeks I encountered new uncorrectable errors.
Question 1:
Why do I have uncorrectable errors on a RAID-1 filesystem in the first place?
Question 2:
How do I properly correct them? (Again by deleting their files? :( )
Question 3:
How do I prevent this from happening?
Thanks a lot!
constantine
PS.
The disks can be considered old (some with > 15000 hrs online), but
SMART long tests complete without errors. I have this filesystem:
# btrfs fi show /mnt/thefilesystem
Label: 'thefilesystem' uuid: 1d1d0850-d1bc-4c76-96a1-17d168ff2431
Total devices 5 FS bytes used 6.11TiB
devid 1 size 2.73TiB used 2.63TiB path /dev/sda1
devid 2 size 3.64TiB used 3.54TiB path /dev/sdg1
devid 3 size 1.82TiB used 1.72TiB path /dev/sdd1
devid 4 size 1.82TiB used 1.72TiB path /dev/sdc1
devid 5 size 2.73TiB used 2.63TiB path /dev/sdh1
Btrfs v3.17.3
# btrfs fi df /mnt/thefilesystem
Data, RAID1: total=6.10TiB, used=6.10TiB
System, RAID1: total=32.00MiB, used=896.00KiB
Metadata, RAID1: total=10.00GiB, used=8.98GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
===================
SMART information from each of the disks:
# for i in a g d c h ; do smartctl -A /dev/sd$i; done
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
3 Spin_Up_Time 0x0027 177 175 021 Pre-fail
Always - 6108
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 201
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 093 093 000 Old_age
Always - 5836
10 Spin_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 185
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 118
193 Load_Cycle_Count 0x0032 189 189 000 Old_age
Always - 33154
194 Temperature_Celsius 0x0022 114 098 000 Old_age
Always - 36
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
3 Spin_Up_Time 0x0027 179 175 021 Pre-fail
Always - 8050
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 141
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age
Always - 4842
10 Spin_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 140
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 91
193 Load_Cycle_Count 0x0032 194 194 000 Old_age
Always - 18614
194 Temperature_Celsius 0x0022 114 100 000 Old_age
Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail
Always - 4738696
3 Spin_Up_Time 0x0003 092 092 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 836
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 144
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail
Always - 69594766
9 Power_On_Hours 0x0032 077 077 000 Old_age
Always - 20554
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 721
183 Runtime_Bad_Block 0x0032 092 092 000 Old_age
Always - 8
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age
Always - 14
189 High_Fly_Writes 0x003a 097 097 000 Old_age
Always - 3
190 Airflow_Temperature_Cel 0x0022 068 042 045 Old_age
Always In_the_past 32 (0 15 39 23 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age
Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always - 320
193 Load_Cycle_Count 0x0032 100 100 000 Old_age
Always - 947
194 Temperature_Celsius 0x0022 032 058 000 Old_age
Always - 32 (0 13 0 0 0)
195 Hardware_ECC_Recovered 0x001a 014 003 000 Old_age
Always - 4738696
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 19390 (116 2 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 2165686930
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 1913785108
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 1
3 Spin_Up_Time 0x0027 182 178 021 Pre-fail
Always - 5900
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 310
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 086 086 000 Old_age
Always - 10839
10 Spin_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 275
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 175
193 Load_Cycle_Count 0x0032 123 123 000 Old_age
Always - 233706
194 Temperature_Celsius 0x0022 120 102 000 Old_age
Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail
Always - 154070800
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 198
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail
Always - 4346841135
9 Power_On_Hours 0x0032 090 090 000 Old_age
Always - 9283
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 185
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age
Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age
Always - 0 0 0
189 High_Fly_Writes 0x003a 098 098 000 Old_age
Always - 2
190 Airflow_Temperature_Cel 0x0022 065 046 045 Old_age
Always - 35 (Min/Max 23/45)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age
Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always - 129
193 Load_Cycle_Count 0x0032 098 098 000 Old_age
Always - 5879
194 Temperature_Celsius 0x0022 035 054 000 Old_age
Always - 35 (0 19 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 8753h+05m+40.278s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 36640474598
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 94882096088
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Uncorrectable errors on RAID-1? 2014-12-21 19:34 Uncorrectable errors on RAID-1? constantine @ 2014-12-21 21:56 ` Robert White 2014-12-21 22:17 ` Hugo Mills 2014-12-22 0:25 ` Chris Murphy [not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com> 2 siblings, 1 reply; 18+ messages in thread From: Robert White @ 2014-12-21 21:56 UTC (permalink / raw) To: constantine, linux-btrfs On 12/21/2014 11:34 AM, constantine wrote: > Some months ago I had 6 uncorrectable errors. I deleted the files that > contained them and then after scrubbing I had 0 uncorrectable errors. > After some weeks I encountered new uncorrectable errors. > > Question 1: > Why do I have uncorrectable errors on a RAID-1 filesystem in the first place? These are disk/platter/hardware errors. They happen for one of two reasons. (most likely) There is a flaw, new or existing, on the platter itself and data just cannot live in that spot. (least likely) You suffered an environmental hazard (hard jolt) while a sector was being written and the drive is just choking on the digital wreckage. > Question 2: > How do I properly correct them? (Again by deleting their files? :( ) You have to _force_ the system to write the sector. If the disk can correct the sector (not a hardware flaw) the problem goes away forever. If it can't the drive will re-map the sector with a spare sector and it will seem to go away forever. Here is a decent tutorial :: http://smartmontools.sourceforge.net/badblockhowto.html and which version of things you need to do will vary by hardware, so read the whole thing. _BUT_ on my system I had to use hdparam to write the sectors instead of just using dd. Math is involved to find the LBA and you have to use the "yes I really know what I am doing" option to force the write at the low level. [Quick version :: smartctl --test=long (or range if you know the range). Test will stop on the read error. Force writ the the "lba of first error" block with hdparam or use the sg-spare thing. Repeat until the long test will read the entire drive. My current smartctl --all /dev/sda shows that recent remapping exercise. SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 36605 - # 2 Selective offline Completed without error 00% 36603 - # 3 Selective offline Aborted by host 90% 36603 - # 4 Selective offline Completed without error 00% 36603 - # 5 Selective offline Completed: read failure 90% 36603 19530186 # 6 Selective offline Completed: read failure 90% 36603 19530182 # 7 Extended offline Completed: read failure 90% 36602 19530182 # 8 Extended offline Completed: read failure 90% 36602 19530182 # 9 Extended offline Completed: read failure 90% 36592 19530182 #10 Extended offline Completed: read failure 90% 36094 19530182 #11 Extended offline Completed without error 00% 4222 - 6 of 6 failed self-tests are outdated by newer successful extended offline self-test # 1 The good news is that since you are using RAID1 and checksums you shouldn't need to delete any files. Just coerce the write and then btrfs scrub your filesystem and the checksum/rewrite thing should recover the degraded copy from the good copy in the mirror. > > Question 3: > How do I prevent this from happening? If the disk only shows an error or two it's probably still in normal range. If you have to spare out a lot of sectors then your disk may be reaching end-of-life and so likely needs replacing. ALL DISKS FAIL EVENTUALLY so you don't "prevent it from happening". You use RAID1 (etc) and backups to prevent data loss and you periodically run the tests and check the output to prevent data loss. That is, you can't prevent eventual disk loss, your job is to prevent data loss. So good on you for the RAID1 > > > Thanks a lot! > > constantine > > PS. > The disks can be considered old (some with > 15000 hrs online), but > SMART long tests complete without errors. I have this filesystem: I don't see the smart test results in any of these blocks. Are you sure you are looking at the correct part of the results? You should have been showing us the table after the heading "SMART Self-test log structure revision number 1" if you are trying to show us tests completing without errors. See smartctl --all and/or --xall output, e.g. _lower_ _case_ "a" or "x", not upper case "A" "attributes", test results will be near the bottom. The "attribute" section is interesting but not dispositive of recent test results. It only shows non-test event counters. > > # btrfs fi show /mnt/thefilesystem > Label: 'thefilesystem' uuid: 1d1d0850-d1bc-4c76-96a1-17d168ff2431 > Total devices 5 FS bytes used 6.11TiB > devid 1 size 2.73TiB used 2.63TiB path /dev/sda1 > devid 2 size 3.64TiB used 3.54TiB path /dev/sdg1 > devid 3 size 1.82TiB used 1.72TiB path /dev/sdd1 > devid 4 size 1.82TiB used 1.72TiB path /dev/sdc1 > devid 5 size 2.73TiB used 2.63TiB path /dev/sdh1 > > Btrfs v3.17.3 > > # btrfs fi df /mnt/thefilesystem > Data, RAID1: total=6.10TiB, used=6.10TiB > System, RAID1: total=32.00MiB, used=896.00KiB > Metadata, RAID1: total=10.00GiB, used=8.98GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > =================== > SMART information from each of the disks: > > # for i in a g d c h ; do smartctl -A /dev/sd$i; done > smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0027 177 175 021 Pre-fail > Always - 6108 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 201 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 093 093 000 Old_age > Always - 5836 > 10 Spin_Retry_Count 0x0032 100 100 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 185 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 118 > 193 Load_Cycle_Count 0x0032 189 189 000 Old_age > Always - 33154 > 194 Temperature_Celsius 0x0022 114 098 000 Old_age > Always - 36 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0027 179 175 021 Pre-fail > Always - 8050 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 141 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 094 094 000 Old_age > Always - 4842 > 10 Spin_Retry_Count 0x0032 100 100 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 140 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 91 > 193 Load_Cycle_Count 0x0032 194 194 000 Old_age > Always - 18614 > 194 Temperature_Celsius 0x0022 114 100 000 Old_age > Always - 38 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail > Always - 4738696 > 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail > Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age > Always - 836 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail > Always - 144 > 7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail > Always - 69594766 > 9 Power_On_Hours 0x0032 077 077 000 Old_age > Always - 20554 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age > Always - 721 > 183 Runtime_Bad_Block 0x0032 092 092 000 Old_age > Always - 8 > 184 End-to-End_Error 0x0032 100 100 099 Old_age > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 > 188 Command_Timeout 0x0032 100 099 000 Old_age > Always - 14 > 189 High_Fly_Writes 0x003a 097 097 000 Old_age > Always - 3 > 190 Airflow_Temperature_Cel 0x0022 068 042 045 Old_age > Always In_the_past 32 (0 15 39 23 0) > 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age > Always - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age > Always - 320 > 193 Load_Cycle_Count 0x0032 100 100 000 Old_age > Always - 947 > 194 Temperature_Celsius 0x0022 032 058 000 Old_age > Always - 32 (0 13 0 0 0) > 195 Hardware_ECC_Recovered 0x001a 014 003 000 Old_age > Always - 4738696 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > 240 Head_Flying_Hours 0x0000 100 253 000 Old_age > Offline - 19390 (116 2 0) > 241 Total_LBAs_Written 0x0000 100 253 000 Old_age > Offline - 2165686930 > 242 Total_LBAs_Read 0x0000 100 253 000 Old_age > Offline - 1913785108 > > smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 1 > 3 Spin_Up_Time 0x0027 182 178 021 Pre-fail > Always - 5900 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 310 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 086 086 000 Old_age > Always - 10839 > 10 Spin_Retry_Count 0x0032 100 100 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 275 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 175 > 193 Load_Cycle_Count 0x0032 123 123 000 Old_age > Always - 233706 > 194 Temperature_Celsius 0x0022 120 102 000 Old_age > Always - 30 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail > Always - 154070800 > 3 Spin_Up_Time 0x0003 094 093 000 Pre-fail > Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age > Always - 198 > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail > Always - 4346841135 > 9 Power_On_Hours 0x0032 090 090 000 Old_age > Always - 9283 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age > Always - 185 > 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age > Always - 0 > 184 End-to-End_Error 0x0032 100 100 099 Old_age > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 > 188 Command_Timeout 0x0032 100 100 000 Old_age > Always - 0 0 0 > 189 High_Fly_Writes 0x003a 098 098 000 Old_age > Always - 2 > 190 Airflow_Temperature_Cel 0x0022 065 046 045 Old_age > Always - 35 (Min/Max 23/45) > 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age > Always - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age > Always - 129 > 193 Load_Cycle_Count 0x0032 098 098 000 Old_age > Always - 5879 > 194 Temperature_Celsius 0x0022 035 054 000 Old_age > Always - 35 (0 19 0 0 0) > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > 240 Head_Flying_Hours 0x0000 100 253 000 Old_age > Offline - 8753h+05m+40.278s > 241 Total_LBAs_Written 0x0000 100 253 000 Old_age > Offline - 36640474598 > 242 Total_LBAs_Read 0x0000 100 253 000 Old_age > Offline - 94882096088 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-21 21:56 ` Robert White @ 2014-12-21 22:17 ` Hugo Mills 0 siblings, 0 replies; 18+ messages in thread From: Hugo Mills @ 2014-12-21 22:17 UTC (permalink / raw) To: Robert White; +Cc: constantine, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2711 bytes --] On Sun, Dec 21, 2014 at 01:56:54PM -0800, Robert White wrote: > On 12/21/2014 11:34 AM, constantine wrote: > >Some months ago I had 6 uncorrectable errors. I deleted the files that > >contained them and then after scrubbing I had 0 uncorrectable errors. > >After some weeks I encountered new uncorrectable errors. > > > >Question 1: > >Why do I have uncorrectable errors on a RAID-1 filesystem in the first place? > > These are disk/platter/hardware errors. They happen for one of two > reasons. (most likely) There is a flaw, new or existing, on the > platter itself and data just cannot live in that spot. (least > likely) You suffered an environmental hazard (hard jolt) while a > sector was being written and the drive is just choking on the > digital wreckage. > > > >Question 2: > >How do I properly correct them? (Again by deleting their files? :( ) > > You have to _force_ the system to write the sector. If the disk can > correct the sector (not a hardware flaw) the problem goes away > forever. If it can't the drive will re-map the sector with a spare > sector and it will seem to go away forever. Note that one of the drives already has reallocated sectors, so it's on its way to failing, and you should start saving up your pennies for a new one now, even if it hasn't gone properly boom yet. However, that doesn't explain on its own why you're getting unrecoverable errors -- the FS should be able to deal with that. [snip] > The good news is that since you are using RAID1 and checksums you > shouldn't need to delete any files. Just coerce the write and then > btrfs scrub your filesystem and the checksum/rewrite thing should > recover the degraded copy from the good copy in the mirror. If btrfs detects a checksum error, it will try to fix it by reading the other copy and then writing good data to the broken copy again. You don't have to force a write to the FS in order to make it fix broken data this way. A scrub will do this check-and-repair on all content of the filesystem. If the FS is reporting uncorrectable errors, then it's tried both copies and both fail their checksums. This is basically not fixable without removing the files and replacing them with copies from your backup. It's not obvious why you've got correlated errors on two devices, though, and I'm not sure how to work it out. I'd suggest running the full SMART tests on the disks, and running a scrub on the FS, and checking your logs for SATA errors and similar problems. Hugo. [snip] -- Hugo Mills | I must be musical: I've got *loads* of CDs hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: 65E74AC0 | Fran, Black Books [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-21 19:34 Uncorrectable errors on RAID-1? constantine 2014-12-21 21:56 ` Robert White @ 2014-12-22 0:25 ` Chris Murphy 2014-12-23 21:16 ` Zygo Blaxell [not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com> 2 siblings, 1 reply; 18+ messages in thread From: Chris Murphy @ 2014-12-22 0:25 UTC (permalink / raw) To: Btrfs BTRFS On Sun, Dec 21, 2014 at 12:34 PM, constantine <costas.magnuse@gmail.com> wrote: > Some months ago I had 6 uncorrectable errors. I deleted the files that > contained them and then after scrubbing I had 0 uncorrectable errors. > After some weeks I encountered new uncorrectable errors. > > Question 1: > Why do I have uncorrectable errors on a RAID-1 filesystem in the first place? > > Question 2: > How do I properly correct them? (Again by deleting their files? :( ) > > Question 3: > How do I prevent this from happening? There are multiple kinds of uncorrectable errors so it depends on the exact error. If Btrfs is reporting uncorrectable errors, then that suggests both copies are bad. Whether md, LVM, or Btrfs raid, make sure the value for cat /sys/block/sdX/device/timeout is larger than the value reported by smartctl -l scterc /dev/sdX Not that units for the first command are seconds, the units for the second command are demiseconds. For the kernel to automatically fix bad sectors by overwriting them, the drive needs to explicitly report read errors. If the SCSI command timer value is shorter than the drive's error recovery, the SATA link might get reset before the drive reports the read error and then uncorrected errors will persist instead of being automatically fixed. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-22 0:25 ` Chris Murphy @ 2014-12-23 21:16 ` Zygo Blaxell 2014-12-23 22:09 ` Chris Murphy 0 siblings, 1 reply; 18+ messages in thread From: Zygo Blaxell @ 2014-12-23 21:16 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 979 bytes --] On Sun, Dec 21, 2014 at 05:25:47PM -0700, Chris Murphy wrote: > For the kernel to automatically fix > bad sectors by overwriting them, the drive needs to explicitly report > read errors. If the SCSI command timer value is shorter than the > drive's error recovery, the SATA link might get reset before the drive > reports the read error and then uncorrected errors will persist > instead of being automatically fixed. Is there a way to tell the kernel to go ahead and assume that all timeouts are effectively read errors? For a simple non-removable hard disk (i.e. not removable and not optical), that seems like a reasonable workaround for an assortment of firmware brokenness. I just did a quick survey of random drives here and found less than 10% support "smartctl -l scterc". A lot of server drives (or at least the drives that shipped in servers) don't have it, but laptop drives do. Drives with firmware that has horrifying known bugs do also have this feature. :-P [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-23 21:16 ` Zygo Blaxell @ 2014-12-23 22:09 ` Chris Murphy 2014-12-23 22:23 ` Chris Murphy 2014-12-28 3:12 ` Phillip Susi 0 siblings, 2 replies; 18+ messages in thread From: Chris Murphy @ 2014-12-23 22:09 UTC (permalink / raw) To: Btrfs BTRFS On Tue, Dec 23, 2014 at 2:16 PM, Zygo Blaxell <zblaxell@furryterror.org> wrote: > On Sun, Dec 21, 2014 at 05:25:47PM -0700, Chris Murphy wrote: >> For the kernel to automatically fix >> bad sectors by overwriting them, the drive needs to explicitly report >> read errors. If the SCSI command timer value is shorter than the >> drive's error recovery, the SATA link might get reset before the drive >> reports the read error and then uncorrected errors will persist >> instead of being automatically fixed. > > Is there a way to tell the kernel to go ahead and assume that all timeouts > are effectively read errors? The timer in /sys is a kernel command timer, it's not a device timer even though it's pointed at a block device. You need to change that from 30 to something higher to get the behavior you want. It doesn't really make sense to say, timeout in 30 seconds, but instead of reporting a timeout, report it as a read error. They're completely different things. There are all sorts of errors listed in libata so for all of them to get dumped into a read error doesn't make sense. A lot of those errors don't report back a sector, and the key part of the read error is what sector(s) have the problem so that they can be fixed. Without that information, the ability to fix it is lost. And it's the drive that needs to report this. > For a simple non-removable hard disk (i.e. > not removable and not optical), that seems like a reasonable workaround > for an assortment of firmware brokenness. Oven doesn't work, so lets spray gasoline on it and light it and the kitchen on fire so that we can cook this damn pizza! That's what I just read. Sorry. It doesn't seem like a good idea to me to map all errors as read errors. > I just did a quick survey of random drives here and found less than 10% > support "smartctl -l scterc". A lot of server drives (or at least the > drives that shipped in servers) don't have it, but laptop drives do. > Drives with firmware that has horrifying known bugs do also have this > feature. :-P Any decent server SATA drive should support SCT ERC. The inexpensive WDC Red drives for NAS's all have it and by default are a reasonable 70 deciseconds last time I checked. It might be that you're using SAS drives? In that case they may have something different than SCT ERC that serves the same purpose, but I don't have any SAS drives here to check this. I'd expect any SAS drive already has short error recoveries by default, but that expectation might be flawed. Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-23 22:09 ` Chris Murphy @ 2014-12-23 22:23 ` Chris Murphy 2014-12-28 3:12 ` Phillip Susi 1 sibling, 0 replies; 18+ messages in thread From: Chris Murphy @ 2014-12-23 22:23 UTC (permalink / raw) To: Btrfs BTRFS The other thing to note, is that the scsi command timer timeout is a maximum. So at 30 seconds if a command to the drive hasn't completed, then consider the drive hung up and do a link reset. And whatever error recovery is in the drive, is also a maximum. If the sector is really immediately bad, the drive will produce a read error immediately. The case where you get these long recoveries where the drive keeps retrying beyond the 30 second scsi command timer value, is when the drive firmware ECC thinks it can recover (or reconstruct) the data instead of producing a read error. A gotcha with changing the scsi command timer to a much larger value is that it possibly gives the drive enough time to recover the data, report it back to the kernel, and then everything goes on normally. The "slow sector" doesn't get fixed. Even a scrub wouldn't fix that unless the drive reported wrongly recovered data and Btrfs checksums catch it. So what you want to do with a drive that has, or is suspected of having such slow sectors, is to balance it. Rewrite everything. That should cause the drive firmware to map out those sectors if they result in persistent write errors. What ought to happen is the data from slow sectors, once recovered, should get written to a reserve sector and the old sector removed from use (remapping, i.e. the LBA is the same but the physical sector is different) but every drive firmware handles this differently. I definitely have had drives where this doesn't happen automatically. Also, I've had drives that when ATA Secure Erased, did not test for persistent write errors and therefore bad sectors weren't removed from use, they'd remain persistently bad when doing smartctl -t long tests. In those cases, using badblocks -w fixed the problem but of course that's destructive. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-23 22:09 ` Chris Murphy 2014-12-23 22:23 ` Chris Murphy @ 2014-12-28 3:12 ` Phillip Susi 2014-12-29 21:53 ` Chris Murphy 1 sibling, 1 reply; 18+ messages in thread From: Phillip Susi @ 2014-12-28 3:12 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS; +Cc: Zygo Blaxell -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 12/23/2014 05:09 PM, Chris Murphy wrote: > The timer in /sys is a kernel command timer, it's not a device > timer even though it's pointed at a block device. You need to > change that from 30 to something higher to get the behavior you > want. It doesn't really make sense to say, timeout in 30 seconds, > but instead of reporting a timeout, report it as a read error. > They're completely different things. The idea is not to give the drive a ridiculous amount of time to recover without timing out, but for the timeout to be handled properly. > There are all sorts of errors listed in libata so for all of them > to get dumped into a read error doesn't make sense. A lot of those > errors don't report back a sector, and the key part of the read > error is what sector(s) have the problem so that they can be fixed. > Without that information, the ability to fix it is lost. And it's > the drive that needs to report this. It is not lost. The information is simply fuzzed from an exact individual sector to a range of sectors in the timed out request. In an ideal world the drive would give up in a reasonable time and report the failure, but if it doesn't, then we should deal with that in a better way than hanging all IO for an unacceptably long time. > Oven doesn't work, so lets spray gasoline on it and light it and > the kitchen on fire so that we can cook this damn pizza! That's > what I just read. Sorry. It doesn't seem like a good idea to me to > map all errors as read errors. How do you conclude that? In the face of a timeout your choices are between kicking the whole drive out of the array immediately, or attempting to repair it by recovering the affected sector(s) and rewriting them. Unless that recovery attempt could cause more harm than degrading the array, then where is the "throwing gasoline on it" part? This is simply a case of the device not providing a specific error that says whether it can be recovered or not, so let's attempt the recovery and see if it works instead of assuming that it won't and possibly causing data loss that could be avoided. > Any decent server SATA drive should support SCT ERC. The > inexpensive WDC Red drives for NAS's all have it and by default are > a reasonable 70 deciseconds last time I checked. And yet it isn't supported on the cheaper but otherwise identical greens, or the higher performing blues. We should not be helping vendors charge a premium for zero cost firmware features that are "required" for raid use when they really aren't ( even if they are nice to have ). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUn3U5AAoJENRVrw2cjl5RFIQIAJAr86Y5s8RWuL8/We/AlM5Q JUuZGGaE1IGmMROdUAEzmj78L8lI2U3D95sERDKmd3aJosfpi1SVOExQZebSIqch hhkLGC0FecxE5VC/67E2wwmfbropSk0mlA5Fbgx8mYf60iUHWcFUkc01kER3JGnd xMI2jV0UpqVD/gY/a5O7Z7bPeHICQcIyXCN7MAbTMBrDWsYhDACQpij+aNXu5+ke rCNV5c/VkYFQZ9aaMb6Mxmi9KOkCVv2+kBOsxwqPxlO5s9vKORDhxMp8XeJQEvhU X2GAgS8r8gSGVdPutekXR1vB+TwhdMxftBWL9jcI1y05Y0z3GcOX+/90S9mrSaU= =2tIU -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-28 3:12 ` Phillip Susi @ 2014-12-29 21:53 ` Chris Murphy 2014-12-30 20:46 ` Phillip Susi 2014-12-31 15:40 ` Austin S Hemmelgarn 0 siblings, 2 replies; 18+ messages in thread From: Chris Murphy @ 2014-12-29 21:53 UTC (permalink / raw) To: Btrfs BTRFS On Sat, Dec 27, 2014 at 8:12 PM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > On 12/23/2014 05:09 PM, Chris Murphy wrote: >> The timer in /sys is a kernel command timer, it's not a device >> timer even though it's pointed at a block device. You need to >> change that from 30 to something higher to get the behavior you >> want. It doesn't really make sense to say, timeout in 30 seconds, >> but instead of reporting a timeout, report it as a read error. >> They're completely different things. > > The idea is not to give the drive a ridiculous amount of time to > recover without timing out, but for the timeout to be handled properly. Get drives supporting configurable or faster recoveries. There's no way around this. > >> There are all sorts of errors listed in libata so for all of them >> to get dumped into a read error doesn't make sense. A lot of those >> errors don't report back a sector, and the key part of the read >> error is what sector(s) have the problem so that they can be fixed. >> Without that information, the ability to fix it is lost. And it's >> the drive that needs to report this. > > It is not lost. The information is simply fuzzed from an exact > individual sector to a range of sectors in the timed out request. In > an ideal world the drive would give up in a reasonable time and report > the failure, but if it doesn't, then we should deal with that in a > better way than hanging all IO for an unacceptably long time. This is a broken record topic honestly. The drives under discussion aren't ever meant to be used in raid, they're desktop drives, they're designed with long recoveries because it's reasonable to try to recover the data even in the face of delays rather than not recover at all. Whether there are also some design flaws in here I can't say because I'm not a hardware designer or developer but they are very clearly targeted at certain use cases and not others, not least of which is their error recovery time but also their vibration tolerance when multiple drives are in close proximity to each other. If you don't like long recoveries, don't buy drives with long recoveries. Simple. > >> Oven doesn't work, so lets spray gasoline on it and light it and >> the kitchen on fire so that we can cook this damn pizza! That's >> what I just read. Sorry. It doesn't seem like a good idea to me to >> map all errors as read errors. > > How do you conclude that? In the face of a timeout your choices are > between kicking the whole drive out of the array immediately, or > attempting to repair it by recovering the affected sector(s) and > rewriting them. Unless that recovery attempt could cause more harm > than degrading the array, then where is the "throwing gasoline on it" > part? This is simply a case of the device not providing a specific > error that says whether it can be recovered or not, so let's attempt > the recovery and see if it works instead of assuming that it won't and > possibly causing data loss that could be avoided. The device will absolutely provide a specific error so long as its link isn't reset prematurely, which happens to be the linux default behavior when combined with drives that have long error recovery times. Hence the recommendation is to increase the linux command timer value. That is the solution right now. If you want a different behavior someone has to write the code to do it because it doesn't exist yet, and so far there seems to be zero interest in actually doing that work, just some interest in hand waiving that it ought to exist, maybe. > >> Any decent server SATA drive should support SCT ERC. The >> inexpensive WDC Red drives for NAS's all have it and by default are >> a reasonable 70 deciseconds last time I checked. > > And yet it isn't supported on the cheaper but otherwise identical > greens, or the higher performing blues. We should not be helping > vendors charge a premium for zero cost firmware features that are > "required" for raid use when they really aren't ( even if they are > nice to have ). The manufacturer says they differ in vibration characteristics, 24x7 usage expectation, and warranty among the top relevant features. The Red has a 3 year warranty, the Green is a 1 year warranty. That alone easily accounts for the $15 difference, although that's perhaps somewhat subjective. I don't actually know the wholesale prices, they could be the same if the purchasing terms are identical. Western Digital Red NAS Hard Drive WD30EFRX 3TB IntelliPower 64MB Cache SATA 6.0Gb/s 3.5" NAS Hard Drive $114 on Newegg.com Western Digital WD Green WD30EZRX 3TB IntelliPower 64MB Cache SATA 6.0Gb/s 3.5" Internal Hard Drive Bare Drive - OEM $99 on Newegg.com And none of the manufacturers actually says these features are required for raid use. What they say is, they reserve the right to deny warranty claims if you're using a drive in a manner inconsistent with their intended usage which is rather easily found information. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-29 21:53 ` Chris Murphy @ 2014-12-30 20:46 ` Phillip Susi 2014-12-30 23:58 ` Chris Murphy 2014-12-31 15:40 ` Austin S Hemmelgarn 1 sibling, 1 reply; 18+ messages in thread From: Phillip Susi @ 2014-12-30 20:46 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/29/2014 4:53 PM, Chris Murphy wrote: > Get drives supporting configurable or faster recoveries. There's > no way around this. Practically available right now? Sure. In theory, no. > This is a broken record topic honestly. The drives under > discussion aren't ever meant to be used in raid, they're desktop > drives, they're designed with long recoveries because it's > reasonable to try to The intention to use the drives in a raid is entirely at the discretion of the user, not the manufacturer. The only reason we are even having this conversation is because the manufacturer has added a misfeature that makes them sub-optimal for use in a raid. > recover the data even in the face of delays rather than not recover > at all. Whether there are also some design flaws in here I can't > say because I'm not a hardware designer or developer but they are > very clearly targeted at certain use cases and not others, not > least of which is their error recovery time but also their > vibration tolerance when multiple drives are in close proximity to > each other. Drives have no business whatsoever retrying for so long; every version of DOS or Windows ever released has been able to report an IO error and give the *user* the option of retrying it in the hopes that it will work that time, because drives used to be sane and not keep retrying a positively ridiculous number of times. > If you don't like long recoveries, don't buy drives with long > recoveries. Simple. Better to fix the software to deal with it sensibly instead of encouraging manufacturers to engage in hamstringing their lower priced products to coax more money out of their customers. > The device will absolutely provide a specific error so long as its > link isn't reset prematurely, which happens to be the linux > default behavior when combined with drives that have long error > recovery times. Hence the recommendation is to increase the linux > command timer value. That is the solution right now. If you want a > different behavior someone has to write the code to do it because > it doesn't exist yet, and so far there seems to be zero interest in > actually doing that work, just some interest in hand waiving that > it ought to exist, maybe. If this is your way of saying "patches welcome" then it probably would have been better just to say that. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUow8ZAAoJENRVrw2cjl5Rr9UH+wd3yJ1ZnoaxDG3JPCBq9MJb Tb6nhjHovRDREeus4UWLESp9kYUyy5OfKmahARhM6AbaBXWYeleoD9SEtMahFXfn /2Kn9yRBqZCBDloVQGNOUaSZyfhTRRl31cGABbbynRo6IDkLEfMQQPWgvz9ttch7 3aPciHhehs1CeseNuiiUPk6HIMb8lJLvgW5J1O5FwgXZ6Wyi9OZdoPL+prnFh2bP 5E2rGblYUHIUiLkOKFOOsEs8q2H9RICFJIBsz8KoPzjCDtdNETBF5mvx8bIUJpg0 Q7cQOo7IRxpFUL/7gnBtWgRIw3lvRY+SY2G+2YwaMiqdeuYcLCr853ONDYg0NCc= =AYGW -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-30 20:46 ` Phillip Susi @ 2014-12-30 23:58 ` Chris Murphy 2014-12-31 3:16 ` Phillip Susi 0 siblings, 1 reply; 18+ messages in thread From: Chris Murphy @ 2014-12-30 23:58 UTC (permalink / raw) To: Btrfs BTRFS On Tue, Dec 30, 2014 at 1:46 PM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 12/29/2014 4:53 PM, Chris Murphy wrote: >> Get drives supporting configurable or faster recoveries. There's >> no way around this. > > Practically available right now? Sure. In theory, no. I have no idea what this means. Such drives exist, you can buy them or not buy them. > >> This is a broken record topic honestly. The drives under >> discussion aren't ever meant to be used in raid, they're desktop >> drives, they're designed with long recoveries because it's >> reasonable to try to > > The intention to use the drives in a raid is entirely at the > discretion of the user, not the manufacturer. The only reason we are > even having this conversation is because the manufacturer has added a > misfeature that makes them sub-optimal for use in a raid. Clearly you have never owned a business, nor have you been involved in volume manufacturing or you wouldn't be so keen to demand one market subsidize another. 24x7 usage is a non-trivial quantity of additional wear and tear on the drive compared to 8 hour/day, 40 hour/week duty cycle. But you seem to think that the manufacturer has no right to produce a cheaper one for the seldom used hardware, or a more expensive one for the constantly used hardware. And of course you completely ignored, and deleted, my point about the difference in warranties. Does the SATA specification require configurable SCT ERC? Does it require even supporting SCT ERC? I think your argument is flawed by mis-distributing the economic burden while simultaneously denying one even exists or that these companies should just eat the cost differential if it does. In any case the argument is asinine. > >> recover the data even in the face of delays rather than not recover >> at all. Whether there are also some design flaws in here I can't >> say because I'm not a hardware designer or developer but they are >> very clearly targeted at certain use cases and not others, not >> least of which is their error recovery time but also their >> vibration tolerance when multiple drives are in close proximity to >> each other. > > Drives have no business whatsoever retrying for so long; every version > of DOS or Windows ever released has been able to report an IO error > and give the *user* the option of retrying it in the hopes that it > will work that time, because drives used to be sane and not keep > retrying a positively ridiculous number of times. When the encoded data signal weakens, they effectively becomes fuzzy bits. Each read produces different results. Obviously this is a very rare condition or there'd be widespread panic. However, it's common and expected enough that the drive manufacturers are all, to very little varying degree, dealing with this problem in a similar way, which is multiple reads. Now you could say they're all in collusion with each other to screw users over, rather than having legitimate reasons for all of these retried. Unless you're a hard drive engineer, I'm unlikely to find such an argument compelling. Besides, it would also be a charge of fraud. > >> If you don't like long recoveries, don't buy drives with long >> recoveries. Simple. > > Better to fix the software to deal with it sensibly instead of > encouraging manufacturers to engage in hamstringing their lower priced > products to coax more money out of their customers. In the meantime, there already is a working software alternative: (re)write over all sectors periodically. Perhaps every 6-12 months is sufficient to mitigate such signal weakening on marginal sectors that aren't persistently failing on writes. This can be done with a periodic reshape if it's md raid. It can be done with balance on Btrfs. It can be done with resilvering on ZFS. > >> The device will absolutely provide a specific error so long as its >> link isn't reset prematurely, which happens to be the linux >> default behavior when combined with drives that have long error >> recovery times. Hence the recommendation is to increase the linux >> command timer value. That is the solution right now. If you want a >> different behavior someone has to write the code to do it because >> it doesn't exist yet, and so far there seems to be zero interest in >> actually doing that work, just some interest in hand waiving that >> it ought to exist, maybe. > > If this is your way of saying "patches welcome" then it probably would > have been better just to say that. Certainly not. I'm not the maintainer of anything, I have no idea if such things are welcome. I'm not even a developer. I couldn't code my way out of a hat. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-30 23:58 ` Chris Murphy @ 2014-12-31 3:16 ` Phillip Susi 2015-01-03 5:31 ` Chris Murphy 0 siblings, 1 reply; 18+ messages in thread From: Phillip Susi @ 2014-12-31 3:16 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 12/30/2014 06:58 PM, Chris Murphy wrote: >> Practically available right now? Sure. In theory, no. > > I have no idea what this means. Such drives exist, you can buy them > or not buy them. I was referring to the "no way around this part". Currently you are correct, but in theory the way around it is exactly the subject of this thread. > Clearly you have never owned a business, nor have you been involved > in volume manufacturing or you wouldn't be so keen to demand one > market subsidize another. 24x7 usage is a non-trivial quantity of > additional wear and tear on the drive compared to 8 hour/day, 40 > hour/week duty cycle. But you seem to think that the manufacturer > has no right to produce a cheaper one for the seldom used hardware, > or a more expensive one for the constantly used hardware. Just because I want a raid doesn't mean I need it to operate reliably 24x7. For that matter, it has long been established that power cycling drives puts more wear and tear on them and as a general rule, leaving them on 24x7 results in them lasting longer. > And of course you completely ignored, and deleted, my point about > the difference in warranties. Because I don't care? It's nice and all that they warranty the more expensive drive more, and it may possibly even mean that they are actually more reliable ( but not likely ), but that doesn't mean that the system should have an unnecessarily terrible response to the behavior of the cheaper drives. Is it worth recommending the more expensive drives? Sure... but the system should also handle the cheaper drives with grace. > Does the SATA specification require configurable SCT ERC? Does it > require even supporting SCT ERC? I think your argument is flawed > by mis-distributing the economic burden while simultaneously > denying one even exists or that these companies should just eat the > cost differential if it does. In any case the argument is asinine. There didn't used to be any such thing; drives simply did not *ever* go into absurdly long internal retries so there was no need. The fact that they do these days I consider a misfeature, and one that *can* be worked around in software, which is the point here. > When the encoded data signal weakens, they effectively becomes > fuzzy bits. Each read produces different results. Obviously this is > a very rare condition or there'd be widespread panic. However, it's > common and expected enough that the drive manufacturers are all, to > very little varying degree, dealing with this problem in a similar > way, which is multiple reads. Sure, but the noise introduced by the read ( as opposed to the noise in the actual signal on the platter ) isn't that large, and so retrying 10,000 times isn't going to give any better results than retrying say, 100 times, and if the user really desires that many retries, they have always been able to do so in the software level rather than depending on the drive to try that much. There is no reason for the drives to have increased their internal retries that much, and then deliberately withed the essentially zero cost ability to limit those internal retries, other than to drive customers to pay for the more expensive models. > Now you could say they're all in collusion with each other to > screw users over, rather than having legitimate reasons for all of > these retried. Unless you're a hard drive engineer, I'm unlikely to > find such an argument compelling. Besides, it would also be a > charge of fraud. Calling it fraud might be a bit of a stretch, but yes, there is no legitimate reason for *that* many retries since people have been retrying failed reads in software for decades and the diminishing returns that goes with increasing the number of retries. > In the meantime, there already is a working software alternative: > (re)write over all sectors periodically. Perhaps every 6-12 months > is sufficient to mitigate such signal weakening on marginal sectors > that aren't persistently failing on writes. This can be done with > a periodic reshape if it's md raid. It can be done with balance on > Btrfs. It can be done with resilvering on ZFS. Is there any actual evidence that this is effective? Or that the recording degrades as a function of time? I doubt it since I do have data on drives that were last written 10 years ago that is still readable. Even if so, this is really a non sequitur since if the signal has degraded making it hard to read, in a raid we can simply recover using the other drives. The issue here is whether we should be doing such recovery sooner rather than waiting for the silly drive to retry 100,000 times before giving up. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUo2p7AAoJENRVrw2cjl5RBRQH/iPeByoKWCBCNcSH+slHQpLu UgFw1Sb0VhkcMV7LWGHRPVCOqOqRUyiDUIWBqjnnKAtGWvngqoVa8oCrYXYfgzeT snarm36vtm5jWQygn62mpZKoFVby5ttKTP3+rwQi+OjZ3+EWKKVkuXRFYpwt5ylt f/Xix2EpgMrl9hi8Bt8D/aLPtyPIF47D5vwa2nw7f5/gU0rKDfG9OZ4B7Bs1Jl0Q UA+bXlz4zi0cD6S7gwKStrDljAmMKjLnpWqMPHHnTWUgKuRRM/VKwzIhZmEZraqD y3SdY1JBj1qli50ZvKH+lkEag0mixMLvzN4mC6gYKqXjG2EAsHMp8185kK97gSQ= =agsX -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-31 3:16 ` Phillip Susi @ 2015-01-03 5:31 ` Chris Murphy 2015-01-05 4:18 ` Phillip Susi 0 siblings, 1 reply; 18+ messages in thread From: Chris Murphy @ 2015-01-03 5:31 UTC (permalink / raw) To: Phillip Susi; +Cc: Chris Murphy, Btrfs BTRFS On Tue, Dec 30, 2014 at 8:16 PM, Phillip Susi <psusi@ubuntu.com> wrote: > Just because I want a raid doesn't mean I need it to operate reliably > 24x7. For that matter, it has long been established that power > cycling drives puts more wear and tear on them and as a general rule, > leaving them on 24x7 results in them lasting longer. It's not a made to order hard drive industry. Maybe one day you'll be able to 3D print your own with its own specs. > >> And of course you completely ignored, and deleted, my point about >> the difference in warranties. > > Because I don't care? Sticking fingers in your ears doesn't change the fact there's a measurable difference in support requirements. > It's nice and all that they warranty the more > expensive drive more, and it may possibly even mean that they are > actually more reliable ( but not likely ), but that doesn't mean that > the system should have an unnecessarily terrible response to the > behavior of the cheaper drives. Is it worth recommending the more > expensive drives? Sure... but the system should also handle the > cheaper drives with grace. This is architecture astronaut territory. The system only has a terrible response for two reasons: 1. The user spec'd the wrong hardware for the use case; 2. The distro isn't automatically leveraging existing ways to mitigate that user mistake by changing either SCT ERC on the drives, or the SCSI command timer for each block device. Now, even though that solution *might* mean long recoveries on occasion, it's still better than link reset behavior which is what we have today because it causes the underlying problem to be fixed by md/dm/Btrfs once the read error is reported. But no distro has implemented this $500 man hour solution. Instead you're suggesting a $500,000 fix that will take hundreds of man hours and end user testing to find all the edge cases. It's like, seriously, WTF? >> Does the SATA specification require configurable SCT ERC? Does it >> require even supporting SCT ERC? I think your argument is flawed >> by mis-distributing the economic burden while simultaneously >> denying one even exists or that these companies should just eat the >> cost differential if it does. In any case the argument is asinine. > > There didn't used to be any such thing; drives simply did not *ever* > go into absurdly long internal retries so there was no need. The fact > that they do these days I consider a misfeature, and one that *can* be > worked around in software, which is the point here. Ok well I think that's hubris unless you're a hard drive engineer. You're referring to how drives behaved over a decade ago, when bad sectors were persistent rather than remapped, and we had to scan the drive at format time to build a map so the bad ones wouldn't be used by the filesystem. >> When the encoded data signal weakens, they effectively becomes >> fuzzy bits. Each read produces different results. Obviously this is >> a very rare condition or there'd be widespread panic. However, it's >> common and expected enough that the drive manufacturers are all, to >> very little varying degree, dealing with this problem in a similar >> way, which is multiple reads. > > Sure, but the noise introduced by the read ( as opposed to the noise > in the actual signal on the platter ) isn't that large, and so > retrying 10,000 times isn't going to give any better results than > retrying say, 100 times, and if the user really desires that many > retries, they have always been able to do so in the software level > rather than depending on the drive to try that much. There is no > reason for the drives to have increased their internal retries that > much, and then deliberately withed the essentially zero cost ability > to limit those internal retries, other than to drive customers to pay > for the more expensive models. http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf That's a high end SAS drive. It's default is to retry up to 20 times, which takes ~1.4 seconds, per sector. But also note how it says lowering the default increases the unrecoverable error rate. That makes sense. So even if the probability is low that retrying up to 120 seconds could work, statistically it affects the unrecoverable error rate positively to increase the default. If I'm going to be a conspiracy theorist, I'd say the recoveries are getting longer by default in order to keep the specifications reporting sane unrecoverable error rates. Maybe you'd prefer seeing these big, cheap, "green" drives have shorter ERC times, with a commensurate reality check with their unrecoverable error rate, which right now is already two orders magnitude higher than enterprise SAS drives. So what if this means that rate is 3 or 4 orders magnitude higher? Now I'm just going to wait for you to suggest that sucks donkey tail and how the manufacturer's should produce drives with the same UER as drives 10 years ago *and* with the same error recovery times, and charge no additional money. OK good luck with that! -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2015-01-03 5:31 ` Chris Murphy @ 2015-01-05 4:18 ` Phillip Susi 2015-01-05 7:41 ` Chris Murphy 0 siblings, 1 reply; 18+ messages in thread From: Phillip Susi @ 2015-01-05 4:18 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 01/03/2015 12:31 AM, Chris Murphy wrote: > It's not a made to order hard drive industry. Maybe one day you'll > be able to 3D print your own with its own specs. And wookies did not live on endor. What's your point? > Sticking fingers in your ears doesn't change the fact there's a > measurable difference in support requirements. Sure, just don't misrepresent one requirement for another. Just because I don't care about a warranty from the hardware manufacturer does not mean I have no right to expect the kernel to perform *reasonably* on that hardware. > This is architecture astronaut territory. > > The system only has a terrible response for two reasons: 1. The > user spec'd the wrong hardware for the use case; 2. The distro > isn't automatically leveraging existing ways to mitigate that user > mistake by changing either SCT ERC on the drives, or the SCSI > command timer for each block device. No, it has terrible response because the kernel either waits an unreasonable time or fails the drive and kicks it out of the array instead of trying to repair it. Blaming the user for not buying better hardware is not an appropriate response for the kernel failing so badly to handle commonly available hardware that doesn't behave in the most ideal way. > Now, even though that solution *might* mean long recoveries on > occasion, it's still better than link reset behavior which is what > we have today because it causes the underlying problem to be fixed > by md/dm/Btrfs once the read error is reported. But no distro has > implemented this $500 man hour solution. Instead you're suggesting > a $500,000 fix that will take hundreds of man hours and end user > testing to find all the edge cases. It's like, seriously, WTF? Seriously? Treating a timeout the same way you treat an unrecoverable media error is no herculean task. > Ok well I think that's hubris unless you're a hard drive engineer. > You're referring to how drives behaved over a decade ago, when bad > sectors were persistent rather than remapped, and we had to scan > the drive at format time to build a map so the bad ones wouldn't be > used by the filesystem. Remapping has nothing to do with it: we are talking about *read* errors, which do not trigger a remap. > http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf > > That's a high end SAS drive. It's default is to retry up to 20 > times, which takes ~1.4 seconds, per sector. But also note how it > says 20 retries on a 15,000 rpm drive only takes 80 milliseconds, not 1.4 seconds. 15,000 rpm / 60 seconds per minute = 250 rotations/retries per second. > Maybe you'd prefer seeing these big, cheap, "green" drives have > shorter ERC times, with a commensurate reality check with their > unrecoverable error rate, which right now is already two orders > magnitude higher than enterprise SAS drives. So what if this means > that rate is 3 or 4 orders magnitude higher? 20 retries vs. 200 retries does not reduce the URE rate by orders of magnitude; more like 1% *maybe*. 200 vs 2000 makes no measurable difference at all. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUqhCxAAoJENRVrw2cjl5RhDYH/RLbHXEPyjK4j6u33ElOyS5S W5/nfiT1ZZjVAFxJwD0y/gt2L61hB1PQdlUjBm2NayExfCXn3sEuccAxvjMDrvsL dFJOV8G/7GBbUfsD0uBustG5639QGc30bRzuiw/URT77zNf+T6+5SmTPSC3Oaj3j fCcDdiKCwNcYiUF3/Q3gdh4XVI8wgoABHC2S/GqvRB+FmmqD6Yt6yG50TG5sPBzq zSUSxWjOPwVinZOlPfCUCFr3buw+yzg5fclcvaNRStJM38gtK0UGgeIHFgCViHtN 0xNRCKWMu3XkfjfOI/cYVor79K4sQlz9K83Ja/UAMrOtopdlKjn9N04oIiPdsbg= =u/i9 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2015-01-05 4:18 ` Phillip Susi @ 2015-01-05 7:41 ` Chris Murphy 0 siblings, 0 replies; 18+ messages in thread From: Chris Murphy @ 2015-01-05 7:41 UTC (permalink / raw) To: Btrfs BTRFS On Sun, Jan 4, 2015 at 9:18 PM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > On 01/03/2015 12:31 AM, Chris Murphy wrote: >> This is architecture astronaut territory. >> >> The system only has a terrible response for two reasons: 1. The >> user spec'd the wrong hardware for the use case; 2. The distro >> isn't automatically leveraging existing ways to mitigate that user >> mistake by changing either SCT ERC on the drives, or the SCSI >> command timer for each block device. > > No, it has terrible response because the kernel either waits an > unreasonable time or fails the drive and kicks it out of the array > instead of trying to repair it. It's a default that works for more use cases than not. The kernel isn't dynamically self-configuring, and it isn't even the kernel's job to take the first step which is to enable and correctly set SCT ERC on each drive. I think assuming a large pile of causes for a drive freezing on a command be treated as read errors (after the link reset) is a bad idea. But since it's your idea, and I'm not a kernel developer, you should propose it on linux-raid@ instead of arguing with me. > Blaming the user for not buying > better hardware is not an appropriate response for the kernel failing > so badly to handle commonly available hardware that doesn't behave in > the most ideal way. "Hi, I'm a good and knowledgeable sysadmin. I buy hardware that's explicitly stated in the company's marketing data sheet as being incompatible with my use case. This is someone else's fault." Sounds like buck passing. >> Now, even though that solution *might* mean long recoveries on >> occasion, it's still better than link reset behavior which is what >> we have today because it causes the underlying problem to be fixed >> by md/dm/Btrfs once the read error is reported. But no distro has >> implemented this $500 man hour solution. Instead you're suggesting >> a $500,000 fix that will take hundreds of man hours and end user >> testing to find all the edge cases. It's like, seriously, WTF? > > Seriously? Treating a timeout the same way you treat an unrecoverable > media error is no herculean task. So you keep saying. But best practices is already known and tested, and can be done with a startup script. Yet no distro does this for the user, even though its much much simpler than what you're proposing, and actually fixes both sources of the problem. That it is in your opinion an imperfect fix is not relevant. It's still better behavior than what we have today, and yet still no distro does this, thereby tacitly preferring status quo. And if the current behavior is simply good enough no one has taken action to implement automatically the known best practice work around of the day, why should kernel developers gives two shits about this idea? Sounds like more buck passing. >> http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf >> >> That's a high end SAS drive. It's default is to retry up to 20 >> times, which takes ~1.4 seconds, per sector. But also note how it >> says > > 20 retries on a 15,000 rpm drive only takes 80 milliseconds, not 1.4 > seconds. 15,000 rpm / 60 seconds per minute = 250 rotations/retries > per second. The PDF contains a table saying 20 retries takes 1.4 seconds. I didn't compute this number myself, it's in the bloody manufacturer's own documentation. Obviously the ECC is doing things that take more than one revolution of the spindle. > >> Maybe you'd prefer seeing these big, cheap, "green" drives have >> shorter ERC times, with a commensurate reality check with their >> unrecoverable error rate, which right now is already two orders >> magnitude higher than enterprise SAS drives. So what if this means >> that rate is 3 or 4 orders magnitude higher? > > 20 retries vs. 200 retries does not reduce the URE rate by orders of > magnitude; more like 1% *maybe*. 200 vs 2000 makes no measurable > difference at all. I see, well I guess you prefer believing in fraud and conspiracy theories, by multiple companies, to screw users over, while they admit the incompatibility of the intended use case on their data sheets. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-29 21:53 ` Chris Murphy 2014-12-30 20:46 ` Phillip Susi @ 2014-12-31 15:40 ` Austin S Hemmelgarn 1 sibling, 0 replies; 18+ messages in thread From: Austin S Hemmelgarn @ 2014-12-31 15:40 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS On 2014-12-29 16:53, Chris Murphy wrote: > On Sat, Dec 27, 2014 at 8:12 PM, Phillip Susi <psusi@ubuntu.com> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA512 >> >> On 12/23/2014 05:09 PM, Chris Murphy wrote: >>> The timer in /sys is a kernel command timer, it's not a device >>> timer even though it's pointed at a block device. You need to >>> change that from 30 to something higher to get the behavior you >>> want. It doesn't really make sense to say, timeout in 30 seconds, >>> but instead of reporting a timeout, report it as a read error. >>> They're completely different things. >> >> The idea is not to give the drive a ridiculous amount of time to >> recover without timing out, but for the timeout to be handled properly. > > Get drives supporting configurable or faster recoveries. There's no > way around this. > >> >>> There are all sorts of errors listed in libata so for all of them >>> to get dumped into a read error doesn't make sense. A lot of those >>> errors don't report back a sector, and the key part of the read >>> error is what sector(s) have the problem so that they can be fixed. >>> Without that information, the ability to fix it is lost. And it's >>> the drive that needs to report this. >> >> It is not lost. The information is simply fuzzed from an exact >> individual sector to a range of sectors in the timed out request. In >> an ideal world the drive would give up in a reasonable time and report >> the failure, but if it doesn't, then we should deal with that in a >> better way than hanging all IO for an unacceptably long time. > > This is a broken record topic honestly. The drives under discussion > aren't ever meant to be used in raid, they're desktop drives, they're > designed with long recoveries because it's reasonable to try to > recover the data even in the face of delays rather than not recover at > all. Whether there are also some design flaws in here I can't say > because I'm not a hardware designer or developer but they are very > clearly targeted at certain use cases and not others, not least of > which is their error recovery time but also their vibration tolerance > when multiple drives are in close proximity to each other. > > If you don't like long recoveries, don't buy drives with long > recoveries. Simple. > > >> >>> Oven doesn't work, so lets spray gasoline on it and light it and >>> the kitchen on fire so that we can cook this damn pizza! That's >>> what I just read. Sorry. It doesn't seem like a good idea to me to >>> map all errors as read errors. >> >> How do you conclude that? In the face of a timeout your choices are >> between kicking the whole drive out of the array immediately, or >> attempting to repair it by recovering the affected sector(s) and >> rewriting them. Unless that recovery attempt could cause more harm >> than degrading the array, then where is the "throwing gasoline on it" >> part? This is simply a case of the device not providing a specific >> error that says whether it can be recovered or not, so let's attempt >> the recovery and see if it works instead of assuming that it won't and >> possibly causing data loss that could be avoided. > > The device will absolutely provide a specific error so long as its > link isn't reset prematurely, which happens to be the linux default > behavior when combined with drives that have long error recovery > times. Hence the recommendation is to increase the linux command timer > value. That is the solution right now. If you want a different > behavior someone has to write the code to do it because it doesn't > exist yet, and so far there seems to be zero interest in actually > doing that work, just some interest in hand waiving that it ought to > exist, maybe. > > > >> >>> Any decent server SATA drive should support SCT ERC. The >>> inexpensive WDC Red drives for NAS's all have it and by default are >>> a reasonable 70 deciseconds last time I checked. >> >> And yet it isn't supported on the cheaper but otherwise identical >> greens, or the higher performing blues. We should not be helping >> vendors charge a premium for zero cost firmware features that are >> "required" for raid use when they really aren't ( even if they are >> nice to have ). > > The manufacturer says they differ in vibration characteristics, 24x7 > usage expectation, and warranty among the top relevant features. The > Red has a 3 year warranty, the Green is a 1 year warranty. That alone > easily accounts for the $15 difference, although that's perhaps > somewhat subjective. I don't actually know the wholesale prices, they > could be the same if the purchasing terms are identical. > > Western Digital Red NAS Hard Drive WD30EFRX 3TB IntelliPower 64MB > Cache SATA 6.0Gb/s 3.5" NAS Hard Drive > $114 on Newegg.com > > Western Digital WD Green WD30EZRX 3TB IntelliPower 64MB Cache SATA > 6.0Gb/s 3.5" Internal Hard Drive Bare Drive - OEM > $99 on Newegg.com > > And none of the manufacturers actually says these features are > required for raid use. What they say is, they reserve the right to > deny warranty claims if you're using a drive in a manner inconsistent > with their intended usage which is rather easily found information. > The fact is though, that the _hardware_ on the Green drives _does_ have everything needed to support SCT ERC properly, it's just that the firmware refuses to support it. The same is the case on every Seagate desktop drive I've ever seen. It's essentially the same as how NVIDIA sells the same hardware as both GeForce and Quadro brands, with only the BIOS/firmware differing. There is a long standing tradition among hardware manufacturers of crippling good hardware in firmware and selling it at a lower price than a non-crippled product. It's not too hard (with a little know how) to modify a firmware updater for one of the drives in question to flash it with the good firmware instead of the crap that comes on it by default (although this WILL void your warranty, and depending on laws where you live, might also count as reverse engineering). ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com>]
* Re: Uncorrectable errors on RAID-1? [not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com> @ 2014-12-22 14:28 ` constantine 2014-12-22 16:05 ` Chris Murphy 0 siblings, 1 reply; 18+ messages in thread From: constantine @ 2014-12-22 14:28 UTC (permalink / raw) To: Chris Murphy, linux-btrfs On Mon, Dec 22, 2014 at 12:24 AM, Chris Murphy <lists@colorremedies.com> wrote: > smartctl -l scterc /dev/sdX That's really good to know. My drives are desktop and this feature is not supported; hence, I get "SCT Error Recovery Control command not supported". I'll definitely go for enterprise/raid class drives that support this command in the future; they somehow seem more transparent (if I may say) in maintaining them. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Uncorrectable errors on RAID-1? 2014-12-22 14:28 ` constantine @ 2014-12-22 16:05 ` Chris Murphy 0 siblings, 0 replies; 18+ messages in thread From: Chris Murphy @ 2014-12-22 16:05 UTC (permalink / raw) To: constantine; +Cc: Chris Murphy, Btrfs BTRFS On Mon, Dec 22, 2014 at 7:28 AM, constantine <costas.magnuse@gmail.com> wrote: > On Mon, Dec 22, 2014 at 12:24 AM, Chris Murphy <lists@colorremedies.com> wrote: >> smartctl -l scterc /dev/sdX > > That's really good to know. My drives are desktop and this feature is > not supported; hence, I get "SCT Error Recovery Control command not > supported". > > I'll definitely go for enterprise/raid class drives that support this > command in the future; they somehow seem more transparent (if I may > say) in maintaining them. Not knowing anything else, I'd say the kernel command timer should be set to 121 in your case, for each drive. If you find evidence it can be shorter, go with that. Bad sectors will fast fast, which is what you want since you have mirrored data. Marginal sectors might take a while for the firmware to either recover, or fail. It's possible to mitigate long recoveries with a periodic balance, say once every six months. This rewrites all data, and all sectors ought to have a decently strong signal. Any sector with a persistent write problem is removed from use automatically by drive firmware, but this tends to require a write operation to trigger. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-01-05 7:41 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-21 19:34 Uncorrectable errors on RAID-1? constantine
2014-12-21 21:56 ` Robert White
2014-12-21 22:17 ` Hugo Mills
2014-12-22 0:25 ` Chris Murphy
2014-12-23 21:16 ` Zygo Blaxell
2014-12-23 22:09 ` Chris Murphy
2014-12-23 22:23 ` Chris Murphy
2014-12-28 3:12 ` Phillip Susi
2014-12-29 21:53 ` Chris Murphy
2014-12-30 20:46 ` Phillip Susi
2014-12-30 23:58 ` Chris Murphy
2014-12-31 3:16 ` Phillip Susi
2015-01-03 5:31 ` Chris Murphy
2015-01-05 4:18 ` Phillip Susi
2015-01-05 7:41 ` Chris Murphy
2014-12-31 15:40 ` Austin S Hemmelgarn
[not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com>
2014-12-22 14:28 ` constantine
2014-12-22 16:05 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).