Uncorrectable errors on RAID-1?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Uncorrectable errors on RAID-1?
@ 2014-12-21 19:34 constantine
  2014-12-21 21:56 ` Robert White
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: constantine @ 2014-12-21 19:34 UTC (permalink / raw)
  To: linux-btrfs

Some months ago I had 6 uncorrectable errors. I deleted the files that
contained them and then after scrubbing I had 0 uncorrectable errors.
After some weeks I encountered new uncorrectable errors.

Question 1:
Why do I have uncorrectable errors on a RAID-1 filesystem in the first place?

Question 2:
How do I properly correct them? (Again by deleting their files? :( )

Question 3:
How do I prevent this from happening?


Thanks a lot!

constantine

PS.
The disks can be considered old (some with > 15000 hrs online), but
SMART long tests complete without errors. I have this filesystem:

# btrfs fi show /mnt/thefilesystem
Label: 'thefilesystem'  uuid: 1d1d0850-d1bc-4c76-96a1-17d168ff2431
        Total devices 5 FS bytes used 6.11TiB
        devid    1 size 2.73TiB used 2.63TiB path /dev/sda1
        devid    2 size 3.64TiB used 3.54TiB path /dev/sdg1
        devid    3 size 1.82TiB used 1.72TiB path /dev/sdd1
        devid    4 size 1.82TiB used 1.72TiB path /dev/sdc1
        devid    5 size 2.73TiB used 2.63TiB path /dev/sdh1

Btrfs v3.17.3

# btrfs fi df /mnt/thefilesystem
Data, RAID1: total=6.10TiB, used=6.10TiB
System, RAID1: total=32.00MiB, used=896.00KiB
Metadata, RAID1: total=10.00GiB, used=8.98GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

===================
SMART information from each of the disks:

# for i in  a g d c h ; do smartctl -A /dev/sd$i; done
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
Always       -       0
  3 Spin_Up_Time            0x0027   177   175   021    Pre-fail
Always       -       6108
  4 Start_Stop_Count        0x0032   100   100   000    Old_age
Always       -       201
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age
Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age
Always       -       5836
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age
Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age
Always       -       185
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
Always       -       118
193 Load_Cycle_Count        0x0032   189   189   000    Old_age
Always       -       33154
194 Temperature_Celsius     0x0022   114   098   000    Old_age
Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
Offline      -       0

smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
Always       -       0
  3 Spin_Up_Time            0x0027   179   175   021    Pre-fail
Always       -       8050
  4 Start_Stop_Count        0x0032   100   100   000    Old_age
Always       -       141
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age
Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age
Always       -       4842
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age
Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age
Always       -       140
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
Always       -       91
193 Load_Cycle_Count        0x0032   194   194   000    Old_age
Always       -       18614
194 Temperature_Celsius     0x0022   114   100   000    Old_age
Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
Offline      -       0

smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   102   099   006    Pre-fail
Always       -       4738696
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       836
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       144
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail
Always       -       69594766
  9 Power_On_Hours          0x0032   077   077   000    Old_age
Always       -       20554
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       721
183 Runtime_Bad_Block       0x0032   092   092   000    Old_age
Always       -       8
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age
Always       -       14
189 High_Fly_Writes         0x003a   097   097   000    Old_age
Always       -       3
190 Airflow_Temperature_Cel 0x0022   068   042   045    Old_age
Always   In_the_past 32 (0 15 39 23 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       320
193 Load_Cycle_Count        0x0032   100   100   000    Old_age
Always       -       947
194 Temperature_Celsius     0x0022   032   058   000    Old_age
Always       -       32 (0 13 0 0 0)
195 Hardware_ECC_Recovered  0x001a   014   003   000    Old_age
Always       -       4738696
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age
Offline      -       19390 (116 2 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age
Offline      -       2165686930
242 Total_LBAs_Read         0x0000   100   253   000    Old_age
Offline      -       1913785108

smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
Always       -       1
  3 Spin_Up_Time            0x0027   182   178   021    Pre-fail
Always       -       5900
  4 Start_Stop_Count        0x0032   100   100   000    Old_age
Always       -       310
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age
Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age
Always       -       10839
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age
Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age
Always       -       275
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
Always       -       175
193 Load_Cycle_Count        0x0032   123   123   000    Old_age
Always       -       233706
194 Temperature_Celsius     0x0022   120   102   000    Old_age
Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
Offline      -       0

smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail
Always       -       154070800
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       198
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail
Always       -       4346841135
  9 Power_On_Hours          0x0032   090   090   000    Old_age
Always       -       9283
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       185
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age
Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age
Always       -       0 0 0
189 High_Fly_Writes         0x003a   098   098   000    Old_age
Always       -       2
190 Airflow_Temperature_Cel 0x0022   065   046   045    Old_age
Always       -       35 (Min/Max 23/45)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
Always       -       129
193 Load_Cycle_Count        0x0032   098   098   000    Old_age
Always       -       5879
194 Temperature_Celsius     0x0022   035   054   000    Old_age
Always       -       35 (0 19 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age
Offline      -       8753h+05m+40.278s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age
Offline      -       36640474598
242 Total_LBAs_Read         0x0000   100   253   000    Old_age
Offline      -       94882096088

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-21 19:34 Uncorrectable errors on RAID-1? constantine
@ 2014-12-21 21:56 ` Robert White
  2014-12-21 22:17   ` Hugo Mills
  2014-12-22  0:25 ` Chris Murphy
       [not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com>
  2 siblings, 1 reply; 18+ messages in thread
From: Robert White @ 2014-12-21 21:56 UTC (permalink / raw)
  To: constantine, linux-btrfs

On 12/21/2014 11:34 AM, constantine wrote:
> Some months ago I had 6 uncorrectable errors. I deleted the files that
> contained them and then after scrubbing I had 0 uncorrectable errors.
> After some weeks I encountered new uncorrectable errors.
>
> Question 1:
> Why do I have uncorrectable errors on a RAID-1 filesystem in the first place?

These are disk/platter/hardware errors. They happen for one of two 
reasons. (most likely) There is a flaw, new or existing, on the platter 
itself and data just cannot live in that spot. (least likely) You 
suffered an environmental hazard (hard jolt) while a sector was being 
written and the drive is just choking on the digital wreckage.


> Question 2:
> How do I properly correct them? (Again by deleting their files? :( )

You have to _force_ the system to write the sector. If the disk can 
correct the sector (not a hardware flaw) the problem goes away forever. 
If it can't the drive will re-map the sector with a spare sector and it 
will seem to go away forever.

Here is a decent tutorial :: 
http://smartmontools.sourceforge.net/badblockhowto.html and which 
version of things you need to do will vary by hardware, so read the 
whole thing.

_BUT_ on my system I had to use hdparam to write the sectors instead of 
just using dd. Math is involved to find the LBA and you have to use the 
"yes I really know what I am doing" option to force the write at the low 
level.

[Quick version :: smartctl --test=long (or range if you know the range). 
Test will stop on the read error. Force writ the the "lba of first 
error" block with hdparam or use the sg-spare thing. Repeat until the 
long test will read the entire drive.

My current smartctl --all /dev/sda shows that recent remapping exercise.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     36605 
      -
# 2  Selective offline   Completed without error       00%     36603 
      -
# 3  Selective offline   Aborted by host               90%     36603 
      -
# 4  Selective offline   Completed without error       00%     36603 
      -
# 5  Selective offline   Completed: read failure       90%     36603 
      19530186
# 6  Selective offline   Completed: read failure       90%     36603 
      19530182
# 7  Extended offline    Completed: read failure       90%     36602 
      19530182
# 8  Extended offline    Completed: read failure       90%     36602 
      19530182
# 9  Extended offline    Completed: read failure       90%     36592 
      19530182
#10  Extended offline    Completed: read failure       90%     36094 
      19530182
#11  Extended offline    Completed without error       00%      4222 
      -
6 of 6 failed self-tests are outdated by newer successful extended 
offline self-test # 1


The good news is that since you are using RAID1 and checksums you 
shouldn't need to delete any files. Just coerce the write and then btrfs 
scrub your filesystem and the checksum/rewrite thing should recover the 
degraded copy from the good copy in the mirror.

>
> Question 3:
> How do I prevent this from happening?

If the disk only shows an error or two it's probably still in normal 
range. If you have to spare out a lot of sectors then your disk may be 
reaching end-of-life and so likely needs replacing.

ALL DISKS FAIL EVENTUALLY so you don't "prevent it from happening". You 
use RAID1 (etc) and backups to prevent data loss and you periodically 
run the tests and check the output to prevent data loss.

That is, you can't prevent eventual disk loss, your job is to prevent 
data loss. So good on you for the RAID1

>
>
> Thanks a lot!
>
> constantine

>
> PS.
> The disks can be considered old (some with > 15000 hrs online), but
> SMART long tests complete without errors. I have this filesystem:

I don't see the smart test results in any of these blocks. Are you sure 
you are looking at the correct part of the results? You should have been 
showing us the table after the heading "SMART Self-test log structure 
revision number 1" if you are trying to show us tests completing without 
errors.

See smartctl --all and/or --xall output, e.g. _lower_ _case_ "a" or "x", 
not upper case "A" "attributes", test results will be near the bottom.

The "attribute" section is interesting but not dispositive of recent 
test results. It only shows non-test event counters.

>
> # btrfs fi show /mnt/thefilesystem
> Label: 'thefilesystem'  uuid: 1d1d0850-d1bc-4c76-96a1-17d168ff2431
>          Total devices 5 FS bytes used 6.11TiB
>          devid    1 size 2.73TiB used 2.63TiB path /dev/sda1
>          devid    2 size 3.64TiB used 3.54TiB path /dev/sdg1
>          devid    3 size 1.82TiB used 1.72TiB path /dev/sdd1
>          devid    4 size 1.82TiB used 1.72TiB path /dev/sdc1
>          devid    5 size 2.73TiB used 2.63TiB path /dev/sdh1
>
> Btrfs v3.17.3
>
> # btrfs fi df /mnt/thefilesystem
> Data, RAID1: total=6.10TiB, used=6.10TiB
> System, RAID1: total=32.00MiB, used=896.00KiB
> Metadata, RAID1: total=10.00GiB, used=8.98GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> ===================
> SMART information from each of the disks:
>
> # for i in  a g d c h ; do smartctl -A /dev/sd$i; done
> smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
> Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>    3 Spin_Up_Time            0x0027   177   175   021    Pre-fail
> Always       -       6108
>    4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       201
>    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>    7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> Always       -       0
>    9 Power_On_Hours          0x0032   093   093   000    Old_age
> Always       -       5836
>   10 Spin_Retry_Count        0x0032   100   100   000    Old_age
> Always       -       0
>   11 Calibration_Retry_Count 0x0032   100   100   000    Old_age
> Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       185
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> Always       -       118
> 193 Load_Cycle_Count        0x0032   189   189   000    Old_age
> Always       -       33154
> 194 Temperature_Celsius     0x0022   114   098   000    Old_age
> Always       -       36
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
>
> smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
> Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>    3 Spin_Up_Time            0x0027   179   175   021    Pre-fail
> Always       -       8050
>    4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       141
>    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>    7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> Always       -       0
>    9 Power_On_Hours          0x0032   094   094   000    Old_age
> Always       -       4842
>   10 Spin_Retry_Count        0x0032   100   100   000    Old_age
> Always       -       0
>   11 Calibration_Retry_Count 0x0032   100   100   000    Old_age
> Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       140
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> Always       -       91
> 193 Load_Cycle_Count        0x0032   194   194   000    Old_age
> Always       -       18614
> 194 Temperature_Celsius     0x0022   114   100   000    Old_age
> Always       -       38
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
>
> smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
> Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x000f   102   099   006    Pre-fail
> Always       -       4738696
>    3 Spin_Up_Time            0x0003   092   092   000    Pre-fail
> Always       -       0
>    4 Start_Stop_Count        0x0032   100   100   020    Old_age
> Always       -       836
>    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       144
>    7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail
> Always       -       69594766
>    9 Power_On_Hours          0x0032   077   077   000    Old_age
> Always       -       20554
>   10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
> Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   020    Old_age
> Always       -       721
> 183 Runtime_Bad_Block       0x0032   092   092   000    Old_age
> Always       -       8
> 184 End-to-End_Error        0x0032   100   100   099    Old_age
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age
> Always       -       0
> 188 Command_Timeout         0x0032   100   099   000    Old_age
> Always       -       14
> 189 High_Fly_Writes         0x003a   097   097   000    Old_age
> Always       -       3
> 190 Airflow_Temperature_Cel 0x0022   068   042   045    Old_age
> Always   In_the_past 32 (0 15 39 23 0)
> 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
> Always       -       0
> 192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
> Always       -       320
> 193 Load_Cycle_Count        0x0032   100   100   000    Old_age
> Always       -       947
> 194 Temperature_Celsius     0x0022   032   058   000    Old_age
> Always       -       32 (0 13 0 0 0)
> 195 Hardware_ECC_Recovered  0x001a   014   003   000    Old_age
> Always       -       4738696
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age
> Offline      -       19390 (116 2 0)
> 241 Total_LBAs_Written      0x0000   100   253   000    Old_age
> Offline      -       2165686930
> 242 Total_LBAs_Read         0x0000   100   253   000    Old_age
> Offline      -       1913785108
>
> smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
> Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       1
>    3 Spin_Up_Time            0x0027   182   178   021    Pre-fail
> Always       -       5900
>    4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       310
>    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>    7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> Always       -       0
>    9 Power_On_Hours          0x0032   086   086   000    Old_age
> Always       -       10839
>   10 Spin_Retry_Count        0x0032   100   100   000    Old_age
> Always       -       0
>   11 Calibration_Retry_Count 0x0032   100   100   000    Old_age
> Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       275
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
> Always       -       175
> 193 Load_Cycle_Count        0x0032   123   123   000    Old_age
> Always       -       233706
> 194 Temperature_Celsius     0x0022   120   102   000    Old_age
> Always       -       30
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
>
> smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
> Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail
> Always       -       154070800
>    3 Spin_Up_Time            0x0003   094   093   000    Pre-fail
> Always       -       0
>    4 Start_Stop_Count        0x0032   100   100   020    Old_age
> Always       -       198
>    5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail
> Always       -       0
>    7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail
> Always       -       4346841135
>    9 Power_On_Hours          0x0032   090   090   000    Old_age
> Always       -       9283
>   10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
> Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   020    Old_age
> Always       -       185
> 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age
> Always       -       0
> 184 End-to-End_Error        0x0032   100   100   099    Old_age
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age
> Always       -       0
> 188 Command_Timeout         0x0032   100   100   000    Old_age
> Always       -       0 0 0
> 189 High_Fly_Writes         0x003a   098   098   000    Old_age
> Always       -       2
> 190 Airflow_Temperature_Cel 0x0022   065   046   045    Old_age
> Always       -       35 (Min/Max 23/45)
> 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age
> Always       -       0
> 192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age
> Always       -       129
> 193 Load_Cycle_Count        0x0032   098   098   000    Old_age
> Always       -       5879
> 194 Temperature_Celsius     0x0022   035   054   000    Old_age
> Always       -       35 (0 19 0 0 0)
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age
> Offline      -       8753h+05m+40.278s
> 241 Total_LBAs_Written      0x0000   100   253   000    Old_age
> Offline      -       36640474598
> 242 Total_LBAs_Read         0x0000   100   253   000    Old_age
> Offline      -       94882096088
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-21 21:56 ` Robert White
@ 2014-12-21 22:17   ` Hugo Mills
  0 siblings, 0 replies; 18+ messages in thread
From: Hugo Mills @ 2014-12-21 22:17 UTC (permalink / raw)
  To: Robert White; +Cc: constantine, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2711 bytes --]

On Sun, Dec 21, 2014 at 01:56:54PM -0800, Robert White wrote:
> On 12/21/2014 11:34 AM, constantine wrote:
> >Some months ago I had 6 uncorrectable errors. I deleted the files that
> >contained them and then after scrubbing I had 0 uncorrectable errors.
> >After some weeks I encountered new uncorrectable errors.
> >
> >Question 1:
> >Why do I have uncorrectable errors on a RAID-1 filesystem in the first place?
> 
> These are disk/platter/hardware errors. They happen for one of two
> reasons. (most likely) There is a flaw, new or existing, on the
> platter itself and data just cannot live in that spot. (least
> likely) You suffered an environmental hazard (hard jolt) while a
> sector was being written and the drive is just choking on the
> digital wreckage.
> 
> 
> >Question 2:
> >How do I properly correct them? (Again by deleting their files? :( )
> 
> You have to _force_ the system to write the sector. If the disk can
> correct the sector (not a hardware flaw) the problem goes away
> forever. If it can't the drive will re-map the sector with a spare
> sector and it will seem to go away forever.

   Note that one of the drives already has reallocated sectors, so
it's on its way to failing, and you should start saving up your
pennies for a new one now, even if it hasn't gone properly boom
yet. However, that doesn't explain on its own why you're getting
unrecoverable errors -- the FS should be able to deal with that.

[snip]

> The good news is that since you are using RAID1 and checksums you
> shouldn't need to delete any files. Just coerce the write and then
> btrfs scrub your filesystem and the checksum/rewrite thing should
> recover the degraded copy from the good copy in the mirror.

   If btrfs detects a checksum error, it will try to fix it by reading
the other copy and then writing good data to the broken copy
again. You don't have to force a write to the FS in order to make it
fix broken data this way. A scrub will do this check-and-repair on all
content of the filesystem.

   If the FS is reporting uncorrectable errors, then it's tried both
copies and both fail their checksums. This is basically not fixable
without removing the files and replacing them with copies from your
backup. It's not obvious why you've got correlated errors on two
devices, though, and I'm not sure how to work it out.

   I'd suggest running the full SMART tests on the disks, and running
a scrub on the FS, and checking your logs for SATA errors and similar
problems.

   Hugo.

[snip]

-- 
Hugo Mills             | I must be musical: I've got *loads* of CDs
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0          |                                     Fran, Black Books

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-21 19:34 Uncorrectable errors on RAID-1? constantine
  2014-12-21 21:56 ` Robert White
@ 2014-12-22  0:25 ` Chris Murphy
  2014-12-23 21:16   ` Zygo Blaxell
       [not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com>
  2 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2014-12-22  0:25 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Dec 21, 2014 at 12:34 PM, constantine <costas.magnuse@gmail.com> wrote:
> Some months ago I had 6 uncorrectable errors. I deleted the files that
> contained them and then after scrubbing I had 0 uncorrectable errors.
> After some weeks I encountered new uncorrectable errors.
>
> Question 1:
> Why do I have uncorrectable errors on a RAID-1 filesystem in the first place?
>
> Question 2:
> How do I properly correct them? (Again by deleting their files? :( )
>
> Question 3:
> How do I prevent this from happening?

There are multiple kinds of uncorrectable errors so it depends on the
exact error. If Btrfs is reporting uncorrectable errors, then that
suggests both copies are bad.

Whether md, LVM, or Btrfs raid, make sure the value for

cat /sys/block/sdX/device/timeout

is larger than the value reported by

smartctl -l scterc /dev/sdX

Not that units for the first command are seconds, the units for the
second command are demiseconds. For the kernel to automatically fix
bad sectors by overwriting them, the drive needs to explicitly report
read errors. If the SCSI command timer value is shorter than the
drive's error recovery, the SATA link might get reset before the drive
reports the read error and then uncorrected errors will persist
instead of being automatically fixed.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-22  0:25 ` Chris Murphy
@ 2014-12-23 21:16   ` Zygo Blaxell
  2014-12-23 22:09     ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2014-12-23 21:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 979 bytes --]

On Sun, Dec 21, 2014 at 05:25:47PM -0700, Chris Murphy wrote:
> For the kernel to automatically fix
> bad sectors by overwriting them, the drive needs to explicitly report
> read errors. If the SCSI command timer value is shorter than the
> drive's error recovery, the SATA link might get reset before the drive
> reports the read error and then uncorrected errors will persist
> instead of being automatically fixed.

Is there a way to tell the kernel to go ahead and assume that all timeouts
are effectively read errors?  For a simple non-removable hard disk (i.e.
not removable and not optical), that seems like a reasonable workaround
for an assortment of firmware brokenness.

I just did a quick survey of random drives here and found less than 10%
support "smartctl -l scterc".  A lot of server drives (or at least the
drives that shipped in servers) don't have it, but laptop drives do.
Drives with firmware that has horrifying known bugs do also have this
feature.  :-P

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-23 21:16   ` Zygo Blaxell
@ 2014-12-23 22:09     ` Chris Murphy
  2014-12-23 22:23       ` Chris Murphy
  2014-12-28  3:12       ` Phillip Susi
  0 siblings, 2 replies; 18+ messages in thread
From: Chris Murphy @ 2014-12-23 22:09 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Dec 23, 2014 at 2:16 PM, Zygo Blaxell <zblaxell@furryterror.org> wrote:
> On Sun, Dec 21, 2014 at 05:25:47PM -0700, Chris Murphy wrote:
>> For the kernel to automatically fix
>> bad sectors by overwriting them, the drive needs to explicitly report
>> read errors. If the SCSI command timer value is shorter than the
>> drive's error recovery, the SATA link might get reset before the drive
>> reports the read error and then uncorrected errors will persist
>> instead of being automatically fixed.
>
> Is there a way to tell the kernel to go ahead and assume that all timeouts
> are effectively read errors?

The timer in /sys is a kernel command timer, it's not a device timer
even though it's pointed at a block device. You need to change that
from 30 to something higher to get the behavior you want. It doesn't
really make sense to say, timeout in 30 seconds, but instead of
reporting a timeout, report it as a read error. They're completely
different things.

There are all sorts of errors listed in libata so for all of them to
get dumped into a read error doesn't make sense. A lot of those errors
don't report back a sector, and the key part of the read error is what
sector(s) have the problem so that they can be fixed. Without that
information, the ability to fix it is lost. And it's the drive that
needs to report this.

> For a simple non-removable hard disk (i.e.
> not removable and not optical), that seems like a reasonable workaround
> for an assortment of firmware brokenness.

Oven doesn't work, so lets spray gasoline on it and light it and the
kitchen on fire so that we can cook this damn pizza! That's what I
just read. Sorry. It doesn't seem like a good idea to me to map all
errors as read errors.

> I just did a quick survey of random drives here and found less than 10%
> support "smartctl -l scterc".  A lot of server drives (or at least the
> drives that shipped in servers) don't have it, but laptop drives do.
> Drives with firmware that has horrifying known bugs do also have this
> feature.  :-P

Any decent server SATA drive should support SCT ERC. The inexpensive
WDC Red drives for NAS's all have it and by default are a reasonable
70 deciseconds last time I checked.

It might be that you're using SAS drives? In that case they may have
something different than SCT ERC that serves the same purpose, but I
don't have any SAS drives here to check this. I'd expect any SAS drive
already has short error recoveries by default, but that expectation
might be flawed.

Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-23 22:09     ` Chris Murphy
@ 2014-12-23 22:23       ` Chris Murphy
  2014-12-28  3:12       ` Phillip Susi
  1 sibling, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2014-12-23 22:23 UTC (permalink / raw)
  To: Btrfs BTRFS

The other thing to note, is that the scsi command timer timeout is a
maximum. So at 30 seconds if a command to the drive hasn't completed,
then consider the drive hung up and do a link reset. And whatever
error recovery is in the drive, is also a maximum. If the sector is
really immediately bad, the drive will produce a read error
immediately. The case where you get these long recoveries where the
drive keeps retrying beyond the 30 second scsi command timer value, is
when the drive firmware ECC thinks it can recover (or reconstruct) the
data instead of producing a read error.

A gotcha with changing the scsi command timer to a much larger value
is that it possibly gives the drive enough time to recover the data,
report it back to the kernel, and then everything goes on normally.
The "slow sector" doesn't get fixed. Even a scrub wouldn't fix that
unless the drive reported wrongly recovered data and Btrfs checksums
catch it.

So what you want to do with a drive that has, or is suspected of
having such slow sectors, is to balance it. Rewrite everything. That
should cause the drive firmware to map out those sectors if they
result in persistent write errors.

What ought to happen is the data from slow sectors, once recovered,
should get written to a reserve sector and the old sector removed from
use (remapping, i.e. the LBA is the same but the physical sector is
different) but every drive firmware handles this differently. I
definitely have had drives where this doesn't happen automatically.
Also, I've had drives that when ATA Secure Erased, did not test for
persistent write errors and therefore bad sectors weren't removed from
use, they'd remain persistently bad when doing smartctl -t long tests.
In those cases, using badblocks -w fixed the problem but of course
that's destructive.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-23 22:09     ` Chris Murphy
  2014-12-23 22:23       ` Chris Murphy
@ 2014-12-28  3:12       ` Phillip Susi
  2014-12-29 21:53         ` Chris Murphy
  1 sibling, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2014-12-28  3:12 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS; +Cc: Zygo Blaxell

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/23/2014 05:09 PM, Chris Murphy wrote:
> The timer in /sys is a kernel command timer, it's not a device
> timer even though it's pointed at a block device. You need to
> change that from 30 to something higher to get the behavior you
> want. It doesn't really make sense to say, timeout in 30 seconds,
> but instead of reporting a timeout, report it as a read error.
> They're completely different things.

The idea is not to give the drive a ridiculous amount of time to
recover without timing out, but for the timeout to be handled properly.

> There are all sorts of errors listed in libata so for all of them
> to get dumped into a read error doesn't make sense. A lot of those
> errors don't report back a sector, and the key part of the read
> error is what sector(s) have the problem so that they can be fixed.
> Without that information, the ability to fix it is lost. And it's
> the drive that needs to report this.

It is not lost.  The information is simply fuzzed from an exact
individual sector to a range of sectors in the timed out request.  In
an ideal world the drive would give up in a reasonable time and report
the failure, but if it doesn't, then we should deal with that in a
better way than hanging all IO for an unacceptably long time.

> Oven doesn't work, so lets spray gasoline on it and light it and
> the kitchen on fire so that we can cook this damn pizza! That's
> what I just read. Sorry. It doesn't seem like a good idea to me to
> map all errors as read errors.

How do you conclude that?  In the face of a timeout your choices are
between kicking the whole drive out of the array immediately, or
attempting to repair it by recovering the affected sector(s) and
rewriting them.  Unless that recovery attempt could cause more harm
than degrading the array, then where is the "throwing gasoline on it"
part?  This is simply a case of the device not providing a specific
error that says whether it can be recovered or not, so let's attempt
the recovery and see if it works instead of assuming that it won't and
possibly causing data loss that could be avoided.

> Any decent server SATA drive should support SCT ERC. The
> inexpensive WDC Red drives for NAS's all have it and by default are
> a reasonable 70 deciseconds last time I checked.

And yet it isn't supported on the cheaper but otherwise identical
greens, or the higher performing blues.  We should not be helping
vendors charge a premium for zero cost firmware features that are
"required" for raid use when they really aren't ( even if they are
nice to have ).

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUn3U5AAoJENRVrw2cjl5RFIQIAJAr86Y5s8RWuL8/We/AlM5Q
JUuZGGaE1IGmMROdUAEzmj78L8lI2U3D95sERDKmd3aJosfpi1SVOExQZebSIqch
hhkLGC0FecxE5VC/67E2wwmfbropSk0mlA5Fbgx8mYf60iUHWcFUkc01kER3JGnd
xMI2jV0UpqVD/gY/a5O7Z7bPeHICQcIyXCN7MAbTMBrDWsYhDACQpij+aNXu5+ke
rCNV5c/VkYFQZ9aaMb6Mxmi9KOkCVv2+kBOsxwqPxlO5s9vKORDhxMp8XeJQEvhU
X2GAgS8r8gSGVdPutekXR1vB+TwhdMxftBWL9jcI1y05Y0z3GcOX+/90S9mrSaU=
=2tIU
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-28  3:12       ` Phillip Susi
@ 2014-12-29 21:53         ` Chris Murphy
  2014-12-30 20:46           ` Phillip Susi
  2014-12-31 15:40           ` Austin S Hemmelgarn
  0 siblings, 2 replies; 18+ messages in thread
From: Chris Murphy @ 2014-12-29 21:53 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sat, Dec 27, 2014 at 8:12 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On 12/23/2014 05:09 PM, Chris Murphy wrote:
>> The timer in /sys is a kernel command timer, it's not a device
>> timer even though it's pointed at a block device. You need to
>> change that from 30 to something higher to get the behavior you
>> want. It doesn't really make sense to say, timeout in 30 seconds,
>> but instead of reporting a timeout, report it as a read error.
>> They're completely different things.
>
> The idea is not to give the drive a ridiculous amount of time to
> recover without timing out, but for the timeout to be handled properly.

Get drives supporting configurable or faster recoveries. There's no
way around this.

>
>> There are all sorts of errors listed in libata so for all of them
>> to get dumped into a read error doesn't make sense. A lot of those
>> errors don't report back a sector, and the key part of the read
>> error is what sector(s) have the problem so that they can be fixed.
>> Without that information, the ability to fix it is lost. And it's
>> the drive that needs to report this.
>
> It is not lost.  The information is simply fuzzed from an exact
> individual sector to a range of sectors in the timed out request.  In
> an ideal world the drive would give up in a reasonable time and report
> the failure, but if it doesn't, then we should deal with that in a
> better way than hanging all IO for an unacceptably long time.

This is a broken record topic honestly. The drives under discussion
aren't ever meant to be used in raid, they're desktop drives, they're
designed with long recoveries because it's reasonable to try to
recover the data even in the face of delays rather than not recover at
all. Whether there are also some design flaws in here I can't say
because I'm not a hardware designer or developer but they are very
clearly targeted at certain use cases and not others, not least of
which is their error recovery time but also their vibration tolerance
when multiple drives are in close proximity to each other.

If you don't like long recoveries, don't buy drives with long
recoveries. Simple.

>
>> Oven doesn't work, so lets spray gasoline on it and light it and
>> the kitchen on fire so that we can cook this damn pizza! That's
>> what I just read. Sorry. It doesn't seem like a good idea to me to
>> map all errors as read errors.
>
> How do you conclude that?  In the face of a timeout your choices are
> between kicking the whole drive out of the array immediately, or
> attempting to repair it by recovering the affected sector(s) and
> rewriting them.  Unless that recovery attempt could cause more harm
> than degrading the array, then where is the "throwing gasoline on it"
> part?  This is simply a case of the device not providing a specific
> error that says whether it can be recovered or not, so let's attempt
> the recovery and see if it works instead of assuming that it won't and
> possibly causing data loss that could be avoided.

The device will absolutely provide a specific error so long as its
link isn't reset prematurely, which happens to be the linux default
behavior when combined with drives that have long error recovery
times. Hence the recommendation is to increase the linux command timer
value. That is the solution right now. If you want a different
behavior someone has to write the code to do it because it doesn't
exist yet, and so far there seems to be zero interest in actually
doing that work, just some interest in hand waiving that it ought to
exist, maybe.

>
>> Any decent server SATA drive should support SCT ERC. The
>> inexpensive WDC Red drives for NAS's all have it and by default are
>> a reasonable 70 deciseconds last time I checked.
>
> And yet it isn't supported on the cheaper but otherwise identical
> greens, or the higher performing blues.  We should not be helping
> vendors charge a premium for zero cost firmware features that are
> "required" for raid use when they really aren't ( even if they are
> nice to have ).

The manufacturer says they differ in vibration characteristics, 24x7
usage expectation, and warranty among the top relevant features. The
Red has a 3 year warranty, the Green is a 1 year warranty. That alone
easily accounts for the $15 difference, although that's perhaps
somewhat subjective. I don't actually know the wholesale prices, they
could be the same if the purchasing terms are identical.

Western Digital Red NAS Hard Drive WD30EFRX 3TB IntelliPower 64MB
Cache SATA 6.0Gb/s 3.5" NAS Hard Drive
$114 on Newegg.com

Western Digital WD Green WD30EZRX 3TB IntelliPower 64MB Cache SATA
6.0Gb/s 3.5" Internal Hard Drive Bare Drive - OEM
$99 on Newegg.com

And none of the manufacturers actually says these features are
required for raid use. What they say is, they reserve the right to
deny warranty claims if you're using a drive in a manner inconsistent
with their intended usage which is rather easily found information.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-29 21:53         ` Chris Murphy
@ 2014-12-30 20:46           ` Phillip Susi
  2014-12-30 23:58             ` Chris Murphy
  2014-12-31 15:40           ` Austin S Hemmelgarn
  1 sibling, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2014-12-30 20:46 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/29/2014 4:53 PM, Chris Murphy wrote:
> Get drives supporting configurable or faster recoveries. There's
> no way around this.

Practically available right now?  Sure.  In theory, no.

> This is a broken record topic honestly. The drives under
> discussion aren't ever meant to be used in raid, they're desktop
> drives, they're designed with long recoveries because it's
> reasonable to try to

The intention to use the drives in a raid is entirely at the
discretion of the user, not the manufacturer.  The only reason we are
even having this conversation is because the manufacturer has added a
misfeature that makes them sub-optimal for use in a raid.

> recover the data even in the face of delays rather than not recover
> at all. Whether there are also some design flaws in here I can't
> say because I'm not a hardware designer or developer but they are
> very clearly targeted at certain use cases and not others, not
> least of which is their error recovery time but also their
> vibration tolerance when multiple drives are in close proximity to
> each other.

Drives have no business whatsoever retrying for so long; every version
of DOS or Windows ever released has been able to report an IO error
and give the *user* the option of retrying it in the hopes that it
will work that time, because drives used to be sane and not keep
retrying a positively ridiculous number of times.

> If you don't like long recoveries, don't buy drives with long 
> recoveries. Simple.

Better to fix the software to deal with it sensibly instead of
encouraging manufacturers to engage in hamstringing their lower priced
products to coax more money out of their customers.

> The device will absolutely provide a specific error so long as its 
> link isn't reset prematurely, which happens to be the linux
> default behavior when combined with drives that have long error
> recovery times. Hence the recommendation is to increase the linux
> command timer value. That is the solution right now. If you want a
> different behavior someone has to write the code to do it because
> it doesn't exist yet, and so far there seems to be zero interest in
> actually doing that work, just some interest in hand waiving that
> it ought to exist, maybe.

If this is your way of saying "patches welcome" then it probably would
have been better just to say that.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUow8ZAAoJENRVrw2cjl5Rr9UH+wd3yJ1ZnoaxDG3JPCBq9MJb
Tb6nhjHovRDREeus4UWLESp9kYUyy5OfKmahARhM6AbaBXWYeleoD9SEtMahFXfn
/2Kn9yRBqZCBDloVQGNOUaSZyfhTRRl31cGABbbynRo6IDkLEfMQQPWgvz9ttch7
3aPciHhehs1CeseNuiiUPk6HIMb8lJLvgW5J1O5FwgXZ6Wyi9OZdoPL+prnFh2bP
5E2rGblYUHIUiLkOKFOOsEs8q2H9RICFJIBsz8KoPzjCDtdNETBF5mvx8bIUJpg0
Q7cQOo7IRxpFUL/7gnBtWgRIw3lvRY+SY2G+2YwaMiqdeuYcLCr853ONDYg0NCc=
=AYGW
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-30 20:46           ` Phillip Susi
@ 2014-12-30 23:58             ` Chris Murphy
  2014-12-31  3:16               ` Phillip Susi
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2014-12-30 23:58 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Dec 30, 2014 at 1:46 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 12/29/2014 4:53 PM, Chris Murphy wrote:
>> Get drives supporting configurable or faster recoveries. There's
>> no way around this.
>
> Practically available right now?  Sure.  In theory, no.

I have no idea what this means. Such drives exist, you can buy them or
not buy them.

>
>> This is a broken record topic honestly. The drives under
>> discussion aren't ever meant to be used in raid, they're desktop
>> drives, they're designed with long recoveries because it's
>> reasonable to try to
>
> The intention to use the drives in a raid is entirely at the
> discretion of the user, not the manufacturer.  The only reason we are
> even having this conversation is because the manufacturer has added a
> misfeature that makes them sub-optimal for use in a raid.

Clearly you have never owned a business, nor have you been involved in
volume manufacturing or you wouldn't be so keen to demand one market
subsidize another. 24x7 usage is a non-trivial quantity of additional
wear and tear on the drive compared to 8 hour/day, 40 hour/week duty
cycle. But you seem to think that the manufacturer has no right to
produce a cheaper one for the seldom used hardware, or a more
expensive one for the constantly used hardware.

And of course you completely ignored, and deleted, my point about the
difference in warranties.

Does the SATA specification require configurable SCT ERC? Does it
require even supporting SCT ERC? I think your argument is flawed by
mis-distributing the economic burden while simultaneously denying one
even exists or that these companies should just eat the cost
differential if it does. In any case the argument is asinine.

>
>> recover the data even in the face of delays rather than not recover
>> at all. Whether there are also some design flaws in here I can't
>> say because I'm not a hardware designer or developer but they are
>> very clearly targeted at certain use cases and not others, not
>> least of which is their error recovery time but also their
>> vibration tolerance when multiple drives are in close proximity to
>> each other.
>
> Drives have no business whatsoever retrying for so long; every version
> of DOS or Windows ever released has been able to report an IO error
> and give the *user* the option of retrying it in the hopes that it
> will work that time, because drives used to be sane and not keep
> retrying a positively ridiculous number of times.

When the encoded data signal weakens, they effectively becomes fuzzy
bits. Each read produces different results. Obviously this is a very
rare condition or there'd be widespread panic. However, it's common
and expected enough that the drive manufacturers are all, to very
little varying degree, dealing with this problem in a similar way,
which is multiple reads.

Now you could say they're all in collusion with each other to screw
users over, rather than having legitimate reasons for all of these
retried. Unless you're a hard drive engineer, I'm unlikely to find
such an argument compelling. Besides, it would also be a charge of
fraud.

>
>> If you don't like long recoveries, don't buy drives with long
>> recoveries. Simple.
>
> Better to fix the software to deal with it sensibly instead of
> encouraging manufacturers to engage in hamstringing their lower priced
> products to coax more money out of their customers.

In the meantime, there already is a working software alternative:
(re)write over all sectors periodically. Perhaps every 6-12 months is
sufficient to mitigate such signal weakening on marginal sectors that
aren't persistently failing on writes. This can be done with a
periodic reshape if it's md raid. It can be done with balance on
Btrfs. It can be done with resilvering on ZFS.

>
>> The device will absolutely provide a specific error so long as its
>> link isn't reset prematurely, which happens to be the linux
>> default behavior when combined with drives that have long error
>> recovery times. Hence the recommendation is to increase the linux
>> command timer value. That is the solution right now. If you want a
>> different behavior someone has to write the code to do it because
>> it doesn't exist yet, and so far there seems to be zero interest in
>> actually doing that work, just some interest in hand waiving that
>> it ought to exist, maybe.
>
> If this is your way of saying "patches welcome" then it probably would
> have been better just to say that.

Certainly not. I'm not the maintainer of anything, I have no idea if
such things are welcome. I'm not even a developer. I couldn't code my
way out of a hat.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-30 23:58             ` Chris Murphy
@ 2014-12-31  3:16               ` Phillip Susi
  2015-01-03  5:31                 ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2014-12-31  3:16 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/30/2014 06:58 PM, Chris Murphy wrote:
>> Practically available right now?  Sure.  In theory, no.
> 
> I have no idea what this means. Such drives exist, you can buy them
> or not buy them.

I was referring to the "no way around this part".  Currently you are
correct, but in theory the way around it is exactly the subject of
this thread.

> Clearly you have never owned a business, nor have you been involved
> in volume manufacturing or you wouldn't be so keen to demand one
> market subsidize another. 24x7 usage is a non-trivial quantity of
> additional wear and tear on the drive compared to 8 hour/day, 40
> hour/week duty cycle. But you seem to think that the manufacturer
> has no right to produce a cheaper one for the seldom used hardware,
> or a more expensive one for the constantly used hardware.

Just because I want a raid doesn't mean I need it to operate reliably
24x7.  For that matter, it has long been established that power
cycling drives puts more wear and tear on them and as a general rule,
leaving them on 24x7 results in them lasting longer.

> And of course you completely ignored, and deleted, my point about
> the difference in warranties.

Because I don't care?  It's nice and all that they warranty the more
expensive drive more, and it may possibly even mean that they are
actually more reliable ( but not likely ), but that doesn't mean that
the system should have an unnecessarily terrible response to the
behavior of the cheaper drives.  Is it worth recommending the more
expensive drives?  Sure... but the system should also handle the
cheaper drives with grace.

> Does the SATA specification require configurable SCT ERC? Does it 
> require even supporting SCT ERC? I think your argument is flawed
> by mis-distributing the economic burden while simultaneously
> denying one even exists or that these companies should just eat the
> cost differential if it does. In any case the argument is asinine.

There didn't used to be any such thing; drives simply did not *ever*
go into absurdly long internal retries so there was no need.  The fact
that they do these days I consider a misfeature, and one that *can* be
worked around in software, which is the point here.

> When the encoded data signal weakens, they effectively becomes
> fuzzy bits. Each read produces different results. Obviously this is
> a very rare condition or there'd be widespread panic. However, it's
> common and expected enough that the drive manufacturers are all, to
> very little varying degree, dealing with this problem in a similar
> way, which is multiple reads.

Sure, but the noise introduced by the read ( as opposed to the noise
in the actual signal on the platter ) isn't that large, and so
retrying 10,000 times isn't going to give any better results than
retrying say, 100 times, and if the user really desires that many
retries, they have always been able to do so in the software level
rather than depending on the drive to try that much.  There is no
reason for the drives to have increased their internal retries that
much, and then deliberately withed the essentially zero cost ability
to limit those internal retries, other than to drive customers to pay
for the more expensive models.

> Now you could say they're all in collusion with each other to
> screw users over, rather than having legitimate reasons for all of
> these retried. Unless you're a hard drive engineer, I'm unlikely to
> find such an argument compelling. Besides, it would also be a
> charge of fraud.

Calling it fraud might be a bit of a stretch, but yes, there is no
legitimate reason for *that* many retries since people have been
retrying failed reads in software for decades and the diminishing
returns that goes with increasing the number of retries.

> In the meantime, there already is a working software alternative: 
> (re)write over all sectors periodically. Perhaps every 6-12 months
> is sufficient to mitigate such signal weakening on marginal sectors
> that aren't persistently failing on writes. This can be done with
> a periodic reshape if it's md raid. It can be done with balance on 
> Btrfs. It can be done with resilvering on ZFS.

Is there any actual evidence that this is effective?  Or that the
recording degrades as a function of time?  I doubt it since I do have
data on drives that were last written 10 years ago that is still
readable.  Even if so, this is really a non sequitur since if the
signal has degraded making it hard to read, in a raid we can simply
recover using the other drives.  The issue here is whether we should
be doing such recovery sooner rather than waiting for the silly drive
to retry 100,000 times before giving up.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUo2p7AAoJENRVrw2cjl5RBRQH/iPeByoKWCBCNcSH+slHQpLu
UgFw1Sb0VhkcMV7LWGHRPVCOqOqRUyiDUIWBqjnnKAtGWvngqoVa8oCrYXYfgzeT
snarm36vtm5jWQygn62mpZKoFVby5ttKTP3+rwQi+OjZ3+EWKKVkuXRFYpwt5ylt
f/Xix2EpgMrl9hi8Bt8D/aLPtyPIF47D5vwa2nw7f5/gU0rKDfG9OZ4B7Bs1Jl0Q
UA+bXlz4zi0cD6S7gwKStrDljAmMKjLnpWqMPHHnTWUgKuRRM/VKwzIhZmEZraqD
y3SdY1JBj1qli50ZvKH+lkEag0mixMLvzN4mC6gYKqXjG2EAsHMp8185kK97gSQ=
=agsX
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-31  3:16               ` Phillip Susi
@ 2015-01-03  5:31                 ` Chris Murphy
  2015-01-05  4:18                   ` Phillip Susi
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2015-01-03  5:31 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Chris Murphy, Btrfs BTRFS

On Tue, Dec 30, 2014 at 8:16 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> Just because I want a raid doesn't mean I need it to operate reliably
> 24x7.  For that matter, it has long been established that power
> cycling drives puts more wear and tear on them and as a general rule,
> leaving them on 24x7 results in them lasting longer.

It's not a made to order hard drive industry. Maybe one day you'll be
able to 3D print your own with its own specs.

>
>> And of course you completely ignored, and deleted, my point about
>> the difference in warranties.
>
> Because I don't care?

Sticking fingers in your ears doesn't change the fact there's a
measurable difference in support requirements.

> It's nice and all that they warranty the more
> expensive drive more, and it may possibly even mean that they are
> actually more reliable ( but not likely ), but that doesn't mean that
> the system should have an unnecessarily terrible response to the
> behavior of the cheaper drives.  Is it worth recommending the more
> expensive drives?  Sure... but the system should also handle the
> cheaper drives with grace.

This is architecture astronaut territory.

The system only has a terrible response for two reasons: 1. The user
spec'd the wrong hardware for the use case; 2. The distro isn't
automatically leveraging existing ways to mitigate that user mistake
by changing either SCT ERC on the drives, or the SCSI command timer
for each block device.

Now, even though that solution *might* mean long recoveries on
occasion, it's still better than link reset behavior which is what we
have today because it causes the underlying problem to be fixed by
md/dm/Btrfs once the read error is reported. But no distro has
implemented this $500 man hour solution. Instead you're suggesting a
$500,000 fix that will take hundreds of man hours and end user testing
to find all the edge cases. It's like, seriously, WTF?

>> Does the SATA specification require configurable SCT ERC? Does it
>> require even supporting SCT ERC? I think your argument is flawed
>> by mis-distributing the economic burden while simultaneously
>> denying one even exists or that these companies should just eat the
>> cost differential if it does. In any case the argument is asinine.
>
> There didn't used to be any such thing; drives simply did not *ever*
> go into absurdly long internal retries so there was no need.  The fact
> that they do these days I consider a misfeature, and one that *can* be
> worked around in software, which is the point here.

Ok well I think that's hubris unless you're a hard drive engineer.
You're referring to how drives behaved over a decade ago, when bad
sectors were persistent rather than remapped, and we had to scan the
drive at format time to build a map so the bad ones wouldn't be used
by the filesystem.

>> When the encoded data signal weakens, they effectively becomes
>> fuzzy bits. Each read produces different results. Obviously this is
>> a very rare condition or there'd be widespread panic. However, it's
>> common and expected enough that the drive manufacturers are all, to
>> very little varying degree, dealing with this problem in a similar
>> way, which is multiple reads.
>
> Sure, but the noise introduced by the read ( as opposed to the noise
> in the actual signal on the platter ) isn't that large, and so
> retrying 10,000 times isn't going to give any better results than
> retrying say, 100 times, and if the user really desires that many
> retries, they have always been able to do so in the software level
> rather than depending on the drive to try that much.  There is no
> reason for the drives to have increased their internal retries that
> much, and then deliberately withed the essentially zero cost ability
> to limit those internal retries, other than to drive customers to pay
> for the more expensive models.

http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf

That's a high end SAS drive. It's default is to retry up to 20 times,
which takes ~1.4 seconds, per sector. But also note how it says
lowering the default increases the unrecoverable error rate. That
makes sense. So even if the probability is low that retrying up to 120
seconds could work, statistically it affects the unrecoverable error
rate positively to increase the default.

If I'm going to be a conspiracy theorist, I'd say the recoveries are
getting longer by default in order to keep the specifications
reporting sane unrecoverable error rates.

Maybe you'd prefer seeing these big, cheap, "green" drives have
shorter ERC times, with a commensurate reality check with their
unrecoverable error rate, which right now is already two orders
magnitude higher than enterprise SAS drives. So what if this means
that rate is 3 or 4 orders magnitude higher?

Now I'm just going to wait for you to suggest that sucks donkey tail
and how the manufacturer's should produce drives with the same UER as
drives 10 years ago *and* with the same error recovery times, and
charge no additional money.

OK good luck with that!

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2015-01-03  5:31                 ` Chris Murphy
@ 2015-01-05  4:18                   ` Phillip Susi
  2015-01-05  7:41                     ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2015-01-05  4:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 01/03/2015 12:31 AM, Chris Murphy wrote:
> It's not a made to order hard drive industry. Maybe one day you'll
> be able to 3D print your own with its own specs.

And wookies did not live on endor.  What's your point?

> Sticking fingers in your ears doesn't change the fact there's a 
> measurable difference in support requirements.

Sure, just don't misrepresent one requirement for another.  Just
because I don't care about a warranty from the hardware manufacturer
does not mean I have no right to expect the kernel to perform
*reasonably* on that hardware.

> This is architecture astronaut territory.
> 
> The system only has a terrible response for two reasons: 1. The
> user spec'd the wrong hardware for the use case; 2. The distro
> isn't automatically leveraging existing ways to mitigate that user
> mistake by changing either SCT ERC on the drives, or the SCSI
> command timer for each block device.

No, it has terrible response because the kernel either waits an
unreasonable time or fails the drive and kicks it out of the array
instead of trying to repair it.  Blaming the user for not buying
better hardware is not an appropriate response for the kernel failing
so badly to handle commonly available hardware that doesn't behave in
the most ideal way.

> Now, even though that solution *might* mean long recoveries on 
> occasion, it's still better than link reset behavior which is what
> we have today because it causes the underlying problem to be fixed
> by md/dm/Btrfs once the read error is reported. But no distro has 
> implemented this $500 man hour solution. Instead you're suggesting
> a $500,000 fix that will take hundreds of man hours and end user
> testing to find all the edge cases. It's like, seriously, WTF?

Seriously?  Treating a timeout the same way you treat an unrecoverable
media error is no herculean task.

> Ok well I think that's hubris unless you're a hard drive engineer. 
> You're referring to how drives behaved over a decade ago, when bad 
> sectors were persistent rather than remapped, and we had to scan
> the drive at format time to build a map so the bad ones wouldn't be
> used by the filesystem.

Remapping has nothing to do with it: we are talking about *read*
errors, which do not trigger a remap.

> http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf
>
>  That's a high end SAS drive. It's default is to retry up to 20
> times, which takes ~1.4 seconds, per sector. But also note how it
> says

20 retries on a 15,000 rpm drive only takes 80 milliseconds, not 1.4
seconds.  15,000 rpm / 60 seconds per minute = 250 rotations/retries
per second.

> Maybe you'd prefer seeing these big, cheap, "green" drives have 
> shorter ERC times, with a commensurate reality check with their 
> unrecoverable error rate, which right now is already two orders 
> magnitude higher than enterprise SAS drives. So what if this means 
> that rate is 3 or 4 orders magnitude higher?

20 retries vs. 200 retries does not reduce the URE rate by orders of
magnitude; more like 1% *maybe*.  200 vs 2000 makes no measurable
difference at all.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUqhCxAAoJENRVrw2cjl5RhDYH/RLbHXEPyjK4j6u33ElOyS5S
W5/nfiT1ZZjVAFxJwD0y/gt2L61hB1PQdlUjBm2NayExfCXn3sEuccAxvjMDrvsL
dFJOV8G/7GBbUfsD0uBustG5639QGc30bRzuiw/URT77zNf+T6+5SmTPSC3Oaj3j
fCcDdiKCwNcYiUF3/Q3gdh4XVI8wgoABHC2S/GqvRB+FmmqD6Yt6yG50TG5sPBzq
zSUSxWjOPwVinZOlPfCUCFr3buw+yzg5fclcvaNRStJM38gtK0UGgeIHFgCViHtN
0xNRCKWMu3XkfjfOI/cYVor79K4sQlz9K83Ja/UAMrOtopdlKjn9N04oIiPdsbg=
=u/i9
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2015-01-05  4:18                   ` Phillip Susi
@ 2015-01-05  7:41                     ` Chris Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2015-01-05  7:41 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jan 4, 2015 at 9:18 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On 01/03/2015 12:31 AM, Chris Murphy wrote:

>> This is architecture astronaut territory.
>>
>> The system only has a terrible response for two reasons: 1. The
>> user spec'd the wrong hardware for the use case; 2. The distro
>> isn't automatically leveraging existing ways to mitigate that user
>> mistake by changing either SCT ERC on the drives, or the SCSI
>> command timer for each block device.
>
> No, it has terrible response because the kernel either waits an
> unreasonable time or fails the drive and kicks it out of the array
> instead of trying to repair it.

It's a default that works for more use cases than not. The kernel
isn't dynamically self-configuring, and it isn't even the kernel's job
to take the first step which is to enable and correctly set SCT ERC on
each drive.

I think assuming a large pile of causes for a drive freezing on a
command be treated as read errors (after the link reset) is a bad
idea. But since it's your idea, and I'm not a kernel developer, you
should propose it on linux-raid@ instead of arguing with me.

> Blaming the user for not buying
> better hardware is not an appropriate response for the kernel failing
> so badly to handle commonly available hardware that doesn't behave in
> the most ideal way.

"Hi, I'm a good and knowledgeable sysadmin. I buy hardware that's
explicitly stated in the company's marketing data sheet as being
incompatible with my use case. This is someone else's fault."

Sounds like buck passing.

>> Now, even though that solution *might* mean long recoveries on
>> occasion, it's still better than link reset behavior which is what
>> we have today because it causes the underlying problem to be fixed
>> by md/dm/Btrfs once the read error is reported. But no distro has
>> implemented this $500 man hour solution. Instead you're suggesting
>> a $500,000 fix that will take hundreds of man hours and end user
>> testing to find all the edge cases. It's like, seriously, WTF?
>
> Seriously?  Treating a timeout the same way you treat an unrecoverable
> media error is no herculean task.

So you keep saying.

But best practices is already known and tested, and can be done with a
startup script. Yet no distro does this for the user, even though its
much much simpler than what you're proposing, and actually fixes both
sources of the problem.

That it is in your opinion an imperfect fix is not relevant. It's
still better behavior than what we have today, and yet still no distro
does this, thereby tacitly preferring status quo. And if the current
behavior is simply good enough no one has taken action to implement
automatically the known best practice work around of the day, why
should kernel developers gives two shits about this idea? Sounds like
more buck passing.

>> http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf
>>
>>  That's a high end SAS drive. It's default is to retry up to 20
>> times, which takes ~1.4 seconds, per sector. But also note how it
>> says
>
> 20 retries on a 15,000 rpm drive only takes 80 milliseconds, not 1.4
> seconds.  15,000 rpm / 60 seconds per minute = 250 rotations/retries
> per second.

The PDF contains a table saying 20 retries takes 1.4 seconds. I didn't
compute this number myself, it's in the bloody manufacturer's own
documentation. Obviously the ECC is doing things that take more than
one revolution of the spindle.

>
>> Maybe you'd prefer seeing these big, cheap, "green" drives have
>> shorter ERC times, with a commensurate reality check with their
>> unrecoverable error rate, which right now is already two orders
>> magnitude higher than enterprise SAS drives. So what if this means
>> that rate is 3 or 4 orders magnitude higher?
>
> 20 retries vs. 200 retries does not reduce the URE rate by orders of
> magnitude; more like 1% *maybe*.  200 vs 2000 makes no measurable
> difference at all.

I see, well I guess you prefer believing in fraud and conspiracy
theories, by multiple companies, to screw users over, while they admit
the incompatibility of the intended use case on their data sheets.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-29 21:53         ` Chris Murphy
  2014-12-30 20:46           ` Phillip Susi
@ 2014-12-31 15:40           ` Austin S Hemmelgarn
  1 sibling, 0 replies; 18+ messages in thread
From: Austin S Hemmelgarn @ 2014-12-31 15:40 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 2014-12-29 16:53, Chris Murphy wrote:
> On Sat, Dec 27, 2014 at 8:12 PM, Phillip Susi <psusi@ubuntu.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA512
>>
>> On 12/23/2014 05:09 PM, Chris Murphy wrote:
>>> The timer in /sys is a kernel command timer, it's not a device
>>> timer even though it's pointed at a block device. You need to
>>> change that from 30 to something higher to get the behavior you
>>> want. It doesn't really make sense to say, timeout in 30 seconds,
>>> but instead of reporting a timeout, report it as a read error.
>>> They're completely different things.
>>
>> The idea is not to give the drive a ridiculous amount of time to
>> recover without timing out, but for the timeout to be handled properly.
> 
> Get drives supporting configurable or faster recoveries. There's no
> way around this.
> 
>>
>>> There are all sorts of errors listed in libata so for all of them
>>> to get dumped into a read error doesn't make sense. A lot of those
>>> errors don't report back a sector, and the key part of the read
>>> error is what sector(s) have the problem so that they can be fixed.
>>> Without that information, the ability to fix it is lost. And it's
>>> the drive that needs to report this.
>>
>> It is not lost.  The information is simply fuzzed from an exact
>> individual sector to a range of sectors in the timed out request.  In
>> an ideal world the drive would give up in a reasonable time and report
>> the failure, but if it doesn't, then we should deal with that in a
>> better way than hanging all IO for an unacceptably long time.
> 
> This is a broken record topic honestly. The drives under discussion
> aren't ever meant to be used in raid, they're desktop drives, they're
> designed with long recoveries because it's reasonable to try to
> recover the data even in the face of delays rather than not recover at
> all. Whether there are also some design flaws in here I can't say
> because I'm not a hardware designer or developer but they are very
> clearly targeted at certain use cases and not others, not least of
> which is their error recovery time but also their vibration tolerance
> when multiple drives are in close proximity to each other.
> 
> If you don't like long recoveries, don't buy drives with long
> recoveries. Simple.
> 
> 
>>
>>> Oven doesn't work, so lets spray gasoline on it and light it and
>>> the kitchen on fire so that we can cook this damn pizza! That's
>>> what I just read. Sorry. It doesn't seem like a good idea to me to
>>> map all errors as read errors.
>>
>> How do you conclude that?  In the face of a timeout your choices are
>> between kicking the whole drive out of the array immediately, or
>> attempting to repair it by recovering the affected sector(s) and
>> rewriting them.  Unless that recovery attempt could cause more harm
>> than degrading the array, then where is the "throwing gasoline on it"
>> part?  This is simply a case of the device not providing a specific
>> error that says whether it can be recovered or not, so let's attempt
>> the recovery and see if it works instead of assuming that it won't and
>> possibly causing data loss that could be avoided.
> 
> The device will absolutely provide a specific error so long as its
> link isn't reset prematurely, which happens to be the linux default
> behavior when combined with drives that have long error recovery
> times. Hence the recommendation is to increase the linux command timer
> value. That is the solution right now. If you want a different
> behavior someone has to write the code to do it because it doesn't
> exist yet, and so far there seems to be zero interest in actually
> doing that work, just some interest in hand waiving that it ought to
> exist, maybe.
> 
> 
> 
>>
>>> Any decent server SATA drive should support SCT ERC. The
>>> inexpensive WDC Red drives for NAS's all have it and by default are
>>> a reasonable 70 deciseconds last time I checked.
>>
>> And yet it isn't supported on the cheaper but otherwise identical
>> greens, or the higher performing blues.  We should not be helping
>> vendors charge a premium for zero cost firmware features that are
>> "required" for raid use when they really aren't ( even if they are
>> nice to have ).
> 
> The manufacturer says they differ in vibration characteristics, 24x7
> usage expectation, and warranty among the top relevant features. The
> Red has a 3 year warranty, the Green is a 1 year warranty. That alone
> easily accounts for the $15 difference, although that's perhaps
> somewhat subjective. I don't actually know the wholesale prices, they
> could be the same if the purchasing terms are identical.
> 
> Western Digital Red NAS Hard Drive WD30EFRX 3TB IntelliPower 64MB
> Cache SATA 6.0Gb/s 3.5" NAS Hard Drive
> $114 on Newegg.com
> 
> Western Digital WD Green WD30EZRX 3TB IntelliPower 64MB Cache SATA
> 6.0Gb/s 3.5" Internal Hard Drive Bare Drive - OEM
> $99 on Newegg.com
> 
> And none of the manufacturers actually says these features are
> required for raid use. What they say is, they reserve the right to
> deny warranty claims if you're using a drive in a manner inconsistent
> with their intended usage which is rather easily found information.
> 
The fact is though, that the _hardware_ on the Green drives _does_ have
everything needed to support SCT ERC properly, it's just that the
firmware refuses to support it.  The same is the case on every Seagate
desktop drive I've ever seen.  It's essentially the same as how NVIDIA
sells the same hardware as both GeForce and Quadro brands, with only the
BIOS/firmware differing.  There is a long standing tradition among
hardware manufacturers of crippling good hardware in firmware and
selling it at a lower price than a non-crippled product.

It's not too hard (with a little know how) to modify a firmware updater
for one of the drives in question to flash it with the good firmware
instead of the crap that comes on it by default (although this WILL void
your warranty, and depending on laws where you live, might also count as
reverse engineering).

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com>]

* Re: Uncorrectable errors on RAID-1?
       [not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com>
@ 2014-12-22 14:28   ` constantine
  2014-12-22 16:05     ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: constantine @ 2014-12-22 14:28 UTC (permalink / raw)
  To: Chris Murphy, linux-btrfs

On Mon, Dec 22, 2014 at 12:24 AM, Chris Murphy <lists@colorremedies.com> wrote:
> smartctl -l scterc /dev/sdX

That's really good to know. My drives are desktop and this feature is
not supported; hence, I get "SCT Error Recovery Control command not
supported".

I'll definitely go for enterprise/raid class drives that support this
command in the future; they somehow seem more transparent (if I may
say) in maintaining them.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Uncorrectable errors on RAID-1?
  2014-12-22 14:28   ` constantine
@ 2014-12-22 16:05     ` Chris Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2014-12-22 16:05 UTC (permalink / raw)
  To: constantine; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Dec 22, 2014 at 7:28 AM, constantine <costas.magnuse@gmail.com> wrote:
> On Mon, Dec 22, 2014 at 12:24 AM, Chris Murphy <lists@colorremedies.com> wrote:
>> smartctl -l scterc /dev/sdX
>
> That's really good to know. My drives are desktop and this feature is
> not supported; hence, I get "SCT Error Recovery Control command not
> supported".
>
> I'll definitely go for enterprise/raid class drives that support this
> command in the future; they somehow seem more transparent (if I may
> say) in maintaining them.

Not knowing anything else, I'd say the kernel command timer should be
set to 121 in your case, for each drive. If you find evidence it can
be shorter, go with that. Bad sectors will fast fast, which is what
you want since you have mirrored data. Marginal sectors might take a
while for the firmware to either recover, or fail.

It's possible to mitigate long recoveries with a periodic balance, say
once every six months. This rewrites all data, and all sectors ought
to have a decently strong signal. Any sector with a persistent write
problem is removed from use automatically by drive firmware, but this
tends to require a write operation to trigger.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-01-05  7:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-21 19:34 Uncorrectable errors on RAID-1? constantine
2014-12-21 21:56 ` Robert White
2014-12-21 22:17   ` Hugo Mills
2014-12-22  0:25 ` Chris Murphy
2014-12-23 21:16   ` Zygo Blaxell
2014-12-23 22:09     ` Chris Murphy
2014-12-23 22:23       ` Chris Murphy
2014-12-28  3:12       ` Phillip Susi
2014-12-29 21:53         ` Chris Murphy
2014-12-30 20:46           ` Phillip Susi
2014-12-30 23:58             ` Chris Murphy
2014-12-31  3:16               ` Phillip Susi
2015-01-03  5:31                 ` Chris Murphy
2015-01-05  4:18                   ` Phillip Susi
2015-01-05  7:41                     ` Chris Murphy
2014-12-31 15:40           ` Austin S Hemmelgarn
     [not found] ` <CAJCQCtQYhaDEic5bwd+PEcEfwOqLwAe8cT8VPZ9je+JLRP1GPw@mail.gmail.com>
2014-12-22 14:28   ` constantine
2014-12-22 16:05     ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).