scrub implies failing drive - smartctl blissfully unaware

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* scrub implies failing drive - smartctl blissfully unaware
       [not found] <E1XqYMg-0000YI-8y@watricky.valid.co.za>
@ 2014-11-18  7:29 ` Brendan Hide
  2014-11-18  7:36   ` Roman Mamedov
                     ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Brendan Hide @ 2014-11-18  7:29 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

Hey, guys

See further below extracted output from a daily scrub showing csum 
errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
errors like this for a few days now.

The disk is patently unreliable but smartctl's output implies there are 
no issues. Is this somehow standard faire for S.M.A.R.T. output?

Here are (I think) the important bits of the smartctl output for 
$(smartctl -a /dev/sdb) (the full results are attached):
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED  
WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail 
Always       -       0
   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail 
Always       -       1
   7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail 
Always       -       440801014
197 Current_Pending_Sector  0x0012   100   100   000    Old_age 
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age 
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age 
Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age 
Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age 
Always       -       0



-------- Original Message --------
Subject: 	Cron <root@watricky> /usr/local/sbin/btrfs-scrub-all
Date: 	Tue, 18 Nov 2014 04:19:12 +0200
From: 	(Cron Daemon) <root@watricky>
To: 	brendan@watricky



WARNING: errors detected during scrubbing, corrected.
[snip]
scrub device /dev/sdb2 (id 2) done
	scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds
	total bytes scrubbed: 189.49GiB with 5420 errors
	error details: read=5 csum=5415
	corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
[snip]


[-- Attachment #2: sdbres[1].txt --]
[-- Type: text/plain, Size: 4571 bytes --]

smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3250410AS
Serial Number:    6RYF5NP7
Firmware Version: 4.AAA
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:    Tue Nov 18 09:16:03 2014 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  64) minutes.
SCT capabilities: 	       (0x0001)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   099   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       68
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail  Always       -       440801057
  9 Power_On_Hours          0x0032   044   044   000    Old_age   Always       -       49106
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       89
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2
190 Airflow_Temperature_Cel 0x0022   060   030   045    Old_age   Always   In_the_past 40 (Min/Max 23/70 #25)
194 Temperature_Celsius     0x0022   040   070   000    Old_age   Always       -       40 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   069   055   000    Old_age   Always       -       126632051
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     37598         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18  7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide
@ 2014-11-18  7:36   ` Roman Mamedov
  2014-11-18 13:24     ` Brendan Hide
  2014-11-18 12:08   ` Austin S Hemmelgarn
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: Roman Mamedov @ 2014-11-18  7:36 UTC (permalink / raw)
  To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org

On Tue, 18 Nov 2014 09:29:54 +0200
Brendan Hide <brendan@swiftspirit.co.za> wrote:

> Hey, guys
> 
> See further below extracted output from a daily scrub showing csum 
> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
> errors like this for a few days now.
> 
> The disk is patently unreliable but smartctl's output implies there are 
> no issues. Is this somehow standard faire for S.M.A.R.T. output?

Not necessarily the disk's fault, could be a SATA controller issue. How are
your disks connected, which controller brand and chip? Add lspci output, at
least if something other than the ordinary "to the motherboard chipset's
built-in ports".

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18  7:36   ` Roman Mamedov
@ 2014-11-18 13:24     ` Brendan Hide
  2014-11-18 15:16       ` Duncan
  0 siblings, 1 reply; 49+ messages in thread
From: Brendan Hide @ 2014-11-18 13:24 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs@vger.kernel.org

On 2014/11/18 09:36, Roman Mamedov wrote:
> On Tue, 18 Nov 2014 09:29:54 +0200
> Brendan Hide <brendan@swiftspirit.co.za> wrote:
>
>> Hey, guys
>>
>> See further below extracted output from a daily scrub showing csum
>> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting
>> errors like this for a few days now.
>>
>> The disk is patently unreliable but smartctl's output implies there are
>> no issues. Is this somehow standard faire for S.M.A.R.T. output?
> Not necessarily the disk's fault, could be a SATA controller issue. How are
> your disks connected, which controller brand and chip? Add lspci output, at
> least if something other than the ordinary "to the motherboard chipset's
> built-in ports".

In this case, yup, its directly to the motherboard chipset's built-in ports. This is a very old desktop, and the other 3 disks don't have any issues. I'm checking out the alternative pointed out by Austin.

SATA-relevant lspci output:
00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) SATA AHCI Controller (rev 02)


-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 13:24     ` Brendan Hide
@ 2014-11-18 15:16       ` Duncan
  0 siblings, 0 replies; 49+ messages in thread
From: Duncan @ 2014-11-18 15:16 UTC (permalink / raw)
  To: linux-btrfs

Brendan Hide posted on Tue, 18 Nov 2014 15:24:48 +0200 as excerpted:

> In this case, yup, its directly to the motherboard chipset's built-in
> ports. This is a very old desktop, and the other 3 disks don't have any
> issues. I'm checking out the alternative pointed out by Austin.
> 
> SATA-relevant lspci output:
> 00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family)
> SATA AHCI Controller (rev 02)

I guess your definition of _very_ old desktop, and mine, are _very_ 
different.

* A quick check of wikipedia says the ICH10 wasn't even 
/introduced/ until 2008 (the wiki link for the 82801jo/do points to an 
Intel page, which says it was launched Q3-2008), and it would have been 
some time after that, likely 2009, that you actually purchased the 
machine.

2009 is five years ago, middle-aged yes, arguably old, but _very_ old, 
not so much in this day and age of longer system replace cycles.

* It has SATA, not IDE/PATA.

* It was PCIE 1.1, not PCI-X or PCI and AGP, and DEFINITELY not ISA bus, 
with or without VLB!

* It has USB 2.0 ports, not USB 1.1, and not only serial/parallel/ps2, 
and DEFINITELY not an AT keyboard.

* It has Gigabit Ethernet, not simply Fast Ethernet or just Ethernet, and 
DEFINITELY Ethernet not token-ring.

* It already has Intel Virtualization technology and HD audio instead of 
AC97 or earlier.

Now I can certainly imagine and "old" desktop having most of these, but 
you said _very_ old, not simply old, and _very_ old to me would mean PATA/
USB-1/AGP/PCI/FastEthernet with AC97 audio or earlier and no 
virtualization.  64-bit would be questionable as well.

FWIW, I've been playing minitube/youtube C64 music the last few days.  
Martin Galway, etc.  Now C64 really _IS_ _very_ old!

Also FWIW, "only" a couple years ago now (well, about three,
time flies!), my old 2003 vintage original 3-digit Opteron based mobo 
died due to bulging/burst capacitors, after serving me 8 years.  I was 
shooting for a full decade but didn't quite make it...

So indeed, 2009 vintage system, five years, definitely not _very_ old, 
arguably not even "old", more like middle-aged. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18  7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide
  2014-11-18  7:36   ` Roman Mamedov
@ 2014-11-18 12:08   ` Austin S Hemmelgarn
  2014-11-18 13:25     ` Brendan Hide
  2014-11-18 16:02     ` Phillip Susi
  2014-11-18 15:35   ` Marc MERLIN
  2014-11-21  4:58   ` Zygo Blaxell
  3 siblings, 2 replies; 49+ messages in thread
From: Austin S Hemmelgarn @ 2014-11-18 12:08 UTC (permalink / raw)
  To: Brendan Hide, linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2752 bytes --]

On 2014-11-18 02:29, Brendan Hide wrote:
> Hey, guys
>
> See further below extracted output from a daily scrub showing csum
> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting
> errors like this for a few days now.
>
> The disk is patently unreliable but smartctl's output implies there are
> no issues. Is this somehow standard faire for S.M.A.R.T. output?
>
> Here are (I think) the important bits of the smartctl output for
> $(smartctl -a /dev/sdb) (the full results are attached):
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED
> WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail
> Always       -       0
>    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       1
>    7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail
> Always       -       440801014
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age
> Offline      -       0
> 202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age
> Always       -       0
>
>
>
> -------- Original Message --------
> Subject:     Cron <root@watricky> /usr/local/sbin/btrfs-scrub-all
> Date:     Tue, 18 Nov 2014 04:19:12 +0200
> From:     (Cron Daemon) <root@watricky>
> To:     brendan@watricky
>
>
>
> WARNING: errors detected during scrubbing, corrected.
> [snip]
> scrub device /dev/sdb2 (id 2) done
>      scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682
> seconds
>      total bytes scrubbed: 189.49GiB with 5420 errors
>      error details: read=5 csum=5415
>      corrected errors: 5420, uncorrectable errors: 0, unverified errors:
> 164
> [snip]
>
In addition to the storage controller being a possibility as mentioned 
in another reply, there are some parts of the drive that aren't covered 
by SMART attributes on most disks, most notably the on-drive cache. 
There really isn't a way to disable the read cache on the drive, but you 
can disable write-caching, which may improve things (and if it's a cheap 
disk, may provide better reliability for BTRFS as well).  The other 
thing I would suggest trying is a different data cable to the drive 
itself, I've had issues with some SATA cables (the cheap red ones you 
get in the retail packaging for some hard disks in particular) having 
either bad connectors, or bad strain-reliefs, and failing after only a 
few hundred hours of use.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 12:08   ` Austin S Hemmelgarn
@ 2014-11-18 13:25     ` Brendan Hide
  2014-11-18 16:02     ` Phillip Susi
  1 sibling, 0 replies; 49+ messages in thread
From: Brendan Hide @ 2014-11-18 13:25 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: linux-btrfs@vger.kernel.org

On 2014/11/18 14:08, Austin S Hemmelgarn wrote:
> [snip] there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching.
Its an old and replaceable disk - but if the cable replacement doesn't 
work I'll try this for kicks. :)
> The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use.
Thanks. I'll try this first. :)

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 12:08   ` Austin S Hemmelgarn
  2014-11-18 13:25     ` Brendan Hide
@ 2014-11-18 16:02     ` Phillip Susi
  1 sibling, 0 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-18 16:02 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Brendan Hide, linux-btrfs@vger.kernel.org

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 7:08 AM, Austin S Hemmelgarn wrote:
> In addition to the storage controller being a possibility as
> mentioned in another reply, there are some parts of the drive that
> aren't covered by SMART attributes on most disks, most notably the
> on-drive cache. There really isn't a way to disable the read cache
> on the drive, but you can disable write-caching, which may improve
> things (and if it's a cheap disk, may provide better reliability
> for BTRFS as well).  The other thing I would suggest trying is a
> different data cable to the drive itself, I've had issues with some
> SATA cables (the cheap red ones you get in the retail packaging for
> some hard disks in particular) having either bad connectors, or bad
> strain-reliefs, and failing after only a few hundred hours of use.

SATA does CRC the data going across it so if it is a bad cable, you
get CRC, or often times 8b10b coding errors and the transfer is
aborted rather than returning bad data.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa22PAAoJEI5FoCIzSKrwqlAH/3p1iftYkX3DAMgMmWra9AZT
2OA4PIwzgKIhANpy+ZQo4c+W1ZUwo2V6sxLvG8/oM3HfITGyfwNA5HgTbQrlx/iU
vdRHq+y60gCruIa0lRST5JCQMbez7eXvSNOWNAZYbtNH/BNyMxwFuav14zFZpNxO
QovXxhk1D5vLf+ID2jwa5mF1Zj7b5GEhb4zzqK+xU1QNeWppLFhB3da+llae8qxf
eFtNt8ebtknr7QMCFrbaYCq1z1I+Fy8EjskkdI4ZW6AgBRPQDDmB8gNCmAAbSaZC
2Ze/AB4Xr6uuGQ4iK7nprKXUtPJFLzGYx+JQ2EeBJtin9ivno1fEY45CMreuzv4=
=6Oy/
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18  7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide
  2014-11-18  7:36   ` Roman Mamedov
  2014-11-18 12:08   ` Austin S Hemmelgarn
@ 2014-11-18 15:35   ` Marc MERLIN
  2014-11-18 16:04     ` Phillip Susi
  2014-11-18 18:57     ` Chris Murphy
  2014-11-21  4:58   ` Zygo Blaxell
  3 siblings, 2 replies; 49+ messages in thread
From: Marc MERLIN @ 2014-11-18 15:35 UTC (permalink / raw)
  To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org

On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote:
> Hey, guys
> 
> See further below extracted output from a daily scrub showing csum 
> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
> errors like this for a few days now.
> 
> The disk is patently unreliable but smartctl's output implies there are 
> no issues. Is this somehow standard faire for S.M.A.R.T. output?

Try running hdrecover on your drive, it'll scan all your blocks and try to
rewrite the ones that are failing, if any:
http://hdrecover.sourceforge.net/

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 15:35   ` Marc MERLIN
@ 2014-11-18 16:04     ` Phillip Susi
  2014-11-18 16:11       ` Marc MERLIN
  2014-11-18 18:57     ` Chris Murphy
  1 sibling, 1 reply; 49+ messages in thread
From: Phillip Susi @ 2014-11-18 16:04 UTC (permalink / raw)
  To: Marc MERLIN, Brendan Hide; +Cc: linux-btrfs@vger.kernel.org

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 10:35 AM, Marc MERLIN wrote:
> Try running hdrecover on your drive, it'll scan all your blocks and
> try to rewrite the ones that are failing, if any: 
> http://hdrecover.sourceforge.net/

He doesn't have blocks that are failing; he has blocks that are being
silently corrupted.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa23wAAoJEI5FoCIzSKrwXTMH/3KhuXuNbPBY0jALRS6kVAew
M3gfJ1kMeZgiBZzUlZb0GsB9J3i+Ei+nF7NQ7taMKey84sPxhQVjpYZV0LZxWNwe
RSga4/Kfnk8TGphwBBeK5e3tOypmv+ECCB4p4uQHXqPAvoFiIALdHYzZGYb0kM8e
ydTonqtUiR8WJ0uqy24/vl7uJyTkj0xz4Adk2ksrbVhW1Z8md2LesKOCtCLa3bVn
Qu8Um/KIBPNBbB21FYN1KyBUMvkx2uGDcu7YRfxXpPnZLwZ9NdMjlOzY8P+EnhFt
cW+tW3mYO9BMhONxi8m7hDI5wj+dsPFblqA5CRBwAOG5b4fsE2pwZwdqYoASmd4=
=2Ho1
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 16:04     ` Phillip Susi
@ 2014-11-18 16:11       ` Marc MERLIN
  2014-11-18 16:26         ` Phillip Susi
  0 siblings, 1 reply; 49+ messages in thread
From: Marc MERLIN @ 2014-11-18 16:11 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org

On Tue, Nov 18, 2014 at 11:04:00AM -0500, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 11/18/2014 10:35 AM, Marc MERLIN wrote:
> > Try running hdrecover on your drive, it'll scan all your blocks and
> > try to rewrite the ones that are failing, if any: 
> > http://hdrecover.sourceforge.net/
> 
> He doesn't have blocks that are failing; he has blocks that are being
> silently corrupted.

That seems to be the case, but hdrecover will rule that part out at least.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 16:11       ` Marc MERLIN
@ 2014-11-18 16:26         ` Phillip Susi
  0 siblings, 0 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-18 16:26 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 11:11 AM, Marc MERLIN wrote:
> That seems to be the case, but hdrecover will rule that part out at
> least.

It's already ruled out: if the read failed that is what the error
message would have said rather than a bad checksum.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEbBAEBAgAGBQJUa3MpAAoJEI5FoCIzSKrwmu4H+IPNwUZMEES7vvA7WTPcrYgw
mO2x9uR/fQJFH1u4Urf3anKXoifsHUgvgyPHotRrm1OoiB3bQgYVapVEqZ0PEkre
la3zKydJ6ZuCa/TuEvATdOxBwvUhMKJCYcwYheja+1stqEBxD8mj6HY5+HqufoLo
VaSeEeBDWvQZtGrOC8JNxfzaeFmf46W+8dQIn7qI72WYvWRfVMhCun+dR4amS8hN
cXgxAe6ElnVV4TuGHLy0n4l2Hr6oWBYLWIJhDzM9IpkfjX9jsv78nLHcoWwtaw82
gv248OcCeLnZBwoN5Tepd5Av6uHh3x9MzlXDrqnWQBWulY3f0idrFGU1y1uZvw==
=AtDf
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 15:35   ` Marc MERLIN
  2014-11-18 16:04     ` Phillip Susi
@ 2014-11-18 18:57     ` Chris Murphy
  2014-11-18 20:58       ` Phillip Susi
  1 sibling, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2014-11-18 18:57 UTC (permalink / raw)
  To: Btrfs BTRFS

On Nov 18, 2014, at 8:35 AM, Marc MERLIN <marc@merlins.org> wrote:

> On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote:
>> Hey, guys
>> 
>> See further below extracted output from a daily scrub showing csum 
>> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
>> errors like this for a few days now.
>> 
>> The disk is patently unreliable but smartctl's output implies there are 
>> no issues. Is this somehow standard faire for S.M.A.R.T. output?
> 
> Try running hdrecover on your drive, it'll scan all your blocks and try to
> rewrite the ones that are failing, if any:
> http://hdrecover.sourceforge.net/

The only way it can know if there is a bad sector is if the drive returns a read error, which will include the LBA for the affected sector(s). This is the same thing that would be done with scrub, except any bad sectors that don’t contain data. A common problem getting a drive to issue the read error, however, is a mismatch between the scsi command timer setting (default 30 seconds) and the SCT error recover control setting for the drive. The drive SCT ERC value needs to be shorter than the scsi command timer value, otherwise some bad sector errors will cause the drive to go into a longer recovery attempt beyond the scsi command timer value. If that happens, the ata link is reset, and there’s no possibility of finding out what the affected sector is.

So a.) use smartctl -l scterc to change the value below 30 seconds (300 deciseconds) with 70 deciseconds being reasonable. If the drive doesn’t support SCT commands, then b.) change the linux scsi command timer to be greater than 120 seconds.

Strictly speaking the command timer would be set to a value that ensures there are no link reset messages in dmesg, that it’s long enough that the drive itself times out and actually reports a read error. This could be much shorter than 120 seconds. I don’t know if there are any consumer drives that try longer than 2 minutes to recover data from a marginally bad sector.

Ideally though, don’t use drives that lack SCT support in multiple device volume configurations. An up to 2 minute hang of the storage stack isn’t production compatible for most workflows.

Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 18:57     ` Chris Murphy
@ 2014-11-18 20:58       ` Phillip Susi
  2014-11-19  2:40         ` Chris Murphy
  2014-11-19  2:46         ` Duncan
  0 siblings, 2 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-18 20:58 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 1:57 PM, Chris Murphy wrote:
> So a.) use smartctl -l scterc to change the value below 30 seconds 
> (300 deciseconds) with 70 deciseconds being reasonable. If the
> drive doesn’t support SCT commands, then b.) change the linux scsi
> command timer to be greater than 120 seconds.
> 
> Strictly speaking the command timer would be set to a value that 
> ensures there are no link reset messages in dmesg, that it’s long 
> enough that the drive itself times out and actually reports a read 
> error. This could be much shorter than 120 seconds. I don’t know
> if there are any consumer drives that try longer than 2 minutes to 
> recover data from a marginally bad sector.

Are there really any that take longer than 30 seconds?  That's enough
time for thousands of retries.  If it can't be read after a dozen
tries, it ain't never gonna work.  It seems absurd that a drive would
keep trying for so long.

> Ideally though, don’t use drives that lack SCT support in multiple 
> device volume configurations. An up to 2 minute hang of the
> storage stack isn’t production compatible for most workflows.

Wasn't there an early failure flag that md ( and therefore, btrfs when
doing raid ) sets so the scsi stack doesn't bother with recovery
attempts and just fails the request?  Thus if the drive takes longer
than the scsi_timeout, the failure would be reported to btrfs, which
then can recover using the other copy, write it back to the bad drive,
and hopefully that fixes it?

In that case, you probably want to lower the timeout so that the
recover kicks in sooner instead of hanging your IO stack for 30 seconds.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa7LqAAoJEI5FoCIzSKrw2Y0H/3Q03vCTxXeGkqvOYG/arZgk
yHq/ruWIKMgfaESdu0Ujzoqbe7XopUueU8luKon52LtbgIFhOM5XnMu/o52KPXIS
CVLnNtRWNbykHJMQu0Sk4lpPrUVI5QP9Ya9ZGVFM4x2ehvJGDAT+wcRWP5OH0waf
mgK+oOnadsckqiSbcQhGrxecjTWZFu5WUCzWFPx+4sEV5ta/tmL0obhHcyho+SDN
lCib2KI9YGzS2sm+V/Qe2i/3ZMp8QY8aAD2x/KlV0DBxkRLZQdOoD3ZkBiaApxZX
VMfXNCKLMexwpe+rGGemH/fCvhRpM/z1aHu8D1u4QVnoWPzD51vX7ySLkwRHaGo=
=XZkM
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 20:58       ` Phillip Susi
@ 2014-11-19  2:40         ` Chris Murphy
  2014-11-19 15:11           ` Phillip Susi
  2014-11-19  2:46         ` Duncan
  1 sibling, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2014-11-19  2:40 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Btrfs BTRFS

On Nov 18, 2014, at 1:58 PM, Phillip Susi <psusi@ubuntu.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 11/18/2014 1:57 PM, Chris Murphy wrote:
>> So a.) use smartctl -l scterc to change the value below 30 seconds 
>> (300 deciseconds) with 70 deciseconds being reasonable. If the
>> drive doesn’t support SCT commands, then b.) change the linux scsi
>> command timer to be greater than 120 seconds.
>> 
>> Strictly speaking the command timer would be set to a value that 
>> ensures there are no link reset messages in dmesg, that it’s long 
>> enough that the drive itself times out and actually reports a read 
>> error. This could be much shorter than 120 seconds. I don’t know
>> if there are any consumer drives that try longer than 2 minutes to 
>> recover data from a marginally bad sector.
> 
> Are there really any that take longer than 30 seconds?  That's enough
> time for thousands of retries.  If it can't be read after a dozen
> tries, it ain't never gonna work.  It seems absurd that a drive would
> keep trying for so long.

It’s well known on linux-raid@ that consumer drives have well over 30 second "deep recoveries" when they lack SCT command support. The WDC and Seagate “green” drives are over 2 minutes apparently. This isn’t easy to test because it requires a sector with enough error that it requires the ECC to do something, and yet not so much error that it gives up in less than 30 seconds. So you have to track down a drive model spec document (one of those 100 pagers).

This makes sense, sorta, because the manufacturer use case is typically single drive only, and most proscribe raid5/6 with such products. So it’s a “recover data at all costs” behavior because it’s assumed to be the only (immediately) available copy.

> 
>> Ideally though, don’t use drives that lack SCT support in multiple 
>> device volume configurations. An up to 2 minute hang of the
>> storage stack isn’t production compatible for most workflows.
> 
> Wasn't there an early failure flag that md ( and therefore, btrfs when
> doing raid ) sets so the scsi stack doesn't bother with recovery
> attempts and just fails the request?  Thus if the drive takes longer
> than the scsi_timeout, the failure would be reported to btrfs, which
> then can recover using the other copy, write it back to the bad drive,
> and hopefully that fixes it?

I don’t see how that’s possible because anything other than the drive explicitly producing  a read error (which includes the affected LBA’s), it’s ambiguous what the actual problem is as far as the kernel is concerned. It has no way of knowing which of possibly dozens of ata commands queued up in the drive have actually hung up the drive. It has no idea why the drive is hung up as well.

The linux-raid@ list is chock full of users having these kinds of problems. It comes up pretty much every week. Someone has an e.g. raid5, and in dmesg all they get are a bunch of ata bus reset messages. So someone tells them to change the scsi command timer for all the block devices that are members of the array in question, and retry (reading file, or scrub or whatever) and low and behold no more ata bus reset messages. Instead they get explicit read errors with LBAs and now md can fix the problem.

> 
> In that case, you probably want to lower the timeout so that the
> recover kicks in sooner instead of hanging your IO stack for 30 seconds.

No I think 30 is pretty sane for servers using SATA drives because if the bus is reset all pending commands in the queue get obliterated which is worse than just waiting up to 30 seconds. With SAS drives maybe less time makes sense. But in either case you still need configurable SCT ERC, or it needs to be a sane fixed default like 70 deciseconds.

Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19  2:40         ` Chris Murphy
@ 2014-11-19 15:11           ` Phillip Susi
  2014-11-20  0:05             ` Chris Murphy
  0 siblings, 1 reply; 49+ messages in thread
From: Phillip Susi @ 2014-11-19 15:11 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 9:40 PM, Chris Murphy wrote:
> It’s well known on linux-raid@ that consumer drives have well over
> 30 second "deep recoveries" when they lack SCT command support. The
> WDC and Seagate “green” drives are over 2 minutes apparently. This
> isn’t easy to test because it requires a sector with enough error
> that it requires the ECC to do something, and yet not so much error
> that it gives up in less than 30 seconds. So you have to track down
> a drive model spec document (one of those 100 pagers).
> 
> This makes sense, sorta, because the manufacturer use case is 
> typically single drive only, and most proscribe raid5/6 with such 
> products. So it’s a “recover data at all costs” behavior because
> it’s assumed to be the only (immediately) available copy.

It doesn't make sense to me.  If it can't recover the data after one
or two hundred retries in one or two seconds, it can keep trying until
the cows come home and it just isn't ever going to work.

> I don’t see how that’s possible because anything other than the
> drive explicitly producing  a read error (which includes the
> affected LBA’s), it’s ambiguous what the actual problem is as far
> as the kernel is concerned. It has no way of knowing which of
> possibly dozens of ata commands queued up in the drive have
> actually hung up the drive. It has no idea why the drive is hung up
> as well.

IIRC, this is true when the drive returns failure as well.  The whole
bio is marked as failed, and the page cache layer then begins retrying
with progressively smaller requests to see if it can get *some* data out.

> No I think 30 is pretty sane for servers using SATA drives because
> if the bus is reset all pending commands in the queue get
> obliterated which is worse than just waiting up to 30 seconds. With
> SAS drives maybe less time makes sense. But in either case you
> still need configurable SCT ERC, or it needs to be a sane fixed
> default like 70 deciseconds.

Who cares if multiple commands in the queue are obliterated if they
can all be retried on the other mirror?  Better to fall back to the
other mirror NOW instead of waiting 30 seconds ( or longer! ).  Sure,
you might end up recovering more than you really had to, but that
won't hurt anything.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbLMyAAoJEI5FoCIzSKrwSM8IAJO2cwhHyxK4LFjINEbNT+ij
fT4EpyzOCs704zhOTgssgSQ8ym85PRQ8VyAIrz338m+lHqKbktZtRt7vWaealmOp
6eleIDJ/I7kggnlhkqg1V8Nctap8qBeRE34K/PaGtTrkRzBYnYxbGdDDz+rXaDi6
CSEMLJBo3I69Oj9qSOV4O18ntV/S3eln0sQ8+w2btbc3xGkG3X2FwVIJokb6IAmu
ngHUeDGXUgkEOvzw3aGDheLueGDPe+V3YlsjSbw2rH75svzXqFCUO8Jcg4NfxT0q
Nl03eoTEGlyf8x2geMWfhoKFatJ7sCMy48K0ZFAAX1k8j0ssjNaEC+q6pwrA/xU=
=Gehg
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 15:11           ` Phillip Susi
@ 2014-11-20  0:05             ` Chris Murphy
  2014-11-25 21:34               ` Phillip Susi
  0 siblings, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2014-11-20  0:05 UTC (permalink / raw)
  To: Btrfs BTRFS

On Wed, Nov 19, 2014 at 8:11 AM, Phillip Susi <psusi@ubuntu.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/18/2014 9:40 PM, Chris Murphy wrote:
>> It’s well known on linux-raid@ that consumer drives have well over
>> 30 second "deep recoveries" when they lack SCT command support. The
>> WDC and Seagate “green” drives are over 2 minutes apparently. This
>> isn’t easy to test because it requires a sector with enough error
>> that it requires the ECC to do something, and yet not so much error
>> that it gives up in less than 30 seconds. So you have to track down
>> a drive model spec document (one of those 100 pagers).
>>
>> This makes sense, sorta, because the manufacturer use case is
>> typically single drive only, and most proscribe raid5/6 with such
>> products. So it’s a “recover data at all costs” behavior because
>> it’s assumed to be the only (immediately) available copy.
>
> It doesn't make sense to me.  If it can't recover the data after one
> or two hundred retries in one or two seconds, it can keep trying until
> the cows come home and it just isn't ever going to work.

I'm not a hard drive engineer, so I can't argue either point. But
consumer drives clearly do behave this way. On Linux, the kernel's
default 30 second command timer eventually results in what look like
link errors rather than drive read errors. And instead of the problems
being fixed with the normal md and btrfs recovery mechanisms, the
errors simply get worse and eventually there's data loss. Exhibits A,
B, C, D - the linux-raid list is full to the brim of such reports and
their solution.

>
>> I don’t see how that’s possible because anything other than the
>> drive explicitly producing  a read error (which includes the
>> affected LBA’s), it’s ambiguous what the actual problem is as far
>> as the kernel is concerned. It has no way of knowing which of
>> possibly dozens of ata commands queued up in the drive have
>> actually hung up the drive. It has no idea why the drive is hung up
>> as well.
>
> IIRC, this is true when the drive returns failure as well.  The whole
> bio is marked as failed, and the page cache layer then begins retrying
> with progressively smaller requests to see if it can get *some* data out.

Well that's very course. It's not at a sector level, so as long as the
drive continues to try to read from a particular LBA, but fails to
either succeed reading or give up and report a read error, within 30
seconds, then you just get a bunch of wonky system behavior.

Conversely what I've observed on Windows in such a case, is it
tolerates these deep recoveries on consumer drives. So they just get
really slow but the drive does seem to eventually recover (until it
doesn't). But yeah 2 minutes is a long time. So then the user gets
annoyed and reinstalls their system. Since that means writing to the
affected drive, the firmware logic causes bad sectors to be
dereferenced when the write error is persistent. Problem solved,
faster system.

>
>> No I think 30 is pretty sane for servers using SATA drives because
>> if the bus is reset all pending commands in the queue get
>> obliterated which is worse than just waiting up to 30 seconds. With
>> SAS drives maybe less time makes sense. But in either case you
>> still need configurable SCT ERC, or it needs to be a sane fixed
>> default like 70 deciseconds.
>
> Who cares if multiple commands in the queue are obliterated if they
> can all be retried on the other mirror?

Because now you have a member drive that's inconsistent. At least in
the md raid case, a certain number of read failures causes the drive
to be ejected from the array. Anytime there's a write failure, it's
ejected from the array too. What you want is for the drive to give up
sooner with an explicit read error, so md can help fix the problem by
writing good data to the effected LBA. That doesn't happen when there
are a bunch of link resets happening.

> Better to fall back to the
> other mirror NOW instead of waiting 30 seconds ( or longer! ).  Sure,
> you might end up recovering more than you really had to, but that
> won't hurt anything.

Again, if your drive SCT ERC is configurable, and set to something
sane like 70 deciseconds, that read failure happens at MOST 7 seconds
after the read attempt. And md is notified of *exactly* what sectors
are affected, it immediately goes to mirror data, or rebuilds it from
parity, and then writes the correct data to the previously reported
bad sectors. And that will fix the problem.

So really, if you're going to play the multiple device game, you need
drive error timing to be shorter than the kernel's.

Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-20  0:05             ` Chris Murphy
@ 2014-11-25 21:34               ` Phillip Susi
  2014-11-25 23:13                 ` Chris Murphy
  2014-11-28 15:02                 ` Patrik Lundquist
  0 siblings, 2 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-25 21:34 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/19/2014 7:05 PM, Chris Murphy wrote:
> I'm not a hard drive engineer, so I can't argue either point. But 
> consumer drives clearly do behave this way. On Linux, the kernel's 
> default 30 second command timer eventually results in what look
> like link errors rather than drive read errors. And instead of the
> problems being fixed with the normal md and btrfs recovery
> mechanisms, the errors simply get worse and eventually there's data
> loss. Exhibits A, B, C, D - the linux-raid list is full to the brim
> of such reports and their solution.

I have seen plenty of error logs of people with drives that do
properly give up and return an error instead of timing out so I get
the feeling that most drives are properly behaved.  Is there a
particular make/model of drive that is known to exhibit this silly
behavior?

>> IIRC, this is true when the drive returns failure as well.  The
>> whole bio is marked as failed, and the page cache layer then
>> begins retrying with progressively smaller requests to see if it
>> can get *some* data out.
> 
> Well that's very course. It's not at a sector level, so as long as
> the drive continues to try to read from a particular LBA, but fails
> to either succeed reading or give up and report a read error,
> within 30 seconds, then you just get a bunch of wonky system
> behavior.

I don't understand this response at all.  The drive isn't going to
keep trying to read the same bad lba; after the kernel times out, it
resets the drive, and tries reading different smaller parts to see
which it can read and which it can't.

> Conversely what I've observed on Windows in such a case, is it 
> tolerates these deep recoveries on consumer drives. So they just
> get really slow but the drive does seem to eventually recover
> (until it doesn't). But yeah 2 minutes is a long time. So then the
> user gets annoyed and reinstalls their system. Since that means
> writing to the affected drive, the firmware logic causes bad
> sectors to be dereferenced when the write error is persistent.
> Problem solved, faster system.

That seems like rather unsubstantiated guesswork.  i.e. the 2 minute+
delays are likely not on an individual request, but from several
requests that each go into deep recovery, possibly because windows is
retrying the same sector or a few consecutive sectors are bad.

> Because now you have a member drive that's inconsistent. At least
> in the md raid case, a certain number of read failures causes the
> drive to be ejected from the array. Anytime there's a write
> failure, it's ejected from the array too. What you want is for the
> drive to give up sooner with an explicit read error, so md can help
> fix the problem by writing good data to the effected LBA. That
> doesn't happen when there are a bunch of link resets happening.

What?  It is no different than when it does return an error, with the
exception that the error is incorrectly applied to the entire request
instead of just the affected sector.

> Again, if your drive SCT ERC is configurable, and set to something 
> sane like 70 deciseconds, that read failure happens at MOST 7
> seconds after the read attempt. And md is notified of *exactly*
> what sectors are affected, it immediately goes to mirror data, or
> rebuilds it from parity, and then writes the correct data to the
> previously reported bad sectors. And that will fix the problem.

Yes... I'm talking about when the drive doesn't support that.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdPXRAAoJEI5FoCIzSKrw5aUIAJpmAczzc+0flGpDnenNIf9E
HITY2a15lRhrnpfiEBmlTe0EUyc8O+Sv/kWJ61VRJ1KNCtF0Cs0jMEvOk2BGiM9T
rR2KinIFlPZfuR7sUpgns+i5TK3eXpn+bbm5jIUFf8hOdkERFArwaQIqo3qqMybs
3rHdnBo7T+F9oCMwuFyvwHupDd2gCbnibB8mIUhijUcZQwoqU9c/ISGySpM7x04J
VeDCI3hWv2V5hhm+Bfdq3fQpjeIo2AAvCPt+ODuFFHabQ5l78Qu7IlCEFGIYuQqi
VJPxXNUi4n34O/jWEX5KBGgXp3H1RegnvcAt2NFLMVpFVDSB9I5eYLrj/d8KWoE=
=r3AP
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-25 21:34               ` Phillip Susi
@ 2014-11-25 23:13                 ` Chris Murphy
  2014-11-26  1:53                   ` Rich Freeman
  2014-12-01 19:10                   ` Phillip Susi
  2014-11-28 15:02                 ` Patrik Lundquist
  1 sibling, 2 replies; 49+ messages in thread
From: Chris Murphy @ 2014-11-25 23:13 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Nov 25, 2014 at 2:34 PM, Phillip Susi <psusi@ubuntu.com> wrote:

> I have seen plenty of error logs of people with drives that do
> properly give up and return an error instead of timing out so I get
> the feeling that most drives are properly behaved.  Is there a
> particular make/model of drive that is known to exhibit this silly
> behavior?

The drive will only issue a read error when its ECC absolutely cannot
recover the data, hard fail.

A few years ago companies including Western Digital started shipping
large cheap drives, think of the "green" drives. These had very high
TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later
they completely took out the ability to configure this error recovery
timing so you only get the upward of 2 minutes to actually get a read
error reported by the drive. Presumably if the ECC determines it's a
hard fail and no point in reading the same sector 14000 times, it
would issue a read error much sooner. But again, the linux-raid list
if full of cases where this doesn't happen, and merely by changing the
linux SCSI command timer from 30 to 121 seconds, now the drive reports
an explicit read error with LBA information included, and now md can
correct the problem.

>
>>> IIRC, this is true when the drive returns failure as well.  The
>>> whole bio is marked as failed, and the page cache layer then
>>> begins retrying with progressively smaller requests to see if it
>>> can get *some* data out.
>>
>> Well that's very course. It's not at a sector level, so as long as
>> the drive continues to try to read from a particular LBA, but fails
>> to either succeed reading or give up and report a read error,
>> within 30 seconds, then you just get a bunch of wonky system
>> behavior.
>
> I don't understand this response at all.  The drive isn't going to
> keep trying to read the same bad lba; after the kernel times out, it
> resets the drive, and tries reading different smaller parts to see
> which it can read and which it can't.

That's my whole point. When the link is reset, no read error is
submitted by the drive, the md driver has no idea what the drive's
problem was, no idea that it's a read problem, no idea what LBA is
affected, and thus no way of writing over the affected bad sector. If
the SCSI command timer is raised well above 30 seconds, this problem
is resolved. Also replacing the drive with one that definitively
errors out (or can be configured with smartctl -l scterc) before 30
seconds is another option.

>
>> Conversely what I've observed on Windows in such a case, is it
>> tolerates these deep recoveries on consumer drives. So they just
>> get really slow but the drive does seem to eventually recover
>> (until it doesn't). But yeah 2 minutes is a long time. So then the
>> user gets annoyed and reinstalls their system. Since that means
>> writing to the affected drive, the firmware logic causes bad
>> sectors to be dereferenced when the write error is persistent.
>> Problem solved, faster system.
>
> That seems like rather unsubstantiated guesswork.  i.e. the 2 minute+
> delays are likely not on an individual request, but from several
> requests that each go into deep recovery, possibly because windows is
> retrying the same sector or a few consecutive sectors are bad.

It doesn't really matter, clearly its time out for drive commands is
much higher than the linux default of 30 seconds.

>
>> Because now you have a member drive that's inconsistent. At least
>> in the md raid case, a certain number of read failures causes the
>> drive to be ejected from the array. Anytime there's a write
>> failure, it's ejected from the array too. What you want is for the
>> drive to give up sooner with an explicit read error, so md can help
>> fix the problem by writing good data to the effected LBA. That
>> doesn't happen when there are a bunch of link resets happening.
>
> What?  It is no different than when it does return an error, with the
> exception that the error is incorrectly applied to the entire request
> instead of just the affected sector.

OK that doesn't actually happen and it would be completely f'n wrong
behavior if it were happening. All the kernel knows is the command
timer has expired, it doesn't know why the drive isn't responding. It
doesn't know there are uncorrectable sector errors causing the
problem. To just assume link resets are the same thing as bad sectors
and to just wholesale start writing possibly a metric shit ton of data
when you don't know what the problem is would be asinine. It might
even be sabotage. Jesus...

>
>> Again, if your drive SCT ERC is configurable, and set to something
>> sane like 70 deciseconds, that read failure happens at MOST 7
>> seconds after the read attempt. And md is notified of *exactly*
>> what sectors are affected, it immediately goes to mirror data, or
>> rebuilds it from parity, and then writes the correct data to the
>> previously reported bad sectors. And that will fix the problem.
>
> Yes... I'm talking about when the drive doesn't support that.

Then there is one option which is to increase the value of the SCSI
command timer. And that applies to all raid: md, lvm, btrfs, and
hardware.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-25 23:13                 ` Chris Murphy
@ 2014-11-26  1:53                   ` Rich Freeman
  2014-12-01 19:10                   ` Phillip Susi
  1 sibling, 0 replies; 49+ messages in thread
From: Rich Freeman @ 2014-11-26  1:53 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Tue, Nov 25, 2014 at 6:13 PM, Chris Murphy <lists@colorremedies.com> wrote:
> A few years ago companies including Western Digital started shipping
> large cheap drives, think of the "green" drives. These had very high
> TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later
> they completely took out the ability to configure this error recovery
> timing so you only get the upward of 2 minutes to actually get a read
> error reported by the drive.

Why sell an $80 hard drive when you can change a few bytes in the
firmware and sell a crippled $80 drive and an otherwise-identical
non-crippled $130 drive?

--
Rich

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-25 23:13                 ` Chris Murphy
  2014-11-26  1:53                   ` Rich Freeman
@ 2014-12-01 19:10                   ` Phillip Susi
  1 sibling, 0 replies; 49+ messages in thread
From: Phillip Susi @ 2014-12-01 19:10 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/25/2014 6:13 PM, Chris Murphy wrote:
> The drive will only issue a read error when its ECC absolutely
> cannot recover the data, hard fail.
> 
> A few years ago companies including Western Digital started
> shipping large cheap drives, think of the "green" drives. These had
> very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT
> ERC. Later they completely took out the ability to configure this
> error recovery timing so you only get the upward of 2 minutes to
> actually get a read error reported by the drive. Presumably if the
> ECC determines it's a hard fail and no point in reading the same
> sector 14000 times, it would issue a read error much sooner. But
> again, the linux-raid list if full of cases where this doesn't
> happen, and merely by changing the linux SCSI command timer from 30
> to 121 seconds, now the drive reports an explicit read error with
> LBA information included, and now md can correct the problem.

I have one of those and took it out of service when it started reporting
read errors ( not timeouts ).  I tried several times to write over the
bad sectors to force reallocation and it worked again for a while...
then the bad sectors kept coming back.  Oddly, the SMART values never
indicated anything had been reallocated.

> That's my whole point. When the link is reset, no read error is 
> submitted by the drive, the md driver has no idea what the drive's 
> problem was, no idea that it's a read problem, no idea what LBA is 
> affected, and thus no way of writing over the affected bad sector.
> If the SCSI command timer is raised well above 30 seconds, this
> problem is resolved. Also replacing the drive with one that
> definitively errors out (or can be configured with smartctl -l
> scterc) before 30 seconds is another option.

It doesn't know why or exactly where, but it does know *something* went
wrong.

> It doesn't really matter, clearly its time out for drive commands
> is much higher than the linux default of 30 seconds.

Only if you are running linux and can see the timeouts.  You can't
assume that's what is going on under windows just because the desktop
stutters.

> OK that doesn't actually happen and it would be completely f'n
> wrong behavior if it were happening. All the kernel knows is the
> command timer has expired, it doesn't know why the drive isn't
> responding. It doesn't know there are uncorrectable sector errors
> causing the problem. To just assume link resets are the same thing
> as bad sectors and to just wholesale start writing possibly a
> metric shit ton of data when you don't know what the problem is
> would be asinine. It might even be sabotage. Jesus...

In normal single disk operation sure: the kernel resets the drive and
retries the request.  But like I said before, I could have sworn there
was an early failure flag that md uses to tell the lower layers NOT to
attempt that kind of normal recovery, and instead just to return the
failure right away so md can just go grab the data from the drive that
isn't wigging out.  That prevents the system from stalling on paging IO
while the drive plays around with its deep recovery, and copying back
512k to the drive with the one bad sector isn't really that big of a
deal.

> Then there is one option which is to increase the value of the
> SCSI command timer. And that applies to all raid: md, lvm, btrfs,
> and hardware.

And then you get stupid hanging when you could just get the data from
the other drive immediately.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfL04AAoJENRVrw2cjl5RFW0H/Rtz4Y8bynWAP2yjiqZMsic+
vXCxuJAFGpOKVyV1FboCuLStp8TQ5aIiJyHrprsCiy4UAY0bFQjzaHOo4jBlCdV/
YaD3HSWGKAFUbIiByCnMfIDMxWSPP8rOeFpotoywAkNe0vIsIKg955IX96+jNMy2
IAjKGQahzp2UW6ggnwwdA/JayUmb1jZ8LvmV58rDVdhTnGPgrrYZnIyf/OphrXqd
R/WJtFDuUBUhtsmXYrY2wGUQNi+3zp+I9YburmeDtEcrbwDLDCiVdE6ChmoCrNBS
nbcfqoWPEk1DsiI9GC/Yu/sXLq2iD0n53e/DHa36z4zc4uWtUjBwSYyCubJfkyI=
=FrB9
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-25 21:34               ` Phillip Susi
  2014-11-25 23:13                 ` Chris Murphy
@ 2014-11-28 15:02                 ` Patrik Lundquist
  1 sibling, 0 replies; 49+ messages in thread
From: Patrik Lundquist @ 2014-11-28 15:02 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Chris Murphy, Btrfs BTRFS

On 25 November 2014 at 22:34, Phillip Susi <psusi@ubuntu.com> wrote:
> On 11/19/2014 7:05 PM, Chris Murphy wrote:
> > I'm not a hard drive engineer, so I can't argue either point. But
> > consumer drives clearly do behave this way. On Linux, the kernel's
> > default 30 second command timer eventually results in what look
> > like link errors rather than drive read errors. And instead of the
> > problems being fixed with the normal md and btrfs recovery
> > mechanisms, the errors simply get worse and eventually there's data
> > loss. Exhibits A, B, C, D - the linux-raid list is full to the brim
> > of such reports and their solution.
>
> I have seen plenty of error logs of people with drives that do
> properly give up and return an error instead of timing out so I get
> the feeling that most drives are properly behaved.  Is there a
> particular make/model of drive that is known to exhibit this silly
> behavior?

I had a couple of Seagate Barracuda 7200.11 (codename Moose) drives
with seriously retarded firmware.

They never reported a read error AFAIK but began to time out instead.
They wouldn't even respond after a link reset. I had to power cycle
the disks.

Funny days with ddrescue. Got almost everything off them.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18 20:58       ` Phillip Susi
  2014-11-19  2:40         ` Chris Murphy
@ 2014-11-19  2:46         ` Duncan
  2014-11-19 16:07           ` Phillip Susi
  1 sibling, 1 reply; 49+ messages in thread
From: Duncan @ 2014-11-19  2:46 UTC (permalink / raw)
  To: linux-btrfs

Phillip Susi posted on Tue, 18 Nov 2014 15:58:18 -0500 as excerpted:

> Are there really any that take longer than 30 seconds?  That's enough
> time for thousands of retries.  If it can't be read after a dozen tries,
> it ain't never gonna work.  It seems absurd that a drive would keep
> trying for so long.

I'm not sure about normal operation, but certainly, many drives take 
longer than 30 seconds to stabilize after power-on, and I routinely see 
resets during this time.

In fact, as I recently posted, power-up stabilization time can and often 
does kill reliable multi-drive device or filesystem (my experience is 
with mdraid and btrfs raid) resume from suspend to RAM or hibernate to 
disk, either one or both, because it's often enough the case that one 
device or another will take enough longer to stabilize than the other, 
that it'll be failed out of the raid.

This doesn't happen on single-hardware-device block devices and 
filesystems because in that case it's either up or down, if the device 
doesn't come up in time the resume simply fails entirely, instead of 
coming up with one or more devices there, but others missing as they 
didn't stabilize in time, as is unfortunately all too common in the multi-
device scenario.

I've seen this with both spinning rust and with SSDs, with mdraid and 
btrfs, with multiple mobos and device controllers, and with resume both 
from suspend to ram (if the machine powers down the storage devices in 
that case, as most modern ones do) and hibernate to permanent storage 
device, over several years worth of kernel series, so it's a reasonably 
widespread phenomena, at least among consumer-level SATA devices.  (My 
experience doesn't extend to enterprise-raid-level devices or proper 
SCSI, etc, so I simply don't know, there.)

While two minutes is getting a bit long, I think it's still within normal 
range, and some devices definitely take over a minute enough of the time 
to be both noticeable and irritating.

That said, I SHOULD say I'd be far *MORE* irritated if the device simply 
pretended it was stable and started reading/writing data before it really 
had stabilized, particularly with SSDs where that sort of behavior has 
been observed and is known to put some devices at risk of complete 
scrambling of either media or firmware, beyond recovery at times.  That 
of course is the risk of going the other direction, and I'd a WHOLE lot 
rather have devices play it safe for another 30 seconds or so after they /
think/ they're stable and be SURE, than pretend to be just fine when 
voltages have NOT stabilized yet and thus end up scrambling things 
irrecoverably.  I've never had that happen here tho I've never stress-
tested for it, only done normal operation, but I've seen testing reports 
where the testers DID make it happen surprisingly easily, to a surprising 
number of their test devices.

So, umm... I suspect the 2-minute default is 2 minutes due to power-up 
stabilizing issues, where two minutes is a reasonable compromise between 
failing the boot most of the time if the timeout is too low, and taking 
excessively long for very little further gain.

And in my experience, the only way around that, at the consumer level at 
least, would be to split the timeouts, perhaps setting something even 
higher, 2.5-3 minutes on power-on, while lowering the operational timeout 
to something more sane for operation, probably 30 seconds or so by 
default, but easily tunable down to 10-20 seconds (or even lower, 5 
seconds, even for consumer level devices?) for those who had hardware 
that fit within that tolerance and wanted the performance.  But at least 
to my knowledge, there's no such split in reset timeout values available 
(maybe for SCSI?), and due to auto-spindown and power-saving, I'm not 
sure whether it's even possible, without some specific hardware feature 
available to tell the kernel that it has in fact NOT been in power-saving 
mode for say 5-10 minutes, hopefully long enough that voltage readings 
really /are/ fully stabilized and a shorter timeout is possible.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19  2:46         ` Duncan
@ 2014-11-19 16:07           ` Phillip Susi
  2014-11-19 21:05             ` Robert White
  2014-11-19 23:59             ` Duncan
  0 siblings, 2 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-19 16:07 UTC (permalink / raw)
  To: Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 9:46 PM, Duncan wrote:
> I'm not sure about normal operation, but certainly, many drives
> take longer than 30 seconds to stabilize after power-on, and I
> routinely see resets during this time.

As far as I have seen, typical drive spin up time is on the order of
3-7 seconds.  Hell, I remember my pair of first generation seagate
cheetah 15,000 rpm drives seemed to take *forever* to spin up and that
still was maybe only 15 seconds.  If a drive takes longer than 30
seconds, then there is something wrong with it.  I figure there is a
reason why spin up time is tracked by SMART so it seems like long spin
up time is a sign of a sick drive.

> This doesn't happen on single-hardware-device block devices and 
> filesystems because in that case it's either up or down, if the
> device doesn't come up in time the resume simply fails entirely,
> instead of coming up with one or more devices there, but others
> missing as they didn't stabilize in time, as is unfortunately all
> too common in the multi- device scenario.

No, the resume doesn't "fail entirely".  The drive is reset, and the
IO request is retried, and by then it should succeed.

> I've seen this with both spinning rust and with SSDs, with mdraid
> and btrfs, with multiple mobos and device controllers, and with
> resume both from suspend to ram (if the machine powers down the
> storage devices in that case, as most modern ones do) and hibernate
> to permanent storage device, over several years worth of kernel
> series, so it's a reasonably widespread phenomena, at least among
> consumer-level SATA devices.  (My experience doesn't extend to
> enterprise-raid-level devices or proper SCSI, etc, so I simply
> don't know, there.)

If you are restoring from hibernation, then the drives are already
spun up before the kernel is loaded.

> While two minutes is getting a bit long, I think it's still within
> normal range, and some devices definitely take over a minute enough
> of the time to be both noticeable and irritating.

It certainly is not normal for a drive to take that long to spin up.
IIRC, the 30 second timeout comes from the ATA specs which state that
it can take up to 30 seconds for a drive to spin up.

> That said, I SHOULD say I'd be far *MORE* irritated if the device
> simply pretended it was stable and started reading/writing data
> before it really had stabilized, particularly with SSDs where that
> sort of behavior has been observed and is known to put some devices
> at risk of complete scrambling of either media or firmware, beyond
> recovery at times.  That of course is the risk of going the other
> direction, and I'd a WHOLE lot rather have devices play it safe for
> another 30 seconds or so after they / think/ they're stable and be
> SURE, than pretend to be just fine when voltages have NOT
> stabilized yet and thus end up scrambling things irrecoverably.
> I've never had that happen here tho I've never stress- tested for
> it, only done normal operation, but I've seen testing reports where
> the testers DID make it happen surprisingly easily, to a surprising
>  number of their test devices.

Power supply voltage is stable within milliseconds.  What takes HDDs
time to start up is mechanically bringing the spinning rust up to
speed.  On SSDs, I think you are confusing testing done on power
*cycling* ( i.e. yanking the power cord in the middle of a write )
with startup.

> So, umm... I suspect the 2-minute default is 2 minutes due to
> power-up stabilizing issues, where two minutes is a reasonable
> compromise between failing the boot most of the time if the timeout
> is too low, and taking excessively long for very little further
> gain.

The default is 30 seconds, not 2 minutes.

> sure whether it's even possible, without some specific hardware
> feature available to tell the kernel that it has in fact NOT been
> in power-saving mode for say 5-10 minutes, hopefully long enough
> that voltage readings really /are/ fully stabilized and a shorter
> timeout is possible.

Again, there is no several minute period where voltage stabilizes and
the drive takes longer to access.  This is a complete red herring.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbMBPAAoJEI5FoCIzSKrwcV0H/20pv7O5+CDf2cRg5G5vt7PR
4J1NuVIBsboKwjwCj8qdxHQJHihvLYkTQKANqaqHv0+wx0u2DaQdPU/LRnqN71xA
jP7b9lx9X6rPnAnZUDBbxzAc8HLeutgQ8YD/WB0sE5IXlI1/XFGW4tXIZ4iYmtN9
GUdL+zcdtEiYE993xiGSMXF4UBrN8d/5buBRsUsPVivAZes6OHbf9bd72c1IXBuS
ADZ7cH7XGmLL3OXA+hm7d99429HFZYAgI7DjrLWp6Tb9ja5Gvhy+AVvrbU5ZWMwu
XUnNsLsBBhEGuZs5xpkotZgaQlmJpw4BFY4BKwC6PL+7ex7ud3hGCGeI6VDmI0U=
=DLHU
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 16:07           ` Phillip Susi
@ 2014-11-19 21:05             ` Robert White
  2014-11-19 21:47               ` Phillip Susi
  2014-11-20  0:25               ` Duncan
  2014-11-19 23:59             ` Duncan
  1 sibling, 2 replies; 49+ messages in thread
From: Robert White @ 2014-11-19 21:05 UTC (permalink / raw)
  To: Phillip Susi, Duncan, linux-btrfs

On 11/19/2014 08:07 AM, Phillip Susi wrote:
> On 11/18/2014 9:46 PM, Duncan wrote:
>> I'm not sure about normal operation, but certainly, many drives
>> take longer than 30 seconds to stabilize after power-on, and I
>> routinely see resets during this time.
>
> As far as I have seen, typical drive spin up time is on the order of
> 3-7 seconds.  Hell, I remember my pair of first generation seagate
> cheetah 15,000 rpm drives seemed to take *forever* to spin up and that
> still was maybe only 15 seconds.  If a drive takes longer than 30
> seconds, then there is something wrong with it.  I figure there is a
> reason why spin up time is tracked by SMART so it seems like long spin
> up time is a sign of a sick drive.

I was recently re-factoring Underdog (http://underdog.sourceforge.net) 
startup scripts to separate out the various startup domains (e.g. lvm, 
luks, mdadm) in the prtotype init.

So I notice you (Duncan) use the word "stabilize", as do a small number 
of drivers in the linux kernel. This word has very little to do with 
"disks" per se.

Between SCSI probing LUNs (where the controller tries every theoretical 
address and gives a potential device ample time to reply), and 
usb-storage having a simple timer delay set for each volume it sees, 
there is a lot of "waiting in the name of safety" going on in the linux 
kernel at device initialization.

When I added the messages "scanning /dev/sd??" to the startup sequence 
as I iterate through the disks and partitions present I discovered that 
the first time I called blkid (e.g. right between /dev/sda and 
/dev/sda1) I'd get a huge hit of many human seconds (I didn't time it, 
but I'd say eight or so) just for having a 2Tb My Book WD 3.0 disk 
enclosure attached as /dev/sdc. This enclosure having "spun up" in the 
previous boot cycle and only bing a soft reboot was immaterial. In this 
case usb-store is going to take its time and do its deal regardless of 
the state of the physical drive itself.

So there are _lots_ of places where you are going to get delays and very 
few of them involve the disk itself going from power-off to ready.

You said it yourself with respect to SSDs.

It's cheaper, and less error prone, and less likely to generate customer 
returns if the generic controller chips just "send init, wait a fixed 
delay, then request a status" compared to trying to "are-you-there-yet" 
poll each device like a nagging child. And you are going to see that at 
every level. And you are going to see it multiply with _sparsely_ 
provisioned buses where the cycle is going to be retried for absent LUNs 
(one disk on a Wide SCSI bus and a controller set to probe all LUNs is 
particularly egregious)

One of the reasons that the whole industry has started favoring 
point-to-point (SATA, SAS) or physical intercessor chaining 
point-to-point (eSATA) buses is to remove a lot of those wait-and-see 
delays.

That said, you should not see a drive (or target enclosure, or 
controller) "reset" during spin up. In a SCSI setting this is almost 
always a cabling, termination, or addressing issue. In IDE its jumper 
mismatch (master vs slave vs cable-select). Less often its a 
partitioning issue (trying to access sectors beyond the end of the drive).

Another strong actor is selecting the wrong storage controller chipset 
driver. In that case you may be faling back from high-end device you 
think it is, through intermediate chip-set, and back to ACPI or BIOS 
emulation

Another common cause is having a dedicated hardware RAID controller 
(dell likes to put LSI MegaRaid controllers in their boxes for example), 
many mother boards have hardware RAID support available through the 
bios, etc, leaving that feature active, then the adding a drive and 
_not_ initializing that drive with the RAID controller disk setup. In 
this case the controller is going to repeatedly probe the drive for its 
proprietary controller signature blocks (and reset the drive after each 
attempt) and then finally fall back to raw block pass-through. This can 
take a long time (thirty seconds to a minute).

But seriously, if you are seeing "reset" anywhere in any storage chain 
during a normal power-on cycle then you've got a problem  with geometry 
or configuration.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 21:05             ` Robert White
@ 2014-11-19 21:47               ` Phillip Susi
  2014-11-19 22:25                 ` Robert White
  2014-11-19 22:33                 ` Robert White
  2014-11-20  0:25               ` Duncan
  1 sibling, 2 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-19 21:47 UTC (permalink / raw)
  To: Robert White, Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/19/2014 4:05 PM, Robert White wrote:
> It's cheaper, and less error prone, and less likely to generate
> customer returns if the generic controller chips just "send init,
> wait a fixed delay, then request a status" compared to trying to
> "are-you-there-yet" poll each device like a nagging child. And you
> are going to see that at every level. And you are going to see it
> multiply with _sparsely_ provisioned buses where the cycle is going
> to be retried for absent LUNs (one disk on a Wide SCSI bus and a
> controller set to probe all LUNs is particularly egregious)

No, they do not wait a fixed time, then proceed.  They do in fact
issue the command, then poll or wait for an interrupt to know when it
is done, then time out and give up if that doesn't happen within a
reasonable amount of time.

> One of the reasons that the whole industry has started favoring 
> point-to-point (SATA, SAS) or physical intercessor chaining 
> point-to-point (eSATA) buses is to remove a lot of those
> wait-and-see delays.

Nope... even with the ancient PIO mode PATA interface, you polled a
ready bit in the status register to see if it was done yet.  If you
always waited 30 seconds for every command your system wouldn't boot
up until next year.

> Another strong actor is selecting the wrong storage controller
> chipset driver. In that case you may be faling back from high-end
> device you think it is, through intermediate chip-set, and back to
> ACPI or BIOS emulation

There is no such thing as ACPI or BIOS emulation.  AHCI SATA
controllers do usually have an old IDE emulation mode instead of AHCI
mode, but this isn't going to cause ridiculously long delays.

> Another common cause is having a dedicated hardware RAID
> controller (dell likes to put LSI MegaRaid controllers in their
> boxes for example), many mother boards have hardware RAID support
> available through the bios, etc, leaving that feature active, then
> the adding a drive and

That would be fake raid, not hardware raid.

> _not_ initializing that drive with the RAID controller disk setup.
> In this case the controller is going to repeatedly probe the drive
> for its proprietary controller signature blocks (and reset the
> drive after each attempt) and then finally fall back to raw block
> pass-through. This can take a long time (thirty seconds to a
> minute).

No, no, and no.  If it reads the drive and does not find its metadata,
it falls back to pass through.  The actual read takes only
milliseconds, though it may have to wait a few seconds for the drive
to spin up.  There is no reason it would keep retrying after a
successful read.

The way you end up with 30-60 second startup time with a raid is if
you have several drives and staggered spinup mode enabled, then each
drive is started one at a time instead of all at once so their
cumulative startup time can add up fairly high.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbQ/qAAoJEI5FoCIzSKrwuhwH/R/+EVTpNlw36naJ8mxqMagt
/xafq+1kGhwNjLTPV68CI4Wt24WSGOLqpq5FPWlTMxuN0VSnX/wqBeSbz4w2Vl3F
VNic+4RqhmzS3EnLXNzkHyF2Z+hQEEldOlheAobkQb4hv/7jVxBri42nMdHQUq5w
em181txT8zkltmV+dm8aYcro8Z4ewntQtyGaO6U/nCfxt9Odr2rfytyeuSyJi9uY
+dKlGSb5klIFwCOOSoRqEz2+KOFHF7td9RrcfIRcPRgjKROH0YilQ8T53lTMoNL1
aUMsbyUy+edEBN1a4o/FqK3dEvBSu1nnRGRpSgm2fFGKhyi/z9gmJ1ZXTdYZRXE=
=/O7+
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 21:47               ` Phillip Susi
@ 2014-11-19 22:25                 ` Robert White
  2014-11-20 20:26                   ` Phillip Susi
  2014-11-19 22:33                 ` Robert White
  1 sibling, 1 reply; 49+ messages in thread
From: Robert White @ 2014-11-19 22:25 UTC (permalink / raw)
  To: Phillip Susi, Duncan, linux-btrfs

Shame you already know everything?

On 11/19/2014 01:47 PM, Phillip Susi wrote:
> On 11/19/2014 4:05 PM, Robert White wrote:

>
>> One of the reasons that the whole industry has started favoring
>> point-to-point (SATA, SAS) or physical intercessor chaining
>> point-to-point (eSATA) buses is to remove a lot of those
>> wait-and-see delays.
>
> Nope... even with the ancient PIO mode PATA interface, you polled a
> ready bit in the status register to see if it was done yet.  If you
> always waited 30 seconds for every command your system wouldn't boot
> up until next year.

The controller, the thing that sets the ready bit and sends the 
interrupt is distinct from the driver, the thing that polls the ready 
bit when the interrupt is sent. At the bus level there are fixed delays 
and retries. Try putting two drives on a pin-select IDE bus and 
strapping them both as _slave_ (or indeed master) sometime and watch the 
shower of fixed delay retries.

>> Another strong actor is selecting the wrong storage controller
>> chipset driver. In that case you may be faling back from high-end
>> device you think it is, through intermediate chip-set, and back to
>> ACPI or BIOS emulation
>
> There is no such thing as ACPI or BIOS emulation.

That's odd... my bios reads from storage to boot the device and it does 
so using the ACPI storage methods.

ACPI 4.0 Specification Section 9.8 even disagrees with you at some length.

Let's just do the titles shall we:

9.8 ATA Controller Devices
9.8.1 Objects for both ATA and SATA Controllers.
9.8.2 IDE Controller Device
9.8.3 Serial ATA (SATA) controller Device

Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at
Device Drivers ->
  <*> Serial ATA and Parallel ATA drivers (libata) ->
   <*> ACPI firmware driver for PATA

CONFIG_PATA_ACPI:

This option enables an ACPI method driver which drives motherboard PATA 
controller interfaces through the ACPI firmware in the BIOS. This driver 
can sometimes handle otherwise unsupported hardware.

You are a storage _genius_ for knowing that all that stuff doesn't 
exist... the rest of us must simply muddle along in our delusion...

 > AHCI SATA
 > controllers do usually have an old IDE emulation mode instead of AHCI
 > mode, but this isn't going to cause ridiculously long delays.

Do tell us more... I didn't say the driver would cause long delays, I 
said that the time it takes to error out other improperly supported 
drivers and fall back to this one could induce long delays and resets.

I think I am done with your "expertise" in the question of all things 
storage related.

Not to be rude... but I'm physically ill and maybe I shouldn't be 
posting right now... 8-)

-- Rob.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 22:25                 ` Robert White
@ 2014-11-20 20:26                   ` Phillip Susi
  2014-11-20 22:45                     ` Robert White
  0 siblings, 1 reply; 49+ messages in thread
From: Phillip Susi @ 2014-11-20 20:26 UTC (permalink / raw)
  To: Robert White, Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/19/2014 5:25 PM, Robert White wrote:
> The controller, the thing that sets the ready bit and sends the 
> interrupt is distinct from the driver, the thing that polls the
> ready bit when the interrupt is sent. At the bus level there are
> fixed delays and retries. Try putting two drives on a pin-select
> IDE bus and strapping them both as _slave_ (or indeed master)
> sometime and watch the shower of fixed delay retries.

No, it does not.  In classical IDE, the "controller" is really just a
bus bridge.  When you read from the status register in the controller,
the read bus cycle is propagated down the IDE ribbon, and into the
drive, and you are in fact, reading the register directly from the
drive.  That is where the name Integrated Device Electronics came
from: because the controller was really integrated into the drive.

The only fixed delays at the bus level are the bus cycle speed.  There
are no retries.  There are only 3 mentions of the word "retry" in the
ATA8-APT and they all refer to the host driver.

> That's odd... my bios reads from storage to boot the device and it
> does so using the ACPI storage methods.

No, it doesn't.  It does so by accessing the IDE or ACHI registers
just as pc bios always has.  I suppose I also need to remind you that
we are talking about the context of linux here, and linux does not
make use of the bios for disk access.

> ACPI 4.0 Specification Section 9.8 even disagrees with you at some
> length.
> 
> Let's just do the titles shall we:
> 
> 9.8 ATA Controller Devices 9.8.1 Objects for both ATA and SATA
> Controllers. 9.8.2 IDE Controller Device 9.8.3 Serial ATA (SATA)
> controller Device
> 
> Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at Device
> Drivers -> <*> Serial ATA and Parallel ATA drivers (libata) -> <*>
> ACPI firmware driver for PATA
> 
> CONFIG_PATA_ACPI:
> 
> This option enables an ACPI method driver which drives motherboard
> PATA controller interfaces through the ACPI firmware in the BIOS.
> This driver can sometimes handle otherwise unsupported hardware.
> 
> You are a storage _genius_ for knowing that all that stuff doesn't 
> exist... the rest of us must simply muddle along in our
> delusion...

Yes, ACPI 4.0 added this mess.  I have yet to see a single system that
actually implements it.  I can't believe they even bothered adding
this driver to the kernel.  Is there anyone in the world who has ever
used it?  If no motherboard vendor has bothered implementing the ACPI
FAN specs, I very much doubt anyone will ever bother with this.

> Do tell us more... I didn't say the driver would cause long delays,
> I said that the time it takes to error out other improperly
> supported drivers and fall back to this one could induce long
> delays and resets.

There is no "error out" and "fall back".  If the device is in AHCI
mode then it identifies itself as such and the ACHI driver is loaded.
 If it is in IDE mode, then it identifies itself as such, and the IDE
driver is loaded.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbk5qAAoJEI5FoCIzSKrw++IH/2DAayNzDqKlA7DBi79UVlpg
jJHDOlmzPqJCLMkffZRX1TLM/OEzu3k/pYMlS0HCdNggbG7eTpHxsoCetiETPcnc
LCcolWXa/eMfzkEphSq4GToeEj5FKrVNzymNvPVL6zdiSfySvSg4RZOs123ULYNM
nPUaOYPSiDPzfC7ggUS3RSvWb8mNzfRVJtgGXlZd/jDh+NAjy3oTb4fYksZjq8qb
n5emKU1jJafvSbBek41wo7Xji1vLThiDZ4kcf4c7oT3x4WuQUMUhzkficqEnwYsm
HK12pv0ktDJr6hKMcHPT26YKsdUOPE6XC3GgNaxt8EZ3bioWYRb4RRAdAuAjI2s=
=+M2o
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-20 20:26                   ` Phillip Susi
@ 2014-11-20 22:45                     ` Robert White
  2014-11-21 15:11                       ` Phillip Susi
  0 siblings, 1 reply; 49+ messages in thread
From: Robert White @ 2014-11-20 22:45 UTC (permalink / raw)
  To: Phillip Susi, Duncan, linux-btrfs

On 11/20/2014 12:26 PM, Phillip Susi wrote:
> Yes, ACPI 4.0 added this mess.  I have yet to see a single system that
> actually implements it.  I can't believe they even bothered adding
> this driver to the kernel.  Is there anyone in the world who has ever
> used it?  If no motherboard vendor has bothered implementing the ACPI
> FAN specs, I very much doubt anyone will ever bother with this.

Nice attempt at saving face, but wrong as _always_.

The CONFIG_PATA_ACPI option has been in the kernel since 2008 and lots 
of people have used it.

If you search for "ACPI ide" you'll find people complaining in 2008-2010 
about windows error messages indicating the device is present in their 
system but no OS driver is available.

That you "have yet to see a single system that implements it" is about 
the worst piece of internet research I've ever seen. Do you not _get_ 
that your opinion about what exists and how it works is not authoritative?

You can also find articles about both windows and linux systems actively 
using ACPI fan control going back to 2009

These are not hard searches to pull off. These are not obscure 
references. Go to the google box and start typing "ACPI fan..." and 
check the autocomplete.

I'll skip ovea all the parts where you don't know how a chipset works 
and blah, blah, blah...

You really should have just stopped at "I don't know" and "I've never" 
because you keep demonstrating that you _don't_ know, and that you 
really _should_ _never_.

Tell us more about the lizard aliens controlling your computer, I find 
your versions of realty fascinating...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-20 22:45                     ` Robert White
@ 2014-11-21 15:11                       ` Phillip Susi
  2014-11-21 21:12                         ` Robert White
  0 siblings, 1 reply; 49+ messages in thread
From: Phillip Susi @ 2014-11-21 15:11 UTC (permalink / raw)
  To: Robert White, Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/20/2014 5:45 PM, Robert White wrote:
> Nice attempt at saving face, but wrong as _always_.
> 
> The CONFIG_PATA_ACPI option has been in the kernel since 2008 and
> lots of people have used it.
> 
> If you search for "ACPI ide" you'll find people complaining in
> 2008-2010 about windows error messages indicating the device is
> present in their system but no OS driver is available.

Nope... not finding it.  The closest thing was one or two people who
said ACPI when they meant AHCI ( and were quickly corrected ).  This
is probably what you were thinking of since windows xp did not ship
with an ahci driver so it was quite common for winxp users to have
this problem when in _AHCI_ mode.

> That you "have yet to see a single system that implements it" is
> about the worst piece of internet research I've ever seen. Do you
> not _get_ that your opinion about what exists and how it works is
> not authoritative?

Show me one and I'll give you a cookie.  I have disassembled a number
of acpi tables and yet to see one that has it.  What's more,
motherboard vendors tend to implement only the absolute minimum they
have to.  Since nobody actually needs this feature, they aren't going
to bother with it.  Do you not get that your hand waving arguments of
"you can google for it" are not authoritative?

> You can also find articles about both windows and linux systems
> actively using ACPI fan control going back to 2009

Maybe you should have actually read those articles.  Linux supports
acpi fan control, unfortunately, almost no motherboards actually
implement it.  Almost everyone who wants fan control working in linux
has to install lm-sensors and load a driver that directly accesses one
of the embedded controllers that motherboards tend to use and run the
fancontrol script to manipulate the pwm channels on that controller.
These days you also have to boot with a kernel argument to allow
loading the driver since ACPI claims those IO ports for its own use
which creates a conflict.

Windows users that want to do this have to install a program... I
believe a popular one is called q-fan, that likewise directly accesses
the embedded controller registers to control the fan, since the acpi
tables don't bother properly implementing the acpi fan spec.

Then there are thinkpads, and one or two other laptops ( asus comes to
mind ) that went and implemented their own proprietary acpi interfaces
for fancontrol instead of following the spec, which required some
reverse engineering and yet more drivers to handle these proprietary
acpi interfaces.  You can google for "thinkfan" if you want to see this.

> These are not hard searches to pull off. These are not obscure 
> references. Go to the google box and start typing "ACPI fan..."
> and check the autocomplete.
> 
> I'll skip ovea all the parts where you don't know how a chipset
> works and blah, blah, blah...
> 
> You really should have just stopped at "I don't know" and "I've
> never" because you keep demonstrating that you _don't_ know, and
> that you really _should_ _never_.
> 
> Tell us more about the lizard aliens controlling your computer, I
> find your versions of realty fascinating...

By all means, keep embarrassing yourself with nonsense and trying to
cover it up by being rude and insulting.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1YxAAoJEI5FoCIzSKrwi54H/Rkd7DloqC9x9QwN4QdmWcAZ
/UQg3hcRbtB3wpmp34Mnb3SS0Ii2mCh/dtKmdRGBNE/x5nU1WiQEHHCicKX3Avvq
8OXLNQrsf+xZL9/HGtUJ3RefpEkmwIG5NgFfKJHtv6Iq204Umq32JUxRla+ZQE5s
MrUparigpUlj26lrnShc6ByDUqYK3wOTsDxEMxrOyAgi/n/7ESHV/dZVaqsE6jGQ
OvPynf1FqJoJSSYC7sNE0XLqfHMu2wnSxcoF6MpuHXlDiwtrSH07tuwgrhCNPagY
j7gQyxucew8oim8lcfs+4rrQ60wwVzlsEJwjA9rAXQF7U2x/WoB+ArYhgmJUMgA=
=cXJr
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21 15:11                       ` Phillip Susi
@ 2014-11-21 21:12                         ` Robert White
  2014-11-21 21:41                           ` Robert White
  2014-11-22 22:06                           ` Phillip Susi
  0 siblings, 2 replies; 49+ messages in thread
From: Robert White @ 2014-11-21 21:12 UTC (permalink / raw)
  To: Phillip Susi, Duncan, linux-btrfs

On 11/21/2014 07:11 AM, Phillip Susi wrote:
> On 11/20/2014 5:45 PM, Robert White wrote:
>> If you search for "ACPI ide" you'll find people complaining in
>> 2008-2010 about windows error messages indicating the device is
>> present in their system but no OS driver is available.
>
> Nope... not finding it.  The closest thing was one or two people who
> said ACPI when they meant AHCI ( and were quickly corrected ).  This
> is probably what you were thinking of since windows xp did not ship
> with an ahci driver so it was quite common for winxp users to have
> this problem when in _AHCI_ mode.

I have to give you that one... I should have never trusted any reference 
to windows.

Most of those references to windows support were getting AHCI and ACPI 
mixed up. Lolz windows users... They didn't get into ACPI disk support 
till 2010. I should have known they were behind the times. I had to 
scroll down almost a whole page to find the linux support.

So lets just look at the top of the ide/ide-acpi.c from linux 2.6 to 
consult about when ACPI got into the IDE business...

linux/drivers/ide/ide-acpi.c
/*
  * Provides ACPI support for IDE drives.
  *
  * Copyright (C) 2005 Intel Corp.
  * Copyright (C) 2005 Randy Dunlap
  * Copyright (C) 2006 SUSE Linux Products GmbH
  * Copyright (C) 2006 Hannes Reinecke
  */

Here's a bug from 2005 of someone having a problem with the ACPI IDE 
support...

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&cad=rja&uact=8&ved=0CDkQFjAF&url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D5604&ei=g6VvVL73K-HLsASIrYKIDg&usg=AFQjCNGTuuXPJk91svGJtRAf35DUqVqrLg&sig2=eHxwbLYXn4ED5jG-guoZqg

People debating the merits of the ACPI IDE drivers in 2005.

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=12&cad=rja&uact=8&ved=0CGUQFjAL&url=http%3A%2F%2Fwww.linuxquestions.org%2Fquestions%2Fslackware-14%2Fbare-ide-and-bare-acpi-kernels-297525%2F&ei=g6VvVL73K-HLsASIrYKIDg&usg=AFQjCNFoyKgH2sOteWwRN_Tdrfw9hOmVGQ&sig2=BmMVcZl24KRz4s4gEvLN_w

So "you got me"... windows was behind the curve by five years instead of 
just three... my bad...

But yea, nobody has ever used that ACPI disk drive support that's been 
in the kernel for nine years.

Even when you "get me" for referencing windows, you're still wrong...

How many times will you try get out of being hideously horribly wrong 
about ACPI supporting disk/storage IO? It is neither "recent" nor "rare".

How much egg does your face really need before you just see that your 
fantasy that it's "new" and uncommon is a delusional mistake?

Methinks Misters Dunning and Kruger need a word with you...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21 21:12                         ` Robert White
@ 2014-11-21 21:41                           ` Robert White
  2014-11-22 22:06                           ` Phillip Susi
  1 sibling, 0 replies; 49+ messages in thread
From: Robert White @ 2014-11-21 21:41 UTC (permalink / raw)
  To: Phillip Susi, Duncan, linux-btrfs

On 11/21/2014 01:12 PM, Robert White wrote:
 > (wrong links included in post...)
Dangit... those two links were bad... wrong clipboard... /sigh...

I'll just stand on the pasted text from the driver. 8-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21 21:12                         ` Robert White
  2014-11-21 21:41                           ` Robert White
@ 2014-11-22 22:06                           ` Phillip Susi
  1 sibling, 0 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-22 22:06 UTC (permalink / raw)
  To: Robert White, Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 11/21/2014 04:12 PM, Robert White wrote:
> Here's a bug from 2005 of someone having a problem with the ACPI
> IDE support...

That is not ACPI "emulation".  ACPI is not used to access the disk,
but rather it has hooks that give it a chance to diddle with the disk
to do things like configure it to lie about its maximum size, or issue
a security unlock during suspend/resume.

> People debating the merits of the ACPI IDE drivers in 2005.

No... that's not a debate at all; it is one guy asking if he should
use IDE or "ACPI" mode... someone who again meant AHCI and typed the
wrong acronym.

> Even when you "get me" for referencing windows, you're still 
> wrong...
> 
> How many times will you try get out of being hideously horribly
> wrong about ACPI supporting disk/storage IO? It is neither "recent"
> nor "rare".
> 
> How much egg does your face really need before you just see that
> your fantasy that it's "new" and uncommon is a delusional mistake?

Project much?

It seems I've proven just about everything I originally said you got
wrong now so hopefully we can be done.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUcQj4AAoJENRVrw2cjl5RwmcH+gOW0LUQE4OXEToMY33brK8Z
QMKw7T1y4dtXIeeWihugNs+vbwmoI2Wheeej4WPdiqvgqIfX4ov9+N9Nb39JiIsI
7frPJ638n98Et5sirCGKfaVvDTwlF85ApHHtXrVLg2dBY3A+oLM9jVU7jpRBvW1m
IFjhJH/SMGDpMhix9SFg6w6cALRh1U5WYV4zMZ1f5/ri/05TYmNJ/M23cjtBicPZ
LaIFxOMGef4lylysNaVh0W03424oIJit6d7DB1gxCyjnkUvVuJ43NjuS5ay+y2sP
FFrepKrOfhK1oOib9e63zNfRHhWrX4KN0Dqcu/3+/+lhD3q5G1fd4YK2RV/oaso=
=nm9l
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 21:47               ` Phillip Susi
  2014-11-19 22:25                 ` Robert White
@ 2014-11-19 22:33                 ` Robert White
  2014-11-20 20:34                   ` Phillip Susi
  1 sibling, 1 reply; 49+ messages in thread
From: Robert White @ 2014-11-19 22:33 UTC (permalink / raw)
  To: Phillip Susi, Duncan, linux-btrfs

P.S.

On 11/19/2014 01:47 PM, Phillip Susi wrote:
>> Another common cause is having a dedicated hardware RAID
>> controller (dell likes to put LSI MegaRaid controllers in their
>> boxes for example), many mother boards have hardware RAID support
>> available through the bios, etc, leaving that feature active, then
>> the adding a drive and
>
> That would be fake raid, not hardware raid.

The LSI MegaRaid controller people would _love_ to hear more about your 
insight into how their battery-backed multi-drive RAID controller is 
"fake". You should go work for them. Try the "contact us" link at the 
bottom of this page. I'm sure they are waiting for your insight with 
baited breath!

http://www.lsi.com/products/raid-controllers/pages/megaraid-sas-9260-8i.aspx

>> _not_ initializing that drive with the RAID controller disk setup.
>> In this case the controller is going to repeatedly probe the drive
>> for its proprietary controller signature blocks (and reset the
>> drive after each attempt) and then finally fall back to raw block
>> pass-through. This can take a long time (thirty seconds to a
>> minute).
>
> No, no, and no.  If it reads the drive and does not find its metadata,
> it falls back to pass through.  The actual read takes only
> milliseconds, though it may have to wait a few seconds for the drive
> to spin up.  There is no reason it would keep retrying after a
> successful read.

Odd, my MegaRaid controller takes about fifteen seconds by-the-clock to 
initialize and to the integrity check on my single initialized drive. 
It's amazing that with a fail and retry it would be _faster_...

It's like you know _everything_...



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 22:33                 ` Robert White
@ 2014-11-20 20:34                   ` Phillip Susi
  2014-11-20 23:08                     ` Robert White
  0 siblings, 1 reply; 49+ messages in thread
From: Phillip Susi @ 2014-11-20 20:34 UTC (permalink / raw)
  To: Robert White, Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/19/2014 5:33 PM, Robert White wrote:
>> That would be fake raid, not hardware raid.
> 
> The LSI MegaRaid controller people would _love_ to hear more about
> your insight into how their battery-backed multi-drive RAID
> controller is "fake". You should go work for them. Try the "contact
> us" link at the bottom of this page. I'm sure they are waiting for
> your insight with baited breath!

Forgive me, I should have trimmed the quote a bit more.  I was
responding specifically to the "many mother boards have hardware RAID
support available through the bios" part, not the lsi part.

> Odd, my MegaRaid controller takes about fifteen seconds
> by-the-clock to initialize and to the integrity check on my single
> initialized drive.

It is almost certainly spending those 15 seconds on something else,
like bootstrapping its firmware code from a slow serial eeprom or
waiting for you to press the magic key to enter the bios utility.  I
would be very surprised to see that time double if you add a second
disk.  If it does, then they are doing something *very* wrong, and
certainly quite different from any other real or fake raid controller
I've ever used.

> It's amazing that with a fail and retry it would be _faster_...

I have no idea what you are talking about here.  I said that they
aren't going to retry a read that *succeeded* but came back without
their magic signature.  It isn't like reading it again is going to
magically give different results.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUblBmAAoJEI5FoCIzSKrwFKkIAKNGOGyLrMIcTeV4DQntdbaa
NMkjXnWnk6lHeqTyE/pb+l4VgVH8nQwDp8hRCnKNnKHoZbT8LOGFULSmBes+DDmW
dxPVDTytUu1AiqB7AyxNJU8213BQCaF0inL7ofZmX95N+0eajuVxOyHIMeokdwUU
zLOnXQg0awLkQwk7U6YLAKA4A7HrOEXw4wHt9hPy/yUySMVqCeHYV3tpf7t96guU
0IRctvpwcNvvVtt65I8A4EklR+vCvqEDUZfKyG8WJAeyAdC4UoHT9vZcJAVkiFl+
Y+Mp5wsr1vuo3dYQ1bKO8RvPTB9D9npFyFIlyHEBMJlCHDU43YsNP8hGcu0mKco=
=AJ6/
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-20 20:34                   ` Phillip Susi
@ 2014-11-20 23:08                     ` Robert White
  2014-11-21 15:27                       ` Phillip Susi
  0 siblings, 1 reply; 49+ messages in thread
From: Robert White @ 2014-11-20 23:08 UTC (permalink / raw)
  To: Phillip Susi, Duncan, linux-btrfs

On 11/20/2014 12:34 PM, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/19/2014 5:33 PM, Robert White wrote:
>>> That would be fake raid, not hardware raid.
>>
>> The LSI MegaRaid controller people would _love_ to hear more about
>> your insight into how their battery-backed multi-drive RAID
>> controller is "fake". You should go work for them. Try the "contact
>> us" link at the bottom of this page. I'm sure they are waiting for
>> your insight with baited breath!
>
> Forgive me, I should have trimmed the quote a bit more.  I was
> responding specifically to the "many mother boards have hardware RAID
> support available through the bios" part, not the lsi part.

Well you should have _actually_ trimmed your response down to not 
pressing send.

_Many_ motherboards have complete RAID support at levels 0, 1, 10, and 
five 5. A few have RAID6.

Some of them even use the LSI chip-set.

Seriously... are you trolling this list with disinformation or just 
repeating tribal knowledge from fifteen year old copies of PC Magazine?

Yea, some of the IDE motherboards and that only had RAID1 and RAID0 (and 
indeed some of the add-on controllers) back in the IDE-only days were 
really lame just-forked-write devices with no integrity checks (hence 
"fake raid") but that's from like the 1990s; it's paleolithic age 
"wisdom" at this point.

Phillip say sky god angry, all go hide in cave! /D'oh...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-20 23:08                     ` Robert White
@ 2014-11-21 15:27                       ` Phillip Susi
  0 siblings, 0 replies; 49+ messages in thread
From: Phillip Susi @ 2014-11-21 15:27 UTC (permalink / raw)
  To: Robert White, Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/20/2014 6:08 PM, Robert White wrote:
> Well you should have _actually_ trimmed your response down to not 
> pressing send.
> 
> _Many_ motherboards have complete RAID support at levels 0, 1, 10,
> and five 5. A few have RAID6.
> 
> Some of them even use the LSI chip-set.

Yes, there are some expensive server class motherboards out there with
integrated real raid chips.  Your average consumer class motherboards
are not those.  They contain intel, nvidia, sil, promise, and via
chipsets that are fake raid.

> Seriously... are you trolling this list with disinformation or
> just repeating tribal knowledge from fifteen year old copies of PC
> Magazine?

Please drop the penis measuring.

> Yea, some of the IDE motherboards and that only had RAID1 and RAID0
> (and indeed some of the add-on controllers) back in the IDE-only
> days were really lame just-forked-write devices with no integrity
> checks (hence "fake raid") but that's from like the 1990s; it's
> paleolithic age "wisdom" at this point.

Wrong again... fakeraid became popular with the advent of SATA since
it was easy to add a knob to the bios to switch it between AHCI and
RAID mode, and just change the pci device id.  These chipsets are
still quite common today and several of them do support raid5 and
raid10 ( well, really it's raid 0 + raid1, but that's a whole nother
can of worms ).  Recent intel chips also now have a caching mode for
having an SSD cache a larger HDD.  Intel has also done a lot of work
integrating support for their chipset into mdadm in the last year or
three.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1ngAAoJEI5FoCIzSKrwqMQIAJ3MfA4n74aJ1KUdfHYOz96o
vwPBNQJ953yozmCHfjERbTCQlKT5AzwQHWpHoFWsQ4gYoNGmeE1jy2rsqxMfujff
eQekfISyX3POExnsr3LnfHWI2/Om39+EAxVPxbA5LN6SC1SCWRut7Q3bQqkuxj/S
bYRU65XJ9BZ6eYznutMDFdEELyAr8b9/wnatI/ohzmebOBDgFzBrn8gwilCctz7X
DI39HTkCvciWKVXNyVdUZKI5S+MRCEB2JZAkCy3x8LLsENmMnO0xN32o5Od0zlGn
nFLcLQFrZfz5dY2ZusxP+z0z0x4RW3sikd4RZ99PEHBkFa5CgJIFrBxtQAsLi1c=
=4Yg+
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 21:05             ` Robert White
  2014-11-19 21:47               ` Phillip Susi
@ 2014-11-20  0:25               ` Duncan
  2014-11-20  2:08                 ` Robert White
  1 sibling, 1 reply; 49+ messages in thread
From: Duncan @ 2014-11-20  0:25 UTC (permalink / raw)
  To: linux-btrfs

Robert White posted on Wed, 19 Nov 2014 13:05:13 -0800 as excerpted:

> One of the reasons that the whole industry has started favoring
> point-to-point (SATA, SAS) or physical intercessor chaining
> point-to-point (eSATA) buses is to remove a lot of those wait-and-see
> delays.
> 
> That said, you should not see a drive (or target enclosure, or
> controller) "reset" during spin up. In a SCSI setting this is almost
> always a cabling, termination, or addressing issue. In IDE its jumper
> mismatch (master vs slave vs cable-select). Less often its a
> partitioning issue (trying to access sectors beyond the end of the
> drive).
> 
> Another strong actor is selecting the wrong storage controller chipset
> driver. In that case you may be faling back from high-end device you
> think it is, through intermediate chip-set, and back to ACPI or BIOS
> emulation

FWIW I run a custom-built monolithic kernel, with only the specific 
drivers (SATA/AHCI in this case) builtin.  There's no drivers for 
anything else it could fallback to.

Once in awhile I do see it try at say 6-gig speeds, then eventually fall 
back to 3 and ultimately 1.5, but that /is/ indicative of other issues 
when I see it.  And like I said, there's no other drivers to fall back 
to, so obviously I never see it doing that.

> Another common cause is having a dedicated hardware RAID controller
> (dell likes to put LSI MegaRaid controllers in their boxes for example),
> many mother boards have hardware RAID support available through the
> bios, etc, leaving that feature active, then the adding a drive and
> _not_ initializing that drive with the RAID controller disk setup. In
> this case the controller is going to repeatedly probe the drive for its
> proprietary controller signature blocks (and reset the drive after each
> attempt) and then finally fall back to raw block pass-through. This can
> take a long time (thirty seconds to a minute).

Everything's set JBOD here.  I don't trust those proprietary "firmware" 
raid things.  Besides, that kills portability.  JBOD SATA and AHCI are 
sufficiently standardized that should the hardware die, I can switch out 
to something else and not have to worry about rebuilding the custom 
kernel with the new drivers.  Some proprietary firmware raid, requiring 
dmraid at the software kernel level to support, when I can just as easily 
use full software mdraid on standardized JBOD, no thanks!

And be sure, that's one of the first things I check when I setup a new 
box, any so-called hardware raid that's actually firmware/software raid, 
disabled, JBOD mode, enabled.

> But seriously, if you are seeing "reset" anywhere in any storage chain
> during a normal power-on cycle then you've got a problem  with geometry
> or configuration.

IIRC I don't get it routinely.  But I've seen it a few times, attributing 
it as I said to the 30-second SATA level timeout not being long enough.

Most often, however, it's at resume, not original startup, which is 
understandable as state at resume doesn't match state at suspend/
hibernate.  The irritating thing, as previously discussed, is when one 
device takes long enough to come back that mdraid or btrfs drops it out, 
generally forcing the reboot I was trying to avoid with the suspend/
hibernate in the first place, along with a re-add and resync (for mdraid) 
or a scrub (for btrfs raid).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-20  0:25               ` Duncan
@ 2014-11-20  2:08                 ` Robert White
  0 siblings, 0 replies; 49+ messages in thread
From: Robert White @ 2014-11-20  2:08 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 11/19/2014 04:25 PM, Duncan wrote:
> Most often, however, it's at resume, not original startup, which is
> understandable as state at resume doesn't match state at suspend/
> hibernate.  The irritating thing, as previously discussed, is when one
> device takes long enough to come back that mdraid or btrfs drops it out,
> generally forcing the reboot I was trying to avoid with the suspend/
> hibernate in the first place, along with a re-add and resync (for mdraid)
> or a scrub (for btrfs raid).

If you want a practical solution you might want to look at 
http://underdog.soruceforge.net (my project, shameless plug). The actual 
user context return isn't in there but I use the project to build 
initramfs images into all my kernels.

[DISCLAIMER: The cryptsetup and LUKS stuff is rock solid but the mdadm 
incremental build stuff is very rough and so lightly untested]

You could easily add a drive preheat code block (spin up and status 
check all drives with pause and repeat function) as a preamble function 
that could/would safely take place before any glance is made towards the 
resume stage.

extemporaneous example::

--- snip ---
cat <<'EOT' >>/opt/underdog/utility/preheat.mod
#!/bin/bash
# ROOT_COMMANDS+=( commands your preheat needs )
UNDERDOG+=( init.d/preheat )
EOT

cat <<'EOT' >>/opt/underdog/prototype/init.d/preheat
#!/bin/bash
function __preamble_preheat() {
whatever logic you need
return 0
}
__preamble_funcs+=( [preheat]=__preamble_preheat )
EOT
--- snip ---

install underdog, paste the above into a shell once. edit 
/opt/underdog/prototype/init.d/preamble to put whatever logic in you need.

Follow the instructions in /opt/underdog/README.txt for making the 
initramfs image or, as I do, build the initramfs into the kernel image.

The preamble will be run in the resultant /init script before the swap 
partitions are submitted for attempted resume.

(The system does support complexity like resuming from a swap partition 
inside an LVM/LV built over a LUKS encrypted media expanse, or just a 
plain laptop with one plain partitioned disk, with zero changes to the 
necessary default config.)

-- Rob.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 16:07           ` Phillip Susi
  2014-11-19 21:05             ` Robert White
@ 2014-11-19 23:59             ` Duncan
  2014-11-25 22:14               ` Phillip Susi
  1 sibling, 1 reply; 49+ messages in thread
From: Duncan @ 2014-11-19 23:59 UTC (permalink / raw)
  To: linux-btrfs

Phillip Susi posted on Wed, 19 Nov 2014 11:07:43 -0500 as excerpted:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 11/18/2014 9:46 PM, Duncan wrote:
>> I'm not sure about normal operation, but certainly, many drives take
>> longer than 30 seconds to stabilize after power-on, and I routinely see
>> resets during this time.
> 
> As far as I have seen, typical drive spin up time is on the order of 3-7
> seconds.  Hell, I remember my pair of first generation seagate cheetah
> 15,000 rpm drives seemed to take *forever* to spin up and that still was
> maybe only 15 seconds.  If a drive takes longer than 30 seconds, then
> there is something wrong with it.  I figure there is a reason why spin
> up time is tracked by SMART so it seems like long spin up time is a sign
> of a sick drive.

It's not physical spinup, but electronic device-ready.  It happens on 
SSDs too and they don't have anything to spinup.

But, for instance on my old seagate 300-gigs that I used to have in 4-way 
mdraid, when I tried to resume from hibernate the drives would be spunup 
and talking to the kernel, but for some seconds to a couple minutes or so 
after spinup, they'd sometimes return something like (example) 
"Seagrte3x0" instead of "Seagate300".  Of course that wasn't the exact 
string, I think it was the model number or perhaps the serial number or 
something, but looking at dmsg I could see the ATA layer up for each of 
the four devices, the connection establish and seem to be returning good 
data, then the mdraid layer would try to assemble and would kick out a 
drive or two due to the device string mismatch compared to what was there 
before the hibernate.  With the string mismatch, from its perspective the 
device had disappeared and been replaced with something else.

But if I held it at the grub prompt for a couple minutes and /then/ let 
it go, or part of the time on its own, all four drives would match and 
it'd work fine.  For just short hibernates (as when testing hibernate/
resume), it'd come back just fine; as it would nearly all the time out to 
two hours or so.  Beyond that, out to 10 or 12 hours, the longer it sat 
the more likely it would be to fail, if it didn't hold it at the grub 
prompt for a few minutes to let it stabilize.

And now I seen similar behavior resuming from suspend (the old hardware 
wouldn't resume from suspend to ram, only hibernate, the new hardware 
resumes from suspend to ram just fine, but I had trouble getting it to 
resume from hibernate back when I first setup and tried it; I've not 
tried hibernate since and didn't even setup swap to hibernate to when I 
got the SSDs so I've not tried it for a couple years) on SSDs with btrfs 
raid.  Btrfs isn't as informative as was mdraid on why it kicks a device, 
but dmesg says both devices are up, while btrfs is suddenly spitting 
errors on one device.  A reboot later and both devices are back in the 
btrfs and I can do a scrub to resync, which generally finds and fixes 
errors on the btrfs that were writable (/home and /var/log), but of 
course not on the btrfs mounted as root, since it's read-only by default.

Same pattern.  Immediate suspend and resume is fine.  Out to about 6 
hours it tends to be fine as well.  But at 8-10 hours in suspend, btrfs 
starts spitting errors often enough that I generally quit trying to 
suspend at all, I simply shut down now.  (With SSDs and systemd, shutdown 
and restart is fast enough, and the delay from having to refill cache low 
enough, that the time difference between suspend and full shutdown is 
hardly worth troubling with anyway, certainly not when there's a risk to 
data due to failure to properly resume.)

But it worked fine when I had only a single device to bring back up.  
Nothing to be slower than another device to respond and thus to be kicked 
out as dead.

I finally realized what was happening after I read a study paper 
mentioning capacitor charge time and solid-state stability time, and how 
a lot of cheap devices say they're ready before the electronics have 
actually properly stabilized.  On SSDs, this is a MUCH worse issue than 
it is on spinning rust, because the logical layout isn't practically 
forced to serial like it is on spinning rust, and the firmware can get so 
jumbled it pretty much scrambles the device.  And it's not just the 
normal storage either.  In the study, many devices corrupted their own 
firmware as well!

Now that was definitely a worst-case study in that they were deliberately 
yanking and/or fast-switching the power, not just doing time-on waits, 
but still, a surprisingly high proportion of SSDs not only scrambled the 
storage, but scrambled their firmware as well.  (On those devices the 
firmware may well have been on the same media as the storage, with the 
firmware simply read in first in a hardware bootstrap mode, and the 
firmware programmed to avoid that area in normal operation thus making it 
as easily corrupted as the the normal storage.)

The paper specifically mentioned that it wasn't necessarily the more 
expensive devices that were the best, either, but the ones that faired 
best did tend to have longer device-ready times.  The conclusion was that 
a lot of devices are cutting corners on device-ready, gambling that in 
normal use they'll work fine, leading to an acceptable return rate, and 
evidently, the gamble pays off most of the time.

That being the case, a longer device-ready, if it actually means the 
device /is/ ready, can be a /good/ thing.  If there's a 30-second timeout 
layer getting impatient and resetting the drive multiple times because 
it's not responding as it's not actually ready yet, well...

The spinning rust in that study faired far better, with I think none of 
the devices scrambling their own firmware, and while there was some 
damage to storage, it was generally far better confined.

>> This doesn't happen on single-hardware-device block devices and
>> filesystems because in that case it's either up or down, if the device
>> doesn't come up in time the resume simply fails entirely, instead of
>> coming up with one or more devices there, but others missing as they
>> didn't stabilize in time, as is unfortunately all too common in the
>> multi- device scenario.
> 
> No, the resume doesn't "fail entirely".  The drive is reset, and the IO
> request is retried, and by then it should succeed.

Yes.  I misspoke by abbreviation.  The point I was trying to make is that 
there's only the one device, so ultimately it either works or it 
doesn't.  There's no case of one or more devices coming up correctly and 
one or more still not being entirely ready.

> It certainly is not normal for a drive to take that long to spin up.
> IIRC, the 30 second timeout comes from the ATA specs which state that it
> can take up to 30 seconds for a drive to spin up.

>> That said, I SHOULD say I'd be far *MORE* irritated if the device
>> simply pretended it was stable and started reading/writing data before
>> it really had stabilized, particularly with SSDs where that sort of
>> behavior has been observed and is known to put some devices at risk of
>> complete scrambling of either media or firmware, beyond recovery at
>> times.

That was referencing the study I summarized in a bit more depth, above.

> Power supply voltage is stable within milliseconds.  What takes HDDs
> time to start up is mechanically bringing the spinning rust up to speed.
>  On SSDs, I think you are confusing testing done on power *cycling* (
> i.e. yanking the power cord in the middle of a write ) with startup.

But if the startup is showing the symptoms...

FWIW, I wasn't a believer at first either.  But I know what I see on my 
own hardware.

Tho I now suspect we might be in vehement agreement with each other, just 
from different viewpoints and stating it differently. =:^)

>> So, umm... I suspect the 2-minute default is 2 minutes due to power-up
>> stabilizing issues

> The default is 30 seconds, not 2 minutes.

Well, as discussed by others there's an often two-minute default at one 
level, and a 30 second default at another.  I was replying to someone who 
couldn't see the logic behind 2 minutes for sure, or even 30 seconds, 
with a reason why the 2 minute retry timeout might actually make sense.  
Yes, there's a 30 second time at a different level as well, but I was 
addressing why 2 minutes can make sense.

Regardless, with the 2 minute timeout behind the half-minute timeout, the 
2-minute timeout is obviously never going to be seen, which /is/ a 
problem.

> Again, there is no several minute period where voltage stabilizes and
> the drive takes longer to access.  This is a complete red herring.

My experience says otherwise.  Else explain why those problems occur in 
the first two minutes, but don't occur if I hold it at the grub prompt 
"to stabilize"for two minutes, and never during normal "post-
stabilization" operation.  Of course perhaps there's another explanation 
for that, and I'm conflating the two things.  But so far, experience 
matches the theory.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-19 23:59             ` Duncan
@ 2014-11-25 22:14               ` Phillip Susi
  2014-11-28 15:55                 ` Patrik Lundquist
  0 siblings, 1 reply; 49+ messages in thread
From: Phillip Susi @ 2014-11-25 22:14 UTC (permalink / raw)
  To: Duncan, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/19/2014 6:59 PM, Duncan wrote:
> It's not physical spinup, but electronic device-ready.  It happens
> on SSDs too and they don't have anything to spinup.

If you have an SSD that isn't handling IO within 5 seconds or so of
power on, it is badly broken.

> But, for instance on my old seagate 300-gigs that I used to have in
> 4-way mdraid, when I tried to resume from hibernate the drives
> would be spunup and talking to the kernel, but for some seconds to
> a couple minutes or so after spinup, they'd sometimes return
> something like (example) "Seagrte3x0" instead of "Seagate300".  Of
> course that wasn't the exact string, I think it was the model
> number or perhaps the serial number or something, but looking at
> dmsg I could see the ATA layer up for each of the four devices, the
> connection establish and seem to be returning good data, then the
> mdraid layer would try to assemble and would kick out a drive or
> two due to the device string mismatch compared to what was there 
> before the hibernate.  With the string mismatch, from its
> perspective the device had disappeared and been replaced with
> something else.

Again, these drives were badly broken then.  Even if it needs extra
time to come up for some reason, it shouldn't be reporting that it is
ready and returning incorrect information.

> And now I seen similar behavior resuming from suspend (the old
> hardware wouldn't resume from suspend to ram, only hibernate, the
> new hardware resumes from suspend to ram just fine, but I had
> trouble getting it to resume from hibernate back when I first setup
> and tried it; I've not tried hibernate since and didn't even setup
> swap to hibernate to when I got the SSDs so I've not tried it for a
> couple years) on SSDs with btrfs raid.  Btrfs isn't as informative
> as was mdraid on why it kicks a device, but dmesg says both devices
> are up, while btrfs is suddenly spitting errors on one device.  A
> reboot later and both devices are back in the btrfs and I can do a
> scrub to resync, which generally finds and fixes errors on the
> btrfs that were writable (/home and /var/log), but of course not on
> the btrfs mounted as root, since it's read-only by default.

Several months back I was working on some patches to avoid blocking a
resume until after all disks had spun up ( someone else ended up
getting a different version merged to the mainline kernel ).  I looked
quite hard at the timings of things during suspend and found that my
ssd was ready and handling IO darn near instantly and the hd ( 5900
rpm wd green at the time ) took something like 7 seconds before it was
completing IO.  These days I'm running a raid10 on 3 7200 rpm blues
and it comes right up from suspend with no problems, just as it should.

> The paper specifically mentioned that it wasn't necessarily the
> more expensive devices that were the best, either, but the ones
> that faired best did tend to have longer device-ready times.  The
> conclusion was that a lot of devices are cutting corners on
> device-ready, gambling that in normal use they'll work fine,
> leading to an acceptable return rate, and evidently, the gamble
> pays off most of the time.

I believe I read the same study and don't recall any such conclusion.
 Instead the conclusion was that the badly behaving drives aren't
ordering their internal writes correctly and flushing their metadata
from ram to flash before completing the write request.  The problem
was on the power *loss* side, not the power application.

> The spinning rust in that study faired far better, with I think
> none of the devices scrambling their own firmware, and while there
> was some damage to storage, it was generally far better confined.

That is because they don't have a flash translation layer to get
mucked up and prevent them from knowing where the blocks are on disk.
 The worst thing you get out of a hdd losing power during a write is
the sector it was writing is corrupted and you have to re-write it.

> My experience says otherwise.  Else explain why those problems
> occur in the first two minutes, but don't occur if I hold it at the
> grub prompt "to stabilize"for two minutes, and never during normal
> "post- stabilization" operation.  Of course perhaps there's another
> explanation for that, and I'm conflating the two things.  But so
> far, experience matches the theory.

I don't know what was broken about these drives, only that it wasn't
capacitors since those charge in milliseconds, not seconds.  Further,
all systems using microprocessors ( like the one in the drive that
controls it ) have reset circuitry that prevents them from running
until after any caps have charged enough to get the power rail up to
the required voltage.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdP9jAAoJEI5FoCIzSKrw50IH/jkh48Z8Oh/AS/i68zT6Grtb
C98aNNQwhC2sJSvaxRBqJ1qkXY4af5DZM/SOvFdNE4qdPLBDLfg70tnTXwU4PjzN
1mHR1PR6Vgft11t0+u8TPTos669Jm8KJ21NMgY072P18Kj/+UJqNRQ+UUNikAcaM
XrTragev53F1Kzu5IrSGGjyS4ryZZNh9YioFtR3oUTh4WuCJIiiqvq1Qpno3ee+D
QrL+5/fyzEkv0fAt59lhfheb2SkWe2Po+FmmH853sPP3MfhX4blTRzQbkVqZpixb
NwsEMu/1hOGedzlZAp4i6aRRKDcl7B+R+x63frFun/kgY54gdbBEn3auoNSGuZA=
=iPNz
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-25 22:14               ` Phillip Susi
@ 2014-11-28 15:55                 ` Patrik Lundquist
  0 siblings, 0 replies; 49+ messages in thread
From: Patrik Lundquist @ 2014-11-28 15:55 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-btrfs@vger.kernel.org

On 25 November 2014 at 23:14, Phillip Susi <psusi@ubuntu.com> wrote:
> On 11/19/2014 6:59 PM, Duncan wrote:
>
>> The paper specifically mentioned that it wasn't necessarily the
>> more expensive devices that were the best, either, but the ones
>> that faired best did tend to have longer device-ready times.  The
>> conclusion was that a lot of devices are cutting corners on
>> device-ready, gambling that in normal use they'll work fine,
>> leading to an acceptable return rate, and evidently, the gamble
>> pays off most of the time.
>
> I believe I read the same study and don't recall any such conclusion.
>  Instead the conclusion was that the badly behaving drives aren't
> ordering their internal writes correctly and flushing their metadata
> from ram to flash before completing the write request.  The problem
> was on the power *loss* side, not the power application.

I've found:

http://www.usenix.org/conference/fast13/technical-sessions/presentation/zheng
http://lkcl.net/reports/ssd_analysis.html

Are there any more studies?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-18  7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide
                     ` (2 preceding siblings ...)
  2014-11-18 15:35   ` Marc MERLIN
@ 2014-11-21  4:58   ` Zygo Blaxell
  2014-11-21  7:05     ` Brendan Hide
  3 siblings, 1 reply; 49+ messages in thread
From: Zygo Blaxell @ 2014-11-21  4:58 UTC (permalink / raw)
  To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 8211 bytes --]

On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote:
> Hey, guys
> 
> See further below extracted output from a daily scrub showing csum
> errors on sdb, part of a raid1 btrfs. Looking back, it has been
> getting errors like this for a few days now.
> 
> The disk is patently unreliable but smartctl's output implies there
> are no issues. Is this somehow standard faire for S.M.A.R.T. output?
> 
> Here are (I think) the important bits of the smartctl output for
> $(smartctl -a /dev/sdb) (the full results are attached):
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail
> Always       -       0
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       1
>   7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail
> Always       -       440801014
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age
> Offline      -       0
> 202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age
> Always       -       0

You have one reallocated sector, so the drive has lost some data at some
time in the last 49000(!) hours.  Normally reallocations happen during
writes so the data that was "lost" was data you were in the process of
overwriting anyway; however, the reallocated sector count could also be
a sign of deteriorating drive integrity.

In /var/lib/smartmontools there might be a csv file with logged error
attribute data that you could use to figure out whether that reallocation
was recent.

I also notice you are not running regular SMART self-tests (e.g.
by smartctl -t long) and the last (and first, and only!) self-test the
drive ran was ~12000 hours ago.  That means most of your SMART data is
about 18 months old.  The drive won't know about sectors that went bad
in the last year and a half unless the host happens to stumble across
them during a read.

The drive is over five years old in operating hours alone.  It is probably
so fragile now that it will break if you try to move it.


> 
> 
> -------- Original Message --------
> Subject: 	Cron <root@watricky> /usr/local/sbin/btrfs-scrub-all
> Date: 	Tue, 18 Nov 2014 04:19:12 +0200
> From: 	(Cron Daemon) <root@watricky>
> To: 	brendan@watricky
> 
> 
> 
> WARNING: errors detected during scrubbing, corrected.
> [snip]
> scrub device /dev/sdb2 (id 2) done
> 	scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds
> 	total bytes scrubbed: 189.49GiB with 5420 errors
> 	error details: read=5 csum=5415
> 	corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164

That seems a little off.  If there were 5 read errors, I'd expect the drive to
have errors in the SMART error log.

Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
There have been a number of fixes to csums in btrfs pulled into the kernel
recently, and I've retired two five-year-old computers this summer due
to RAM/CPU failures.

> [snip]
> 

> smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build)
> Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda 7200.10
> Device Model:     ST3250410AS
> Serial Number:    6RYF5NP7
> Firmware Version: 4.AAA
> User Capacity:    250,059,350,016 bytes [250 GB]
> Sector Size:      512 bytes logical/physical
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
> Local Time is:    Tue Nov 18 09:16:03 2014 SAST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> See vendor-specific Attribute list for marginal Attributes.
> 
> General SMART Values:
> Offline data collection status:  (0x82)	Offline data collection activity
> 					was completed without error.
> 					Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0)	The previous self-test routine completed
> 					without error or no self-test has ever 
> 					been run.
> Total time to complete Offline 
> data collection: 		(  430) seconds.
> Offline data collection
> capabilities: 			 (0x5b) SMART execute Offline immediate.
> 					Auto Offline data collection on/off support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					No Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					General Purpose Logging supported.
> Short self-test routine 
> recommended polling time: 	 (   1) minutes.
> Extended self-test routine
> recommended polling time: 	 (  64) minutes.
> SCT capabilities: 	       (0x0001)	SCT Status supported.
> 
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
>   3 Spin_Up_Time            0x0003   099   097   000    Pre-fail  Always       -       0
>   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       68
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       1
>   7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail  Always       -       440801057
>   9 Power_On_Hours          0x0032   044   044   000    Old_age   Always       -       49106
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       89
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
> 189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2
> 190 Airflow_Temperature_Cel 0x0022   060   030   045    Old_age   Always   In_the_past 40 (Min/Max 23/70 #25)
> 194 Temperature_Celsius     0x0022   040   070   000    Old_age   Always       -       40 (0 23 0 0 0)
> 195 Hardware_ECC_Recovered  0x001a   069   055   000    Old_age   Always       -       126632051
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
> 202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%     37598         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21  4:58   ` Zygo Blaxell
@ 2014-11-21  7:05     ` Brendan Hide
  2014-11-21 12:55       ` Ian Armstrong
  2014-11-21 17:42       ` Zygo Blaxell
  0 siblings, 2 replies; 49+ messages in thread
From: Brendan Hide @ 2014-11-21  7:05 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs@vger.kernel.org

On 2014/11/21 06:58, Zygo Blaxell wrote:
> You have one reallocated sector, so the drive has lost some data at some
> time in the last 49000(!) hours.  Normally reallocations happen during
> writes so the data that was "lost" was data you were in the process of
> overwriting anyway; however, the reallocated sector count could also be
> a sign of deteriorating drive integrity.
>
> In /var/lib/smartmontools there might be a csv file with logged error
> attribute data that you could use to figure out whether that reallocation
> was recent.
>
> I also notice you are not running regular SMART self-tests (e.g.
> by smartctl -t long) and the last (and first, and only!) self-test the
> drive ran was ~12000 hours ago.  That means most of your SMART data is
> about 18 months old.  The drive won't know about sectors that went bad
> in the last year and a half unless the host happens to stumble across
> them during a read.
>
> The drive is over five years old in operating hours alone.  It is probably
> so fragile now that it will break if you try to move it.
All interesting points. Do you schedule SMART self-tests on your own 
systems? I have smartd running. In theory it tracks changes and sends 
alerts if it figures a drive is going to fail. But, based on what you've 
indicated, that isn't good enough.

> WARNING: errors detected during scrubbing, corrected.
> [snip]
> scrub device /dev/sdb2 (id 2) done
> 	scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds
> 	total bytes scrubbed: 189.49GiB with 5420 errors
> 	error details: read=5 csum=5415
> 	corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
> That seems a little off.  If there were 5 read errors, I'd expect the drive to
> have errors in the SMART error log.
>
> Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
> There have been a number of fixes to csums in btrfs pulled into the kernel
> recently, and I've retired two five-year-old computers this summer due
> to RAM/CPU failures.
The difference here is that the issue only affects the one drive. This 
leaves the probable cause at:
- the drive itself
- the cable/ports

with a negligibly-possible cause at the motherboard chipset.


-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21  7:05     ` Brendan Hide
@ 2014-11-21 12:55       ` Ian Armstrong
  2014-11-21 17:45         ` Chris Murphy
  2014-11-21 17:42       ` Zygo Blaxell
  1 sibling, 1 reply; 49+ messages in thread
From: Ian Armstrong @ 2014-11-21 12:55 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

On Fri, 21 Nov 2014 09:05:32 +0200, Brendan Hide wrote:

> On 2014/11/21 06:58, Zygo Blaxell wrote:

> > I also notice you are not running regular SMART self-tests (e.g.
> > by smartctl -t long) and the last (and first, and only!) self-test
> > the drive ran was ~12000 hours ago.  That means most of your SMART
> > data is about 18 months old.  The drive won't know about sectors
> > that went bad in the last year and a half unless the host happens
> > to stumble across them during a read.
> >
> > The drive is over five years old in operating hours alone.  It is
> > probably so fragile now that it will break if you try to move it.

> All interesting points. Do you schedule SMART self-tests on your own 
> systems? I have smartd running. In theory it tracks changes and sends 
> alerts if it figures a drive is going to fail. But, based on what
> you've indicated, that isn't good enough.

Simply monitoring the smart status without a self-test isn't really that
great. I'm not sure on the default config, but smartd can be made to
initiate a smart self-test at regular intervals. Depending on the test
type (short, long, etc) it could include a full surface scan. This can
reveal things like bad sectors before you ever hit them during normal
system usage.

> 
> > WARNING: errors detected during scrubbing, corrected.
> > [snip]
> > scrub device /dev/sdb2 (id 2) done
> > 	scrub started at Tue Nov 18 03:22:58 2014 and finished
> > after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors
> > 	error details: read=5 csum=5415
> > 	corrected errors: 5420, uncorrectable errors: 0, unverified
> > errors: 164 That seems a little off.  If there were 5 read errors,
> > I'd expect the drive to have errors in the SMART error log.
> >
> > Checksum errors could just as easily be a btrfs bug or a RAM/CPU
> > problem. There have been a number of fixes to csums in btrfs pulled
> > into the kernel recently, and I've retired two five-year-old
> > computers this summer due to RAM/CPU failures.

> The difference here is that the issue only affects the one drive.
> This leaves the probable cause at:
> - the drive itself
> - the cable/ports
> 
> with a negligibly-possible cause at the motherboard chipset.

This is the same problem that I'm currently trying to resolve. I have
one drive in a raid1 setup which shows no issues in smart status but
often has checksum errors.

In my situation what I've found is that if I scrub & let it fix the
errors then a second pass immediately after will show no errors. If I
then leave it a few days & try again there will be errors, even in
old files which have not been accessed for months.

If I do a read-only scrub to get a list of errors, a second scrub
immediately after will show exactly the same errors.

Apart from the scrub errors the system logs shows no issues with that
particular drive.

My next step is to disable autodefrag & see if the problem persists.
(I'm not suggesting a problem with autodefrag, I just want to remove it
from the equation & ensure that outside of normal file access, data
isn't being rewritten between scrubs)

-- 
Ian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21 12:55       ` Ian Armstrong
@ 2014-11-21 17:45         ` Chris Murphy
  2014-11-22  7:18           ` Ian Armstrong
  0 siblings, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2014-11-21 17:45 UTC (permalink / raw)
  To: Ian Armstrong; +Cc: linux-btrfs@vger.kernel.org

On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong <btrfs@iarmst.co.uk> wrote:

> In my situation what I've found is that if I scrub & let it fix the
> errors then a second pass immediately after will show no errors. If I
> then leave it a few days & try again there will be errors, even in
> old files which have not been accessed for months.

What are the devices? And if they're SSDs are they powered off for
these few days? I take it the scrub error type is corruption?

You can use badblocks to write a known pattern to the drive. Then
power off and leave it for a few days. Then read the drive, matching
against the pattern, and see if there are any discrepancies. Doing
this outside the code path of Btrfs would fairly conclusively indicate
whether it's hardware or software induced.

Assuming you have another copy of all of these files :-) you could
just sha256sum the two copies to see if they have in fact changed. If
they have, well then you've got some silent data corruption somewhere
somehow. But if they always match, then that suggests a bug. I don't
see how you can get bogus corruption messages, and for it to not be a
bug. When you do these scrubs that come up clean, and then later come
up with corruptions, have you done any software updates?

> My next step is to disable autodefrag & see if the problem persists.
> (I'm not suggesting a problem with autodefrag, I just want to remove it
> from the equation & ensure that outside of normal file access, data
> isn't being rewritten between scrubs)

I wouldn't expect autodefrag to touch old files not accessed for
months. Doesn't it only affect actively used files?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21 17:45         ` Chris Murphy
@ 2014-11-22  7:18           ` Ian Armstrong
  0 siblings, 0 replies; 49+ messages in thread
From: Ian Armstrong @ 2014-11-22  7:18 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 21 Nov 2014 10:45:21 -0700 Chris Murphy wrote:

> On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong <btrfs@iarmst.co.uk>
> wrote:
> 
> > In my situation what I've found is that if I scrub & let it fix the
> > errors then a second pass immediately after will show no errors. If
> > I then leave it a few days & try again there will be errors, even in
> > old files which have not been accessed for months.
> 
> What are the devices? And if they're SSDs are they powered off for
> these few days? I take it the scrub error type is corruption?

It's spinning rust and the checksum error is always on the one drive
(a SAMSUNG HD204UI). The firmware has been updated, since some were
shipped with a bad version which could result in data corruption.

> You can use badblocks to write a known pattern to the drive. Then
> power off and leave it for a few days. Then read the drive, matching
> against the pattern, and see if there are any discrepancies. Doing
> this outside the code path of Btrfs would fairly conclusively indicate
> whether it's hardware or software induced.

Unfortunately I'm reluctant to go the badblock route for the entire
drive since it's the second drive in a 2 drive raid1 and I don't
currently have a spare. There is a small 6G partition that I can use,
but given that the drive is large and the errors are few, it could take
a while for anything to show.

I also have a second 2 drive btrfs raid1 in the same machine that
doesn't have this problem. All the drives are running off the same
controller.

> Assuming you have another copy of all of these files :-) you could
> just sha256sum the two copies to see if they have in fact changed. If
> they have, well then you've got some silent data corruption somewhere
> somehow. But if they always match, then that suggests a bug.

Some of the files already have an md5 linked to them, while others have
parity files to give some level of recovery from corruption or damage.
Checking against these show no problems, so I assume that btrfs is
doing its job & only serving an intact file.

> I don't
> see how you can get bogus corruption messages, and for it to not be a
> bug. When you do these scrubs that come up clean, and then later come
> up with corruptions, have you done any software updates?

No software updates between clean & corrupt. I don't have to power down
or reboot either for checksum errors to appear.

I don't think the corruption messages are bogus, but are indicating a
genuine problem. What I would like to be able to do is compare the
corrupt block with the one used to repair it and see what the difference
is. As I've already stated, the system logs are clean & the smart logs
aren't showing any issues. (Well, until today when a self-test failed
with a read error, but it must be an unused sector since the scrub
doesn't hit it & there are no re-allocated sectors yet)

> > My next step is to disable autodefrag & see if the problem persists.
> > (I'm not suggesting a problem with autodefrag, I just want to
> > remove it from the equation & ensure that outside of normal file
> > access, data isn't being rewritten between scrubs)
> 
> I wouldn't expect autodefrag to touch old files not accessed for
> months. Doesn't it only affect actively used files?

The drive is mainly used to hold old archive files, though there are
daily rotating files on it as well. The corruption affects both new and
old files.

-- 
Ian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21  7:05     ` Brendan Hide
  2014-11-21 12:55       ` Ian Armstrong
@ 2014-11-21 17:42       ` Zygo Blaxell
  2014-11-21 18:06         ` Chris Murphy
  1 sibling, 1 reply; 49+ messages in thread
From: Zygo Blaxell @ 2014-11-21 17:42 UTC (permalink / raw)
  To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3799 bytes --]

On Fri, Nov 21, 2014 at 09:05:32AM +0200, Brendan Hide wrote:
> On 2014/11/21 06:58, Zygo Blaxell wrote:
> >You have one reallocated sector, so the drive has lost some data at some
> >time in the last 49000(!) hours.  Normally reallocations happen during
> >writes so the data that was "lost" was data you were in the process of
> >overwriting anyway; however, the reallocated sector count could also be
> >a sign of deteriorating drive integrity.
> >
> >In /var/lib/smartmontools there might be a csv file with logged error
> >attribute data that you could use to figure out whether that reallocation
> >was recent.
> >
> >I also notice you are not running regular SMART self-tests (e.g.
> >by smartctl -t long) and the last (and first, and only!) self-test the
> >drive ran was ~12000 hours ago.  That means most of your SMART data is
> >about 18 months old.  The drive won't know about sectors that went bad
> >in the last year and a half unless the host happens to stumble across
> >them during a read.
> >
> >The drive is over five years old in operating hours alone.  It is probably
> >so fragile now that it will break if you try to move it.
> All interesting points. Do you schedule SMART self-tests on your own
> systems? I have smartd running. In theory it tracks changes and
> sends alerts if it figures a drive is going to fail. But, based on
> what you've indicated, that isn't good enough.

I run 'smartctl -t long' from cron overnight (or whenever the drives
are most idle).  You can also set up smartd.conf to launch the self
tests; however, the syntax for test scheduling is byzantine compared to
cron (and that's saying something!).  On multi-drive systems I schedule
a different drive for each night.

If you are also doing btrfs scrub, then stagger the scheduling so
e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.

smartd is OK for monitoring test logs and email alerts.  I've had no
problems there.

> >WARNING: errors detected during scrubbing, corrected.
> >[snip]
> >scrub device /dev/sdb2 (id 2) done
> >	scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds
> >	total bytes scrubbed: 189.49GiB with 5420 errors
> >	error details: read=5 csum=5415
> >	corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
> >That seems a little off.  If there were 5 read errors, I'd expect the drive to
> >have errors in the SMART error log.
> >
> >Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
> >There have been a number of fixes to csums in btrfs pulled into the kernel
> >recently, and I've retired two five-year-old computers this summer due
> >to RAM/CPU failures.
> The difference here is that the issue only affects the one drive.
> This leaves the probable cause at:
> - the drive itself
> - the cable/ports
> 
> with a negligibly-possible cause at the motherboard chipset.

If it was cable, there should be UDMA CRC errors or similar in the SMART
counters, but they are zero.  You can also try swapping the cable and
seeing whether the errors move.  I've found many bad cables that way.

The drive itself could be failing in some way that prevents recording
SMART errors (e.g. because of host timeouts triggering a bus reset,
which also prevents the SMART counter update for what was going wrong at
the time).  This is unfortunately quite common, especially with drives
configured for non-RAID workloads.

> 
> -- 
> __________
> Brendan Hide
> http://swiftspirit.co.za/
> http://www.webafrica.co.za/?AFF1E97
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21 17:42       ` Zygo Blaxell
@ 2014-11-21 18:06         ` Chris Murphy
  2014-11-22  2:25           ` Zygo Blaxell
  0 siblings, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2014-11-21 18:06 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org

On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell <zblaxell@furryterror.org> wrote:

> I run 'smartctl -t long' from cron overnight (or whenever the drives
> are most idle).  You can also set up smartd.conf to launch the self
> tests; however, the syntax for test scheduling is byzantine compared to
> cron (and that's saying something!).  On multi-drive systems I schedule
> a different drive for each night.
>
> If you are also doing btrfs scrub, then stagger the scheduling so
> e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.
>
> smartd is OK for monitoring test logs and email alerts.  I've had no
> problems there.

Most attributes are always updated without issuing a smart test of any
kind. A drive I have here only has four offline updateable attributes.

When it comes to bad sectors, the drive won't use a sector that
persistently fails writes. So you don't really have to worry about
latent bad sectors that don't have data on them already. The sectors
you care about are the ones with data. A scrub reads all of those
sectors.

First the drive could report a read error in which case Btrfs
raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or
rebuild it from parity, and write to the affected sector; and also
this same mechanism happens in normal reads so it's a kind of passive
scrub. But it happens to miss checking inactively read data, which a
scrub will check.

Second, the drive could report no problem, and Btrfs raid1/10 could
still fix the problem in case of a csum mismatch. And it looks like
soonish we'll see this apply to raid5/6.

So I think a nightly long smart test is a bit overkill. I think you
could do nightly -t short tests which will report problems scrub won't
notice, such as higher seek times or lower throughput performance. And
then scrub once a week.

> The drive itself could be failing in some way that prevents recording
> SMART errors (e.g. because of host timeouts triggering a bus reset,
> which also prevents the SMART counter update for what was going wrong at
> the time).  This is unfortunately quite common, especially with drives
> configured for non-RAID workloads.

Libata resetting the link should be recorded in kernel messages.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: scrub implies failing drive - smartctl blissfully unaware
  2014-11-21 18:06         ` Chris Murphy
@ 2014-11-22  2:25           ` Zygo Blaxell
  0 siblings, 0 replies; 49+ messages in thread
From: Zygo Blaxell @ 2014-11-22  2:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3608 bytes --]

On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote:
> On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell <zblaxell@furryterror.org> wrote:
> 
> > I run 'smartctl -t long' from cron overnight (or whenever the drives
> > are most idle).  You can also set up smartd.conf to launch the self
> > tests; however, the syntax for test scheduling is byzantine compared to
> > cron (and that's saying something!).  On multi-drive systems I schedule
> > a different drive for each night.
> >
> > If you are also doing btrfs scrub, then stagger the scheduling so
> > e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.
> >
> > smartd is OK for monitoring test logs and email alerts.  I've had no
> > problems there.
> 
> Most attributes are always updated without issuing a smart test of any
> kind. A drive I have here only has four offline updateable attributes.

One of those four is Offline_Uncorrectable, which is a really important
attribute to monitor!

> When it comes to bad sectors, the drive won't use a sector that
> persistently fails writes. So you don't really have to worry about
> latent bad sectors that don't have data on them already. The sectors
> you care about are the ones with data. A scrub reads all of those
> sectors.

A scrub reads all the _allocated_ sectors.  A long selftest reads
_everything_, and also exercises the electronics and mechanics of the
drive in ways that normal operation doesn't.  I have several disks that
are less than 25% occupied, which means scrubs will ignore 75% of the
disk surface at any given time.

A sharp increase in the number of bad sectors (no matter how they are
detected) usually indicates a total drive failure is coming.  Many drives
have been nice enough to give me enough warning for their RMA replacements
to be delivered just a few hours before the drive totally fails.

> First the drive could report a read error in which case Btrfs
> raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or
> rebuild it from parity, and write to the affected sector; and also
> this same mechanism happens in normal reads so it's a kind of passive
> scrub. But it happens to miss checking inactively read data, which a
> scrub will check.
> 
> Second, the drive could report no problem, and Btrfs raid1/10 could
> still fix the problem in case of a csum mismatch. And it looks like
> soonish we'll see this apply to raid5/6.
> 
> So I think a nightly long smart test is a bit overkill. I think you
> could do nightly -t short tests which will report problems scrub won't
> notice, such as higher seek times or lower throughput performance. And
> then scrub once a week.

Drives quite often drop a sector or two over the years, and it can
be harmless.  What you want to be watching out for is hundreds of bad
sectors showing up over a period of few days--that means something is
rattling around on the disk platters, damaging the hardware as it goes.
To get that data, you have to test the disks every few days.

> > The drive itself could be failing in some way that prevents recording
> > SMART errors (e.g. because of host timeouts triggering a bus reset,
> > which also prevents the SMART counter update for what was going wrong at
> > the time).  This is unfortunately quite common, especially with drives
> > configured for non-RAID workloads.
> 
> Libata resetting the link should be recorded in kernel messages.

This is true, but the original question was about SMART data coverage.
This is why it's important to monitor both.

> -- 
> Chris Murphy

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2014-12-01 19:11 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <E1XqYMg-0000YI-8y@watricky.valid.co.za>
2014-11-18  7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide
2014-11-18  7:36   ` Roman Mamedov
2014-11-18 13:24     ` Brendan Hide
2014-11-18 15:16       ` Duncan
2014-11-18 12:08   ` Austin S Hemmelgarn
2014-11-18 13:25     ` Brendan Hide
2014-11-18 16:02     ` Phillip Susi
2014-11-18 15:35   ` Marc MERLIN
2014-11-18 16:04     ` Phillip Susi
2014-11-18 16:11       ` Marc MERLIN
2014-11-18 16:26         ` Phillip Susi
2014-11-18 18:57     ` Chris Murphy
2014-11-18 20:58       ` Phillip Susi
2014-11-19  2:40         ` Chris Murphy
2014-11-19 15:11           ` Phillip Susi
2014-11-20  0:05             ` Chris Murphy
2014-11-25 21:34               ` Phillip Susi
2014-11-25 23:13                 ` Chris Murphy
2014-11-26  1:53                   ` Rich Freeman
2014-12-01 19:10                   ` Phillip Susi
2014-11-28 15:02                 ` Patrik Lundquist
2014-11-19  2:46         ` Duncan
2014-11-19 16:07           ` Phillip Susi
2014-11-19 21:05             ` Robert White
2014-11-19 21:47               ` Phillip Susi
2014-11-19 22:25                 ` Robert White
2014-11-20 20:26                   ` Phillip Susi
2014-11-20 22:45                     ` Robert White
2014-11-21 15:11                       ` Phillip Susi
2014-11-21 21:12                         ` Robert White
2014-11-21 21:41                           ` Robert White
2014-11-22 22:06                           ` Phillip Susi
2014-11-19 22:33                 ` Robert White
2014-11-20 20:34                   ` Phillip Susi
2014-11-20 23:08                     ` Robert White
2014-11-21 15:27                       ` Phillip Susi
2014-11-20  0:25               ` Duncan
2014-11-20  2:08                 ` Robert White
2014-11-19 23:59             ` Duncan
2014-11-25 22:14               ` Phillip Susi
2014-11-28 15:55                 ` Patrik Lundquist
2014-11-21  4:58   ` Zygo Blaxell
2014-11-21  7:05     ` Brendan Hide
2014-11-21 12:55       ` Ian Armstrong
2014-11-21 17:45         ` Chris Murphy
2014-11-22  7:18           ` Ian Armstrong
2014-11-21 17:42       ` Zygo Blaxell
2014-11-21 18:06         ` Chris Murphy
2014-11-22  2:25           ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).