* scrub implies failing drive - smartctl blissfully unaware [not found] <E1XqYMg-0000YI-8y@watricky.valid.co.za> @ 2014-11-18 7:29 ` Brendan Hide 2014-11-18 7:36 ` Roman Mamedov ` (3 more replies) 0 siblings, 4 replies; 49+ messages in thread From: Brendan Hide @ 2014-11-18 7:29 UTC (permalink / raw) To: linux-btrfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 1800 bytes --] Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Here are (I think) the important bits of the smartctl output for $(smartctl -a /dev/sdb) (the full results are attached): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail Always - 440801014 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 -------- Original Message -------- Subject: Cron <root@watricky> /usr/local/sbin/btrfs-scrub-all Date: Tue, 18 Nov 2014 04:19:12 +0200 From: (Cron Daemon) <root@watricky> To: brendan@watricky WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 [snip] [-- Attachment #2: sdbres[1].txt --] [-- Type: text/plain, Size: 4571 bytes --] smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 Device Model: ST3250410AS Serial Number: 6RYF5NP7 Firmware Version: 4.AAA User Capacity: 250,059,350,016 bytes [250 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7 (minor revision not indicated) Local Time is: Tue Nov 18 09:16:03 2014 SAST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 64) minutes. SCT capabilities: (0x0001) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 68 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail Always - 440801057 9 Power_On_Hours 0x0032 044 044 000 Old_age Always - 49106 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 89 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 060 030 045 Old_age Always In_the_past 40 (Min/Max 23/70 #25) 194 Temperature_Celsius 0x0022 040 070 000 Old_age Always - 40 (0 23 0 0 0) 195 Hardware_ECC_Recovered 0x001a 069 055 000 Old_age Always - 126632051 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 37598 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide @ 2014-11-18 7:36 ` Roman Mamedov 2014-11-18 13:24 ` Brendan Hide 2014-11-18 12:08 ` Austin S Hemmelgarn ` (2 subsequent siblings) 3 siblings, 1 reply; 49+ messages in thread From: Roman Mamedov @ 2014-11-18 7:36 UTC (permalink / raw) To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org On Tue, 18 Nov 2014 09:29:54 +0200 Brendan Hide <brendan@swiftspirit.co.za> wrote: > Hey, guys > > See further below extracted output from a daily scrub showing csum > errors on sdb, part of a raid1 btrfs. Looking back, it has been getting > errors like this for a few days now. > > The disk is patently unreliable but smartctl's output implies there are > no issues. Is this somehow standard faire for S.M.A.R.T. output? Not necessarily the disk's fault, could be a SATA controller issue. How are your disks connected, which controller brand and chip? Add lspci output, at least if something other than the ordinary "to the motherboard chipset's built-in ports". -- With respect, Roman ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 7:36 ` Roman Mamedov @ 2014-11-18 13:24 ` Brendan Hide 2014-11-18 15:16 ` Duncan 0 siblings, 1 reply; 49+ messages in thread From: Brendan Hide @ 2014-11-18 13:24 UTC (permalink / raw) To: Roman Mamedov; +Cc: linux-btrfs@vger.kernel.org On 2014/11/18 09:36, Roman Mamedov wrote: > On Tue, 18 Nov 2014 09:29:54 +0200 > Brendan Hide <brendan@swiftspirit.co.za> wrote: > >> Hey, guys >> >> See further below extracted output from a daily scrub showing csum >> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting >> errors like this for a few days now. >> >> The disk is patently unreliable but smartctl's output implies there are >> no issues. Is this somehow standard faire for S.M.A.R.T. output? > Not necessarily the disk's fault, could be a SATA controller issue. How are > your disks connected, which controller brand and chip? Add lspci output, at > least if something other than the ordinary "to the motherboard chipset's > built-in ports". In this case, yup, its directly to the motherboard chipset's built-in ports. This is a very old desktop, and the other 3 disks don't have any issues. I'm checking out the alternative pointed out by Austin. SATA-relevant lspci output: 00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) SATA AHCI Controller (rev 02) -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 13:24 ` Brendan Hide @ 2014-11-18 15:16 ` Duncan 0 siblings, 0 replies; 49+ messages in thread From: Duncan @ 2014-11-18 15:16 UTC (permalink / raw) To: linux-btrfs Brendan Hide posted on Tue, 18 Nov 2014 15:24:48 +0200 as excerpted: > In this case, yup, its directly to the motherboard chipset's built-in > ports. This is a very old desktop, and the other 3 disks don't have any > issues. I'm checking out the alternative pointed out by Austin. > > SATA-relevant lspci output: > 00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) > SATA AHCI Controller (rev 02) I guess your definition of _very_ old desktop, and mine, are _very_ different. * A quick check of wikipedia says the ICH10 wasn't even /introduced/ until 2008 (the wiki link for the 82801jo/do points to an Intel page, which says it was launched Q3-2008), and it would have been some time after that, likely 2009, that you actually purchased the machine. 2009 is five years ago, middle-aged yes, arguably old, but _very_ old, not so much in this day and age of longer system replace cycles. * It has SATA, not IDE/PATA. * It was PCIE 1.1, not PCI-X or PCI and AGP, and DEFINITELY not ISA bus, with or without VLB! * It has USB 2.0 ports, not USB 1.1, and not only serial/parallel/ps2, and DEFINITELY not an AT keyboard. * It has Gigabit Ethernet, not simply Fast Ethernet or just Ethernet, and DEFINITELY Ethernet not token-ring. * It already has Intel Virtualization technology and HD audio instead of AC97 or earlier. Now I can certainly imagine and "old" desktop having most of these, but you said _very_ old, not simply old, and _very_ old to me would mean PATA/ USB-1/AGP/PCI/FastEthernet with AC97 audio or earlier and no virtualization. 64-bit would be questionable as well. FWIW, I've been playing minitube/youtube C64 music the last few days. Martin Galway, etc. Now C64 really _IS_ _very_ old! Also FWIW, "only" a couple years ago now (well, about three, time flies!), my old 2003 vintage original 3-digit Opteron based mobo died due to bulging/burst capacitors, after serving me 8 years. I was shooting for a full decade but didn't quite make it... So indeed, 2009 vintage system, five years, definitely not _very_ old, arguably not even "old", more like middle-aged. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide 2014-11-18 7:36 ` Roman Mamedov @ 2014-11-18 12:08 ` Austin S Hemmelgarn 2014-11-18 13:25 ` Brendan Hide 2014-11-18 16:02 ` Phillip Susi 2014-11-18 15:35 ` Marc MERLIN 2014-11-21 4:58 ` Zygo Blaxell 3 siblings, 2 replies; 49+ messages in thread From: Austin S Hemmelgarn @ 2014-11-18 12:08 UTC (permalink / raw) To: Brendan Hide, linux-btrfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2752 bytes --] On 2014-11-18 02:29, Brendan Hide wrote: > Hey, guys > > See further below extracted output from a daily scrub showing csum > errors on sdb, part of a raid1 btrfs. Looking back, it has been getting > errors like this for a few days now. > > The disk is patently unreliable but smartctl's output implies there are > no issues. Is this somehow standard faire for S.M.A.R.T. output? > > Here are (I think) the important bits of the smartctl output for > $(smartctl -a /dev/sdb) (the full results are attached): > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail > Always - 0 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail > Always - 1 > 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail > Always - 440801014 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age > Offline - 0 > 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age > Always - 0 > > > > -------- Original Message -------- > Subject: Cron <root@watricky> /usr/local/sbin/btrfs-scrub-all > Date: Tue, 18 Nov 2014 04:19:12 +0200 > From: (Cron Daemon) <root@watricky> > To: brendan@watricky > > > > WARNING: errors detected during scrubbing, corrected. > [snip] > scrub device /dev/sdb2 (id 2) done > scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 > seconds > total bytes scrubbed: 189.49GiB with 5420 errors > error details: read=5 csum=5415 > corrected errors: 5420, uncorrectable errors: 0, unverified errors: > 164 > [snip] > In addition to the storage controller being a possibility as mentioned in another reply, there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching, which may improve things (and if it's a cheap disk, may provide better reliability for BTRFS as well). The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2455 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 12:08 ` Austin S Hemmelgarn @ 2014-11-18 13:25 ` Brendan Hide 2014-11-18 16:02 ` Phillip Susi 1 sibling, 0 replies; 49+ messages in thread From: Brendan Hide @ 2014-11-18 13:25 UTC (permalink / raw) To: Austin S Hemmelgarn; +Cc: linux-btrfs@vger.kernel.org On 2014/11/18 14:08, Austin S Hemmelgarn wrote: > [snip] there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching. Its an old and replaceable disk - but if the cable replacement doesn't work I'll try this for kicks. :) > The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use. Thanks. I'll try this first. :) -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 12:08 ` Austin S Hemmelgarn 2014-11-18 13:25 ` Brendan Hide @ 2014-11-18 16:02 ` Phillip Susi 1 sibling, 0 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-18 16:02 UTC (permalink / raw) To: Austin S Hemmelgarn, Brendan Hide, linux-btrfs@vger.kernel.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 7:08 AM, Austin S Hemmelgarn wrote: > In addition to the storage controller being a possibility as > mentioned in another reply, there are some parts of the drive that > aren't covered by SMART attributes on most disks, most notably the > on-drive cache. There really isn't a way to disable the read cache > on the drive, but you can disable write-caching, which may improve > things (and if it's a cheap disk, may provide better reliability > for BTRFS as well). The other thing I would suggest trying is a > different data cable to the drive itself, I've had issues with some > SATA cables (the cheap red ones you get in the retail packaging for > some hard disks in particular) having either bad connectors, or bad > strain-reliefs, and failing after only a few hundred hours of use. SATA does CRC the data going across it so if it is a bad cable, you get CRC, or often times 8b10b coding errors and the transfer is aborted rather than returning bad data. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa22PAAoJEI5FoCIzSKrwqlAH/3p1iftYkX3DAMgMmWra9AZT 2OA4PIwzgKIhANpy+ZQo4c+W1ZUwo2V6sxLvG8/oM3HfITGyfwNA5HgTbQrlx/iU vdRHq+y60gCruIa0lRST5JCQMbez7eXvSNOWNAZYbtNH/BNyMxwFuav14zFZpNxO QovXxhk1D5vLf+ID2jwa5mF1Zj7b5GEhb4zzqK+xU1QNeWppLFhB3da+llae8qxf eFtNt8ebtknr7QMCFrbaYCq1z1I+Fy8EjskkdI4ZW6AgBRPQDDmB8gNCmAAbSaZC 2Ze/AB4Xr6uuGQ4iK7nprKXUtPJFLzGYx+JQ2EeBJtin9ivno1fEY45CMreuzv4= =6Oy/ -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide 2014-11-18 7:36 ` Roman Mamedov 2014-11-18 12:08 ` Austin S Hemmelgarn @ 2014-11-18 15:35 ` Marc MERLIN 2014-11-18 16:04 ` Phillip Susi 2014-11-18 18:57 ` Chris Murphy 2014-11-21 4:58 ` Zygo Blaxell 3 siblings, 2 replies; 49+ messages in thread From: Marc MERLIN @ 2014-11-18 15:35 UTC (permalink / raw) To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote: > Hey, guys > > See further below extracted output from a daily scrub showing csum > errors on sdb, part of a raid1 btrfs. Looking back, it has been getting > errors like this for a few days now. > > The disk is patently unreliable but smartctl's output implies there are > no issues. Is this somehow standard faire for S.M.A.R.T. output? Try running hdrecover on your drive, it'll scan all your blocks and try to rewrite the ones that are failing, if any: http://hdrecover.sourceforge.net/ Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 15:35 ` Marc MERLIN @ 2014-11-18 16:04 ` Phillip Susi 2014-11-18 16:11 ` Marc MERLIN 2014-11-18 18:57 ` Chris Murphy 1 sibling, 1 reply; 49+ messages in thread From: Phillip Susi @ 2014-11-18 16:04 UTC (permalink / raw) To: Marc MERLIN, Brendan Hide; +Cc: linux-btrfs@vger.kernel.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 10:35 AM, Marc MERLIN wrote: > Try running hdrecover on your drive, it'll scan all your blocks and > try to rewrite the ones that are failing, if any: > http://hdrecover.sourceforge.net/ He doesn't have blocks that are failing; he has blocks that are being silently corrupted. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa23wAAoJEI5FoCIzSKrwXTMH/3KhuXuNbPBY0jALRS6kVAew M3gfJ1kMeZgiBZzUlZb0GsB9J3i+Ei+nF7NQ7taMKey84sPxhQVjpYZV0LZxWNwe RSga4/Kfnk8TGphwBBeK5e3tOypmv+ECCB4p4uQHXqPAvoFiIALdHYzZGYb0kM8e ydTonqtUiR8WJ0uqy24/vl7uJyTkj0xz4Adk2ksrbVhW1Z8md2LesKOCtCLa3bVn Qu8Um/KIBPNBbB21FYN1KyBUMvkx2uGDcu7YRfxXpPnZLwZ9NdMjlOzY8P+EnhFt cW+tW3mYO9BMhONxi8m7hDI5wj+dsPFblqA5CRBwAOG5b4fsE2pwZwdqYoASmd4= =2Ho1 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 16:04 ` Phillip Susi @ 2014-11-18 16:11 ` Marc MERLIN 2014-11-18 16:26 ` Phillip Susi 0 siblings, 1 reply; 49+ messages in thread From: Marc MERLIN @ 2014-11-18 16:11 UTC (permalink / raw) To: Phillip Susi; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org On Tue, Nov 18, 2014 at 11:04:00AM -0500, Phillip Susi wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 10:35 AM, Marc MERLIN wrote: > > Try running hdrecover on your drive, it'll scan all your blocks and > > try to rewrite the ones that are failing, if any: > > http://hdrecover.sourceforge.net/ > > He doesn't have blocks that are failing; he has blocks that are being > silently corrupted. That seems to be the case, but hdrecover will rule that part out at least. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 16:11 ` Marc MERLIN @ 2014-11-18 16:26 ` Phillip Susi 0 siblings, 0 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-18 16:26 UTC (permalink / raw) To: Marc MERLIN; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 11:11 AM, Marc MERLIN wrote: > That seems to be the case, but hdrecover will rule that part out at > least. It's already ruled out: if the read failed that is what the error message would have said rather than a bad checksum. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEbBAEBAgAGBQJUa3MpAAoJEI5FoCIzSKrwmu4H+IPNwUZMEES7vvA7WTPcrYgw mO2x9uR/fQJFH1u4Urf3anKXoifsHUgvgyPHotRrm1OoiB3bQgYVapVEqZ0PEkre la3zKydJ6ZuCa/TuEvATdOxBwvUhMKJCYcwYheja+1stqEBxD8mj6HY5+HqufoLo VaSeEeBDWvQZtGrOC8JNxfzaeFmf46W+8dQIn7qI72WYvWRfVMhCun+dR4amS8hN cXgxAe6ElnVV4TuGHLy0n4l2Hr6oWBYLWIJhDzM9IpkfjX9jsv78nLHcoWwtaw82 gv248OcCeLnZBwoN5Tepd5Av6uHh3x9MzlXDrqnWQBWulY3f0idrFGU1y1uZvw== =AtDf -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 15:35 ` Marc MERLIN 2014-11-18 16:04 ` Phillip Susi @ 2014-11-18 18:57 ` Chris Murphy 2014-11-18 20:58 ` Phillip Susi 1 sibling, 1 reply; 49+ messages in thread From: Chris Murphy @ 2014-11-18 18:57 UTC (permalink / raw) To: Btrfs BTRFS On Nov 18, 2014, at 8:35 AM, Marc MERLIN <marc@merlins.org> wrote: > On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote: >> Hey, guys >> >> See further below extracted output from a daily scrub showing csum >> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting >> errors like this for a few days now. >> >> The disk is patently unreliable but smartctl's output implies there are >> no issues. Is this somehow standard faire for S.M.A.R.T. output? > > Try running hdrecover on your drive, it'll scan all your blocks and try to > rewrite the ones that are failing, if any: > http://hdrecover.sourceforge.net/ The only way it can know if there is a bad sector is if the drive returns a read error, which will include the LBA for the affected sector(s). This is the same thing that would be done with scrub, except any bad sectors that don’t contain data. A common problem getting a drive to issue the read error, however, is a mismatch between the scsi command timer setting (default 30 seconds) and the SCT error recover control setting for the drive. The drive SCT ERC value needs to be shorter than the scsi command timer value, otherwise some bad sector errors will cause the drive to go into a longer recovery attempt beyond the scsi command timer value. If that happens, the ata link is reset, and there’s no possibility of finding out what the affected sector is. So a.) use smartctl -l scterc to change the value below 30 seconds (300 deciseconds) with 70 deciseconds being reasonable. If the drive doesn’t support SCT commands, then b.) change the linux scsi command timer to be greater than 120 seconds. Strictly speaking the command timer would be set to a value that ensures there are no link reset messages in dmesg, that it’s long enough that the drive itself times out and actually reports a read error. This could be much shorter than 120 seconds. I don’t know if there are any consumer drives that try longer than 2 minutes to recover data from a marginally bad sector. Ideally though, don’t use drives that lack SCT support in multiple device volume configurations. An up to 2 minute hang of the storage stack isn’t production compatible for most workflows. Chris Murphy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 18:57 ` Chris Murphy @ 2014-11-18 20:58 ` Phillip Susi 2014-11-19 2:40 ` Chris Murphy 2014-11-19 2:46 ` Duncan 0 siblings, 2 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-18 20:58 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 1:57 PM, Chris Murphy wrote: > So a.) use smartctl -l scterc to change the value below 30 seconds > (300 deciseconds) with 70 deciseconds being reasonable. If the > drive doesn’t support SCT commands, then b.) change the linux scsi > command timer to be greater than 120 seconds. > > Strictly speaking the command timer would be set to a value that > ensures there are no link reset messages in dmesg, that it’s long > enough that the drive itself times out and actually reports a read > error. This could be much shorter than 120 seconds. I don’t know > if there are any consumer drives that try longer than 2 minutes to > recover data from a marginally bad sector. Are there really any that take longer than 30 seconds? That's enough time for thousands of retries. If it can't be read after a dozen tries, it ain't never gonna work. It seems absurd that a drive would keep trying for so long. > Ideally though, don’t use drives that lack SCT support in multiple > device volume configurations. An up to 2 minute hang of the > storage stack isn’t production compatible for most workflows. Wasn't there an early failure flag that md ( and therefore, btrfs when doing raid ) sets so the scsi stack doesn't bother with recovery attempts and just fails the request? Thus if the drive takes longer than the scsi_timeout, the failure would be reported to btrfs, which then can recover using the other copy, write it back to the bad drive, and hopefully that fixes it? In that case, you probably want to lower the timeout so that the recover kicks in sooner instead of hanging your IO stack for 30 seconds. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa7LqAAoJEI5FoCIzSKrw2Y0H/3Q03vCTxXeGkqvOYG/arZgk yHq/ruWIKMgfaESdu0Ujzoqbe7XopUueU8luKon52LtbgIFhOM5XnMu/o52KPXIS CVLnNtRWNbykHJMQu0Sk4lpPrUVI5QP9Ya9ZGVFM4x2ehvJGDAT+wcRWP5OH0waf mgK+oOnadsckqiSbcQhGrxecjTWZFu5WUCzWFPx+4sEV5ta/tmL0obhHcyho+SDN lCib2KI9YGzS2sm+V/Qe2i/3ZMp8QY8aAD2x/KlV0DBxkRLZQdOoD3ZkBiaApxZX VMfXNCKLMexwpe+rGGemH/fCvhRpM/z1aHu8D1u4QVnoWPzD51vX7ySLkwRHaGo= =XZkM -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 20:58 ` Phillip Susi @ 2014-11-19 2:40 ` Chris Murphy 2014-11-19 15:11 ` Phillip Susi 2014-11-19 2:46 ` Duncan 1 sibling, 1 reply; 49+ messages in thread From: Chris Murphy @ 2014-11-19 2:40 UTC (permalink / raw) To: Phillip Susi; +Cc: Btrfs BTRFS On Nov 18, 2014, at 1:58 PM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 1:57 PM, Chris Murphy wrote: >> So a.) use smartctl -l scterc to change the value below 30 seconds >> (300 deciseconds) with 70 deciseconds being reasonable. If the >> drive doesn’t support SCT commands, then b.) change the linux scsi >> command timer to be greater than 120 seconds. >> >> Strictly speaking the command timer would be set to a value that >> ensures there are no link reset messages in dmesg, that it’s long >> enough that the drive itself times out and actually reports a read >> error. This could be much shorter than 120 seconds. I don’t know >> if there are any consumer drives that try longer than 2 minutes to >> recover data from a marginally bad sector. > > Are there really any that take longer than 30 seconds? That's enough > time for thousands of retries. If it can't be read after a dozen > tries, it ain't never gonna work. It seems absurd that a drive would > keep trying for so long. It’s well known on linux-raid@ that consumer drives have well over 30 second "deep recoveries" when they lack SCT command support. The WDC and Seagate “green” drives are over 2 minutes apparently. This isn’t easy to test because it requires a sector with enough error that it requires the ECC to do something, and yet not so much error that it gives up in less than 30 seconds. So you have to track down a drive model spec document (one of those 100 pagers). This makes sense, sorta, because the manufacturer use case is typically single drive only, and most proscribe raid5/6 with such products. So it’s a “recover data at all costs” behavior because it’s assumed to be the only (immediately) available copy. > >> Ideally though, don’t use drives that lack SCT support in multiple >> device volume configurations. An up to 2 minute hang of the >> storage stack isn’t production compatible for most workflows. > > Wasn't there an early failure flag that md ( and therefore, btrfs when > doing raid ) sets so the scsi stack doesn't bother with recovery > attempts and just fails the request? Thus if the drive takes longer > than the scsi_timeout, the failure would be reported to btrfs, which > then can recover using the other copy, write it back to the bad drive, > and hopefully that fixes it? I don’t see how that’s possible because anything other than the drive explicitly producing a read error (which includes the affected LBA’s), it’s ambiguous what the actual problem is as far as the kernel is concerned. It has no way of knowing which of possibly dozens of ata commands queued up in the drive have actually hung up the drive. It has no idea why the drive is hung up as well. The linux-raid@ list is chock full of users having these kinds of problems. It comes up pretty much every week. Someone has an e.g. raid5, and in dmesg all they get are a bunch of ata bus reset messages. So someone tells them to change the scsi command timer for all the block devices that are members of the array in question, and retry (reading file, or scrub or whatever) and low and behold no more ata bus reset messages. Instead they get explicit read errors with LBAs and now md can fix the problem. > > In that case, you probably want to lower the timeout so that the > recover kicks in sooner instead of hanging your IO stack for 30 seconds. No I think 30 is pretty sane for servers using SATA drives because if the bus is reset all pending commands in the queue get obliterated which is worse than just waiting up to 30 seconds. With SAS drives maybe less time makes sense. But in either case you still need configurable SCT ERC, or it needs to be a sane fixed default like 70 deciseconds. Chris Murphy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 2:40 ` Chris Murphy @ 2014-11-19 15:11 ` Phillip Susi 2014-11-20 0:05 ` Chris Murphy 0 siblings, 1 reply; 49+ messages in thread From: Phillip Susi @ 2014-11-19 15:11 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 9:40 PM, Chris Murphy wrote: > It’s well known on linux-raid@ that consumer drives have well over > 30 second "deep recoveries" when they lack SCT command support. The > WDC and Seagate “green” drives are over 2 minutes apparently. This > isn’t easy to test because it requires a sector with enough error > that it requires the ECC to do something, and yet not so much error > that it gives up in less than 30 seconds. So you have to track down > a drive model spec document (one of those 100 pagers). > > This makes sense, sorta, because the manufacturer use case is > typically single drive only, and most proscribe raid5/6 with such > products. So it’s a “recover data at all costs” behavior because > it’s assumed to be the only (immediately) available copy. It doesn't make sense to me. If it can't recover the data after one or two hundred retries in one or two seconds, it can keep trying until the cows come home and it just isn't ever going to work. > I don’t see how that’s possible because anything other than the > drive explicitly producing a read error (which includes the > affected LBA’s), it’s ambiguous what the actual problem is as far > as the kernel is concerned. It has no way of knowing which of > possibly dozens of ata commands queued up in the drive have > actually hung up the drive. It has no idea why the drive is hung up > as well. IIRC, this is true when the drive returns failure as well. The whole bio is marked as failed, and the page cache layer then begins retrying with progressively smaller requests to see if it can get *some* data out. > No I think 30 is pretty sane for servers using SATA drives because > if the bus is reset all pending commands in the queue get > obliterated which is worse than just waiting up to 30 seconds. With > SAS drives maybe less time makes sense. But in either case you > still need configurable SCT ERC, or it needs to be a sane fixed > default like 70 deciseconds. Who cares if multiple commands in the queue are obliterated if they can all be retried on the other mirror? Better to fall back to the other mirror NOW instead of waiting 30 seconds ( or longer! ). Sure, you might end up recovering more than you really had to, but that won't hurt anything. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbLMyAAoJEI5FoCIzSKrwSM8IAJO2cwhHyxK4LFjINEbNT+ij fT4EpyzOCs704zhOTgssgSQ8ym85PRQ8VyAIrz338m+lHqKbktZtRt7vWaealmOp 6eleIDJ/I7kggnlhkqg1V8Nctap8qBeRE34K/PaGtTrkRzBYnYxbGdDDz+rXaDi6 CSEMLJBo3I69Oj9qSOV4O18ntV/S3eln0sQ8+w2btbc3xGkG3X2FwVIJokb6IAmu ngHUeDGXUgkEOvzw3aGDheLueGDPe+V3YlsjSbw2rH75svzXqFCUO8Jcg4NfxT0q Nl03eoTEGlyf8x2geMWfhoKFatJ7sCMy48K0ZFAAX1k8j0ssjNaEC+q6pwrA/xU= =Gehg -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 15:11 ` Phillip Susi @ 2014-11-20 0:05 ` Chris Murphy 2014-11-25 21:34 ` Phillip Susi 0 siblings, 1 reply; 49+ messages in thread From: Chris Murphy @ 2014-11-20 0:05 UTC (permalink / raw) To: Btrfs BTRFS On Wed, Nov 19, 2014 at 8:11 AM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 9:40 PM, Chris Murphy wrote: >> It’s well known on linux-raid@ that consumer drives have well over >> 30 second "deep recoveries" when they lack SCT command support. The >> WDC and Seagate “green” drives are over 2 minutes apparently. This >> isn’t easy to test because it requires a sector with enough error >> that it requires the ECC to do something, and yet not so much error >> that it gives up in less than 30 seconds. So you have to track down >> a drive model spec document (one of those 100 pagers). >> >> This makes sense, sorta, because the manufacturer use case is >> typically single drive only, and most proscribe raid5/6 with such >> products. So it’s a “recover data at all costs” behavior because >> it’s assumed to be the only (immediately) available copy. > > It doesn't make sense to me. If it can't recover the data after one > or two hundred retries in one or two seconds, it can keep trying until > the cows come home and it just isn't ever going to work. I'm not a hard drive engineer, so I can't argue either point. But consumer drives clearly do behave this way. On Linux, the kernel's default 30 second command timer eventually results in what look like link errors rather than drive read errors. And instead of the problems being fixed with the normal md and btrfs recovery mechanisms, the errors simply get worse and eventually there's data loss. Exhibits A, B, C, D - the linux-raid list is full to the brim of such reports and their solution. > >> I don’t see how that’s possible because anything other than the >> drive explicitly producing a read error (which includes the >> affected LBA’s), it’s ambiguous what the actual problem is as far >> as the kernel is concerned. It has no way of knowing which of >> possibly dozens of ata commands queued up in the drive have >> actually hung up the drive. It has no idea why the drive is hung up >> as well. > > IIRC, this is true when the drive returns failure as well. The whole > bio is marked as failed, and the page cache layer then begins retrying > with progressively smaller requests to see if it can get *some* data out. Well that's very course. It's not at a sector level, so as long as the drive continues to try to read from a particular LBA, but fails to either succeed reading or give up and report a read error, within 30 seconds, then you just get a bunch of wonky system behavior. Conversely what I've observed on Windows in such a case, is it tolerates these deep recoveries on consumer drives. So they just get really slow but the drive does seem to eventually recover (until it doesn't). But yeah 2 minutes is a long time. So then the user gets annoyed and reinstalls their system. Since that means writing to the affected drive, the firmware logic causes bad sectors to be dereferenced when the write error is persistent. Problem solved, faster system. > >> No I think 30 is pretty sane for servers using SATA drives because >> if the bus is reset all pending commands in the queue get >> obliterated which is worse than just waiting up to 30 seconds. With >> SAS drives maybe less time makes sense. But in either case you >> still need configurable SCT ERC, or it needs to be a sane fixed >> default like 70 deciseconds. > > Who cares if multiple commands in the queue are obliterated if they > can all be retried on the other mirror? Because now you have a member drive that's inconsistent. At least in the md raid case, a certain number of read failures causes the drive to be ejected from the array. Anytime there's a write failure, it's ejected from the array too. What you want is for the drive to give up sooner with an explicit read error, so md can help fix the problem by writing good data to the effected LBA. That doesn't happen when there are a bunch of link resets happening. > Better to fall back to the > other mirror NOW instead of waiting 30 seconds ( or longer! ). Sure, > you might end up recovering more than you really had to, but that > won't hurt anything. Again, if your drive SCT ERC is configurable, and set to something sane like 70 deciseconds, that read failure happens at MOST 7 seconds after the read attempt. And md is notified of *exactly* what sectors are affected, it immediately goes to mirror data, or rebuilds it from parity, and then writes the correct data to the previously reported bad sectors. And that will fix the problem. So really, if you're going to play the multiple device game, you need drive error timing to be shorter than the kernel's. Chris Murphy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-20 0:05 ` Chris Murphy @ 2014-11-25 21:34 ` Phillip Susi 2014-11-25 23:13 ` Chris Murphy 2014-11-28 15:02 ` Patrik Lundquist 0 siblings, 2 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-25 21:34 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/19/2014 7:05 PM, Chris Murphy wrote: > I'm not a hard drive engineer, so I can't argue either point. But > consumer drives clearly do behave this way. On Linux, the kernel's > default 30 second command timer eventually results in what look > like link errors rather than drive read errors. And instead of the > problems being fixed with the normal md and btrfs recovery > mechanisms, the errors simply get worse and eventually there's data > loss. Exhibits A, B, C, D - the linux-raid list is full to the brim > of such reports and their solution. I have seen plenty of error logs of people with drives that do properly give up and return an error instead of timing out so I get the feeling that most drives are properly behaved. Is there a particular make/model of drive that is known to exhibit this silly behavior? >> IIRC, this is true when the drive returns failure as well. The >> whole bio is marked as failed, and the page cache layer then >> begins retrying with progressively smaller requests to see if it >> can get *some* data out. > > Well that's very course. It's not at a sector level, so as long as > the drive continues to try to read from a particular LBA, but fails > to either succeed reading or give up and report a read error, > within 30 seconds, then you just get a bunch of wonky system > behavior. I don't understand this response at all. The drive isn't going to keep trying to read the same bad lba; after the kernel times out, it resets the drive, and tries reading different smaller parts to see which it can read and which it can't. > Conversely what I've observed on Windows in such a case, is it > tolerates these deep recoveries on consumer drives. So they just > get really slow but the drive does seem to eventually recover > (until it doesn't). But yeah 2 minutes is a long time. So then the > user gets annoyed and reinstalls their system. Since that means > writing to the affected drive, the firmware logic causes bad > sectors to be dereferenced when the write error is persistent. > Problem solved, faster system. That seems like rather unsubstantiated guesswork. i.e. the 2 minute+ delays are likely not on an individual request, but from several requests that each go into deep recovery, possibly because windows is retrying the same sector or a few consecutive sectors are bad. > Because now you have a member drive that's inconsistent. At least > in the md raid case, a certain number of read failures causes the > drive to be ejected from the array. Anytime there's a write > failure, it's ejected from the array too. What you want is for the > drive to give up sooner with an explicit read error, so md can help > fix the problem by writing good data to the effected LBA. That > doesn't happen when there are a bunch of link resets happening. What? It is no different than when it does return an error, with the exception that the error is incorrectly applied to the entire request instead of just the affected sector. > Again, if your drive SCT ERC is configurable, and set to something > sane like 70 deciseconds, that read failure happens at MOST 7 > seconds after the read attempt. And md is notified of *exactly* > what sectors are affected, it immediately goes to mirror data, or > rebuilds it from parity, and then writes the correct data to the > previously reported bad sectors. And that will fix the problem. Yes... I'm talking about when the drive doesn't support that. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUdPXRAAoJEI5FoCIzSKrw5aUIAJpmAczzc+0flGpDnenNIf9E HITY2a15lRhrnpfiEBmlTe0EUyc8O+Sv/kWJ61VRJ1KNCtF0Cs0jMEvOk2BGiM9T rR2KinIFlPZfuR7sUpgns+i5TK3eXpn+bbm5jIUFf8hOdkERFArwaQIqo3qqMybs 3rHdnBo7T+F9oCMwuFyvwHupDd2gCbnibB8mIUhijUcZQwoqU9c/ISGySpM7x04J VeDCI3hWv2V5hhm+Bfdq3fQpjeIo2AAvCPt+ODuFFHabQ5l78Qu7IlCEFGIYuQqi VJPxXNUi4n34O/jWEX5KBGgXp3H1RegnvcAt2NFLMVpFVDSB9I5eYLrj/d8KWoE= =r3AP -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-25 21:34 ` Phillip Susi @ 2014-11-25 23:13 ` Chris Murphy 2014-11-26 1:53 ` Rich Freeman 2014-12-01 19:10 ` Phillip Susi 2014-11-28 15:02 ` Patrik Lundquist 1 sibling, 2 replies; 49+ messages in thread From: Chris Murphy @ 2014-11-25 23:13 UTC (permalink / raw) To: Btrfs BTRFS On Tue, Nov 25, 2014 at 2:34 PM, Phillip Susi <psusi@ubuntu.com> wrote: > I have seen plenty of error logs of people with drives that do > properly give up and return an error instead of timing out so I get > the feeling that most drives are properly behaved. Is there a > particular make/model of drive that is known to exhibit this silly > behavior? The drive will only issue a read error when its ECC absolutely cannot recover the data, hard fail. A few years ago companies including Western Digital started shipping large cheap drives, think of the "green" drives. These had very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later they completely took out the ability to configure this error recovery timing so you only get the upward of 2 minutes to actually get a read error reported by the drive. Presumably if the ECC determines it's a hard fail and no point in reading the same sector 14000 times, it would issue a read error much sooner. But again, the linux-raid list if full of cases where this doesn't happen, and merely by changing the linux SCSI command timer from 30 to 121 seconds, now the drive reports an explicit read error with LBA information included, and now md can correct the problem. > >>> IIRC, this is true when the drive returns failure as well. The >>> whole bio is marked as failed, and the page cache layer then >>> begins retrying with progressively smaller requests to see if it >>> can get *some* data out. >> >> Well that's very course. It's not at a sector level, so as long as >> the drive continues to try to read from a particular LBA, but fails >> to either succeed reading or give up and report a read error, >> within 30 seconds, then you just get a bunch of wonky system >> behavior. > > I don't understand this response at all. The drive isn't going to > keep trying to read the same bad lba; after the kernel times out, it > resets the drive, and tries reading different smaller parts to see > which it can read and which it can't. That's my whole point. When the link is reset, no read error is submitted by the drive, the md driver has no idea what the drive's problem was, no idea that it's a read problem, no idea what LBA is affected, and thus no way of writing over the affected bad sector. If the SCSI command timer is raised well above 30 seconds, this problem is resolved. Also replacing the drive with one that definitively errors out (or can be configured with smartctl -l scterc) before 30 seconds is another option. > >> Conversely what I've observed on Windows in such a case, is it >> tolerates these deep recoveries on consumer drives. So they just >> get really slow but the drive does seem to eventually recover >> (until it doesn't). But yeah 2 minutes is a long time. So then the >> user gets annoyed and reinstalls their system. Since that means >> writing to the affected drive, the firmware logic causes bad >> sectors to be dereferenced when the write error is persistent. >> Problem solved, faster system. > > That seems like rather unsubstantiated guesswork. i.e. the 2 minute+ > delays are likely not on an individual request, but from several > requests that each go into deep recovery, possibly because windows is > retrying the same sector or a few consecutive sectors are bad. It doesn't really matter, clearly its time out for drive commands is much higher than the linux default of 30 seconds. > >> Because now you have a member drive that's inconsistent. At least >> in the md raid case, a certain number of read failures causes the >> drive to be ejected from the array. Anytime there's a write >> failure, it's ejected from the array too. What you want is for the >> drive to give up sooner with an explicit read error, so md can help >> fix the problem by writing good data to the effected LBA. That >> doesn't happen when there are a bunch of link resets happening. > > What? It is no different than when it does return an error, with the > exception that the error is incorrectly applied to the entire request > instead of just the affected sector. OK that doesn't actually happen and it would be completely f'n wrong behavior if it were happening. All the kernel knows is the command timer has expired, it doesn't know why the drive isn't responding. It doesn't know there are uncorrectable sector errors causing the problem. To just assume link resets are the same thing as bad sectors and to just wholesale start writing possibly a metric shit ton of data when you don't know what the problem is would be asinine. It might even be sabotage. Jesus... > >> Again, if your drive SCT ERC is configurable, and set to something >> sane like 70 deciseconds, that read failure happens at MOST 7 >> seconds after the read attempt. And md is notified of *exactly* >> what sectors are affected, it immediately goes to mirror data, or >> rebuilds it from parity, and then writes the correct data to the >> previously reported bad sectors. And that will fix the problem. > > Yes... I'm talking about when the drive doesn't support that. Then there is one option which is to increase the value of the SCSI command timer. And that applies to all raid: md, lvm, btrfs, and hardware. -- Chris Murphy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-25 23:13 ` Chris Murphy @ 2014-11-26 1:53 ` Rich Freeman 2014-12-01 19:10 ` Phillip Susi 1 sibling, 0 replies; 49+ messages in thread From: Rich Freeman @ 2014-11-26 1:53 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On Tue, Nov 25, 2014 at 6:13 PM, Chris Murphy <lists@colorremedies.com> wrote: > A few years ago companies including Western Digital started shipping > large cheap drives, think of the "green" drives. These had very high > TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later > they completely took out the ability to configure this error recovery > timing so you only get the upward of 2 minutes to actually get a read > error reported by the drive. Why sell an $80 hard drive when you can change a few bytes in the firmware and sell a crippled $80 drive and an otherwise-identical non-crippled $130 drive? -- Rich ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-25 23:13 ` Chris Murphy 2014-11-26 1:53 ` Rich Freeman @ 2014-12-01 19:10 ` Phillip Susi 1 sibling, 0 replies; 49+ messages in thread From: Phillip Susi @ 2014-12-01 19:10 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/25/2014 6:13 PM, Chris Murphy wrote: > The drive will only issue a read error when its ECC absolutely > cannot recover the data, hard fail. > > A few years ago companies including Western Digital started > shipping large cheap drives, think of the "green" drives. These had > very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT > ERC. Later they completely took out the ability to configure this > error recovery timing so you only get the upward of 2 minutes to > actually get a read error reported by the drive. Presumably if the > ECC determines it's a hard fail and no point in reading the same > sector 14000 times, it would issue a read error much sooner. But > again, the linux-raid list if full of cases where this doesn't > happen, and merely by changing the linux SCSI command timer from 30 > to 121 seconds, now the drive reports an explicit read error with > LBA information included, and now md can correct the problem. I have one of those and took it out of service when it started reporting read errors ( not timeouts ). I tried several times to write over the bad sectors to force reallocation and it worked again for a while... then the bad sectors kept coming back. Oddly, the SMART values never indicated anything had been reallocated. > That's my whole point. When the link is reset, no read error is > submitted by the drive, the md driver has no idea what the drive's > problem was, no idea that it's a read problem, no idea what LBA is > affected, and thus no way of writing over the affected bad sector. > If the SCSI command timer is raised well above 30 seconds, this > problem is resolved. Also replacing the drive with one that > definitively errors out (or can be configured with smartctl -l > scterc) before 30 seconds is another option. It doesn't know why or exactly where, but it does know *something* went wrong. > It doesn't really matter, clearly its time out for drive commands > is much higher than the linux default of 30 seconds. Only if you are running linux and can see the timeouts. You can't assume that's what is going on under windows just because the desktop stutters. > OK that doesn't actually happen and it would be completely f'n > wrong behavior if it were happening. All the kernel knows is the > command timer has expired, it doesn't know why the drive isn't > responding. It doesn't know there are uncorrectable sector errors > causing the problem. To just assume link resets are the same thing > as bad sectors and to just wholesale start writing possibly a > metric shit ton of data when you don't know what the problem is > would be asinine. It might even be sabotage. Jesus... In normal single disk operation sure: the kernel resets the drive and retries the request. But like I said before, I could have sworn there was an early failure flag that md uses to tell the lower layers NOT to attempt that kind of normal recovery, and instead just to return the failure right away so md can just go grab the data from the drive that isn't wigging out. That prevents the system from stalling on paging IO while the drive plays around with its deep recovery, and copying back 512k to the drive with the one bad sector isn't really that big of a deal. > Then there is one option which is to increase the value of the > SCSI command timer. And that applies to all raid: md, lvm, btrfs, > and hardware. And then you get stupid hanging when you could just get the data from the other drive immediately. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUfL04AAoJENRVrw2cjl5RFW0H/Rtz4Y8bynWAP2yjiqZMsic+ vXCxuJAFGpOKVyV1FboCuLStp8TQ5aIiJyHrprsCiy4UAY0bFQjzaHOo4jBlCdV/ YaD3HSWGKAFUbIiByCnMfIDMxWSPP8rOeFpotoywAkNe0vIsIKg955IX96+jNMy2 IAjKGQahzp2UW6ggnwwdA/JayUmb1jZ8LvmV58rDVdhTnGPgrrYZnIyf/OphrXqd R/WJtFDuUBUhtsmXYrY2wGUQNi+3zp+I9YburmeDtEcrbwDLDCiVdE6ChmoCrNBS nbcfqoWPEk1DsiI9GC/Yu/sXLq2iD0n53e/DHa36z4zc4uWtUjBwSYyCubJfkyI= =FrB9 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-25 21:34 ` Phillip Susi 2014-11-25 23:13 ` Chris Murphy @ 2014-11-28 15:02 ` Patrik Lundquist 1 sibling, 0 replies; 49+ messages in thread From: Patrik Lundquist @ 2014-11-28 15:02 UTC (permalink / raw) To: Phillip Susi; +Cc: Chris Murphy, Btrfs BTRFS On 25 November 2014 at 22:34, Phillip Susi <psusi@ubuntu.com> wrote: > On 11/19/2014 7:05 PM, Chris Murphy wrote: > > I'm not a hard drive engineer, so I can't argue either point. But > > consumer drives clearly do behave this way. On Linux, the kernel's > > default 30 second command timer eventually results in what look > > like link errors rather than drive read errors. And instead of the > > problems being fixed with the normal md and btrfs recovery > > mechanisms, the errors simply get worse and eventually there's data > > loss. Exhibits A, B, C, D - the linux-raid list is full to the brim > > of such reports and their solution. > > I have seen plenty of error logs of people with drives that do > properly give up and return an error instead of timing out so I get > the feeling that most drives are properly behaved. Is there a > particular make/model of drive that is known to exhibit this silly > behavior? I had a couple of Seagate Barracuda 7200.11 (codename Moose) drives with seriously retarded firmware. They never reported a read error AFAIK but began to time out instead. They wouldn't even respond after a link reset. I had to power cycle the disks. Funny days with ddrescue. Got almost everything off them. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 20:58 ` Phillip Susi 2014-11-19 2:40 ` Chris Murphy @ 2014-11-19 2:46 ` Duncan 2014-11-19 16:07 ` Phillip Susi 1 sibling, 1 reply; 49+ messages in thread From: Duncan @ 2014-11-19 2:46 UTC (permalink / raw) To: linux-btrfs Phillip Susi posted on Tue, 18 Nov 2014 15:58:18 -0500 as excerpted: > Are there really any that take longer than 30 seconds? That's enough > time for thousands of retries. If it can't be read after a dozen tries, > it ain't never gonna work. It seems absurd that a drive would keep > trying for so long. I'm not sure about normal operation, but certainly, many drives take longer than 30 seconds to stabilize after power-on, and I routinely see resets during this time. In fact, as I recently posted, power-up stabilization time can and often does kill reliable multi-drive device or filesystem (my experience is with mdraid and btrfs raid) resume from suspend to RAM or hibernate to disk, either one or both, because it's often enough the case that one device or another will take enough longer to stabilize than the other, that it'll be failed out of the raid. This doesn't happen on single-hardware-device block devices and filesystems because in that case it's either up or down, if the device doesn't come up in time the resume simply fails entirely, instead of coming up with one or more devices there, but others missing as they didn't stabilize in time, as is unfortunately all too common in the multi- device scenario. I've seen this with both spinning rust and with SSDs, with mdraid and btrfs, with multiple mobos and device controllers, and with resume both from suspend to ram (if the machine powers down the storage devices in that case, as most modern ones do) and hibernate to permanent storage device, over several years worth of kernel series, so it's a reasonably widespread phenomena, at least among consumer-level SATA devices. (My experience doesn't extend to enterprise-raid-level devices or proper SCSI, etc, so I simply don't know, there.) While two minutes is getting a bit long, I think it's still within normal range, and some devices definitely take over a minute enough of the time to be both noticeable and irritating. That said, I SHOULD say I'd be far *MORE* irritated if the device simply pretended it was stable and started reading/writing data before it really had stabilized, particularly with SSDs where that sort of behavior has been observed and is known to put some devices at risk of complete scrambling of either media or firmware, beyond recovery at times. That of course is the risk of going the other direction, and I'd a WHOLE lot rather have devices play it safe for another 30 seconds or so after they / think/ they're stable and be SURE, than pretend to be just fine when voltages have NOT stabilized yet and thus end up scrambling things irrecoverably. I've never had that happen here tho I've never stress- tested for it, only done normal operation, but I've seen testing reports where the testers DID make it happen surprisingly easily, to a surprising number of their test devices. So, umm... I suspect the 2-minute default is 2 minutes due to power-up stabilizing issues, where two minutes is a reasonable compromise between failing the boot most of the time if the timeout is too low, and taking excessively long for very little further gain. And in my experience, the only way around that, at the consumer level at least, would be to split the timeouts, perhaps setting something even higher, 2.5-3 minutes on power-on, while lowering the operational timeout to something more sane for operation, probably 30 seconds or so by default, but easily tunable down to 10-20 seconds (or even lower, 5 seconds, even for consumer level devices?) for those who had hardware that fit within that tolerance and wanted the performance. But at least to my knowledge, there's no such split in reset timeout values available (maybe for SCSI?), and due to auto-spindown and power-saving, I'm not sure whether it's even possible, without some specific hardware feature available to tell the kernel that it has in fact NOT been in power-saving mode for say 5-10 minutes, hopefully long enough that voltage readings really /are/ fully stabilized and a shorter timeout is possible. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 2:46 ` Duncan @ 2014-11-19 16:07 ` Phillip Susi 2014-11-19 21:05 ` Robert White 2014-11-19 23:59 ` Duncan 0 siblings, 2 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-19 16:07 UTC (permalink / raw) To: Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 9:46 PM, Duncan wrote: > I'm not sure about normal operation, but certainly, many drives > take longer than 30 seconds to stabilize after power-on, and I > routinely see resets during this time. As far as I have seen, typical drive spin up time is on the order of 3-7 seconds. Hell, I remember my pair of first generation seagate cheetah 15,000 rpm drives seemed to take *forever* to spin up and that still was maybe only 15 seconds. If a drive takes longer than 30 seconds, then there is something wrong with it. I figure there is a reason why spin up time is tracked by SMART so it seems like long spin up time is a sign of a sick drive. > This doesn't happen on single-hardware-device block devices and > filesystems because in that case it's either up or down, if the > device doesn't come up in time the resume simply fails entirely, > instead of coming up with one or more devices there, but others > missing as they didn't stabilize in time, as is unfortunately all > too common in the multi- device scenario. No, the resume doesn't "fail entirely". The drive is reset, and the IO request is retried, and by then it should succeed. > I've seen this with both spinning rust and with SSDs, with mdraid > and btrfs, with multiple mobos and device controllers, and with > resume both from suspend to ram (if the machine powers down the > storage devices in that case, as most modern ones do) and hibernate > to permanent storage device, over several years worth of kernel > series, so it's a reasonably widespread phenomena, at least among > consumer-level SATA devices. (My experience doesn't extend to > enterprise-raid-level devices or proper SCSI, etc, so I simply > don't know, there.) If you are restoring from hibernation, then the drives are already spun up before the kernel is loaded. > While two minutes is getting a bit long, I think it's still within > normal range, and some devices definitely take over a minute enough > of the time to be both noticeable and irritating. It certainly is not normal for a drive to take that long to spin up. IIRC, the 30 second timeout comes from the ATA specs which state that it can take up to 30 seconds for a drive to spin up. > That said, I SHOULD say I'd be far *MORE* irritated if the device > simply pretended it was stable and started reading/writing data > before it really had stabilized, particularly with SSDs where that > sort of behavior has been observed and is known to put some devices > at risk of complete scrambling of either media or firmware, beyond > recovery at times. That of course is the risk of going the other > direction, and I'd a WHOLE lot rather have devices play it safe for > another 30 seconds or so after they / think/ they're stable and be > SURE, than pretend to be just fine when voltages have NOT > stabilized yet and thus end up scrambling things irrecoverably. > I've never had that happen here tho I've never stress- tested for > it, only done normal operation, but I've seen testing reports where > the testers DID make it happen surprisingly easily, to a surprising > number of their test devices. Power supply voltage is stable within milliseconds. What takes HDDs time to start up is mechanically bringing the spinning rust up to speed. On SSDs, I think you are confusing testing done on power *cycling* ( i.e. yanking the power cord in the middle of a write ) with startup. > So, umm... I suspect the 2-minute default is 2 minutes due to > power-up stabilizing issues, where two minutes is a reasonable > compromise between failing the boot most of the time if the timeout > is too low, and taking excessively long for very little further > gain. The default is 30 seconds, not 2 minutes. > sure whether it's even possible, without some specific hardware > feature available to tell the kernel that it has in fact NOT been > in power-saving mode for say 5-10 minutes, hopefully long enough > that voltage readings really /are/ fully stabilized and a shorter > timeout is possible. Again, there is no several minute period where voltage stabilizes and the drive takes longer to access. This is a complete red herring. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbMBPAAoJEI5FoCIzSKrwcV0H/20pv7O5+CDf2cRg5G5vt7PR 4J1NuVIBsboKwjwCj8qdxHQJHihvLYkTQKANqaqHv0+wx0u2DaQdPU/LRnqN71xA jP7b9lx9X6rPnAnZUDBbxzAc8HLeutgQ8YD/WB0sE5IXlI1/XFGW4tXIZ4iYmtN9 GUdL+zcdtEiYE993xiGSMXF4UBrN8d/5buBRsUsPVivAZes6OHbf9bd72c1IXBuS ADZ7cH7XGmLL3OXA+hm7d99429HFZYAgI7DjrLWp6Tb9ja5Gvhy+AVvrbU5ZWMwu XUnNsLsBBhEGuZs5xpkotZgaQlmJpw4BFY4BKwC6PL+7ex7ud3hGCGeI6VDmI0U= =DLHU -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 16:07 ` Phillip Susi @ 2014-11-19 21:05 ` Robert White 2014-11-19 21:47 ` Phillip Susi 2014-11-20 0:25 ` Duncan 2014-11-19 23:59 ` Duncan 1 sibling, 2 replies; 49+ messages in thread From: Robert White @ 2014-11-19 21:05 UTC (permalink / raw) To: Phillip Susi, Duncan, linux-btrfs On 11/19/2014 08:07 AM, Phillip Susi wrote: > On 11/18/2014 9:46 PM, Duncan wrote: >> I'm not sure about normal operation, but certainly, many drives >> take longer than 30 seconds to stabilize after power-on, and I >> routinely see resets during this time. > > As far as I have seen, typical drive spin up time is on the order of > 3-7 seconds. Hell, I remember my pair of first generation seagate > cheetah 15,000 rpm drives seemed to take *forever* to spin up and that > still was maybe only 15 seconds. If a drive takes longer than 30 > seconds, then there is something wrong with it. I figure there is a > reason why spin up time is tracked by SMART so it seems like long spin > up time is a sign of a sick drive. I was recently re-factoring Underdog (http://underdog.sourceforge.net) startup scripts to separate out the various startup domains (e.g. lvm, luks, mdadm) in the prtotype init. So I notice you (Duncan) use the word "stabilize", as do a small number of drivers in the linux kernel. This word has very little to do with "disks" per se. Between SCSI probing LUNs (where the controller tries every theoretical address and gives a potential device ample time to reply), and usb-storage having a simple timer delay set for each volume it sees, there is a lot of "waiting in the name of safety" going on in the linux kernel at device initialization. When I added the messages "scanning /dev/sd??" to the startup sequence as I iterate through the disks and partitions present I discovered that the first time I called blkid (e.g. right between /dev/sda and /dev/sda1) I'd get a huge hit of many human seconds (I didn't time it, but I'd say eight or so) just for having a 2Tb My Book WD 3.0 disk enclosure attached as /dev/sdc. This enclosure having "spun up" in the previous boot cycle and only bing a soft reboot was immaterial. In this case usb-store is going to take its time and do its deal regardless of the state of the physical drive itself. So there are _lots_ of places where you are going to get delays and very few of them involve the disk itself going from power-off to ready. You said it yourself with respect to SSDs. It's cheaper, and less error prone, and less likely to generate customer returns if the generic controller chips just "send init, wait a fixed delay, then request a status" compared to trying to "are-you-there-yet" poll each device like a nagging child. And you are going to see that at every level. And you are going to see it multiply with _sparsely_ provisioned buses where the cycle is going to be retried for absent LUNs (one disk on a Wide SCSI bus and a controller set to probe all LUNs is particularly egregious) One of the reasons that the whole industry has started favoring point-to-point (SATA, SAS) or physical intercessor chaining point-to-point (eSATA) buses is to remove a lot of those wait-and-see delays. That said, you should not see a drive (or target enclosure, or controller) "reset" during spin up. In a SCSI setting this is almost always a cabling, termination, or addressing issue. In IDE its jumper mismatch (master vs slave vs cable-select). Less often its a partitioning issue (trying to access sectors beyond the end of the drive). Another strong actor is selecting the wrong storage controller chipset driver. In that case you may be faling back from high-end device you think it is, through intermediate chip-set, and back to ACPI or BIOS emulation Another common cause is having a dedicated hardware RAID controller (dell likes to put LSI MegaRaid controllers in their boxes for example), many mother boards have hardware RAID support available through the bios, etc, leaving that feature active, then the adding a drive and _not_ initializing that drive with the RAID controller disk setup. In this case the controller is going to repeatedly probe the drive for its proprietary controller signature blocks (and reset the drive after each attempt) and then finally fall back to raw block pass-through. This can take a long time (thirty seconds to a minute). But seriously, if you are seeing "reset" anywhere in any storage chain during a normal power-on cycle then you've got a problem with geometry or configuration. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 21:05 ` Robert White @ 2014-11-19 21:47 ` Phillip Susi 2014-11-19 22:25 ` Robert White 2014-11-19 22:33 ` Robert White 2014-11-20 0:25 ` Duncan 1 sibling, 2 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-19 21:47 UTC (permalink / raw) To: Robert White, Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/19/2014 4:05 PM, Robert White wrote: > It's cheaper, and less error prone, and less likely to generate > customer returns if the generic controller chips just "send init, > wait a fixed delay, then request a status" compared to trying to > "are-you-there-yet" poll each device like a nagging child. And you > are going to see that at every level. And you are going to see it > multiply with _sparsely_ provisioned buses where the cycle is going > to be retried for absent LUNs (one disk on a Wide SCSI bus and a > controller set to probe all LUNs is particularly egregious) No, they do not wait a fixed time, then proceed. They do in fact issue the command, then poll or wait for an interrupt to know when it is done, then time out and give up if that doesn't happen within a reasonable amount of time. > One of the reasons that the whole industry has started favoring > point-to-point (SATA, SAS) or physical intercessor chaining > point-to-point (eSATA) buses is to remove a lot of those > wait-and-see delays. Nope... even with the ancient PIO mode PATA interface, you polled a ready bit in the status register to see if it was done yet. If you always waited 30 seconds for every command your system wouldn't boot up until next year. > Another strong actor is selecting the wrong storage controller > chipset driver. In that case you may be faling back from high-end > device you think it is, through intermediate chip-set, and back to > ACPI or BIOS emulation There is no such thing as ACPI or BIOS emulation. AHCI SATA controllers do usually have an old IDE emulation mode instead of AHCI mode, but this isn't going to cause ridiculously long delays. > Another common cause is having a dedicated hardware RAID > controller (dell likes to put LSI MegaRaid controllers in their > boxes for example), many mother boards have hardware RAID support > available through the bios, etc, leaving that feature active, then > the adding a drive and That would be fake raid, not hardware raid. > _not_ initializing that drive with the RAID controller disk setup. > In this case the controller is going to repeatedly probe the drive > for its proprietary controller signature blocks (and reset the > drive after each attempt) and then finally fall back to raw block > pass-through. This can take a long time (thirty seconds to a > minute). No, no, and no. If it reads the drive and does not find its metadata, it falls back to pass through. The actual read takes only milliseconds, though it may have to wait a few seconds for the drive to spin up. There is no reason it would keep retrying after a successful read. The way you end up with 30-60 second startup time with a raid is if you have several drives and staggered spinup mode enabled, then each drive is started one at a time instead of all at once so their cumulative startup time can add up fairly high. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbQ/qAAoJEI5FoCIzSKrwuhwH/R/+EVTpNlw36naJ8mxqMagt /xafq+1kGhwNjLTPV68CI4Wt24WSGOLqpq5FPWlTMxuN0VSnX/wqBeSbz4w2Vl3F VNic+4RqhmzS3EnLXNzkHyF2Z+hQEEldOlheAobkQb4hv/7jVxBri42nMdHQUq5w em181txT8zkltmV+dm8aYcro8Z4ewntQtyGaO6U/nCfxt9Odr2rfytyeuSyJi9uY +dKlGSb5klIFwCOOSoRqEz2+KOFHF7td9RrcfIRcPRgjKROH0YilQ8T53lTMoNL1 aUMsbyUy+edEBN1a4o/FqK3dEvBSu1nnRGRpSgm2fFGKhyi/z9gmJ1ZXTdYZRXE= =/O7+ -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 21:47 ` Phillip Susi @ 2014-11-19 22:25 ` Robert White 2014-11-20 20:26 ` Phillip Susi 2014-11-19 22:33 ` Robert White 1 sibling, 1 reply; 49+ messages in thread From: Robert White @ 2014-11-19 22:25 UTC (permalink / raw) To: Phillip Susi, Duncan, linux-btrfs Shame you already know everything? On 11/19/2014 01:47 PM, Phillip Susi wrote: > On 11/19/2014 4:05 PM, Robert White wrote: > >> One of the reasons that the whole industry has started favoring >> point-to-point (SATA, SAS) or physical intercessor chaining >> point-to-point (eSATA) buses is to remove a lot of those >> wait-and-see delays. > > Nope... even with the ancient PIO mode PATA interface, you polled a > ready bit in the status register to see if it was done yet. If you > always waited 30 seconds for every command your system wouldn't boot > up until next year. The controller, the thing that sets the ready bit and sends the interrupt is distinct from the driver, the thing that polls the ready bit when the interrupt is sent. At the bus level there are fixed delays and retries. Try putting two drives on a pin-select IDE bus and strapping them both as _slave_ (or indeed master) sometime and watch the shower of fixed delay retries. >> Another strong actor is selecting the wrong storage controller >> chipset driver. In that case you may be faling back from high-end >> device you think it is, through intermediate chip-set, and back to >> ACPI or BIOS emulation > > There is no such thing as ACPI or BIOS emulation. That's odd... my bios reads from storage to boot the device and it does so using the ACPI storage methods. ACPI 4.0 Specification Section 9.8 even disagrees with you at some length. Let's just do the titles shall we: 9.8 ATA Controller Devices 9.8.1 Objects for both ATA and SATA Controllers. 9.8.2 IDE Controller Device 9.8.3 Serial ATA (SATA) controller Device Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at Device Drivers -> <*> Serial ATA and Parallel ATA drivers (libata) -> <*> ACPI firmware driver for PATA CONFIG_PATA_ACPI: This option enables an ACPI method driver which drives motherboard PATA controller interfaces through the ACPI firmware in the BIOS. This driver can sometimes handle otherwise unsupported hardware. You are a storage _genius_ for knowing that all that stuff doesn't exist... the rest of us must simply muddle along in our delusion... > AHCI SATA > controllers do usually have an old IDE emulation mode instead of AHCI > mode, but this isn't going to cause ridiculously long delays. Do tell us more... I didn't say the driver would cause long delays, I said that the time it takes to error out other improperly supported drivers and fall back to this one could induce long delays and resets. I think I am done with your "expertise" in the question of all things storage related. Not to be rude... but I'm physically ill and maybe I shouldn't be posting right now... 8-) -- Rob. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 22:25 ` Robert White @ 2014-11-20 20:26 ` Phillip Susi 2014-11-20 22:45 ` Robert White 0 siblings, 1 reply; 49+ messages in thread From: Phillip Susi @ 2014-11-20 20:26 UTC (permalink / raw) To: Robert White, Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/19/2014 5:25 PM, Robert White wrote: > The controller, the thing that sets the ready bit and sends the > interrupt is distinct from the driver, the thing that polls the > ready bit when the interrupt is sent. At the bus level there are > fixed delays and retries. Try putting two drives on a pin-select > IDE bus and strapping them both as _slave_ (or indeed master) > sometime and watch the shower of fixed delay retries. No, it does not. In classical IDE, the "controller" is really just a bus bridge. When you read from the status register in the controller, the read bus cycle is propagated down the IDE ribbon, and into the drive, and you are in fact, reading the register directly from the drive. That is where the name Integrated Device Electronics came from: because the controller was really integrated into the drive. The only fixed delays at the bus level are the bus cycle speed. There are no retries. There are only 3 mentions of the word "retry" in the ATA8-APT and they all refer to the host driver. > That's odd... my bios reads from storage to boot the device and it > does so using the ACPI storage methods. No, it doesn't. It does so by accessing the IDE or ACHI registers just as pc bios always has. I suppose I also need to remind you that we are talking about the context of linux here, and linux does not make use of the bios for disk access. > ACPI 4.0 Specification Section 9.8 even disagrees with you at some > length. > > Let's just do the titles shall we: > > 9.8 ATA Controller Devices 9.8.1 Objects for both ATA and SATA > Controllers. 9.8.2 IDE Controller Device 9.8.3 Serial ATA (SATA) > controller Device > > Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at Device > Drivers -> <*> Serial ATA and Parallel ATA drivers (libata) -> <*> > ACPI firmware driver for PATA > > CONFIG_PATA_ACPI: > > This option enables an ACPI method driver which drives motherboard > PATA controller interfaces through the ACPI firmware in the BIOS. > This driver can sometimes handle otherwise unsupported hardware. > > You are a storage _genius_ for knowing that all that stuff doesn't > exist... the rest of us must simply muddle along in our > delusion... Yes, ACPI 4.0 added this mess. I have yet to see a single system that actually implements it. I can't believe they even bothered adding this driver to the kernel. Is there anyone in the world who has ever used it? If no motherboard vendor has bothered implementing the ACPI FAN specs, I very much doubt anyone will ever bother with this. > Do tell us more... I didn't say the driver would cause long delays, > I said that the time it takes to error out other improperly > supported drivers and fall back to this one could induce long > delays and resets. There is no "error out" and "fall back". If the device is in AHCI mode then it identifies itself as such and the ACHI driver is loaded. If it is in IDE mode, then it identifies itself as such, and the IDE driver is loaded. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbk5qAAoJEI5FoCIzSKrw++IH/2DAayNzDqKlA7DBi79UVlpg jJHDOlmzPqJCLMkffZRX1TLM/OEzu3k/pYMlS0HCdNggbG7eTpHxsoCetiETPcnc LCcolWXa/eMfzkEphSq4GToeEj5FKrVNzymNvPVL6zdiSfySvSg4RZOs123ULYNM nPUaOYPSiDPzfC7ggUS3RSvWb8mNzfRVJtgGXlZd/jDh+NAjy3oTb4fYksZjq8qb n5emKU1jJafvSbBek41wo7Xji1vLThiDZ4kcf4c7oT3x4WuQUMUhzkficqEnwYsm HK12pv0ktDJr6hKMcHPT26YKsdUOPE6XC3GgNaxt8EZ3bioWYRb4RRAdAuAjI2s= =+M2o -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-20 20:26 ` Phillip Susi @ 2014-11-20 22:45 ` Robert White 2014-11-21 15:11 ` Phillip Susi 0 siblings, 1 reply; 49+ messages in thread From: Robert White @ 2014-11-20 22:45 UTC (permalink / raw) To: Phillip Susi, Duncan, linux-btrfs On 11/20/2014 12:26 PM, Phillip Susi wrote: > Yes, ACPI 4.0 added this mess. I have yet to see a single system that > actually implements it. I can't believe they even bothered adding > this driver to the kernel. Is there anyone in the world who has ever > used it? If no motherboard vendor has bothered implementing the ACPI > FAN specs, I very much doubt anyone will ever bother with this. Nice attempt at saving face, but wrong as _always_. The CONFIG_PATA_ACPI option has been in the kernel since 2008 and lots of people have used it. If you search for "ACPI ide" you'll find people complaining in 2008-2010 about windows error messages indicating the device is present in their system but no OS driver is available. That you "have yet to see a single system that implements it" is about the worst piece of internet research I've ever seen. Do you not _get_ that your opinion about what exists and how it works is not authoritative? You can also find articles about both windows and linux systems actively using ACPI fan control going back to 2009 These are not hard searches to pull off. These are not obscure references. Go to the google box and start typing "ACPI fan..." and check the autocomplete. I'll skip ovea all the parts where you don't know how a chipset works and blah, blah, blah... You really should have just stopped at "I don't know" and "I've never" because you keep demonstrating that you _don't_ know, and that you really _should_ _never_. Tell us more about the lizard aliens controlling your computer, I find your versions of realty fascinating... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-20 22:45 ` Robert White @ 2014-11-21 15:11 ` Phillip Susi 2014-11-21 21:12 ` Robert White 0 siblings, 1 reply; 49+ messages in thread From: Phillip Susi @ 2014-11-21 15:11 UTC (permalink / raw) To: Robert White, Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/20/2014 5:45 PM, Robert White wrote: > Nice attempt at saving face, but wrong as _always_. > > The CONFIG_PATA_ACPI option has been in the kernel since 2008 and > lots of people have used it. > > If you search for "ACPI ide" you'll find people complaining in > 2008-2010 about windows error messages indicating the device is > present in their system but no OS driver is available. Nope... not finding it. The closest thing was one or two people who said ACPI when they meant AHCI ( and were quickly corrected ). This is probably what you were thinking of since windows xp did not ship with an ahci driver so it was quite common for winxp users to have this problem when in _AHCI_ mode. > That you "have yet to see a single system that implements it" is > about the worst piece of internet research I've ever seen. Do you > not _get_ that your opinion about what exists and how it works is > not authoritative? Show me one and I'll give you a cookie. I have disassembled a number of acpi tables and yet to see one that has it. What's more, motherboard vendors tend to implement only the absolute minimum they have to. Since nobody actually needs this feature, they aren't going to bother with it. Do you not get that your hand waving arguments of "you can google for it" are not authoritative? > You can also find articles about both windows and linux systems > actively using ACPI fan control going back to 2009 Maybe you should have actually read those articles. Linux supports acpi fan control, unfortunately, almost no motherboards actually implement it. Almost everyone who wants fan control working in linux has to install lm-sensors and load a driver that directly accesses one of the embedded controllers that motherboards tend to use and run the fancontrol script to manipulate the pwm channels on that controller. These days you also have to boot with a kernel argument to allow loading the driver since ACPI claims those IO ports for its own use which creates a conflict. Windows users that want to do this have to install a program... I believe a popular one is called q-fan, that likewise directly accesses the embedded controller registers to control the fan, since the acpi tables don't bother properly implementing the acpi fan spec. Then there are thinkpads, and one or two other laptops ( asus comes to mind ) that went and implemented their own proprietary acpi interfaces for fancontrol instead of following the spec, which required some reverse engineering and yet more drivers to handle these proprietary acpi interfaces. You can google for "thinkfan" if you want to see this. > These are not hard searches to pull off. These are not obscure > references. Go to the google box and start typing "ACPI fan..." > and check the autocomplete. > > I'll skip ovea all the parts where you don't know how a chipset > works and blah, blah, blah... > > You really should have just stopped at "I don't know" and "I've > never" because you keep demonstrating that you _don't_ know, and > that you really _should_ _never_. > > Tell us more about the lizard aliens controlling your computer, I > find your versions of realty fascinating... By all means, keep embarrassing yourself with nonsense and trying to cover it up by being rude and insulting. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUb1YxAAoJEI5FoCIzSKrwi54H/Rkd7DloqC9x9QwN4QdmWcAZ /UQg3hcRbtB3wpmp34Mnb3SS0Ii2mCh/dtKmdRGBNE/x5nU1WiQEHHCicKX3Avvq 8OXLNQrsf+xZL9/HGtUJ3RefpEkmwIG5NgFfKJHtv6Iq204Umq32JUxRla+ZQE5s MrUparigpUlj26lrnShc6ByDUqYK3wOTsDxEMxrOyAgi/n/7ESHV/dZVaqsE6jGQ OvPynf1FqJoJSSYC7sNE0XLqfHMu2wnSxcoF6MpuHXlDiwtrSH07tuwgrhCNPagY j7gQyxucew8oim8lcfs+4rrQ60wwVzlsEJwjA9rAXQF7U2x/WoB+ArYhgmJUMgA= =cXJr -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 15:11 ` Phillip Susi @ 2014-11-21 21:12 ` Robert White 2014-11-21 21:41 ` Robert White 2014-11-22 22:06 ` Phillip Susi 0 siblings, 2 replies; 49+ messages in thread From: Robert White @ 2014-11-21 21:12 UTC (permalink / raw) To: Phillip Susi, Duncan, linux-btrfs On 11/21/2014 07:11 AM, Phillip Susi wrote: > On 11/20/2014 5:45 PM, Robert White wrote: >> If you search for "ACPI ide" you'll find people complaining in >> 2008-2010 about windows error messages indicating the device is >> present in their system but no OS driver is available. > > Nope... not finding it. The closest thing was one or two people who > said ACPI when they meant AHCI ( and were quickly corrected ). This > is probably what you were thinking of since windows xp did not ship > with an ahci driver so it was quite common for winxp users to have > this problem when in _AHCI_ mode. I have to give you that one... I should have never trusted any reference to windows. Most of those references to windows support were getting AHCI and ACPI mixed up. Lolz windows users... They didn't get into ACPI disk support till 2010. I should have known they were behind the times. I had to scroll down almost a whole page to find the linux support. So lets just look at the top of the ide/ide-acpi.c from linux 2.6 to consult about when ACPI got into the IDE business... linux/drivers/ide/ide-acpi.c /* * Provides ACPI support for IDE drives. * * Copyright (C) 2005 Intel Corp. * Copyright (C) 2005 Randy Dunlap * Copyright (C) 2006 SUSE Linux Products GmbH * Copyright (C) 2006 Hannes Reinecke */ Here's a bug from 2005 of someone having a problem with the ACPI IDE support... https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&cad=rja&uact=8&ved=0CDkQFjAF&url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D5604&ei=g6VvVL73K-HLsASIrYKIDg&usg=AFQjCNGTuuXPJk91svGJtRAf35DUqVqrLg&sig2=eHxwbLYXn4ED5jG-guoZqg People debating the merits of the ACPI IDE drivers in 2005. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=12&cad=rja&uact=8&ved=0CGUQFjAL&url=http%3A%2F%2Fwww.linuxquestions.org%2Fquestions%2Fslackware-14%2Fbare-ide-and-bare-acpi-kernels-297525%2F&ei=g6VvVL73K-HLsASIrYKIDg&usg=AFQjCNFoyKgH2sOteWwRN_Tdrfw9hOmVGQ&sig2=BmMVcZl24KRz4s4gEvLN_w So "you got me"... windows was behind the curve by five years instead of just three... my bad... But yea, nobody has ever used that ACPI disk drive support that's been in the kernel for nine years. Even when you "get me" for referencing windows, you're still wrong... How many times will you try get out of being hideously horribly wrong about ACPI supporting disk/storage IO? It is neither "recent" nor "rare". How much egg does your face really need before you just see that your fantasy that it's "new" and uncommon is a delusional mistake? Methinks Misters Dunning and Kruger need a word with you... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 21:12 ` Robert White @ 2014-11-21 21:41 ` Robert White 2014-11-22 22:06 ` Phillip Susi 1 sibling, 0 replies; 49+ messages in thread From: Robert White @ 2014-11-21 21:41 UTC (permalink / raw) To: Phillip Susi, Duncan, linux-btrfs On 11/21/2014 01:12 PM, Robert White wrote: > (wrong links included in post...) Dangit... those two links were bad... wrong clipboard... /sigh... I'll just stand on the pasted text from the driver. 8-) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 21:12 ` Robert White 2014-11-21 21:41 ` Robert White @ 2014-11-22 22:06 ` Phillip Susi 1 sibling, 0 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-22 22:06 UTC (permalink / raw) To: Robert White, Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 11/21/2014 04:12 PM, Robert White wrote: > Here's a bug from 2005 of someone having a problem with the ACPI > IDE support... That is not ACPI "emulation". ACPI is not used to access the disk, but rather it has hooks that give it a chance to diddle with the disk to do things like configure it to lie about its maximum size, or issue a security unlock during suspend/resume. > People debating the merits of the ACPI IDE drivers in 2005. No... that's not a debate at all; it is one guy asking if he should use IDE or "ACPI" mode... someone who again meant AHCI and typed the wrong acronym. > Even when you "get me" for referencing windows, you're still > wrong... > > How many times will you try get out of being hideously horribly > wrong about ACPI supporting disk/storage IO? It is neither "recent" > nor "rare". > > How much egg does your face really need before you just see that > your fantasy that it's "new" and uncommon is a delusional mistake? Project much? It seems I've proven just about everything I originally said you got wrong now so hopefully we can be done. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUcQj4AAoJENRVrw2cjl5RwmcH+gOW0LUQE4OXEToMY33brK8Z QMKw7T1y4dtXIeeWihugNs+vbwmoI2Wheeej4WPdiqvgqIfX4ov9+N9Nb39JiIsI 7frPJ638n98Et5sirCGKfaVvDTwlF85ApHHtXrVLg2dBY3A+oLM9jVU7jpRBvW1m IFjhJH/SMGDpMhix9SFg6w6cALRh1U5WYV4zMZ1f5/ri/05TYmNJ/M23cjtBicPZ LaIFxOMGef4lylysNaVh0W03424oIJit6d7DB1gxCyjnkUvVuJ43NjuS5ay+y2sP FFrepKrOfhK1oOib9e63zNfRHhWrX4KN0Dqcu/3+/+lhD3q5G1fd4YK2RV/oaso= =nm9l -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 21:47 ` Phillip Susi 2014-11-19 22:25 ` Robert White @ 2014-11-19 22:33 ` Robert White 2014-11-20 20:34 ` Phillip Susi 1 sibling, 1 reply; 49+ messages in thread From: Robert White @ 2014-11-19 22:33 UTC (permalink / raw) To: Phillip Susi, Duncan, linux-btrfs P.S. On 11/19/2014 01:47 PM, Phillip Susi wrote: >> Another common cause is having a dedicated hardware RAID >> controller (dell likes to put LSI MegaRaid controllers in their >> boxes for example), many mother boards have hardware RAID support >> available through the bios, etc, leaving that feature active, then >> the adding a drive and > > That would be fake raid, not hardware raid. The LSI MegaRaid controller people would _love_ to hear more about your insight into how their battery-backed multi-drive RAID controller is "fake". You should go work for them. Try the "contact us" link at the bottom of this page. I'm sure they are waiting for your insight with baited breath! http://www.lsi.com/products/raid-controllers/pages/megaraid-sas-9260-8i.aspx >> _not_ initializing that drive with the RAID controller disk setup. >> In this case the controller is going to repeatedly probe the drive >> for its proprietary controller signature blocks (and reset the >> drive after each attempt) and then finally fall back to raw block >> pass-through. This can take a long time (thirty seconds to a >> minute). > > No, no, and no. If it reads the drive and does not find its metadata, > it falls back to pass through. The actual read takes only > milliseconds, though it may have to wait a few seconds for the drive > to spin up. There is no reason it would keep retrying after a > successful read. Odd, my MegaRaid controller takes about fifteen seconds by-the-clock to initialize and to the integrity check on my single initialized drive. It's amazing that with a fail and retry it would be _faster_... It's like you know _everything_... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 22:33 ` Robert White @ 2014-11-20 20:34 ` Phillip Susi 2014-11-20 23:08 ` Robert White 0 siblings, 1 reply; 49+ messages in thread From: Phillip Susi @ 2014-11-20 20:34 UTC (permalink / raw) To: Robert White, Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/19/2014 5:33 PM, Robert White wrote: >> That would be fake raid, not hardware raid. > > The LSI MegaRaid controller people would _love_ to hear more about > your insight into how their battery-backed multi-drive RAID > controller is "fake". You should go work for them. Try the "contact > us" link at the bottom of this page. I'm sure they are waiting for > your insight with baited breath! Forgive me, I should have trimmed the quote a bit more. I was responding specifically to the "many mother boards have hardware RAID support available through the bios" part, not the lsi part. > Odd, my MegaRaid controller takes about fifteen seconds > by-the-clock to initialize and to the integrity check on my single > initialized drive. It is almost certainly spending those 15 seconds on something else, like bootstrapping its firmware code from a slow serial eeprom or waiting for you to press the magic key to enter the bios utility. I would be very surprised to see that time double if you add a second disk. If it does, then they are doing something *very* wrong, and certainly quite different from any other real or fake raid controller I've ever used. > It's amazing that with a fail and retry it would be _faster_... I have no idea what you are talking about here. I said that they aren't going to retry a read that *succeeded* but came back without their magic signature. It isn't like reading it again is going to magically give different results. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUblBmAAoJEI5FoCIzSKrwFKkIAKNGOGyLrMIcTeV4DQntdbaa NMkjXnWnk6lHeqTyE/pb+l4VgVH8nQwDp8hRCnKNnKHoZbT8LOGFULSmBes+DDmW dxPVDTytUu1AiqB7AyxNJU8213BQCaF0inL7ofZmX95N+0eajuVxOyHIMeokdwUU zLOnXQg0awLkQwk7U6YLAKA4A7HrOEXw4wHt9hPy/yUySMVqCeHYV3tpf7t96guU 0IRctvpwcNvvVtt65I8A4EklR+vCvqEDUZfKyG8WJAeyAdC4UoHT9vZcJAVkiFl+ Y+Mp5wsr1vuo3dYQ1bKO8RvPTB9D9npFyFIlyHEBMJlCHDU43YsNP8hGcu0mKco= =AJ6/ -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-20 20:34 ` Phillip Susi @ 2014-11-20 23:08 ` Robert White 2014-11-21 15:27 ` Phillip Susi 0 siblings, 1 reply; 49+ messages in thread From: Robert White @ 2014-11-20 23:08 UTC (permalink / raw) To: Phillip Susi, Duncan, linux-btrfs On 11/20/2014 12:34 PM, Phillip Susi wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/19/2014 5:33 PM, Robert White wrote: >>> That would be fake raid, not hardware raid. >> >> The LSI MegaRaid controller people would _love_ to hear more about >> your insight into how their battery-backed multi-drive RAID >> controller is "fake". You should go work for them. Try the "contact >> us" link at the bottom of this page. I'm sure they are waiting for >> your insight with baited breath! > > Forgive me, I should have trimmed the quote a bit more. I was > responding specifically to the "many mother boards have hardware RAID > support available through the bios" part, not the lsi part. Well you should have _actually_ trimmed your response down to not pressing send. _Many_ motherboards have complete RAID support at levels 0, 1, 10, and five 5. A few have RAID6. Some of them even use the LSI chip-set. Seriously... are you trolling this list with disinformation or just repeating tribal knowledge from fifteen year old copies of PC Magazine? Yea, some of the IDE motherboards and that only had RAID1 and RAID0 (and indeed some of the add-on controllers) back in the IDE-only days were really lame just-forked-write devices with no integrity checks (hence "fake raid") but that's from like the 1990s; it's paleolithic age "wisdom" at this point. Phillip say sky god angry, all go hide in cave! /D'oh... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-20 23:08 ` Robert White @ 2014-11-21 15:27 ` Phillip Susi 0 siblings, 0 replies; 49+ messages in thread From: Phillip Susi @ 2014-11-21 15:27 UTC (permalink / raw) To: Robert White, Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/20/2014 6:08 PM, Robert White wrote: > Well you should have _actually_ trimmed your response down to not > pressing send. > > _Many_ motherboards have complete RAID support at levels 0, 1, 10, > and five 5. A few have RAID6. > > Some of them even use the LSI chip-set. Yes, there are some expensive server class motherboards out there with integrated real raid chips. Your average consumer class motherboards are not those. They contain intel, nvidia, sil, promise, and via chipsets that are fake raid. > Seriously... are you trolling this list with disinformation or > just repeating tribal knowledge from fifteen year old copies of PC > Magazine? Please drop the penis measuring. > Yea, some of the IDE motherboards and that only had RAID1 and RAID0 > (and indeed some of the add-on controllers) back in the IDE-only > days were really lame just-forked-write devices with no integrity > checks (hence "fake raid") but that's from like the 1990s; it's > paleolithic age "wisdom" at this point. Wrong again... fakeraid became popular with the advent of SATA since it was easy to add a knob to the bios to switch it between AHCI and RAID mode, and just change the pci device id. These chipsets are still quite common today and several of them do support raid5 and raid10 ( well, really it's raid 0 + raid1, but that's a whole nother can of worms ). Recent intel chips also now have a caching mode for having an SSD cache a larger HDD. Intel has also done a lot of work integrating support for their chipset into mdadm in the last year or three. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUb1ngAAoJEI5FoCIzSKrwqMQIAJ3MfA4n74aJ1KUdfHYOz96o vwPBNQJ953yozmCHfjERbTCQlKT5AzwQHWpHoFWsQ4gYoNGmeE1jy2rsqxMfujff eQekfISyX3POExnsr3LnfHWI2/Om39+EAxVPxbA5LN6SC1SCWRut7Q3bQqkuxj/S bYRU65XJ9BZ6eYznutMDFdEELyAr8b9/wnatI/ohzmebOBDgFzBrn8gwilCctz7X DI39HTkCvciWKVXNyVdUZKI5S+MRCEB2JZAkCy3x8LLsENmMnO0xN32o5Od0zlGn nFLcLQFrZfz5dY2ZusxP+z0z0x4RW3sikd4RZ99PEHBkFa5CgJIFrBxtQAsLi1c= =4Yg+ -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 21:05 ` Robert White 2014-11-19 21:47 ` Phillip Susi @ 2014-11-20 0:25 ` Duncan 2014-11-20 2:08 ` Robert White 1 sibling, 1 reply; 49+ messages in thread From: Duncan @ 2014-11-20 0:25 UTC (permalink / raw) To: linux-btrfs Robert White posted on Wed, 19 Nov 2014 13:05:13 -0800 as excerpted: > One of the reasons that the whole industry has started favoring > point-to-point (SATA, SAS) or physical intercessor chaining > point-to-point (eSATA) buses is to remove a lot of those wait-and-see > delays. > > That said, you should not see a drive (or target enclosure, or > controller) "reset" during spin up. In a SCSI setting this is almost > always a cabling, termination, or addressing issue. In IDE its jumper > mismatch (master vs slave vs cable-select). Less often its a > partitioning issue (trying to access sectors beyond the end of the > drive). > > Another strong actor is selecting the wrong storage controller chipset > driver. In that case you may be faling back from high-end device you > think it is, through intermediate chip-set, and back to ACPI or BIOS > emulation FWIW I run a custom-built monolithic kernel, with only the specific drivers (SATA/AHCI in this case) builtin. There's no drivers for anything else it could fallback to. Once in awhile I do see it try at say 6-gig speeds, then eventually fall back to 3 and ultimately 1.5, but that /is/ indicative of other issues when I see it. And like I said, there's no other drivers to fall back to, so obviously I never see it doing that. > Another common cause is having a dedicated hardware RAID controller > (dell likes to put LSI MegaRaid controllers in their boxes for example), > many mother boards have hardware RAID support available through the > bios, etc, leaving that feature active, then the adding a drive and > _not_ initializing that drive with the RAID controller disk setup. In > this case the controller is going to repeatedly probe the drive for its > proprietary controller signature blocks (and reset the drive after each > attempt) and then finally fall back to raw block pass-through. This can > take a long time (thirty seconds to a minute). Everything's set JBOD here. I don't trust those proprietary "firmware" raid things. Besides, that kills portability. JBOD SATA and AHCI are sufficiently standardized that should the hardware die, I can switch out to something else and not have to worry about rebuilding the custom kernel with the new drivers. Some proprietary firmware raid, requiring dmraid at the software kernel level to support, when I can just as easily use full software mdraid on standardized JBOD, no thanks! And be sure, that's one of the first things I check when I setup a new box, any so-called hardware raid that's actually firmware/software raid, disabled, JBOD mode, enabled. > But seriously, if you are seeing "reset" anywhere in any storage chain > during a normal power-on cycle then you've got a problem with geometry > or configuration. IIRC I don't get it routinely. But I've seen it a few times, attributing it as I said to the 30-second SATA level timeout not being long enough. Most often, however, it's at resume, not original startup, which is understandable as state at resume doesn't match state at suspend/ hibernate. The irritating thing, as previously discussed, is when one device takes long enough to come back that mdraid or btrfs drops it out, generally forcing the reboot I was trying to avoid with the suspend/ hibernate in the first place, along with a re-add and resync (for mdraid) or a scrub (for btrfs raid). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-20 0:25 ` Duncan @ 2014-11-20 2:08 ` Robert White 0 siblings, 0 replies; 49+ messages in thread From: Robert White @ 2014-11-20 2:08 UTC (permalink / raw) To: Duncan, linux-btrfs On 11/19/2014 04:25 PM, Duncan wrote: > Most often, however, it's at resume, not original startup, which is > understandable as state at resume doesn't match state at suspend/ > hibernate. The irritating thing, as previously discussed, is when one > device takes long enough to come back that mdraid or btrfs drops it out, > generally forcing the reboot I was trying to avoid with the suspend/ > hibernate in the first place, along with a re-add and resync (for mdraid) > or a scrub (for btrfs raid). If you want a practical solution you might want to look at http://underdog.soruceforge.net (my project, shameless plug). The actual user context return isn't in there but I use the project to build initramfs images into all my kernels. [DISCLAIMER: The cryptsetup and LUKS stuff is rock solid but the mdadm incremental build stuff is very rough and so lightly untested] You could easily add a drive preheat code block (spin up and status check all drives with pause and repeat function) as a preamble function that could/would safely take place before any glance is made towards the resume stage. extemporaneous example:: --- snip --- cat <<'EOT' >>/opt/underdog/utility/preheat.mod #!/bin/bash # ROOT_COMMANDS+=( commands your preheat needs ) UNDERDOG+=( init.d/preheat ) EOT cat <<'EOT' >>/opt/underdog/prototype/init.d/preheat #!/bin/bash function __preamble_preheat() { whatever logic you need return 0 } __preamble_funcs+=( [preheat]=__preamble_preheat ) EOT --- snip --- install underdog, paste the above into a shell once. edit /opt/underdog/prototype/init.d/preamble to put whatever logic in you need. Follow the instructions in /opt/underdog/README.txt for making the initramfs image or, as I do, build the initramfs into the kernel image. The preamble will be run in the resultant /init script before the swap partitions are submitted for attempted resume. (The system does support complexity like resuming from a swap partition inside an LVM/LV built over a LUKS encrypted media expanse, or just a plain laptop with one plain partitioned disk, with zero changes to the necessary default config.) -- Rob. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 16:07 ` Phillip Susi 2014-11-19 21:05 ` Robert White @ 2014-11-19 23:59 ` Duncan 2014-11-25 22:14 ` Phillip Susi 1 sibling, 1 reply; 49+ messages in thread From: Duncan @ 2014-11-19 23:59 UTC (permalink / raw) To: linux-btrfs Phillip Susi posted on Wed, 19 Nov 2014 11:07:43 -0500 as excerpted: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 9:46 PM, Duncan wrote: >> I'm not sure about normal operation, but certainly, many drives take >> longer than 30 seconds to stabilize after power-on, and I routinely see >> resets during this time. > > As far as I have seen, typical drive spin up time is on the order of 3-7 > seconds. Hell, I remember my pair of first generation seagate cheetah > 15,000 rpm drives seemed to take *forever* to spin up and that still was > maybe only 15 seconds. If a drive takes longer than 30 seconds, then > there is something wrong with it. I figure there is a reason why spin > up time is tracked by SMART so it seems like long spin up time is a sign > of a sick drive. It's not physical spinup, but electronic device-ready. It happens on SSDs too and they don't have anything to spinup. But, for instance on my old seagate 300-gigs that I used to have in 4-way mdraid, when I tried to resume from hibernate the drives would be spunup and talking to the kernel, but for some seconds to a couple minutes or so after spinup, they'd sometimes return something like (example) "Seagrte3x0" instead of "Seagate300". Of course that wasn't the exact string, I think it was the model number or perhaps the serial number or something, but looking at dmsg I could see the ATA layer up for each of the four devices, the connection establish and seem to be returning good data, then the mdraid layer would try to assemble and would kick out a drive or two due to the device string mismatch compared to what was there before the hibernate. With the string mismatch, from its perspective the device had disappeared and been replaced with something else. But if I held it at the grub prompt for a couple minutes and /then/ let it go, or part of the time on its own, all four drives would match and it'd work fine. For just short hibernates (as when testing hibernate/ resume), it'd come back just fine; as it would nearly all the time out to two hours or so. Beyond that, out to 10 or 12 hours, the longer it sat the more likely it would be to fail, if it didn't hold it at the grub prompt for a few minutes to let it stabilize. And now I seen similar behavior resuming from suspend (the old hardware wouldn't resume from suspend to ram, only hibernate, the new hardware resumes from suspend to ram just fine, but I had trouble getting it to resume from hibernate back when I first setup and tried it; I've not tried hibernate since and didn't even setup swap to hibernate to when I got the SSDs so I've not tried it for a couple years) on SSDs with btrfs raid. Btrfs isn't as informative as was mdraid on why it kicks a device, but dmesg says both devices are up, while btrfs is suddenly spitting errors on one device. A reboot later and both devices are back in the btrfs and I can do a scrub to resync, which generally finds and fixes errors on the btrfs that were writable (/home and /var/log), but of course not on the btrfs mounted as root, since it's read-only by default. Same pattern. Immediate suspend and resume is fine. Out to about 6 hours it tends to be fine as well. But at 8-10 hours in suspend, btrfs starts spitting errors often enough that I generally quit trying to suspend at all, I simply shut down now. (With SSDs and systemd, shutdown and restart is fast enough, and the delay from having to refill cache low enough, that the time difference between suspend and full shutdown is hardly worth troubling with anyway, certainly not when there's a risk to data due to failure to properly resume.) But it worked fine when I had only a single device to bring back up. Nothing to be slower than another device to respond and thus to be kicked out as dead. I finally realized what was happening after I read a study paper mentioning capacitor charge time and solid-state stability time, and how a lot of cheap devices say they're ready before the electronics have actually properly stabilized. On SSDs, this is a MUCH worse issue than it is on spinning rust, because the logical layout isn't practically forced to serial like it is on spinning rust, and the firmware can get so jumbled it pretty much scrambles the device. And it's not just the normal storage either. In the study, many devices corrupted their own firmware as well! Now that was definitely a worst-case study in that they were deliberately yanking and/or fast-switching the power, not just doing time-on waits, but still, a surprisingly high proportion of SSDs not only scrambled the storage, but scrambled their firmware as well. (On those devices the firmware may well have been on the same media as the storage, with the firmware simply read in first in a hardware bootstrap mode, and the firmware programmed to avoid that area in normal operation thus making it as easily corrupted as the the normal storage.) The paper specifically mentioned that it wasn't necessarily the more expensive devices that were the best, either, but the ones that faired best did tend to have longer device-ready times. The conclusion was that a lot of devices are cutting corners on device-ready, gambling that in normal use they'll work fine, leading to an acceptable return rate, and evidently, the gamble pays off most of the time. That being the case, a longer device-ready, if it actually means the device /is/ ready, can be a /good/ thing. If there's a 30-second timeout layer getting impatient and resetting the drive multiple times because it's not responding as it's not actually ready yet, well... The spinning rust in that study faired far better, with I think none of the devices scrambling their own firmware, and while there was some damage to storage, it was generally far better confined. >> This doesn't happen on single-hardware-device block devices and >> filesystems because in that case it's either up or down, if the device >> doesn't come up in time the resume simply fails entirely, instead of >> coming up with one or more devices there, but others missing as they >> didn't stabilize in time, as is unfortunately all too common in the >> multi- device scenario. > > No, the resume doesn't "fail entirely". The drive is reset, and the IO > request is retried, and by then it should succeed. Yes. I misspoke by abbreviation. The point I was trying to make is that there's only the one device, so ultimately it either works or it doesn't. There's no case of one or more devices coming up correctly and one or more still not being entirely ready. > It certainly is not normal for a drive to take that long to spin up. > IIRC, the 30 second timeout comes from the ATA specs which state that it > can take up to 30 seconds for a drive to spin up. >> That said, I SHOULD say I'd be far *MORE* irritated if the device >> simply pretended it was stable and started reading/writing data before >> it really had stabilized, particularly with SSDs where that sort of >> behavior has been observed and is known to put some devices at risk of >> complete scrambling of either media or firmware, beyond recovery at >> times. That was referencing the study I summarized in a bit more depth, above. > Power supply voltage is stable within milliseconds. What takes HDDs > time to start up is mechanically bringing the spinning rust up to speed. > On SSDs, I think you are confusing testing done on power *cycling* ( > i.e. yanking the power cord in the middle of a write ) with startup. But if the startup is showing the symptoms... FWIW, I wasn't a believer at first either. But I know what I see on my own hardware. Tho I now suspect we might be in vehement agreement with each other, just from different viewpoints and stating it differently. =:^) >> So, umm... I suspect the 2-minute default is 2 minutes due to power-up >> stabilizing issues > The default is 30 seconds, not 2 minutes. Well, as discussed by others there's an often two-minute default at one level, and a 30 second default at another. I was replying to someone who couldn't see the logic behind 2 minutes for sure, or even 30 seconds, with a reason why the 2 minute retry timeout might actually make sense. Yes, there's a 30 second time at a different level as well, but I was addressing why 2 minutes can make sense. Regardless, with the 2 minute timeout behind the half-minute timeout, the 2-minute timeout is obviously never going to be seen, which /is/ a problem. > Again, there is no several minute period where voltage stabilizes and > the drive takes longer to access. This is a complete red herring. My experience says otherwise. Else explain why those problems occur in the first two minutes, but don't occur if I hold it at the grub prompt "to stabilize"for two minutes, and never during normal "post- stabilization" operation. Of course perhaps there's another explanation for that, and I'm conflating the two things. But so far, experience matches the theory. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-19 23:59 ` Duncan @ 2014-11-25 22:14 ` Phillip Susi 2014-11-28 15:55 ` Patrik Lundquist 0 siblings, 1 reply; 49+ messages in thread From: Phillip Susi @ 2014-11-25 22:14 UTC (permalink / raw) To: Duncan, linux-btrfs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/19/2014 6:59 PM, Duncan wrote: > It's not physical spinup, but electronic device-ready. It happens > on SSDs too and they don't have anything to spinup. If you have an SSD that isn't handling IO within 5 seconds or so of power on, it is badly broken. > But, for instance on my old seagate 300-gigs that I used to have in > 4-way mdraid, when I tried to resume from hibernate the drives > would be spunup and talking to the kernel, but for some seconds to > a couple minutes or so after spinup, they'd sometimes return > something like (example) "Seagrte3x0" instead of "Seagate300". Of > course that wasn't the exact string, I think it was the model > number or perhaps the serial number or something, but looking at > dmsg I could see the ATA layer up for each of the four devices, the > connection establish and seem to be returning good data, then the > mdraid layer would try to assemble and would kick out a drive or > two due to the device string mismatch compared to what was there > before the hibernate. With the string mismatch, from its > perspective the device had disappeared and been replaced with > something else. Again, these drives were badly broken then. Even if it needs extra time to come up for some reason, it shouldn't be reporting that it is ready and returning incorrect information. > And now I seen similar behavior resuming from suspend (the old > hardware wouldn't resume from suspend to ram, only hibernate, the > new hardware resumes from suspend to ram just fine, but I had > trouble getting it to resume from hibernate back when I first setup > and tried it; I've not tried hibernate since and didn't even setup > swap to hibernate to when I got the SSDs so I've not tried it for a > couple years) on SSDs with btrfs raid. Btrfs isn't as informative > as was mdraid on why it kicks a device, but dmesg says both devices > are up, while btrfs is suddenly spitting errors on one device. A > reboot later and both devices are back in the btrfs and I can do a > scrub to resync, which generally finds and fixes errors on the > btrfs that were writable (/home and /var/log), but of course not on > the btrfs mounted as root, since it's read-only by default. Several months back I was working on some patches to avoid blocking a resume until after all disks had spun up ( someone else ended up getting a different version merged to the mainline kernel ). I looked quite hard at the timings of things during suspend and found that my ssd was ready and handling IO darn near instantly and the hd ( 5900 rpm wd green at the time ) took something like 7 seconds before it was completing IO. These days I'm running a raid10 on 3 7200 rpm blues and it comes right up from suspend with no problems, just as it should. > The paper specifically mentioned that it wasn't necessarily the > more expensive devices that were the best, either, but the ones > that faired best did tend to have longer device-ready times. The > conclusion was that a lot of devices are cutting corners on > device-ready, gambling that in normal use they'll work fine, > leading to an acceptable return rate, and evidently, the gamble > pays off most of the time. I believe I read the same study and don't recall any such conclusion. Instead the conclusion was that the badly behaving drives aren't ordering their internal writes correctly and flushing their metadata from ram to flash before completing the write request. The problem was on the power *loss* side, not the power application. > The spinning rust in that study faired far better, with I think > none of the devices scrambling their own firmware, and while there > was some damage to storage, it was generally far better confined. That is because they don't have a flash translation layer to get mucked up and prevent them from knowing where the blocks are on disk. The worst thing you get out of a hdd losing power during a write is the sector it was writing is corrupted and you have to re-write it. > My experience says otherwise. Else explain why those problems > occur in the first two minutes, but don't occur if I hold it at the > grub prompt "to stabilize"for two minutes, and never during normal > "post- stabilization" operation. Of course perhaps there's another > explanation for that, and I'm conflating the two things. But so > far, experience matches the theory. I don't know what was broken about these drives, only that it wasn't capacitors since those charge in milliseconds, not seconds. Further, all systems using microprocessors ( like the one in the drive that controls it ) have reset circuitry that prevents them from running until after any caps have charged enough to get the power rail up to the required voltage. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUdP9jAAoJEI5FoCIzSKrw50IH/jkh48Z8Oh/AS/i68zT6Grtb C98aNNQwhC2sJSvaxRBqJ1qkXY4af5DZM/SOvFdNE4qdPLBDLfg70tnTXwU4PjzN 1mHR1PR6Vgft11t0+u8TPTos669Jm8KJ21NMgY072P18Kj/+UJqNRQ+UUNikAcaM XrTragev53F1Kzu5IrSGGjyS4ryZZNh9YioFtR3oUTh4WuCJIiiqvq1Qpno3ee+D QrL+5/fyzEkv0fAt59lhfheb2SkWe2Po+FmmH853sPP3MfhX4blTRzQbkVqZpixb NwsEMu/1hOGedzlZAp4i6aRRKDcl7B+R+x63frFun/kgY54gdbBEn3auoNSGuZA= =iPNz -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-25 22:14 ` Phillip Susi @ 2014-11-28 15:55 ` Patrik Lundquist 0 siblings, 0 replies; 49+ messages in thread From: Patrik Lundquist @ 2014-11-28 15:55 UTC (permalink / raw) To: Phillip Susi; +Cc: linux-btrfs@vger.kernel.org On 25 November 2014 at 23:14, Phillip Susi <psusi@ubuntu.com> wrote: > On 11/19/2014 6:59 PM, Duncan wrote: > >> The paper specifically mentioned that it wasn't necessarily the >> more expensive devices that were the best, either, but the ones >> that faired best did tend to have longer device-ready times. The >> conclusion was that a lot of devices are cutting corners on >> device-ready, gambling that in normal use they'll work fine, >> leading to an acceptable return rate, and evidently, the gamble >> pays off most of the time. > > I believe I read the same study and don't recall any such conclusion. > Instead the conclusion was that the badly behaving drives aren't > ordering their internal writes correctly and flushing their metadata > from ram to flash before completing the write request. The problem > was on the power *loss* side, not the power application. I've found: http://www.usenix.org/conference/fast13/technical-sessions/presentation/zheng http://lkcl.net/reports/ssd_analysis.html Are there any more studies? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-18 7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide ` (2 preceding siblings ...) 2014-11-18 15:35 ` Marc MERLIN @ 2014-11-21 4:58 ` Zygo Blaxell 2014-11-21 7:05 ` Brendan Hide 3 siblings, 1 reply; 49+ messages in thread From: Zygo Blaxell @ 2014-11-21 4:58 UTC (permalink / raw) To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 8211 bytes --] On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote: > Hey, guys > > See further below extracted output from a daily scrub showing csum > errors on sdb, part of a raid1 btrfs. Looking back, it has been > getting errors like this for a few days now. > > The disk is patently unreliable but smartctl's output implies there > are no issues. Is this somehow standard faire for S.M.A.R.T. output? > > Here are (I think) the important bits of the smartctl output for > $(smartctl -a /dev/sdb) (the full results are attached): > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail > Always - 0 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail > Always - 1 > 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail > Always - 440801014 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age > Offline - 0 > 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age > Always - 0 You have one reallocated sector, so the drive has lost some data at some time in the last 49000(!) hours. Normally reallocations happen during writes so the data that was "lost" was data you were in the process of overwriting anyway; however, the reallocated sector count could also be a sign of deteriorating drive integrity. In /var/lib/smartmontools there might be a csv file with logged error attribute data that you could use to figure out whether that reallocation was recent. I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. > > > -------- Original Message -------- > Subject: Cron <root@watricky> /usr/local/sbin/btrfs-scrub-all > Date: Tue, 18 Nov 2014 04:19:12 +0200 > From: (Cron Daemon) <root@watricky> > To: brendan@watricky > > > > WARNING: errors detected during scrubbing, corrected. > [snip] > scrub device /dev/sdb2 (id 2) done > scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds > total bytes scrubbed: 189.49GiB with 5420 errors > error details: read=5 csum=5415 > corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. > [snip] > > smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda 7200.10 > Device Model: ST3250410AS > Serial Number: 6RYF5NP7 > Firmware Version: 4.AAA > User Capacity: 250,059,350,016 bytes [250 GB] > Sector Size: 512 bytes logical/physical > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA/ATAPI-7 (minor revision not indicated) > Local Time is: Tue Nov 18 09:16:03 2014 SAST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > See vendor-specific Attribute list for marginal Attributes. > > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: ( 430) seconds. > Offline data collection > capabilities: (0x5b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > No Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 1) minutes. > Extended self-test routine > recommended polling time: ( 64) minutes. > SCT capabilities: (0x0001) SCT Status supported. > > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0 > 3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 68 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1 > 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail Always - 440801057 > 9 Power_On_Hours 0x0032 044 044 000 Old_age Always - 49106 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 89 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 > 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 > 190 Airflow_Temperature_Cel 0x0022 060 030 045 Old_age Always In_the_past 40 (Min/Max 23/70 #25) > 194 Temperature_Celsius 0x0022 040 070 000 Old_age Always - 40 (0 23 0 0 0) > 195 Hardware_ECC_Recovered 0x001a 069 055 000 Old_age Always - 126632051 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 > 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 > 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed without error 00% 37598 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 4:58 ` Zygo Blaxell @ 2014-11-21 7:05 ` Brendan Hide 2014-11-21 12:55 ` Ian Armstrong 2014-11-21 17:42 ` Zygo Blaxell 0 siblings, 2 replies; 49+ messages in thread From: Brendan Hide @ 2014-11-21 7:05 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs@vger.kernel.org On 2014/11/21 06:58, Zygo Blaxell wrote: > You have one reallocated sector, so the drive has lost some data at some > time in the last 49000(!) hours. Normally reallocations happen during > writes so the data that was "lost" was data you were in the process of > overwriting anyway; however, the reallocated sector count could also be > a sign of deteriorating drive integrity. > > In /var/lib/smartmontools there might be a csv file with logged error > attribute data that you could use to figure out whether that reallocation > was recent. > > I also notice you are not running regular SMART self-tests (e.g. > by smartctl -t long) and the last (and first, and only!) self-test the > drive ran was ~12000 hours ago. That means most of your SMART data is > about 18 months old. The drive won't know about sectors that went bad > in the last year and a half unless the host happens to stumble across > them during a read. > > The drive is over five years old in operating hours alone. It is probably > so fragile now that it will break if you try to move it. All interesting points. Do you schedule SMART self-tests on your own systems? I have smartd running. In theory it tracks changes and sends alerts if it figures a drive is going to fail. But, based on what you've indicated, that isn't good enough. > WARNING: errors detected during scrubbing, corrected. > [snip] > scrub device /dev/sdb2 (id 2) done > scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds > total bytes scrubbed: 189.49GiB with 5420 errors > error details: read=5 csum=5415 > corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 > That seems a little off. If there were 5 read errors, I'd expect the drive to > have errors in the SMART error log. > > Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. > There have been a number of fixes to csums in btrfs pulled into the kernel > recently, and I've retired two five-year-old computers this summer due > to RAM/CPU failures. The difference here is that the issue only affects the one drive. This leaves the probable cause at: - the drive itself - the cable/ports with a negligibly-possible cause at the motherboard chipset. -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 7:05 ` Brendan Hide @ 2014-11-21 12:55 ` Ian Armstrong 2014-11-21 17:45 ` Chris Murphy 2014-11-21 17:42 ` Zygo Blaxell 1 sibling, 1 reply; 49+ messages in thread From: Ian Armstrong @ 2014-11-21 12:55 UTC (permalink / raw) To: linux-btrfs@vger.kernel.org On Fri, 21 Nov 2014 09:05:32 +0200, Brendan Hide wrote: > On 2014/11/21 06:58, Zygo Blaxell wrote: > > I also notice you are not running regular SMART self-tests (e.g. > > by smartctl -t long) and the last (and first, and only!) self-test > > the drive ran was ~12000 hours ago. That means most of your SMART > > data is about 18 months old. The drive won't know about sectors > > that went bad in the last year and a half unless the host happens > > to stumble across them during a read. > > > > The drive is over five years old in operating hours alone. It is > > probably so fragile now that it will break if you try to move it. > All interesting points. Do you schedule SMART self-tests on your own > systems? I have smartd running. In theory it tracks changes and sends > alerts if it figures a drive is going to fail. But, based on what > you've indicated, that isn't good enough. Simply monitoring the smart status without a self-test isn't really that great. I'm not sure on the default config, but smartd can be made to initiate a smart self-test at regular intervals. Depending on the test type (short, long, etc) it could include a full surface scan. This can reveal things like bad sectors before you ever hit them during normal system usage. > > > WARNING: errors detected during scrubbing, corrected. > > [snip] > > scrub device /dev/sdb2 (id 2) done > > scrub started at Tue Nov 18 03:22:58 2014 and finished > > after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors > > error details: read=5 csum=5415 > > corrected errors: 5420, uncorrectable errors: 0, unverified > > errors: 164 That seems a little off. If there were 5 read errors, > > I'd expect the drive to have errors in the SMART error log. > > > > Checksum errors could just as easily be a btrfs bug or a RAM/CPU > > problem. There have been a number of fixes to csums in btrfs pulled > > into the kernel recently, and I've retired two five-year-old > > computers this summer due to RAM/CPU failures. > The difference here is that the issue only affects the one drive. > This leaves the probable cause at: > - the drive itself > - the cable/ports > > with a negligibly-possible cause at the motherboard chipset. This is the same problem that I'm currently trying to resolve. I have one drive in a raid1 setup which shows no issues in smart status but often has checksum errors. In my situation what I've found is that if I scrub & let it fix the errors then a second pass immediately after will show no errors. If I then leave it a few days & try again there will be errors, even in old files which have not been accessed for months. If I do a read-only scrub to get a list of errors, a second scrub immediately after will show exactly the same errors. Apart from the scrub errors the system logs shows no issues with that particular drive. My next step is to disable autodefrag & see if the problem persists. (I'm not suggesting a problem with autodefrag, I just want to remove it from the equation & ensure that outside of normal file access, data isn't being rewritten between scrubs) -- Ian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 12:55 ` Ian Armstrong @ 2014-11-21 17:45 ` Chris Murphy 2014-11-22 7:18 ` Ian Armstrong 0 siblings, 1 reply; 49+ messages in thread From: Chris Murphy @ 2014-11-21 17:45 UTC (permalink / raw) To: Ian Armstrong; +Cc: linux-btrfs@vger.kernel.org On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong <btrfs@iarmst.co.uk> wrote: > In my situation what I've found is that if I scrub & let it fix the > errors then a second pass immediately after will show no errors. If I > then leave it a few days & try again there will be errors, even in > old files which have not been accessed for months. What are the devices? And if they're SSDs are they powered off for these few days? I take it the scrub error type is corruption? You can use badblocks to write a known pattern to the drive. Then power off and leave it for a few days. Then read the drive, matching against the pattern, and see if there are any discrepancies. Doing this outside the code path of Btrfs would fairly conclusively indicate whether it's hardware or software induced. Assuming you have another copy of all of these files :-) you could just sha256sum the two copies to see if they have in fact changed. If they have, well then you've got some silent data corruption somewhere somehow. But if they always match, then that suggests a bug. I don't see how you can get bogus corruption messages, and for it to not be a bug. When you do these scrubs that come up clean, and then later come up with corruptions, have you done any software updates? > My next step is to disable autodefrag & see if the problem persists. > (I'm not suggesting a problem with autodefrag, I just want to remove it > from the equation & ensure that outside of normal file access, data > isn't being rewritten between scrubs) I wouldn't expect autodefrag to touch old files not accessed for months. Doesn't it only affect actively used files? -- Chris Murphy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 17:45 ` Chris Murphy @ 2014-11-22 7:18 ` Ian Armstrong 0 siblings, 0 replies; 49+ messages in thread From: Ian Armstrong @ 2014-11-22 7:18 UTC (permalink / raw) To: linux-btrfs On Fri, 21 Nov 2014 10:45:21 -0700 Chris Murphy wrote: > On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong <btrfs@iarmst.co.uk> > wrote: > > > In my situation what I've found is that if I scrub & let it fix the > > errors then a second pass immediately after will show no errors. If > > I then leave it a few days & try again there will be errors, even in > > old files which have not been accessed for months. > > What are the devices? And if they're SSDs are they powered off for > these few days? I take it the scrub error type is corruption? It's spinning rust and the checksum error is always on the one drive (a SAMSUNG HD204UI). The firmware has been updated, since some were shipped with a bad version which could result in data corruption. > You can use badblocks to write a known pattern to the drive. Then > power off and leave it for a few days. Then read the drive, matching > against the pattern, and see if there are any discrepancies. Doing > this outside the code path of Btrfs would fairly conclusively indicate > whether it's hardware or software induced. Unfortunately I'm reluctant to go the badblock route for the entire drive since it's the second drive in a 2 drive raid1 and I don't currently have a spare. There is a small 6G partition that I can use, but given that the drive is large and the errors are few, it could take a while for anything to show. I also have a second 2 drive btrfs raid1 in the same machine that doesn't have this problem. All the drives are running off the same controller. > Assuming you have another copy of all of these files :-) you could > just sha256sum the two copies to see if they have in fact changed. If > they have, well then you've got some silent data corruption somewhere > somehow. But if they always match, then that suggests a bug. Some of the files already have an md5 linked to them, while others have parity files to give some level of recovery from corruption or damage. Checking against these show no problems, so I assume that btrfs is doing its job & only serving an intact file. > I don't > see how you can get bogus corruption messages, and for it to not be a > bug. When you do these scrubs that come up clean, and then later come > up with corruptions, have you done any software updates? No software updates between clean & corrupt. I don't have to power down or reboot either for checksum errors to appear. I don't think the corruption messages are bogus, but are indicating a genuine problem. What I would like to be able to do is compare the corrupt block with the one used to repair it and see what the difference is. As I've already stated, the system logs are clean & the smart logs aren't showing any issues. (Well, until today when a self-test failed with a read error, but it must be an unused sector since the scrub doesn't hit it & there are no re-allocated sectors yet) > > My next step is to disable autodefrag & see if the problem persists. > > (I'm not suggesting a problem with autodefrag, I just want to > > remove it from the equation & ensure that outside of normal file > > access, data isn't being rewritten between scrubs) > > I wouldn't expect autodefrag to touch old files not accessed for > months. Doesn't it only affect actively used files? The drive is mainly used to hold old archive files, though there are daily rotating files on it as well. The corruption affects both new and old files. -- Ian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 7:05 ` Brendan Hide 2014-11-21 12:55 ` Ian Armstrong @ 2014-11-21 17:42 ` Zygo Blaxell 2014-11-21 18:06 ` Chris Murphy 1 sibling, 1 reply; 49+ messages in thread From: Zygo Blaxell @ 2014-11-21 17:42 UTC (permalink / raw) To: Brendan Hide; +Cc: linux-btrfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 3799 bytes --] On Fri, Nov 21, 2014 at 09:05:32AM +0200, Brendan Hide wrote: > On 2014/11/21 06:58, Zygo Blaxell wrote: > >You have one reallocated sector, so the drive has lost some data at some > >time in the last 49000(!) hours. Normally reallocations happen during > >writes so the data that was "lost" was data you were in the process of > >overwriting anyway; however, the reallocated sector count could also be > >a sign of deteriorating drive integrity. > > > >In /var/lib/smartmontools there might be a csv file with logged error > >attribute data that you could use to figure out whether that reallocation > >was recent. > > > >I also notice you are not running regular SMART self-tests (e.g. > >by smartctl -t long) and the last (and first, and only!) self-test the > >drive ran was ~12000 hours ago. That means most of your SMART data is > >about 18 months old. The drive won't know about sectors that went bad > >in the last year and a half unless the host happens to stumble across > >them during a read. > > > >The drive is over five years old in operating hours alone. It is probably > >so fragile now that it will break if you try to move it. > All interesting points. Do you schedule SMART self-tests on your own > systems? I have smartd running. In theory it tracks changes and > sends alerts if it figures a drive is going to fail. But, based on > what you've indicated, that isn't good enough. I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. > >WARNING: errors detected during scrubbing, corrected. > >[snip] > >scrub device /dev/sdb2 (id 2) done > > scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds > > total bytes scrubbed: 189.49GiB with 5420 errors > > error details: read=5 csum=5415 > > corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 > >That seems a little off. If there were 5 read errors, I'd expect the drive to > >have errors in the SMART error log. > > > >Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. > >There have been a number of fixes to csums in btrfs pulled into the kernel > >recently, and I've retired two five-year-old computers this summer due > >to RAM/CPU failures. > The difference here is that the issue only affects the one drive. > This leaves the probable cause at: > - the drive itself > - the cable/ports > > with a negligibly-possible cause at the motherboard chipset. If it was cable, there should be UDMA CRC errors or similar in the SMART counters, but they are zero. You can also try swapping the cable and seeing whether the errors move. I've found many bad cables that way. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. > > -- > __________ > Brendan Hide > http://swiftspirit.co.za/ > http://www.webafrica.co.za/?AFF1E97 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 17:42 ` Zygo Blaxell @ 2014-11-21 18:06 ` Chris Murphy 2014-11-22 2:25 ` Zygo Blaxell 0 siblings, 1 reply; 49+ messages in thread From: Chris Murphy @ 2014-11-21 18:06 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell <zblaxell@furryterror.org> wrote: > I run 'smartctl -t long' from cron overnight (or whenever the drives > are most idle). You can also set up smartd.conf to launch the self > tests; however, the syntax for test scheduling is byzantine compared to > cron (and that's saying something!). On multi-drive systems I schedule > a different drive for each night. > > If you are also doing btrfs scrub, then stagger the scheduling so > e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. > > smartd is OK for monitoring test logs and email alerts. I've had no > problems there. Most attributes are always updated without issuing a smart test of any kind. A drive I have here only has four offline updateable attributes. When it comes to bad sectors, the drive won't use a sector that persistently fails writes. So you don't really have to worry about latent bad sectors that don't have data on them already. The sectors you care about are the ones with data. A scrub reads all of those sectors. First the drive could report a read error in which case Btrfs raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or rebuild it from parity, and write to the affected sector; and also this same mechanism happens in normal reads so it's a kind of passive scrub. But it happens to miss checking inactively read data, which a scrub will check. Second, the drive could report no problem, and Btrfs raid1/10 could still fix the problem in case of a csum mismatch. And it looks like soonish we'll see this apply to raid5/6. So I think a nightly long smart test is a bit overkill. I think you could do nightly -t short tests which will report problems scrub won't notice, such as higher seek times or lower throughput performance. And then scrub once a week. > The drive itself could be failing in some way that prevents recording > SMART errors (e.g. because of host timeouts triggering a bus reset, > which also prevents the SMART counter update for what was going wrong at > the time). This is unfortunately quite common, especially with drives > configured for non-RAID workloads. Libata resetting the link should be recorded in kernel messages. -- Chris Murphy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: scrub implies failing drive - smartctl blissfully unaware 2014-11-21 18:06 ` Chris Murphy @ 2014-11-22 2:25 ` Zygo Blaxell 0 siblings, 0 replies; 49+ messages in thread From: Zygo Blaxell @ 2014-11-22 2:25 UTC (permalink / raw) To: Chris Murphy; +Cc: Brendan Hide, linux-btrfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 3608 bytes --] On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote: > On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell <zblaxell@furryterror.org> wrote: > > > I run 'smartctl -t long' from cron overnight (or whenever the drives > > are most idle). You can also set up smartd.conf to launch the self > > tests; however, the syntax for test scheduling is byzantine compared to > > cron (and that's saying something!). On multi-drive systems I schedule > > a different drive for each night. > > > > If you are also doing btrfs scrub, then stagger the scheduling so > > e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. > > > > smartd is OK for monitoring test logs and email alerts. I've had no > > problems there. > > Most attributes are always updated without issuing a smart test of any > kind. A drive I have here only has four offline updateable attributes. One of those four is Offline_Uncorrectable, which is a really important attribute to monitor! > When it comes to bad sectors, the drive won't use a sector that > persistently fails writes. So you don't really have to worry about > latent bad sectors that don't have data on them already. The sectors > you care about are the ones with data. A scrub reads all of those > sectors. A scrub reads all the _allocated_ sectors. A long selftest reads _everything_, and also exercises the electronics and mechanics of the drive in ways that normal operation doesn't. I have several disks that are less than 25% occupied, which means scrubs will ignore 75% of the disk surface at any given time. A sharp increase in the number of bad sectors (no matter how they are detected) usually indicates a total drive failure is coming. Many drives have been nice enough to give me enough warning for their RMA replacements to be delivered just a few hours before the drive totally fails. > First the drive could report a read error in which case Btrfs > raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or > rebuild it from parity, and write to the affected sector; and also > this same mechanism happens in normal reads so it's a kind of passive > scrub. But it happens to miss checking inactively read data, which a > scrub will check. > > Second, the drive could report no problem, and Btrfs raid1/10 could > still fix the problem in case of a csum mismatch. And it looks like > soonish we'll see this apply to raid5/6. > > So I think a nightly long smart test is a bit overkill. I think you > could do nightly -t short tests which will report problems scrub won't > notice, such as higher seek times or lower throughput performance. And > then scrub once a week. Drives quite often drop a sector or two over the years, and it can be harmless. What you want to be watching out for is hundreds of bad sectors showing up over a period of few days--that means something is rattling around on the disk platters, damaging the hardware as it goes. To get that data, you have to test the disks every few days. > > The drive itself could be failing in some way that prevents recording > > SMART errors (e.g. because of host timeouts triggering a bus reset, > > which also prevents the SMART counter update for what was going wrong at > > the time). This is unfortunately quite common, especially with drives > > configured for non-RAID workloads. > > Libata resetting the link should be recorded in kernel messages. This is true, but the original question was about SMART data coverage. This is why it's important to monitor both. > -- > Chris Murphy [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2014-12-01 19:11 UTC | newest] Thread overview: 49+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <E1XqYMg-0000YI-8y@watricky.valid.co.za> 2014-11-18 7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide 2014-11-18 7:36 ` Roman Mamedov 2014-11-18 13:24 ` Brendan Hide 2014-11-18 15:16 ` Duncan 2014-11-18 12:08 ` Austin S Hemmelgarn 2014-11-18 13:25 ` Brendan Hide 2014-11-18 16:02 ` Phillip Susi 2014-11-18 15:35 ` Marc MERLIN 2014-11-18 16:04 ` Phillip Susi 2014-11-18 16:11 ` Marc MERLIN 2014-11-18 16:26 ` Phillip Susi 2014-11-18 18:57 ` Chris Murphy 2014-11-18 20:58 ` Phillip Susi 2014-11-19 2:40 ` Chris Murphy 2014-11-19 15:11 ` Phillip Susi 2014-11-20 0:05 ` Chris Murphy 2014-11-25 21:34 ` Phillip Susi 2014-11-25 23:13 ` Chris Murphy 2014-11-26 1:53 ` Rich Freeman 2014-12-01 19:10 ` Phillip Susi 2014-11-28 15:02 ` Patrik Lundquist 2014-11-19 2:46 ` Duncan 2014-11-19 16:07 ` Phillip Susi 2014-11-19 21:05 ` Robert White 2014-11-19 21:47 ` Phillip Susi 2014-11-19 22:25 ` Robert White 2014-11-20 20:26 ` Phillip Susi 2014-11-20 22:45 ` Robert White 2014-11-21 15:11 ` Phillip Susi 2014-11-21 21:12 ` Robert White 2014-11-21 21:41 ` Robert White 2014-11-22 22:06 ` Phillip Susi 2014-11-19 22:33 ` Robert White 2014-11-20 20:34 ` Phillip Susi 2014-11-20 23:08 ` Robert White 2014-11-21 15:27 ` Phillip Susi 2014-11-20 0:25 ` Duncan 2014-11-20 2:08 ` Robert White 2014-11-19 23:59 ` Duncan 2014-11-25 22:14 ` Phillip Susi 2014-11-28 15:55 ` Patrik Lundquist 2014-11-21 4:58 ` Zygo Blaxell 2014-11-21 7:05 ` Brendan Hide 2014-11-21 12:55 ` Ian Armstrong 2014-11-21 17:45 ` Chris Murphy 2014-11-22 7:18 ` Ian Armstrong 2014-11-21 17:42 ` Zygo Blaxell 2014-11-21 18:06 ` Chris Murphy 2014-11-22 2:25 ` Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).