* Pending sectors in valid array - how to proceed?
@ 2010-07-28 17:46 Stefan G. Weichinger
2010-07-28 18:41 ` Tim Small
0 siblings, 1 reply; 7+ messages in thread
From: Stefan G. Weichinger @ 2010-07-28 17:46 UTC (permalink / raw)
To: linux-raid
Greets,
in a customer-server I run these arrays:
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md3 : active raid5 sdd3[3](S) sdc3[2] sdb3[1] sda3[0]
15647104 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
md4 : active raid5 sdd4[3](S) sdc4[2] sdb4[1] sda4[0]
471941376 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
So far everything OK.
--
smartctl shows for /dev/sdb:
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always
- 0
195 Hardware_ECC_Recovered 0x001a 058 039 000 Old_age Always
- 146754005
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 13
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 13
(relevant lines as far as I understand ...)
So I have these 13 pending sectors ...
I assume it would be good to swap sdb for safety?
This would mean:
* fail sdb
* wait for the spare to be synced ... which I fear somehow (I once lost
an array as a second drive dropped out while resync ...)
* change sdb
* re-add sdb
correct?
I also read of a way of removing and re-adding a drive to get rid of
these sectors?
Is this a recommended thing to do?
What would you recommend me to do?
Thank you, Stefan
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Pending sectors in valid array - how to proceed? 2010-07-28 17:46 Pending sectors in valid array - how to proceed? Stefan G. Weichinger @ 2010-07-28 18:41 ` Tim Small 2010-07-28 20:27 ` Stefan *St0fF* Huebner 0 siblings, 1 reply; 7+ messages in thread From: Tim Small @ 2010-07-28 18:41 UTC (permalink / raw) To: lists; +Cc: linux-raid Stefan G. Weichinger wrote: > md3 : active raid5 sdd3[3](S) sdc3[2] sdb3[1] sda3[0] > 15647104 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] > ... > smartctl shows for /dev/sdb: > > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always > - 0 > 195 Hardware_ECC_Recovered 0x001a 058 039 000 Old_age Always > - 146754005 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always > - 13 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 13 > > (relevant lines as far as I understand ...) > Do you have any high-fly writes? Are there lots of Hardware_ECC_Recovered on all the drives? Is vibration likely to be an issue? What's the drive/chassis? > I also read of a way of removing and re-adding a drive to get rid of > these sectors? > > Is this a recommended thing to do? > What would you recommend me to do? > I think you should trigger a check, this should attempt to read these pending sectors (assuming they are within the boundaries of the array), along with every other sector in the array, and scrub them when the read fails (i.e. reconstruct the data from the other array members, and write them to the pending sectors on sdb - thus triggering reallocation of those sectors). echo check > /sys/block/md1/md/sync_action etc. Personally, I'd then wait to see if/how the reallocated count goes up - if the sectors are the result of a one-off event, then no-problem, but if they steadily climb, then the drive is probably on its way out - those ECC_Recovered counts look a bit naff to me. If you're nervous of losing a drive during resync, the the check is a good thing to do first, but you could also consider migrating the array to RAID6, to give you double redundancy... Cheers, Tim. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pending sectors in valid array - how to proceed? 2010-07-28 18:41 ` Tim Small @ 2010-07-28 20:27 ` Stefan *St0fF* Huebner 2010-07-28 21:11 ` Roman Mamedov 0 siblings, 1 reply; 7+ messages in thread From: Stefan *St0fF* Huebner @ 2010-07-28 20:27 UTC (permalink / raw) To: Tim Small; +Cc: lists, linux-raid Am 28.07.2010 20:41, schrieb Tim Small: > Stefan G. Weichinger wrote: >> md3 : active raid5 sdd3[3](S) sdc3[2] sdb3[1] sda3[0] >> 15647104 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] >> > ... > >> smartctl shows for /dev/sdb: >> >> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always >> - 0 >> 195 Hardware_ECC_Recovered 0x001a 058 039 000 Old_age Always >> - 146754005 >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >> - 13 >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age >> Offline - 13 >> >> (relevant lines as far as I understand ...) >> > Do you have any high-fly writes? Are there lots of > Hardware_ECC_Recovered on all the drives? Is vibration likely to be an > issue? What's the drive/chassis? Hardware ECC recovered means how many times the internal error correction of the drive succeeded. Indeed this may indicate vibration or other external sources of errors. >> I also read of a way of removing and re-adding a drive to get rid of >> these sectors? >> >> Is this a recommended thing to do? >> What would you recommend me to do? >> > I think you should trigger a check, this should attempt to read these > pending sectors (assuming they are within the boundaries of the array), > along with every other sector in the array, and scrub them when the read > fails (i.e. reconstruct the data from the other array members, and write > them to the pending sectors on sdb - thus triggering reallocation of > those sectors). > > echo check > /sys/block/md1/md/sync_action Well, I also think this would be the way to go, but it depends on the drives used!!! Are the drives Customer Class or Enterprise Class drives? If they are Enterprise Class (i.e. Raid Edition), go ahead. If they're Customer Class, please enable ERC (if supported by the drives) before scrubbing, as this needs to be there. If ERC is not supported (or not enabled), most likely when hitting a pending sector, the respective drive will not respond while doing it's error correction. It will still be in the error recovery procedure, when mdraid tries to rewrite the sector. The rewrite will fail, as the drive won't respond. Then the drive gets kicked out of the array. > etc. > > Personally, I'd then wait to see if/how the reallocated count goes up - > if the sectors are the result of a one-off event, then no-problem, but > if they steadily climb, then the drive is probably on its way out - > those ECC_Recovered counts look a bit naff to me. If you're nervous of > losing a drive during resync, the the check is a good thing to do first, > but you could also consider migrating the array to RAID6, to give you > double redundancy... I have had the situation, that pending sectors just went away ;) No reallocation occurred. I just wanted to mention that this might be another way it can go so you're not surprised if that happens. > Cheers, > > Tim. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html dito, Stefan ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pending sectors in valid array - how to proceed? 2010-07-28 20:27 ` Stefan *St0fF* Huebner @ 2010-07-28 21:11 ` Roman Mamedov 2010-07-29 2:50 ` Simon Matthews 2010-07-29 8:45 ` Stefan G. Weichinger 0 siblings, 2 replies; 7+ messages in thread From: Roman Mamedov @ 2010-07-28 21:11 UTC (permalink / raw) To: st0ff; +Cc: st0ff, Tim Small, lists, linux-raid [-- Attachment #1: Type: text/plain, Size: 1253 bytes --] On Wed, 28 Jul 2010 22:27:48 +0200 Stefan *St0fF* Huebner <st0ff@gmx.net> wrote: > >> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always > >> - 0 > >> 195 Hardware_ECC_Recovered 0x001a 058 039 000 Old_age Always > >> - 146754005 > >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always > >> - 13 > >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > >> Offline - 13 > >> > >> (relevant lines as far as I understand ...) > >> > > Do you have any high-fly writes? Are there lots of > > Hardware_ECC_Recovered on all the drives? Is vibration likely to be an > > issue? What's the drive/chassis? > Hardware ECC recovered means how many times the internal error > correction of the drive succeeded. Indeed this may indicate vibration > or other external sources of errors. That drive is most likely a Seagate, and if so, there's nothing to worry about. Literally every Seagate drive will have a high value in Hardware_ECC_Recovered, it's just a peculiarity of their SMART. Other vendors' drives recover read errors using ECC too, but don't report that into the SMART metric. -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pending sectors in valid array - how to proceed? 2010-07-28 21:11 ` Roman Mamedov @ 2010-07-29 2:50 ` Simon Matthews 2010-07-30 4:24 ` Simon Matthews 2010-07-29 8:45 ` Stefan G. Weichinger 1 sibling, 1 reply; 7+ messages in thread From: Simon Matthews @ 2010-07-29 2:50 UTC (permalink / raw) To: linux-raid On Wed, Jul 28, 2010 at 2:11 PM, Roman Mamedov <roman@rm.pp.ru> wrote: > On Wed, 28 Jul 2010 22:27:48 +0200 > Stefan *St0fF* Huebner <st0ff@gmx.net> wrote: > >> >> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always >> >> - 0 >> >> 195 Hardware_ECC_Recovered 0x001a 058 039 000 Old_age Always >> >> - 146754005 >> >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always >> >> - 13 >> >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age >> >> Offline - 13 >> >> >> >> (relevant lines as far as I understand ...) >> >> >> > Do you have any high-fly writes? Are there lots of >> > Hardware_ECC_Recovered on all the drives? Is vibration likely to be an >> > issue? What's the drive/chassis? >> Hardware ECC recovered means how many times the internal error >> correction of the drive succeeded. Indeed this may indicate vibration >> or other external sources of errors. > > That drive is most likely a Seagate, and if so, there's nothing to worry > about. Literally every Seagate drive will have a high value in > Hardware_ECC_Recovered, it's just a peculiarity of their SMART. Other vendors' > drives recover read errors using ECC too, but don't report that into the SMART > metric. > I am waiting for this drive to get to the point that Seagate will accept an RMA: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 089 076 006 Pre-fail Always - 173224741 3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 69 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2002 7 Seek_Error_Rate 0x000f 046 036 030 Pre-fail Always - 42786857552386 9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16170 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 5 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 69 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 012 012 000 Old_age Always - 88 188 Unknown_Attribute 0x0032 100 090 000 Old_age Always - 112 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 064 057 045 Old_age Always - 36 (Lifetime Min/Max 33/43) 194 Temperature_Celsius 0x0022 036 043 000 Old_age Always - 36 (0 10 0 0) 195 Hardware_ECC_Recovered 0x001a 031 020 000 Old_age Always - 173224741 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 It is a desktop drive and is used for half of several RAID1 arrays, but so far it hasn't been kicked out of any arrays. I have run a check several times in the last few days. I had expected it to show a failing state when the reallocated sector count reached 2000, but it hasn't. The Seek Error rate is an order of magnitude higher than an identical drive that is the other half of those RAID1 arrays: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 108 091 006 Pre-fail Always - 31895651 3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 55 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 056 051 030 Pre-fail Always - 3741314243502 9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16221 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 2 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 55 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 049 049 000 Old_age Always - 51 188 Unknown_Attribute 0x0032 100 098 000 Old_age Always - 2 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 059 051 045 Old_age Always - 41 (Lifetime Min/Max 39/49) 194 Temperature_Celsius 0x0022 040 049 000 Old_age Always - 40 (0 9 0 0) 195 Hardware_ECC_Recovered 0x001a 025 015 000 Old_age Always - 31895651 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 192 000 Old_age Always - 15 Simon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pending sectors in valid array - how to proceed? 2010-07-29 2:50 ` Simon Matthews @ 2010-07-30 4:24 ` Simon Matthews 0 siblings, 0 replies; 7+ messages in thread From: Simon Matthews @ 2010-07-30 4:24 UTC (permalink / raw) To: linux-raid On Wed, Jul 28, 2010 at 7:50 PM, Simon Matthews <simon.d.matthews@gmail.com> wrote: > > I am waiting for this drive to get to the point that Seagate will accept an RMA: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 089 076 006 Pre-fail > Always - 173224741 > 3 Spin_Up_Time 0x0003 094 093 000 Pre-fail > Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age > Always - 69 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail > Always - 2002 > 7 Seek_Error_Rate 0x000f 046 036 030 Pre-fail > Always - 42786857552386 > 9 Power_On_Hours 0x0032 082 082 000 Old_age > Always - 16170 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail > Always - 5 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age > Always - 69 > 184 Unknown_Attribute 0x0032 100 100 099 Old_age > Always - 0 > 187 Reported_Uncorrect 0x0032 012 012 000 Old_age > Always - 88 > 188 Unknown_Attribute 0x0032 100 090 000 Old_age > Always - 112 > 189 High_Fly_Writes 0x003a 100 100 000 Old_age > Always - 0 > 190 Airflow_Temperature_Cel 0x0022 064 057 045 Old_age > Always - 36 (Lifetime Min/Max 33/43) > 194 Temperature_Celsius 0x0022 036 043 000 Old_age > Always - 36 (0 10 0 0) > 195 Hardware_ECC_Recovered 0x001a 031 020 000 Old_age > Always - 173224741 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > > > It is a desktop drive and is used for half of several RAID1 arrays, > but so far it hasn't been kicked out of any arrays. I have run a check > several times in the last few days. I had expected it to show a > failing state when the reallocated sector count reached 2000, but it > hasn't. Well, despite the S.M.A.R.T. data showing that the drive is OK, it has apparently totally failed this evening. The disk is totally inaccessible From the logs: Jul 29 20:39:56 server2 kernel: ata1: failed to read log page 10h (errno=-5) Jul 29 20:40:48 server2 kernel: ata1.00: exception Emask 0x1 SAct 0x403ffff7 SErr 0x0 action 0x0 Jul 29 20:40:48 server2 kernel: ata1.00: irq_stat 0x40000008 Jul 29 20:40:48 server2 kernel: ata1.00: cmd 60/80:00:e8:2b:5a/00:00:47:00:00/40 tag 0 ncq 65536 in Jul 29 20:40:48 server2 kernel: res 40/00:a8:e8:35:5a/3a:00:47:00:00/40 Emask 0x1 (device error) Jul 29 20:40:48 server2 kernel: ata1.00: status: { DRDY } Jul 29 20:40:48 server2 kernel: ata1.00: cmd 60/80:08:e8:2f:5a/00:00:47:00:00/40 tag 1 ncq 65536 in Jul 29 20:40:48 server2 kernel: res 40/00:a8:e8:35:5a/00:00:47:00:00/40 Emask 0x1 (device error) Jul 29 20:40:48 server2 kernel: ata1.00: status: { DRDY } Jul 29 20:40:48 server2 kernel: ata1.00: cmd 60/80:10:e8:33:5a/00:00:47:00:00/40 tag 2 ncq 65536 in Jul 29 20:40:48 server2 kernel: res 40/00:a8:e8:35:5a/00:00:47:00:00/40 Emask 0x1 (device error) ... Jul 29 20:40:48 server2 kernel: ata1.00: status: { DRDY } Jul 29 20:40:48 server2 kernel: ata1.00: qc timeout (cmd 0xec) Jul 29 20:40:48 server2 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5) Jul 29 20:40:48 server2 kernel: ata1.00: revalidation failed (errno=-5) Jul 29 20:40:48 server2 kernel: ata1: hard resetting link Jul 29 20:40:48 server2 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jul 29 20:40:48 server2 kernel: ata1.00: qc timeout (cmd 0xa1) Jul 29 20:40:48 server2 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5) Jul 29 20:40:48 server2 kernel: ata1.00: revalidation failed (errno=-5) Jul 29 20:40:48 server2 kernel: ata1: limiting SATA link speed to 1.5 Gbps Jul 29 20:40:48 server2 kernel: ata1: hard resetting link Jul 29 20:40:48 server2 kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jul 29 20:40:48 server2 kernel: ata1.00: qc timeout (cmd 0xa1) Jul 29 20:40:48 server2 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5) Jul 29 20:40:48 server2 kernel: ata1.00: revalidation failed (errno=-5) Jul 29 20:40:48 server2 kernel: ata1.00: disabled Jul 29 20:40:48 server2 kernel: ata1: hard resetting link Jul 29 20:40:48 server2 kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jul 29 20:40:48 server2 kernel: ata1: EH complete Jul 29 20:40:48 server2 kernel: sd 0:0:0:0: [sda] Unhandled error code Jul 29 20:40:48 server2 kernel: sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00 Jul 29 20:40:48 server2 kernel: end_request: I/O error, dev sda, sector 1197091688 I guess Seagate will accept it for an RMA now! Simon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Pending sectors in valid array - how to proceed? 2010-07-28 21:11 ` Roman Mamedov 2010-07-29 2:50 ` Simon Matthews @ 2010-07-29 8:45 ` Stefan G. Weichinger 1 sibling, 0 replies; 7+ messages in thread From: Stefan G. Weichinger @ 2010-07-29 8:45 UTC (permalink / raw) To: Roman Mamedov; +Cc: st0ff, st0ff, Tim Small, linux-raid Am 28.07.2010 23:11, schrieb Roman Mamedov: > That drive is most likely a Seagate, and if so, there's nothing to worry > about. Literally every Seagate drive will have a high value in > Hardware_ECC_Recovered, it's just a peculiarity of their SMART. Other vendors' > drives recover read errors using ECC too, but don't report that into the SMART > metric. Yep, it's a Seagate. All four are Seagate: sda, sdb: ST3250310NS (should have ERC as far as I found online) sdc, sdd: ST3250621NS (still don't know if they have ERC) I now decided to run that check-action on all three arrays. So far it looks good. All three arrays re-synced OK, without any drive failing. Good :-) Still no reallocated sectors on all four drives. "Current_Pending_Sector" and "Offline_Uncorrectable" on /dev/sdb still at the old value of "13". Do you think I should swap that drive or not? (added difficulty: that server is around 400km from me ... I would have to direct an employee there to swap the hdd ...) Migrating to RAID6, sure, would make sense, but this would need a kernel-upgrade and involves quite some work. Right now I have 2.6.25-gentoo-r7 there :-( thank you all for your replies, Stefan ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-07-30 4:24 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-28 17:46 Pending sectors in valid array - how to proceed? Stefan G. Weichinger 2010-07-28 18:41 ` Tim Small 2010-07-28 20:27 ` Stefan *St0fF* Huebner 2010-07-28 21:11 ` Roman Mamedov 2010-07-29 2:50 ` Simon Matthews 2010-07-30 4:24 ` Simon Matthews 2010-07-29 8:45 ` Stefan G. Weichinger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).