* time limited error recovery and md raid @ 2008-12-05 20:57 Redeeman 2008-12-05 21:01 ` Justin Piszcz 0 siblings, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-05 20:57 UTC (permalink / raw) To: linux-raid Hello.. Im going to be building a software raid6 setup with probably 8 disks, and i've been looking at the wd gp disks, which comes in both standard and raid edition, with the raid edition being much more expensive. I have searched around, and found that it is indeed possible to activate tler on the "normal" disks, however, the setting has a parameter, more specifically, how many seconds it should be limited to. Default is 7. So i was wondering, what should that be set to, to be optimal for linux md raid? i havent been able to find any information about this. Thanks. mvh. Kasper Sandberg ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 20:57 time limited error recovery and md raid Redeeman @ 2008-12-05 21:01 ` Justin Piszcz 2008-12-05 21:07 ` Redeeman 2008-12-05 23:04 ` Redeeman 0 siblings, 2 replies; 22+ messages in thread From: Justin Piszcz @ 2008-12-05 21:01 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Fri, 5 Dec 2008, Redeeman wrote: > Hello.. > > Im going to be building a software raid6 setup with probably 8 disks, > and i've been looking at the wd gp disks, which comes in both standard > and raid edition, with the raid edition being much more expensive. > > I have searched around, and found that it is indeed possible to activate > tler on the "normal" disks, however, the setting has a parameter, more > specifically, how many seconds it should be limited to. Default is 7. > > So i was wondering, what should that be set to, to be optimal for linux > md raid? i havent been able to find any information about this. > > > Thanks. > > > mvh. > Kasper Sandberg > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > The exact time is a good question. Something I have noticed is when TLER is off, the drives hang up when they hit a bad sector and when TLER is on, the drives is kicked out of the array immediately when it reports a bad sector. Justin. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:01 ` Justin Piszcz @ 2008-12-05 21:07 ` Redeeman 2008-12-05 21:12 ` Justin Piszcz 2008-12-05 23:04 ` Redeeman 1 sibling, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-05 21:07 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > > On Fri, 5 Dec 2008, Redeeman wrote: > > > Hello.. > > > > Im going to be building a software raid6 setup with probably 8 disks, > > and i've been looking at the wd gp disks, which comes in both standard > > and raid edition, with the raid edition being much more expensive. > > > > I have searched around, and found that it is indeed possible to activate > > tler on the "normal" disks, however, the setting has a parameter, more > > specifically, how many seconds it should be limited to. Default is 7. > > > > So i was wondering, what should that be set to, to be optimal for linux > > md raid? i havent been able to find any information about this. > > > > > > Thanks. > > > > > > mvh. > > Kasper Sandberg > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > The exact time is a good question. > > Something I have noticed is when TLER is off, the drives hang up when they > hit a bad sector and when TLER is on, the drives is kicked out of the > array immediately when it reports a bad sector. First.. Does this happen with a degree of frequency on these disks? you recommend other disks? second, when tler is off, it hangs FOREVER? or just until it gives up? when tler is on, why does linux not attempt to remap the sector? > > Justin. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:07 ` Redeeman @ 2008-12-05 21:12 ` Justin Piszcz 2008-12-05 21:18 ` Redeeman 2008-12-06 10:32 ` Michal Soltys 0 siblings, 2 replies; 22+ messages in thread From: Justin Piszcz @ 2008-12-05 21:12 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Fri, 5 Dec 2008, Redeeman wrote: > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >> >> On Fri, 5 Dec 2008, Redeeman wrote: >> >>> Hello.. >>> >>> Im going to be building a software raid6 setup with probably 8 disks, >>> and i've been looking at the wd gp disks, which comes in both standard >>> and raid edition, with the raid edition being much more expensive. >>> >>> I have searched around, and found that it is indeed possible to activate >>> tler on the "normal" disks, however, the setting has a parameter, more >>> specifically, how many seconds it should be limited to. Default is 7. >>> >>> So i was wondering, what should that be set to, to be optimal for linux >>> md raid? i havent been able to find any information about this. >>> >>> >>> Thanks. >>> >>> >>> mvh. >>> Kasper Sandberg >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> The exact time is a good question. >> >> Something I have noticed is when TLER is off, the drives hang up when they >> hit a bad sector and when TLER is on, the drives is kicked out of the >> array immediately when it reports a bad sector. > > First.. Does this happen with a degree of frequency on these disks? you > recommend other disks? The disks I have are velociraptors, I will be reverting back to my old raptor150s shortly to ensure everything is fine with them before I go buying new hard drives, etc. It happened every week or two with velociraptors, bad drives or drives getting kicked out of the array over and over again. > second, when tler is off, it hangs FOREVER? or just until it gives up? When TLER is off it hangs for 1-2 minutes and then you will see a timeout in the dmesg/kernel log and it (sometimes) kicks the drive out of the array, othertimes it 'hard' resets the drive the array is able to continue operating normally. > when tler is on, why does linux not attempt to remap the sector? See my earlier posts on this question from last week. It would need a metadata section on each HDD to keep track of the bad sectors before it writes to the drives, a 3ware card will do this for you-- however, typically, (having run md/linux) for a number of years is not super-necessary if you run checks on your disks once a week *AND* you have good drives that don't have problems. Justin. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:12 ` Justin Piszcz @ 2008-12-05 21:18 ` Redeeman 2008-12-05 21:21 ` Justin Piszcz 2008-12-06 10:32 ` Michal Soltys 1 sibling, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-05 21:18 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: > > On Fri, 5 Dec 2008, Redeeman wrote: > > > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > >> > >> On Fri, 5 Dec 2008, Redeeman wrote: > >> > >>> Hello.. > >>> > >>> Im going to be building a software raid6 setup with probably 8 disks, > >>> and i've been looking at the wd gp disks, which comes in both standard > >>> and raid edition, with the raid edition being much more expensive. > >>> > >>> I have searched around, and found that it is indeed possible to activate > >>> tler on the "normal" disks, however, the setting has a parameter, more > >>> specifically, how many seconds it should be limited to. Default is 7. > >>> > >>> So i was wondering, what should that be set to, to be optimal for linux > >>> md raid? i havent been able to find any information about this. > >>> > >>> > >>> Thanks. > >>> > >>> > >>> mvh. > >>> Kasper Sandberg > >>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >>> the body of a message to majordomo@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >> > >> The exact time is a good question. > >> > >> Something I have noticed is when TLER is off, the drives hang up when they > >> hit a bad sector and when TLER is on, the drives is kicked out of the > >> array immediately when it reports a bad sector. > > > > First.. Does this happen with a degree of frequency on these disks? you > > recommend other disks? > The disks I have are velociraptors, I will be reverting back to my old > raptor150s shortly to ensure everything is fine with them before I go > buying new hard drives, etc. It happened every week or two with > velociraptors, bad drives or drives getting kicked out of the array over > and over again. > > > second, when tler is off, it hangs FOREVER? or just until it gives up? > When TLER is off it hangs for 1-2 minutes and then you will see a timeout > in the dmesg/kernel log and it (sometimes) kicks the drive out of the > array, othertimes it 'hard' resets the drive the array is able to continue > operating normally. > > > when tler is on, why does linux not attempt to remap the sector? > See my earlier posts on this question from last week. It would need a > metadata section on each HDD to keep track of the bad sectors before it > writes to the drives, a 3ware card will do this for you-- however, I thought disks had this internally, and could be prompted to do it by writing to the sector? > typically, (having run md/linux) for a number of years is not > super-necessary if you run checks on your disks once a week *AND* you have > good drives that don't have problems. Obviously i intent to replace broken disks as i detect them, but as i understand it, its fairly common case that disks will get bad sectors over time, and remap them internally? > > Justin. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:18 ` Redeeman @ 2008-12-05 21:21 ` Justin Piszcz 2008-12-05 21:31 ` Redeeman 0 siblings, 1 reply; 22+ messages in thread From: Justin Piszcz @ 2008-12-05 21:21 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Fri, 5 Dec 2008, Redeeman wrote: > On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: >> >> On Fri, 5 Dec 2008, Redeeman wrote: >> >>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >>>> >>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>> > I thought disks had this internally, and could be prompted to do it by > writing to the sector? Yes-- and that is what I did over and over again, dd if=/dev/zero of=/dev/dsk it ran OK for 1-2 days but then it started erroring again with Velociraptors. With some old 400GiB Seagates, I did the same thing and their pending sector list rose but the drives still remained working for 1-2 years after. The problems I mention are only applicable to my latest experience with velociraptor hdds. >> typically, (having run md/linux) for a number of years is not >> super-necessary if you run checks on your disks once a week *AND* you have >> good drives that don't have problems. > Obviously i intent to replace broken disks as i detect them, but as i > understand it, its fairly common case that disks will get bad sectors > over time, and remap them internally? Yes, and with 'check' I believe it helps to ensure this process (it is like 'scrubbing') using RAID VERIFY in the 3ware controller. Justin. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:21 ` Justin Piszcz @ 2008-12-05 21:31 ` Redeeman 2008-12-05 21:42 ` Justin Piszcz 0 siblings, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-05 21:31 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: > > On Fri, 5 Dec 2008, Redeeman wrote: > > > On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: > >> > >> On Fri, 5 Dec 2008, Redeeman wrote: > >> > >>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > >>>> > >>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>> > > I thought disks had this internally, and could be prompted to do it by > > writing to the sector? > Yes-- and that is what I did over and over again, dd if=/dev/zero of=/dev/dsk > it ran OK for 1-2 days but then it started erroring again with Velociraptors. > With some old 400GiB Seagates, I did the same thing and their pending sector > list rose but the drives still remained working for 1-2 years after. The > problems I mention are only applicable to my latest experience with > velociraptor hdds. Okay, you happen to have any knowledge to pass on about current 1tb disks? > > >> typically, (having run md/linux) for a number of years is not > >> super-necessary if you run checks on your disks once a week *AND* you have > >> good drives that don't have problems. > > Obviously i intent to replace broken disks as i detect them, but as i > > understand it, its fairly common case that disks will get bad sectors > > over time, and remap them internally? > Yes, and with 'check' I believe it helps to ensure this process (it is like > 'scrubbing') using RAID VERIFY in the 3ware controller. sorry to ask so much, but by "this process", do you mean the drive internally doing it rewriting, avoiding the kickout, or do you mean the raid system discovering the disk as faulty, and kicking it? :) > > Justin. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:31 ` Redeeman @ 2008-12-05 21:42 ` Justin Piszcz 2008-12-05 22:09 ` Redeeman 2008-12-06 9:14 ` David Greaves 0 siblings, 2 replies; 22+ messages in thread From: Justin Piszcz @ 2008-12-05 21:42 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Fri, 5 Dec 2008, Redeeman wrote: > On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: >> >> On Fri, 5 Dec 2008, Redeeman wrote: >> >>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: >>>> >>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>> >>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >>>>>> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>> > Okay, you happen to have any knowledge to pass on about current 1tb > disks? I am still looking for some good 1TiB drives myself. I know one user who has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware PCI-X card: SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive I really would like to find a disk with a working NCQ implementation in Linux and 3ware and one that works well. In a single disk configuration, the WD 750G that I have used for ~1 year+ now has been fine, the problem is finding good, reliable disks when used in a raid configuration, that is when everything changes. > sorry to ask so much, but by "this process", do you mean the drive > internally doing it rewriting, avoiding the kickout, or do you mean the > raid system discovering the disk as faulty, and kicking it? :) The process being check or repair as noted by mikylie who also responded to this thread, see his response regarding check vs. repair. There is a difference, with check/mdraid see mikylie's note, with RAID VERIFY on a 3ware controller it will remap bad sectors to others parts of the array as it comes across them both during RAID VERIFY and while the array is running live. Justin. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:42 ` Justin Piszcz @ 2008-12-05 22:09 ` Redeeman 2008-12-05 23:52 ` Justin Piszcz 2008-12-06 9:14 ` David Greaves 1 sibling, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-05 22:09 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote: > > On Fri, 5 Dec 2008, Redeeman wrote: > > > On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: > >> > >> On Fri, 5 Dec 2008, Redeeman wrote: > >> > >>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: > >>>> > >>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>> > >>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > >>>>>> > >>>>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>>>> > > Okay, you happen to have any knowledge to pass on about current 1tb > > disks? > I am still looking for some good 1TiB drives myself. I know one user who > has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware > PCI-X card: > SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive I guess those look pretty good. i personally am running WD RE2 and Seagate ES.2 in raids without issues at all, raid1, but hmm.. > > I really would like to find a disk with a working NCQ implementation in > Linux and 3ware and one that works well. In a single disk configuration, > the WD 750G that I have used for ~1 year+ now has been fine, the problem > is finding good, reliable disks when used in a raid configuration, that is > when everything changes. > > > sorry to ask so much, but by "this process", do you mean the drive > > internally doing it rewriting, avoiding the kickout, or do you mean the > > raid system discovering the disk as faulty, and kicking it? :) > The process being check or repair as noted by mikylie who also responded to > this thread, see his response regarding check vs. repair. There is a > difference, with check/mdraid see mikylie's note, with RAID VERIFY on a 3ware > controller it will remap bad sectors to others parts of the array as it comes > across them both during RAID VERIFY and while the array is running live. > > Justin. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 22:09 ` Redeeman @ 2008-12-05 23:52 ` Justin Piszcz 2008-12-06 2:59 ` Redeeman 0 siblings, 1 reply; 22+ messages in thread From: Justin Piszcz @ 2008-12-05 23:52 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Fri, 5 Dec 2008, Redeeman wrote: > On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote: >> >> On Fri, 5 Dec 2008, Redeeman wrote: >> >>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: >>>> >>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>> >>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: >>>>>> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>> >>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >>>>>>>> >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>>>> >>> Okay, you happen to have any knowledge to pass on about current 1tb >>> disks? >> I am still looking for some good 1TiB drives myself. I know one user who >> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware >> PCI-X card: >> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive > I guess those look pretty good. > > i personally am running WD RE2 and Seagate ES.2 in raids without issues > at all, raid1, but hmm.. Can you show the smartctl -a output for each of the disks in your raids? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 23:52 ` Justin Piszcz @ 2008-12-06 2:59 ` Redeeman 2008-12-06 9:23 ` Justin Piszcz 0 siblings, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-06 2:59 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote: > > On Fri, 5 Dec 2008, Redeeman wrote: > > > On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote: > >> > >> On Fri, 5 Dec 2008, Redeeman wrote: > >> > >>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: > >>>> > >>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>> > >>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: > >>>>>> > >>>>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>>>> > >>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > >>>>>>>> > >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>>>>>> > >>> Okay, you happen to have any knowledge to pass on about current 1tb > >>> disks? > >> I am still looking for some good 1TiB drives myself. I know one user who > >> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware > >> PCI-X card: > >> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive > > I guess those look pretty good. > > > > i personally am running WD RE2 and Seagate ES.2 in raids without issues > > at all, raid1, but hmm.. > > Can you show the smartctl -a output for each of the disks in your raids? this is a raid1 with 1xwd re2 gp and 1x seagate es.2: fileserver1:~# smartctl -a /dev/sda smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD1000FYPS-01ZKB0 Serial Number: WD-WCASJ1247531 Firmware Version: 02.01B01 User Capacity: 1.000.203.804.160 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Dec 6 04:00:27 2008 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (27960) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 178 178 021 Pre-fail Always - 8066 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 43 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3065 10 Spin_Retry_Count 0x0012 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 43 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 59 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 675 194 Temperature_Celsius 0x0022 122 108 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 589 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. fileserver1:~# smartctl -a /dev/sdb smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST31000340NS Serial Number: 9QJ0RPJG Firmware Version: SN05 User Capacity: 1.000.204.886.016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Dec 6 04:01:06 2008 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 650) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 237) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 079 063 044 Pre-fail Always - 93375833 3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 40 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 065 060 030 Pre-fail Always - 21491681226 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3068 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 41 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 090 000 Old_age Always - 60 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 055 045 Old_age Always - 33 (Lifetime Min/Max 18/37) 194 Temperature_Celsius 0x0022 033 045 000 Old_age Always - 33 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 022 022 000 Old_age Always - 93375833 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 589 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-06 2:59 ` Redeeman @ 2008-12-06 9:23 ` Justin Piszcz 2008-12-06 14:33 ` Redeeman 0 siblings, 1 reply; 22+ messages in thread From: Justin Piszcz @ 2008-12-06 9:23 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Sat, 6 Dec 2008, Redeeman wrote: > On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote: >> >> On Fri, 5 Dec 2008, Redeeman wrote: >> >>> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote: >>>> >>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>> >>>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: >>>>>> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>> >>>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: >>>>>>>> >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>>>> >>>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >>>>>>>>>> >>>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>>>>>> >>>>> Okay, you happen to have any knowledge to pass on about current 1tb >>>>> disks? >>>> I am still looking for some good 1TiB drives myself. I know one user who >>>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware >>>> PCI-X card: >>>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive >>> I guess those look pretty good. >>> >>> i personally am running WD RE2 and Seagate ES.2 in raids without issues >>> at all, raid1, but hmm.. >> >> Can you show the smartctl -a output for each of the disks in your raids? > this is a raid1 with 1xwd re2 gp and 1x seagate es.2: Just 1 re-allocated sector, curious btw is there a reason you do not do daily short smart tests and weekly long tests? Nothing is worse than rebuilding a RAID array and then having another disk fail due to poor health. Check/repair/VERIFY also helps to rid disks of bad sectors (force-re-allocate) but still.. Justin. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-06 9:23 ` Justin Piszcz @ 2008-12-06 14:33 ` Redeeman 0 siblings, 0 replies; 22+ messages in thread From: Redeeman @ 2008-12-06 14:33 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Sat, 2008-12-06 at 04:23 -0500, Justin Piszcz wrote: > > On Sat, 6 Dec 2008, Redeeman wrote: > > > On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote: > >> > >> On Fri, 5 Dec 2008, Redeeman wrote: > >> > >>> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote: > >>>> > >>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>> > >>>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: > >>>>>> > >>>>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>>>> > >>>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: > >>>>>>>> > >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>>>>>> > >>>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > >>>>>>>>>> > >>>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: > >>>>>>>>>> > >>>>> Okay, you happen to have any knowledge to pass on about current 1tb > >>>>> disks? > >>>> I am still looking for some good 1TiB drives myself. I know one user who > >>>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware > >>>> PCI-X card: > >>>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive > >>> I guess those look pretty good. > >>> > >>> i personally am running WD RE2 and Seagate ES.2 in raids without issues > >>> at all, raid1, but hmm.. > >> > >> Can you show the smartctl -a output for each of the disks in your raids? > > this is a raid1 with 1xwd re2 gp and 1x seagate es.2: > > Just 1 re-allocated sector, curious btw is there a reason you do not do > daily short smart tests and weekly long tests? Nothing is worse than > rebuilding a RAID array and then having another disk fail due to poor health. > Check/repair/VERIFY also helps to rid disks of bad sectors (force-re-allocate) > but still.. I actually thought i had that already on this, i guess i forgot that :) > > Justin. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:42 ` Justin Piszcz 2008-12-05 22:09 ` Redeeman @ 2008-12-06 9:14 ` David Greaves 2008-12-06 9:59 ` Justin Piszcz 1 sibling, 1 reply; 22+ messages in thread From: David Greaves @ 2008-12-06 9:14 UTC (permalink / raw) To: Justin Piszcz; +Cc: Redeeman, linux-raid Justin Piszcz wrote: > > > On Fri, 5 Dec 2008, Redeeman wrote: > >> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: >>> >>> On Fri, 5 Dec 2008, Redeeman wrote: >>> >>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: >>>>> >>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>> >>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >>>>>>> >>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>>> >> Okay, you happen to have any knowledge to pass on about current 1tb >> disks? > I am still looking for some good 1TiB drives myself. I know one user > who has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port > 3ware > PCI-X card: > SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard > Drive I wouldn't. I had 9 of these in 2 Dell SOHO servers; I RMA'd about 13 (yes 13) over a few months before exchanging them. The HE ones are slightly better. David -- "Don't worry, you'll be fine; I saw it work in a cartoon once..." ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-06 9:14 ` David Greaves @ 2008-12-06 9:59 ` Justin Piszcz 0 siblings, 0 replies; 22+ messages in thread From: Justin Piszcz @ 2008-12-06 9:59 UTC (permalink / raw) To: David Greaves; +Cc: Redeeman, linux-raid On Sat, 6 Dec 2008, David Greaves wrote: > Justin Piszcz wrote: >> >> >> On Fri, 5 Dec 2008, Redeeman wrote: >> >>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote: >>>> >>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>> >>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote: >>>>>> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>> >>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >>>>>>>> >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>>>>>> >>> Okay, you happen to have any knowledge to pass on about current 1tb >>> disks? >> I am still looking for some good 1TiB drives myself. I know one user >> who has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port >> 3ware >> PCI-X card: >> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard >> Drive > I wouldn't. > > I had 9 of these in 2 Dell SOHO servers; I RMA'd about 13 (yes 13) over a few > months before exchanging them. Ouch! How did they fail/what kind of workload was imposed on them? Yeah I see the enterprise disks as well for 2x the cost, 'slightly' better, I take it they fail quite often as well? > > The HE ones are slightly better. > > David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:12 ` Justin Piszcz 2008-12-05 21:18 ` Redeeman @ 2008-12-06 10:32 ` Michal Soltys 2008-12-06 10:53 ` Justin Piszcz 1 sibling, 1 reply; 22+ messages in thread From: Michal Soltys @ 2008-12-06 10:32 UTC (permalink / raw) To: Justin Piszcz; +Cc: Redeeman, linux-raid Justin Piszcz wrote: > >> when tler is on, why does linux not attempt to remap the sector? > See my earlier posts on this question from last week. It would need a > metadata section on each HDD to keep track of the bad sectors before it > writes to the drives, a 3ware card will do this for you-- however, > typically, (having run md/linux) for a number of years is not > super-necessary if you run checks on your disks once a week *AND* you > have good drives that don't have problems. > From what I know, md does attempt to repair read errors (but not other ones though) with rewrite, and only if it fails, it will kick the drive out of the array. In case of 1.x superblocks, the count of repaired (+ ones not causing drive to be kicked off) sectors will be preserved across reboots as well (check Documentation/md.txt). Peeking over the code seems to confirm it as well. It was on Neil's todo list, and it got implemented a while ago, afaik. There was a short discussion on smartmon mailing list recently: http://sourceforge.net/mailarchive/forum.php?thread_name=VA.00003465.16ed4a93%40news.conactive.com&forum_name=smartmontools-support There was no 100% answer there either though. Neil ? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-06 10:32 ` Michal Soltys @ 2008-12-06 10:53 ` Justin Piszcz 0 siblings, 0 replies; 22+ messages in thread From: Justin Piszcz @ 2008-12-06 10:53 UTC (permalink / raw) To: Michal Soltys; +Cc: Redeeman, linux-raid On Sat, 6 Dec 2008, Michal Soltys wrote: > Justin Piszcz wrote: >> >>> when tler is on, why does linux not attempt to remap the sector? >> See my earlier posts on this question from last week. It would need a >> metadata section on each HDD to keep track of the bad sectors before it >> writes to the drives, a 3ware card will do this for you-- however, >> typically, (having run md/linux) for a number of years is not >> super-necessary if you run checks on your disks once a week *AND* you have >> good drives that don't have problems. >> > > From what I know, md does attempt to repair read errors (but not other ones > though) with rewrite, and only if it fails, it will kick the drive out of the > array. In case of 1.x superblocks, the count of repaired (+ ones not causing > drive to be kicked off) sectors will be preserved across reboots as well > (check Documentation/md.txt). Peeking over the code seems to confirm it as > well. It appears so: [ 1215.654712] raid5:md3: read error not correctable (sector 213000256 on sde1). [ 1215.654765] raid5: Disk failure on sde1, disabling device. [ 1215.654766] raid5: Operation continuing on 8 devices. [ 1215.655412] raid5:md3: read error not correctable (sector 213000264 on sde1). [ 1215.655473] raid5:md3: read error not correctable (sector 213000272 on sde1). [ 1215.655533] raid5:md3: read error not correctable (sector 213000280 on sde1). [ 1215.655592] raid5:md3: read error not correctable (sector 213000288 on sde1). [ 1215.655644] raid5:md3: read error not correctable (sector 213000296 on sde1). [ 1215.655694] raid5:md3: read error not correctable (sector 213000304 on sde1). [ 1215.655746] raid5:md3: read error not correctable (sector 213000312 on sde1). [ 1215.655800] raid5:md3: read error not correctable (sector 213000320 on sde1). [ 1215.655852] raid5:md3: read error not correctable (sector 213000328 on sde1). Justin. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 21:01 ` Justin Piszcz 2008-12-05 21:07 ` Redeeman @ 2008-12-05 23:04 ` Redeeman 2008-12-05 23:52 ` Justin Piszcz 1 sibling, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-05 23:04 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > > On Fri, 5 Dec 2008, Redeeman wrote: > > > Hello.. > > > > Im going to be building a software raid6 setup with probably 8 disks, > > and i've been looking at the wd gp disks, which comes in both standard > > and raid edition, with the raid edition being much more expensive. > > > > I have searched around, and found that it is indeed possible to activate > > tler on the "normal" disks, however, the setting has a parameter, more > > specifically, how many seconds it should be limited to. Default is 7. > > > > So i was wondering, what should that be set to, to be optimal for linux > > md raid? i havent been able to find any information about this. > > > > > > Thanks. > > > > > > mvh. > > Kasper Sandberg > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > The exact time is a good question. About this, shouldnt it be possible to find the exact time which the linux raid system waits until it auto kicks out the drive? > > Something I have noticed is when TLER is off, the drives hang up when they > hit a bad sector and when TLER is on, the drives is kicked out of the > array immediately when it reports a bad sector. On further thought, might this not suggest that the linux raid system waits indefinetly for I/O error, before kicking out? If that is indeed the case, then if the use of the raid isnt time-critical, then maybe its a good thing to have tler enabled, just in case the disk is able to fix stuff? > > Justin. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 23:04 ` Redeeman @ 2008-12-05 23:52 ` Justin Piszcz 2008-12-06 0:42 ` Roger Heflin 2008-12-08 16:59 ` Redeeman 0 siblings, 2 replies; 22+ messages in thread From: Justin Piszcz @ 2008-12-05 23:52 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Sat, 6 Dec 2008, Redeeman wrote: > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >> >> On Fri, 5 Dec 2008, Redeeman wrote: >> >>> Hello.. >>> >>> Im going to be building a software raid6 setup with probably 8 disks, >>> and i've been looking at the wd gp disks, which comes in both standard >>> and raid edition, with the raid edition being much more expensive. >>> >>> I have searched around, and found that it is indeed possible to activate >>> tler on the "normal" disks, however, the setting has a parameter, more >>> specifically, how many seconds it should be limited to. Default is 7. >>> >>> So i was wondering, what should that be set to, to be optimal for linux >>> md raid? i havent been able to find any information about this. >>> >>> >>> Thanks. >>> >>> >>> mvh. >>> Kasper Sandberg >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> The exact time is a good question. > > About this, shouldnt it be possible to find the exact time which the > linux raid system waits until it auto kicks out the drive? > > >> >> Something I have noticed is when TLER is off, the drives hang up when they >> hit a bad sector and when TLER is on, the drives is kicked out of the >> array immediately when it reports a bad sector. > > On further thought, might this not suggest that the linux raid system > waits indefinetly for I/O error, before kicking out? It times out after 1-2 minutes. I have been dealing with this for a very long time, another 'symptom' is short smart tests taking forever when the disk is in a soon-to-be failing state. > > If that is indeed the case, then if the use of the raid isnt > time-critical, then maybe its a good thing to have tler enabled, just in > case the disk is able to fix stuff? With TLER enabled the drive is kicked out immediately. The purpose of TLER is that of HW raid where it can alert the controller about it and the controller handles it. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 23:52 ` Justin Piszcz @ 2008-12-06 0:42 ` Roger Heflin 2008-12-08 16:59 ` Redeeman 1 sibling, 0 replies; 22+ messages in thread From: Roger Heflin @ 2008-12-06 0:42 UTC (permalink / raw) To: Justin Piszcz; +Cc: Redeeman, linux-raid Justin Piszcz wrote: > > >> On further thought, might this not suggest that the linux raid system >> waits indefinetly for I/O error, before kicking out? > It times out after 1-2 minutes. I have been dealing with this for a > very long > time, another 'symptom' is short smart tests taking forever when the > disk is in a soon-to-be failing state. >> >> If that is indeed the case, then if the use of the raid isnt >> time-critical, then maybe its a good thing to have tler enabled, just in >> case the disk is able to fix stuff? > With TLER enabled the drive is kicked out immediately. The purpose of > TLER is > that of HW raid where it can alert the controller about it and the > controller > handles it. Without TLER typically the raid controller will eventually (after 30 seconds or so) declare the disk dead and go on about its business. The only thing TLER saves you is the 23 second faster timeout before the disk is declared dead. And if you are doing something critical the 7 second timeout is still potentially very troublesome. I would have though TLER had more value to be set quite a bit lower than 7 seconds. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-05 23:52 ` Justin Piszcz 2008-12-06 0:42 ` Roger Heflin @ 2008-12-08 16:59 ` Redeeman 2008-12-08 17:01 ` Justin Piszcz 1 sibling, 1 reply; 22+ messages in thread From: Redeeman @ 2008-12-08 16:59 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote: > > On Sat, 6 Dec 2008, Redeeman wrote: > > > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: > >> > >> On Fri, 5 Dec 2008, Redeeman wrote: > >> > >>> Hello.. > >>> > >>> Im going to be building a software raid6 setup with probably 8 disks, > >>> and i've been looking at the wd gp disks, which comes in both standard > >>> and raid edition, with the raid edition being much more expensive. > >>> > >>> I have searched around, and found that it is indeed possible to activate > >>> tler on the "normal" disks, however, the setting has a parameter, more > >>> specifically, how many seconds it should be limited to. Default is 7. > >>> > >>> So i was wondering, what should that be set to, to be optimal for linux > >>> md raid? i havent been able to find any information about this. > >>> > >>> > >>> Thanks. > >>> > >>> > >>> mvh. > >>> Kasper Sandberg > >>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >>> the body of a message to majordomo@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >> > >> The exact time is a good question. > > > > About this, shouldnt it be possible to find the exact time which the > > linux raid system waits until it auto kicks out the drive? > > > > > >> > >> Something I have noticed is when TLER is off, the drives hang up when they > >> hit a bad sector and when TLER is on, the drives is kicked out of the > >> array immediately when it reports a bad sector. > > > > On further thought, might this not suggest that the linux raid system > > waits indefinetly for I/O error, before kicking out? > It times out after 1-2 minutes. I have been dealing with this for a very long > time, another 'symptom' is short smart tests taking forever when the disk is in > a soon-to-be failing state. Do you know if this is actually caused by the disk's error recovery timing out in 1-2 minutes, or the linux raid system? > > > > If that is indeed the case, then if the use of the raid isnt > > time-critical, then maybe its a good thing to have tler enabled, just in > > case the disk is able to fix stuff? > With TLER enabled the drive is kicked out immediately. The purpose of TLER is > that of HW raid where it can alert the controller about it and the controller > handles it. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid 2008-12-08 16:59 ` Redeeman @ 2008-12-08 17:01 ` Justin Piszcz 0 siblings, 0 replies; 22+ messages in thread From: Justin Piszcz @ 2008-12-08 17:01 UTC (permalink / raw) To: Redeeman; +Cc: linux-raid On Mon, 8 Dec 2008, Redeeman wrote: > On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote: >> >> On Sat, 6 Dec 2008, Redeeman wrote: >> >>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote: >>>> >>>> On Fri, 5 Dec 2008, Redeeman wrote: >>>> >>>>> Hello.. >>>>> >>>>> Im going to be building a software raid6 setup with probably 8 disks, >>>>> and i've been looking at the wd gp disks, which comes in both standard >>>>> and raid edition, with the raid edition being much more expensive. >>>>> >>>>> I have searched around, and found that it is indeed possible to activate >>>>> tler on the "normal" disks, however, the setting has a parameter, more >>>>> specifically, how many seconds it should be limited to. Default is 7. >>>>> >>>>> So i was wondering, what should that be set to, to be optimal for linux >>>>> md raid? i havent been able to find any information about this. >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> mvh. >>>>> Kasper Sandberg >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>> >>>> The exact time is a good question. >>> >>> About this, shouldnt it be possible to find the exact time which the >>> linux raid system waits until it auto kicks out the drive? >>> >>> >>>> >>>> Something I have noticed is when TLER is off, the drives hang up when they >>>> hit a bad sector and when TLER is on, the drives is kicked out of the >>>> array immediately when it reports a bad sector. >>> >>> On further thought, might this not suggest that the linux raid system >>> waits indefinetly for I/O error, before kicking out? >> It times out after 1-2 minutes. I have been dealing with this for a very long >> time, another 'symptom' is short smart tests taking forever when the disk is in >> a soon-to-be failing state. > > Do you know if this is actually caused by the disk's error recovery > timing out in 1-2 minutes, or the linux raid system? I am not sure, that is a good question. I am not sure how long the drive itself will spend 'searching' or trying to repair/read a bad sector with TLER off vs. when the kernel will drop the disk for not responding. Justin. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2008-12-08 17:01 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-05 20:57 time limited error recovery and md raid Redeeman 2008-12-05 21:01 ` Justin Piszcz 2008-12-05 21:07 ` Redeeman 2008-12-05 21:12 ` Justin Piszcz 2008-12-05 21:18 ` Redeeman 2008-12-05 21:21 ` Justin Piszcz 2008-12-05 21:31 ` Redeeman 2008-12-05 21:42 ` Justin Piszcz 2008-12-05 22:09 ` Redeeman 2008-12-05 23:52 ` Justin Piszcz 2008-12-06 2:59 ` Redeeman 2008-12-06 9:23 ` Justin Piszcz 2008-12-06 14:33 ` Redeeman 2008-12-06 9:14 ` David Greaves 2008-12-06 9:59 ` Justin Piszcz 2008-12-06 10:32 ` Michal Soltys 2008-12-06 10:53 ` Justin Piszcz 2008-12-05 23:04 ` Redeeman 2008-12-05 23:52 ` Justin Piszcz 2008-12-06 0:42 ` Roger Heflin 2008-12-08 16:59 ` Redeeman 2008-12-08 17:01 ` Justin Piszcz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).