* time limited error recovery and md raid
@ 2008-12-05 20:57 Redeeman
2008-12-05 21:01 ` Justin Piszcz
0 siblings, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 20:57 UTC (permalink / raw)
To: linux-raid
Hello..
Im going to be building a software raid6 setup with probably 8 disks,
and i've been looking at the wd gp disks, which comes in both standard
and raid edition, with the raid edition being much more expensive.
I have searched around, and found that it is indeed possible to activate
tler on the "normal" disks, however, the setting has a parameter, more
specifically, how many seconds it should be limited to. Default is 7.
So i was wondering, what should that be set to, to be optimal for linux
md raid? i havent been able to find any information about this.
Thanks.
mvh.
Kasper Sandberg
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 20:57 time limited error recovery and md raid Redeeman
@ 2008-12-05 21:01 ` Justin Piszcz
2008-12-05 21:07 ` Redeeman
2008-12-05 23:04 ` Redeeman
0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:01 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Fri, 5 Dec 2008, Redeeman wrote:
> Hello..
>
> Im going to be building a software raid6 setup with probably 8 disks,
> and i've been looking at the wd gp disks, which comes in both standard
> and raid edition, with the raid edition being much more expensive.
>
> I have searched around, and found that it is indeed possible to activate
> tler on the "normal" disks, however, the setting has a parameter, more
> specifically, how many seconds it should be limited to. Default is 7.
>
> So i was wondering, what should that be set to, to be optimal for linux
> md raid? i havent been able to find any information about this.
>
>
> Thanks.
>
>
> mvh.
> Kasper Sandberg
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
The exact time is a good question.
Something I have noticed is when TLER is off, the drives hang up when they
hit a bad sector and when TLER is on, the drives is kicked out of the
array immediately when it reports a bad sector.
Justin.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:01 ` Justin Piszcz
@ 2008-12-05 21:07 ` Redeeman
2008-12-05 21:12 ` Justin Piszcz
2008-12-05 23:04 ` Redeeman
1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 21:07 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>
> On Fri, 5 Dec 2008, Redeeman wrote:
>
> > Hello..
> >
> > Im going to be building a software raid6 setup with probably 8 disks,
> > and i've been looking at the wd gp disks, which comes in both standard
> > and raid edition, with the raid edition being much more expensive.
> >
> > I have searched around, and found that it is indeed possible to activate
> > tler on the "normal" disks, however, the setting has a parameter, more
> > specifically, how many seconds it should be limited to. Default is 7.
> >
> > So i was wondering, what should that be set to, to be optimal for linux
> > md raid? i havent been able to find any information about this.
> >
> >
> > Thanks.
> >
> >
> > mvh.
> > Kasper Sandberg
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> The exact time is a good question.
>
> Something I have noticed is when TLER is off, the drives hang up when they
> hit a bad sector and when TLER is on, the drives is kicked out of the
> array immediately when it reports a bad sector.
First.. Does this happen with a degree of frequency on these disks? you
recommend other disks?
second, when tler is off, it hangs FOREVER? or just until it gives up?
when tler is on, why does linux not attempt to remap the sector?
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:07 ` Redeeman
@ 2008-12-05 21:12 ` Justin Piszcz
2008-12-05 21:18 ` Redeeman
2008-12-06 10:32 ` Michal Soltys
0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:12 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Fri, 5 Dec 2008, Redeeman wrote:
> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> Hello..
>>>
>>> Im going to be building a software raid6 setup with probably 8 disks,
>>> and i've been looking at the wd gp disks, which comes in both standard
>>> and raid edition, with the raid edition being much more expensive.
>>>
>>> I have searched around, and found that it is indeed possible to activate
>>> tler on the "normal" disks, however, the setting has a parameter, more
>>> specifically, how many seconds it should be limited to. Default is 7.
>>>
>>> So i was wondering, what should that be set to, to be optimal for linux
>>> md raid? i havent been able to find any information about this.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> mvh.
>>> Kasper Sandberg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> The exact time is a good question.
>>
>> Something I have noticed is when TLER is off, the drives hang up when they
>> hit a bad sector and when TLER is on, the drives is kicked out of the
>> array immediately when it reports a bad sector.
>
> First.. Does this happen with a degree of frequency on these disks? you
> recommend other disks?
The disks I have are velociraptors, I will be reverting back to my old
raptor150s shortly to ensure everything is fine with them before I go
buying new hard drives, etc. It happened every week or two with
velociraptors, bad drives or drives getting kicked out of the array over
and over again.
> second, when tler is off, it hangs FOREVER? or just until it gives up?
When TLER is off it hangs for 1-2 minutes and then you will see a timeout
in the dmesg/kernel log and it (sometimes) kicks the drive out of the
array, othertimes it 'hard' resets the drive the array is able to continue
operating normally.
> when tler is on, why does linux not attempt to remap the sector?
See my earlier posts on this question from last week. It would need a
metadata section on each HDD to keep track of the bad sectors before it
writes to the drives, a 3ware card will do this for you-- however,
typically, (having run md/linux) for a number of years is not
super-necessary if you run checks on your disks once a week *AND* you have
good drives that don't have problems.
Justin.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:12 ` Justin Piszcz
@ 2008-12-05 21:18 ` Redeeman
2008-12-05 21:21 ` Justin Piszcz
2008-12-06 10:32 ` Michal Soltys
1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 21:18 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>
> On Fri, 5 Dec 2008, Redeeman wrote:
>
> > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> Hello..
> >>>
> >>> Im going to be building a software raid6 setup with probably 8 disks,
> >>> and i've been looking at the wd gp disks, which comes in both standard
> >>> and raid edition, with the raid edition being much more expensive.
> >>>
> >>> I have searched around, and found that it is indeed possible to activate
> >>> tler on the "normal" disks, however, the setting has a parameter, more
> >>> specifically, how many seconds it should be limited to. Default is 7.
> >>>
> >>> So i was wondering, what should that be set to, to be optimal for linux
> >>> md raid? i havent been able to find any information about this.
> >>>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> mvh.
> >>> Kasper Sandberg
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> The exact time is a good question.
> >>
> >> Something I have noticed is when TLER is off, the drives hang up when they
> >> hit a bad sector and when TLER is on, the drives is kicked out of the
> >> array immediately when it reports a bad sector.
> >
> > First.. Does this happen with a degree of frequency on these disks? you
> > recommend other disks?
> The disks I have are velociraptors, I will be reverting back to my old
> raptor150s shortly to ensure everything is fine with them before I go
> buying new hard drives, etc. It happened every week or two with
> velociraptors, bad drives or drives getting kicked out of the array over
> and over again.
>
> > second, when tler is off, it hangs FOREVER? or just until it gives up?
> When TLER is off it hangs for 1-2 minutes and then you will see a timeout
> in the dmesg/kernel log and it (sometimes) kicks the drive out of the
> array, othertimes it 'hard' resets the drive the array is able to continue
> operating normally.
>
> > when tler is on, why does linux not attempt to remap the sector?
> See my earlier posts on this question from last week. It would need a
> metadata section on each HDD to keep track of the bad sectors before it
> writes to the drives, a 3ware card will do this for you-- however,
I thought disks had this internally, and could be prompted to do it by
writing to the sector?
> typically, (having run md/linux) for a number of years is not
> super-necessary if you run checks on your disks once a week *AND* you have
> good drives that don't have problems.
Obviously i intent to replace broken disks as i detect them, but as i
understand it, its fairly common case that disks will get bad sectors
over time, and remap them internally?
>
> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:18 ` Redeeman
@ 2008-12-05 21:21 ` Justin Piszcz
2008-12-05 21:31 ` Redeeman
0 siblings, 1 reply; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:21 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Fri, 5 Dec 2008, Redeeman wrote:
> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
> I thought disks had this internally, and could be prompted to do it by
> writing to the sector?
Yes-- and that is what I did over and over again, dd if=/dev/zero of=/dev/dsk
it ran OK for 1-2 days but then it started erroring again with Velociraptors.
With some old 400GiB Seagates, I did the same thing and their pending sector
list rose but the drives still remained working for 1-2 years after. The
problems I mention are only applicable to my latest experience with
velociraptor hdds.
>> typically, (having run md/linux) for a number of years is not
>> super-necessary if you run checks on your disks once a week *AND* you have
>> good drives that don't have problems.
> Obviously i intent to replace broken disks as i detect them, but as i
> understand it, its fairly common case that disks will get bad sectors
> over time, and remap them internally?
Yes, and with 'check' I believe it helps to ensure this process (it is like
'scrubbing') using RAID VERIFY in the 3ware controller.
Justin.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:21 ` Justin Piszcz
@ 2008-12-05 21:31 ` Redeeman
2008-12-05 21:42 ` Justin Piszcz
0 siblings, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 21:31 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>
> On Fri, 5 Dec 2008, Redeeman wrote:
>
> > On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> > I thought disks had this internally, and could be prompted to do it by
> > writing to the sector?
> Yes-- and that is what I did over and over again, dd if=/dev/zero of=/dev/dsk
> it ran OK for 1-2 days but then it started erroring again with Velociraptors.
> With some old 400GiB Seagates, I did the same thing and their pending sector
> list rose but the drives still remained working for 1-2 years after. The
> problems I mention are only applicable to my latest experience with
> velociraptor hdds.
Okay, you happen to have any knowledge to pass on about current 1tb
disks?
>
> >> typically, (having run md/linux) for a number of years is not
> >> super-necessary if you run checks on your disks once a week *AND* you have
> >> good drives that don't have problems.
> > Obviously i intent to replace broken disks as i detect them, but as i
> > understand it, its fairly common case that disks will get bad sectors
> > over time, and remap them internally?
> Yes, and with 'check' I believe it helps to ensure this process (it is like
> 'scrubbing') using RAID VERIFY in the 3ware controller.
sorry to ask so much, but by "this process", do you mean the drive
internally doing it rewriting, avoiding the kickout, or do you mean the
raid system discovering the disk as faulty, and kicking it? :)
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:31 ` Redeeman
@ 2008-12-05 21:42 ` Justin Piszcz
2008-12-05 22:09 ` Redeeman
2008-12-06 9:14 ` David Greaves
0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:42 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Fri, 5 Dec 2008, Redeeman wrote:
> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
> Okay, you happen to have any knowledge to pass on about current 1tb
> disks?
I am still looking for some good 1TiB drives myself. I know one user who
has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
PCI-X card:
SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
I really would like to find a disk with a working NCQ implementation in
Linux and 3ware and one that works well. In a single disk configuration,
the WD 750G that I have used for ~1 year+ now has been fine, the problem
is finding good, reliable disks when used in a raid configuration, that is
when everything changes.
> sorry to ask so much, but by "this process", do you mean the drive
> internally doing it rewriting, avoiding the kickout, or do you mean the
> raid system discovering the disk as faulty, and kicking it? :)
The process being check or repair as noted by mikylie who also responded to
this thread, see his response regarding check vs. repair. There is a
difference, with check/mdraid see mikylie's note, with RAID VERIFY on a 3ware
controller it will remap bad sectors to others parts of the array as it comes
across them both during RAID VERIFY and while the array is running live.
Justin.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:42 ` Justin Piszcz
@ 2008-12-05 22:09 ` Redeeman
2008-12-05 23:52 ` Justin Piszcz
2008-12-06 9:14 ` David Greaves
1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 22:09 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
>
> On Fri, 5 Dec 2008, Redeeman wrote:
>
> > On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> >>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>>>
> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>
> > Okay, you happen to have any knowledge to pass on about current 1tb
> > disks?
> I am still looking for some good 1TiB drives myself. I know one user who
> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
> PCI-X card:
> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
I guess those look pretty good.
i personally am running WD RE2 and Seagate ES.2 in raids without issues
at all, raid1, but hmm..
>
> I really would like to find a disk with a working NCQ implementation in
> Linux and 3ware and one that works well. In a single disk configuration,
> the WD 750G that I have used for ~1 year+ now has been fine, the problem
> is finding good, reliable disks when used in a raid configuration, that is
> when everything changes.
>
> > sorry to ask so much, but by "this process", do you mean the drive
> > internally doing it rewriting, avoiding the kickout, or do you mean the
> > raid system discovering the disk as faulty, and kicking it? :)
> The process being check or repair as noted by mikylie who also responded to
> this thread, see his response regarding check vs. repair. There is a
> difference, with check/mdraid see mikylie's note, with RAID VERIFY on a 3ware
> controller it will remap bad sectors to others parts of the array as it comes
> across them both during RAID VERIFY and while the array is running live.
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:01 ` Justin Piszcz
2008-12-05 21:07 ` Redeeman
@ 2008-12-05 23:04 ` Redeeman
2008-12-05 23:52 ` Justin Piszcz
1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 23:04 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>
> On Fri, 5 Dec 2008, Redeeman wrote:
>
> > Hello..
> >
> > Im going to be building a software raid6 setup with probably 8 disks,
> > and i've been looking at the wd gp disks, which comes in both standard
> > and raid edition, with the raid edition being much more expensive.
> >
> > I have searched around, and found that it is indeed possible to activate
> > tler on the "normal" disks, however, the setting has a parameter, more
> > specifically, how many seconds it should be limited to. Default is 7.
> >
> > So i was wondering, what should that be set to, to be optimal for linux
> > md raid? i havent been able to find any information about this.
> >
> >
> > Thanks.
> >
> >
> > mvh.
> > Kasper Sandberg
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> The exact time is a good question.
About this, shouldnt it be possible to find the exact time which the
linux raid system waits until it auto kicks out the drive?
>
> Something I have noticed is when TLER is off, the drives hang up when they
> hit a bad sector and when TLER is on, the drives is kicked out of the
> array immediately when it reports a bad sector.
On further thought, might this not suggest that the linux raid system
waits indefinetly for I/O error, before kicking out?
If that is indeed the case, then if the use of the raid isnt
time-critical, then maybe its a good thing to have tler enabled, just in
case the disk is able to fix stuff?
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 23:04 ` Redeeman
@ 2008-12-05 23:52 ` Justin Piszcz
2008-12-06 0:42 ` Roger Heflin
2008-12-08 16:59 ` Redeeman
0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 23:52 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Sat, 6 Dec 2008, Redeeman wrote:
> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> Hello..
>>>
>>> Im going to be building a software raid6 setup with probably 8 disks,
>>> and i've been looking at the wd gp disks, which comes in both standard
>>> and raid edition, with the raid edition being much more expensive.
>>>
>>> I have searched around, and found that it is indeed possible to activate
>>> tler on the "normal" disks, however, the setting has a parameter, more
>>> specifically, how many seconds it should be limited to. Default is 7.
>>>
>>> So i was wondering, what should that be set to, to be optimal for linux
>>> md raid? i havent been able to find any information about this.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> mvh.
>>> Kasper Sandberg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> The exact time is a good question.
>
> About this, shouldnt it be possible to find the exact time which the
> linux raid system waits until it auto kicks out the drive?
>
>
>>
>> Something I have noticed is when TLER is off, the drives hang up when they
>> hit a bad sector and when TLER is on, the drives is kicked out of the
>> array immediately when it reports a bad sector.
>
> On further thought, might this not suggest that the linux raid system
> waits indefinetly for I/O error, before kicking out?
It times out after 1-2 minutes. I have been dealing with this for a very long
time, another 'symptom' is short smart tests taking forever when the disk is in
a soon-to-be failing state.
>
> If that is indeed the case, then if the use of the raid isnt
> time-critical, then maybe its a good thing to have tler enabled, just in
> case the disk is able to fix stuff?
With TLER enabled the drive is kicked out immediately. The purpose of TLER is
that of HW raid where it can alert the controller about it and the controller
handles it.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 22:09 ` Redeeman
@ 2008-12-05 23:52 ` Justin Piszcz
2008-12-06 2:59 ` Redeeman
0 siblings, 1 reply; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 23:52 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Fri, 5 Dec 2008, Redeeman wrote:
> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>>
>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>
>>> Okay, you happen to have any knowledge to pass on about current 1tb
>>> disks?
>> I am still looking for some good 1TiB drives myself. I know one user who
>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
>> PCI-X card:
>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
> I guess those look pretty good.
>
> i personally am running WD RE2 and Seagate ES.2 in raids without issues
> at all, raid1, but hmm..
Can you show the smartctl -a output for each of the disks in your raids?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 23:52 ` Justin Piszcz
@ 2008-12-06 0:42 ` Roger Heflin
2008-12-08 16:59 ` Redeeman
1 sibling, 0 replies; 22+ messages in thread
From: Roger Heflin @ 2008-12-06 0:42 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Redeeman, linux-raid
Justin Piszcz wrote:
>
>
>> On further thought, might this not suggest that the linux raid system
>> waits indefinetly for I/O error, before kicking out?
> It times out after 1-2 minutes. I have been dealing with this for a
> very long
> time, another 'symptom' is short smart tests taking forever when the
> disk is in a soon-to-be failing state.
>>
>> If that is indeed the case, then if the use of the raid isnt
>> time-critical, then maybe its a good thing to have tler enabled, just in
>> case the disk is able to fix stuff?
> With TLER enabled the drive is kicked out immediately. The purpose of
> TLER is
> that of HW raid where it can alert the controller about it and the
> controller
> handles it.
Without TLER typically the raid controller will eventually (after 30
seconds or so) declare the disk dead and go on about its business.
The only thing TLER saves you is the 23 second faster timeout before
the disk is declared dead. And if you are doing something critical
the 7 second timeout is still potentially very troublesome. I would
have though TLER had more value to be set quite a bit lower than 7
seconds.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 23:52 ` Justin Piszcz
@ 2008-12-06 2:59 ` Redeeman
2008-12-06 9:23 ` Justin Piszcz
0 siblings, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-06 2:59 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
>
> On Fri, 5 Dec 2008, Redeeman wrote:
>
> > On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> >>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>>>>>
> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>
> >>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>>>>>
> >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>>>
> >>> Okay, you happen to have any knowledge to pass on about current 1tb
> >>> disks?
> >> I am still looking for some good 1TiB drives myself. I know one user who
> >> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
> >> PCI-X card:
> >> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
> > I guess those look pretty good.
> >
> > i personally am running WD RE2 and Seagate ES.2 in raids without issues
> > at all, raid1, but hmm..
>
> Can you show the smartctl -a output for each of the disks in your raids?
this is a raid1 with 1xwd re2 gp and 1x seagate es.2:
fileserver1:~# smartctl -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: WDC WD1000FYPS-01ZKB0
Serial Number: WD-WCASJ1247531
Firmware Version: 02.01B01
User Capacity: 1.000.203.804.160 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Dec 6 04:00:27 2008 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting
command from host.
Auto Offline Data Collection:
Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test
has ever
been run.
Total time to complete Offline
data collection: (27960) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon
new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging
supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always
- 0
3 Spin_Up_Time 0x0003 178 178 021 Pre-fail Always
- 8066
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 43
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000e 200 200 000 Old_age Always
- 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always
- 3065
10 Spin_Retry_Count 0x0012 100 253 000 Old_age Always
- 0
11 Calibration_Retry_Count 0x0012 100 253 000 Old_age Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 43
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always
- 59
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always
- 675
194 Temperature_Celsius 0x0022 122 108 000 Old_age Always
- 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 589
-
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.
fileserver1:~# smartctl -a /dev/sdb
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: ST31000340NS
Serial Number: 9QJ0RPJG
Firmware Version: SN05
User Capacity: 1.000.204.886.016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sat Dec 6 04:01:06 2008 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection:
Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test
has ever
been run.
Total time to complete Offline
data collection: ( 650) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon
new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging
supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 237) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 063 044 Pre-fail Always
- 93375833
3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 40
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always
- 1
7 Seek_Error_Rate 0x000f 065 060 030 Pre-fail Always
- 21491681226
9 Power_On_Hours 0x0032 097 097 000 Old_age Always
- 3068
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always
- 41
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 0
188 Unknown_Attribute 0x0032 100 090 000 Old_age Always
- 60
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 067 055 045 Old_age Always
- 33 (Lifetime Min/Max 18/37)
194 Temperature_Celsius 0x0022 033 045 000 Old_age Always
- 33 (0 18 0 0)
195 Hardware_ECC_Recovered 0x001a 022 022 000 Old_age Always
- 93375833
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 589
-
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:42 ` Justin Piszcz
2008-12-05 22:09 ` Redeeman
@ 2008-12-06 9:14 ` David Greaves
2008-12-06 9:59 ` Justin Piszcz
1 sibling, 1 reply; 22+ messages in thread
From: David Greaves @ 2008-12-06 9:14 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Redeeman, linux-raid
Justin Piszcz wrote:
>
>
> On Fri, 5 Dec 2008, Redeeman wrote:
>
>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>
>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>
>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>
>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>
>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>
>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>
>> Okay, you happen to have any knowledge to pass on about current 1tb
>> disks?
> I am still looking for some good 1TiB drives myself. I know one user
> who has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port
> 3ware
> PCI-X card:
> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard
> Drive
I wouldn't.
I had 9 of these in 2 Dell SOHO servers; I RMA'd about 13 (yes 13) over a few
months before exchanging them.
The HE ones are slightly better.
David
--
"Don't worry, you'll be fine; I saw it work in a cartoon once..."
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-06 2:59 ` Redeeman
@ 2008-12-06 9:23 ` Justin Piszcz
2008-12-06 14:33 ` Redeeman
0 siblings, 1 reply; 22+ messages in thread
From: Justin Piszcz @ 2008-12-06 9:23 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Sat, 6 Dec 2008, Redeeman wrote:
> On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
>>>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>>>>
>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>
>>>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>>>
>>>>> Okay, you happen to have any knowledge to pass on about current 1tb
>>>>> disks?
>>>> I am still looking for some good 1TiB drives myself. I know one user who
>>>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
>>>> PCI-X card:
>>>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
>>> I guess those look pretty good.
>>>
>>> i personally am running WD RE2 and Seagate ES.2 in raids without issues
>>> at all, raid1, but hmm..
>>
>> Can you show the smartctl -a output for each of the disks in your raids?
> this is a raid1 with 1xwd re2 gp and 1x seagate es.2:
Just 1 re-allocated sector, curious btw is there a reason you do not do
daily short smart tests and weekly long tests? Nothing is worse than
rebuilding a RAID array and then having another disk fail due to poor health.
Check/repair/VERIFY also helps to rid disks of bad sectors (force-re-allocate)
but still..
Justin.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-06 9:14 ` David Greaves
@ 2008-12-06 9:59 ` Justin Piszcz
0 siblings, 0 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-06 9:59 UTC (permalink / raw)
To: David Greaves; +Cc: Redeeman, linux-raid
On Sat, 6 Dec 2008, David Greaves wrote:
> Justin Piszcz wrote:
>>
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>>
>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>
>>> Okay, you happen to have any knowledge to pass on about current 1tb
>>> disks?
>> I am still looking for some good 1TiB drives myself. I know one user
>> who has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port
>> 3ware
>> PCI-X card:
>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard
>> Drive
> I wouldn't.
>
> I had 9 of these in 2 Dell SOHO servers; I RMA'd about 13 (yes 13) over a few
> months before exchanging them.
Ouch! How did they fail/what kind of workload was imposed on them? Yeah
I see the enterprise disks as well for 2x the cost, 'slightly' better, I
take it they fail quite often as well?
>
> The HE ones are slightly better.
>
> David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 21:12 ` Justin Piszcz
2008-12-05 21:18 ` Redeeman
@ 2008-12-06 10:32 ` Michal Soltys
2008-12-06 10:53 ` Justin Piszcz
1 sibling, 1 reply; 22+ messages in thread
From: Michal Soltys @ 2008-12-06 10:32 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Redeeman, linux-raid
Justin Piszcz wrote:
>
>> when tler is on, why does linux not attempt to remap the sector?
> See my earlier posts on this question from last week. It would need a
> metadata section on each HDD to keep track of the bad sectors before it
> writes to the drives, a 3ware card will do this for you-- however,
> typically, (having run md/linux) for a number of years is not
> super-necessary if you run checks on your disks once a week *AND* you
> have good drives that don't have problems.
>
From what I know, md does attempt to repair read errors (but not other
ones though) with rewrite, and only if it fails, it will kick the drive
out of the array. In case of 1.x superblocks, the count of repaired (+
ones not causing drive to be kicked off) sectors will be preserved
across reboots as well (check Documentation/md.txt). Peeking over the
code seems to confirm it as well.
It was on Neil's todo list, and it got implemented a while ago, afaik.
There was a short discussion on smartmon mailing list recently:
http://sourceforge.net/mailarchive/forum.php?thread_name=VA.00003465.16ed4a93%40news.conactive.com&forum_name=smartmontools-support
There was no 100% answer there either though.
Neil ?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-06 10:32 ` Michal Soltys
@ 2008-12-06 10:53 ` Justin Piszcz
0 siblings, 0 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-06 10:53 UTC (permalink / raw)
To: Michal Soltys; +Cc: Redeeman, linux-raid
On Sat, 6 Dec 2008, Michal Soltys wrote:
> Justin Piszcz wrote:
>>
>>> when tler is on, why does linux not attempt to remap the sector?
>> See my earlier posts on this question from last week. It would need a
>> metadata section on each HDD to keep track of the bad sectors before it
>> writes to the drives, a 3ware card will do this for you-- however,
>> typically, (having run md/linux) for a number of years is not
>> super-necessary if you run checks on your disks once a week *AND* you have
>> good drives that don't have problems.
>>
>
> From what I know, md does attempt to repair read errors (but not other ones
> though) with rewrite, and only if it fails, it will kick the drive out of the
> array. In case of 1.x superblocks, the count of repaired (+ ones not causing
> drive to be kicked off) sectors will be preserved across reboots as well
> (check Documentation/md.txt). Peeking over the code seems to confirm it as
> well.
It appears so:
[ 1215.654712] raid5:md3: read error not correctable (sector 213000256 on sde1).
[ 1215.654765] raid5: Disk failure on sde1, disabling device.
[ 1215.654766] raid5: Operation continuing on 8 devices.
[ 1215.655412] raid5:md3: read error not correctable (sector 213000264 on sde1).
[ 1215.655473] raid5:md3: read error not correctable (sector 213000272 on sde1).
[ 1215.655533] raid5:md3: read error not correctable (sector 213000280 on sde1).
[ 1215.655592] raid5:md3: read error not correctable (sector 213000288 on sde1).
[ 1215.655644] raid5:md3: read error not correctable (sector 213000296 on sde1).
[ 1215.655694] raid5:md3: read error not correctable (sector 213000304 on sde1).
[ 1215.655746] raid5:md3: read error not correctable (sector 213000312 on sde1).
[ 1215.655800] raid5:md3: read error not correctable (sector 213000320 on sde1).
[ 1215.655852] raid5:md3: read error not correctable (sector 213000328 on sde1).
Justin.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-06 9:23 ` Justin Piszcz
@ 2008-12-06 14:33 ` Redeeman
0 siblings, 0 replies; 22+ messages in thread
From: Redeeman @ 2008-12-06 14:33 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Sat, 2008-12-06 at 04:23 -0500, Justin Piszcz wrote:
>
> On Sat, 6 Dec 2008, Redeeman wrote:
>
> > On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> >>>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
> >>>>>>
> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>
> >>>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>>>>>>>
> >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>>>
> >>>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>>>>>
> >>>>> Okay, you happen to have any knowledge to pass on about current 1tb
> >>>>> disks?
> >>>> I am still looking for some good 1TiB drives myself. I know one user who
> >>>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
> >>>> PCI-X card:
> >>>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
> >>> I guess those look pretty good.
> >>>
> >>> i personally am running WD RE2 and Seagate ES.2 in raids without issues
> >>> at all, raid1, but hmm..
> >>
> >> Can you show the smartctl -a output for each of the disks in your raids?
> > this is a raid1 with 1xwd re2 gp and 1x seagate es.2:
>
> Just 1 re-allocated sector, curious btw is there a reason you do not do
> daily short smart tests and weekly long tests? Nothing is worse than
> rebuilding a RAID array and then having another disk fail due to poor health.
> Check/repair/VERIFY also helps to rid disks of bad sectors (force-re-allocate)
> but still..
I actually thought i had that already on this, i guess i forgot that :)
>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-05 23:52 ` Justin Piszcz
2008-12-06 0:42 ` Roger Heflin
@ 2008-12-08 16:59 ` Redeeman
2008-12-08 17:01 ` Justin Piszcz
1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-08 16:59 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid
On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
>
> On Sat, 6 Dec 2008, Redeeman wrote:
>
> > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> Hello..
> >>>
> >>> Im going to be building a software raid6 setup with probably 8 disks,
> >>> and i've been looking at the wd gp disks, which comes in both standard
> >>> and raid edition, with the raid edition being much more expensive.
> >>>
> >>> I have searched around, and found that it is indeed possible to activate
> >>> tler on the "normal" disks, however, the setting has a parameter, more
> >>> specifically, how many seconds it should be limited to. Default is 7.
> >>>
> >>> So i was wondering, what should that be set to, to be optimal for linux
> >>> md raid? i havent been able to find any information about this.
> >>>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> mvh.
> >>> Kasper Sandberg
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> The exact time is a good question.
> >
> > About this, shouldnt it be possible to find the exact time which the
> > linux raid system waits until it auto kicks out the drive?
> >
> >
> >>
> >> Something I have noticed is when TLER is off, the drives hang up when they
> >> hit a bad sector and when TLER is on, the drives is kicked out of the
> >> array immediately when it reports a bad sector.
> >
> > On further thought, might this not suggest that the linux raid system
> > waits indefinetly for I/O error, before kicking out?
> It times out after 1-2 minutes. I have been dealing with this for a very long
> time, another 'symptom' is short smart tests taking forever when the disk is in
> a soon-to-be failing state.
Do you know if this is actually caused by the disk's error recovery
timing out in 1-2 minutes, or the linux raid system?
> >
> > If that is indeed the case, then if the use of the raid isnt
> > time-critical, then maybe its a good thing to have tler enabled, just in
> > case the disk is able to fix stuff?
> With TLER enabled the drive is kicked out immediately. The purpose of TLER is
> that of HW raid where it can alert the controller about it and the controller
> handles it.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: time limited error recovery and md raid
2008-12-08 16:59 ` Redeeman
@ 2008-12-08 17:01 ` Justin Piszcz
0 siblings, 0 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-08 17:01 UTC (permalink / raw)
To: Redeeman; +Cc: linux-raid
On Mon, 8 Dec 2008, Redeeman wrote:
> On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
>>
>> On Sat, 6 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> Hello..
>>>>>
>>>>> Im going to be building a software raid6 setup with probably 8 disks,
>>>>> and i've been looking at the wd gp disks, which comes in both standard
>>>>> and raid edition, with the raid edition being much more expensive.
>>>>>
>>>>> I have searched around, and found that it is indeed possible to activate
>>>>> tler on the "normal" disks, however, the setting has a parameter, more
>>>>> specifically, how many seconds it should be limited to. Default is 7.
>>>>>
>>>>> So i was wondering, what should that be set to, to be optimal for linux
>>>>> md raid? i havent been able to find any information about this.
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> mvh.
>>>>> Kasper Sandberg
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> The exact time is a good question.
>>>
>>> About this, shouldnt it be possible to find the exact time which the
>>> linux raid system waits until it auto kicks out the drive?
>>>
>>>
>>>>
>>>> Something I have noticed is when TLER is off, the drives hang up when they
>>>> hit a bad sector and when TLER is on, the drives is kicked out of the
>>>> array immediately when it reports a bad sector.
>>>
>>> On further thought, might this not suggest that the linux raid system
>>> waits indefinetly for I/O error, before kicking out?
>> It times out after 1-2 minutes. I have been dealing with this for a very long
>> time, another 'symptom' is short smart tests taking forever when the disk is in
>> a soon-to-be failing state.
>
> Do you know if this is actually caused by the disk's error recovery
> timing out in 1-2 minutes, or the linux raid system?
I am not sure, that is a good question. I am not sure how long the drive
itself will spend 'searching' or trying to repair/read a bad sector with
TLER off vs. when the kernel will drop the disk for not responding.
Justin.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2008-12-08 17:01 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-05 20:57 time limited error recovery and md raid Redeeman
2008-12-05 21:01 ` Justin Piszcz
2008-12-05 21:07 ` Redeeman
2008-12-05 21:12 ` Justin Piszcz
2008-12-05 21:18 ` Redeeman
2008-12-05 21:21 ` Justin Piszcz
2008-12-05 21:31 ` Redeeman
2008-12-05 21:42 ` Justin Piszcz
2008-12-05 22:09 ` Redeeman
2008-12-05 23:52 ` Justin Piszcz
2008-12-06 2:59 ` Redeeman
2008-12-06 9:23 ` Justin Piszcz
2008-12-06 14:33 ` Redeeman
2008-12-06 9:14 ` David Greaves
2008-12-06 9:59 ` Justin Piszcz
2008-12-06 10:32 ` Michal Soltys
2008-12-06 10:53 ` Justin Piszcz
2008-12-05 23:04 ` Redeeman
2008-12-05 23:52 ` Justin Piszcz
2008-12-06 0:42 ` Roger Heflin
2008-12-08 16:59 ` Redeeman
2008-12-08 17:01 ` Justin Piszcz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).