time limited error recovery and md raid

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* time limited error recovery and md raid
@ 2008-12-05 20:57 Redeeman
  2008-12-05 21:01 ` Justin Piszcz
  0 siblings, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 20:57 UTC (permalink / raw)
  To: linux-raid

Hello..

Im going to be building a software raid6 setup with probably 8 disks,
and i've been looking at the wd gp disks, which comes in both standard
and raid edition, with the raid edition being much more expensive.

I have searched around, and found that it is indeed possible to activate
tler on the "normal" disks, however, the setting has a parameter, more
specifically, how many seconds it should be limited to. Default is 7.

So i was wondering, what should that be set to, to be optimal for linux
md raid? i havent been able to find any information about this.

Thanks.

mvh.
Kasper Sandberg

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 20:57 time limited error recovery and md raid Redeeman
@ 2008-12-05 21:01 ` Justin Piszcz
  2008-12-05 21:07   ` Redeeman
  2008-12-05 23:04   ` Redeeman
  0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:01 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Fri, 5 Dec 2008, Redeeman wrote:

> Hello..
>
> Im going to be building a software raid6 setup with probably 8 disks,
> and i've been looking at the wd gp disks, which comes in both standard
> and raid edition, with the raid edition being much more expensive.
>
> I have searched around, and found that it is indeed possible to activate
> tler on the "normal" disks, however, the setting has a parameter, more
> specifically, how many seconds it should be limited to. Default is 7.
>
> So i was wondering, what should that be set to, to be optimal for linux
> md raid? i havent been able to find any information about this.
>
>
> Thanks.
>
>
> mvh.
> Kasper Sandberg
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

The exact time is a good question.

Something I have noticed is when TLER is off, the drives hang up when they 
hit a bad sector and when TLER is on, the drives is kicked out of the 
array immediately when it reports a bad sector.

Justin.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:01 ` Justin Piszcz
@ 2008-12-05 21:07   ` Redeeman
  2008-12-05 21:12     ` Justin Piszcz
  2008-12-05 23:04   ` Redeeman
  1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 21:07 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> 
> On Fri, 5 Dec 2008, Redeeman wrote:
> 
> > Hello..
> >
> > Im going to be building a software raid6 setup with probably 8 disks,
> > and i've been looking at the wd gp disks, which comes in both standard
> > and raid edition, with the raid edition being much more expensive.
> >
> > I have searched around, and found that it is indeed possible to activate
> > tler on the "normal" disks, however, the setting has a parameter, more
> > specifically, how many seconds it should be limited to. Default is 7.
> >
> > So i was wondering, what should that be set to, to be optimal for linux
> > md raid? i havent been able to find any information about this.
> >
> >
> > Thanks.
> >
> >
> > mvh.
> > Kasper Sandberg
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> The exact time is a good question.
> 
> Something I have noticed is when TLER is off, the drives hang up when they 
> hit a bad sector and when TLER is on, the drives is kicked out of the 
> array immediately when it reports a bad sector.

First.. Does this happen with a degree of frequency on these disks? you
recommend other disks?

second, when tler is off, it hangs FOREVER? or just until it gives up?
when tler is on, why does linux not attempt to remap the sector?

> 
> Justin.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:07   ` Redeeman
@ 2008-12-05 21:12     ` Justin Piszcz
  2008-12-05 21:18       ` Redeeman
  2008-12-06 10:32       ` Michal Soltys
  0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:12 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Fri, 5 Dec 2008, Redeeman wrote:

> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> Hello..
>>>
>>> Im going to be building a software raid6 setup with probably 8 disks,
>>> and i've been looking at the wd gp disks, which comes in both standard
>>> and raid edition, with the raid edition being much more expensive.
>>>
>>> I have searched around, and found that it is indeed possible to activate
>>> tler on the "normal" disks, however, the setting has a parameter, more
>>> specifically, how many seconds it should be limited to. Default is 7.
>>>
>>> So i was wondering, what should that be set to, to be optimal for linux
>>> md raid? i havent been able to find any information about this.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> mvh.
>>> Kasper Sandberg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> The exact time is a good question.
>>
>> Something I have noticed is when TLER is off, the drives hang up when they
>> hit a bad sector and when TLER is on, the drives is kicked out of the
>> array immediately when it reports a bad sector.
>
> First.. Does this happen with a degree of frequency on these disks? you
> recommend other disks?
The disks I have are velociraptors, I will be reverting back to my old 
raptor150s shortly to ensure everything is fine with them before I go 
buying new hard drives, etc.  It happened every week or two with 
velociraptors, bad drives or drives getting kicked out of the array over 
and over again.

> second, when tler is off, it hangs FOREVER? or just until it gives up?
When TLER is off it hangs for 1-2 minutes and then you will see a timeout 
in the dmesg/kernel log and it (sometimes) kicks the drive out of the 
array, othertimes it 'hard' resets the drive the array is able to continue 
operating normally.

> when tler is on, why does linux not attempt to remap the sector?
See my earlier posts on this question from last week.  It would need a 
metadata section on each HDD to keep track of the bad sectors before it 
writes to the drives, a 3ware card will do this for you-- however, 
typically, (having run md/linux) for a number of years is not 
super-necessary if you run checks on your disks once a week *AND* you have 
good drives that don't have problems.

Justin.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:12     ` Justin Piszcz
@ 2008-12-05 21:18       ` Redeeman
  2008-12-05 21:21         ` Justin Piszcz
  2008-12-06 10:32       ` Michal Soltys
  1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 21:18 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> 
> On Fri, 5 Dec 2008, Redeeman wrote:
> 
> > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> Hello..
> >>>
> >>> Im going to be building a software raid6 setup with probably 8 disks,
> >>> and i've been looking at the wd gp disks, which comes in both standard
> >>> and raid edition, with the raid edition being much more expensive.
> >>>
> >>> I have searched around, and found that it is indeed possible to activate
> >>> tler on the "normal" disks, however, the setting has a parameter, more
> >>> specifically, how many seconds it should be limited to. Default is 7.
> >>>
> >>> So i was wondering, what should that be set to, to be optimal for linux
> >>> md raid? i havent been able to find any information about this.
> >>>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> mvh.
> >>> Kasper Sandberg
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> The exact time is a good question.
> >>
> >> Something I have noticed is when TLER is off, the drives hang up when they
> >> hit a bad sector and when TLER is on, the drives is kicked out of the
> >> array immediately when it reports a bad sector.
> >
> > First.. Does this happen with a degree of frequency on these disks? you
> > recommend other disks?
> The disks I have are velociraptors, I will be reverting back to my old 
> raptor150s shortly to ensure everything is fine with them before I go 
> buying new hard drives, etc.  It happened every week or two with 
> velociraptors, bad drives or drives getting kicked out of the array over 
> and over again.
> 
> > second, when tler is off, it hangs FOREVER? or just until it gives up?
> When TLER is off it hangs for 1-2 minutes and then you will see a timeout 
> in the dmesg/kernel log and it (sometimes) kicks the drive out of the 
> array, othertimes it 'hard' resets the drive the array is able to continue 
> operating normally.
> 
> > when tler is on, why does linux not attempt to remap the sector?
> See my earlier posts on this question from last week.  It would need a 
> metadata section on each HDD to keep track of the bad sectors before it 
> writes to the drives, a 3ware card will do this for you-- however, 
I thought disks had this internally, and could be prompted to do it by
writing to the sector?
> typically, (having run md/linux) for a number of years is not 
> super-necessary if you run checks on your disks once a week *AND* you have 
> good drives that don't have problems.
Obviously i intent to replace broken disks as i detect them, but as i
understand it, its fairly common case that disks will get bad sectors
over time, and remap them internally?
> 
> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:18       ` Redeeman
@ 2008-12-05 21:21         ` Justin Piszcz
  2008-12-05 21:31           ` Redeeman
  0 siblings, 1 reply; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:21 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Fri, 5 Dec 2008, Redeeman wrote:

> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
> I thought disks had this internally, and could be prompted to do it by
> writing to the sector?
Yes-- and that is what I did over and over again, dd if=/dev/zero of=/dev/dsk
it ran OK for 1-2 days but then it started erroring again with Velociraptors. 
With some old 400GiB Seagates, I did the same thing and their pending sector 
list rose but the drives still remained working for 1-2 years after.  The 
problems I mention are only applicable to my latest experience with 
velociraptor hdds.

>> typically, (having run md/linux) for a number of years is not
>> super-necessary if you run checks on your disks once a week *AND* you have
>> good drives that don't have problems.
> Obviously i intent to replace broken disks as i detect them, but as i
> understand it, its fairly common case that disks will get bad sectors
> over time, and remap them internally?
Yes, and with 'check' I believe it helps to ensure this process (it is like
'scrubbing') using RAID VERIFY in the 3ware controller.

Justin.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:21         ` Justin Piszcz
@ 2008-12-05 21:31           ` Redeeman
  2008-12-05 21:42             ` Justin Piszcz
  0 siblings, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 21:31 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
> 
> On Fri, 5 Dec 2008, Redeeman wrote:
> 
> > On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> > I thought disks had this internally, and could be prompted to do it by
> > writing to the sector?
> Yes-- and that is what I did over and over again, dd if=/dev/zero of=/dev/dsk
> it ran OK for 1-2 days but then it started erroring again with Velociraptors. 
> With some old 400GiB Seagates, I did the same thing and their pending sector 
> list rose but the drives still remained working for 1-2 years after.  The 
> problems I mention are only applicable to my latest experience with 
> velociraptor hdds.

Okay, you happen to have any knowledge to pass on about current 1tb
disks?

> 
> >> typically, (having run md/linux) for a number of years is not
> >> super-necessary if you run checks on your disks once a week *AND* you have
> >> good drives that don't have problems.
> > Obviously i intent to replace broken disks as i detect them, but as i
> > understand it, its fairly common case that disks will get bad sectors
> > over time, and remap them internally?
> Yes, and with 'check' I believe it helps to ensure this process (it is like
> 'scrubbing') using RAID VERIFY in the 3ware controller.
sorry to ask so much, but by "this process", do you mean the drive
internally doing it rewriting, avoiding the kickout, or do you mean the
raid system discovering the disk as faulty, and kicking it? :)

> 
> Justin.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:31           ` Redeeman
@ 2008-12-05 21:42             ` Justin Piszcz
  2008-12-05 22:09               ` Redeeman
  2008-12-06  9:14               ` David Greaves
  0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 21:42 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Fri, 5 Dec 2008, Redeeman wrote:

> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
> Okay, you happen to have any knowledge to pass on about current 1tb
> disks?
I am still looking for some good 1TiB drives myself.  I know one user who 
has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
PCI-X card:
SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive

I really would like to find a disk with a working NCQ implementation in 
Linux and 3ware and one that works well.  In a single disk configuration, 
the WD 750G that I have used for ~1 year+ now has been fine, the problem 
is finding good, reliable disks when used in a raid configuration, that is
when everything changes.

> sorry to ask so much, but by "this process", do you mean the drive
> internally doing it rewriting, avoiding the kickout, or do you mean the
> raid system discovering the disk as faulty, and kicking it? :)
The process being check or repair as noted by mikylie who also responded to
this thread, see his response regarding check vs. repair.  There is a
difference, with check/mdraid see mikylie's note, with RAID VERIFY on a 3ware
controller it will remap bad sectors to others parts of the array as it comes
across them both during RAID VERIFY and while the array is running live.

Justin.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:42             ` Justin Piszcz
@ 2008-12-05 22:09               ` Redeeman
  2008-12-05 23:52                 ` Justin Piszcz
  2008-12-06  9:14               ` David Greaves
  1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 22:09 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
> 
> On Fri, 5 Dec 2008, Redeeman wrote:
> 
> > On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> >>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>>>
> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>
> > Okay, you happen to have any knowledge to pass on about current 1tb
> > disks?
> I am still looking for some good 1TiB drives myself.  I know one user who 
> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
> PCI-X card:
> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
I guess those look pretty good.

i personally am running WD RE2 and Seagate ES.2 in raids without issues
at all, raid1, but hmm..

> 
> I really would like to find a disk with a working NCQ implementation in 
> Linux and 3ware and one that works well.  In a single disk configuration, 
> the WD 750G that I have used for ~1 year+ now has been fine, the problem 
> is finding good, reliable disks when used in a raid configuration, that is
> when everything changes.
> 
> > sorry to ask so much, but by "this process", do you mean the drive
> > internally doing it rewriting, avoiding the kickout, or do you mean the
> > raid system discovering the disk as faulty, and kicking it? :)
> The process being check or repair as noted by mikylie who also responded to
> this thread, see his response regarding check vs. repair.  There is a
> difference, with check/mdraid see mikylie's note, with RAID VERIFY on a 3ware
> controller it will remap bad sectors to others parts of the array as it comes
> across them both during RAID VERIFY and while the array is running live.
> 
> Justin.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:01 ` Justin Piszcz
  2008-12-05 21:07   ` Redeeman
@ 2008-12-05 23:04   ` Redeeman
  2008-12-05 23:52     ` Justin Piszcz
  1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-05 23:04 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> 
> On Fri, 5 Dec 2008, Redeeman wrote:
> 
> > Hello..
> >
> > Im going to be building a software raid6 setup with probably 8 disks,
> > and i've been looking at the wd gp disks, which comes in both standard
> > and raid edition, with the raid edition being much more expensive.
> >
> > I have searched around, and found that it is indeed possible to activate
> > tler on the "normal" disks, however, the setting has a parameter, more
> > specifically, how many seconds it should be limited to. Default is 7.
> >
> > So i was wondering, what should that be set to, to be optimal for linux
> > md raid? i havent been able to find any information about this.
> >
> >
> > Thanks.
> >
> >
> > mvh.
> > Kasper Sandberg
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> The exact time is a good question.

About this, shouldnt it be possible to find the exact time which the
linux raid system waits until it auto kicks out the drive?


> 
> Something I have noticed is when TLER is off, the drives hang up when they 
> hit a bad sector and when TLER is on, the drives is kicked out of the 
> array immediately when it reports a bad sector.

On further thought, might this not suggest that the linux raid system
waits indefinetly for I/O error, before kicking out?

If that is indeed the case, then if the use of the raid isnt
time-critical, then maybe its a good thing to have tler enabled, just in
case the disk is able to fix stuff?

> 
> Justin.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 23:04   ` Redeeman
@ 2008-12-05 23:52     ` Justin Piszcz
  2008-12-06  0:42       ` Roger Heflin
  2008-12-08 16:59       ` Redeeman
  0 siblings, 2 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 23:52 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Sat, 6 Dec 2008, Redeeman wrote:

> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> Hello..
>>>
>>> Im going to be building a software raid6 setup with probably 8 disks,
>>> and i've been looking at the wd gp disks, which comes in both standard
>>> and raid edition, with the raid edition being much more expensive.
>>>
>>> I have searched around, and found that it is indeed possible to activate
>>> tler on the "normal" disks, however, the setting has a parameter, more
>>> specifically, how many seconds it should be limited to. Default is 7.
>>>
>>> So i was wondering, what should that be set to, to be optimal for linux
>>> md raid? i havent been able to find any information about this.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> mvh.
>>> Kasper Sandberg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> The exact time is a good question.
>
> About this, shouldnt it be possible to find the exact time which the
> linux raid system waits until it auto kicks out the drive?
>
>
>>
>> Something I have noticed is when TLER is off, the drives hang up when they
>> hit a bad sector and when TLER is on, the drives is kicked out of the
>> array immediately when it reports a bad sector.
>
> On further thought, might this not suggest that the linux raid system
> waits indefinetly for I/O error, before kicking out?
It times out after 1-2 minutes.  I have been dealing with this for a very long
time, another 'symptom' is short smart tests taking forever when the disk is in 
a soon-to-be failing state.
>
> If that is indeed the case, then if the use of the raid isnt
> time-critical, then maybe its a good thing to have tler enabled, just in
> case the disk is able to fix stuff?
With TLER enabled the drive is kicked out immediately. The purpose of TLER is
that of HW raid where it can alert the controller about it and the controller
handles it.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 22:09               ` Redeeman
@ 2008-12-05 23:52                 ` Justin Piszcz
  2008-12-06  2:59                   ` Redeeman
  0 siblings, 1 reply; 22+ messages in thread
From: Justin Piszcz @ 2008-12-05 23:52 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Fri, 5 Dec 2008, Redeeman wrote:

> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>>
>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>
>>> Okay, you happen to have any knowledge to pass on about current 1tb
>>> disks?
>> I am still looking for some good 1TiB drives myself.  I know one user who
>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
>> PCI-X card:
>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
> I guess those look pretty good.
>
> i personally am running WD RE2 and Seagate ES.2 in raids without issues
> at all, raid1, but hmm..

Can you show the smartctl -a output for each of the disks in your raids?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 23:52     ` Justin Piszcz
@ 2008-12-06  0:42       ` Roger Heflin
  2008-12-08 16:59       ` Redeeman
  1 sibling, 0 replies; 22+ messages in thread
From: Roger Heflin @ 2008-12-06  0:42 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Redeeman, linux-raid

Justin Piszcz wrote:
> 
> 
>> On further thought, might this not suggest that the linux raid system
>> waits indefinetly for I/O error, before kicking out?
> It times out after 1-2 minutes.  I have been dealing with this for a 
> very long
> time, another 'symptom' is short smart tests taking forever when the 
> disk is in a soon-to-be failing state.
>>
>> If that is indeed the case, then if the use of the raid isnt
>> time-critical, then maybe its a good thing to have tler enabled, just in
>> case the disk is able to fix stuff?
> With TLER enabled the drive is kicked out immediately. The purpose of 
> TLER is
> that of HW raid where it can alert the controller about it and the 
> controller
> handles it.

Without TLER typically the raid controller will eventually (after 30 
seconds or so) declare the disk dead and go on about its business.

The only thing TLER saves you is the 23 second faster timeout before 
the disk is declared dead.   And if you are doing something critical 
the 7 second timeout is still potentially very troublesome.    I would 
have though TLER had more value to be set quite a bit lower than 7 
seconds.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 23:52                 ` Justin Piszcz
@ 2008-12-06  2:59                   ` Redeeman
  2008-12-06  9:23                     ` Justin Piszcz
  0 siblings, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-06  2:59 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
> 
> On Fri, 5 Dec 2008, Redeeman wrote:
> 
> > On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> >>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>>>>>
> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>
> >>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>>>>>
> >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>>>
> >>> Okay, you happen to have any knowledge to pass on about current 1tb
> >>> disks?
> >> I am still looking for some good 1TiB drives myself.  I know one user who
> >> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
> >> PCI-X card:
> >> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
> > I guess those look pretty good.
> >
> > i personally am running WD RE2 and Seagate ES.2 in raids without issues
> > at all, raid1, but hmm..
> 
> Can you show the smartctl -a output for each of the disks in your raids?
this is a raid1 with 1xwd re2 gp and 1x seagate es.2:
fileserver1:~# smartctl -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD1000FYPS-01ZKB0
Serial Number:    WD-WCASJ1247531
Firmware Version: 02.01B01
User Capacity:    1.000.203.804.160 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Dec  6 04:00:27 2008 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting
command from host.
                                        Auto Offline Data Collection:
Enabled.
Self-test execution status:      (   0) The previous self-test routine
completed
                                        without error or no self-test
has ever
                                        been run.
Total time to complete Offline
data collection:                 (27960) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon
new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging
supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always
-       0
  3 Spin_Up_Time            0x0003   178   178   021    Pre-fail  Always
-       8066
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
-       43
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always
-       0
  7 Seek_Error_Rate         0x000e   200   200   000    Old_age   Always
-       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always
-       3065
 10 Spin_Retry_Count        0x0012   100   253   000    Old_age   Always
-       0
 11 Calibration_Retry_Count 0x0012   100   253   000    Old_age   Always
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
-       43
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always
-       59
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always
-       675
194 Temperature_Celsius     0x0022   122   108   000    Old_age   Always
-       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
-       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
-       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
-       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       589
-

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.

fileserver1:~# smartctl -a /dev/sdb
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST31000340NS
Serial Number:    9QJ0RPJG
Firmware Version: SN05
User Capacity:    1.000.204.886.016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Dec  6 04:01:06 2008 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection:
Enabled.
Self-test execution status:      (   0) The previous self-test routine
completed
                                        without error or no self-test
has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 650) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection
on/off support.
                                        Suspend Offline collection upon
new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging
supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 237) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   063   044    Pre-fail  Always
-       93375833
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always
-       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always
-       40
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always
-       1
  7 Seek_Error_Rate         0x000f   065   060   030    Pre-fail  Always
-       21491681226
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always
-       3068
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
-       0
 12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always
-       41
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always
-       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always
-       0
188 Unknown_Attribute       0x0032   100   090   000    Old_age   Always
-       60
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
-       0
190 Airflow_Temperature_Cel 0x0022   067   055   045    Old_age   Always
-       33 (Lifetime Min/Max 18/37)
194 Temperature_Celsius     0x0022   033   045   000    Old_age   Always
-       33 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   022   022   000    Old_age   Always
-       93375833
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
-       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
-       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       589
-

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.


> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:42             ` Justin Piszcz
  2008-12-05 22:09               ` Redeeman
@ 2008-12-06  9:14               ` David Greaves
  2008-12-06  9:59                 ` Justin Piszcz
  1 sibling, 1 reply; 22+ messages in thread
From: David Greaves @ 2008-12-06  9:14 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Redeeman, linux-raid

Justin Piszcz wrote:
> 
> 
> On Fri, 5 Dec 2008, Redeeman wrote:
> 
>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>
>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>
>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>
>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>
>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>
>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>
>> Okay, you happen to have any knowledge to pass on about current 1tb
>> disks?
> I am still looking for some good 1TiB drives myself.  I know one user
> who has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port
> 3ware
> PCI-X card:
> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard
> Drive
I wouldn't.

I had 9 of these in 2 Dell SOHO servers; I RMA'd about 13 (yes 13) over a few
months before exchanging them.

The HE ones are slightly better.

David

-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-06  2:59                   ` Redeeman
@ 2008-12-06  9:23                     ` Justin Piszcz
  2008-12-06 14:33                       ` Redeeman
  0 siblings, 1 reply; 22+ messages in thread
From: Justin Piszcz @ 2008-12-06  9:23 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Sat, 6 Dec 2008, Redeeman wrote:

> On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
>>>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>>>>
>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>
>>>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>>>
>>>>> Okay, you happen to have any knowledge to pass on about current 1tb
>>>>> disks?
>>>> I am still looking for some good 1TiB drives myself.  I know one user who
>>>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
>>>> PCI-X card:
>>>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
>>> I guess those look pretty good.
>>>
>>> i personally am running WD RE2 and Seagate ES.2 in raids without issues
>>> at all, raid1, but hmm..
>>
>> Can you show the smartctl -a output for each of the disks in your raids?
> this is a raid1 with 1xwd re2 gp and 1x seagate es.2:

Just 1 re-allocated sector, curious btw is there a reason you do not do
daily short smart tests and weekly long tests?  Nothing is worse than
rebuilding a RAID array and then having another disk fail due to poor health.
Check/repair/VERIFY also helps to rid disks of bad sectors (force-re-allocate)
but still..

Justin.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-06  9:14               ` David Greaves
@ 2008-12-06  9:59                 ` Justin Piszcz
  0 siblings, 0 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-06  9:59 UTC (permalink / raw)
  To: David Greaves; +Cc: Redeeman, linux-raid



On Sat, 6 Dec 2008, David Greaves wrote:

> Justin Piszcz wrote:
>>
>>
>> On Fri, 5 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
>>>>>>
>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>
>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>>>>>
>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>>>>>
>>> Okay, you happen to have any knowledge to pass on about current 1tb
>>> disks?
>> I am still looking for some good 1TiB drives myself.  I know one user
>> who has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port
>> 3ware
>> PCI-X card:
>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard
>> Drive
> I wouldn't.
>
> I had 9 of these in 2 Dell SOHO servers; I RMA'd about 13 (yes 13) over a few
> months before exchanging them.

Ouch!  How did they fail/what kind of workload was imposed on them?  Yeah 
I see the enterprise disks as well for 2x the cost, 'slightly' better, I 
take it they fail quite often as well?

>
> The HE ones are slightly better.
>
> David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 21:12     ` Justin Piszcz
  2008-12-05 21:18       ` Redeeman
@ 2008-12-06 10:32       ` Michal Soltys
  2008-12-06 10:53         ` Justin Piszcz
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Soltys @ 2008-12-06 10:32 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Redeeman, linux-raid

Justin Piszcz wrote:
> 
>> when tler is on, why does linux not attempt to remap the sector?
> See my earlier posts on this question from last week.  It would need a 
> metadata section on each HDD to keep track of the bad sectors before it 
> writes to the drives, a 3ware card will do this for you-- however, 
> typically, (having run md/linux) for a number of years is not 
> super-necessary if you run checks on your disks once a week *AND* you 
> have good drives that don't have problems.
> 

 From what I know, md does attempt to repair read errors (but not other 
ones though) with rewrite, and only if it fails, it will kick the drive 
out of the array. In case of 1.x superblocks, the count of repaired (+ 
ones not causing drive to be kicked off) sectors will be preserved 
across reboots as well (check Documentation/md.txt). Peeking over the 
code seems to confirm it as well.

It was on Neil's todo list, and it got implemented a while ago, afaik. 
There was a short discussion on smartmon mailing list recently:

http://sourceforge.net/mailarchive/forum.php?thread_name=VA.00003465.16ed4a93%40news.conactive.com&forum_name=smartmontools-support

There was no 100% answer there either though.

Neil ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-06 10:32       ` Michal Soltys
@ 2008-12-06 10:53         ` Justin Piszcz
  0 siblings, 0 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-06 10:53 UTC (permalink / raw)
  To: Michal Soltys; +Cc: Redeeman, linux-raid



On Sat, 6 Dec 2008, Michal Soltys wrote:

> Justin Piszcz wrote:
>> 
>>> when tler is on, why does linux not attempt to remap the sector?
>> See my earlier posts on this question from last week.  It would need a 
>> metadata section on each HDD to keep track of the bad sectors before it 
>> writes to the drives, a 3ware card will do this for you-- however, 
>> typically, (having run md/linux) for a number of years is not 
>> super-necessary if you run checks on your disks once a week *AND* you have 
>> good drives that don't have problems.
>> 
>
> From what I know, md does attempt to repair read errors (but not other ones 
> though) with rewrite, and only if it fails, it will kick the drive out of the 
> array. In case of 1.x superblocks, the count of repaired (+ ones not causing 
> drive to be kicked off) sectors will be preserved across reboots as well 
> (check Documentation/md.txt). Peeking over the code seems to confirm it as 
> well.
It appears so:

[ 1215.654712] raid5:md3: read error not correctable (sector 213000256 on sde1).
[ 1215.654765] raid5: Disk failure on sde1, disabling device.
[ 1215.654766] raid5: Operation continuing on 8 devices.
[ 1215.655412] raid5:md3: read error not correctable (sector 213000264 on sde1).
[ 1215.655473] raid5:md3: read error not correctable (sector 213000272 on sde1).
[ 1215.655533] raid5:md3: read error not correctable (sector 213000280 on sde1).
[ 1215.655592] raid5:md3: read error not correctable (sector 213000288 on sde1).
[ 1215.655644] raid5:md3: read error not correctable (sector 213000296 on sde1).
[ 1215.655694] raid5:md3: read error not correctable (sector 213000304 on sde1).
[ 1215.655746] raid5:md3: read error not correctable (sector 213000312 on sde1).
[ 1215.655800] raid5:md3: read error not correctable (sector 213000320 on sde1).
[ 1215.655852] raid5:md3: read error not correctable (sector 213000328 on sde1).

Justin.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-06  9:23                     ` Justin Piszcz
@ 2008-12-06 14:33                       ` Redeeman
  0 siblings, 0 replies; 22+ messages in thread
From: Redeeman @ 2008-12-06 14:33 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Sat, 2008-12-06 at 04:23 -0500, Justin Piszcz wrote:
> 
> On Sat, 6 Dec 2008, Redeeman wrote:
> 
> > On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> On Fri, 2008-12-05 at 16:42 -0500, Justin Piszcz wrote:
> >>>>
> >>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>
> >>>>> On Fri, 2008-12-05 at 16:21 -0500, Justin Piszcz wrote:
> >>>>>>
> >>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>
> >>>>>>> On Fri, 2008-12-05 at 16:12 -0500, Justin Piszcz wrote:
> >>>>>>>>
> >>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>>>
> >>>>>>>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Fri, 5 Dec 2008, Redeeman wrote:
> >>>>>>>>>>
> >>>>> Okay, you happen to have any knowledge to pass on about current 1tb
> >>>>> disks?
> >>>> I am still looking for some good 1TiB drives myself.  I know one user who
> >>>> has 12 of these, 11 in a RAID-5 array and 1 as a spare on a 12-port 3ware
> >>>> PCI-X card:
> >>>> SAMSUNG Spinpoint F1 HD103UJ 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drive
> >>> I guess those look pretty good.
> >>>
> >>> i personally am running WD RE2 and Seagate ES.2 in raids without issues
> >>> at all, raid1, but hmm..
> >>
> >> Can you show the smartctl -a output for each of the disks in your raids?
> > this is a raid1 with 1xwd re2 gp and 1x seagate es.2:
> 
> Just 1 re-allocated sector, curious btw is there a reason you do not do
> daily short smart tests and weekly long tests?  Nothing is worse than
> rebuilding a RAID array and then having another disk fail due to poor health.
> Check/repair/VERIFY also helps to rid disks of bad sectors (force-re-allocate)
> but still..

I actually thought i had that already on this, i guess i forgot that :)

> 
> Justin.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-05 23:52     ` Justin Piszcz
  2008-12-06  0:42       ` Roger Heflin
@ 2008-12-08 16:59       ` Redeeman
  2008-12-08 17:01         ` Justin Piszcz
  1 sibling, 1 reply; 22+ messages in thread
From: Redeeman @ 2008-12-08 16:59 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
> 
> On Sat, 6 Dec 2008, Redeeman wrote:
> 
> > On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
> >>
> >> On Fri, 5 Dec 2008, Redeeman wrote:
> >>
> >>> Hello..
> >>>
> >>> Im going to be building a software raid6 setup with probably 8 disks,
> >>> and i've been looking at the wd gp disks, which comes in both standard
> >>> and raid edition, with the raid edition being much more expensive.
> >>>
> >>> I have searched around, and found that it is indeed possible to activate
> >>> tler on the "normal" disks, however, the setting has a parameter, more
> >>> specifically, how many seconds it should be limited to. Default is 7.
> >>>
> >>> So i was wondering, what should that be set to, to be optimal for linux
> >>> md raid? i havent been able to find any information about this.
> >>>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> mvh.
> >>> Kasper Sandberg
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> The exact time is a good question.
> >
> > About this, shouldnt it be possible to find the exact time which the
> > linux raid system waits until it auto kicks out the drive?
> >
> >
> >>
> >> Something I have noticed is when TLER is off, the drives hang up when they
> >> hit a bad sector and when TLER is on, the drives is kicked out of the
> >> array immediately when it reports a bad sector.
> >
> > On further thought, might this not suggest that the linux raid system
> > waits indefinetly for I/O error, before kicking out?
> It times out after 1-2 minutes.  I have been dealing with this for a very long
> time, another 'symptom' is short smart tests taking forever when the disk is in 
> a soon-to-be failing state.

Do you know if this is actually caused by the disk's error recovery
timing out in 1-2 minutes, or the linux raid system?

> >
> > If that is indeed the case, then if the use of the raid isnt
> > time-critical, then maybe its a good thing to have tler enabled, just in
> > case the disk is able to fix stuff?
> With TLER enabled the drive is kicked out immediately. The purpose of TLER is
> that of HW raid where it can alert the controller about it and the controller
> handles it.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: time limited error recovery and md raid
  2008-12-08 16:59       ` Redeeman
@ 2008-12-08 17:01         ` Justin Piszcz
  0 siblings, 0 replies; 22+ messages in thread
From: Justin Piszcz @ 2008-12-08 17:01 UTC (permalink / raw)
  To: Redeeman; +Cc: linux-raid



On Mon, 8 Dec 2008, Redeeman wrote:

> On Fri, 2008-12-05 at 18:52 -0500, Justin Piszcz wrote:
>>
>> On Sat, 6 Dec 2008, Redeeman wrote:
>>
>>> On Fri, 2008-12-05 at 16:01 -0500, Justin Piszcz wrote:
>>>>
>>>> On Fri, 5 Dec 2008, Redeeman wrote:
>>>>
>>>>> Hello..
>>>>>
>>>>> Im going to be building a software raid6 setup with probably 8 disks,
>>>>> and i've been looking at the wd gp disks, which comes in both standard
>>>>> and raid edition, with the raid edition being much more expensive.
>>>>>
>>>>> I have searched around, and found that it is indeed possible to activate
>>>>> tler on the "normal" disks, however, the setting has a parameter, more
>>>>> specifically, how many seconds it should be limited to. Default is 7.
>>>>>
>>>>> So i was wondering, what should that be set to, to be optimal for linux
>>>>> md raid? i havent been able to find any information about this.
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> mvh.
>>>>> Kasper Sandberg
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> The exact time is a good question.
>>>
>>> About this, shouldnt it be possible to find the exact time which the
>>> linux raid system waits until it auto kicks out the drive?
>>>
>>>
>>>>
>>>> Something I have noticed is when TLER is off, the drives hang up when they
>>>> hit a bad sector and when TLER is on, the drives is kicked out of the
>>>> array immediately when it reports a bad sector.
>>>
>>> On further thought, might this not suggest that the linux raid system
>>> waits indefinetly for I/O error, before kicking out?
>> It times out after 1-2 minutes.  I have been dealing with this for a very long
>> time, another 'symptom' is short smart tests taking forever when the disk is in
>> a soon-to-be failing state.
>
> Do you know if this is actually caused by the disk's error recovery
> timing out in 1-2 minutes, or the linux raid system?
I am not sure, that is a good question.  I am not sure how long the drive 
itself will spend 'searching' or trying to repair/read a bad sector with 
TLER off vs. when the kernel will drop the disk for not responding.

Justin.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-12-08 17:01 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-05 20:57 time limited error recovery and md raid Redeeman
2008-12-05 21:01 ` Justin Piszcz
2008-12-05 21:07   ` Redeeman
2008-12-05 21:12     ` Justin Piszcz
2008-12-05 21:18       ` Redeeman
2008-12-05 21:21         ` Justin Piszcz
2008-12-05 21:31           ` Redeeman
2008-12-05 21:42             ` Justin Piszcz
2008-12-05 22:09               ` Redeeman
2008-12-05 23:52                 ` Justin Piszcz
2008-12-06  2:59                   ` Redeeman
2008-12-06  9:23                     ` Justin Piszcz
2008-12-06 14:33                       ` Redeeman
2008-12-06  9:14               ` David Greaves
2008-12-06  9:59                 ` Justin Piszcz
2008-12-06 10:32       ` Michal Soltys
2008-12-06 10:53         ` Justin Piszcz
2008-12-05 23:04   ` Redeeman
2008-12-05 23:52     ` Justin Piszcz
2008-12-06  0:42       ` Roger Heflin
2008-12-08 16:59       ` Redeeman
2008-12-08 17:01         ` Justin Piszcz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).