RE: RAID halting

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: RAID halting
       [not found] <49D7C19C.2050308@gmail.com>
@ 2009-04-05  0:07 ` Lelsie Rhorer
  2009-04-05  0:49   ` Greg Freemyer
  2009-04-05  0:57   ` Roger Heflin
  0 siblings, 2 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  0:07 UTC (permalink / raw)
  To: 'Linux RAID'

> If one of your disks was clearing bad sectors then things get messy
> and when it hits one of these bad sectors that it can successfully
> move you would get a delay almost every time.

Yes, but in that case two things would be true:

1.  Any write of any sort could readily trigger an event.  The system quite
regularly writes more than 5000 sectors / second, but never do any of these
writes trigger an event except in the case where it is a file creation.
Like I said, the drives have no idea whether the sector they are attempting
to write is a new file or not, or part of a directory structure or not.

2.  The kernel would be reporting SMART errors.  It isn't.

Finally, as you said yourself, the situation would result in a delay almost
every time, yet there are signifcant stretches of time when every single
file creation works just fine.  Also, it doesn't take a drive 40 seconds,
let alone 2 minutes, to mark a sector bad.  The array chassis I had
previously had some sort of problem which made the drives think there were
bad sectors, when there weren't.  It cause one drive to be marked with more
than a million bad sectors.  It never paused like this, however.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05  0:07 ` RAID halting Lelsie Rhorer
@ 2009-04-05  0:49   ` Greg Freemyer
  2009-04-05  5:34     ` Lelsie Rhorer
  2009-04-05  0:57   ` Roger Heflin
  1 sibling, 1 reply; 84+ messages in thread
From: Greg Freemyer @ 2009-04-05  0:49 UTC (permalink / raw)
  To: lrhorer; +Cc: Linux RAID

<snip>
> Also, it doesn't take a drive 40 seconds,
> let alone 2 minutes, to mark a sector bad.

Just for info, with Enterprise class sata drives, you're right.

For consumer grade sata, they have extended auto retry logic as an
effort not to fail a read from a bad sector.  It can easily take times
like your seeing.  I think the standard drive timeout is on the order
of 30 seconds and then libata sometimes has retry logic of its own.

Especially with PATA drives a 2 minute issue is not out of the
question, since the kernel will step down the i/o speed and retry each
speed as it goes.  And PATA (IDE) has a lot of speeds to try.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05  0:49   ` Greg Freemyer
@ 2009-04-05  5:34     ` Lelsie Rhorer
  2009-04-05  7:16       ` Richard Scobie
  2009-04-05  7:33       ` Richard Scobie
  0 siblings, 2 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  5:34 UTC (permalink / raw)
  To: linux-raid

> > Also, it doesn't take a drive 40 seconds,
> > let alone 2 minutes, to mark a sector bad.
> 
> Just for info, with Enterprise class sata drives, you're right.

On this system, with consumer class drives, with the faulty RAID enclosure,
the system marked well over a million sectors bad.  Except when the entire
drive was faulted and taken offline, it never caused a halt of this
magnitude, and every single block of sectors marked bad was reported in the
kernel log.  I can't guarantee this is not what is happening, but I am
highly skeptical, to say the least.  You're also going to have to explain to
me why 5 drives stop reading and 5 do not, and why it is always the same 5
drives.  One flaky drive I can believe.  Five I can't.  You're also going to
have to explain to me why there are no reports of sectors marked bad in the
kernel log when these events occur, although at long intervals (months),
occasionally a block of bad sectors will be reported.

> For consumer grade sata, they have extended auto retry logic as an
> effort not to fail a read from a bad sector.  It can easily take times
> like your seeing.  I think the standard drive timeout is on the order
> of 30 seconds and then libata sometimes has retry logic of its own.
I have never seen any of the marked failures take that long.  As I said,
there were well over a million marked bad, usually in chunks of 128 sectors.

> Especially with PATA drives a 2 minute issue is not out of the
> question, since the kernel will step down the i/o speed and retry each
> speed as it goes.  And PATA (IDE) has a lot of speeds to try.

In my original message I already said these are 1T SATA drives spread across
three different port multipliers.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05  5:34     ` Lelsie Rhorer
@ 2009-04-05  7:16       ` Richard Scobie
  2009-04-05  8:22         ` Lelsie Rhorer
  2009-04-05  7:33       ` Richard Scobie
  1 sibling, 1 reply; 84+ messages in thread
From: Richard Scobie @ 2009-04-05  7:16 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

Lelsie Rhorer wrote:

> In my original message I already said these are 1T SATA drives spread across
> three different port multipliers.

Are these Seagates?

If so, possibly you are hitting their recent firmware troubles, which I 
believe caused significant delays in drive operation.

Whilst this link refers to 1.5TB drives, I think I have seen reference 
elsewhere to 1TB drives also.

http://techreport.com/discussions.x/15863

Regards,

Richard

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05  7:16       ` Richard Scobie
@ 2009-04-05  8:22         ` Lelsie Rhorer
  2009-04-05 14:05           ` Drew
  0 siblings, 1 reply; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  8:22 UTC (permalink / raw)
  To: linux-raid

> > In my original message I already said these are 1T SATA drives spread
> across
> > three different port multipliers.
> 
> Are these Seagates? 

Two of them are, but they are somewhat older drives, and they didn't have
this problem before.  The problem started immediately the last time I
rebuilt the array and formatted it as Reiserfs, after moving the drives out
of the old RAID chassis.

> If so, possibly you are hitting their recent firmware troubles, which I
> believe caused significant delays in drive operation.
> 
> Whilst this link refers to 1.5TB drives, I think I have seen reference
> elsewhere to 1TB drives also.
> 
> http://techreport.com/discussions.x/15863

I couldn't reach the link, but I've heard of this before, or something to
the effect.  It's possible, but to me this is looking ever less and less
like a problem at the drive level.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05  8:22         ` Lelsie Rhorer
@ 2009-04-05 14:05           ` Drew
  2009-04-05 18:54             ` Leslie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Drew @ 2009-04-05 14:05 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

> The problem started immediately the last time I
> rebuilt the array and formatted it as Reiserfs, after moving the drives out
> of the old RAID chassis.

What file system were you using before ReiserFS?

I'm no expert but it seems, based on what you said about the raid
resyncing just fine while a filesystem call blocked, you may want to
look at the filesystem you're using. RAID provides a block level
interface to the FS and if the block level interface is working and
not reporting errors from lower levels, your culprit is higher up the
chain, ie the FS.

It's perhaps a matter of personal preference but I try to avoid
Reiser. It's not that it's a bad FS or anything but I've never had it
work quite right for me. Plus, based on my read of how it works, it
requires more processing on file creation due to the various
optimization algorithms it uses.

-- 
Drew

"Nothing in life is to be feared. It is only to be understood."
--Marie Curie

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 14:05           ` Drew
@ 2009-04-05 18:54             ` Leslie Rhorer
  2009-04-05 19:17               ` John Robinson
                                 ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 18:54 UTC (permalink / raw)
  To: 'Drew'; +Cc: linux-raid

> > The problem started immediately the last time I
> > rebuilt the array and formatted it as Reiserfs, after moving the drives
> out
> > of the old RAID chassis.
> 
> What file system were you using before ReiserFS?

Several, actually.  Since the RAID array kept crashing, I had to re-create
it numerous times.  I tried ext3 more than once, but the journal kept
getting corrupted, and fixing it lost several files.  Once I lost several
important files during a RAID expansion.  In some cases I converted to ext2,
and others I started out with ext2, but last I checked, one cannot grow an
ext2 file system on the fly.  I know at least once before it was reiserfs,
but I'm pretty sure it was under an earlier kernel, and probably an earlier
version of resierfs.

> I'm no expert but it seems, based on what you said about the raid
> resyncing just fine while a filesystem call blocked, you may want to
> look at the filesystem you're using. RAID provides a block level
> interface to the FS and if the block level interface is working and
> not reporting errors from lower levels, your culprit is higher up the
> chain, ie the FS.

I've suspected this may be the case from the outset.  I don't see a reiserfs
user's mailing list.  I posted on the development mailing list, but I
haven't received any response, assuming it isn't the wrong list in the first
place.  Of course, the fact the RAID level is working while the FS level is
not does not necessarily mean it's the FS' fault.  It could be an
interaction which is causing the problem, not a failure of either one
individually, or it could be the internal calls are working, but there is an
issue with external calls.

> It's perhaps a matter of personal preference but I try to avoid
> Reiser. It's not that it's a bad FS or anything but I've never had it
> work quite right for me. Plus, based on my read of how it works, it
> requires more processing on file creation due to the various
> optimization algorithms it uses.

Yes, but I had many problems with ext3, as well, and I definitely need a
file system which can be grown on the fly.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05 18:54             ` Leslie Rhorer
@ 2009-04-05 19:17               ` John Robinson
  2009-04-05 20:00                 ` Greg Freemyer
  2009-04-05 21:02                 ` Leslie Rhorer
  2009-04-05 19:26               ` Richard Scobie
  2009-04-05 20:57               ` Peter Grandi
  2 siblings, 2 replies; 84+ messages in thread
From: John Robinson @ 2009-04-05 19:17 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Drew', linux-raid

On 05/04/2009 19:54, Leslie Rhorer wrote:
>>> The problem started immediately the last time I
>>> rebuilt the array and formatted it as Reiserfs, after moving the drives
>>> out of the old RAID chassis.
>> What file system were you using before ReiserFS?
> 
> Several, actually.  Since the RAID array kept crashing, I had to re-create
> it numerous times.
[...]
>> your culprit is higher up the chain, ie the FS.
> 
> I've suspected this may be the case from the outset.

I'm sorry? You've repeatedly had trouble with this system, this array, 
you've tried several filesystems; do you think they're *ALL* broken?

Cheers,

John.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05 19:17               ` John Robinson
@ 2009-04-05 20:00                 ` Greg Freemyer
  2009-04-05 20:39                   ` Peter Grandi
                                     ` (2 more replies)
  2009-04-05 21:02                 ` Leslie Rhorer
  1 sibling, 3 replies; 84+ messages in thread
From: Greg Freemyer @ 2009-04-05 20:00 UTC (permalink / raw)
  To: John Robinson; +Cc: lrhorer, Drew, linux-raid

On Sun, Apr 5, 2009 at 3:17 PM, John Robinson
<john.robinson@anonymous.org.uk> wrote:
> On 05/04/2009 19:54, Leslie Rhorer wrote:
>>>>
>>>> The problem started immediately the last time I
>>>> rebuilt the array and formatted it as Reiserfs, after moving the drives
>>>> out of the old RAID chassis.
>>>
>>> What file system were you using before ReiserFS?
>>
>> Several, actually.  Since the RAID array kept crashing, I had to re-create
>> it numerous times.
>
> [...]
>>>
>>> your culprit is higher up the chain, ie the FS.
>>
>> I've suspected this may be the case from the outset.
>
> I'm sorry? You've repeatedly had trouble with this system, this array,
> you've tried several filesystems; do you think they're *ALL* broken?
>
> Cheers,
>
> John.

I for one think it is very reasonable that Leslie may have experienced
numerous different problems in the course of trying to put together a
large scale raid system for video editing.

But Leslie, maybe you do need to take a step back and review your
overall design and see what major changes you could make that might
help.

I'm not really keeping up with things like video editing, but as
someone else said XFS was specifically designed for that type of
workload.  It even has a psuedo realtime capability to ensure you
maintain your frame rate, etc.  Or so I understand.  I've never used
that feature.  You could also evaluate the different i/o elevators.

If I were designing a system like you have for myself, I would get one
of the major supported server distros.  (I'm a SuSE fan, so I would go
with SLES, but I assume others are good as well.)  Then I would get
hardware they specifically support and I would use their best practice
configs.  Neil Brown has a suse email address, maybe he can tell you
where to find some suse supported config documents, etc.

FYI: Some of the major problems going in the last year that make me
willing to believe someone is having lots of unrelated issues in
trying to build a system like Leslie's.

==
Reiser's main maintainer is in jail, recent versions of OpenSUSE croak
if reiser is in use because they exercise code paths with serious
bugs. (google "beagle opensuse reiser")

Ext3 is being savaged on the various LKML lists as we speak due to
horrible latency issues with workloads similar to Leslie's.

The latest Linus kernel has a lot ext3 patches in it that reduce the
horrible latency to merely unacceptable.  Linus and Ted Tso are now
thinking the remaining problems are with the CFQ elevator.  (In theory
the AS one is better, but the troubleshooting is ongoing as we speak,
so too soon to say anything definitive just yet.)

Seagate drives have been having major firmware issues for about a year.

Marvell PMP linux kernel support has just been promoted from
experimental recently (if that has even happened yet.)  And Marvell is
used on lot of MBs.

Sil's have a known problem that if the first drive on a PMP is
missing, it screws up the rest of the drives.

Ext4 is claimed "production" but is getting major corruption bugzillas
(and associated patches) weekly.  I for one would not use it for
production work.

Tejun Heo is the core eSata developer and he says not to trust any
eSata cable a meter or longer.  ie. He had lots of spurious
transmission errors when testing with longer cables.

Lot of reported problems turn out to be power supplies not designed to
carry a Sata load.  Apparently sata drives are very demanding and many
"good" power supplies don't cut the mustard.

and that is off the top of my head.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05 20:00                 ` Greg Freemyer
@ 2009-04-05 20:39                   ` Peter Grandi
  2009-04-05 23:27                     ` Leslie Rhorer
  2009-04-05 22:03                   ` Leslie Rhorer
  2009-04-24  4:52                   ` Leslie Rhorer
  2 siblings, 1 reply; 84+ messages in thread
From: Peter Grandi @ 2009-04-05 20:39 UTC (permalink / raw)
  To: Linux RAID

>>>>> The problem started immediately the last time I rebuilt
>>>>> the array and formatted it as Reiserfs, after moving the
>>>>> drives out of the old RAID chassis.

>>>> What file system were you using before ReiserFS?

>>> Several, actually.  Since the RAID array kept crashing, I
>>> had to re-create it numerous times.

>>>> your culprit is higher up the chain, ie the FS.

>>> I've suspected this may be the case from the outset.

>> I'm sorry? You've repeatedly had trouble with this system,
>> this array, you've tried several filesystems; do you think
>> they're *ALL* broken?

> I for one think it is very reasonable that Leslie may have
> experienced numerous different problems in the course of
> trying to put together a large scale raid system for video
> editing.

> But Leslie, maybe you do need to take a step back and review
> your overall design and see what major changes you could make
> that might help.

Very wise words. Because what the O.P. is trying to do is system
integration on a medium-large scale, and yet expects all the
bits (sw and hw) to snap together. While as you demonstrate to
know system integration means finding the few combinations of
sw/hw/fw that actually work, and work together.

Too many of the messages to this list are by people who have a
"syntactic" approach to systems integration (if the "syntax" is
formally valid, it ought to work...).

> I'm not really keeping up with things like video editing, but
> as someone else said XFS was specifically designed for that
> type of workload.

JFS not too bad either, and it is fairly robust too.

> You could also evaluate the different i/o elevators.

That's more about performance than reliability. Anyhow as to
performance many elevators have peculiar performance profiles
(as you report later for CFQ) and never mind their interaction
with insane "features" like plugging/unplugging in the block
layer. From many tests that I have seen, 'noop' should be the
default in most situations.

> If I were designing a system like you have for myself, I would
> get one of the major supported server distros.

That in my experience does not matter a lot, but it should be
tried. On one hand their kernels are usually quite a bit behind
the state of the hw, on the other their kernels tend to have
lots of useful bug fixes. On balance I am not sure which is more
important. However I like the API stability of major distributions.

> FYI: Some of the major problems going in the last year that
> make me willing to believe someone is having lots of unrelated
> issues in trying to build a system like Leslie's.

All these problems that you list below are typical of system
integration with lots of moving parts :-). Experiences teaches
people like you and me that to expect them. And there are people
at large scale sites that write up about them, for example:

  https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257

> Reiser's main maintainer is in jail, recent versions of
> OpenSUSE croak if reiser is in use because they exercise code
> paths with serious bugs. (google "beagle opensuse reiser")

That the maintainer is in trouble is not so important; but
ReiserFS has indeed some bugs mostly because it is a bit
complicated.

> [ ... ] The latest Linus kernel has a lot ext3 patches in it
> that reduce the horrible latency to merely unacceptable.
> Linus and Ted Tso are now thinking the remaining problems are
> with the CFQ elevator. [ ... ]

I strongly suspect the block layer, which seems to do quite a
few misguided "optimizations" (even if not quite as awful as
some aspects of the VM subsystem).

> Seagate drives have been having major firmware issues for
> about a year.

That's the .11 series, but WD have had problems in the recent
past, and so have had other manufacturers. With ever increasing
fw complexity come many risks...

> Ext4 is claimed "production" but is getting major corruption
> bugzillas (and associated patches) weekly. [ ... ]

That is mostly however IIRC because of the ancient delayed write
feature and the fact that "userspace sucks", that is does not
use 'fsync'.

> Tejun Heo is the core eSata developer and he says not to trust
> any eSata cable a meter or longer. [ ... ]

Longer ATA/80 wire cables also have had problems for a long
time. Longer SATA and eSATA cables also problematic. But SAS
"splitter" cables seem to be usually pretty well shielded.

> Lot of reported problems turn out to be power supplies not
> designed to carry a Sata load.  Apparently sata drives are
> very demanding and many "good" power supplies don't cut the
> mustard.

That probably does not have much to do with SATA drives. It is
more like a combination of factors:

  * Many power supplies are poorly designed or built.

  * Modern hard disks draw a high peak current on startup, and
    many people do not realize that PSU rails have different
    power ratings, and do not stagger power up of many drives.

  * Cooling is often underestimated, with overheating of power
    and other components, especially in dense configurations.

Some of my recommendations:

* Use as simple a setup as you can. RAID10, no LVM, well tested
  file systems like JFS or XFS (or Lustre for extreme cases).

* Only use not-latest components that are reported to work well
  with the vintage of sw that you are using, and do extensive
  web searching as to which vintages of hw/fw/sw seem to work
  well together.

* Oversize by a good margin the power supplies and the cooling
  system, stagger drive startup, and monitor the voltages and
  the temperatures.

* Use disks of many different manufacturers in the same array.

* Run period tests against silent corruption.

The results can be rewarding; I have setup without too much
effort storage systems that deliver several hundred MB/s over
NFS, and a few GB/s over Lustre are also possible (but that
needs more careful thinking).

An example of a recent setup that worked pretty well: Dell
2900/2950 server, PERC 6s (not my favourites though), 2 Dell
MD1000 arrays, 30 750GB drives configured as 3 MD (sw) RAID10
arrays with a few spares, RHEL5/CentOS5, 10Gb/s Myri card.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 20:39                   ` Peter Grandi
@ 2009-04-05 23:27                     ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 23:27 UTC (permalink / raw)
  To: 'Peter Grandi', 'Linux RAID'

> Very wise words. Because what the O.P. is trying to do is system
> integration on a medium-large scale

More like a small-medium scale, I would say.  Other than the size of the
array, this is a vary small, very limited system.  It's a motherboard, a
keyboard, a mouse, a power supply, a UPS, a basic non-RAID controller, some
port multipliers, and a bunch of disks.  The desktop system from which I
type this message is more sophisticated / larger in scale.  Its expansion
ports are full, as are its I/O ports.  It just doesn't have a RAID array.

> and yet expects all the
> bits (sw and hw) to snap together.

No, I don't.  Indeed, it has taken over a year and a half of testing and
swapping components (software and hardware) to get to this point.  This is
simply the first problem I have not been able to solve by myself.

> While as you demonstrate to
> know system integration means finding the few combinations of
> sw/hw/fw that actually work, and work together.

Yes, he does, but so do I.  I just need some help finding some diagnostics
which will point to what components are causing the problem.  At this point,
I would guess it's something between the RAID software and reiserfs, but the
data is not yet conclusive, by any means.

> > I'm not really keeping up with things like video editing, but
> > as someone else said XFS was specifically designed for that
> > type of workload.
> 
> JFS not too bad either, and it is fairly robust too.

I'm open to any and all better alternatives, even if they are not part of
the root cause of the problem.  I read a couple of reports, early on, that
gave JFS a black eye.  With so many opinions, it's sometimes hard to sort
out the good from the bad.

> > If I were designing a system like you have for myself, I would
> > get one of the major supported server distros.
> 
> That in my experience does not matter a lot, but it should be
> tried. On one hand their kernels are usually quite a bit behind
> the state of the hw, on the other their kernels tend to have
> lots of useful bug fixes. On balance I am not sure which is more
> important. However I like the API stability of major distributions.

I'll put it on the list.  I can fairly easily create an alternate boot, of
course, and while I don't want to spend all the time to convert the server
softwares unless I am sure this will fix the problem, I should be able to
reproduce the conditions well enough to verify one way or the other.

> > FYI: Some of the major problems going in the last year that
> > make me willing to believe someone is having lots of unrelated
> > issues in trying to build a system like Leslie's.
> 
> All these problems that you list below are typical of system
> integration with lots of moving parts :-). Experiences teaches
> people like you and me that to expect them. And there are people
> at large scale sites that write up about them, for example:

I know that very well.  My professional systems encompass tens of thousands
of miles of fiber plant and tens of thousands of individual hardware
components from more than 200 vendors.  The number that don't talk at all to
one another or do so poorly yet must be employed in the system is appalling.

> > Reiser's main maintainer is in jail, recent versions of
> > OpenSUSE croak if reiser is in use because they exercise code
> > paths with serious bugs. (google "beagle opensuse reiser")
> 
> That the maintainer is in trouble is not so important; but
> ReiserFS has indeed some bugs mostly because it is a bit
> complicated.

Here's what I don't understand.  Given that reiserfs, like virtually any
complex software, is known to have some issues, and given the symptoms I
have encountered point more toward issues at the file system or RAID level
and more away from hardware sources, why are several people basically
yelling at me that it must be a hardware issue?  No, the hardware is not
sophisticated, but then neither is the application.

> Longer ATA/80 wire cables also have had problems for a long
> time. Longer SATA and eSATA cables also problematic. But SAS
> "splitter" cables seem to be usually pretty well shielded.

Appropo of nothing, but it was a SAS / infiniband RAID chassis that had the
big problem.  With 5 drives, I was having trouble, so the manufacturer sent
me a new backplane.  That seemed to resolve the issue, but when I went to 6
drives, the array croaked.  I replaced the drive controller (the second time
I replaced it, actually), to no avail. I had to move the drives around until
I could find a stable configuration of used and unused slots in the chassis.
I went to 8 drives without too much trouble, but then when I needed a 9th it
meant I had to put both controllers in the system, but the motherboard only
had 1 PCI Express x 16 slot.  I purchased a motherboard which was supposed
to be compatible with Linux and the controllers, but I couldn't get it to
work under "Etch" and it would not boot "Lenny" at all.  So I got another MB
which was supposed to work according to both the MB and controller folks.
It worked fine with one controller, but never two.  Finally, I gutted the
multilane system and installed a port multiplier system.  I could get 8
drives to be pretty stable, but with ten drives, the number of "failed"
drives jumped to three or four a day.  The RAID array crashed and burned
completely and unrecoverably 3 times.  I moved the drives out of the
external chassis and into the main chassis, and the problems ceased.  This
worked fine for three weeks.  When the new chassis arrived, and I moved the
drives to it.  I don't recall for certain whether I formatted the array as
reiserfs before or after moving the drives, but I did not notice the issue
with the halts until a week or two after moving the drives.

> > Lot of reported problems turn out to be power supplies not
> > designed to carry a Sata load.  Apparently sata drives are
> > very demanding and many "good" power supplies don't cut the
> > mustard.
> 
> That probably does not have much to do with SATA drives. It is
> more like a combination of factors:
> 
>   * Many power supplies are poorly designed or built.
> 
>   * Modern hard disks draw a high peak current on startup, and
>     many people do not realize that PSU rails have different
>     power ratings, and do not stagger power up of many drives.

Since in this case the drive are already spinning long before the system
boots, the start-up currents shouldn't really be an issue, but even if they
are, the supply is rated for more than enough to handle all the drives
starting together.  A bad supply would be another matter, of course.

>   * Cooling is often underestimated, with overheating of power
>     and other components, especially in dense configurations.

The array chassis is specifically designed to handle 12 drives.  All the
drives are reporting to be continuously cooler than 46C.  All but two are
cooler than 43C.

> Some of my recommendations:
> 
> * Use as simple a setup as you can. RAID10, no LVM, well tested

I'm using RAID6, because it is more robust, and because I couldn't keep the
array up for more than a few hours under RAID5 with the old chassis.

>   file systems like JFS or XFS (or Lustre for extreme cases).

I'll look into them.

> * Only use not-latest components that are reported to work well
>   with the vintage of sw that you are using, and do extensive
>   web searching as to which vintages of hw/fw/sw seem to work
>   well together.

Well, like I said, this is a pretty plain vanilla system.  The motherboard
is a somewhat new model, and of course the 1T drives have only been out a
year or so.  Other than the softwares related to the servers, everything
else is in the distro.  The servers are both Java based, and in any case I
can shut them down and still readily produce the issue.  There is one
Windows client I run under Wine, but likewise I can shut it down and still
trigger the issue with two successive cp commands.

> * Oversize by a good margin the power supplies and the cooling
>   system, stagger drive startup, and monitor the voltages and
>   the temperatures.

The original controllers supported staggered spin-up, but I don't think this
one does.  Since the drives are external to the main system, I don't think
it really makes much difference.

> * Use disks of many different manufacturers in the same array.

Well, I'm using two different manufacturers, and four different models.

> * Run period tests against silent corruption.

Such as ?

> The results can be rewarding; I have setup without too much
> effort storage systems that deliver several hundred MB/s over
> NFS, and a few GB/s over Lustre are also possible (but that
> needs more careful thinking).

The performance of this system is fine.  I haven't done any tuning, but 450
Mbps is much more than necessary, so I'm not inclined to spend any effort on
improving it, unless it will fix this issue.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 20:00                 ` Greg Freemyer
  2009-04-05 20:39                   ` Peter Grandi
@ 2009-04-05 22:03                   ` Leslie Rhorer
  2009-04-06 22:16                     ` Greg Freemyer
  2009-04-24  4:52                   ` Leslie Rhorer
  2 siblings, 1 reply; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 22:03 UTC (permalink / raw)
  To: linux-raid

> I for one think it is very reasonable that Leslie may have experienced
> numerous different problems in the course of trying to put together a
> large scale raid system for video editing.

I've encountered much worse, with many more failure sources.

> But Leslie, maybe you do need to take a step back and review your
> overall design and see what major changes you could make that might
> help.

That's a reasonable solution, but until I know more of where the fundamental
issue actually lies, any changes I make may be a waste of time.  What if
it's a power supply issue?  A bad cable?  What if the new RAID chassis is
also bad?  What if it's a motherboard problem?  I can't afford to replace
the entire system, and even if I did, is the issue due to a one off
component failure or a systemic problem with the entire product line (i.e.
do I replace the component with the same model or a different piece of
equipment altogether? 

> I'm not really keeping up with things like video editing, but as
> someone else said XFS was specifically designed for that type of
> workload.  It even has a psuedo realtime capability to ensure you
> maintain your frame rate, etc.  Or so I understand.  I've never used
> that feature.  You could also evaluate the different i/o elevators.

I'll look into XFS.  Of course, it means taking the system down for several
days while I reconfigure and then copy all the data back.  It also makes me
really nervous to only have only one or two copies of the files in
existence.  When the array is reformatted, the bulk of the data only exists
in one place: the backup server.  Three days is a long time, and it's always
possible the backup server could fail.  In fact, the last time I took down
the RAID server, the backup server *DID* fail.  It's motherboard fried
itself and took the power supply with it.  Fortunately, the LVM was not
corrupted, and all that was lost were the files in the process of being
written to the backup server (which of course was then the main server for
the time being).

As to the file system, it really doesn't make a lot of different at the
application layer.  The video editor is on a Windows (puke!) machine and
only needs a steady stream of bits across a SAMBA share.  Similarly, the
server does not stream video directly.  It merely transfers the file -
possibly filtering through ffmpeg first - to the hard drive of the video
display devices (TiVo DVRs), where at some point the file is streamed out
the video device.  As long as the array can transfer at rates greater than
20 Mbps, everything is fine as far as the video is concerned.

> If I were designing a system like you have for myself, I would get one
> of the major supported server distros.  (I'm a SuSE fan, so I would go
> with SLES, but I assume others are good as well.)  Then I would get

Debian is pretty well supported, and to my eye has consistently the most
bug-free distros, including the commercial distros.  Of course I am not an
expert in this area, but I have worked some with Xandros and Red Hat.
Personally, I much prefer Debian.  This is the first time I have run into an
issue which I could not resolve myself.  Of course, I've only been using
Linux at all since 2002, and I've only had desktop Linux systems for about 4
years, so my experience is not extensive.

> hardware they specifically support and I would use their best practice
> configs.  Neil Brown has a suse email address, maybe he can tell you
> where to find some suse supported config documents, etc.

I don't think I can afford that.  Things are extremely tight right now, and
a whole hog hardware replacement is really not practical.  Although it is
entirely possible this could be related to a number of hardware and software
components, I'm really hoping it is not the case, and if I can pinpoint the
problem through diagnosis and then replace a single element, I think it is
what needs to be done.  That said, if this is due to a hard drive problem -
one or many - the hard drives are due to be replaced after the 3T drives are
shipping anyway, so if one or more drive are the problem, it should go away
at that time.  If not, then it would be better to find and fix the problem
prior to that time.

> Ext4 is claimed "production" but is getting major corruption bugzillas
> (and associated patches) weekly.  I for one would not use it for
> production work.

Uh, yeah.  I was unaware of XFS until today, but I did look at Ext4.  One
look and I said, "Uh-uh".

> Tejun Heo is the core eSata developer and he says not to trust any
> eSata cable a meter or longer.  ie. He had lots of spurious
> transmission errors when testing with longer cables.

Just FYI, the eSATA cables going to the array chassis are 24", and I believe
them to be of high quality.  They have also been replaced with no apparent
effect.

> Lot of reported problems turn out to be power supplies not designed to
> carry a Sata load.  Apparently sata drives are very demanding and many
> "good" power supplies don't cut the mustard.

Well, the server itself only has a single PATA drive as its boot drive, and
the only peripheral card is the SATA controller.  It's a new chassis (6
months) with a 550 Watt supply, so it's unlikely to be the culprit, even
though the CPU is 125 Watts.  The RAID chassis is a 12 slot system with a
400 Watt supply and 11 drives.  I suppose I could try changing the RAID
supply to a 600 or 700 watt model, but really 400W should be enough for 11
drives and 3 port multipliers.  According to the spec sheets, the most power
hungry drives in the mix (Hitachi E7K1000) require an absolute maximum of 26
watts.  If all the drives were the same, that would be 286 watts.
Especially given the Western Digital drives and the one Seagate (not part of
the array) drives are specified to have somewhat lower power consumption,
400W should be fine.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05 22:03                   ` Leslie Rhorer
@ 2009-04-06 22:16                     ` Greg Freemyer
  2009-04-07 18:22                       ` Leslie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Greg Freemyer @ 2009-04-06 22:16 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

>> Lot of reported problems turn out to be power supplies not designed to
>> carry a Sata load.  Apparently sata drives are very demanding and many
>> "good" power supplies don't cut the mustard.
>
> Well, the server itself only has a single PATA drive as its boot drive, and
> the only peripheral card is the SATA controller.  It's a new chassis (6
> months) with a 550 Watt supply, so it's unlikely to be the culprit, even
> though the CPU is 125 Watts.  The RAID chassis is a 12 slot system with a
> 400 Watt supply and 11 drives.  I suppose I could try changing the RAID
> supply to a 600 or 700 watt model, but really 400W should be enough for 11
> drives and 3 port multipliers.  According to the spec sheets, the most power
> hungry drives in the mix (Hitachi E7K1000) require an absolute maximum of 26
> watts.  If all the drives were the same, that would be 286 watts.
> Especially given the Western Digital drives and the one Seagate (not part of
> the array) drives are specified to have somewhat lower power consumption,
> 400W should be fine.

It may or may not be as simple as you describe.

A lot of modern PSUs have multiple rails.  Each rail is independent
and provides some number of watts.

So if your CPU / graphic rail(s) are getting 300 watts out of your
550, then that only leave 250.  And that can be split among 2 or 3
rails as well.

It definitely gets complicated.

Or you may have a simple design where all the 5v lines are tied
together, same for +12v,-12v.

OTOH, if your having issues with power you should be seeing sata
transmission errors reported in the kernel log.  I don't recall your
saying, but I assume you've been watching that and nothing is showing
up during these events?

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-06 22:16                     ` Greg Freemyer
@ 2009-04-07 18:22                       ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-07 18:22 UTC (permalink / raw)
  To: 'Linux RAID'

> It may or may not be as simple as you describe.
> 
> A lot of modern PSUs have multiple rails.  Each rail is independent
> and provides some number of watts.

Now we are in the area where I am a completely qualified expert.  Designing
power systems is one of my primary tasks, and I regularly design power
systems far larger (up to 100,000 amperes and 5.4 Megawatts) and more
sophisticated than those in PCs.

> So if your CPU / graphic rail(s) are getting 300 watts out of your
> 550, then that only leave 250.  And that can be split among 2 or 3
> rails as well.

Since the CPU and graphics card are not even on the same power supply as the
drives, let alone a common rail, this can't apply, can it?  I don't know off
the top of my head how much power the (embedded) GPU uses, but if it is
close to 425 Watts, I'll eat my hat.  The CPU is rated at 125 Watts.  I
haven't checked, but I doubt the eSATA controller is more than 3 watts,
given it has no heat sinks.  At high speed, the fans chew up 35 watts on the
main chassis, although the fans rarely run at more than 1/2 speed.  The
keyboard and mouse are trivial.  I don't know exactly what the fans on the
array draw, but it can't be more than 30 watts - probably much less.

> It definitely gets complicated.

'Not really.  PC supplies are trivial compared to the ones with which I deal
daily.

> Or you may have a simple design where all the 5v lines are tied
> together, same for +12v,-12v.

Yes, my response was simplified.  I checked the total current carrying
capability of each system when I purchased them.  The RAID array supply has
a single 12 rail, a single -12v rail, and a single 5v rail.  It's rated to
be able to handle the spin-up current of 12 of the Hitachi Deskstar drives
on all rails, that being the drive with the highest spin-up current
requirements on all three supplies.  As I said, one of them could be bad,
but the power supplies are not below spec.

> OTOH, if your having issues with power you should be seeing sata
> transmission errors reported in the kernel log.  I don't recall your
> saying, but I assume you've been watching that and nothing is showing
> up during these events?

Yes, I did say, more than once I believe.  There are no errors of any sort
in the logs other than those on the UPS due to a known bug in the UPS
driver, and on the 230G SATA drive, which is known to have problems, but is
not part of the array.  Other than those, the logs only show temperature
info from smartd and sensord, DHCP bindings, etc.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 20:00                 ` Greg Freemyer
  2009-04-05 20:39                   ` Peter Grandi
  2009-04-05 22:03                   ` Leslie Rhorer
@ 2009-04-24  4:52                   ` Leslie Rhorer
  2009-04-24  6:50                     ` Richard Scobie
                                       ` (4 more replies)
  2 siblings, 5 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-24  4:52 UTC (permalink / raw)
  To: 'Linux RAID'

> I'm not really keeping up with things like video editing, but as
> someone else said XFS was specifically designed for that type of
> workload.  It even has a psuedo realtime capability to ensure you
> maintain your frame rate, etc.  Or so I understand.  I've never used
> that feature.  You could also evaluate the different i/o elevators.

I tried different schedulers, with no apparent effect.  A known bug in
oprofile combined with my own nearly total unfamiliarity with using the tool
has brought me pretty much to a dead end, unless someone has some additional
guidance for me.  In addition, I have since learned I should not have
selected the default .90 superblock when building the array, but should have
selected probably a 1.2 superblock, instead.  Given all that, unless someone
else has a better idea, I am going to go ahead and tear down the array and
rebuild it with a version 1.2 superblock.  I have suspended all writes to
the array and double-backed up all the most critical data along with a small
handful of files which for some unknown reason appear to differ by a few
bytes between the RAID array copy and the backup copy.  I just hope like all
get-out the backup system doesn't crash sometime in the four days after I
tear down the RAID array and start to rebuild it.

I've done some reading, and it's been suggested a 128K chunk size might be a
better choice on my system than the default chunk size of 64K, so I intend
to create the new array on the raw devices with the command:

mdadm --create --raid-devices=10 --metadata=1.2 --chunk=128 --level=6
/dev/sd[a-j]

Does anyone have any better suggestions or comments on creating the array
with these options?  It is going to start as an 8T array and probably grow
to 30T by the end of this year or early next year, increasing the number of
drives to 12 and then swapping out the 1T drives for 3T drives, hopefully
after the price of 3T drives has dropped considerably.

I intend to create an XFS file system on the raw RAID device, which I am
given to understand offers few if any disadvantages compared to
partitioning the array, or partitioning the devices below the array, for
that matter, given I am devoting each entire device to the array and the
entire array to the single file system.  Does anyone strongly disagree?  I
see no advantage to LVM in this application, either.  Again, are there any
dissenting opinions?

Also, in my reading it was suggested by several researchers the best
performance of an XFS file system is achieved if the stripe width of the FS
is set to be the same as the RAID array using the su and sw switches in
mkfs.xfs.  I've also read the man page for mkfs.xfs, but I am quite unclear
on several points, in my defense perhaps because I am really exhausted at
this point.

1.  How do I determine the stripe width for a RAID 6 array, either before or
after creating it?  The entries I have read strongly suggest to me the chunk
size and the stripe size are closely related (I would have thought it would
be the product of the chunk size and the number of drives, excluding the
parity drives), but exactly how they are related escapes me.  I haven't read
anything which states it explicitly, and the examples I've read seem
contradictory to each other, or unclear at best.

2.  Am I correct in assuming the number of bytes in the XFS stripe size
should be equal to the product of the sw and su parameters?  If not, what?
Why are there two separate parameters and what would be the effect of them
both being off, if the product is still equal to the sripe size?

3.  The man page says "When a filesystem is created on a logical volume
device, mkfs.xfs will automatically  query  the  logical  volume  for
appropriate sunit and swidth values."  Does this mean it is best for me to
simply not worry about setting these parameters and let mkfs.xfs do it, or
is there a good reason for me to intervene?

4.  My reading, including the statement in the mkfs.xfs man page which says,
"The value [of the sw parameter] is expressed as a multiplier of the stripe
unit, usually the same as the number of stripe members in the logical volume
configuration, or data disks in a RAID device", suggests to me the optimal
stripe size for an XFS file system will change when the number of member
disks is increased.  Am I correct in this inference?  If so, I haven't seen
anything suggesting the stripe size of the FXS file system can be modified
after the file system is created.  Certainly the man page for xfs_growfs
mentions nothing of it.  The researchers I read all suggested the
performance of FXS is greatly enhanced if the file system stripe size
matches the RAID stripe size.  I'm also a little puzzled why the stripe
width of the XFS file system should be the same as the number of drives in a
RAID 5 or RAID 6 array, since to the file system the stripe extent would
seem to be defined by the data drives, because a payload which fits
perfectly on N drive chunks is spread across N+2 drive chunks on a RAID 6
array.  To put it another way, it seems to me the parity drives should be
excluded from the calculation.

5.  Finally, one other thing concerns me a bit.  The researchers I read
suggested XFS has by far the worst file deletion performance of any of the
journaling file systems, and Reiserfs supposedly has the best.  I find that
shocking, since deleting multi-gigabyte files on the existing file system
can take a rather long time - close to a minute.  Small to moderate sized
files get deleted in a flash, but 20GB or 30GB files take forever.  I didn't
find that abnormal, considering how Linux file systems are structured, and
it's not a huge problem given it never locks up the file system the way a
file creation often does, but if a file system which is supposed to be
super-terrific takes such a long time to delete a file, how bad is it going
to be when I install the worst file system when it comes to file deletion
times?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-24  4:52                   ` Leslie Rhorer
@ 2009-04-24  6:50                     ` Richard Scobie
  2009-04-24 10:03                       ` Leslie Rhorer
  2009-04-24 15:24                     ` Andrew Burgess
                                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 84+ messages in thread
From: Richard Scobie @ 2009-04-24  6:50 UTC (permalink / raw)
  To: lrhorer; +Cc: Linux RAID Mailing List

Leslie Rhorer wrote:
> 
> 3.  The man page says "When a filesystem is created on a logical volume
> device, mkfs.xfs will automatically  query  the  logical  volume  for
> appropriate sunit and swidth values."  Does this mean it is best for me to
> simply not worry about setting these parameters and let mkfs.xfs do it, or
> is there a good reason for me to intervene?

XFS will do exactly that and choose the correct sunit and swidth based 
upon the RAID stripe size you have chosen - do not worry about it.


> 5.  Finally, one other thing concerns me a bit.  The researchers I read
> suggested XFS has by far the worst file deletion performance of any of the
> journaling file systems, and Reiserfs supposedly has the best.  I find that
> shocking, since deleting multi-gigabyte files on the existing file system
> can take a rather long time - close to a minute.  Small to moderate sized
> files get deleted in a flash, but 20GB or 30GB files take forever.  I didn't

Quad core Xeon 2.8GHz md RAID6 of 16 x 750GB. Delete 20GB file:

time rm -f dump

real    0m1.849s
user    0m0.000s
sys     0m1.633s

Regards,

Richard

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-24  6:50                     ` Richard Scobie
@ 2009-04-24 10:03                       ` Leslie Rhorer
  2009-04-28 19:36                         ` lrhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-24 10:03 UTC (permalink / raw)
  To: 'Linux RAID'

> Quad core Xeon 2.8GHz md RAID6 of 16 x 750GB. Delete 20GB file:
> 
> time rm -f dump
> 
> real    0m1.849s
> user    0m0.000s
> sys     0m1.633s

3.2 GHz AMD Athlon 64 x 2 RAID6 of 10 x 1T:

RAID-Server:/RAID/Recordings# ll sizetest.fil
-rw-rw-rw- 1 lrhorer users 27847620608 2009-04-24 03:21 sizetest.fil
RAID-Server:/RAID/Recordings# time rm sizetest.fil

real    0m21.503s
user    0m0.000s
sys     0m6.852s

See what I mean?  'Zero additional activity on the array other than the rm
task.  We'll see what happens with XFS.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-24 10:03                       ` Leslie Rhorer
@ 2009-04-28 19:36                         ` lrhorer
  0 siblings, 0 replies; 84+ messages in thread
From: lrhorer @ 2009-04-28 19:36 UTC (permalink / raw)
  To: 'Linux RAID'

> > Quad core Xeon 2.8GHz md RAID6 of 16 x 750GB. Delete 20GB file:
> >
> > time rm -f dump
> >
> > real    0m1.849s
> > user    0m0.000s
> > sys     0m1.633s
> 
> 3.2 GHz AMD Athlon 64 x 2 RAID6 of 10 x 1T:
> 
> RAID-Server:/RAID/Recordings# ll sizetest.fil
> -rw-rw-rw- 1 lrhorer users 27847620608 2009-04-24 03:21 sizetest.fil
> RAID-Server:/RAID/Recordings# time rm sizetest.fil
> 
> real    0m21.503s
> user    0m0.000s
> sys     0m6.852s
> 
> See what I mean?  'Zero additional activity on the array other than the rm
> task.  We'll see what happens with XFS.

So far, it's looking much better.  I'll test it again when the array is
closer to being full.

RAID-Server:/RAID/Recordings# ll testfile
-rw-rw-rw- 1 lrhorer users 12506679296 2009-04-25 03:29 testfile
RAID-Server:/RAID/Recordings# time rm testfile

real    0m0.111s
user    0m0.000s
sys     0m0.000s


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-24  4:52                   ` Leslie Rhorer
  2009-04-24  6:50                     ` Richard Scobie
@ 2009-04-24 15:24                     ` Andrew Burgess
  2009-04-25  4:26                       ` Leslie Rhorer
  2009-04-24 17:03                     ` Doug Ledford
                                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 84+ messages in thread
From: Andrew Burgess @ 2009-04-24 15:24 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Thu, 2009-04-23 at 23:52 -0500, Leslie Rhorer wrote:

> Does anyone have any better suggestions or comments on creating the array
> with these options?  It is going to start as an 8T array and probably grow
> to 30T by the end of this year or early next year, increasing the number of
> drives to 12 and then swapping out the 1T drives for 3T drives, hopefully
> after the price of 3T drives has dropped considerably.
> 
> I intend to create an XFS file system 

The one disadvantage to XFS is you cannot shrink the filesystem. This is
handy when upgrading the array when you want to reuse some of the
smaller disks to save money. i.e:

Create a new 3T device array but one that only holds say half your data.
Copy half your data to the new array. Shrink the old array (fs and then
md). This frees up some 1T disks which you can make into 3T devices with
md, add to the new array, grow the fs. Repeat untill all the data is
transferred. Your data is protected against disk failure the whole time.

I did exactly this in the past with ext3 but talked myself into using
xfs for the new array so this time when I upgraded the array from 400G
devices to 750GB devices I had to buy enough 750's to hold everything. I
was still able to reuse some of the 400GB to give lots of extra space on
the new array after the copy.

> on the raw RAID device, which I am
> given to understand offers few if any disadvantages compared to
> partitioning the array, or partitioning the devices below the array, for
> that matter, given I am devoting each entire device to the array and the
> entire array to the single file system.  Does anyone strongly disagree?  I
> see no advantage to LVM in this application, either.  Again, are there any
> dissenting opinions?

I agree about LVM but am no expert

> 3.  The man page says "When a filesystem is created on a logical volume
> device, mkfs.xfs will automatically  query  the  logical  volume  for
> appropriate sunit and swidth values."  Does this mean it is best for me to
> simply not worry about setting these parameters and let mkfs.xfs do it, or
> is there a good reason for me to intervene?
> 
> 4.  My reading, including the statement in the mkfs.xfs man page which says,
> "The value [of the sw parameter] is expressed as a multiplier of the stripe
> unit, usually the same as the number of stripe members in the logical volume
> configuration, or data disks in a RAID device", suggests to me the optimal
> stripe size for an XFS file system will change when the number of member
> disks is increased.  Am I correct in this inference?  If so, I haven't seen
> anything suggesting the stripe size of the FXS file system can be modified
> after the file system is created.  Certainly the man page for xfs_growfs
> mentions nothing of it.  The researchers I read all suggested the
> performance of FXS is greatly enhanced if the file system stripe size
> matches the RAID stripe size.  I'm also a little puzzled why the stripe
> width of the XFS file system should be the same as the number of drives in a
> RAID 5 or RAID 6 array, since to the file system the stripe extent would
> seem to be defined by the data drives, because a payload which fits
> perfectly on N drive chunks is spread across N+2 drive chunks on a RAID 6
> array.  To put it another way, it seems to me the parity drives should be
> excluded from the calculation.

The mount man page says it can be changed at mount time which does seem
a little strange to me.

Quoting man mount:

       sunit=value and swidth=value

"Used  to  specify  the  stripe unit and width for a RAID device or a
stripe volume.  value must be specified in 512-byte block units.  If
this option is not specified and the filesystem was made on a stripe
volume or the stripe width or unit were specified for the RAID device at
mkfs time,  then  the  mount system call will restore the value from the
superblock.  For filesystems that are made directly on RAID devices,
these options can be used to override the information in the superblock
if the underlying disk layout changes after the filesystem has been
created. The swidth option is required if the sunit option has been
specified, and must be a multiple of the sunit value."

Maybe it means newly created files use the new sunit/swidth values?
There is also defrag available for xfs, perhaps this rearranges things
as well, I don't know.

Once you create the fs and examine the values in /proc/mounts you could
see if they change when you add a device to the array, grow the fs and
remount. Also, your argument about the number of data disks makes sense
to me. After you get some data you might ask on the xfs mailing list if
you see a discrepancy.

My 14 device 128K chunk raid6 xfs picked "sunit=256,swidth=1024"
according to /proc/mounts. I think the units are 512 byte sectors so the
sunit is the same as the chunk size. I don't know what these values were
before the last md grow.

> 5.  Finally, one other thing concerns me a bit.  The researchers I read
> suggested XFS has by far the worst file deletion performance of any of the
> journaling file systems

Single file deletes of ~10GB work fine on my system but several in a row
will bog things down. Make sure you measure whats important to you; your
example shows deleting a single 20GB file. Is that what needs to be fast
or do you delete several files like that at once? And benchmarking rm
without a final sync may not be valid (or at least will measure
different things).

Also, there is an alloc_size mount parameter which reduces fragmentation
and may speed deletes. 

HTH

PS I wish I could have helped you with oprofile but its been a while
since I used it - we'd be starting at the same place ;-)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-24 15:24                     ` Andrew Burgess
@ 2009-04-25  4:26                       ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-25  4:26 UTC (permalink / raw)
  To: 'Linux RAID'

> > I intend to create an XFS file system
> 
> The one disadvantage to XFS is you cannot shrink the filesystem. This is
> handy when upgrading the array when you want to reuse some of the
> smaller disks to save money. i.e:

Yeah, I noticed that.  All things considered, I think I can live with the
disadvantage.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-24  4:52                   ` Leslie Rhorer
  2009-04-24  6:50                     ` Richard Scobie
  2009-04-24 15:24                     ` Andrew Burgess
@ 2009-04-24 17:03                     ` Doug Ledford
  2009-04-24 20:25                       ` Richard Scobie
  2009-04-25  7:40                       ` Leslie Rhorer
  2009-04-24 20:25                     ` Greg Freemyer
  2009-04-25  7:24                     ` Leslie Rhorer
  4 siblings, 2 replies; 84+ messages in thread
From: Doug Ledford @ 2009-04-24 17:03 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

[-- Attachment #1: Type: text/plain, Size: 2918 bytes --]

On Apr 24, 2009, at 12:52 AM, Leslie Rhorer wrote:
> I've done some reading, and it's been suggested a 128K chunk size  
> might be a
> better choice on my system than the default chunk size of 64K, so I  
> intend
> to create the new array on the raw devices with the command:
>
> mdadm --create --raid-devices=10 --metadata=1.2 --chunk=128 --level=6
> /dev/sd[a-j]

Go with a bigger chunk size.  Especially if you do lots of big file  
manipulation.  During testing (many years ago now, admittedly) with  
Dell for some benchmarks, it was determined that when using linux  
software raid, larger chunk sizes would tend to increase performance.   
In those tests, we settled on a 2MB chunk size.  I wouldn't recommend  
you go *that* high, but I could easily see 256k or 512k chunk sizes.   
However, you are using raid6 and that might impact the optimal chunk  
size towards smaller sizes.  A raid6 expert would need to comment on  
that.

> Does anyone have any better suggestions or comments on creating the  
> array
> with these options?  It is going to start as an 8T array and  
> probably grow
> to 30T by the end of this year or early next year, increasing the  
> number of
> drives to 12 and then swapping out the 1T drives for 3T drives,  
> hopefully
> after the price of 3T drives has dropped considerably.

I'm a big fan of the bitmap stuff.  I use internal bitmaps on all my  
arrays except boot arrays where they are so small it doesn't matter.   
However, the performance reduction you get from a bitmap is  
proportional to the granularity of the bitmap.  So, I use big bitmap- 
chunk sizes too (32768k is usually my normal bitmap size, but I'm  
getting ready to do some testing soon to see if I want to modify that  
for recent hardware).

> I intend to create an XFS file system on the raw RAID device, which  
> I am
> given to understand offers few if any disadvantages compared to
> partitioning the array, or partitioning the devices below the array,  
> for
> that matter, given I am devoting each entire device to the array and  
> the
> entire array to the single file system.  Does anyone strongly  
> disagree?  I
> see no advantage to LVM in this application, either.  Again, are  
> there any
> dissenting opinions?

Sounds right.

> Also, in my reading it was suggested by several researchers the best
> performance of an XFS file system is achieved if the stripe width of  
> the FS
> is set to be the same as the RAID array using the su and sw switches  
> in
> mkfs.xfs.

This is true of any raid aware file system.  I know how to do this for  
ext3, but not for xfs, so I won't comment further on that.  However,  
the stripe size is always chunk size * number of non-parity drives on  
a parity based array.

--

Doug Ledford <dledford@redhat.com>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-24 17:03                     ` Doug Ledford
@ 2009-04-24 20:25                       ` Richard Scobie
  2009-04-24 20:28                         ` CoolCold
  2009-04-25  7:40                       ` Leslie Rhorer
  1 sibling, 1 reply; 84+ messages in thread
From: Richard Scobie @ 2009-04-24 20:25 UTC (permalink / raw)
  To: Doug Ledford; +Cc: lrhorer, 'Linux RAID'

Doug Ledford wrote:

>> Also, in my reading it was suggested by several researchers the best
>> performance of an XFS file system is achieved if the stripe width of 
>> the FS
>> is set to be the same as the RAID array using the su and sw switches in
>> mkfs.xfs.
> 
> This is true of any raid aware file system.  I know how to do this for 
> ext3, but not for xfs, so I won't comment further on that.  However, the 
> stripe size is always chunk size * number of non-parity drives on a 
> parity based array.

mkfs.xfs will will query the md device for no. of devices and stripe 
size and automatically set this optimally (su and sw).

If an array is later grown, these two values will need to be manually 
calculated and then be applied as mount options.

Regards,

Richard

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-24 20:25                       ` Richard Scobie
@ 2009-04-24 20:28                         ` CoolCold
  2009-04-24 21:04                           ` Richard Scobie
  0 siblings, 1 reply; 84+ messages in thread
From: CoolCold @ 2009-04-24 20:28 UTC (permalink / raw)
  To: Richard Scobie; +Cc: Doug Ledford, lrhorer, Linux RAID

On Sat, Apr 25, 2009 at 12:25 AM, Richard Scobie <richard@sauce.co.nz> wrote:
> Doug Ledford wrote:
>
>>> Also, in my reading it was suggested by several researchers the best
>>> performance of an XFS file system is achieved if the stripe width of the
>>> FS
>>> is set to be the same as the RAID array using the su and sw switches in
>>> mkfs.xfs.
>>
>> This is true of any raid aware file system.  I know how to do this for
>> ext3, but not for xfs, so I won't comment further on that.  However, the
>> stripe size is always chunk size * number of non-parity drives on a parity
>> based array.
>
> mkfs.xfs will will query the md device for no. of devices and stripe size
> and automatically set this optimally (su and sw).

It will, but need dmsetup installed.

>
> If an array is later grown, these two values will need to be manually
> calculated and then be applied as mount options.
>
> Regards,
>
> Richard
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-24 20:28                         ` CoolCold
@ 2009-04-24 21:04                           ` Richard Scobie
  0 siblings, 0 replies; 84+ messages in thread
From: Richard Scobie @ 2009-04-24 21:04 UTC (permalink / raw)
  To: CoolCold; +Cc: Doug Ledford, lrhorer, Linux RAID

CoolCold wrote:

>> mkfs.xfs will will query the md device for no. of devices and stripe size
>> and automatically set this optimally (su and sw).
> 
> It will, but need dmsetup installed.

Hmm. I have never used this and from reading the man page, it looks as 
though it applies to dm devices. I am using md.

Or am I missing something here. Certainly any time I have XFS formatted 
any md RAID 5/6 device directly (no volume management), the swidth and 
sunit parameters have been set to the optimal values automatically.

Regards,

Richard

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-24 17:03                     ` Doug Ledford
  2009-04-24 20:25                       ` Richard Scobie
@ 2009-04-25  7:40                       ` Leslie Rhorer
  2009-04-25  8:53                         ` Michał Przyłuski
  2009-04-28 19:33                         ` Leslie Rhorer
  1 sibling, 2 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-25  7:40 UTC (permalink / raw)
  To: 'Doug Ledford'; +Cc: 'Linux RAID'

> On Apr 24, 2009, at 12:52 AM, Leslie Rhorer wrote:
> > I've done some reading, and it's been suggested a 128K chunk size
> > might be a
> > better choice on my system than the default chunk size of 64K, so I
> > intend
> > to create the new array on the raw devices with the command:
> >
> > mdadm --create --raid-devices=10 --metadata=1.2 --chunk=128 --level=6
> > /dev/sd[a-j]
> 
> Go with a bigger chunk size.  Especially if you do lots of big file
> manipulation.  During testing (many years ago now, admittedly) with
> Dell for some benchmarks, it was determined that when using linux
> software raid, larger chunk sizes would tend to increase performance.
> In those tests, we settled on a 2MB chunk size.  I wouldn't recommend
> you go *that* high, but I could easily see 256k or 512k chunk sizes.

I went ahead and selected 256K.  There's something a little odd, though.
mdadm is reporting a device size of 2T.  These are 1T drives.  The overall
array size is correct but the device size is goofy.  I hope this doesn't
cause any problems.  As I believe I mentioned before (or maybe it was on
another list), I once had problems with ext3 trying to read beyond the
physical end of the array.

RAID-Server:/Backup/Personal_Folders# mdadm -Dt /dev/md0
/dev/md0:
        Version : 01.02
  Creation Time : Sat Apr 25 01:17:12 2009
     Raid Level : raid6
     Array Size : 7814098944 (7452.11 GiB 8001.64 GB)
  Used Dev Size : 1953524736 (1863.03 GiB 2000.41 GB)
   Raid Devices : 10
  Total Devices : 10
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Apr 25 01:20:47 2009
          State : active, resyncing
 Active Devices : 10
Working Devices : 10
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 256K

 Rebuild Status : 0% complete

           Name : RAID-Server:0  (local to host RAID-Server)
           UUID : 5ff10d73:a096195f:7a646bba:a68986ca
         Events : 1

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      active sync   /dev/sdb
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf
       6       8       96        6      active sync   /dev/sdg
       7       8      112        7      active sync   /dev/sdh
       8       8      128        8      active sync   /dev/sdi
       9       8      144        9      active sync   /dev/sdj



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-25  7:40                       ` Leslie Rhorer
@ 2009-04-25  8:53                         ` Michał Przyłuski
  2009-04-28 19:33                         ` Leslie Rhorer
  1 sibling, 0 replies; 84+ messages in thread
From: Michał Przyłuski @ 2009-04-25  8:53 UTC (permalink / raw)
  To: lrhorer; +Cc: Doug Ledford, Linux RAID

2009/4/25 Leslie Rhorer <lrhorer@satx.rr.com>:
> I went ahead and selected 256K.  There's something a little odd, though.
> mdadm is reporting a device size of 2T.  These are 1T drives.  The overall
> array size is correct but the device size is goofy.  I hope this doesn't
> cause any problems.  As I believe I mentioned before (or maybe it was on
> another list), I once had problems with ext3 trying to read beyond the
> physical end of the array.
>
> RAID-Server:/Backup/Personal_Folders# mdadm -Dt /dev/md0
> /dev/md0:
>        Version : 01.02
>  Creation Time : Sat Apr 25 01:17:12 2009
>     Raid Level : raid6
>     Array Size : 7814098944 (7452.11 GiB 8001.64 GB)
>  Used Dev Size : 1953524736 (1863.03 GiB 2000.41 GB)

It's a bug in mdadm, probably leftover from blocks/sectors migration,
I believe it was fixed in 2.6.9. The reality is looking okay, array is
safe, it's just mdadm miscalculating on the very last stage of output.

Greets,
Mike

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-25  7:40                       ` Leslie Rhorer
  2009-04-25  8:53                         ` Michał Przyłuski
@ 2009-04-28 19:33                         ` Leslie Rhorer
  2009-04-29 11:25                           ` John Robinson
  2009-05-03  2:23                           ` Leslie Rhorer
  1 sibling, 2 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-28 19:33 UTC (permalink / raw)
  To: 'Linux RAID'

	The RAID array is at 80% on the resync, and approximately 25% of the
data has been transferred from the backups.  I believe the restore is
running so slowly because of an error I made building the LVM on the backup
system.  Rather than 4 days to recover, it looks like it's going to take 16
days, or more.  As you might imagine, I'm pretty nervous, as the backup is
not fault tolerant, and the odds against the failure of a hard drive in a 16
day period are not astronomical by any means.

	This experience has prompted me to re-think my position on the
backup system.  My thoughts previously were I could save on money and effort
by simply employing a JBOD volume managed by LVM for the backup system.
After this experience, and even with the fact once the data is all
transferred I can fix the backup system to allow proper performance (i.e.
50+ MBps, rather than the current average of about 4 MBps), I believe I will
implement a RAID 5 array on the backup system.  Still being limited by
budgetary concerns, however, I saw something here which started me thinking
again.  I'm soon going to have a number of 1T drives available for use, and
replacing the backup system's 1.5T drives with brand new 2T or 3T drives is
going to be a daunting task.  Is it possible, then to use the 1T drives
combined with a like number of 1.5T to create a number (say, 6 or so) of
2.5T LVM volumes and then use mdadm to create a RAID 5 array assembled from
the virtual volumes?  Of course, I obviously could create a RAID 50 array
from the individual discs, but them the member size would be limited to 1T,
wasting .5T of each 1.5T drive.  The difference in cost between a single
1.5T drive and five or six 3T drives is going to be pretty significant, and
using the 1T drives as an interim step hopefully might allow me to put off
buying the 3T drives until they have come down in price.  I can easily
assemble twelve TB from spare parts, which should last me well into the 3rd
quarter of next year or beyond.

	Any comments?  I know with 12 drives vs. 6 drives the liklihood of a
failure is much greater, but the RAID 5 umbrella helps, and the fact this is
a backup system means my required span of reliability is much lower than for
the main system.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-28 19:33                         ` Leslie Rhorer
@ 2009-04-29 11:25                           ` John Robinson
  2009-04-30  0:55                             ` Leslie Rhorer
  2009-05-03  2:23                           ` Leslie Rhorer
  1 sibling, 1 reply; 84+ messages in thread
From: John Robinson @ 2009-04-29 11:25 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On 28/04/2009 20:33, Leslie Rhorer wrote:
[...]
>  Is it possible, then to use the 1T drives
> combined with a like number of 1.5T to create a number (say, 6 or so) of
> 2.5T LVM volumes and then use mdadm to create a RAID 5 array assembled from
> the virtual volumes?

Yes, you could do that, though I think I'd use md to create 6 linear 
sets, which are then combined into a RAID 5; I imagine md will perform 
better than LVM for this task.

What I'd be much more inclined to do is create 2 RAID 5 sets, one of 1T 
discs, and one of 1.5T discs, then combine the two using LVM. It just 
feels more natural to me that way. In addition, while both setups will 
tolerate one disc failure, yours will then only tolerate one particular 
one of the remaining 11 failing (the companion drive in the LVM/linear 
pair), and mine will tolerate a failure of any drive on the other RAID-5.

Hope this is of interest.

Cheers,

John.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-29 11:25                           ` John Robinson
@ 2009-04-30  0:55                             ` Leslie Rhorer
  2009-04-30 12:34                               ` John Robinson
  0 siblings, 1 reply; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-30  0:55 UTC (permalink / raw)
  To: 'Linux RAID'

> >  Is it possible, then to use the 1T drives
> > combined with a like number of 1.5T to create a number (say, 6 or so) of
> > 2.5T LVM volumes and then use mdadm to create a RAID 5 array assembled
> from
> > the virtual volumes?
> 
> Yes, you could do that, though I think I'd use md to create 6 linear
> sets, which are then combined into a RAID 5; I imagine md will perform
> better than LVM for this task.

I hadn't thought of that.

> What I'd be much more inclined to do is create 2 RAID 5 sets, one of 1T
> discs, and one of 1.5T discs, then combine the two using LVM. It just
> feels more natural to me that way. In addition, while both setups will
> tolerate one disc failure, yours will then only tolerate one particular
> one of the remaining 11 failing (the companion drive in the LVM/linear
> pair), and mine will tolerate a failure of any drive on the other RAID-5.

I had thought of this, but then replacing the 1T + 1.5T pairs with 3T
drives, which I expect to happen eventually, will be more difficult.  I do
of course realize it is a bit more frail, but once again, to be devastating
it would have to have two drives fail during a period (as now) when the data
on the main array is incomplete or the main array has failed.

> Hope this is of interest.

Oh, you bet!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-30  0:55                             ` Leslie Rhorer
@ 2009-04-30 12:34                               ` John Robinson
  2009-05-03  2:16                                 ` Leslie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: John Robinson @ 2009-04-30 12:34 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On 30/04/2009 01:55, Leslie Rhorer wrote:
>> What I'd be much more inclined to do is create 2 RAID 5 sets, one of 1T
>> discs, and one of 1.5T discs, then combine the two using LVM. It just
>> feels more natural to me that way. In addition, while both setups will
>> tolerate one disc failure, yours will then only tolerate one particular
>> one of the remaining 11 failing (the companion drive in the LVM/linear
>> pair), and mine will tolerate a failure of any drive on the other RAID-5.
> 
> I had thought of this, but then replacing the 1T + 1.5T pairs with 3T
> drives, which I expect to happen eventually, will be more difficult.

Not really. You can just replace the drives in one array with 3T ones 
(one at a time, waiting for the rebuild to complete each time), grow the 
array, resize LVM's PV with `pvresize`, tell LVM to move the data from 
the other array across with `pvmove`, decommission the now-empty 
array/PV, and finally resize the filesystem.

If you can have all the drives online at once, it's easier than that - 
make a new array of 3T drives, add it to the LVM VG, and `pvmove` all 
your data, decommission the now-empty arrays/PVs, and resize the fs.

Or you might choose to keep one of the arrays of smaller drives and have 
a larger backup system, no `pvmove`s required.

Cheers,

John.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-30 12:34                               ` John Robinson
@ 2009-05-03  2:16                                 ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-05-03  2:16 UTC (permalink / raw)
  To: 'Linux RAID'

> On 30/04/2009 01:55, Leslie Rhorer wrote:
> >> What I'd be much more inclined to do is create 2 RAID 5 sets, one of 1T
> >> discs, and one of 1.5T discs, then combine the two using LVM. It just
> >> feels more natural to me that way. In addition, while both setups will
> >> tolerate one disc failure, yours will then only tolerate one particular
> >> one of the remaining 11 failing (the companion drive in the LVM/linear
> >> pair), and mine will tolerate a failure of any drive on the other RAID-
> 5.
> >
> > I had thought of this, but then replacing the 1T + 1.5T pairs with 3T
> > drives, which I expect to happen eventually, will be more difficult.
> 
> Not really. You can just replace the drives in one array with 3T ones
> (one at a time, waiting for the rebuild to complete each time), grow the
> array, resize LVM's PV with `pvresize`, tell LVM to move the data from
> the other array across with `pvmove`, decommission the now-empty
> array/PV, and finally resize the filesystem.

	Yeah, but that requires replacing all the drives of one size and
growing the array.  It's a bit more trouble than simply taking a 1T + 1.5T
pair offline and replacing with a single 3T drive.  It also requires buying
six 3T drives all at once.  With the other method, I can buy two 3T drives,
swap out two pairs of drives, and then add one 1.5T drive to each of two
other 1.5T drives and move the remaining six 1T drives into two sets of
three drives.  Grow the array and I'm up to 15T with two extra 1.5T drives
while only having purchased an additional 2 drives.  As my budget allows, I
can replace each two or 3 element set with a single 3T drive.

> If you can have all the drives online at once, it's easier than that -
> make a new array of 3T drives, add it to the LVM VG, and `pvmove` all
> your data, decommission the now-empty arrays/PVs, and resize the fs.
> 
> Or you might choose to keep one of the arrays of smaller drives and have
> a larger backup system, no `pvmove`s required.

Yeah, but depending on the actual cost of 3T drives at the time, buying six
of them might be a bit difficult.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-28 19:33                         ` Leslie Rhorer
  2009-04-29 11:25                           ` John Robinson
@ 2009-05-03  2:23                           ` Leslie Rhorer
  1 sibling, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-05-03  2:23 UTC (permalink / raw)
  To: 'Linux RAID'

	For anyone who really cares (probably no one), the system has now
transferred more than half the data over from the backup to the RAID system.
I have not yet experienced a single halt.  Performance seems good (I'll test
a  lot more once the data has transferred and I can take the backup array
offline).  With the RAID array more than 40% full, creating a file takes
almost no time.  Deleting a 23G file takes less than 0.9 seconds.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-24  4:52                   ` Leslie Rhorer
                                       ` (2 preceding siblings ...)
  2009-04-24 17:03                     ` Doug Ledford
@ 2009-04-24 20:25                     ` Greg Freemyer
  2009-04-25  7:24                     ` Leslie Rhorer
  4 siblings, 0 replies; 84+ messages in thread
From: Greg Freemyer @ 2009-04-24 20:25 UTC (permalink / raw)
  To: lrhorer; +Cc: Linux RAID

<snip>
> 5.  Finally, one other thing concerns me a bit.  The researchers I read
> suggested XFS has by far the worst file deletion performance of any of the
> journaling file systems, and Reiserfs supposedly has the best.

I'm pretty sure that someone got that wrong.  The real issue is that
xfs is really slow comparatively at deleting *small* files.

I did a test just a couple months ago comparing reiser and xfs.  xfs
was 30x slower.

My test was to make a copy of a linux kernel source tree.  Then time
deleting it.

Reiser did it in a couple seconds.  xfs took a minute.  So xfs is not
a good choice to hold your source directories it your a kernel
developer.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-24  4:52                   ` Leslie Rhorer
                                       ` (3 preceding siblings ...)
  2009-04-24 20:25                     ` Greg Freemyer
@ 2009-04-25  7:24                     ` Leslie Rhorer
  4 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-25  7:24 UTC (permalink / raw)
  To: lrhorer, 'Linux RAID'

> selected probably a 1.2 superblock, instead.  Given all that, unless
> someone
> else has a better idea, I am going to go ahead and tear down the array and
> rebuild it with a version 1.2 superblock.  I have suspended all writes to
> the array and double-backed up all the most critical data along with a
> small
> handful of files which for some unknown reason appear to differ by a few
> bytes between the RAID array copy and the backup copy.  I just hope like
> all
> get-out the backup system doesn't crash sometime in the four days after I
> tear down the RAID array and start to rebuild it.
> 
> I've done some reading, and it's been suggested a 128K chunk size might be
> a
> better choice on my system than the default chunk size of 64K, so I intend
> to create the new array on the raw devices with the command:
> 
> mdadm --create --raid-devices=10 --metadata=1.2 --chunk=128 --level=6
> /dev/sd[a-j]

No one noticed this was missing the target array.  I didn't either until I
ran it and mdadm complained there weren't enough member disks.  'Puzzled the
dickens out of me until I realized mdadm was trying to create an array at
/dev/sda from disks /dev/sdb - /dev/sdj.  'Silly computer.  :-)

For anyone who is interested, the array has been created and formatted, and
the file transfers from the backup have begun, plus I have started to write
all the data I suspended from transferring over the last day or so.  The
system is also resyncing the drives, of course, so there is a persistent
stream of fairly high bandwidth reads going on in addition to the writes.
See below.  So far, nearly 10,000 files have been created without a halt,
and during a RAID resync the system previous would halt with every single
file creation.  Forty-two GB out of over 6T of data has been transferred,
and the system is starting on the large video files right now.  I have high
hopes the problem may have been resolved.  If so, it is almost certain
reiserfs was the culprit, as nothing else has changed except for the
Superblock format and the disk order within the array.  'Fingers crossed.

RAID-Server:/# iostat 1 2
Linux 2.6.26-1-amd64 (RAID-Server)      04/25/2009      _x86_64_

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.62    0.00    6.58   11.04    0.00   78.77

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              41.00      3440.04      2467.32   19477424   13969924
sdb              41.19      3441.68      2525.79   19486704   14300948
sdc              41.71      3437.65      2533.14   19463912   14342596
sdd              41.62      3445.47      2524.65   19508192   14294540
sde              41.43      3440.61      2467.74   19480680   13972308
sdf              41.24      3441.43      2519.61   19485296   14265996
sdg              41.53      3432.60      2477.87   19435296   14029668
sdh              41.23      3440.57      2528.07   19480416   14313860
sdi              45.09      3443.20      2466.24   19495336   13963796
sdj              45.29      3431.99      2535.49   19431880   14355876
hda               8.56       105.58        89.97     597815     509384
hda1              0.02         0.45         0.00       2540          0
hda2              8.53       104.94        89.92     594160     509104
hda3              0.00         0.00         0.00          6          0
hda5              0.01         0.12         0.05        693        280
md0             103.20         6.34     14438.02      35880   81747792

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.17    0.00   15.53   35.92    0.00   37.38

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              55.00      3128.00      4872.00       3128       4872
sdb              50.00      2808.00      4256.00       2808       4256
sdc              44.00      2472.00      3664.00       2472       3664
sdd              42.00      2872.00      3920.00       2872       3920
sde              55.00      2280.00      5360.00       2280       5360
sdf              68.00      2128.00      6984.00       2128       6984
sdg              56.00      2808.00      5432.00       2808       5432
sdh              48.00      3072.00      4608.00       3072       4608
sdi              54.00      3456.00      5008.00       3456       5008
sdj              59.00      3584.00      5008.00       3584       5008
hda              23.00         0.00       184.00          0        184
hda1              0.00         0.00         0.00          0          0
hda2             23.00         0.00       184.00          0        184
hda3              0.00         0.00         0.00          0          0
hda5              0.00         0.00         0.00          0          0
md0             307.00         0.00     33936.00          0      33936
 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 19:17               ` John Robinson
  2009-04-05 20:00                 ` Greg Freemyer
@ 2009-04-05 21:02                 ` Leslie Rhorer
  1 sibling, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 21:02 UTC (permalink / raw)
  To: linux-raid

> > Several, actually.  Since the RAID array kept crashing, I had to re-
> create
> > it numerous times.
> [...]
> >> your culprit is higher up the chain, ie the FS.
> >
> > I've suspected this may be the case from the outset.
> 
> I'm sorry? You've repeatedly had trouble with this system, this array,
> you've tried several filesystems; do you think they're *ALL* broken?

No, the other issues were clearly related to a single hardware issue.  It
took quite a while to find it, it being highly intermittent, but once I did,
replacing the culprit system apparently solved all the issues, and the
system ran fine for several weeks.  Then I reconfigured and formatted under
reiserfs.  That's when this issue popped up.  It's entirely possible there
is a persistent hardware problem which has plagued the system from day 1,
but was masked by the other, more severe hardware issue.  It's possible one
of the drives or some other component went bad just as I reformatted the
array.  It's possible moving the drives physically around has introduced
some problem.  It's possible whatever issue was causing the symptoms went
quiescent during the testing phase after the chassis was replaced.  All are
unlikely given the physical evidence, but none impossible.  To go any
further, I need further diagnostics.

Look, people, I am not a duffer.  I am a professional engineer who has been
in this business for over 30 years.  Personal computers and Linux are not my
area of expertise, but hardware systems and troubleshooting them are.  I am
not asking for anyone to fix my system for me, and I am perfectly capable of
drawing reasonable conclusions from empirical data.  Don't be surprised if
my conclusions differ from yours, but don't make the mistake of thinking I
have discarded any possible solutions, either.  Explain to me how the
alignment of Jupiter and Mars with Lake Michigan could be having an effect,
and I will look into it.  In the mean time, I need the advice of those more
experienced with Linux and possibly PCs than I.  That's you.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05 18:54             ` Leslie Rhorer
  2009-04-05 19:17               ` John Robinson
@ 2009-04-05 19:26               ` Richard Scobie
  2009-04-05 20:40                 ` Leslie Rhorer
  2009-04-05 20:57               ` Peter Grandi
  2 siblings, 1 reply; 84+ messages in thread
From: Richard Scobie @ 2009-04-05 19:26 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

Leslie Rhorer wrote:

> Yes, but I had many problems with ext3, as well, and I definitely need a
> file system which can be grown on the fly.

XFS can be grown and is ideally suited to your use, which I beleive is 
video editing.

Regards,

Richard

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 19:26               ` Richard Scobie
@ 2009-04-05 20:40                 ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 20:40 UTC (permalink / raw)
  To: linux-raid

> > Yes, but I had many problems with ext3, as well, and I definitely need a
> > file system which can be grown on the fly.
> 
> XFS can be grown and is ideally suited to your use, which I beleive is
> video editing.

Principally that and video streaming, yes.  It also hosts some files (small
in size but moderately large in number) acting as a regular low end file
server, as well.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 18:54             ` Leslie Rhorer
  2009-04-05 19:17               ` John Robinson
  2009-04-05 19:26               ` Richard Scobie
@ 2009-04-05 20:57               ` Peter Grandi
  2009-04-05 23:55                 ` Leslie Rhorer
  2 siblings, 1 reply; 84+ messages in thread
From: Peter Grandi @ 2009-04-05 20:57 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> Several, actually.  Since the RAID array kept crashing,

Someone who talks or thinks in vague terms like "RAID array kept
crashing" is already on a bad path.

> I had to re-create it numerous times. I tried ext3 more than
> once, but the journal kept getting corrupted, and fixing it
> lost several files.

Well, 'ext3' is *exceptionally* well tested, and this points to
some problems with the storage system or the kernel or some
driver (e.g. use of some blob/binary proprietary driver in the
kernel). In theory everything should work with everything else
and everything should be bug free in every combination... In
practice wise people try not to beg for trouble.

> Once I lost several important files during a RAID expansion.
> In some cases I converted to ext2, and others I started out
> with ext2, but last I checked, one cannot grow an ext2 file
> system on the fly.

Modifying filesystem structure while the filesystem is operating
is a very good way to beg for trouble. Especially if under load.

That something is *possible* does not mean that it is wise to
rely on it.

[ ... ]

> Yes, but I had many problems with ext3, as well,

> and I definitely need a file system which can be grown on the
> fly.

Good luck with that. Perhaps you need to think again about your
requirements, and/or perhaps to get a much larger budget to do
some research and development into "dynamic storage pools".

Not many seem to have sorted out all subtle problems; I have
even heard rumours that even GPFS has issues in demanding
configurations.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 20:57               ` Peter Grandi
@ 2009-04-05 23:55                 ` Leslie Rhorer
  2009-04-06 20:35                   ` jim owens
  0 siblings, 1 reply; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 23:55 UTC (permalink / raw)
  To: 'Linux RAID'

> [ ... ]
> 
> > Several, actually.  Since the RAID array kept crashing,
> 
> Someone who talks or thinks in vague terms like "RAID array kept
> crashing" is already on a bad path.

Into how much detail do you wish me to go?  I could write a small volume on
the various symptoms.  The array was taken offline numerous times due to
drives being disconnected or convicted as bad.  Usually I could recover the
array, but 3 times it proved to be completely unrecoverable.  After
replacing a convicted drive and placing it in another machine, destructive
diagnostics showed no problems.  This happened many times.

> > I had to re-create it numerous times. I tried ext3 more than
> > once, but the journal kept getting corrupted, and fixing it
> > lost several files.
> 
> Well, 'ext3' is *exceptionally* well tested, and this points to
> some problems with the storage system

It *WAS* the storage system, almost surely.  The first time ext3 was crashed
was when I performed an on-line RAID expansion while using a hardware RAID
controller.  Everything seemed to be fine after adding a drive, but the next
morning I could not write to the array.  I re-mounted the drive, and
everything seemed fine.  Fifteen minutes later, I could not write to the
array again.  After nosing around, I found the array was constantly trying
to seek beyond the end of the physical drive system when writing.  When I
tried to run fsck, it wouldn't let me because the journal inode was invalid
(I don't recall the exact error).  I converted to ext2, and once again ran
fsck.  It deleted and fixed a very large number of errors, and when the dust
settled, a number of newer files were lost.  During one of the numerous
array crashes, the journal got munched again.  This time, however, fsck was
able to recover all the errors without converting to ext2 and as far as I
could tell without losing any additional files.  I'm not saying ext3 cause
any of the problems, but it certainly allowed itself to be corrupted by
hardware issues.

> driver (e.g. use of some blob/binary proprietary driver in the
> kernel). In theory everything should work with everything else
> and everything should be bug free in every combination... In
> practice wise people try not to beg for trouble.
> 
> > Once I lost several important files during a RAID expansion.
> > In some cases I converted to ext2, and others I started out
> > with ext2, but last I checked, one cannot grow an ext2 file
> > system on the fly.
> 
> Modifying filesystem structure while the filesystem is operating
> is a very good way to beg for trouble. Especially if under load.

I am aware of the risk, but ext3 claims to be capable of the migration
(indeed I just did one on an LVM system employing ext3), and the RAID
controller has a very prominent utility for OLRE.  Taking the array offline
for 3 days every time I need to do an expansion is not a very thrilling
prospect.  If it were 3 or 4 hours, or even overnight...

> That something is *possible* does not mean that it is wise to
> rely on it.

I am aware of this, too.  I did not consider the option lightly.  At the
time, I did not have the money to put together a backup server, and having
the array offline for three days was not an attractive option.

> Good luck with that. Perhaps you need to think again about your
> requirements, and/or perhaps to get a much larger budget to do
> some research and development into "dynamic storage pools".

I would be happy to, as soon as someone offers me a large pay increase.  Are
you offering?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05 23:55                 ` Leslie Rhorer
@ 2009-04-06 20:35                   ` jim owens
  2009-04-07 17:47                     ` Leslie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: jim owens @ 2009-04-06 20:35 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Leslie Rhorer wrote:

> could tell without losing any additional files.  I'm not saying ext3 cause
> any of the problems, but it certainly allowed itself to be corrupted by
> hardware issues.

Some observations from a filesystem guy lurking on this list...

You won't find a filesystem that can't be corrupted by bad hardware.
Most devices lie these days and say the data is recorded when it
is not.  I have had drive firmware remap a bad sector but forget to
write the data to the spare sector. And the list goes on and on.

Most filesystems update some same set of common blocks to do a
create.  This is particularly true of journal filesystems like
reiserFS.  If the journal write stalls, everything else can hang
on a journaling fs.  And I've had drives that did their bad sector
remap by first running through multiple algorithms trying the
original sector before going to the spare on every access!

While I agree your symptoms sound more like a software problem, my
experience with enterprise raid arrays and drives says I would not
rule hardware out as the trigger for the problem.

That 20 minute hang sure sounds like an array ignoring the host.
With an enterprise array a 20 minute state like that is "normal"
and really makes us want to beat the storage guys severely.

As was pointed out, there is a block layer "plug" when a device
says "I'm busy".  That requires the FS to issue an "unplug", but
if a code path doesn't have it... hang until some other path is
taken that does do the unplug.

I suggest using blktrace to see what is happening between the
filesystem, block layer, and device.

As to your choice of ReiserFS.  I don't have personal experience
with it, but check Wikipedia.  What I do see is no developers
are actively working on it, which is not a good sign.  On the
other hand, ext3, JFS, and XFS all have active developers so
they are kept up to date with changes in the surrounding kernel.

But none of them will protect you from bad hardware.

jim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-06 20:35                   ` jim owens
@ 2009-04-07 17:47                     ` Leslie Rhorer
  2009-04-07 18:18                       ` David Lethe
                                         ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-07 17:47 UTC (permalink / raw)
  To: 'Linux RAID'

> > could tell without losing any additional files.  I'm not saying ext3
> cause
> > any of the problems, but it certainly allowed itself to be corrupted by
> > hardware issues.
> 
> Some observations from a filesystem guy lurking on this list...
> 
> You won't find a filesystem that can't be corrupted by bad hardware.

That's absolutely true, but a RAID array is supposed to be fault tolerant.
Now, I am well aware of the vast difference between fault tolerant and fault
proof, and I cannot begin to claim another file system would not have
suffered problems, but to my admittedly inexperienced (in the realm of ext3
and other Linux file systems) eye, a journal which thinks the device is
bigger than it really is after an array expansion causing a loss of data
seems pretty frail.  It's not like there was an actual array failure or any
number of bad blocks associated with the event.  It also left a  bit of a
bad taste in my mouth that fsck could not repair the issue until I converted
the system to ext2.

> Most filesystems update some same set of common blocks to do a
> create.  This is particularly true of journal filesystems like
> reiserFS.  If the journal write stalls, everything else can hang
> on a journaling fs.

Yes, I would expect that.  Read or write failures from those common blocks
- and nothing else - should not ordinarily be related to the volume of data
being read or written elsewhere on the array, however.  In other words, if
the common blocks are number 1000 - 2000, then reading and writing to blocks
10,000 and above should not cause the rate of failure of reads from blocks
1000 - 2000 to change.  Instead, what we see quite clearly in this case is
modest to high write and / or read rates in blocks 10,000 and above cause
file creation events on blocks 1000 - 2000 to fail, while low data rates do
not.  I also think it is probably significant that the journal is obviously
written to by both file creations and file writes, yet only creations cause
the failure.  Now if certain sections of the journal blocks are only for
file creation, then why do read-only data rates affect the issue at all?

> While I agree your symptoms sound more like a software problem, my
> experience with enterprise raid arrays and drives says I would not
> rule hardware out as the trigger for the problem.

Nor have I done so.  At this point, I haven't ruled out anything.  It's
taking better than a day to scan each drive using badblocks, so it's going
to be about 2 weeks before I have scanned all 10 drives.  AFAIK, the
badblocks routine itself has not triggered any read/write halts.

> That 20 minute hang sure sounds like an array ignoring the host.
> With an enterprise array a 20 minute state like that is "normal"
> and really makes us want to beat the storage guys severely.

I can't argue against that at this point, for certain.  What puzzles me
(among other things) is why do 5 of the drives show zero reads while 5 of
them show very low levels of read activity, and always the same 5 drives?
The main question, of course, is not so much what is happening, as why, and
of course how can it be avoided?

Fortunately the multi-minute hangs only occur once a month, when the array
is resyncing.  Even so, however, the nearly continuous 40 second hangs are
driving me mad.  I have a large number of videos to edit, and stretching
what should be a 7 minute manual process into 20 minutes 4 or 5 times a day
is getting old fast.

> As was pointed out, there is a block layer "plug" when a device
> says "I'm busy".  That requires the FS to issue an "unplug", but
> if a code path doesn't have it... hang until some other path is
> taken that does do the unplug.
> 
> I suggest using blktrace to see what is happening between the
> filesystem, block layer, and device.

Thanks!  I'll take a look after all the drives are scanned.

> But none of them will protect you from bad hardware.

No, of course not, but I believe I am pretty close to having a stable
hardware set.  Before that gathers any flames, let me hasten to say that in
no way means I am certain of it, or that I refuse to change out any suspect
hardware.  Changing out non-suspect hardware, however, is just a means of
introducing more possible failures into a system.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-07 17:47                     ` Leslie Rhorer
@ 2009-04-07 18:18                       ` David Lethe
  2009-04-08 14:17                         ` Leslie Rhorer
  2009-04-07 18:20                       ` Greg Freemyer
  2009-04-08  8:45                       ` John Robinson
  2 siblings, 1 reply; 84+ messages in thread
From: David Lethe @ 2009-04-07 18:18 UTC (permalink / raw)
  To: lrhorer, Linux RAID

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Leslie Rhorer
> Sent: Tuesday, April 07, 2009 12:47 PM
> To: 'Linux RAID'
> Subject: RE: RAID halting
> 
> > > could tell without losing any additional files.  I'm not saying
> ext3
> > cause
> > > any of the problems, but it certainly allowed itself to be
> corrupted by
> > > hardware issues.
> >
> > Some observations from a filesystem guy lurking on this list...
> >
> > You won't find a filesystem that can't be corrupted by bad hardware.
> 
> That's absolutely true, but a RAID array is supposed to be fault
> tolerant.
> Now, I am well aware of the vast difference between fault tolerant and
> fault
> proof, and I cannot begin to claim another file system would not have
> suffered problems, but to my admittedly inexperienced (in the realm of
> ext3
> and other Linux file systems) eye, a journal which thinks the device
is
> bigger than it really is after an array expansion causing a loss of
> data
> seems pretty frail.  It's not like there was an actual array failure
or
> any
> number of bad blocks associated with the event.  It also left a  bit
of
> a
> bad taste in my mouth that fsck could not repair the issue until I
> converted
> the system to ext2.
> 
> > Most filesystems update some same set of common blocks to do a
> > create.  This is particularly true of journal filesystems like
> > reiserFS.  If the journal write stalls, everything else can hang
> > on a journaling fs.
> 
> Yes, I would expect that.  Read or write failures from those common
> blocks
> - and nothing else - should not ordinarily be related to the volume of
> data
> being read or written elsewhere on the array, however.  In other
words,
> if
> the common blocks are number 1000 - 2000, then reading and writing to
> blocks
> 10,000 and above should not cause the rate of failure of reads from
> blocks
> 1000 - 2000 to change.  Instead, what we see quite clearly in this
case
> is
> modest to high write and / or read rates in blocks 10,000 and above
> cause
> file creation events on blocks 1000 - 2000 to fail, while low data
> rates do
> not.  I also think it is probably significant that the journal is
> obviously
> written to by both file creations and file writes, yet only creations
> cause
> the failure.  Now if certain sections of the journal blocks are only
> for
> file creation, then why do read-only data rates affect the issue at
> all?
> 
> > While I agree your symptoms sound more like a software problem, my
> > experience with enterprise raid arrays and drives says I would not
> > rule hardware out as the trigger for the problem.
> 
> Nor have I done so.  At this point, I haven't ruled out anything.
It's
> taking better than a day to scan each drive using badblocks, so it's
> going
> to be about 2 weeks before I have scanned all 10 drives.  AFAIK, the
> badblocks routine itself has not triggered any read/write halts.
> 
> > That 20 minute hang sure sounds like an array ignoring the host.
> > With an enterprise array a 20 minute state like that is "normal"
> > and really makes us want to beat the storage guys severely.
> 
> I can't argue against that at this point, for certain.  What puzzles
me
> (among other things) is why do 5 of the drives show zero reads while 5
> of
> them show very low levels of read activity, and always the same 5
> drives?
> The main question, of course, is not so much what is happening, as
why,
> and
> of course how can it be avoided?
> 
> Fortunately the multi-minute hangs only occur once a month, when the
> array
> is resyncing.  Even so, however, the nearly continuous 40 second hangs
> are
> driving me mad.  I have a large number of videos to edit, and
> stretching
> what should be a 7 minute manual process into 20 minutes 4 or 5 times
a
> day
> is getting old fast.
> 
> > As was pointed out, there is a block layer "plug" when a device
> > says "I'm busy".  That requires the FS to issue an "unplug", but
> > if a code path doesn't have it... hang until some other path is
> > taken that does do the unplug.
> >
> > I suggest using blktrace to see what is happening between the
> > filesystem, block layer, and device.
> 
> Thanks!  I'll take a look after all the drives are scanned.
> 
> > But none of them will protect you from bad hardware.
> 
> No, of course not, but I believe I am pretty close to having a stable
> hardware set.  Before that gathers any flames, let me hasten to say
> that in
> no way means I am certain of it, or that I refuse to change out any
> suspect
> hardware.  Changing out non-suspect hardware, however, is just a means
> of
> introducing more possible failures into a system.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sorry for probably being the one that started the flames .. It was
unprofessional of me, I apologize for getting on my soapbox.  Anyway,
the blkcheck routine isn't really the best way to do this.  There is a
low-level command that can be executed on the drives that runs directly
in the disk firmware, and it will create a list of bad blocks that can
be programmatically read.  The up-side is that this can run in
background of disks, and all can run concurrently with minor performance
hit. (The test can be killed without harming anything if you think
impact is too high). 

There are also some embedded self-test routines that do not do full
media scan, but check things like electronics, cache buffers, random
seek/reads, etc, even destructive write tests.

Big caveat however, the chip that your cheap SATA controller uses
probably blocks all these commands, as most SATA controllers use SAS
bridge chips that make disks appear to BIOS as SCSI devices, and they do
a protocol conversion which does command translation as well.  Vast
majority of bridge chips don't do this correctly.

If the disks are Seagate then the self-tests can be downloaded for free
as part of seatools, just boot the system into windows.   Some SATA
controller BIOSs have embedded self test support, and controllers such
as 3WARE and LSI have various build-in capability depending on firmware
and model number.  hdparm may support the ATA embedded self-tests, I
don't know.  Lots of windows shareware you can find that support
self-tests also.  I expect WD has some stuff on their site as well.
There are commercial products as well, and to be up-front, my company
has some, but if you can get something for free, use them instead.

There is also a SMART error log reporting mechanism.  The disks have
ability to report the last 5 commands that errored on each disk, which
includes timestamp, op code, input parameters and reason for error.
This runs instantly, and it is possible that just having these last 5
errors will tell you exactly what problem is. The errors are also
non-volatile, so you can even move each disk to a PC with a "regular"
ATA/SATA controller so that your bridge chip doesn't block the commands.
 
If you care to contact me offline then I will be happy to point you to
right direction. I owe you that.   

David @ santools. Com





^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-07 18:18                       ` David Lethe
@ 2009-04-08 14:17                         ` Leslie Rhorer
  2009-04-08 14:30                           ` David Lethe
                                             ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-08 14:17 UTC (permalink / raw)
  To: 'Linux RAID'

> There is also a SMART error log reporting mechanism.  The disks have
> ability to report the last 5 commands that errored on each disk, which
> includes timestamp, op code, input parameters and reason for error.

I pulled the logs from the drives.  Four showed no errors whatsoever, and
three showed 1, 2, and 3 errors, respectively.  The others ranged from 408
to 1871.  I then triggered a halt with a mv command, and checked the logs
again.  The values all remain the same, with no new errors reported
whatsoever.

> This runs instantly, and it is possible that just having these last 5
> errors will tell you exactly what problem is. The errors are also

I presume a sector remapping would be caught by the log file, yes?  If so,
bad sector mapping is pretty much completely eliminated as a cause for this
issue.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-08 14:17                         ` Leslie Rhorer
@ 2009-04-08 14:30                           ` David Lethe
  2009-04-09  4:52                             ` Leslie Rhorer
  2009-04-08 14:37                           ` Greg Freemyer
  2009-04-08 18:04                           ` Corey Hickey
  2 siblings, 1 reply; 84+ messages in thread
From: David Lethe @ 2009-04-08 14:30 UTC (permalink / raw)
  To: lrhorer, Linux RAID

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Leslie Rhorer
> Sent: Wednesday, April 08, 2009 9:18 AM
> To: 'Linux RAID'
> Subject: RE: RAID halting
> 
> > There is also a SMART error log reporting mechanism.  The disks have
> > ability to report the last 5 commands that errored on each disk,
> which
> > includes timestamp, op code, input parameters and reason for error.
> 
> I pulled the logs from the drives.  Four showed no errors whatsoever,
> and
> three showed 1, 2, and 3 errors, respectively.  The others ranged from
> 408
> to 1871.  I then triggered a halt with a mv command, and checked the
> logs
> again.  The values all remain the same, with no new errors reported
> whatsoever.
> 
> > This runs instantly, and it is possible that just having these last
5
> > errors will tell you exactly what problem is. The errors are also
> 
> I presume a sector remapping would be caught by the log file, yes?  If
> so,
> bad sector mapping is pretty much completely eliminated as a cause for
> this
> issue.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

EXACTLY -- what are the errors .(Also a halt will not create an error in
the internal log of the disk.   Now, if you had cut power in middle of a
huge I/O, or read block n+1 on a disk that only had n blocks, then you
would create an error.


- David


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-08 14:30                           ` David Lethe
@ 2009-04-09  4:52                             ` Leslie Rhorer
  2009-04-09  6:45                               ` David Lethe
  0 siblings, 1 reply; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-09  4:52 UTC (permalink / raw)
  To: 'Linux RAID'

> EXACTLY -- what are the errors .(Also a halt will not create an error in
> the internal log of the disk.   Now, if you had cut power in middle of a
> huge I/O, or read block n+1 on a disk that only had n blocks, then you
> would create an error.

No, but you are implying a cause / effect the other way around: errors on
the disk are causing the halts.  None of the evidence so far supports the
notion well at all.

I had several more halts today, and these results are from right now.

Drives /dev/sda, /dev/sde, /dev/sdf/ and /dev/sdg all remain without errors.

These drive models are:

sda  WD10EACS-00D6B0
sde  WD10EACS-00D6B0
sdf  WD10EACS-00D6B1
sdg  WD10EACS-00D6B1

Not surprisingly, these are the most recently purchased of the set (early
November). 

The one odd Hitachi (sdh  HUA721010KLA330) was powered up in mid-January
2008, and the other five were all powered up in mid-December 2007.  This
places the last errors on any of the drives previous to mid-December 2008,
which is when the system was removed from the old chassis.  It's also not at
all surprising there were errors before the drives were removed from the old
chassis.  By these logs, there hasn't been an error reported by SMART on any
of these drives in over 3 months.


sdi  HDS721010KLA330

ATA Error Count: 1

Error 1 occurred at disk power-on lifetime: 8442 hours (351 days + 18 hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 2b 8e 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 90 00 2c 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
  60 00 78 00 2b 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
  60 00 30 00 2a 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
  60 d0 18 30 29 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
  60 08 10 28 29 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED


sdh  HUA721010KLA330

ATA Error Count: 2

Error 2 occurred at disk power-on lifetime: 7051 hours (293 days + 19 hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 c0 40 41 3d 4a

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 f8 00 08 42 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
  60 08 08 00 42 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
  60 00 b8 00 41 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
  60 00 38 00 40 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
  60 f0 d8 10 3f 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 6874 hours (286 days + 10 hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 f0 0f 44 54 45

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 88 f0 43 54 40 08   1d+13:20:47.600  WRITE FPDMA QUEUED
  61 c8 78 28 43 54 40 08   1d+13:20:47.600  WRITE FPDMA QUEUED
  61 88 68 a0 41 54 40 08   1d+13:20:47.500  WRITE FPDMA QUEUED
  61 58 60 40 41 54 40 08   1d+13:20:47.500  WRITE FPDMA QUEUED
  61 10 08 30 41 54 40 08   1d+13:20:47.500  WRITE FPDMA QUEUED


sdj  HDS721010KLA330

ATA Error Count: 3

Error 3 occurred at disk power-on lifetime: 8133 hours (338 days + 21 hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 80 80 2a 8e 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 a0 00 2c 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
  60 00 88 00 2b 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
  60 00 40 00 2a 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
  60 d0 28 30 29 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
  60 08 00 28 29 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 7675 hours (319 days + 19 hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 c8 08 59 3e 41

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 d8 00 f8 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
  60 08 60 f0 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
  60 08 58 e8 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
  60 08 50 e0 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
  60 08 28 d8 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 7673 hours (319 days + 17 hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 28 d7 97 4e 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 68 60 98 97 4e 40 08   2d+01:26:53.900  READ FPDMA QUEUED
  61 00 58 00 81 ff 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED
  61 10 10 00 80 fe 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED
  61 f0 08 10 80 ff 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED
  61 68 00 98 01 ff 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED


sdc  HDS721010KLA330

ATA Error Count: 408 (device log contains only the most recent five errors)

Error 408 occurred at disk power-on lifetime: 8426 hours (351 days + 2
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 08 8f 87 d6 43

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 88 00 10 87 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
  60 08 00 08 87 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
  60 08 18 00 87 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
  60 a8 10 58 86 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
  60 58 00 00 86 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED

Error 407 occurred at disk power-on lifetime: 8426 hours (351 days + 2
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 d0 30 c3 a4 42

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 20 00 c5 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
  60 00 18 00 c4 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
  60 00 10 00 c3 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
  60 b8 08 48 c2 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
  60 48 00 00 c2 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED

Error 406 occurred at disk power-on lifetime: 8424 hours (351 days + 0
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 50 b0 8a c2 48

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 18 18 00 8b c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
  60 00 10 00 8a c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
  60 00 08 00 89 c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
  60 00 00 00 88 c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
  60 00 18 00 87 c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED

Error 405 occurred at disk power-on lifetime: 8424 hours (351 days + 0
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 e0 1f 6a ec 46

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 30 10 00 6b ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
  60 00 08 00 6a ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
  60 f0 00 10 69 ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
  60 10 18 00 69 ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
  60 00 10 00 68 ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED

Error 404 occurred at disk power-on lifetime: 8423 hours (350 days + 23
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 ef 19 e6 43

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 e0 00 20 19 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
  60 20 20 00 19 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
  60 00 18 00 18 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
  60 e0 10 20 17 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
  60 20 08 00 17 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED


sdd  HDS721010KLA330

ATA Error Count: 679 (device log contains only the most recent five errors)

Error 679 occurred at disk power-on lifetime: 8717 hours (363 days + 5
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 31 4f 63 87 4d

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 38 40 63 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
  61 c0 08 80 62 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
  61 f0 28 90 61 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
  61 88 20 00 61 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
  61 08 08 f8 5d 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED

Error 678 occurred at disk power-on lifetime: 8717 hours (363 days + 5
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 40 90 48 1c 47

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 f0 28 90 4f 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED
  61 50 20 80 48 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED
  60 40 48 08 bb 72 40 08  23d+21:59:42.400  READ FPDMA QUEUED
  61 10 40 80 4e 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED
  61 78 30 08 4b 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED

Error 677 occurred at disk power-on lifetime: 8717 hours (363 days + 5
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 58 80 f1 8d 46

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 78 58 08 00 8e 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
  61 20 10 e0 f1 8d 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
  61 58 08 80 f0 8d 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
  61 08 00 60 ea 8d 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
  60 78 08 80 ed 8d 40 08  23d+21:59:17.300  READ FPDMA QUEUED

Error 676 occurred at disk power-on lifetime: 8717 hours (363 days + 5
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 70 b0 1c de 46

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 68 20 61 70 40 08  23d+21:58:42.100  READ FPDMA QUEUED
  61 08 48 f8 1e de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED
  61 d0 40 20 1d de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED
  61 a0 20 80 1c de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED
  61 80 08 00 1b de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED

Error 675 occurred at disk power-on lifetime: 8717 hours (363 days + 5
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 f0 03 de 46

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 88 c0 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
  61 28 80 00 e8 dd 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
  61 08 48 b8 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
  61 30 40 80 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
  61 10 38 68 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED


sdb  HDS721010KLA330

ATA Error Count: 1871 (device log contains only the most recent five errors)

Error 1871 occurred at disk power-on lifetime: 8455 hours (352 days + 7
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 b1 57 f6 46 e4  Error: ICRC, ABRT 177 sectors at LBA = 0x0446f657 =
71759447

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 00 08 f3 46 e0 08  11d+11:10:29.700  WRITE DMA EXT
  35 00 08 00 f3 46 e0 08  11d+11:10:29.700  WRITE DMA EXT
  35 00 00 00 f0 46 e0 08  11d+11:10:29.600  WRITE DMA EXT
  35 00 00 00 ef 46 e0 08  11d+11:10:29.600  WRITE DMA EXT
  35 00 00 00 ee 46 e0 08  11d+11:10:29.600  WRITE DMA EXT

Error 1870 occurred at disk power-on lifetime: 8455 hours (352 days + 7
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 69 cf b6 dd e3  Error: ICRC, ABRT 105 sectors at LBA = 0x03ddb6cf =
64861903

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 d8 60 b6 dd e0 08  11d+11:04:15.600  WRITE DMA EXT
  35 00 08 58 b6 dd e0 08  11d+11:04:15.600  WRITE DMA EXT
  35 00 d0 88 b5 dd e0 08  11d+11:04:15.600  WRITE DMA EXT
  35 00 b0 d8 b3 dd e0 08  11d+11:04:15.500  WRITE DMA EXT
  35 00 50 88 b3 dd e0 08  11d+11:04:15.500  WRITE DMA EXT

Error 1869 occurred at disk power-on lifetime: 8455 hours (352 days + 7
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 71 8f d9 dc e3  Error: ICRC, ABRT 113 sectors at LBA = 0x03dcd98f =
64805263

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 00 00 d8 dc e0 08  11d+11:04:12.300  WRITE DMA EXT
  35 00 00 00 d7 dc e0 08  11d+11:04:12.300  WRITE DMA EXT
  35 00 00 00 d6 dc e0 08  11d+11:04:12.300  WRITE DMA EXT
  35 00 00 00 d4 dc e0 08  11d+11:04:12.200  WRITE DMA EXT
  35 00 00 00 d0 dc e0 08  11d+11:04:12.200  WRITE DMA EXT

Error 1868 occurred at disk power-on lifetime: 8455 hours (352 days + 7
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 09 ff 24 bc e3  Error: ICRC, ABRT 9 sectors at LBA = 0x03bc24ff =
62661887

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 08 00 24 bc e0 08  11d+11:02:16.600  WRITE DMA EXT
  35 00 00 00 20 bc e0 08  11d+11:02:16.600  WRITE DMA EXT
  35 00 f8 08 1f bc e0 08  11d+11:02:16.500  WRITE DMA EXT
  35 00 08 00 1f bc e0 08  11d+11:02:16.500  WRITE DMA EXT
  35 00 00 00 1c bc e0 08  11d+11:02:16.500  WRITE DMA EXT

Error 1867 occurred at disk power-on lifetime: 8455 hours (352 days + 7
hours)
  When the command that caused the error occurred, the device was active or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 f0 fd 94 e3  Error: ICRC, ABRT 16 sectors at LBA = 0x0394fdf0 =
60095984

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 00 00 fb 94 e0 08  11d+10:59:58.100  WRITE DMA EXT
  35 00 f8 08 fa 94 e0 08  11d+10:59:58.100  WRITE DMA EXT
  35 00 08 00 fa 94 e0 08  11d+10:59:58.100  WRITE DMA EXT
  35 00 00 00 f8 94 e0 08  11d+10:59:58.000  WRITE DMA EXT
  35 00 00 00 f4 94 e0 08  11d+10:59:58.000  WRITE DMA EXT


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-09  4:52                             ` Leslie Rhorer
@ 2009-04-09  6:45                               ` David Lethe
  0 siblings, 0 replies; 84+ messages in thread
From: David Lethe @ 2009-04-09  6:45 UTC (permalink / raw)
  To: lrhorer, Linux RAID



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Leslie Rhorer
> Sent: Wednesday, April 08, 2009 11:53 PM
> To: 'Linux RAID'
> Subject: RE: RAID halting
> 
> > EXACTLY -- what are the errors .(Also a halt will not create an
error
> in
> > the internal log of the disk.   Now, if you had cut power in middle
> of a
> > huge I/O, or read block n+1 on a disk that only had n blocks, then
> you
> > would create an error.
> 
> No, but you are implying a cause / effect the other way around: errors
> on
> the disk are causing the halts.  None of the evidence so far supports
> the
> notion well at all.
> 
> I had several more halts today, and these results are from right now.
> 
> Drives /dev/sda, /dev/sde, /dev/sdf/ and /dev/sdg all remain without
> errors.
> 
> These drive models are:
> 
> sda  WD10EACS-00D6B0
> sde  WD10EACS-00D6B0
> sdf  WD10EACS-00D6B1
> sdg  WD10EACS-00D6B1
> 
> Not surprisingly, these are the most recently purchased of the set
> (early
> November).
> 
> The one odd Hitachi (sdh  HUA721010KLA330) was powered up in mid-
> January
> 2008, and the other five were all powered up in mid-December 2007.
> This
> places the last errors on any of the drives previous to mid-December
> 2008,
> which is when the system was removed from the old chassis.  It's also
> not at
> all surprising there were errors before the drives were removed from
> the old
> chassis.  By these logs, there hasn't been an error reported by SMART
> on any
> of these drives in over 3 months.
> 
> 
> sdi  HDS721010KLA330
> 
> ATA Error Count: 1
> 
> Error 1 occurred at disk power-on lifetime: 8442 hours (351 days + 18
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 00 00 2b 8e 40
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 08 90 00 2c 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
>   60 00 78 00 2b 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
>   60 00 30 00 2a 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
>   60 d0 18 30 29 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
>   60 08 10 28 29 8e 40 08  12d+03:47:37.400  READ FPDMA QUEUED
> 
> 
> sdh  HUA721010KLA330
> 
> ATA Error Count: 2
> 
> Error 2 occurred at disk power-on lifetime: 7051 hours (293 days + 19
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 c0 40 41 3d 4a
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 f8 00 08 42 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
>   60 08 08 00 42 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
>   60 00 b8 00 41 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
>   60 00 38 00 40 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
>   60 f0 d8 10 3f 3d 40 08   6d+00:00:26.900  READ FPDMA QUEUED
> 
> Error 1 occurred at disk power-on lifetime: 6874 hours (286 days + 10
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 f0 0f 44 54 45
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   61 10 88 f0 43 54 40 08   1d+13:20:47.600  WRITE FPDMA QUEUED
>   61 c8 78 28 43 54 40 08   1d+13:20:47.600  WRITE FPDMA QUEUED
>   61 88 68 a0 41 54 40 08   1d+13:20:47.500  WRITE FPDMA QUEUED
>   61 58 60 40 41 54 40 08   1d+13:20:47.500  WRITE FPDMA QUEUED
>   61 10 08 30 41 54 40 08   1d+13:20:47.500  WRITE FPDMA QUEUED
> 
> 
> sdj  HDS721010KLA330
> 
> ATA Error Count: 3
> 
> Error 3 occurred at disk power-on lifetime: 8133 hours (338 days + 21
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 80 80 2a 8e 40
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 08 a0 00 2c 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
>   60 00 88 00 2b 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
>   60 00 40 00 2a 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
>   60 d0 28 30 29 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
>   60 08 00 28 29 8e 40 08  12d+03:47:39.300  READ FPDMA QUEUED
> 
> Error 2 occurred at disk power-on lifetime: 7675 hours (319 days + 19
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 c8 08 59 3e 41
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 d8 00 f8 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
>   60 08 60 f0 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
>   60 08 58 e8 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
>   60 08 50 e0 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
>   60 08 28 d8 58 3e 40 08   2d+03:42:43.800  READ FPDMA QUEUED
> 
> Error 1 occurred at disk power-on lifetime: 7673 hours (319 days + 17
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 28 d7 97 4e 40
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 68 60 98 97 4e 40 08   2d+01:26:53.900  READ FPDMA QUEUED
>   61 00 58 00 81 ff 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED
>   61 10 10 00 80 fe 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED
>   61 f0 08 10 80 ff 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED
>   61 68 00 98 01 ff 40 08   2d+01:26:53.800  WRITE FPDMA QUEUED
> 
> 
> sdc  HDS721010KLA330
> 
> ATA Error Count: 408 (device log contains only the most recent five
> errors)
> 
> Error 408 occurred at disk power-on lifetime: 8426 hours (351 days + 2
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 08 8f 87 d6 43
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 88 00 10 87 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
>   60 08 00 08 87 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
>   60 08 18 00 87 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
>   60 a8 10 58 86 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
>   60 58 00 00 86 d6 40 08  12d+01:59:16.500  READ FPDMA QUEUED
> 
> Error 407 occurred at disk power-on lifetime: 8426 hours (351 days + 2
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 d0 30 c3 a4 42
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 08 20 00 c5 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
>   60 00 18 00 c4 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
>   60 00 10 00 c3 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
>   60 b8 08 48 c2 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
>   60 48 00 00 c2 a4 40 08  12d+01:47:43.600  READ FPDMA QUEUED
> 
> Error 406 occurred at disk power-on lifetime: 8424 hours (351 days + 0
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 50 b0 8a c2 48
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 18 18 00 8b c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
>   60 00 10 00 8a c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
>   60 00 08 00 89 c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
>   60 00 00 00 88 c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
>   60 00 18 00 87 c2 40 08  12d+00:12:53.000  READ FPDMA QUEUED
> 
> Error 405 occurred at disk power-on lifetime: 8424 hours (351 days + 0
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 e0 1f 6a ec 46
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 30 10 00 6b ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
>   60 00 08 00 6a ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
>   60 f0 00 10 69 ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
>   60 10 18 00 69 ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
>   60 00 10 00 68 ec 40 08  11d+23:53:47.200  READ FPDMA QUEUED
> 
> Error 404 occurred at disk power-on lifetime: 8423 hours (350 days +
23
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 10 ef 19 e6 43
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 e0 00 20 19 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
>   60 20 20 00 19 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
>   60 00 18 00 18 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
>   60 e0 10 20 17 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
>   60 20 08 00 17 e6 40 08  11d+23:13:38.800  READ FPDMA QUEUED
> 
> 
> sdd  HDS721010KLA330
> 
> ATA Error Count: 679 (device log contains only the most recent five
> errors)
> 
> Error 679 occurred at disk power-on lifetime: 8717 hours (363 days + 5
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 31 4f 63 87 4d
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   61 40 38 40 63 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
>   61 c0 08 80 62 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
>   61 f0 28 90 61 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
>   61 88 20 00 61 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
>   61 08 08 f8 5d 87 40 08  23d+22:03:52.500  WRITE FPDMA QUEUED
> 
> Error 678 occurred at disk power-on lifetime: 8717 hours (363 days + 5
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 40 90 48 1c 47
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   61 f0 28 90 4f 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED
>   61 50 20 80 48 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED
>   60 40 48 08 bb 72 40 08  23d+21:59:42.400  READ FPDMA QUEUED
>   61 10 40 80 4e 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED
>   61 78 30 08 4b 1c 40 08  23d+21:59:42.400  WRITE FPDMA QUEUED
> 
> Error 677 occurred at disk power-on lifetime: 8717 hours (363 days + 5
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 58 80 f1 8d 46
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   61 78 58 08 00 8e 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
>   61 20 10 e0 f1 8d 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
>   61 58 08 80 f0 8d 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
>   61 08 00 60 ea 8d 40 08  23d+21:59:17.300  WRITE FPDMA QUEUED
>   60 78 08 80 ed 8d 40 08  23d+21:59:17.300  READ FPDMA QUEUED
> 
> Error 676 occurred at disk power-on lifetime: 8717 hours (363 days + 5
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 70 b0 1c de 46
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   60 08 68 20 61 70 40 08  23d+21:58:42.100  READ FPDMA QUEUED
>   61 08 48 f8 1e de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED
>   61 d0 40 20 1d de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED
>   61 a0 20 80 1c de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED
>   61 80 08 00 1b de 40 08  23d+21:58:42.100  WRITE FPDMA QUEUED
> 
> Error 675 occurred at disk power-on lifetime: 8717 hours (363 days + 5
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 10 f0 03 de 46
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   61 40 88 c0 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
>   61 28 80 00 e8 dd 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
>   61 08 48 b8 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
>   61 30 40 80 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
>   61 10 38 68 03 de 40 08  23d+21:58:41.100  WRITE FPDMA QUEUED
> 
> 
> sdb  HDS721010KLA330
> 
> ATA Error Count: 1871 (device log contains only the most recent five
> errors)
> 
> Error 1871 occurred at disk power-on lifetime: 8455 hours (352 days +
7
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 b1 57 f6 46 e4  Error: ICRC, ABRT 177 sectors at LBA =
> 0x0446f657 =
> 71759447
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   35 00 00 08 f3 46 e0 08  11d+11:10:29.700  WRITE DMA EXT
>   35 00 08 00 f3 46 e0 08  11d+11:10:29.700  WRITE DMA EXT
>   35 00 00 00 f0 46 e0 08  11d+11:10:29.600  WRITE DMA EXT
>   35 00 00 00 ef 46 e0 08  11d+11:10:29.600  WRITE DMA EXT
>   35 00 00 00 ee 46 e0 08  11d+11:10:29.600  WRITE DMA EXT
> 
> Error 1870 occurred at disk power-on lifetime: 8455 hours (352 days +
7
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 69 cf b6 dd e3  Error: ICRC, ABRT 105 sectors at LBA =
> 0x03ddb6cf =
> 64861903
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   35 00 d8 60 b6 dd e0 08  11d+11:04:15.600  WRITE DMA EXT
>   35 00 08 58 b6 dd e0 08  11d+11:04:15.600  WRITE DMA EXT
>   35 00 d0 88 b5 dd e0 08  11d+11:04:15.600  WRITE DMA EXT
>   35 00 b0 d8 b3 dd e0 08  11d+11:04:15.500  WRITE DMA EXT
>   35 00 50 88 b3 dd e0 08  11d+11:04:15.500  WRITE DMA EXT
> 
> Error 1869 occurred at disk power-on lifetime: 8455 hours (352 days +
7
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 71 8f d9 dc e3  Error: ICRC, ABRT 113 sectors at LBA =
> 0x03dcd98f =
> 64805263
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   35 00 00 00 d8 dc e0 08  11d+11:04:12.300  WRITE DMA EXT
>   35 00 00 00 d7 dc e0 08  11d+11:04:12.300  WRITE DMA EXT
>   35 00 00 00 d6 dc e0 08  11d+11:04:12.300  WRITE DMA EXT
>   35 00 00 00 d4 dc e0 08  11d+11:04:12.200  WRITE DMA EXT
>   35 00 00 00 d0 dc e0 08  11d+11:04:12.200  WRITE DMA EXT
> 
> Error 1868 occurred at disk power-on lifetime: 8455 hours (352 days +
7
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 09 ff 24 bc e3  Error: ICRC, ABRT 9 sectors at LBA =
0x03bc24ff
> =
> 62661887
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   35 00 08 00 24 bc e0 08  11d+11:02:16.600  WRITE DMA EXT
>   35 00 00 00 20 bc e0 08  11d+11:02:16.600  WRITE DMA EXT
>   35 00 f8 08 1f bc e0 08  11d+11:02:16.500  WRITE DMA EXT
>   35 00 08 00 1f bc e0 08  11d+11:02:16.500  WRITE DMA EXT
>   35 00 00 00 1c bc e0 08  11d+11:02:16.500  WRITE DMA EXT
> 
> Error 1867 occurred at disk power-on lifetime: 8455 hours (352 days +
7
> hours)
>   When the command that caused the error occurred, the device was
> active or
> idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 10 f0 fd 94 e3  Error: ICRC, ABRT 16 sectors at LBA =
> 0x0394fdf0 =
> 60095984
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   35 00 00 00 fb 94 e0 08  11d+10:59:58.100  WRITE DMA EXT
>   35 00 f8 08 fa 94 e0 08  11d+10:59:58.100  WRITE DMA EXT
>   35 00 08 00 fa 94 e0 08  11d+10:59:58.100  WRITE DMA EXT
>   35 00 00 00 f8 94 e0 08  11d+10:59:58.000  WRITE DMA EXT
>   35 00 00 00 f4 94 e0 08  11d+10:59:58.000  WRITE DMA EXT
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
Well, there you go.  Problem isn't physical (unless you want to count
firmware bugs).
I wouldn't worry about running block check programs, because if any of
the errors were due to bad blocks, then they would have shown up in the
log.

The disks are doing exactly what they are supposed to do, and the disks
themselves are not timing out. Unless it is disk firmware related, then
you can at least stop spending time looking at hardware.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-08 14:17                         ` Leslie Rhorer
  2009-04-08 14:30                           ` David Lethe
@ 2009-04-08 14:37                           ` Greg Freemyer
  2009-04-08 16:29                             ` Andrew Burgess
  2009-04-08 18:04                           ` Corey Hickey
  2 siblings, 1 reply; 84+ messages in thread
From: Greg Freemyer @ 2009-04-08 14:37 UTC (permalink / raw)
  To: lrhorer; +Cc: Linux RAID

On Wed, Apr 8, 2009 at 10:17 AM, Leslie Rhorer <lrhorer@satx.rr.com> wrote:
>> There is also a SMART error log reporting mechanism.  The disks have
>> ability to report the last 5 commands that errored on each disk, which
>> includes timestamp, op code, input parameters and reason for error.
>
> I pulled the logs from the drives.  Four showed no errors whatsoever, and
> three showed 1, 2, and 3 errors, respectively.  The others ranged from 408
> to 1871.  I then triggered a halt with a mv command, and checked the logs
> again.  The values all remain the same, with no new errors reported
> whatsoever.
>
>> This runs instantly, and it is possible that just having these last 5
>> errors will tell you exactly what problem is. The errors are also
>
> I presume a sector remapping would be caught by the log file, yes?  If so,
> bad sector mapping is pretty much completely eliminated as a cause for this
> issue.

Leslie,

The next thing I would try to eliminate is the kernel i/o scheduler.
I assume you're using the default one (CFQ).

The noop one is far less complex and in particular I believe it does
not have all the "plugging" logic that CFQ has.

Plugging is used to ensure data is held back from the disk until all
of it is in the block layer buffers together, thus allowing
coalescing, and seek sorting (elevator logic) to most efficiently take
place.  The trouble is that if a unplug call is missed, the i/o can
stall waiting for it.  I don't know if it can cause a total i/o stall
like your seeing or not.

Since the noop scheduler does not do any of this fancyness, you
eliminate one the big potential causes of your problem by this test.
And since you said throughput is not a major concern for you, I would
leave the noop scheduler in place for the long run if it gives
satisfactory results.

I think the only way to switch schedulers is to reboot.  You invoke
the noop scheduler via a command line arg to the kernel at boot time.

You said your using debian, I don't know if they have a gui tool to
make that change for you or not.  Alternatively you can edit
/boot/grub/menu.lst manually.  (Assuming your using grub.)

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-08 14:37                           ` Greg Freemyer
@ 2009-04-08 16:29                             ` Andrew Burgess
  2009-04-09  3:24                               ` Leslie Rhorer
  2009-04-10  3:02                               ` Leslie Rhorer
  0 siblings, 2 replies; 84+ messages in thread
From: Andrew Burgess @ 2009-04-08 16:29 UTC (permalink / raw)
  To: Greg Freemyer, linux raid mailing list

On Wed, 2009-04-08 at 10:37 -0400, Greg Freemyer wrote:

> I think the only way to switch schedulers is to reboot.  You invoke
> the noop scheduler via a command line arg to the kernel at boot time.

This can be done at any time, no reboot required:

 for f in /sys/block/*/queue/scheduler; do
    echo noop > $f
    echo $f "$(cat $f)"
  done

Just FYI, I don't think ionice and friends will work when using noop.

Leslie: I still think finding out what the kernel is doing during the
stall would be a HUGE hint to the problem. Did you look into oprofile or
ftrace?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-08 16:29                             ` Andrew Burgess
@ 2009-04-09  3:24                               ` Leslie Rhorer
  2009-04-10  3:02                               ` Leslie Rhorer
  1 sibling, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-09  3:24 UTC (permalink / raw)
  To: 'Linux RAID'

> Leslie: I still think finding out what the kernel is doing during the
> stall would be a HUGE hint to the problem. Did you look into oprofile or
> ftrace?

Not yet, but I will.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-08 16:29                             ` Andrew Burgess
  2009-04-09  3:24                               ` Leslie Rhorer
@ 2009-04-10  3:02                               ` Leslie Rhorer
  2009-04-10  4:51                                 ` Leslie Rhorer
  2009-04-10  8:53                                 ` David Greaves
  1 sibling, 2 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-10  3:02 UTC (permalink / raw)
  To: 'Linux RAID'

> > I think the only way to switch schedulers is to reboot.  You invoke
> > the noop scheduler via a command line arg to the kernel at boot time.
> 
> This can be done at any time, no reboot required:
> 
>  for f in /sys/block/*/queue/scheduler; do
>     echo noop > $f
>     echo $f "$(cat $f)"
>   done

OK, I did this.  Two questions:

1.  The system responded in each case with this: "/sys/block/<block
device>/queue/scheduler [noop] anticipatory deadline cfq".  Is this as
expected?

2.  To switch back to the default scheduler, which I take it is CFQ, do I
simply issue the command above, replacing the string "noop" with "cfq"?

> Just FYI, I don't think ionice and friends will work when using noop.
> 
> Leslie: I still think finding out what the kernel is doing during the
> stall would be a HUGE hint to the problem. Did you look into oprofile or
> ftrace?

I couldn't find a Debian source for ftrace, but I did download oprofile.  I
am going to require some guidance, however.  I understand in general terms
what oprofile does, and vaguely what is involved with setup and
configuration, but it's quite a powerful and flexible tool, and I have
little or no idea of where to head with specifics.  If someone would care to
send a little guidance on what parameters I need to employ for setup and
configuration, and what processes I need to profile, I would be most
grateful.

Not knowing what to expect for an output, I may also require some help with
analysis.  If anyone wishes to help, but would prefer to take it offline,
please feel free to e-mail me directly.  Otherwise, please respond here.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-10  3:02                               ` Leslie Rhorer
@ 2009-04-10  4:51                                 ` Leslie Rhorer
  2009-04-10 12:50                                   ` jim owens
  2009-04-10 15:31                                   ` Bill Davidsen
  2009-04-10  8:53                                 ` David Greaves
  1 sibling, 2 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-10  4:51 UTC (permalink / raw)
  To: 'Linux RAID'

> >  for f in /sys/block/*/queue/scheduler; do
> >     echo noop > $f
> >     echo $f "$(cat $f)"
> >   done
> 
> OK, I did this.  Two questions:

It doesn't seem to have helped or hindered.  I still get halts, but under
moderate loads not every time.

> > Leslie: I still think finding out what the kernel is doing during the
> > stall would be a HUGE hint to the problem. Did you look into oprofile or
> > ftrace?
> 
> I couldn't find a Debian source for ftrace, but I did download oprofile.

Something very disturbing is happening now, however.  Just a few minutes
after loading oprofile, the system did a sudden total shutdown.  The file
systems were all left dirty, and power was suddenly cut to the main chassis.
This has never happened before.  I rebooted the system, and the file systems
replayed their journals.  Some data was lost, of course, but nothing
serious.  A few hours later, the exact same thing happened again:  A sudden
shut-down.  Nothing like this has ever happened before.  Of course the
system can issue a power shutdown from software, but it is supposed to clean
up the file systems first, and it's not supposed to just do it autonomously.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-10  4:51                                 ` Leslie Rhorer
@ 2009-04-10 12:50                                   ` jim owens
  2009-04-10 15:31                                   ` Bill Davidsen
  1 sibling, 0 replies; 84+ messages in thread
From: jim owens @ 2009-04-10 12:50 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Leslie Rhorer wrote:
>>>  for f in /sys/block/*/queue/scheduler; do
>>>     echo noop > $f
>>>     echo $f "$(cat $f)"
>>>   done
>> OK, I did this.  Two questions:
> 
> It doesn't seem to have helped or hindered.  I still get halts, but under
> moderate loads not every time.
> 
>>> Leslie: I still think finding out what the kernel is doing during the
>>> stall would be a HUGE hint to the problem. Did you look into oprofile or
>>> ftrace?
>> I couldn't find a Debian source for ftrace, but I did download oprofile.
> 
> Something very disturbing is happening now, however.  Just a few minutes
> after loading oprofile, the system did a sudden total shutdown.  The file
> systems were all left dirty, and power was suddenly cut to the main chassis.
> This has never happened before.  I rebooted the system, and the file systems
> replayed their journals.  Some data was lost, of course, but nothing
> serious.  A few hours later, the exact same thing happened again:  A sudden
> shut-down.  Nothing like this has ever happened before.  Of course the
> system can issue a power shutdown from software, but it is supposed to clean
> up the file systems first, and it's not supposed to just do it autonomously.

There are some problems with oprofile on recent kernels and
various hardware platforms.  From the discussions I have seen,
it appears to be conflicts between the platform interrupt
handlers that manage things like power events and the CPU
performance counter non-maskable interrupts that are triggered
by oprofile.  The result is the system goes boom.

Your platform/distro is not where this was reported, but what
is happening to you sounds like the same problem.

Two approaches have been tried to work around this:

1) disable those platform management drivers.
2) run oprofile using the kernel clock (1000hz) to collect
    events instead of the hardware counters.

Since it is only very recently that the cause of this problem
was identified (and I was not really paying attention), I don't
know how successful either work around is or when fixes might
be available.

jim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-10  4:51                                 ` Leslie Rhorer
  2009-04-10 12:50                                   ` jim owens
@ 2009-04-10 15:31                                   ` Bill Davidsen
  2009-04-11  1:37                                     ` Leslie Rhorer
  1 sibling, 1 reply; 84+ messages in thread
From: Bill Davidsen @ 2009-04-10 15:31 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Leslie Rhorer wrote:
>>>  for f in /sys/block/*/queue/scheduler; do
>>>     echo noop > $f
>>>     echo $f "$(cat $f)"
>>>   done
>>>       
>> OK, I did this.  Two questions:
>>     
>
> It doesn't seem to have helped or hindered.  I still get halts, but under
> moderate loads not every time.
>
>   
>>> Leslie: I still think finding out what the kernel is doing during the
>>> stall would be a HUGE hint to the problem. Did you look into oprofile or
>>> ftrace?
>>>       
>> I couldn't find a Debian source for ftrace, but I did download oprofile.
>>     
>
> Something very disturbing is happening now, however.  Just a few minutes
> after loading oprofile, the system did a sudden total shutdown.  The file
> systems were all left dirty, and power was suddenly cut to the main chassis.
> This has never happened before.  I rebooted the system, and the file systems
> replayed their journals.  Some data was lost, of course, but nothing
> serious.  A few hours later, the exact same thing happened again:  A sudden
> shut-down.  Nothing like this has ever happened before.  Of course the
> system can issue a power shutdown from software, but it is supposed to clean
> up the file systems first, and it's not supposed to just do it autonomously.
>   

Just for grins, is this system by any chance on a UPS? Because I found 
an interesting failure mode if that's the case.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-10 15:31                                   ` Bill Davidsen
@ 2009-04-11  1:37                                     ` Leslie Rhorer
  2009-04-11 13:02                                       ` Bill Davidsen
  0 siblings, 1 reply; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-11  1:37 UTC (permalink / raw)
  To: 'Linux RAID'

> > Something very disturbing is happening now, however.  Just a few minutes
> > after loading oprofile, the system did a sudden total shutdown.  The
> file
> > systems were all left dirty, and power was suddenly cut to the main
> chassis.
> > This has never happened before.  I rebooted the system, and the file
> systems
> > replayed their journals.  Some data was lost, of course, but nothing
> > serious.  A few hours later, the exact same thing happened again:  A
> sudden
> > shut-down.  Nothing like this has ever happened before.  Of course the
> > system can issue a power shutdown from software, but it is supposed to
> clean
> > up the file systems first, and it's not supposed to just do it
> autonomously.
> >
> 
> Just for grins, is this system by any chance on a UPS? Because I found
> an interesting failure mode if that's the case.

Yeah, you bet.  Indeed, the first thing I considered was a false Low Battery
indication causing the system to shut down, but in that case it should have
been a clean shut-down.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-11  1:37                                     ` Leslie Rhorer
@ 2009-04-11 13:02                                       ` Bill Davidsen
  0 siblings, 0 replies; 84+ messages in thread
From: Bill Davidsen @ 2009-04-11 13:02 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Leslie Rhorer wrote:
>>> Something very disturbing is happening now, however.  Just a few minutes
>>> after loading oprofile, the system did a sudden total shutdown.  The
>>>       
>> file
>>     
>>> systems were all left dirty, and power was suddenly cut to the main
>>>       
>> chassis.
>>     
>>> This has never happened before.  I rebooted the system, and the file
>>>       
>> systems
>>     
>>> replayed their journals.  Some data was lost, of course, but nothing
>>> serious.  A few hours later, the exact same thing happened again:  A
>>>       
>> sudden
>>     
>>> shut-down.  Nothing like this has ever happened before.  Of course the
>>> system can issue a power shutdown from software, but it is supposed to
>>>       
>> clean
>>     
>>> up the file systems first, and it's not supposed to just do it
>>>       
>> autonomously.
>>     
>> Just for grins, is this system by any chance on a UPS? Because I found
>> an interesting failure mode if that's the case.
>>     
>
> Yeah, you bet.  Indeed, the first thing I considered was a false Low Battery
> indication causing the system to shut down, but in that case it should have
> been a clean shut-down.
>   

Actually what I saw was that the load exceeded the charge capacity, so 
the battery gradually discharged, and at some point the UPS just shut 
off (or the output voltage got so low the system crashed). The cause was 
not too much load, but a failure of a diode in the charging circuit, 
such that the full wave rectifier became a half wave rectifier, and in 
addition to the other problems there was a lot of noise on the output 
voltage.

Anyway, just a thought, if this falls over often I would take a chance 
on losing power and run on just a surge protector for a little bit to 
see if any UPS issue might be involved.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-10  3:02                               ` Leslie Rhorer
  2009-04-10  4:51                                 ` Leslie Rhorer
@ 2009-04-10  8:53                                 ` David Greaves
  1 sibling, 0 replies; 84+ messages in thread
From: David Greaves @ 2009-04-10  8:53 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Leslie Rhorer wrote:
>>> I think the only way to switch schedulers is to reboot.  You invoke
>>> the noop scheduler via a command line arg to the kernel at boot time.
>> This can be done at any time, no reboot required:
>>
>>  for f in /sys/block/*/queue/scheduler; do
>>     echo noop > $f
>>     echo $f "$(cat $f)"
>>   done
> 
> OK, I did this.  Two questions:
> 
> 1.  The system responded in each case with this: "/sys/block/<block
> device>/queue/scheduler [noop] anticipatory deadline cfq".  Is this as
> expected?

Yes, it's the 2nd echo telling you what schedulers are available and which is
selected in a 'human readable form'

> 2.  To switch back to the default scheduler, which I take it is CFQ, do I
> simply issue the command above, replacing the string "noop" with "cfq"?

Yes

David

-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-08 14:17                         ` Leslie Rhorer
  2009-04-08 14:30                           ` David Lethe
  2009-04-08 14:37                           ` Greg Freemyer
@ 2009-04-08 18:04                           ` Corey Hickey
  2 siblings, 0 replies; 84+ messages in thread
From: Corey Hickey @ 2009-04-08 18:04 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Did you see my response over on reiserfs-devel?

http://marc.info/?l=reiserfs-devel&m=123904875509162&w=2

I'm really curious whether you're running into the same issue I had.

-Corey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-07 17:47                     ` Leslie Rhorer
  2009-04-07 18:18                       ` David Lethe
@ 2009-04-07 18:20                       ` Greg Freemyer
  2009-04-08  8:45                       ` John Robinson
  2 siblings, 0 replies; 84+ messages in thread
From: Greg Freemyer @ 2009-04-07 18:20 UTC (permalink / raw)
  To: lrhorer; +Cc: Linux RAID

Leslie,

I don't know if it is related, but if you have an hour or so you
should read bugzilla 12309.  There are 290 comments and tons of
significant attachments.

http://bugzilla.kernel.org/show_bug.cgi?id=12309

It's about very high latency experienced by users while a large i/o
activity is happening in the background.

I don't know if the background tasked is also being slowed down like
it is for you.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-07 17:47                     ` Leslie Rhorer
  2009-04-07 18:18                       ` David Lethe
  2009-04-07 18:20                       ` Greg Freemyer
@ 2009-04-08  8:45                       ` John Robinson
  2009-04-09  3:34                         ` Leslie Rhorer
  2 siblings, 1 reply; 84+ messages in thread
From: John Robinson @ 2009-04-08  8:45 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On 07/04/2009 18:47, Leslie Rhorer wrote:
> [...]  What puzzles me
> (among other things) is why do 5 of the drives show zero reads while 5 of
> them show very low levels of read activity, and always the same 5 drives?

Can you identify how these drives are connected? If I recall correctly, 
you're using port multipliers (because you have to); is there anything 
like a pattern between the activity of drives and the PMPs or PMP 
channels they're connected to?

To borrow your expression, PMPs have left me with a bad taste in my 
mouth. Actually ReiserFS has too, though that was much longer ago.

Anyway, apologies for my last mini-flame (is that a smoulder?), and best 
of luck getting to the cause of your problem.

Cheers,

John.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-08  8:45                       ` John Robinson
@ 2009-04-09  3:34                         ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-09  3:34 UTC (permalink / raw)
  To: 'Linux RAID'

> > [...]  What puzzles me
> > (among other things) is why do 5 of the drives show zero reads while 5
> of
> > them show very low levels of read activity, and always the same 5
> drives?
> 
> Can you identify how these drives are connected? If I recall correctly,
> you're using port multipliers (because you have to); is there anything
> like a pattern between the activity of drives and the PMPs or PMP
> channels they're connected to?

'First think of which I thought, of course.  I can't see one.  'Three
different PMPs, with both symptoms present on all 3.  'Both Hitachi and WD
drives with both symptoms.

> To borrow your expression, PMPs have left me with a bad taste in my
> mouth.

'Definitely not my favorite solution, either.  If nothing else, the loss of
a single PMP will at a minimum take the array offline.  At worst, it may
corrupt the drives unrecoverably.  I think this less likely with a
multi-lane controller.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05  5:34     ` Lelsie Rhorer
  2009-04-05  7:16       ` Richard Scobie
@ 2009-04-05  7:33       ` Richard Scobie
  1 sibling, 0 replies; 84+ messages in thread
From: Richard Scobie @ 2009-04-05  7:33 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

Further to my last post regarding Seagate firmware, there are more 
details here, under "Seagate harddrives which time out FLUSH CACHE when 
NCQ is being used":

http://ata.wiki.kernel.org/index.php/Known_issues#Seagate_harddrives_which_time_out_FLUSH_CACHE_when_NCQ_is_being_used

Regards,

Richard

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05  0:07 ` RAID halting Lelsie Rhorer
  2009-04-05  0:49   ` Greg Freemyer
@ 2009-04-05  0:57   ` Roger Heflin
  2009-04-05  6:30     ` Lelsie Rhorer
  1 sibling, 1 reply; 84+ messages in thread
From: Roger Heflin @ 2009-04-05  0:57 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Lelsie Rhorer wrote:
>> If one of your disks was clearing bad sectors then things get messy
>> and when it hits one of these bad sectors that it can successfully
>> move you would get a delay almost every time.
> 
> Yes, but in that case two things would be true:
> 
> 1.  Any write of any sort could readily trigger an event.  The system quite
> regularly writes more than 5000 sectors / second, but never do any of these
> writes trigger an event except in the case where it is a file creation.
> Like I said, the drives have no idea whether the sector they are attempting
> to write is a new file or not, or part of a directory structure or not.

Writes don't trigger this sort of events, it is only the reads, and 
are you sure the data the you wrote is still readable?

> 
> 2.  The kernel would be reporting SMART errors.  It isn't.

Smart had never really worked as good as the disk makers claim.   I 
have tested smart on sets of >1000 drives, and  smarts accuracy for 
detecting bad sector issues with disks was almost useless, I had 50 
known bad drives in the set, smart flagged only 15 of them as bad, and 
on top of that smart flagged another 15-20 drives as bad that did not 
appear to fail at all after months of usage since smart had declared 
them bad.   Basically smart is useful, but it cannot really be 
trusted, if you don't believe me, see google's similar study on large 
numbers of drives.

> 
> Finally, as you said yourself, the situation would result in a delay almost
> every time, yet there are signifcant stretches of time when every single
> file creation works just fine.  Also, it doesn't take a drive 40 seconds,
> let alone 2 minutes, to mark a sector bad.  The array chassis I had
> previously had some sort of problem which made the drives think there were
> bad sectors, when there weren't.  It cause one drive to be marked with more
> than a million bad sectors.  It never paused like this, however.
>

And what I said if you read it carefully is, that *WHEN* you hit a bad 
sector it will cause a delay almost every time, not you will hit a 
delay every time you read the disk.

It will only result in a delay if you hit the magic bad sector.   And 
on reads it cannot mark the sector bad until it successfully reads the 
sector so it tries really hard and takes a long time trying, and once 
it reads that sector successfully it will rewrite it elsewhere and 
mark the sector bad.    When you hit the next bad sector the same 
thing will happen again.   How bad of issue that you have depends on 
if the number of bad sectors on the disk is growing...if you only have 
20 bad ones eventually they will all get reread (maybe) and relocated, 
if you have a few more showing up each day, things will never get any 
better.

When the array chassis had its issue, likely the chassis decided they 
were bad after getting a successful read, the read came back quickly 
and the chassis decided it was bad and marked it as such, the *DRIVE* 
has to think the sector is bad to get the delay, and in the array 
chassis case the drive knew the sector was just find and the array 
chassis misinterpreted what the drive was telling it and decided it 
was bad.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05  0:57   ` Roger Heflin
@ 2009-04-05  6:30     ` Lelsie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  6:30 UTC (permalink / raw)
  To: linux-raid

> Lelsie Rhorer wrote:

Is that my error in spelling my name, or yours?  If mine, how do I fix it?

> Writes don't trigger this sort of events, it is only the reads, and
> are you sure the data the you wrote is still readable?

This data has been read and written, hundreds of gigabytes a day, for
months.  None of the files have experienced any noticeable problems.  I
can't vouch for every byte, of course, but no read out of the tens of
millions of blocks read has ever triggered a halt or produced a noticeable
error, AFAIK.  Typical read rates are between 5000 and 20,000 blocks per
second for one or two hours at a time.

> And what I said if you read it carefully is, that *WHEN* you hit a bad
> sector it will cause a delay almost every time, not you will hit a
> delay every time you read the disk.

So why is it that thousands of read blocks per second continuously over
hours at a time and spanning the entire drive space many times over have
never produced a single event, yet creating a 200 byte file under some
circumstances causes the failure in some cases every single time? The total
number of sectors involved with failure triggers has not exceeded 100
kilobytes, even allowing for file system overheard and the existence of
large numbers of superblocks.  The total number of bytes read, however, is
easily 200 - 300 terabytes, or more.

> It will only result in a delay if you hit the magic bad sector.   And
> on reads it cannot mark the sector bad until it successfully reads the
> sector so it tries really hard and takes a long time trying, and once
> it reads that sector successfully it will rewrite it elsewhere and
> mark the sector bad.

So why doesn't it happen when reading any file?  Why does it rarely, if ever
happen when low volumes of reading and writing are underway, but happen
extremely frequently when large numbers of reads, write, or both are
happening?  A bad sector doesn't care how many other sectors are being read
or written.  I have several times backed up the entire array, end to end, at
400+ Mbps, without a single burp.  Create one or two tiny files during the
process, and it comes to a screeching halt for 40 seconds.  Note the time is
highly regular.  Unless the array health check is underway, the halt is
always 40 seconds long, never 30 or 50.

> When you hit the next bad sector the same
> thing will happen again.

But how is it in a sea of some 19,531,250,000 sectors, multi gigabyte long
reads never hit any bad sectors, but hitting bad sectors with 1K long file
creations manage to find a bad sector sometimes 50% of the time or more?  If
the bad sectors were in the superblocks, then file reads and writes would
find them just as often as file creations.  If the errors are in the inodes,
how is it billions and billions of sectors read and written find no errors,
but the odd file creation can find them up to 50% of the time or more?  Why
is it the likelihood of hitting a bad sector in an inode or superblock - for
file creations only - is much more likely when other drive accesses are
going on?

> When the array chassis had its issue, likely the chassis decided they

The chassis didn't decide anything.  It (like the new one) was a dumb drive
chassis.  When I purchased it, it was a multilane chassis served by a
RocketRaid SAS PCI Express RAID adapter.  It had troubles from day 1, and
they grew exponentially as more drives were added to the array.  Finally,
when the array needed to grow beyond 8 drives, it required the addition of
an additional adapter.  I was never able to get two adapters to work in the
system, no matter what.  At that point, I switched to the SI adapter and
converted the chassis to port multipliers.  It would fail up to 4 drives a
day, completely trashing the RAID6 array numerous times.  Replacing the
chassis caused the reported errors to drop to virtually zero, and the system
has not failed a drive since, or even had to restart one, that I have seen.

> were bad after getting a successful read, the read came back quickly
> and the chassis decided it was bad and marked it as such, the *DRIVE*
> has to think the sector is bad to get the delay, and in the array
> chassis case the drive knew the sector was just find and the array
> chassis misinterpreted what the drive was telling it and decided it
> was bad.

Then why did SMART's reading of the sectors marked as errored on the drives
correspond closely to the reports in the kernel logs?

In any case, no offense, but this isn't really helping.  I need methods of
determining what is wrong, not hypotheses about what could be wrong,
especially when those hypotheses appear to be unsupported by the facts.  If
a drive in the array is bad, I need to know which one, and how to find it.
With ten drives, and the fact it takes over three days to rebuild the array,
I can't afford to just go around swapping drives.

^ permalink raw reply	[flat|nested] 84+ messages in thread

[parent not found: <49F2A193.8080807@sauce.co.nz>]

* RE: RAID halting
       [not found] <49F2A193.8080807@sauce.co.nz>
@ 2009-04-25  7:03 ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-25  7:03 UTC (permalink / raw)
  To: 'Linux RAID'

> >> I grew a 3 disk RAID5 to 4 disks and the write performance was
> >> definitely inferior until this was done.
> >
> > Well, write performance is not generally a big issue with this array, as
> it
> > does vastly more reading than writing.  In any case, as long as the
> > parameters can be updated with a mount option, it's not a really big
> deal.
> > What about the read performance?  Is it badly impacted for pre-existing
> > files?  The files on this system are very static.  Once they are edited
> and
> > saved, they're pretty much permanent.
> 
> My understanding of growing the array by adding more disks, is that all
> files are rewritten across all disks, so there is

So there is...?  I think you accidentally deleted part of what you were
saying.  Let me see if I can guess what it was.  You were perhaps going to
say the stripe is re-organized at the array layer, so at the file system
layer the machine thinks the file was originally written that way?  While it
would seem to make sense, I'm going to have to think about it a bit to
convince myself it is true.
 
> The issue with writes is that if the filesystem knows the exact geometry
> of the array (no. of devices and stripe size), it can coalesce writes
> into n x stripe size chunks, which means that there is minimal
> read-modify-write penalty with parity having to be recalculated.

Yes, I follow that.

> If you grow the array and do not tell XFS that this chunk size has
> changed, then every single write will be based on the previous size and
> will require read-modify-write.

Yeah, I understood that, too.  That it can be re-configured in the mount
makes it a reasonable proposition.


^ permalink raw reply	[flat|nested] 84+ messages in thread

[parent not found: <49F21B75.7060705@sauce.co.nz>]

* RE: RAID halting
       [not found] <49F21B75.7060705@sauce.co.nz>
@ 2009-04-25  4:32 ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-25  4:32 UTC (permalink / raw)
  To: 'Linux RAID'

> >> Quad core Xeon 2.8GHz md RAID6 of 16 x 750GB. Delete 20GB file:
> >>
> >> time rm -f dump
> >>
> >> real    0m1.849s
> >> user    0m0.000s
> >> sys     0m1.633s
> >
> > 3.2 GHz AMD Athlon 64 x 2 RAID6 of 10 x 1T:
> >
> > RAID-Server:/RAID/Recordings# ll sizetest.fil
> > -rw-rw-rw- 1 lrhorer users 27847620608 2009-04-24 03:21 sizetest.fil
> > RAID-Server:/RAID/Recordings# time rm sizetest.fil
> >
> > real    0m21.503s
> > user    0m0.000s
> > sys     0m6.852s
> >
> > See what I mean?  'Zero additional activity on the array other than the
> rm
> > task.  We'll see what happens with XFS.
> 
> Hi Leslie,
> 
> Regarding my previous post, I missed your comment about later growing
> the array, so what I said about mkfs.xfs automatically calculating the
> correct swidth/sunit sizes is correct initially, but once you grow the
> array, you will need to manaully calculate new values and these are
> applied as mount options.
> 
> I grew a 3 disk RAID5 to 4 disks and the write performance was
> definitely inferior until this was done.

Well, write performance is not generally a big issue with this array, as it
does vastly more reading than writing.  In any case, as long as the
parameters can be updated with a mount option, it's not a really big deal.
What about the read performance?  Is it badly impacted for pre-existing
files?  The files on this system are very static.  Once they are edited and
saved, they're pretty much permanent.


^ permalink raw reply	[flat|nested] 84+ messages in thread

[parent not found: <49D89515.3020800@computer.org>]

* RE: RAID halting
       [not found] <49D89515.3020800@computer.org>
@ 2009-04-05 18:40 ` Leslie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 18:40 UTC (permalink / raw)
  To: linux-raid

> > Is that my error in spelling my name, or yours?  If mine, how do I fix
> it?
> 
> Yours.
> 
> Configuration of your name in your email client.

Thanks.  'Got it.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* FW: RAID halting
@ 2009-04-05 14:22 David Lethe
  2009-04-05 14:53 ` David Lethe
  2009-04-05 20:33 ` Leslie Rhorer
  0 siblings, 2 replies; 84+ messages in thread
From: David Lethe @ 2009-04-05 14:22 UTC (permalink / raw)
  To: lrhorer, linux-raid

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Lelsie Rhorer
Sent: Sunday, April 05, 2009 3:14 AM
To: linux-raid@vger.kernel.org
Subject: RE: RAID halting

> All of what you report is still consistent with delays caused by
having
> to remap bad blocks

I disagree.  If it happened with some frequency during ordinary reads,
then
I would agree.  If it happened without respect to the volume of reads
and
writes on the system, then I would be less inclined to disagree.

> The O/S will not report recovered errors, as this gets done internally
> by the disk drive, and the O/S never learns about it. (Queue depth

SMART is supposed to report this, and rarely the kernel log does report
a
block of sectors being marked bad by the controller.  I cannot speak to
the
notion SMART's reporting of relocated sectors and failed relocations may
not
be accurate, as I have no means to verify.

Actually, I should amend the first sentence, because while the ten
drives in
the array are almost never reporting any errors, there is another drive
in
the chassis which is chunking out error reports like a farm boy spitting
out
watermelon seeds.  I had a 320G drive in another system which was
behaving
erratically, so I moved it to the array chassis on this machine to
eliminate
it being a cable or the drive controller.  It's reporting blocks being
marked bad all over the place.

> Really, if this was my system I would run non-destructive read tests
on
> all blocks;

How does one do this?  Or rather, isn't this what the monthly mdadm
resync
does?

> along with the embedded self-test on the disk.  It is often

How does one do this?

> a lot easier and more productive to eliminate what ISN'T the problem
> rather than chase all of the potential reasons for the problem.

I agree, which is why I am asking for troubleshooting methods and
utilities.

The monthly RAID array resync started a few minutes ago, and it is
providing
some interesting results.  The number of blocks read / second is
consistently 13,000 - 24,000 on all ten drives.  There were no other
drive
accesses of any sort at the time, so the number of blocks written was
flat
zero on all drives in the array.  I copied the /etc/hosts file to the
RAID
array, and instantly the file system locked, but the array resync *DID
NOT*.
The number of blocks read and written per second continued to range from
13,000 to 24,000 blocks/second, with no apparent halt or slow-down at
all,
not even for one second.  So if it's a drive error, why are file system
reads halted almost completely, and writes halted altogether, yet drive
reads at the RAID array level continue unabated at an aggregate of more
than
130,000 blocks - 240,000 blocks (500 - 940 megabits) per second?  I
tried a
second copy and again the file system accesses to the drives halted
altogether.  The block reads (which had been alternating with writes
after
the new transfer proceses were implemented) again jumped to between
13,000
and 24,000.  This time I used a stopwatch, and the halt was 18 minutes
21
seconds - I believe the longest ever.  There is absolutely no way it
would
take a drive almost 20 minutes to mark a block bad.  The dirty blocks
grew
to more than 78 Megabytes.  I just did a 3rd cp of the /etc/hosts file
to
the array, and once again it locked the machine for what is likely to be
another 15 - 20 minutes.  I tried forcing a sync, but it also hung.

<Sigh>  The next three days are going to be Hell, again.  It's going to
be
all but impossible to edit a file until the RAID resync completes.  It's
often really bad under ordinary loads, but when the resync is underway,
it's
beyond absurd.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
======
Leslie: 
Respectfully, your statement, "SMART is supposed to report this" shows
you have no understanding of exactly what S.M.A.R.T. is and is not
supposed to report, nor do you know enough about hardware to make an
educated decision about what can and can not be contributing factors.
As such, you are not qualified to dismiss the necessity to run hardware
diagnostics.

A few other things - many SATA controller cards use poorly architected
bridge chips that spoof some of the ATA commands, so even if you *think*
you are kicking off one of the SMART subcommands, like the
SMART_IMMEDIATE_OFFLINE (op code d4h with the extended self test,
subcommand 2h), then it is possible, perhaps probable, they are never
getting run. -- yes, I am giving you the raw opcodes so you can look
them up and learn what they do.

You want to know how it is possible that frequency or size of reads can
be a factor? 
Do the math:
 * Look at the # of ECC bits you have on the disks (read the specs), and
compare that with the trillions of bytes you have.  How frequently can
you expect to have an unrecoverable ECC error based on that.
 * What percentage of your farm are you actually testing with the tests
you have run so far? Is it even close to being statistically
significant?
 * Do you know what physical blocks on each disk are being read/written
with the tests you mention? If you do not know, then how do you know
that the short tests are doing I/O on blocks that need to be repaired,
and subsequent tests run OK because those blocks were just repaired?
 * Did you look into firmware? Are the drives and/or firmware revisions
qualified by your controller vendor?  

I've been in the storage business for over 10 years, writing everything
from RAID firmware, configurators, disk diagnostics, test bench suites.
I even have my own company that writes storage diagnostics.  I think I
know a little more about diagnostics and what can and can not happen.
You said before that you do not agree with my statements earlier.  I
doubt that you will find any experienced storage professional that
wouldn't tell you to break it all down and run a full block-level DVT
before going further.  It could have all been done over the week-end if
you had the right setup, and then you would know a lot more than what
you know now.  

AT this point all you have done is tell people who suggest hardware is
the cause that they are wrong and then tell us why you think we are
wrong.  Frankly, be lazy and don't run diagnostics, you had just better
not be a government employee, or in charge of a database that contains
financial, medical, or other such information, and you have better be
running hot backups.

If you still refuse to run full block-level hardware test, then ask
yourself how much longer will you allow this to go on before you run
such a test, or are you just going to continue down this path waiting
for somebody to give you a magic command to type in that will fix
everything.

I am not the one who is putting my job on the line at best, and at
worst, is looking at a criminal violation for not taking appropriate
actions to protect certain data. I make no apology for beating you up on
this.  You need to hear it.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 14:22 FW: " David Lethe
@ 2009-04-05 14:53 ` David Lethe
  2009-04-05 20:33 ` Leslie Rhorer
  1 sibling, 0 replies; 84+ messages in thread
From: David Lethe @ 2009-04-05 14:53 UTC (permalink / raw)
  To: David Lethe, lrhorer, linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of David Lethe
> Sent: Sunday, April 05, 2009 9:23 AM
> To: lrhorer@satx.rr.com; linux-raid@vger.kernel.org
> Subject: FW: RAID halting
> 
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Lelsie Rhorer
> Sent: Sunday, April 05, 2009 3:14 AM
> To: linux-raid@vger.kernel.org
> Subject: RE: RAID halting
> 
> > All of what you report is still consistent with delays caused by
> having
> > to remap bad blocks
> 
> I disagree.  If it happened with some frequency during ordinary reads,
> then
> I would agree.  If it happened without respect to the volume of reads
> and
> writes on the system, then I would be less inclined to disagree.
> 
> > The O/S will not report recovered errors, as this gets done
> internally
> > by the disk drive, and the O/S never learns about it. (Queue depth
> 
> SMART is supposed to report this, and rarely the kernel log does
report
> a
> block of sectors being marked bad by the controller.  I cannot speak
to
> the
> notion SMART's reporting of relocated sectors and failed relocations
> may
> not
> be accurate, as I have no means to verify.
> 
> Actually, I should amend the first sentence, because while the ten
> drives in
> the array are almost never reporting any errors, there is another
drive
> in
> the chassis which is chunking out error reports like a farm boy
> spitting
> out
> watermelon seeds.  I had a 320G drive in another system which was
> behaving
> erratically, so I moved it to the array chassis on this machine to
> eliminate
> it being a cable or the drive controller.  It's reporting blocks being
> marked bad all over the place.
> 
> > Really, if this was my system I would run non-destructive read tests
> on
> > all blocks;
> 
> How does one do this?  Or rather, isn't this what the monthly mdadm
> resync
> does?
> 
> > along with the embedded self-test on the disk.  It is often
> 
> How does one do this?
> 
> > a lot easier and more productive to eliminate what ISN'T the problem
> > rather than chase all of the potential reasons for the problem.
> 
> I agree, which is why I am asking for troubleshooting methods and
> utilities.
> 
> The monthly RAID array resync started a few minutes ago, and it is
> providing
> some interesting results.  The number of blocks read / second is
> consistently 13,000 - 24,000 on all ten drives.  There were no other
> drive
> accesses of any sort at the time, so the number of blocks written was
> flat
> zero on all drives in the array.  I copied the /etc/hosts file to the
> RAID
> array, and instantly the file system locked, but the array resync *DID
> NOT*.
> The number of blocks read and written per second continued to range
> from
> 13,000 to 24,000 blocks/second, with no apparent halt or slow-down at
> all,
> not even for one second.  So if it's a drive error, why are file
system
> reads halted almost completely, and writes halted altogether, yet
drive
> reads at the RAID array level continue unabated at an aggregate of
more
> than
> 130,000 blocks - 240,000 blocks (500 - 940 megabits) per second?  I
> tried a
> second copy and again the file system accesses to the drives halted
> altogether.  The block reads (which had been alternating with writes
> after
> the new transfer proceses were implemented) again jumped to between
> 13,000
> and 24,000.  This time I used a stopwatch, and the halt was 18 minutes
> 21
> seconds - I believe the longest ever.  There is absolutely no way it
> would
> take a drive almost 20 minutes to mark a block bad.  The dirty blocks
> grew
> to more than 78 Megabytes.  I just did a 3rd cp of the /etc/hosts file
> to
> the array, and once again it locked the machine for what is likely to
> be
> another 15 - 20 minutes.  I tried forcing a sync, but it also hung.
> 
> <Sigh>  The next three days are going to be Hell, again.  It's going
to
> be
> all but impossible to edit a file until the RAID resync completes.
> It's
> often really bad under ordinary loads, but when the resync is
underway,
> it's
> beyond absurd.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ======
> Leslie:
> Respectfully, your statement, "SMART is supposed to report this" shows
> you have no understanding of exactly what S.M.A.R.T. is and is not
> supposed to report, nor do you know enough about hardware to make an
> educated decision about what can and can not be contributing factors.
> As such, you are not qualified to dismiss the necessity to run
hardware
> diagnostics.
> 
> A few other things - many SATA controller cards use poorly architected
> bridge chips that spoof some of the ATA commands, so even if you
> *think*
> you are kicking off one of the SMART subcommands, like the
> SMART_IMMEDIATE_OFFLINE (op code d4h with the extended self test,
> subcommand 2h), then it is possible, perhaps probable, they are never
> getting run. -- yes, I am giving you the raw opcodes so you can look
> them up and learn what they do.
> 
> You want to know how it is possible that frequency or size of reads
can
> be a factor?
> Do the math:
>  * Look at the # of ECC bits you have on the disks (read the specs),
> and
> compare that with the trillions of bytes you have.  How frequently can
> you expect to have an unrecoverable ECC error based on that.
>  * What percentage of your farm are you actually testing with the
tests
> you have run so far? Is it even close to being statistically
> significant?
>  * Do you know what physical blocks on each disk are being
read/written
> with the tests you mention? If you do not know, then how do you know
> that the short tests are doing I/O on blocks that need to be repaired,
> and subsequent tests run OK because those blocks were just repaired?
>  * Did you look into firmware? Are the drives and/or firmware
revisions
> qualified by your controller vendor?
> 
> I've been in the storage business for over 10 years, writing
everything
> from RAID firmware, configurators, disk diagnostics, test bench
suites.
> I even have my own company that writes storage diagnostics.  I think I
> know a little more about diagnostics and what can and can not happen.
> You said before that you do not agree with my statements earlier.  I
> doubt that you will find any experienced storage professional that
> wouldn't tell you to break it all down and run a full block-level DVT
> before going further.  It could have all been done over the week-end
if
> you had the right setup, and then you would know a lot more than what
> you know now.
> 
> AT this point all you have done is tell people who suggest hardware is
> the cause that they are wrong and then tell us why you think we are
> wrong.  Frankly, be lazy and don't run diagnostics, you had just
better
> not be a government employee, or in charge of a database that contains
> financial, medical, or other such information, and you have better be
> running hot backups.
> 
> If you still refuse to run full block-level hardware test, then ask
> yourself how much longer will you allow this to go on before you run
> such a test, or are you just going to continue down this path waiting
> for somebody to give you a magic command to type in that will fix
> everything.
> 
> I am not the one who is putting my job on the line at best, and at
> worst, is looking at a criminal violation for not taking appropriate
> actions to protect certain data. I make no apology for beating you up
> on
> this.  You need to hear it.

P.S. Your other mistake was even assuming that your configuration would
ever "work" in the first place, even if every disk passed full
diagnostics.  How do you know there aren't any firmware bugs biting you
.. everything works fine individually, but combinations of the 2.1TB
overflow, TCQ, old drive firmware, configurable drive settings, etc..
all don't work together to create chaos above and beyond block read
errors.

Take the FS out of the equation, boot to a CDROM and do some reads of
the raw md devices, rather than mounted filesystem.

Maybe your config doesn't work now, is because it wouldn't have ever
worked in the first place.

-David



^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 14:22 FW: " David Lethe
  2009-04-05 14:53 ` David Lethe
@ 2009-04-05 20:33 ` Leslie Rhorer
  2009-04-05 22:20   ` Peter Grandi
  2009-04-06  0:31   ` Doug Ledford
  1 sibling, 2 replies; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-05 20:33 UTC (permalink / raw)
  To: 'David Lethe', linux-raid

> Leslie:
> Respectfully, your statement, "SMART is supposed to report this" shows
> you have no understanding of exactly what S.M.A.R.T. is and is not
> supposed to report, nor do you know enough about hardware to make an
> educated decision about what can and can not be contributing factors.
> As such, you are not qualified to dismiss the necessity to run hardware
> diagnostics.

I am not dismissing anything.  From the very first post I have been asking
for methods of running diagnostics, including hardware diagnostics.  The
evidence so far does not strongly suggest a hardware issue, at least not a
drive issue, but this in no way conclusively eliminates the possibility.  I
never said it did.

> A few other things - many SATA controller cards use poorly architected
> bridge chips that spoof some of the ATA commands, so even if you *think*
> you are kicking off one of the SMART subcommands, like the
> SMART_IMMEDIATE_OFFLINE (op code d4h with the extended self test,
> subcommand 2h), then it is possible, perhaps probable, they are never
> getting run. -- yes, I am giving you the raw opcodes so you can look
> them up and learn what they do.

I know what they do.  Whether those codes are properly implemented by either
the OS or the SATA controller I have no way of knowing offhand.  All I can
tell you is what I have observed: the drive system previously reported tons
of sector remaps when the drives were in a different, clearly broken,
enclosure, and they continue to do so on the 320G drive with known issues.

> You want to know how it is possible that frequency or size of reads can
> be a factor?
> Do the math:
>  * Look at the # of ECC bits you have on the disks (read the specs), and
> compare that with the trillions of bytes you have.  How frequently can
> you expect to have an unrecoverable ECC error based on that.

Exceedingly rarely for file creations and very commonly for gigabit long
reads if the drive is bad.  So why does it happen sometimes 100% of the time
when creating files, but never at all when reading them?  I ma noty talking
about reads on static files which presumably may have encountered a bad
sector and had it remapped.  I am talking about massive volumes of brand new
data being written, read, erased, written, and read over and over.

>  * What percentage of your farm are you actually testing with the tests
> you have run so far? Is it even close to being statistically
> significant?

Yes, they are.  More than statistically significant, they cover a data space
larger than the entire free space region many times over.  It's not an
overly high probability even a single block has been missed.

>  * Do you know what physical blocks on each disk are being read/written
> with the tests you mention?

No, of course not.  Are you seriously suggesting I should investigate the
sector numbers of each and every one of the billions of reads going on?
They far exceed in number the total number of free blocks, which is more
than sufficient evidence.

> If you do not know, then how do you know
> that the short tests are doing I/O on blocks that need to be repaired,
> and subsequent tests run OK because those blocks were just repaired?

'Because there is no statistical or logical differentiation between a write
/ read pair on a file creation and billions of write / read pairs spooling
out on a new multi-gigabit file.  If there were halts during the reads of
the files after they were written, then I would agree, but each large file
is written, read, copied, read, written, read, and then read again at least
once.  Every 3G - 35G file has its bytes written to at least three different
locations (with modifications each time) on the array, and is read at least
four times, sometimes five or six times, along the way.  Once in its final
state, the file just sits there in perpetuity, only being read.  At this
point, the active sections of the array are only 610 Gigs in extent, and
that much data gets written, read, written over, and read in a matter of a
few days.  Yet for many months, not a single halt has been observed during
any of the trillions of blocks read, except when creating a file.  It's
moderately unlikely there is even a single block on the drive that has not
been written and subsequently read.

>  * Did you look into firmware? Are the drives and/or firmware revisions
> qualified by your controller vendor?

Yes.  I did that before purchasing the controller.  No, I did not look into
the drives.  The controller vendor does not qualify drives.  Controllers
don't get any more generic than the one I purchased (I don't recall the
brand at this time - it's based on the Silicon Image SiI3124 controller
chip).  More importantly, the fact the system ran for months without the
problem, and the problem only occurred after changing the array chassis and
the file system strongly suggests this is not the root of the issue.

> I've been in the storage business for over 10 years, writing everything
> from RAID firmware, configurators, disk diagnostics, test bench suites.

'And I started designing computer hardware more than 30 years ago, before
the IMB PC existed.  Neither your qualifications nor mine are of any
relevance.

> doubt that you will find any experienced storage professional that
> wouldn't tell you to break it all down and run a full block-level DVT
> before going further.

I am not a fool.  I have asked you more than once for the details of a
utility to handle a DVT.  I can't run an application I do not have in my
possession.  If it is part of the Linux distro, then I need to know what it
is.

> It could have all been done over the week-end if
> you had the right setup, and then you would know a lot more than what
> you know now.

> AT this point all you have done is tell people who suggest hardware is
> the cause that they are wrong and then tell us why you think we are
> wrong.

No, in addition to applying proper deductive reasoning to the data I do have
and concluding your hypotheses are unlikely, I have repeatedly asked for the
details behind such a setup.

> Frankly, be lazy and don't run diagnostics, you had just better
> not be a government employee, or in charge of a database that contains
> financial, medical, or other such information, and you have better be
> running hot backups.

My employment is not relevant, and only an idiot doesn't back up data.
Critical data gets backed up on multiple sites via multiple vectors.  None
of which has any relevance to my problem at hand or how to further diagnose
it.  My backup systems aren't having the problem.

> If you still refuse to run full block-level hardware test, then ask

Where do you get this?  I have never once refused to do anything.  Read my
lips: "HOW DO I RUN A FULL BLOCK-LEVEL HARDWARE TEST?"  Point me to a
website, a MAN page, or a phone number where I can obtain the utilities to
perform the tests.  I am a hardware expert, but a Linux neophyte.  I do not
run Linux on any of my professional systems, only my personal ones, and my
professional systems are not PCs.

> yourself how much longer will you allow this to go on before you run
> such a test, or are you just going to continue down this path waiting
> for somebody to give you a magic command to type in that will fix
> everything.
> 
> I am not the one who is putting my job on the line at best, and at
> worst, is looking at a criminal violation for not taking appropriate
> actions to protect certain data. I make no apology for beating you up on
> this.  You need to hear it.

Oh, brother.  Since I personally own outright every last nut, bolt, and
transistor in these systems, and since the data belongs entirely and
exclusively to me (with the exception of some copyright restrictions), for
use by me, exactly who is going to fire me or prosecute me?  Enough.  Please
cease the ad hominem attacks, and point me towards the utilities which will
allow me to further diagnose the issue.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05 20:33 ` Leslie Rhorer
@ 2009-04-05 22:20   ` Peter Grandi
  2009-04-06  0:31   ` Doug Ledford
  1 sibling, 0 replies; 84+ messages in thread
From: Peter Grandi @ 2009-04-05 22:20 UTC (permalink / raw)
  To: Linux RAID

> [ ... ] The evidence so far does not strongly suggest a
> hardware issue, at least not a drive issue, [ ... ]

> [ ... ] the drive system previously reported tons of sector
> remaps when the drives were in a different, clearly broken,
> enclosure, and they continue to do so on the 320G drive with
> known issues.

>> * Did you look into firmware? Are the drives and/or firmware
>>   revisions qualified by your controller vendor?

> Yes.  I did that before purchasing the controller.  No, I did not
> look into the drives.  The controller vendor does not qualify
> drives.  Controllers don't get any more generic than the one I
> purchased (I don't recall the brand at this time - it's based on
> the Silicon Image SiI3124 controller chip).

Uhhh, I'd invest in something else. Just in case. The SiL chips are
a bit low end, and most SiL based cards I have seeen were of the
cheap and cheerful variety, and those sometimes have fairly
marginal electrical/noise designs.

> More importantly, the fact the system ran for months without the
> problem, and the problem only occurred after changing the array
> chassis and the file system strongly suggests this is not the
> root of the issue.

Not necessarily: a different file system may trigger different bugs
in the host adapter fw and in the drive fw by doing operations in a
different sequence with different timing.

> [ ... ] "HOW DO I RUN A FULL BLOCK-LEVEL HARDWARE TEST?"

I agree that it seems unlikely that it is a physically defective
disk. More likely bad cabling, bad backplane, bad fw, bad
electrical/noise design.

Anyhow it is practically impossible on modern drives to run a full
black level hardware test on disk drives, which are more like block
servers, with several layers of interpolation between the command
level and the hardware.

However to run a *logical* block test, 'badblocks' from the
'e2fsprogs' package is the common choice.

But I'd leave running the CERN "silent corruption" daemon and other
checks/diagnostics and look carefully at the system logs for host
adapter errors.

For most people doing significant storage systems and self-built
systems of a certain size keeping current with the HEPiX workshops
<URL:https://WWW.HEPiX.org/> seems to me a good idea.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-05 20:33 ` Leslie Rhorer
  2009-04-05 22:20   ` Peter Grandi
@ 2009-04-06  0:31   ` Doug Ledford
  2009-04-06  1:53     ` Leslie Rhorer
  1 sibling, 1 reply; 84+ messages in thread
From: Doug Ledford @ 2009-04-06  0:31 UTC (permalink / raw)
  To: lrhorer; +Cc: 'David Lethe', linux-raid

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

On Apr 5, 2009, at 4:33 PM, Leslie Rhorer wrote:
> Oh, brother.  Since I personally own outright every last nut, bolt,  
> and
> transistor in these systems, and since the data belongs entirely and
> exclusively to me (with the exception of some copyright  
> restrictions), for
> use by me, exactly who is going to fire me or prosecute me?   
> Enough.  Please
> cease the ad hominem attacks, and point me towards the utilities  
> which will
> allow me to further diagnose the issue.

Well said.  I found it particularly interesting to hear David talk of  
statistical probabilities as he studiously ignored the astronomical  
statistical improbability that sector remapping would strike only on  
file creation, and would simultaneously block a drive up for the  
purpose of file creation but not block it up for the purpose of raid  
sector checking.  David definitely went on a "short bus" rant, I'd  
just ignore the rant part and see if he manages to provide any useful  
data about how to check things as you requested.  Personally, I've  
only ever used badblocks for low level disk checking, but back when I  
used it for diagnosis drives were different than they are today in  
terms of firmware and you could actually trust that badblocks was  
doing something useful.

--

Doug Ledford <dledford@redhat.com>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-06  0:31   ` Doug Ledford
@ 2009-04-06  1:53     ` Leslie Rhorer
  2009-04-06 12:37       ` Doug Ledford
  0 siblings, 1 reply; 84+ messages in thread
From: Leslie Rhorer @ 2009-04-06  1:53 UTC (permalink / raw)
  To: 'Linux RAID'

> Well said.  I found it particularly interesting to hear David talk of
> statistical probabilities as he studiously ignored the astronomical
> statistical improbability that sector remapping would strike only on
> file creation, and would simultaneously block a drive up for the
> purpose of file creation but not block it up for the purpose of raid
> sector checking.

I know.  I was apoplectic.  I truly didn't know how best to respond to such
a glaring incongruity.  It's likely the average sector in the free space
region has been read 50 times or more without a single instance of the
failure, yet under load every single creation causes a halt.  'Billions of
sectors read over, and over, and over again, yet perform a file create to
one or two from the very same sector space, and kerplewey!  It boggled my
mind.

> David definitely went on a "short bus" rant, I'd
> just ignore the rant

I was trying to, mostly.  I've never heard the term, "short bus", before.  I
presume it refers to what can happen on a computer bus with an electrical
short?  What really gets me is rather than going on and on about how
ignorant I was, all he had to do in his very first message was say, "Try the
badblocks command."

> Personally, I've
> only ever used badblocks for low level disk checking, but back when I
> used it for diagnosis drives were different than they are today in
> terms of firmware and you could actually trust that badblocks was
> doing something useful.

Am I mistaken in believing, per the discussion in this list, it should
trigger an event, provided the problem really is bad blocks on one or more
drives?  If so, then I need someone to explain a bit more what badblocks
does, and perhaps point me toward some low level test which will potentially
either rule out or convict the drive layer of being the source of issues.
I've never used it before, quite obviously.  I read the MAN page, of course,
but as is typical with MAN pages, it doesn't go into any detail under the
hood, as it were.

Oh, just BTW, I have the system set to notify me via e-mail of any events
passing through the Monitor daemon of mdadm.  Will this notify me if the
RAID device encounters any errors requiring recovery at the RAID level?  If
so, I have never received any such notifications since implementing mdadm.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-06  1:53     ` Leslie Rhorer
@ 2009-04-06 12:37       ` Doug Ledford
  0 siblings, 0 replies; 84+ messages in thread
From: Doug Ledford @ 2009-04-06 12:37 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

[-- Attachment #1: Type: text/plain, Size: 4978 bytes --]

On Apr 5, 2009, at 9:53 PM, Leslie Rhorer wrote:

>> Well said.  I found it particularly interesting to hear David talk of
>> statistical probabilities as he studiously ignored the astronomical
>> statistical improbability that sector remapping would strike only on
>> file creation, and would simultaneously block a drive up for the
>> purpose of file creation but not block it up for the purpose of raid
>> sector checking.
>
> I know.  I was apoplectic.  I truly didn't know how best to respond  
> to such
> a glaring incongruity.  It's likely the average sector in the free  
> space
> region has been read 50 times or more without a single instance of the
> failure, yet under load every single creation causes a halt.   
> 'Billions of
> sectors read over, and over, and over again, yet perform a file  
> create to
> one or two from the very same sector space, and kerplewey!  It  
> boggled my
> mind.
>
>> David definitely went on a "short bus" rant, I'd
>> just ignore the rant
>
> I was trying to, mostly.  I've never heard the term, "short bus",  
> before.  I
> presume it refers to what can happen on a computer bus with an  
> electrical
> short?

No, much more insulting and definitely not politically correct speech,  
hence my lack of further explanation.

>  What really gets me is rather than going on and on about how
> ignorant I was, all he had to do in his very first message was say,  
> "Try the
> badblocks command."
>
>> Personally, I've
>> only ever used badblocks for low level disk checking, but back when I
>> used it for diagnosis drives were different than they are today in
>> terms of firmware and you could actually trust that badblocks was
>> doing something useful.
>
> Am I mistaken in believing, per the discussion in this list, it should
> trigger an event, provided the problem really is bad blocks on one  
> or more
> drives?

It should, yes.  It merely attempts to read the entire block device,  
without any regards to filesystem layout or anything like that.  Since  
it reads the entire block device, it covers the metadata, the journal,  
and everything else. There shouldn't be anything that the filesystem  
touches that bad blocks doesn't.  In the old days, when bad blocks  
gave you a bad block number, it meant something, now a days it doesn't  
mean much due to changes in disk firmware.  So even if it doesn't give  
you what you need to manually map bad blocks out of the filesystem  
well, it should still replicate your hangs if the hangs are truly bad  
block remapping related.

>  If so, then I need someone to explain a bit more what badblocks
> does, and perhaps point me toward some low level test which will  
> potentially
> either rule out or convict the drive layer of being the source of  
> issues.
> I've never used it before, quite obviously.  I read the MAN page, of  
> course,
> but as is typical with MAN pages, it doesn't go into any detail  
> under the
> hood, as it were.

All it really does under the hood in the non-destructive case is read  
from block 0 to block (size-1) and see if any of them report errors  
via the OS.  In destructive write tests, it writes patterns and sees  
if they read back properly.  I think there is a non-destructive write  
test that's supposed to read/modify/read/restore or something like  
that, but obviously it can't be used on a live filesystem.  Only the  
read test is safe on a live filesystem.

>
> Oh, just BTW, I have the system set to notify me via e-mail of any  
> events
> passing through the Monitor daemon of mdadm.  Will this notify me if  
> the
> RAID device encounters any errors requiring recovery at the RAID  
> level?  If
> so, I have never received any such notifications since implementing  
> mdadm.

I don't think so.  The mdadm --monitor functionality simply watches  
the output of /proc/mdstat watching for changes in the array's listed  
state, such as a transition from active to degraded.  On those  
changes, it mails the admin.  However, if you are running a resync/ 
check, this is considered a good state like active is.  So mdadm would  
only send you a mail if the array encountered an unrecoverable problem  
that kicked the array from a good state to a degraded state.  The  
whole monitor capability of mdadm is probably due for a rewrite now  
that sysfs usage is pervasive.  It should probably ignore /proc/mdstat  
and instead use the sysfs files, and it should check for more things  
than just a transition from active->degraded, it should also check  
things like mismatch_cnt after a check/resync completes, things like  
that.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux- 
> raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--

Doug Ledford <dledford@redhat.com>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE:
@ 2009-04-05  5:33 David Lethe
  2009-04-05  8:14 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: David Lethe @ 2009-04-05  5:33 UTC (permalink / raw)
  To: lrhorer, linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Lelsie Rhorer
> Sent: Sunday, April 05, 2009 12:21 AM
> To: linux-raid@vger.kernel.org
> Subject: RE:
> 
> > You should note that the drive won't know a sector it just wrote is
> > bad until it reads it
> 
> Yes, but it also won't halt the write for 40 seconds because it was
> bad.
> More to  the point, there is no difference at the drive level between
a
> bad
> sector written for a 30Gb file and a 30 byte file.
> 
> > ....are you sure you actually successfully wrote all of that data
and
> that
> >it is still there?  Pretty sure, yeah.  There are no errors in the
> filesystem, and every file I have written works.  Again, however, the
> point
> is there is never a problem once the file is created, no matter how
> long it
> takes to write it out to disk.  The moment the file is created,
> however,
> there may be up to a 2 minute delay in writing its data to the drive.
> 
> > And it is not the writes that kill when you have a drive going bad,
> it
> > is the reads of the bad sectors.    And to create a file, a number
of
> > things will likely need to be read to finish the file creation, and
> if
> > one of those is a bad sector things get ugly.
> 
> Well, I agree to some extent, except that why would it be loosely
> related to
> the volume of drive activity, and why is it 5 drives stop reading
> altogether
> and 5 do not?  Furthermore, every single video file gets read, re-
> written,
> edited, re-written again, and finally read again at least once,
> sometimes
> several times, before being finally archived.  Why does the kernel log
> never
> report any errors of any sort?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

All of what you report is still consistent with delays caused by having
to remap bad blocks
The O/S will not report recovered errors, as this gets done internally
by the disk drive, and the O/S never learns about it. (Queue depth
settings can account for some of the other "weirdness" you reported.

Really, if this was my system I would run non-destructive read tests on
all blocks; along with the embedded self-test on the disk.  It is often
a lot easier and more productive to eliminate what ISN'T the problem
rather than chase all of the potential reasons for the problem.  



^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-05  5:33 David Lethe
@ 2009-04-05  8:14 ` Lelsie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  8:14 UTC (permalink / raw)
  To: linux-raid

> All of what you report is still consistent with delays caused by having
> to remap bad blocks

I disagree.  If it happened with some frequency during ordinary reads, then
I would agree.  If it happened without respect to the volume of reads and
writes on the system, then I would be less inclined to disagree.

> The O/S will not report recovered errors, as this gets done internally
> by the disk drive, and the O/S never learns about it. (Queue depth

SMART is supposed to report this, and rarely the kernel log does report a
block of sectors being marked bad by the controller.  I cannot speak to the
notion SMART's reporting of relocated sectors and failed relocations may not
be accurate, as I have no means to verify.

Actually, I should amend the first sentence, because while the ten drives in
the array are almost never reporting any errors, there is another drive in
the chassis which is chunking out error reports like a farm boy spitting out
watermelon seeds.  I had a 320G drive in another system which was behaving
erratically, so I moved it to the array chassis on this machine to eliminate
it being a cable or the drive controller.  It's reporting blocks being
marked bad all over the place.

> Really, if this was my system I would run non-destructive read tests on
> all blocks;

How does one do this?  Or rather, isn't this what the monthly mdadm resync
does?

> along with the embedded self-test on the disk.  It is often

How does one do this?

> a lot easier and more productive to eliminate what ISN'T the problem
> rather than chase all of the potential reasons for the problem.

I agree, which is why I am asking for troubleshooting methods and utilities.

The monthly RAID array resync started a few minutes ago, and it is providing
some interesting results.  The number of blocks read / second is
consistently 13,000 - 24,000 on all ten drives.  There were no other drive
accesses of any sort at the time, so the number of blocks written was flat
zero on all drives in the array.  I copied the /etc/hosts file to the RAID
array, and instantly the file system locked, but the array resync *DID NOT*.
The number of blocks read and written per second continued to range from
13,000 to 24,000 blocks/second, with no apparent halt or slow-down at all,
not even for one second.  So if it's a drive error, why are file system
reads halted almost completely, and writes halted altogether, yet drive
reads at the RAID array level continue unabated at an aggregate of more than
130,000 blocks - 240,000 blocks (500 - 940 megabits) per second?  I tried a
second copy and again the file system accesses to the drives halted
altogether.  The block reads (which had been alternating with writes after
the new transfer proceses were implemented) again jumped to between 13,000
and 24,000.  This time I used a stopwatch, and the halt was 18 minutes 21
seconds - I believe the longest ever.  There is absolutely no way it would
take a drive almost 20 minutes to mark a block bad.  The dirty blocks grew
to more than 78 Megabytes.  I just did a 3rd cp of the /etc/hosts file to
the array, and once again it locked the machine for what is likely to be
another 15 - 20 minutes.  I tried forcing a sync, but it also hung.

<Sigh>  The next three days are going to be Hell, again.  It's going to be
all but impossible to edit a file until the RAID resync completes.  It's
often really bad under ordinary loads, but when the resync is underway, it's
beyond absurd.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
@ 2009-04-04 17:05 Lelsie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-04 17:05 UTC (permalink / raw)
  To: 'Linux RAID'

> One thing that can cause this sort of behaviour is if the filesystem is in
> the middle of a sync and has to complete it before the create can
> complete, and the sync is writing out many megabytes of data.
> 
> You can see if this is happening by running
> 
>      watch 'grep Dirty /proc/meminfo'

OK, I did this.

> if that is large when the hang starts, and drops down to zero, and the
> hang lets go when it hits (close to) zero, then this is the problem.

No, not really.  The value of course rises and falls erratically during
normal operation (anything from a few dozen K to 200 Megs), but it is not
necessarily very high at the event onset.  When the halt occurs it drops
from whatever value it may have (perhaps 256K or so) to 16K, and then slowly
rises to several hundred K until the event terminates.

> If that doesn't turn out to be the problem, then knowing how the
> "Dirty" count is behaving might still be useful, and I would probably
> look at what processes are in 'D' state, (ps axgu)

Well, nothing surprising, there.  The process(es) involved with the
transfer(s) are dirty (D+), as well as the trigger process (for testing, I
simply copy /etc/hosts over to a directory on the RAID array), and pdflush
had a D state (no plus), but that's all.

> and look at their stack (/proc/$PID/stack)..

Um, I thought I knew what you meant by this, but apparently not.  I tried to
`cat /proc/<PID of the process with a D status>/stack`, but the system
returns "cat: /proc/8005/stack: No such file or directory".  What did I do
wrong?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re:
@ 2009-04-02 13:35 Andrew Burgess
  2009-04-04  5:57 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Andrew Burgess @ 2009-04-02 13:35 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Wed, 2009-04-01 at 23:16 -0500, Lelsie Rhorer wrote:

> The issue is the entire array will occasionally pause completely for about
> 40 seconds when a file is created. 

I had symptoms like this once. It turned out to be a defective disk. The
disk would never return a read or write error but just intermittently
took a really long time to respond.

I found it by running atop. All the other drives would be running at low
utilization and this one drive would be at 100% when the symptoms
occurred (which in atop gets colored red so it jumps out at you)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-02 13:35 Andrew Burgess
@ 2009-04-04  5:57 ` Lelsie Rhorer
  2009-04-04 13:01   ` Andrew Burgess
  0 siblings, 1 reply; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-04  5:57 UTC (permalink / raw)
  To: 'Linux RAID'

>> The issue is the entire array will occasionally pause completely for
about
>> 40 seconds when a file is created. 

>I had symptoms like this once. It turned out to be a defective disk. The
>disk would never return a read or write error but just intermittently
>took a really long time to respond.

>I found it by running atop. All the other drives would be running at low
>utilization and this one drive would be at 100% when the symptoms
>occurred (which in atop gets colored red so it jumps out at you)

Thanks.  I gave this a try, but not being at all familiar with atop, I'm not
sure what, if anything, the results mean in terms of any additional
diagnostic data.  Depending somewhat upon the I/O load on the RAID array,
atop sometimes reports the drive utilization on several or all of the drives
to be well in excess of 85% - occasionally even 99%, but never flat 100% at
any time.  Oddly, even under relatively light loads of 20 or 30 Mbps,
sometimes the RAID members would show utilization in the high 90s, usually
on all the drives on a multiplier channel. I don't know if this is ordinary
behavior for atop, but all the drives also periodically disappear from the
status display.  Additionally, while atop is running and I am using my usual
video editor, Video Redo, on a Windows workstation to stream video from the
server, every time atop updates, the video and audio skip when reading from
a drive not on the RAID array.  I did not notice the same behavior from the
RAID array.  Odd.

Anyway, on to the diagnostics.

I ran both `atop` and `watch iostat 1 2` concurrently and triggered several
events while under heavy load ( >450 Mbps, total ). In atop, drives sdb,
sdd, sde, sdg, and sdi consistently disappeared from atop entirely, and
writes for the other drives fell to dead zero.  Reads fell to a very small
number.  The iostat session returned information in agreement with atop:
both reads and writes for sdb, sdd, sde, sdg, sdi, and md0 all fell to dead
zero from nominal values frequently exceeding 20,000 reads / sec and 5000
writes / sec.  Meanwhile, writes to sda, sdc, sdf, sdh, and sdj also dropped
to dead zero, but reads only fell to between 230 and 256 reads/sec.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-04  5:57 ` RAID halting Lelsie Rhorer
@ 2009-04-04 13:01   ` Andrew Burgess
  2009-04-04 14:39     ` Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Andrew Burgess @ 2009-04-04 13:01 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Sat, 2009-04-04 at 00:57 -0500, Lelsie Rhorer wrote:

> >> The issue is the entire array will occasionally pause completely for
> about 40 seconds when a file is created. 
> 
> >I had symptoms like this once. It turned out to be a defective disk. The
> >disk would never return a read or write error but just intermittently
> >took a really long time to respond.
> 
> >I found it by running atop. All the other drives would be running at low
> >utilization and this one drive would be at 100% when the symptoms
> >occurred (which in atop gets colored red so it jumps out at you)
> 
> Thanks.  I gave this a try, but not being at all familiar with atop, I'm not
> sure what, if anything, the results mean in terms of any additional
> diagnostic data.

It's the same info as iostat just in color

> Depending somewhat upon the I/O load on the RAID array,
> atop sometimes reports the drive utilization on several or all of the drives
> to be well in excess of 85% - occasionally even 99%, but never flat 100% at
> any time.  

High 90's is what I ment by 100% :-)

> Oddly, even under relatively light loads of 20 or 30 Mbps,
> sometimes the RAID members would show utilization in the high 90s, usually
> on all the drives on a multiplier channel.

I think that's the filesystem buffering and then writing all at once.
It's normal if it's periodic; they go briefly to ~100% and then back to
~0%?

Did you watch the atop display when the problem occurred?

> I don't know if this is ordinary
> behavior for atop, but all the drives also periodically disappear from the
> status display.

That's a config option (and I find the default annoying). atop also
sorts the drives by utilization every second which can be a little hard
to watch. But if you have the problem I had then that one drive stays at
the top of the list when the problem occurs.

> Additionally, while atop is running and I am using my usual
> video editor, Video Redo, on a Windows workstation to stream video from the
> server, every time atop updates, the video and audio skip when reading from
> a drive not on the RAID array.  I did not notice the same behavior from the
> RAID array.  Odd.

I think this is heavy /proc filesystem access which I have noticed can
screw up even realtime processes.

> Anyway, on to the diagnostics.
> 
> I ran both `atop` and `watch iostat 1 2` concurrently and triggered several
> events while under heavy load ( >450 Mbps, total ). In atop, drives sdb,
> sdd, sde, sdg, and sdi consistently disappeared from atop entirely, and
> writes for the other drives fell to dead zero.  Reads fell to a very small
> number.  The iostat session returned information in agreement with atop:
> both reads and writes for sdb, sdd, sde, sdg, sdi, and md0 all fell to dead
> zero from nominal values frequently exceeding 20,000 reads / sec and 5000
> writes / sec.  Meanwhile, writes to sda, sdc, sdf, sdh, and sdj also dropped
> to dead zero, but reads only fell to between 230 and 256 reads/sec.

I used:

  iostat -t -k -x 1 | egrep -v 'sd.[0-9]'

to get percent utilization and not show each partition but just whole
drives.

For atop you want the -f option to 'fixate' the number of lines so
drives with zero utilization don't disappear.

If you didn't get constant 100% utilization while the event occurred
then I guess you don't have the problem I had.

Does the sata multiplier have it's own driver and if so, is it the
latest? Any other complaints on the net about it? I would think a
problem there would show up as 100% utilization though...

And I think you already said the cpu usage is low when the event occurs?
No one core at near 100%? (atop would show this too...)



^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-04 13:01   ` Andrew Burgess
@ 2009-04-04 14:39     ` Lelsie Rhorer
  2009-04-04 15:04       ` Andrew Burgess
  0 siblings, 1 reply; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-04 14:39 UTC (permalink / raw)
  To: 'Linux RAID'

> I think that's the filesystem buffering and then writing all at once.
> It's normal if it's periodic; they go briefly to ~100% and then back to
> ~0%?

Yes.

> > I don't know if this is ordinary
> > behavior for atop, but all the drives also periodically disappear from
> the
> > status display.
> 
> That's a config option (and I find the default annoying).

Yeah, me, too.

> sorts the drives by utilization every second which can be a little hard
> to watch. But if you have the problem I had then that one drive stays at
> the top of the list when the problem occurs.

No.

> I used:
> 
>   iostat -t -k -x 1 | egrep -v 'sd.[0-9]'
> 
> to get percent utilization and not show each partition but just whole
> drives.

Since there are no partitions, it shouldn't make a difference.

> For atop you want the -f option to 'fixate' the number of lines so
> drives with zero utilization don't disappear.

Well, diagnostically, I think the situation is clear.  All ten drives stop
writing completely.  Five of the ten stop reading, and the other five slow
their reads to a dribble - always the same five drives.

> Does the sata multiplier have it's own driver and if so, is it the
> latest? Any other complaints on the net about it? I would think a
> problem there would show up as 100% utilization though...

Multipliers - three of them, and no, they require no driver.  The SI
controller's drivers are included in the distro.

> And I think you already said the cpu usage is low when the event occurs?
> No one core at near 100%? (atop would show this too...)

Nowhere near.  Typically both cores are running below 25%, depending upon
what processes are running, of course.  I have the Gnome system monitor up,
and the graphs don't spike when the event occurs.  Of course, if there is a
local drive access process which uses lots of CPU horsepower, such as
ffmpeg, then when the array halt occurs, the CPU utilization falls right
off.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-04 14:39     ` Lelsie Rhorer
@ 2009-04-04 15:04       ` Andrew Burgess
  2009-04-04 15:15         ` Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Andrew Burgess @ 2009-04-04 15:04 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Sat, 2009-04-04 at 09:39 -0500, Lelsie Rhorer wrote:

> Well, diagnostically, I think the situation is clear.  All ten drives stop
> writing completely.  Five of the ten stop reading, and the other five slow
> their reads to a dribble - always the same five drives.

So the delay seems to be hiding in the kernel else the userspace tools
would see it (they see some kernel stuff too, like utilization)

Oprofile is supposed to be good for user and kernel profiling but I
don't know if it can find non-cpu bound stuff. There are also a bunch of
latency analysis tools in the kernel that were used for realtime tuning,
they might show where something is getting stuck. Andrew Morton did alot
of work in this area.

If the cpu was spinning somewhere it would show as system time so it
must be waiting for a timer or some other event (wild guessing). It's as
if the i/o completion never arrives but some timer eventually goes off
and maybe the i/o is retried and everything gets back on track? But that
should cause utilization to go up and you'd think some sort of
message... 

Perhaps the ide list would know of some diagnostic knobs to tweak.

It's a puzzler...

One last thing, the cpu goes toward 100% idle not wait?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-04 15:04       ` Andrew Burgess
@ 2009-04-04 15:15         ` Lelsie Rhorer
  2009-04-04 16:39           ` Andrew Burgess
  0 siblings, 1 reply; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-04 15:15 UTC (permalink / raw)
  To: 'Linux RAID'

> Oprofile is supposed to be good for user and kernel profiling but I
> don't know if it can find non-cpu bound stuff. There are also a bunch of
> latency analysis tools in the kernel that were used for realtime tuning,
> they might show where something is getting stuck. Andrew Morton did alot
> of work in this area.

Do you know if he subscribes to this list?  If not, how may I reach him?

> If the cpu was spinning somewhere it would show as system time so it
> must be waiting for a timer or some other event (wild guessing). It's as
> if the i/o completion never arrives but some timer eventually goes off
> and maybe the i/o is retried and everything gets back on track? But that
> should cause utilization to go up and you'd think some sort of
> message...

It's also puzzling why the issue is so much worse when the array diagnostic
is running.  Almost every file creation triggers a halt, and the halt time
extends to 2 minutes.  This is one thing which makes me tend to think it is
an interaction between the file system and the RAID system, rather than
either one alone.

> Perhaps the ide list would know of some diagnostic knobs to tweak.

> One last thing, the cpu goes toward 100% idle not wait?

Neither one.  If the drive access is tied to a local CPU intensive user
application, for example ffmpeg, then of course CPU utilization dips, but if
all the drive accesses are via network utilities (ftp, SAMBA, etc), I
haven't noticed any change in CPU utilization.  Simultaneous reads and
writes to other local drives or network drives continue without a hiccough. 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-04 15:15         ` Lelsie Rhorer
@ 2009-04-04 16:39           ` Andrew Burgess
  0 siblings, 0 replies; 84+ messages in thread
From: Andrew Burgess @ 2009-04-04 16:39 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Sat, 2009-04-04 at 10:15 -0500, Lelsie Rhorer wrote:

> > Oprofile is supposed to be good for user and kernel profiling but I
> > don't know if it can find non-cpu bound stuff. There are also a bunch of
> > latency analysis tools in the kernel that were used for realtime tuning,
> > they might show where something is getting stuck. Andrew Morton did alot
> > of work in this area.
> 
> Do you know if he subscribes to this list?  If not, how may I reach him?

He's on the kernel list which I seldom read nowadays. His email used to
be akpm@something I think.

He worked on ftrace, documented in the kernel source
in /usr/src/linux/Documentation/ftrace.txt

An excerpt:

"Ftrace is an internal tracer designed to help out developers and
designers of systems to find what is going on inside the kernel.
It can be used for debugging or analyzing latencies and performance
issues that take place outside of user-space.

Although ftrace is the function tracer, it also includes an
infrastructure that allows for other types of tracing. Some of the
tracers that are currently in ftrace include a tracer to trace
context switches, the time it takes for a high priority task to
run after it was woken up, the time interrupts are disabled, and
more (ftrace allows for tracer plugins, which means that the list of
tracers can always grow)."

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re:
@ 2009-04-02  7:33 Peter Grandi
  2009-04-02 23:01 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Peter Grandi @ 2009-04-02  7:33 UTC (permalink / raw)
  To: Linux RAID

> The issue is the entire array will occasionally pause completely
> for about 40 seconds when a file is created. [ ... ] During heavy
> file transfer activity, sometimes the system halts with every
> other file creation. [ ... ] There are other drives formatted
> with other file systems on the machine, but the issue has never
> been seen on any of the other drives.  When the array runs its
> regularly scheduled health check, the problem is much worse. [
> ... ]

Looks like that either you have hw issues (transfer errors, bad
blocks) or more likely the cache flusher and elevator settings have
not been tuned for a steady flow.

> How can I troubleshoot and more importantly resolve this issue?

Well, troubleshooting would require a good understanding of file
system design and storage subsystem design, and quite a bit of time.

However, for hardware errors check the kernel logs, and for cache
flusher and elevator settings check the 'bi'/'bo' numbers of
'vmstat 1' while the pause happens.

For a deeper profile of per-drive IO run 'watch iostat 1 2' while
this is happening. This might also help indicate drive errors (no
IO is happening) or flusher/elevator tuning issues (lots of IO is
happening suddenfly).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-02  7:33 Peter Grandi
@ 2009-04-02 23:01 ` Lelsie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-02 23:01 UTC (permalink / raw)
  To: 'Linux RAID'

>> The issue is the entire array will occasionally pause completely
>> for about 40 seconds when a file is created. [ ... ] During heavy
>> file transfer activity, sometimes the system halts with every
>> other file creation. [ ... ] There are other drives formatted
>> with other file systems on the machine, but the issue has never
>> been seen on any of the other drives.  When the array runs its
>> regularly scheduled health check, the problem is much worse. [
>> ... ]

>Looks like that either you have hw issues (transfer errors, bad
>blocks) or more likely the cache flusher and elevator settings have
>not been tuned for a steady flow.

That doesn't sound right.  I can read and write all day long at up to 450
Mbps in both directions continuously for hours at a time.  It's only when a
file is created, even a file of only a few bytes, that the issue occurs, and
then not always.  Indeed, earlier today I had transfers going with an
average throughput of more than 300 Mbps total and despite creating more
than 20 new files, not once did the transfers halt.

> How can I troubleshoot and more importantly resolve this issue?

>Well, troubleshooting would require a good understanding of file
>system design and storage subsystem design, and quite a bit of time.

>However, for hardware errors check the kernel logs, and for cache
>flusher and elevator settings check the 'bi'/'bo' numbers of
>'vmstat 1' while the pause happens.

I've already done that.  There are no errors of any sort in the kernel log.
Vmstat only tells me both bi and bo are zero, which we already knew.  I've
tried ps, iostat, vmstat, and top, and nothing provides anything of any
significance I can see except that resierfs is waiting on md, which we
already knew, and (as I recall - it's been a couple of weeks) the number of
bytes in and out of md0 falls to zero.

>For a deeper profile of per-drive IO run 'watch iostat 1 2' while
>this is happening. This might also help indicate drive errors (no
>IO is happening) or flusher/elevator tuning issues (lots of IO is
>happening suddenfly).

I'll give it a try.  I haven't been able to reproduce the issue today.
Usually it's pretty easy.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: your mail
@ 2009-04-02  6:56 Luca Berra
  2009-04-04  6:44 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Luca Berra @ 2009-04-02  6:56 UTC (permalink / raw)
  To: linux-raid

On Wed, Apr 01, 2009 at 11:16:06PM -0500, Lelsie Rhorer wrote:
>8T of active space is formatted as a single Reiserfs file system.  The
....
>The issue is the entire array will occasionally pause completely for about
>40 seconds when a file is created.  This does not always happen, but the
>situation is easily reproducible.  The frequency at which the symptom
i wonder how costly b-tree operaton are for a 8Tb filesystem...

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-02  6:56 your mail Luca Berra
@ 2009-04-04  6:44 ` Lelsie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-04  6:44 UTC (permalink / raw)
  To: 'Linux RAID'

>>The issue is the entire array will occasionally pause completely for about
>>40 seconds when a file is created.  This does not always happen, but the
>>situation is easily reproducible.  The frequency at which the symptom
>i wonder how costly b-tree operaton are for a 8Tb filesystem...

I expect somewhat costly, but a 4 second to 2 minute halt just to create a
file of between 0 and 1000 bytes is ridiculous.  That, and as I said, it
doesn't always happen, by a long shot, not even when transfers in the 400
Mbps range are underway.  OTOH, I have had it happen when only a few Mbps
transfers were underway.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: Strange filesystem slowness with 8TB RAID6
@ 2009-04-02  4:38 NeilBrown
  2009-04-04  7:12 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: NeilBrown @ 2009-04-02  4:38 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Thu, April 2, 2009 3:16 pm, Lelsie Rhorer wrote:

> The issue is the entire array will occasionally pause completely for about
> 40 seconds when a file is created.  This does not always happen, but the
> situation is easily reproducible.  The frequency at which the symptom
> occurs seems to be related to the transfer load on the array.  If no other
> transfers are in process, then the failure seems somewhat more rare,
> perhaps accompanying less than 1 file creation in 10..  During heavy file
> transfer activity, sometimes the system halts with every other file
> creation.  Although I have observed many dozens of these events, I have
> never once observed it to happen except when a file creation occurs.
> Reading and writing existing files never triggers the event, although any
> read or write occurring during the event is halted for the duration.
...

> How can I troubleshoot and more importantly resolve this issue?

This sounds like a filesys problem rather than a RAID problem.

One thing that can cause this sort of behaviour is if the filesystem is in
the middle of a sync and has to complete it before the create can
complete, and the sync is writing out many megabytes of data.

You can see if this is happening by running

     watch 'grep Dirty /proc/meminfo'

if that is large when the hang starts, and drops down to zero, and the
hang lets go when it hits (close to) zero, then this is the problem.
The answer then is to keep that number low by writing a suitable
number into
   /proc/sys/vm/dirty_ratio   (a percentage of system RAM)
or
   /proc/sys/vm/dirty_bytes

If that doesn't turn out to be the problem, then knowing how the
"Dirty" count is behaving might still be useful, and I would probably
look at what processes are in 'D' state, (ps axgu) and look at their
stack (/proc/$PID/stack)..

You didn't say what kernel you are running.  It might make a difference.

NeilBrown

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: RAID halting
  2009-04-02  4:38 Strange filesystem slowness with 8TB RAID6 NeilBrown
@ 2009-04-04  7:12 ` Lelsie Rhorer
  2009-04-04 12:38   ` Roger Heflin
  0 siblings, 1 reply; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-04  7:12 UTC (permalink / raw)
  To: 'Linux RAID'

>This sounds like a filesys problem rather than a RAID problem.

I considered that.  It may well be.

>One thing that can cause this sort of behaviour is if the filesystem is in
>the middle of a sync and has to complete it before the create can
>complete, and the sync is writing out many megabytes of data.

For between 40 seconds and 2 minutes?  The drive subsystem can easily gulp
down 200 megabytes to 6000 megabytes in that period of time.  What synch
would be that large?  Secondly, the problem occurs also when there is
relatively little or no data being written to the array.  Finally, unless I
am misunderstanding at what layers iostat and atop are reporting the
traffic, the fact all drive writes invariably fall to dead zero during an
event and reads on precisely half the drives (and always the same drives)
drop to dead zero suggests to me this is not the case.

>You can see if this is happening by running

>     watch 'grep Dirty /proc/meminfo'

>if that is large when the hang starts, and drops down to zero, and the
>hang lets go when it hits (close to) zero, then this is the problem.

Thanks, I'll give it a try later today.  Right now I am dead tired, plus
there are some batches running I really don't want interrupted, and
triggering an event might halt them.

>The answer then is to keep that number low by writing a suitable
>number into
>   /proc/sys/vm/dirty_ratio   (a percentage of system RAM)
>or
>   /proc/sys/vm/dirty_bytes

Um, OK.  What would constitute suitable numbers, assuming it turns out to be
the issue?

>If that doesn't turn out to be the problem, then knowing how the
>"Dirty" count is behaving might still be useful, and I would probably
>look at what processes are in 'D' state, (ps axgu) and look at their
>stack (/proc/$PID/stack)..

I'll surely try that, too.

>You didn't say what kernel you are running.  It might make a difference.

>NeilBrown

Oh, sorry!  2.6.26-1-amd64  4G of RAM, with typically 600 - 800M in use.
The swap space is 4.7G, but the used swap space has never exceeded 200K.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: RAID halting
  2009-04-04  7:12 ` RAID halting Lelsie Rhorer
@ 2009-04-04 12:38   ` Roger Heflin
  0 siblings, 0 replies; 84+ messages in thread
From: Roger Heflin @ 2009-04-04 12:38 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Lelsie Rhorer wrote:
>> This sounds like a filesys problem rather than a RAID problem.
> 
> I considered that.  It may well be.
> 
>> One thing that can cause this sort of behaviour is if the filesystem is in
>> the middle of a sync and has to complete it before the create can
>> complete, and the sync is writing out many megabytes of data.
> 
> For between 40 seconds and 2 minutes?  The drive subsystem can easily gulp
> down 200 megabytes to 6000 megabytes in that period of time.  What synch
> would be that large?  Secondly, the problem occurs also when there is
> relatively little or no data being written to the array.  Finally, unless I
> am misunderstanding at what layers iostat and atop are reporting the
> traffic, the fact all drive writes invariably fall to dead zero during an
> event and reads on precisely half the drives (and always the same drives)
> drop to dead zero suggests to me this is not the case.
> 
> 

If you have things setup such that you have lights on the disks, and 
can check the lights when the event is happening, often if a single 
drive is being slow it will be the only one with its lights on when 
things are going bad.

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2009-05-03  2:23 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <49D7C19C.2050308@gmail.com>
2009-04-05  0:07 ` RAID halting Lelsie Rhorer
2009-04-05  0:49   ` Greg Freemyer
2009-04-05  5:34     ` Lelsie Rhorer
2009-04-05  7:16       ` Richard Scobie
2009-04-05  8:22         ` Lelsie Rhorer
2009-04-05 14:05           ` Drew
2009-04-05 18:54             ` Leslie Rhorer
2009-04-05 19:17               ` John Robinson
2009-04-05 20:00                 ` Greg Freemyer
2009-04-05 20:39                   ` Peter Grandi
2009-04-05 23:27                     ` Leslie Rhorer
2009-04-05 22:03                   ` Leslie Rhorer
2009-04-06 22:16                     ` Greg Freemyer
2009-04-07 18:22                       ` Leslie Rhorer
2009-04-24  4:52                   ` Leslie Rhorer
2009-04-24  6:50                     ` Richard Scobie
2009-04-24 10:03                       ` Leslie Rhorer
2009-04-28 19:36                         ` lrhorer
2009-04-24 15:24                     ` Andrew Burgess
2009-04-25  4:26                       ` Leslie Rhorer
2009-04-24 17:03                     ` Doug Ledford
2009-04-24 20:25                       ` Richard Scobie
2009-04-24 20:28                         ` CoolCold
2009-04-24 21:04                           ` Richard Scobie
2009-04-25  7:40                       ` Leslie Rhorer
2009-04-25  8:53                         ` Michał Przyłuski
2009-04-28 19:33                         ` Leslie Rhorer
2009-04-29 11:25                           ` John Robinson
2009-04-30  0:55                             ` Leslie Rhorer
2009-04-30 12:34                               ` John Robinson
2009-05-03  2:16                                 ` Leslie Rhorer
2009-05-03  2:23                           ` Leslie Rhorer
2009-04-24 20:25                     ` Greg Freemyer
2009-04-25  7:24                     ` Leslie Rhorer
2009-04-05 21:02                 ` Leslie Rhorer
2009-04-05 19:26               ` Richard Scobie
2009-04-05 20:40                 ` Leslie Rhorer
2009-04-05 20:57               ` Peter Grandi
2009-04-05 23:55                 ` Leslie Rhorer
2009-04-06 20:35                   ` jim owens
2009-04-07 17:47                     ` Leslie Rhorer
2009-04-07 18:18                       ` David Lethe
2009-04-08 14:17                         ` Leslie Rhorer
2009-04-08 14:30                           ` David Lethe
2009-04-09  4:52                             ` Leslie Rhorer
2009-04-09  6:45                               ` David Lethe
2009-04-08 14:37                           ` Greg Freemyer
2009-04-08 16:29                             ` Andrew Burgess
2009-04-09  3:24                               ` Leslie Rhorer
2009-04-10  3:02                               ` Leslie Rhorer
2009-04-10  4:51                                 ` Leslie Rhorer
2009-04-10 12:50                                   ` jim owens
2009-04-10 15:31                                   ` Bill Davidsen
2009-04-11  1:37                                     ` Leslie Rhorer
2009-04-11 13:02                                       ` Bill Davidsen
2009-04-10  8:53                                 ` David Greaves
2009-04-08 18:04                           ` Corey Hickey
2009-04-07 18:20                       ` Greg Freemyer
2009-04-08  8:45                       ` John Robinson
2009-04-09  3:34                         ` Leslie Rhorer
2009-04-05  7:33       ` Richard Scobie
2009-04-05  0:57   ` Roger Heflin
2009-04-05  6:30     ` Lelsie Rhorer
     [not found] <49F2A193.8080807@sauce.co.nz>
2009-04-25  7:03 ` Leslie Rhorer
     [not found] <49F21B75.7060705@sauce.co.nz>
2009-04-25  4:32 ` Leslie Rhorer
     [not found] <49D89515.3020800@computer.org>
2009-04-05 18:40 ` Leslie Rhorer
2009-04-05 14:22 FW: " David Lethe
2009-04-05 14:53 ` David Lethe
2009-04-05 20:33 ` Leslie Rhorer
2009-04-05 22:20   ` Peter Grandi
2009-04-06  0:31   ` Doug Ledford
2009-04-06  1:53     ` Leslie Rhorer
2009-04-06 12:37       ` Doug Ledford
  -- strict thread matches above, loose matches on Subject: below --
2009-04-05  5:33 David Lethe
2009-04-05  8:14 ` RAID halting Lelsie Rhorer
2009-04-04 17:05 Lelsie Rhorer
2009-04-02 13:35 Andrew Burgess
2009-04-04  5:57 ` RAID halting Lelsie Rhorer
2009-04-04 13:01   ` Andrew Burgess
2009-04-04 14:39     ` Lelsie Rhorer
2009-04-04 15:04       ` Andrew Burgess
2009-04-04 15:15         ` Lelsie Rhorer
2009-04-04 16:39           ` Andrew Burgess
2009-04-02  7:33 Peter Grandi
2009-04-02 23:01 ` RAID halting Lelsie Rhorer
2009-04-02  6:56 your mail Luca Berra
2009-04-04  6:44 ` RAID halting Lelsie Rhorer
2009-04-02  4:38 Strange filesystem slowness with 8TB RAID6 NeilBrown
2009-04-04  7:12 ` RAID halting Lelsie Rhorer
2009-04-04 12:38   ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).