linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
[parent not found: <49F2A193.8080807@sauce.co.nz>]
[parent not found: <49F21B75.7060705@sauce.co.nz>]
[parent not found: <49D89515.3020800@computer.org>]
* FW: RAID halting
@ 2009-04-05 14:22 David Lethe
  2009-04-05 14:53 ` David Lethe
  2009-04-05 20:33 ` Leslie Rhorer
  0 siblings, 2 replies; 84+ messages in thread
From: David Lethe @ 2009-04-05 14:22 UTC (permalink / raw)
  To: lrhorer, linux-raid

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Lelsie Rhorer
Sent: Sunday, April 05, 2009 3:14 AM
To: linux-raid@vger.kernel.org
Subject: RE: RAID halting

> All of what you report is still consistent with delays caused by
having
> to remap bad blocks

I disagree.  If it happened with some frequency during ordinary reads,
then
I would agree.  If it happened without respect to the volume of reads
and
writes on the system, then I would be less inclined to disagree.

> The O/S will not report recovered errors, as this gets done internally
> by the disk drive, and the O/S never learns about it. (Queue depth

SMART is supposed to report this, and rarely the kernel log does report
a
block of sectors being marked bad by the controller.  I cannot speak to
the
notion SMART's reporting of relocated sectors and failed relocations may
not
be accurate, as I have no means to verify.

Actually, I should amend the first sentence, because while the ten
drives in
the array are almost never reporting any errors, there is another drive
in
the chassis which is chunking out error reports like a farm boy spitting
out
watermelon seeds.  I had a 320G drive in another system which was
behaving
erratically, so I moved it to the array chassis on this machine to
eliminate
it being a cable or the drive controller.  It's reporting blocks being
marked bad all over the place.

> Really, if this was my system I would run non-destructive read tests
on
> all blocks;

How does one do this?  Or rather, isn't this what the monthly mdadm
resync
does?

> along with the embedded self-test on the disk.  It is often

How does one do this?

> a lot easier and more productive to eliminate what ISN'T the problem
> rather than chase all of the potential reasons for the problem.

I agree, which is why I am asking for troubleshooting methods and
utilities.

The monthly RAID array resync started a few minutes ago, and it is
providing
some interesting results.  The number of blocks read / second is
consistently 13,000 - 24,000 on all ten drives.  There were no other
drive
accesses of any sort at the time, so the number of blocks written was
flat
zero on all drives in the array.  I copied the /etc/hosts file to the
RAID
array, and instantly the file system locked, but the array resync *DID
NOT*.
The number of blocks read and written per second continued to range from
13,000 to 24,000 blocks/second, with no apparent halt or slow-down at
all,
not even for one second.  So if it's a drive error, why are file system
reads halted almost completely, and writes halted altogether, yet drive
reads at the RAID array level continue unabated at an aggregate of more
than
130,000 blocks - 240,000 blocks (500 - 940 megabits) per second?  I
tried a
second copy and again the file system accesses to the drives halted
altogether.  The block reads (which had been alternating with writes
after
the new transfer proceses were implemented) again jumped to between
13,000
and 24,000.  This time I used a stopwatch, and the halt was 18 minutes
21
seconds - I believe the longest ever.  There is absolutely no way it
would
take a drive almost 20 minutes to mark a block bad.  The dirty blocks
grew
to more than 78 Megabytes.  I just did a 3rd cp of the /etc/hosts file
to
the array, and once again it locked the machine for what is likely to be
another 15 - 20 minutes.  I tried forcing a sync, but it also hung.

<Sigh>  The next three days are going to be Hell, again.  It's going to
be
all but impossible to edit a file until the RAID resync completes.  It's
often really bad under ordinary loads, but when the resync is underway,
it's
beyond absurd.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
======
Leslie: 
Respectfully, your statement, "SMART is supposed to report this" shows
you have no understanding of exactly what S.M.A.R.T. is and is not
supposed to report, nor do you know enough about hardware to make an
educated decision about what can and can not be contributing factors.
As such, you are not qualified to dismiss the necessity to run hardware
diagnostics.

A few other things - many SATA controller cards use poorly architected
bridge chips that spoof some of the ATA commands, so even if you *think*
you are kicking off one of the SMART subcommands, like the
SMART_IMMEDIATE_OFFLINE (op code d4h with the extended self test,
subcommand 2h), then it is possible, perhaps probable, they are never
getting run. -- yes, I am giving you the raw opcodes so you can look
them up and learn what they do.

You want to know how it is possible that frequency or size of reads can
be a factor? 
Do the math:
 * Look at the # of ECC bits you have on the disks (read the specs), and
compare that with the trillions of bytes you have.  How frequently can
you expect to have an unrecoverable ECC error based on that.
 * What percentage of your farm are you actually testing with the tests
you have run so far? Is it even close to being statistically
significant?
 * Do you know what physical blocks on each disk are being read/written
with the tests you mention? If you do not know, then how do you know
that the short tests are doing I/O on blocks that need to be repaired,
and subsequent tests run OK because those blocks were just repaired?
 * Did you look into firmware? Are the drives and/or firmware revisions
qualified by your controller vendor?  

I've been in the storage business for over 10 years, writing everything
from RAID firmware, configurators, disk diagnostics, test bench suites.
I even have my own company that writes storage diagnostics.  I think I
know a little more about diagnostics and what can and can not happen.
You said before that you do not agree with my statements earlier.  I
doubt that you will find any experienced storage professional that
wouldn't tell you to break it all down and run a full block-level DVT
before going further.  It could have all been done over the week-end if
you had the right setup, and then you would know a lot more than what
you know now.  

AT this point all you have done is tell people who suggest hardware is
the cause that they are wrong and then tell us why you think we are
wrong.  Frankly, be lazy and don't run diagnostics, you had just better
not be a government employee, or in charge of a database that contains
financial, medical, or other such information, and you have better be
running hot backups.

If you still refuse to run full block-level hardware test, then ask
yourself how much longer will you allow this to go on before you run
such a test, or are you just going to continue down this path waiting
for somebody to give you a magic command to type in that will fix
everything.

I am not the one who is putting my job on the line at best, and at
worst, is looking at a criminal violation for not taking appropriate
actions to protect certain data. I make no apology for beating you up on
this.  You need to hear it.




^ permalink raw reply	[flat|nested] 84+ messages in thread
* RE:
@ 2009-04-05  5:33 David Lethe
  2009-04-05  8:14 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: David Lethe @ 2009-04-05  5:33 UTC (permalink / raw)
  To: lrhorer, linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Lelsie Rhorer
> Sent: Sunday, April 05, 2009 12:21 AM
> To: linux-raid@vger.kernel.org
> Subject: RE:
> 
> > You should note that the drive won't know a sector it just wrote is
> > bad until it reads it
> 
> Yes, but it also won't halt the write for 40 seconds because it was
> bad.
> More to  the point, there is no difference at the drive level between
a
> bad
> sector written for a 30Gb file and a 30 byte file.
> 
> > ....are you sure you actually successfully wrote all of that data
and
> that
> >it is still there?  Pretty sure, yeah.  There are no errors in the
> filesystem, and every file I have written works.  Again, however, the
> point
> is there is never a problem once the file is created, no matter how
> long it
> takes to write it out to disk.  The moment the file is created,
> however,
> there may be up to a 2 minute delay in writing its data to the drive.
> 
> > And it is not the writes that kill when you have a drive going bad,
> it
> > is the reads of the bad sectors.    And to create a file, a number
of
> > things will likely need to be read to finish the file creation, and
> if
> > one of those is a bad sector things get ugly.
> 
> Well, I agree to some extent, except that why would it be loosely
> related to
> the volume of drive activity, and why is it 5 drives stop reading
> altogether
> and 5 do not?  Furthermore, every single video file gets read, re-
> written,
> edited, re-written again, and finally read again at least once,
> sometimes
> several times, before being finally archived.  Why does the kernel log
> never
> report any errors of any sort?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

All of what you report is still consistent with delays caused by having
to remap bad blocks
The O/S will not report recovered errors, as this gets done internally
by the disk drive, and the O/S never learns about it. (Queue depth
settings can account for some of the other "weirdness" you reported.

Really, if this was my system I would run non-destructive read tests on
all blocks; along with the embedded self-test on the disk.  It is often
a lot easier and more productive to eliminate what ISN'T the problem
rather than chase all of the potential reasons for the problem.  



^ permalink raw reply	[flat|nested] 84+ messages in thread
* RE: RAID halting
@ 2009-04-04 17:05 Lelsie Rhorer
  0 siblings, 0 replies; 84+ messages in thread
From: Lelsie Rhorer @ 2009-04-04 17:05 UTC (permalink / raw)
  To: 'Linux RAID'

> One thing that can cause this sort of behaviour is if the filesystem is in
> the middle of a sync and has to complete it before the create can
> complete, and the sync is writing out many megabytes of data.
> 
> You can see if this is happening by running
> 
>      watch 'grep Dirty /proc/meminfo'

OK, I did this.

> if that is large when the hang starts, and drops down to zero, and the
> hang lets go when it hits (close to) zero, then this is the problem.

No, not really.  The value of course rises and falls erratically during
normal operation (anything from a few dozen K to 200 Megs), but it is not
necessarily very high at the event onset.  When the halt occurs it drops
from whatever value it may have (perhaps 256K or so) to 16K, and then slowly
rises to several hundred K until the event terminates.

> If that doesn't turn out to be the problem, then knowing how the
> "Dirty" count is behaving might still be useful, and I would probably
> look at what processes are in 'D' state, (ps axgu)

Well, nothing surprising, there.  The process(es) involved with the
transfer(s) are dirty (D+), as well as the trigger process (for testing, I
simply copy /etc/hosts over to a directory on the RAID array), and pdflush
had a D state (no plus), but that's all.

> and look at their stack (/proc/$PID/stack)..

Um, I thought I knew what you meant by this, but apparently not.  I tried to
`cat /proc/<PID of the process with a D status>/stack`, but the system
returns "cat: /proc/8005/stack: No such file or directory".  What did I do
wrong?


^ permalink raw reply	[flat|nested] 84+ messages in thread
* Re:
@ 2009-04-02 13:35 Andrew Burgess
  2009-04-04  5:57 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Andrew Burgess @ 2009-04-02 13:35 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Wed, 2009-04-01 at 23:16 -0500, Lelsie Rhorer wrote:

> The issue is the entire array will occasionally pause completely for about
> 40 seconds when a file is created. 

I had symptoms like this once. It turned out to be a defective disk. The
disk would never return a read or write error but just intermittently
took a really long time to respond.

I found it by running atop. All the other drives would be running at low
utilization and this one drive would be at 100% when the symptoms
occurred (which in atop gets colored red so it jumps out at you)



^ permalink raw reply	[flat|nested] 84+ messages in thread
* Re:
@ 2009-04-02  7:33 Peter Grandi
  2009-04-02 23:01 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Peter Grandi @ 2009-04-02  7:33 UTC (permalink / raw)
  To: Linux RAID


> The issue is the entire array will occasionally pause completely
> for about 40 seconds when a file is created. [ ... ] During heavy
> file transfer activity, sometimes the system halts with every
> other file creation. [ ... ] There are other drives formatted
> with other file systems on the machine, but the issue has never
> been seen on any of the other drives.  When the array runs its
> regularly scheduled health check, the problem is much worse. [
> ... ]

Looks like that either you have hw issues (transfer errors, bad
blocks) or more likely the cache flusher and elevator settings have
not been tuned for a steady flow.

> How can I troubleshoot and more importantly resolve this issue?

Well, troubleshooting would require a good understanding of file
system design and storage subsystem design, and quite a bit of time.

However, for hardware errors check the kernel logs, and for cache
flusher and elevator settings check the 'bi'/'bo' numbers of
'vmstat 1' while the pause happens.

For a deeper profile of per-drive IO run 'watch iostat 1 2' while
this is happening. This might also help indicate drive errors (no
IO is happening) or flusher/elevator tuning issues (lots of IO is
happening suddenfly).

^ permalink raw reply	[flat|nested] 84+ messages in thread
* Re: your mail
@ 2009-04-02  6:56 Luca Berra
  2009-04-04  6:44 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: Luca Berra @ 2009-04-02  6:56 UTC (permalink / raw)
  To: linux-raid

On Wed, Apr 01, 2009 at 11:16:06PM -0500, Lelsie Rhorer wrote:
>8T of active space is formatted as a single Reiserfs file system.  The
....
>The issue is the entire array will occasionally pause completely for about
>40 seconds when a file is created.  This does not always happen, but the
>situation is easily reproducible.  The frequency at which the symptom
i wonder how costly b-tree operaton are for a 8Tb filesystem...

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 84+ messages in thread
* Re: Strange filesystem slowness with 8TB RAID6
@ 2009-04-02  4:38 NeilBrown
  2009-04-04  7:12 ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 84+ messages in thread
From: NeilBrown @ 2009-04-02  4:38 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Thu, April 2, 2009 3:16 pm, Lelsie Rhorer wrote:

> The issue is the entire array will occasionally pause completely for about
> 40 seconds when a file is created.  This does not always happen, but the
> situation is easily reproducible.  The frequency at which the symptom
> occurs seems to be related to the transfer load on the array.  If no other
> transfers are in process, then the failure seems somewhat more rare,
> perhaps accompanying less than 1 file creation in 10..  During heavy file
> transfer activity, sometimes the system halts with every other file
> creation.  Although I have observed many dozens of these events, I have
> never once observed it to happen except when a file creation occurs.
> Reading and writing existing files never triggers the event, although any
> read or write occurring during the event is halted for the duration.
...

> How can I troubleshoot and more importantly resolve this issue?

This sounds like a filesys problem rather than a RAID problem.

One thing that can cause this sort of behaviour is if the filesystem is in
the middle of a sync and has to complete it before the create can
complete, and the sync is writing out many megabytes of data.

You can see if this is happening by running

     watch 'grep Dirty /proc/meminfo'

if that is large when the hang starts, and drops down to zero, and the
hang lets go when it hits (close to) zero, then this is the problem.
The answer then is to keep that number low by writing a suitable
number into
   /proc/sys/vm/dirty_ratio   (a percentage of system RAM)
or
   /proc/sys/vm/dirty_bytes

If that doesn't turn out to be the problem, then knowing how the
"Dirty" count is behaving might still be useful, and I would probably
look at what processes are in 'D' state, (ps axgu) and look at their
stack (/proc/$PID/stack)..

You didn't say what kernel you are running.  It might make a difference.

NeilBrown


^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2009-05-03  2:23 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <49D7C19C.2050308@gmail.com>
2009-04-05  0:07 ` RAID halting Lelsie Rhorer
2009-04-05  0:49   ` Greg Freemyer
2009-04-05  5:34     ` Lelsie Rhorer
2009-04-05  7:16       ` Richard Scobie
2009-04-05  8:22         ` Lelsie Rhorer
2009-04-05 14:05           ` Drew
2009-04-05 18:54             ` Leslie Rhorer
2009-04-05 19:17               ` John Robinson
2009-04-05 20:00                 ` Greg Freemyer
2009-04-05 20:39                   ` Peter Grandi
2009-04-05 23:27                     ` Leslie Rhorer
2009-04-05 22:03                   ` Leslie Rhorer
2009-04-06 22:16                     ` Greg Freemyer
2009-04-07 18:22                       ` Leslie Rhorer
2009-04-24  4:52                   ` Leslie Rhorer
2009-04-24  6:50                     ` Richard Scobie
2009-04-24 10:03                       ` Leslie Rhorer
2009-04-28 19:36                         ` lrhorer
2009-04-24 15:24                     ` Andrew Burgess
2009-04-25  4:26                       ` Leslie Rhorer
2009-04-24 17:03                     ` Doug Ledford
2009-04-24 20:25                       ` Richard Scobie
2009-04-24 20:28                         ` CoolCold
2009-04-24 21:04                           ` Richard Scobie
2009-04-25  7:40                       ` Leslie Rhorer
2009-04-25  8:53                         ` Michał Przyłuski
2009-04-28 19:33                         ` Leslie Rhorer
2009-04-29 11:25                           ` John Robinson
2009-04-30  0:55                             ` Leslie Rhorer
2009-04-30 12:34                               ` John Robinson
2009-05-03  2:16                                 ` Leslie Rhorer
2009-05-03  2:23                           ` Leslie Rhorer
2009-04-24 20:25                     ` Greg Freemyer
2009-04-25  7:24                     ` Leslie Rhorer
2009-04-05 21:02                 ` Leslie Rhorer
2009-04-05 19:26               ` Richard Scobie
2009-04-05 20:40                 ` Leslie Rhorer
2009-04-05 20:57               ` Peter Grandi
2009-04-05 23:55                 ` Leslie Rhorer
2009-04-06 20:35                   ` jim owens
2009-04-07 17:47                     ` Leslie Rhorer
2009-04-07 18:18                       ` David Lethe
2009-04-08 14:17                         ` Leslie Rhorer
2009-04-08 14:30                           ` David Lethe
2009-04-09  4:52                             ` Leslie Rhorer
2009-04-09  6:45                               ` David Lethe
2009-04-08 14:37                           ` Greg Freemyer
2009-04-08 16:29                             ` Andrew Burgess
2009-04-09  3:24                               ` Leslie Rhorer
2009-04-10  3:02                               ` Leslie Rhorer
2009-04-10  4:51                                 ` Leslie Rhorer
2009-04-10 12:50                                   ` jim owens
2009-04-10 15:31                                   ` Bill Davidsen
2009-04-11  1:37                                     ` Leslie Rhorer
2009-04-11 13:02                                       ` Bill Davidsen
2009-04-10  8:53                                 ` David Greaves
2009-04-08 18:04                           ` Corey Hickey
2009-04-07 18:20                       ` Greg Freemyer
2009-04-08  8:45                       ` John Robinson
2009-04-09  3:34                         ` Leslie Rhorer
2009-04-05  7:33       ` Richard Scobie
2009-04-05  0:57   ` Roger Heflin
2009-04-05  6:30     ` Lelsie Rhorer
     [not found] <49F2A193.8080807@sauce.co.nz>
2009-04-25  7:03 ` Leslie Rhorer
     [not found] <49F21B75.7060705@sauce.co.nz>
2009-04-25  4:32 ` Leslie Rhorer
     [not found] <49D89515.3020800@computer.org>
2009-04-05 18:40 ` Leslie Rhorer
2009-04-05 14:22 FW: " David Lethe
2009-04-05 14:53 ` David Lethe
2009-04-05 20:33 ` Leslie Rhorer
2009-04-05 22:20   ` Peter Grandi
2009-04-06  0:31   ` Doug Ledford
2009-04-06  1:53     ` Leslie Rhorer
2009-04-06 12:37       ` Doug Ledford
  -- strict thread matches above, loose matches on Subject: below --
2009-04-05  5:33 David Lethe
2009-04-05  8:14 ` RAID halting Lelsie Rhorer
2009-04-04 17:05 Lelsie Rhorer
2009-04-02 13:35 Andrew Burgess
2009-04-04  5:57 ` RAID halting Lelsie Rhorer
2009-04-04 13:01   ` Andrew Burgess
2009-04-04 14:39     ` Lelsie Rhorer
2009-04-04 15:04       ` Andrew Burgess
2009-04-04 15:15         ` Lelsie Rhorer
2009-04-04 16:39           ` Andrew Burgess
2009-04-02  7:33 Peter Grandi
2009-04-02 23:01 ` RAID halting Lelsie Rhorer
2009-04-02  6:56 your mail Luca Berra
2009-04-04  6:44 ` RAID halting Lelsie Rhorer
2009-04-02  4:38 Strange filesystem slowness with 8TB RAID6 NeilBrown
2009-04-04  7:12 ` RAID halting Lelsie Rhorer
2009-04-04 12:38   ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).