(unknown),

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* (unknown), 
@ 2009-04-02  4:16 Lelsie Rhorer
  2009-04-02  4:22 ` David Lethe
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-02  4:16 UTC (permalink / raw)
  To: linux-raid

I'm having a severe problem whose root cause I cannot determine.  I have a
RAID 6 array managed by mdadm running on Debian "Lenny" with a 3.2GHz AMD
Athlon 64 x 2 processor and 8G of RAM.  There are ten 1 Terabyte SATA
drives, unpartitioned, fully allocated to the /dev/md0 device. The drive
are served by 3 Silicon Image SATA port multipliers and a Silicon Image 4
port eSATA controller.  The /dev/md0 device is also unpartitioned, and all
8T of active space is formatted as a single Reiserfs file system.  The
entire volume is mounted to /RAID.  Various directories on the volume are
shared using both NFS and SAMBA.

Performance of the RAID system is very good.  The array can read and write
at over 450 Mbps, and I don't know if the limit is the array itself or the
network, but since the performance is more than adequate I really am not
concerned which is the case.

The issue is the entire array will occasionally pause completely for about
40 seconds when a file is created.  This does not always happen, but the
situation is easily reproducible.  The frequency at which the symptom
occurs seems to be related to the transfer load on the array.  If no other
transfers are in process, then the failure seems somewhat more rare,
perhaps accompanying less than 1 file creation in 10..  During heavy file
transfer activity, sometimes the system halts with every other file
creation.  Although I have observed many dozens of these events, I have
never once observed it to happen except when a file creation occurs. 
Reading and writing existing files never triggers the event, although any
read or write occurring during the event is halted for the duration. 
(There is one cron jog which runs every half-hour that creates a tiny file;
this is the most common failure vector.)  There are other drives formatted
with other file systems on the machine, but the issue has never been seen
on any of the other drives.  When the array runs its regularly scheduled
health check, the problem is much worse.  Not only does it lock up with
almost every single file creation, but the lock-up time is much longer -
sometimes in excess of 2 minutes.

Transfers via Linux based utilities (ftp, NFS, cp, mv, rsync, etc) all
recover after the event, but SAMBA based transfers frequently fail, both
reads and writes.

How can I troubleshoot and more importantly resolve this issue?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE:
  2009-04-02  4:16 (unknown), Lelsie Rhorer
@ 2009-04-02  4:22 ` David Lethe
  2009-04-05  0:12   ` RE: Lelsie Rhorer
  2009-04-02  4:38 ` Strange filesystem slowness with 8TB RAID6 NeilBrown
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: David Lethe @ 2009-04-02  4:22 UTC (permalink / raw)
  To: lrhorer, linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Lelsie Rhorer
> Sent: Wednesday, April 01, 2009 11:16 PM
> To: linux-raid@vger.kernel.org
> Subject:
> 
> I'm having a severe problem whose root cause I cannot determine.  I
> have a
> RAID 6 array managed by mdadm running on Debian "Lenny" with a 3.2GHz
> AMD
> Athlon 64 x 2 processor and 8G of RAM.  There are ten 1 Terabyte SATA
> drives, unpartitioned, fully allocated to the /dev/md0 device. The
> drive
> are served by 3 Silicon Image SATA port multipliers and a Silicon
Image
> 4
> port eSATA controller.  The /dev/md0 device is also unpartitioned, and
> all
> 8T of active space is formatted as a single Reiserfs file system.  The
> entire volume is mounted to /RAID.  Various directories on the volume
> are
> shared using both NFS and SAMBA.
> 
> Performance of the RAID system is very good.  The array can read and
> write
> at over 450 Mbps, and I don't know if the limit is the array itself or
> the
> network, but since the performance is more than adequate I really am
> not
> concerned which is the case.
> 
> The issue is the entire array will occasionally pause completely for
> about
> 40 seconds when a file is created.  This does not always happen, but
> the
> situation is easily reproducible.  The frequency at which the symptom
> occurs seems to be related to the transfer load on the array.  If no
> other
> transfers are in process, then the failure seems somewhat more rare,
> perhaps accompanying less than 1 file creation in 10..  During heavy
> file
> transfer activity, sometimes the system halts with every other file
> creation.  Although I have observed many dozens of these events, I
have
> never once observed it to happen except when a file creation occurs.
> Reading and writing existing files never triggers the event, although
> any
> read or write occurring during the event is halted for the duration.
> (There is one cron jog which runs every half-hour that creates a tiny
> file;
> this is the most common failure vector.)  There are other drives
> formatted
> with other file systems on the machine, but the issue has never been
> seen
> on any of the other drives.  When the array runs its regularly
> scheduled
> health check, the problem is much worse.  Not only does it lock up
with
> almost every single file creation, but the lock-up time is much longer
> -
> sometimes in excess of 2 minutes.
> 
> Transfers via Linux based utilities (ftp, NFS, cp, mv, rsync, etc) all
> recover after the event, but SAMBA based transfers frequently fail,
> both
> reads and writes.
> 
> How can I troubleshoot and more importantly resolve this issue?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I would try to first run hardware diagnostics.  Maybe you will get
"lucky" and one or more disks will fail diagnostics, which at least
means it will be easy to repair the problem.

This could very well be situation where you have a lot of bad blocks
that have to get restriped, and parity has to be regenerated.   Are
these the cheap consumer SATA disk drives, or enterprise class disks? 

David



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE:
  2009-04-02  4:22 ` David Lethe
@ 2009-04-05  0:12   ` Lelsie Rhorer
  2009-04-05  0:38     ` Greg Freemyer
  2009-04-05  0:45     ` Re: Roger Heflin
  0 siblings, 2 replies; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  0:12 UTC (permalink / raw)
  To: linux-raid

> I would try to first run hardware diagnostics.  Maybe you will get
> "lucky" and one or more disks will fail diagnostics, which at least
> means it will be easy to repair the problem.
> 
> This could very well be situation where you have a lot of bad blocks
> that have to get restriped, and parity has to be regenerated.   Are
> these the cheap consumer SATA disk drives, or enterprise class disks?


I don't buy that for a second.  First of all, restriping parity can and does
occur in the background.  Secondly, how is it the system writes many
terrabytes of data post file creation, then chokes on a 0 byte file?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re:
  2009-04-05  0:12   ` RE: Lelsie Rhorer
@ 2009-04-05  0:38     ` Greg Freemyer
  2009-04-05  5:05       ` Lelsie Rhorer
  2009-04-05  0:45     ` Re: Roger Heflin
  1 sibling, 1 reply; 24+ messages in thread
From: Greg Freemyer @ 2009-04-05  0:38 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Sat, Apr 4, 2009 at 8:12 PM, Lelsie Rhorer <lrhorer@satx.rr.com> wrote:
>> I would try to first run hardware diagnostics.  Maybe you will get
>> "lucky" and one or more disks will fail diagnostics, which at least
>> means it will be easy to repair the problem.
>>
>> This could very well be situation where you have a lot of bad blocks
>> that have to get restriped, and parity has to be regenerated.   Are
>> these the cheap consumer SATA disk drives, or enterprise class disks?
>
>
> I don't buy that for a second.  First of all, restriping parity can and does
> occur in the background.  Secondly, how is it the system writes many
> terrabytes of data post file creation, then chokes on a 0 byte file?
>

Alternate theory - serious fsync performance issue

I don't know if it's related, but there is a lot of recent discussion
related to fsync causing large delays in ext3.  Linus is saying his
highspeed SDD is seeing multisecond delays.  He is very frustrated
because the SDD should be more or less instantaneous.

The current thread is http://markmail.org/message/adiyz3lri6tlueaf

In one of the other threads I saw someone saying that in one test they
had a fsync() call take minutes to return.  Apparently no one yet
fully understands what is going on.  Seems like something that could
in some way be related to what you are seeing.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE:
  2009-04-05  0:38     ` Greg Freemyer
@ 2009-04-05  5:05       ` Lelsie Rhorer
  2009-04-05 11:42         ` Greg Freemyer
  0 siblings, 1 reply; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  5:05 UTC (permalink / raw)
  To: linux-raid

> Alternate theory - serious fsync performance issue
> 
> I don't know if it's related, but there is a lot of recent discussion
> related to fsync causing large delays in ext3.  Linus is saying his
> highspeed SDD is seeing multisecond delays.  He is very frustrated
> because the SDD should be more or less instantaneous.
> 
> The current thread is http://markmail.org/message/adiyz3lri6tlueaf
> 
> In one of the other threads I saw someone saying that in one test they
> had a fsync() call take minutes to return.  Apparently no one yet
> fully understands what is going on.  Seems like something that could
> in some way be related to what you are seeing.

Well, it could be.  I tried flushing the cashes numerous times while
testing, but I never could see it made a difference one way or the other.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re:
  2009-04-05  5:05       ` Lelsie Rhorer
@ 2009-04-05 11:42         ` Greg Freemyer
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Freemyer @ 2009-04-05 11:42 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Sun, Apr 5, 2009 at 1:05 AM, Lelsie Rhorer <lrhorer@satx.rr.com> wrote:
>> Alternate theory - serious fsync performance issue
>>
>> I don't know if it's related, but there is a lot of recent discussion
>> related to fsync causing large delays in ext3.  Linus is saying his
>> highspeed SDD is seeing multisecond delays.  He is very frustrated
>> because the SDD should be more or less instantaneous.
>>
>> The current thread is http://markmail.org/message/adiyz3lri6tlueaf
>>
>> In one of the other threads I saw someone saying that in one test they
>> had a fsync() call take minutes to return.  Apparently no one yet
>> fully understands what is going on.  Seems like something that could
>> in some way be related to what you are seeing.
>
> Well, it could be.  I tried flushing the cashes numerous times while
> testing, but I never could see it made a difference one way or the other.
>

In a separate thread you said it was reiser and what I have seen
discussed is ext3, so you may be safe from that bug.

As to flushing caches, I don't think that is the same thing.    This
bug specifically impacts fsyncs on a small file while a heavy i/o load
is underway via other processes.  The elevators were being discussed
as part of the problem and fsync triggers different elevator logic
than sync or drop_caches does.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re:
  2009-04-05  0:12   ` RE: Lelsie Rhorer
  2009-04-05  0:38     ` Greg Freemyer
@ 2009-04-05  0:45     ` Roger Heflin
  2009-04-05  5:21       ` Lelsie Rhorer
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Heflin @ 2009-04-05  0:45 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

Lelsie Rhorer wrote:
>> I would try to first run hardware diagnostics.  Maybe you will get
>> "lucky" and one or more disks will fail diagnostics, which at least
>> means it will be easy to repair the problem.
>>
>> This could very well be situation where you have a lot of bad blocks
>> that have to get restriped, and parity has to be regenerated.   Are
>> these the cheap consumer SATA disk drives, or enterprise class disks?
> 
> 
> I don't buy that for a second.  First of all, restriping parity can and does
> occur in the background.  Secondly, how is it the system writes many
> terrabytes of data post file creation, then chokes on a 0 byte file?
> 

You should note that the drive won't know a sector it just wrote is 
bad until it reads it....are you sure you actually successfully wrote 
all of that data and that it is still there?

And it is not the writes that kill when you have a drive going bad, it 
is the reads of the bad sectors.    And to create a file, a number of 
things will likely need to be read to finish the file creation, and if 
one of those is a bad sector things get ugly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE:
  2009-04-05  0:45     ` Re: Roger Heflin
@ 2009-04-05  5:21       ` Lelsie Rhorer
  2009-04-05  5:33         ` RE: David Lethe
  0 siblings, 1 reply; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  5:21 UTC (permalink / raw)
  To: linux-raid

> You should note that the drive won't know a sector it just wrote is
> bad until it reads it

Yes, but it also won't halt the write for 40 seconds because it was bad.
More to  the point, there is no difference at the drive level between a bad
sector written for a 30Gb file and a 30 byte file.

> ....are you sure you actually successfully wrote all of that data and that
>it is still there?  Pretty sure, yeah.  There are no errors in the
filesystem, and every file I have written works.  Again, however, the point
is there is never a problem once the file is created, no matter how long it
takes to write it out to disk.  The moment the file is created, however,
there may be up to a 2 minute delay in writing its data to the drive.

> And it is not the writes that kill when you have a drive going bad, it
> is the reads of the bad sectors.    And to create a file, a number of
> things will likely need to be read to finish the file creation, and if
> one of those is a bad sector things get ugly.

Well, I agree to some extent, except that why would it be loosely related to
the volume of drive activity, and why is it 5 drives stop reading altogether
and 5 do not?  Furthermore, every single video file gets read, re-written,
edited, re-written again, and finally read again at least once, sometimes
several times, before being finally archived.  Why does the kernel log never
report any errors of any sort?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE:
  2009-04-05  5:21       ` Lelsie Rhorer
@ 2009-04-05  5:33         ` David Lethe
  2009-04-05  8:14           ` RAID halting Lelsie Rhorer
  0 siblings, 1 reply; 24+ messages in thread
From: David Lethe @ 2009-04-05  5:33 UTC (permalink / raw)
  To: lrhorer, linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Lelsie Rhorer
> Sent: Sunday, April 05, 2009 12:21 AM
> To: linux-raid@vger.kernel.org
> Subject: RE:
> 
> > You should note that the drive won't know a sector it just wrote is
> > bad until it reads it
> 
> Yes, but it also won't halt the write for 40 seconds because it was
> bad.
> More to  the point, there is no difference at the drive level between
a
> bad
> sector written for a 30Gb file and a 30 byte file.
> 
> > ....are you sure you actually successfully wrote all of that data
and
> that
> >it is still there?  Pretty sure, yeah.  There are no errors in the
> filesystem, and every file I have written works.  Again, however, the
> point
> is there is never a problem once the file is created, no matter how
> long it
> takes to write it out to disk.  The moment the file is created,
> however,
> there may be up to a 2 minute delay in writing its data to the drive.
> 
> > And it is not the writes that kill when you have a drive going bad,
> it
> > is the reads of the bad sectors.    And to create a file, a number
of
> > things will likely need to be read to finish the file creation, and
> if
> > one of those is a bad sector things get ugly.
> 
> Well, I agree to some extent, except that why would it be loosely
> related to
> the volume of drive activity, and why is it 5 drives stop reading
> altogether
> and 5 do not?  Furthermore, every single video file gets read, re-
> written,
> edited, re-written again, and finally read again at least once,
> sometimes
> several times, before being finally archived.  Why does the kernel log
> never
> report any errors of any sort?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

All of what you report is still consistent with delays caused by having
to remap bad blocks
The O/S will not report recovered errors, as this gets done internally
by the disk drive, and the O/S never learns about it. (Queue depth
settings can account for some of the other "weirdness" you reported.

Really, if this was my system I would run non-destructive read tests on
all blocks; along with the embedded self-test on the disk.  It is often
a lot easier and more productive to eliminate what ISN'T the problem
rather than chase all of the potential reasons for the problem.  



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-05  5:33         ` RE: David Lethe
@ 2009-04-05  8:14           ` Lelsie Rhorer
  0 siblings, 0 replies; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-05  8:14 UTC (permalink / raw)
  To: linux-raid

> All of what you report is still consistent with delays caused by having
> to remap bad blocks

I disagree.  If it happened with some frequency during ordinary reads, then
I would agree.  If it happened without respect to the volume of reads and
writes on the system, then I would be less inclined to disagree.

> The O/S will not report recovered errors, as this gets done internally
> by the disk drive, and the O/S never learns about it. (Queue depth

SMART is supposed to report this, and rarely the kernel log does report a
block of sectors being marked bad by the controller.  I cannot speak to the
notion SMART's reporting of relocated sectors and failed relocations may not
be accurate, as I have no means to verify.

Actually, I should amend the first sentence, because while the ten drives in
the array are almost never reporting any errors, there is another drive in
the chassis which is chunking out error reports like a farm boy spitting out
watermelon seeds.  I had a 320G drive in another system which was behaving
erratically, so I moved it to the array chassis on this machine to eliminate
it being a cable or the drive controller.  It's reporting blocks being
marked bad all over the place.

> Really, if this was my system I would run non-destructive read tests on
> all blocks;

How does one do this?  Or rather, isn't this what the monthly mdadm resync
does?

> along with the embedded self-test on the disk.  It is often

How does one do this?

> a lot easier and more productive to eliminate what ISN'T the problem
> rather than chase all of the potential reasons for the problem.

I agree, which is why I am asking for troubleshooting methods and utilities.

The monthly RAID array resync started a few minutes ago, and it is providing
some interesting results.  The number of blocks read / second is
consistently 13,000 - 24,000 on all ten drives.  There were no other drive
accesses of any sort at the time, so the number of blocks written was flat
zero on all drives in the array.  I copied the /etc/hosts file to the RAID
array, and instantly the file system locked, but the array resync *DID NOT*.
The number of blocks read and written per second continued to range from
13,000 to 24,000 blocks/second, with no apparent halt or slow-down at all,
not even for one second.  So if it's a drive error, why are file system
reads halted almost completely, and writes halted altogether, yet drive
reads at the RAID array level continue unabated at an aggregate of more than
130,000 blocks - 240,000 blocks (500 - 940 megabits) per second?  I tried a
second copy and again the file system accesses to the drives halted
altogether.  The block reads (which had been alternating with writes after
the new transfer proceses were implemented) again jumped to between 13,000
and 24,000.  This time I used a stopwatch, and the halt was 18 minutes 21
seconds - I believe the longest ever.  There is absolutely no way it would
take a drive almost 20 minutes to mark a block bad.  The dirty blocks grew
to more than 78 Megabytes.  I just did a 3rd cp of the /etc/hosts file to
the array, and once again it locked the machine for what is likely to be
another 15 - 20 minutes.  I tried forcing a sync, but it also hung.

<Sigh>  The next three days are going to be Hell, again.  It's going to be
all but impossible to edit a file until the RAID resync completes.  It's
often really bad under ordinary loads, but when the resync is underway, it's
beyond absurd.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Strange filesystem slowness with 8TB RAID6
  2009-04-02  4:16 (unknown), Lelsie Rhorer
  2009-04-02  4:22 ` David Lethe
@ 2009-04-02  4:38 ` NeilBrown
  2009-04-04  7:12   ` RAID halting Lelsie Rhorer
  2009-04-02  6:56 ` your mail Luca Berra
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2009-04-02  4:38 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Thu, April 2, 2009 3:16 pm, Lelsie Rhorer wrote:

> The issue is the entire array will occasionally pause completely for about
> 40 seconds when a file is created.  This does not always happen, but the
> situation is easily reproducible.  The frequency at which the symptom
> occurs seems to be related to the transfer load on the array.  If no other
> transfers are in process, then the failure seems somewhat more rare,
> perhaps accompanying less than 1 file creation in 10..  During heavy file
> transfer activity, sometimes the system halts with every other file
> creation.  Although I have observed many dozens of these events, I have
> never once observed it to happen except when a file creation occurs.
> Reading and writing existing files never triggers the event, although any
> read or write occurring during the event is halted for the duration.
...

> How can I troubleshoot and more importantly resolve this issue?

This sounds like a filesys problem rather than a RAID problem.

One thing that can cause this sort of behaviour is if the filesystem is in
the middle of a sync and has to complete it before the create can
complete, and the sync is writing out many megabytes of data.

You can see if this is happening by running

     watch 'grep Dirty /proc/meminfo'

if that is large when the hang starts, and drops down to zero, and the
hang lets go when it hits (close to) zero, then this is the problem.
The answer then is to keep that number low by writing a suitable
number into
   /proc/sys/vm/dirty_ratio   (a percentage of system RAM)
or
   /proc/sys/vm/dirty_bytes

If that doesn't turn out to be the problem, then knowing how the
"Dirty" count is behaving might still be useful, and I would probably
look at what processes are in 'D' state, (ps axgu) and look at their
stack (/proc/$PID/stack)..

You didn't say what kernel you are running.  It might make a difference.

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-02  4:38 ` Strange filesystem slowness with 8TB RAID6 NeilBrown
@ 2009-04-04  7:12   ` Lelsie Rhorer
  2009-04-04 12:38     ` Roger Heflin
  0 siblings, 1 reply; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-04  7:12 UTC (permalink / raw)
  To: 'Linux RAID'

>This sounds like a filesys problem rather than a RAID problem.

I considered that.  It may well be.

>One thing that can cause this sort of behaviour is if the filesystem is in
>the middle of a sync and has to complete it before the create can
>complete, and the sync is writing out many megabytes of data.

For between 40 seconds and 2 minutes?  The drive subsystem can easily gulp
down 200 megabytes to 6000 megabytes in that period of time.  What synch
would be that large?  Secondly, the problem occurs also when there is
relatively little or no data being written to the array.  Finally, unless I
am misunderstanding at what layers iostat and atop are reporting the
traffic, the fact all drive writes invariably fall to dead zero during an
event and reads on precisely half the drives (and always the same drives)
drop to dead zero suggests to me this is not the case.

>You can see if this is happening by running

>     watch 'grep Dirty /proc/meminfo'

>if that is large when the hang starts, and drops down to zero, and the
>hang lets go when it hits (close to) zero, then this is the problem.

Thanks, I'll give it a try later today.  Right now I am dead tired, plus
there are some batches running I really don't want interrupted, and
triggering an event might halt them.

>The answer then is to keep that number low by writing a suitable
>number into
>   /proc/sys/vm/dirty_ratio   (a percentage of system RAM)
>or
>   /proc/sys/vm/dirty_bytes

Um, OK.  What would constitute suitable numbers, assuming it turns out to be
the issue?

>If that doesn't turn out to be the problem, then knowing how the
>"Dirty" count is behaving might still be useful, and I would probably
>look at what processes are in 'D' state, (ps axgu) and look at their
>stack (/proc/$PID/stack)..

I'll surely try that, too.

>You didn't say what kernel you are running.  It might make a difference.

>NeilBrown

Oh, sorry!  2.6.26-1-amd64  4G of RAM, with typically 600 - 800M in use.
The swap space is 4.7G, but the used swap space has never exceeded 200K.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID halting
  2009-04-04  7:12   ` RAID halting Lelsie Rhorer
@ 2009-04-04 12:38     ` Roger Heflin
  0 siblings, 0 replies; 24+ messages in thread
From: Roger Heflin @ 2009-04-04 12:38 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Lelsie Rhorer wrote:
>> This sounds like a filesys problem rather than a RAID problem.
> 
> I considered that.  It may well be.
> 
>> One thing that can cause this sort of behaviour is if the filesystem is in
>> the middle of a sync and has to complete it before the create can
>> complete, and the sync is writing out many megabytes of data.
> 
> For between 40 seconds and 2 minutes?  The drive subsystem can easily gulp
> down 200 megabytes to 6000 megabytes in that period of time.  What synch
> would be that large?  Secondly, the problem occurs also when there is
> relatively little or no data being written to the array.  Finally, unless I
> am misunderstanding at what layers iostat and atop are reporting the
> traffic, the fact all drive writes invariably fall to dead zero during an
> event and reads on precisely half the drives (and always the same drives)
> drop to dead zero suggests to me this is not the case.
> 
> 

If you have things setup such that you have lights on the disks, and 
can check the lights when the event is happening, often if a single 
drive is being slow it will be the only one with its lights on when 
things are going bad.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: your mail
  2009-04-02  4:16 (unknown), Lelsie Rhorer
  2009-04-02  4:22 ` David Lethe
  2009-04-02  4:38 ` Strange filesystem slowness with 8TB RAID6 NeilBrown
@ 2009-04-02  6:56 ` Luca Berra
  2009-04-04  6:44   ` RAID halting Lelsie Rhorer
  2009-04-02  7:33 ` Peter Grandi
  2009-04-02 13:35 ` Andrew Burgess
  4 siblings, 1 reply; 24+ messages in thread
From: Luca Berra @ 2009-04-02  6:56 UTC (permalink / raw)
  To: linux-raid

On Wed, Apr 01, 2009 at 11:16:06PM -0500, Lelsie Rhorer wrote:
>8T of active space is formatted as a single Reiserfs file system.  The
....
>The issue is the entire array will occasionally pause completely for about
>40 seconds when a file is created.  This does not always happen, but the
>situation is easily reproducible.  The frequency at which the symptom
i wonder how costly b-tree operaton are for a 8Tb filesystem...

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-02  6:56 ` your mail Luca Berra
@ 2009-04-04  6:44   ` Lelsie Rhorer
  0 siblings, 0 replies; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-04  6:44 UTC (permalink / raw)
  To: 'Linux RAID'

>>The issue is the entire array will occasionally pause completely for about
>>40 seconds when a file is created.  This does not always happen, but the
>>situation is easily reproducible.  The frequency at which the symptom
>i wonder how costly b-tree operaton are for a 8Tb filesystem...

I expect somewhat costly, but a 4 second to 2 minute halt just to create a
file of between 0 and 1000 bytes is ridiculous.  That, and as I said, it
doesn't always happen, by a long shot, not even when transfers in the 400
Mbps range are underway.  OTOH, I have had it happen when only a few Mbps
transfers were underway.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re:
  2009-04-02  4:16 (unknown), Lelsie Rhorer
                   ` (2 preceding siblings ...)
  2009-04-02  6:56 ` your mail Luca Berra
@ 2009-04-02  7:33 ` Peter Grandi
  2009-04-02 23:01   ` RAID halting Lelsie Rhorer
  2009-04-02 13:35 ` Andrew Burgess
  4 siblings, 1 reply; 24+ messages in thread
From: Peter Grandi @ 2009-04-02  7:33 UTC (permalink / raw)
  To: Linux RAID

> The issue is the entire array will occasionally pause completely
> for about 40 seconds when a file is created. [ ... ] During heavy
> file transfer activity, sometimes the system halts with every
> other file creation. [ ... ] There are other drives formatted
> with other file systems on the machine, but the issue has never
> been seen on any of the other drives.  When the array runs its
> regularly scheduled health check, the problem is much worse. [
> ... ]

Looks like that either you have hw issues (transfer errors, bad
blocks) or more likely the cache flusher and elevator settings have
not been tuned for a steady flow.

> How can I troubleshoot and more importantly resolve this issue?

Well, troubleshooting would require a good understanding of file
system design and storage subsystem design, and quite a bit of time.

However, for hardware errors check the kernel logs, and for cache
flusher and elevator settings check the 'bi'/'bo' numbers of
'vmstat 1' while the pause happens.

For a deeper profile of per-drive IO run 'watch iostat 1 2' while
this is happening. This might also help indicate drive errors (no
IO is happening) or flusher/elevator tuning issues (lots of IO is
happening suddenfly).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-02  7:33 ` Peter Grandi
@ 2009-04-02 23:01   ` Lelsie Rhorer
  0 siblings, 0 replies; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-02 23:01 UTC (permalink / raw)
  To: 'Linux RAID'

>> The issue is the entire array will occasionally pause completely
>> for about 40 seconds when a file is created. [ ... ] During heavy
>> file transfer activity, sometimes the system halts with every
>> other file creation. [ ... ] There are other drives formatted
>> with other file systems on the machine, but the issue has never
>> been seen on any of the other drives.  When the array runs its
>> regularly scheduled health check, the problem is much worse. [
>> ... ]

>Looks like that either you have hw issues (transfer errors, bad
>blocks) or more likely the cache flusher and elevator settings have
>not been tuned for a steady flow.

That doesn't sound right.  I can read and write all day long at up to 450
Mbps in both directions continuously for hours at a time.  It's only when a
file is created, even a file of only a few bytes, that the issue occurs, and
then not always.  Indeed, earlier today I had transfers going with an
average throughput of more than 300 Mbps total and despite creating more
than 20 new files, not once did the transfers halt.

> How can I troubleshoot and more importantly resolve this issue?

>Well, troubleshooting would require a good understanding of file
>system design and storage subsystem design, and quite a bit of time.

>However, for hardware errors check the kernel logs, and for cache
>flusher and elevator settings check the 'bi'/'bo' numbers of
>'vmstat 1' while the pause happens.

I've already done that.  There are no errors of any sort in the kernel log.
Vmstat only tells me both bi and bo are zero, which we already knew.  I've
tried ps, iostat, vmstat, and top, and nothing provides anything of any
significance I can see except that resierfs is waiting on md, which we
already knew, and (as I recall - it's been a couple of weeks) the number of
bytes in and out of md0 falls to zero.

>For a deeper profile of per-drive IO run 'watch iostat 1 2' while
>this is happening. This might also help indicate drive errors (no
>IO is happening) or flusher/elevator tuning issues (lots of IO is
>happening suddenfly).

I'll give it a try.  I haven't been able to reproduce the issue today.
Usually it's pretty easy.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re:
  2009-04-02  4:16 (unknown), Lelsie Rhorer
                   ` (3 preceding siblings ...)
  2009-04-02  7:33 ` Peter Grandi
@ 2009-04-02 13:35 ` Andrew Burgess
  2009-04-04  5:57   ` RAID halting Lelsie Rhorer
  4 siblings, 1 reply; 24+ messages in thread
From: Andrew Burgess @ 2009-04-02 13:35 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid

On Wed, 2009-04-01 at 23:16 -0500, Lelsie Rhorer wrote:

> The issue is the entire array will occasionally pause completely for about
> 40 seconds when a file is created. 

I had symptoms like this once. It turned out to be a defective disk. The
disk would never return a read or write error but just intermittently
took a really long time to respond.

I found it by running atop. All the other drives would be running at low
utilization and this one drive would be at 100% when the symptoms
occurred (which in atop gets colored red so it jumps out at you)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-02 13:35 ` Andrew Burgess
@ 2009-04-04  5:57   ` Lelsie Rhorer
  2009-04-04 13:01     ` Andrew Burgess
  0 siblings, 1 reply; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-04  5:57 UTC (permalink / raw)
  To: 'Linux RAID'

>> The issue is the entire array will occasionally pause completely for
about
>> 40 seconds when a file is created. 

>I had symptoms like this once. It turned out to be a defective disk. The
>disk would never return a read or write error but just intermittently
>took a really long time to respond.

>I found it by running atop. All the other drives would be running at low
>utilization and this one drive would be at 100% when the symptoms
>occurred (which in atop gets colored red so it jumps out at you)

Thanks.  I gave this a try, but not being at all familiar with atop, I'm not
sure what, if anything, the results mean in terms of any additional
diagnostic data.  Depending somewhat upon the I/O load on the RAID array,
atop sometimes reports the drive utilization on several or all of the drives
to be well in excess of 85% - occasionally even 99%, but never flat 100% at
any time.  Oddly, even under relatively light loads of 20 or 30 Mbps,
sometimes the RAID members would show utilization in the high 90s, usually
on all the drives on a multiplier channel. I don't know if this is ordinary
behavior for atop, but all the drives also periodically disappear from the
status display.  Additionally, while atop is running and I am using my usual
video editor, Video Redo, on a Windows workstation to stream video from the
server, every time atop updates, the video and audio skip when reading from
a drive not on the RAID array.  I did not notice the same behavior from the
RAID array.  Odd.

Anyway, on to the diagnostics.

I ran both `atop` and `watch iostat 1 2` concurrently and triggered several
events while under heavy load ( >450 Mbps, total ). In atop, drives sdb,
sdd, sde, sdg, and sdi consistently disappeared from atop entirely, and
writes for the other drives fell to dead zero.  Reads fell to a very small
number.  The iostat session returned information in agreement with atop:
both reads and writes for sdb, sdd, sde, sdg, sdi, and md0 all fell to dead
zero from nominal values frequently exceeding 20,000 reads / sec and 5000
writes / sec.  Meanwhile, writes to sda, sdc, sdf, sdh, and sdj also dropped
to dead zero, but reads only fell to between 230 and 256 reads/sec.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-04  5:57   ` RAID halting Lelsie Rhorer
@ 2009-04-04 13:01     ` Andrew Burgess
  2009-04-04 14:39       ` Lelsie Rhorer
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Burgess @ 2009-04-04 13:01 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Sat, 2009-04-04 at 00:57 -0500, Lelsie Rhorer wrote:

> >> The issue is the entire array will occasionally pause completely for
> about 40 seconds when a file is created. 
> 
> >I had symptoms like this once. It turned out to be a defective disk. The
> >disk would never return a read or write error but just intermittently
> >took a really long time to respond.
> 
> >I found it by running atop. All the other drives would be running at low
> >utilization and this one drive would be at 100% when the symptoms
> >occurred (which in atop gets colored red so it jumps out at you)
> 
> Thanks.  I gave this a try, but not being at all familiar with atop, I'm not
> sure what, if anything, the results mean in terms of any additional
> diagnostic data.

It's the same info as iostat just in color

> Depending somewhat upon the I/O load on the RAID array,
> atop sometimes reports the drive utilization on several or all of the drives
> to be well in excess of 85% - occasionally even 99%, but never flat 100% at
> any time.  

High 90's is what I ment by 100% :-)

> Oddly, even under relatively light loads of 20 or 30 Mbps,
> sometimes the RAID members would show utilization in the high 90s, usually
> on all the drives on a multiplier channel.

I think that's the filesystem buffering and then writing all at once.
It's normal if it's periodic; they go briefly to ~100% and then back to
~0%?

Did you watch the atop display when the problem occurred?

> I don't know if this is ordinary
> behavior for atop, but all the drives also periodically disappear from the
> status display.

That's a config option (and I find the default annoying). atop also
sorts the drives by utilization every second which can be a little hard
to watch. But if you have the problem I had then that one drive stays at
the top of the list when the problem occurs.

> Additionally, while atop is running and I am using my usual
> video editor, Video Redo, on a Windows workstation to stream video from the
> server, every time atop updates, the video and audio skip when reading from
> a drive not on the RAID array.  I did not notice the same behavior from the
> RAID array.  Odd.

I think this is heavy /proc filesystem access which I have noticed can
screw up even realtime processes.

> Anyway, on to the diagnostics.
> 
> I ran both `atop` and `watch iostat 1 2` concurrently and triggered several
> events while under heavy load ( >450 Mbps, total ). In atop, drives sdb,
> sdd, sde, sdg, and sdi consistently disappeared from atop entirely, and
> writes for the other drives fell to dead zero.  Reads fell to a very small
> number.  The iostat session returned information in agreement with atop:
> both reads and writes for sdb, sdd, sde, sdg, sdi, and md0 all fell to dead
> zero from nominal values frequently exceeding 20,000 reads / sec and 5000
> writes / sec.  Meanwhile, writes to sda, sdc, sdf, sdh, and sdj also dropped
> to dead zero, but reads only fell to between 230 and 256 reads/sec.

I used:

  iostat -t -k -x 1 | egrep -v 'sd.[0-9]'

to get percent utilization and not show each partition but just whole
drives.

For atop you want the -f option to 'fixate' the number of lines so
drives with zero utilization don't disappear.

If you didn't get constant 100% utilization while the event occurred
then I guess you don't have the problem I had.

Does the sata multiplier have it's own driver and if so, is it the
latest? Any other complaints on the net about it? I would think a
problem there would show up as 100% utilization though...

And I think you already said the cpu usage is low when the event occurs?
No one core at near 100%? (atop would show this too...)



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-04 13:01     ` Andrew Burgess
@ 2009-04-04 14:39       ` Lelsie Rhorer
  2009-04-04 15:04         ` Andrew Burgess
  0 siblings, 1 reply; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-04 14:39 UTC (permalink / raw)
  To: 'Linux RAID'

> I think that's the filesystem buffering and then writing all at once.
> It's normal if it's periodic; they go briefly to ~100% and then back to
> ~0%?

Yes.

> > I don't know if this is ordinary
> > behavior for atop, but all the drives also periodically disappear from
> the
> > status display.
> 
> That's a config option (and I find the default annoying).

Yeah, me, too.

> sorts the drives by utilization every second which can be a little hard
> to watch. But if you have the problem I had then that one drive stays at
> the top of the list when the problem occurs.

No.

> I used:
> 
>   iostat -t -k -x 1 | egrep -v 'sd.[0-9]'
> 
> to get percent utilization and not show each partition but just whole
> drives.

Since there are no partitions, it shouldn't make a difference.

> For atop you want the -f option to 'fixate' the number of lines so
> drives with zero utilization don't disappear.

Well, diagnostically, I think the situation is clear.  All ten drives stop
writing completely.  Five of the ten stop reading, and the other five slow
their reads to a dribble - always the same five drives.

> Does the sata multiplier have it's own driver and if so, is it the
> latest? Any other complaints on the net about it? I would think a
> problem there would show up as 100% utilization though...

Multipliers - three of them, and no, they require no driver.  The SI
controller's drivers are included in the distro.

> And I think you already said the cpu usage is low when the event occurs?
> No one core at near 100%? (atop would show this too...)

Nowhere near.  Typically both cores are running below 25%, depending upon
what processes are running, of course.  I have the Gnome system monitor up,
and the graphs don't spike when the event occurs.  Of course, if there is a
local drive access process which uses lots of CPU horsepower, such as
ffmpeg, then when the array halt occurs, the CPU utilization falls right
off.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-04 14:39       ` Lelsie Rhorer
@ 2009-04-04 15:04         ` Andrew Burgess
  2009-04-04 15:15           ` Lelsie Rhorer
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Burgess @ 2009-04-04 15:04 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Sat, 2009-04-04 at 09:39 -0500, Lelsie Rhorer wrote:

> Well, diagnostically, I think the situation is clear.  All ten drives stop
> writing completely.  Five of the ten stop reading, and the other five slow
> their reads to a dribble - always the same five drives.

So the delay seems to be hiding in the kernel else the userspace tools
would see it (they see some kernel stuff too, like utilization)

Oprofile is supposed to be good for user and kernel profiling but I
don't know if it can find non-cpu bound stuff. There are also a bunch of
latency analysis tools in the kernel that were used for realtime tuning,
they might show where something is getting stuck. Andrew Morton did alot
of work in this area.

If the cpu was spinning somewhere it would show as system time so it
must be waiting for a timer or some other event (wild guessing). It's as
if the i/o completion never arrives but some timer eventually goes off
and maybe the i/o is retried and everything gets back on track? But that
should cause utilization to go up and you'd think some sort of
message... 

Perhaps the ide list would know of some diagnostic knobs to tweak.

It's a puzzler...

One last thing, the cpu goes toward 100% idle not wait?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-04 15:04         ` Andrew Burgess
@ 2009-04-04 15:15           ` Lelsie Rhorer
  2009-04-04 16:39             ` Andrew Burgess
  0 siblings, 1 reply; 24+ messages in thread
From: Lelsie Rhorer @ 2009-04-04 15:15 UTC (permalink / raw)
  To: 'Linux RAID'

> Oprofile is supposed to be good for user and kernel profiling but I
> don't know if it can find non-cpu bound stuff. There are also a bunch of
> latency analysis tools in the kernel that were used for realtime tuning,
> they might show where something is getting stuck. Andrew Morton did alot
> of work in this area.

Do you know if he subscribes to this list?  If not, how may I reach him?

> If the cpu was spinning somewhere it would show as system time so it
> must be waiting for a timer or some other event (wild guessing). It's as
> if the i/o completion never arrives but some timer eventually goes off
> and maybe the i/o is retried and everything gets back on track? But that
> should cause utilization to go up and you'd think some sort of
> message...

It's also puzzling why the issue is so much worse when the array diagnostic
is running.  Almost every file creation triggers a halt, and the halt time
extends to 2 minutes.  This is one thing which makes me tend to think it is
an interaction between the file system and the RAID system, rather than
either one alone.

> Perhaps the ide list would know of some diagnostic knobs to tweak.

> One last thing, the cpu goes toward 100% idle not wait?

Neither one.  If the drive access is tied to a local CPU intensive user
application, for example ffmpeg, then of course CPU utilization dips, but if
all the drive accesses are via network utilities (ftp, SAMBA, etc), I
haven't noticed any change in CPU utilization.  Simultaneous reads and
writes to other local drives or network drives continue without a hiccough. 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RAID halting
  2009-04-04 15:15           ` Lelsie Rhorer
@ 2009-04-04 16:39             ` Andrew Burgess
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Burgess @ 2009-04-04 16:39 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Sat, 2009-04-04 at 10:15 -0500, Lelsie Rhorer wrote:

> > Oprofile is supposed to be good for user and kernel profiling but I
> > don't know if it can find non-cpu bound stuff. There are also a bunch of
> > latency analysis tools in the kernel that were used for realtime tuning,
> > they might show where something is getting stuck. Andrew Morton did alot
> > of work in this area.
> 
> Do you know if he subscribes to this list?  If not, how may I reach him?

He's on the kernel list which I seldom read nowadays. His email used to
be akpm@something I think.

He worked on ftrace, documented in the kernel source
in /usr/src/linux/Documentation/ftrace.txt

An excerpt:

"Ftrace is an internal tracer designed to help out developers and
designers of systems to find what is going on inside the kernel.
It can be used for debugging or analyzing latencies and performance
issues that take place outside of user-space.

Although ftrace is the function tracer, it also includes an
infrastructure that allows for other types of tracing. Some of the
tracers that are currently in ftrace include a tracer to trace
context switches, the time it takes for a high priority task to
run after it was woken up, the time interrupts are disabled, and
more (ftrace allows for tracer plugins, which means that the list of
tracers can always grow)."

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-04-05 11:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-02  4:16 (unknown), Lelsie Rhorer
2009-04-02  4:22 ` David Lethe
2009-04-05  0:12   ` RE: Lelsie Rhorer
2009-04-05  0:38     ` Greg Freemyer
2009-04-05  5:05       ` Lelsie Rhorer
2009-04-05 11:42         ` Greg Freemyer
2009-04-05  0:45     ` Re: Roger Heflin
2009-04-05  5:21       ` Lelsie Rhorer
2009-04-05  5:33         ` RE: David Lethe
2009-04-05  8:14           ` RAID halting Lelsie Rhorer
2009-04-02  4:38 ` Strange filesystem slowness with 8TB RAID6 NeilBrown
2009-04-04  7:12   ` RAID halting Lelsie Rhorer
2009-04-04 12:38     ` Roger Heflin
2009-04-02  6:56 ` your mail Luca Berra
2009-04-04  6:44   ` RAID halting Lelsie Rhorer
2009-04-02  7:33 ` Peter Grandi
2009-04-02 23:01   ` RAID halting Lelsie Rhorer
2009-04-02 13:35 ` Andrew Burgess
2009-04-04  5:57   ` RAID halting Lelsie Rhorer
2009-04-04 13:01     ` Andrew Burgess
2009-04-04 14:39       ` Lelsie Rhorer
2009-04-04 15:04         ` Andrew Burgess
2009-04-04 15:15           ` Lelsie Rhorer
2009-04-04 16:39             ` Andrew Burgess

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).