performance problems with raid10,f2

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* performance problems with raid10,f2
@ 2008-03-14 23:11 Keld Jørn Simonsen
  2008-03-20 17:28 ` Keld Jørn Simonsen
  0 siblings, 1 reply; 10+ messages in thread
From: Keld Jørn Simonsen @ 2008-03-14 23:11 UTC (permalink / raw)
  To: linux-raid

Hi

I have a 4 drive array with 1 TB Hitachi disks, formatted as raid10,f2

I had some strange observations:

1. while resyncing I could get the raid to give me about 320 MB/s in
sequential read, which was good. After resync had been done, and with
all 4 drives active, I only get 115 MB/s.

2. while resyncing, the IO rate on each disk is about 27 MB/s - and the 
rate of each sdisk is about 82 MB/s. Why is this?

best regards
keld

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-03-14 23:11 performance problems with raid10,f2 Keld Jørn Simonsen
@ 2008-03-20 17:28 ` Keld Jørn Simonsen
  2008-03-25  5:13   ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Keld Jørn Simonsen @ 2008-03-20 17:28 UTC (permalink / raw)
  To: linux-raid

On Sat, Mar 15, 2008 at 12:11:51AM +0100, Keld Jørn Simonsen wrote:
> Hi
> 
> I have a 4 drive array with 1 TB Hitachi disks, formatted as raid10,f2
> 
> I had some strange observations:
> 
> 1. while resyncing I could get the raid to give me about 320 MB/s in
> sequential read, which was good. After resync had been done, and with
> all 4 drives active, I only get 115 MB/s.

This was reproducable. I dont know what could be wrong.
I tried to enlarge my readahed, but the system did not allow me to have 
more than a 2 MiB readahed, - well that should be ok for a 4 disk array
with 256 kiB chunks?

I did try to have chunks of 64 kiB - but no luck.

It seemed like it is something that the resync process builds up.
What could it be?

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-03-20 17:28 ` Keld Jørn Simonsen
@ 2008-03-25  5:13   ` Neil Brown
  2008-03-25 10:36     ` Keld Jørn Simonsen
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2008-03-25  5:13 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

On Thursday March 20, keld@dkuug.dk wrote:
> On Sat, Mar 15, 2008 at 12:11:51AM +0100, Keld Jørn Simonsen wrote:
> > Hi
> > 
> > I have a 4 drive array with 1 TB Hitachi disks, formatted as raid10,f2
> > 
> > I had some strange observations:
> > 
> > 1. while resyncing I could get the raid to give me about 320 MB/s in
> > sequential read, which was good. After resync had been done, and with
> > all 4 drives active, I only get 115 MB/s.
> 
> This was reproducable. I dont know what could be wrong.

Is this with, or without, your patch to avoid "read-balancing" for
raid10/far layouts?
It sounds like it is without that patch ????

NeilBrown


> I tried to enlarge my readahed, but the system did not allow me to have 
> more than a 2 MiB readahed, - well that should be ok for a 4 disk array
> with 256 kiB chunks?
> 
> I did try to have chunks of 64 kiB - but no luck.
> 
> It seemed like it is something that the resync process builds up.
> What could it be?
> 
> Best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-03-25  5:13   ` Neil Brown
@ 2008-03-25 10:36     ` Keld Jørn Simonsen
  2008-03-25 13:22       ` Peter Grandi
  0 siblings, 1 reply; 10+ messages in thread
From: Keld Jørn Simonsen @ 2008-03-25 10:36 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Tue, Mar 25, 2008 at 04:13:28PM +1100, Neil Brown wrote:
> On Thursday March 20, keld@dkuug.dk wrote:
> > On Sat, Mar 15, 2008 at 12:11:51AM +0100, Keld Jørn Simonsen wrote:
> > > Hi
> > > 
> > > I have a 4 drive array with 1 TB Hitachi disks, formatted as raid10,f2
> > > 
> > > I had some strange observations:
> > > 
> > > 1. while resyncing I could get the raid to give me about 320 MB/s in
> > > sequential read, which was good. After resync had been done, and with
> > > all 4 drives active, I only get 115 MB/s.
> > 
> > This was reproducable. I dont know what could be wrong.
> 
> Is this with, or without, your patch to avoid "read-balancing" for
> raid10/far layouts?
> It sounds like it is without that patch ????
> 
> NeilBrown

I tried both without the patch and with the patch, with almost same resulte.
Is resync building some table, and could that be it?
Or could it be some time of inode traffic?

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-03-25 10:36     ` Keld Jørn Simonsen
@ 2008-03-25 13:22       ` Peter Grandi
  2008-04-02 21:13         ` Keld Jørn Simonsen
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Grandi @ 2008-03-25 13:22 UTC (permalink / raw)
  To: Linux RAID

>>>> I have a 4 drive array with 1 TB Hitachi disks, formatted
>>>> as raid10,f2 I had some strange observations: 1. while
>>>> resyncing I could get the raid to give me about 320 MB/s in
>>>> sequential read, which was good. After resync had been
>>>> done, and with all 4 drives active, I only get 115 MB/s.

[ ... ]

>> Is this with, or without, your patch to avoid "read-balancing"
>> for raid10/far layouts?  It sounds like it is without that
>> patch ????

> I tried both without the patch and with the patch, with almost
> same resulte.

That could be the usual issue with apparent pauses in the stream
of IO requests to the array component devices, with the usual
workaround of trying 'blockdev --setra 65536 /dev/mdN' and see if
sequential reads improve.

> Is resync building some table, and could that be it?  Or could
> it be some time of inode traffic?

One good way to see what is actually happening is to use either
'watch iostat -k 1 2' and look at the load on the individual MD
array component devices, or use 'sysctl vm/block_dump=1' and look
at the addresses being read or written.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-03-25 13:22       ` Peter Grandi
@ 2008-04-02 21:13         ` Keld Jørn Simonsen
  2008-04-03 20:20           ` Peter Grandi
  0 siblings, 1 reply; 10+ messages in thread
From: Keld Jørn Simonsen @ 2008-04-02 21:13 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Tue, Mar 25, 2008 at 01:22:03PM +0000, Peter Grandi wrote:
> >>>> I have a 4 drive array with 1 TB Hitachi disks, formatted
> >>>> as raid10,f2 I had some strange observations: 1. while
> >>>> resyncing I could get the raid to give me about 320 MB/s in
> >>>> sequential read, which was good. After resync had been
> >>>> done, and with all 4 drives active, I only get 115 MB/s.
> 
> [ ... ]
> 
> >> Is this with, or without, your patch to avoid "read-balancing"
> >> for raid10/far layouts?  It sounds like it is without that
> >> patch ????
> 
> > I tried both without the patch and with the patch, with almost
> > same resulte.
> 
> That could be the usual issue with apparent pauses in the stream
> of IO requests to the array component devices, with the usual
> workaround of trying 'blockdev --setra 65536 /dev/mdN' and see if
> sequential reads improve.

Yes, that did it! 

> > Is resync building some table, and could that be it?  Or could
> > it be some time of inode traffic?
> 
> One good way to see what is actually happening is to use either
> 'watch iostat -k 1 2' and look at the load on the individual MD
> array component devices, or use 'sysctl vm/block_dump=1' and look
> at the addresses being read or written.

Good advice. I added your info to the wiki.

best regards
keld

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-04-02 21:13         ` Keld Jørn Simonsen
@ 2008-04-03 20:20           ` Peter Grandi
  2008-04-04  8:03             ` Keld Jørn Simonsen
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Grandi @ 2008-04-03 20:20 UTC (permalink / raw)
  To: Linux RAID

>>> On Wed, 2 Apr 2008 23:13:15 +0200, Keld Jørn Simonsen
>>> <keld@dkuug.dk> said:

[ ... slow RAID reading ... ]

>> That could be the usual issue with apparent pauses in the
>> stream of IO requests to the array component devices, with
>> the usual workaround of trying 'blockdev --setra 65536
>> /dev/mdN' and see if sequential reads improve.

keld> Yes, that did it! 

But that's as usual very wrong. Such a large readhead has
negative consequences, and most likely is the result of both
some terrible misdesign in the Linux block IO subsystem (from
some further experiments it is most likely related to "plugging")
and integration of MD into it.

However I have found that on relatively fast machines (I think)
much lower values of read-ahead still give reasonable speed,
with some values being much better than others. For example with
another RAID10 I get pretty decent speed with a read-ahead of
128 on '/dev/md0' (but much worse with say 64 or 256). On others
1000 sectors read-ahead is good.

The read-ahead needed also depends a bit on the file system
type, don't trust tests done on the block device itself.

So please experiment a bit to try and reduce it, at least until
I find the time to figure out the (surely embarrasing) reason
why it is needed and how to avoid it, or the Linux block IO and
MD maintainers confess (they almost surely already know why)
and/or fix it already.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-04-03 20:20           ` Peter Grandi
@ 2008-04-04  8:03             ` Keld Jørn Simonsen
  2008-04-05 17:31               ` Peter Grandi
  0 siblings, 1 reply; 10+ messages in thread
From: Keld Jørn Simonsen @ 2008-04-04  8:03 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Thu, Apr 03, 2008 at 09:20:37PM +0100, Peter Grandi wrote:
> >>> On Wed, 2 Apr 2008 23:13:15 +0200, Keld Jørn Simonsen
> >>> <keld@dkuug.dk> said:
> 
> [ ... slow RAID reading ... ]
> 
> >> That could be the usual issue with apparent pauses in the
> >> stream of IO requests to the array component devices, with
> >> the usual workaround of trying 'blockdev --setra 65536
> >> /dev/mdN' and see if sequential reads improve.
> 
> keld> Yes, that did it! 
> 
> But that's as usual very wrong. Such a large readhead has
> negative consequences, and most likely is the result of both
> some terrible misdesign in the Linux block IO subsystem (from
> some further experiments it is most likely related to "plugging")
> and integration of MD into it.
> 
> However I have found that on relatively fast machines (I think)
> much lower values of read-ahead still give reasonable speed,
> with some values being much better than others. For example with
> another RAID10 I get pretty decent speed with a read-ahead of
> 128 on '/dev/md0' (but much worse with say 64 or 256). On others
> 1000 sectors read-ahead is good.
> 
> The read-ahead needed also depends a bit on the file system
> type, don't trust tests done on the block device itself.
> 
> So please experiment a bit to try and reduce it, at least until
> I find the time to figure out the (surely embarrasing) reason
> why it is needed and how to avoid it, or the Linux block IO and
> MD maintainers confess (they almost surely already know why)
> and/or fix it already.

I did experiment and I noted that a 16 MiB readahead was sufficient.

And then I was wondering if this had negative consequences, eg on random
reads.

I then had a test with reading 1000 files concurrently, and Some strange
things happened. Each drive was doing about 2000 transactions per
second  (tps). Why? I thought a drive could only do about 150 tps, given
t5hat it is a 7200 rpm drive. 

What is tps measuring?

Why is the fs not reading the chunk size for every IO operation?

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-04-04  8:03             ` Keld Jørn Simonsen
@ 2008-04-05 17:31               ` Peter Grandi
  2008-04-05 18:46                 ` Keld Jørn Simonsen
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Grandi @ 2008-04-05 17:31 UTC (permalink / raw)
  To: Linux RAID

>>> On Fri, 4 Apr 2008 10:03:59 +0200, Keld Jørn Simonsen
>>> <keld@dkuug.dk> said:

[ ...  slow software RAID in sequential access ... ]

> I did experiment and I noted that a 16 MiB readahead was
> sufficient.

That still sounds a bit high.

> And then I was wondering if this had negative consequences, eg
> on random reads.

It surely has large negative consequences, but not necessarily on
random reads. After all that depends when an operations completes,
and I suspect that read-ahead is at least partially asynchronous,
that is the read of a block completes when it gets to memory, not
when the whole read-ahead is done. The problem is more likely to be
increased memory contention when the system is busy, and even
worse, increased disks arm contention.

Read ahead not only loads memory with not-yet-needed blocks, it
keeps the disk busier reading those not-yet-needed blocks.

> I then had a test with reading 1000 files concurrently, and
> Some strange things happened. Each drive was doing about 2000
> transactions per second (tps). Why? I thought a drive could
> only do about 150 tps, given t5hat it is a 7200 rpm drive.

RPM is not that related to transactions/s, however defined, perhaps
arm movement time and locality of access are.

> What is tps measuring?

That's pretty mysterious to me. It could mean anything, and anyhow
I I have become even more disillusioned about the whole Liux IO
subsystem, which I now think to be as poorly misdesigned as the
Linux VM subsystem.

Just the idea of putting "plugging" at the block device level
demonstrates the level of its developers (amazingly some recent
tests I have done seem to show that at least in some cases it has
no influence on performance either way).

But then I was recently reading these wise words from a great
old man of OS design:

  http://CSG.CSAIL.MIT.edu/Users/dennis/essay.htm

   "During the 1980s things changed. Computer Science Departments
    had proliferated throughout the universities to meet the
    demand, primarily for programmers and software engineers, and
    the faculty assembled to teach the subjects was expected to do
    meaningful research.

    To manage the burgeoning flood of conference papers, program
    committees adopted a new strategy for papers in computer
    architecture: No more wild ideas; papers had to present
    quantitative results. The effect was to create a style of
    graduate research in computer architecture that remains the
    "conventional wisdom" of the community to the present day: Make
    a small, innovative, change to a commercially accepted design
    and evaluate it using standard benchmark programs.

    This style has stifled the exploration and publication of
    interesting architectural ideas that require more than a
    modicum of change from current practice.

    The practice of basing evaluations on standard benchmark codes
    neglects the potential benefits of architectural concepts that
    need a change in programming methodology to demonstrate their
    full benefit."

and around the same time I had a very depressing IRC conversation
with a well known kernel developer about what I think to be some
rather stupid aspects of the Linux VM susbsystem and he was quite
unrepentant, saying that in some tests they were of benefit...

> Why is the fs not reading the chunk size for every IO operation?

Why should it? The goal is to keep the disk busy in the cheapest
way. Keep the queue as long as you need to keep the disk busy
(back-to-back operations) and no more.

However if you are really asking why the MD subsystem needs
read-ahead values hundreds or thousands of times larger than the
underlying devices, counterproductively, that's something that I am
trying to figure out in my not so abundant spare time. If anybody
knows please let the rest of us know.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: performance problems with raid10,f2
  2008-04-05 17:31               ` Peter Grandi
@ 2008-04-05 18:46                 ` Keld Jørn Simonsen
  0 siblings, 0 replies; 10+ messages in thread
From: Keld Jørn Simonsen @ 2008-04-05 18:46 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Sat, Apr 05, 2008 at 06:31:00PM +0100, Peter Grandi wrote:
> >>> On Fri, 4 Apr 2008 10:03:59 +0200, Keld Jørn Simonsen
> >>> <keld@dkuug.dk> said:
> 
> [ ...  slow software RAID in sequential access ... ]
> 
> > I did experiment and I noted that a 16 MiB readahead was
> > sufficient.
> 
> That still sounds a bit high.

Well.. it was ony 8 Mib...

> > And then I was wondering if this had negative consequences, eg
> > on random reads.
> 
> It surely has large negative consequences, but not necessarily on
> random reads. After all that depends when an operations completes,
> and I suspect that read-ahead is at least partially asynchronous,
> that is the read of a block completes when it gets to memory, not
> when the whole read-ahead is done. The problem is more likely to be
> increased memory contention when the system is busy, and even
> worse, increased disks arm contention.

Well, it looks like the bigger the chunk size, the better for random
reading.. 1000 processes. I need to do some more tests.

> Read ahead not only loads memory with not-yet-needed blocks, it
> keeps the disk busier reading those not-yet-needed blocks.

But they will be needed, given that most processes read files
sequentially, which is my scenario.

The trick is to keep the data in memory till they are needed.


> > I then had a test with reading 1000 files concurrently, and
> > Some strange things happened. Each drive was doing about 2000
> > transactions per second (tps). Why? I thought a drive could
> > only do about 150 tps, given t5hat it is a 7200 rpm drive.

> RPM is not that related to transactions/s, however defined, perhaps
> arm movement time and locality of access are.

RPM is also related. Actually quite related.

> > What is tps measuring?
> 
> That's pretty mysterious to me. It could mean anything, and anyhow
> I I have become even more disillusioned about the whole Liux IO
> subsystem, which I now think to be as poorly misdesigned as the
> Linux VM subsystem.

iostat -x actually gave a more plausible measurement.
It has two measures, actually aggregated IO requests to disk, and IO
requests made by programs.

> > Why is the fs not reading the chunk size for every IO operation?
> 
> Why should it? The goal is to keep the disk busy in the cheapest
> way. Keep the queue as long as you need to keep the disk busy
> (back-to-back operations) and no more.

I would like the disk to produce as muce real data for processes as
possible. With about 150 requests per second that would for 256 kiB
chunks produce about 37 MB/s - but my system only gives me around 15
MB/s per disk. Some room for improvement.

> However if you are really asking why the MD subsystem needs
> read-ahead values hundreds or thousands of times larger than the
> underlying devices, counterproductively, that's something that I am
> trying to figure out in my not so abundant spare time. If anybody
> knows please let the rest of us know.

Yes, quite strange.

Keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-04-05 18:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-14 23:11 performance problems with raid10,f2 Keld Jørn Simonsen
2008-03-20 17:28 ` Keld Jørn Simonsen
2008-03-25  5:13   ` Neil Brown
2008-03-25 10:36     ` Keld Jørn Simonsen
2008-03-25 13:22       ` Peter Grandi
2008-04-02 21:13         ` Keld Jørn Simonsen
2008-04-03 20:20           ` Peter Grandi
2008-04-04  8:03             ` Keld Jørn Simonsen
2008-04-05 17:31               ` Peter Grandi
2008-04-05 18:46                 ` Keld Jørn Simonsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).