*terrible* direct-write performance with raid5

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* *terrible* direct-write performance with raid5
@ 2005-02-22 17:39 Michael Tokarev
  2005-02-22 20:11 ` Peter T. Breuer
  2005-02-22 23:08 ` dean gaudet
  0 siblings, 2 replies; 7+ messages in thread
From: Michael Tokarev @ 2005-02-22 17:39 UTC (permalink / raw)
  To: linux-raid

When debugging some other problem, I noticied that
direct-io (O_DIRECT) write speed on a software raid5
is terrible slow.  Here's a small table just to show
the idea (not numbers by itself as they vary from system
to system but how they relate to each other).  I measured
"plain" single-drive performance (sdX below), performance
of a raid5 array composed from 5 sdX drives, and ext3
filesystem (the file on the filesystem was pre-created
during tests).  Speed measurements performed with 8Kbyte
buffer aka write(fd, buf, 8192*1024), units a Mb/sec.

           write   read
   sdX      44.9   45.5
   md        1.7*  31.3
fs on md    0.7*  26.3
fs on sdX  44.7   45.3

"Absolute winner" is a filesystem on top of a raid5 array:
700 kilobytes/sec, sorta like a 300-megabyte ide drive some
10 years ago...

The raid5 array built with mdadm with default options, aka
Layout = left-symmetric, Chunk Size = 64K.  The same test
with raid0 or raid1 for example shows quite good performance
(still not perfect but *much* better than for raid5).

It's quite interesting how I/O speed is different for fs on md
vs fs on sdX case - with fs on sdX, filesystem code adds almost
nothing to the plain partition speed, while it makes alot of
difference when used on top of an md device.

Comments anyone? ;)

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: *terrible* direct-write performance with raid5
  2005-02-22 17:39 *terrible* direct-write performance with raid5 Michael Tokarev
@ 2005-02-22 20:11 ` Peter T. Breuer
  2005-02-22 21:43   ` Michael Tokarev
  2005-02-22 23:08 ` dean gaudet
  1 sibling, 1 reply; 7+ messages in thread
From: Peter T. Breuer @ 2005-02-22 20:11 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> When debugging some other problem, I noticied that
> direct-io (O_DIRECT) write speed on a software raid5

And normal write speed (over 10 times the size of ram)?

> is terrible slow.  Here's a small table just to show
> the idea (not numbers by itself as they vary from system
> to system but how they relate to each other).  I measured
> "plain" single-drive performance (sdX below), performance
> of a raid5 array composed from 5 sdX drives, and ext3
> filesystem (the file on the filesystem was pre-created

And ext2? You will be enormously hampered by using a journalling file
system, especially with journal on the same system as the one you are
testing! At least put the journal elsewhere - and preferably leave it
off.

> during tests).  Speed measurements performed with 8Kbyte
> buffer aka write(fd, buf, 8192*1024), units a Mb/sec.
> 
>            write   read
>    sdX      44.9   45.5
>    md        1.7*  31.3
> fs on md    0.7*  26.3
> fs on sdX  44.7   45.3
> 
> "Absolute winner" is a filesystem on top of a raid5 array:

I'm afraid there are too many influences to say much from it overall.
The "legitimate" (i.e.  controlled) experiment there is between sdX and
md (over sdx), with o_direct both times.  For reference I personally
would like to see the speed withut o_direct on those two.  And the
size/ram of the transfer - you want to run over ten times size of ram
when you run without o_direct.

Then I would like to see a similar comparison made over hdX instead of
sdX.

You can forget the fs-based tests for the moment, in other words. You
already have plenty there to explain in the sdX/md comparison. And to
explain it I would like to see sdX replaced with hdX.

A time-wise graph of the instantaneous speed to disk would probably
also be instructive, but I guess you can't get that!

I would guess that you are seeing the results of one read and write to
two disks happening in sequence and not happening with any great
urgency.  Are the writes sent to each of the mirror targets from raid
without going through VMS too?  I'd suspect that - surely the requests
are just queued as normal by raid5 via the block device system. I don't
think the o_direct taint persists on the requests - surely it only
exists on the file/inode used for access.

Suppose the mirrored requests are NOT done directly - then I guess we
are seeing an interaction with the VMS, where priority inversion causes
the high-priority requests to the md device to wait on the fulfilment of
low priority requests to the sdX devices below them.  The sdX devices
requests may not ever get treated until the buffers in question age
sufficiently, or until the kernel finds time for them. When is that?
Well, the kernel won't let your process run .. hmm. I'd suspect the
raid code should be deliberately signalling the kernel to run the
request_fn of the mirror devices more often.

> Comments anyone? ;)

Random guesses above. Purely without data, of course.

Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: *terrible* direct-write performance with raid5
  2005-02-22 20:11 ` Peter T. Breuer
@ 2005-02-22 21:43   ` Michael Tokarev
  2005-02-22 22:27     ` Peter T. Breuer
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Tokarev @ 2005-02-22 21:43 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer wrote:
> Michael Tokarev <mjt@tls.msk.ru> wrote:
> 
>>When debugging some other problem, I noticied that
>>direct-io (O_DIRECT) write speed on a software raid5
> 
> And normal write speed (over 10 times the size of ram)?

There's no such term as "normal write speed" in this context
in my dictionary, because there are just too many factors
influencing the speed of non-direct I/O operations (I/O
scheduler aka elevator is the main factor I guess).  More,
when going over the buffer cache, "cache trashing" is plays
significant role for the whole system (eg, when just copying
a large amount of data with cp, system becomes quite
unresponsible die to "cache trashing" while the whole stuff
it is copying should not be cached in the first place, for
this task anyway).  Also, I don't think linux elevator is
optimized for this task (accessing large amounts of data).

I come across this issue when debugging very slow database
(oracle10 in this case) which tries to do direct I/O where
possible because "it knows better" when/how to cache data.
If I "turn on" vfs/block cache here, system becomes much
slower (under normal conditions anyway, not counting this
md slowness) (ok ok, i know it isn't a good idea to place
datbase files on raid5... or wasn't some time ago when
raid5 checksumming was the bottleneck anyway... but that's
a different story).

More to the point seems to be the same direct-io but in
larger chunks - eg 1Mb or more instead of 8Kb buffer.  And
this indeed makes alot of difference, the numbers looks
much more nice.

>>is terrible slow.  Here's a small table just to show
>>the idea (not numbers by itself as they vary from system
>>to system but how they relate to each other).  I measured
>>"plain" single-drive performance (sdX below), performance
>>of a raid5 array composed from 5 sdX drives, and ext3
>>filesystem (the file on the filesystem was pre-created
> 
> And ext2? You will be enormously hampered by using a journalling file
> system, especially with journal on the same system as the one you are
> testing! At least put the journal elsewhere - and preferably leave it
> off.

This whole issue has exactly nothing to do with journal.
I don't mount the fs with data=journal option, and the
file I'm writing to is "preallocated" first (i create the
file of a given size when measure re-writing speed).  In
this case, data never touches ext3 journal.

>>during tests).  Speed measurements performed with 8Kbyte
>>buffer aka write(fd, buf, 8192*1024), units a Mb/sec.
[]
>>"Absolute winner" is a filesystem on top of a raid5 array:
> 
> I'm afraid there are too many influences to say much from it overall.
> The "legitimate" (i.e.  controlled) experiment there is between sdX and
> md (over sdx), with o_direct both times.  For reference I personally
> would like to see the speed withut o_direct on those two.  And the
> size/ram of the transfer - you want to run over ten times size of ram
> when you run without o_direct.

I/O speed without O_DIRECT is very close to 44 Mb/sec for sdX (it's the
spid of the drives it seems), and md performs at about 80 Mb/sec.  That
numbers are very close to the case with O_DIRECT and large block size
(eg 1Mb).

There's much more to the block size really.  I just used 8Kb block because
I have real problem with performance of our database server (we're trying
oracle10 and it performs very badly, and now i don't understand whenever
the machine has always been like that (the numbers above) but we never
noticied with different usage pattern of previous oracle releases, or
something else changed...

> Then I would like to see a similar comparison made over hdX instead of
> sdX.

Sorry no IDE drives here, and i don't see the point in trying them anyway.

> You can forget the fs-based tests for the moment, in other words. You
> already have plenty there to explain in the sdX/md comparison. And to
> explain it I would like to see sdX replaced with hdX.
> 
> A time-wise graph of the instantaneous speed to disk would probably
> also be instructive, but I guess you can't get that!
> 
> I would guess that you are seeing the results of one read and write to
> two disks happening in sequence and not happening with any great
> urgency.  Are the writes sent to each of the mirror targets from raid

Hmm point.

> without going through VMS too?  I'd suspect that - surely the requests
> are just queued as normal by raid5 via the block device system. I don't
> think the o_direct taint persists on the requests - surely it only
> exists on the file/inode used for access.

Well, O_DIRECT performs very-very similar to O_SYNC here (both cases --
with and without a filesystem involved) in terms of speed.

I don't care much now whenever it relly performs direct I/O (from
userspace buffer directly to controller), esp. since it can't work
exactly this way with raid5 implemented in software (checksums must
be written too).  I just don't want to see unnecessary cache trashing
and do want to know about I/O errors immediately.

> Suppose the mirrored requests are NOT done directly - then I guess we
> are seeing an interaction with the VMS, where priority inversion causes
> the high-priority requests to the md device to wait on the fulfilment of
> low priority requests to the sdX devices below them.  The sdX devices
> requests may not ever get treated until the buffers in question age
> sufficiently, or until the kernel finds time for them. When is that?
> Well, the kernel won't let your process run .. hmm. I'd suspect the
> raid code should be deliberately signalling the kernel to run the
> request_fn of the mirror devices more often.

I guess if that's the case, buffer size should not make much difference.

>>Comments anyone? ;)
> 
> Random guesses above. Purely without data, of course.

Heh.  Thanks anyway ;)

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: *terrible* direct-write performance with raid5
  2005-02-22 21:43   ` Michael Tokarev
@ 2005-02-22 22:27     ` Peter T. Breuer
  0 siblings, 0 replies; 7+ messages in thread
From: Peter T. Breuer @ 2005-02-22 22:27 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> Peter T. Breuer wrote:
> > Michael Tokarev <mjt@tls.msk.ru> wrote:
> > 
> >>When debugging some other problem, I noticied that
> >>direct-io (O_DIRECT) write speed on a software raid5
> > 
> > And normal write speed (over 10 times the size of ram)?
> 
> There's no such term as "normal write speed" in this context
> in my dictionary, because there are just too many factors
> influencing the speed of non-direct I/O operations (I/O

Well I said to use over 10 times the size of ram, so we get a good
picture of an average sort of situation.

> scheduler aka elevator is the main factor I guess).  More,

I would only want to see the influence of the VMS.

> when going over the buffer cache, "cache trashing" is plays

I doubt that it influences anything here. But it can be tested.

> More to the point seems to be the same direct-io but in
> larger chunks - eg 1Mb or more instead of 8Kb buffer.  And

D_IO is done in blocks and MAYBE multiples of blocks.  You should
really perform all tests at single block sizes first to get a good
picture, or check the kernel code to see if splitting occurs.  I don't
know if it does.

> this indeed makes alot of difference, the numbers looks

It seemss irrelevant to the immediate problem, which is to explain the
discrepancy in your observed figures, not to find some situation in
which you get figures without the discrepancy!

> > And ext2? You will be enormously hampered by using a journalling file
> > system, especially with journal on the same system as the one you are
> > testing! At least put the journal elsewhere - and preferably leave it
> > off.
> 
> This whole issue has exactly nothing to do with journal.

Then you can leave it off :(.

> > I'm afraid there are too many influences to say much from it overall.
> > The "legitimate" (i.e.  controlled) experiment there is between sdX and
> > md (over sdx), with o_direct both times.  For reference I personally
> > would like to see the speed withut o_direct on those two.  And the
> > size/ram of the transfer - you want to run over ten times size of ram
> > when you run without o_direct.
> 
> I/O speed without O_DIRECT is very close to 44 Mb/sec for sdX (it's the
> spid of the drives it seems), and md performs at about 80 Mb/sec.  That

For large transfers? I don't see how MD can be faster than the raw
drive on write! I would suspect that the transfer was not large enough
to measure well.

> numbers are very close to the case with O_DIRECT and large block size
> (eg 1Mb).
> 
> There's much more to the block size really.  I just used 8Kb block because

You should really use 4KB to get a good picture of the problem.

> > Then I would like to see a similar comparison made over hdX instead of
> > sdX.
> 
> Sorry no IDE drives here, and i don't see the point in trying them anyway.

So that we can locate whether the problem is in the md driver or in the
sd driver. (i.e. "see the point" :-).

> > You can forget the fs-based tests for the moment, in other words. You
> > already have plenty there to explain in the sdX/md comparison. And to
> > explain it I would like to see sdX replaced with hdX.
> > 
> > A time-wise graph of the instantaneous speed to disk would probably
> > also be instructive, but I guess you can't get that!
> > 
> > I would guess that you are seeing the results of one read and write to
> > two disks happening in sequence and not happening with any great
> > urgency.  Are the writes sent to each of the mirror targets from raid
> 
> Hmm point.
> 
> > without going through VMS too?  I'd suspect that - surely the requests
> > are just queued as normal by raid5 via the block device system. I don't
> > think the o_direct taint persists on the requests - surely it only
> > exists on the file/inode used for access.
> 
> Well, O_DIRECT performs very-very similar to O_SYNC here (both cases --
> with and without a filesystem involved) in terms of speed.

I don't see the relevance of the remark ...?  O_SYNC is not as sync as
O_DIRECT and in particular does not bypass the VMS.  If you were to do
an O_SYNC write in 8KB lumps it would be very like a O_DIRECT write in
8KB lumps, however. O_SYNC requires FS implementation as far as I
recall, but I may be wrong.

A considerable difference between the two would likely be visible if you
used much larger blocksize or wrote with two processes at once.

> I don't care much now whenever it relly performs direct I/O (from
> userspace buffer directly to controller), esp. since it can't work
> exactly this way with raid5 implemented in software (checksums must
> be written too).

Well, it does work rather in that direction.  Each userspace write gives
rise imediately to one read and two writes (or more?) aimed at the disk
controller.  My point was that those further requests probably pass
through VMS, rather than being sent directly to the controller without
passing thhrough VMS.  In particular, the read may come from VMS buffers
filled through previous readahead rather than directly from the disk.
And also the writes may not go to the controller immediately, but
instead go to VMS and then hang around until the kernel decides to tell
the controller that it has requests waiting that it needs to attend to.

I suggested that the raid5 driver may not be taking pains to inform
the mirror targets that they have work waiting NOW after sending off the
mirror requests.  As far as I recall it just does a make_request()
(unchecked!).  If it were making efforts to honour O_DIRECT it might
want to schedule itself out after the make_request, thus giving the
kernel a chance to run the controllers request function and handle the
requests it has just submitted.  Or it might want to signal disk_tq (or
whatever handles the request function sweep nowadays). There are
probably little thigs it can set on the buffers it makes to cause them
to age fast too.

As it is, things might be a little stalemated at that point - not that I
know for sure, but I can imagine that it might be so.  There's an
opportunity for a lack of pressure to meet another lack of urgency, and
for the two to try and wait each other out ...

> I just don't want to see unnecessary cache trashing

I doubt the cache has much to do with it.  But what happens when you
vary the cpu cache size?  Does the relative difference become more or
less?  DO you have any direct evidence for cache effects?

> and do want to know about I/O errors immediately.
> 
> > Suppose the mirrored requests are NOT done directly - then I guess we
> > are seeing an interaction with the VMS, where priority inversion causes
> > the high-priority requests to the md device to wait on the fulfilment of
> > low priority requests to the sdX devices below them.  The sdX devices
> > requests may not ever get treated until the buffers in question age
> > sufficiently, or until the kernel finds time for them. When is that?
> > Well, the kernel won't let your process run .. hmm. I'd suspect the
> > raid code should be deliberately signalling the kernel to run the
> > request_fn of the mirror devices more often.
> 
> I guess if that's the case, buffer size should not make much difference.

If D_IO requests are not split into 4KB units in the kernel, then any
delay mechanism that is per-request will likely grow relatively less the
larger (and fewer) the requests are.  So I don't see a rationale for
that statement.

> >>Comments anyone? ;)
> > 
> > Random guesses above. Purely without data, of course.
> 
> Heh.  Thanks anyway ;)

No problemo.

Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: *terrible* direct-write performance with raid5
  2005-02-22 17:39 *terrible* direct-write performance with raid5 Michael Tokarev
  2005-02-22 20:11 ` Peter T. Breuer
@ 2005-02-22 23:08 ` dean gaudet
  2005-02-23 17:38   ` Michael Tokarev
  1 sibling, 1 reply; 7+ messages in thread
From: dean gaudet @ 2005-02-22 23:08 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid

On Tue, 22 Feb 2005, Michael Tokarev wrote:

> When debugging some other problem, I noticied that
> direct-io (O_DIRECT) write speed on a software raid5
> is terrible slow.  Here's a small table just to show
> the idea (not numbers by itself as they vary from system
> to system but how they relate to each other).  I measured
> "plain" single-drive performance (sdX below), performance
> of a raid5 array composed from 5 sdX drives, and ext3
> filesystem (the file on the filesystem was pre-created
> during tests).  Speed measurements performed with 8Kbyte
> buffer aka write(fd, buf, 8192*1024), units a Mb/sec.

with O_DIRECT you told the kernel it couldn't cache anything... you're 
managing the cache.  you should either be writing 64KiB or you should 
change your chunksize to 8KiB (if it goes that low).

-dean

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: *terrible* direct-write performance with raid5
  2005-02-22 23:08 ` dean gaudet
@ 2005-02-23 17:38   ` Michael Tokarev
  2005-02-23 17:55     ` Peter T. Breuer
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Tokarev @ 2005-02-23 17:38 UTC (permalink / raw)
  To: linux-raid

dean gaudet wrote:
> On Tue, 22 Feb 2005, Michael Tokarev wrote:
> 
> 
>>When debugging some other problem, I noticied that
>>direct-io (O_DIRECT) write speed on a software raid5
>>is terrible slow.  Here's a small table just to show
>>the idea (not numbers by itself as they vary from system
>>to system but how they relate to each other).  I measured
>>"plain" single-drive performance (sdX below), performance
>>of a raid5 array composed from 5 sdX drives, and ext3
>>filesystem (the file on the filesystem was pre-created
>>during tests).  Speed measurements performed with 8Kbyte
>>buffer aka write(fd, buf, 8192*1024), units a Mb/sec.
> 
> with O_DIRECT you told the kernel it couldn't cache anything... you're 
> managing the cache.  you should either be writing 64KiB or you should 
> change your chunksize to 8KiB (if it goes that low).

The picture does not change at all when changing raid chunk size.
With 8kb chunk speed is exactly the same as with 64kb or 256kb chunk.

Yes, increasing write buffer size helps alot.  Here's the write
performance in mb/sec for direct-write into an md device which
is a raid5 array built from 5 drives depending on the write
buffer size (in kb):

  buffer  md raid5    sdX
            speed    speed
      1      0.2       14
      2      0.4       26
      4      0.9       41
      8      1.7       44
     16      3.9       44
     32     72.6       44
     64     84.6       ..
    128     97.1
    256     53.7
    512     64.1
   1024     74.5

I've no idea why there's a drop in speed after 128kb blocksize,
but the more important is a huge drop with 32->16 kb blocksize.

The numbers are almost exactly the same with several chunksizes --
256kb, 64kb (default), 8kb and 4kb.

(note raid5 performs faster than a single drive, it's expectable
as it is possible to write to several drives in parallel).

The numbers also does not depend much on seeking -- obviously the
speed with seeking is worse than the above for sequential write,
but not much worse (about 10..20%, not >20 times as with 72 vs 4
mb/sec).

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: *terrible* direct-write performance with raid5
  2005-02-23 17:38   ` Michael Tokarev
@ 2005-02-23 17:55     ` Peter T. Breuer
  0 siblings, 0 replies; 7+ messages in thread
From: Peter T. Breuer @ 2005-02-23 17:55 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> (note raid5 performs faster than a single drive, it's expectable
> as it is possible to write to several drives in parallel).

Each raid5 write must include at least ONE write to a target.  I think
you're saying that the writes go to different targets from time to time
and that when the targets are the bottlenecks then you get faster than
normal response.

Hmmmm. That's actually quite difficult to calculate, because if say you
have three raid disks, then every time you write to the array you write
to two of those three (foget the read, which will come via readahead and
buffers).  Suppose that's no slower than one write to one disk, how
could you get any speed INCREASE?

Well, only by writing to a different two out of the three each time, or
near each time. If you first write to AB, then to BC, then to CA, and
repeat, then you have written 3 times but only kept each disk busy 2/3
of the time, so I suppose there is some opportunity for pipelining. Can
anyone see where?

   A B C  A B C  ...
   B C A  B C A  ...
   1 2 3  1 2 3

Maybe like this:

   A1 A3  A1 A3  ...
   B1 B2  B1 B2  ...
   C2 C3  C2 C3  ...

Yes. That seems to preserve local order and go 50% faster.

Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-02-23 17:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-22 17:39 *terrible* direct-write performance with raid5 Michael Tokarev
2005-02-22 20:11 ` Peter T. Breuer
2005-02-22 21:43   ` Michael Tokarev
2005-02-22 22:27     ` Peter T. Breuer
2005-02-22 23:08 ` dean gaudet
2005-02-23 17:38   ` Michael Tokarev
2005-02-23 17:55     ` Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).