paralellism of device use in md

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* paralellism of device use in md
@ 2006-01-17 12:09 Andy Smith
  2006-01-17 23:04 ` Neil Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Andy Smith @ 2006-01-17 12:09 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 634 bytes --]

I'm wondering: how well does md currently make use of the fact there
are multiple devices in the different (non-parity) RAID levels for
optimising reading and writing?

For example, are *writes* to a 2 device RAID-0 approaching twice as
fast as to a single device?  If not, are they any faster at all?
Are reads from a 2 device RAID-1 twice as fast as from a single
device?  If there are benefits, how quickly do they degrade to
nothing as disks are added?

What does the picture look like for reads and writes to a 4 device
RAID-10?

Sorry if my subject line isn't clear, but I coudn't think of a
better way to put it.

Thanks,
Andy

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-17 12:09 paralellism of device use in md Andy Smith
@ 2006-01-17 23:04 ` Neil Brown
  2006-01-18  0:23 ` Tim Moore
  2006-01-18  9:50 ` Andy Smith
  2 siblings, 0 replies; 10+ messages in thread
From: Neil Brown @ 2006-01-17 23:04 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid

On Tuesday January 17, andy@lug.org.uk wrote:
> I'm wondering: how well does md currently make use of the fact there
> are multiple devices in the different (non-parity) RAID levels for
> optimising reading and writing?

It does the best it can.  Every request from the filesystem goes
directly to which device it should.  Ofcourse if all the blocks the
filesystem requests happen to be on the same drive, there isn't alot
that md can do...


> 
> For example, are *writes* to a 2 device RAID-0 approaching twice as
> fast as to a single device?  If not, are they any faster at all?
> Are reads from a 2 device RAID-1 twice as fast as from a single
> device?  If there are benefits, how quickly do they degrade to
> nothing as disks are added?

Yes.  For a reasonably heavy loads both read and write will be close to
twice as fast on a 2 device RAID-0 compared to a single device
(providing that the buss doesn't become a bottleneck).

> 
> What does the picture look like for reads and writes to a 4 device
> RAID-10?

Much the same.

> 
> Sorry if my subject line isn't clear, but I coudn't think of a
> better way to put it.

Clear enough.


NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-17 12:09 paralellism of device use in md Andy Smith
  2006-01-17 23:04 ` Neil Brown
@ 2006-01-18  0:23 ` Tim Moore
  2006-01-18  7:41   ` Mario 'BitKoenig' Holbe
  2006-01-18  9:50 ` Andy Smith
  2 siblings, 1 reply; 10+ messages in thread
From: Tim Moore @ 2006-01-18  0:23 UTC (permalink / raw)
  To: linux-raid



Andy Smith wrote:
 > ...
> For example, are *writes* to a 2 device RAID-0 approaching twice as
> fast as to a single device?  If not, are they any faster at all?
> Are reads from a 2 device RAID-1 twice as fast as from a single
> device?  If there are benefits, how quickly do they degrade to
> nothing as disks are added?
> 

Server development where I work uses a 3-way mirror for system bits but 
this would be a costly solution for any sort of real storage and/or write 
performance.

Here's the last chunk of a pair of 120GB WD SATA-I drives, SiI3112 chipset, 
sata_sil driver, 2.4.32 kernel.  A three or four way stripe should get 
proportionally more provided separate controller channels and of course, 
risk scales with performance.

[16:05] abit:~ > cat /proc/mdstat | head -7
Personalities : [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md14 : active raid0 sdb13[1] sda13[0]
       114575616 blocks 32k chunks

md13 : active raid1 sdb12[1] sda12[0]
       20113216 blocks [2/2] [UU]

[16:06] abit:~ > hdparm -tT /dev/{md14,sd{a,b}13,md13,sd{a,b}12}

/dev/md14:
  Timing buffer-cache reads:   1908 MB in  2.00 seconds = 954.00 MB/sec
  Timing buffered disk reads:  272 MB in  3.01 seconds =  90.37 MB/sec

/dev/sda13:
  Timing buffer-cache reads:   1904 MB in  2.00 seconds = 952.00 MB/sec
  Timing buffered disk reads:  156 MB in  3.01 seconds =  51.83 MB/sec

/dev/sdb13:
  Timing buffer-cache reads:   1912 MB in  2.00 seconds = 956.00 MB/sec
  Timing buffered disk reads:  136 MB in  3.00 seconds =  45.33 MB/sec

/dev/md13:
  Timing buffer-cache reads:   1876 MB in  2.00 seconds = 938.00 MB/sec
  Timing buffered disk reads:  164 MB in  3.00 seconds =  54.67 MB/sec

/dev/sda12:
  Timing buffer-cache reads:   1904 MB in  2.00 seconds = 952.00 MB/sec
  Timing buffered disk reads:  166 MB in  3.02 seconds =  54.97 MB/sec

/dev/sdb12:
  Timing buffer-cache reads:   1892 MB in  2.00 seconds = 946.00 MB/sec
  Timing buffered disk reads:  146 MB in  3.00 seconds =  48.67 MB/sec
[16:08] abit:~ >

-- 
  | for direct mail add "private_" in front of user name

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-18  0:23 ` Tim Moore
@ 2006-01-18  7:41   ` Mario 'BitKoenig' Holbe
  2006-01-18  8:16     ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 10+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2006-01-18  7:41 UTC (permalink / raw)
  To: linux-raid

Tim Moore <linux-raid@nsr500.net> wrote:
> Andy Smith wrote:
>> Are reads from a 2 device RAID-1 twice as fast as from a single
> md14 : active raid0 sdb13[1] sda13[0]
> md13 : active raid1 sdb12[1] sda12[0]
>
> /dev/md14:
>   Timing buffered disk reads:  272 MB in  3.01 seconds =  90.37 MB/sec
> /dev/md13:
>   Timing buffered disk reads:  164 MB in  3.00 seconds =  54.67 MB/sec

And this is exactly the strange thing which I'm also experiencing and
which was asked a lot of times on this list already, IIRC.

Why is the single-stream read-performance of a RAID1 so much worse than
the read-performance of a RAID0. A RAID1 should easily be able to gain
(or perhaps even advance, since it's not bound to chunk borders) the
read-performance of a RAID0.

As far as I can see, RAID1 only does that in case of lots of parallel
scheduled read-requests. Would it probably make sense to split one
single read over all mirrors that are currently idle?

regards
   Mario
-- 
I've never been certain whether the moral of the Icarus story should
only be, as is generally accepted, "Don't try to fly too high," or
whether it might also be thought of as, "Forget the wax and feathers
and do a better job on the wings."            -- Stanley Kubrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-18  7:41   ` Mario 'BitKoenig' Holbe
@ 2006-01-18  8:16     ` Mario 'BitKoenig' Holbe
  2006-01-18 17:55       ` Francois Barre
  0 siblings, 1 reply; 10+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2006-01-18  8:16 UTC (permalink / raw)
  To: linux-raid

Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> wrote:
> scheduled read-requests. Would it probably make sense to split one
> single read over all mirrors that are currently idle?

A I got it from the other thread - seek times :)
Perhaps using some big (virtual) chunk size could do the trick? What
about using chunks that big that seeking is faster than data-transfer...
assuming a data rate of 50MB/s and 9ms average seek time would result in
at least 500kB chunks, 14ms average seek time would result in at least
750kB chunks.
However, since the blocks being read are most likely somewhat close
together, it's not a typical average seek, so probably smaller chunks
would also be possible.

regards
   Mario
-- 
<Sique> Huch? 802.1q? Was sucht das denn hier? Wie kommt das ans TAGgeslicht?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-18  8:16     ` Mario 'BitKoenig' Holbe
@ 2006-01-18 17:55       ` Francois Barre
  2006-01-18 23:34         ` Neil Brown
  2006-01-19 11:30         ` Mario 'BitKoenig' Holbe
  0 siblings, 2 replies; 10+ messages in thread
From: Francois Barre @ 2006-01-18 17:55 UTC (permalink / raw)
  To: linux-raid

2006/1/18, Mario 'BitKoenig' Holbe <Mario.Holbe@tu-ilmenau.de>:
> Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> wrote:
> > scheduled read-requests. Would it probably make sense to split one
> > single read over all mirrors that are currently idle?
>
> A I got it from the other thread - seek times :)
> Perhaps using some big (virtual) chunk size could do the trick? What
> about using chunks that big that seeking is faster than data-transfer...
> assuming a data rate of 50MB/s and 9ms average seek time would result in
> at least 500kB chunks, 14ms average seek time would result in at least
> 750kB chunks.
> However, since the blocks being read are most likely somewhat close
> together, it's not a typical average seek, so probably smaller chunks
> would also be possible.
>
>
> regards
>   Mario

Stop me if I'm wrong, but this is called... huge readahead. Instead of
reading 32k on drive0 then 32k on drive1, you read continuous 512k
from drive0 (16*32k) and 512k from drive1, resulting in a 1M read.
Maybe for a single 4k page...

So my additionnal question to this would be : how well does md fit
with linux's/fs readahead policies ?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-18 17:55       ` Francois Barre
@ 2006-01-18 23:34         ` Neil Brown
  2006-01-22 16:43           ` Tuomas Leikola
  2006-01-19 11:30         ` Mario 'BitKoenig' Holbe
  1 sibling, 1 reply; 10+ messages in thread
From: Neil Brown @ 2006-01-18 23:34 UTC (permalink / raw)
  To: Francois Barre; +Cc: linux-raid

On Wednesday January 18, francois.barre@gmail.com wrote:
> 2006/1/18, Mario 'BitKoenig' Holbe <Mario.Holbe@tu-ilmenau.de>:
> > Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> wrote:
> > > scheduled read-requests. Would it probably make sense to split one
> > > single read over all mirrors that are currently idle?
> >
> > A I got it from the other thread - seek times :)
> > Perhaps using some big (virtual) chunk size could do the trick? What
> > about using chunks that big that seeking is faster than data-transfer...
> > assuming a data rate of 50MB/s and 9ms average seek time would result in
> > at least 500kB chunks, 14ms average seek time would result in at least
> > 750kB chunks.
> > However, since the blocks being read are most likely somewhat close
> > together, it's not a typical average seek, so probably smaller chunks
> > would also be possible.
> >
> >
> > regards
> >   Mario
> 
> Stop me if I'm wrong, but this is called... huge readahead. Instead of
> reading 32k on drive0 then 32k on drive1, you read continuous 512k
> from drive0 (16*32k) and 512k from drive1, resulting in a 1M read.
> Maybe for a single 4k page...
> 
> So my additionnal question to this would be : how well does md fit
> with linux's/fs readahead policies ?

The read balancing in raid1 is clunky at best.  I've often thought
"there must be a better way".  I've never thought what the better way
might be (though I haven't tried very hard).

If anyone would like to experiment with the read-balancing code,
suggest and test changes, it would be most welcome.

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-18 23:34         ` Neil Brown
@ 2006-01-22 16:43           ` Tuomas Leikola
  0 siblings, 0 replies; 10+ messages in thread
From: Tuomas Leikola @ 2006-01-22 16:43 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On 1/19/06, Neil Brown <neilb@suse.de> wrote:
>
> The read balancing in raid1 is clunky at best.  I've often thought
> "there must be a better way".  I've never thought what the better way
> might be (though I haven't tried very hard).
>
> If anyone would like to experiment with the read-balancing code,
> suggest and test changes, it would be most welcome.
>

An interesting and desperately complex topic, intertwined with IO
schedulers in general. I'll follow with my 2 cents.

The way I see this, there are two fundamentally different approaches:
optimize for throughput or latency.

When optimizing for latency, the balancer would always choose a device
that can serve a request in the shortest time. This is close to what
the current code does, altough it doesn't seem to account for devices'
pending request queue lengths. (I'd estimate for a traditional ATA
disk, around 2-3 short seek requests is worth 1 long seek, because of
spindle latency). I'd assume a "fair" in-order service for the latency
mode.

When optimizing for throughput, the balancer would choose a device
that will have it's total queue completion time increased the least.
This indicates reordering of requests etc.

For queue depth of 1, the thoughput balancer would pick the "closest"
available device as long as the devices are idle, and when they are
all busy, leave the requests into array-wide queue until one of the
devices becomes available, and then dequeue the request the device can
serve fastest (or one that's had its deadline exceeded).

Both approaches become difficult when taking into account device
queues. The throughput balancer, as described, could just estimate how
close the new request is to all others already in the device, and pick
one that is nearby the other work. The latency scheduler is propably
pretty much useless in this scenario, as its definition will change if
requests can push each other around. I'd expect it to be useful in the
common desktop configuration with no device queues though.

One thing i'd like to see is more powerful estimates of request cost
for a device. It's possible, if not practical, to profile devices for
things like spindle latency and sector locations. If this cost
estimation data is correct enough, per-device queues become less
important as performance factors. As it is now, one can only hope that
requests that are near LBA-wise are near timewise, which is not true
for most devices. Yes, i know it's mostly wishful thinking.
Measurements would be tricky and would provide complex maps for
estimating costs, and (I think) would be virtually impossible to do
correctly for anything with device queues.

I'd expect that no drives in the market expose this kind of latency
estimation data to the controller or OS. I'd also expect that high end
storage system vendors use the very same information in their hardware
raid implementations to provide better queuing and load balancing.

Both the described balancer algorithms can be implemented somewhat
easily, and (I'd expect) will work relatively well with common desktop
drives. They could be optional (like the IO schedulers currently are),
and different cost estimation algorithms could also be optional (and
tunable if autotuning is out of question). Unfortunately my kernel
hacking skills are too weak for most of this - there needs to be
another who's interested enough.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-18 17:55       ` Francois Barre
  2006-01-18 23:34         ` Neil Brown
@ 2006-01-19 11:30         ` Mario 'BitKoenig' Holbe
  1 sibling, 0 replies; 10+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2006-01-19 11:30 UTC (permalink / raw)
  To: linux-raid

Francois Barre <francois.barre@gmail.com> wrote:
> 2006/1/18, Mario 'BitKoenig' Holbe <Mario.Holbe@tu-ilmenau.de>:
>> Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> wrote:
>> Perhaps using some big (virtual) chunk size could do the trick? What
> Stop me if I'm wrong, but this is called... huge readahead. Instead of
> reading 32k on drive0 then 32k on drive1, you read continuous 512k
> from drive0 (16*32k) and 512k from drive1, resulting in a 1M read.
> Maybe for a single 4k page...

Yes, this would be the consequence.
However, this would probably not be a big issue, since a) the current
default read-ahead for RAID1 is 1024 (in 512-byte sectors) anyways.
Furthermore, in the hardware-RAID sector b) at least the 3ware support
recommends huge read-aheads for speeding up their RAID1s, too...
afaik they recommend:
vm.{min,max}-readahead=512
blockdev --setra 6144 /dev/...
which is far more than 1M. I don't know why they do so but I could
imagine they also use some strategy similar to the one I suggested.

regards
   Mario
-- 
Ho ho ho! I am Santa Claus of Borg. Nice assimilation all together!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: paralellism of device use in md
  2006-01-17 12:09 paralellism of device use in md Andy Smith
  2006-01-17 23:04 ` Neil Brown
  2006-01-18  0:23 ` Tim Moore
@ 2006-01-18  9:50 ` Andy Smith
  2 siblings, 0 replies; 10+ messages in thread
From: Andy Smith @ 2006-01-18  9:50 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 264 bytes --]

On Tue, Jan 17, 2006 at 12:09:27PM +0000, Andy Smith wrote:
> I'm wondering: how well does md currently make use of the fact there
> are multiple devices in the different (non-parity) RAID levels for
> optimising reading and writing?

Thanks all for your answers.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-01-22 16:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-17 12:09 paralellism of device use in md Andy Smith
2006-01-17 23:04 ` Neil Brown
2006-01-18  0:23 ` Tim Moore
2006-01-18  7:41   ` Mario 'BitKoenig' Holbe
2006-01-18  8:16     ` Mario 'BitKoenig' Holbe
2006-01-18 17:55       ` Francois Barre
2006-01-18 23:34         ` Neil Brown
2006-01-22 16:43           ` Tuomas Leikola
2006-01-19 11:30         ` Mario 'BitKoenig' Holbe
2006-01-18  9:50 ` Andy Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).