Re: How about reversed (or offset) disk components?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Alan D. Brunelle" <Alan.Brunelle@hp.com>
To: NeilBrown <neilb@suse.de>
Cc: for.poige+linux@gmail.com, linux-raid@vger.kernel.org
Subject: Re: How about reversed (or offset) disk components?
Date: Wed, 12 Nov 2008 15:12:46 -0500	[thread overview]
Message-ID: <491B38BE.6070309@hp.com> (raw)
In-Reply-To: <62a7356ceb508b25c3951da7fc7d10b5.squirrel@neil.brown.name>

NeilBrown wrote:
> On Wed, November 12, 2008 5:40 am, Igor Podlesny wrote:
>> 	Hi!
>>
>> 	And I have one more idea: How about reversed (or offset) disk
>> components? -- It's known that linear read speed is decreasing when
>> reading from the beginning to end of HDD, thus leading to situation
>> when reading from RAID is good at its beginning and rather poor at its
>> ending. My suggestion (possibly) would make that speed almost constant
>> in despite of reading position. Examples:
>>
>> 	RAID5:
>>
>> 	disk1: 0123456789
>> 	disk2: 3456789012
>> 	disk3: 6789012345
>>
>> 	i. .e, the disk1's chunks aren't offseted at all, and disks' 2 and 3 are.
>>
>> 	RAID0:
>>
>> 	disk1: 0123456789
>> 	disk2: 9876543210
>>
>> 	Any drawbacks?
> 
> It is hard to be really sure without implementing the layout and making
> measurements.
> You could probably do this by partitioning each device into three partitions,
> combining those together with a linear array so they are in a different
> order, then combining the three linear arrays into a raid5.

I'm not quite sure I did _exactly_ this - but I have some graphs which
may help explain this a bit. Caveat Emptor: _one_ type of storage and
_one_set of results...

o  16-way AMD64 box (128GB RAM, Smart Array (CCISS) P800 w/ 24 300 GB disks)

o  Linux 2.6.28-rc3

o  Used 4 disks behind the P800: each was partitioned into 4 pieces:

# parted /dev/cciss/c2d0 print

Model: Compaq Smart Array (cpqarray)
Disk /dev/cciss/c2d0: 300GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      17.4kB  75.0GB  75.0GB               primary
 2      75.0GB  150GB   75.0GB               primary
 3      150GB   225GB   75.0GB               primary
 4      225GB   300GB   75.0GB               primary

==============================================================

First, I did some asynchronous direct I/Os for each of the partitions:
doing random (4KiB) / sequential (512KiB) reads & writes. The graphs are at:

http://free.linux.hp.com/~adb/2008-11-12/rawdisks.pnm

(Each disk is a separate color, and each bunch of vertical bars
represents a specific partition.)

It shows that for random I/Os there's a _slight_ tendency to go slower
as one gets to the latter parts of the disk (last partition), but not
much - and there's a lot of variability. [The seek times probably swamp
the I/O transfer times here.]

For sequential I/Os there's a noticeable decline in performance for all
4 disks as one proceeds towards the end - 25-30% drops for both reads &
writes between the first & last parts of the disk.

==============================================================

Next I did two sets of runs with MD devices made out of these
partitioned disks - see

http://free.linux.hp.com/~adb/2008-11-12/standard.pnm

("Standard") Made 4 MDs, increasing the partition with each MD - thus
/dev/md1 was constructed out of /dev/cciss/c2d[0123]p1, /dev/md2 out of
/dev/cciss/c2d[0123]p2, ..., /dev/md4 out of /dev/cciss/c2d[0123]p4.
(/dev/md1 out of the "fastest" partition on each disk, and /dev/md4 out
of the "slowest" partition on each disk.) [[These are the black bars in
the graphs.]]

("Offset") Made 4 MDs, staggering the partitions as Neil suggested -
thus /dev/md1 had /dev/cciss/c2d0p1 + /dev/cciss/c2d1p2 +
/dev/cciss/c2d2p3 + /dev/cciss/c2d3p4 and /dev/md2 had d0p2 + d1p3 +
d2p4 + d3p1 and so on. [[These are the red bars in the graphs.]]

Strange results came out of this - granted it was one set of runs, so
some variability is to be expected.

For random read/writes we again see seek stuff swamp I/O transfer
peculiarities. But nothing to show that doing an "Offset" configuration
helping out.

Anyways, the sequential read picture  makes great sense: With the
"Standard" set up we see decreasing performance of the RAID0 sets as we
utilize "slower" partitions : ~470MiB/sec down to ~350MiB/sec. With the
"Offset" partitions, we are truly gated by the slowest partition - so we
get consistency, but overall slower performance : ~350MiB/sec across the
board.

The sequential write picture is kind of messy - the "Offset"
configuration again shows somewhat gated performance (all around 325
MiB/second) but the "Standard" config goes up and down - I _think_ this
just may be an artifact of the write-caching on the P800?!? If need be,
I could disable that, and I'd _guess_ we'd see a picture more in line
with the sequential reads.

Alan

     prev parent reply	other threads:[~2008-11-12 20:12 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-11 18:40 How about reversed (or offset) disk components? Igor Podlesny
2008-11-11 19:02 ` Robin Hill
2008-11-11 20:32 ` NeilBrown
2008-11-12 20:12   ` Alan D. Brunelle [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=491B38BE.6070309@hp.com \
    --to=alan.brunelle@hp.com \
    --cc=for.poige+linux@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.