linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Doug Ledford <dledford@redhat.com>
To: Steven Haigh <netwiz@crc.id.au>
Cc: linux-raid@vger.kernel.org
Subject: Re: Software RAID5 write issues
Date: Wed, 10 Jun 2009 21:46:32 -0400	[thread overview]
Message-ID: <1244684792.21347.231.camel@firewall.xsintricity.com> (raw)
In-Reply-To: <916523ED-3D03-4AB1-A64A-1797852ABE9B@crc.id.au>

[-- Attachment #1: Type: text/plain, Size: 4876 bytes --]

On Thu, 2009-06-11 at 10:05 +1000, Steven Haigh wrote:
> Isn't the PCI Bus limited to around 133MB/sec? If so, even with 3  
> drives on the same controller, you would expect divided equally that  
> each drive would get ~44MB/sec before overheads - not around 7MB/sec  
> per drive. I know I'm not going to get phenomenal performance with my  
> setup, but as most the data is archiving (and then copied to tape), I  
> would like to get things at least up to a reasonable level instead of  
> having a write speed of ~12% of the read speed.

It is, but it's shared amongst all the cards on that particular bus, and
in particular older motherboards would daisy chain busses such that a
later bus only gets part of that bandwidth because earlier busses are
using it too.  Plus the PCI bus limitation is 133MB/s theoretical
maximum, in practice you get less due to things like arbitration and
such.

> Hmmm - a very interesting read - but I am a little confused when it  
> comes to PCI bandwidth. I would assume (maybe wrongly) that if I can  
> READ from the array at 95MB/sec (as measured by bonnie++), then I  
> should be able to write to the same array at a little faster than 11MB/ 
> sec - as a read would usually read from 4 of 5 drives, however a write  
> would go to all drives. This being said, I wouldn't expect one extra  
> write to equal 12% of a read speed!

There are two factors involved in this (I'm speculating of course, but
here goes).

One, a read doesn't involve every drive in the array.  For any given
stripe, you will actually only read from 4 of the 5 drives.  Since 3 of
the drives are on the card, that means that 3 out of 5 stripes, one of
those drives will be the parity drive and therefore not used in the read
process.  So, for 3 out of 5 stripes, you actually read from two of the
drives behind the card and two on the motherboard.  The other two you
read from three of the drives behind the card and one on the
motherboard.  That accounts for a reasonable amount of difference all by
itself.  As an example, I have an external SATA drive case here that
holds 4 drives on a repeater and uses a single eSATA cable to run the 4
drives.  When accessing a single drive, I get 132MB/s throughput.  When
I access two drives, it drops to 60MB/s throughput per drive.  When
accessing three drives, it drops to 39MB/s throughput per drive.  So,
you can see where, on read, the lack of need to access all three drives
can really help on specific stripes.  In other words, reading from only
4 drives at a time *helps* your performance because whatever two drives
are in use behind the PCI card run faster and can keep up better with
the two drives on the motherboard.  Since writes always go to all 5
drives, you always get the slower speed (and you are writing 25% more
data to disk relative to the amount of actual data transfered than when
you are reading to boot).

Two, you use a 1MB chunk size.  Given a 5 drive raid5, that gives a 4MB
stripe width.  My guess is that your stripe size is large enough,
relative to your average write size, that your array is more often than
not performing a read/modify/write cycle for writing instead of a full
stripe write.  In a full stripe write, the md stack will write out all
four data chunks and a calculated parity chunk without regard to what
parity was before and what data was there before.  If, on the other
hand, it doesn't have enough data to write an entire stripe by the time
it is flushing things out, then it has to do a read/modify/write cycle.
The particulars of what's most efficient in this case depend on how many
chunks are being overwritten in the stripe, but regardless it means
reading in parts of the stripe and parity first, then doing xor
operations, then writing new data and new parity back out.  This will
mean that at least some of the 5 drives are doing both reads and writes
in a single stripe operation.  So, I think the combination of
read/modify/write cycles combined with the poor performance of drives
behind the PCI card actually can result in the drastic difference you
are seeing between read and write speeds.

> The other thing I wonder is if it has something to do with the  
> sil_sata driver - as ALL the drives in the RAID5 are handled by that  
> kernel module. The boot RAID1 is on the ICH5 SATA controller - and  
> suffers no performance issues at all. It shows a good 40MB/sec+ read  
> AND write speeds per drive.

It's entirely possible that the driver plays a role in this, yes.  I
don't have any hardware that uses that driver so I couldn't say.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

  reply	other threads:[~2009-06-11  1:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-10 17:17 Software RAID5 write issues Steven Haigh
2009-06-10 19:30 ` John Robinson
2009-06-10 20:57 ` Doug Ledford
2009-06-11  0:05   ` Steven Haigh
2009-06-11  1:46     ` Doug Ledford [this message]
2009-06-12  5:24   ` Leslie Rhorer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1244684792.21347.231.camel@firewall.xsintricity.com \
    --to=dledford@redhat.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=netwiz@crc.id.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).