Re: Software RAID checksum performance on 24 disks not even close to kernel reported

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Joe Landman <joe.landman@gmail.com>
To: stan@hardwarefreak.com
Cc: Dan Williams <dan.j.williams@intel.com>, linux-raid@vger.kernel.org
Subject: Re: Software RAID checksum performance on 24 disks not even close to kernel reported
Date: Thu, 07 Jun 2012 10:40:56 -0400	[thread overview]
Message-ID: <4FD0BD78.2050808@gmail.com> (raw)
In-Reply-To: <4FD028B2.1050306@hardwarefreak.com>

Not to interject too much here ...

On 06/07/2012 12:06 AM, Stan Hoeppner wrote:
> On 6/6/2012 11:09 AM, Dan Williams wrote:
>
>> Hardware raid ultimately does the same shuffling, outside of nvram an
>> advantage it has is that parity data does not traverse the bus...
>
> Are you referring to the host data bus(s)?  I.e. HT/QPI and PCIe?
>
> With a 24 disk array, a full stripe write is only 1/12th parity data,
> less than 10%.  And the buses (point to point actually) of 24 drive
> caliber systems will usually start at one way B/W of 4GB/s for PCIe 2.0
> x8 and with one way B/W from the PCIe controller to the CPU starting at

PCIe gen 2 is ~500MB/s per lane in each direction, but there's like a 
14% protocol overhead, so your "sustained" streaming performance is more 
along the lines of 430 MB/s.  For a PCIe x8 gen 2 system, this nets you 
about 3.4GB/s in each direction.

> 10.4GB/s for AMD HT 3.0 systems.  PCIe x8 is plenty to handle a 24 drive
> md RAID 6, using 7.2K SATA drives anyway.

Each drive capable of streaming say 140 MB/s (modern drives).  24 x 140 
= 3.4 GB/s

This assumes streaming, no seeks that aren't part of streaming.

This said, this is *not* a design pattern you'd want to follow for a 
number of reasons.

But for seek heavy designs, you aren't going to hit anything close to 
140 MB/s.  We've just done a brief study for a customer on what they 
should expect to see (by measuring it and reporting on the measurement). 
  Assume close to an order of magnitude off for seekier loads.

Also, please note that iozone, dd, bonnie++, ... aren't great load 
generators, especially if things are in cache.  You tend to measure the 
upper layers of the file system stack, and not the actual full stack 
performance.  fio does a better job if you set the right options.  This 
said, almost all of these codes suffer from a measurement at the front 
end of the stack, if you want to know what the disks are really doing, 
you have to start poking your head into the kernel proc/sys spaces. 
Whats interesting is that of the tools mentioned, only fio appears to 
eventually converge its reporting to what the backend hardware does. 
The front end measurements seem to do a pretty bad job of deciding when 
an IO begins and when it is complete.  Could be an fsync or similar 
problem (discussed in the past), but its very annoying.  End users look 
at bonnie++ and other results and don't understand why their use case is 
so badly different in performance.

> What is a bigger issue, and may actually be what you were referring to,
> is read-modify-write B/W, which will incur a full stripe read and write.
>   For RMW heavy workloads, this is significant.  HBA RAID does have a big
> advantage here, compared to one's md array possessing the aggregate
> performance to saturate the PCIe bus.

The big issues for most HBAs are the available bandwidth to the disks, 
the quality/implementation of the controllers/drivers, etc.  Hanging 24 
drives off a single controller is a low cost design, not a high 
performance design.  You will get contention (especially with expandor 
chips).  You will get sub-optimal performance.

Checksumming speed on the CPU will not be the bottleneck in most of 
these cases.  Controller/driver performance and contention will be.

Back to your regularly scheduled thread ...

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

next prev parent reply	other threads:[~2012-06-07 14:40 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-04 23:14 Software RAID checksum performance on 24 disks not even close to kernel reported Ole Tange
2012-06-05  1:26 ` Joe Landman
2012-06-05  3:36 ` Igor M Podlesny
2012-06-05  7:47   ` Ole Tange
2012-06-05 11:25     ` Peter Grandi
2012-06-05 20:57       ` Ole Tange
2012-06-06 17:37         ` Peter Grandi
2012-06-05 14:15     ` Stan Hoeppner
2012-06-05 20:45       ` Ole Tange
2012-06-05  3:39 ` Igor M Podlesny
2012-06-05  7:47   ` Ole Tange
2012-06-05 11:29     ` Igor M Podlesny
2012-06-05 13:09       ` Peter Grandi
2012-06-05 21:17         ` Ole Tange
2012-06-06  1:38           ` Stan Hoeppner
2012-06-05 18:44       ` Ole Tange
2012-06-06  1:40         ` Brad Campbell
2012-06-06  3:48           ` Marcus Sorensen
2012-06-06 11:21             ` Ole Tange
2012-06-06 11:17           ` Ole Tange
2012-06-06 12:58             ` Brad Campbell
2012-06-06 14:11 ` Ole Tange
2012-06-06 16:05   ` Igor M Podlesny
2012-06-06 19:51     ` Ole Tange
2012-06-06 22:21       ` Igor M Podlesny
2012-06-06 22:53         ` Peter Grandi
2012-06-07  3:41           ` Igor M Podlesny
2012-06-07  4:59             ` Stan Hoeppner
2012-06-07  5:22               ` Igor M Podlesny
2012-06-07  9:03                 ` Stan Hoeppner
2012-06-07  9:22                   ` Igor M Podlesny
2012-06-06 16:09   ` Dan Williams
2012-06-06 19:19     ` Ole Tange
2012-06-06 19:24       ` Dan Williams
2012-06-06 19:26         ` Ole Tange
2012-06-07  4:06     ` Stan Hoeppner
2012-06-07 14:40       ` Joe Landman [this message]
2012-06-08  1:23         ` Stan Hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FD0BD78.2050808@gmail.com \
    --to=joe.landman@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=stan@hardwarefreak.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).