Re: RAID 5,6 sequential writing seems slower in newer kernels

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: Robert Kierski <rkierski@cray.com>,
	Dallas Clement <dallas.a.clement@gmail.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: RAID 5,6 sequential writing seems slower in newer kernels
Date: Thu, 3 Dec 2015 10:04:08 -0500	[thread overview]
Message-ID: <566059E8.60804@turmel.org> (raw)
In-Reply-To: <F7761B9B1D11B64BBB666019E9378117FDDF77@CFWEX01.americas.cray.com>

On 12/03/2015 09:19 AM, Robert Kierski wrote:
> Phil,
> 
> I have a variety of testing tools that I use to corroborate the results of the others.  So... IOR, XDD, fio, iozone, (and dd when I need something simple).  Each of those can be run with a variety of options that simulate what an FS will submit to the block layer without adding the complexity, overhead, and uncertainty that an FS brings to the table.  I've run the same tools through an FS, and found that at the bottom end of things, I can configure those tools to do exactly what the FS does... only when I'm looking at the traces, I don't have to scan past 100K lines while the FS is dealing with inodes, privileges, and other meta data.

Ok.  Please cite the tool when you give a performance number, please.

> But to more precisely answer your question... as an example, if I'm using dd, I give this command:
> 
> dd if=/dev/zero of=/dev/md0 bs=1M oflag=direct

Why oflag=direct ?  And what do you get without it?

> Where /dev/md0 is the raid device I've configured.
> 
> I don't use bitmaps, I've configured my raid using "--bitmap=none" and confirmed that mdadmin sees that there is no bitmap.  I don't have alignment issues as my ramdisk has 512byte sectors.  If something is somehow aligning things off 512byte boundaries when doing 1m writes.... I would be surprised.  Also... I verified that the data written to disk falls at the boundaries I'm expecting.

Ok.  I wasn't concerned about sector size.  I was concerned about writes
not filling complete stripes in a single IO.  Writes to parity raid are
broken up into 4k blocks in the stripe cache for parity calculation.
Each block in that stripe is separated from its mates by the chunk size.
 If you don't write to all of them before the state machine decides to
compute, the parity devices will be read to perform RMW cycles (or the
other data members will be read to recompute from scratch).  Either way,
when the 4k blocks are then written from the stripe, they have to have a
chance to get merged again.

> I tried RAID0 and got performance that is similar to what I was expecting -- 38G/s doing the writes.

Yep, those 1M writes are broken into chunk-sized writes for each member
and submitted as is.  Raid456 breaks those down further for parity
calculation.

So, you probably have found a bug in post-stripe merging.  Possibly due
to the extreme low latency of a ramdisk.  Possibly an O_DIRECT side
effect.  There's been a lot of work on parity raid in the past couple
years, both fixing bugs and adding features.

Sounds like time to bisect to locate the patches that make step changes
in performance on your specific hardware.

Phil

next prev parent reply	other threads:[~2015-12-03 15:04 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-01 23:02 RAID 5,6 sequential writing seems slower in newer kernels Dallas Clement
2015-12-02  1:07 ` keld
2015-12-02 14:18   ` Robert Kierski
2015-12-02 14:45     ` Phil Turmel
2015-12-02 15:28       ` Robert Kierski
2015-12-02 15:37         ` Phil Turmel
2015-12-02 15:44           ` Robert Kierski
2015-12-02 15:51             ` Phil Turmel
2015-12-02 19:50               ` Dallas Clement
2015-12-03  0:12                 ` Dallas Clement
2015-12-03  2:18                   ` Phil Turmel
2015-12-03  2:24                     ` Dallas Clement
2015-12-03  2:33                       ` Dallas Clement
2015-12-03  2:38                         ` Phil Turmel
2015-12-03  2:51                           ` Dallas Clement
2015-12-03  4:30                             ` Phil Turmel
2015-12-03  4:49                               ` Dallas Clement
2015-12-03 13:43                               ` Robert Kierski
2015-12-03 14:37                                 ` Phil Turmel
2015-12-03  2:34                       ` Phil Turmel
2015-12-03 14:19                 ` Robert Kierski
2015-12-03 14:39                   ` Dallas Clement
2015-12-03 15:04                   ` Phil Turmel [this message]
2015-12-03 22:21                     ` Weedy
2015-12-04 13:40                     ` Robert Kierski
2015-12-04 16:08                       ` Dallas Clement
2015-12-07 14:29                         ` Robert Kierski
2015-12-08 19:38                           ` Dallas Clement
2015-12-08 21:24                             ` Robert Kierski
2015-12-04 18:51                       ` Shaohua Li
2015-12-05  1:38                         ` Dallas Clement
2015-12-07 14:18                         ` Robert Kierski
2015-12-02 15:37       ` Robert Kierski
2015-12-02  5:22 ` Roman Mamedov
2015-12-02 14:15 ` Robert Kierski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=566059E8.60804@turmel.org \
    --to=philip@turmel.org \
    --cc=dallas.a.clement@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=rkierski@cray.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).