linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stan Hoeppner <stan@hardwarefreak.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: John Robinson <john.robinson@anonymous.org.uk>,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: O_DIRECT to md raid 6 is slow
Date: Wed, 15 Aug 2012 17:00:33 -0500	[thread overview]
Message-ID: <502C1C01.1040509@hardwarefreak.com> (raw)
In-Reply-To: <CALCETrX=mi92qwOAjt_7Qu-ho_Hdg_5SHX-_8nXYRer4JnzD0w@mail.gmail.com>

On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
> <john.robinson@anonymous.org.uk> wrote:
>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>
>>> If I do:
>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>
>> [...]
>>
>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>> I'm in O_DIRECT mode.
>>
>>
>> I see your md device is partitioned. Is the partition itself stripe-aligned?
> 
> Crud.
> 
> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>       11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/6] [UUUUUU]
> 
> IIUC this means that I/O should be aligned on 2MB boundaries (512k
> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
> (i.e. 1MB) boundary.

It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer "tests" with dd
buffered sequential reads/writes makes their Levi's expand.  Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

-- 
Stan

  reply	other threads:[~2012-08-15 22:00 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-15  0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
2012-08-15  1:07 ` kedacomkernel
2012-08-15  1:12   ` Andy Lutomirski
2012-08-15  1:23     ` kedacomkernel
2012-08-15 11:50 ` John Robinson
2012-08-15 17:57   ` Andy Lutomirski
2012-08-15 22:00     ` Stan Hoeppner [this message]
2012-08-15 22:10       ` Andy Lutomirski
2012-08-15 23:50         ` Stan Hoeppner
2012-08-16  1:08           ` Andy Lutomirski
2012-08-16  6:41           ` Roman Mamedov
     [not found]     ` <201208152307.q7FN7hMR008630@xs8.xs4all.nl>
     [not found]       ` <502CD3F8.70001@hardwarefreak.com>
     [not found]         ` <502D6B0A.6090508@xs4all.net>
     [not found]           ` <502DF357.8090205@hardwarefreak.com>
     [not found]             ` <502E2817.8040306@xs4all.net>
2012-08-18  5:09               ` Stan Hoeppner
2012-08-18 10:08                 ` Michael Tokarev
2012-08-19  3:17                   ` Stan Hoeppner
2012-08-19 14:01                     ` David Brown
2012-08-19 23:34                       ` Stan Hoeppner
2012-08-20  0:01                         ` NeilBrown
2012-08-20  4:44                           ` Stan Hoeppner
2012-08-20  5:19                             ` Dave Chinner
2012-08-20  5:42                               ` Stan Hoeppner
2012-08-20  7:47                           ` David Brown
2012-08-21 14:51                         ` Miquel van Smoorenburg
2012-08-22  3:59                           ` Stan Hoeppner
2012-08-19 17:02                     ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=502C1C01.1040509@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=john.robinson@anonymous.org.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=luto@amacapital.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).