From: Stan Hoeppner <stan@hardwarefreak.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: John Robinson <john.robinson@anonymous.org.uk>,
linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: O_DIRECT to md raid 6 is slow
Date: Wed, 15 Aug 2012 18:50:44 -0500 [thread overview]
Message-ID: <502C35D4.6010804@hardwarefreak.com> (raw)
In-Reply-To: <CALCETrUTNV0r6xeF+mbqqw7w_StxoF2qFxzCLfb-LVH7ay_SHw@mail.gmail.com>
On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>> <john.robinson@anonymous.org.uk> wrote:
>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>
>>>>> If I do:
>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>
>>>> [...]
>>>>
>>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>>> I'm in O_DIRECT mode.
>>>>
>>>>
>>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>>
>>> Crud.
>>>
>>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>> 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [6/6] [UUUUUU]
>>>
>>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>>> chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector
>>> (i.e. 1MB) boundary.
>>
>> It's time to blow away the array and start over. You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust. So you consume 6MB of bandwidth to write less than
>> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata. Yes, insane.
>
> Grr. I thought the bad old days of filesystem and related defaults
> sucking were over.
The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads. I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default. Obviously something went horribly wrong here. 512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.
> cryptsetup aligns sanely these days, xfs is
> sensible, etc.
XFS won't align with the 512KB chunk default of metadata 1.2. The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended. Thus mkfs.xfs throws an error due to the 512KB
stripe. See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.
> wtf? <rant>Why is there no sensible filesystem for
> huge disks? zfs can't cp --reflink and has all kinds of source
> availability and licensing issues, xfs can't dedupe at all, and btrfs
> isn't nearly stable enough.</rant>
Deduplication isn't a responsibility of a filesystem. TTBOMK there are
two, and only two, COW filesystems in existence: ZFS and BTRFS. And
these are the only two to offer a native dedupe capability. They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup. So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.
> Anyhow, I'll try the patch from Wu Fengguang. There's still a bug here...
Always one somewhere.
--
Stan
next prev parent reply other threads:[~2012-08-15 23:50 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-15 0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
2012-08-15 1:07 ` kedacomkernel
2012-08-15 1:12 ` Andy Lutomirski
2012-08-15 1:23 ` kedacomkernel
2012-08-15 11:50 ` John Robinson
2012-08-15 17:57 ` Andy Lutomirski
2012-08-15 22:00 ` Stan Hoeppner
2012-08-15 22:10 ` Andy Lutomirski
2012-08-15 23:50 ` Stan Hoeppner [this message]
2012-08-16 1:08 ` Andy Lutomirski
2012-08-16 6:41 ` Roman Mamedov
[not found] ` <201208152307.q7FN7hMR008630@xs8.xs4all.nl>
[not found] ` <502CD3F8.70001@hardwarefreak.com>
[not found] ` <502D6B0A.6090508@xs4all.net>
[not found] ` <502DF357.8090205@hardwarefreak.com>
[not found] ` <502E2817.8040306@xs4all.net>
2012-08-18 5:09 ` Stan Hoeppner
2012-08-18 10:08 ` Michael Tokarev
2012-08-19 3:17 ` Stan Hoeppner
2012-08-19 14:01 ` David Brown
2012-08-19 23:34 ` Stan Hoeppner
2012-08-20 0:01 ` NeilBrown
2012-08-20 4:44 ` Stan Hoeppner
2012-08-20 5:19 ` Dave Chinner
2012-08-20 5:42 ` Stan Hoeppner
2012-08-20 7:47 ` David Brown
2012-08-21 14:51 ` Miquel van Smoorenburg
2012-08-22 3:59 ` Stan Hoeppner
2012-08-19 17:02 ` Chris Murphy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=502C35D4.6010804@hardwarefreak.com \
--to=stan@hardwarefreak.com \
--cc=john.robinson@anonymous.org.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=luto@amacapital.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).