Re: Filesystem writes on RAID5 too slow

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Martin Boutin <martboutin@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>,
	"Kernel.org-Linux-RAID" <linux-raid@vger.kernel.org>,
	xfs-oss <xfs@oss.sgi.com>,
	"Kernel.org-Linux-EXT4" <linux-ext4@vger.kernel.org>
Subject: Re: Filesystem writes on RAID5 too slow
Date: Thu, 21 Nov 2013 20:26:06 +1100	[thread overview]
Message-ID: <20131121092606.GU11434@dastard> (raw)
In-Reply-To: <CACtJ3Ha3C7JNi5VZRnNMn+-okNheygmbj=j9AnUMvfzfZjNwug@mail.gmail.com>

On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
> >> > Dear list,
> >> >
> >> > I am writing about an apparent issue (or maybe it is normal, that's my
> >> > question) regarding filesystem write speed in in a linux raid device.
> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
> >> > embedded system with 3 HDDs in a RAID-5 configuration.
> >> > The hard disks have 4k physical sectors which are reported as 512
> >> > logical size. I made sure the partitions underlying the raid device
> >> > start at sector 2048.
> >>
> >> (fixed cc: to xfs list)
> >>
> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
> >> > offset, therefore the data should also be 4k aligned. The raid chunk
> >> > size is 512K.
> >> >
> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
> >> > stride and stripes correctly chosen to match the raid chunk size, that
> >> > is, stride=128,stripe-width=256.
> >> >
> >> > While I was working in a small university project, I just noticed that
> >> > the write speeds when using a filesystem over raid are *much* slower
> >> > than when writing directly to the raid device (or even compared to
> >> > filesystem read speeds).
> >> >
> >> > The command line for measuring filesystem read and write speeds was:
> >> >
> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> >> >
> >> > The command line for measuring raw read and write speeds was:
> >> >
> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
> >> >
> >> > Here are some speed measures using dd (an average of 20 runs).:
> >> >
> >> > device       raw/fs  mode   speed (MB/s)    slowdown (%)
> >> > /dev/md0    raw    read    207
> >> > /dev/md0    raw    write    209
> >> > /dev/md1    raw    read    214
> >> > /dev/md1    raw    write    212
> >
> > So, that's writing to the first 1GB of /dev/md0, and all the writes
> > are going to be aligned to the MD stripe.
> >
> >> > /dev/md0    xfs    read    188    9
> >> > /dev/md0    xfs    write    35    83o
> >
> > And these will not be written to the first 1GB of the block device
> > but somewhere else. Most likely a region that hasn't otherwise been
> > used, and so isn't going to be overwriting the same blocks like the
> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
> > caching effect going on here? Was the md device fully initialised
> > before you ran these tests?
> >
> >> >
> >> > /dev/md1    ext3    read    199    7
> >> > /dev/md1    ext3    write    36    83
> >> >
> >> > /dev/md0    ufs    read    212    0
> >> > /dev/md0    ufs    write    53    75
> >> >
> >> > /dev/md0    ext2    read    202    2
> >> > /dev/md0    ext2    write    34    84
> >
> > I suspect what you are seeing here is either the latency introduced
> > by having to allocate blocks before issuing the IO, or the file
> > layout due to allocation is not idea. Single threaded direct IO is
> > latency bound, not bandwidth bound and, as such, is IO size
> > sensitive. Allocation for direct IO is also IO size sensitive -
> > there's typically an allocation per IO, so the more IO you have to
> > do, the more allocation that occurs.
> 
> I just did a few more tests, this time with ext4:
> 
> device       raw/fs  mode   speed (MB/s)    slowdown (%)
> /dev/md0    ext4    read    199    4%
> /dev/md0    ext4    write    210    0%
> 
> This time, no slowdown at all on ext4. I believe this is due to the
> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
> should be it). So I guess for the other filesystems, it was indeed
> the latency introduced by block allocation.

Except that XFS does extent based allocation as well, so that's not
likely the reason. The fact that ext4 doesn't see a slowdown like
every other filesystem really doesn't make a lot of sense to
me, either from an IO dispatch point of view or an IO alignment
point of view.

Why? Because all the filesystems align identically to the underlying
device and all should be doing 4k block aligned IO, and XFS has
roughly the same allocation overhead for this workload as ext4.
Did you retest XFS or any of the other filesystems directly after
running the ext4 tests (i.e. confirm you are testing apples to
apples)?

What we need to determine why other filesystems are slow (and why
ext4 is fast) is more information about your configuration and block
traces showing what is happening at the IO level, like was requested
in a previous email....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2013-11-21  9:26 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-18 16:02 Filesystem writes on RAID5 too slow Martin Boutin
2013-11-18 18:28 ` Eric Sandeen
2013-11-19  0:57   ` Dave Chinner
2013-11-21  9:11     ` Martin Boutin
2013-11-21  9:26       ` Dave Chinner [this message]
2013-11-21  9:50         ` Martin Boutin
2013-11-21 13:31           ` Martin Boutin
2013-11-21 16:35             ` Martin Boutin
2013-11-21 23:41             ` Dave Chinner
2013-11-22  9:21               ` Christoph Hellwig
2013-11-22 22:40                 ` Dave Chinner
2013-11-23  8:41                   ` Christoph Hellwig
2013-11-24 23:21                     ` Dave Chinner
2013-11-22 13:33               ` Martin Boutin
2013-12-10 19:18               ` Christoph Hellwig
2013-12-11  0:27                 ` Dave Chinner
2013-12-11 19:09                   ` Ben Myers
2013-11-18 18:41 ` Roman Mamedov
2013-11-18 19:25   ` Roman Mamedov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131121092606.GU11434@dastard \
    --to=david@fromorbit.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=martboutin@gmail.com \
    --cc=sandeen@redhat.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).