linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Filesystem writes on RAID5 too slow
@ 2013-11-18 16:02 Martin Boutin
  2013-11-18 18:28 ` Eric Sandeen
  2013-11-18 18:41 ` Roman Mamedov
  0 siblings, 2 replies; 19+ messages in thread
From: Martin Boutin @ 2013-11-18 16:02 UTC (permalink / raw)
  To: Kernel.org-Linux-RAID; +Cc: Kernel.org-Linux-XFS, Kernel.org-Linux-EXT4

Dear list,

I am writing about an apparent issue (or maybe it is normal, that's my
question) regarding filesystem write speed in in a linux raid device.
More specifically, I have linux-3.10.10 running in an Intel Haswell
embedded system with 3 HDDs in a RAID-5 configuration.
The hard disks have 4k physical sectors which are reported as 512
logical size. I made sure the partitions underlying the raid device
start at sector 2048.

The RAID device has version 1.2 metadata and 4k (bytes) of data
offset, therefore the data should also be 4k aligned. The raid chunk
size is 512K.

I have the md0 raid device formatted as ext3 with a 4k block size, and
stride and stripes correctly chosen to match the raid chunk size, that
is, stride=128,stripe-width=256.

While I was working in a small university project, I just noticed that
the write speeds when using a filesystem over raid are *much* slower
than when writing directly to the raid device (or even compared to
filesystem read speeds).

The command line for measuring filesystem read and write speeds was:

$ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
$ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct

The command line for measuring raw read and write speeds was:

$ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
$ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct

Here are some speed measures using dd (an average of 20 runs).:

device       raw/fs  mode   speed (MB/s)    slowdown (%)
/dev/md0    raw    read    207
/dev/md0    raw    write    209
/dev/md1    raw    read    214
/dev/md1    raw    write    212

/dev/md0    xfs    read    188    9
/dev/md0    xfs    write    35    83

/dev/md1    ext3    read    199    7
/dev/md1    ext3    write    36    83

/dev/md0    ufs    read    212    0
/dev/md0    ufs    write    53    75

/dev/md0    ext2    read    202    2
/dev/md0    ext2    write    34    84

Is it possible that the filesystem has such enormous impact in the
write speed? We are talking about a slowdown of 80%!!! Even a
filesystem as simple as ufs has a slowdown of 75%! What am I missing?

Thank you,
-- 
Martin Boutin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-18 16:02 Filesystem writes on RAID5 too slow Martin Boutin
@ 2013-11-18 18:28 ` Eric Sandeen
  2013-11-19  0:57   ` Dave Chinner
  2013-11-18 18:41 ` Roman Mamedov
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2013-11-18 18:28 UTC (permalink / raw)
  To: Martin Boutin, Kernel.org-Linux-RAID; +Cc: xfs-oss, Kernel.org-Linux-EXT4

On 11/18/13, 10:02 AM, Martin Boutin wrote:
> Dear list,
> 
> I am writing about an apparent issue (or maybe it is normal, that's my
> question) regarding filesystem write speed in in a linux raid device.
> More specifically, I have linux-3.10.10 running in an Intel Haswell
> embedded system with 3 HDDs in a RAID-5 configuration.
> The hard disks have 4k physical sectors which are reported as 512
> logical size. I made sure the partitions underlying the raid device
> start at sector 2048.

(fixed cc: to xfs list)

> The RAID device has version 1.2 metadata and 4k (bytes) of data
> offset, therefore the data should also be 4k aligned. The raid chunk
> size is 512K.
> 
> I have the md0 raid device formatted as ext3 with a 4k block size, and
> stride and stripes correctly chosen to match the raid chunk size, that
> is, stride=128,stripe-width=256.
> 
> While I was working in a small university project, I just noticed that
> the write speeds when using a filesystem over raid are *much* slower
> than when writing directly to the raid device (or even compared to
> filesystem read speeds).
> 
> The command line for measuring filesystem read and write speeds was:
> 
> $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 
> The command line for measuring raw read and write speeds was:
> 
> $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
> $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
> 
> Here are some speed measures using dd (an average of 20 runs).:
> 
> device       raw/fs  mode   speed (MB/s)    slowdown (%)
> /dev/md0    raw    read    207
> /dev/md0    raw    write    209
> /dev/md1    raw    read    214
> /dev/md1    raw    write    212
> 
> /dev/md0    xfs    read    188    9
> /dev/md0    xfs    write    35    83
> 
> /dev/md1    ext3    read    199    7
> /dev/md1    ext3    write    36    83
> 
> /dev/md0    ufs    read    212    0
> /dev/md0    ufs    write    53    75
> 
> /dev/md0    ext2    read    202    2
> /dev/md0    ext2    write    34    84
> 
> Is it possible that the filesystem has such enormous impact in the
> write speed? We are talking about a slowdown of 80%!!! Even a
> filesystem as simple as ufs has a slowdown of 75%! What am I missing?

One thing you're missing is enough info to debug this.

/proc/mdstat, kernel version, xfs_info output, mkfs commandlines used,
partition table details, etc.

If something is misaligned and you are doing RMW for these IOs it could
hurt a lot.

-Eric

> Thank you,
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-18 16:02 Filesystem writes on RAID5 too slow Martin Boutin
  2013-11-18 18:28 ` Eric Sandeen
@ 2013-11-18 18:41 ` Roman Mamedov
  2013-11-18 19:25   ` Roman Mamedov
  1 sibling, 1 reply; 19+ messages in thread
From: Roman Mamedov @ 2013-11-18 18:41 UTC (permalink / raw)
  To: Martin Boutin
  Cc: Kernel.org-Linux-RAID, Kernel.org-Linux-XFS,
	Kernel.org-Linux-EXT4

[-- Attachment #1: Type: text/plain, Size: 709 bytes --]

On Mon, 18 Nov 2013 11:02:15 -0500
Martin Boutin <martboutin@gmail.com> wrote:

> I have the md0 raid device formatted as ext3 with a 4k block size, and
> stride and stripes correctly chosen to match the raid chunk size, that
> is, stride=128,stripe-width=256.

What is your stripe cache size?
http://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/

> The command line for measuring filesystem read and write speeds was:
> 
> $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct

Try testing with "fdatasync" instead of "direct" here.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-18 18:41 ` Roman Mamedov
@ 2013-11-18 19:25   ` Roman Mamedov
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Mamedov @ 2013-11-18 19:25 UTC (permalink / raw)
  To: Martin Boutin
  Cc: Kernel.org-Linux-RAID, Kernel.org-Linux-XFS,
	Kernel.org-Linux-EXT4

[-- Attachment #1: Type: text/plain, Size: 464 bytes --]

On Tue, 19 Nov 2013 00:41:40 +0600
Roman Mamedov <rm@romanrm.net> wrote:

> > The command line for measuring filesystem read and write speeds was:
> > 
> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 
> Try testing with "fdatasync" instead of "direct" here.

Sorry, "conv=fdatasync" instead of "oflag=direct".

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-18 18:28 ` Eric Sandeen
@ 2013-11-19  0:57   ` Dave Chinner
  2013-11-21  9:11     ` Martin Boutin
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2013-11-19  0:57 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Martin Boutin, Kernel.org-Linux-RAID, Kernel.org-Linux-EXT4,
	xfs-oss

On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
> On 11/18/13, 10:02 AM, Martin Boutin wrote:
> > Dear list,
> > 
> > I am writing about an apparent issue (or maybe it is normal, that's my
> > question) regarding filesystem write speed in in a linux raid device.
> > More specifically, I have linux-3.10.10 running in an Intel Haswell
> > embedded system with 3 HDDs in a RAID-5 configuration.
> > The hard disks have 4k physical sectors which are reported as 512
> > logical size. I made sure the partitions underlying the raid device
> > start at sector 2048.
> 
> (fixed cc: to xfs list)
> 
> > The RAID device has version 1.2 metadata and 4k (bytes) of data
> > offset, therefore the data should also be 4k aligned. The raid chunk
> > size is 512K.
> > 
> > I have the md0 raid device formatted as ext3 with a 4k block size, and
> > stride and stripes correctly chosen to match the raid chunk size, that
> > is, stride=128,stripe-width=256.
> > 
> > While I was working in a small university project, I just noticed that
> > the write speeds when using a filesystem over raid are *much* slower
> > than when writing directly to the raid device (or even compared to
> > filesystem read speeds).
> > 
> > The command line for measuring filesystem read and write speeds was:
> > 
> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> > 
> > The command line for measuring raw read and write speeds was:
> > 
> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
> > 
> > Here are some speed measures using dd (an average of 20 runs).:
> > 
> > device       raw/fs  mode   speed (MB/s)    slowdown (%)
> > /dev/md0    raw    read    207
> > /dev/md0    raw    write    209
> > /dev/md1    raw    read    214
> > /dev/md1    raw    write    212

So, that's writing to the first 1GB of /dev/md0, and all the writes
are going to be aligned to the MD stripe.

> > /dev/md0    xfs    read    188    9
> > /dev/md0    xfs    write    35    83o

And these will not be written to the first 1GB of the block device
but somewhere else. Most likely a region that hasn't otherwise been
used, and so isn't going to be overwriting the same blocks like the
/dev/md0 case is going to be. Perhaps there's some kind of stripe
caching effect going on here? Was the md device fully initialised
before you ran these tests?

> > 
> > /dev/md1    ext3    read    199    7
> > /dev/md1    ext3    write    36    83
> > 
> > /dev/md0    ufs    read    212    0
> > /dev/md0    ufs    write    53    75
> > 
> > /dev/md0    ext2    read    202    2
> > /dev/md0    ext2    write    34    84

I suspect what you are seeing here is either the latency introduced
by having to allocate blocks before issuing the IO, or the file
layout due to allocation is not idea. Single threaded direct IO is
latency bound, not bandwidth bound and, as such, is IO size
sensitive. Allocation for direct IO is also IO size sensitive -
there's typically an allocation per IO, so the more IO you have to
do, the more allocation that occurs.

So, on XFS, what does "xfs_bmap -vvp /tmp/diskmnt/filewr.zero"
output for the file you wrote? Specifically, I'm interested whether
it aligned the allocations to the stripe unit boundary, and if so,
what offset into the device those extents sit at....

Also, you should run iostat and blktrace to determine if MD is
doing RMW cycles when being written to through the filesystem.

> > Is it possible that the filesystem has such enormous impact in the
> > write speed? We are talking about a slowdown of 80%!!! Even a
> > filesystem as simple as ufs has a slowdown of 75%! What am I missing?
> 
> One thing you're missing is enough info to debug this.
> 
> /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used,
> partition table details, etc.

THere's a good list here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> If something is misaligned and you are doing RMW for these IOs it could
> hurt a lot.
> 
> -Eric
> 
> > Thank you,
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-19  0:57   ` Dave Chinner
@ 2013-11-21  9:11     ` Martin Boutin
  2013-11-21  9:26       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Martin Boutin @ 2013-11-21  9:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss,
	Kernel.org-Linux-EXT4

On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>> > Dear list,
>> >
>> > I am writing about an apparent issue (or maybe it is normal, that's my
>> > question) regarding filesystem write speed in in a linux raid device.
>> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>> > embedded system with 3 HDDs in a RAID-5 configuration.
>> > The hard disks have 4k physical sectors which are reported as 512
>> > logical size. I made sure the partitions underlying the raid device
>> > start at sector 2048.
>>
>> (fixed cc: to xfs list)
>>
>> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>> > offset, therefore the data should also be 4k aligned. The raid chunk
>> > size is 512K.
>> >
>> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>> > stride and stripes correctly chosen to match the raid chunk size, that
>> > is, stride=128,stripe-width=256.
>> >
>> > While I was working in a small university project, I just noticed that
>> > the write speeds when using a filesystem over raid are *much* slower
>> > than when writing directly to the raid device (or even compared to
>> > filesystem read speeds).
>> >
>> > The command line for measuring filesystem read and write speeds was:
>> >
>> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> >
>> > The command line for measuring raw read and write speeds was:
>> >
>> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>> >
>> > Here are some speed measures using dd (an average of 20 runs).:
>> >
>> > device       raw/fs  mode   speed (MB/s)    slowdown (%)
>> > /dev/md0    raw    read    207
>> > /dev/md0    raw    write    209
>> > /dev/md1    raw    read    214
>> > /dev/md1    raw    write    212
>
> So, that's writing to the first 1GB of /dev/md0, and all the writes
> are going to be aligned to the MD stripe.
>
>> > /dev/md0    xfs    read    188    9
>> > /dev/md0    xfs    write    35    83o
>
> And these will not be written to the first 1GB of the block device
> but somewhere else. Most likely a region that hasn't otherwise been
> used, and so isn't going to be overwriting the same blocks like the
> /dev/md0 case is going to be. Perhaps there's some kind of stripe
> caching effect going on here? Was the md device fully initialised
> before you ran these tests?
>
>> >
>> > /dev/md1    ext3    read    199    7
>> > /dev/md1    ext3    write    36    83
>> >
>> > /dev/md0    ufs    read    212    0
>> > /dev/md0    ufs    write    53    75
>> >
>> > /dev/md0    ext2    read    202    2
>> > /dev/md0    ext2    write    34    84
>
> I suspect what you are seeing here is either the latency introduced
> by having to allocate blocks before issuing the IO, or the file
> layout due to allocation is not idea. Single threaded direct IO is
> latency bound, not bandwidth bound and, as such, is IO size
> sensitive. Allocation for direct IO is also IO size sensitive -
> there's typically an allocation per IO, so the more IO you have to
> do, the more allocation that occurs.

I just did a few more tests, this time with ext4:

device       raw/fs  mode   speed (MB/s)    slowdown (%)
/dev/md0    ext4    read    199    4%
/dev/md0    ext4    write    210    0%

This time, no slowdown at all on ext4. I believe this is due to the
multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
should be it). So I guess for the other filesystems, it was indeed
the latency introduced by block allocation.

Thanks,
- Martin

>
> So, on XFS, what does "xfs_bmap -vvp /tmp/diskmnt/filewr.zero"
> output for the file you wrote? Specifically, I'm interested whether
> it aligned the allocations to the stripe unit boundary, and if so,
> what offset into the device those extents sit at....
>
> Also, you should run iostat and blktrace to determine if MD is
> doing RMW cycles when being written to through the filesystem.
>
>> > Is it possible that the filesystem has such enormous impact in the
>> > write speed? We are talking about a slowdown of 80%!!! Even a
>> > filesystem as simple as ufs has a slowdown of 75%! What am I missing?
>>
>> One thing you're missing is enough info to debug this.
>>
>> /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used,
>> partition table details, etc.
>
> THere's a good list here:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
>> If something is misaligned and you are doing RMW for these IOs it could
>> hurt a lot.
>>
>> -Eric
>>
>> > Thank you,
>> >
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21  9:11     ` Martin Boutin
@ 2013-11-21  9:26       ` Dave Chinner
  2013-11-21  9:50         ` Martin Boutin
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2013-11-21  9:26 UTC (permalink / raw)
  To: Martin Boutin
  Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss,
	Kernel.org-Linux-EXT4

On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
> >> > Dear list,
> >> >
> >> > I am writing about an apparent issue (or maybe it is normal, that's my
> >> > question) regarding filesystem write speed in in a linux raid device.
> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
> >> > embedded system with 3 HDDs in a RAID-5 configuration.
> >> > The hard disks have 4k physical sectors which are reported as 512
> >> > logical size. I made sure the partitions underlying the raid device
> >> > start at sector 2048.
> >>
> >> (fixed cc: to xfs list)
> >>
> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
> >> > offset, therefore the data should also be 4k aligned. The raid chunk
> >> > size is 512K.
> >> >
> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
> >> > stride and stripes correctly chosen to match the raid chunk size, that
> >> > is, stride=128,stripe-width=256.
> >> >
> >> > While I was working in a small university project, I just noticed that
> >> > the write speeds when using a filesystem over raid are *much* slower
> >> > than when writing directly to the raid device (or even compared to
> >> > filesystem read speeds).
> >> >
> >> > The command line for measuring filesystem read and write speeds was:
> >> >
> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> >> >
> >> > The command line for measuring raw read and write speeds was:
> >> >
> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
> >> >
> >> > Here are some speed measures using dd (an average of 20 runs).:
> >> >
> >> > device       raw/fs  mode   speed (MB/s)    slowdown (%)
> >> > /dev/md0    raw    read    207
> >> > /dev/md0    raw    write    209
> >> > /dev/md1    raw    read    214
> >> > /dev/md1    raw    write    212
> >
> > So, that's writing to the first 1GB of /dev/md0, and all the writes
> > are going to be aligned to the MD stripe.
> >
> >> > /dev/md0    xfs    read    188    9
> >> > /dev/md0    xfs    write    35    83o
> >
> > And these will not be written to the first 1GB of the block device
> > but somewhere else. Most likely a region that hasn't otherwise been
> > used, and so isn't going to be overwriting the same blocks like the
> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
> > caching effect going on here? Was the md device fully initialised
> > before you ran these tests?
> >
> >> >
> >> > /dev/md1    ext3    read    199    7
> >> > /dev/md1    ext3    write    36    83
> >> >
> >> > /dev/md0    ufs    read    212    0
> >> > /dev/md0    ufs    write    53    75
> >> >
> >> > /dev/md0    ext2    read    202    2
> >> > /dev/md0    ext2    write    34    84
> >
> > I suspect what you are seeing here is either the latency introduced
> > by having to allocate blocks before issuing the IO, or the file
> > layout due to allocation is not idea. Single threaded direct IO is
> > latency bound, not bandwidth bound and, as such, is IO size
> > sensitive. Allocation for direct IO is also IO size sensitive -
> > there's typically an allocation per IO, so the more IO you have to
> > do, the more allocation that occurs.
> 
> I just did a few more tests, this time with ext4:
> 
> device       raw/fs  mode   speed (MB/s)    slowdown (%)
> /dev/md0    ext4    read    199    4%
> /dev/md0    ext4    write    210    0%
> 
> This time, no slowdown at all on ext4. I believe this is due to the
> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
> should be it). So I guess for the other filesystems, it was indeed
> the latency introduced by block allocation.

Except that XFS does extent based allocation as well, so that's not
likely the reason. The fact that ext4 doesn't see a slowdown like
every other filesystem really doesn't make a lot of sense to
me, either from an IO dispatch point of view or an IO alignment
point of view.

Why? Because all the filesystems align identically to the underlying
device and all should be doing 4k block aligned IO, and XFS has
roughly the same allocation overhead for this workload as ext4.
Did you retest XFS or any of the other filesystems directly after
running the ext4 tests (i.e. confirm you are testing apples to
apples)?

What we need to determine why other filesystems are slow (and why
ext4 is fast) is more information about your configuration and block
traces showing what is happening at the IO level, like was requested
in a previous email....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21  9:26       ` Dave Chinner
@ 2013-11-21  9:50         ` Martin Boutin
  2013-11-21 13:31           ` Martin Boutin
  0 siblings, 1 reply; 19+ messages in thread
From: Martin Boutin @ 2013-11-21  9:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4,
	xfs-oss

On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>> >> > Dear list,
>> >> >
>> >> > I am writing about an apparent issue (or maybe it is normal, that's my
>> >> > question) regarding filesystem write speed in in a linux raid device.
>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>> >> > embedded system with 3 HDDs in a RAID-5 configuration.
>> >> > The hard disks have 4k physical sectors which are reported as 512
>> >> > logical size. I made sure the partitions underlying the raid device
>> >> > start at sector 2048.
>> >>
>> >> (fixed cc: to xfs list)
>> >>
>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>> >> > offset, therefore the data should also be 4k aligned. The raid chunk
>> >> > size is 512K.
>> >> >
>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>> >> > stride and stripes correctly chosen to match the raid chunk size, that
>> >> > is, stride=128,stripe-width=256.
>> >> >
>> >> > While I was working in a small university project, I just noticed that
>> >> > the write speeds when using a filesystem over raid are *much* slower
>> >> > than when writing directly to the raid device (or even compared to
>> >> > filesystem read speeds).
>> >> >
>> >> > The command line for measuring filesystem read and write speeds was:
>> >> >
>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> >> >
>> >> > The command line for measuring raw read and write speeds was:
>> >> >
>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>> >> >
>> >> > Here are some speed measures using dd (an average of 20 runs).:
>> >> >
>> >> > device       raw/fs  mode   speed (MB/s)    slowdown (%)
>> >> > /dev/md0    raw    read    207
>> >> > /dev/md0    raw    write    209
>> >> > /dev/md1    raw    read    214
>> >> > /dev/md1    raw    write    212
>> >
>> > So, that's writing to the first 1GB of /dev/md0, and all the writes
>> > are going to be aligned to the MD stripe.
>> >
>> >> > /dev/md0    xfs    read    188    9
>> >> > /dev/md0    xfs    write    35    83o
>> >
>> > And these will not be written to the first 1GB of the block device
>> > but somewhere else. Most likely a region that hasn't otherwise been
>> > used, and so isn't going to be overwriting the same blocks like the
>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
>> > caching effect going on here? Was the md device fully initialised
>> > before you ran these tests?
>> >
>> >> >
>> >> > /dev/md1    ext3    read    199    7
>> >> > /dev/md1    ext3    write    36    83
>> >> >
>> >> > /dev/md0    ufs    read    212    0
>> >> > /dev/md0    ufs    write    53    75
>> >> >
>> >> > /dev/md0    ext2    read    202    2
>> >> > /dev/md0    ext2    write    34    84
>> >
>> > I suspect what you are seeing here is either the latency introduced
>> > by having to allocate blocks before issuing the IO, or the file
>> > layout due to allocation is not idea. Single threaded direct IO is
>> > latency bound, not bandwidth bound and, as such, is IO size
>> > sensitive. Allocation for direct IO is also IO size sensitive -
>> > there's typically an allocation per IO, so the more IO you have to
>> > do, the more allocation that occurs.
>>
>> I just did a few more tests, this time with ext4:
>>
>> device       raw/fs  mode   speed (MB/s)    slowdown (%)
>> /dev/md0    ext4    read    199    4%
>> /dev/md0    ext4    write    210    0%
>>
>> This time, no slowdown at all on ext4. I believe this is due to the
>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
>> should be it). So I guess for the other filesystems, it was indeed
>> the latency introduced by block allocation.
>
> Except that XFS does extent based allocation as well, so that's not
> likely the reason. The fact that ext4 doesn't see a slowdown like
> every other filesystem really doesn't make a lot of sense to
> me, either from an IO dispatch point of view or an IO alignment
> point of view.
>
> Why? Because all the filesystems align identically to the underlying
> device and all should be doing 4k block aligned IO, and XFS has
> roughly the same allocation overhead for this workload as ext4.
> Did you retest XFS or any of the other filesystems directly after
> running the ext4 tests (i.e. confirm you are testing apples to
> apples)?

Yes I did, the performance figures did not change for either XFS or ext3.
>
> What we need to determine why other filesystems are slow (and why
> ext4 is fast) is more information about your configuration and block
> traces showing what is happening at the IO level, like was requested
> in a previous email....

Ok, I'm going to try coming up with meaningful data. Thanks.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Martin Boutin

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21  9:50         ` Martin Boutin
@ 2013-11-21 13:31           ` Martin Boutin
  2013-11-21 16:35             ` Martin Boutin
  2013-11-21 23:41             ` Dave Chinner
  0 siblings, 2 replies; 19+ messages in thread
From: Martin Boutin @ 2013-11-21 13:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss,
	Kernel.org-Linux-EXT4

$ uname -a
Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
i686 GNU/Linux

$ xfs_repair -V
xfs_repair version 3.1.4

$ cat /proc/cpuinfo | grep processor
processor    : 0
processor    : 1

$ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
$ mount -t xfs /dev/md0 /tmp/diskmnt/
$ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s

$ cat /proc/meminfo
MemTotal:        1313956 kB
MemFree:         1099936 kB
Buffers:           13232 kB
Cached:           141452 kB
SwapCached:            0 kB
Active:           128960 kB
Inactive:          55936 kB
Active(anon):      30548 kB
Inactive(anon):     1096 kB
Active(file):      98412 kB
Inactive(file):    54840 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:        626696 kB
HighFree:         452472 kB
LowTotal:         687260 kB
LowFree:          647464 kB
SwapTotal:         72256 kB
SwapFree:          72256 kB
Dirty:                 8 kB
Writeback:             0 kB
AnonPages:         30172 kB
Mapped:            15764 kB
Shmem:              1432 kB
Slab:              14720 kB
SReclaimable:       6632 kB
SUnreclaim:         8088 kB
KernelStack:        1792 kB
PageTables:         1176 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      729232 kB
Committed_AS:     734116 kB
VmallocTotal:     327680 kB
VmallocUsed:       10192 kB
VmallocChunk:     294904 kB
DirectMap4k:       12280 kB
DirectMap4M:      692224 kB

$ cat /proc/mounts
(...)
/dev/md0 /tmp/diskmnt xfs
rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0

$ cat /proc/partitions
major minor  #blocks  name

   8        0  976762584 sda
   8        1   10281600 sda1
   8        2  966479960 sda2
   8       16  976762584 sdb
   8       17   10281600 sdb1
   8       18  966479960 sdb2
   8       32  976762584 sdc
   8       33   10281600 sdc1
   8       34  966479960 sdc2
   (...)
   9        1   20560896 md1
   9        0 1932956672 md0

# same layout for other disks
$ fdisk -c -u /dev/sda

The device presents a logical sector size that is smaller than
the physical sector size. Aligning to a physical sector (or optimal
I/O) size boundary is recommended, or performance may be impacted.

Command (m for help): p

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048    20565247    10281600   83  Linux
/dev/sda2        20565248  1953525167   966479960   83  Linux

# unfortunately I had to reinitelize the array and recovery takes a
while.. it does not impact performance much though.
$ cat /proc/mdstat
Personalities : [linear] [raid6] [raid5] [raid4]
md0 : active raid5 sda2[0] sdc2[3] sdb2[1]
      1932956672 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
      [>....................]  recovery =  2.4% (23588740/966478336)
finish=156.6min speed=100343K/sec
      bitmap: 0/1 pages [0KB], 2097152KB chunk


# sda sdb and sdc are the same model
$ hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
    Model Number:       HGST HCC541010A9E680
    (...)
    Firmware Revision:  JA0OA560
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II
Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project
D1697 Revision 0b
Standards:
    Used: unknown (minor revision code 0x0028)
    Supported: 8 7 6 5
    Likely used: 8
Configuration:
    Logical        max    current
    cylinders    16383    16383
    heads        16    16
    sectors/track    63    63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors: 1953525168
    Logical  Sector size:                   512 bytes
    Physical Sector size:                  4096 bytes
    Logical Sector-0 offset:                  0 bytes
    device size with M = 1024*1024:      953869 MBytes
    device size with M = 1000*1000:     1000204 MBytes (1000 GB)
    cache/buffer size  = 8192 KBytes (type=DualPortCache)
    Form Factor: 2.5 inch
    Nominal Media Rotation Rate: 5400
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, no device specific minimum
    R/W multiple sector transfer: Max = 16    Current = 16
    Advanced power management level: 128
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4
         Cycle time: no flow control=120ns  IORDY flow control=120ns

$ hdparm -I /dev/sd{a,b,c} | grep "Write cache"
       *    Write cache
       *    Write cache
       *    Write cache
# therefore write cache is enabled in all drives

$ xfs_info /dev/md0
meta-data=/dev/md0               isize=256    agcount=32, agsize=15101312 blks
         =                       sectsz=4096  attr=2
data     =                       bsize=4096   blocks=483239168, imaxpct=5
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=8192, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
/tmp/diskmnt/filewr.zero:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
   0: [0..2047999]:    2049056..4097055  0 (2049056..4097055) 2048000 01111
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width
# this does not look good, does it?

# run while dd was executing, looks like we have almost the half
writes as reads....
$ iostat -d -k 30 2 /dev/sda2 /dev/sdb2 /dev/sdc2
Linux 3.10.10 (haswell1)     11/21/2013     _i686_    (2 CPU)

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda2             13.75      6639.52       232.17   78863819    2757731
sdb2             13.74      6639.42       232.24   78862660    2758483
sdc2             13.68        55.86      6813.67     663443   80932375

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda2             78.27     11191.20     22556.07     335736     676682
sdb2             78.30     11175.73     22589.13     335272     677674
sdc2             78.30      5506.13     28258.47     165184     847754

Thanks
- Martin

On Thu, Nov 21, 2013 at 4:50 AM, Martin Boutin <martboutin@gmail.com> wrote:
> On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
>>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>>> >> > Dear list,
>>> >> >
>>> >> > I am writing about an apparent issue (or maybe it is normal, that's my
>>> >> > question) regarding filesystem write speed in in a linux raid device.
>>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>>> >> > embedded system with 3 HDDs in a RAID-5 configuration.
>>> >> > The hard disks have 4k physical sectors which are reported as 512
>>> >> > logical size. I made sure the partitions underlying the raid device
>>> >> > start at sector 2048.
>>> >>
>>> >> (fixed cc: to xfs list)
>>> >>
>>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>>> >> > offset, therefore the data should also be 4k aligned. The raid chunk
>>> >> > size is 512K.
>>> >> >
>>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>>> >> > stride and stripes correctly chosen to match the raid chunk size, that
>>> >> > is, stride=128,stripe-width=256.
>>> >> >
>>> >> > While I was working in a small university project, I just noticed that
>>> >> > the write speeds when using a filesystem over raid are *much* slower
>>> >> > than when writing directly to the raid device (or even compared to
>>> >> > filesystem read speeds).
>>> >> >
>>> >> > The command line for measuring filesystem read and write speeds was:
>>> >> >
>>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>>> >> >
>>> >> > The command line for measuring raw read and write speeds was:
>>> >> >
>>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>>> >> >
>>> >> > Here are some speed measures using dd (an average of 20 runs).:
>>> >> >
>>> >> > device       raw/fs  mode   speed (MB/s)    slowdown (%)
>>> >> > /dev/md0    raw    read    207
>>> >> > /dev/md0    raw    write    209
>>> >> > /dev/md1    raw    read    214
>>> >> > /dev/md1    raw    write    212
>>> >
>>> > So, that's writing to the first 1GB of /dev/md0, and all the writes
>>> > are going to be aligned to the MD stripe.
>>> >
>>> >> > /dev/md0    xfs    read    188    9
>>> >> > /dev/md0    xfs    write    35    83o
>>> >
>>> > And these will not be written to the first 1GB of the block device
>>> > but somewhere else. Most likely a region that hasn't otherwise been
>>> > used, and so isn't going to be overwriting the same blocks like the
>>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
>>> > caching effect going on here? Was the md device fully initialised
>>> > before you ran these tests?
>>> >
>>> >> >
>>> >> > /dev/md1    ext3    read    199    7
>>> >> > /dev/md1    ext3    write    36    83
>>> >> >
>>> >> > /dev/md0    ufs    read    212    0
>>> >> > /dev/md0    ufs    write    53    75
>>> >> >
>>> >> > /dev/md0    ext2    read    202    2
>>> >> > /dev/md0    ext2    write    34    84
>>> >
>>> > I suspect what you are seeing here is either the latency introduced
>>> > by having to allocate blocks before issuing the IO, or the file
>>> > layout due to allocation is not idea. Single threaded direct IO is
>>> > latency bound, not bandwidth bound and, as such, is IO size
>>> > sensitive. Allocation for direct IO is also IO size sensitive -
>>> > there's typically an allocation per IO, so the more IO you have to
>>> > do, the more allocation that occurs.
>>>
>>> I just did a few more tests, this time with ext4:
>>>
>>> device       raw/fs  mode   speed (MB/s)    slowdown (%)
>>> /dev/md0    ext4    read    199    4%
>>> /dev/md0    ext4    write    210    0%
>>>
>>> This time, no slowdown at all on ext4. I believe this is due to the
>>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
>>> should be it). So I guess for the other filesystems, it was indeed
>>> the latency introduced by block allocation.
>>
>> Except that XFS does extent based allocation as well, so that's not
>> likely the reason. The fact that ext4 doesn't see a slowdown like
>> every other filesystem really doesn't make a lot of sense to
>> me, either from an IO dispatch point of view or an IO alignment
>> point of view.
>>
>> Why? Because all the filesystems align identically to the underlying
>> device and all should be doing 4k block aligned IO, and XFS has
>> roughly the same allocation overhead for this workload as ext4.
>> Did you retest XFS or any of the other filesystems directly after
>> running the ext4 tests (i.e. confirm you are testing apples to
>> apples)?
>
> Yes I did, the performance figures did not change for either XFS or ext3.
>>
>> What we need to determine why other filesystems are slow (and why
>> ext4 is fast) is more information about your configuration and block
>> traces showing what is happening at the IO level, like was requested
>> in a previous email....
>
> Ok, I'm going to try coming up with meaningful data. Thanks.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>
>
>
> --
> Martin Boutin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21 13:31           ` Martin Boutin
@ 2013-11-21 16:35             ` Martin Boutin
  2013-11-21 23:41             ` Dave Chinner
  1 sibling, 0 replies; 19+ messages in thread
From: Martin Boutin @ 2013-11-21 16:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss,
	Kernel.org-Linux-EXT4

Sorry for the spam but I just noticed that the XFS stripe unit does not
match the strip unit of the underlying RAID device. I tried to do a
mkfs.xfs with a stripe of 512KiB but mkfs.xfs complains that the
maximum stripe width is 256KiB.

So I recreated the RAID with a stripe of 256KiB:
$ cat /proc/mdstat
Personalities : [linear] [raid6] [raid5] [raid4]
md0 : active raid5 sdc2[3] sdb2[1] sda2[0]
      1932957184 blocks super 1.2 level 5, 256k chunk, algorithm 2 [3/2] [UU_]
          resync=DELAYED
      bitmap: 1/1 pages [4KB], 2097152KB chunk

and called mkf.xfs with proper parameters:
$ mkfs.xfs -d sunit=512,swidth=1024 -f -l size=32m /dev/md0

Unfortunately the file is still created unaligned to the RAID stripe.
$ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
/tmp/diskmnt/filewr.zero:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET           TOTAL FLAGS
   0: [0..507903]:     2048544..2556447  0 (2048544..2556447) 507904 01111
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width

Now I'm out of ideas..

- Martin

On Thu, Nov 21, 2013 at 8:31 AM, Martin Boutin <martboutin@gmail.com> wrote:
> $ uname -a
> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
> i686 GNU/Linux
>
> $ xfs_repair -V
> xfs_repair version 3.1.4
>
> $ cat /proc/cpuinfo | grep processor
> processor    : 0
> processor    : 1
>
> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
> $ mount -t xfs /dev/md0 /tmp/diskmnt/
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
>
> $ cat /proc/meminfo
> MemTotal:        1313956 kB
> MemFree:         1099936 kB
> Buffers:           13232 kB
> Cached:           141452 kB
> SwapCached:            0 kB
> Active:           128960 kB
> Inactive:          55936 kB
> Active(anon):      30548 kB
> Inactive(anon):     1096 kB
> Active(file):      98412 kB
> Inactive(file):    54840 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> HighTotal:        626696 kB
> HighFree:         452472 kB
> LowTotal:         687260 kB
> LowFree:          647464 kB
> SwapTotal:         72256 kB
> SwapFree:          72256 kB
> Dirty:                 8 kB
> Writeback:             0 kB
> AnonPages:         30172 kB
> Mapped:            15764 kB
> Shmem:              1432 kB
> Slab:              14720 kB
> SReclaimable:       6632 kB
> SUnreclaim:         8088 kB
> KernelStack:        1792 kB
> PageTables:         1176 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:      729232 kB
> Committed_AS:     734116 kB
> VmallocTotal:     327680 kB
> VmallocUsed:       10192 kB
> VmallocChunk:     294904 kB
> DirectMap4k:       12280 kB
> DirectMap4M:      692224 kB
>
> $ cat /proc/mounts
> (...)
> /dev/md0 /tmp/diskmnt xfs
> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0
>
> $ cat /proc/partitions
> major minor  #blocks  name
>
>    8        0  976762584 sda
>    8        1   10281600 sda1
>    8        2  966479960 sda2
>    8       16  976762584 sdb
>    8       17   10281600 sdb1
>    8       18  966479960 sdb2
>    8       32  976762584 sdc
>    8       33   10281600 sdc1
>    8       34  966479960 sdc2
>    (...)
>    9        1   20560896 md1
>    9        0 1932956672 md0
>
> # same layout for other disks
> $ fdisk -c -u /dev/sda
>
> The device presents a logical sector size that is smaller than
> the physical sector size. Aligning to a physical sector (or optimal
> I/O) size boundary is recommended, or performance may be impacted.
>
> Command (m for help): p
>
> Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk identifier: 0x00000000
>
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sda1            2048    20565247    10281600   83  Linux
> /dev/sda2        20565248  1953525167   966479960   83  Linux
>
> # unfortunately I had to reinitelize the array and recovery takes a
> while.. it does not impact performance much though.
> $ cat /proc/mdstat
> Personalities : [linear] [raid6] [raid5] [raid4]
> md0 : active raid5 sda2[0] sdc2[3] sdb2[1]
>       1932956672 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
>       [>....................]  recovery =  2.4% (23588740/966478336)
> finish=156.6min speed=100343K/sec
>       bitmap: 0/1 pages [0KB], 2097152KB chunk
>
>
> # sda sdb and sdc are the same model
> $ hdparm -I /dev/sda
>
> /dev/sda:
>
> ATA device, with non-removable media
>     Model Number:       HGST HCC541010A9E680
>     (...)
>     Firmware Revision:  JA0OA560
>     Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II
> Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project
> D1697 Revision 0b
> Standards:
>     Used: unknown (minor revision code 0x0028)
>     Supported: 8 7 6 5
>     Likely used: 8
> Configuration:
>     Logical        max    current
>     cylinders    16383    16383
>     heads        16    16
>     sectors/track    63    63
>     --
>     CHS current addressable sectors:   16514064
>     LBA    user addressable sectors:  268435455
>     LBA48  user addressable sectors: 1953525168
>     Logical  Sector size:                   512 bytes
>     Physical Sector size:                  4096 bytes
>     Logical Sector-0 offset:                  0 bytes
>     device size with M = 1024*1024:      953869 MBytes
>     device size with M = 1000*1000:     1000204 MBytes (1000 GB)
>     cache/buffer size  = 8192 KBytes (type=DualPortCache)
>     Form Factor: 2.5 inch
>     Nominal Media Rotation Rate: 5400
> Capabilities:
>     LBA, IORDY(can be disabled)
>     Queue depth: 32
>     Standby timer values: spec'd by Standard, no device specific minimum
>     R/W multiple sector transfer: Max = 16    Current = 16
>     Advanced power management level: 128
>     DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
>          Cycle time: min=120ns recommended=120ns
>     PIO: pio0 pio1 pio2 pio3 pio4
>          Cycle time: no flow control=120ns  IORDY flow control=120ns
>
> $ hdparm -I /dev/sd{a,b,c} | grep "Write cache"
>        *    Write cache
>        *    Write cache
>        *    Write cache
> # therefore write cache is enabled in all drives
>
> $ xfs_info /dev/md0
> meta-data=/dev/md0               isize=256    agcount=32, agsize=15101312 blks
>          =                       sectsz=4096  attr=2
> data     =                       bsize=4096   blocks=483239168, imaxpct=5
>          =                       sunit=128    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=8192, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
> /tmp/diskmnt/filewr.zero:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2047999]:    2049056..4097055  0 (2049056..4097055) 2048000 01111
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> # this does not look good, does it?
>
> # run while dd was executing, looks like we have almost the half
> writes as reads....
> $ iostat -d -k 30 2 /dev/sda2 /dev/sdb2 /dev/sdc2
> Linux 3.10.10 (haswell1)     11/21/2013     _i686_    (2 CPU)
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda2             13.75      6639.52       232.17   78863819    2757731
> sdb2             13.74      6639.42       232.24   78862660    2758483
> sdc2             13.68        55.86      6813.67     663443   80932375
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda2             78.27     11191.20     22556.07     335736     676682
> sdb2             78.30     11175.73     22589.13     335272     677674
> sdc2             78.30      5506.13     28258.47     165184     847754
>
> Thanks
> - Martin
>
> On Thu, Nov 21, 2013 at 4:50 AM, Martin Boutin <martboutin@gmail.com> wrote:
>> On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote:
>>>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>>>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>>>> >> > Dear list,
>>>> >> >
>>>> >> > I am writing about an apparent issue (or maybe it is normal, that's my
>>>> >> > question) regarding filesystem write speed in in a linux raid device.
>>>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>>>> >> > embedded system with 3 HDDs in a RAID-5 configuration.
>>>> >> > The hard disks have 4k physical sectors which are reported as 512
>>>> >> > logical size. I made sure the partitions underlying the raid device
>>>> >> > start at sector 2048.
>>>> >>
>>>> >> (fixed cc: to xfs list)
>>>> >>
>>>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>>>> >> > offset, therefore the data should also be 4k aligned. The raid chunk
>>>> >> > size is 512K.
>>>> >> >
>>>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>>>> >> > stride and stripes correctly chosen to match the raid chunk size, that
>>>> >> > is, stride=128,stripe-width=256.
>>>> >> >
>>>> >> > While I was working in a small university project, I just noticed that
>>>> >> > the write speeds when using a filesystem over raid are *much* slower
>>>> >> > than when writing directly to the raid device (or even compared to
>>>> >> > filesystem read speeds).
>>>> >> >
>>>> >> > The command line for measuring filesystem read and write speeds was:
>>>> >> >
>>>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>>>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>>>> >> >
>>>> >> > The command line for measuring raw read and write speeds was:
>>>> >> >
>>>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>>>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>>>> >> >
>>>> >> > Here are some speed measures using dd (an average of 20 runs).:
>>>> >> >
>>>> >> > device       raw/fs  mode   speed (MB/s)    slowdown (%)
>>>> >> > /dev/md0    raw    read    207
>>>> >> > /dev/md0    raw    write    209
>>>> >> > /dev/md1    raw    read    214
>>>> >> > /dev/md1    raw    write    212
>>>> >
>>>> > So, that's writing to the first 1GB of /dev/md0, and all the writes
>>>> > are going to be aligned to the MD stripe.
>>>> >
>>>> >> > /dev/md0    xfs    read    188    9
>>>> >> > /dev/md0    xfs    write    35    83o
>>>> >
>>>> > And these will not be written to the first 1GB of the block device
>>>> > but somewhere else. Most likely a region that hasn't otherwise been
>>>> > used, and so isn't going to be overwriting the same blocks like the
>>>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe
>>>> > caching effect going on here? Was the md device fully initialised
>>>> > before you ran these tests?
>>>> >
>>>> >> >
>>>> >> > /dev/md1    ext3    read    199    7
>>>> >> > /dev/md1    ext3    write    36    83
>>>> >> >
>>>> >> > /dev/md0    ufs    read    212    0
>>>> >> > /dev/md0    ufs    write    53    75
>>>> >> >
>>>> >> > /dev/md0    ext2    read    202    2
>>>> >> > /dev/md0    ext2    write    34    84
>>>> >
>>>> > I suspect what you are seeing here is either the latency introduced
>>>> > by having to allocate blocks before issuing the IO, or the file
>>>> > layout due to allocation is not idea. Single threaded direct IO is
>>>> > latency bound, not bandwidth bound and, as such, is IO size
>>>> > sensitive. Allocation for direct IO is also IO size sensitive -
>>>> > there's typically an allocation per IO, so the more IO you have to
>>>> > do, the more allocation that occurs.
>>>>
>>>> I just did a few more tests, this time with ext4:
>>>>
>>>> device       raw/fs  mode   speed (MB/s)    slowdown (%)
>>>> /dev/md0    ext4    read    199    4%
>>>> /dev/md0    ext4    write    210    0%
>>>>
>>>> This time, no slowdown at all on ext4. I believe this is due to the
>>>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
>>>> should be it). So I guess for the other filesystems, it was indeed
>>>> the latency introduced by block allocation.
>>>
>>> Except that XFS does extent based allocation as well, so that's not
>>> likely the reason. The fact that ext4 doesn't see a slowdown like
>>> every other filesystem really doesn't make a lot of sense to
>>> me, either from an IO dispatch point of view or an IO alignment
>>> point of view.
>>>
>>> Why? Because all the filesystems align identically to the underlying
>>> device and all should be doing 4k block aligned IO, and XFS has
>>> roughly the same allocation overhead for this workload as ext4.
>>> Did you retest XFS or any of the other filesystems directly after
>>> running the ext4 tests (i.e. confirm you are testing apples to
>>> apples)?
>>
>> Yes I did, the performance figures did not change for either XFS or ext3.
>>>
>>> What we need to determine why other filesystems are slow (and why
>>> ext4 is fast) is more information about your configuration and block
>>> traces showing what is happening at the IO level, like was requested
>>> in a previous email....
>>
>> Ok, I'm going to try coming up with meaningful data. Thanks.
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
>>
>>
>>
>> --
>> Martin Boutin



-- 
Martin Boutin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21 13:31           ` Martin Boutin
  2013-11-21 16:35             ` Martin Boutin
@ 2013-11-21 23:41             ` Dave Chinner
  2013-11-22  9:21               ` Christoph Hellwig
                                 ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Dave Chinner @ 2013-11-21 23:41 UTC (permalink / raw)
  To: Martin Boutin
  Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss,
	Kernel.org-Linux-EXT4

On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
> $ uname -a
> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
> i686 GNU/Linux

Oh, it's 32 bit system. Things you don't know from the obfuscating
codenames everyone uses these days...

> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
> $ mount -t xfs /dev/md0 /tmp/diskmnt/
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
....
> $ cat /proc/mounts
> (...)
> /dev/md0 /tmp/diskmnt xfs
> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0

sunit/swidth is 512k/1MB

> # same layout for other disks
> $ fdisk -c -u /dev/sda
....
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sda1            2048    20565247    10281600   83  Linux

Aligned to 1 MB.

> /dev/sda2        20565248  1953525167   966479960   83  Linux

And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
aligned to 4k, though, so there shouldn't be any hardware RMW
cycles.

> $ xfs_info /dev/md0
> meta-data=/dev/md0               isize=256    agcount=32, agsize=15101312 blks
>          =                       sectsz=4096  attr=2
> data     =                       bsize=4096   blocks=483239168, imaxpct=5
>          =                       sunit=12

sunit/swidth of 512k/1MB, so it matches the MD device.

> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
> /tmp/diskmnt/filewr.zero:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2047999]:    2049056..4097055  0 (2049056..4097055) 2048000 01111
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> # this does not look good, does it?

Yup, looks broken.

/me digs through git. 

Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
the code that sets stripe unit alignment for the initial allocation
way back in 3.2.

[ Hmmm, that would explain the very occasional failure that
generic/223 throws outi (maybe once a month I see it fail). ]

Which means MD is doing RMW cycles for it's parity calculations, and
that's where performance is going south.

Current code:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..2097151]:    1056..2098207     0 (1056..2098207)  2097152 11111
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
$

Which indicates that even if we take direct IO based allocation out
of the picture, the allocation does not get aligned properly. This
in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.

With a fixed kernel:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
   0: [0..2097151]:    6293504..8390655  0 (6293504..8390655) 2097152 10000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
$

It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.

Take fallocate out of the picture so the direct IO does the
allocation:

$ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
   0: [0..2097151]:    2099200..4196351  0 (2099200..4196351) 2097152 00000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width

It's slower than with preallocation (no surprise - no allocation
overhead per write(2) call after preallocation is done) but the
allocation is still correctly aligned.

The patch below should fix the unaligned allocation problem you are
seeing, but because XFS defaults to stripe unit alignment for large
allocations, you might still see RMW cycles when it aligns to a
stripe unit that is not the first in a MD stripe. I'll have a quick
look at fixing that behaviour when the swalloc mount option is
specified....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: align initial file allocations correctly.

From: Dave Chinner <dchinner@redhat.com>

The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.

Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.

Fix it by considering allocation into empty files as requiring
aligned allocation again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 3ef11b2..8401f11 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -1635,7 +1635,7 @@ xfs_bmap_last_extent(
  * blocks at the end of the file which do not start at the previous data block,
  * we will try to align the new blocks at stripe unit boundaries.
  *
- * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
+ * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
  * at, or past the EOF.
  */
 STATIC int
@@ -1650,9 +1650,14 @@ xfs_bmap_isaeof(
 	bma->aeof = 0;
 	error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
 				     &is_empty);
-	if (error || is_empty)
+	if (error)
 		return error;
 
+	if (is_empty) {
+		bma->aeof = 1;
+		return 0;
+	}
+
 	/*
 	 * Check if we are allocation or past the last extent, or at least into
 	 * the last delayed allocated extent.

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21 23:41             ` Dave Chinner
@ 2013-11-22  9:21               ` Christoph Hellwig
  2013-11-22 22:40                 ` Dave Chinner
  2013-11-22 13:33               ` Martin Boutin
  2013-12-10 19:18               ` Christoph Hellwig
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2013-11-22  9:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen,
	Kernel.org-Linux-EXT4, xfs-oss

> From: Dave Chinner <dchinner@redhat.com>
> 
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
> 
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
> 
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Ooops.  The fix looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


Might be worth cooking up a test for this, scsi_debug can expose
geometry, and we already have it wired to to large sector size
testing in xfstests.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21 23:41             ` Dave Chinner
  2013-11-22  9:21               ` Christoph Hellwig
@ 2013-11-22 13:33               ` Martin Boutin
  2013-12-10 19:18               ` Christoph Hellwig
  2 siblings, 0 replies; 19+ messages in thread
From: Martin Boutin @ 2013-11-22 13:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4,
	xfs-oss

Dave, I just applied your patch in my vanilla 3.10.10 Linux. Here are
the new performance figures for XFS:

$ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.95292 s, 212 MB/s

: )
So things make more sense now... I hit a bug in XFS and ext3 and ufs
do not support some kind of multiblock allocation.

Thank you all,
- Martin

On Thu, Nov 21, 2013 at 6:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
>> $ uname -a
>> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
>> i686 GNU/Linux
>
> Oh, it's 32 bit system. Things you don't know from the obfuscating
> codenames everyone uses these days...
>
>> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
>> $ mount -t xfs /dev/md0 /tmp/diskmnt/
>> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
> ....
>> $ cat /proc/mounts
>> (...)
>> /dev/md0 /tmp/diskmnt xfs
>> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0
>
> sunit/swidth is 512k/1MB
>
>> # same layout for other disks
>> $ fdisk -c -u /dev/sda
> ....
>>    Device Boot      Start         End      Blocks   Id  System
>> /dev/sda1            2048    20565247    10281600   83  Linux
>
> Aligned to 1 MB.
>
>> /dev/sda2        20565248  1953525167   966479960   83  Linux
>
> And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
> aligned to 4k, though, so there shouldn't be any hardware RMW
> cycles.
>
>> $ xfs_info /dev/md0
>> meta-data=/dev/md0               isize=256    agcount=32, agsize=15101312 blks
>>          =                       sectsz=4096  attr=2
>> data     =                       bsize=4096   blocks=483239168, imaxpct=5
>>          =                       sunit=12
>
> sunit/swidth of 512k/1MB, so it matches the MD device.
>
>> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
>> /tmp/diskmnt/filewr.zero:
>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>>    0: [0..2047999]:    2049056..4097055  0 (2049056..4097055) 2048000 01111
>>  FLAG Values:
>>     010000 Unwritten preallocated extent
>>     001000 Doesn't begin on stripe unit
>>     000100 Doesn't end   on stripe unit
>>     000010 Doesn't begin on stripe width
>>     000001 Doesn't end   on stripe width
>> # this does not look good, does it?
>
> Yup, looks broken.
>
> /me digs through git.
>
> Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
> the code that sets stripe unit alignment for the initial allocation
> way back in 3.2.
>
> [ Hmmm, that would explain the very occasional failure that
> generic/223 throws outi (maybe once a month I see it fail). ]
>
> Which means MD is doing RMW cycles for it's parity calculations, and
> that's where performance is going south.
>
> Current code:
>
> $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
>    0: [0..2097151]:    1056..2098207     0 (1056..2098207)  2097152 11111
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
> $
>
> Which indicates that even if we take direct IO based allocation out
> of the picture, the allocation does not get aligned properly. This
> in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.
>
> With a fixed kernel:
>
> $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2097151]:    6293504..8390655  0 (6293504..8390655) 2097152 10000
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
> $
>
> It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.
>
> Take fallocate out of the picture so the direct IO does the
> allocation:
>
> $ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2097151]:    2099200..4196351  0 (2099200..4196351) 2097152 00000
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
>
> It's slower than with preallocation (no surprise - no allocation
> overhead per write(2) call after preallocation is done) but the
> allocation is still correctly aligned.
>
> The patch below should fix the unaligned allocation problem you are
> seeing, but because XFS defaults to stripe unit alignment for large
> allocations, you might still see RMW cycles when it aligns to a
> stripe unit that is not the first in a MD stripe. I'll have a quick
> look at fixing that behaviour when the swalloc mount option is
> specified....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> xfs: align initial file allocations correctly.
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
>
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
>
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> index 3ef11b2..8401f11 100644
> --- a/fs/xfs/xfs_bmap.c
> +++ b/fs/xfs/xfs_bmap.c
> @@ -1635,7 +1635,7 @@ xfs_bmap_last_extent(
>   * blocks at the end of the file which do not start at the previous data block,
>   * we will try to align the new blocks at stripe unit boundaries.
>   *
> - * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
> + * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
>   * at, or past the EOF.
>   */
>  STATIC int
> @@ -1650,9 +1650,14 @@ xfs_bmap_isaeof(
>         bma->aeof = 0;
>         error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
>                                      &is_empty);
> -       if (error || is_empty)
> +       if (error)
>                 return error;
>
> +       if (is_empty) {
> +               bma->aeof = 1;
> +               return 0;
> +       }
> +
>         /*
>          * Check if we are allocation or past the last extent, or at least into
>          * the last delayed allocated extent.



-- 
Martin Boutin

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-22  9:21               ` Christoph Hellwig
@ 2013-11-22 22:40                 ` Dave Chinner
  2013-11-23  8:41                   ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2013-11-22 22:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen,
	Kernel.org-Linux-EXT4, xfs-oss

On Fri, Nov 22, 2013 at 01:21:36AM -0800, Christoph Hellwig wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The function xfs_bmap_isaeof() is used to indicate that an
> > allocation is occurring at or past the end of file, and as such
> > should be aligned to the underlying storage geometry if possible.
> > 
> > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > behaviour of this function for empty files - it turned off
> > allocation alignment for this case accidentally. Hence large initial
> > allocations from direct IO are not getting correctly aligned to the
> > underlying geometry, and that is cause write performance to drop in
> > alignment sensitive configurations.
> > 
> > Fix it by considering allocation into empty files as requiring
> > aligned allocation again.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Ooops.  The fix looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> 
> Might be worth cooking up a test for this, scsi_debug can expose
> geometry, and we already have it wired to to large sector size
> testing in xfstests.

We don't need to screw around with the sector size - that is
irrelevant to the problem, and we have an allocation alignment
test that is supposed to catch these issues: generic/223.

As I said, I have seen occasional failures of that test (once a
month, on average) as a result of this bug. It was simply not often
enough - running in a hard loop didn't increase the frequency of
failures - to be able debug it or to reach my "there's a regression
I need to look at" threshold. Perhaps we need to revisit that test
and see if we can make it more likely to trigger failures...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-22 22:40                 ` Dave Chinner
@ 2013-11-23  8:41                   ` Christoph Hellwig
  2013-11-24 23:21                     ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2013-11-23  8:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Martin Boutin, Kernel.org-Linux-RAID,
	Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss

On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote:
> > geometry, and we already have it wired to to large sector size
> > testing in xfstests.
> 
> We don't need to screw around with the sector size - that is
> irrelevant to the problem, and we have an allocation alignment
> test that is supposed to catch these issues: generic/223.

It didn't imply we need large sector sizes, but the same mechanism
to expodse a large sector size can also be used to present large
stripe units/width.

> As I said, I have seen occasional failures of that test (once a
> month, on average) as a result of this bug. It was simply not often
> enough - running in a hard loop didn't increase the frequency of
> failures - to be able debug it or to reach my "there's a regression
> I need to look at" threshold. Perhaps we need to revisit that test
> and see if we can make it more likely to trigger failures...

Seems like 233 should have cought it regularly with the explicit
alignment options on mkfs time.  Maybe we also need a test mirroring
the plain dd more closely?

I've not seen 233 fail for a long time..

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-23  8:41                   ` Christoph Hellwig
@ 2013-11-24 23:21                     ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2013-11-24 23:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen,
	Kernel.org-Linux-EXT4, xfs-oss

On Sat, Nov 23, 2013 at 12:41:06AM -0800, Christoph Hellwig wrote:
> On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote:
> > > geometry, and we already have it wired to to large sector size
> > > testing in xfstests.
> > 
> > We don't need to screw around with the sector size - that is
> > irrelevant to the problem, and we have an allocation alignment
> > test that is supposed to catch these issues: generic/223.
> 
> It didn't imply we need large sector sizes, but the same mechanism
> to expodse a large sector size can also be used to present large
> stripe units/width.
> 
> > As I said, I have seen occasional failures of that test (once a
> > month, on average) as a result of this bug. It was simply not often
> > enough - running in a hard loop didn't increase the frequency of
> > failures - to be able debug it or to reach my "there's a regression
> > I need to look at" threshold. Perhaps we need to revisit that test
> > and see if we can make it more likely to trigger failures...
> 
> Seems like 233 should have cought it regularly with the explicit
> alignment options on mkfs time.  Maybe we also need a test mirroring
> the plain dd more closely?

Preallocation showed the problem, too, so we probably don't even
need dd to check whether allocation alignment is working properly.
We should probably write a test that spefically checks all the
different anlignment/extent size combinations we can use.

Preallocation should behave very similarly to direct IO, but I'm
pretty sure that it won't do things like round up allocations to
stripe unit/widths like direct IO does. The fact that we do
allocation sunit/swidth size alignment for direct Io outside the
allocator and sunit/swidth offset alignment inside the allocation is
kinda funky....

> I've not seen 233 fail for a long time..

Not surprising, it is a one in several hundred test runs occurrence
here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-11-21 23:41             ` Dave Chinner
  2013-11-22  9:21               ` Christoph Hellwig
  2013-11-22 13:33               ` Martin Boutin
@ 2013-12-10 19:18               ` Christoph Hellwig
  2013-12-11  0:27                 ` Dave Chinner
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2013-12-10 19:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen,
	Kernel.org-Linux-EXT4, xfs-oss

> xfs: align initial file allocations correctly.
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
> 
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
> 
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.

Seems like this one didn't get picked up yet?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-12-10 19:18               ` Christoph Hellwig
@ 2013-12-11  0:27                 ` Dave Chinner
  2013-12-11 19:09                   ` Ben Myers
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2013-12-11  0:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen,
	Kernel.org-Linux-EXT4, xfs-oss

On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote:
> > xfs: align initial file allocations correctly.
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The function xfs_bmap_isaeof() is used to indicate that an
> > allocation is occurring at or past the end of file, and as such
> > should be aligned to the underlying storage geometry if possible.
> > 
> > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > behaviour of this function for empty files - it turned off
> > allocation alignment for this case accidentally. Hence large initial
> > allocations from direct IO are not getting correctly aligned to the
> > underlying geometry, and that is cause write performance to drop in
> > alignment sensitive configurations.
> > 
> > Fix it by considering allocation into empty files as requiring
> > aligned allocation again.
> 
> Seems like this one didn't get picked up yet?

I'm about to resend all my outstanding patches...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Filesystem writes on RAID5 too slow
  2013-12-11  0:27                 ` Dave Chinner
@ 2013-12-11 19:09                   ` Ben Myers
  0 siblings, 0 replies; 19+ messages in thread
From: Ben Myers @ 2013-12-11 19:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, xfs-oss,
	Christoph Hellwig, Kernel.org-Linux-EXT4

Hi,

On Wed, Dec 11, 2013 at 11:27:53AM +1100, Dave Chinner wrote:
> On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote:
> > > xfs: align initial file allocations correctly.
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The function xfs_bmap_isaeof() is used to indicate that an
> > > allocation is occurring at or past the end of file, and as such
> > > should be aligned to the underlying storage geometry if possible.
> > > 
> > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > > behaviour of this function for empty files - it turned off
> > > allocation alignment for this case accidentally. Hence large initial
> > > allocations from direct IO are not getting correctly aligned to the
> > > underlying geometry, and that is cause write performance to drop in
> > > alignment sensitive configurations.
> > > 
> > > Fix it by considering allocation into empty files as requiring
> > > aligned allocation again.
> > 
> > Seems like this one didn't get picked up yet?
> 
> I'm about to resend all my outstanding patches...

Sorry I didn't see that one.  If you stick the keyword 'patch' in the subject I
tend to do a bit better.

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-12-11 19:09 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-18 16:02 Filesystem writes on RAID5 too slow Martin Boutin
2013-11-18 18:28 ` Eric Sandeen
2013-11-19  0:57   ` Dave Chinner
2013-11-21  9:11     ` Martin Boutin
2013-11-21  9:26       ` Dave Chinner
2013-11-21  9:50         ` Martin Boutin
2013-11-21 13:31           ` Martin Boutin
2013-11-21 16:35             ` Martin Boutin
2013-11-21 23:41             ` Dave Chinner
2013-11-22  9:21               ` Christoph Hellwig
2013-11-22 22:40                 ` Dave Chinner
2013-11-23  8:41                   ` Christoph Hellwig
2013-11-24 23:21                     ` Dave Chinner
2013-11-22 13:33               ` Martin Boutin
2013-12-10 19:18               ` Christoph Hellwig
2013-12-11  0:27                 ` Dave Chinner
2013-12-11 19:09                   ` Ben Myers
2013-11-18 18:41 ` Roman Mamedov
2013-11-18 19:25   ` Roman Mamedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).