* Disappointing performance of copy (MD raid + XFS)
@ 2009-12-10 0:39 Asdo
2009-12-10 0:57 ` Asdo
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Asdo @ 2009-12-10 0:39 UTC (permalink / raw)
To: xfs; +Cc: linux-raid
Hi all,
I'm copying a bagzillion of files (14TB) from a 26disk MD-raid 6 array
to a 16disk MD-raid 6 array.
Filesystems are XFS for both arrays.
Kernel is 2.6.31 ubuntu generic-14
Performance is very disappointing, going from 150MB/sec to 22MB/sec
depending apparently to the size of files it encounters. 150MB/sec is
when files are 40-80MB in size, 22MB/sec is when files are 1MB in size
on average, and I think I have seen around 10MB/sec when they are of
500KB (this transfer at 10MB/sec was in parallel with another faster one
however).
Doing multiple rsync transfers simultaneously for different files of the
filesystem does increase the speed, up to a point however, and even
launching 5 of them I am not able to bring it above 150MB/sec (that's
the average: it's actually very unstable).
Already tried tweaking: stripe_cache_size, readahead, elevator type and
its parameters, increasing elevator queue length, some parameters in
/proc/sys/fs/xfs (randomly without understanding much of the xfs params
actually), and /proc/sys/vm/*dirty* parameters .
Mount options for destination initially were defaults, then I tried to
change them via remount to rw,nodiratime,relatime,largeio but without
much improvements.
The above are the best results I could obtain.
Firstly I tried copying with cp and then with rsync. Not much difference
between the two.
Rsync is nicer to monitor because it splits in 2 processes, one reads
only, the other one only writes.
So I have repeatedly catted /proc/pid/stack for the reader and writer
processes: the *writer* is the bottleneck, and 90% of the times it is
stuck in one of the following stacktraces:
[<ffffffffa02ff41d>] xlog_state_get_iclog_space+0xed/0x2d0
[xfs]
[<ffffffffa02ff76c>] xlog_write+0x16c/0x630
[xfs]
[<ffffffffa02ffc6a>] xfs_log_write+0x3a/0x70
[xfs]
[<ffffffffa030b6d7>] _xfs_trans_commit+0x197/0x3b0
[xfs]
[<ffffffffa030ff15>] xfs_free_eofblocks+0x265/0x270
[xfs]
[<ffffffffa031090d>] xfs_release+0x10d/0x1c0
[xfs]
[<ffffffffa0318200>] xfs_file_release+0x10/0x20
[xfs]
[<ffffffff81120700>]
__fput+0xf0/0x210
[<ffffffff8112083d>]
fput+0x1d/0x30
[<ffffffff8111cab8>]
filp_close+0x58/0x90
[<ffffffff8111cba9>]
sys_close+0xb9/0x110
[<ffffffff81012002>]
system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---------
[<ffffffff8107d6cc>]
down+0x3c/0x50
[<ffffffffa03176ee>] xfs_buf_lock+0x1e/0x60
[xfs]
[<ffffffffa0317869>] _xfs_buf_find+0x139/0x230
[xfs]
[<ffffffffa03179bb>] xfs_buf_get_flags+0x5b/0x170
[xfs]
[<ffffffffa0317ae3>] xfs_buf_read_flags+0x13/0xa0
[xfs]
[<ffffffffa030c9d1>] xfs_trans_read_buf+0x1c1/0x300
[xfs]
[<ffffffffa02e26c9>] xfs_da_do_buf+0x279/0x6f0
[xfs]
[<ffffffffa02e2bb5>] xfs_da_read_buf+0x25/0x30
[xfs]
[<ffffffffa02e7157>] xfs_dir2_block_addname+0x47/0x970
[xfs]
[<ffffffffa02e5e9a>] xfs_dir_createname+0x13a/0x1b0
[xfs]
[<ffffffffa0309816>] xfs_rename+0x576/0x660
[xfs]
[<ffffffffa031add1>] xfs_vn_rename+0x61/0x70
[xfs]
[<ffffffff81128766>]
vfs_rename_other+0xc6/0x100
[<ffffffff81129b29>]
vfs_rename+0x109/0x280
[<ffffffff8112b722>]
sys_renameat+0x252/0x280
[<ffffffff8112b766>]
sys_rename+0x16/0x20
[<ffffffff81012002>]
system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>]
0xffffffffffffffff
----------
[<ffffffff8107d6cc>]
down+0x3c/0x50
[<ffffffffa03176ee>] xfs_buf_lock+0x1e/0x60
[xfs]
[<ffffffffa0317869>] _xfs_buf_find+0x139/0x230
[xfs]
[<ffffffffa03179bb>] xfs_buf_get_flags+0x5b/0x170
[xfs]
[<ffffffffa0317ae3>] xfs_buf_read_flags+0x13/0xa0
[xfs]
[<ffffffffa030c9d1>] xfs_trans_read_buf+0x1c1/0x300
[xfs]
[<ffffffffa02e26c9>] xfs_da_do_buf+0x279/0x6f0
[xfs]
[<ffffffffa02e2bb5>] xfs_da_read_buf+0x25/0x30
[xfs]
[<ffffffffa02e960b>] xfs_dir2_leaf_addname+0x4b/0x8b0
[xfs]
[<ffffffffa02e5ee3>] xfs_dir_createname+0x183/0x1b0
[xfs]
[<ffffffffa030fa4b>] xfs_create+0x45b/0x5f0
[xfs]
[<ffffffffa031af4b>] xfs_vn_mknod+0xab/0x1c0
[xfs]
[<ffffffffa031b07b>] xfs_vn_create+0xb/0x10
[xfs]
[<ffffffff8112967f>]
vfs_create+0xaf/0xd0
[<ffffffff8112975c>]
__open_namei_create+0xbc/0x100
[<ffffffff8112ccd6>]
do_filp_open+0x9e6/0xac0
[<ffffffff8111cc64>]
do_sys_open+0x64/0x160
[<ffffffff8111cd8b>]
sys_open+0x1b/0x20
[<ffffffff81012002>]
system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
The xfs_buf_lock trace is more common (about 3 to 1) than the
xlog_state_get_iclog_space trace.
I don't really understand what are these buffers mentioned in the last
stack traces (xfs_buf_*)... anybody cares to explain? Is this
performance bottleneck really related to the disks or the contention on
buffers locking is e.g. entirely in memory and it's stuck for some other
reason? Can I assign more memory to xfs so to have more buffers? I have
32GB ram and it's all free... I also have 8 cores BTW.
The controllers I'm using are 3ware 9650SE so there is a word around
that they are not optimal in terms of latency, but I didn't expect them
to be SO bad. Also I'm not sure latency is the bottleneck here because
XFS could buffer writes and flush just every lots of seconds, and I'm
pretty sure cp and rsync never do fsync/fdatasync themselves
Thanks in advance for any insight.
Asdo
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-10 0:39 Disappointing performance of copy (MD raid + XFS) Asdo
@ 2009-12-10 0:57 ` Asdo
2009-12-10 1:16 ` Asdo
2009-12-10 4:16 ` Eric Sandeen
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Asdo @ 2009-12-10 0:57 UTC (permalink / raw)
To: xfs; +Cc: linux-raid
Asdo wrote:
> and I think I have seen around 10MB/sec when they are of 500KB (this
> transfer at 10MB/sec was in parallel with another faster one however).
Yes I definitely confirm: right now I have just 1 rsync copy running,
it's in a zone where files are around 500KB on average, and it's going
at 9 MB/sec.
Stack traces of the writer process conform to what I have posted in my
previous email, even now that the writer is the only process using the
destination array.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-10 0:57 ` Asdo
@ 2009-12-10 1:16 ` Asdo
0 siblings, 0 replies; 11+ messages in thread
From: Asdo @ 2009-12-10 1:16 UTC (permalink / raw)
To: xfs; +Cc: linux-raid
Asdo wrote:
> Asdo wrote:
>> and I think I have seen around 10MB/sec when they are of 500KB (this
>> transfer at 10MB/sec was in parallel with another faster one however).
> Yes I definitely confirm: right now I have just 1 rsync copy running,
> it's in a zone where files are around 500KB on average, and it's going
> at 9 MB/sec.
> Stack traces of the writer process conform to what I have posted in my
> previous email, even now that the writer is the only process using the
> destination array.
Excuse me, I am going nuts...
In this case of 9MB/sec for 500KB files, stack traces on the writer are
indeed very similar to what I have posted, but the relative frequency of
the two type of stack traces is different:
20%: waiting on the reader (this almost never happened when using
multiple parallel rsyncs)
50%: xlog_state_get_iclog_space+0xed/0x2d0
30%: xfs_buf_lock+0x1e/0x60
The reader is waiting either on select (on the writer I guess) or on this:
[<ffffffff810da74d>] sync_page+0x3d/0x50
[<ffffffff810da769>] sync_page_killable+0x9/0x40
[<ffffffff810da682>] __lock_page_killable+0x62/0x70
[<ffffffff810db8be>] T.768+0x1ee/0x440
[<ffffffff810dbbc6>] generic_file_aio_read+0xb6/0x1d0
[<ffffffffa031cd95>] xfs_read+0x115/0x2a0 [xfs]
[<ffffffffa031832b>] xfs_file_aio_read+0x5b/0x70 [xfs]
[<ffffffff8111ec32>] do_sync_read+0xf2/0x130
[<ffffffff8111f215>] vfs_read+0xb5/0x1a0
[<ffffffff8111f81c>] sys_read+0x4c/0x80
[<ffffffff81012002>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-10 0:39 Disappointing performance of copy (MD raid + XFS) Asdo
2009-12-10 0:57 ` Asdo
@ 2009-12-10 4:16 ` Eric Sandeen
2009-12-11 1:41 ` Asdo
2009-12-10 7:28 ` Gabor Gombas
2009-12-10 9:44 ` Kristleifur Daðason
3 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2009-12-10 4:16 UTC (permalink / raw)
To: Asdo; +Cc: xfs, linux-raid
Asdo wrote:
> Hi all,
>
> I'm copying a bagzillion of files (14TB) from a 26disk MD-raid 6 array
> to a 16disk MD-raid 6 array.
> Filesystems are XFS for both arrays.
> Kernel is 2.6.31 ubuntu generic-14
> Performance is very disappointing, going from 150MB/sec to 22MB/sec
> depending apparently to the size of files it encounters. 150MB/sec is
> when files are 40-80MB in size, 22MB/sec is when files are 1MB in size
> on average, and I think I have seen around 10MB/sec when they are of
> 500KB (this transfer at 10MB/sec was in parallel with another faster one
> however).
> Doing multiple rsync transfers simultaneously for different files of the
> filesystem does increase the speed, up to a point however, and even
> launching 5 of them I am not able to bring it above 150MB/sec (that's
> the average: it's actually very unstable).
>
> Already tried tweaking: stripe_cache_size, readahead, elevator type and
> its parameters, increasing elevator queue length, some parameters in
> /proc/sys/fs/xfs (randomly without understanding much of the xfs params
> actually), and /proc/sys/vm/*dirty* parameters .
> Mount options for destination initially were defaults, then I tried to
> change them via remount to rw,nodiratime,relatime,largeio but without
> much improvements.
A few things come to mind.
For large filesystems such as this, xfs restricts inode locations such that
inode numbers will stay below 32 bits (the number reflects the disk location).
This has the tendency to skew inodes & data away from each other, and copying
a lot of little files will probably get plenty seeky for you. This may explain
why little files are so much slower than larger ones.
If you mount with -o inode64, new inodes are free to roam the filesystem, and
stay nearer to their data blocks. This won't help on the read side though
if you've got an existing filesystem. Note that not all 32-bit applications
cope well with > 32-bit inode numbers, though.
Also, you need to be sure that your filesystem geometry is well-aligned to
the raid geometry, but if this is MD software raid, that should have happened
automatically.
You might also want to see how fragmented your source files are; if they
are highly fragmented, this would reduce performance as you seek around to
get to the pieces. xfs_bmap will tell you this info.
You might try running blktrace on the source & target block devices to see
what your IOs look like; you can use seekwatcher to graph the results
(or just use seekwatcher to run the whole show). Nasty IO patterns could
certainly kill performance.
You might also try each piece; see how fast your reads can go, and your
writes, independently.
-Eric
> The above are the best results I could obtain.
>
> Firstly I tried copying with cp and then with rsync. Not much difference
> between the two.
>
> Rsync is nicer to monitor because it splits in 2 processes, one reads
> only, the other one only writes.
>
> So I have repeatedly catted /proc/pid/stack for the reader and writer
> processes: the *writer* is the bottleneck, and 90% of the times it is
> stuck in one of the following stacktraces:
>
...
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-10 0:39 Disappointing performance of copy (MD raid + XFS) Asdo
2009-12-10 0:57 ` Asdo
2009-12-10 4:16 ` Eric Sandeen
@ 2009-12-10 7:28 ` Gabor Gombas
2009-12-10 9:44 ` Kristleifur Daðason
3 siblings, 0 replies; 11+ messages in thread
From: Gabor Gombas @ 2009-12-10 7:28 UTC (permalink / raw)
To: Asdo; +Cc: xfs, linux-raid
On Thu, Dec 10, 2009 at 01:39:16AM +0100, Asdo wrote:
> I'm copying a bagzillion of files (14TB) from a 26disk MD-raid 6
> array to a 16disk MD-raid 6 array.
> Filesystems are XFS for both arrays.
[...]
> So I have repeatedly catted /proc/pid/stack for the reader and
> writer processes: the *writer* is the bottleneck, and 90% of the
> times it is stuck in one of the following stacktraces:
What does iostat say about the disk utilization during the copy?
IMHO copying a large amount of relatively small files means lots of
metadata operations which in turn means lots of small, scattered writes.
For parity RAID, that kind of load is a killer since writes smaller than
the stripe size will turn into "read all disks - modify the stripe -
write it back".
Use RAID10 if you want good performance for that kind of load.
Gabor
--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-10 0:39 Disappointing performance of copy (MD raid + XFS) Asdo
` (2 preceding siblings ...)
2009-12-10 7:28 ` Gabor Gombas
@ 2009-12-10 9:44 ` Kristleifur Daðason
3 siblings, 0 replies; 11+ messages in thread
From: Kristleifur Daðason @ 2009-12-10 9:44 UTC (permalink / raw)
To: Asdo; +Cc: xfs, linux-raid
On Thu, Dec 10, 2009 at 12:39 AM, Asdo <asdo@shiftmail.org> wrote:
> Hi all,
>
> I'm copying a bagzillion of files (14TB) from a 26disk MD-raid 6 array to a
> 16disk MD-raid 6 array.
> Filesystems are XFS for both arrays.
> Kernel is 2.6.31 ubuntu generic-14
> Performance is very disappointing, going from 150MB/sec to 22MB/sec
> depending apparently to the size of files it encounters. 150MB/sec is when
> files are 40-80MB in size, 22MB/sec is when files are 1MB in size on
> average, and I think I have seen around 10MB/sec when they are of 500KB
> (this transfer at 10MB/sec was in parallel with another faster one however).
This may not be related, but I have had some general problems
regarding transfer rates in Ubuntu. Using a vanilla kernel seemed to
help: http://kernel.ubuntu.com/~kernel-ppa/mainline/
-- Kristleifur
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-10 4:16 ` Eric Sandeen
@ 2009-12-11 1:41 ` Asdo
2009-12-11 3:20 ` Eric Sandeen
2009-12-11 3:26 ` Dave Chinner
0 siblings, 2 replies; 11+ messages in thread
From: Asdo @ 2009-12-11 1:41 UTC (permalink / raw)
Cc: Eric Sandeen, xfs, linux-raid, Kristleifur Daðason,
Gabor Gombas
Eric Sandeen wrote:
Gabor Gombas wrote:
Kristleifur Daðason wrote:
[CUT]
Thank you guys for your help
I have done further investigation.
I still have not checked how performances are with very small files and
multiple simultaneous rsyncs.
I have checked the other problem I had which I was mentioning, that I
couldn't go more than 150MB/sec even with large files and multiple
simultaneous transfers.
I confirm this one and I have narrowed the problem: two XFS defaults
(optimizations) actually damage the performances.
The first and most important is the aligned writes: cat /proc/mounts
lists this (autodetected) stripe size: "sunit=2048,swidth=28672" . My
chunks are is 1MB and I have 16 disks in raid-6 so 14 data disks. Do you
think it's correct? xfs_info lists blocks as 4k and sunit and swidth are
in 4k blocks and have a very different value. Please do not use the same
name "sunit"/"swidth" to mean 2 different things in 2 different places,
it can confuse the user (me!)
Anyway that's not the problem: I have tried to specify other values in
my mount (in particular I tried the values sunit and swidth should have
had if blocks were 4k), but ANY xfs aligned mount kills the performances
for me. I have to specify "noalign" in my mount to go fast. (Also note
this option cannot be changed on mount -o remount. I have to unmount.)
The other default feature that kills performances for me is the
rotorstep. I have to max it out at 255 in order to have good
performances. Actually it is reasonable that a higher rotorstep should
be faster... why is 1 the default? Why it even exists? With low values
the await (iostat -x 1) increases, I guess because of the seeks, and
stripe_cache_active stays higher, because there are less filled stripes.
If I use noalign and rotorstep at 255 I am able to go at 325 MB/sec on
average (16 parallel transfers of 7MB files) while with defaults I go at
about 90 MB/sec.
Also with noalign and rotorstep at 255 the stripe_cache_size stays
usually in the lower half (below 16000 out of 32000) while with defaults
it's stuck for most of the time at the maximum and processes are stuck
sleeping in MD locks for this reason.
Do you have any knowledge of sunit/swidth alignment mechanism being
broken on 2.6.31 or more specifically 2.6.31 ubuntu generic-14 ?
(Kristleifur thank you I have seen your mention of the Ubuntu vs vanilla
kernel, I will try a vanilla one but right now I can't. However now I
have narrowed the problem so XFS people might want to watch at the
alignment problem more specifically)
Regarding my previous post I still would like to know what are those
stack traces I posted in my previous post: what are the functions
xlog_state_get_iclog_space+0xed/0x2d0 [xfs]
and
xfs_buf_lock+0x1e/0x60 [xfs]
and what are they waiting for...
these are still the place where processes get stuck, even after having
worked around the alignment/rotorstep problem...
And then a few questions on inode64:
- if I start using inode64, do I have to remember to use inode64 on
every subsequent mount for the life for that filesystem? Or does it
write it in some filesystem info region that the option has been used
once, so it applies the inode64 by itself on subsequent mounts?
- if I use a 64bit linux distro, will ALL userland programs
automatically support 64bit inodes or do I have to continuously pay
attention and risk to damage my data?
Thanks for your help
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-11 1:41 ` Asdo
@ 2009-12-11 3:20 ` Eric Sandeen
2009-12-11 3:26 ` Dave Chinner
1 sibling, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2009-12-11 3:20 UTC (permalink / raw)
To: Asdo; +Cc: xfs, linux-raid, Kristleifur Daðason, Gabor Gombas
Asdo wrote:
> Eric Sandeen wrote:
> Gabor Gombas wrote:
> Kristleifur Daðason wrote:
> [CUT]
>
> Thank you guys for your help
>
> I have done further investigation.
>
> I still have not checked how performances are with very small files and
> multiple simultaneous rsyncs.
>
> I have checked the other problem I had which I was mentioning, that I
> couldn't go more than 150MB/sec even with large files and multiple
> simultaneous transfers.
> I confirm this one and I have narrowed the problem: two XFS defaults
> (optimizations) actually damage the performances.
>
> The first and most important is the aligned writes: cat /proc/mounts
> lists this (autodetected) stripe size: "sunit=2048,swidth=28672" . My
> chunks are is 1MB and I have 16 disks in raid-6 so 14 data disks. Do you
> think it's correct? xfs_info lists blocks as 4k and sunit and swidth are
> in 4k blocks and have a very different value. Please do not use the same
> name "sunit"/"swidth" to mean 2 different things in 2 different places,
> it can confuse the user (me!)
granted, this is confusing.
the /proc/mounts units are in 512-byte sectors. So 2048 is 1M; 28672/2048
is 14, so that all looks right.
> Anyway that's not the problem: I have tried to specify other values in
> my mount (in particular I tried the values sunit and swidth should have
> had if blocks were 4k), but ANY xfs aligned mount kills the performances
> for me.
certainly any wrong alignment would ;)
> I have to specify "noalign" in my mount to go fast. (Also note
> this option cannot be changed on mount -o remount. I have to unmount.)
so noalign is faster than the defaults? hm.
> The other default feature that kills performances for me is the
> rotorstep. I have to max it out at 255 in order to have good
> performances. Actually it is reasonable that a higher rotorstep should
> be faster... why is 1 the default? Why it even exists? With low values
> the await (iostat -x 1) increases, I guess because of the seeks, and
> stripe_cache_active stays higher, because there are less filled stripes.
this is related to the inode64 mount option I mentioned, which I guess you
haven't tested? rotorstep affects how often new AGs are chosen in
the 32-bit inode mode. I'm not sure why 1 is the default, perhaps
this should be changed.
> If I use noalign and rotorstep at 255 I am able to go at 325 MB/sec on
> average (16 parallel transfers of 7MB files) while with defaults I go at
> about 90 MB/sec.
It might be nice to do some blktracing to see what's actually
hitting the disk.
Are you running on the entire md device, or is it partitioned?
If you have partitioned your device with something so that the partitions
are not stripe-aligned, maybe that throws everything off.
Maybe you can post your partition info if any, as well as the actual
raid geometry as reported by md.
> Also with noalign and rotorstep at 255 the stripe_cache_size stays
> usually in the lower half (below 16000 out of 32000) while with defaults
> it's stuck for most of the time at the maximum and processes are stuck
> sleeping in MD locks for this reason.
>
> Do you have any knowledge of sunit/swidth alignment mechanism being
> broken on 2.6.31 or more specifically 2.6.31 ubuntu generic-14 ?
nope, don't use ubuntu, and AFAIK stripe alignment is just fine upstream.
> (Kristleifur thank you I have seen your mention of the Ubuntu vs vanilla
> kernel, I will try a vanilla one but right now I can't. However now I
> have narrowed the problem so XFS people might want to watch at the
> alignment problem more specifically)
>
> Regarding my previous post I still would like to know what are those
> stack traces I posted in my previous post: what are the functions
> xlog_state_get_iclog_space+0xed/0x2d0 [xfs] and
> xfs_buf_lock+0x1e/0x60 [xfs]
> and what are they waiting for...
> these are still the place where processes get stuck, even after having
> worked around the alignment/rotorstep problem...
>
> And then a few questions on inode64:
> - if I start using inode64, do I have to remember to use inode64 on
> every subsequent mount for the life for that filesystem? Or does it
> write it in some filesystem info region that the option has been used
> once, so it applies the inode64 by itself on subsequent mounts?
Unfortunately it's not a superblock features, though IMHO it should
be; so yes, you need to mount with it always. Leaving it out won't
harm your filesystem, it'll just put you back in 32-bit-inode mode.
Just put it in your fstab.
> - if I use a 64bit linux distro, will ALL userland programs
> automatically support 64bit inodes or do I have to continuously pay
> attention and risk to damage my data?
all 64-bit applications should be just fine.
-Eric
> Thanks for your help
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-11 1:41 ` Asdo
2009-12-11 3:20 ` Eric Sandeen
@ 2009-12-11 3:26 ` Dave Chinner
2009-12-15 16:51 ` Kasper Sandberg
1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2009-12-11 3:26 UTC (permalink / raw)
To: Asdo; +Cc: linux-raid, Kristleifur Daðason, Eric Sandeen, Gabor Gombas,
xfs
On Fri, Dec 11, 2009 at 02:41:32AM +0100, Asdo wrote:
> I have checked the other problem I had which I was mentioning, that I
> couldn't go more than 150MB/sec even with large files and multiple
> simultaneous transfers.
> I confirm this one and I have narrowed the problem: two XFS defaults
> (optimizations) actually damage the performances.
>
> The first and most important is the aligned writes: cat /proc/mounts
> lists this (autodetected) stripe size: "sunit=2048,swidth=28672" . My
> chunks are is 1MB and I have 16 disks in raid-6 so 14 data disks. Do you
> think it's correct?
Yes. The units that mkfs/xfs_info use are not consistent - in this
case sunit/switch are in 512 byte sectors, so the values are
effectively { sunit = 1MB, swidth = 14MB } which matches to your
raid6 configuration correctly.
> xfs_info lists blocks as 4k and sunit and swidth are
> in 4k blocks and have a very different value. Please do not use the same
> name "sunit"/"swidth" to mean 2 different things in 2 different places,
> it can confuse the user (me!)
I know, and agree that it is less than optimal. It's been like this
forever, and unfortunately while changing it is relatively easy, the
knock-on effect of breaking most QA tests we have (scripts parse the
output of mkfs) it makes it a much larger amount of effort to
change than it otherwise looks. Still, we should consider doing
it....
> Anyway that's not the problem: I have tried to specify other values in
> my mount (in particular I tried the values sunit and swidth should have
> had if blocks were 4k), but ANY xfs aligned mount kills the performances
> for me. I have to specify "noalign" in my mount to go fast. (Also note
> this option cannot be changed on mount -o remount. I have to unmount.)
That sounds like the filesystem is not aligned to the underlying
RAID correctly. Allocation for a filesystem with sunit/swidth set
aligns to the the start of stripe units so allocation between large
files is sparse. noalign turns off the allocation alignment, so it
will have much fewer holes in the writeback pattern, and that
reduces the impact of an unaligned filesystem....
Soory if you've already asked and answered this question - Is your
filesystem straight on the md raid volume, or is there
partitions/lvm/dm configuration in between them?
> The other default feature that kills performances for me is the
> rotorstep. I have to max it out at 255 in order to have good
> performances.
So you are not using inode64, then? If you have 64 bit systems, then
you probably should use inode64....
> Actually it is reasonable that a higher rotorstep should
> be faster... why is 1 the default? Why it even exists? With low values
> the await (iostat -x 1) increases, I guess because of the seeks, and
> stripe_cache_active stays higher, because there are less filled stripes.
Rotorstep is used to determine how far to spread files apart in the
inode32 allocation. Basically every new file created has the AG it
will be place in selected by:
new_ag = (last_ag + rotorstep) % num_ags_in_filesystem;
By default it just picks the next AG (linear progression). If you
have only a few AGs, then a value of 255 will effectively randomise
the AG being selected. For your workload, that must result in the
best distribution of IO for your storage subsystem. In general,
though, no matter how much you tweak inode32 w/ rotorstep, the
inode64 allocator usually performs better.
> If I use noalign and rotorstep at 255 I am able to go at 325 MB/sec on
> average (16 parallel transfers of 7MB files) while with defaults I go at
> about 90 MB/sec.
>
> Also with noalign and rotorstep at 255 the stripe_cache_size stays
> usually in the lower half (below 16000 out of 32000) while with defaults
> it's stuck for most of the time at the maximum and processes are stuck
> sleeping in MD locks for this reason.
That really does sound like a misaligned filesystem - the stripe
cache will grow larger the more RMW cycles that need to be
performed...
> Regarding my previous post I still would like to know what are those
> stack traces I posted in my previous post: what are the functions
> xlog_state_get_iclog_space+0xed/0x2d0 [xfs] and
Waiting on IO completion on the log buffers so that
the current transaction can be written into the next log buffer
(there are 8 log buffers). Basically a sign of being IO bound during
metadata operations.
> xfs_buf_lock+0x1e/0x60 [xfs]
Generally you see this while waiting on IO completion to unlock the
buffer so that it can be locked into in the current transaction for
further modification. Usually a result of the log tail being pushed
to make room for new transactions.
> And then a few questions on inode64:
> - if I start using inode64, do I have to remember to use inode64 on
> every subsequent mount for the life for that filesystem?
No, you are not forced to, but if you forget it will revert to
inode32 allocator behaviour.
> Or does it
> write it in some filesystem info region that the option has been used
> once, so it applies the inode64 by itself on subsequent mounts?
No, it does not do this, but probably should.
> - if I use a 64bit linux distro, will ALL userland programs
> automatically support 64bit inodes or do I have to continuously pay
> attention and risk to damage my data?
If you use 64 bit applications, then they should all support 64 bit
inodes.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-11 3:26 ` Dave Chinner
@ 2009-12-15 16:51 ` Kasper Sandberg
2009-12-15 16:53 ` Eric Sandeen
0 siblings, 1 reply; 11+ messages in thread
From: Kasper Sandberg @ 2009-12-15 16:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Asdo, linux-raid, Kristleifur Daðason, Eric Sandeen,
Gabor Gombas, xfs
On Fri, 2009-12-11 at 14:26 +1100, Dave Chinner wrote:
> On Fri, Dec 11, 2009 at 02:41:32AM +0100, Asdo wrote:
> > I have checked the other problem I had which I was mentioning, that I
<snip>
> > Also with noalign and rotorstep at 255 the stripe_cache_size stays
> > usually in the lower half (below 16000 out of 32000) while with defaults
> > it's stuck for most of the time at the maximum and processes are stuck
> > sleeping in MD locks for this reason.
>
> That really does sound like a misaligned filesystem - the stripe
> cache will grow larger the more RMW cycles that need to be
> performed...
Sorry to interrupt, I would very much like to see if my array/xfs is
properly aligned and all that stuff which has been mentioned inhere,
could any of you please post the things required to check? filesystem is
straight on md raid6, debian lenny.
>
<snip>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Disappointing performance of copy (MD raid + XFS)
2009-12-15 16:51 ` Kasper Sandberg
@ 2009-12-15 16:53 ` Eric Sandeen
0 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2009-12-15 16:53 UTC (permalink / raw)
To: Kasper Sandberg
Cc: Dave Chinner, Asdo, linux-raid, Kristleifur Daðason,
Gabor Gombas, xfs
Kasper Sandberg wrote:
> On Fri, 2009-12-11 at 14:26 +1100, Dave Chinner wrote:
>> On Fri, Dec 11, 2009 at 02:41:32AM +0100, Asdo wrote:
>>> I have checked the other problem I had which I was mentioning, that I
> <snip>
>>> Also with noalign and rotorstep at 255 the stripe_cache_size stays
>>> usually in the lower half (below 16000 out of 32000) while with defaults
>>> it's stuck for most of the time at the maximum and processes are stuck
>>> sleeping in MD locks for this reason.
>> That really does sound like a misaligned filesystem - the stripe
>> cache will grow larger the more RMW cycles that need to be
>> performed...
> Sorry to interrupt, I would very much like to see if my array/xfs is
> properly aligned and all that stuff which has been mentioned inhere,
> could any of you please post the things required to check? filesystem is
> straight on md raid6, debian lenny.
>
>
> <snip>
>
use xfs_info on your mountpoint, and compare the stripe unit and width
to what md tells you. xfs_info stripe unit/width output is in fs blocks
units.
-Eric
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-12-15 16:53 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-10 0:39 Disappointing performance of copy (MD raid + XFS) Asdo
2009-12-10 0:57 ` Asdo
2009-12-10 1:16 ` Asdo
2009-12-10 4:16 ` Eric Sandeen
2009-12-11 1:41 ` Asdo
2009-12-11 3:20 ` Eric Sandeen
2009-12-11 3:26 ` Dave Chinner
2009-12-15 16:51 ` Kasper Sandberg
2009-12-15 16:53 ` Eric Sandeen
2009-12-10 7:28 ` Gabor Gombas
2009-12-10 9:44 ` Kristleifur Daðason
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).