* ag selection
@ 2013-11-11 17:25 Bernd Schubert
2013-11-11 17:53 ` Carlos Maiolino
2013-11-11 20:55 ` Dave Chinner
0 siblings, 2 replies; 7+ messages in thread
From: Bernd Schubert @ 2013-11-11 17:25 UTC (permalink / raw)
To: linux-xfs
Hi all,
for streaming writes onto a raid6 the current round-robin ag selection
seems does not seem to be optimal. Writing 4 files from 4 threads into a
single directory we get 900 MB/s, writing 4 files in 4 different
directories we only get 700 MB/s (12 disks with with hw megaraid-sas).
The current round-robin scheme seems to be optimized for linear raid0?
With small AGs one could also argue, that choosing AGs which are not far
away from each other (in respect to the number of blocks) also adds more
parallel disk access for small and medium sized files.
Any objections against a patch to improve the AG selection?
Thanks,
Bernd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection
2013-11-11 17:25 ag selection Bernd Schubert
@ 2013-11-11 17:53 ` Carlos Maiolino
2013-11-11 17:55 ` Carlos Maiolino
2013-11-11 20:55 ` Dave Chinner
1 sibling, 1 reply; 7+ messages in thread
From: Carlos Maiolino @ 2013-11-11 17:53 UTC (permalink / raw)
To: xfs, linux-xfs
On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote:
> Hi all,
>
> for streaming writes onto a raid6 the current round-robin ag
> selection seems does not seem to be optimal. Writing 4 files from 4
> threads into a single directory we get 900 MB/s, writing 4 files in
> 4 different directories we only get 700 MB/s (12 disks with with hw
> megaraid-sas). The current round-robin scheme seems to be optimized
> for linear raid0? With small AGs one could also argue, that choosing
> AGs which are not far away from each other (in respect to the number
> of blocks) also adds more parallel disk access for small and medium
> sized files.
>
> Any objections against a patch to improve the AG selection?
>
I wouldn't say this it is optimized specifically for raid 0 environments but I
lack some knowledge on this choice. The mainly reason for the round-robing IIRC,
was to avoid lock contention in a single AG. spreading different files along the
whole disk, and also making it able to allocate them contiguously along the disk.
But, I'm not sure what kind of optimization you have in mind and I believe
another engineers will also need some extra information about what optimization
you have in mind, what kind of tests you're doing (Direct I/O, buffered,
pre-allocation), etc.. You'll also need to post filesystem configurations like
FS aligment (su, sw options), etc.
For different write patterns, you might also want to take a look at the
rotor_step procfs option, and some other options dedicated to streaming writes,
that might help you in this case.
Just my $0.02
>
> Thanks,
> Bernd
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
Carlos
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection
2013-11-11 17:53 ` Carlos Maiolino
@ 2013-11-11 17:55 ` Carlos Maiolino
2013-11-11 18:23 ` Bernd Schubert
0 siblings, 1 reply; 7+ messages in thread
From: Carlos Maiolino @ 2013-11-11 17:55 UTC (permalink / raw)
To: xfs, linux-xfs
On Mon, Nov 11, 2013 at 03:53:14PM -0200, Carlos Maiolino wrote:
> On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote:
> > Hi all,
> >
> > for streaming writes onto a raid6 the current round-robin ag
> > selection seems does not seem to be optimal. Writing 4 files from 4
> > threads into a single directory we get 900 MB/s, writing 4 files in
> > 4 different directories we only get 700 MB/s (12 disks with with hw
> > megaraid-sas). The current round-robin scheme seems to be optimized
> > for linear raid0? With small AGs one could also argue, that choosing
> > AGs which are not far away from each other (in respect to the number
> > of blocks) also adds more parallel disk access for small and medium
> > sized files.
> >
> > Any objections against a patch to improve the AG selection?
> >
>
> I wouldn't say this it is optimized specifically for raid 0 environments but I
> lack some knowledge on this choice. The mainly reason for the round-robing IIRC,
> was to avoid lock contention in a single AG. spreading different files along the
> whole disk, and also making it able to allocate them contiguously along the disk.
>
Lock contention in inodes and blocks B-Trees for example, improving parallelism
in the filesystem, but of course this might not be the optimal behavior for all
environments. That's why XFS has a long list of tuning mkfs/mount options :-)
> But, I'm not sure what kind of optimization you have in mind and I believe
> another engineers will also need some extra information about what optimization
> you have in mind, what kind of tests you're doing (Direct I/O, buffered,
> pre-allocation), etc.. You'll also need to post filesystem configurations like
> FS aligment (su, sw options), etc.
>
> For different write patterns, you might also want to take a look at the
> rotor_step procfs option, and some other options dedicated to streaming writes,
> that might help you in this case.
>
> Just my $0.02
>
> >
Just complementing my past comment
--
Carlos
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection
2013-11-11 17:55 ` Carlos Maiolino
@ 2013-11-11 18:23 ` Bernd Schubert
2013-11-11 18:30 ` Eric Sandeen
0 siblings, 1 reply; 7+ messages in thread
From: Bernd Schubert @ 2013-11-11 18:23 UTC (permalink / raw)
To: xfs
On 11/11/2013 06:55 PM, Carlos Maiolino wrote:
> On Mon, Nov 11, 2013 at 03:53:14PM -0200, Carlos Maiolino wrote:
>> On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote:
>>> Hi all,
>>>
>>> for streaming writes onto a raid6 the current round-robin ag
>>> selection seems does not seem to be optimal. Writing 4 files from 4
>>> threads into a single directory we get 900 MB/s, writing 4 files in
>>> 4 different directories we only get 700 MB/s (12 disks with with hw
>>> megaraid-sas). The current round-robin scheme seems to be optimized
>>> for linear raid0? With small AGs one could also argue, that choosing
>>> AGs which are not far away from each other (in respect to the number
>>> of blocks) also adds more parallel disk access for small and medium
>>> sized files.
>>>
>>> Any objections against a patch to improve the AG selection?
>>>
>>
>> I wouldn't say this it is optimized specifically for raid 0 environments but I
>> lack some knowledge on this choice. The mainly reason for the round-robing IIRC,
>> was to avoid lock contention in a single AG. spreading different files along the
>> whole disk, and also making it able to allocate them contiguously along the disk.
>>
> Lock contention in inodes and blocks B-Trees for example, improving parallelism
> in the filesystem, but of course this might not be the optimal behavior for all
Agreed, more locks help to avoid that.
> environments. That's why XFS has a long list of tuning mkfs/mount options :-)
>
>> But, I'm not sure what kind of optimization you have in mind and I believe
>> another engineers will also need some extra information about what optimization
>> you have in mind, what kind of tests you're doing (Direct I/O, buffered,
>> pre-allocation), etc.. You'll also need to post filesystem configurations like
>> FS aligment (su, sw options), etc.
One of my colleagues benchmarked this on one of our fast systems and another
colleague current needs this system for other tests, so I don't have the
exact parameters. However, it was for sure formated with options like these:
mkfs.xfs -d su=256k,sw=10 -l version=2,su=256k -isize=512 /dev/sdX
and mounted with these options:
mount -onoatime,nodiratime,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint>
>>
>> For different write patterns, you might also want to take a look at the
>> rotor_step procfs option, and some other options dedicated to streaming writes,
>> that might help you in this case.
Thanks, I didn't know that knob, I'm going to look into it.
According to the comments its for inode32 only, but I need
to read the xfs_alloc code first to see what it actually
does.
Thanks,
Bernd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection
2013-11-11 18:23 ` Bernd Schubert
@ 2013-11-11 18:30 ` Eric Sandeen
2013-11-11 19:42 ` Bernd Schubert
0 siblings, 1 reply; 7+ messages in thread
From: Eric Sandeen @ 2013-11-11 18:30 UTC (permalink / raw)
To: Bernd Schubert, xfs
On 11/11/13, 12:23 PM, Bernd Schubert wrote:
> One of my colleagues benchmarked this on one of our fast systems and another
> colleague current needs this system for other tests, so I don't have the
> exact parameters. However, it was for sure formated with options like these:
>
> mkfs.xfs -d su=256k,sw=10 -l version=2,su=256k -isize=512 /dev/sdX
>
> and mounted with these options:
>
> mount -onoatime,nodiratime,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint>
With all due respect, this is excessive knob-twiddling. Slow down. ;)
* V2 logs are default already, so -l version=2 is redundant.
* noatime implies nodiratime, so specifying both is redundant.
* "largeio" only changes the st_blksize value reported (from default page size to, in your case, the total stripe width). Does that actually affect your application behavior?
Backing up, what kernel & what userspace versions are you testing?
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection
2013-11-11 18:30 ` Eric Sandeen
@ 2013-11-11 19:42 ` Bernd Schubert
0 siblings, 0 replies; 7+ messages in thread
From: Bernd Schubert @ 2013-11-11 19:42 UTC (permalink / raw)
To: Eric Sandeen, xfs
Hello Eric,
On 11/11/2013 07:30 PM, Eric Sandeen wrote:
> On 11/11/13, 12:23 PM, Bernd Schubert wrote:
>
>> One of my colleagues benchmarked this on one of our fast systems and another
>> colleague current needs this system for other tests, so I don't have the
>> exact parameters. However, it was for sure formated with options like these:
>>
>> mkfs.xfs -d su=256k,sw=10 -l version=2,su=256k -isize=512 /dev/sdX
>>
>> and mounted with these options:
>>
>> mount -onoatime,nodiratime,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint>
>
> With all due respect, this is excessive knob-twiddling. Slow down. ;)
>
> * V2 logs are default already, so -l version=2 is redundant.
Well, some of our customer are using fhgfs + xfs on old systems and we
don't want to create a list
"with kernel/xfsprogs version < xyz, you need knob abc..."
So better add it by default, as long as it does not hurt.
> * noatime implies nodiratime, so specifying both is redundant.
Ok, I didn't add it to our default options, but I also didn't care about
it.
> * "largeio" only changes the st_blksize value reported (from default page size to, in your case, the total stripe width). Does that actually affect your application behavior?
Fhgfs does not care about it, but for some backup tools (I think rsync
does/did? read it) it helps.
>
> Backing up, what kernel & what userspace versions are you testing?
xfsprogs-3.1.1-10.el6 + kernel 3.11.
Cheers,
Bernd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection
2013-11-11 17:25 ag selection Bernd Schubert
2013-11-11 17:53 ` Carlos Maiolino
@ 2013-11-11 20:55 ` Dave Chinner
1 sibling, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2013-11-11 20:55 UTC (permalink / raw)
To: Bernd Schubert; +Cc: linux-xfs
On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote:
> Hi all,
>
> for streaming writes onto a raid6 the current round-robin ag
> selection seems does not seem to be optimal. Writing 4 files from 4
> threads into a single directory we get 900 MB/s,
IOWs, writing all 4 files into the same AG, interleaving them in to
the same physical location on disk.
> writing 4 files in
> 4 different directories we only get 700 MB/s (12 disks with with hw
> megaraid-sas).
And that writes the 4 files into 4 different AGs, separating them
into physically different regions of the disk. There's seeks between
the streams there, and often cheap RAID controllers have problems
with internal caching algorithms being unable to minimise seeks
between streams effectively.
> The current round-robin scheme seems to be optimized
> for linear raid0?
Not at all - sequential writes of large files are optimised to
maintain high sequential *read* rates of the data that is being
written. Also, RAID 0 and RAID 6 have exactly the same
characteristics for this workload, so the behaviour you are seeing
is more likely due to XFS is writing to slower areas of the disks
when more streams are running in more AGs.
i.e. 900MB/s might be what you get at the outer edge of the disks,
but you might only get 500MB/s at the inner edges. When writing into
4 AGs at once, they are not all going to the outer edge, and hence
you see a much truer reflection of the speed of your storage than
the single AG case.
Keep in mind the inode64 AG selection algorithm is optimised to
spread the allocation load out over the entire filesystem address
space via rotating the directory structure. It does this to
increases allocation parallelism and reduce filesystem hotspots,
to improves individual locality of disparate sets of data, and in
general is significantly faster than any other AG selection
algorithm that anyone has managed to come up with.
> With small AGs one could also argue, that choosing
> AGs which are not far away from each other (in respect to the number
> of blocks) also adds more parallel disk access for small and medium
> sized files.
>
> Any objections against a patch to improve the AG selection?
Define "improve". I'm interested in hearing new idea on how we might
be able to make different allocation decisions, but changing
algorithms is not just a matter of changing code.
At minimum, changing the way allocation is done will drastically
change the aging characteristics of the filesystem, and so what
might work really well for empty filesystems (like ext4's linear
allocation algorithms) really hurts performance as filesystems get
older and free space gets less contiguous....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-11-11 20:56 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-11 17:25 ag selection Bernd Schubert
2013-11-11 17:53 ` Carlos Maiolino
2013-11-11 17:55 ` Carlos Maiolino
2013-11-11 18:23 ` Bernd Schubert
2013-11-11 18:30 ` Eric Sandeen
2013-11-11 19:42 ` Bernd Schubert
2013-11-11 20:55 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox