* ag selection @ 2013-11-11 17:25 Bernd Schubert 2013-11-11 17:53 ` Carlos Maiolino 2013-11-11 20:55 ` Dave Chinner 0 siblings, 2 replies; 7+ messages in thread From: Bernd Schubert @ 2013-11-11 17:25 UTC (permalink / raw) To: linux-xfs Hi all, for streaming writes onto a raid6 the current round-robin ag selection seems does not seem to be optimal. Writing 4 files from 4 threads into a single directory we get 900 MB/s, writing 4 files in 4 different directories we only get 700 MB/s (12 disks with with hw megaraid-sas). The current round-robin scheme seems to be optimized for linear raid0? With small AGs one could also argue, that choosing AGs which are not far away from each other (in respect to the number of blocks) also adds more parallel disk access for small and medium sized files. Any objections against a patch to improve the AG selection? Thanks, Bernd _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection 2013-11-11 17:25 ag selection Bernd Schubert @ 2013-11-11 17:53 ` Carlos Maiolino 2013-11-11 17:55 ` Carlos Maiolino 2013-11-11 20:55 ` Dave Chinner 1 sibling, 1 reply; 7+ messages in thread From: Carlos Maiolino @ 2013-11-11 17:53 UTC (permalink / raw) To: xfs, linux-xfs On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote: > Hi all, > > for streaming writes onto a raid6 the current round-robin ag > selection seems does not seem to be optimal. Writing 4 files from 4 > threads into a single directory we get 900 MB/s, writing 4 files in > 4 different directories we only get 700 MB/s (12 disks with with hw > megaraid-sas). The current round-robin scheme seems to be optimized > for linear raid0? With small AGs one could also argue, that choosing > AGs which are not far away from each other (in respect to the number > of blocks) also adds more parallel disk access for small and medium > sized files. > > Any objections against a patch to improve the AG selection? > I wouldn't say this it is optimized specifically for raid 0 environments but I lack some knowledge on this choice. The mainly reason for the round-robing IIRC, was to avoid lock contention in a single AG. spreading different files along the whole disk, and also making it able to allocate them contiguously along the disk. But, I'm not sure what kind of optimization you have in mind and I believe another engineers will also need some extra information about what optimization you have in mind, what kind of tests you're doing (Direct I/O, buffered, pre-allocation), etc.. You'll also need to post filesystem configurations like FS aligment (su, sw options), etc. For different write patterns, you might also want to take a look at the rotor_step procfs option, and some other options dedicated to streaming writes, that might help you in this case. Just my $0.02 > > Thanks, > Bernd > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- Carlos _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection 2013-11-11 17:53 ` Carlos Maiolino @ 2013-11-11 17:55 ` Carlos Maiolino 2013-11-11 18:23 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Carlos Maiolino @ 2013-11-11 17:55 UTC (permalink / raw) To: xfs, linux-xfs On Mon, Nov 11, 2013 at 03:53:14PM -0200, Carlos Maiolino wrote: > On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote: > > Hi all, > > > > for streaming writes onto a raid6 the current round-robin ag > > selection seems does not seem to be optimal. Writing 4 files from 4 > > threads into a single directory we get 900 MB/s, writing 4 files in > > 4 different directories we only get 700 MB/s (12 disks with with hw > > megaraid-sas). The current round-robin scheme seems to be optimized > > for linear raid0? With small AGs one could also argue, that choosing > > AGs which are not far away from each other (in respect to the number > > of blocks) also adds more parallel disk access for small and medium > > sized files. > > > > Any objections against a patch to improve the AG selection? > > > > I wouldn't say this it is optimized specifically for raid 0 environments but I > lack some knowledge on this choice. The mainly reason for the round-robing IIRC, > was to avoid lock contention in a single AG. spreading different files along the > whole disk, and also making it able to allocate them contiguously along the disk. > Lock contention in inodes and blocks B-Trees for example, improving parallelism in the filesystem, but of course this might not be the optimal behavior for all environments. That's why XFS has a long list of tuning mkfs/mount options :-) > But, I'm not sure what kind of optimization you have in mind and I believe > another engineers will also need some extra information about what optimization > you have in mind, what kind of tests you're doing (Direct I/O, buffered, > pre-allocation), etc.. You'll also need to post filesystem configurations like > FS aligment (su, sw options), etc. > > For different write patterns, you might also want to take a look at the > rotor_step procfs option, and some other options dedicated to streaming writes, > that might help you in this case. > > Just my $0.02 > > > Just complementing my past comment -- Carlos _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection 2013-11-11 17:55 ` Carlos Maiolino @ 2013-11-11 18:23 ` Bernd Schubert 2013-11-11 18:30 ` Eric Sandeen 0 siblings, 1 reply; 7+ messages in thread From: Bernd Schubert @ 2013-11-11 18:23 UTC (permalink / raw) To: xfs On 11/11/2013 06:55 PM, Carlos Maiolino wrote: > On Mon, Nov 11, 2013 at 03:53:14PM -0200, Carlos Maiolino wrote: >> On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote: >>> Hi all, >>> >>> for streaming writes onto a raid6 the current round-robin ag >>> selection seems does not seem to be optimal. Writing 4 files from 4 >>> threads into a single directory we get 900 MB/s, writing 4 files in >>> 4 different directories we only get 700 MB/s (12 disks with with hw >>> megaraid-sas). The current round-robin scheme seems to be optimized >>> for linear raid0? With small AGs one could also argue, that choosing >>> AGs which are not far away from each other (in respect to the number >>> of blocks) also adds more parallel disk access for small and medium >>> sized files. >>> >>> Any objections against a patch to improve the AG selection? >>> >> >> I wouldn't say this it is optimized specifically for raid 0 environments but I >> lack some knowledge on this choice. The mainly reason for the round-robing IIRC, >> was to avoid lock contention in a single AG. spreading different files along the >> whole disk, and also making it able to allocate them contiguously along the disk. >> > Lock contention in inodes and blocks B-Trees for example, improving parallelism > in the filesystem, but of course this might not be the optimal behavior for all Agreed, more locks help to avoid that. > environments. That's why XFS has a long list of tuning mkfs/mount options :-) > >> But, I'm not sure what kind of optimization you have in mind and I believe >> another engineers will also need some extra information about what optimization >> you have in mind, what kind of tests you're doing (Direct I/O, buffered, >> pre-allocation), etc.. You'll also need to post filesystem configurations like >> FS aligment (su, sw options), etc. One of my colleagues benchmarked this on one of our fast systems and another colleague current needs this system for other tests, so I don't have the exact parameters. However, it was for sure formated with options like these: mkfs.xfs -d su=256k,sw=10 -l version=2,su=256k -isize=512 /dev/sdX and mounted with these options: mount -onoatime,nodiratime,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint> >> >> For different write patterns, you might also want to take a look at the >> rotor_step procfs option, and some other options dedicated to streaming writes, >> that might help you in this case. Thanks, I didn't know that knob, I'm going to look into it. According to the comments its for inode32 only, but I need to read the xfs_alloc code first to see what it actually does. Thanks, Bernd _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection 2013-11-11 18:23 ` Bernd Schubert @ 2013-11-11 18:30 ` Eric Sandeen 2013-11-11 19:42 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Eric Sandeen @ 2013-11-11 18:30 UTC (permalink / raw) To: Bernd Schubert, xfs On 11/11/13, 12:23 PM, Bernd Schubert wrote: > One of my colleagues benchmarked this on one of our fast systems and another > colleague current needs this system for other tests, so I don't have the > exact parameters. However, it was for sure formated with options like these: > > mkfs.xfs -d su=256k,sw=10 -l version=2,su=256k -isize=512 /dev/sdX > > and mounted with these options: > > mount -onoatime,nodiratime,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint> With all due respect, this is excessive knob-twiddling. Slow down. ;) * V2 logs are default already, so -l version=2 is redundant. * noatime implies nodiratime, so specifying both is redundant. * "largeio" only changes the st_blksize value reported (from default page size to, in your case, the total stripe width). Does that actually affect your application behavior? Backing up, what kernel & what userspace versions are you testing? -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection 2013-11-11 18:30 ` Eric Sandeen @ 2013-11-11 19:42 ` Bernd Schubert 0 siblings, 0 replies; 7+ messages in thread From: Bernd Schubert @ 2013-11-11 19:42 UTC (permalink / raw) To: Eric Sandeen, xfs Hello Eric, On 11/11/2013 07:30 PM, Eric Sandeen wrote: > On 11/11/13, 12:23 PM, Bernd Schubert wrote: > >> One of my colleagues benchmarked this on one of our fast systems and another >> colleague current needs this system for other tests, so I don't have the >> exact parameters. However, it was for sure formated with options like these: >> >> mkfs.xfs -d su=256k,sw=10 -l version=2,su=256k -isize=512 /dev/sdX >> >> and mounted with these options: >> >> mount -onoatime,nodiratime,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint> > > With all due respect, this is excessive knob-twiddling. Slow down. ;) > > * V2 logs are default already, so -l version=2 is redundant. Well, some of our customer are using fhgfs + xfs on old systems and we don't want to create a list "with kernel/xfsprogs version < xyz, you need knob abc..." So better add it by default, as long as it does not hurt. > * noatime implies nodiratime, so specifying both is redundant. Ok, I didn't add it to our default options, but I also didn't care about it. > * "largeio" only changes the st_blksize value reported (from default page size to, in your case, the total stripe width). Does that actually affect your application behavior? Fhgfs does not care about it, but for some backup tools (I think rsync does/did? read it) it helps. > > Backing up, what kernel & what userspace versions are you testing? xfsprogs-3.1.1-10.el6 + kernel 3.11. Cheers, Bernd _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ag selection 2013-11-11 17:25 ag selection Bernd Schubert 2013-11-11 17:53 ` Carlos Maiolino @ 2013-11-11 20:55 ` Dave Chinner 1 sibling, 0 replies; 7+ messages in thread From: Dave Chinner @ 2013-11-11 20:55 UTC (permalink / raw) To: Bernd Schubert; +Cc: linux-xfs On Mon, Nov 11, 2013 at 06:25:13PM +0100, Bernd Schubert wrote: > Hi all, > > for streaming writes onto a raid6 the current round-robin ag > selection seems does not seem to be optimal. Writing 4 files from 4 > threads into a single directory we get 900 MB/s, IOWs, writing all 4 files into the same AG, interleaving them in to the same physical location on disk. > writing 4 files in > 4 different directories we only get 700 MB/s (12 disks with with hw > megaraid-sas). And that writes the 4 files into 4 different AGs, separating them into physically different regions of the disk. There's seeks between the streams there, and often cheap RAID controllers have problems with internal caching algorithms being unable to minimise seeks between streams effectively. > The current round-robin scheme seems to be optimized > for linear raid0? Not at all - sequential writes of large files are optimised to maintain high sequential *read* rates of the data that is being written. Also, RAID 0 and RAID 6 have exactly the same characteristics for this workload, so the behaviour you are seeing is more likely due to XFS is writing to slower areas of the disks when more streams are running in more AGs. i.e. 900MB/s might be what you get at the outer edge of the disks, but you might only get 500MB/s at the inner edges. When writing into 4 AGs at once, they are not all going to the outer edge, and hence you see a much truer reflection of the speed of your storage than the single AG case. Keep in mind the inode64 AG selection algorithm is optimised to spread the allocation load out over the entire filesystem address space via rotating the directory structure. It does this to increases allocation parallelism and reduce filesystem hotspots, to improves individual locality of disparate sets of data, and in general is significantly faster than any other AG selection algorithm that anyone has managed to come up with. > With small AGs one could also argue, that choosing > AGs which are not far away from each other (in respect to the number > of blocks) also adds more parallel disk access for small and medium > sized files. > > Any objections against a patch to improve the AG selection? Define "improve". I'm interested in hearing new idea on how we might be able to make different allocation decisions, but changing algorithms is not just a matter of changing code. At minimum, changing the way allocation is done will drastically change the aging characteristics of the filesystem, and so what might work really well for empty filesystems (like ext4's linear allocation algorithms) really hurts performance as filesystems get older and free space gets less contiguous.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-11-11 20:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-11 17:25 ag selection Bernd Schubert 2013-11-11 17:53 ` Carlos Maiolino 2013-11-11 17:55 ` Carlos Maiolino 2013-11-11 18:23 ` Bernd Schubert 2013-11-11 18:30 ` Eric Sandeen 2013-11-11 19:42 ` Bernd Schubert 2013-11-11 20:55 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox