* Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) [not found] <1372657476-9241-1-git-send-email-david@fromorbit.com> @ 2013-07-08 12:44 ` Dave Chinner 2013-07-08 13:59 ` Jan Kara ` (3 more replies) 0 siblings, 4 replies; 12+ messages in thread From: Dave Chinner @ 2013-07-08 12:44 UTC (permalink / raw) To: xfs; +Cc: linux-fsdevel [cc fsdevel because after all the XFS stuff I did a some testing on mmotm w.r.t per-node LRU lock contention avoidance, and also some scalability tests against ext4 and btrfs for comparison on some new hardware. That bit ain't pretty. ] On Mon, Jul 01, 2013 at 03:44:36PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Note: This is an RFC right now - it'll need to be broken up into > several patches for final submission. > > The CIL insertion during transaction commit currently does multiple > passes across the transaction objects and requires multiple memory > allocations per object that is to be inserted into the CIL. It is > quite inefficient, and as such xfs_log_commit_cil() and it's > children show up quite highly in profiles under metadata > modification intensive workloads. > > The current insertion tries to minimise ithe number of times the > xc_cil_lock is grabbed and the hold times via a couple of methods: > > 1. an initial loop across the transaction items outside the > lock to allocate log vectors, buffers and copy the data into > them. > 2. a second pass across the log vectors that then inserts > them into the CIL, modifies the CIL state and frees the old > vectors. > > This is somewhat inefficient. While it minimises lock grabs, the > hold time is still quite high because we are freeing objects with > the spinlock held and so the hold times are much higher than they > need to be. > > Optimisations that can be made: ..... > > The result is that my standard fsmark benchmark (8-way, 50m files) > on my standard test VM (8-way, 4GB RAM, 4xSSD in RAID0, 100TB fs) > gives the following results with a xfs-oss tree. No CRCs: > > vanilla patched Difference > create (time) 483s 435s -10.0% (faster) > (rate) 109k+/6k 122k+/-7k +11.9% (faster) > > walk 339s 335s (noise) > (sys cpu) 1134s 1135s (noise) > > unlink 692s 645s -6.8% (faster) > > So it's significantly faster than the current code, and lock_stat > reports lower contention on the xc_cil_lock, too. So, big win here. > > With CRCs: > > vanilla patched Difference > create (time) 510s 460s -9.8% (faster) > (rate) 105k+/5.4k 117k+/-5k +11.4% (faster) > > walk 494s 486s (noise) > (sys cpu) 1324s 1290s (noise) > > unlink 959s 889s -7.3% (faster) > > Gains are of the same order, with walk and unlink still affected by > VFS LRU lock contention. IOWs, with this changes, filesystems with > CRCs enabled will still be faster than the old non-CRC kernels... FWIW, I have new hardware here that I'll be using for benchmarking like this, so here's a quick baseline comparison using the same 8p/4GB RAM VM (just migrated across) and same SSD based storage (physically moved) and 100TB filesystem. The disks are behind a faster RAID controller w/ 1GB of BBWC, so random read and write IOPS are higher and hence traversal times will due to lower IO latency. Create times wall time(s) rate (files/s) vanilla patched diff vanilla patched diff Old system 483 435 -10.0% 109k+-6k 122k+-7k +11.9% New system 378 342 -9.5% 143k+-9k 158k+-8k +10.5% diff -21.7% -21.4% +31.2% +29.5% Walk times wall time(s) vanilla patched diff Old system 339 335 (noise) New system 194 197 (noise) diff -42.7% -41.2% Unlink times wall time(s) vanilla patched diff Old system 692 645 -7.3% New system 457 405 -11.4% diff -34.0% -37.2% So, overall, the new system is 20-40% faster than the old one on a comparitive test. but I have a few more cores and a lot more memory to play with, so a 16-way test on the same machine with the VM expanded to 16p/16GB RAM, 4 fake numa nodes follows: New system, patched kernel: Threads create walk unlink time(s) rate time(s) time(s) 8 342 158k+-8k 197 405 16 222 266k+-32k 170 295 diff -35.1% +68.4% -13.7% -27.2% Create rates are much more variable because the memory reclaim behaviour appears to be very harsh, pulling 4-6 million inodes out of memory every 10s or so and thrashing on the LRU locks, and then doing nothing until another large step occurs. Walk rates improve, but not much because of lock contention. I added 8 cpu cores to the workload, and I'm burning at least 4 of those cores on the inode LRU lock. - 30.61% [kernel] [k] __ticket_spin_trylock - __ticket_spin_trylock - 65.33% _raw_spin_lock + 88.19% inode_add_lru + 7.31% dentry_lru_del + 1.07% shrink_dentry_list + 0.90% dput + 0.83% inode_sb_list_add + 0.59% evict + 27.79% do_raw_spin_lock + 4.03% do_raw_spin_trylock + 2.85% _raw_spin_trylock The current mmotm (and hence probably 3.11) has the new per-node LRU code in it, so this variance and contention should go away very soon. Unlinks go lots faster because they don't cause inode LRU lock contention, but we are still a long way from linear scalability from 8- to 16-way. FWIW, the mmotm kernel (which has a fair bit of debug enabled, so not quite comparitive) doesn't have any LRU lock contention to speak of. For create: - 7.81% [kernel] [k] __ticket_spin_trylock - __ticket_spin_trylock - 70.98% _raw_spin_lock + 97.55% xfs_log_commit_cil + 0.93% __d_instantiate + 0.58% inode_sb_list_add - 29.02% do_raw_spin_lock - _raw_spin_lock + 41.14% xfs_log_commit_cil + 8.29% _xfs_buf_find + 8.00% xfs_iflush_cluster And the walk: - 26.37% [kernel] [k] __ticket_spin_trylock - __ticket_spin_trylock - 49.10% _raw_spin_lock - 50.65% evict dispose_list prune_icache_sb super_cache_scan + shrink_slab - 26.99% list_lru_add + 89.01% inode_add_lru + 10.95% dput + 7.03% __remove_inode_hash - 40.65% do_raw_spin_lock - _raw_spin_lock - 41.96% evict dispose_list prune_icache_sb super_cache_scan + shrink_slab - 13.55% list_lru_add 84.33% inode_add_lru iput d_kill shrink_dentry_list prune_dcache_sb super_cache_scan shrink_slab 15.01% dput 0.66% xfs_buf_rele + 10.10% __remove_inode_hash system_call_fastpath There's quite a different pattern of contention - it has moved inward to evict which implies the inode_sb_list_lock is the next obvious point of contention. I have patches in the works for that. Also, the inode_hash_lock is causing some contention, even though we fake inode hashing. I have a patch to fix that for XFS as well. I also note an interesting behaviour of the per-node inode LRUs - the contention is coming from the dentry shrinker on one node freeing inodes allocated on a different node during reclaim. There's scope for improvement there. But here' the interesting part: Kernel create walk unlink time(s) rate time(s) time(s) 3.10-cil 222 266k+-32k 170 295 mmotm 251 222k+-16k 128 356 Even with all the debug enabled, the overall walk time dropped by 25% to 128s. So performance in this workload has substantially improved because of the per-node LRUs and variability is also down as well, as predicted. Once I add all the tweaks I have in the 3.10-cil tree to mmotm, I expect significant improvements to create and unlink performance as well... So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the 3.10-cil kernel I've been testing XFS on): create walk unlink time(s) rate time(s) time(s) xfs 222 266k+-32k 170 295 ext4 978 54k+- 2k 325 2053 btrfs 1223 47k+- 8k 366 12000(*) (*) Estimate based on a removal rate of 18.5 minutes for the first 4.8 million inodes. Basically, neither btrfs or ext4 have any concurrency scaling to demonstrate, and unlinks on btrfs a just plain woeful. ext4 create rate is limited by the extent cache LRU locking: - 41.81% [kernel] [k] __ticket_spin_trylock - __ticket_spin_trylock - 60.67% _raw_spin_lock - 99.60% ext4_es_lru_add + 99.63% ext4_es_lookup_extent - 39.15% do_raw_spin_lock - _raw_spin_lock + 95.38% ext4_es_lru_add 0.51% insert_inode_locked __ext4_new_inode - 16.20% [kernel] [k] native_read_tsc - native_read_tsc - 60.91% delay_tsc __delay do_raw_spin_lock + _raw_spin_lock - 39.09% __delay do_raw_spin_lock + _raw_spin_lock Ext4 unlink is serialised on orphan list processing: - 12.67% [kernel] [k] __mutex_unlock_slowpath - __mutex_unlock_slowpath - 99.95% mutex_unlock + 54.37% ext4_orphan_del + 43.26% ext4_orphan_add + 5.33% [kernel] [k] __mutex_lock_slowpath btrfs create has tree lock problems: - 21.68% [kernel] [k] __write_lock_failed - __write_lock_failed - 99.93% do_raw_write_lock - _raw_write_lock - 79.04% btrfs_try_tree_write_lock - btrfs_search_slot - 97.48% btrfs_insert_empty_items 99.82% btrfs_new_inode + 2.52% btrfs_lookup_inode - 20.37% btrfs_tree_lock - 99.38% btrfs_search_slot 99.92% btrfs_insert_empty_items 0.52% btrfs_lock_root_node btrfs_search_slot btrfs_insert_empty_items - 21.24% [kernel] [k] _raw_spin_unlock_irqrestore - _raw_spin_unlock_irqrestore - 61.22% prepare_to_wait + 61.52% btrfs_tree_lock + 32.31% btrfs_tree_read_lock 6.17% reserve_metadata_bytes btrfs_block_rsv_add btrfs walk phase hammers the inode_hash_lock: - 18.45% [kernel] [k] __ticket_spin_trylock - __ticket_spin_trylock - 47.38% _raw_spin_lock + 42.99% iget5_locked + 15.17% __remove_inode_hash + 13.77% btrfs_get_delayed_node + 11.27% inode_tree_add + 9.32% btrfs_destroy_inode ..... - 46.77% do_raw_spin_lock - _raw_spin_lock + 30.51% iget5_locked + 11.40% __remove_inode_hash + 11.38% btrfs_get_delayed_node + 9.45% inode_tree_add + 7.28% btrfs_destroy_inode ..... I have a RCU inode hash lookup patch floating around somewhere if someone wants it... And, well, the less said about btrfs unlinks the better: + 37.14% [kernel] [k] _raw_spin_unlock_irqrestore + 33.18% [kernel] [k] __write_lock_failed + 17.96% [kernel] [k] __read_lock_failed + 1.35% [kernel] [k] _raw_spin_unlock_irq + 0.82% [kernel] [k] __do_softirq + 0.53% [kernel] [k] btrfs_tree_lock + 0.41% [kernel] [k] btrfs_tree_read_lock + 0.41% [kernel] [k] do_raw_read_lock + 0.39% [kernel] [k] do_raw_write_lock + 0.38% [kernel] [k] btrfs_clear_lock_blocking_rw + 0.37% [kernel] [k] free_extent_buffer + 0.36% [kernel] [k] btrfs_tree_read_unlock + 0.32% [kernel] [k] do_raw_write_unlock Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 12:44 ` Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) Dave Chinner @ 2013-07-08 13:59 ` Jan Kara 2013-07-08 15:22 ` Marco Stornelli 2013-07-09 0:43 ` Zheng Liu ` (2 subsequent siblings) 3 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2013-07-08 13:59 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, xfs On Mon 08-07-13 22:44:53, Dave Chinner wrote: <snipped some nice XFS results ;)> > So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the > 3.10-cil kernel I've been testing XFS on): > > create walk unlink > time(s) rate time(s) time(s) > xfs 222 266k+-32k 170 295 > ext4 978 54k+- 2k 325 2053 > btrfs 1223 47k+- 8k 366 12000(*) > > (*) Estimate based on a removal rate of 18.5 minutes for the first > 4.8 million inodes. > > Basically, neither btrfs or ext4 have any concurrency scaling to > demonstrate, and unlinks on btrfs a just plain woeful. Thanks for posting the numbers. There isn't anyone seriously testing ext4 SMP scalability AFAIK so it's not surprising it sucks. > ext4 create rate is limited by the extent cache LRU locking: > > - 41.81% [kernel] [k] __ticket_spin_trylock > - __ticket_spin_trylock > - 60.67% _raw_spin_lock > - 99.60% ext4_es_lru_add > + 99.63% ext4_es_lookup_extent At least this should improve with the patches in 3.11-rc1. > - 39.15% do_raw_spin_lock > - _raw_spin_lock > + 95.38% ext4_es_lru_add > 0.51% insert_inode_locked > __ext4_new_inode > - 16.20% [kernel] [k] native_read_tsc > - native_read_tsc > - 60.91% delay_tsc > __delay > do_raw_spin_lock > + _raw_spin_lock > - 39.09% __delay > do_raw_spin_lock > + _raw_spin_lock > > Ext4 unlink is serialised on orphan list processing: > > - 12.67% [kernel] [k] __mutex_unlock_slowpath > - __mutex_unlock_slowpath > - 99.95% mutex_unlock > + 54.37% ext4_orphan_del > + 43.26% ext4_orphan_add > + 5.33% [kernel] [k] __mutex_lock_slowpath ext4 can do better here I'm sure. The current solution is pretty simplistic. At least we could use spinlock for in-memory orphan list and atomic ops for on disk one (as it's only singly linked list). But not sure if I find time to look into this in forseeable future... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 13:59 ` Jan Kara @ 2013-07-08 15:22 ` Marco Stornelli 2013-07-08 15:38 ` Jan Kara 2013-07-09 0:56 ` Theodore Ts'o 0 siblings, 2 replies; 12+ messages in thread From: Marco Stornelli @ 2013-07-08 15:22 UTC (permalink / raw) To: Jan Kara; +Cc: Dave Chinner, xfs, linux-fsdevel Il 08/07/2013 15:59, Jan Kara ha scritto: > On Mon 08-07-13 22:44:53, Dave Chinner wrote: > <snipped some nice XFS results ;)> >> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the >> 3.10-cil kernel I've been testing XFS on): >> >> create walk unlink >> time(s) rate time(s) time(s) >> xfs 222 266k+-32k 170 295 >> ext4 978 54k+- 2k 325 2053 >> btrfs 1223 47k+- 8k 366 12000(*) >> >> (*) Estimate based on a removal rate of 18.5 minutes for the first >> 4.8 million inodes. >> >> Basically, neither btrfs or ext4 have any concurrency scaling to >> demonstrate, and unlinks on btrfs a just plain woeful. > Thanks for posting the numbers. There isn't anyone seriously testing ext4 > SMP scalability AFAIK so it's not surprising it sucks. Funny, if I well remember Google guys switched android from yaffs2 to ext4 due to its superiority on SMP :) Marco ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 15:22 ` Marco Stornelli @ 2013-07-08 15:38 ` Jan Kara 2013-07-09 0:15 ` Dave Chinner 2013-07-09 0:56 ` Theodore Ts'o 1 sibling, 1 reply; 12+ messages in thread From: Jan Kara @ 2013-07-08 15:38 UTC (permalink / raw) To: Marco Stornelli; +Cc: linux-fsdevel, Jan Kara, xfs On Mon 08-07-13 17:22:43, Marco Stornelli wrote: > Il 08/07/2013 15:59, Jan Kara ha scritto: > >On Mon 08-07-13 22:44:53, Dave Chinner wrote: > ><snipped some nice XFS results ;)> > >>So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the > >>3.10-cil kernel I've been testing XFS on): > >> > >> create walk unlink > >> time(s) rate time(s) time(s) > >>xfs 222 266k+-32k 170 295 > >>ext4 978 54k+- 2k 325 2053 > >>btrfs 1223 47k+- 8k 366 12000(*) > >> > >>(*) Estimate based on a removal rate of 18.5 minutes for the first > >>4.8 million inodes. > >> > >>Basically, neither btrfs or ext4 have any concurrency scaling to > >>demonstrate, and unlinks on btrfs a just plain woeful. > > Thanks for posting the numbers. There isn't anyone seriously testing ext4 > >SMP scalability AFAIK so it's not surprising it sucks. > > Funny, if I well remember Google guys switched android from yaffs2 > to ext4 due to its superiority on SMP :) Well, there's SMP and SMP. Ext4 is perfectly OK for desktop kind of SMP - that's what lots of people use. When we speak of heavy IO load with 16 CPUs on enterprise grade storage so that CPU (and not IO) bottlenecks are actually visible, that's not so easily available and so we don't have serious performance work in that direction... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 15:38 ` Jan Kara @ 2013-07-09 0:15 ` Dave Chinner 0 siblings, 0 replies; 12+ messages in thread From: Dave Chinner @ 2013-07-09 0:15 UTC (permalink / raw) To: Jan Kara; +Cc: Marco Stornelli, xfs, linux-fsdevel On Mon, Jul 08, 2013 at 05:38:07PM +0200, Jan Kara wrote: > On Mon 08-07-13 17:22:43, Marco Stornelli wrote: > > Il 08/07/2013 15:59, Jan Kara ha scritto: > > >On Mon 08-07-13 22:44:53, Dave Chinner wrote: > > ><snipped some nice XFS results ;)> > > >>So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the > > >>3.10-cil kernel I've been testing XFS on): > > >> > > >> create walk unlink > > >> time(s) rate time(s) time(s) > > >>xfs 222 266k+-32k 170 295 > > >>ext4 978 54k+- 2k 325 2053 > > >>btrfs 1223 47k+- 8k 366 12000(*) > > >> > > >>(*) Estimate based on a removal rate of 18.5 minutes for the first > > >>4.8 million inodes. > > >> > > >>Basically, neither btrfs or ext4 have any concurrency scaling to > > >>demonstrate, and unlinks on btrfs a just plain woeful. > > > Thanks for posting the numbers. There isn't anyone seriously testing ext4 > > >SMP scalability AFAIK so it's not surprising it sucks. It's worse than that - nobody picked up on review that taking a global lock on every extent lookup might be a scalability issue? Scalability is not an afterthought anymore - new filesystem and kernel features need to be designed from the ground up with this in mind. We're living in a world where even phones have 4 CPU cores.... > > Funny, if I well remember Google guys switched android from yaffs2 > > to ext4 due to its superiority on SMP :) > Well, there's SMP and SMP. Ext4 is perfectly OK for desktop kind of SMP - Barely. It tops out in parallelism at between 2-4 threads depending on the metadata operations being done. > that's what lots of people use. When we speak of heavy IO load with 16 CPUs > on enterprise grade storage so that CPU (and not IO) bottlenecks are actually > visible, that's not so easily available and so we don't have serious > performance work in that direction... I'm not testing with "enterprise grade" storage. The filesystem I'm testing on is hosted on less than $300 of SSDs. The "enterprise" RAID controller they sit behind is actually an IOPS bottleneck, not an improvement. My 2.5 year old desktop has a pair of cheap, no name sandforce SSDs in RAID0 and they can do at least 2x the read and write IOPS of the new hardware I just tested. And yes, I run XFS on my desktop. And then there's my 3 month old laptop, which has a recent SATA SSD in it. It also has 8 threads, but twice the memory and about 1.5x the IOPS and bandwidth of my desktop machine. The bottlenecks showing up in ext4 and btrfs don't magically show up at 16 threads - they are present and reproducable at 2-4 threads. Indeed, I didn't bother testing at 32 threads - even though my new server can do that - because that will just hammer the same bottlenecks even harder. Fundamentally, I'm not testing anything you can't test on a $2000 desktop PC.... FWIW, the SSDs are making ext4 and btrfs look good in these workloads. XFS is creating >250k files/s doing about 1500 IOPS. ext4 is making 50k files/s at 23,000 IOPS. btrfs has peaks every 30s of over 30,000 IOPS. Which filesystem is going to scale better on desktops with spinning rust? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 15:22 ` Marco Stornelli 2013-07-08 15:38 ` Jan Kara @ 2013-07-09 0:56 ` Theodore Ts'o 1 sibling, 0 replies; 12+ messages in thread From: Theodore Ts'o @ 2013-07-09 0:56 UTC (permalink / raw) To: Marco Stornelli; +Cc: Jan Kara, Dave Chinner, xfs, linux-fsdevel On Mon, Jul 08, 2013 at 05:22:43PM +0200, Marco Stornelli wrote: > > Funny, if I well remember Google guys switched android from yaffs2 > to ext4 due to its superiority on SMP :) The bigger reason why was because raw NAND flash doesn't really make sense any more; especially as the feature size of flash cells has shrunk and with the introduction of MLC and TLC, you really need to use hardware assist to make flash sufficiently reliable. Modern flash storage uses dynamic adjustment of voltage levels as the flash cells age, and error correcting codes to compensate for flash reliability challenges. This means accessing flash using eMMC, SATA, SAS, etc., and that rules out YAFFS2. - Ted ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 12:44 ` Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) Dave Chinner 2013-07-08 13:59 ` Jan Kara @ 2013-07-09 0:43 ` Zheng Liu 2013-07-09 1:23 ` Dave Chinner 2013-07-09 1:15 ` Chris Mason 2013-07-09 8:26 ` Dave Chinner 3 siblings, 1 reply; 12+ messages in thread From: Zheng Liu @ 2013-07-09 0:43 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs, linux-fsdevel Hi Dave, On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote: [...] > So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the > 3.10-cil kernel I've been testing XFS on): > > create walk unlink > time(s) rate time(s) time(s) > xfs 222 266k+-32k 170 295 > ext4 978 54k+- 2k 325 2053 > btrfs 1223 47k+- 8k 366 12000(*) > > (*) Estimate based on a removal rate of 18.5 minutes for the first > 4.8 million inodes. > > Basically, neither btrfs or ext4 have any concurrency scaling to > demonstrate, and unlinks on btrfs a just plain woeful. > > ext4 create rate is limited by the extent cache LRU locking: I have a patch to fix this problem and the patch has been applied into 3.11-rc1. The patch is (d3922a77): ext4: improve extent cache shrink mechanism to avoid to burn CPU time I do really appreicate that if you could try your testing again against this patch. I just want to make sure that this problem has been fixed. At least in my own testing it looks fine. Thanks, - Zheng > > - 41.81% [kernel] [k] __ticket_spin_trylock > - __ticket_spin_trylock > - 60.67% _raw_spin_lock > - 99.60% ext4_es_lru_add > + 99.63% ext4_es_lookup_extent > - 39.15% do_raw_spin_lock > - _raw_spin_lock > + 95.38% ext4_es_lru_add > 0.51% insert_inode_locked > __ext4_new_inode > - 16.20% [kernel] [k] native_read_tsc > - native_read_tsc > - 60.91% delay_tsc > __delay > do_raw_spin_lock > + _raw_spin_lock > - 39.09% __delay > do_raw_spin_lock > + _raw_spin_lock > > Ext4 unlink is serialised on orphan list processing: > > - 12.67% [kernel] [k] __mutex_unlock_slowpath > - __mutex_unlock_slowpath > - 99.95% mutex_unlock > + 54.37% ext4_orphan_del > + 43.26% ext4_orphan_add > + 5.33% [kernel] [k] __mutex_lock_slowpath > > > btrfs create has tree lock problems: > > - 21.68% [kernel] [k] __write_lock_failed > - __write_lock_failed > - 99.93% do_raw_write_lock > - _raw_write_lock > - 79.04% btrfs_try_tree_write_lock > - btrfs_search_slot > - 97.48% btrfs_insert_empty_items > 99.82% btrfs_new_inode > + 2.52% btrfs_lookup_inode > - 20.37% btrfs_tree_lock > - 99.38% btrfs_search_slot > 99.92% btrfs_insert_empty_items > 0.52% btrfs_lock_root_node > btrfs_search_slot > btrfs_insert_empty_items > - 21.24% [kernel] [k] _raw_spin_unlock_irqrestore > - _raw_spin_unlock_irqrestore > - 61.22% prepare_to_wait > + 61.52% btrfs_tree_lock > + 32.31% btrfs_tree_read_lock > 6.17% reserve_metadata_bytes > btrfs_block_rsv_add > > btrfs walk phase hammers the inode_hash_lock: > > - 18.45% [kernel] [k] __ticket_spin_trylock > - __ticket_spin_trylock > - 47.38% _raw_spin_lock > + 42.99% iget5_locked > + 15.17% __remove_inode_hash > + 13.77% btrfs_get_delayed_node > + 11.27% inode_tree_add > + 9.32% btrfs_destroy_inode > ..... > - 46.77% do_raw_spin_lock > - _raw_spin_lock > + 30.51% iget5_locked > + 11.40% __remove_inode_hash > + 11.38% btrfs_get_delayed_node > + 9.45% inode_tree_add > + 7.28% btrfs_destroy_inode > ..... > > I have a RCU inode hash lookup patch floating around somewhere if > someone wants it... > > And, well, the less said about btrfs unlinks the better: > > + 37.14% [kernel] [k] _raw_spin_unlock_irqrestore > + 33.18% [kernel] [k] __write_lock_failed > + 17.96% [kernel] [k] __read_lock_failed > + 1.35% [kernel] [k] _raw_spin_unlock_irq > + 0.82% [kernel] [k] __do_softirq > + 0.53% [kernel] [k] btrfs_tree_lock > + 0.41% [kernel] [k] btrfs_tree_read_lock > + 0.41% [kernel] [k] do_raw_read_lock > + 0.39% [kernel] [k] do_raw_write_lock > + 0.38% [kernel] [k] btrfs_clear_lock_blocking_rw > + 0.37% [kernel] [k] free_extent_buffer > + 0.36% [kernel] [k] btrfs_tree_read_unlock > + 0.32% [kernel] [k] do_raw_write_unlock > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-09 0:43 ` Zheng Liu @ 2013-07-09 1:23 ` Dave Chinner 0 siblings, 0 replies; 12+ messages in thread From: Dave Chinner @ 2013-07-09 1:23 UTC (permalink / raw) To: xfs, linux-fsdevel On Tue, Jul 09, 2013 at 08:43:32AM +0800, Zheng Liu wrote: > Hi Dave, > > On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote: > [...] > > So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the > > 3.10-cil kernel I've been testing XFS on): > > > > create walk unlink > > time(s) rate time(s) time(s) > > xfs 222 266k+-32k 170 295 > > ext4 978 54k+- 2k 325 2053 > > btrfs 1223 47k+- 8k 366 12000(*) > > > > (*) Estimate based on a removal rate of 18.5 minutes for the first > > 4.8 million inodes. > > > > Basically, neither btrfs or ext4 have any concurrency scaling to > > demonstrate, and unlinks on btrfs a just plain woeful. > > > > ext4 create rate is limited by the extent cache LRU locking: > > I have a patch to fix this problem and the patch has been applied into > 3.11-rc1. The patch is (d3922a77): > ext4: improve extent cache shrink mechanism to avoid to burn CPU time > > I do really appreicate that if you could try your testing again against > this patch. I just want to make sure that this problem has been fixed. > At least in my own testing it looks fine. I'll redo them when 3.11-rc1 comes around. I'll let you know how much better it is, and where the next ring of the onion lies. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 12:44 ` Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) Dave Chinner 2013-07-08 13:59 ` Jan Kara 2013-07-09 0:43 ` Zheng Liu @ 2013-07-09 1:15 ` Chris Mason 2013-07-09 1:26 ` Dave Chinner 2013-07-09 8:26 ` Dave Chinner 3 siblings, 1 reply; 12+ messages in thread From: Chris Mason @ 2013-07-09 1:15 UTC (permalink / raw) To: Dave Chinner, xfs; +Cc: linux-fsdevel Quoting Dave Chinner (2013-07-08 08:44:53) > [cc fsdevel because after all the XFS stuff I did a some testing on > mmotm w.r.t per-node LRU lock contention avoidance, and also some > scalability tests against ext4 and btrfs for comparison on some new > hardware. That bit ain't pretty. ] > > And, well, the less said about btrfs unlinks the better: > > + 37.14% [kernel] [k] _raw_spin_unlock_irqrestore > + 33.18% [kernel] [k] __write_lock_failed > + 17.96% [kernel] [k] __read_lock_failed > + 1.35% [kernel] [k] _raw_spin_unlock_irq > + 0.82% [kernel] [k] __do_softirq > + 0.53% [kernel] [k] btrfs_tree_lock > + 0.41% [kernel] [k] btrfs_tree_read_lock > + 0.41% [kernel] [k] do_raw_read_lock > + 0.39% [kernel] [k] do_raw_write_lock > + 0.38% [kernel] [k] btrfs_clear_lock_blocking_rw > + 0.37% [kernel] [k] free_extent_buffer > + 0.36% [kernel] [k] btrfs_tree_read_unlock > + 0.32% [kernel] [k] do_raw_write_unlock > Hi Dave, Thanks for doing these runs. At least on Btrfs the best way to resolve the tree locking today is to break things up into more subvolumes. I've got another run at the root lock contention in the queue after I get the skiplists in place in a few other parts of the Btrfs code. -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-09 1:15 ` Chris Mason @ 2013-07-09 1:26 ` Dave Chinner 2013-07-09 1:54 ` [BULK] " Chris Mason 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2013-07-09 1:26 UTC (permalink / raw) To: Chris Mason; +Cc: linux-fsdevel, xfs On Mon, Jul 08, 2013 at 09:15:33PM -0400, Chris Mason wrote: > Quoting Dave Chinner (2013-07-08 08:44:53) > > [cc fsdevel because after all the XFS stuff I did a some testing on > > mmotm w.r.t per-node LRU lock contention avoidance, and also some > > scalability tests against ext4 and btrfs for comparison on some new > > hardware. That bit ain't pretty. ] > > > > And, well, the less said about btrfs unlinks the better: > > > > + 37.14% [kernel] [k] _raw_spin_unlock_irqrestore > > + 33.18% [kernel] [k] __write_lock_failed > > + 17.96% [kernel] [k] __read_lock_failed > > + 1.35% [kernel] [k] _raw_spin_unlock_irq > > + 0.82% [kernel] [k] __do_softirq > > + 0.53% [kernel] [k] btrfs_tree_lock > > + 0.41% [kernel] [k] btrfs_tree_read_lock > > + 0.41% [kernel] [k] do_raw_read_lock > > + 0.39% [kernel] [k] do_raw_write_lock > > + 0.38% [kernel] [k] btrfs_clear_lock_blocking_rw > > + 0.37% [kernel] [k] free_extent_buffer > > + 0.36% [kernel] [k] btrfs_tree_read_unlock > > + 0.32% [kernel] [k] do_raw_write_unlock > > > > Hi Dave, > > Thanks for doing these runs. At least on Btrfs the best way to resolve > the tree locking today is to break things up into more subvolumes. Sure, but you can't do that most workloads. Only on specialised workloads (e.g. hashed directory tree based object stores) is this really a viable option.... > I've > got another run at the root lock contention in the queue after I get > the skiplists in place in a few other parts of the Btrfs code. It will be interesting to see how these new structures play out ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BULK] Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-09 1:26 ` Dave Chinner @ 2013-07-09 1:54 ` Chris Mason 0 siblings, 0 replies; 12+ messages in thread From: Chris Mason @ 2013-07-09 1:54 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs, linux-fsdevel Quoting Dave Chinner (2013-07-08 21:26:14) > On Mon, Jul 08, 2013 at 09:15:33PM -0400, Chris Mason wrote: > > Quoting Dave Chinner (2013-07-08 08:44:53) > > > [cc fsdevel because after all the XFS stuff I did a some testing on > > > mmotm w.r.t per-node LRU lock contention avoidance, and also some > > > scalability tests against ext4 and btrfs for comparison on some new > > > hardware. That bit ain't pretty. ] > > > > > > And, well, the less said about btrfs unlinks the better: > > > > > > + 37.14% [kernel] [k] _raw_spin_unlock_irqrestore > > > + 33.18% [kernel] [k] __write_lock_failed > > > + 17.96% [kernel] [k] __read_lock_failed > > > + 1.35% [kernel] [k] _raw_spin_unlock_irq > > > + 0.82% [kernel] [k] __do_softirq > > > + 0.53% [kernel] [k] btrfs_tree_lock > > > + 0.41% [kernel] [k] btrfs_tree_read_lock > > > + 0.41% [kernel] [k] do_raw_read_lock > > > + 0.39% [kernel] [k] do_raw_write_lock > > > + 0.38% [kernel] [k] btrfs_clear_lock_blocking_rw > > > + 0.37% [kernel] [k] free_extent_buffer > > > + 0.36% [kernel] [k] btrfs_tree_read_unlock > > > + 0.32% [kernel] [k] do_raw_write_unlock > > > > > > > Hi Dave, > > > > Thanks for doing these runs. At least on Btrfs the best way to resolve > > the tree locking today is to break things up into more subvolumes. > > Sure, but you can't do that most workloads. Only on specialised > workloads (e.g. hashed directory tree based object stores) is this > really a viable option.... Yes and no. It makes a huge difference even when you have 8 procs all working on the same 8 subvolumes. It's not perfect but it's all I have ;) > > > I've > > got another run at the root lock contention in the queue after I get > > the skiplists in place in a few other parts of the Btrfs code. > > It will be interesting to see how these new structures play out ;) The skiplists don't translate well to the tree roots, so I'll probably have to do something different there. But I'll get the onion peeled one way or another. -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) 2013-07-08 12:44 ` Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) Dave Chinner ` (2 preceding siblings ...) 2013-07-09 1:15 ` Chris Mason @ 2013-07-09 8:26 ` Dave Chinner 3 siblings, 0 replies; 12+ messages in thread From: Dave Chinner @ 2013-07-09 8:26 UTC (permalink / raw) To: xfs; +Cc: linux-fsdevel On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote: > [cc fsdevel because after all the XFS stuff I did a some testing on > mmotm w.r.t per-node LRU lock contention avoidance, and also some > scalability tests against ext4 and btrfs for comparison on some new > hardware. That bit ain't pretty. ] A quick follow on mmotm: > FWIW, the mmotm kernel (which has a fair bit of debug enabled, so > not quite comparitive) doesn't have any LRU lock contention to speak > of. For create: > > - 7.81% [kernel] [k] __ticket_spin_trylock > - __ticket_spin_trylock > - 70.98% _raw_spin_lock > + 97.55% xfs_log_commit_cil > + 0.93% __d_instantiate > + 0.58% inode_sb_list_add > - 29.02% do_raw_spin_lock > - _raw_spin_lock > + 41.14% xfs_log_commit_cil > + 8.29% _xfs_buf_find > + 8.00% xfs_iflush_cluster So i just ported all my prototype sync and inode_sb_list_lock changes across to mmotm, as well as the XFS CIL optimisations. - 2.33% [kernel] [k] __ticket_spin_trylock - __ticket_spin_trylock - 70.14% do_raw_spin_lock - _raw_spin_lock + 16.91% _xfs_buf_find + 15.20% list_lru_add + 12.83% xfs_log_commit_cil + 11.18% d_alloc + 7.43% dput + 4.56% __d_instantiate .... Most of the spinlock contention has gone away. > And the walk: > > - 26.37% [kernel] [k] __ticket_spin_trylock > - __ticket_spin_trylock > - 49.10% _raw_spin_lock > - 50.65% evict ... > - 26.99% list_lru_add > + 89.01% inode_add_lru > + 10.95% dput > + 7.03% __remove_inode_hash > - 40.65% do_raw_spin_lock > - _raw_spin_lock > - 41.96% evict .... > - 13.55% list_lru_add > 84.33% inode_add_lru .... > + 10.10% __remove_inode_hash > system_call_fastpath - 15.44% [kernel] [k] __ticket_spin_trylock - __ticket_spin_trylock - 46.59% _raw_spin_lock + 69.40% list_lru_add 17.65% list_lru_del 5.70% list_lru_count_node 2.44% shrink_dentry_list prune_dcache_sb super_cache_scan shrink_slab 0.86% __page_check_address - 33.06% do_raw_spin_lock - _raw_spin_lock + 36.96% list_lru_add + 11.98% list_lru_del + 6.68% shrink_dentry_list + 6.43% d_alloc + 4.79% _xfs_buf_find ..... + 11.48% do_raw_spin_trylock + 8.87% _raw_spin_trylock So now we see that CPU wasted on contention is down by 40%. Observation shows that most of the list_lru_add/list_lru_del contention occurs when reclaim is running - before memory filled up the lookup rate was on the high side of 600,000 inodes/s, but fell back to about 425,000/s once reclaim started working. > > There's quite a different pattern of contention - it has moved > inward to evict which implies the inode_sb_list_lock is the next > obvious point of contention. I have patches in the works for that. > Also, the inode_hash_lock is causing some contention, even though we > fake inode hashing. I have a patch to fix that for XFS as well. > > I also note an interesting behaviour of the per-node inode LRUs - > the contention is coming from the dentry shrinker on one node > freeing inodes allocated on a different node during reclaim. There's > scope for improvement there. > > But here' the interesting part: > > Kernel create walk unlink > time(s) rate time(s) time(s) > 3.10-cil 222 266k+-32k 170 295 > mmotm 251 222k+-16k 128 356 mmotm-cil 225 258k+-26k 122 296 So even with all the debug on, the mmotm kernel with most of the mods as I was running in 3.10-cil, plus the s_inodes ->list_lru conversion gets the same throughput for create and unlink and has much better walk times. > Even with all the debug enabled, the overall walk time dropped by > 25% to 128s. So performance in this workload has substantially > improved because of the per-node LRUs and variability is also down > as well, as predicted. Once I add all the tweaks I have in the > 3.10-cil tree to mmotm, I expect significant improvements to create > and unlink performance as well... > > So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the > 3.10-cil kernel I've been testing XFS on): > > create walk unlink > time(s) rate time(s) time(s) > xfs 222 266k+-32k 170 295 > ext4 978 54k+- 2k 325 2053 > btrfs 1223 47k+- 8k 366 12000(*) > > (*) Estimate based on a removal rate of 18.5 minutes for the first > 4.8 million inodes. So, let's run these again on my current mmotm tree - it has the ext4 extent tree fixes in it and my rcu inode hash lookup patch... create walk unlink time(s) rate time(s) time(s) xfs 225 258k+-26k 122 296 ext4 456 118k+- 4k 128 1632 btrfs 1122 51k+- 3k 281 3200(*) (*) about 4.7 million inodes removed in 5 minutes. ext4 is a lot healthier: create speed doubles from the extent cache lock contention fixes, and the walk time halves due to the rcu inode cache lookup. That said, it is still burning a huge amount of CPU on the inode_hash_lock adding and removing inodes. Unlink perf is a bit faster, but still slow. So, yeah, things will get better in the not-too distant future... And for btrfs? Well, create is a tiny bit faster, the walk is 20% faster thanks to the rcu hash lookups, and unlinks are markedly faster (3x). Still not fast enough for me to hang around waiting for them to complete, though. FWIW, while the results are a lot better for ext4, let me just point out how hard it is driving the storage to get that performance: load | create | walk | unlink IO type | write | read | read | write | IOPS BW | IOPS BW | IOPS BW | IOPS BW --------+------------+--------------+---------------+-------------- xfs | 900 200 | 18000 140 | 7500 50 | 400 50 ext4 |23000 390 | 55000 200 | 2000 10 | 13000 160 btrfs(*)|peaky 75 | 26000 100 | decay 10 | peaky peaky ext4 is hammering the SSDs far harder than XFS, both in terms of IOPS and bandwidth. You do not want to run ext4 on your SSD if you have a metadata intensive workload as it will age the SSD much, much faster than XFS with that sort of write behaviour. (*) the btrfs create IO pattern is 5s peaks of write IOPS every 30s. The baseline is about 500 IOPS, but the peaks reach upwards of 30,000 write IOPS. Unlink does this as well. There are also short bursts of 2-3000 read IOPS just before the write IOPS bursts in the create workload. For the unlink, it starts off with about 10,000 read IOPS, and goes quickly into exponential decay down to about 2000 read IOPS in 90s. Then it hits some trigger and the cycle starts again. The trigger appears to co-incide with the reclaim 1-2 million dentries being reclaimed. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-07-09 8:26 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1372657476-9241-1-git-send-email-david@fromorbit.com> 2013-07-08 12:44 ` Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC]) Dave Chinner 2013-07-08 13:59 ` Jan Kara 2013-07-08 15:22 ` Marco Stornelli 2013-07-08 15:38 ` Jan Kara 2013-07-09 0:15 ` Dave Chinner 2013-07-09 0:56 ` Theodore Ts'o 2013-07-09 0:43 ` Zheng Liu 2013-07-09 1:23 ` Dave Chinner 2013-07-09 1:15 ` Chris Mason 2013-07-09 1:26 ` Dave Chinner 2013-07-09 1:54 ` [BULK] " Chris Mason 2013-07-09 8:26 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).