* Debunking myths about metadata CRC overhead
@ 2013-06-03 7:44 Dave Chinner
2013-06-03 9:10 ` Emmanuel Florac
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Dave Chinner @ 2013-06-03 7:44 UTC (permalink / raw)
To: xfs
Hi folks,
There has been some assertions made recently that metadata CRCs have
too much overhead to always be enabled. So I'll run some quick
benchmarks to demonstrate the "too much overhead" assertions are
completely unfounded.
These are some numbers from my usual performance test VM. Note that
as this is a VM, it's not running the hardware CRC instructions so
I'm benchmarking the worst case overhead here. i.e. the kernel's
software CRC32c algorithm.
THe VM is 8p, 8GB RAM, 4 node fake-numa config with a 100TB XFS
filesystem being used for testing. The fs is backed by 4x64GB SSDs
sliced via LVM into a 160GB RAID0 device with an XFS filesytsem on
it to host the sparse 100TB image file. KVM is using
virtio,cache=none to use direct IO to write to the image file, and
the host is running a 3.8.5 kernel.
Baseline CRC32c performance
---------------------------
The VM runs the xfsprogs selftest program in:
crc32c: tests passed, 225944 bytes in 212 usec
so it can calulate CRCs at roughly 1GB/s on small, random chunks of
data through the software algorithm according to this. Given the
fsmark create workload only drives around 100MB/s of metadata and
journal IO, the minimum CRC32c overhead we should see on a load
spread across 8 CPUs is roughly:
100MB/s / 1000MB/s / 8p * 100% = 1.25% per CPU
So, in a perfect world, that's what we should see from the kernel
profiles. It's not a perfect world, though, so it will never be
this low (4 cores all trying to use the same memory bus at the same
time, perhaps?), so if we get anywhere near that number I'd be very
happy.
Note that a hardware implementation should be faster than the SSE
optimised RAID5/6 calculations on the CPU, which come in at:
[ 0.548004] raid6: sse2x4 7221 MB/s
which is a *lot* faster. So it's probably reasonable to assume
similar throughput for hardware CRC32c throughput. Hence Intel
servers will have substantially lower CRC overhead than the software
CRC32c implementation being measured here.
fs_mark workload
----------------
$ sudo mkfs.xfs -f -m crc=1 -l size=512m,sunit=8 /dev/vdc
vs
$ sudo mkfs.xfs -f -l size=512m,sunit=8 /dev/vdc
8-way 50 million zero-length file create, 8-way
find+stat of all the files, 8-unlink of all the files:
no CRCs CRCs Difference
create (time) 483s 510s +5.2% (slower)
(rate) 109k+/6k 105k+/-5.4k -3.8% (slower)
walk 339s 494s -30.3% (slower)
(sys cpu) 1134s 1324s +14.4% (slower)
unlink 692s 959s -27.8%(*) (slower)
(*) All the slowdown here is from the traversal slowdown as seen in
the walk phase. i.e. not related to the unlink operations.
On the surface, it looks like there's a huge impact on the walk and
unlink phases from CRC calculations, but these numbers don't tell
the whole story. Lets look deeper:
Create phase top CPU users (>1% total):
5.59% [kernel] [k] _xfs_buf_find
5.52% [kernel] [k] xfs_dir2_node_addname
4.58% [kernel] [k] memcpy
3.28% [kernel] [k] xfs_dir3_free_hdr_from_disk
3.05% [kernel] [k] __ticket_spin_trylock
2.94% [kernel] [k] __slab_alloc
1.96% [kernel] [k] xfs_log_commit_cil
1.93% [kernel] [k] __slab_free
1.90% [kernel] [k] kmem_cache_alloc
1.72% [kernel] [k] xfs_next_bit
1.65% [kernel] [k] __crc32c_le
1.52% [kernel] [k] _raw_spin_unlock_irqrestore
1.50% [kernel] [k] do_raw_spin_lock
1.42% [kernel] [k] kmem_cache_free
1.32% [kernel] [k] native_read_tsc
1.28% [kernel] [k] __kmalloc
1.17% [kernel] [k] xfs_buf_offset
1.14% [kernel] [k] delay_tsc
1.14% [kernel] [k] kfree
1.10% [kernel] [k] xfs_buf_item_format
1.06% [kernel] [k] xfs_btree_lookup
CRC overehad is at 1.65%, not much higher than the optimum 1.25%
overhead calculated above. So the overhead really isn't that
significant - it's far less overhead than, say, the 1.2 million
buffer lookups a second we are doing (_xfs_buf_find overhead) in
this workload...
Walk phase top CPU users:
6.64% [kernel] [k] __ticket_spin_trylock
6.05% [kernel] [k] _xfs_buf_find
5.58% [kernel] [k] _raw_spin_unlock_irqrestore
4.88% [kernel] [k] _raw_spin_unlock_irq
3.30% [kernel] [k] native_read_tsc
2.93% [kernel] [k] __crc32c_le
2.87% [kernel] [k] delay_tsc
2.32% [kernel] [k] do_raw_spin_lock
1.98% [kernel] [k] blk_flush_plug_list
1.79% [kernel] [k] __slab_alloc
1.76% [kernel] [k] __d_lookup_rcu
1.56% [kernel] [k] kmem_cache_alloc
1.25% [kernel] [k] kmem_cache_free
1.25% [kernel] [k] xfs_da_read_buf
1.11% [kernel] [k] xfs_dir2_leaf_search_hash
1.08% [kernel] [k] flat_send_IPI_mask
1.02% [kernel] [k] radix_tree_lookup_element
1.00% [kernel] [k] do_raw_spin_unlock
There's more CRC32c overhead indicating lower efficiency, but
there's an obvious cause for that - the CRC overhead is dwarfed by
something else new: lock contention. A quick 30s call graph profile
during the middle of the walk phase shows:
- 12.74% [kernel] [k] __ticket_spin_trylock
- __ticket_spin_trylock
- 60.49% _raw_spin_lock
+ 91.79% inode_add_lru >>> inode_lru_lock
+ 2.98% dentry_lru_del >>> dcache_lru_lock
+ 1.30% shrink_dentry_list
+ 0.71% evict
- 20.42% do_raw_spin_lock
- _raw_spin_lock
+ 13.41% inode_add_lru >>> inode_lru_lock
+ 10.55% evict
+ 8.26% dentry_lru_del >>> dcache_lru_lock
+ 7.62% __remove_inode_hash
....
- 10.37% do_raw_spin_trylock
- _raw_spin_trylock
+ 79.65% prune_icache_sb >>> inode_lru_lock
+ 11.04% shrink_dentry_list
+ 9.24% prune_dcache_sb >>> dcache_lru_lock
- 8.72% _raw_spin_trylock
+ 46.33% prune_icache_sb >>> inode_lru_lock
+ 46.08% shrink_dentry_list
+ 7.60% prune_dcache_sb >>> dcache_lru_lock
So the lock contention is variable - it's twice as high in this
short sample as the overall profile I measured above. It's also
pretty much all VFS cache LRU lock contention that is causing the
problems here. IOWs, the slowdowns are not related to the overhead
of CRC calculations; it's the change in memory access patterns that
are lowering the threshold of catastrophic lock contention that is
causing it. This VFS LRU problem is being fixed independently by the
generic numa-aware LRU list patchset I've been doing with Glauber
Costa.
Therefore, it is clear that the slowdown in this phase is not caused
by the overhead of CRCs, but that of lock contention elsewhere in
the kernel. The unlink profiles show the same the thing as the walk
profiles - additional lock contention on the lookup phase of the
unlink walk.
----
Dbench:
$ sudo mkfs.xfs -f -m crc=1 -l size=128m,sunit=8 /dev/vdc
vs
$ sudo mkfs.xfs -f -l size=128m,sunit=8 /dev/vdc
Running:
$ dbench -t 120 -D /mnt/scratch 8
no CRCs CRCs Difference
thruput 1098.06 MB/s 1229.65 MB/s +10% (faster)
latency (max) 22.385 ms 22.661 ms +1.3% (noise)
Well, now that's an interesting result, isn't it. CRC enabled
filesystems are 10% faster than non-crc filesystems. Again, let's
not take that number at face value, but ask ourselves why adding
CRCs improves performance (a.k.a. "know your benchmark")...
It's pretty obvious why - dbench uses xattrs and performance is
sensitive to how many attributes can be stored inline in the inode.
And CRCs increase the inode size to 512 bytes meaning attributes are
probably never out of line. So, let's make it an even playing field
and compare:
$ sudo mkfs.xfs -f -m crc=1 -l size=128m,sunit=8 /dev/vdc
vs
$ sudo mkfs.xfs -f -i size=512 -l size=128m,sunit=8 /dev/vdc
no CRCs CRCs Difference
thruput 1273.22 MB/s 1229.65 MB/s -3.5% (slower)
latency (max) 25.455 ms 22.661 ms -12.4% (better)
So, we're back to the same relatively small difference seen in the
fsmark create phase, with similar CRC overhead being shown in the
profiles.
----
Compilebench
Testing the same filesystems with 512 byte inodes as for dbench:
$ ./compilebench -D /mnt/scratch
using working directory /mnt/scratch, 30 intial dirs 100 runs
.....
test no CRCs CRCs
runs avg avg
==========================================================================
intial create 30 92.12 MB/s 90.24 MB/s
create 14 61.91 MB/s 61.13 MB/s
patch 15 41.04 MB/s 38.00 MB/s
compile 14 278.74 MB/s 262.00 MB/s
clean 10 1355.30 MB/s 1296.17 MB/s
read tree 11 25.68 MB/s 25.40 MB/s
read compiled tree 4 48.74 MB/s 48.65 MB/s
delete tree 10 2.97 seconds 3.05 seconds
delete compiled tree 4 2.96 seconds 3.05 seconds
stat tree 11 1.33 seconds 1.36 seconds
stat compiled tree 7 1.86 seconds 1.64 seconds
The numbers are so close that the differences are in the noise, and
the CRC overhead doesn't even show up in the ">1% usage" section
of the profile output.
----
Looking at these numbers realistically, dbench and compilebench
model two fairly common metadata intensive workloads - file servers
and code tree manipulations that developers tend to use all the
time. The difference that CRCs make to performance in these
workloads on equivalently configured filesystems varies between
0-5%, and for most operations they are small enough that they can
just about be considered to be noise.
Yes, we could argue over the fsmark walk/unlink phase results, but
the synthetic fsmark workload is designed to push the system to it's
limits and it's obvious that the addition of CRCs pushes the VFS into
lock contention hell. Further, we have to recognise that the same
workload on a 12p VM (run 12-way instead of 8-way) without CRCs hits
the same lock contention problem. IOWs, the slowdown is most
definitely not caused by the addition of CRC calculations to XFS
metadata.
The CPU overhead of CRCs is small and may be outweighed by other
changes for CRC filesystems that improve performance far more than
the cost of CRC calculations degrades it. The numbers above simply
don't support the assertion that metadata CRCs have "too much
overhead".
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: Debunking myths about metadata CRC overhead 2013-06-03 7:44 Debunking myths about metadata CRC overhead Dave Chinner @ 2013-06-03 9:10 ` Emmanuel Florac 2013-06-04 2:53 ` Dave Chinner 2013-06-03 15:31 ` Troy McCorkell 2013-06-03 20:00 ` Geoffrey Wehrman 2 siblings, 1 reply; 16+ messages in thread From: Emmanuel Florac @ 2013-06-03 9:10 UTC (permalink / raw) Cc: xfs Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > There has been some assertions made recently that metadata CRCs have > too much overhead to always be enabled. So I'll run some quick > benchmarks to demonstrate the "too much overhead" assertions are > completely unfounded. Just a quick question: what is the minimal kernel version and xfsprogs version needed to run xfs with metadata CRC? I'd happily test it on real hardware, I have a couple of storage servers in test in the 40 to 108 TB range. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-03 9:10 ` Emmanuel Florac @ 2013-06-04 2:53 ` Dave Chinner 2013-06-04 16:20 ` Ben Myers 2013-06-04 18:38 ` Chandra Seetharaman 0 siblings, 2 replies; 16+ messages in thread From: Dave Chinner @ 2013-06-04 2:53 UTC (permalink / raw) To: Emmanuel Florac; +Cc: xfs On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > There has been some assertions made recently that metadata CRCs have > > too much overhead to always be enabled. So I'll run some quick > > benchmarks to demonstrate the "too much overhead" assertions are > > completely unfounded. > > Just a quick question: what is the minimal kernel version and xfsprogs > version needed to run xfs with metadata CRC? I'd happily test it on > real hardware, I have a couple of storage servers in test in the 40 to > 108 TB range. If the maintainers merge all the patches I send for the 3.10-rc series, then the 3.10 release should be stable enough to use for testing with data you don't care if you lose. As for the userspace code - that is still just a patchset. I haven't had any feedback from the maintainers about it in the past month, so I've got no idea what they are doing with it. I'll post out a new version in the next couple of days - it's 50-odd patches by now, so it'd be nice to have it in the xfsprogs git tree so people could just pull it and build it for testing purposes by the time that 3.10 releases.... Cheers, Dave -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 2:53 ` Dave Chinner @ 2013-06-04 16:20 ` Ben Myers 2013-06-04 22:06 ` Dave Chinner 2013-06-04 18:38 ` Chandra Seetharaman 1 sibling, 1 reply; 16+ messages in thread From: Ben Myers @ 2013-06-04 16:20 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Dave, On Tue, Jun 04, 2013 at 12:53:07PM +1000, Dave Chinner wrote: > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > > > There has been some assertions made recently that metadata CRCs have > > > too much overhead to always be enabled. So I'll run some quick > > > benchmarks to demonstrate the "too much overhead" assertions are > > > completely unfounded. > > > > Just a quick question: what is the minimal kernel version and xfsprogs > > version needed to run xfs with metadata CRC? I'd happily test it on > > real hardware, I have a couple of storage servers in test in the 40 to > > 108 TB range. > > If the maintainers merge all the patches I send for the 3.10-rc > series, then the 3.10 release should be stable enough to use for > testing with data you don't care if you lose. > > As for the userspace code - that is still just a patchset. I haven't > had any feedback from the maintainers about it in the past month, so > I've got no idea what they are doing with it. I'll post out a new > version in the next couple of days - it's 50-odd patches by now, so > it'd be nice to have it in the xfsprogs git tree so people could > just pull it and build it for testing purposes by the time that 3.10 > releases.... When it is reviewed and adequately tested we'll pull it in. Until then Emmanuel will need to pull down the patchset. Right now the focus is on 3.10. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 16:20 ` Ben Myers @ 2013-06-04 22:06 ` Dave Chinner 2013-06-04 22:09 ` Ben Myers 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2013-06-04 22:06 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Tue, Jun 04, 2013 at 11:20:30AM -0500, Ben Myers wrote: > Dave, > > On Tue, Jun 04, 2013 at 12:53:07PM +1000, Dave Chinner wrote: > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > > > > > There has been some assertions made recently that metadata CRCs have > > > > too much overhead to always be enabled. So I'll run some quick > > > > benchmarks to demonstrate the "too much overhead" assertions are > > > > completely unfounded. > > > > > > Just a quick question: what is the minimal kernel version and xfsprogs > > > version needed to run xfs with metadata CRC? I'd happily test it on > > > real hardware, I have a couple of storage servers in test in the 40 to > > > 108 TB range. > > > > If the maintainers merge all the patches I send for the 3.10-rc > > series, then the 3.10 release should be stable enough to use for > > testing with data you don't care if you lose. > > > > As for the userspace code - that is still just a patchset. I haven't > > had any feedback from the maintainers about it in the past month, so > > I've got no idea what they are doing with it. I'll post out a new > > version in the next couple of days - it's 50-odd patches by now, so > > it'd be nice to have it in the xfsprogs git tree so people could > > just pull it and build it for testing purposes by the time that 3.10 > > releases.... > > When it is reviewed and adequately tested we'll pull it in. Until then > Emmanuel will need to pull down the patchset. Right now the focus is on > 3.10. And when will that be? I've already been waiting the best part of a month for anyone to even comment on it, and I've got 5 private pings in the past 3 days asking about how to get the userspace code so they can test the new kernel code.... How about this: I post an up-to-date patch set, and you guys commit it to a "crc-dev" branch in the oss xfsprogs git tree. The branch can be thrown away when the code is reviewed, but in the mean time we can point early adopters and testers to that branch rather than ask them to pull down and apply a 50 patch series to a git tree? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 22:06 ` Dave Chinner @ 2013-06-04 22:09 ` Ben Myers 0 siblings, 0 replies; 16+ messages in thread From: Ben Myers @ 2013-06-04 22:09 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Wed, Jun 05, 2013 at 08:06:10AM +1000, Dave Chinner wrote: > On Tue, Jun 04, 2013 at 11:20:30AM -0500, Ben Myers wrote: > > Dave, > > > > On Tue, Jun 04, 2013 at 12:53:07PM +1000, Dave Chinner wrote: > > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > > > > > > > There has been some assertions made recently that metadata CRCs have > > > > > too much overhead to always be enabled. So I'll run some quick > > > > > benchmarks to demonstrate the "too much overhead" assertions are > > > > > completely unfounded. > > > > > > > > Just a quick question: what is the minimal kernel version and xfsprogs > > > > version needed to run xfs with metadata CRC? I'd happily test it on > > > > real hardware, I have a couple of storage servers in test in the 40 to > > > > 108 TB range. > > > > > > If the maintainers merge all the patches I send for the 3.10-rc > > > series, then the 3.10 release should be stable enough to use for > > > testing with data you don't care if you lose. > > > > > > As for the userspace code - that is still just a patchset. I haven't > > > had any feedback from the maintainers about it in the past month, so > > > I've got no idea what they are doing with it. I'll post out a new > > > version in the next couple of days - it's 50-odd patches by now, so > > > it'd be nice to have it in the xfsprogs git tree so people could > > > just pull it and build it for testing purposes by the time that 3.10 > > > releases.... > > > > When it is reviewed and adequately tested we'll pull it in. Until then > > Emmanuel will need to pull down the patchset. Right now the focus is on > > 3.10. > > And when will that be? I've already been waiting the best part of a > month for anyone to even comment on it, and I've got 5 private pings > in the past 3 days asking about how to get the userspace code so > they can test the new kernel code.... > > How about this: I post an up-to-date patch set, and you guys commit > it to a "crc-dev" branch in the oss xfsprogs git tree. The branch > can be thrown away when the code is reviewed, but in the mean time > we can point early adopters and testers to that branch rather than > ask them to pull down and apply a 50 patch series to a git tree? Sounds good to me. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 2:53 ` Dave Chinner 2013-06-04 16:20 ` Ben Myers @ 2013-06-04 18:38 ` Chandra Seetharaman 2013-06-04 22:08 ` Dave Chinner 1 sibling, 1 reply; 16+ messages in thread From: Chandra Seetharaman @ 2013-06-04 18:38 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote: > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > > > There has been some assertions made recently that metadata CRCs have > > > too much overhead to always be enabled. So I'll run some quick > > > benchmarks to demonstrate the "too much overhead" assertions are > > > completely unfounded. > > > > Just a quick question: what is the minimal kernel version and xfsprogs > > version needed to run xfs with metadata CRC? I'd happily test it on > > real hardware, I have a couple of storage servers in test in the 40 to > > 108 TB range. > > If the maintainers merge all the patches I send for the 3.10-rc > series, then the 3.10 release should be stable enough to use for > testing with data you don't care if you lose. > > As for the userspace code - that is still just a patchset. I haven't > had any feedback from the maintainers about it in the past month, so > I've got no idea what they are doing with it. I'll post out a new > version in the next couple of days - it's 50-odd patches by now, so > it'd be nice to have it in the xfsprogs git tree so people could > just pull it and build it for testing purposes by the time that 3.10 > releases.... Dave, I was of the impression that the user space changes will be released sometime later (i.e when CRC comes out of experimental). If we make the user space changes to create V5 filesystem now, it will be an annoyance for people that created V5 super blocks without my changes (getting rid of OQUOTA.* flags). BTW, I am waiting for your response to do a final re-post on the kernel changes, after which I will post my user space changes. Regards, Chandra > > Cheers, > > Dave _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 18:38 ` Chandra Seetharaman @ 2013-06-04 22:08 ` Dave Chinner 2013-06-04 22:40 ` Chandra Seetharaman 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2013-06-04 22:08 UTC (permalink / raw) To: Chandra Seetharaman; +Cc: xfs On Tue, Jun 04, 2013 at 01:38:29PM -0500, Chandra Seetharaman wrote: > On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote: > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > > > > > There has been some assertions made recently that metadata CRCs have > > > > too much overhead to always be enabled. So I'll run some quick > > > > benchmarks to demonstrate the "too much overhead" assertions are > > > > completely unfounded. > > > > > > Just a quick question: what is the minimal kernel version and xfsprogs > > > version needed to run xfs with metadata CRC? I'd happily test it on > > > real hardware, I have a couple of storage servers in test in the 40 to > > > 108 TB range. > > > > If the maintainers merge all the patches I send for the 3.10-rc > > series, then the 3.10 release should be stable enough to use for > > testing with data you don't care if you lose. > > > > As for the userspace code - that is still just a patchset. I haven't > > had any feedback from the maintainers about it in the past month, so > > I've got no idea what they are doing with it. I'll post out a new > > version in the next couple of days - it's 50-odd patches by now, so > > it'd be nice to have it in the xfsprogs git tree so people could > > just pull it and build it for testing purposes by the time that 3.10 > > releases.... > > Dave, > > I was of the impression that the user space changes will be released > sometime later (i.e when CRC comes out of experimental). If we make the > user space changes to create V5 filesystem now, it will be an annoyance > for people that created V5 super blocks without my changes (getting rid > of OQUOTA.* flags). People still need access to the code to test it. I'm not talking about an official release here at all, just getting it committed to the git tree to make it easy for people to get the code and for developers to build on top of it and fix bugs. > BTW, I am waiting for your response to do a final re-post on the kernel > changes, after which I will post my user space changes. I must have missed your question, I'll go back and have a look for it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 22:08 ` Dave Chinner @ 2013-06-04 22:40 ` Chandra Seetharaman 2013-06-04 22:59 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Chandra Seetharaman @ 2013-06-04 22:40 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Wed, 2013-06-05 at 08:08 +1000, Dave Chinner wrote: > On Tue, Jun 04, 2013 at 01:38:29PM -0500, Chandra Seetharaman wrote: > > On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote: > > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > > > > > > > There has been some assertions made recently that metadata CRCs have > > > > > too much overhead to always be enabled. So I'll run some quick > > > > > benchmarks to demonstrate the "too much overhead" assertions are > > > > > completely unfounded. > > > > > > > > Just a quick question: what is the minimal kernel version and xfsprogs > > > > version needed to run xfs with metadata CRC? I'd happily test it on > > > > real hardware, I have a couple of storage servers in test in the 40 to > > > > 108 TB range. > > > > > > If the maintainers merge all the patches I send for the 3.10-rc > > > series, then the 3.10 release should be stable enough to use for > > > testing with data you don't care if you lose. > > > > > > As for the userspace code - that is still just a patchset. I haven't > > > had any feedback from the maintainers about it in the past month, so > > > I've got no idea what they are doing with it. I'll post out a new > > > version in the next couple of days - it's 50-odd patches by now, so > > > it'd be nice to have it in the xfsprogs git tree so people could > > > just pull it and build it for testing purposes by the time that 3.10 > > > releases.... > > > > Dave, > > > > I was of the impression that the user space changes will be released > > sometime later (i.e when CRC comes out of experimental). If we make the > > user space changes to create V5 filesystem now, it will be an annoyance > > for people that created V5 super blocks without my changes (getting rid > > of OQUOTA.* flags). > > People still need access to the code to test it. I'm not talking > about an official release here at all, just getting it committed to > the git tree to make it easy for people to get the code and for > developers to build on top of it and fix bugs. > Oh, I see. It is clear now. Also, is there a git tree where I can pull your xfsprogs changes from ? I tried to apply xfsprogs-kern-sync-patchset-v2.tar.gz on top of xfsprogs git tree, it seems to have some problems. Or is there a later version ? > > BTW, I am waiting for your response to do a final re-post on the kernel > > changes, after which I will post my user space changes. > > I must have missed your question, I'll go back and have a look for > it. > Thanks. > Cheers, > > Dave. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 22:40 ` Chandra Seetharaman @ 2013-06-04 22:59 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2013-06-04 22:59 UTC (permalink / raw) To: Chandra Seetharaman; +Cc: xfs On Tue, Jun 04, 2013 at 05:40:16PM -0500, Chandra Seetharaman wrote: > On Wed, 2013-06-05 at 08:08 +1000, Dave Chinner wrote: > > On Tue, Jun 04, 2013 at 01:38:29PM -0500, Chandra Seetharaman wrote: > > > On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote: > > > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote: > > > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez: > > > > > > > > > > > There has been some assertions made recently that metadata CRCs have > > > > > > too much overhead to always be enabled. So I'll run some quick > > > > > > benchmarks to demonstrate the "too much overhead" assertions are > > > > > > completely unfounded. > > > > > > > > > > Just a quick question: what is the minimal kernel version and xfsprogs > > > > > version needed to run xfs with metadata CRC? I'd happily test it on > > > > > real hardware, I have a couple of storage servers in test in the 40 to > > > > > 108 TB range. > > > > > > > > If the maintainers merge all the patches I send for the 3.10-rc > > > > series, then the 3.10 release should be stable enough to use for > > > > testing with data you don't care if you lose. > > > > > > > > As for the userspace code - that is still just a patchset. I haven't > > > > had any feedback from the maintainers about it in the past month, so > > > > I've got no idea what they are doing with it. I'll post out a new > > > > version in the next couple of days - it's 50-odd patches by now, so > > > > it'd be nice to have it in the xfsprogs git tree so people could > > > > just pull it and build it for testing purposes by the time that 3.10 > > > > releases.... > > > > > > Dave, > > > > > > I was of the impression that the user space changes will be released > > > sometime later (i.e when CRC comes out of experimental). If we make the > > > user space changes to create V5 filesystem now, it will be an annoyance > > > for people that created V5 super blocks without my changes (getting rid > > > of OQUOTA.* flags). > > > > People still need access to the code to test it. I'm not talking > > about an official release here at all, just getting it committed to > > the git tree to make it easy for people to get the code and for > > developers to build on top of it and fix bugs. > > > > Oh, I see. It is clear now. > > Also, is there a git tree where I can pull your xfsprogs changes from ? > I tried to apply xfsprogs-kern-sync-patchset-v2.tar.gz on top of > xfsprogs git tree, it seems to have some problems. > > Or is there a later version ? I'm cleaning up my current tree right now. I'll post it out as soon as it is ready to go. Hopefully Ben can get that into a branch in the xfsprogs git tree and you can pull from there. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-03 7:44 Debunking myths about metadata CRC overhead Dave Chinner 2013-06-03 9:10 ` Emmanuel Florac @ 2013-06-03 15:31 ` Troy McCorkell 2013-06-03 20:00 ` Geoffrey Wehrman 2 siblings, 0 replies; 16+ messages in thread From: Troy McCorkell @ 2013-06-03 15:31 UTC (permalink / raw) To: xfs On 06/03/2013 02:44 AM, Dave Chinner wrote: > Hi folks, > > There has been some assertions made recently that metadata CRCs have > too much overhead to always be enabled. So I'll run some quick > benchmarks to demonstrate the "too much overhead" assertions are > completely unfounded. > > > Dave, Thanks for generating, gathering, and providing this data. Thanks, Troy _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-03 7:44 Debunking myths about metadata CRC overhead Dave Chinner 2013-06-03 9:10 ` Emmanuel Florac 2013-06-03 15:31 ` Troy McCorkell @ 2013-06-03 20:00 ` Geoffrey Wehrman 2013-06-04 2:43 ` Dave Chinner 2 siblings, 1 reply; 16+ messages in thread From: Geoffrey Wehrman @ 2013-06-03 20:00 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote: | Hi folks, | | There has been some assertions made recently that metadata CRCs have | too much overhead to always be enabled. So I'll run some quick | benchmarks to demonstrate the "too much overhead" assertions are | completely unfounded. Thank you, much appreciated. | fs_mark workload | ---------------- ... | So the lock contention is variable - it's twice as high in this | short sample as the overall profile I measured above. It's also | pretty much all VFS cache LRU lock contention that is causing the | problems here. IOWs, the slowdowns are not related to the overhead | of CRC calculations; it's the change in memory access patterns that | are lowering the threshold of catastrophic lock contention that is | causing it. This VFS LRU problem is being fixed independently by the | generic numa-aware LRU list patchset I've been doing with Glauber | Costa. | | Therefore, it is clear that the slowdown in this phase is not caused | by the overhead of CRCs, but that of lock contention elsewhere in | the kernel. The unlink profiles show the same the thing as the walk | profiles - additional lock contention on the lookup phase of the | unlink walk. I get it that the slowdown is not caused by the numerical operations to calculate the CRCs, but as a overall feature, I don't see how you can say that CRCs are not responsible for the slowdown. If CRCs are introducing lock contention, it doesn't matter if that lock contention is in XFS code or elsewhere in the kernel, it is still a slowdown which can be attributed to the CRC feature. Spin it as you like, it still appears to me that there's a huge impact on the walk and unlink phases from CRC calculations. | ---- | | Dbench: ... | Well, now that's an interesting result, isn't it. CRC enabled | filesystems are 10% faster than non-crc filesystems. Again, let's | not take that number at face value, but ask ourselves why adding | CRCs improves performance (a.k.a. "know your benchmark")... | | It's pretty obvious why - dbench uses xattrs and performance is | sensitive to how many attributes can be stored inline in the inode. | And CRCs increase the inode size to 512 bytes meaning attributes are | probably never out of line. So, let's make it an even playing field | and compare: CRC filesystems default to 512 byte inodes? I wasn't aware of that. Sure, CRC filesystems are able to move more volume, but the metadata is half the density as it was before. I'm not a dbench expert, so I have no idea what the ratio of metadata to data is here, so I really don't know what conclusions to draw from the dbench results. What really bothers me is the default of 512 byte inodes for CRCs. That means my inodes take up twice as much space on disk, and will require 2X the bandwidth to read from disk. This will have significant impact on SGI's DMF managed filesystems. I know you don't care about SGI's DMF, but this will also have a significant performance impact on xfsdump, xfsrestore, and xfs_repair. These performance benchmarks are just as important to me as dbench and compilebench. | ---- | | Compilebench | | Testing the same filesystems with 512 byte inodes as for dbench: | | $ ./compilebench -D /mnt/scratch | using working directory /mnt/scratch, 30 intial dirs 100 runs | ..... | | test no CRCs CRCs | runs avg avg | ========================================================================== | intial create 30 92.12 MB/s 90.24 MB/s | create 14 61.91 MB/s 61.13 MB/s | patch 15 41.04 MB/s 38.00 MB/s | compile 14 278.74 MB/s 262.00 MB/s | clean 10 1355.30 MB/s 1296.17 MB/s | read tree 11 25.68 MB/s 25.40 MB/s | read compiled tree 4 48.74 MB/s 48.65 MB/s | delete tree 10 2.97 seconds 3.05 seconds | delete compiled tree 4 2.96 seconds 3.05 seconds | stat tree 11 1.33 seconds 1.36 seconds | stat compiled tree 7 1.86 seconds 1.64 seconds | | The numbers are so close that the differences are in the noise, and | the CRC overhead doesn't even show up in the ">1% usage" section | of the profile output. What really surprises me in these results is the hit that the compile phase takes. That is a 6% performance drop in an area where I expect the CRCs to have limited effect. To me, the results show a rather consistent performance drop of up to 6%, and is sufficient to support my assertion that the CRCs overhead may outweigh the benefits. | ---- | | Looking at these numbers realistically, dbench and compilebench | model two fairly common metadata intensive workloads - file servers | and code tree manipulations that developers tend to use all the | time. The difference that CRCs make to performance in these | workloads on equivalently configured filesystems varies between | 0-5%, and for most operations they are small enough that they can | just about be considered to be noise. | | Yes, we could argue over the fsmark walk/unlink phase results, but | the synthetic fsmark workload is designed to push the system to it's | limits and it's obvious that the addition of CRCs pushes the VFS into | lock contention hell. Further, we have to recognise that the same | workload on a 12p VM (run 12-way instead of 8-way) without CRCs hits | the same lock contention problem. IOWs, the slowdown is most | definitely not caused by the addition of CRC calculations to XFS | metadata. | | The CPU overhead of CRCs is small and may be outweighed by other | changes for CRC filesystems that improve performance far more than | the cost of CRC calculations degrades it. The numbers above simply | don't support the assertion that metadata CRCs have "too much | overhead". Do I want to take a 5% performance hit in filesystem performance and double the size of my inodes for an unproved feature? I am still unconvinced that CRCs are a feature that I want to use. Others may see enough benefit in CRCs to accept the performance hit. All I want is to ensure that I the option going forward to chose not to use CRCs without sacrificing other features introduced XFS. -- Geoffrey Wehrman SGI Building 10 Office: (651)683-5496 2750 Blue Water Road Fax: (651)683-5098 Eagan, MN 55121 E-mail: gwehrman@sgi.com http://www.sgi.com/products/storage/software/ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-03 20:00 ` Geoffrey Wehrman @ 2013-06-04 2:43 ` Dave Chinner 2013-06-04 10:19 ` Dave Chinner 2013-06-04 21:27 ` Geoffrey Wehrman 0 siblings, 2 replies; 16+ messages in thread From: Dave Chinner @ 2013-06-04 2:43 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: xfs On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote: > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote: > | Hi folks, > | > | There has been some assertions made recently that metadata CRCs have > | too much overhead to always be enabled. So I'll run some quick > | benchmarks to demonstrate the "too much overhead" assertions are > | completely unfounded. > > Thank you, much appreciated. > > | fs_mark workload > | ---------------- > ... > | So the lock contention is variable - it's twice as high in this > | short sample as the overall profile I measured above. It's also > | pretty much all VFS cache LRU lock contention that is causing the > | problems here. IOWs, the slowdowns are not related to the overhead > | of CRC calculations; it's the change in memory access patterns that > | are lowering the threshold of catastrophic lock contention that is > | causing it. This VFS LRU problem is being fixed independently by the > | generic numa-aware LRU list patchset I've been doing with Glauber > | Costa. > | > | Therefore, it is clear that the slowdown in this phase is not caused > | by the overhead of CRCs, but that of lock contention elsewhere in > | the kernel. The unlink profiles show the same the thing as the walk > | profiles - additional lock contention on the lookup phase of the > | unlink walk. > > I get it that the slowdown is not caused by the numerical operations to > calculate the CRCs, but as a overall feature, I don't see how you can > say that CRCs are not responsible for the slowdown. I can trigger the VFS lock contention in a similar manner by running a userspace application that memcpy()s a 128MB buffer repeatedly. It's simply a case of increased memory bus traffic causing cacheline bouncing that the lock contention causes to spiral out of control. > If CRCs are > introducing lock contention, it doesn't matter if that lock contention > is in XFS code or elsewhere in the kernel, it is still a slowdown which > can be attributed to the CRC feature. Spin it as you like, it still > appears to me that there's a huge impact on the walk and unlink phases > from CRC calculations. So by that logic, userspace memcpy() causes lock contention in the VFS, and so therefore the problem is the userspace application, not the kernel code. And the solution is not to run that userspace code. Three words: Root Cause Analysis. We've known about the VFS lock contention problem a lot longer than we've had the CRC code has been running. In case you hadn't been keeping up with this stuff, here's a quick summary of the work I've been doing with Glauber: http://lwn.net/Articles/550463/ http://lwn.net/Articles/548092/ So, while CRCs might be a trigger that makes the system fall off the cliff it is on the edge of, it is most certainly not a CRC problem, it is not a problem we can solve by changing the CRC code and it is not a problem we can solve by turning off CRCs. IOWs, CRCs are not the root cause of the degradation in performance. > | ---- > | > | Dbench: > ... > | Well, now that's an interesting result, isn't it. CRC enabled > | filesystems are 10% faster than non-crc filesystems. Again, let's > | not take that number at face value, but ask ourselves why adding > | CRCs improves performance (a.k.a. "know your benchmark")... > | > | It's pretty obvious why - dbench uses xattrs and performance is > | sensitive to how many attributes can be stored inline in the inode. > | And CRCs increase the inode size to 512 bytes meaning attributes are > | probably never out of line. So, let's make it an even playing field > | and compare: > > CRC filesystems default to 512 byte inodes? I wasn't aware of that. That's been the plan of record since 2008 as the increase in the size of the inode code reduces 256 byte inodes to less literal area space than attr=1 configurations. > Sure, CRC filesystems are able to move more volume, but the metadata is > half the density as it was before. I'm not a dbench expert, so I have > no idea what the ratio of metadata to data is here, so I really don't > know what conclusions to draw from the dbench results. So perhaps you should trust someone who is an expert to analyse the results for you? :) FYI, dbench is log IO bound, not metadata or data IO bound. Performance drops with out-of-line attributes because attribute block IO steals IOPS from the log IO and hence processes block for longer in fsync and that lowers throughput and increases measured latency. IOWs, the performance differential that inode sizes give is all due to less IO being needed for attribute manipulations. > What really bothers me is the default of 512 byte inodes for CRCs. That > means my inodes take up twice as much space on disk, and will require > 2X the bandwidth to read from disk. Metadata read IO is latency bound, not bandwidth bound. The increase in metadata IO bandwidth doesn't make any measurable difference on a typical modern storage system. > This will have significant impact > on SGI's DMF managed filesystems. You're concerned about bulkstat performance, then? Bulkstat will CRC every inode it reads, so the increase in inode size is the least of your worries.... But bulkstat scalability is an unrelated issue to the CRC work, especially as bulkstat already needs application provided parallelism to scale effectively. > I know you don't care about SGI's > DMF, but this will also have a significant performance impact on > xfsdump, xfsrestore, and xfs_repair. These performance benchmarks are > just as important to me as dbench and compilebench. Sure. But the changes for SDM (self describing metadata) are not introducing any new performance problems we don't already have. I'm perfectly OK with that, and it's pretty clear that correcting any such issues are not related to the implementation of SDM. > | Compilebench > | > | Testing the same filesystems with 512 byte inodes as for dbench: > | > | $ ./compilebench -D /mnt/scratch > | using working directory /mnt/scratch, 30 intial dirs 100 runs > | ..... > | > | test no CRCs CRCs > | runs avg avg > | ========================================================================== > | intial create 30 92.12 MB/s 90.24 MB/s > | create 14 61.91 MB/s 61.13 MB/s > | patch 15 41.04 MB/s 38.00 MB/s > | compile 14 278.74 MB/s 262.00 MB/s > | clean 10 1355.30 MB/s 1296.17 MB/s > | read tree 11 25.68 MB/s 25.40 MB/s > | read compiled tree 4 48.74 MB/s 48.65 MB/s > | delete tree 10 2.97 seconds 3.05 seconds > | delete compiled tree 4 2.96 seconds 3.05 seconds > | stat tree 11 1.33 seconds 1.36 seconds > | stat compiled tree 7 1.86 seconds 1.64 seconds > | > | The numbers are so close that the differences are in the noise, and > | the CRC overhead doesn't even show up in the ">1% usage" section > | of the profile output. > > What really surprises me in these results is the hit that the compile > phase takes. That is a 6% performance drop in an area where I expect > the CRCs to have limited effect. To me, the results show a rather > consistent performance drop of up to 6%, and is sufficient to support my > assertion that the CRCs overhead may outweigh the benefits. You're making an assumption that 6% is actually meaningful. It's not. Here's the raw numbers for that phase throughout the benchmark: compile dir kernel-7 691MB in 1.98 seconds (349.29 MB/s) compile dir kernel-14 680MB in 2.67 seconds (254.92 MB/s) compile dir kernel-2 680MB in 1.81 seconds (376.04 MB/s) compile dir kernel-2 691MB in 1.94 seconds (356.49 MB/s) compile dir kernel-7 691MB in 2.16 seconds (320.18 MB/s) compile dir kernel-2 691MB in 1.97 seconds (351.06 MB/s) compile dir kernel-26 680MB in 3.13 seconds (217.46 MB/s) compile dir kernel-14 691MB in 3.03 seconds (228.25 MB/s) compile dir kernel-70151 691MB in 3.38 seconds (204.61 MB/s) compile dir kernel-27 691MB in 4.14 seconds (167.05 MB/s) compile dir kernel-18 680MB in 2.72 seconds (250.23 MB/s) compile dir kernel-2 691MB in 2.25 seconds (307.38 MB/s) compile dir kernel-17 680MB in 2.83 seconds (240.51 MB/s) So, to summaries the numbers for the compile phase we have: min: 167.05 MB/s max: 376.04 MB/s avg: 262.00 MB/s stddev: 65 MB/s (25%!) So, that difference of 16MB/s from run to run is well within the standard deviation of the results of that phase. I just did another run on a CRC enabled filesystem: compile total runs 14 avg 291.30 MB/s (user 0.13s sys 0.77s) Which is still within a single stddev of the above number and hence is not significant. IOWs, there's a lot of variability within any specific phase from run to run in this benchmark and for this phase a 6% difference is well within the noise. Like I said - I use benchmarks that I understand. If I say that the differences are "in the noise" I really do mean that they are "in the noise". I don't play games with numbers - benchmarketing is one of my pet peeves and it's something I do not do out of principle. > Do I want to take a 5% performance hit in filesystem performance > and double the size of my inodes for an unproved feature? I am > still unconvinced that CRCs are a feature that I want to use. > Others may see enough benefit in CRCs to accept the performance > hit. All I want is to ensure that I the option going forward to > chose not to use CRCs without sacrificing other features > introduced XFS. If you don't want to take the performance hit of SDM, the don't use it. You have that choice right now - either choose performance (v4 superblocks) or reliability (v5 superblocks) at mkfs time. If new features are introduced that you want that are dependent on v5 superblocks and you want to stick with v4 superblocks for performance reasons, then you have to make a hard choice unless you address your concerns about v5 superblocks. Indeed, none of the performance issues you've mentioned are unsolvable problems - you just have to identify them and fix them before your customers need v5 superblocks. IOWs, you need to quantify the specific performance degradations you are concerned about and help fix them. We may have different priorities and goals, but that doesn't stop us from both being able to help each reach our goals. But any such discussion about performance and problem areas needs to be based on quantified information, not handwaving. Geoffrey, can you start by identifying and quantifying two things on current top-of-tree kernels? 1. exactly where the problems with larger inodes are (on v4 superblocks) 2. workloads you care about where SDM significantly impacts performance (i.e. v4 vs v5 superblocks) We can discuss each case you raise on their merits and determine whether they need to be addressed and, if so, how to address them. But we need quantified data to make any progress here. In the mean time, you can just use v4 superblocks like you currently do, but when the time comes to switch to v5 superblocks we will have corrected the identified problems and performance will not be an issue that you need to be concerned about. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 2:43 ` Dave Chinner @ 2013-06-04 10:19 ` Dave Chinner 2013-06-04 21:27 ` Geoffrey Wehrman 1 sibling, 0 replies; 16+ messages in thread From: Dave Chinner @ 2013-06-04 10:19 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: xfs On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote: > On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote: > > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote: > > This will have significant impact > > on SGI's DMF managed filesystems. > > You're concerned about bulkstat performance, then? Bulkstat will CRC > every inode it reads, so the increase in inode size is the least of > your worries.... > > But bulkstat scalability is an unrelated issue to the CRC work, > especially as bulkstat already needs application provided > parallelism to scale effectively. So, I just added a single threaded bulkstat pass to the fsmark workload by passing xfs_fsr across the filesystem to test out what impact it has. So, 50 million inodes in the directory structure: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- wall time 13m34.203s 14m15.266s sys CPU 7m7.930s 8m52.050s rate 61,425 inodes/s 58,479 inode/s efficency 116,800 inodes/CPU/s 93,984 inodes/CPU/s So, really it's not particularly significant in terms of performance differential. Certainly there isn't anything signficant problem that larger inodes cause. For comparison, the 8-way find workloads: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- wall time 5m33.165s 8m18.256s sys CPU 18m36.731s 22m2.277s rate 150,055 inodes/s 100,400 inodes/s efficiency 44,800 inodes/CPU/s 37,800 inodes/CPU/s Which makes me think omething is not right with this bulkstat pass I've just done. It's way too slow if a find+stat is 2-2.5x faster. Ah, xfs_fsr only bulkstats 64 inodes at a time. That's right, last time I did this I used bstat out of xfstests. On a CRC enabled fs: ninodes runtime sys time read bw(IOPS) 64 14m01s 8m37s 128 11m20s 7m58s 35MB/s(5000) 256 8m53s 7m24s 45MB/s(6000) 512 7m24s 6m28s 55MB/s(7000) 1024 6m37s 5m40s 65MB/s(8000) 2048 10m50s 6m51s 35MB/s(5000) 4096(default) 26m23s 8m38s Ask bulkstat for too few or too much, and it all goes to hell. So if we get the bulkstat config right, a single threaded bulkstat is faster than the 8-way find, and a whole lot more efficient at it. But, still there is effectively no performance differential between 256 byte and 512 byte inodes worth talking about. And, FWIW, I just hacked threading into bstat to run a thread per AG and just scan a single AG per thread. It's not perfect - it counts some inodes twice (threads*ninodes at most) before it detects it's run into the next AG. This is on a 100TB filesystem, so it runs 100 threads. CRC enabled fs: ninodes runtime sys time read bw(IOPS) 64 1m53s 10m25s 220MB/s (27000) 256 1m52s 10m03s 220MB/s (27000) 1024 1m55s 10m08s 210MB/s (26000) So when it's threaded, the small request size just doesn't matter - there's enough IO to drive the system to being IOPS bound and that limits performance. Just to go full circle, the differences between 256 byte inodes, no CRCs and the crc enabled filesystem for a single threaded bulkstat: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- ninodes 1024 1024 wall time 5m22s 6m37s sys CPU 4m46s 5m40s bw(IOPS) 40MB/s(5000) 65MB/s(8000) rate 155,300 inodes/s 126,000 inode/s efficency 174,800 inodes/CPU/s 147,000 inodes/CPU/s Both follow the same ninode profile, but there is less IO done for the 256 byte inode filesystem and throughput is higher. There's no big surprise there, what does surprise me is that the difference isn't larger. Let's drive it to being I/O bound with threading: 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- ninodes 256 256 wall time 1m02s 1m52s sys CPU 7m04s 10m03s bw/IOPS 210MB/s (27000) 220MB/s (27000) rate 806,500 inodes/s 446,500 inode/s efficency 117,900 inodes/CPU/s 82,900 inodes/CPU/s The 256 byte inode test is completely CPU bound - it can't go any faster than that, and it just so happens to be pretty close to IO bound as well. So, while there's double the throughput for 256 byte inodes, it raises an interesting question: why are all the IOs only 8k in size? That means the inode readahead that bulkstat is doing is not being combined down in the elevator - it is either being cancelled because there is too much, or it is being dispatched immediately and so we are being IOPS limited long before we should be. i.e. there's still 500MB of bandwidth available on this filesystem and we're issuing sequential adjacent 8k IO. Either way, it's not functioning as it should. <blktrace> Yup, immediate, explicit unplug and dispatch. No readahead batching and the unplug is coming from _xfs_buf_ioapply(). Well, that is easy to fix. 256 byte inodes, 512 byte inodes CRCs disabled CRCs enabled --------------------------------------- ninodes 256 256 wall time 1m02s 1m08s sys CPU 7m07s 8m09s bw/IOPS 210MB/s (13500) 360MB/s (14000) rate 806,500 inodes/s 735,300 inode/s efficency 117,100 inodes/CPU/s 102,200 inodes/CPU/s So, the difference in performance pretty much goes away. We burn more bandwidth, but now the multithreaded bulkstat is CPU limited for both non-crc, 256 byte inodes and CRC enabled 512 byte inodes. What this says to me is that there isn't a bulkstat performance problem that we need to fix apart from the 3 lines of code for the readahead IO plugging that I just added. It's only limited by storage IOPS and available CPU power, yet the bandwidth is sufficiently low that any storage system that SGI installs for DMF is not going to be stressed by it. IOPS, yes. Bandwidth, no. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 2:43 ` Dave Chinner 2013-06-04 10:19 ` Dave Chinner @ 2013-06-04 21:27 ` Geoffrey Wehrman 2013-06-05 0:27 ` Dave Chinner 1 sibling, 1 reply; 16+ messages in thread From: Geoffrey Wehrman @ 2013-06-04 21:27 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote: | On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote: | > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote: | > | Hi folks, | > | | > | There has been some assertions made recently that metadata CRCs have | > | too much overhead to always be enabled. So I'll run some quick | > | benchmarks to demonstrate the "too much overhead" assertions are | > | completely unfounded. | > | > Thank you, much appreciated. | We've known about the VFS lock contention problem a lot longer than | we've had the CRC code has been running. In case you hadn't been | keeping up with this stuff, here's a quick summary of the work I've | been doing with Glauber: | | http://lwn.net/Articles/550463/ | http://lwn.net/Articles/548092/ | | So, while CRCs might be a trigger that makes the system fall off the | cliff it is on the edge of, it is most certainly not a CRC problem, | it is not a problem we can solve by changing the CRC code and it is | not a problem we can solve by turning off CRCs. IOWs, CRCs are not | the root cause of the degradation in performance. Fair enough. It is good to know that the VFS lock contention problem is being addressed. Thanks for the pointers to the summary of the work you and Glauber have been doing. | > Do I want to take a 5% performance hit in filesystem performance | > and double the size of my inodes for an unproved feature? I am | > still unconvinced that CRCs are a feature that I want to use. | > Others may see enough benefit in CRCs to accept the performance | > hit. All I want is to ensure that I the option going forward to | > chose not to use CRCs without sacrificing other features | > introduced XFS. | | If you don't want to take the performance hit of SDM, the don't use | it. You have that choice right now - either choose performance (v4 | superblocks) or reliability (v5 superblocks) at mkfs time. That is exactly the capability I want. | If new features are introduced that you want that are dependent on | v5 superblocks and you want to stick with v4 superblocks for | performance reasons, then you have to make a hard choice unless you | address your concerns about v5 superblocks. Indeed, none of the | performance issues you've mentioned are unsolvable problems - you | just have to identify them and fix them before your customers need | v5 superblocks. This is the type of hard choice I want to avoid as much as possible. My concern is that all future XFS features will be introduced as v5 superblock only features, regardless of whether they are directly dependent on CRC or not. I'm not expecting all future features to be implemented for both v4 and v5 superblocks, but I would like to have new features available for v4 superblocks available when possible, at least until the vast majority of systems deployed are v5 superblock capable. Unfortunately this will take much longer than we like. | IOWs, you need to quantify the specific performance degradations you | are concerned about and help fix them. We may have different | priorities and goals, but that doesn't stop us from both being able | to help each reach our goals. But any such discussion about | performance and problem areas needs to be based on quantified | information, not handwaving. I would love to be able to quantify and help fix performance degradations I am concerned about. Unfortunately there are just not enough hours in a day. I will be honest, I am not an XFS developer. I am an XFS consumer. The products I spend my time working on rely on XFS as their foundation. I don't even touch current XFS. I spend most of my time working with XFS code that is a year old or more. Even then, I am not spending much time with the XFS code itself but rather the code from the products built on top of XFS. Call me an XFS consumer. It is like buying an automobile. I don't review the cad drawings of each part used in the construction. I don't even examine the engine or transmission. I don't take an automobile I'm looking at and hook it up to a dyno to get a performance report. I rely on the manufacturer to provide me with the performance information, and then I do my best to analyze the data I have available. | Geoffrey, can you start by identifying and quantifying two things on | current top-of-tree kernels? | | 1. exactly where the problems with larger inodes are (on v4 | superblocks) | 2. workloads you care about where SDM significantly impacts | performance (i.e. v4 vs v5 superblocks) I cannot identify and quantify any more than I already have. Bulk scans are my primary concern, along with potential doubling of bandwidth required for a bulk scan. You address much of this in your follow-up e-mail. | We can discuss each case you raise on their merits and determine | whether they need to be addressed and, if so, how to address them. | But we need quantified data to make any progress here. | | In the mean time, you can just use v4 superblocks like you currently | do, but when the time comes to switch to v5 superblocks we will have | corrected the identified problems and performance will not be an | issue that you need to be concerned about. I hope that is the case, and expect that it will be. I'm not questioning your abilities. You are one of the best developers in the community. I just want to be sure that I'm forced into v5 superblocks before the identified problems have been resolved, and that the work on v5 superblocks has minimal impact on my current use of v4 superblocks. The data you have provided has gone a long way to ally my concerns about the metadata performance in XFS with CRCs. I'm not ready to jump yet, but you have given me confidence that the jump should not be as bad as I had expected. On Tue, Jun 04, 2013 at 08:19:37PM +1000, Dave Chinner wrote: | On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote: | > On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote: | > > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote: | > > This will have significant impact | > > on SGI's DMF managed filesystems. | > | > You're concerned about bulkstat performance, then? Bulkstat will CRC | > every inode it reads, so the increase in inode size is the least of | > your worries.... | > | > But bulkstat scalability is an unrelated issue to the CRC work, | > especially as bulkstat already needs application provided | > parallelism to scale effectively. ... | So, the difference in performance pretty much goes away. We burn | more bandwidth, but now the multithreaded bulkstat is CPU limited | for both non-crc, 256 byte inodes and CRC enabled 512 byte inodes. | | What this says to me is that there isn't a bulkstat performance | problem that we need to fix apart from the 3 lines of code for the | readahead IO plugging that I just added. It's only limited by | storage IOPS and available CPU power, yet the bandwidth is | sufficiently low that any storage system that SGI installs for DMF | is not going to be stressed by it. IOPS, yes. Bandwidth, no. What can I say but nice analysis. You've clearly shown that the performance impact in bulk scan caused by CRCs can be easily offset by changes elsewhere, improving bulkstat performance across the board. I don't exactly follow what changes you made to _xfs_buf_ioapply(), but expect that you will eventually post the change. -- Geoffrey Wehrman 651-683-5496 gwehrman@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Debunking myths about metadata CRC overhead 2013-06-04 21:27 ` Geoffrey Wehrman @ 2013-06-05 0:27 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2013-06-05 0:27 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: xfs On Tue, Jun 04, 2013 at 04:27:13PM -0500, Geoffrey Wehrman wrote: > On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote: > | On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote: > | > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote: > | > | Hi folks, > | > | > | > | There has been some assertions made recently that metadata CRCs have > | > | too much overhead to always be enabled. So I'll run some quick > | > | benchmarks to demonstrate the "too much overhead" assertions are > | > | completely unfounded. .... > | > Do I want to take a 5% performance hit in filesystem performance > | > and double the size of my inodes for an unproved feature? I am > | > still unconvinced that CRCs are a feature that I want to use. > | > Others may see enough benefit in CRCs to accept the performance > | > hit. All I want is to ensure that I the option going forward to > | > chose not to use CRCs without sacrificing other features > | > introduced XFS. > | > | If you don't want to take the performance hit of SDM, the don't use > | it. You have that choice right now - either choose performance (v4 > | superblocks) or reliability (v5 superblocks) at mkfs time. > > That is exactly the capability I want. And you have it, so I don't see what the fuss is all about. > | If new features are introduced that you want that are dependent on > | v5 superblocks and you want to stick with v4 superblocks for > | performance reasons, then you have to make a hard choice unless you > | address your concerns about v5 superblocks. Indeed, none of the > | performance issues you've mentioned are unsolvable problems - you > | just have to identify them and fix them before your customers need > | v5 superblocks. > > This is the type of hard choice I want to avoid as much as possible. > My concern is that all future XFS features will be introduced as v5 > superblock only features, regardless of whether they are directly > dependent on CRC or not. No different to the v3->v4 transition. The old format was immediately deprecated... > I'm not expecting all future features to be > implemented for both v4 and v5 superblocks, but I would like to have new > features available for v4 superblocks available when possible, at least > until the vast majority of systems deployed are v5 superblock capable. > Unfortunately this will take much longer than we like. v5 superblocks are the future and upstream development is focussed primarily on the future. Any new feature that requires an on-disk format change is now going to be dependent on v5 superblocks. You're welcome to backport such features to SGI supported kernels using v4 superblocks, but it's unrealistic to expect upstream to jump through hoops to do this for you... > | IOWs, you need to quantify the specific performance degradations you > | are concerned about and help fix them. We may have different > | priorities and goals, but that doesn't stop us from both being able > | to help each reach our goals. But any such discussion about > | performance and problem areas needs to be based on quantified > | information, not handwaving. > > I would love to be able to quantify and help fix performance degradations > I am concerned about. Unfortunately there are just not enough hours in a > day. Delegate to your minions. ;) > I will be honest, I am not an XFS developer. I am an XFS > consumer. The products I spend my time working on rely on XFS as > their foundation. I don't even touch current XFS. I spend most > of my time working with XFS code that is a year old or more. Even > then, I am not spending much time with the XFS code itself but > rather the code from the products built on top of XFS. Call me an > XFS consumer. It is like buying an automobile. > I don't review the cad drawings of each part used in the > construction. I don't even examine the engine or transmission. I > don't take an automobile I'm looking at and hook it up to a dyno > to get a performance report. I rely on the manufacturer to > provide me with the performance information, and then I do my best > to analyze the data I have available. Yes, you are a downstream consumer, but that's a seriously bad analogy. You aren't "buying an automobile" and relying on the manufacturer to supply you with specifications and support for your car. You are an expert mechanic who is getting a cheap car and a box of parts from the local bazzar for nothing and treating it to a star role in an episode of "Monster Garage".(*) You then sell that "new" car and support it directly because the original manufacturer doesn't even recognise it anymore. (*) The show where a team of expert mechanics and fabricators take some standard vehicle, rip the guts out of it and rebuild it into some entirely different contraption. See, I can do bad car analogies, too. :/ Ignoring the bad analogies, my point still stands. If you want to make claims about performance issues, you need to back them up with a reproducable test case, numbers and analysis for them to be taken seriously. Only after the problem has been demonstrated and reproduced can we consider what changes *might* be necessary. > I don't exactly follow what changes you made to _xfs_buf_ioapply(), > but expect that you will eventually post the change. None. I made changes to bulkstat. :) And yes, I will post the patch in my next for-3.11 patchset.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2013-06-05 0:28 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-03 7:44 Debunking myths about metadata CRC overhead Dave Chinner 2013-06-03 9:10 ` Emmanuel Florac 2013-06-04 2:53 ` Dave Chinner 2013-06-04 16:20 ` Ben Myers 2013-06-04 22:06 ` Dave Chinner 2013-06-04 22:09 ` Ben Myers 2013-06-04 18:38 ` Chandra Seetharaman 2013-06-04 22:08 ` Dave Chinner 2013-06-04 22:40 ` Chandra Seetharaman 2013-06-04 22:59 ` Dave Chinner 2013-06-03 15:31 ` Troy McCorkell 2013-06-03 20:00 ` Geoffrey Wehrman 2013-06-04 2:43 ` Dave Chinner 2013-06-04 10:19 ` Dave Chinner 2013-06-04 21:27 ` Geoffrey Wehrman 2013-06-05 0:27 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox