public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Debunking myths about metadata CRC overhead
@ 2013-06-03  7:44 Dave Chinner
  2013-06-03  9:10 ` Emmanuel Florac
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dave Chinner @ 2013-06-03  7:44 UTC (permalink / raw)
  To: xfs

Hi folks,

There has been some assertions made recently that metadata CRCs have
too much overhead to always be enabled.  So I'll run some quick
benchmarks to demonstrate the "too much overhead" assertions are
completely unfounded.

These are some numbers from my usual performance test VM. Note that
as this is a VM, it's not running the hardware CRC instructions so
I'm benchmarking the worst case overhead here. i.e. the kernel's
software CRC32c algorithm.

THe VM is 8p, 8GB RAM, 4 node fake-numa config with a 100TB XFS
filesystem being used for testing. The fs is backed by 4x64GB SSDs
sliced via LVM into a 160GB RAID0 device with an XFS filesytsem on
it to host the sparse 100TB image file. KVM is using
virtio,cache=none to use direct IO to write to the image file, and
the host is running a 3.8.5 kernel.

Baseline CRC32c performance
---------------------------

The VM runs the xfsprogs selftest program in:

crc32c: tests passed, 225944 bytes in 212 usec

so it can calulate CRCs at roughly 1GB/s on small, random chunks of
data through the software algorithm according to this. Given the
fsmark create workload only drives around 100MB/s of metadata and
journal IO, the minimum CRC32c overhead we should see on a load
spread across 8 CPUs is roughly:

	100MB/s / 1000MB/s / 8p * 100% = 1.25% per CPU

So, in a perfect world, that's what we should see from the kernel
profiles. It's not a perfect world, though, so it will never be
this low (4 cores all trying to use the same memory bus at the same
time, perhaps?), so if we get anywhere near that number I'd be very
happy.

Note that a hardware implementation should be faster than the SSE
optimised RAID5/6 calculations on the CPU, which come in at:

[    0.548004] raid6: sse2x4    7221 MB/s

which is a *lot* faster. So it's probably reasonable to assume
similar throughput for hardware CRC32c throughput. Hence Intel
servers will have substantially lower CRC overhead than the software
CRC32c implementation being measured here.

fs_mark workload
----------------

$ sudo mkfs.xfs -f -m crc=1 -l size=512m,sunit=8 /dev/vdc

vs

$ sudo mkfs.xfs -f -l size=512m,sunit=8 /dev/vdc

8-way 50 million zero-length file create, 8-way
find+stat of all the files, 8-unlink of all the files:

		no CRCs		CRCs		Difference
create	(time)	483s		510s		+5.2%	(slower)
	(rate)	109k+/6k	105k+/-5.4k	-3.8%	(slower)

walk		339s		494s		-30.3%	(slower)
     (sys cpu)	1134s		1324s		+14.4%	(slower)

unlink		692s		959s		-27.8%(*) (slower)

(*) All the slowdown here is from the traversal slowdown as seen in
the walk phase. i.e. not related to the unlink operations.

On the surface, it looks like there's a huge impact on the walk and
unlink phases from CRC calculations, but these numbers don't tell
the whole story. Lets look deeper:

Create phase top CPU users (>1% total):

  5.59%  [kernel]  [k] _xfs_buf_find
  5.52%  [kernel]  [k] xfs_dir2_node_addname
  4.58%  [kernel]  [k] memcpy
  3.28%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
  3.05%  [kernel]  [k] __ticket_spin_trylock
  2.94%  [kernel]  [k] __slab_alloc
  1.96%  [kernel]  [k] xfs_log_commit_cil
  1.93%  [kernel]  [k] __slab_free
  1.90%  [kernel]  [k] kmem_cache_alloc
  1.72%  [kernel]  [k] xfs_next_bit
  1.65%  [kernel]  [k] __crc32c_le
  1.52%  [kernel]  [k] _raw_spin_unlock_irqrestore
  1.50%  [kernel]  [k] do_raw_spin_lock
  1.42%  [kernel]  [k] kmem_cache_free
  1.32%  [kernel]  [k] native_read_tsc
  1.28%  [kernel]  [k] __kmalloc
  1.17%  [kernel]  [k] xfs_buf_offset
  1.14%  [kernel]  [k] delay_tsc
  1.14%  [kernel]  [k] kfree
  1.10%  [kernel]  [k] xfs_buf_item_format
  1.06%  [kernel]  [k] xfs_btree_lookup

CRC overehad is at 1.65%, not much higher than the optimum 1.25%
overhead calculated above. So the overhead really isn't that
significant - it's far less overhead than, say, the 1.2 million
buffer lookups a second we are doing (_xfs_buf_find overhead) in
this workload...

Walk phase top CPU users:

  6.64%  [kernel]  [k] __ticket_spin_trylock
  6.05%  [kernel]  [k] _xfs_buf_find
  5.58%  [kernel]  [k] _raw_spin_unlock_irqrestore
  4.88%  [kernel]  [k] _raw_spin_unlock_irq
  3.30%  [kernel]  [k] native_read_tsc
  2.93%  [kernel]  [k] __crc32c_le
  2.87%  [kernel]  [k] delay_tsc
  2.32%  [kernel]  [k] do_raw_spin_lock
  1.98%  [kernel]  [k] blk_flush_plug_list
  1.79%  [kernel]  [k] __slab_alloc
  1.76%  [kernel]  [k] __d_lookup_rcu
  1.56%  [kernel]  [k] kmem_cache_alloc
  1.25%  [kernel]  [k] kmem_cache_free
  1.25%  [kernel]  [k] xfs_da_read_buf
  1.11%  [kernel]  [k] xfs_dir2_leaf_search_hash
  1.08%  [kernel]  [k] flat_send_IPI_mask
  1.02%  [kernel]  [k] radix_tree_lookup_element
  1.00%  [kernel]  [k] do_raw_spin_unlock

There's more CRC32c overhead indicating lower efficiency, but
there's an obvious cause for that - the CRC overhead is dwarfed by
something else new: lock contention.  A quick 30s call graph profile
during the middle of the walk phase shows:

-  12.74%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 60.49% _raw_spin_lock
         + 91.79% inode_add_lru			>>> inode_lru_lock
         + 2.98% dentry_lru_del			>>> dcache_lru_lock
         + 1.30% shrink_dentry_list
         + 0.71% evict
      - 20.42% do_raw_spin_lock
         - _raw_spin_lock
            + 13.41% inode_add_lru		>>> inode_lru_lock
            + 10.55% evict
            + 8.26% dentry_lru_del		>>> dcache_lru_lock
            + 7.62% __remove_inode_hash
....
      - 10.37% do_raw_spin_trylock
         - _raw_spin_trylock
            + 79.65% prune_icache_sb		>>> inode_lru_lock
            + 11.04% shrink_dentry_list
            + 9.24% prune_dcache_sb		>>> dcache_lru_lock
      - 8.72% _raw_spin_trylock
         + 46.33% prune_icache_sb		>>> inode_lru_lock
         + 46.08% shrink_dentry_list
         + 7.60% prune_dcache_sb		>>> dcache_lru_lock

So the lock contention is variable - it's twice as high in this
short sample as the overall profile I measured above. It's also
pretty much all VFS cache LRU lock contention that is causing the
problems here. IOWs, the slowdowns are not related to the overhead
of CRC calculations; it's the change in memory access patterns that
are lowering the threshold of catastrophic lock contention that is
causing it. This VFS LRU problem is being fixed independently by the
generic numa-aware LRU list patchset I've been doing with Glauber
Costa.

Therefore, it is clear that the slowdown in this phase is not caused
by the overhead of CRCs, but that of lock contention elsewhere in
the kernel.  The unlink profiles show the same the thing as the walk
profiles - additional lock contention on the lookup phase of the
unlink walk.

----

Dbench:

$ sudo mkfs.xfs -f -m crc=1 -l size=128m,sunit=8 /dev/vdc

vs

$ sudo mkfs.xfs -f -l size=128m,sunit=8 /dev/vdc

Running:

$ dbench -t 120 -D /mnt/scratch 8

		no CRCs		CRCs		Difference
thruput		1098.06 MB/s	1229.65 MB/s	+10% (faster)
latency (max)	22.385 ms	22.661 ms	+1.3% (noise)

Well, now that's an interesting result, isn't it. CRC enabled
filesystems are 10% faster than non-crc filesystems. Again, let's
not take that number at face value, but ask ourselves why adding
CRCs improves performance (a.k.a. "know your benchmark")...

It's pretty obvious why - dbench uses xattrs and performance is
sensitive to how many attributes can be stored inline in the inode.
And CRCs increase the inode size to 512 bytes meaning attributes are
probably never out of line. So, let's make it an even playing field
and compare:

$ sudo mkfs.xfs -f -m crc=1 -l size=128m,sunit=8 /dev/vdc

vs

$ sudo mkfs.xfs -f -i size=512 -l size=128m,sunit=8 /dev/vdc

		no CRCs		CRCs		Difference
thruput		1273.22 MB/s	1229.65 MB/s	-3.5% (slower)
latency (max)	25.455 ms	22.661 ms	-12.4% (better)

So, we're back to the same relatively small difference seen in the
fsmark create phase, with similar CRC overhead being shown in the
profiles.

----

Compilebench

Testing the same filesystems with 512 byte inodes as for dbench:

$ ./compilebench -D /mnt/scratch
using working directory /mnt/scratch, 30 intial dirs 100 runs
.....

test				no CRCs		CRCs
			runs	avg		avg
==========================================================================
intial create		30	92.12 MB/s	90.24 MB/s
create			14	61.91 MB/s	61.13 MB/s
patch			15	41.04 MB/s	38.00 MB/s
compile			14	278.74 MB/s	262.00 MB/s
clean			10	1355.30 MB/s	1296.17 MB/s
read tree		11	25.68 MB/s	25.40 MB/s
read compiled tree	4	48.74 MB/s	48.65 MB/s
delete tree		10	2.97 seconds	3.05 seconds
delete compiled tree	4	2.96 seconds	3.05 seconds
stat tree		11	1.33 seconds	1.36 seconds
stat compiled tree	7	1.86 seconds	1.64 seconds

The numbers are so close that the differences are in the noise, and
the CRC overhead doesn't even show up in the ">1% usage" section
of the profile output.

----

Looking at these numbers realistically, dbench and compilebench
model two fairly common metadata intensive workloads - file servers
and code tree manipulations that developers tend to use all the
time. The difference that CRCs make to performance in these
workloads on equivalently configured filesystems varies between
0-5%, and for most operations they are small enough that they can
just about be considered to be noise.

Yes, we could argue over the fsmark walk/unlink phase results, but
the synthetic fsmark workload is designed to push the system to it's
limits and it's obvious that the addition of CRCs pushes the VFS into
lock contention hell. Further, we have to recognise that the same
workload on a 12p VM (run 12-way instead of 8-way) without CRCs hits
the same lock contention problem. IOWs, the slowdown is most
definitely not caused by the addition of CRC calculations to XFS
metadata.

The CPU overhead of CRCs is small and may be outweighed by other
changes for CRC filesystems that improve performance far more than
the cost of CRC calculations degrades it.  The numbers above simply
don't support the assertion that metadata CRCs have "too much
overhead".

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-03  7:44 Debunking myths about metadata CRC overhead Dave Chinner
@ 2013-06-03  9:10 ` Emmanuel Florac
  2013-06-04  2:53   ` Dave Chinner
  2013-06-03 15:31 ` Troy McCorkell
  2013-06-03 20:00 ` Geoffrey Wehrman
  2 siblings, 1 reply; 16+ messages in thread
From: Emmanuel Florac @ 2013-06-03  9:10 UTC (permalink / raw)
  Cc: xfs

Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:

> There has been some assertions made recently that metadata CRCs have
> too much overhead to always be enabled.  So I'll run some quick
> benchmarks to demonstrate the "too much overhead" assertions are
> completely unfounded.

Just a quick question: what is the minimal kernel version and xfsprogs
version needed to run xfs with metadata CRC? I'd happily test it on
real hardware, I have a couple of storage servers in test in the 40 to
108 TB range.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-03  7:44 Debunking myths about metadata CRC overhead Dave Chinner
  2013-06-03  9:10 ` Emmanuel Florac
@ 2013-06-03 15:31 ` Troy McCorkell
  2013-06-03 20:00 ` Geoffrey Wehrman
  2 siblings, 0 replies; 16+ messages in thread
From: Troy McCorkell @ 2013-06-03 15:31 UTC (permalink / raw)
  To: xfs

On 06/03/2013 02:44 AM, Dave Chinner wrote:
> Hi folks,
>
> There has been some assertions made recently that metadata CRCs have
> too much overhead to always be enabled.  So I'll run some quick
> benchmarks to demonstrate the "too much overhead" assertions are
> completely unfounded.
>
>
>    
Dave,

Thanks for generating, gathering, and providing this data.

Thanks,
Troy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-03  7:44 Debunking myths about metadata CRC overhead Dave Chinner
  2013-06-03  9:10 ` Emmanuel Florac
  2013-06-03 15:31 ` Troy McCorkell
@ 2013-06-03 20:00 ` Geoffrey Wehrman
  2013-06-04  2:43   ` Dave Chinner
  2 siblings, 1 reply; 16+ messages in thread
From: Geoffrey Wehrman @ 2013-06-03 20:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote:
| Hi folks,
| 
| There has been some assertions made recently that metadata CRCs have
| too much overhead to always be enabled.  So I'll run some quick
| benchmarks to demonstrate the "too much overhead" assertions are
| completely unfounded.

Thank you, much appreciated.

| fs_mark workload
| ----------------
...
| So the lock contention is variable - it's twice as high in this
| short sample as the overall profile I measured above. It's also
| pretty much all VFS cache LRU lock contention that is causing the
| problems here. IOWs, the slowdowns are not related to the overhead
| of CRC calculations; it's the change in memory access patterns that
| are lowering the threshold of catastrophic lock contention that is
| causing it. This VFS LRU problem is being fixed independently by the
| generic numa-aware LRU list patchset I've been doing with Glauber
| Costa.
| 
| Therefore, it is clear that the slowdown in this phase is not caused
| by the overhead of CRCs, but that of lock contention elsewhere in
| the kernel.  The unlink profiles show the same the thing as the walk
| profiles - additional lock contention on the lookup phase of the
| unlink walk.

I get it that the slowdown is not caused by the numerical operations to
calculate the CRCs, but as a overall feature, I don't see how you can
say that CRCs are not responsible for the slowdown.  If CRCs are
introducing lock contention, it doesn't matter if that lock contention
is in XFS code or elsewhere in the kernel, it is still a slowdown which
can be attributed to the CRC feature.  Spin it as you like, it still
appears to me that there's a huge impact on the walk and unlink phases
from CRC calculations.

| ----
| 
| Dbench:
...
| Well, now that's an interesting result, isn't it. CRC enabled
| filesystems are 10% faster than non-crc filesystems. Again, let's
| not take that number at face value, but ask ourselves why adding
| CRCs improves performance (a.k.a. "know your benchmark")...
| 
| It's pretty obvious why - dbench uses xattrs and performance is
| sensitive to how many attributes can be stored inline in the inode.
| And CRCs increase the inode size to 512 bytes meaning attributes are
| probably never out of line. So, let's make it an even playing field
| and compare:

CRC filesystems default to 512 byte inodes?  I wasn't aware of that.
Sure, CRC filesystems are able to move more volume, but the metadata is
half the density as it was before.  I'm not a dbench expert, so I have
no idea what the ratio of metadata to data is here, so I really don't
know what conclusions to draw from the dbench results.

What really bothers me is the default of 512 byte inodes for CRCs.  That
means my inodes take up twice as much space on disk, and will require
2X the bandwidth to read from disk.  This will have significant impact
on SGI's DMF managed filesystems.  I know you don't care about SGI's
DMF, but this will also have a significant performance impact on
xfsdump, xfsrestore, and xfs_repair.  These performance benchmarks are
just as important to me as dbench and compilebench.

| ----
| 
| Compilebench
| 
| Testing the same filesystems with 512 byte inodes as for dbench:
| 
| $ ./compilebench -D /mnt/scratch
| using working directory /mnt/scratch, 30 intial dirs 100 runs
| .....
| 
| test				no CRCs		CRCs
| 			runs	avg		avg
| ==========================================================================
| intial create		30	92.12 MB/s	90.24 MB/s
| create			14	61.91 MB/s	61.13 MB/s
| patch			15	41.04 MB/s	38.00 MB/s
| compile			14	278.74 MB/s	262.00 MB/s
| clean			10	1355.30 MB/s	1296.17 MB/s
| read tree		11	25.68 MB/s	25.40 MB/s
| read compiled tree	4	48.74 MB/s	48.65 MB/s
| delete tree		10	2.97 seconds	3.05 seconds
| delete compiled tree	4	2.96 seconds	3.05 seconds
| stat tree		11	1.33 seconds	1.36 seconds
| stat compiled tree	7	1.86 seconds	1.64 seconds
| 
| The numbers are so close that the differences are in the noise, and
| the CRC overhead doesn't even show up in the ">1% usage" section
| of the profile output.

What really surprises me in these results is the hit that the compile
phase takes.  That is a 6% performance drop in an area where I expect
the CRCs to have limited effect.  To me, the results show a rather
consistent performance drop of up to 6%, and is sufficient to support my
assertion that the CRCs overhead may outweigh the benefits.

| ----
| 
| Looking at these numbers realistically, dbench and compilebench
| model two fairly common metadata intensive workloads - file servers
| and code tree manipulations that developers tend to use all the
| time. The difference that CRCs make to performance in these
| workloads on equivalently configured filesystems varies between
| 0-5%, and for most operations they are small enough that they can
| just about be considered to be noise.
| 
| Yes, we could argue over the fsmark walk/unlink phase results, but
| the synthetic fsmark workload is designed to push the system to it's
| limits and it's obvious that the addition of CRCs pushes the VFS into
| lock contention hell. Further, we have to recognise that the same
| workload on a 12p VM (run 12-way instead of 8-way) without CRCs hits
| the same lock contention problem. IOWs, the slowdown is most
| definitely not caused by the addition of CRC calculations to XFS
| metadata.
| 
| The CPU overhead of CRCs is small and may be outweighed by other
| changes for CRC filesystems that improve performance far more than
| the cost of CRC calculations degrades it.  The numbers above simply
| don't support the assertion that metadata CRCs have "too much
| overhead".

Do I want to take a 5% performance hit in filesystem performance and
double the size of my inodes for an unproved feature?  I am still
unconvinced that CRCs are a feature that I want to use.  Others may see
enough benefit in CRCs to accept the performance hit.  All I want is to
ensure that I the option going forward to chose not to use CRCs without
sacrificing other features introduced XFS.


-- 
Geoffrey Wehrman
SGI Building 10                             Office: (651)683-5496
2750 Blue Water Road                           Fax: (651)683-5098
Eagan, MN 55121                             E-mail: gwehrman@sgi.com
	  http://www.sgi.com/products/storage/software/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-03 20:00 ` Geoffrey Wehrman
@ 2013-06-04  2:43   ` Dave Chinner
  2013-06-04 10:19     ` Dave Chinner
  2013-06-04 21:27     ` Geoffrey Wehrman
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2013-06-04  2:43 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: xfs

On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote:
> On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote:
> | Hi folks,
> | 
> | There has been some assertions made recently that metadata CRCs have
> | too much overhead to always be enabled.  So I'll run some quick
> | benchmarks to demonstrate the "too much overhead" assertions are
> | completely unfounded.
> 
> Thank you, much appreciated.
> 
> | fs_mark workload
> | ----------------
> ...
> | So the lock contention is variable - it's twice as high in this
> | short sample as the overall profile I measured above. It's also
> | pretty much all VFS cache LRU lock contention that is causing the
> | problems here. IOWs, the slowdowns are not related to the overhead
> | of CRC calculations; it's the change in memory access patterns that
> | are lowering the threshold of catastrophic lock contention that is
> | causing it. This VFS LRU problem is being fixed independently by the
> | generic numa-aware LRU list patchset I've been doing with Glauber
> | Costa.
> | 
> | Therefore, it is clear that the slowdown in this phase is not caused
> | by the overhead of CRCs, but that of lock contention elsewhere in
> | the kernel.  The unlink profiles show the same the thing as the walk
> | profiles - additional lock contention on the lookup phase of the
> | unlink walk.
> 
> I get it that the slowdown is not caused by the numerical operations to
> calculate the CRCs, but as a overall feature, I don't see how you can
> say that CRCs are not responsible for the slowdown.

I can trigger the VFS lock contention in a similar manner by running
a userspace application that memcpy()s a 128MB buffer repeatedly.
It's simply a case of increased memory bus traffic causing cacheline
bouncing that the lock contention causes to spiral out of control.

> If CRCs are
> introducing lock contention, it doesn't matter if that lock contention
> is in XFS code or elsewhere in the kernel, it is still a slowdown which
> can be attributed to the CRC feature.  Spin it as you like, it still
> appears to me that there's a huge impact on the walk and unlink phases
> from CRC calculations.

So by that logic, userspace memcpy() causes lock contention in the
VFS, and so therefore the problem is the userspace application, not
the kernel code. And the solution is not to run that userspace code.

Three words: Root Cause Analysis.

We've known about the VFS lock contention problem a lot longer than
we've had the CRC code has been running.  In case you hadn't been
keeping up with this stuff, here's a quick summary of the work I've
been doing with Glauber:

http://lwn.net/Articles/550463/
http://lwn.net/Articles/548092/

So, while CRCs might be a trigger that makes the system fall off the
cliff it is on the edge of, it is most certainly not a CRC problem,
it is not a problem we can solve by changing the CRC code and it is
not a problem we can solve by turning off CRCs.  IOWs, CRCs are not
the root cause of the degradation in performance.

> | ----
> | 
> | Dbench:
> ...
> | Well, now that's an interesting result, isn't it. CRC enabled
> | filesystems are 10% faster than non-crc filesystems. Again, let's
> | not take that number at face value, but ask ourselves why adding
> | CRCs improves performance (a.k.a. "know your benchmark")...
> | 
> | It's pretty obvious why - dbench uses xattrs and performance is
> | sensitive to how many attributes can be stored inline in the inode.
> | And CRCs increase the inode size to 512 bytes meaning attributes are
> | probably never out of line. So, let's make it an even playing field
> | and compare:
> 
> CRC filesystems default to 512 byte inodes?  I wasn't aware of that.

That's been the plan of record since 2008 as the increase in the size
of the inode code reduces 256 byte inodes to less literal area space
than attr=1 configurations.

> Sure, CRC filesystems are able to move more volume, but the metadata is
> half the density as it was before.  I'm not a dbench expert, so I have
> no idea what the ratio of metadata to data is here, so I really don't
> know what conclusions to draw from the dbench results.

So perhaps you should trust someone who is an expert to analyse the
results for you? :)

FYI, dbench is log IO bound, not metadata or data IO bound.
Performance drops with out-of-line attributes because attribute
block IO steals IOPS from the log IO and hence processes block for
longer in fsync and that lowers throughput and increases measured
latency. IOWs, the performance differential that inode sizes give
is all due to less IO being needed for attribute manipulations.


> What really bothers me is the default of 512 byte inodes for CRCs.  That
> means my inodes take up twice as much space on disk, and will require
> 2X the bandwidth to read from disk.

Metadata read IO is latency bound, not bandwidth bound.  The
increase in metadata IO bandwidth doesn't make any measurable
difference on a typical modern storage system.

> This will have significant impact
> on SGI's DMF managed filesystems.

You're concerned about bulkstat performance, then? Bulkstat will CRC
every inode it reads, so the increase in inode size is the least of
your worries....

But bulkstat scalability is an unrelated issue to the CRC work,
especially as bulkstat already needs application provided
parallelism to scale effectively.

> I know you don't care about SGI's
> DMF, but this will also have a significant performance impact on
> xfsdump, xfsrestore, and xfs_repair.  These performance benchmarks are
> just as important to me as dbench and compilebench.

Sure. But the changes for SDM (self describing metadata) are not
introducing any new performance problems we don't already have. I'm
perfectly OK with that, and it's pretty clear that correcting any
such issues are not related to the implementation of SDM.

> | Compilebench
> | 
> | Testing the same filesystems with 512 byte inodes as for dbench:
> | 
> | $ ./compilebench -D /mnt/scratch
> | using working directory /mnt/scratch, 30 intial dirs 100 runs
> | .....
> | 
> | test				no CRCs		CRCs
> | 			runs	avg		avg
> | ==========================================================================
> | intial create		30	92.12 MB/s	90.24 MB/s
> | create			14	61.91 MB/s	61.13 MB/s
> | patch			15	41.04 MB/s	38.00 MB/s
> | compile			14	278.74 MB/s	262.00 MB/s
> | clean			10	1355.30 MB/s	1296.17 MB/s
> | read tree		11	25.68 MB/s	25.40 MB/s
> | read compiled tree	4	48.74 MB/s	48.65 MB/s
> | delete tree		10	2.97 seconds	3.05 seconds
> | delete compiled tree	4	2.96 seconds	3.05 seconds
> | stat tree		11	1.33 seconds	1.36 seconds
> | stat compiled tree	7	1.86 seconds	1.64 seconds
> | 
> | The numbers are so close that the differences are in the noise, and
> | the CRC overhead doesn't even show up in the ">1% usage" section
> | of the profile output.
> 
> What really surprises me in these results is the hit that the compile
> phase takes.  That is a 6% performance drop in an area where I expect
> the CRCs to have limited effect.  To me, the results show a rather
> consistent performance drop of up to 6%, and is sufficient to support my
> assertion that the CRCs overhead may outweigh the benefits.

You're making an assumption that 6% is actually meaningful. It's
not.  Here's the raw numbers for that phase throughout the
benchmark:

compile dir kernel-7 691MB in 1.98 seconds (349.29 MB/s)
compile dir kernel-14 680MB in 2.67 seconds (254.92 MB/s)
compile dir kernel-2 680MB in 1.81 seconds (376.04 MB/s)
compile dir kernel-2 691MB in 1.94 seconds (356.49 MB/s)
compile dir kernel-7 691MB in 2.16 seconds (320.18 MB/s)
compile dir kernel-2 691MB in 1.97 seconds (351.06 MB/s)
compile dir kernel-26 680MB in 3.13 seconds (217.46 MB/s)
compile dir kernel-14 691MB in 3.03 seconds (228.25 MB/s)
compile dir kernel-70151 691MB in 3.38 seconds (204.61 MB/s)
compile dir kernel-27 691MB in 4.14 seconds (167.05 MB/s)
compile dir kernel-18 680MB in 2.72 seconds (250.23 MB/s)
compile dir kernel-2 691MB in 2.25 seconds (307.38 MB/s)
compile dir kernel-17 680MB in 2.83 seconds (240.51 MB/s)

So, to summaries the numbers for the compile phase we have:

	min:	167.05 MB/s
	max:	376.04 MB/s
	avg:	262.00 MB/s
	stddev: 65 MB/s (25%!)

So, that difference of 16MB/s from run to run is well within the
standard deviation of the results of that phase. I just did another
run on a CRC enabled filesystem:

compile total runs 14 avg 291.30 MB/s (user 0.13s sys 0.77s)

Which is still within a single stddev of the above number and hence
is not significant. IOWs, there's a lot of variability within any
specific phase from run to run in this benchmark and for this phase
a 6% difference is well within the noise.

Like I said - I use benchmarks that I understand. If I say that the
differences are "in the noise" I really do mean that they are "in
the noise". I don't play games with numbers - benchmarketing is one
of my pet peeves and it's something I do not do out of principle.

> Do I want to take a 5% performance hit in filesystem performance
> and double the size of my inodes for an unproved feature?  I am
> still unconvinced that CRCs are a feature that I want to use.
> Others may see enough benefit in CRCs to accept the performance
> hit.  All I want is to ensure that I the option going forward to
> chose not to use CRCs without sacrificing other features
> introduced XFS.

If you don't want to take the performance hit of SDM, the don't use
it. You have that choice right now - either choose performance (v4
superblocks) or reliability (v5 superblocks) at mkfs time.

If new features are introduced that you want that are dependent on
v5 superblocks and you want to stick with v4 superblocks for
performance reasons, then you have to make a hard choice unless you
address your concerns about v5 superblocks. Indeed, none of the
performance issues you've mentioned are unsolvable problems - you
just have to identify them and fix them before your customers need
v5 superblocks.

IOWs, you need to quantify the specific performance degradations you
are concerned about and help fix them. We may have different
priorities and goals, but that doesn't stop us from both being able
to help each reach our goals. But any such discussion about
performance and problem areas needs to be based on quantified
information, not handwaving.

Geoffrey, can you start by identifying and quantifying two things on
current top-of-tree kernels?

	1. exactly where the problems with larger inodes are (on v4
	   superblocks)
	2. workloads you care about where SDM significantly impacts
	   performance (i.e. v4 vs v5 superblocks)

We can discuss each case you raise on their merits and determine
whether they need to be addressed and, if so, how to address them.
But we need quantified data to make any progress here.

In the mean time, you can just use v4 superblocks like you currently
do, but when the time comes to switch to v5 superblocks we will have
corrected the identified problems and performance will not be an
issue that you need to be concerned about.

Cheers,

Dave.


-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-03  9:10 ` Emmanuel Florac
@ 2013-06-04  2:53   ` Dave Chinner
  2013-06-04 16:20     ` Ben Myers
  2013-06-04 18:38     ` Chandra Seetharaman
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2013-06-04  2:53 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: xfs

On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> 
> > There has been some assertions made recently that metadata CRCs have
> > too much overhead to always be enabled.  So I'll run some quick
> > benchmarks to demonstrate the "too much overhead" assertions are
> > completely unfounded.
> 
> Just a quick question: what is the minimal kernel version and xfsprogs
> version needed to run xfs with metadata CRC? I'd happily test it on
> real hardware, I have a couple of storage servers in test in the 40 to
> 108 TB range.

If the maintainers merge all the patches I send for the 3.10-rc
series, then the 3.10 release should be stable enough to use for
testing with data you don't care if you lose.

As for the userspace code - that is still just a patchset. I haven't
had any feedback from the maintainers about it in the past month, so
I've got no idea what they are doing with it. I'll post out a new
version in the next couple of days - it's 50-odd patches by now, so
it'd be nice to have it in the xfsprogs git tree so people could
just pull it and build it for testing purposes by the time that 3.10
releases....

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04  2:43   ` Dave Chinner
@ 2013-06-04 10:19     ` Dave Chinner
  2013-06-04 21:27     ` Geoffrey Wehrman
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2013-06-04 10:19 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: xfs

On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote:
> On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote:
> > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote:
> > This will have significant impact
> > on SGI's DMF managed filesystems.
> 
> You're concerned about bulkstat performance, then? Bulkstat will CRC
> every inode it reads, so the increase in inode size is the least of
> your worries....
> 
> But bulkstat scalability is an unrelated issue to the CRC work,
> especially as bulkstat already needs application provided
> parallelism to scale effectively.

So, I just added a single threaded bulkstat pass to the fsmark
workload by passing xfs_fsr across the filesystem to test out what
impact it has. So, 50 million inodes in the directory structure:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
wall time	13m34.203s		14m15.266s
sys CPU		7m7.930s		8m52.050s
rate		61,425 inodes/s		58,479 inode/s
efficency	116,800 inodes/CPU/s	93,984 inodes/CPU/s

So, really it's not particularly significant in terms of performance
differential. Certainly there isn't anything signficant problem that
larger inodes cause.  For comparison, the 8-way find workloads:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
wall time	5m33.165s		8m18.256s
sys CPU		18m36.731s		22m2.277s
rate		150,055 inodes/s	100,400 inodes/s
efficiency	44,800 inodes/CPU/s	37,800 inodes/CPU/s

Which makes me think omething is not right with this bulkstat pass
I've just done. It's way too slow if a find+stat is 2-2.5x faster.

Ah, xfs_fsr only bulkstats 64 inodes at a time. That's right,
last time I did this I used bstat out of xfstests. On a CRC enabled
fs:

ninodes		runtime		sys time	read bw(IOPS)
64		14m01s		8m37s
128		11m20s		7m58s		35MB/s(5000)
256		 8m53s		7m24s		45MB/s(6000)
512		 7m24s		6m28s		55MB/s(7000)
1024		 6m37s		5m40s		65MB/s(8000)
2048		10m50s		6m51s		35MB/s(5000)
4096(default)	26m23s		8m38s

Ask bulkstat for too few or too much, and it all goes to hell.  So
if we get the bulkstat config right, a single threaded bulkstat is
faster than the 8-way find, and a whole lot more efficient at it.
But, still there is effectively no performance differential between
256 byte and 512 byte inodes worth talking about.

And, FWIW, I just hacked threading into bstat to run a thread per AG
and just scan a single AG per thread. It's not perfect - it counts
some inodes twice (threads*ninodes at most) before it detects it's
run into the next AG. This is on a 100TB filesystem, so it runs 100
threads. CRC enabled fs:

ninodes		runtime		sys time	read bw(IOPS)
64		 1m53s		10m25s		220MB/s (27000)
256		 1m52s		10m03s		220MB/s (27000)
1024		 1m55s		10m08s		210MB/s (26000)

So when it's threaded, the small request size just doesn't matter -
there's enough IO to drive the system to being IOPS bound and that
limits performance.

Just to go full circle, the differences between 256 byte inodes, no
CRCs and the crc enabled filesystem for a single threaded bulkstat:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
ninodes		   1024			     1024
wall time	   5m22s		     6m37s
sys CPU		   4m46s		     5m40s
bw(IOPS)	     40MB/s(5000)	     65MB/s(8000)
rate		155,300 inodes/s	126,000 inode/s
efficency	174,800 inodes/CPU/s	147,000 inodes/CPU/s

Both follow the same ninode profile, but there is less IO done for
the 256 byte inode filesystem and throughput is higher. There's no
big surprise there, what does surprise me is that the difference
isn't larger. Let's drive it to being I/O bound with threading:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
ninodes		    256			    256
wall time	   1m02s		   1m52s
sys CPU		   7m04s		  10m03s
bw/IOPS		    210MB/s (27000)	    220MB/s (27000)
rate		806,500 inodes/s	446,500 inode/s
efficency	117,900 inodes/CPU/s	 82,900 inodes/CPU/s

The 256 byte inode test is completely CPU bound - it can't go any
faster than that, and it just so happens to be pretty close to IO
bound as well. So, while there's double the throughput for 256 byte
inodes, it raises an interesting question: why are all the IOs only
8k in size?

That means the inode readahead that bulkstat is doing is not being
combined down in the elevator - it is either being cancelled because
there is too much, or it is being dispatched immediately and so we
are being IOPS limited long before we should be. i.e. there's still
500MB of bandwidth available on this filesystem and we're issuing
sequential adjacent 8k IO.  Either way, it's not functioning as it
should.

<blktrace>

Yup, immediate, explicit unplug and dispatch. No readahead batching
and the unplug is coming from _xfs_buf_ioapply().  Well, that is
easy to fix.

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
ninodes		    256			    256
wall time	   1m02s		   1m08s
sys CPU		   7m07s		   8m09s
bw/IOPS		    210MB/s (13500)	    360MB/s (14000)
rate		806,500 inodes/s	735,300 inode/s
efficency	117,100 inodes/CPU/s	102,200 inodes/CPU/s

So, the difference in performance pretty much goes away. We burn
more bandwidth, but now the multithreaded bulkstat is CPU limited
for both non-crc, 256 byte inodes and CRC enabled 512 byte inodes.

What this says to me is that there isn't a bulkstat performance
problem that we need to fix apart from the 3 lines of code for the
readahead IO plugging that I just added.  It's only limited by
storage IOPS and available CPU power, yet the bandwidth is
sufficiently low that any storage system that SGI installs for DMF
is not going to be stressed by it. IOPS, yes. Bandwidth, no.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04  2:53   ` Dave Chinner
@ 2013-06-04 16:20     ` Ben Myers
  2013-06-04 22:06       ` Dave Chinner
  2013-06-04 18:38     ` Chandra Seetharaman
  1 sibling, 1 reply; 16+ messages in thread
From: Ben Myers @ 2013-06-04 16:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Dave,

On Tue, Jun 04, 2013 at 12:53:07PM +1000, Dave Chinner wrote:
> On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> > 
> > > There has been some assertions made recently that metadata CRCs have
> > > too much overhead to always be enabled.  So I'll run some quick
> > > benchmarks to demonstrate the "too much overhead" assertions are
> > > completely unfounded.
> > 
> > Just a quick question: what is the minimal kernel version and xfsprogs
> > version needed to run xfs with metadata CRC? I'd happily test it on
> > real hardware, I have a couple of storage servers in test in the 40 to
> > 108 TB range.
> 
> If the maintainers merge all the patches I send for the 3.10-rc
> series, then the 3.10 release should be stable enough to use for
> testing with data you don't care if you lose.
> 
> As for the userspace code - that is still just a patchset. I haven't
> had any feedback from the maintainers about it in the past month, so
> I've got no idea what they are doing with it. I'll post out a new
> version in the next couple of days - it's 50-odd patches by now, so
> it'd be nice to have it in the xfsprogs git tree so people could
> just pull it and build it for testing purposes by the time that 3.10
> releases....

When it is reviewed and adequately tested we'll pull it in.  Until then
Emmanuel will need to pull down the patchset.  Right now the focus is on
3.10.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04  2:53   ` Dave Chinner
  2013-06-04 16:20     ` Ben Myers
@ 2013-06-04 18:38     ` Chandra Seetharaman
  2013-06-04 22:08       ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Chandra Seetharaman @ 2013-06-04 18:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote:
> On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> > 
> > > There has been some assertions made recently that metadata CRCs have
> > > too much overhead to always be enabled.  So I'll run some quick
> > > benchmarks to demonstrate the "too much overhead" assertions are
> > > completely unfounded.
> > 
> > Just a quick question: what is the minimal kernel version and xfsprogs
> > version needed to run xfs with metadata CRC? I'd happily test it on
> > real hardware, I have a couple of storage servers in test in the 40 to
> > 108 TB range.
> 
> If the maintainers merge all the patches I send for the 3.10-rc
> series, then the 3.10 release should be stable enough to use for
> testing with data you don't care if you lose.
> 
> As for the userspace code - that is still just a patchset. I haven't
> had any feedback from the maintainers about it in the past month, so
> I've got no idea what they are doing with it. I'll post out a new
> version in the next couple of days - it's 50-odd patches by now, so
> it'd be nice to have it in the xfsprogs git tree so people could
> just pull it and build it for testing purposes by the time that 3.10
> releases....

Dave,

I was of the impression that the user space changes will be released
sometime later (i.e when CRC comes out of experimental). If we make the
user space changes to create V5 filesystem now, it will be an annoyance
for people that created V5 super blocks without my changes (getting rid
of OQUOTA.* flags). 

BTW, I am waiting for your response to do a final re-post on the kernel
changes, after which I will post my user space changes.

Regards,

Chandra
> 
> Cheers,
> 
> Dave


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04  2:43   ` Dave Chinner
  2013-06-04 10:19     ` Dave Chinner
@ 2013-06-04 21:27     ` Geoffrey Wehrman
  2013-06-05  0:27       ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Geoffrey Wehrman @ 2013-06-04 21:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote:
| On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote:
| > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote:
| > | Hi folks,
| > | 
| > | There has been some assertions made recently that metadata CRCs have
| > | too much overhead to always be enabled.  So I'll run some quick
| > | benchmarks to demonstrate the "too much overhead" assertions are
| > | completely unfounded.
| > 
| > Thank you, much appreciated.

| We've known about the VFS lock contention problem a lot longer than
| we've had the CRC code has been running.  In case you hadn't been
| keeping up with this stuff, here's a quick summary of the work I've
| been doing with Glauber:
| 
| http://lwn.net/Articles/550463/
| http://lwn.net/Articles/548092/
| 
| So, while CRCs might be a trigger that makes the system fall off the
| cliff it is on the edge of, it is most certainly not a CRC problem,
| it is not a problem we can solve by changing the CRC code and it is
| not a problem we can solve by turning off CRCs.  IOWs, CRCs are not
| the root cause of the degradation in performance.

Fair enough.  It is good to know that the VFS lock contention problem is
being addressed.  Thanks for the pointers to the summary of the work you
and Glauber have been doing.

| > Do I want to take a 5% performance hit in filesystem performance
| > and double the size of my inodes for an unproved feature?  I am
| > still unconvinced that CRCs are a feature that I want to use.
| > Others may see enough benefit in CRCs to accept the performance
| > hit.  All I want is to ensure that I the option going forward to
| > chose not to use CRCs without sacrificing other features
| > introduced XFS.
| 
| If you don't want to take the performance hit of SDM, the don't use
| it. You have that choice right now - either choose performance (v4
| superblocks) or reliability (v5 superblocks) at mkfs time.

That is exactly the capability I want.

| If new features are introduced that you want that are dependent on
| v5 superblocks and you want to stick with v4 superblocks for
| performance reasons, then you have to make a hard choice unless you
| address your concerns about v5 superblocks. Indeed, none of the
| performance issues you've mentioned are unsolvable problems - you
| just have to identify them and fix them before your customers need
| v5 superblocks.

This is the type of hard choice I want to avoid as much as possible.
My concern is that all future XFS features will be introduced as v5
superblock only features, regardless of whether they are directly
dependent on CRC or not.  I'm not expecting all future features to be
implemented for both v4 and v5 superblocks, but I would like to have new
features available for v4 superblocks available when possible, at least
until the vast majority of systems deployed are v5 superblock capable.
Unfortunately this will take much longer than we like.

| IOWs, you need to quantify the specific performance degradations you
| are concerned about and help fix them. We may have different
| priorities and goals, but that doesn't stop us from both being able
| to help each reach our goals. But any such discussion about
| performance and problem areas needs to be based on quantified
| information, not handwaving.

I would love to be able to quantify and help fix performance degradations
I am concerned about.  Unfortunately there are just not enough hours in a
day.  I will be honest, I am not an XFS developer.  I am an XFS consumer.
The products I spend my time working on rely on XFS as their foundation.
I don't even touch current XFS.  I spend most of my time working with XFS
code that is a year old or more.  Even then, I am not spending much time
with the XFS code itself but rather the code from the products built on
top of XFS.  Call me an XFS consumer.  It is like buying an automobile.
I don't review the cad drawings of each part used in the construction.
I don't even examine the engine or transmission.  I don't take an
automobile I'm looking at and hook it up to a dyno to get a performance
report.  I rely on the manufacturer to provide me with the performance
information, and then I do my best to analyze the data I have available.

| Geoffrey, can you start by identifying and quantifying two things on
| current top-of-tree kernels?
| 
| 	1. exactly where the problems with larger inodes are (on v4
| 	   superblocks)
| 	2. workloads you care about where SDM significantly impacts
| 	   performance (i.e. v4 vs v5 superblocks)

I cannot identify and quantify any more than I already have.  Bulk scans
are my primary concern, along with potential doubling of bandwidth
required for a bulk scan.  You address much of this in your follow-up
e-mail.

| We can discuss each case you raise on their merits and determine
| whether they need to be addressed and, if so, how to address them.
| But we need quantified data to make any progress here.
| 
| In the mean time, you can just use v4 superblocks like you currently
| do, but when the time comes to switch to v5 superblocks we will have
| corrected the identified problems and performance will not be an
| issue that you need to be concerned about.

I hope that is the case, and expect that it will be.  I'm not questioning
your abilities.  You are one of the best developers in the community.
I just want to be sure that I'm forced into v5 superblocks before
the identified problems have been resolved, and that the work on v5
superblocks has minimal impact on my current use of v4 superblocks.
The data you have provided has gone a long way to ally my concerns about
the metadata performance in XFS with CRCs.  I'm not ready to jump yet,
but you have given me confidence that the jump should not be as bad as I
had expected.

On Tue, Jun 04, 2013 at 08:19:37PM +1000, Dave Chinner wrote:
| On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote:
| > On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote:
| > > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote:
| > > This will have significant impact
| > > on SGI's DMF managed filesystems.
| > 
| > You're concerned about bulkstat performance, then? Bulkstat will CRC
| > every inode it reads, so the increase in inode size is the least of
| > your worries....
| > 
| > But bulkstat scalability is an unrelated issue to the CRC work,
| > especially as bulkstat already needs application provided
| > parallelism to scale effectively.
...
| So, the difference in performance pretty much goes away. We burn
| more bandwidth, but now the multithreaded bulkstat is CPU limited
| for both non-crc, 256 byte inodes and CRC enabled 512 byte inodes.
| 
| What this says to me is that there isn't a bulkstat performance
| problem that we need to fix apart from the 3 lines of code for the
| readahead IO plugging that I just added.  It's only limited by
| storage IOPS and available CPU power, yet the bandwidth is
| sufficiently low that any storage system that SGI installs for DMF
| is not going to be stressed by it. IOPS, yes. Bandwidth, no.

What can I say but nice analysis.  You've clearly shown that the
performance impact in bulk scan caused by CRCs can be easily offset by
changes elsewhere, improving bulkstat performance across the board.
I don't exactly follow what changes you made to _xfs_buf_ioapply(),
but expect that you will eventually post the change.


-- 
Geoffrey Wehrman  651-683-5496  gwehrman@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04 16:20     ` Ben Myers
@ 2013-06-04 22:06       ` Dave Chinner
  2013-06-04 22:09         ` Ben Myers
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2013-06-04 22:06 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Tue, Jun 04, 2013 at 11:20:30AM -0500, Ben Myers wrote:
> Dave,
> 
> On Tue, Jun 04, 2013 at 12:53:07PM +1000, Dave Chinner wrote:
> > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> > > 
> > > > There has been some assertions made recently that metadata CRCs have
> > > > too much overhead to always be enabled.  So I'll run some quick
> > > > benchmarks to demonstrate the "too much overhead" assertions are
> > > > completely unfounded.
> > > 
> > > Just a quick question: what is the minimal kernel version and xfsprogs
> > > version needed to run xfs with metadata CRC? I'd happily test it on
> > > real hardware, I have a couple of storage servers in test in the 40 to
> > > 108 TB range.
> > 
> > If the maintainers merge all the patches I send for the 3.10-rc
> > series, then the 3.10 release should be stable enough to use for
> > testing with data you don't care if you lose.
> > 
> > As for the userspace code - that is still just a patchset. I haven't
> > had any feedback from the maintainers about it in the past month, so
> > I've got no idea what they are doing with it. I'll post out a new
> > version in the next couple of days - it's 50-odd patches by now, so
> > it'd be nice to have it in the xfsprogs git tree so people could
> > just pull it and build it for testing purposes by the time that 3.10
> > releases....
> 
> When it is reviewed and adequately tested we'll pull it in.  Until then
> Emmanuel will need to pull down the patchset.  Right now the focus is on
> 3.10.

And when will that be? I've already been waiting the best part of a
month for anyone to even comment on it, and I've got 5 private pings
in the past 3 days asking about how to get the userspace code so
they can test the new kernel code....

How about this: I post an up-to-date patch set, and you guys commit
it to a "crc-dev" branch in the oss xfsprogs git tree. The branch
can be thrown away when the code is reviewed, but in the mean time
we can point early adopters and testers to that branch rather than
ask them to pull down and apply a 50 patch series to a git tree?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04 18:38     ` Chandra Seetharaman
@ 2013-06-04 22:08       ` Dave Chinner
  2013-06-04 22:40         ` Chandra Seetharaman
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2013-06-04 22:08 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: xfs

On Tue, Jun 04, 2013 at 01:38:29PM -0500, Chandra Seetharaman wrote:
> On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote:
> > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> > > 
> > > > There has been some assertions made recently that metadata CRCs have
> > > > too much overhead to always be enabled.  So I'll run some quick
> > > > benchmarks to demonstrate the "too much overhead" assertions are
> > > > completely unfounded.
> > > 
> > > Just a quick question: what is the minimal kernel version and xfsprogs
> > > version needed to run xfs with metadata CRC? I'd happily test it on
> > > real hardware, I have a couple of storage servers in test in the 40 to
> > > 108 TB range.
> > 
> > If the maintainers merge all the patches I send for the 3.10-rc
> > series, then the 3.10 release should be stable enough to use for
> > testing with data you don't care if you lose.
> > 
> > As for the userspace code - that is still just a patchset. I haven't
> > had any feedback from the maintainers about it in the past month, so
> > I've got no idea what they are doing with it. I'll post out a new
> > version in the next couple of days - it's 50-odd patches by now, so
> > it'd be nice to have it in the xfsprogs git tree so people could
> > just pull it and build it for testing purposes by the time that 3.10
> > releases....
> 
> Dave,
> 
> I was of the impression that the user space changes will be released
> sometime later (i.e when CRC comes out of experimental). If we make the
> user space changes to create V5 filesystem now, it will be an annoyance
> for people that created V5 super blocks without my changes (getting rid
> of OQUOTA.* flags). 

People still need access to the code to test it. I'm not talking
about an official release here at all, just getting it committed to
the git tree to make it easy for people to get the code and for
developers to build on top of it and fix bugs.

> BTW, I am waiting for your response to do a final re-post on the kernel
> changes, after which I will post my user space changes.

I must have missed your question, I'll go back and have a look for
it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04 22:06       ` Dave Chinner
@ 2013-06-04 22:09         ` Ben Myers
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Myers @ 2013-06-04 22:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jun 05, 2013 at 08:06:10AM +1000, Dave Chinner wrote:
> On Tue, Jun 04, 2013 at 11:20:30AM -0500, Ben Myers wrote:
> > Dave,
> > 
> > On Tue, Jun 04, 2013 at 12:53:07PM +1000, Dave Chinner wrote:
> > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> > > > 
> > > > > There has been some assertions made recently that metadata CRCs have
> > > > > too much overhead to always be enabled.  So I'll run some quick
> > > > > benchmarks to demonstrate the "too much overhead" assertions are
> > > > > completely unfounded.
> > > > 
> > > > Just a quick question: what is the minimal kernel version and xfsprogs
> > > > version needed to run xfs with metadata CRC? I'd happily test it on
> > > > real hardware, I have a couple of storage servers in test in the 40 to
> > > > 108 TB range.
> > > 
> > > If the maintainers merge all the patches I send for the 3.10-rc
> > > series, then the 3.10 release should be stable enough to use for
> > > testing with data you don't care if you lose.
> > > 
> > > As for the userspace code - that is still just a patchset. I haven't
> > > had any feedback from the maintainers about it in the past month, so
> > > I've got no idea what they are doing with it. I'll post out a new
> > > version in the next couple of days - it's 50-odd patches by now, so
> > > it'd be nice to have it in the xfsprogs git tree so people could
> > > just pull it and build it for testing purposes by the time that 3.10
> > > releases....
> > 
> > When it is reviewed and adequately tested we'll pull it in.  Until then
> > Emmanuel will need to pull down the patchset.  Right now the focus is on
> > 3.10.
> 
> And when will that be? I've already been waiting the best part of a
> month for anyone to even comment on it, and I've got 5 private pings
> in the past 3 days asking about how to get the userspace code so
> they can test the new kernel code....
> 
> How about this: I post an up-to-date patch set, and you guys commit
> it to a "crc-dev" branch in the oss xfsprogs git tree. The branch
> can be thrown away when the code is reviewed, but in the mean time
> we can point early adopters and testers to that branch rather than
> ask them to pull down and apply a 50 patch series to a git tree?

Sounds good to me.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04 22:08       ` Dave Chinner
@ 2013-06-04 22:40         ` Chandra Seetharaman
  2013-06-04 22:59           ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Chandra Seetharaman @ 2013-06-04 22:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2013-06-05 at 08:08 +1000, Dave Chinner wrote:
> On Tue, Jun 04, 2013 at 01:38:29PM -0500, Chandra Seetharaman wrote:
> > On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote:
> > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> > > > 
> > > > > There has been some assertions made recently that metadata CRCs have
> > > > > too much overhead to always be enabled.  So I'll run some quick
> > > > > benchmarks to demonstrate the "too much overhead" assertions are
> > > > > completely unfounded.
> > > > 
> > > > Just a quick question: what is the minimal kernel version and xfsprogs
> > > > version needed to run xfs with metadata CRC? I'd happily test it on
> > > > real hardware, I have a couple of storage servers in test in the 40 to
> > > > 108 TB range.
> > > 
> > > If the maintainers merge all the patches I send for the 3.10-rc
> > > series, then the 3.10 release should be stable enough to use for
> > > testing with data you don't care if you lose.
> > > 
> > > As for the userspace code - that is still just a patchset. I haven't
> > > had any feedback from the maintainers about it in the past month, so
> > > I've got no idea what they are doing with it. I'll post out a new
> > > version in the next couple of days - it's 50-odd patches by now, so
> > > it'd be nice to have it in the xfsprogs git tree so people could
> > > just pull it and build it for testing purposes by the time that 3.10
> > > releases....
> > 
> > Dave,
> > 
> > I was of the impression that the user space changes will be released
> > sometime later (i.e when CRC comes out of experimental). If we make the
> > user space changes to create V5 filesystem now, it will be an annoyance
> > for people that created V5 super blocks without my changes (getting rid
> > of OQUOTA.* flags). 
> 
> People still need access to the code to test it. I'm not talking
> about an official release here at all, just getting it committed to
> the git tree to make it easy for people to get the code and for
> developers to build on top of it and fix bugs.
> 

Oh, I see. It is clear now.

Also, is there a git tree where I can pull your xfsprogs changes from ?
I tried to apply xfsprogs-kern-sync-patchset-v2.tar.gz on top of
xfsprogs git tree, it seems to have some problems.

Or is there a later version ?

> > BTW, I am waiting for your response to do a final re-post on the kernel
> > changes, after which I will post my user space changes.
> 
> I must have missed your question, I'll go back and have a look for
> it.
> 

Thanks.
> Cheers,
> 
> Dave.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04 22:40         ` Chandra Seetharaman
@ 2013-06-04 22:59           ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2013-06-04 22:59 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: xfs

On Tue, Jun 04, 2013 at 05:40:16PM -0500, Chandra Seetharaman wrote:
> On Wed, 2013-06-05 at 08:08 +1000, Dave Chinner wrote:
> > On Tue, Jun 04, 2013 at 01:38:29PM -0500, Chandra Seetharaman wrote:
> > > On Tue, 2013-06-04 at 12:53 +1000, Dave Chinner wrote:
> > > > On Mon, Jun 03, 2013 at 11:10:11AM +0200, Emmanuel Florac wrote:
> > > > > Le Mon, 3 Jun 2013 17:44:52 +1000 vous écriviez:
> > > > > 
> > > > > > There has been some assertions made recently that metadata CRCs have
> > > > > > too much overhead to always be enabled.  So I'll run some quick
> > > > > > benchmarks to demonstrate the "too much overhead" assertions are
> > > > > > completely unfounded.
> > > > > 
> > > > > Just a quick question: what is the minimal kernel version and xfsprogs
> > > > > version needed to run xfs with metadata CRC? I'd happily test it on
> > > > > real hardware, I have a couple of storage servers in test in the 40 to
> > > > > 108 TB range.
> > > > 
> > > > If the maintainers merge all the patches I send for the 3.10-rc
> > > > series, then the 3.10 release should be stable enough to use for
> > > > testing with data you don't care if you lose.
> > > > 
> > > > As for the userspace code - that is still just a patchset. I haven't
> > > > had any feedback from the maintainers about it in the past month, so
> > > > I've got no idea what they are doing with it. I'll post out a new
> > > > version in the next couple of days - it's 50-odd patches by now, so
> > > > it'd be nice to have it in the xfsprogs git tree so people could
> > > > just pull it and build it for testing purposes by the time that 3.10
> > > > releases....
> > > 
> > > Dave,
> > > 
> > > I was of the impression that the user space changes will be released
> > > sometime later (i.e when CRC comes out of experimental). If we make the
> > > user space changes to create V5 filesystem now, it will be an annoyance
> > > for people that created V5 super blocks without my changes (getting rid
> > > of OQUOTA.* flags). 
> > 
> > People still need access to the code to test it. I'm not talking
> > about an official release here at all, just getting it committed to
> > the git tree to make it easy for people to get the code and for
> > developers to build on top of it and fix bugs.
> > 
> 
> Oh, I see. It is clear now.
> 
> Also, is there a git tree where I can pull your xfsprogs changes from ?
> I tried to apply xfsprogs-kern-sync-patchset-v2.tar.gz on top of
> xfsprogs git tree, it seems to have some problems.
> 
> Or is there a later version ?

I'm cleaning up my current tree right now. I'll post it out as soon
as it is ready to go. Hopefully Ben can get that into a branch in
the xfsprogs git tree and you can pull from there.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Debunking myths about metadata CRC overhead
  2013-06-04 21:27     ` Geoffrey Wehrman
@ 2013-06-05  0:27       ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2013-06-05  0:27 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: xfs

On Tue, Jun 04, 2013 at 04:27:13PM -0500, Geoffrey Wehrman wrote:
> On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote:
> | On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote:
> | > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote:
> | > | Hi folks,
> | > | 
> | > | There has been some assertions made recently that metadata CRCs have
> | > | too much overhead to always be enabled.  So I'll run some quick
> | > | benchmarks to demonstrate the "too much overhead" assertions are
> | > | completely unfounded.
....
> | > Do I want to take a 5% performance hit in filesystem performance
> | > and double the size of my inodes for an unproved feature?  I am
> | > still unconvinced that CRCs are a feature that I want to use.
> | > Others may see enough benefit in CRCs to accept the performance
> | > hit.  All I want is to ensure that I the option going forward to
> | > chose not to use CRCs without sacrificing other features
> | > introduced XFS.
> | 
> | If you don't want to take the performance hit of SDM, the don't use
> | it. You have that choice right now - either choose performance (v4
> | superblocks) or reliability (v5 superblocks) at mkfs time.
> 
> That is exactly the capability I want.

And you have it, so I don't see what the fuss is all about.

> | If new features are introduced that you want that are dependent on
> | v5 superblocks and you want to stick with v4 superblocks for
> | performance reasons, then you have to make a hard choice unless you
> | address your concerns about v5 superblocks. Indeed, none of the
> | performance issues you've mentioned are unsolvable problems - you
> | just have to identify them and fix them before your customers need
> | v5 superblocks.
> 
> This is the type of hard choice I want to avoid as much as possible.
> My concern is that all future XFS features will be introduced as v5
> superblock only features, regardless of whether they are directly
> dependent on CRC or not.

No different to the v3->v4 transition. The old format was immediately
deprecated...

> I'm not expecting all future features to be
> implemented for both v4 and v5 superblocks, but I would like to have new
> features available for v4 superblocks available when possible, at least
> until the vast majority of systems deployed are v5 superblock capable.
> Unfortunately this will take much longer than we like.

v5 superblocks are the future and upstream development is focussed
primarily on the future. Any new feature that requires an on-disk
format change is now going to be dependent on v5 superblocks.
You're welcome to backport such features to SGI supported kernels
using v4 superblocks, but it's unrealistic to expect upstream to
jump through hoops to do this for you...

> | IOWs, you need to quantify the specific performance degradations you
> | are concerned about and help fix them. We may have different
> | priorities and goals, but that doesn't stop us from both being able
> | to help each reach our goals. But any such discussion about
> | performance and problem areas needs to be based on quantified
> | information, not handwaving.
> 
> I would love to be able to quantify and help fix performance degradations
> I am concerned about.  Unfortunately there are just not enough hours in a
> day.

Delegate to your minions. ;)

> I will be honest, I am not an XFS developer.  I am an XFS
> consumer.  The products I spend my time working on rely on XFS as
> their foundation.  I don't even touch current XFS.  I spend most
> of my time working with XFS code that is a year old or more.  Even
> then, I am not spending much time with the XFS code itself but
> rather the code from the products built on top of XFS.  Call me an
> XFS consumer.  It is like buying an automobile.
> I don't review the cad drawings of each part used in the
> construction.  I don't even examine the engine or transmission.  I
> don't take an automobile I'm looking at and hook it up to a dyno
> to get a performance report.  I rely on the manufacturer to
> provide me with the performance information, and then I do my best
> to analyze the data I have available.

Yes, you are a downstream consumer, but that's a seriously bad
analogy.  You aren't "buying an automobile" and relying on the
manufacturer to supply you with specifications and support for your
car.

You are an expert mechanic who is getting a cheap car and a box of
parts from the local bazzar for nothing and treating it to a star
role in an episode of "Monster Garage".(*) You then sell that "new"
car and support it directly because the original manufacturer
doesn't even recognise it anymore.

(*) The show where a team of expert mechanics and fabricators take
some standard vehicle, rip the guts out of it and rebuild it into
some entirely different contraption.

See, I can do bad car analogies, too. :/

Ignoring the bad analogies, my point still stands. If you want to
make claims about performance issues, you need to back them up with
a reproducable test case, numbers and analysis for them to be taken
seriously. Only after the problem has been demonstrated and
reproduced can we consider what changes *might* be necessary.

> I don't exactly follow what changes you made to _xfs_buf_ioapply(),
> but expect that you will eventually post the change.

None. I made changes to bulkstat. :)

And yes, I will post the patch in my next for-3.11 patchset....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-06-05  0:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-03  7:44 Debunking myths about metadata CRC overhead Dave Chinner
2013-06-03  9:10 ` Emmanuel Florac
2013-06-04  2:53   ` Dave Chinner
2013-06-04 16:20     ` Ben Myers
2013-06-04 22:06       ` Dave Chinner
2013-06-04 22:09         ` Ben Myers
2013-06-04 18:38     ` Chandra Seetharaman
2013-06-04 22:08       ` Dave Chinner
2013-06-04 22:40         ` Chandra Seetharaman
2013-06-04 22:59           ` Dave Chinner
2013-06-03 15:31 ` Troy McCorkell
2013-06-03 20:00 ` Geoffrey Wehrman
2013-06-04  2:43   ` Dave Chinner
2013-06-04 10:19     ` Dave Chinner
2013-06-04 21:27     ` Geoffrey Wehrman
2013-06-05  0:27       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox