From: Stan Hoeppner <stan@hardwarefreak.com>
To: xfs@oss.sgi.com
Subject: Re: deleting 2TB lots of files with delaylog: sync helps?
Date: Thu, 02 Sep 2010 09:57:43 -0500 [thread overview]
Message-ID: <4C7FBB67.7010404@hardwarefreak.com> (raw)
In-Reply-To: <20100902112920.GZ705@dastard>
Thanks Dave.
I don't normally top post, but I just wanted to quickly say I _really_
enjoyed reading your reply below. It was seriously educational. I
really enjoyed your note about the 24p Altix system. I've been a fan of
SGI NUMA machines since the Origin 2k, due to the uniqueness of this
scalable interconnect, though I've never been an SGI user. :(
Keep up the great work, and keep us educated, as you've done so very
well here. :)
--
Stan
Dave Chinner put forth on 9/2/2010 6:29 AM:
> On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote:
>> Dave Chinner put forth on 9/2/2010 2:01 AM:
>>
>>> No, that's definitely not the case. A different kernel in the
>>> same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144"
>>>
>>> FSUse% Count Size Files/sec App Overhead
>>> 0 800000 0 39554.2 7590355
>>>
>>> 4 threads with mount options "logbsize=262144,delaylog"
>>>
>>> FSUse% Count Size Files/sec App Overhead
>>> 0 800000 0 67269.7 5697246
>>
>> What happens when you bump each of these to 8 threads, 1 per core? If
>
> FSUse% Count Size Files/sec App Overhead
> 0 1600000 0 127979.3 13156823
>
> So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets
> 67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU
> bound at each load point ;)
>
>> the test consumes all cpus/cores, what instrumentation are you viewing
>> that tells you the cpu utilization _isn't_ due to memory b/w starvation?
>
> 1) profiling like 'perf top' or oprofile, using hardware counters to
> profile on cpu cycles, l1/l2 cache misses, etc
>
> 2) the delayed logging code uses significantly more memory bandwidth
> than the original code because it copies changed information twice
> (instead of once) before it is written to disk. Given that single
> threaded performance of delayed logging is identical to the original
> code and scalability from 1 to 8 cores is almost linear, it cannot
> be memory bandwidth bound....
>
> The code might be memory _latency_ bound (i.e on cache misses), but
> it is certainly not stressing pure memory bandwidth.
>
>> A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue
>> rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core. An
>> 8 core machine would therefore have an instruction issue rate of 8 * 16
>> GB/s = 128 GB/s. A modern dual socket system is going to top out at
>> 24-48 GB/s, well short of the instruction issue rate. Now, this doesn't
>> even take the b/w of data load/store operations into account, but I'm
>> guessing the data size per directory operation is smaller than the total
>> instruction sequence, which operates on the same variable(s).
>>
>> So, if the CPUs are pegging, and we're not running out of memory b/w,
>> then this would lead me to believe that the hot kernel code, core
>> fs_mark code and the filesystem data are fully, or near fully, contained
>> in level 2 and 3 CPU caches. Is this correct, more or less?
>
> Probably.
>
> However (and it is a big however!), I generally don't care to
> analyse performance at this level because it's getting into
> micro-optimisation territory. Sure, it will get you a few percent
> here and there, but then you lose focus on improving the algorithms.
> An algorithmic change can provide an order of magnitude improvement,
> not a few percent. The delayed logging code is a clear example of
> that.
>
> Another example - perf top shows this on the above 8p load on
> a plain 2.6.36-rc3 kernel (and it gets about 40k files/s):
>
> 426043.00 27.4% _xfs_buf_find
> 87491.00 5.6% __ticket_spin_lock
> 67204.00 4.3% xfs_dir2_node_addname
> 60434.00 3.9% dso__find_symbol
> 48407.00 3.1% kmem_cache_alloc
> 37006.00 2.4% __d_lookup
> 31625.00 2.0% xfs_trans_buf_item_match
> 20036.00 1.3% xfs_log_commit_cil
> 18728.00 1.2% _raw_spin_unlock_irqrestore
> 18428.00 1.2% __memset
> 18001.00 1.2% __memcpy
> 17781.00 1.1% xfs_da_do_buf
> 17732.00 1.1% xfs_iflush_cluster
> 16831.00 1.1% kmem_cache_free
> 14836.00 1.0% __kmalloc
>
> It is clear that buffer lookup is consuming the most CPU of any
> operation. Why? Because the buffer hash table is too small. I've
> already posted patches for a short term solution (increase the size
> of the hash table) and the above 127k files/s result is using that
> patch. hence it is clear that the micro-optimisation works, but at
> the cost of 16x increase in memory usage for the hash table. And
> that still isn't really large enough, because now the load is
> already pushing the limits of the enlarged hash table.
>
> As Christoph has already suggested, the correct way to fix the
> problem is to change the caching algorithm to something that is
> self-scaling (e.g. a rb-tree or a btree). That will keep memory
> usage low on small filesystems, yet scale efficiently to large
> numbers of buffers, something a hash cannot easily do.
>
> IOWs, an algorithmic change will solve the problem far better for
> more situations than the micro-optimisation of tweaking the hash
> sizes. Reduced to the simplest argument, scalability is all
> about choosing the right algorithm so you don't have to care about
> minute details to obtain the performance you require.
>
>> I'll have to dig around. I've never even looked for the archives for
>> this list. It's hopefully mirrored in the usual places.
>>
>> Out of curiosity, have you ever run into memory b/w starvation before
>> peaking all CPUs while running this test?
>
> No. Last time I ran out of bandwidth doing an IO workloads was doing
> 6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk
> subsystem could handle 11GB/s, we got 10GB/s with direct IO, but
> buffered IO was limited by the cross-sectional memory bandwidth of
> the machine (25GB/s) because of the extra copy buffered IO
> requires....
>
>> I could see that maybe
>> occurring with dual 1GHz+ P3 class systems with their smallish caches
>> and lowly single channel PC100, back before the switch to DDR memory,
>> but those machines were probably gone before XFS was open sourced, IIRC,
>> so you may not have had the pleasure (if you could call it that).
>
> The ratio between CPU cycles and memory bandwidth really hasn't
> changed much since then. The CPUs weren't powerful enough then,
> either, to run enough metadata ops to get near memory bandwidth
> limits...
>
> Cheers,
>
> Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2010-09-02 14:57 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-31 23:30 deleting 2TB lots of files with delaylog: sync helps? Michael Monnerie
2010-09-01 0:06 ` Dave Chinner
2010-09-01 0:22 ` Michael Monnerie
2010-09-01 3:19 ` Dave Chinner
2010-09-01 4:42 ` Stan Hoeppner
2010-09-01 6:44 ` Dave Chinner
2010-09-02 5:37 ` Stan Hoeppner
2010-09-02 7:01 ` Dave Chinner
2010-09-02 8:41 ` Stan Hoeppner
2010-09-02 11:29 ` Dave Chinner
2010-09-02 14:57 ` Stan Hoeppner [this message]
2010-09-01 3:01 ` Stan Hoeppner
2010-09-01 3:41 ` Dave Chinner
2010-09-01 7:45 ` Michael Monnerie
2010-09-02 1:17 ` Dave Chinner
2010-09-02 2:15 ` Michael Monnerie
2010-09-02 7:51 ` Stan Hoeppner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C7FBB67.7010404@hardwarefreak.com \
--to=stan@hardwarefreak.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.