linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Waiman Long <Waiman.Long@hpe.com>
Cc: Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()
Date: Sat, 3 Oct 2015 08:16:49 +1000	[thread overview]
Message-ID: <20151002221649.GL27164@dastard> (raw)
In-Reply-To: <1443806997-30792-1-git-send-email-Waiman.Long@hpe.com>

On Fri, Oct 02, 2015 at 01:29:57PM -0400, Waiman Long wrote:
> In __percpu_counter_compare(), if the current imprecise count is
> within (batch*nr_cpus) of the input value to be compared, a call
> to percpu_counter_sum() will be made to get the precise count. The
> percpu_counter_sum() call, however, can be expensive especially on
> large systems where there are a lot of CPUs. Large systems also make
> it more likely that percpu_counter_sum() will be called.
> 
> The xfs_mod_fdblocks() function calls __percpu_counter_compare()
> twice. First to see if a smaller batch size should be used for
> __percpu_counter_add() and the second call to compare the actual
> size needed. This can potentially lead to 2 calls to the expensive
> percpu_counter_sum() function.

There should not be that much overhead in __percpu_counter_compare()
through this path in normal operation. The slow path is only taken
as you near ENOSPC...

> This patch added an extra argument to __percpu_counter_compare()
> to return the precise count, if computed. The caller will need to
> initialize it to an invalid value that it can tell if the precise
> count is being returned.

This doesn't work. ENOSPC detection is a lockless algorithm that
requires absolute precision. Assuming the XFS_ALLOC_SET_ASIDE()
definition of ENOSPC is 0 blocks free, your change allows this race:

free space: 1 block

thread 1		thread 2			free space
allocate 1 block	allocate 1 block		  1
sample pcount = 1					  1
			sample pcount = 1		  1
add fdblocks, -1, 1)					  0
			add fdblocks, -1, 1)		  -1
if (pcount - 1 >= 0)	if (pcount - 1 >= 0)
   OK!			    OK!				  -1

So, we've just failed to detect ENOSPC correct. One of those two
threads should have returned ENOSPC and failed the allocation,
but instead we've just allowed XFS to allocate a block that doesn't
exist. Hence we have to resample the percpu counter after the
modification to ensure that we don't miss this race condition.

Sure, the curent code could race on the second comparisions and
return ENOSPC to both threads, but that is a perfectly OK thing
to do. It is vitally important that we don't oversubscribe
filesystem space, because that will lead to all sorts of other
problems (deadlocks, hangs, shutdowns, etc) that are very difficult
to identify the cause of.

FWIW, I'm guessing that you didn't run this patch through xfstests?
xfstests will find these ENOSPC accounting bugs, and usually quite
quickly...

> Running the AIM7 disk workload with XFS filesystem, the jobs/min
> on a 40-core 80-thread 4-socket Haswell-EX system increases from
> 3805k to 4276k (12% increase) with this patch applied. As measured
> by the perf tool, the %CPU cycle consumed by __percpu_counter_sum()
> decreases from 12.64% to 7.08%.

XFS should only hit the slow __percpu_counter_sum() path patch as
the fs gets close to ENOSPC, which for your system will be less
than:

threshold = num_online_cpus * XFS_FDBLOCKS_BATCH * 2 blocks
	  = 80 * 1024 * 2 blocks
	  = 160,000 blocks
	  = 640MB of disk space.

Having less than 1GB of free space in an XFS filesystem is
considered to be "almost ENOSPC" - when you have TB to PB of space,
less than 1GB really "moments before ENOSPC".

XFS trades off low overhead for fast path allocation  with slowdowns
as we near ENOSPC in allocation routines. It gets harder to find
contiguous free space, files get more fragmented, IO takes longer
because we seek more, etc. Hence we accept that performance slows
down as as the need for precision increases as we near ENOSPC.

I'd suggest you retry your benchmark with larger filesystems, and
see what happens...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2015-10-02 22:16 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-02 17:29 [PATCH] percpu_counter: return precise count from __percpu_counter_compare() Waiman Long
2015-10-02 18:04 ` kbuild test robot
2015-10-05 23:03   ` Waiman Long
2015-10-02 18:05 ` kbuild test robot
2015-10-02 18:12 ` kbuild test robot
2015-10-02 18:15 ` kbuild test robot
2015-10-02 22:16 ` Dave Chinner [this message]
2015-10-05 23:02   ` Waiman Long
2015-10-06  0:25     ` Dave Chinner
2015-10-06 17:33       ` Waiman Long
2015-10-06 21:30         ` Dave Chinner
2015-10-07 20:00           ` Waiman Long
2015-10-07 23:04             ` Dave Chinner
2015-10-07 23:20               ` Tejun Heo
2015-10-08  1:02                 ` Dave Chinner
2015-10-08  1:09                   ` Tejun Heo
2015-10-08 16:06                   ` Waiman Long
2015-10-08 16:01               ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151002221649.GL27164@dastard \
    --to=david@fromorbit.com \
    --cc=Waiman.Long@hpe.com \
    --cc=cl@linux-foundation.org \
    --cc=doug.hatch@hpe.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=scott.norton@hpe.com \
    --cc=tj@kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).