All of lore.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <waiman.long@hpe.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Scott J Norton <scott.norton@hpe.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Douglas Hatch <doug.hatch@hpe.com>,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()
Date: Wed, 07 Oct 2015 16:00:42 -0400	[thread overview]
Message-ID: <561579EA.60507@hpe.com> (raw)
In-Reply-To: <20151006213023.GP27164@dastard>

On 10/06/2015 05:30 PM, Dave Chinner wrote:
> On Tue, Oct 06, 2015 at 01:33:21PM -0400, Waiman Long wrote:
> Yes, it may be, but that does not mean we should optimise for it.
> If you are doing filesystem scalability testing on small filesystems
> near capacity, then your testing methodology is needs fixing. Not
> the code.
>
>>>>> XFS trades off low overhead for fast path allocation  with slowdowns
>>>>> as we near ENOSPC in allocation routines. It gets harder to find
>>>>> contiguous free space, files get more fragmented, IO takes longer
>>>>> because we seek more, etc. Hence we accept that performance slows
>>>>> down as as the need for precision increases as we near ENOSPC.
>>>>>
>>>>> I'd suggest you retry your benchmark with larger filesystems, and
>>>>> see what happens...
>>>> I don't think I am going to see the slowdown that I observed on
>>>> larger filesystems with more free space.
>>> So there is no problem that needs fixing.... ;)
>> Well, I am still worrying that corner cases when the slowpath is
>> triggered. I would like to make it perform better in those cases.
> It's a pretty damn small slowdown in your somewhat extreme,
> artificial test. Show me a real world production system that runs
> small fileystems permanently at>99% filesystem capacity, and them
> maybe vwe've got something that needs changing.
>
>>>> gauge how far it is from ENOSPC.  So we don't really need to get
>>>> the precise count as long as number of CPUs are taken into
>>>> consideration in the comparison.
>>> I think you are looking in the wrong place. There is nothing
>>> wrong with XFS doing two compares here. If we are hitting the
>>> __percpu_counter_compare() slow path too much, then we should be
>>> understanding exactly why that slow path is being hit so hard so
>>> often. I don't see any analysis of the actual per-cpu counter
>>> behaviour and why the slow path is being taken so often....
>> I am thinking of making the following changes:
> No. Please test the change to the per-cpu counters that I suggested:
>
>>> /*
>>>   * Aggregate the per-cpu counter magazines back into the global
>>>   * counter. This avoids the need for repeated compare operations to
>>>   * run the slow path when the majority of the counter value is held
>>>   * in the per-cpu magazines. Folding them back into the global
>>>   * counter means we will continue to hit the fast
>>>   * percpu_counter_read() path until the counter value falls
>>>   * completely within the comparison limit passed to
>>>   * __percpu_counter_compare().
>>>   */
>>> static s64 percpu_counter_aggregate(struct percpu_counter *fbc)
>>> {
>>> 	s64 ret;
>>> 	int cpu;
>>> 	unsigned long flags;
>>>
>>> 	raw_spin_lock_irqsave(&fbc->lock, flags);
>>> 	ret = fbc->count;
>>> 	for_each_online_cpu(cpu) {
>>> 		s32 count = __this_cpu_read(*fbc->counters);
>>>                  ret += count;
>>> 		__this_cpu_sub(*fbc->counters, count)
>>> 	}
>>> 	fbc->count = ret;
>>> 	raw_spin_unlock_irqrestore(&fbc->lock, flags);
>>> 	return ret;
>>> }
>> I don't think that will work as some other CPUs may change the
>> percpu counters values between percpu_counter_aggregate() and
>> __percpu_counter_compare().  To be safe, the precise counter has to
>> be compted whenever the comparison value difference is less than
>> nr_cpus * batch size.
> Well, yes. Why do you think the above function does the same
> function as percpu_counter_sum()? So that the percpu_counter_sum()
> call *inside* __percpu_counter_compare() can be replaced by this
> call. i.e.
>
> 			return -1;
> 	}
> 	/* Need to use precise count */
> -	count = percpu_counter_sum(fbc);
> +	count = percpu_counter_aggregate(fbc);
> 	if (count>  rhs)
> 		return 1;
> 	else if (count<  rhs)
>
> Please think about what I'm saying rather than dismissing it without
> first understanding my suggestions.

I understood what you were saying. However, the per-cpu counter isn't 
protected by the spinlock. Reading it is OK, but writing may cause race 
if that counter is modified by a CPU other than its owning CPU.

The slow performance of percpu_counter_sum() is due to its need to 
access n different (likely cold) cachelines where n is the number of 
CPUs in the system. So the larger the system, the more problematic it 
will be. My main concern about xfs_mod_fdblocks() is that it can 
potentially call percpu_counter_sum() twice which is what I want to 
prevent. It is OK if you don't think that change is necessary. However, 
I will come back if I find more evidence that this can be an issue.

Cheers,
Longman



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <waiman.long@hpe.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()
Date: Wed, 07 Oct 2015 16:00:42 -0400	[thread overview]
Message-ID: <561579EA.60507@hpe.com> (raw)
In-Reply-To: <20151006213023.GP27164@dastard>

On 10/06/2015 05:30 PM, Dave Chinner wrote:
> On Tue, Oct 06, 2015 at 01:33:21PM -0400, Waiman Long wrote:
> Yes, it may be, but that does not mean we should optimise for it.
> If you are doing filesystem scalability testing on small filesystems
> near capacity, then your testing methodology is needs fixing. Not
> the code.
>
>>>>> XFS trades off low overhead for fast path allocation  with slowdowns
>>>>> as we near ENOSPC in allocation routines. It gets harder to find
>>>>> contiguous free space, files get more fragmented, IO takes longer
>>>>> because we seek more, etc. Hence we accept that performance slows
>>>>> down as as the need for precision increases as we near ENOSPC.
>>>>>
>>>>> I'd suggest you retry your benchmark with larger filesystems, and
>>>>> see what happens...
>>>> I don't think I am going to see the slowdown that I observed on
>>>> larger filesystems with more free space.
>>> So there is no problem that needs fixing.... ;)
>> Well, I am still worrying that corner cases when the slowpath is
>> triggered. I would like to make it perform better in those cases.
> It's a pretty damn small slowdown in your somewhat extreme,
> artificial test. Show me a real world production system that runs
> small fileystems permanently at>99% filesystem capacity, and them
> maybe vwe've got something that needs changing.
>
>>>> gauge how far it is from ENOSPC.  So we don't really need to get
>>>> the precise count as long as number of CPUs are taken into
>>>> consideration in the comparison.
>>> I think you are looking in the wrong place. There is nothing
>>> wrong with XFS doing two compares here. If we are hitting the
>>> __percpu_counter_compare() slow path too much, then we should be
>>> understanding exactly why that slow path is being hit so hard so
>>> often. I don't see any analysis of the actual per-cpu counter
>>> behaviour and why the slow path is being taken so often....
>> I am thinking of making the following changes:
> No. Please test the change to the per-cpu counters that I suggested:
>
>>> /*
>>>   * Aggregate the per-cpu counter magazines back into the global
>>>   * counter. This avoids the need for repeated compare operations to
>>>   * run the slow path when the majority of the counter value is held
>>>   * in the per-cpu magazines. Folding them back into the global
>>>   * counter means we will continue to hit the fast
>>>   * percpu_counter_read() path until the counter value falls
>>>   * completely within the comparison limit passed to
>>>   * __percpu_counter_compare().
>>>   */
>>> static s64 percpu_counter_aggregate(struct percpu_counter *fbc)
>>> {
>>> 	s64 ret;
>>> 	int cpu;
>>> 	unsigned long flags;
>>>
>>> 	raw_spin_lock_irqsave(&fbc->lock, flags);
>>> 	ret = fbc->count;
>>> 	for_each_online_cpu(cpu) {
>>> 		s32 count = __this_cpu_read(*fbc->counters);
>>>                  ret += count;
>>> 		__this_cpu_sub(*fbc->counters, count)
>>> 	}
>>> 	fbc->count = ret;
>>> 	raw_spin_unlock_irqrestore(&fbc->lock, flags);
>>> 	return ret;
>>> }
>> I don't think that will work as some other CPUs may change the
>> percpu counters values between percpu_counter_aggregate() and
>> __percpu_counter_compare().  To be safe, the precise counter has to
>> be compted whenever the comparison value difference is less than
>> nr_cpus * batch size.
> Well, yes. Why do you think the above function does the same
> function as percpu_counter_sum()? So that the percpu_counter_sum()
> call *inside* __percpu_counter_compare() can be replaced by this
> call. i.e.
>
> 			return -1;
> 	}
> 	/* Need to use precise count */
> -	count = percpu_counter_sum(fbc);
> +	count = percpu_counter_aggregate(fbc);
> 	if (count>  rhs)
> 		return 1;
> 	else if (count<  rhs)
>
> Please think about what I'm saying rather than dismissing it without
> first understanding my suggestions.

I understood what you were saying. However, the per-cpu counter isn't 
protected by the spinlock. Reading it is OK, but writing may cause race 
if that counter is modified by a CPU other than its owning CPU.

The slow performance of percpu_counter_sum() is due to its need to 
access n different (likely cold) cachelines where n is the number of 
CPUs in the system. So the larger the system, the more problematic it 
will be. My main concern about xfs_mod_fdblocks() is that it can 
potentially call percpu_counter_sum() twice which is what I want to 
prevent. It is OK if you don't think that change is necessary. However, 
I will come back if I find more evidence that this can be an issue.

Cheers,
Longman




  reply	other threads:[~2015-10-07 20:01 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-02 17:29 [PATCH] percpu_counter: return precise count from __percpu_counter_compare() Waiman Long
2015-10-02 17:29 ` Waiman Long
2015-10-02 18:04 ` kbuild test robot
2015-10-02 18:04   ` kbuild test robot
2015-10-05 23:03   ` Waiman Long
2015-10-05 23:03     ` Waiman Long
2015-10-02 18:05 ` kbuild test robot
2015-10-02 18:05   ` kbuild test robot
2015-10-02 18:12 ` kbuild test robot
2015-10-02 18:12   ` kbuild test robot
2015-10-02 18:15 ` kbuild test robot
2015-10-02 18:15   ` kbuild test robot
2015-10-02 22:16 ` Dave Chinner
2015-10-02 22:16   ` Dave Chinner
2015-10-05 23:02   ` Waiman Long
2015-10-05 23:02     ` Waiman Long
2015-10-06  0:25     ` Dave Chinner
2015-10-06  0:25       ` Dave Chinner
2015-10-06 17:33       ` Waiman Long
2015-10-06 17:33         ` Waiman Long
2015-10-06 21:30         ` Dave Chinner
2015-10-06 21:30           ` Dave Chinner
2015-10-07 20:00           ` Waiman Long [this message]
2015-10-07 20:00             ` Waiman Long
2015-10-07 23:04             ` Dave Chinner
2015-10-07 23:04               ` Dave Chinner
2015-10-07 23:20               ` Tejun Heo
2015-10-07 23:20                 ` Tejun Heo
2015-10-08  1:02                 ` Dave Chinner
2015-10-08  1:02                   ` Dave Chinner
2015-10-08  1:09                   ` Tejun Heo
2015-10-08  1:09                     ` Tejun Heo
2015-10-08 16:06                   ` Waiman Long
2015-10-08 16:06                     ` Waiman Long
2015-10-08 16:01               ` Waiman Long
2015-10-08 16:01                 ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=561579EA.60507@hpe.com \
    --to=waiman.long@hpe.com \
    --cc=cl@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=doug.hatch@hpe.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=scott.norton@hpe.com \
    --cc=tj@kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.