[RFC PATCH 0/2] percpu_counter: Enable switching to global counter

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
@ 2016-03-05  2:51 Waiman Long
  2016-03-05  2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
	Ingo Molnar, Douglas Hatch

This patchset allows the degeneration of per-cpu counters back to
global counters when:

 1) The number of CPUs in the system is large, hence a high cost for
    calling percpu_counter_sum().
 2) The initial count value is small so that it has a high chance of
    excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
  percpu_counter: Allow falling back to global counter on large system
  xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

 fs/xfs/xfs_mount.c             |    1 -
 fs/xfs/xfs_mount.h             |    5 +++
 fs/xfs/xfs_super.c             |    6 +++
 include/linux/percpu_counter.h |   10 +++++
 lib/percpu_counter.c           |   72 +++++++++++++++++++++++++++++++++++++++-
 5 files changed, 92 insertions(+), 2 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
  2016-03-05  2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
@ 2016-03-05  2:51 ` Waiman Long
  2016-03-07 18:24   ` Christoph Lameter
  2016-03-05  2:51 ` [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters Waiman Long
  2016-03-05  6:34 ` [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Dave Chinner
  2 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
	Ingo Molnar, Douglas Hatch

Per-cpu counters are used in quite a number of places within
the kernel.  On large system with a lot of CPUs, however, doing a
percpu_counter_sum() can be very expensive as nr_cpu cachelines will
need to be read. In __percpu_counter_compare(), the chance of calling
percpu_counter_sum() also increases with increasing number of CPUs
if the global counter value is relatively small.

On large system, using a global counter with lock may actually be
faster than doing a percpu_counter_sum() which can be frequently
called from __percpu_counter_compare().

This patch provides a mechanism to selectively degenerate per-cpu
counters to global counters at per-cpu counter initialization time. The
following new API is added:

  percpu_counter_set_limit(struct percpu_counter *fbc,
                           u32 percpu_limit)

The function should be called after percpu_counter_set(). It will
compare the total limit (nr_cpu * percpu_limit) against the current
counter value.  If the limit is not smaller, it will disable per-cpu
counter and use only the global counter instead. At run time, when
the counter value grows past the total limit, per-cpu counter will
be enabled again.

Runtime disabling of per-cpu counters, however, is not currently
supported as it will slow down the per-cpu fast path.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/percpu_counter.h |   10 +++++
 lib/percpu_counter.c           |   72 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 84a1094..04a3783 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,8 +16,14 @@
 
 #ifdef CONFIG_SMP
 
+/*
+ * The per-cpu counter will be degenerated into a global counter when limit
+ * is set at initialization time. It will change back to a real per-cpu
+ * counter once the count exceed the given limit.
+ */
 struct percpu_counter {
 	raw_spinlock_t lock;
+	u32 limit;
 	s64 count;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
@@ -42,6 +48,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit);
 
 static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
 {
@@ -170,6 +177,9 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
 	return 1;
 }
 
+static inline void percpu_counter_set_limit(struct percpu_counter *fbc,
+					    u32 percpu_limit) { }
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index f051d69..f101c06 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -75,11 +75,25 @@ EXPORT_SYMBOL(percpu_counter_set);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	unsigned long flags;
+
+	if (fbc->limit) {
+		raw_spin_lock_irqsave(&fbc->lock, flags);
+		if (unlikely(!fbc->limit)) {
+			raw_spin_unlock_irqrestore(&fbc->lock, flags);
+			goto percpu_add;
+		}
+		fbc->count += amount;
+		if (abs(fbc->count) > fbc->limit)
+			fbc->limit = 0;	/* Revert back to per-cpu counter */
 
+		raw_spin_unlock_irqrestore(&fbc->lock, flags);
+		return;
+	}
+percpu_add:
 	preempt_disable();
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		unsigned long flags;
 		raw_spin_lock_irqsave(&fbc->lock, flags);
 		fbc->count += count;
 		__this_cpu_sub(*fbc->counters, count - amount);
@@ -94,6 +108,8 @@ EXPORT_SYMBOL(__percpu_counter_add);
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
+ *
+ * If a limit is set, the count can be returned directly without locking.
  */
 s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
@@ -101,6 +117,9 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 	int cpu;
 	unsigned long flags;
 
+	if (READ_ONCE(fbc->limit))
+		return READ_ONCE(fbc->count);
+
 	raw_spin_lock_irqsave(&fbc->lock, flags);
 	ret = fbc->count;
 	for_each_online_cpu(cpu) {
@@ -120,6 +139,7 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
 	raw_spin_lock_init(&fbc->lock);
 	lockdep_set_class(&fbc->lock, key);
 	fbc->count = amount;
+	fbc->limit = 0;
 	fbc->counters = alloc_percpu_gfp(s32, gfp);
 	if (!fbc->counters)
 		return -ENOMEM;
@@ -202,6 +222,9 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 	s64	count;
 
 	count = percpu_counter_read(fbc);
+	if (READ_ONCE(fbc->limit))
+		goto compare;
+
 	/* Check to see if rough count will be sufficient for comparison */
 	if (abs(count - rhs) > (batch * num_online_cpus())) {
 		if (count > rhs)
@@ -211,6 +234,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 	}
 	/* Need to use precise count */
 	count = percpu_counter_sum(fbc);
+compare:
 	if (count > rhs)
 		return 1;
 	else if (count < rhs)
@@ -220,6 +244,52 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 }
 EXPORT_SYMBOL(__percpu_counter_compare);
 
+/*
+ * Set the limit if the count is less than the given per-cpu limit * # of cpus.
+ *
+ * This function should only be called at initialization time right after
+ * percpu_counter_set(). Limit will only be set if there is more than
+ * 32 cpus in the system and the current counter value is not bigger than
+ * the limit. Once it is set, it can be cleared as soon as the counter
+ * value exceeds the given limit and real per-cpu counters are used again.
+ * However, switching from per-cpu counters back to global counter is not
+ * currently supported as that will slow down the per-cpu counter fastpath.
+ *
+ * The magic number 32 is chosen to be a compromise between the cost of
+ * reading all the per-cpu counters and that of locking. It can be changed
+ * if there is a better value.
+ */
+#define PERCPU_SET_LIMIT_CPU_THRESHOLD	32
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit)
+{
+	unsigned long flags;
+	int nrcpus = num_possible_cpus();
+	u32 limit;
+
+	if (nrcpus <= PERCPU_SET_LIMIT_CPU_THRESHOLD)
+		return;
+
+	if (!fbc->count) {
+		WARN(1, "percpu_counter_set_limit() called without an initial counter value!\n");
+		return;
+	}
+	/*
+	 * Use default batch size if the given percpu limit is 0.
+	 */
+	if (!percpu_limit)
+		percpu_limit = percpu_counter_batch;
+	limit = percpu_limit * nrcpus;
+
+	/*
+	 * Limit will not be set if the count is large enough
+	 */
+	raw_spin_lock_irqsave(&fbc->lock, flags);
+	if (abs(fbc->count) <= limit)
+		fbc->limit = limit;
+	raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_set_limit);
+
 static int __init percpu_counter_startup(void)
 {
 	compute_batch_value();
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
  2016-03-05  2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
@ 2016-03-07 18:24   ` Christoph Lameter
       [not found]     ` <56E9B219.7090500@hpe.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2016-03-07 18:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, xfs, Ingo Molnar,
	Douglas Hatch, Dave Chinner, Tejun Heo

On Fri, 4 Mar 2016, Waiman Long wrote:

> This patch provides a mechanism to selectively degenerate per-cpu
> counters to global counters at per-cpu counter initialization time. The
> following new API is added:
>
>   percpu_counter_set_limit(struct percpu_counter *fbc,
>                            u32 percpu_limit)
>
> The function should be called after percpu_counter_set(). It will
> compare the total limit (nr_cpu * percpu_limit) against the current
> counter value.  If the limit is not smaller, it will disable per-cpu
> counter and use only the global counter instead. At run time, when
> the counter value grows past the total limit, per-cpu counter will
> be enabled again.

Hmmm... That is requiring manual setting of a limit. Would it not be
possible to completely automatize the switch over? F.e. one could
keep a cpumask of processors that use the per cpu counters.

Then in the fastpath if the current cpu is a member increment the per cpu
counter. If not do the spinlock thing. If there is contention add the
cpu to the cpumask and use the  per cpu counters. Thus automatically
scaling for the processors on which frequent increments are operating.

Then regularly (once per minute or so) degenerate the counter by folding
the per cpu diffs into the global count and zapping the cpumask.

If the cpumask is empty you can use the global count. Otherwise you just
need to add up the counters of the cpus set in the cpumask.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <56E9B219.7090500@hpe.com>]

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
       [not found]     ` <56E9B219.7090500@hpe.com>
@ 2016-03-18  1:58       ` Christoph Lameter
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Lameter @ 2016-03-18  1:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, xfs, Ingo Molnar,
	Douglas Hatch, Dave Chinner, Tejun Heo

On Wed, 16 Mar 2016, Waiman Long wrote:

> > If the cpumask is empty you can use the global count. Otherwise you just
> > need to add up the counters of the cpus set in the cpumask.
> >
>
> I have modified the patch to try that out. However, that doesn't yield that
> much of improvement in term of performance and it slows down the percpu fast
> path a bit. So I am going to focus on my existing patch first and think about
> that later.

Hmmm... Maybe look at the cause of the slowdown first?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
  2016-03-05  2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
  2016-03-05  2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
@ 2016-03-05  2:51 ` Waiman Long
  2016-03-05  6:34 ` [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Dave Chinner
  2 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
	Ingo Molnar, Douglas Hatch

Small XFS filesystems on systems with large number of CPUs can incur a
significant overhead due to excessive calls to the percpu_counter_sum()
function which needs to walk through a large number of different
cachelines.

This patch uses the newly added percpu_counter_set_limit() API to
potentially switch the m_fdblocks and m_ifree per-cpu counters to
a global counter with locks at filesystem mount time if its size
is small relatively to the number of CPUs available.

A possible use case is the use of the NVDIMM as an application scratch
storage area for log file and other small files. Current battery-backed
NVDIMMs are pretty small in size, e.g. 8G per DIMM. So we cannot create
large filesystem on top of them.

On a 4-socket 80-thread system running 4.5-rc6 kernel, this patch can
improve the throughput of the AIM7 XFS disk workload by 25%. Before
the patch, the perf profile was:

  18.68%   0.08%  reaim  [k] __percpu_counter_compare
  18.05%   9.11%  reaim  [k] __percpu_counter_sum
   0.37%   0.36%  reaim  [k] __percpu_counter_add

After the patch, the perf profile was:

   0.73%   0.36%  reaim  [k] __percpu_counter_add
   0.27%   0.27%  reaim  [k] __percpu_counter_compare

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/xfs/xfs_mount.c |    1 -
 fs/xfs/xfs_mount.h |    5 +++++
 fs/xfs/xfs_super.c |    6 ++++++
 3 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bb753b3..fe74b91 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1163,7 +1163,6 @@ xfs_mod_ifree(
  * a large batch count (1024) to minimise global counter updates except when
  * we get near to ENOSPC and we have to be very accurate with our updates.
  */
-#define XFS_FDBLOCKS_BATCH	1024
 int
 xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b570984..d9520f4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -206,6 +206,11 @@ typedef struct xfs_mount {
 #define	XFS_WSYNC_WRITEIO_LOG	14	/* 16k */
 
 /*
+ * FD blocks batch size for per-cpu compare
+ */
+#define XFS_FDBLOCKS_BATCH	1024
+
+/*
  * Allow large block sizes to be reported to userspace programs if the
  * "largeio" mount option is used.
  *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 59c9b7b..c0b4f79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1412,6 +1412,12 @@ xfs_reinit_percpu_counters(
 	percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
 	percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
 	percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
+
+	/*
+	 * Use default batch size for m_ifree
+	 */
+	percpu_counter_set_limit(&mp->m_ifree, 0);
+	percpu_counter_set_limit(&mp->m_fdblocks, 4 * XFS_FDBLOCKS_BATCH);
 }
 
 static void
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
  2016-03-05  2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
  2016-03-05  2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
  2016-03-05  2:51 ` [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters Waiman Long
@ 2016-03-05  6:34 ` Dave Chinner
       [not found]   ` <56DDBCEB.8060307@hpe.com>
  2 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2016-03-05  6:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Peter Zijlstra, Scott J Norton, linux-kernel,
	xfs, Ingo Molnar, Douglas Hatch, Tejun Heo

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> This patchset allows the degeneration of per-cpu counters back to
> global counters when:
> 
>  1) The number of CPUs in the system is large, hence a high cost for
>     calling percpu_counter_sum().
>  2) The initial count value is small so that it has a high chance of
>     excessive percpu_counter_sum() calls.
> 
> When the above 2 conditions are true, this patchset allows the user of
> per-cpu counters to selectively degenerate them into global counters
> with lock. This is done by calling the new percpu_counter_set_limit()
> API after percpu_counter_set(). Without this call, there is no change
> in the behavior of the per-cpu counters.
> 
> Patch 1 implements the new percpu_counter_set_limit() API.
> 
> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
> per-cpu counters.
> 
> Waiman Long (2):
>   percpu_counter: Allow falling back to global counter on large system
>   xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters.  We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.

The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

	a) 99.9% of production workloads never execute,
	b) where we expect performance to degrade as allocation gets
	   computationally expensive as we close in on ENOSPC,
	c) we start to execute blocking data flush operations that
	   slow everything down massively, and
	d) is indicative that the workload is about to suffer
	   from a fatal, unrecoverable error (i.e. ENOSPC)

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <56DDBCEB.8060307@hpe.com>]

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
       [not found]   ` <56DDBCEB.8060307@hpe.com>
@ 2016-03-07 21:33     ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2016-03-07 21:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Peter Zijlstra, Scott J Norton, linux-kernel,
	xfs, Ingo Molnar, Douglas Hatch, Dave Chinner, Tejun Heo

On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
> On 03/05/2016 01:34 AM, Dave Chinner wrote:
> >On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> >>This patchset allows the degeneration of per-cpu counters back
> >>to global counters when:
> >>
> >>  1) The number of CPUs in the system is large, hence a high
> >>  cost for calling percpu_counter_sum().  2) The initial count
> >>  value is small so that it has a high chance of excessive
> >>  percpu_counter_sum() calls.
> >>
> >>When the above 2 conditions are true, this patchset allows the
> >>user of per-cpu counters to selectively degenerate them into
> >>global counters with lock. This is done by calling the new
> >>percpu_counter_set_limit() API after percpu_counter_set().
> >>Without this call, there is no change in the behavior of the
> >>per-cpu counters.
> >>
> >>Patch 1 implements the new percpu_counter_set_limit() API.
> >>
> >>Patch 2 modifies XFS to call the new API for the m_ifree and
> >>m_fdblocks per-cpu counters.
> >>
> >>Waiman Long (2): percpu_counter: Allow falling back to global
> >>counter on large system xfs: Allow degeneration of
> >>m_fdblocks/m_ifree to global counters
> >NACK.
> >
> >This change to turns off per-counter free block counters for 32p
> >for the XFS free block counters.  We proved 10 years ago that a
> >global lock for these counters was a massive scalability
> >limitation for concurrent buffered writes on 16p machines.
> >
> >IOWs, this change is going to cause fast path concurrent
> >sequential write regressions for just about everyone, even on
> >empty filesystems.
> 
> That is not really the case here. The patch won't change anything
> if there is enough free blocks available in the filesystem.  It
> will turn on global lock at mount time iff the number of free
> blocks available is less than the given limit. In the case of XFS,
> it is 12MB per CPU. On the 80-thread system that I used for
> testing, it will be a bit less than 1GB. Even if global lock is
> enabled at the beginning, it will be transitioned back to percpu
> lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".

> I am aware that if there are enough threads pounding on the lock,
> it can cause a scalability bottleneck. However, the qspinlock used
> in x86 should greatly alleviate the scalability impact compared
> with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency....

> BTW, what exactly
> was the microbenchmark that you used to exercise concurrent
> sequential write? I would like to try it out on the new hardware
> and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

> >near to ENOSPC. As i asked you last time - if you want to make
> >this problem go away, please increase the size of the filesystem
> >you are running your massively concurrent benchmarks on.
> >
> >IOWs, please stop trying to optimise a filesystem slow path that:
> >
> >	a) 99.9% of production workloads never execute, b) where we
> >	expect performance to degrade as allocation gets
> >	computationally expensive as we close in on ENOSPC, c) we
> >	start to execute blocking data flush operations that slow
> >	everything down massively, and d) is indicative that the
> >	workload is about to suffer from a fatal, unrecoverable
> >	error (i.e. ENOSPC)
> >
> 
> I totally agree. I am not trying to optimize a filesystem
> slowpath.

Where else in the kernel is there a requirement for 100%
accurate threshold detection on per-cpu counters? There isn't, is
there?

> There are use cases, however, where we may want to
> create relatively small filesystem. One example that I cited in
> patch 2 is the battery backed NVDIMM that I have played with
> recently. They can be used for log files or other small files.
> Each dimm is 8 GB. You can have a few of those available. So the
> filesystem size could be 32GB or so.  That can come close to the
> the limit where excessive percpu_counter_sum() call can happen.
> What I want to do here is to try to reduce the chance of excessive
> percpu_counter_sum() calls causing a performance problem. For a
> large filesystem that is nowhere near ENOSPC, my patch will have
> no performance impact whatsoever.

Yet your patch won't have any effect on these "small" filesystems
because unless they have less free space than your threshold at
mount time (rare!) they won't ever have this global lock turned on.
Not to mention if space if freed in the fs, the global lock is
turned off, and will never get turned back on. 

Further, anyone using XFS on nvdimms will be enabling DAX, which
goes through the direct IO path rather than the buffered IO path
that is generating all this block accounting pressure. Hence it will
behave differently, and so your solution doesn't obviously apply to
that workload space, either.

When we get production workloads hitting free block accounting
issues near ENOSPC, then we'll look at optimising the XFS accounting
code. Microbenchmarks are great when they have real-work relevance,
but this doesn't right now. Not to mention we've got bigger things
to worry about in XFS right now in terms of ENOSPC accounting (think
reverse mapping, shared blocks and breaking shares via COW right
next to ENOSPC) and getting these working *correctly* takes
precendence of optimisation of the accounting code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-03-18  1:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-05  2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
2016-03-05  2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
2016-03-07 18:24   ` Christoph Lameter
     [not found]     ` <56E9B219.7090500@hpe.com>
2016-03-18  1:58       ` Christoph Lameter
2016-03-05  2:51 ` [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters Waiman Long
2016-03-05  6:34 ` [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Dave Chinner
     [not found]   ` <56DDBCEB.8060307@hpe.com>
2016-03-07 21:33     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox