* [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
2016-03-05 2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
@ 2016-03-05 2:51 ` Waiman Long
2016-03-07 18:24 ` Christoph Lameter
2016-03-05 2:51 ` [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters Waiman Long
2016-03-05 6:34 ` [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Dave Chinner
2 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2016-03-05 2:51 UTC (permalink / raw)
To: Tejun Heo, Christoph Lameter, Dave Chinner
Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
Ingo Molnar, Douglas Hatch
Per-cpu counters are used in quite a number of places within
the kernel. On large system with a lot of CPUs, however, doing a
percpu_counter_sum() can be very expensive as nr_cpu cachelines will
need to be read. In __percpu_counter_compare(), the chance of calling
percpu_counter_sum() also increases with increasing number of CPUs
if the global counter value is relatively small.
On large system, using a global counter with lock may actually be
faster than doing a percpu_counter_sum() which can be frequently
called from __percpu_counter_compare().
This patch provides a mechanism to selectively degenerate per-cpu
counters to global counters at per-cpu counter initialization time. The
following new API is added:
percpu_counter_set_limit(struct percpu_counter *fbc,
u32 percpu_limit)
The function should be called after percpu_counter_set(). It will
compare the total limit (nr_cpu * percpu_limit) against the current
counter value. If the limit is not smaller, it will disable per-cpu
counter and use only the global counter instead. At run time, when
the counter value grows past the total limit, per-cpu counter will
be enabled again.
Runtime disabling of per-cpu counters, however, is not currently
supported as it will slow down the per-cpu fast path.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
include/linux/percpu_counter.h | 10 +++++
lib/percpu_counter.c | 72 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 81 insertions(+), 1 deletions(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 84a1094..04a3783 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,8 +16,14 @@
#ifdef CONFIG_SMP
+/*
+ * The per-cpu counter will be degenerated into a global counter when limit
+ * is set at initialization time. It will change back to a real per-cpu
+ * counter once the count exceed the given limit.
+ */
struct percpu_counter {
raw_spinlock_t lock;
+ u32 limit;
s64 count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
@@ -42,6 +48,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
s64 __percpu_counter_sum(struct percpu_counter *fbc);
int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit);
static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
{
@@ -170,6 +177,9 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
return 1;
}
+static inline void percpu_counter_set_limit(struct percpu_counter *fbc,
+ u32 percpu_limit) { }
+
#endif /* CONFIG_SMP */
static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index f051d69..f101c06 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -75,11 +75,25 @@ EXPORT_SYMBOL(percpu_counter_set);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
+ unsigned long flags;
+
+ if (fbc->limit) {
+ raw_spin_lock_irqsave(&fbc->lock, flags);
+ if (unlikely(!fbc->limit)) {
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+ goto percpu_add;
+ }
+ fbc->count += amount;
+ if (abs(fbc->count) > fbc->limit)
+ fbc->limit = 0; /* Revert back to per-cpu counter */
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+ return;
+ }
+percpu_add:
preempt_disable();
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
- unsigned long flags;
raw_spin_lock_irqsave(&fbc->lock, flags);
fbc->count += count;
__this_cpu_sub(*fbc->counters, count - amount);
@@ -94,6 +108,8 @@ EXPORT_SYMBOL(__percpu_counter_add);
/*
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive()
+ *
+ * If a limit is set, the count can be returned directly without locking.
*/
s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
@@ -101,6 +117,9 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
int cpu;
unsigned long flags;
+ if (READ_ONCE(fbc->limit))
+ return READ_ONCE(fbc->count);
+
raw_spin_lock_irqsave(&fbc->lock, flags);
ret = fbc->count;
for_each_online_cpu(cpu) {
@@ -120,6 +139,7 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
raw_spin_lock_init(&fbc->lock);
lockdep_set_class(&fbc->lock, key);
fbc->count = amount;
+ fbc->limit = 0;
fbc->counters = alloc_percpu_gfp(s32, gfp);
if (!fbc->counters)
return -ENOMEM;
@@ -202,6 +222,9 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
s64 count;
count = percpu_counter_read(fbc);
+ if (READ_ONCE(fbc->limit))
+ goto compare;
+
/* Check to see if rough count will be sufficient for comparison */
if (abs(count - rhs) > (batch * num_online_cpus())) {
if (count > rhs)
@@ -211,6 +234,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
}
/* Need to use precise count */
count = percpu_counter_sum(fbc);
+compare:
if (count > rhs)
return 1;
else if (count < rhs)
@@ -220,6 +244,52 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
}
EXPORT_SYMBOL(__percpu_counter_compare);
+/*
+ * Set the limit if the count is less than the given per-cpu limit * # of cpus.
+ *
+ * This function should only be called at initialization time right after
+ * percpu_counter_set(). Limit will only be set if there is more than
+ * 32 cpus in the system and the current counter value is not bigger than
+ * the limit. Once it is set, it can be cleared as soon as the counter
+ * value exceeds the given limit and real per-cpu counters are used again.
+ * However, switching from per-cpu counters back to global counter is not
+ * currently supported as that will slow down the per-cpu counter fastpath.
+ *
+ * The magic number 32 is chosen to be a compromise between the cost of
+ * reading all the per-cpu counters and that of locking. It can be changed
+ * if there is a better value.
+ */
+#define PERCPU_SET_LIMIT_CPU_THRESHOLD 32
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit)
+{
+ unsigned long flags;
+ int nrcpus = num_possible_cpus();
+ u32 limit;
+
+ if (nrcpus <= PERCPU_SET_LIMIT_CPU_THRESHOLD)
+ return;
+
+ if (!fbc->count) {
+ WARN(1, "percpu_counter_set_limit() called without an initial counter value!\n");
+ return;
+ }
+ /*
+ * Use default batch size if the given percpu limit is 0.
+ */
+ if (!percpu_limit)
+ percpu_limit = percpu_counter_batch;
+ limit = percpu_limit * nrcpus;
+
+ /*
+ * Limit will not be set if the count is large enough
+ */
+ raw_spin_lock_irqsave(&fbc->lock, flags);
+ if (abs(fbc->count) <= limit)
+ fbc->limit = limit;
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_set_limit);
+
static int __init percpu_counter_startup(void)
{
compute_batch_value();
--
1.7.1
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 7+ messages in thread* [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
2016-03-05 2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
2016-03-05 2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
@ 2016-03-05 2:51 ` Waiman Long
2016-03-05 6:34 ` [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Dave Chinner
2 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2016-03-05 2:51 UTC (permalink / raw)
To: Tejun Heo, Christoph Lameter, Dave Chinner
Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
Ingo Molnar, Douglas Hatch
Small XFS filesystems on systems with large number of CPUs can incur a
significant overhead due to excessive calls to the percpu_counter_sum()
function which needs to walk through a large number of different
cachelines.
This patch uses the newly added percpu_counter_set_limit() API to
potentially switch the m_fdblocks and m_ifree per-cpu counters to
a global counter with locks at filesystem mount time if its size
is small relatively to the number of CPUs available.
A possible use case is the use of the NVDIMM as an application scratch
storage area for log file and other small files. Current battery-backed
NVDIMMs are pretty small in size, e.g. 8G per DIMM. So we cannot create
large filesystem on top of them.
On a 4-socket 80-thread system running 4.5-rc6 kernel, this patch can
improve the throughput of the AIM7 XFS disk workload by 25%. Before
the patch, the perf profile was:
18.68% 0.08% reaim [k] __percpu_counter_compare
18.05% 9.11% reaim [k] __percpu_counter_sum
0.37% 0.36% reaim [k] __percpu_counter_add
After the patch, the perf profile was:
0.73% 0.36% reaim [k] __percpu_counter_add
0.27% 0.27% reaim [k] __percpu_counter_compare
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
fs/xfs/xfs_mount.c | 1 -
fs/xfs/xfs_mount.h | 5 +++++
fs/xfs/xfs_super.c | 6 ++++++
3 files changed, 11 insertions(+), 1 deletions(-)
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bb753b3..fe74b91 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1163,7 +1163,6 @@ xfs_mod_ifree(
* a large batch count (1024) to minimise global counter updates except when
* we get near to ENOSPC and we have to be very accurate with our updates.
*/
-#define XFS_FDBLOCKS_BATCH 1024
int
xfs_mod_fdblocks(
struct xfs_mount *mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b570984..d9520f4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -206,6 +206,11 @@ typedef struct xfs_mount {
#define XFS_WSYNC_WRITEIO_LOG 14 /* 16k */
/*
+ * FD blocks batch size for per-cpu compare
+ */
+#define XFS_FDBLOCKS_BATCH 1024
+
+/*
* Allow large block sizes to be reported to userspace programs if the
* "largeio" mount option is used.
*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 59c9b7b..c0b4f79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1412,6 +1412,12 @@ xfs_reinit_percpu_counters(
percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
+
+ /*
+ * Use default batch size for m_ifree
+ */
+ percpu_counter_set_limit(&mp->m_ifree, 0);
+ percpu_counter_set_limit(&mp->m_fdblocks, 4 * XFS_FDBLOCKS_BATCH);
}
static void
--
1.7.1
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
2016-03-05 2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
2016-03-05 2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
2016-03-05 2:51 ` [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters Waiman Long
@ 2016-03-05 6:34 ` Dave Chinner
[not found] ` <56DDBCEB.8060307@hpe.com>
2 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2016-03-05 6:34 UTC (permalink / raw)
To: Waiman Long
Cc: Christoph Lameter, Peter Zijlstra, Scott J Norton, linux-kernel,
xfs, Ingo Molnar, Douglas Hatch, Tejun Heo
On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> This patchset allows the degeneration of per-cpu counters back to
> global counters when:
>
> 1) The number of CPUs in the system is large, hence a high cost for
> calling percpu_counter_sum().
> 2) The initial count value is small so that it has a high chance of
> excessive percpu_counter_sum() calls.
>
> When the above 2 conditions are true, this patchset allows the user of
> per-cpu counters to selectively degenerate them into global counters
> with lock. This is done by calling the new percpu_counter_set_limit()
> API after percpu_counter_set(). Without this call, there is no change
> in the behavior of the per-cpu counters.
>
> Patch 1 implements the new percpu_counter_set_limit() API.
>
> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
> per-cpu counters.
>
> Waiman Long (2):
> percpu_counter: Allow falling back to global counter on large system
> xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
NACK.
This change to turns off per-counter free block counters for 32p for
the XFS free block counters. We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.
IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.
The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.
IOWs, please stop trying to optimise a filesystem slow path that:
a) 99.9% of production workloads never execute,
b) where we expect performance to degrade as allocation gets
computationally expensive as we close in on ENOSPC,
c) we start to execute blocking data flush operations that
slow everything down massively, and
d) is indicative that the workload is about to suffer
from a fatal, unrecoverable error (i.e. ENOSPC)
Cheers,
Dave.
--
Dave Chinner
dchinner@redhat.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread