Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
       [not found] <20111012160202.GA18666@sgi.com>
@ 2011-10-12 19:01 ` Andrew Morton
  2011-10-12 19:57   ` Christoph Lameter
       [not found]   ` <CADE8fzrdMOBF1RyyEpMVi8aKcgOVKRQSKi0=c1Qvh3p6hHcXRA@mail.gmail.com>
  0 siblings, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2011-10-12 19:01 UTC (permalink / raw)
  To: Dimitri Sivanich; +Cc: linux-kernel, linux-mm, Christoph Lameter

On Wed, 12 Oct 2011 11:02:02 -0500
Dimitri Sivanich <sivanich@sgi.com> wrote:

> Tmpfs I/O throughput testing on UV systems has shown writeback contention
> between multiple writer threads (even when each thread writes to a separate
> tmpfs mount point).
> 
> A large part of this is caused by cacheline contention reading the vm_stat
> array in the __vm_enough_memory check.
> 
> The attached test patch illustrates a possible avenue for improvement in this
> area.  By locally caching the values read from vm_stat (and refreshing the
> values after 2 seconds), I was able to improve tmpfs writeback performance from
> ~300 MB/sec to ~700 MB/sec with 120 threads writing data simultaneously to
> files on separate tmpfs mount points (tested on 3.1.0-rc9).
> 
> Note that this patch is simply to illustrate the gains that can be made here.
> What I'm looking for is some guidance on an acceptable way to accomplish the
> task of reducing contention in this area, either by caching these values in a
> way similar to the attached patch, or by some other mechanism if this is
> unacceptable.

Yes, the global vm_stat[] array is a problem - I'm surprised it's hung
around for this long.  Altering the sysctl_overcommit_memory mode will
hide the problem, but that's no good.

I think we've discussed switching vm_stat[] to a contention-avoiding
counter scheme.  Simply using <percpu_counter.h> would be the simplest
approach.  They'll introduce inaccuracies but hopefully any problems
from that will be minor for the global page counters.

otoh, I think we've been round this loop before and I don't recall why
nothing happened.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-12 19:01 ` [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory Andrew Morton
@ 2011-10-12 19:57   ` Christoph Lameter
  2011-10-13 15:06     ` Mel Gorman
  2011-10-13 15:23     ` Dimitri Sivanich
       [not found]   ` <CADE8fzrdMOBF1RyyEpMVi8aKcgOVKRQSKi0=c1Qvh3p6hHcXRA@mail.gmail.com>
  1 sibling, 2 replies; 25+ messages in thread
From: Christoph Lameter @ 2011-10-12 19:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dimitri Sivanich, linux-kernel, linux-mm, Mel Gorman

On Wed, 12 Oct 2011, Andrew Morton wrote:

> > Note that this patch is simply to illustrate the gains that can be made here.
> > What I'm looking for is some guidance on an acceptable way to accomplish the
> > task of reducing contention in this area, either by caching these values in a
> > way similar to the attached patch, or by some other mechanism if this is
> > unacceptable.
>
> Yes, the global vm_stat[] array is a problem - I'm surprised it's hung
> around for this long.  Altering the sysctl_overcommit_memory mode will
> hide the problem, but that's no good.

The global vm_stat array is keeping the state for the zone. It would be
even more expensive to calculate this at every point where we need such
data.

> I think we've discussed switching vm_stat[] to a contention-avoiding
> counter scheme.  Simply using <percpu_counter.h> would be the simplest
> approach.  They'll introduce inaccuracies but hopefully any problems
> from that will be minor for the global page counters.

We already have a contention avoiding scheme for counter updates in
vmstat.c. The problem here is that vm_stat is frequently read. Updates
from other cpus that fold counter updates in a deferred way into the
global statistics cause cacheline eviction. The updates occur too frequent
in this load.

> otoh, I think we've been round this loop before and I don't recall why
> nothing happened.

The update behavior can be tuned using /proc/sys/vm/stat_interval.
Increase the interval to reduce the folding into the global counter (set
maybe to 10?). This will reduce contention. The other approach is to
increase the allowed delta per zone if frequent updates occur via the
overflow checks in vmstat.c. See calculate_*_threshold there.

Note that the deltas are current reduced for memory pressure situations
(after recent patches by Mel). This will cause a significant increase in
vm_stat cacheline contention compared to earlier kernels.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
       [not found]   ` <CADE8fzrdMOBF1RyyEpMVi8aKcgOVKRQSKi0=c1Qvh3p6hHcXRA@mail.gmail.com>
@ 2011-10-13  0:07     ` Tim Chen
  2011-10-13 14:15       ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Tim Chen @ 2011-10-13  0:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dimitri Sivanich, linux-kernel, linux-mm, Christoph Lameter, ak

Andrew Morton wrote:

> Yes, the global vm_stat[] array is a problem - I'm surprised it's hung
> around for this long.  Altering the sysctl_overcommit_memory mode will
> hide the problem, but that's no good.
> 
> I think we've discussed switching vm_stat[] to a contention-avoiding
> counter scheme.  Simply using <percpu_counter.h> would be the simplest
> approach.  They'll introduce inaccuracies but hopefully any problems
> from that will be minor for the global page counters.
> 
> otoh, I think we've been round this loop before and I don't recall why
> nothing happened.

Yeah, we have had this discussion on vm_enough_memory before.  

https://lkml.org/lkml/2011/1/26/473

The current version of per cpu counter was not really suitable because
the batch size is not appropriate.  I've tried to use per cpu counter
with batch size adjusted in my attempt.  Andrew has suggested having an
elastic batch size that's proportional to the size of the central
counter but I haven't gotten around to try that out.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-13  0:07     ` Tim Chen
@ 2011-10-13 14:15       ` Christoph Lameter
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Lameter @ 2011-10-13 14:15 UTC (permalink / raw)
  To: Tim Chen; +Cc: Andrew Morton, Dimitri Sivanich, linux-kernel, linux-mm, ak

On Wed, 12 Oct 2011, Tim Chen wrote:

> Yeah, we have had this discussion on vm_enough_memory before.
>
> https://lkml.org/lkml/2011/1/26/473
>
> The current version of per cpu counter was not really suitable because
> the batch size is not appropriate.  I've tried to use per cpu counter
> with batch size adjusted in my attempt.  Andrew has suggested having an
> elastic batch size that's proportional to the size of the central
> counter but I haven't gotten around to try that out.

These counter are already managed as a ZVC counter. It may be easiest to
adjust the batching parameters for those to solve this issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-12 19:57   ` Christoph Lameter
@ 2011-10-13 15:06     ` Mel Gorman
  2011-10-13 15:59       ` Andi Kleen
  2011-10-13 15:23     ` Dimitri Sivanich
  1 sibling, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2011-10-13 15:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Dimitri Sivanich, linux-kernel, linux-mm

On Wed, Oct 12, 2011 at 02:57:53PM -0500, Christoph Lameter wrote:
> > I think we've discussed switching vm_stat[] to a contention-avoiding
> > counter scheme.  Simply using <percpu_counter.h> would be the simplest
> > approach.  They'll introduce inaccuracies but hopefully any problems
> > from that will be minor for the global page counters.
> 
> We already have a contention avoiding scheme for counter updates in
> vmstat.c. The problem here is that vm_stat is frequently read. Updates
> from other cpus that fold counter updates in a deferred way into the
> global statistics cause cacheline eviction. The updates occur too frequent
> in this load.
> 

There is also a correctness issue to be concerned with. In the patch,
there is a two second window during which the counters are not being
read. This increases the risk that the system gets too overcommitted
when overcommit_memory == OVERCOMMIT_GUESS.

If vm_enough_memory is being heavily hit as well, it implies that this
workload is mmap-intensive which is pretty inefficient in itself. I
guess it would also apply to workloads that are malloc-intensive for
large buffers but I'd expect the cache line bounces to only dominate if
there was little or no computation on the resulting buffers.

As a result, I wonder how realistic is this test workload and who useful
fixing this problem is in general?

> > otoh, I think we've been round this loop before and I don't recall why
> > nothing happened.
> 
> The update behavior can be tuned using /proc/sys/vm/stat_interval.
> Increase the interval to reduce the folding into the global counter (set
> maybe to 10?). This will reduce contention.

Unless the thresholds for per-cpu drift are being hit. If they are
allocating and freeing pages in large numbers for example, we'll be
calling __mod_zone_page_state(NR_FREE_PAGES) in large batches,
overflowing the counters, calling zone_page_state_add() and dirtying the
global vm_stat that way. In that case, increasing stat_interval alone is
not the answer.

> The other approach is to
> increase the allowed delta per zone if frequent updates occur via the
> overflow checks in vmstat.c. See calculate_*_threshold there.
> 

If this approach is taken, be careful that threshold is an s8 so it is
limited in size.

> Note that the deltas are current reduced for memory pressure situations
> (after recent patches by Mel). This will cause a significant increase in
> vm_stat cacheline contention compared to earlier kernels.
> 

That statement is misleading. The thresholds are reduced while
kswapd is awake to avoid the possibility of all memory being allocated
and the machine livelocking. If the system is under enough pressure for
kswapd to be awake for prolonged periods of time, the overhead of cache
line bouncing while updating vm_stat is going to be a lesser concern.

I like the idea of the threshold being scaled under normal circumstances
depending on the size of the central counter. Conceivably it could be
done as part of refresh_cpu_vm_stats() using the old value of the
central counter while walking each per_cpu_pageset.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-12 19:57   ` Christoph Lameter
  2011-10-13 15:06     ` Mel Gorman
@ 2011-10-13 15:23     ` Dimitri Sivanich
  2011-10-13 15:54       ` Christoph Lameter
  1 sibling, 1 reply; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-13 15:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Wed, Oct 12, 2011 at 02:57:53PM -0500, Christoph Lameter wrote:
> On Wed, 12 Oct 2011, Andrew Morton wrote:
> 
> > > Note that this patch is simply to illustrate the gains that can be made here.
> > > What I'm looking for is some guidance on an acceptable way to accomplish the
> > > task of reducing contention in this area, either by caching these values in a
> > > way similar to the attached patch, or by some other mechanism if this is
> > > unacceptable.
> >
> > Yes, the global vm_stat[] array is a problem - I'm surprised it's hung
> > around for this long.  Altering the sysctl_overcommit_memory mode will
> > hide the problem, but that's no good.
> 
> The global vm_stat array is keeping the state for the zone. It would be
> even more expensive to calculate this at every point where we need such
> data.
> 
> > I think we've discussed switching vm_stat[] to a contention-avoiding
> > counter scheme.  Simply using <percpu_counter.h> would be the simplest
> > approach.  They'll introduce inaccuracies but hopefully any problems
> > from that will be minor for the global page counters.
> 
> We already have a contention avoiding scheme for counter updates in
> vmstat.c. The problem here is that vm_stat is frequently read. Updates
> from other cpus that fold counter updates in a deferred way into the
> global statistics cause cacheline eviction. The updates occur too frequent
> in this load.

The test I did slowed down the reads by __vm_enough_memory by caching the
values and updating them every two seconds (in the OVERCOMMIT_GUESS area).
> 
> > otoh, I think we've been round this loop before and I don't recall why
> > nothing happened.
> 
> The update behavior can be tuned using /proc/sys/vm/stat_interval.
> Increase the interval to reduce the folding into the global counter (set
> maybe to 10?). This will reduce contention. The other approach is to

Increasing this interval to 10 (or even 100) had no effect on the vm_stat
contention on a 640 cpu test system, so vmstat_update() is not the culprit.

> increase the allowed delta per zone if frequent updates occur via the
> overflow checks in vmstat.c. See calculate_*_threshold there.

I tried changing the threshold in both directions, with slower throughput in
both cases.

> 
> Note that the deltas are current reduced for memory pressure situations
> (after recent patches by Mel). This will cause a significant increase in
> vm_stat cacheline contention compared to earlier kernels.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-13 15:23     ` Dimitri Sivanich
@ 2011-10-13 15:54       ` Christoph Lameter
  2011-10-13 20:50         ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2011-10-13 15:54 UTC (permalink / raw)
  To: Dimitri Sivanich; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Thu, 13 Oct 2011, Dimitri Sivanich wrote:

> > increase the allowed delta per zone if frequent updates occur via the
> > overflow checks in vmstat.c. See calculate_*_threshold there.
>
> I tried changing the threshold in both directions, with slower throughput in
> both cases.

If that is the case check for the vm_stat cacheline being shared with
another hot kernel variable variable. Maybe that causes cacheline
eviction.

If there are no updates occurring for a while (due to increased deltas
and/or vmstat updates) then the vm_stat cacheline should be able to stay
in shared mode in multiple processors and the performance should increase.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-13 15:06     ` Mel Gorman
@ 2011-10-13 15:59       ` Andi Kleen
  0 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2011-10-13 15:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, Dimitri Sivanich, linux-kernel,
	linux-mm

Mel Gorman <mel@csn.ul.ie> writes:
>
> If vm_enough_memory is being heavily hit as well, it implies that this
> workload is mmap-intensive which is pretty inefficient in itself. I

Saw it with tmpfs originally. No need to be mmap intensive. Just
do lots of IOs on tmpfs.

> guess it would also apply to workloads that are malloc-intensive for
> large buffers but I'd expect the cache line bounces to only dominate if
> there was little or no computation on the resulting buffers.

I think you severly underestimate the costs of bouncing cache lines
on >2S.

> As a result, I wonder how realistic is this test workload and who useful
> fixing this problem is in general?

It's kind of bad if tmpfs doesn't scale.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-13 15:54       ` Christoph Lameter
@ 2011-10-13 20:50         ` Andrew Morton
  2011-10-13 21:02           ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2011-10-13 20:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Dimitri Sivanich, linux-kernel, linux-mm, Mel Gorman

On Thu, 13 Oct 2011 10:54:30 -0500 (CDT)
Christoph Lameter <cl@gentwo.org> wrote:

> On Thu, 13 Oct 2011, Dimitri Sivanich wrote:
> 
> > > increase the allowed delta per zone if frequent updates occur via the
> > > overflow checks in vmstat.c. See calculate_*_threshold there.
> >
> > I tried changing the threshold in both directions, with slower throughput in
> > both cases.
> 
> If that is the case check for the vm_stat cacheline being shared with
> another hot kernel variable variable. Maybe that causes cacheline
> eviction.

yup.  `nm -n vmlinux'.

> If there are no updates occurring for a while (due to increased deltas
> and/or vmstat updates) then the vm_stat cacheline should be able to stay
> in shared mode in multiple processors and the performance should increase.
> 

We could cacheline align vm_stat[].  But the thing is pretty small - we
couild put each entry in its own cacheline.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-13 20:50         ` Andrew Morton
@ 2011-10-13 21:02           ` Christoph Lameter
  2011-10-13 21:24             ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2011-10-13 21:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dimitri Sivanich, linux-kernel, linux-mm, Mel Gorman

On Thu, 13 Oct 2011, Andrew Morton wrote:

> > If there are no updates occurring for a while (due to increased deltas
> > and/or vmstat updates) then the vm_stat cacheline should be able to stay
> > in shared mode in multiple processors and the performance should increase.
> >
>
> We could cacheline align vm_stat[].  But the thing is pretty small - we
> couild put each entry in its own cacheline.

Which in turn would increase the cache footprint of some key kernel
functions (because they need multiple vm_stat entries) and cause eviction
of other cachelines that then reduce overall system performance again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-13 21:02           ` Christoph Lameter
@ 2011-10-13 21:24             ` Andrew Morton
  2011-10-14 12:25               ` Dimitri Sivanich
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2011-10-13 21:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Dimitri Sivanich, linux-kernel, linux-mm, Mel Gorman

On Thu, 13 Oct 2011 16:02:58 -0500 (CDT)
Christoph Lameter <cl@gentwo.org> wrote:

> On Thu, 13 Oct 2011, Andrew Morton wrote:
> 
> > > If there are no updates occurring for a while (due to increased deltas
> > > and/or vmstat updates) then the vm_stat cacheline should be able to stay
> > > in shared mode in multiple processors and the performance should increase.
> > >
> >
> > We could cacheline align vm_stat[].  But the thing is pretty small - we
> > couild put each entry in its own cacheline.
> 
> Which in turn would increase the cache footprint of some key kernel
> functions (because they need multiple vm_stat entries) and cause eviction
> of other cachelines that then reduce overall system performance again.

Sure, but we gain performance by not having different CPUs treading on
each other when they update different vmstat fields.  Sometimes one
effect will win and other times the other effect will win.  Some
engineering is needed..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-13 21:24             ` Andrew Morton
@ 2011-10-14 12:25               ` Dimitri Sivanich
  2011-10-14 13:50                 ` Dimitri Sivanich
  0 siblings, 1 reply; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-14 12:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm, Mel Gorman

On Thu, Oct 13, 2011 at 02:24:34PM -0700, Andrew Morton wrote:
> On Thu, 13 Oct 2011 16:02:58 -0500 (CDT)
> Christoph Lameter <cl@gentwo.org> wrote:
> 
> > On Thu, 13 Oct 2011, Andrew Morton wrote:
> > 
> > > > If there are no updates occurring for a while (due to increased deltas
> > > > and/or vmstat updates) then the vm_stat cacheline should be able to stay
> > > > in shared mode in multiple processors and the performance should increase.
> > > >
> > >
> > > We could cacheline align vm_stat[].  But the thing is pretty small - we
> > > couild put each entry in its own cacheline.
> > 
> > Which in turn would increase the cache footprint of some key kernel
> > functions (because they need multiple vm_stat entries) and cause eviction
> > of other cachelines that then reduce overall system performance again.
> 
> Sure, but we gain performance by not having different CPUs treading on
> each other when they update different vmstat fields.  Sometimes one
> effect will win and other times the other effect will win.  Some
> engineering is needed..

I think the first step is to determine the role (if any) that false sharing may be playing in this, since that's a simpler fix (cacheline align and pad the array).

Then, if necessary, will look at contention issues within the array.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-14 12:25               ` Dimitri Sivanich
@ 2011-10-14 13:50                 ` Dimitri Sivanich
  2011-10-14 13:57                   ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-14 13:50 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, linux-kernel, linux-mm,
	Mel Gorman

On Fri, Oct 14, 2011 at 07:25:06AM -0500, Dimitri Sivanich wrote:
> On Thu, Oct 13, 2011 at 02:24:34PM -0700, Andrew Morton wrote:
> > On Thu, 13 Oct 2011 16:02:58 -0500 (CDT)
> > Christoph Lameter <cl@gentwo.org> wrote:
> > 
> > > On Thu, 13 Oct 2011, Andrew Morton wrote:
> > > 
> > > > > If there are no updates occurring for a while (due to increased deltas
> > > > > and/or vmstat updates) then the vm_stat cacheline should be able to stay
> > > > > in shared mode in multiple processors and the performance should increase.
> > > > >
> > > >
> > > > We could cacheline align vm_stat[].  But the thing is pretty small - we
> > > > couild put each entry in its own cacheline.
> > > 
> > > Which in turn would increase the cache footprint of some key kernel
> > > functions (because they need multiple vm_stat entries) and cause eviction
> > > of other cachelines that then reduce overall system performance again.
> > 
> > Sure, but we gain performance by not having different CPUs treading on
> > each other when they update different vmstat fields.  Sometimes one
> > effect will win and other times the other effect will win.  Some
> > engineering is needed..
> 
> I think the first step is to determine the role (if any) that false sharing may be playing in this, since that's a simpler fix (cacheline align and pad the array).
>

Testing on a smaller machine with 46 writer threads in parallel (my original
test used 120).

Looks as though cache-aligning and padding the end of the vm_stat array
results in a ~150 MB/sec speedup.  This is a nice improvement for only 46
writer threads, though it's not the full ~250 MB/sec speedup I get from
setting OVERCOMMIT_NEVER.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-14 13:50                 ` Dimitri Sivanich
@ 2011-10-14 13:57                   ` Christoph Lameter
  2011-10-14 14:19                     ` Dimitri Sivanich
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2011-10-14 13:57 UTC (permalink / raw)
  To: Dimitri Sivanich; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Fri, 14 Oct 2011, Dimitri Sivanich wrote:

> Testing on a smaller machine with 46 writer threads in parallel (my original
> test used 120).
>
> Looks as though cache-aligning and padding the end of the vm_stat array
> results in a ~150 MB/sec speedup.  This is a nice improvement for only 46
> writer threads, though it's not the full ~250 MB/sec speedup I get from
> setting OVERCOMMIT_NEVER.

Add to this the increase in the deltas for the ZVCs and change the stat
interval to 10 sec?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-14 13:57                   ` Christoph Lameter
@ 2011-10-14 14:19                     ` Dimitri Sivanich
  2011-10-14 14:34                       ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-14 14:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Fri, Oct 14, 2011 at 08:57:16AM -0500, Christoph Lameter wrote:
> On Fri, 14 Oct 2011, Dimitri Sivanich wrote:
> 
> > Testing on a smaller machine with 46 writer threads in parallel (my original
> > test used 120).
> >
> > Looks as though cache-aligning and padding the end of the vm_stat array
> > results in a ~150 MB/sec speedup.  This is a nice improvement for only 46
> > writer threads, though it's not the full ~250 MB/sec speedup I get from
> > setting OVERCOMMIT_NEVER.
> 
> Add to this the increase in the deltas for the ZVCs and change the stat
> interval to 10 sec?

Increasing the ZVC deltas (threshold value in calculate*threshold == 125)
does -seem- to give a small speedup in this case (maybe as much as 50 MB/sec?).

Changing the stat interval to 10 seconds still has no effect, with or without
the ZVC delta change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-14 14:19                     ` Dimitri Sivanich
@ 2011-10-14 14:34                       ` Christoph Lameter
  2011-10-14 15:18                         ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2011-10-14 14:34 UTC (permalink / raw)
  To: Dimitri Sivanich; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Fri, 14 Oct 2011, Dimitri Sivanich wrote:

> Increasing the ZVC deltas (threshold value in calculate*threshold == 125)
> does -seem- to give a small speedup in this case (maybe as much as 50 MB/sec?).

Hmm... The question is how much do the VM paths used for the critical path
increment the vmstat counters on average per second? If we end up with
hundred of updates per second from each thread then we still have a
problem that can only be addressed by increasing the deltas beyond 125
meaning the fieldwidth must be increased to support 16 bit counters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-14 14:34                       ` Christoph Lameter
@ 2011-10-14 15:18                         ` Christoph Lameter
  2011-10-14 16:16                           ` Dimitri Sivanich
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2011-10-14 15:18 UTC (permalink / raw)
  To: Dimitri Sivanich; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

Also the whole thing could be optimized by concentrating updates to the
vm_stat array at one point in time. If any local per cpu differential
overflows then update all the counters in the same cacheline for which we have per cpu
differentials.

That will defer another acquisition of the cacheline for the next delta
overflowing. After an update all the per cpu differentials would be zero.

This could be added to zone_page_state_add....


Something like this patch? (Restriction of the updates to the same
cacheline missing. Just does everything and the zone_page_state may need
uninlining now)

---
 include/linux/vmstat.h |   19 ++++++++++++++++---
 mm/vmstat.c            |   10 ++++------
 2 files changed, 20 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2011-10-14 09:58:03.000000000 -0500
+++ linux-2.6/include/linux/vmstat.h	2011-10-14 10:08:00.000000000 -0500
@@ -90,10 +90,23 @@ static inline void vm_events_fold_cpu(in
 extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

 static inline void zone_page_state_add(long x, struct zone *zone,
-				 enum zone_stat_item item)
+				 enum zone_stat_item item, s8 new_value)
 {
-	atomic_long_add(x, &zone->vm_stat[item]);
-	atomic_long_add(x, &vm_stat[item]);
+	enum zone_stat_item i;
+
+	for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
+		long y;
+
+		if (i == item)
+			y = this_cpu_xchg(zone->pageset->vm_stat_diff[i], new_value) + x;
+		else
+			y = this_cpu_xchg(zone->pageset->vm_stat_diff[i], 0);
+
+		if (y) {
+			atomic_long_add(y, &zone->vm_stat[item]);
+			atomic_long_add(y, &vm_stat[item]);
+		}
+	}
 }

 static inline unsigned long global_page_state(enum zone_stat_item item)
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2011-10-14 10:04:20.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2011-10-14 10:08:39.000000000 -0500
@@ -221,7 +221,7 @@ void __mod_zone_page_state(struct zone *
 	t = __this_cpu_read(pcp->stat_threshold);

 	if (unlikely(x > t || x < -t)) {
-		zone_page_state_add(x, zone, item);
+		zone_page_state_add(x, zone, item, 0);
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
@@ -262,8 +262,7 @@ void __inc_zone_state(struct zone *zone,
 	if (unlikely(v > t)) {
 		s8 overstep = t >> 1;

-		zone_page_state_add(v + overstep, zone, item);
-		__this_cpu_write(*p, -overstep);
+		zone_page_state_add(v + overstep, zone, item, -overstep);
 	}
 }

@@ -284,8 +283,7 @@ void __dec_zone_state(struct zone *zone,
 	if (unlikely(v < - t)) {
 		s8 overstep = t >> 1;

-		zone_page_state_add(v - overstep, zone, item);
-		__this_cpu_write(*p, overstep);
+		zone_page_state_add(v - overstep, zone, item, overstep);
 	}
 }

@@ -343,7 +341,7 @@ static inline void mod_state(struct zone
 	} while (this_cpu_cmpxchg(*p, o, n) != o);

 	if (z)
-		zone_page_state_add(z, zone, item);
+		zone_page_state_add(z, zone, item, 0);
 }

 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-14 15:18                         ` Christoph Lameter
@ 2011-10-14 16:16                           ` Dimitri Sivanich
  2011-10-18 13:48                             ` Dimitri Sivanich
  0 siblings, 1 reply; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-14 16:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Fri, Oct 14, 2011 at 10:18:24AM -0500, Christoph Lameter wrote:
> Also the whole thing could be optimized by concentrating updates to the
> vm_stat array at one point in time. If any local per cpu differential
> overflows then update all the counters in the same cacheline for which we have per cpu
> differentials.
> 
> That will defer another acquisition of the cacheline for the next delta
> overflowing. After an update all the per cpu differentials would be zero.
> 
> This could be added to zone_page_state_add....
> 
> 
> Something like this patch? (Restriction of the updates to the same
> cacheline missing. Just does everything and the zone_page_state may need
> uninlining now)

This patch doesn't have much, if any, effect, at least in the 46 writer thread
case (NR_VM_EVENT_ITEMS-->NR_VM_ZONE_STAT_ITEMS allowed it to boot :) ).
I applied this with the change to align vm_stat.

So far cache alignment of vm_data and increasing ZVC delta has the greatest
effect.

> 
> ---
>  include/linux/vmstat.h |   19 ++++++++++++++++---
>  mm/vmstat.c            |   10 ++++------
>  2 files changed, 20 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h	2011-10-14 09:58:03.000000000 -0500
> +++ linux-2.6/include/linux/vmstat.h	2011-10-14 10:08:00.000000000 -0500
> @@ -90,10 +90,23 @@ static inline void vm_events_fold_cpu(in
>  extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> 
>  static inline void zone_page_state_add(long x, struct zone *zone,
> -				 enum zone_stat_item item)
> +				 enum zone_stat_item item, s8 new_value)
>  {
> -	atomic_long_add(x, &zone->vm_stat[item]);
> -	atomic_long_add(x, &vm_stat[item]);
> +	enum zone_stat_item i;
> +
> +	for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
> +		long y;
> +
> +		if (i == item)
> +			y = this_cpu_xchg(zone->pageset->vm_stat_diff[i], new_value) + x;
> +		else
> +			y = this_cpu_xchg(zone->pageset->vm_stat_diff[i], 0);
> +
> +		if (y) {
> +			atomic_long_add(y, &zone->vm_stat[item]);
> +			atomic_long_add(y, &vm_stat[item]);
> +		}
> +	}
>  }
> 
>  static inline unsigned long global_page_state(enum zone_stat_item item)
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2011-10-14 10:04:20.000000000 -0500
> +++ linux-2.6/mm/vmstat.c	2011-10-14 10:08:39.000000000 -0500
> @@ -221,7 +221,7 @@ void __mod_zone_page_state(struct zone *
>  	t = __this_cpu_read(pcp->stat_threshold);
> 
>  	if (unlikely(x > t || x < -t)) {
> -		zone_page_state_add(x, zone, item);
> +		zone_page_state_add(x, zone, item, 0);
>  		x = 0;
>  	}
>  	__this_cpu_write(*p, x);
> @@ -262,8 +262,7 @@ void __inc_zone_state(struct zone *zone,
>  	if (unlikely(v > t)) {
>  		s8 overstep = t >> 1;
> 
> -		zone_page_state_add(v + overstep, zone, item);
> -		__this_cpu_write(*p, -overstep);
> +		zone_page_state_add(v + overstep, zone, item, -overstep);
>  	}
>  }
> 
> @@ -284,8 +283,7 @@ void __dec_zone_state(struct zone *zone,
>  	if (unlikely(v < - t)) {
>  		s8 overstep = t >> 1;
> 
> -		zone_page_state_add(v - overstep, zone, item);
> -		__this_cpu_write(*p, overstep);
> +		zone_page_state_add(v - overstep, zone, item, overstep);
>  	}
>  }
> 
> @@ -343,7 +341,7 @@ static inline void mod_state(struct zone
>  	} while (this_cpu_cmpxchg(*p, o, n) != o);
> 
>  	if (z)
> -		zone_page_state_add(z, zone, item);
> +		zone_page_state_add(z, zone, item, 0);
>  }
> 
>  void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-14 16:16                           ` Dimitri Sivanich
@ 2011-10-18 13:48                             ` Dimitri Sivanich
  2011-10-18 14:36                               ` Christoph Lameter
  2011-10-18 15:48                               ` Andi Kleen
  0 siblings, 2 replies; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-18 13:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: Christoph Lameter, Andrew Morton, Mel Gorman

On Fri, Oct 14, 2011 at 11:16:03AM -0500, Dimitri Sivanich wrote:
> On Fri, Oct 14, 2011 at 10:18:24AM -0500, Christoph Lameter wrote:
> > Also the whole thing could be optimized by concentrating updates to the
> > vm_stat array at one point in time. If any local per cpu differential
> > overflows then update all the counters in the same cacheline for which we have per cpu
> > differentials.
> > 
> > That will defer another acquisition of the cacheline for the next delta
> > overflowing. After an update all the per cpu differentials would be zero.
> > 
> > This could be added to zone_page_state_add....
> > 
> > 
> > Something like this patch? (Restriction of the updates to the same
> > cacheline missing. Just does everything and the zone_page_state may need
> > uninlining now)
> 
> This patch doesn't have much, if any, effect, at least in the 46 writer thread
> case (NR_VM_EVENT_ITEMS-->NR_VM_ZONE_STAT_ITEMS allowed it to boot :) ).
> I applied this with the change to align vm_stat.
> 
> So far cache alignment of vm_data and increasing ZVC delta has the greatest
> effect.

After further testing, substantial increases in ZVC delta along with cache alignment
of the vm_stat array bring the tmpfs writeback throughput numbers to about where
they are with vm.overcommit_memory==OVERCOMMIT_NEVER.  I still need to determine how
high the ZVC delta needs to be to achieve this performance, but it is greater than 125.

Would it make sense to have the ZVC delta be tuneable (via /proc/sys/vm?), keeping the
same default behavior as what we currently have?

If the thresholds get set higher, it could be that some values that don't normally have
as big a delta may not get updated frequently enough.  Should we maybe update all values
everytime a threshold is hit, as the patch below was intending?

Note that having each counter in a separate cacheline does not have much, if any,
effect.

> 
> > 
> > ---
> >  include/linux/vmstat.h |   19 ++++++++++++++++---
> >  mm/vmstat.c            |   10 ++++------
> >  2 files changed, 20 insertions(+), 9 deletions(-)
> > 
> > Index: linux-2.6/include/linux/vmstat.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/vmstat.h	2011-10-14 09:58:03.000000000 -0500
> > +++ linux-2.6/include/linux/vmstat.h	2011-10-14 10:08:00.000000000 -0500
> > @@ -90,10 +90,23 @@ static inline void vm_events_fold_cpu(in
> >  extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> > 
> >  static inline void zone_page_state_add(long x, struct zone *zone,
> > -				 enum zone_stat_item item)
> > +				 enum zone_stat_item item, s8 new_value)
> >  {
> > -	atomic_long_add(x, &zone->vm_stat[item]);
> > -	atomic_long_add(x, &vm_stat[item]);
> > +	enum zone_stat_item i;
> > +
> > +	for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
> > +		long y;
> > +
> > +		if (i == item)
> > +			y = this_cpu_xchg(zone->pageset->vm_stat_diff[i], new_value) + x;
> > +		else
> > +			y = this_cpu_xchg(zone->pageset->vm_stat_diff[i], 0);
> > +
> > +		if (y) {
> > +			atomic_long_add(y, &zone->vm_stat[item]);
> > +			atomic_long_add(y, &vm_stat[item]);
> > +		}
> > +	}
> >  }
> > 
> >  static inline unsigned long global_page_state(enum zone_stat_item item)
> > Index: linux-2.6/mm/vmstat.c
> > ===================================================================
> > --- linux-2.6.orig/mm/vmstat.c	2011-10-14 10:04:20.000000000 -0500
> > +++ linux-2.6/mm/vmstat.c	2011-10-14 10:08:39.000000000 -0500
> > @@ -221,7 +221,7 @@ void __mod_zone_page_state(struct zone *
> >  	t = __this_cpu_read(pcp->stat_threshold);
> > 
> >  	if (unlikely(x > t || x < -t)) {
> > -		zone_page_state_add(x, zone, item);
> > +		zone_page_state_add(x, zone, item, 0);
> >  		x = 0;
> >  	}
> >  	__this_cpu_write(*p, x);
> > @@ -262,8 +262,7 @@ void __inc_zone_state(struct zone *zone,
> >  	if (unlikely(v > t)) {
> >  		s8 overstep = t >> 1;
> > 
> > -		zone_page_state_add(v + overstep, zone, item);
> > -		__this_cpu_write(*p, -overstep);
> > +		zone_page_state_add(v + overstep, zone, item, -overstep);
> >  	}
> >  }
> > 
> > @@ -284,8 +283,7 @@ void __dec_zone_state(struct zone *zone,
> >  	if (unlikely(v < - t)) {
> >  		s8 overstep = t >> 1;
> > 
> > -		zone_page_state_add(v - overstep, zone, item);
> > -		__this_cpu_write(*p, overstep);
> > +		zone_page_state_add(v - overstep, zone, item, overstep);
> >  	}
> >  }
> > 
> > @@ -343,7 +341,7 @@ static inline void mod_state(struct zone
> >  	} while (this_cpu_cmpxchg(*p, o, n) != o);
> > 
> >  	if (z)
> > -		zone_page_state_add(z, zone, item);
> > +		zone_page_state_add(z, zone, item, 0);
> >  }
> > 
> >  void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-18 13:48                             ` Dimitri Sivanich
@ 2011-10-18 14:36                               ` Christoph Lameter
  2011-10-18 15:48                               ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Lameter @ 2011-10-18 14:36 UTC (permalink / raw)
  To: Dimitri Sivanich; +Cc: linux-kernel, linux-mm, Andrew Morton, Mel Gorman

On Tue, 18 Oct 2011, Dimitri Sivanich wrote:

> After further testing, substantial increases in ZVC delta along with cache alignment
> of the vm_stat array bring the tmpfs writeback throughput numbers to about where
> they are with vm.overcommit_memory==OVERCOMMIT_NEVER.  I still need to determine how
> high the ZVC delta needs to be to achieve this performance, but it is greater than 125.

Sounds like this is the way to go then.

> Would it make sense to have the ZVC delta be tuneable (via /proc/sys/vm?), keeping the
> same default behavior as what we currently have?

I think so.

> If the thresholds get set higher, it could be that some values that don't normally have
> as big a delta may not get updated frequently enough.  Should we maybe update all values
> everytime a threshold is hit, as the patch below was intending?

Mel can probably chime in on the accuracy needed for reclaim etc. We
already have an automatic reduction of the delta if the vm gets into
problems.

> Note that having each counter in a separate cacheline does not have much, if any,
> effect.

It may have a good effect if you group the counters according to their
uses into different cachelines. Counters that are typically updates
together need to be close to each other. Also you could modify my patch to
only update counters in the same cacheline. I think doing all counters
caused the problems with that patch because we now touch multiple
cachelines and increase the cache footprint of critical vm functions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-18 13:48                             ` Dimitri Sivanich
  2011-10-18 14:36                               ` Christoph Lameter
@ 2011-10-18 15:48                               ` Andi Kleen
  2011-10-19  1:16                                 ` David Rientjes
  1 sibling, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2011-10-18 15:48 UTC (permalink / raw)
  To: Dimitri Sivanich
  Cc: linux-kernel, linux-mm, Christoph Lameter, Andrew Morton,
	Mel Gorman

Dimitri Sivanich <sivanich@sgi.com> writes:
>
> Would it make sense to have the ZVC delta be tuneable (via /proc/sys/vm?), keeping the
> same default behavior as what we currently have?

Tunable is bad. We don't really want a "hundreds of lines magic shell script to
make large systems perform". Please find a way to auto tune.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-18 15:48                               ` Andi Kleen
@ 2011-10-19  1:16                                 ` David Rientjes
  2011-10-19 14:54                                   ` Dimitri Sivanich
  0 siblings, 1 reply; 25+ messages in thread
From: David Rientjes @ 2011-10-19  1:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dimitri Sivanich, linux-kernel, linux-mm, Christoph Lameter,
	Andrew Morton, Mel Gorman

On Tue, 18 Oct 2011, Andi Kleen wrote:

> > Would it make sense to have the ZVC delta be tuneable (via /proc/sys/vm?), keeping the
> > same default behavior as what we currently have?
> 
> Tunable is bad. We don't really want a "hundreds of lines magic shell script to
> make large systems perform". Please find a way to auto tune.
> 

Agreed, and I think even if we had a tunable that it would result in 
potentially erradic VM performance because some areas depend on "fairly 
accurate" ZVCs and it wouldn't be clear that you're trading other unknown 
VM issues that will affect your workload because you've increased the 
deltas.  Let's try to avoid having to ask "what is your ZVC delta tunable 
set at?" when someone reports a bug about reclaim stopping preemptively.

That said, perhaps we need higher deltas by default and then hints in key 
areas in the form of sync_stats_if_delta_above(x) calls that would do 
zone_page_state_add() only when that kind of precision is actually needed.  
For public interfaces, that would be very easy to audit to see what the 
level of precision is when parsing the data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-19  1:16                                 ` David Rientjes
@ 2011-10-19 14:54                                   ` Dimitri Sivanich
  2011-10-19 15:31                                     ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-19 14:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, linux-kernel, linux-mm, Christoph Lameter,
	Andrew Morton, Mel Gorman

On Tue, Oct 18, 2011 at 06:16:21PM -0700, David Rientjes wrote:
> On Tue, 18 Oct 2011, Andi Kleen wrote:
> 
> > > Would it make sense to have the ZVC delta be tuneable (via /proc/sys/vm?), keeping the
> > > same default behavior as what we currently have?
> > 
> > Tunable is bad. We don't really want a "hundreds of lines magic shell script to
> > make large systems perform". Please find a way to auto tune.
> > 
> 
> Agreed, and I think even if we had a tunable that it would result in 
> potentially erradic VM performance because some areas depend on "fairly 
> accurate" ZVCs and it wouldn't be clear that you're trading other unknown 
> VM issues that will affect your workload because you've increased the 
> deltas.  Let's try to avoid having to ask "what is your ZVC delta tunable 
> set at?" when someone reports a bug about reclaim stopping preemptively.

Yes, I'm inclined to agree.

> 
> That said, perhaps we need higher deltas by default and then hints in key 
> areas in the form of sync_stats_if_delta_above(x) calls that would do 
> zone_page_state_add() only when that kind of precision is actually needed.  
> For public interfaces, that would be very easy to audit to see what the 
> level of precision is when parsing the data.

I did some manual tuning to see what deltas would be needed to achieve the
greatest tmpfs writeback performance on a system with 640 cpus and 64 nodes:

For 120 threads writing in parallel (each to it's own mountpoint), the
threshold needs to be on the order of 1000.  At a threshold of 750, I
start to see a slowdown of 50-60 MB/sec.

For 400 threads writing in parallel, the threshold needs to be on the order
of 2000 (although we're off by about 40 MB/sec at that point).

The necessary deltas in these cases are quite a bit higher than the current
125 maximum (see calculate*threshold in mm/vmstat.c).

I like the idea of having certain areas triggering vm_stat sync, as long
as we know what those key areas are and how often they might be called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-19 14:54                                   ` Dimitri Sivanich
@ 2011-10-19 15:31                                     ` Christoph Lameter
  2011-10-24 14:59                                       ` Dimitri Sivanich
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2011-10-19 15:31 UTC (permalink / raw)
  To: Dimitri Sivanich
  Cc: David Rientjes, Andi Kleen, linux-kernel, linux-mm, Andrew Morton,
	Mel Gorman

On Wed, 19 Oct 2011, Dimitri Sivanich wrote:

> For 120 threads writing in parallel (each to it's own mountpoint), the
> threshold needs to be on the order of 1000.  At a threshold of 750, I
> start to see a slowdown of 50-60 MB/sec.
>
> For 400 threads writing in parallel, the threshold needs to be on the order
> of 2000 (although we're off by about 40 MB/sec at that point).
>
> The necessary deltas in these cases are quite a bit higher than the current
> 125 maximum (see calculate*threshold in mm/vmstat.c).
>
> I like the idea of having certain areas triggering vm_stat sync, as long
> as we know what those key areas are and how often they might be called.

You could potentially reduce the maximum necessary by applying my earlier
patch (but please reduce the counters touched to the current cacheline).
That should reduce the number of updates in the global cacheline and allow
you to reduce the very high deltas that you have to deal with now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory
  2011-10-19 15:31                                     ` Christoph Lameter
@ 2011-10-24 14:59                                       ` Dimitri Sivanich
  0 siblings, 0 replies; 25+ messages in thread
From: Dimitri Sivanich @ 2011-10-24 14:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Andi Kleen, linux-kernel, linux-mm, Andrew Morton,
	Mel Gorman

On Wed, Oct 19, 2011 at 10:31:54AM -0500, Christoph Lameter wrote:
> On Wed, 19 Oct 2011, Dimitri Sivanich wrote:
> 
> > For 120 threads writing in parallel (each to it's own mountpoint), the
> > threshold needs to be on the order of 1000.  At a threshold of 750, I
> > start to see a slowdown of 50-60 MB/sec.
> >
> > For 400 threads writing in parallel, the threshold needs to be on the order
> > of 2000 (although we're off by about 40 MB/sec at that point).
> >
> > The necessary deltas in these cases are quite a bit higher than the current
> > 125 maximum (see calculate*threshold in mm/vmstat.c).
> >
> > I like the idea of having certain areas triggering vm_stat sync, as long
> > as we know what those key areas are and how often they might be called.
> 
> You could potentially reduce the maximum necessary by applying my earlier
> patch (but please reduce the counters touched to the current cacheline).
> That should reduce the number of updates in the global cacheline and allow
> you to reduce the very high deltas that you have to deal with now.

I tried updating whole, single vm_stat cachelines as you suggest, but that
made little if any difference in tmpfs writeback performance.  The same higher
threshold values were still necessary to significantly reduce the contention
seen in __vm_enough_memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-10-24 14:59 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20111012160202.GA18666@sgi.com>
2011-10-12 19:01 ` [PATCH] Reduce vm_stat cacheline contention in __vm_enough_memory Andrew Morton
2011-10-12 19:57   ` Christoph Lameter
2011-10-13 15:06     ` Mel Gorman
2011-10-13 15:59       ` Andi Kleen
2011-10-13 15:23     ` Dimitri Sivanich
2011-10-13 15:54       ` Christoph Lameter
2011-10-13 20:50         ` Andrew Morton
2011-10-13 21:02           ` Christoph Lameter
2011-10-13 21:24             ` Andrew Morton
2011-10-14 12:25               ` Dimitri Sivanich
2011-10-14 13:50                 ` Dimitri Sivanich
2011-10-14 13:57                   ` Christoph Lameter
2011-10-14 14:19                     ` Dimitri Sivanich
2011-10-14 14:34                       ` Christoph Lameter
2011-10-14 15:18                         ` Christoph Lameter
2011-10-14 16:16                           ` Dimitri Sivanich
2011-10-18 13:48                             ` Dimitri Sivanich
2011-10-18 14:36                               ` Christoph Lameter
2011-10-18 15:48                               ` Andi Kleen
2011-10-19  1:16                                 ` David Rientjes
2011-10-19 14:54                                   ` Dimitri Sivanich
2011-10-19 15:31                                     ` Christoph Lameter
2011-10-24 14:59                                       ` Dimitri Sivanich
     [not found]   ` <CADE8fzrdMOBF1RyyEpMVi8aKcgOVKRQSKi0=c1Qvh3p6hHcXRA@mail.gmail.com>
2011-10-13  0:07     ` Tim Chen
2011-10-13 14:15       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).