[PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist
@ 2012-01-08 16:27 Gilad Ben-Yossef
  2012-01-09 16:35 ` Christoph Lameter
  2012-01-11  7:04 ` Milton Miller
  0 siblings, 2 replies; 7+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-08 16:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gilad Ben-Yossef, Christoph Lameter, Chris Metcalf,
	Peter Zijlstra, Frederic Weisbecker, Russell King, linux-mm,
	Pekka Enberg, Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen,
	Andrew Morton, Alexander Viro, linux-fsdevel, Avi Kivity,
	Michal Nazarewicz

Calculate a cpumask of CPUs with per-cpu pages in any zone
and only send an IPI requesting CPUs to drain these pages
to the buddy allocator if they actually have pages when
asked to flush.

This patch saves 90%+ of IPIs asking to drain per-cpu
pages in case of severe memory preassure that leads
to OOM since in these cases multiple, possibly concurrent,
allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first
allocation failure for most of the proceeding allocation
attempts until the memory pressure is off (possibly via
the OOM killer) there are no per-cpu pages on most CPUs
(and there can easily be hundreds of them).

This also has the side effect of shortening the average
latency of direct reclaim by 1 or more order of magnitude
since waiting for all the CPUs to ACK the IPI takes a
long time.

Tested by running "hackbench 400" on a 4 CPU x86 otherwise
idle VM and observing the difference between the number
of direct reclaim attempts that end up in drain_all_pages()
and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced
in the next patch in the series and using proc/interrupts.

In the test sceanrio, this saved around 500 global IPIs.
After trigerring an OOM:

$ cat /proc/vmstat
...
pcp_global_drain 627
pcp_global_ipi_saved 578

I've also seen the number of drains reach 15k calls
with the saved percentage reaching 99% when there
are more tasks running during an OOM kill.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Christoph Lameter <cl@linux.com>
CC: Chris Metcalf <cmetcalf@tilera.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: linux-mm@kvack.org
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com>
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: Avi Kivity <avi@redhat.com>
CC: Michal Nazarewicz <mina86@mina86.org>
---
 mm/page_alloc.c |   26 +++++++++++++++++++++++++-
 1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bdc804c..dc97199 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,14 @@ DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
 #endif

+/*
+ * A global cpumask of CPUs with per-cpu pages that gets
+ * recomputed on each drain. We use a global cpumask
+ * here to avoid allocation on direct reclaim code path
+ * for CONFIG_CPUMASK_OFFSTACK=y
+ */
+static cpumask_var_t cpus_with_pcps;
+
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 /*
  * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
@@ -1097,7 +1105,19 @@ void drain_local_pages(void *arg)
  */
 void drain_all_pages(void)
 {
-	on_each_cpu(drain_local_pages, NULL, 1);
+	int cpu;
+	struct per_cpu_pageset *pcp;
+	struct zone *zone;
+
+	for_each_online_cpu(cpu)
+		for_each_populated_zone(zone) {
+			pcp = per_cpu_ptr(zone->pageset, cpu);
+			if (pcp->pcp.count)
+				cpumask_set_cpu(cpu, cpus_with_pcps);
+			else
+				cpumask_clear_cpu(cpu, cpus_with_pcps);
+		}
+	on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);
 }

 #ifdef CONFIG_HIBERNATION
@@ -3601,6 +3621,10 @@ static void setup_zone_pageset(struct zone *zone)
 void __init setup_per_cpu_pageset(void)
 {
 	struct zone *zone;
+	int ret;
+
+	ret = zalloc_cpumask_var(&cpus_with_pcps, GFP_KERNEL);
+	BUG_ON(!ret);

 	for_each_populated_zone(zone)
 		setup_zone_pageset(zone);
-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-08 16:27 [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
@ 2012-01-09 16:35 ` Christoph Lameter
  2012-01-09 16:47   ` Michal Nazarewicz
  2012-01-11  7:04 ` Milton Miller
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2012-01-09 16:35 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen, Andrew Morton, Alexander Viro,
	linux-fsdevel, Avi Kivity, Michal Nazarewicz

On Sun, 8 Jan 2012, Gilad Ben-Yossef wrote:

> @@ -67,6 +67,14 @@ DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
>  #endif
>
> +/*
> + * A global cpumask of CPUs with per-cpu pages that gets
> + * recomputed on each drain. We use a global cpumask
> + * here to avoid allocation on direct reclaim code path
> + * for CONFIG_CPUMASK_OFFSTACK=y
> + */
> +static cpumask_var_t cpus_with_pcps;

Move the static definition into drain_all_pages()?

> +
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>  /*
>   * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
> @@ -1097,7 +1105,19 @@ void drain_local_pages(void *arg)
>   */
>  void drain_all_pages(void)
>  {
> -	on_each_cpu(drain_local_pages, NULL, 1);
> +	int cpu;
> +	struct per_cpu_pageset *pcp;
> +	struct zone *zone;
> +
> +	for_each_online_cpu(cpu)
> +		for_each_populated_zone(zone) {
> +			pcp = per_cpu_ptr(zone->pageset, cpu);
> +			if (pcp->pcp.count)
> +				cpumask_set_cpu(cpu, cpus_with_pcps);
> +			else
> +				cpumask_clear_cpu(cpu, cpus_with_pcps);
> +		}
> +	on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);
>  }
>
>  #ifdef CONFIG_HIBERNATION
> @@ -3601,6 +3621,10 @@ static void setup_zone_pageset(struct zone *zone)
>  void __init setup_per_cpu_pageset(void)
>  {
>  	struct zone *zone;
> +	int ret;
> +
> +	ret = zalloc_cpumask_var(&cpus_with_pcps, GFP_KERNEL);
> +	BUG_ON(!ret);
>
>  	for_each_populated_zone(zone)
>  		setup_zone_pageset(zone);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-09 16:35 ` Christoph Lameter
@ 2012-01-09 16:47   ` Michal Nazarewicz
  2012-01-10 12:43     ` Gilad Ben-Yossef
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Nazarewicz @ 2012-01-09 16:47 UTC (permalink / raw)
  To: Gilad Ben-Yossef, Christoph Lameter
  Cc: linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Sasha Levin,
	Rik van Riel, Andi Kleen, Andrew Morton, Alexander Viro,
	linux-fsdevel, Avi Kivity, Michal Nazarewicz

On Mon, 09 Jan 2012 17:35:26 +0100, Christoph Lameter <cl@linux.com> wrote:

> On Sun, 8 Jan 2012, Gilad Ben-Yossef wrote:
>
>> @@ -67,6 +67,14 @@ DEFINE_PER_CPU(int, numa_node);
>>  EXPORT_PER_CPU_SYMBOL(numa_node);
>>  #endif
>>
>> +/*
>> + * A global cpumask of CPUs with per-cpu pages that gets
>> + * recomputed on each drain. We use a global cpumask
>> + * here to avoid allocation on direct reclaim code path
>> + * for CONFIG_CPUMASK_OFFSTACK=y
>> + */
>> +static cpumask_var_t cpus_with_pcps;
>
> Move the static definition into drain_all_pages()?

This is initialised in setup_per_cpu_pageset() so it needs to be file scoped.

>> +
>>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>>  /*
>>   * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
>> @@ -1097,7 +1105,19 @@ void drain_local_pages(void *arg)
>>   */
>>  void drain_all_pages(void)
>>  {
>> -	on_each_cpu(drain_local_pages, NULL, 1);
>> +	int cpu;
>> +	struct per_cpu_pageset *pcp;
>> +	struct zone *zone;
>> +
>> +	for_each_online_cpu(cpu)
>> +		for_each_populated_zone(zone) {
>> +			pcp = per_cpu_ptr(zone->pageset, cpu);
>> +			if (pcp->pcp.count)
>> +				cpumask_set_cpu(cpu, cpus_with_pcps);
>> +			else
>> +				cpumask_clear_cpu(cpu, cpus_with_pcps);
>> +		}
>> +	on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);
>>  }
>>
>>  #ifdef CONFIG_HIBERNATION
>> @@ -3601,6 +3621,10 @@ static void setup_zone_pageset(struct zone *zone)
>>  void __init setup_per_cpu_pageset(void)
>>  {
>>  	struct zone *zone;
>> +	int ret;
>> +
>> +	ret = zalloc_cpumask_var(&cpus_with_pcps, GFP_KERNEL);
>> +	BUG_ON(!ret);
>>
>>  	for_each_populated_zone(zone)
>>  		setup_zone_pageset(zone);
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-09 16:47   ` Michal Nazarewicz
@ 2012-01-10 12:43     ` Gilad Ben-Yossef
  2012-01-10 12:48       ` Michal Nazarewicz
  0 siblings, 1 reply; 7+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-10 12:43 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: Christoph Lameter, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen,
	Andrew Morton, Alexander Viro, linux-fsdevel, Avi Kivity

2012/1/9 Michal Nazarewicz <mina86@mina86.com>:
> On Mon, 09 Jan 2012 17:35:26 +0100, Christoph Lameter <cl@linux.com> wrote:
>
>> On Sun, 8 Jan 2012, Gilad Ben-Yossef wrote:
>>
>>> @@ -67,6 +67,14 @@ DEFINE_PER_CPU(int, numa_node);
>>>  EXPORT_PER_CPU_SYMBOL(numa_node);
>>>  #endif
>>>
>>> +/*
>>> + * A global cpumask of CPUs with per-cpu pages that gets
>>> + * recomputed on each drain. We use a global cpumask
>>> + * here to avoid allocation on direct reclaim code path
>>> + * for CONFIG_CPUMASK_OFFSTACK=y
>>> + */
>>> +static cpumask_var_t cpus_with_pcps;
>>
>>
>> Move the static definition into drain_all_pages()?
>
>
> This is initialised in setup_per_cpu_pageset() so it needs to be file
> scoped.

Yes. The cpumask_var_t abstraction is convenient and all but it does
make the allocation
very non obvious when it does not happen in proximity to the variable
use - it doesn't *look* like
a pointer. "syntactic sugar causes cancer of the semicolon" and all that.

Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-10 12:43     ` Gilad Ben-Yossef
@ 2012-01-10 12:48       ` Michal Nazarewicz
  0 siblings, 0 replies; 7+ messages in thread
From: Michal Nazarewicz @ 2012-01-10 12:48 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Christoph Lameter, linux-kernel, Chris Metcalf, Peter Zijlstra,
	Frederic Weisbecker, Russell King, linux-mm, Pekka Enberg,
	Matt Mackall, Sasha Levin, Rik van Riel, Andi Kleen,
	Andrew Morton, Alexander Viro, linux-fsdevel, Avi Kivity

> 2012/1/9 Michal Nazarewicz <mina86@mina86.com>:
>> This is initialised in setup_per_cpu_pageset() so it needs to be file
>> scoped.

On Tue, 10 Jan 2012 13:43:21 +0100, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> Yes. The cpumask_var_t abstraction is convenient and all but it does
> make the allocation very non obvious when it does not happen in
> proximity to the variable use - it doesn't *look* like a pointer.

You can say that about any file scoped variable that needs non-const
initialisation, not only pointers.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-08 16:27 [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
  2012-01-09 16:35 ` Christoph Lameter
@ 2012-01-11  7:04 ` Milton Miller
  2012-01-11 16:10   ` Gilad Ben-Yossef
  1 sibling, 1 reply; 7+ messages in thread
From: Milton Miller @ 2012-01-11  7:04 UTC (permalink / raw)
  To: Gilad Ben-Yossef, Christoph Lameter, Michal Nazarewicz,
	Mel Gorman, KOSAKI Motohiro, linux-kernel
  Cc: Chris Metcalf, Peter Zijlstra, Frederic Weisbecker, Russell King,
	linux-mm, Pekka Enberg, Matt Mackall, Rik van Riel, Andi Kleen,
	Sasha Levin, Andrew Morton, Alexander Viro, linux-fsdevel,
	Avi Kivity

On Mon Jan 09 2012 about 11:47:36 EST, Michal Nazarewicz wrote:
> On Mon Jan 09 2012 11:35:37 +0100, Christoph Lameter <cl@linux.com> wrote:
> > On Sun, 8 Jan 2012, Gilad Ben-Yossef wrote:
> > > 
> > > Calculate a cpumask of CPUs with per-cpu pages in any zone
> > > and only send an IPI requesting CPUs to drain these pages
> > > to the buddy allocator if they actually have pages when
> > > asked to flush.
> > > 
> > > This patch saves 90%+ of IPIs asking to drain per-cpu
> > > pages in case of severe memory preassure that leads
> > > to OOM since in these cases multiple, possibly concurrent,
> > > allocation requests end up in the direct reclaim code
> > > path so when the per-cpu pages end up reclaimed on first
> > > allocation failure for most of the proceeding allocation
> > > attempts until the memory pressure is off (possibly via
> > > the OOM killer) there are no per-cpu pages on most CPUs
> > > (and there can easily be hundreds of them).
> > > 
> > > This also has the side effect of shortening the average
> > > latency of direct reclaim by 1 or more order of magnitude
> > > since waiting for all the CPUs to ACK the IPI takes a
> > > long time.
> > > 
> > > Tested by running "hackbench 400" on a 4 CPU x86 otherwise
> > > idle VM and observing the difference between the number
> > > of direct reclaim attempts that end up in drain_all_pages()
> > > and those were more then 1/2 of the online CPU had any
> > > per-cpu page in them, using the vmstat counters introduced
> > > in the next patch in the series and using proc/interrupts.
> > > 
> > > In the test sceanrio, this saved around 500 global IPIs.
> > > After trigerring an OOM:
> > > 
> > > $ cat /proc/vmstat
> > > ...
> > > pcp_global_drain 627
> > > pcp_global_ipi_saved 578
> > > 
> > > I've also seen the number of drains reach 15k calls
> > > with the saved percentage reaching 99% when there
> > > are more tasks running during an OOM kill.
> > > 
> > > Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
> > > Acked-by: Mel Gorman <mel@csn.ul.ie>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

> > > ---
> > >  mm/page_alloc.c |   26 +++++++++++++++++++++++++-
> > >  1 files changed, 25 insertions(+), 1 deletions(-)
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index bdc804c..dc97199 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -67,6 +67,14 @@ DEFINE_PER_CPU(int, numa_node);
> > >  EXPORT_PER_CPU_SYMBOL(numa_node);
> > >  #endif
> > >  
> > > +/*
> > > + * A global cpumask of CPUs with per-cpu pages that gets
> > > + * recomputed on each drain. We use a global cpumask
> > > + * here to avoid allocation on direct reclaim code path
> > > + * for CONFIG_CPUMASK_OFFSTACK=y
> > > + */
> > > +static cpumask_var_t cpus_with_pcps;
> > > +
> > Move the static definition into drain_all_pages()? 
>
> This is initialised in setup_per_cpu_pageset() so it needs to be file scoped.

Instead of allocating and initialising a cpumask_var_t at runtime for
the life of the kernel we could create a cpumask_t in bss.  It would
trade the permanent kmalloc for some BSS (but we haven't merged the
nr_cpu_ids cpumask_size yet, so we aren't saving any memory).

> > >  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> > >  /*
> > >   * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
> > > @@ -1097,7 +1105,19 @@ void drain_local_pages(void *arg)
> > >   */
> > >  void drain_all_pages(void)
> > >  {
> > > -	on_each_cpu(drain_local_pages, NULL, 1);
> > > +	int cpu;
> > > +	struct per_cpu_pageset *pcp;
> > > +	struct zone *zone;
> > > +
> > > +	for_each_online_cpu(cpu)
> > > +		for_each_populated_zone(zone) {
> > > +			pcp = per_cpu_ptr(zone->pageset, cpu);
> > > +			if (pcp->pcp.count)
> > > +				cpumask_set_cpu(cpu, cpus_with_pcps);
> > > +			else
> > > +				cpumask_clear_cpu(cpu, cpus_with_pcps);
> > > +		}
> > > +	on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);
> > 

This will only select cpus that have pcp pages in the last zone,
not pages in any populated zone.

Call cpu_mask_clear before the for_each_online_cpu loop, and only
set cpus in the loop.

Hmm, that means we actually need a lock for using the static mask.

-- what would happen without any lock? 
[The last to store would likely see all the cpus and send them IPIs
but the earlier, racing callers would miss some cpus (It will test
our smp_call_function_many locking!), and would retry before all the
pages were drained from pcp pools].  The locking should be better
than racing -- one cpu will peek and pull all per-cpu data to itself,
then copy the mask to its smp_call_function_many element, then wait
for all the cpus it found cpus to process their list pulling their
pcp data back to the owners.   Not much sense in the racing cpus
figighting to bring the per-cpu data to themselves to write to the
now contended static mask while pulling the zone pcp data from the
owning cpus that are trying to push to the buddy lists.

The benefit numbers in the changelog need to be measured again
after this correction.


At first I was thinking raw_spinlock_t but then I realized we have
filtered !__GFP_WAIT and have called cond_sched several times so we
can sleep.  But I am not sure we really want to use a mutex, we could
end up scheduling every task trying to lock this mask in the middle
of the memory allocation slowpath?  What happens if we decide to oom
kill a random task waiting at this mutex, will we become unfair?
If we use interruptable and are signalled do we just skip waiting
for the per cpu buckets to drain? what if we are trying to allocate
memory to delive the signal?  Of course, a mutex might lead us to
find a task that can make progress with the memory it already has.
Definitely could use some exploration.

I think we should start with a normal spinlock (not irq, must respond
to an ipi here!) or a try_spin_lock / cond_resched / cpu_relax loop.



An alternative to clearing the mask at the beginning is to loop over
the zones, calculating the summary for each cpu, then make the set or
clear decision.   Or letting the cpus take themselves out of the mask,
but I think both of those are worse.  While it might hide the lock
I think it would only lead to cache contention on the mask and pcp.


Disabling preemption around online loop and the call will prevent
races with cpus going offline, but we do not that we care as the
offline notification will cause the notified cpu to drain that pool.
on_each_cpu_mask disables preemption as part of its processing.

If we document the o-e-c-m behaviour then we should consider putting a
comment here, or at least put the previous paragraph in the change log.


> > >  }
> > >  
> > >  #ifdef CONFIG_HIBERNATION
> > > @@ -3601,6 +3621,10 @@ static void setup_zone_pageset(struct zone *zone)
> > >  void __init setup_per_cpu_pageset(void)
> > >  {
> > >  	struct zone *zone;
> > > +	int ret;
> > > +
> > > +	ret = zalloc_cpumask_var(&cpus_with_pcps, GFP_KERNEL);
> > > +	BUG_ON(!ret);
> > >  

Switching to cpumask_t will eliminate this hunk.   Even if we decide
to keep it a cpumask_var_t we don't need to pre-zero it as we set
the entire mask so alloc_cpumask_var would be sufficient.

milton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist
  2012-01-11  7:04 ` Milton Miller
@ 2012-01-11 16:10   ` Gilad Ben-Yossef
  0 siblings, 0 replies; 7+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-11 16:10 UTC (permalink / raw)
  To: Milton Miller
  Cc: Christoph Lameter, Michal Nazarewicz, Mel Gorman, KOSAKI Motohiro,
	linux-kernel, Chris Metcalf, Peter Zijlstra, Frederic Weisbecker,
	Russell King, linux-mm, Pekka Enberg, Matt Mackall, Rik van Riel,
	Andi Kleen, Sasha Levin, Andrew Morton, Alexander Viro,
	linux-fsdevel, Avi Kivity

On Wed, Jan 11, 2012 at 9:04 AM, Milton Miller <miltonm@bga.com> wrote:

>> > > @@ -1097,7 +1105,19 @@ void drain_local_pages(void *arg)
>> > >   */
>> > >  void drain_all_pages(void)
>> > >  {
>> > > - on_each_cpu(drain_local_pages, NULL, 1);
>> > > + int cpu;
>> > > + struct per_cpu_pageset *pcp;
>> > > + struct zone *zone;
>> > > +
>> > > + for_each_online_cpu(cpu)
>> > > +         for_each_populated_zone(zone) {
>> > > +                 pcp = per_cpu_ptr(zone->pageset, cpu);
>> > > +                 if (pcp->pcp.count)
>> > > +                         cpumask_set_cpu(cpu, cpus_with_pcps);
>> > > +                 else
>> > > +                         cpumask_clear_cpu(cpu, cpus_with_pcps);
>> > > +         }
>> > > + on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);
>> >
>
> This will only select cpus that have pcp pages in the last zone,
> not pages in any populated zone.

Oy! you are very right.

>
> Call cpu_mask_clear before the for_each_online_cpu loop, and only
> set cpus in the loop.
>
> Hmm, that means we actually need a lock for using the static mask.

We don't have to clear before the loop or lock. We can do something like this:

        int cpu;
        struct per_cpu_pageset *pcp;
        struct zone *zone;

        for_each_online_cpu(cpu) {
                bool has_pcps = false;
                for_each_populated_zone(zone) {
                        pcp = per_cpu_ptr(zone->pageset, cpu);
                        if (pcp->pcp.count) {
                                has_pcps = true;
                                break;
                        }
                }
                if (has_pcps)
                        cpumask_set_cpu(cpu, cpus_with_pcps);
                else
                        cpumask_clear_cpu(cpu, cpus_with_pcps);
        }
        on_each_cpu_mask(cpus_with_pcps, drain_local_pages, NULL, 1);


>
> -- what would happen without any lock?
> [The last to store would likely see all the cpus and send them IPIs
> but the earlier, racing callers would miss some cpus (It will test
> our smp_call_function_many locking!), and would retry before all the
> pages were drained from pcp pools].  The locking should be better
> than racing -- one cpu will peek and pull all per-cpu data to itself,
> then copy the mask to its smp_call_function_many element, then wait
> for all the cpus it found cpus to process their list pulling their
> pcp data back to the owners.   Not much sense in the racing cpus
> figighting to bring the per-cpu data to themselves to write to the
> now contended static mask while pulling the zone pcp data from the
> owning cpus that are trying to push to the buddy lists.
>

That is a very good point and I guess you are right that the locking
saves cache bounces and probably also some IPIs.

I am not 100% it is a win though. The lockless approach is simpler,
has zero risk of dead locks. I also fear that any non interruptable lock
(be it spinlock or  interruptable lock non mutex) might delay an OOM
in progress.

Having the lock there is not a correctness issue - it is a performance
enhancement. Because this is not a fast path, i would go for simple
and less risk and not have the lock there at all.

> The benefit numbers in the changelog need to be measured again
> after this correction.
>

Yes, of course. Preliminary numbers with the lockless version above
still looks good.

I am guessing though that Mel's patch will make all this moot anyway
since if we're not doing a drain_all in the direct reclaim we're left with
very rare code paths (memory error and memory migration) that call
the drain_all and they wont tend to do that concurrently like the direct
reclaim.

>
>
> Disabling preemption around online loop and the call will prevent
> races with cpus going offline, but we do not that we care as the
> offline notification will cause the notified cpu to drain that pool.
> on_each_cpu_mask disables preemption as part of its processing.
>
> If we document the o-e-c-m behaviour then we should consider putting a
> comment here, or at least put the previous paragraph in the change log.

I noticed that Mel's patch already added get_online_cpus() in drain_all.


>
>> > >  }
>> > >
>> > >  #ifdef CONFIG_HIBERNATION
>> > > @@ -3601,6 +3621,10 @@ static void setup_zone_pageset(struct zone *zone)
>> > >  void __init setup_per_cpu_pageset(void)
>> > >  {
>> > >   struct zone *zone;
>> > > + int ret;
>> > > +
>> > > + ret = zalloc_cpumask_var(&cpus_with_pcps, GFP_KERNEL);
>> > > + BUG_ON(!ret);
>> > >
>
> Switching to cpumask_t will eliminate this hunk.   Even if we decide
> to keep it a cpumask_var_t we don't need to pre-zero it as we set
> the entire mask so alloc_cpumask_var would be sufficient.
>

hmm... then we can make it a static local variable and it doesn't even
make the kernel image really bigger since it's in the BSS. Cool.

Thanks,

Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-11 16:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-08 16:27 [PATCH v6 7/8] mm: only IPI CPUs to drain local pages if they exist Gilad Ben-Yossef
2012-01-09 16:35 ` Christoph Lameter
2012-01-09 16:47   ` Michal Nazarewicz
2012-01-10 12:43     ` Gilad Ben-Yossef
2012-01-10 12:48       ` Michal Nazarewicz
2012-01-11  7:04 ` Milton Miller
2012-01-11 16:10   ` Gilad Ben-Yossef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).