meminfo Committed_AS underflows

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* meminfo Committed_AS underflows
@ 2009-04-14 19:33 Dave Hansen
  2009-04-15  2:04 ` KOSAKI Motohiro
  2009-04-15  8:33 ` Alan Cox
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Hansen @ 2009-04-14 19:33 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel@vger.kernel.org, Eric B Munson, Mel Gorman,
	Christoph Lameter

I have a set of ppc64 machines that seem to spontaneously get underflows
in /proc/meminfo's Committed_AS field:
        
        # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
              1 Committed_AS: 18446744073709323392 kB
             11 Committed_AS: 18446744073709455488 kB
              6 Committed_AS:    35136 kB
              5 Committed_AS: 18446744073709454400 kB
              7 Committed_AS:    35904 kB
              3 Committed_AS: 18446744073709453248 kB
              2 Committed_AS:    34752 kB
              9 Committed_AS: 18446744073709453248 kB
              8 Committed_AS:    34752 kB
              3 Committed_AS: 18446744073709320960 kB
              7 Committed_AS: 18446744073709454080 kB
              3 Committed_AS: 18446744073709320960 kB
              5 Committed_AS: 18446744073709454080 kB
              6 Committed_AS: 18446744073709320960 kB

As you can see, it bounces in and out of it.  I think the problem is
here:
        
        #define ACCT_THRESHOLD  max(16, NR_CPUS * 2)
        ...
        void vm_acct_memory(long pages)
        {
                long *local;
        
                preempt_disable();
                local = &__get_cpu_var(committed_space);
                *local += pages;
                if (*local > ACCT_THRESHOLD || *local < -ACCT_THRESHOLD) {
                        atomic_long_add(*local, &vm_committed_space);
                        *local = 0;
                }
                preempt_enable();
        }

Plus, some joker set CONFIG_NR_CPUS=1024.

nr_cpus (1024) * 2 * page_size (64k) = 128MB.  That means each cpu can
skew the counter by 128MB.  With 1024 CPUs that means that we can have
~128GB of outstanding percpu accounting that meminfo doesn't see.  Let's
say we do vm_acct_memory(128MB-1) on 1023 of the CPUs, then on the other
CPU, we do  vm_acct_memory(-128GB).

The 1023 cpus won't ever hit the ACCT_THRESHOLD.  The 1 CPU that did
will decrement the global 'vm_committed_space'  by ~128 GB.  Underflow.
Yay.  This happens on a much smaller scale now.

Should we be protecting meminfo so that it spits slightly more sane
numbers out to the user?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-14 19:33 meminfo Committed_AS underflows Dave Hansen
@ 2009-04-15  2:04 ` KOSAKI Motohiro
  2009-04-15  3:34   ` Balbir Singh
  2009-04-15  4:12   ` KOSAKI Motohiro
  2009-04-15  8:33 ` Alan Cox
  1 sibling, 2 replies; 12+ messages in thread
From: KOSAKI Motohiro @ 2009-04-15  2:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, linux-mm, linux-kernel@vger.kernel.org,
	Eric B Munson, Mel Gorman, Christoph Lameter

> I have a set of ppc64 machines that seem to spontaneously get underflows
> in /proc/meminfo's Committed_AS field:
>         
>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB
> 
> As you can see, it bounces in and out of it.  I think the problem is
> here:
>         
>         #define ACCT_THRESHOLD  max(16, NR_CPUS * 2)
>         ...
>         void vm_acct_memory(long pages)
>         {
>                 long *local;
>         
>                 preempt_disable();
>                 local = &__get_cpu_var(committed_space);
>                 *local += pages;
>                 if (*local > ACCT_THRESHOLD || *local < -ACCT_THRESHOLD) {
>                         atomic_long_add(*local, &vm_committed_space);
>                         *local = 0;
>                 }
>                 preempt_enable();
>         }
> 
> Plus, some joker set CONFIG_NR_CPUS=1024.
> 
> nr_cpus (1024) * 2 * page_size (64k) = 128MB.  That means each cpu can
> skew the counter by 128MB.  With 1024 CPUs that means that we can have
> ~128GB of outstanding percpu accounting that meminfo doesn't see.  Let's
> say we do vm_acct_memory(128MB-1) on 1023 of the CPUs, then on the other
> CPU, we do  vm_acct_memory(-128GB).
> 
> The 1023 cpus won't ever hit the ACCT_THRESHOLD.  The 1 CPU that did
> will decrement the global 'vm_committed_space'  by ~128 GB.  Underflow.
> Yay.  This happens on a much smaller scale now.
> 
> Should we be protecting meminfo so that it spits slightly more sane
> numbers out to the user?

Can you try to this patch? (Oh well, I can't reproduce this underflow
on my small machine)

===============

Dave Hansen reported committed_AS field can underfolow.

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPU can be greater than 1000. and meminfo_proc_show()
doesn't have underflow check.

this patch have two change.

1. Change NR_CPU to nr_online_cpus()
   vm_acct_memory() isn't fast-path. then cpumask_weight() calculation
   isn't so expensive and the parameter for scalability issue should
   consider number of _physical_ cpus. not theoretical maximum number.
2. Add under-flow check to meminfo_proc_show().
   Almost field in /proc/meminfo have underflow check. but Committed_AS
   is significant exeption.
   it should do.

Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/meminfo.c |    2 ++
 mm/swap.c         |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: b/fs/proc/meminfo.c
===================================================================
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -36,6 +36,8 @@ static int meminfo_proc_show(struct seq_
 	si_meminfo(&i);
 	si_swapinfo(&i);
 	committed = atomic_long_read(&vm_committed_space);
+	if (committed < 0)
+		committed = 0;
 	allowed = ((totalram_pages - hugetlb_total_pages())
 		* sysctl_overcommit_ratio / 100) + total_swap_pages;
 
Index: b/mm/swap.c
===================================================================
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -519,7 +519,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
  * We tolerate a little inaccuracy to avoid ping-ponging the counter between
  * CPUs
  */
-#define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
+#define ACCT_THRESHOLD	max_t(long, 16, num_online_cpus() * 2)
 
 static DEFINE_PER_CPU(long, committed_space);
 




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-15  2:04 ` KOSAKI Motohiro
@ 2009-04-15  3:34   ` Balbir Singh
  2009-04-15  4:10     ` KOSAKI Motohiro
  2009-04-15  4:12   ` KOSAKI Motohiro
  1 sibling, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2009-04-15  3:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Hansen, linux-mm, linux-kernel@vger.kernel.org,
	Eric B Munson, Mel Gorman, Christoph Lameter

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 11:04:59]:

>  	committed = atomic_long_read(&vm_committed_space);
> +	if (committed < 0)
> +		committed = 0;

Isn't this like pushing the problem under the rug?

>  	allowed = ((totalram_pages - hugetlb_total_pages())
>  		* sysctl_overcommit_ratio / 100) + total_swap_pages;
> 
> Index: b/mm/swap.c
> ===================================================================
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -519,7 +519,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
>   * We tolerate a little inaccuracy to avoid ping-ponging the counter between
>   * CPUs
>   */
> -#define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
> +#define ACCT_THRESHOLD	max_t(long, 16, num_online_cpus() * 2)
>

Hmm.. this is a one time expansion, free of CPU hotplug.

Should we use nr_cpu_ids or num_possible_cpus()?
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-15  3:34   ` Balbir Singh
@ 2009-04-15  4:10     ` KOSAKI Motohiro
  2009-04-15  8:47       ` Balbir Singh
  0 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2009-04-15  4:10 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, Dave Hansen, linux-mm,
	linux-kernel@vger.kernel.org, Eric B Munson, Mel Gorman,
	Christoph Lameter

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 11:04:59]:
> 
> >  	committed = atomic_long_read(&vm_committed_space);
> > +	if (committed < 0)
> > +		committed = 0;
> 
> Isn't this like pushing the problem under the rug?

global_page_state() already has same logic.
IOW almost meminfo filed has this one (except Commited_AS).


> >  	allowed = ((totalram_pages - hugetlb_total_pages())
> >  		* sysctl_overcommit_ratio / 100) + total_swap_pages;
> > 
> > Index: b/mm/swap.c
> > ===================================================================
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -519,7 +519,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
> >   * We tolerate a little inaccuracy to avoid ping-ponging the counter between
> >   * CPUs
> >   */
> > -#define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
> > +#define ACCT_THRESHOLD	max_t(long, 16, num_online_cpus() * 2)
> >
> 
> Hmm.. this is a one time expansion, free of CPU hotplug.
> 
> Should we use nr_cpu_ids or num_possible_cpus()?

#define num_online_cpus()       cpumask_weight(cpu_online_mask)
#define num_possible_cpus()     cpumask_weight(cpu_possible_mask)

num_possible_cpus() have the same calculation cost.
nr_cpu_ids isn't proper value.
it point to valid cpu-id range, no related number of online nor possible cpus.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-15  4:10     ` KOSAKI Motohiro
@ 2009-04-15  8:47       ` Balbir Singh
  2009-04-27 20:27         ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2009-04-15  8:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Hansen, linux-mm, linux-kernel@vger.kernel.org,
	Eric B Munson, Mel Gorman, Christoph Lameter

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 13:10:06]:

> > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 11:04:59]:
> > 
> > >  	committed = atomic_long_read(&vm_committed_space);
> > > +	if (committed < 0)
> > > +		committed = 0;
> > 
> > Isn't this like pushing the problem under the rug?
> 
> global_page_state() already has same logic.
> IOW almost meminfo filed has this one (except Commited_AS).
>

OK
 
> 
> > >  	allowed = ((totalram_pages - hugetlb_total_pages())
> > >  		* sysctl_overcommit_ratio / 100) + total_swap_pages;
> > > 
> > > Index: b/mm/swap.c
> > > ===================================================================
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -519,7 +519,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
> > >   * We tolerate a little inaccuracy to avoid ping-ponging the counter between
> > >   * CPUs
> > >   */
> > > -#define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
> > > +#define ACCT_THRESHOLD	max_t(long, 16, num_online_cpus() * 2)
> > >
> > 
> > Hmm.. this is a one time expansion, free of CPU hotplug.
> > 
> > Should we use nr_cpu_ids or num_possible_cpus()?
> 
> #define num_online_cpus()       cpumask_weight(cpu_online_mask)
> #define num_possible_cpus()     cpumask_weight(cpu_possible_mask)
> 
> num_possible_cpus() have the same calculation cost.
> nr_cpu_ids isn't proper value.
> it point to valid cpu-id range, no related number of online nor possible cpus.
>

Since the value is just a basis for thresholds, num_online_cpus()
might work. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-15  8:47       ` Balbir Singh
@ 2009-04-27 20:27         ` Andrew Morton
  2009-04-28  3:07           ` KOSAKI Motohiro
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2009-04-27 20:27 UTC (permalink / raw)
  To: balbir; +Cc: kosaki.motohiro, dave, linux-mm, linux-kernel, ebmunson, mel, cl

On Wed, 15 Apr 2009 14:17:13 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 13:10:06]:
> 
> > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 11:04:59]:
> > > 
> > > >  	committed = atomic_long_read(&vm_committed_space);
> > > > +	if (committed < 0)
> > > > +		committed = 0;
> > > 

Is there a reason why we can't use a boring old percpu_counter for
vm_committed_space?  That way the meminfo code can just use
percpu_counter_read_positive().

Or perhaps just percpu_counter_read().  The percpu_counter code does a
better job of handling large cpu counts than the
mysteriously-duplicative open-coded stuff we have there.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-27 20:27         ` Andrew Morton
@ 2009-04-28  3:07           ` KOSAKI Motohiro
  2009-04-28  3:27             ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28  3:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, balbir, dave, linux-mm, linux-kernel, ebmunson,
	mel, cl

> On Wed, 15 Apr 2009 14:17:13 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 13:10:06]:
> > 
> > > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 11:04:59]:
> > > > 
> > > > >  	committed = atomic_long_read(&vm_committed_space);
> > > > > +	if (committed < 0)
> > > > > +		committed = 0;
> > > > 
> 
> Is there a reason why we can't use a boring old percpu_counter for
> vm_committed_space?  That way the meminfo code can just use
> percpu_counter_read_positive().
> 
> Or perhaps just percpu_counter_read().  The percpu_counter code does a
> better job of handling large cpu counts than the
> mysteriously-duplicative open-coded stuff we have there.

At that time, I thought smallest patch is better because it can send -stable
tree easily.
but maybe I was wrong. it made bikeshed discussion :(

ok, I'm going to right way.


=========================================
Subject: [PATCH] fix Committed_AS underfolow on large NR_CPUS environment

As reported by Dave Hansen, the Committed_AS field can underflow in certain
situations:

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does not check
for underflow.

But NR_CPUS proportional isn't good calculation. In general, possibility of
lock contention is proportional to the number of online cpus, not theorical
maximum cpus (NR_CPUS).
the current kernel has generic percpu-counter stuff. using it is right way.
it makes code simplify and percpu_counter_read_positive() don't make underflow issue.


Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
---
 fs/proc/meminfo.c    |    2 +-
 include/linux/mman.h |    9 +++------
 mm/mmap.c            |   12 ++++++------
 mm/nommu.c           |   13 +++++++------
 mm/swap.c            |   46 ----------------------------------------------
 5 files changed, 17 insertions(+), 65 deletions(-)

Index: b/fs/proc/meminfo.c
===================================================================
--- a/fs/proc/meminfo.c	2009-04-28 11:36:43.000000000 +0900
+++ b/fs/proc/meminfo.c	2009-04-28 11:36:54.000000000 +0900
@@ -35,7 +35,7 @@ static int meminfo_proc_show(struct seq_
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = atomic_long_read(&vm_committed_space);
+	committed = percpu_counter_read_positive(&vm_committed_as);
 	allowed = ((totalram_pages - hugetlb_total_pages())
 		* sysctl_overcommit_ratio / 100) + total_swap_pages;
 
Index: b/mm/mmap.c
===================================================================
--- a/mm/mmap.c	2009-04-28 11:36:43.000000000 +0900
+++ b/mm/mmap.c	2009-04-28 11:36:54.000000000 +0900
@@ -85,7 +85,7 @@ EXPORT_SYMBOL(vm_get_page_prot);
 int sysctl_overcommit_memory = OVERCOMMIT_GUESS;  /* heuristic overcommit */
 int sysctl_overcommit_ratio = 50;	/* default is 50% */
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-atomic_long_t vm_committed_space = ATOMIC_LONG_INIT(0);
+struct percpu_counter vm_committed_as;
 
 /*
  * Check that a process has enough memory to allocate a new virtual
@@ -179,11 +179,7 @@ int __vm_enough_memory(struct mm_struct 
 	if (mm)
 		allowed -= mm->total_vm / 32;
 
-	/*
-	 * cast `allowed' as a signed long because vm_committed_space
-	 * sometimes has a negative value
-	 */
-	if (atomic_long_read(&vm_committed_space) < (long)allowed)
+	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
 		return 0;
 error:
 	vm_unacct_memory(pages);
@@ -2481,4 +2477,8 @@ void mm_drop_all_locks(struct mm_struct 
  */
 void __init mmap_init(void)
 {
+	int ret;
+
+	ret = percpu_counter_init(&vm_committed_as, 0);
+	VM_BUG_ON(ret);
 }
Index: b/mm/nommu.c
===================================================================
--- a/mm/nommu.c	2009-04-28 11:36:43.000000000 +0900
+++ b/mm/nommu.c	2009-04-28 11:36:54.000000000 +0900
@@ -62,7 +62,7 @@ void *high_memory;
 struct page *mem_map;
 unsigned long max_mapnr;
 unsigned long num_physpages;
-atomic_long_t vm_committed_space = ATOMIC_LONG_INIT(0);
+struct percpu_counter vm_committed_as;
 int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
 int sysctl_overcommit_ratio = 50; /* default is 50% */
 int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
@@ -463,6 +463,10 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
  */
 void __init mmap_init(void)
 {
+	int ret;
+
+	ret = percpu_counter_init(&vm_committed_as, 0);
+	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC);
 }
 
@@ -1847,12 +1851,9 @@ int __vm_enough_memory(struct mm_struct 
 	if (mm)
 		allowed -= mm->total_vm / 32;
 
-	/*
-	 * cast `allowed' as a signed long because vm_committed_space
-	 * sometimes has a negative value
-	 */
-	if (atomic_long_read(&vm_committed_space) < (long)allowed)
+	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
 		return 0;
+
 error:
 	vm_unacct_memory(pages);
 
Index: b/mm/swap.c
===================================================================
--- a/mm/swap.c	2009-04-28 11:36:43.000000000 +0900
+++ b/mm/swap.c	2009-04-28 11:36:54.000000000 +0900
@@ -491,49 +491,6 @@ unsigned pagevec_lookup_tag(struct pagev
 
 EXPORT_SYMBOL(pagevec_lookup_tag);
 
-#ifdef CONFIG_SMP
-/*
- * We tolerate a little inaccuracy to avoid ping-ponging the counter between
- * CPUs
- */
-#define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
-
-static DEFINE_PER_CPU(long, committed_space);
-
-void vm_acct_memory(long pages)
-{
-	long *local;
-
-	preempt_disable();
-	local = &__get_cpu_var(committed_space);
-	*local += pages;
-	if (*local > ACCT_THRESHOLD || *local < -ACCT_THRESHOLD) {
-		atomic_long_add(*local, &vm_committed_space);
-		*local = 0;
-	}
-	preempt_enable();
-}
-
-#ifdef CONFIG_HOTPLUG_CPU
-
-/* Drop the CPU's cached committed space back into the central pool. */
-static int cpu_swap_callback(struct notifier_block *nfb,
-			     unsigned long action,
-			     void *hcpu)
-{
-	long *committed;
-
-	committed = &per_cpu(committed_space, (long)hcpu);
-	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
-		atomic_long_add(*committed, &vm_committed_space);
-		*committed = 0;
-		drain_cpu_pagevecs((long)hcpu);
-	}
-	return NOTIFY_OK;
-}
-#endif /* CONFIG_HOTPLUG_CPU */
-#endif /* CONFIG_SMP */
-
 /*
  * Perform any setup for the swap system
  */
@@ -554,7 +511,4 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
-#ifdef CONFIG_HOTPLUG_CPU
-	hotcpu_notifier(cpu_swap_callback, 0);
-#endif
 }
Index: b/include/linux/mman.h
===================================================================
--- a/include/linux/mman.h	2009-04-28 11:32:47.000000000 +0900
+++ b/include/linux/mman.h	2009-04-28 11:38:42.000000000 +0900
@@ -12,21 +12,18 @@
 
 #ifdef __KERNEL__
 #include <linux/mm.h>
+#include <linux/percpu_counter.h>
 
 #include <asm/atomic.h>
 
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
-extern atomic_long_t vm_committed_space;
+extern struct percpu_counter vm_committed_as;
 
-#ifdef CONFIG_SMP
-extern void vm_acct_memory(long pages);
-#else
 static inline void vm_acct_memory(long pages)
 {
-	atomic_long_add(pages, &vm_committed_space);
+	percpu_counter_add(&vm_committed_as, pages);
 }
-#endif
 
 static inline void vm_unacct_memory(long pages)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-28  3:07           ` KOSAKI Motohiro
@ 2009-04-28  3:27             ` Andrew Morton
  2009-04-28  4:25               ` KOSAKI Motohiro
  2009-04-28  8:17               ` Dave Hansen
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2009-04-28  3:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: balbir, dave, linux-mm, linux-kernel, ebmunson, mel, cl, stable

On Tue, 28 Apr 2009 12:07:59 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed, 15 Apr 2009 14:17:13 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 13:10:06]:
> > > 
> > > > > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-04-15 11:04:59]:
> > > > > 
> > > > > >  	committed = atomic_long_read(&vm_committed_space);
> > > > > > +	if (committed < 0)
> > > > > > +		committed = 0;
> > > > > 
> > 
> > Is there a reason why we can't use a boring old percpu_counter for
> > vm_committed_space?  That way the meminfo code can just use
> > percpu_counter_read_positive().
> > 
> > Or perhaps just percpu_counter_read().  The percpu_counter code does a
> > better job of handling large cpu counts than the
> > mysteriously-duplicative open-coded stuff we have there.
> 
> At that time, I thought smallest patch is better because it can send -stable
> tree easily.
> but maybe I was wrong. it made bikeshed discussion :(

Yes, I know what you mean.  But otoh it's a good idea to keep -stable
in sync with mainline - it means that -stable can merge things which
have had a suitable amount of testing.

> ok, I'm going to right way.
> 
> 
> =========================================
> Subject: [PATCH] fix Committed_AS underfolow on large NR_CPUS environment
> 
> As reported by Dave Hansen, the Committed_AS field can underflow in certain
> situations:
> 
> >         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
> >               1 Committed_AS: 18446744073709323392 kB
> >              11 Committed_AS: 18446744073709455488 kB
> >               6 Committed_AS:    35136 kB
> >               5 Committed_AS: 18446744073709454400 kB
> >               7 Committed_AS:    35904 kB
> >               3 Committed_AS: 18446744073709453248 kB
> >               2 Committed_AS:    34752 kB
> >               9 Committed_AS: 18446744073709453248 kB
> >               8 Committed_AS:    34752 kB
> >               3 Committed_AS: 18446744073709320960 kB
> >               7 Committed_AS: 18446744073709454080 kB
> >               3 Committed_AS: 18446744073709320960 kB
> >               5 Committed_AS: 18446744073709454080 kB
> >               6 Committed_AS: 18446744073709320960 kB
> 
> Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does not check
> for underflow.
> 
> But NR_CPUS proportional isn't good calculation. In general, possibility of
> lock contention is proportional to the number of online cpus, not theorical
> maximum cpus (NR_CPUS).
> the current kernel has generic percpu-counter stuff. using it is right way.
> it makes code simplify and percpu_counter_read_positive() don't make underflow issue.
> 
> 
> Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Eric B Munson <ebmunson@us.ibm.com>
> ---
>  fs/proc/meminfo.c    |    2 +-
>  include/linux/mman.h |    9 +++------
>  mm/mmap.c            |   12 ++++++------
>  mm/nommu.c           |   13 +++++++------
>  mm/swap.c            |   46 ----------------------------------------------
>  5 files changed, 17 insertions(+), 65 deletions(-)

Well that was nice.

There's potential here for weird performance regressions, so I think
that if we do this in mainline, we should wait a while (a few weeks?)
before backporting it.

Do we know how long this bug has existed for?  Quite a while, I expect?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-28  3:27             ` Andrew Morton
@ 2009-04-28  4:25               ` KOSAKI Motohiro
  2009-04-28  8:17               ` Dave Hansen
  1 sibling, 0 replies; 12+ messages in thread
From: KOSAKI Motohiro @ 2009-04-28  4:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, balbir, dave, linux-mm, linux-kernel, ebmunson,
	mel, cl, stable

> > > > > > >  	committed = atomic_long_read(&vm_committed_space);
> > > > > > > +	if (committed < 0)
> > > > > > > +		committed = 0;
> > > > > > 
> > > 
> > > Is there a reason why we can't use a boring old percpu_counter for
> > > vm_committed_space?  That way the meminfo code can just use
> > > percpu_counter_read_positive().
> > > 
> > > Or perhaps just percpu_counter_read().  The percpu_counter code does a
> > > better job of handling large cpu counts than the
> > > mysteriously-duplicative open-coded stuff we have there.
> > 
> > At that time, I thought smallest patch is better because it can send -stable
> > tree easily.
> > but maybe I was wrong. it made bikeshed discussion :(
> 
> Yes, I know what you mean.  But otoh it's a good idea to keep -stable
> in sync with mainline - it means that -stable can merge things which
> have had a suitable amount of testing.

Agreed.


> > Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Eric B Munson <ebmunson@us.ibm.com>
> > ---
> >  fs/proc/meminfo.c    |    2 +-
> >  include/linux/mman.h |    9 +++------
> >  mm/mmap.c            |   12 ++++++------
> >  mm/nommu.c           |   13 +++++++------
> >  mm/swap.c            |   46 ----------------------------------------------
> >  5 files changed, 17 insertions(+), 65 deletions(-)
> 
> Well that was nice.
> 
> There's potential here for weird performance regressions, so I think
> that if we do this in mainline, we should wait a while (a few weeks?)
> before backporting it.
> 
> Do we know how long this bug has existed for?  Quite a while, I expect?

ACCT_THRESHOLD was introduced bit-keeper age.
powerpc default NR_CPUS is still less 128.

% grep NR_CPUS *
cell_defconfig:CONFIG_NR_CPUS=4
celleb_defconfig:CONFIG_NR_CPUS=4
chrp32_defconfig:CONFIG_NR_CPUS=4
g5_defconfig:CONFIG_NR_CPUS=4
iseries_defconfig:CONFIG_NR_CPUS=32
maple_defconfig:CONFIG_NR_CPUS=4
mpc86xx_defconfig:CONFIG_NR_CPUS=2
pasemi_defconfig:CONFIG_NR_CPUS=2
ppc64_defconfig:CONFIG_NR_CPUS=32
ps3_defconfig:CONFIG_NR_CPUS=2
pseries_defconfig:CONFIG_NR_CPUS=128

powerpc maximum NR_CPUS was increased following commit.
I'm not sure about one year is short or long.

==========================================================
commit 90035fe378c7459ba19c43c63d5f878284224ce4
Author: Tony Breeds <tony@bakeyournoodle.com>
Date:   Thu Apr 24 13:43:49 2008 +1000

    [POWERPC] Raise the upper limit of NR_CPUS and move the pacas into the BSS

    This adds the required functionality to fill in all pacas at runtime.

    With NR_CPUS=1024
    text    data     bss     dec     hex filename
     137 1704032       0 1704169  1a00e9 arch/powerpc/kernel/paca.o :Before
     121 1179744  524288 1704153  1a00d9 arch/powerpc/kernel/paca.o :After

    Also remove unneeded #includes from arch/powerpc/kernel/paca.c

    Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
    Signed-off-by: Paul Mackerras <paulus@samba.org>

diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 5fc7fac..f7efaa9 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -220,8 +220,8 @@ config SMP
          If you don't know what to do here, say N.

 config NR_CPUS
-       int "Maximum number of CPUs (2-128)"
-       range 2 128
+       int "Maximum number of CPUs (2-1024)"
+       range 2 1024
        depends on SMP
        default "32" if PPC64
        default "4"



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-28  3:27             ` Andrew Morton
  2009-04-28  4:25               ` KOSAKI Motohiro
@ 2009-04-28  8:17               ` Dave Hansen
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2009-04-28  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, balbir, linux-mm, linux-kernel, ebmunson, mel,
	cl, stable

On Mon, 2009-04-27 at 20:27 -0700, Andrew Morton wrote:
> There's potential here for weird performance regressions, so I think
> that if we do this in mainline, we should wait a while (a few weeks?)
> before backporting it.
> 
> Do we know how long this bug has existed for?  Quite a while, I
> expect?

Yeah, we didn't notice it until a recent enterprise distro got a config
with NR_CPUS=1024.  That opened the window up in a major way because of
the way we calculated the threshold.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-15  2:04 ` KOSAKI Motohiro
  2009-04-15  3:34   ` Balbir Singh
@ 2009-04-15  4:12   ` KOSAKI Motohiro
  1 sibling, 0 replies; 12+ messages in thread
From: KOSAKI Motohiro @ 2009-04-15  4:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, linux-mm, linux-kernel@vger.kernel.org,
	Eric B Munson, Mel Gorman, Christoph Lameter


fix stupid mistake.

Changelog:
	since v1
		- change the type of commited value from ulong to long


=========================
Subject: [PATCH v2] fix Commited_AS underflow

Dave Hansen reported committed_AS field can underfolow.

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPU can be greater than 1000. and meminfo_proc_show()
doesn't have underflow check.

this patch have two change.

1. Change NR_CPU to nr_online_cpus()
   vm_acct_memory() isn't fast-path. then cpumask_weight() calculation
   isn't so expensive and the parameter for scalability issue should
   consider number of _physical_ cpus. not theoretical maximum number.
2. Add under-flow check to meminfo_proc_show().
   Almost field in /proc/meminfo have underflow check. but Committed_AS
   is significant exeption.
   it should do.

Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/meminfo.c |    4 +++-
 mm/swap.c         |    2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

Index: b/fs/proc/meminfo.c
===================================================================
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -22,7 +22,7 @@ void __attribute__((weak)) arch_report_m
 static int meminfo_proc_show(struct seq_file *m, void *v)
 {
 	struct sysinfo i;
-	unsigned long committed;
+	long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;
 	long cached;
@@ -36,6 +36,8 @@ static int meminfo_proc_show(struct seq_
 	si_meminfo(&i);
 	si_swapinfo(&i);
 	committed = atomic_long_read(&vm_committed_space);
+	if (committed < 0)
+		committed = 0;
 	allowed = ((totalram_pages - hugetlb_total_pages())
 		* sysctl_overcommit_ratio / 100) + total_swap_pages;
 
Index: b/mm/swap.c
===================================================================
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -519,7 +519,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
  * We tolerate a little inaccuracy to avoid ping-ponging the counter between
  * CPUs
  */
-#define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
+#define ACCT_THRESHOLD	max_t(long, 16, num_online_cpus() * 2)
 
 static DEFINE_PER_CPU(long, committed_space);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: meminfo Committed_AS underflows
  2009-04-14 19:33 meminfo Committed_AS underflows Dave Hansen
  2009-04-15  2:04 ` KOSAKI Motohiro
@ 2009-04-15  8:33 ` Alan Cox
  1 sibling, 0 replies; 12+ messages in thread
From: Alan Cox @ 2009-04-15  8:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel@vger.kernel.org, Eric B Munson, Mel Gorman,
	Christoph Lameter

> The 1023 cpus won't ever hit the ACCT_THRESHOLD.  The 1 CPU that did
> will decrement the global 'vm_committed_space'  by ~128 GB.  Underflow.
> Yay.  This happens on a much smaller scale now.
> 
> Should we be protecting meminfo so that it spits slightly more sane
> numbers out to the user?

Yes. It used to be accurate but the changes were put in as the memory
accounting value was actually starting to show up in profiles as it
bounced around the CPUs - perhaps the constraint should instead be a
worst case error as percentage of total system memory ?

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-04-28  8:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-14 19:33 meminfo Committed_AS underflows Dave Hansen
2009-04-15  2:04 ` KOSAKI Motohiro
2009-04-15  3:34   ` Balbir Singh
2009-04-15  4:10     ` KOSAKI Motohiro
2009-04-15  8:47       ` Balbir Singh
2009-04-27 20:27         ` Andrew Morton
2009-04-28  3:07           ` KOSAKI Motohiro
2009-04-28  3:27             ` Andrew Morton
2009-04-28  4:25               ` KOSAKI Motohiro
2009-04-28  8:17               ` Dave Hansen
2009-04-15  4:12   ` KOSAKI Motohiro
2009-04-15  8:33 ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).