vmstat: use our own timer events

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* vmstat: use our own timer events
@ 2007-04-29  5:09 Christoph Lameter
  2007-04-29  8:15 ` Andrew Morton
  2007-05-02 18:13 ` Pallipadi, Venkatesh
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Lameter @ 2007-04-29  5:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Arjan van de Ven

vmstat is currently using the cache reaper to periodically bring the
statistics up to date. The cache reaper does only exists in SLUB
as a way to provide compatibility with SLAB. This patch removes
the vmstat calls from the slab allocators and provides its own
handling.

The advantage is also that we can use a different frequency for the
updates. Refreshing vm stats is a pretty fast job so we can run this
every second and stagger this by only one tick. This will lead to
some overlap in large systems. F.e a system running at 250 HZ with
1024 processors will have 4 vm updates occurring at once.

However, the vm stats update only accesses per node information.
It is only necessary to stagger the vm statistics updates per
processor in each node. Vm counter updates occurring on distant
nodes will not cause cacheline contention.

We could implement an alternate approach that runs the first processor
on each node at the second and then each of the other processor on a
node on a subsequent tick. That may be useful to keep a large amount
of the second free of timer activity. Maybe the timer folks will have
some feedback on this one?

CC: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/vmstat.h |    3 ---
 mm/slab.c              |    1 -
 mm/slub.c              |    1 -
 mm/vmstat.c            |   39 +++++++++++++++++++++++++++++++++++----
 4 files changed, 35 insertions(+), 9 deletions(-)

Index: slub/mm/slab.c
===================================================================
--- slub.orig/mm/slab.c	2007-04-28 19:35:19.000000000 -0700
+++ slub/mm/slab.c	2007-04-28 19:35:26.000000000 -0700
@@ -4155,7 +4155,6 @@ next:
 	check_irq_on();
 	mutex_unlock(&cache_chain_mutex);
 	next_reap_node();
-	refresh_cpu_vm_stats(smp_processor_id());
 out:
 	/* Set up the next iteration */
 	schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_CPUC));
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c	2007-04-28 19:35:28.000000000 -0700
+++ slub/mm/slub.c	2007-04-28 19:36:12.000000000 -0700
@@ -2514,7 +2514,6 @@ static DEFINE_PER_CPU(struct delayed_wor
 static void cache_reap(struct work_struct *unused)
 {
 	next_reap_node();
-	refresh_cpu_vm_stats(smp_processor_id());
 	schedule_delayed_work(&__get_cpu_var(reap_work),
 				      REAPTIMEOUT_CPUC);
 }
Index: slub/include/linux/vmstat.h
===================================================================
--- slub.orig/include/linux/vmstat.h	2007-04-28 19:30:01.000000000 -0700
+++ slub/include/linux/vmstat.h	2007-04-28 19:36:38.000000000 -0700
@@ -213,8 +213,6 @@ extern void dec_zone_state(struct zone *
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
-void refresh_vm_stats(void);
-
 #else /* CONFIG_SMP */
 
 /*
@@ -261,7 +259,6 @@ static inline void __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
 static inline void refresh_cpu_vm_stats(int cpu) { }
-static inline void refresh_vm_stats(void) { }
 #endif
 
 #endif /* _LINUX_VMSTAT_H */
Index: slub/mm/vmstat.c
===================================================================
--- slub.orig/mm/vmstat.c	2007-04-28 19:30:01.000000000 -0700
+++ slub/mm/vmstat.c	2007-04-28 19:36:38.000000000 -0700
@@ -640,6 +640,22 @@ const struct seq_operations vmstat_op = 
 #endif /* CONFIG_PROC_FS */
 
 #ifdef CONFIG_SMP
+static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
+
+static void vmstat_update(struct work_struct *w)
+{
+	refresh_cpu_vm_stats(smp_processor_id());
+	schedule_delayed_work(&__get_cpu_var(vmstat_work), HZ);
+}
+
+static void __devinit start_cpu_timer(int cpu)
+{
+	struct delayed_work *vmstat_work = &per_cpu(vmstat_work, cpu);
+
+	INIT_DELAYED_WORK(vmstat_work, vmstat_update);
+	schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
+}
+
 /*
  * Use the cpu notifier to insure that the thresholds are recalculated
  * when necessary.
@@ -648,11 +664,21 @@ static int __cpuinit vmstat_cpuup_callba
 		unsigned long action,
 		void *hcpu)
 {
+	long cpu = (long)hcpu;
+
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+		start_cpu_timer(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
+		per_cpu(vmstat_work, cpu).work.func = NULL;
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		start_cpu_timer(cpu);
+		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
 		refresh_zone_stat_thresholds();
@@ -668,8 +694,13 @@ static struct notifier_block __cpuinitda
 
 int __init setup_vmstat(void)
 {
+	int cpu;
+
 	refresh_zone_stat_thresholds();
 	register_cpu_notifier(&vmstat_notifier);
+
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
 	return 0;
 }
 module_init(setup_vmstat)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: vmstat: use our own timer events
  2007-04-29  5:09 vmstat: use our own timer events Christoph Lameter
@ 2007-04-29  8:15 ` Andrew Morton
  2007-04-30  4:44   ` Christoph Lameter
  2007-05-02 18:13 ` Pallipadi, Venkatesh
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2007-04-29  8:15 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Arjan van de Ven

On Sat, 28 Apr 2007 22:09:04 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> vmstat is currently using the cache reaper to periodically bring the
> statistics up to date. The cache reaper does only exists in SLUB
> as a way to provide compatibility with SLAB. This patch removes
> the vmstat calls from the slab allocators and provides its own
> handling.
> 
> The advantage is also that we can use a different frequency for the
> updates. Refreshing vm stats is a pretty fast job so we can run this
> every second and stagger this by only one tick. This will lead to
> some overlap in large systems. F.e a system running at 250 HZ with
> 1024 processors will have 4 vm updates occurring at once.
> 
> However, the vm stats update only accesses per node information.
> It is only necessary to stagger the vm statistics updates per
> processor in each node. Vm counter updates occurring on distant
> nodes will not cause cacheline contention.
> 
> We could implement an alternate approach that runs the first processor
> on each node at the second and then each of the other processor on a
> node on a subsequent tick. That may be useful to keep a large amount
> of the second free of timer activity. Maybe the timer folks will have
> some feedback on this one?

The one-per-second timer interrupt will upset the people who are really
aggressive about power consumption (eg, OLPC).  Perhaps there isn't (yet)
an intersection between those people and SMP.

However a knob to set the frequency would be nice, if it's not too
expensive to implement.  Presumably anyone who cares enough will come along
and add one, but then they have to wait for a long period for that change
to propagate out to their users, which is a bit sad for something which we
already knew about.

Having each CPU touch every zone looks a bit expensive - I'd have thought
that it would be showing up a little on your monster NUMA machines?

> @@ -648,11 +664,21 @@ static int __cpuinit vmstat_cpuup_callba
>  		unsigned long action,
>  		void *hcpu)
>  {
> +	long cpu = (long)hcpu;
> +
>  	switch (action) {
> -	case CPU_UP_PREPARE:
> -	case CPU_UP_PREPARE_FROZEN:
> -	case CPU_UP_CANCELED:
> -	case CPU_UP_CANCELED_FROZEN:
> +	case CPU_ONLINE:
> +	case CPU_ONLINE_FROZEN:
> +		start_cpu_timer(cpu);
> +		break;
> +	case CPU_DOWN_PREPARE:
> +	case CPU_DOWN_PREPARE_FROZEN:
> +		cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> +		per_cpu(vmstat_work, cpu).work.func = NULL;
> +	case CPU_DOWN_FAILED:
> +	case CPU_DOWN_FAILED_FROZEN:
> +		start_cpu_timer(cpu);
> +		break;
>  	case CPU_DEAD:
>  	case CPU_DEAD_FROZEN:
>  		refresh_zone_stat_thresholds();

Oh dear.  Some of these new notifier types are added by a patch which is a
few hundred patches later than slub.  I can park this patch after that one,
but that introduces a risk that later slub patches will also get
disconnected.

Oh well, we'll see how things go.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: vmstat: use our own timer events
  2007-04-29  8:15 ` Andrew Morton
@ 2007-04-30  4:44   ` Christoph Lameter
  2007-04-30 14:12     ` Arjan van de Ven
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2007-04-30  4:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Arjan van de Ven

On Sun, 29 Apr 2007, Andrew Morton wrote:

> > on each node at the second and then each of the other processor on a
> > node on a subsequent tick. That may be useful to keep a large amount
> > of the second free of timer activity. Maybe the timer folks will have
> > some feedback on this one?
> 
> The one-per-second timer interrupt will upset the people who are really
> aggressive about power consumption (eg, OLPC).  Perhaps there isn't (yet)
> an intersection between those people and SMP.

Well the cache_reaper of SLAB hits hard todays anyways. This will help
if they switch to slub because the counter consolidation is much lighter 
weight.

> However a knob to set the frequency would be nice, if it's not too
> expensive to implement.  Presumably anyone who cares enough will come along
> and add one, but then they have to wait for a long period for that change
> to propagate out to their users, which is a bit sad for something which we
> already knew about.

Ok will do.
 
> Having each CPU touch every zone looks a bit expensive - I'd have thought
> that it would be showing up a little on your monster NUMA machines?

Vmstat updates only touch cachelines that are node local. Data from
remote zones may be updated but pcps of those remote zones are for
this processor and have been placed local to the processor accessing them.

Thus no problem for monster NUMA.

> Oh dear.  Some of these new notifier types are added by a patch which is a
> few hundred patches later than slub.  I can park this patch after that one,
> but that introduces a risk that later slub patches will also get
> disconnected.

I am fine with delaying this. I just wanted the timer guys to have a 
chance to shape this a bit. Not sure what they want. What they did to the 
cache_reaper in 2.6.20/21 is bad.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: vmstat: use our own timer events
  2007-04-30  4:44   ` Christoph Lameter
@ 2007-04-30 14:12     ` Arjan van de Ven
  2007-04-30 17:42       ` Christoph Lameter
  0 siblings, 1 reply; 7+ messages in thread
From: Arjan van de Ven @ 2007-04-30 14:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel

Christoph Lameter wrote:
> On Sun, 29 Apr 2007, Andrew Morton wrote:
> 
>>> on each node at the second and then each of the other processor on a
>>> node on a subsequent tick. That may be useful to keep a large amount
>>> of the second free of timer activity. Maybe the timer folks will have
>>> some feedback on this one?
>> The one-per-second timer interrupt will upset the people who are really
>> aggressive about power consumption (eg, OLPC).  Perhaps there isn't (yet)
>> an intersection between those people and SMP.
> 
> Well the cache_reaper of SLAB hits hard todays anyways. This will help
> if they switch to slub because the counter consolidation is much lighter 
> weight.

it's not about the weight. It's about waking up *at all*. I've been working
really hard to get a system with a reasonable average idle time (600ms+), but
I obviously had to patch the SLAB reaper to be at a different resolution...


> 
> I am fine with delaying this. I just wanted the timer guys to have a 
> chance to shape this a bit. Not sure what they want. What they did to the 
> cache_reaper in 2.6.20/21 is bad.

HUH? The cache_reaper DID NOT CHANGE with the round_jiffies() change.
Before it had a 3 jiffies per cpu offset, after it has a 3 jiffies per cpu offset.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: vmstat: use our own timer events
  2007-04-30 14:12     ` Arjan van de Ven
@ 2007-04-30 17:42       ` Christoph Lameter
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Lameter @ 2007-04-30 17:42 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel

On Mon, 30 Apr 2007, Arjan van de Ven wrote:

> > Well the cache_reaper of SLAB hits hard todays anyways. This will help
> > if they switch to slub because the counter consolidation is much lighter
> > weight.
> 
> it's not about the weight. It's about waking up *at all*. I've been working
> really hard to get a system with a reasonable average idle time (600ms+), but
> I obviously had to patch the SLAB reaper to be at a different resolution...

If you use SLUB then its going to be gone.

> > I am fine with delaying this. I just wanted the timer guys to have a chance
> > to shape this a bit. Not sure what they want. What they did to the
> > cache_reaper in 2.6.20/21 is bad.
> 
> HUH? The cache_reaper DID NOT CHANGE with the round_jiffies() change.
> Before it had a 3 jiffies per cpu offset, after it has a 3 jiffies per cpu
> offset.

It seems that it does use round_jiffies_relative() instead?
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: vmstat: use our own timer events
  2007-04-29  5:09 vmstat: use our own timer events Christoph Lameter
  2007-04-29  8:15 ` Andrew Morton
@ 2007-05-02 18:13 ` Pallipadi, Venkatesh
  2007-05-02 18:29   ` Christoph Lameter
  1 sibling, 1 reply; 7+ messages in thread
From: Pallipadi, Venkatesh @ 2007-05-02 18:13 UTC (permalink / raw)
  To: Christoph Lameter, akpm; +Cc: linux-kernel, Arjan van de Ven

 

>-----Original Message-----
>From: linux-kernel-owner@vger.kernel.org 
>[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of 
>Christoph Lameter
>Sent: Saturday, April 28, 2007 10:09 PM
>To: akpm@linux-foundation.org
>Cc: linux-kernel@vger.kernel.org; Arjan van de Ven
>Subject: vmstat: use our own timer events
>
>
>We could implement an alternate approach that runs the first processor
>on each node at the second and then each of the other processor on a
>node on a subsequent tick. That may be useful to keep a large amount
>of the second free of timer activity. Maybe the timer folks will have
>some feedback on this one?
>

Can this use 'deferrable timer' along with round_jiffies. That
will eliminate the issue of too frequent interrupt when CPU is idle.


>CC: Arjan van de Ven <arjan@linux.intel.com>
>Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
>Index: slub/mm/vmstat.c
>===================================================================
>--- slub.orig/mm/vmstat.c	2007-04-28 19:30:01.000000000 -0700
>+++ slub/mm/vmstat.c	2007-04-28 19:36:38.000000000 -0700
>@@ -640,6 +640,22 @@ const struct seq_operations vmstat_op = 
> #endif /* CONFIG_PROC_FS */
> 
> #ifdef CONFIG_SMP
>+static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
>+
>+static void vmstat_update(struct work_struct *w)
>+{
>+	refresh_cpu_vm_stats(smp_processor_id());
>+	schedule_delayed_work(&__get_cpu_var(vmstat_work), HZ);
>+}
>+
>+static void __devinit start_cpu_timer(int cpu)
>+{
>+	struct delayed_work *vmstat_work = &per_cpu(vmstat_work, cpu);
>+
>+	INIT_DELAYED_WORK(vmstat_work, vmstat_update);

This change alone should help.
	INIT_DELAYED_WORK_DEFERRABLE(vmstat_work, vmstat_update);


>+	schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
>+}
>+


Thanks,
Venki

Thanks,
Venki

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: vmstat: use our own timer events
  2007-05-02 18:13 ` Pallipadi, Venkatesh
@ 2007-05-02 18:29   ` Christoph Lameter
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Lameter @ 2007-05-02 18:29 UTC (permalink / raw)
  To: Pallipadi, Venkatesh; +Cc: akpm, linux-kernel, Arjan van de Ven

On Wed, 2 May 2007, Pallipadi, Venkatesh wrote:

> Can this use 'deferrable timer' along with round_jiffies. That
> will eliminate the issue of too frequent interrupt when CPU is idle.

Yes I asked Arjan about this.
> >+	struct delayed_work *vmstat_work = &per_cpu(vmstat_work, cpu);
> >+
> >+	INIT_DELAYED_WORK(vmstat_work, vmstat_update);
> 
> This change alone should help.
> 	INIT_DELAYED_WORK_DEFERRABLE(vmstat_work, vmstat_update);
> 

Hmmm.. I need to check out what this does exactly.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-05-02 18:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-29  5:09 vmstat: use our own timer events Christoph Lameter
2007-04-29  8:15 ` Andrew Morton
2007-04-30  4:44   ` Christoph Lameter
2007-04-30 14:12     ` Arjan van de Ven
2007-04-30 17:42       ` Christoph Lameter
2007-05-02 18:13 ` Pallipadi, Venkatesh
2007-05-02 18:29   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox