vmstat: On demand vmstat workers V4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* vmstat: On demand vmstat workers V4
@ 2014-05-08 15:35 Christoph Lameter
  2014-05-08 21:29 ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2014-05-08 15:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar

There were numerous requests for an update of this patch.


V3->V4:
- Make the shepherd task not deferrable. It runs on the tick cpu
  anyways. Deferral could get deltas too far out of sync if
  vmstat operations are disabled for a certain processor.

V2->V3:
- Introduce a new tick_get_housekeeping_cpu() function. Not sure
  if that is exactly what we want but it is a start. Thomas?
- Migrate the shepherd task if the output of
  tick_get_housekeeping_cpu() changes.
- Fixes recommended by Andrew.

V1->V2:
- Optimize the need_update check by using memchr_inv.
- Clean up.

vmstat workers are used for folding counter differentials into the
zone, per node and global counters at certain time intervals.
They currently run at defined intervals on all processors which will
cause some holdoff for processors that need minimal intrusion by the
OS.

The current vmstat_update mechanism depends on a deferrable timer
firing every other second by default which registers a work queue item
that runs on the local CPU, with the result that we have 1 interrupt
and one additional schedulable task on each CPU every 2 seconds
If a workload indeed causes VM activity or multiple tasks are running
on a CPU, then there are probably bigger issues to deal with.

However, some workloads dedicate a CPU for a single CPU bound task.
This is done in high performance computing, in high frequency
financial applications, in networking (Intel DPDK, EZchip NPS) and with
the advent of systems with more and more CPUs over time, this may become
more and more common to do since when one has enough CPUs
one cares less about efficiently sharing a CPU with other tasks and
more about efficiently monopolizing a CPU per task.

The difference of having this timer firing and workqueue kernel thread
scheduled per second can be enormous. An artificial test measuring the
worst case time to do a simple "i++" in an endless loop on a bare metal
system and under Linux on an isolated CPU with dynticks and with and
without this patch, have Linux match the bare metal performance
(~700 cycles) with this patch and loose by couple of orders of magnitude
(~200k cycles) without it[*].  The loss occurs for something that just
calculates statistics. For networking applications, for example, this
could be the difference between dropping packets or sustaining line rate.

Statistics are important and useful, but it would be great if there
would be a way to not cause statistics gathering produce a huge
performance difference. This patche does just that.

This patch creates a vmstat shepherd worker that monitors the
per cpu differentials on all processors. If there are differentials
on a processor then a vmstat worker local to the processors
with the differentials is created. That worker will then start
folding the diffs in regular intervals. Should the worker
find that there is no work to be done then it will make the shepherd
worker monitor the differentials again.

With this patch it is possible then to have periods longer than
2 seconds without any OS event on a "cpu" (hardware thread).

Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Christoph Lameter <cl@linux.com>


Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c	2014-05-06 10:51:19.711239813 -0500
+++ linux/mm/vmstat.c	2014-05-06 11:17:28.228738828 -0500
@@ -7,6 +7,7 @@
  *  zoned VM statistics
  *  Copyright (C) 2006 Silicon Graphics, Inc.,
  *		Christoph Lameter <christoph@lameter.com>
+ *  Copyright (C) 2008-2014 Christoph Lameter
  */
 #include <linux/fs.h>
 #include <linux/mm.h>
@@ -14,12 +15,14 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
 #include <linux/math64.h>
 #include <linux/writeback.h>
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
+#include <linux/tick.h>

 #include "internal.h"

@@ -417,13 +420,22 @@
 EXPORT_SYMBOL(dec_zone_page_state);
 #endif

-static inline void fold_diff(int *diff)
+
+/*
+ * Fold a differential into the global counters.
+ * Returns the number of counters updated.
+ */
+static inline int fold_diff(int *diff)
 {
 	int i;
+	int changes = 0;

 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (diff[i])
+		if (diff[i]) {
 			atomic_long_add(diff[i], &vm_stat[i]);
+			changes++;
+	}
+	return changes;
 }

 /*
@@ -439,12 +451,15 @@
  * statistics in the remote zone struct as well as the global cachelines
  * with the global counters. These could cause remote node cache line
  * bouncing and will have to be only done when necessary.
+ *
+ * The function returns the number of global counters updated.
  */
-static void refresh_cpu_vm_stats(void)
+static int refresh_cpu_vm_stats(void)
 {
 	struct zone *zone;
 	int i;
 	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int changes = 0;

 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset __percpu *p = zone->pageset;
@@ -484,15 +499,17 @@
 			continue;
 		}

-
 		if (__this_cpu_dec_return(p->expire))
 			continue;

-		if (__this_cpu_read(p->pcp.count))
+		if (__this_cpu_read(p->pcp.count)) {
 			drain_zone_pages(zone, __this_cpu_ptr(&p->pcp));
+			changes++;
+		}
 #endif
 	}
-	fold_diff(global_diff);
+	changes += fold_diff(global_diff);
+	return changes;
 }

 /*
@@ -1222,12 +1239,15 @@
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
+static struct cpumask *monitored_cpus;

 static void vmstat_update(struct work_struct *w)
 {
-	refresh_cpu_vm_stats();
-	schedule_delayed_work(&__get_cpu_var(vmstat_work),
-		round_jiffies_relative(sysctl_stat_interval));
+	if (refresh_cpu_vm_stats())
+		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+			round_jiffies_relative(sysctl_stat_interval));
+	else
+		cpumask_set_cpu(smp_processor_id(), monitored_cpus);
 }

 static void start_cpu_timer(int cpu)
@@ -1235,7 +1255,69 @@
 	struct delayed_work *work = &per_cpu(vmstat_work, cpu);

 	INIT_DEFERRABLE_WORK(work, vmstat_update);
-	schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
+	schedule_delayed_work_on(cpu, work,
+		__round_jiffies_relative(sysctl_stat_interval, cpu));
+}
+
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
+
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 * This works because the diffs are byte sized items.
+		 */
+		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
+			return true;
+	}
+	return false;
+}
+
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+	int s = tick_get_housekeeping_cpu();
+	struct delayed_work *d = per_cpu_ptr(&vmstat_work, s);
+
+	refresh_cpu_vm_stats();
+	for_each_cpu(cpu, monitored_cpus)
+		if (need_update(cpu)) {
+			cpumask_clear_cpu(cpu, monitored_cpus);
+			start_cpu_timer(cpu);
+		}
+
+	if (s != smp_processor_id()) {
+		/* Timekeeping was moved. Move the shepherd worker */
+		cancel_delayed_work_sync(d);
+		cpumask_set_cpu(smp_processor_id(), monitored_cpus);
+		cpumask_clear_cpu(s, monitored_cpus);
+		INIT_DELAYED_WORK(d, vmstat_shepherd);
+	}
+
+	schedule_delayed_work_on(s, d,
+		__round_jiffies_relative(sysctl_stat_interval, s));
+
+}
+
+static void __init start_shepherd_timer(void)
+{
+	int cpu = tick_get_housekeeping_cpu();
+	struct delayed_work *d = per_cpu_ptr(&vmstat_work, cpu);
+
+	INIT_DELAYED_WORK(d, vmstat_shepherd);
+	monitored_cpus = kmalloc(BITS_TO_LONGS(nr_cpu_ids) * sizeof(long),
+			GFP_KERNEL);
+	cpumask_copy(monitored_cpus, cpu_online_mask);
+	cpumask_clear_cpu(cpu, monitored_cpus);
+	schedule_delayed_work_on(cpu, d,
+		__round_jiffies_relative(sysctl_stat_interval, cpu));
 }

 static void vmstat_cpu_dead(int node)
@@ -1266,17 +1348,19 @@
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		refresh_zone_stat_thresholds();
-		start_cpu_timer(cpu);
 		node_set_state(cpu_to_node(cpu), N_CPU);
+		cpumask_set_cpu(cpu, monitored_cpus);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
-		cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+		if (!cpumask_test_cpu(cpu, monitored_cpus))
+			cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+		cpumask_clear_cpu(cpu, monitored_cpus);
 		per_cpu(vmstat_work, cpu).work.func = NULL;
 		break;
 	case CPU_DOWN_FAILED:
 	case CPU_DOWN_FAILED_FROZEN:
-		start_cpu_timer(cpu);
+		cpumask_set_cpu(cpu, monitored_cpus);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
@@ -1296,15 +1380,10 @@
 static int __init setup_vmstat(void)
 {
 #ifdef CONFIG_SMP
-	int cpu;
-
 	cpu_notifier_register_begin();
 	__register_cpu_notifier(&vmstat_notifier);

-	for_each_online_cpu(cpu) {
-		start_cpu_timer(cpu);
-		node_set_state(cpu_to_node(cpu), N_CPU);
-	}
+	start_shepherd_timer();
 	cpu_notifier_register_done();
 #endif
 #ifdef CONFIG_PROC_FS
Index: linux/kernel/time/tick-common.c
===================================================================
--- linux.orig/kernel/time/tick-common.c	2014-05-06 10:51:19.711239813 -0500
+++ linux/kernel/time/tick-common.c	2014-05-06 10:51:19.711239813 -0500
@@ -222,6 +222,24 @@
 		tick_setup_oneshot(newdev, handler, next_event);
 }

+/*
+ * Return a cpu number that may be used to run housekeeping
+ * tasks. This is usually the timekeeping cpu unless that
+ * is not available. Then we simply fall back to the current
+ * cpu.
+ */
+int tick_get_housekeeping_cpu(void)
+{
+	int cpu;
+
+	if (system_state < SYSTEM_RUNNING || tick_do_timer_cpu < 0)
+		cpu = raw_smp_processor_id();
+	else
+		cpu = tick_do_timer_cpu;
+
+	return cpu;
+}
+
 void tick_install_replacement(struct clock_event_device *newdev)
 {
 	struct tick_device *td = &__get_cpu_var(tick_cpu_device);
Index: linux/include/linux/tick.h
===================================================================
--- linux.orig/include/linux/tick.h	2014-05-06 10:51:19.711239813 -0500
+++ linux/include/linux/tick.h	2014-05-06 10:51:19.711239813 -0500
@@ -77,6 +77,7 @@
 extern void __init tick_init(void);
 extern int tick_is_oneshot_available(void);
 extern struct tick_device *tick_get_device(int cpu);
+extern int tick_get_housekeeping_cpu(void);

 # ifdef CONFIG_HIGH_RES_TIMERS
 extern int tick_init_highres(void);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-08 15:35 vmstat: On demand vmstat workers V4 Christoph Lameter
@ 2014-05-08 21:29 ` Andrew Morton
  2014-05-08 22:18   ` Thomas Gleixner
  2014-05-12 16:25   ` Christoph Lameter
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2014-05-08 21:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar

(tglx poke)

On Thu, 8 May 2014 10:35:15 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:

> There were numerous requests for an update of this patch.
> 
> 
> V3->V4:
> - Make the shepherd task not deferrable. It runs on the tick cpu
>   anyways. Deferral could get deltas too far out of sync if
>   vmstat operations are disabled for a certain processor.
> 
> V2->V3:
> - Introduce a new tick_get_housekeeping_cpu() function. Not sure
>   if that is exactly what we want but it is a start. Thomas?
> - Migrate the shepherd task if the output of
>   tick_get_housekeeping_cpu() changes.
> - Fixes recommended by Andrew.
> 
> V1->V2:
> - Optimize the need_update check by using memchr_inv.
> - Clean up.
> 
> vmstat workers are used for folding counter differentials into the
> zone, per node and global counters at certain time intervals.
> They currently run at defined intervals on all processors which will
> cause some holdoff for processors that need minimal intrusion by the
> OS.
> 
> The current vmstat_update mechanism depends on a deferrable timer
> firing every other second by default which registers a work queue item
> that runs on the local CPU, with the result that we have 1 interrupt
> and one additional schedulable task on each CPU every 2 seconds
> If a workload indeed causes VM activity or multiple tasks are running
> on a CPU, then there are probably bigger issues to deal with.
> 
> However, some workloads dedicate a CPU for a single CPU bound task.
> This is done in high performance computing, in high frequency
> financial applications, in networking (Intel DPDK, EZchip NPS) and with
> the advent of systems with more and more CPUs over time, this may become
> more and more common to do since when one has enough CPUs
> one cares less about efficiently sharing a CPU with other tasks and
> more about efficiently monopolizing a CPU per task.
> 
> The difference of having this timer firing and workqueue kernel thread
> scheduled per second can be enormous. An artificial test measuring the
> worst case time to do a simple "i++" in an endless loop on a bare metal
> system and under Linux on an isolated CPU with dynticks and with and
> without this patch, have Linux match the bare metal performance
> (~700 cycles) with this patch and loose by couple of orders of magnitude
> (~200k cycles) without it[*].  The loss occurs for something that just
> calculates statistics. For networking applications, for example, this
> could be the difference between dropping packets or sustaining line rate.
> 
> Statistics are important and useful, but it would be great if there
> would be a way to not cause statistics gathering produce a huge
> performance difference. This patche does just that.
> 
> This patch creates a vmstat shepherd worker that monitors the
> per cpu differentials on all processors. If there are differentials
> on a processor then a vmstat worker local to the processors
> with the differentials is created. That worker will then start
> folding the diffs in regular intervals. Should the worker
> find that there is no work to be done then it will make the shepherd
> worker monitor the differentials again.
> 
> With this patch it is possible then to have periods longer than
> 2 seconds without any OS event on a "cpu" (hardware thread).

Some explanation of the changes to kernel/time/tick-common.c would be
appropriate.

>
> ...
>
> +
> +/*
> + * Fold a differential into the global counters.
> + * Returns the number of counters updated.
> + */
> +static inline int fold_diff(int *diff)
>  {
>  	int i;
> +	int changes = 0;
> 
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -		if (diff[i])
> +		if (diff[i]) {
>  			atomic_long_add(diff[i], &vm_stat[i]);
> +			changes++;
> +	}
> +	return changes;
>  }

This is too large to be inlined.  It has a single callsite so the
compiler will presumably choose to inline it regardless of whether we
put "inline" in the definition.

>  /*
>
> ...
>
> @@ -1222,12 +1239,15 @@
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
>  int sysctl_stat_interval __read_mostly = HZ;
> +static struct cpumask *monitored_cpus;
> 
>  static void vmstat_update(struct work_struct *w)
>  {
> -	refresh_cpu_vm_stats();
> -	schedule_delayed_work(&__get_cpu_var(vmstat_work),
> -		round_jiffies_relative(sysctl_stat_interval));
> +	if (refresh_cpu_vm_stats())
> +		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
> +			round_jiffies_relative(sysctl_stat_interval));
> +	else
> +		cpumask_set_cpu(smp_processor_id(), monitored_cpus);
>  }

This function is the core of this design and would be a good place to
document it all.  Where we decide to call schedule_delayed_work(), add
a comment explaining why.  Where we decide to call cpumask_set_cpu(),
add a comment explaining why.


>  static void start_cpu_timer(int cpu)
> @@ -1235,7 +1255,69 @@
>  	struct delayed_work *work = &per_cpu(vmstat_work, cpu);
> 
>  	INIT_DEFERRABLE_WORK(work, vmstat_update);
> -	schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
> +	schedule_delayed_work_on(cpu, work,
> +		__round_jiffies_relative(sysctl_stat_interval, cpu));
> +}
> +
> +/*
> + * Check if the diffs for a certain cpu indicate that
> + * an update is needed.
> + */
> +static bool need_update(int cpu)
> +{
> +	struct zone *zone;
> +
> +	for_each_populated_zone(zone) {
> +		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
> +
> +		/*
> +		 * The fast way of checking if there are any vmstat diffs.
> +		 * This works because the diffs are byte sized items.
> +		 */

yikes.

I guess

		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);

will help address the obvious maintainability concern.

> +		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static void vmstat_shepherd(struct work_struct *w)

Let's document the design here also.  What does it do, why does it do
it, how does it do it.  We know how to do this.

> +{
> +	int cpu;
> +	int s = tick_get_housekeeping_cpu();
> +	struct delayed_work *d = per_cpu_ptr(&vmstat_work, s);
> +
> +	refresh_cpu_vm_stats();
> +	for_each_cpu(cpu, monitored_cpus)
> +		if (need_update(cpu)) {
> +			cpumask_clear_cpu(cpu, monitored_cpus);

Obvious race is obvious.  Let's either fix the race or tell readers
what the consequences are and why it's OK.

> +			start_cpu_timer(cpu);
> +		}
> +
> +	if (s != smp_processor_id()) {
> +		/* Timekeeping was moved. Move the shepherd worker */

You mean "move the shepherd worker to follow the timekeeping CPU.  We
do this because ..."

> +		cancel_delayed_work_sync(d);
> +		cpumask_set_cpu(smp_processor_id(), monitored_cpus);
> +		cpumask_clear_cpu(s, monitored_cpus);
> +		INIT_DELAYED_WORK(d, vmstat_shepherd);

INIT_DELAYED_WORK() seems inappropriate here.  It's generally used for
once-off initialisation of a freshly allocated work item.  Look at all
the stuff it does - do we really want to run debug_object_init()
against an active object?

> +	}
> +
> +	schedule_delayed_work_on(s, d,
> +		__round_jiffies_relative(sysctl_stat_interval, s));
> +
> +}
> +
> +static void __init start_shepherd_timer(void)
> +{
> +	int cpu = tick_get_housekeeping_cpu();
> +	struct delayed_work *d = per_cpu_ptr(&vmstat_work, cpu);
> +
> +	INIT_DELAYED_WORK(d, vmstat_shepherd);
> +	monitored_cpus = kmalloc(BITS_TO_LONGS(nr_cpu_ids) * sizeof(long),
> +			GFP_KERNEL);
> +	cpumask_copy(monitored_cpus, cpu_online_mask);
> +	cpumask_clear_cpu(cpu, monitored_cpus);
> +	schedule_delayed_work_on(cpu, d,
> +		__round_jiffies_relative(sysctl_stat_interval, cpu));
>  }
> 
>  static void vmstat_cpu_dead(int node)
> @@ -1266,17 +1348,19 @@
>  	case CPU_ONLINE:
>  	case CPU_ONLINE_FROZEN:
>  		refresh_zone_stat_thresholds();
> -		start_cpu_timer(cpu);
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> +		cpumask_set_cpu(cpu, monitored_cpus);
>  		break;
>  	case CPU_DOWN_PREPARE:
>  	case CPU_DOWN_PREPARE_FROZEN:
> -		cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> +		if (!cpumask_test_cpu(cpu, monitored_cpus))

This test is inverted isn't it?

> +			cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> +		cpumask_clear_cpu(cpu, monitored_cpus);
>  		per_cpu(vmstat_work, cpu).work.func = NULL;
>  		break;
>  	case CPU_DOWN_FAILED:
>  	case CPU_DOWN_FAILED_FROZEN:
> -		start_cpu_timer(cpu);
> +		cpumask_set_cpu(cpu, monitored_cpus);
>  		break;
>  	case CPU_DEAD:
>  	case CPU_DEAD_FROZEN:
>
> ...
>
> --- linux.orig/kernel/time/tick-common.c	2014-05-06 10:51:19.711239813 -0500
> +++ linux/kernel/time/tick-common.c	2014-05-06 10:51:19.711239813 -0500
> @@ -222,6 +222,24 @@
>  		tick_setup_oneshot(newdev, handler, next_event);
>  }
> 
> +/*
> + * Return a cpu number that may be used to run housekeeping
> + * tasks. This is usually the timekeeping cpu unless that
> + * is not available. Then we simply fall back to the current
> + * cpu.
> + */

This comment is unusably vague.  What the heck is a "housekeeping
task"?  Why would anyone call this and what is special about the CPU
number it returns?


> +int tick_get_housekeeping_cpu(void)
> +{
> +	int cpu;
> +
> +	if (system_state < SYSTEM_RUNNING || tick_do_timer_cpu < 0)
> +		cpu = raw_smp_processor_id();
> +	else
> +		cpu = tick_do_timer_cpu;
> +
> +	return cpu;
> +}
> +
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-08 21:29 ` Andrew Morton
@ 2014-05-08 22:18   ` Thomas Gleixner
  2014-05-09 14:53     ` Christoph Lameter
  2014-05-12 16:25   ` Christoph Lameter
  1 sibling, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2014-05-08 22:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar

On Thu, 8 May 2014, Andrew Morton wrote:
> On Thu, 8 May 2014 10:35:15 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:
> > --- linux.orig/kernel/time/tick-common.c	2014-05-06 10:51:19.711239813 -0500
> > +++ linux/kernel/time/tick-common.c	2014-05-06 10:51:19.711239813 -0500
> > @@ -222,6 +222,24 @@
> >  		tick_setup_oneshot(newdev, handler, next_event);
> >  }
> > 
> > +/*
> > + * Return a cpu number that may be used to run housekeeping
> > + * tasks. This is usually the timekeeping cpu unless that
> > + * is not available. Then we simply fall back to the current
> > + * cpu.
> > + */
> 
> This comment is unusably vague.  What the heck is a "housekeeping
> task"?  Why would anyone call this and what is special about the CPU
> number it returns?
> 
> 
> > +int tick_get_housekeeping_cpu(void)
> > +{
> > +	int cpu;
> > +
> > +	if (system_state < SYSTEM_RUNNING || tick_do_timer_cpu < 0)
> > +		cpu = raw_smp_processor_id();

That's completely bogus. The system state check is pointless and
tick_do_timer_cpu even more so because if you call that code from a
worker thread tick_do_timer_cpu should be assigned to some cpu.

Aside of that I'm having a hard time to understand why this stuff
wants to move around at all.

I think we agreed long ago, that for the whole HPC FULL_NOHZ stuff you
have to sacrify at least one CPU for housekeeping purposes of all
kinds, timekeeping, statistics and whatever.

So if you have a housekeeper, then it makes absolutely no sense at all
to move it around in circles.

Can you please enlighten me why we need this at all?

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-08 22:18   ` Thomas Gleixner
@ 2014-05-09 14:53     ` Christoph Lameter
  2014-05-09 15:05       ` Thomas Gleixner
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2014-05-09 14:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar

On Fri, 9 May 2014, Thomas Gleixner wrote:

> > > +/*
> > > + * Return a cpu number that may be used to run housekeeping
> > > + * tasks. This is usually the timekeeping cpu unless that
> > > + * is not available. Then we simply fall back to the current
> > > + * cpu.
> > > + */
> >
> > This comment is unusably vague.  What the heck is a "housekeeping
> > task"?  Why would anyone call this and what is special about the CPU
> > number it returns?

I just need a processor that keeps watch over the vmstat workers in the
system. The processor that does timekeeping is an obvious choice. I am
open to other suggestions.

Typically our system have processors that are used for OS processing and
processor that are focused on app services. Those need to be as
undisturbed as possible.

 > >
> >
> > > +int tick_get_housekeeping_cpu(void)
> > > +{
> > > +	int cpu;
> > > +
> > > +	if (system_state < SYSTEM_RUNNING || tick_do_timer_cpu < 0)
> > > +		cpu = raw_smp_processor_id();
>
> That's completely bogus. The system state check is pointless and
> tick_do_timer_cpu even more so because if you call that code from a
> worker thread tick_do_timer_cpu should be assigned to some cpu.
>
> Aside of that I'm having a hard time to understand why this stuff
> wants to move around at all.
>
> I think we agreed long ago, that for the whole HPC FULL_NOHZ stuff you
> have to sacrify at least one CPU for housekeeping purposes of all
> kinds, timekeeping, statistics and whatever.

Ok how do I figure out that cpu? I'd rather have a specific cpu that
never changes.

> So if you have a housekeeper, then it makes absolutely no sense at all
> to move it around in circles.
>
> Can you please enlighten me why we need this at all?

The vmstat kworker thread checks every 2 seconds if there are vmstat
updates that need to be folded into the global statistics. This is not
necessary if the application is running and no OS services are being used.
Thus we could switch off vmstat updates and avoid taking the processor
away from the application.

This has also been noted by multiple other people at was brought up at the
mm summit by others who noted the same issues.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-09 14:53     ` Christoph Lameter
@ 2014-05-09 15:05       ` Thomas Gleixner
  2014-05-09 15:28         ` Christoph Lameter
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2014-05-09 15:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar

On Fri, 9 May 2014, Christoph Lameter wrote:
> On Fri, 9 May 2014, Thomas Gleixner wrote:
> > I think we agreed long ago, that for the whole HPC FULL_NOHZ stuff you
> > have to sacrify at least one CPU for housekeeping purposes of all
> > kinds, timekeeping, statistics and whatever.
> 
> Ok how do I figure out that cpu? I'd rather have a specific cpu that
> never changes.

I followed the full nohz development only losely, but back then when
all started here at my place with frederic, we had a way to define the
housekeeper cpu. I think we lazily had it hardwired to 0 :)

That probably changed, but I'm sure there is still a way to define a
housekeeper. And we should simply force the timekeeping on that
housekeeper. That comes with the price, that the housekeeper is not
allowed to go deep idle, but I bet that in HPC scenarios this does not
matter at all simply because the whole machine is under full load.

Frederic?

> > So if you have a housekeeper, then it makes absolutely no sense at all
> > to move it around in circles.
> >
> > Can you please enlighten me why we need this at all?
> 
> The vmstat kworker thread checks every 2 seconds if there are vmstat
> updates that need to be folded into the global statistics. This is not
> necessary if the application is running and no OS services are being used.
> Thus we could switch off vmstat updates and avoid taking the processor
> away from the application.
> 
> This has also been noted by multiple other people at was brought up at the
> mm summit by others who noted the same issues.

I understand why you want to get this done by a housekeeper, I just
did not understand why we need this whole move it around business is
required.
 
Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-09 15:05       ` Thomas Gleixner
@ 2014-05-09 15:28         ` Christoph Lameter
  2014-05-09 22:57           ` Thomas Gleixner
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2014-05-09 15:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar

On Fri, 9 May 2014, Thomas Gleixner wrote:

> > Ok how do I figure out that cpu? I'd rather have a specific cpu that
> > never changes.
>
> I followed the full nohz development only losely, but back then when
> all started here at my place with frederic, we had a way to define the
> housekeeper cpu. I think we lazily had it hardwired to 0 :)

Yes that would be the easiest and simplest. We dedicate cpu 0 to OS
services around
here anyways.

> That probably changed, but I'm sure there is still a way to define a
> housekeeper. And we should simply force the timekeeping on that
> housekeeper. That comes with the price, that the housekeeper is not
> allowed to go deep idle, but I bet that in HPC scenarios this does not
> matter at all simply because the whole machine is under full load.

Excellent. Yes. Good.

> >
> > The vmstat kworker thread checks every 2 seconds if there are vmstat
> > updates that need to be folded into the global statistics. This is not
> > necessary if the application is running and no OS services are being used.
> > Thus we could switch off vmstat updates and avoid taking the processor
> > away from the application.
> >
> > This has also been noted by multiple other people at was brought up at the
> > mm summit by others who noted the same issues.
>
> I understand why you want to get this done by a housekeeper, I just
> did not understand why we need this whole move it around business is
> required.

This came about because of another objection against having it simply
fixed to a processor. After all that processor may be disabled etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-09 15:28         ` Christoph Lameter
@ 2014-05-09 22:57           ` Thomas Gleixner
  2014-05-09 23:47             ` Paul E. McKenney
  2014-05-10  0:34             ` Frederic Weisbecker
  0 siblings, 2 replies; 20+ messages in thread
From: Thomas Gleixner @ 2014-05-09 22:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, Mike Frysinger,
	Minchan Kim, Hakan Akkan, Max Krasnyansky, Frederic Weisbecker,
	Paul E. McKenney, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Fri, 9 May 2014, Christoph Lameter wrote:
> On Fri, 9 May 2014, Thomas Gleixner wrote:
> > I understand why you want to get this done by a housekeeper, I just
> > did not understand why we need this whole move it around business is
> > required.
> 
> This came about because of another objection against having it simply
> fixed to a processor. After all that processor may be disabled etc etc.

I really regret that I did not pay more attention (though my cycle
constraints simply do not allow it).

This is the typical overengineering failure: 

  Before we even have a working proof that we can solve the massive
  complex basic problem with the price of a dedicated housekeeper, we
  try to make the housekeeper itself a moving target with the price of
  making the problem exponential(unknown) instead of simply unknown.

I really cannot figure out why a moving housekeeper would be a
brilliant idea at all, but I'm sure there is some magic use case in
some other disjunct universe.

Whoever complained and came up with the NOT SO brilliant idea to make
the housekeeper a moving target, come please forth and explain:

- How this can be done without having a working solution with a
  dedicated housekeeper in the first place

- How this can be done without knowing what implication it has w/o
  seing the complexity of a dedicated housekeeper upfront.

Keep it simple has always been and still is the best engineering
principle.

We all know that we can do large scale overhauls in a very controlled
way if the need arises. But going for the most complex solution while
not knowing whether the least complex solution is feasible at all is
outright stupid or beyond.

Unless someone comes up with a reasonable explantion for all of this I
put a general NAK on patches which are directed to kernel/time/*

Correction:

I'm taking patches right away which undo any damage which has been
applied w/o me noticing because I trusted the responsible developers /
maintainers.

Preferrably those patches arrive before my return from LinuxCon Japan.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-09 22:57           ` Thomas Gleixner
@ 2014-05-09 23:47             ` Paul E. McKenney
  2014-05-10  0:48               ` Frederic Weisbecker
  2014-05-10 12:20               ` Thomas Gleixner
  2014-05-10  0:34             ` Frederic Weisbecker
  1 sibling, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2014-05-09 23:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Lameter, Andrew Morton, Gilad Ben-Yossef, Tejun Heo,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, May 10, 2014 at 12:57:15AM +0200, Thomas Gleixner wrote:
> On Fri, 9 May 2014, Christoph Lameter wrote:
> > On Fri, 9 May 2014, Thomas Gleixner wrote:
> > > I understand why you want to get this done by a housekeeper, I just
> > > did not understand why we need this whole move it around business is
> > > required.
> > 
> > This came about because of another objection against having it simply
> > fixed to a processor. After all that processor may be disabled etc etc.
> 
> I really regret that I did not pay more attention (though my cycle
> constraints simply do not allow it).

As far as I can see, the NO_HZ_FULL timekeeping CPU is always zero.  If it
can change in NO_HZ_FULL kernels, RCU will do some very strange things!

One possible issue here is that Christoph's patch is unconditional.
It takes effect for both NO_HZ_FULL and !NO_HZ_FULL.  If I recall
correctly, the timekeeping CPU -can- change in !NO_HZ_FULL kernels,
which might be what Christoph was trying to take into account.

> This is the typical overengineering failure: 
> 
>   Before we even have a working proof that we can solve the massive
>   complex basic problem with the price of a dedicated housekeeper, we
>   try to make the housekeeper itself a moving target with the price of
>   making the problem exponential(unknown) instead of simply unknown.
> 
> I really cannot figure out why a moving housekeeper would be a
> brilliant idea at all, but I'm sure there is some magic use case in
> some other disjunct universe.
> 
> Whoever complained and came up with the NOT SO brilliant idea to make
> the housekeeper a moving target, come please forth and explain:
> 
> - How this can be done without having a working solution with a
>   dedicated housekeeper in the first place
> 
> - How this can be done without knowing what implication it has w/o
>   seing the complexity of a dedicated housekeeper upfront.
> 
> Keep it simple has always been and still is the best engineering
> principle.

If someone decides to make tick_do_timer_cpu non-constant in NO_HZ_FULL
CPUs, they will break unless/until I make RCU deal with that sort
of thing, at least for NO_HZ_FULL_SYSIDLE kernels.  ;-)

> We all know that we can do large scale overhauls in a very controlled
> way if the need arises. But going for the most complex solution while
> not knowing whether the least complex solution is feasible at all is
> outright stupid or beyond.
> 
> Unless someone comes up with a reasonable explantion for all of this I
> put a general NAK on patches which are directed to kernel/time/*
> 
> Correction:
> 
> I'm taking patches right away which undo any damage which has been
> applied w/o me noticing because I trusted the responsible developers /
> maintainers.
> 
> Preferrably those patches arrive before my return from LinuxCon Japan.

I could easily have missed something, but as far as I know, there is
nothing in the current kernel that allows tick_do_timer_cpu to move in
NO_HZ_FULL kernels.

Hmmm...  Well, I -do- have a gratuitous ACCESS_ONCE() around one fetch
of tick_do_timer_cpu that happens only in NO_HZ_FULL kernels.  I will
remove it.

							Thanx, Paul

> Thanks,
> 
> 	tglx
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-09 23:47             ` Paul E. McKenney
@ 2014-05-10  0:48               ` Frederic Weisbecker
  2014-05-10 12:31                 ` Thomas Gleixner
  2014-05-10 12:20               ` Thomas Gleixner
  1 sibling, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2014-05-10  0:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Christoph Lameter, Andrew Morton,
	Gilad Ben-Yossef, Tejun Heo, Mike Frysinger, Minchan Kim,
	Hakan Akkan, Max Krasnyansky, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Fri, May 09, 2014 at 04:47:45PM -0700, Paul E. McKenney wrote:
> On Sat, May 10, 2014 at 12:57:15AM +0200, Thomas Gleixner wrote:
> If someone decides to make tick_do_timer_cpu non-constant in NO_HZ_FULL
> CPUs, they will break unless/until I make RCU deal with that sort
> of thing, at least for NO_HZ_FULL_SYSIDLE kernels.  ;-)
> 
> > We all know that we can do large scale overhauls in a very controlled
> > way if the need arises. But going for the most complex solution while
> > not knowing whether the least complex solution is feasible at all is
> > outright stupid or beyond.
> > 
> > Unless someone comes up with a reasonable explantion for all of this I
> > put a general NAK on patches which are directed to kernel/time/*
> > 
> > Correction:
> > 
> > I'm taking patches right away which undo any damage which has been
> > applied w/o me noticing because I trusted the responsible developers /
> > maintainers.
> > 
> > Preferrably those patches arrive before my return from LinuxCon Japan.
> 
> I could easily have missed something, but as far as I know, there is
> nothing in the current kernel that allows tick_do_timer_cpu to move in
> NO_HZ_FULL kernels.

Right.

So we agree that housekeeping/timekeeping is going to stay CPU 0 for now.

But I still have the plan to make the timekeeper use the full sysidle
facility in order to adaptively get to dynticks idle.

Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
completely periodic. It can't enter in dynticks idle mode because it
must maintain timekeeping on behalf of full dynticks CPUs. So that's
a power issue.

But Paul has a feature in RCU that lets us know when all CPUs are idle
and the timekeeper can finally sleep. Then when a full nohz CPU wakes
up from idle, it sends an IPI to the timekeeper if needed so the latter
restarts timekeeping maintainance.

It's not complicated to add to the timer code.
Most of the code is already there, in RCU, for a while already.

Are we keeping that direction? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-10  0:48               ` Frederic Weisbecker
@ 2014-05-10 12:31                 ` Thomas Gleixner
  2014-05-10 13:14                   ` Frederic Weisbecker
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2014-05-10 12:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Christoph Lameter, Andrew Morton,
	Gilad Ben-Yossef, Tejun Heo, Mike Frysinger, Minchan Kim,
	Hakan Akkan, Max Krasnyansky, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, 10 May 2014, Frederic Weisbecker wrote:
> But I still have the plan to make the timekeeper use the full sysidle
> facility in order to adaptively get to dynticks idle.
> 
> Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
> completely periodic. It can't enter in dynticks idle mode because it
> must maintain timekeeping on behalf of full dynticks CPUs. So that's
> a power issue.
> 
> But Paul has a feature in RCU that lets us know when all CPUs are idle
> and the timekeeper can finally sleep. Then when a full nohz CPU wakes
> up from idle, it sends an IPI to the timekeeper if needed so the latter
> restarts timekeeping maintainance.
>
> It's not complicated to add to the timer code.
> Most of the code is already there, in RCU, for a while already.
> 
> Are we keeping that direction? 

So the idea is that the timekeeper stays on cpu0, but if everything is
idle it is allowed to take a long nap as well. So if some other cpu
wakes up it updates timekeeping without taking over the time keeper
duty and if it has work to do, it kicks cpu0 into gear. If it just
goes back to sleep, then nothing to do.

No objections from my side.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-10 12:31                 ` Thomas Gleixner
@ 2014-05-10 13:14                   ` Frederic Weisbecker
  2014-05-11  1:17                     ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2014-05-10 13:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, Christoph Lameter, Andrew Morton,
	Gilad Ben-Yossef, Tejun Heo, Mike Frysinger, Minchan Kim,
	Hakan Akkan, Max Krasnyansky, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, May 10, 2014 at 02:31:28PM +0200, Thomas Gleixner wrote:
> On Sat, 10 May 2014, Frederic Weisbecker wrote:
> > But I still have the plan to make the timekeeper use the full sysidle
> > facility in order to adaptively get to dynticks idle.
> > 
> > Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
> > completely periodic. It can't enter in dynticks idle mode because it
> > must maintain timekeeping on behalf of full dynticks CPUs. So that's
> > a power issue.
> > 
> > But Paul has a feature in RCU that lets us know when all CPUs are idle
> > and the timekeeper can finally sleep. Then when a full nohz CPU wakes
> > up from idle, it sends an IPI to the timekeeper if needed so the latter
> > restarts timekeeping maintainance.
> >
> > It's not complicated to add to the timer code.
> > Most of the code is already there, in RCU, for a while already.
> > 
> > Are we keeping that direction? 
> 
> So the idea is that the timekeeper stays on cpu0, but if everything is
> idle it is allowed to take a long nap as well. So if some other cpu
> wakes up it updates timekeeping without taking over the time keeper
> duty and if it has work to do, it kicks cpu0 into gear. If it just
> goes back to sleep, then nothing to do.

Exactly! Except perhaps the last sentence "If it just goes back to sleep,
then nothing to do.", I didn't think about that although this special case
is quite frequent indeed when an interrupt fires on idle but no task is woken up.

Maybe I should move the code that fires the IPI to cpu0, if it is sleeping,
on irq exit (the plan was to do it right away on irq enter) and fire it
only if need_resched().

> 
> No objections from my side.

Great! Thanks for checking that!

> 
> Thanks,
> 
> 	tglx
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-10 13:14                   ` Frederic Weisbecker
@ 2014-05-11  1:17                     ` Paul E. McKenney
  2014-05-11  1:30                       ` Frederic Weisbecker
  0 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2014-05-11  1:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Christoph Lameter, Andrew Morton,
	Gilad Ben-Yossef, Tejun Heo, Mike Frysinger, Minchan Kim,
	Hakan Akkan, Max Krasnyansky, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, May 10, 2014 at 03:14:25PM +0200, Frederic Weisbecker wrote:
> On Sat, May 10, 2014 at 02:31:28PM +0200, Thomas Gleixner wrote:
> > On Sat, 10 May 2014, Frederic Weisbecker wrote:
> > > But I still have the plan to make the timekeeper use the full sysidle
> > > facility in order to adaptively get to dynticks idle.
> > > 
> > > Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
> > > completely periodic. It can't enter in dynticks idle mode because it
> > > must maintain timekeeping on behalf of full dynticks CPUs. So that's
> > > a power issue.
> > > 
> > > But Paul has a feature in RCU that lets us know when all CPUs are idle
> > > and the timekeeper can finally sleep. Then when a full nohz CPU wakes
> > > up from idle, it sends an IPI to the timekeeper if needed so the latter
> > > restarts timekeeping maintainance.
> > >
> > > It's not complicated to add to the timer code.
> > > Most of the code is already there, in RCU, for a while already.
> > > 
> > > Are we keeping that direction? 
> > 
> > So the idea is that the timekeeper stays on cpu0, but if everything is
> > idle it is allowed to take a long nap as well. So if some other cpu
> > wakes up it updates timekeeping without taking over the time keeper
> > duty and if it has work to do, it kicks cpu0 into gear. If it just
> > goes back to sleep, then nothing to do.

Hmmm...  If RCU is supposed to ignore the fact that one of the other
CPUs woke up momentarily, we will need to adjust things a bit.

> Exactly! Except perhaps the last sentence "If it just goes back to sleep,
> then nothing to do.", I didn't think about that although this special case
> is quite frequent indeed when an interrupt fires on idle but no task is woken up.
> 
> Maybe I should move the code that fires the IPI to cpu0, if it is sleeping,
> on irq exit (the plan was to do it right away on irq enter) and fire it
> only if need_resched().

And of course if that code path contains any RCU read-side critical
sections, RCU absolutely cannot ignore that CPU's momentary wakeup.

							Thanx, Paul

> > No objections from my side.
> 
> Great! Thanks for checking that!
> 
> > 
> > Thanks,
> > 
> > 	tglx
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-11  1:17                     ` Paul E. McKenney
@ 2014-05-11  1:30                       ` Frederic Weisbecker
  2014-05-11  1:38                         ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2014-05-11  1:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Christoph Lameter, Andrew Morton,
	Gilad Ben-Yossef, Tejun Heo, Mike Frysinger, Minchan Kim,
	Hakan Akkan, Max Krasnyansky, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, May 10, 2014 at 06:17:08PM -0700, Paul E. McKenney wrote:
> On Sat, May 10, 2014 at 03:14:25PM +0200, Frederic Weisbecker wrote:
> > On Sat, May 10, 2014 at 02:31:28PM +0200, Thomas Gleixner wrote:
> > > On Sat, 10 May 2014, Frederic Weisbecker wrote:
> > > > But I still have the plan to make the timekeeper use the full sysidle
> > > > facility in order to adaptively get to dynticks idle.
> > > > 
> > > > Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
> > > > completely periodic. It can't enter in dynticks idle mode because it
> > > > must maintain timekeeping on behalf of full dynticks CPUs. So that's
> > > > a power issue.
> > > > 
> > > > But Paul has a feature in RCU that lets us know when all CPUs are idle
> > > > and the timekeeper can finally sleep. Then when a full nohz CPU wakes
> > > > up from idle, it sends an IPI to the timekeeper if needed so the latter
> > > > restarts timekeeping maintainance.
> > > >
> > > > It's not complicated to add to the timer code.
> > > > Most of the code is already there, in RCU, for a while already.
> > > > 
> > > > Are we keeping that direction? 
> > > 
> > > So the idea is that the timekeeper stays on cpu0, but if everything is
> > > idle it is allowed to take a long nap as well. So if some other cpu
> > > wakes up it updates timekeeping without taking over the time keeper
> > > duty and if it has work to do, it kicks cpu0 into gear. If it just
> > > goes back to sleep, then nothing to do.
> 
> Hmmm...  If RCU is supposed to ignore the fact that one of the other
> CPUs woke up momentarily, we will need to adjust things a bit.

Maybe not that much actually.

> 
> > Exactly! Except perhaps the last sentence "If it just goes back to sleep,
> > then nothing to do.", I didn't think about that although this special case
> > is quite frequent indeed when an interrupt fires on idle but no task is woken up.
> > 
> > Maybe I should move the code that fires the IPI to cpu0, if it is sleeping,
> > on irq exit (the plan was to do it right away on irq enter) and fire it
> > only if need_resched().
> 
> And of course if that code path contains any RCU read-side critical
> sections, RCU absolutely cannot ignore that CPU's momentary wakeup.

Sure the core RCU still needs to know that the CPU went out of dynticks the
time of the irq, so we keep the rcu_irq_enter/rcu_irq_exit calls.

But if the CPU only wakes up to serve an IRQ, it doesn't need to tell the RCU
sysidle detection about it. The irq entry fixup jiffies on dynticks idle mode,
this should be enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-11  1:30                       ` Frederic Weisbecker
@ 2014-05-11  1:38                         ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2014-05-11  1:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Christoph Lameter, Andrew Morton,
	Gilad Ben-Yossef, Tejun Heo, Mike Frysinger, Minchan Kim,
	Hakan Akkan, Max Krasnyansky, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sun, May 11, 2014 at 03:30:31AM +0200, Frederic Weisbecker wrote:
> On Sat, May 10, 2014 at 06:17:08PM -0700, Paul E. McKenney wrote:
> > On Sat, May 10, 2014 at 03:14:25PM +0200, Frederic Weisbecker wrote:
> > > On Sat, May 10, 2014 at 02:31:28PM +0200, Thomas Gleixner wrote:
> > > > On Sat, 10 May 2014, Frederic Weisbecker wrote:
> > > > > But I still have the plan to make the timekeeper use the full sysidle
> > > > > facility in order to adaptively get to dynticks idle.
> > > > > 
> > > > > Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
> > > > > completely periodic. It can't enter in dynticks idle mode because it
> > > > > must maintain timekeeping on behalf of full dynticks CPUs. So that's
> > > > > a power issue.
> > > > > 
> > > > > But Paul has a feature in RCU that lets us know when all CPUs are idle
> > > > > and the timekeeper can finally sleep. Then when a full nohz CPU wakes
> > > > > up from idle, it sends an IPI to the timekeeper if needed so the latter
> > > > > restarts timekeeping maintainance.
> > > > >
> > > > > It's not complicated to add to the timer code.
> > > > > Most of the code is already there, in RCU, for a while already.
> > > > > 
> > > > > Are we keeping that direction? 
> > > > 
> > > > So the idea is that the timekeeper stays on cpu0, but if everything is
> > > > idle it is allowed to take a long nap as well. So if some other cpu
> > > > wakes up it updates timekeeping without taking over the time keeper
> > > > duty and if it has work to do, it kicks cpu0 into gear. If it just
> > > > goes back to sleep, then nothing to do.
> > 
> > Hmmm...  If RCU is supposed to ignore the fact that one of the other
> > CPUs woke up momentarily, we will need to adjust things a bit.
> 
> Maybe not that much actually.
> 
> > 
> > > Exactly! Except perhaps the last sentence "If it just goes back to sleep,
> > > then nothing to do.", I didn't think about that although this special case
> > > is quite frequent indeed when an interrupt fires on idle but no task is woken up.
> > > 
> > > Maybe I should move the code that fires the IPI to cpu0, if it is sleeping,
> > > on irq exit (the plan was to do it right away on irq enter) and fire it
> > > only if need_resched().
> > 
> > And of course if that code path contains any RCU read-side critical
> > sections, RCU absolutely cannot ignore that CPU's momentary wakeup.
> 
> Sure the core RCU still needs to know that the CPU went out of dynticks the
> time of the irq, so we keep the rcu_irq_enter/rcu_irq_exit calls.
> 
> But if the CPU only wakes up to serve an IRQ, it doesn't need to tell the RCU
> sysidle detection about it. The irq entry fixup jiffies on dynticks idle mode,
> this should be enough.

As long as you pass me in a hint so that RCU knows which case it is
dealing with.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-09 23:47             ` Paul E. McKenney
  2014-05-10  0:48               ` Frederic Weisbecker
@ 2014-05-10 12:20               ` Thomas Gleixner
  2014-05-11  1:12                 ` Paul E. McKenney
  1 sibling, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2014-05-10 12:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, Andrew Morton, Gilad Ben-Yossef, Tejun Heo,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Fri, 9 May 2014, Paul E. McKenney wrote:

> On Sat, May 10, 2014 at 12:57:15AM +0200, Thomas Gleixner wrote:
> > On Fri, 9 May 2014, Christoph Lameter wrote:
> > > On Fri, 9 May 2014, Thomas Gleixner wrote:
> > > > I understand why you want to get this done by a housekeeper, I just
> > > > did not understand why we need this whole move it around business is
> > > > required.
> > > 
> > > This came about because of another objection against having it simply
> > > fixed to a processor. After all that processor may be disabled etc etc.
> > 
> > I really regret that I did not pay more attention (though my cycle
> > constraints simply do not allow it).
> 
> As far as I can see, the NO_HZ_FULL timekeeping CPU is always zero.  If it
> can change in NO_HZ_FULL kernels, RCU will do some very strange things!

Good. I seriously hope it stays that way.

> One possible issue here is that Christoph's patch is unconditional.
> It takes effect for both NO_HZ_FULL and !NO_HZ_FULL.  If I recall
> correctly, the timekeeping CPU -can- change in !NO_HZ_FULL kernels,
> which might be what Christoph was trying to take into account.

Ok. Sorry, I was just in a lousy mood after wasting half a day in
reviewing even lousier patches related to that NO_HZ* muck.

So, right with NO_HZ_IDLE the time keeper can move around and
housekeeping stuff might want to move around as well.

But it's not necessary a good idea to bundle that with the timekeeper,
as under certain conditions the timekeeper duty can move around fast
and left unassigned again when the system is fully idle.

And we really do not want a gazillion of sites which implement a
metric ton of different ways to connect some random housekeeping jobs
with the timekeeper.

So the proper solution to this is to have either a thread or a
dedicated housekeeping worker, which is placed by the scheduler
depending on the system configuration and workload.

That way it can be kept at cpu0 for the nohz=off and the nohz_full
case. In the nohz_idle case we can have different placement
algorithms. On a big/little ARM machine you probably want to keep it
on the first cpu of one or the other cluster. And there might be other
constraints on servers.

So we are way better of with a generic facility, where the various
housekeeping jobs can be queued.

Does that make sense?

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-10 12:20               ` Thomas Gleixner
@ 2014-05-11  1:12                 ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2014-05-11  1:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Lameter, Andrew Morton, Gilad Ben-Yossef, Tejun Heo,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, May 10, 2014 at 02:20:36PM +0200, Thomas Gleixner wrote:
> On Fri, 9 May 2014, Paul E. McKenney wrote:
> 
> > On Sat, May 10, 2014 at 12:57:15AM +0200, Thomas Gleixner wrote:
> > > On Fri, 9 May 2014, Christoph Lameter wrote:
> > > > On Fri, 9 May 2014, Thomas Gleixner wrote:
> > > > > I understand why you want to get this done by a housekeeper, I just
> > > > > did not understand why we need this whole move it around business is
> > > > > required.
> > > > 
> > > > This came about because of another objection against having it simply
> > > > fixed to a processor. After all that processor may be disabled etc etc.
> > > 
> > > I really regret that I did not pay more attention (though my cycle
> > > constraints simply do not allow it).
> > 
> > As far as I can see, the NO_HZ_FULL timekeeping CPU is always zero.  If it
> > can change in NO_HZ_FULL kernels, RCU will do some very strange things!
> 
> Good. I seriously hope it stays that way.

Unless and until systems end up with so many CPUs that a single CPU
cannot keep up with all the housekeeping tasks.  But we should wait to
burn that bridge until after we drive off it.  ;-)

> > One possible issue here is that Christoph's patch is unconditional.
> > It takes effect for both NO_HZ_FULL and !NO_HZ_FULL.  If I recall
> > correctly, the timekeeping CPU -can- change in !NO_HZ_FULL kernels,
> > which might be what Christoph was trying to take into account.
> 
> Ok. Sorry, I was just in a lousy mood after wasting half a day in
> reviewing even lousier patches related to that NO_HZ* muck.

I can relate...

> So, right with NO_HZ_IDLE the time keeper can move around and
> housekeeping stuff might want to move around as well.
> 
> But it's not necessary a good idea to bundle that with the timekeeper,
> as under certain conditions the timekeeper duty can move around fast
> and left unassigned again when the system is fully idle.
> 
> And we really do not want a gazillion of sites which implement a
> metric ton of different ways to connect some random housekeeping jobs
> with the timekeeper.
> 
> So the proper solution to this is to have either a thread or a
> dedicated housekeeping worker, which is placed by the scheduler
> depending on the system configuration and workload.
> 
> That way it can be kept at cpu0 for the nohz=off and the nohz_full
> case. In the nohz_idle case we can have different placement
> algorithms. On a big/little ARM machine you probably want to keep it
> on the first cpu of one or the other cluster. And there might be other
> constraints on servers.
> 
> So we are way better of with a generic facility, where the various
> housekeeping jobs can be queued.
> 
> Does that make sense?

It might well.

Here is what I currently do for RCU:

1.	If !NO_HZ_FULL, I let the grace-period kthreads run wherever
	the scheduler wants them to.

2.	If NO_HZ_FULL, I bind the grace-period kthreads to the
	timekeeping CPU.

But if I could just mark it as a housekeeping kthread and have something
take care of it.

So let's see...

Your nohz=off case recognizes a real-time setup, correct?  In which
case it does make sense to get the housekeeping out of the way of the
worker CPUs.  I would look pretty silly arguing against the nohz_full
case, since that is what RCU does.  Right now I just pay attention to
the Kconfig parameter, but perhaps it would make sense to also look at
the boot parameters.  Especially since some distros seem to be setting
NO_HZ_FULL by default.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-09 22:57           ` Thomas Gleixner
  2014-05-09 23:47             ` Paul E. McKenney
@ 2014-05-10  0:34             ` Frederic Weisbecker
  2014-05-10 12:22               ` Thomas Gleixner
  2014-05-11  1:14               ` Paul E. McKenney
  1 sibling, 2 replies; 20+ messages in thread
From: Frederic Weisbecker @ 2014-05-10  0:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Lameter, Andrew Morton, Gilad Ben-Yossef, Tejun Heo,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Paul E. McKenney, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, May 10, 2014 at 12:57:15AM +0200, Thomas Gleixner wrote:
> On Fri, 9 May 2014, Christoph Lameter wrote:
> > On Fri, 9 May 2014, Thomas Gleixner wrote:
> > > I understand why you want to get this done by a housekeeper, I just
> > > did not understand why we need this whole move it around business is
> > > required.
> > 
> > This came about because of another objection against having it simply
> > fixed to a processor. After all that processor may be disabled etc etc.
> 
> I really regret that I did not pay more attention (though my cycle
> constraints simply do not allow it).
> 
> This is the typical overengineering failure: 
> 
>   Before we even have a working proof that we can solve the massive
>   complex basic problem with the price of a dedicated housekeeper, we
>   try to make the housekeeper itself a moving target with the price of
>   making the problem exponential(unknown) instead of simply unknown.
> 
> I really cannot figure out why a moving housekeeper would be a
> brilliant idea at all, but I'm sure there is some magic use case in
> some other disjunct universe.
> 
> Whoever complained and came up with the NOT SO brilliant idea to make
> the housekeeper a moving target, come please forth and explain:
> 
> - How this can be done without having a working solution with a
>   dedicated housekeeper in the first place
> 
> - How this can be done without knowing what implication it has w/o
>   seing the complexity of a dedicated housekeeper upfront.
> 
> Keep it simple has always been and still is the best engineering
> principle.
> 
> We all know that we can do large scale overhauls in a very controlled
> way if the need arises. But going for the most complex solution while
> not knowing whether the least complex solution is feasible at all is
> outright stupid or beyond.
> 
> Unless someone comes up with a reasonable explantion for all of this I
> put a general NAK on patches which are directed to kernel/time/*
> 
> Correction:
> 
> I'm taking patches right away which undo any damage which has been
> applied w/o me noticing because I trusted the responsible developers /
> maintainers.
> 
> Preferrably those patches arrive before my return from LinuxCon Japan.

Yeah my plan was to have a variable housekeeping CPU. In fact the reason
was more about power optimization: having all non-full-nohz CPUs able
to handle the timekeeping duty (and hence housekeeping) could help
further to balance timekeeping.

But then I sent a patchset with that in mind (https://lkml.org/lkml/2013/12/17/708)
but it was too complicated. Doing it correctly is too hard for now.

Anyway I agree that was overengineering at this stage.

Fortunately nothing has been applied. I was too busy with cleanups and workqueues
affinity.

So the "only" damage is on bad directions given to Christoph. But you know
how I use GPS...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-10  0:34             ` Frederic Weisbecker
@ 2014-05-10 12:22               ` Thomas Gleixner
  2014-05-11  1:14               ` Paul E. McKenney
  1 sibling, 0 replies; 20+ messages in thread
From: Thomas Gleixner @ 2014-05-10 12:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Andrew Morton, Gilad Ben-Yossef, Tejun Heo,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Paul E. McKenney, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, 10 May 2014, Frederic Weisbecker wrote:
> Anyway I agree that was overengineering at this stage.
> 
> Fortunately nothing has been applied. I was too busy with cleanups and workqueues
> affinity.

Good.
 
> So the "only" damage is on bad directions given to Christoph. But you know
> how I use GPS...

Yeah, especially when it's switched to 'Frederic universe' mode :)

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-10  0:34             ` Frederic Weisbecker
  2014-05-10 12:22               ` Thomas Gleixner
@ 2014-05-11  1:14               ` Paul E. McKenney
  1 sibling, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2014-05-11  1:14 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Christoph Lameter, Andrew Morton,
	Gilad Ben-Yossef, Tejun Heo, Mike Frysinger, Minchan Kim,
	Hakan Akkan, Max Krasnyansky, LKML, linux-mm, hughd, viresh.kumar,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, John Stultz

On Sat, May 10, 2014 at 02:34:49AM +0200, Frederic Weisbecker wrote:

[ . . . ]

> So the "only" damage is on bad directions given to Christoph. But you know
> how I use GPS...

Well, my redundant ACCESS_ONCE() around tick_do_timer_cpu was also
quite misleading...  :-/

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: vmstat: On demand vmstat workers V4
  2014-05-08 21:29 ` Andrew Morton
  2014-05-08 22:18   ` Thomas Gleixner
@ 2014-05-12 16:25   ` Christoph Lameter
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2014-05-12 16:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar

On Thu, 8 May 2014, Andrew Morton wrote:

> Some explanation of the changes to kernel/time/tick-common.c would be
> appropriate.

I dropped those after the discussion related to housekeepig cpus.

> > +		cancel_delayed_work_sync(d);
> > +		cpumask_set_cpu(smp_processor_id(), monitored_cpus);
> > +		cpumask_clear_cpu(s, monitored_cpus);
> > +		INIT_DELAYED_WORK(d, vmstat_shepherd);
>
> INIT_DELAYED_WORK() seems inappropriate here.  It's generally used for
> once-off initialisation of a freshly allocated work item.  Look at all
> the stuff it does - do we really want to run debug_object_init()
> against an active object?

Well this function is the one off initialization. INIT_DEFERRABLE_WORK in
vmstat_shepherd() is a case of repeatedly initializing the per cpu
structure when the worker thread is started again. In order to remove
that I would have to do a loop initializing the structures at startup
time. In V4 there were different function depending on the processor
and they could change. With the housekeeping processor fixed that is no
longer the case. Should I loop over the whole structure and set the
functions at init time?

> >  	case CPU_DOWN_PREPARE_FROZEN:
> > -		cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
> > +		if (!cpumask_test_cpu(cpu, monitored_cpus))
>
> This test is inverted isn't it?

If the monitoring cpu bit is not set then the worker thread is active
and needs to be cancelled. There is a race here so I used test_and_clear
here in the new revision.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-05-12 16:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-08 15:35 vmstat: On demand vmstat workers V4 Christoph Lameter
2014-05-08 21:29 ` Andrew Morton
2014-05-08 22:18   ` Thomas Gleixner
2014-05-09 14:53     ` Christoph Lameter
2014-05-09 15:05       ` Thomas Gleixner
2014-05-09 15:28         ` Christoph Lameter
2014-05-09 22:57           ` Thomas Gleixner
2014-05-09 23:47             ` Paul E. McKenney
2014-05-10  0:48               ` Frederic Weisbecker
2014-05-10 12:31                 ` Thomas Gleixner
2014-05-10 13:14                   ` Frederic Weisbecker
2014-05-11  1:17                     ` Paul E. McKenney
2014-05-11  1:30                       ` Frederic Weisbecker
2014-05-11  1:38                         ` Paul E. McKenney
2014-05-10 12:20               ` Thomas Gleixner
2014-05-11  1:12                 ` Paul E. McKenney
2014-05-10  0:34             ` Frederic Weisbecker
2014-05-10 12:22               ` Thomas Gleixner
2014-05-11  1:14               ` Paul E. McKenney
2014-05-12 16:25   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).