[PATCH 2/4] nohz: Synchronize sleep time stats with seqlock

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-09  0:54 [PATCH 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
@ 2013-08-09  0:54 ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-09  0:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

Namely this sums up to:

       sleeptime = ts($CPU)->idle_sleeptime;
       if (ts($CPU)->idle_active)
       	  sleeptime += NOW() - ts($CPU)->idle_entrytime

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

Lets consider the following scenario:

             CPU 0                                            CPU 1

  (seq 1)    ts(CPU 0)->idle_active = 1
             ts(CPU 0)->idle_entrytime = NOW()

  (seq 2)    sleeptime = NOW() - ts(CPU 0)->idle_entrytime
             ts(CPU 0)->idle_sleeptime += sleeptime           sleeptime = ts(CPU 0)->idle_sleeptime;
                                                              if (ts(CPU 0)->idle_active)
             ts(CPU 0)->idle_entrytime = NOW()                    sleeptime += NOW() - ts(CPU 0)->idle_entrytime

The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:

* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.

* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.

* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.

Some more tricky scenario can also happen if idle_active value is read from a former sequence.

Hence we must honour the following constraints:

- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering

- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.

- We need to prevent from fetching middle-state sequence values.

The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.

This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
when it is frequently read. And potential cpufreq governor bugs.

Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/tick.h     |    2 ++
 kernel/time/tick-sched.c |   37 +++++++++++++++++++++++++------------
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62bd8b7..49f9720 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/irqflags.h>
 #include <linux/percpu.h>
 #include <linux/hrtimer.h>
+#include <linux/seqlock.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -70,6 +71,7 @@ struct tick_sched {
 	unsigned long			next_jiffies;
 	ktime_t				idle_expires;
 	int				do_timer_last;
+	seqcount_t			sleeptime_seq;
 };
 
 extern void __init tick_init(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ede0405..f7fc27b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -414,6 +414,7 @@ static void tick_nohz_stop_idle(int cpu, ktime_t now)
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
+	write_seqcount_begin(&ts->sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
 	if (nr_iowait_cpu(cpu) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
@@ -421,6 +422,7 @@ static void tick_nohz_stop_idle(int cpu, ktime_t now)
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_entrytime = now;
 	ts->idle_active = 0;
+	write_seqcount_end(&ts->sleeptime_seq);
 
 	sched_clock_idle_wakeup_event(0);
 }
@@ -429,8 +431,11 @@ static ktime_t tick_nohz_start_idle(int cpu, struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqcount_begin(&ts->sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_seqcount_end(&ts->sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -453,6 +458,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
+	unsigned int seq;
 
 	if (!tick_nohz_enabled)
 		return -1;
@@ -461,12 +467,15 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		idle = ktime_add(ts->idle_sleeptime, delta);
-	} else {
-		idle = ts->idle_sleeptime;
-	}
+	do {
+		seq = read_seqcount_begin(&ts->sleeptime_seq);
+		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			idle = ktime_add(ts->idle_sleeptime, delta);
+		} else {
+			idle = ts->idle_sleeptime;
+		}
+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
 
@@ -491,6 +500,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, iowait;
+	unsigned int seq;
 
 	if (!tick_nohz_enabled)
 		return -1;
@@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		iowait = ktime_add(ts->iowait_sleeptime, delta);
-	} else {
-		iowait = ts->iowait_sleeptime;
-	}
+	do {
+		seq = read_seqcount_begin(&ts->sleeptime_seq);
+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			iowait = ktime_add(ts->iowait_sleeptime, delta);
+		} else {
+			iowait = ts->iowait_sleeptime;
+		}
+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
 
 	return ktime_to_us(iowait);
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
@ 2013-08-16 15:42 ` Frederic Weisbecker
  2013-08-16 16:02   ` Oleg Nesterov
  2013-08-18 16:54   ` Oleg Nesterov
  0 siblings, 2 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 15:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Frederic Weisbecker, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

Namely this sums up to:

       sleeptime = ts($CPU)->idle_sleeptime;
       if (ts($CPU)->idle_active)
       	  sleeptime += NOW() - ts($CPU)->idle_entrytime

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

Lets consider the following scenario:

             CPU 0                                            CPU 1

  (seq 1)    ts(CPU 0)->idle_active = 1
             ts(CPU 0)->idle_entrytime = NOW()

  (seq 2)    sleeptime = NOW() - ts(CPU 0)->idle_entrytime
             ts(CPU 0)->idle_sleeptime += sleeptime           sleeptime = ts(CPU 0)->idle_sleeptime;
                                                              if (ts(CPU 0)->idle_active)
             ts(CPU 0)->idle_entrytime = NOW()                    sleeptime += NOW() - ts(CPU 0)->idle_entrytime

The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:

* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.

* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.

* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.

Some more tricky scenario can also happen if idle_active value is read from a former sequence.

Hence we must honour the following constraints:

- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering

- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.

- We need to prevent from fetching middle-state sequence values.

The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.

This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
when it is frequently read. And potential cpufreq governor bugs.

Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/tick.h     |    2 ++
 kernel/time/tick-sched.c |   37 +++++++++++++++++++++++++------------
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62bd8b7..49f9720 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/irqflags.h>
 #include <linux/percpu.h>
 #include <linux/hrtimer.h>
+#include <linux/seqlock.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -70,6 +71,7 @@ struct tick_sched {
 	unsigned long			next_jiffies;
 	ktime_t				idle_expires;
 	int				do_timer_last;
+	seqcount_t			sleeptime_seq;
 };
 
 extern void __init tick_init(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ede0405..f7fc27b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -414,6 +414,7 @@ static void tick_nohz_stop_idle(int cpu, ktime_t now)
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
+	write_seqcount_begin(&ts->sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
 	if (nr_iowait_cpu(cpu) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
@@ -421,6 +422,7 @@ static void tick_nohz_stop_idle(int cpu, ktime_t now)
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_entrytime = now;
 	ts->idle_active = 0;
+	write_seqcount_end(&ts->sleeptime_seq);
 
 	sched_clock_idle_wakeup_event(0);
 }
@@ -429,8 +431,11 @@ static ktime_t tick_nohz_start_idle(int cpu, struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqcount_begin(&ts->sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_seqcount_end(&ts->sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -453,6 +458,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
+	unsigned int seq;
 
 	if (!tick_nohz_enabled)
 		return -1;
@@ -461,12 +467,15 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		idle = ktime_add(ts->idle_sleeptime, delta);
-	} else {
-		idle = ts->idle_sleeptime;
-	}
+	do {
+		seq = read_seqcount_begin(&ts->sleeptime_seq);
+		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			idle = ktime_add(ts->idle_sleeptime, delta);
+		} else {
+			idle = ts->idle_sleeptime;
+		}
+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
 
@@ -491,6 +500,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, iowait;
+	unsigned int seq;
 
 	if (!tick_nohz_enabled)
 		return -1;
@@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		iowait = ktime_add(ts->iowait_sleeptime, delta);
-	} else {
-		iowait = ts->iowait_sleeptime;
-	}
+	do {
+		seq = read_seqcount_begin(&ts->sleeptime_seq);
+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			iowait = ktime_add(ts->iowait_sleeptime, delta);
+		} else {
+			iowait = ts->iowait_sleeptime;
+		}
+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
 
 	return ktime_to_us(iowait);
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
@ 2013-08-16 16:02   ` Oleg Nesterov
  2013-08-16 16:20     ` Frederic Weisbecker
  2013-08-16 16:32     ` Frederic Weisbecker
  2013-08-18 16:54   ` Oleg Nesterov
  1 sibling, 2 replies; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

Thanks Frederic!

I'll try to read this series carefully later. Not that I think
I can help, you certainly understand this much better.

Just one question below,

On 08/16, Frederic Weisbecker wrote:
>
> @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>  	if (last_update_time)
>  		*last_update_time = ktime_to_us(now);
>  
> -	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> -		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> -		iowait = ktime_add(ts->iowait_sleeptime, delta);
> -	} else {
> -		iowait = ts->iowait_sleeptime;
> -	}
> +	do {
> +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> +		} else {
> +			iowait = ts->iowait_sleeptime;
> +		}
> +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));

Unless I missread this patch, this is still racy a bit.

Suppose it is called on CPU_0 and cpu == 1. Suppose that
ts->idle_active == T and nr_iowait_cpu(cpu) == 1.

So we return iowait_sleeptime + delta.

Suppose that we call get_cpu_iowait_time_us() again. By this time
the task which incremented ->nr_iowait can be woken up on another
CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
we return iowait_sleeptime, and this is not monotonic again.

No?

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:02   ` Oleg Nesterov
@ 2013-08-16 16:20     ` Frederic Weisbecker
  2013-08-16 16:26       ` Oleg Nesterov
  2013-08-16 16:32     ` Frederic Weisbecker
  1 sibling, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> Thanks Frederic!
> 
> I'll try to read this series carefully later. Not that I think
> I can help, you certainly understand this much better.
> 
> Just one question below,
> 
> On 08/16, Frederic Weisbecker wrote:
> >
> > @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> >  	if (last_update_time)
> >  		*last_update_time = ktime_to_us(now);
> >  
> > -	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > -		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > -		iowait = ktime_add(ts->iowait_sleeptime, delta);
> > -	} else {
> > -		iowait = ts->iowait_sleeptime;
> > -	}
> > +	do {
> > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > +		} else {
> > +			iowait = ts->iowait_sleeptime;
> > +		}
> > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> 
> Unless I missread this patch, this is still racy a bit.
> 
> Suppose it is called on CPU_0 and cpu == 1. Suppose that
> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> 
> So we return iowait_sleeptime + delta.
> 
> Suppose that we call get_cpu_iowait_time_us() again. By this time
> the task which incremented ->nr_iowait can be woken up on another
> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> we return iowait_sleeptime, and this is not monotonic again.

Hmm, by the time it decrements nr_iowait, it returned from schedule() and
so idle had flushed the pending iowait sleeptime.

May be you have some scenario in mind that I'm missing?

> 
> No?
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:20     ` Frederic Weisbecker
@ 2013-08-16 16:26       ` Oleg Nesterov
  2013-08-16 16:46         ` Frederic Weisbecker
  0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> > > +	do {
> > > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > > +		} else {
> > > +			iowait = ts->iowait_sleeptime;
> > > +		}
> > > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> >
> > Unless I missread this patch, this is still racy a bit.
> >
> > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> >
> > So we return iowait_sleeptime + delta.
> >
> > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > the task which incremented ->nr_iowait can be woken up on another
> > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > we return iowait_sleeptime, and this is not monotonic again.
>
> Hmm, by the time it decrements nr_iowait, it returned from schedule() and
> so idle had flushed the pending iowait sleeptime.

Suppose a task does io_schedule() on CPU_0, and increments the counter.
This CPU becomes idle and nr_iowait_cpu(0) == 1.

Then this task is woken up, but try_to_wake_up() selects another CPU != 0.

It returns from schedule() and decrements the same counter, it doesn't
do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.

In fact the task can even migrate to another CPU right after raw_rq().

> May be you have some scenario in mind that I'm missing?

Or me...

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:02   ` Oleg Nesterov
  2013-08-16 16:20     ` Frederic Weisbecker
@ 2013-08-16 16:32     ` Frederic Weisbecker
  2013-08-16 16:33       ` Oleg Nesterov
  2013-08-16 16:37       ` Frederic Weisbecker
  1 sibling, 2 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> Thanks Frederic!
> 
> I'll try to read this series carefully later. Not that I think
> I can help, you certainly understand this much better.
> 
> Just one question below,
> 
> On 08/16, Frederic Weisbecker wrote:
> >
> > @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> >  	if (last_update_time)
> >  		*last_update_time = ktime_to_us(now);
> >  
> > -	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > -		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > -		iowait = ktime_add(ts->iowait_sleeptime, delta);
> > -	} else {
> > -		iowait = ts->iowait_sleeptime;
> > -	}
> > +	do {
> > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > +		} else {
> > +			iowait = ts->iowait_sleeptime;
> > +		}
> > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> 
> Unless I missread this patch, this is still racy a bit.
> 
> Suppose it is called on CPU_0 and cpu == 1. Suppose that
> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> 
> So we return iowait_sleeptime + delta.
> 
> Suppose that we call get_cpu_iowait_time_us() again. By this time
> the task which incremented ->nr_iowait can be woken up on another
> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> we return iowait_sleeptime, and this is not monotonic again.
> 
> No?

OTOH, io_schedule() does:

      atomic_inc(&rq->nr_iowait);
      schedule();
      atomic_dec(&rq->nr_iowait);

How do we handle that when the task is migrated after it goes to sleep? I don't
see the nr_iowait is decreased from the src CPU and increased on the dest CPU.

I don't either see that iowait tasks can't be migrated.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:32     ` Frederic Weisbecker
@ 2013-08-16 16:33       ` Oleg Nesterov
  2013-08-16 16:49         ` Frederic Weisbecker
  2013-08-16 16:37       ` Frederic Weisbecker
  1 sibling, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> >
> > Unless I missread this patch, this is still racy a bit.
> >
> > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> >
> > So we return iowait_sleeptime + delta.
> >
> > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > the task which incremented ->nr_iowait can be woken up on another
> > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > we return iowait_sleeptime, and this is not monotonic again.
> >
> > No?
>
> OTOH, io_schedule() does:
>
>       atomic_inc(&rq->nr_iowait);
>       schedule();
>       atomic_dec(&rq->nr_iowait);
>
> How do we handle that when the task is migrated after it goes to sleep?

or even before it goes to sleep. This is what I meant.

> I don't either see that iowait tasks can't be migrated.

But probably this is fine? This is just the non-precise accounting.

But otoh, I agree. The whole idea about per-cpu nr_iowait looks a
bit strange.

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:32     ` Frederic Weisbecker
  2013-08-16 16:33       ` Oleg Nesterov
@ 2013-08-16 16:37       ` Frederic Weisbecker
  1 sibling, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

2013/8/16 Frederic Weisbecker <fweisbec@gmail.com>:
> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
>> Thanks Frederic!
>>
>> I'll try to read this series carefully later. Not that I think
>> I can help, you certainly understand this much better.
>>
>> Just one question below,
>>
>> On 08/16, Frederic Weisbecker wrote:
>> >
>> > @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>> >     if (last_update_time)
>> >             *last_update_time = ktime_to_us(now);
>> >
>> > -   if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
>> > -           ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>> > -           iowait = ktime_add(ts->iowait_sleeptime, delta);
>> > -   } else {
>> > -           iowait = ts->iowait_sleeptime;
>> > -   }
>> > +   do {
>> > +           seq = read_seqcount_begin(&ts->sleeptime_seq);
>> > +           if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
>> > +                   ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>> > +                   iowait = ktime_add(ts->iowait_sleeptime, delta);
>> > +           } else {
>> > +                   iowait = ts->iowait_sleeptime;
>> > +           }
>> > +   } while (read_seqcount_retry(&ts->sleeptime_seq, seq));
>>
>> Unless I missread this patch, this is still racy a bit.
>>
>> Suppose it is called on CPU_0 and cpu == 1. Suppose that
>> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
>>
>> So we return iowait_sleeptime + delta.
>>
>> Suppose that we call get_cpu_iowait_time_us() again. By this time
>> the task which incremented ->nr_iowait can be woken up on another
>> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
>> we return iowait_sleeptime, and this is not monotonic again.
>>
>> No?
>
> OTOH, io_schedule() does:
>
>       atomic_inc(&rq->nr_iowait);
>       schedule();
>       atomic_dec(&rq->nr_iowait);
>
> How do we handle that when the task is migrated after it goes to sleep? I don't
> see the nr_iowait is decreased from the src CPU and increased on the dest CPU.
>
> I don't either see that iowait tasks can't be migrated.

My bad, The decrement happens on the src CPU anyway.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:26       ` Oleg Nesterov
@ 2013-08-16 16:46         ` Frederic Weisbecker
  2013-08-16 16:49           ` Oleg Nesterov
                             ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:46 UTC (permalink / raw)
  To: Oleg Nesterov, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:26:54PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> > > > +	do {
> > > > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > > > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > > > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > > > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > > > +		} else {
> > > > +			iowait = ts->iowait_sleeptime;
> > > > +		}
> > > > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> > >
> > > Unless I missread this patch, this is still racy a bit.
> > >
> > > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> > >
> > > So we return iowait_sleeptime + delta.
> > >
> > > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > > the task which incremented ->nr_iowait can be woken up on another
> > > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > > we return iowait_sleeptime, and this is not monotonic again.
> >
> > Hmm, by the time it decrements nr_iowait, it returned from schedule() and
> > so idle had flushed the pending iowait sleeptime.
> 
> Suppose a task does io_schedule() on CPU_0, and increments the counter.
> This CPU becomes idle and nr_iowait_cpu(0) == 1.
> 
> Then this task is woken up, but try_to_wake_up() selects another CPU != 0.
> 
> It returns from schedule() and decrements the same counter, it doesn't
> do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.
> 
> In fact the task can even migrate to another CPU right after raw_rq().

Ah I see now. So that indeed yet another race.

Should we flush that iowait to the src CPU? But then it means we must handle
concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
code and from idle enter / exit.

So I fear we need a seqlock.

Or we can live with that and still account the whole idle time slept until
tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
All we need is just a new field in ts-> that records on which state we entered
idle.

What do you think?

Ingo, Thomas?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:46         ` Frederic Weisbecker
@ 2013-08-16 16:49           ` Oleg Nesterov
  2013-08-16 17:12             ` Frederic Weisbecker
  2013-08-19 11:10           ` Peter Zijlstra
  2013-08-20  6:21           ` Fernando Luis Vázquez Cao
  2 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> All we need is just a new field in ts-> that records on which state we entered
> idle.

Or we can turn ->idle_active into enum. And all other nr_iowait_cpu's
in this code should go away.

Personally I am fine either way.

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:33       ` Oleg Nesterov
@ 2013-08-16 16:49         ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:33:55PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> > >
> > > Unless I missread this patch, this is still racy a bit.
> > >
> > > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> > >
> > > So we return iowait_sleeptime + delta.
> > >
> > > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > > the task which incremented ->nr_iowait can be woken up on another
> > > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > > we return iowait_sleeptime, and this is not monotonic again.
> > >
> > > No?
> >
> > OTOH, io_schedule() does:
> >
> >       atomic_inc(&rq->nr_iowait);
> >       schedule();
> >       atomic_dec(&rq->nr_iowait);
> >
> > How do we handle that when the task is migrated after it goes to sleep?
> 
> or even before it goes to sleep. This is what I meant.
> 
> > I don't either see that iowait tasks can't be migrated.
> 
> But probably this is fine? This is just the non-precise accounting.
> 
> But otoh, I agree. The whole idea about per-cpu nr_iowait looks a
> bit strange.


My bad, I thought it was doing:

   inc(this_rq->nr_iowait)
   schedule()
   dec(this_rq->nr_iowait)

But it actually uses the src CPU all along.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:49           ` Oleg Nesterov
@ 2013-08-16 17:12             ` Frederic Weisbecker
  2013-08-18 16:36               ` Oleg Nesterov
  2013-08-19 10:58               ` Peter Zijlstra
  0 siblings, 2 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 17:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:49:22PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> > All we need is just a new field in ts-> that records on which state we entered
> > idle.
> 
> Or we can turn ->idle_active into enum. And all other nr_iowait_cpu's
> in this code should go away.
> 
> Personally I am fine either way.

Me too.

So my proposition is that we can keep the existing patches as they fix other distinct races
(and we add fixes on what peterz just reported) and send them to Ingo. Ah and I'll wait for
your review first.

Then if all goes well on the pull request we describe him the nr_iowait race and we let him
choose what to do with that nr_iowait migration race: either we ignore the
migration and always account to what we saw on idle start, or we flush that time accounting
on iowait migration, but that requires seqlocks on the idle path.

Or may be Peter could tell us as well. Peter, do you have a preference?

Thanks.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 17:12             ` Frederic Weisbecker
@ 2013-08-18 16:36               ` Oleg Nesterov
  2013-08-18 21:25                 ` Frederic Weisbecker
  2013-08-19 10:58               ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-18 16:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> On Fri, Aug 16, 2013 at 06:49:22PM +0200, Oleg Nesterov wrote:
> >
> > Personally I am fine either way.
>
> Me too.
>
> So my proposition is that we can keep the existing patches as they fix other distinct races

perhaps... but it would be nice to fix the "goes backward" problem.

This "race" is not theoretical, at least for get_cpu_iowait_time_us().
nr_iowait_cpu() can change from !0 to 0 very easily.

And just in case, note that get_cpu_idle_time_us() has the same problem
too, because it can also change from 0 to !0 although this case is much
less likely.

However, right now I do not see a simple solution. It seems that, if
get_cpu_idle_time_us() does ktime_add(ts->idle_sleeptime) it should
actually update ts->idle_sleeptime/entrytime to avoid these races
(the same for ->idle_sleeptime), but then we need the locking.

> (and we add fixes on what peterz just reported)

Do you mean his comments about 4/4 or I missed something else?

> Ah and I'll wait for
> your review first.

If only I could understand this code ;) It looks really simple and
I hope I can understand what it does. But not why. I simply can't
understand why idle/iowait are "mutually exclusive".

And if we do this,  then perhaps io_schedule() should do
"if (atomic_dec(&rq->nr_iowait)) update_iowait_sleeptime()" but
this means the locking again.

> Then if all goes well on the pull request we describe him the nr_iowait race and we let him
> choose what to do with that nr_iowait migration race:

OK, agreed.

> Or may be Peter could tell us as well. Peter, do you have a preference?

Yes, it would be nice to know what Peter thinks ;)

Oleg.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
  2013-08-16 16:02   ` Oleg Nesterov
@ 2013-08-18 16:54   ` Oleg Nesterov
  2013-08-18 21:40     ` Frederic Weisbecker
  1 sibling, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-18 16:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
> when it is frequently read.

Plus it fixes the problems with 32bit machines reading u64 values.

But perhaps the changelog should mention other races we discussed.

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-18 16:36               ` Oleg Nesterov
@ 2013-08-18 21:25                 ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-18 21:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Sun, Aug 18, 2013 at 06:36:39PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > On Fri, Aug 16, 2013 at 06:49:22PM +0200, Oleg Nesterov wrote:
> > >
> > > Personally I am fine either way.
> >
> > Me too.
> >
> > So my proposition is that we can keep the existing patches as they fix other distinct races
> 
> perhaps... but it would be nice to fix the "goes backward" problem.
> 
> This "race" is not theoretical, at least for get_cpu_iowait_time_us().
> nr_iowait_cpu() can change from !0 to 0 very easily.

Actually yes. Now I think we should fix the iowait race in the same changeset because
the bugs I'm fixing in this patchset were probably lowering the iowait issue in some
case.

> 
> And just in case, note that get_cpu_idle_time_us() has the same problem
> too, because it can also change from 0 to !0 although this case is much
> less likely.

Right.

> 
> However, right now I do not see a simple solution. It seems that, if
> get_cpu_idle_time_us() does ktime_add(ts->idle_sleeptime) it should
> actually update ts->idle_sleeptime/entrytime to avoid these races
> (the same for ->idle_sleeptime), but then we need the locking.

It seems that wouldn't solve the issue. Imagine that task A waits for
IO on CPU 0. It waits 10 seconds. Then it's finally woken on CPU 1.
A further call to get_cpu_idle_time_us() on CPU 0, say 5 seconds later,
will miss the io sleeptime part.

For now I can't figure out how to avoid flushing the io sleeptime when
the iowait task is moved to another CPU. This requires a lock (moving from
seqcount to seqlock) and may be involve quite some cache issues in the idle
path, resulting in higher latencies on wakeup. We really need another solution.

> 
> > (and we add fixes on what peterz just reported)
> 
> Do you mean his comments about 4/4 or I missed something else?

Yep, just removing the cpu arguments from the functions that only
use ts locally.

> > Ah and I'll wait for
> > your review first.
> 
> If only I could understand this code ;) It looks really simple and
> I hope I can understand what it does. But not why. I simply can't
> understand why idle/iowait are "mutually exclusive".

I believe this can be changed such that iowait is included in the idle
sleeptime. We can do that if that's needed. 

> 
> And if we do this,  then perhaps io_schedule() should do
> "if (atomic_dec(&rq->nr_iowait)) update_iowait_sleeptime()" but
> this means the locking again.

Yep.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-18 16:54   ` Oleg Nesterov
@ 2013-08-18 21:40     ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-18 21:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Sun, Aug 18, 2013 at 06:54:00PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
> > when it is frequently read.
> 
> Plus it fixes the problems with 32bit machines reading u64 values.

Ah indeed.

> 
> But perhaps the changelog should mention other races we discussed.

Agreed, I'll detail that.

thanks.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 17:12             ` Frederic Weisbecker
  2013-08-18 16:36               ` Oleg Nesterov
@ 2013-08-19 10:58               ` Peter Zijlstra
  2013-08-19 15:44                 ` Arjan van de Ven
  2013-08-19 15:47                 ` Arjan van de Ven
  1 sibling, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-19 10:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton,
	Arjan van de Ven

On Fri, Aug 16, 2013 at 07:12:09PM +0200, Frederic Weisbecker wrote:
> Or may be Peter could tell us as well. Peter, do you have a preference?

Still trying to wrap my head around it, but conceptually
get_cpu_iowait_time_us() doesn't make any kind of sense. iowait isn't
per cpu since effectively tasks that aren't running aren't assigned a
cpu (as Oleg already pointed out).

The fact that cpufreq 'needs' this just means that cpufreq is broken --
but I think I've said as much previously; cpufreq needs to stop living
in the partitioned-mp era and get dragged (kicking and screaming) into
the smp era.

I'm also not entirely clear on the 'desired' semantics here. Do we count
iowait time as idle or not?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:46         ` Frederic Weisbecker
  2013-08-16 16:49           ` Oleg Nesterov
@ 2013-08-19 11:10           ` Peter Zijlstra
  2013-08-19 11:15             ` Peter Zijlstra
  2013-08-20  6:59             ` Fernando Luis Vázquez Cao
  2013-08-20  6:21           ` Fernando Luis Vázquez Cao
  2 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-19 11:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton,
	Arjan van de Ven

On Fri, Aug 16, 2013 at 06:46:28PM +0200, Frederic Weisbecker wrote:

Option A:

> Should we flush that iowait to the src CPU? But then it means we must handle
> concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
> code and from idle enter / exit.
> 
> So I fear we need a seqlock.

Option B:

> Or we can live with that and still account the whole idle time slept until
> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> All we need is just a new field in ts-> that records on which state we entered
> idle.
> 
> What do you think?

I think option B is unworkable. Afaict it could basically caused
unlimited iowait time. Suppose we have a load-balancer that tries it
bestestest to sort-left (ie. run a task on the lowest 'free' cpu
possible) -- the power aware folks are pondering such schemes.

Now suppose we have a small burst of activity and the rightmost cpu gets
to run something that goes to sleep on iowait.

We'd accrue iowait on that cpu until it wakes up, which could be days
from now if the load stays low enough, even though the task got to run
almost instantly on another cpu.

So no, if we need per-cpu iowait time we have to do A.

Since we already have atomics in the io_schedule*() paths, please
replace those with (seq)locks. Also see if you can place the entire
iowait accounting thing in a separate cacheline.

That said, I'm still not sure if iowait time counts as both idle and
iowait or only iowait.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 11:10           ` Peter Zijlstra
@ 2013-08-19 11:15             ` Peter Zijlstra
  2013-08-20  6:59             ` Fernando Luis Vázquez Cao
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-19 11:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton,
	Arjan van de Ven

On Mon, Aug 19, 2013 at 01:10:26PM +0200, Peter Zijlstra wrote:
> So no, if we need per-cpu iowait time we have to do A.
> 
> Since we already have atomics in the io_schedule*() paths, please
> replace those with (seq)locks. Also see if you can place the entire
> iowait accounting thing in a separate cacheline.

Also, it completely blows to have these extra atomics in the IO paths,
I'm sure that the Mega-IOP/s people would be much pleased if we could
remove these.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 10:58               ` Peter Zijlstra
@ 2013-08-19 15:44                 ` Arjan van de Ven
  2013-08-19 15:47                 ` Arjan van de Ven
  1 sibling, 0 replies; 68+ messages in thread
From: Arjan van de Ven @ 2013-08-19 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton

On 8/19/2013 3:58 AM, Peter Zijlstra wrote:
> I'm also not entirely clear on the 'desired' semantics here. Do we count
> iowait time as idle or not?

cpufreq only looks at this because it does not want to count this as idle...
other than that it could not care less.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 10:58               ` Peter Zijlstra
  2013-08-19 15:44                 ` Arjan van de Ven
@ 2013-08-19 15:47                 ` Arjan van de Ven
  1 sibling, 0 replies; 68+ messages in thread
From: Arjan van de Ven @ 2013-08-19 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton

On 8/19/2013 3:58 AM, Peter Zijlstra wrote:
> On Fri, Aug 16, 2013 at 07:12:09PM +0200, Frederic Weisbecker wrote:
>> Or may be Peter could tell us as well. Peter, do you have a preference?
>
> Still trying to wrap my head around it, but conceptually
> get_cpu_iowait_time_us() doesn't make any kind of sense. iowait isn't
> per cpu since effectively tasks that aren't running aren't assigned a
> cpu (as Oleg already pointed out).
>
> The fact that cpufreq 'needs' this just means that cpufreq is broken --
> but I think I've said as much previously; cpufreq needs to stop living
> in the partitioned-mp era and get dragged (kicking and screaming) into
> the smp era.
>
> I'm also not entirely clear on the 'desired' semantics here. Do we count
> iowait time as idle or not?

fwiw only some Intel cpu's use this, and those by and large no longer use cpufreq
so this can just go away as far as I am concerned.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:46         ` Frederic Weisbecker
  2013-08-16 16:49           ` Oleg Nesterov
  2013-08-19 11:10           ` Peter Zijlstra
@ 2013-08-20  6:21           ` Fernando Luis Vázquez Cao
  2013-08-20 21:55             ` Frederic Weisbecker
  2 siblings, 1 reply; 68+ messages in thread
From: Fernando Luis Vázquez Cao @ 2013-08-20  6:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven

(2013年08月17日 01:46), Frederic Weisbecker wrote:
> On Fri, Aug 16, 2013 at 06:26:54PM +0200, Oleg Nesterov wrote:
>> On 08/16, Frederic Weisbecker wrote:
>>> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
>>>>> +	do {
>>>>> +		seq = read_seqcount_begin(&ts->sleeptime_seq);
>>>>> +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
>>>>> +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>>>>> +			iowait = ktime_add(ts->iowait_sleeptime, delta);
>>>>> +		} else {
>>>>> +			iowait = ts->iowait_sleeptime;
>>>>> +		}
>>>>> +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
>>>> Unless I missread this patch, this is still racy a bit.
>>>>
>>>> Suppose it is called on CPU_0 and cpu == 1. Suppose that
>>>> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
>>>>
>>>> So we return iowait_sleeptime + delta.
>>>>
>>>> Suppose that we call get_cpu_iowait_time_us() again. By this time
>>>> the task which incremented ->nr_iowait can be woken up on another
>>>> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
>>>> we return iowait_sleeptime, and this is not monotonic again.
>>> Hmm, by the time it decrements nr_iowait, it returned from schedule() and
>>> so idle had flushed the pending iowait sleeptime.
>> Suppose a task does io_schedule() on CPU_0, and increments the counter.
>> This CPU becomes idle and nr_iowait_cpu(0) == 1.
>>
>> Then this task is woken up, but try_to_wake_up() selects another CPU != 0.
>>
>> It returns from schedule() and decrements the same counter, it doesn't
>> do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.
>>
>> In fact the task can even migrate to another CPU right after raw_rq().
> Ah I see now. So that indeed yet another race.

I am sorry for chiming in late.

That precisely the race I described here:

https://lkml.org/lkml/2013/7/2/3

I should have been more concise in my explanation. I apologize.

> Should we flush that iowait to the src CPU? But then it means we must handle
> concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
> code and from idle enter / exit.
>
> So I fear we need a seqlock.
>
> Or we can live with that and still account the whole idle time slept until
> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> All we need is just a new field in ts-> that records on which state we entered
> idle.

Another approach could be to shadow ->iowait_sleeptime as
suggested here: https://lkml.org/lkml/2013/7/2/165

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 11:10           ` Peter Zijlstra
  2013-08-19 11:15             ` Peter Zijlstra
@ 2013-08-20  6:59             ` Fernando Luis Vázquez Cao
  2013-08-20  8:44               ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Fernando Luis Vázquez Cao @ 2013-08-20  6:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Tetsuo Handa, Andrew Morton, Arjan van de Ven

(2013年08月19日 20:10), Peter Zijlstra wrote:
> On Fri, Aug 16, 2013 at 06:46:28PM +0200, Frederic Weisbecker wrote:
>
> Option A:
>
>> Should we flush that iowait to the src CPU? But then it means we must handle
>> concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
>> code and from idle enter / exit.
>>
>> So I fear we need a seqlock.
> Option B:
>
>> Or we can live with that and still account the whole idle time slept until
>> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
>> All we need is just a new field in ts-> that records on which state we entered
>> idle.
>>
>> What do you think?
> I think option B is unworkable. Afaict it could basically caused
> unlimited iowait time. Suppose we have a load-balancer that tries it
> bestestest to sort-left (ie. run a task on the lowest 'free' cpu
> possible) -- the power aware folks are pondering such schemes.
>
> Now suppose we have a small burst of activity and the rightmost cpu gets
> to run something that goes to sleep on iowait.
>
> We'd accrue iowait on that cpu until it wakes up, which could be days
> from now if the load stays low enough, even though the task got to run
> almost instantly on another cpu.
>
> So no, if we need per-cpu iowait time we have to do A.
>
> Since we already have atomics in the io_schedule*() paths, please
> replace those with (seq)locks. Also see if you can place the entire
> iowait accounting thing in a separate cacheline.

I considered option A for a while but, fearing it would be
considered overkill, took a different approach: create a
shadow copy of ->iowait_sleeptime that is always kept
monotonic (artificially in some cases) and use that to
compute the values exported through /proc.

That said, if deemed acceptable, option A is the one I would
choose.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  6:59             ` Fernando Luis Vázquez Cao
@ 2013-08-20  8:44               ` Peter Zijlstra
  2013-08-20 15:29                 ` Frederic Weisbecker
  2013-08-20 15:31                 ` Arjan van de Ven
  0 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-20  8:44 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Tetsuo Handa, Andrew Morton, Arjan van de Ven

On Tue, Aug 20, 2013 at 03:59:36PM +0900, Fernando Luis Vázquez Cao wrote:
> That said, if deemed acceptable, option A is the one I would
> choose.

Right, so I think we can do A without much extra cost mostly because we
already have 2 atomics in the io_schedule() path. If we replace those
two atomic operations with locks and modify nr_iowait and the other
stats under the same lock, and ensure all those variables (including the
lock) live in the same cacheline we should have the same cost we have
now.

Of course, if we can get away with completely removing all of that
(which I think Arjan suggested was a real possibility) then that would
be ever so much better still :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  8:44               ` Peter Zijlstra
@ 2013-08-20 15:29                 ` Frederic Weisbecker
  2013-08-20 15:33                   ` Arjan van de Ven
  2013-08-20 15:31                 ` Arjan van de Ven
  1 sibling, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 15:29 UTC (permalink / raw)
  To: Peter Zijlstra, Arjan van de Ven, Rafael J. Wysocki, Viresh Kumar
  Cc: Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 10:44:05AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 20, 2013 at 03:59:36PM +0900, Fernando Luis Vázquez Cao wrote:
> > That said, if deemed acceptable, option A is the one I would
> > choose.
> 
> Right, so I think we can do A without much extra cost mostly because we
> already have 2 atomics in the io_schedule() path. If we replace those
> two atomic operations with locks and modify nr_iowait and the other
> stats under the same lock, and ensure all those variables (including the
> lock) live in the same cacheline we should have the same cost we have
> now.

I can try that :-)

> 
> Of course, if we can get away with completely removing all of that
> (which I think Arjan suggested was a real possibility) then that would
> be ever so much better still :-)

Would be lovely. But I don't know much about cpufreq, I hope somebody who's
familiar with that code can handle this. Then once there are no more users
of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
code.

Surely the overhead that this mess brings to io_schedule() (and it's going
to be worth with a seqlock, whether in the same cacheline than nr_iowait or
not) may be a good motivation to poke at that cpufreq code part.

Thanks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  8:44               ` Peter Zijlstra
  2013-08-20 15:29                 ` Frederic Weisbecker
@ 2013-08-20 15:31                 ` Arjan van de Ven
  2013-08-20 16:01                   ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Arjan van de Ven @ 2013-08-20 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Fernando Luis Vázquez Cao, Frederic Weisbecker,
	Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Andrew Morton

On 8/20/2013 1:44 AM, Peter Zijlstra wrote:
> On Tue, Aug 20, 2013 at 03:59:36PM +0900, Fernando Luis Vázquez Cao wrote:
>> That said, if deemed acceptable, option A is the one I would
>> choose.
>
> Right, so I think we can do A without much extra cost mostly because we
> already have 2 atomics in the io_schedule() path. If we replace those
> two atomic operations with locks and modify nr_iowait and the other
> stats under the same lock, and ensure all those variables (including the
> lock) live in the same cacheline we should have the same cost we have
> now.
>
> Of course, if we can get away with completely removing all of that
> (which I think Arjan suggested was a real possibility) then that would
> be ever so much better still :-)

I'm quite ok with removing that.

however note that "top" also reports per cpu iowait...
and that's a userspace expectation

>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:29                 ` Frederic Weisbecker
@ 2013-08-20 15:33                   ` Arjan van de Ven
  2013-08-20 15:35                     ` Frederic Weisbecker
  0 siblings, 1 reply; 68+ messages in thread
From: Arjan van de Ven @ 2013-08-20 15:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 8/20/2013 8:29 AM, Frederic Weisbecker wrote:

>>
>> Of course, if we can get away with completely removing all of that
>> (which I think Arjan suggested was a real possibility) then that would
>> be ever so much better still :-)
>
> Would be lovely. But I don't know much about cpufreq, I hope somebody who's
> familiar with that code can handle this. Then once there are no more users
> of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
> code.

it's just doing the "idle = 100 - busy% - iowait%" calculation.
(with the later part only for Intel cpus iirc)

in a perfect world the scheduler would be doing that calculation in the first place ;-)

removing the later part will impact performance some on specific workloads,
but most Intel cpus that this applies to should not be using cpufreq anymore
anyway.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:33                   ` Arjan van de Ven
@ 2013-08-20 15:35                     ` Frederic Weisbecker
  2013-08-20 15:41                       ` Arjan van de Ven
  0 siblings, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 15:35 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 08:33:50AM -0700, Arjan van de Ven wrote:
> On 8/20/2013 8:29 AM, Frederic Weisbecker wrote:
> 
> >>
> >>Of course, if we can get away with completely removing all of that
> >>(which I think Arjan suggested was a real possibility) then that would
> >>be ever so much better still :-)
> >
> >Would be lovely. But I don't know much about cpufreq, I hope somebody who's
> >familiar with that code can handle this. Then once there are no more users
> >of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
> >code.
> 
> it's just doing the "idle = 100 - busy% - iowait%" calculation.
> (with the later part only for Intel cpus iirc)
> 
> in a perfect world the scheduler would be doing that calculation in the first place ;-)
> 
> removing the later part will impact performance some on specific workloads,
> but most Intel cpus that this applies to should not be using cpufreq anymore
> anyway.

Are there other users than intel?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:35                     ` Frederic Weisbecker
@ 2013-08-20 15:41                       ` Arjan van de Ven
  0 siblings, 0 replies; 68+ messages in thread
From: Arjan van de Ven @ 2013-08-20 15:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 8/20/2013 8:35 AM, Frederic Weisbecker wrote:
> On Tue, Aug 20, 2013 at 08:33:50AM -0700, Arjan van de Ven wrote:
>> On 8/20/2013 8:29 AM, Frederic Weisbecker wrote:
>>
>>>>
>>>> Of course, if we can get away with completely removing all of that
>>>> (which I think Arjan suggested was a real possibility) then that would
>>>> be ever so much better still :-)
>>>
>>> Would be lovely. But I don't know much about cpufreq, I hope somebody who's
>>> familiar with that code can handle this. Then once there are no more users
>>> of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
>>> code.
>>
>> it's just doing the "idle = 100 - busy% - iowait%" calculation.
>> (with the later part only for Intel cpus iirc)
>>
>> in a perfect world the scheduler would be doing that calculation in the first place ;-)
>>
>> removing the later part will impact performance some on specific workloads,
>> but most Intel cpus that this applies to should not be using cpufreq anymore
>> anyway.
>
> Are there other users than intel?
>

http://lxr.linux.no/linux+v3.10.7/drivers/cpufreq/cpufreq_ondemand.c#L69

nope

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:31                 ` Arjan van de Ven
@ 2013-08-20 16:01                   ` Peter Zijlstra
  2013-08-20 16:33                     ` Oleg Nesterov
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-20 16:01 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Fernando Luis Vázquez Cao, Frederic Weisbecker,
	Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Andrew Morton

On Tue, Aug 20, 2013 at 08:31:53AM -0700, Arjan van de Ven wrote:
> On 8/20/2013 1:44 AM, Peter Zijlstra wrote:

> >Of course, if we can get away with completely removing all of that
> >(which I think Arjan suggested was a real possibility) then that would
> >be ever so much better still :-)
> 
> I'm quite ok with removing that.
> 
> however note that "top" also reports per cpu iowait...
> and that's a userspace expectation

Right, broken as that maybe :/ OK that looks like CPUTIME_IOWAIT which
is tick based and not the ns based accounting.

Still it needs the per-cpu nr_iowait accounting which pretty much
requires the atomics so no big gains there.

Which means that if Frederic can make the ns thing as expensive as the
existing atomics we might as well keep the ns thing too.


Hmm, would something like the below make sense? I suppose this can be
done even for the ns case, you'd have to duplicate all stats though.

At the very least the below reduces the number of atomics, not entirely
sure it all matters much though, some benchmarking would be in order I
suppose.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2167,8 +2167,12 @@ unsigned long nr_iowait(void)
 {
 	unsigned long i, sum = 0;
 
-	for_each_possible_cpu(i)
-		sum += atomic_read(&cpu_rq(i)->nr_iowait);
+	for_each_possible_cpu(i) {
+		struct rq *rq = cpu_rq(i);
+
+		sum += rq->nr_iowait_local;
+		sum += atomic_read(&rq->nr_iowait_remote);
+	}
 
 	return sum;
 }
@@ -2176,7 +2180,7 @@ unsigned long nr_iowait(void)
 unsigned long nr_iowait_cpu(int cpu)
 {
 	struct rq *this = cpu_rq(cpu);
-	return atomic_read(&this->nr_iowait);
+	return atomic_read(&this->nr_iowait_remote) + this->nr_iowait_local;
 }
 
 #ifdef CONFIG_SMP
@@ -4086,31 +4090,49 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 void __sched io_schedule(void)
 {
-	struct rq *rq = raw_rq();
+	struct rq *rq;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
+
+	preempt_disable();
+	rq = this_rq();
+	rq->nr_iowait_local++;
 	current->in_iowait = 1;
-	schedule();
+	schedule_preempt_disabled();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	if (likely(task_cpu(current) == cpu_of(rq)))
+		rq->nr_iowait_local--;
+	else
+		atomic_dec(&rq->nr_iowait_remote);
+	preempt_enable();
+
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
 
 long __sched io_schedule_timeout(long timeout)
 {
-	struct rq *rq = raw_rq();
+	struct rq *rq;
 	long ret;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
+
+	preempt_disable();
+	rq = this_rq();
+	rq->nr_iowait_local++;
 	current->in_iowait = 1;
+	preempt_enable_no_resched();
 	ret = schedule_timeout(timeout);
+	preempt_disable();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	if (likely(task_cpu(current) == cpu_of(rq)))
+		rq->nr_iowait_local--;
+	else
+		atomic_dec(&rq->nr_iowait_remote);
+	preempt_enable();
+
 	delayacct_blkio_end();
 	return ret;
 }
@@ -6650,7 +6672,8 @@ void __init sched_init(void)
 #endif
 #endif
 		init_rq_hrtick(rq);
-		atomic_set(&rq->nr_iowait, 0);
+		rq->nr_iowait_local = 0;
+		atomic_set(&rq->nr_iowait_remote, 0);
 	}
 
 	set_load_weight(&init_task);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -453,7 +453,8 @@ struct rq {
 	u64 clock;
 	u64 clock_task;
 
-	atomic_t nr_iowait;
+	int nr_iowait_local;
+	atomic_t nr_iowait_remote;
 
 #ifdef CONFIG_SMP
 	struct root_domain *rd;


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 16:01                   ` Peter Zijlstra
@ 2013-08-20 16:33                     ` Oleg Nesterov
  2013-08-20 17:54                       ` Peter Zijlstra
  2013-08-20 22:18                       ` Frederic Weisbecker
  0 siblings, 2 replies; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-20 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/20, Peter Zijlstra wrote:
>
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -453,7 +453,8 @@ struct rq {
>  	u64 clock;
>  	u64 clock_task;
>  
> -	atomic_t nr_iowait;
> +	int nr_iowait_local;
> +	atomic_t nr_iowait_remote;

I am wondering how the extra lock(rq)/unlock(rq) in schedule() is bad
compared to atomic_dec.

IOW, what if we simply make rq->nr_iowait "int" and change schedule()
to update it? Something like below. Just curious.

As for nr_iowait_local + nr_iowait_remote, this doesn't look safe...
in theory nr_iowait_cpu() or even nr_iowait() can return a negative
number.

Oleg.


--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -2435,6 +2435,9 @@ need_resched:
 		rq->curr = next;
 		++*switch_count;
 
+		if (unlikely(prev->in_iowait))
+			rq->nr_iowait++;
+
 		context_switch(rq, prev, next); /* unlocks the rq */
 		/*
 		 * The context switch have flipped the stack from under us
@@ -2442,6 +2445,12 @@ need_resched:
 		 * this task called schedule() in the past. prev == current
 		 * is still correct, but it can be moved to another cpu/rq.
 		 */
+		if (unlikely(prev->in_iowait)) {
+			raw_spin_lock_irq(&rq->lock);
+			rq->nr_iowait--;
+			raw_spin_unlock_irq(&rq->lock);
+		}
+
 		cpu = smp_processor_id();
 		rq = cpu_rq(cpu);
 	} else
@@ -3939,31 +3948,24 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 void __sched io_schedule(void)
 {
-	struct rq *rq = raw_rq();
-
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
 
 long __sched io_schedule_timeout(long timeout)
 {
-	struct rq *rq = raw_rq();
 	long ret;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 	return ret;
 }


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 16:33                     ` Oleg Nesterov
@ 2013-08-20 17:54                       ` Peter Zijlstra
  2013-08-20 18:25                         ` Oleg Nesterov
  2013-08-20 22:18                       ` Frederic Weisbecker
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-20 17:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:

> I am wondering how the extra lock(rq)/unlock(rq) in schedule() is bad
> compared to atomic_dec.
> 
> IOW, what if we simply make rq->nr_iowait "int" and change schedule()
> to update it? Something like below. Just curious.
> 
> As for nr_iowait_local + nr_iowait_remote, this doesn't look safe...
> in theory nr_iowait_cpu() or even nr_iowait() can return a negative
> number.

I'm too tired to see how its different, I'll think on it :-)


That said:

> --- x/kernel/sched/core.c
> +++ x/kernel/sched/core.c
> @@ -2435,6 +2435,9 @@ need_resched:
>  		rq->curr = next;
>  		++*switch_count;
>  
> +		if (unlikely(prev->in_iowait))
> +			rq->nr_iowait++;
> +
>  		context_switch(rq, prev, next); /* unlocks the rq */
>  		/*
>  		 * The context switch have flipped the stack from under us
> @@ -2442,6 +2445,12 @@ need_resched:
>  		 * this task called schedule() in the past. prev == current
>  		 * is still correct, but it can be moved to another cpu/rq.
>  		 */
> +		if (unlikely(prev->in_iowait)) {
> +			raw_spin_lock_irq(&rq->lock);
> +			rq->nr_iowait--;
> +			raw_spin_unlock_irq(&rq->lock);
> +		}

This seems like the wrong place, this is where you return from
schedule() running another task, not where the task you just send to
sleep wakes up.

If you want to to this, the nr_iowait decrement needs to be in the
wakeup path someplace.

>  		cpu = smp_processor_id();
>  		rq = cpu_rq(cpu);
>  	} else

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 17:54                       ` Peter Zijlstra
@ 2013-08-20 18:25                         ` Oleg Nesterov
  2013-08-21  8:31                           ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-20 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/20, Peter Zijlstra wrote:
>
> On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> > --- x/kernel/sched/core.c
> > +++ x/kernel/sched/core.c
> > @@ -2435,6 +2435,9 @@ need_resched:
> >  		rq->curr = next;
> >  		++*switch_count;
> >  
> > +		if (unlikely(prev->in_iowait))
> > +			rq->nr_iowait++;
> > +
> >  		context_switch(rq, prev, next); /* unlocks the rq */
> >  		/*
> >  		 * The context switch have flipped the stack from under us
> > @@ -2442,6 +2445,12 @@ need_resched:
> >  		 * this task called schedule() in the past. prev == current
> >  		 * is still correct, but it can be moved to another cpu/rq.
> >  		 */
> > +		if (unlikely(prev->in_iowait)) {
> > +			raw_spin_lock_irq(&rq->lock);
> > +			rq->nr_iowait--;
> > +			raw_spin_unlock_irq(&rq->lock);
> > +		}
>
> This seems like the wrong place, this is where you return from
> schedule() running another task,

Yes, but prev is current, and rq should be "correct" for
rq->nr_iowait-- ?

This local var should be equal to its value when this task called
context_switch() in the past.

Like any other variable, like "rq = raw_rq()" in io_schedule().

> not where the task you just send to
> sleep wakes up.

sure, but currently io_schedule() does the same.



Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?

Oleg.


--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -3939,16 +3939,7 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 void __sched io_schedule(void)
 {
-	struct rq *rq = raw_rq();
-
-	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
-	current->in_iowait = 1;
-	schedule();
-	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
+	io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
 }
 EXPORT_SYMBOL(io_schedule);
 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  6:21           ` Fernando Luis Vázquez Cao
@ 2013-08-20 21:55             ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 21:55 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Tue, Aug 20, 2013 at 03:21:11PM +0900, Fernando Luis Vázquez Cao wrote:
> (2013年08月17日 01:46), Frederic Weisbecker wrote:
> >On Fri, Aug 16, 2013 at 06:26:54PM +0200, Oleg Nesterov wrote:
> >>On 08/16, Frederic Weisbecker wrote:
> >>>On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> >>>>>+	do {
> >>>>>+		seq = read_seqcount_begin(&ts->sleeptime_seq);
> >>>>>+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> >>>>>+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> >>>>>+			iowait = ktime_add(ts->iowait_sleeptime, delta);
> >>>>>+		} else {
> >>>>>+			iowait = ts->iowait_sleeptime;
> >>>>>+		}
> >>>>>+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> >>>>Unless I missread this patch, this is still racy a bit.
> >>>>
> >>>>Suppose it is called on CPU_0 and cpu == 1. Suppose that
> >>>>ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> >>>>
> >>>>So we return iowait_sleeptime + delta.
> >>>>
> >>>>Suppose that we call get_cpu_iowait_time_us() again. By this time
> >>>>the task which incremented ->nr_iowait can be woken up on another
> >>>>CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> >>>>we return iowait_sleeptime, and this is not monotonic again.
> >>>Hmm, by the time it decrements nr_iowait, it returned from schedule() and
> >>>so idle had flushed the pending iowait sleeptime.
> >>Suppose a task does io_schedule() on CPU_0, and increments the counter.
> >>This CPU becomes idle and nr_iowait_cpu(0) == 1.
> >>
> >>Then this task is woken up, but try_to_wake_up() selects another CPU != 0.
> >>
> >>It returns from schedule() and decrements the same counter, it doesn't
> >>do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.
> >>
> >>In fact the task can even migrate to another CPU right after raw_rq().
> >Ah I see now. So that indeed yet another race.
> 
> I am sorry for chiming in late.
> 
> That precisely the race I described here:
> 
> https://lkml.org/lkml/2013/7/2/3
> 
> I should have been more concise in my explanation. I apologize.

Reading that again, the description was good. It's rather me who didn't read that
not carefully enough :)

> 
> >Should we flush that iowait to the src CPU? But then it means we must handle
> >concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
> >code and from idle enter / exit.
> >
> >So I fear we need a seqlock.
> >
> >Or we can live with that and still account the whole idle time slept until
> >tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> >All we need is just a new field in ts-> that records on which state we entered
> >idle.
> 
> Another approach could be to shadow ->iowait_sleeptime as
> suggested here: https://lkml.org/lkml/2013/7/2/165

Hmm, yeah it seems that it would enforce the monotonicity but the wrong forward jumps
due to bad ordering could still happen.

Oh and I realize you already suggested to account the iowait time on task migration more than
one month ago. Grr I should really sit down before reading emails.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 16:33                     ` Oleg Nesterov
  2013-08-20 17:54                       ` Peter Zijlstra
@ 2013-08-20 22:18                       ` Frederic Weisbecker
  2013-08-21 11:49                         ` Oleg Nesterov
  1 sibling, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 22:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -453,7 +453,8 @@ struct rq {
> >  	u64 clock;
> >  	u64 clock_task;
> >  
> > -	atomic_t nr_iowait;
> > +	int nr_iowait_local;
> > +	atomic_t nr_iowait_remote;
> 
> I am wondering how the extra lock(rq)/unlock(rq) in schedule() is bad
> compared to atomic_dec.
> 
> IOW, what if we simply make rq->nr_iowait "int" and change schedule()
> to update it? Something like below. Just curious.
> 
> As for nr_iowait_local + nr_iowait_remote, this doesn't look safe...
> in theory nr_iowait_cpu() or even nr_iowait() can return a negative
> number.
> 
> Oleg.
> 
> 
> --- x/kernel/sched/core.c
> +++ x/kernel/sched/core.c
> @@ -2435,6 +2435,9 @@ need_resched:
>  		rq->curr = next;
>  		++*switch_count;
>  
> +		if (unlikely(prev->in_iowait))
> +			rq->nr_iowait++;
> +
>  		context_switch(rq, prev, next); /* unlocks the rq */
>  		/*
>  		 * The context switch have flipped the stack from under us
> @@ -2442,6 +2445,12 @@ need_resched:
>  		 * this task called schedule() in the past. prev == current
>  		 * is still correct, but it can be moved to another cpu/rq.
>  		 */
> +		if (unlikely(prev->in_iowait)) {
> +			raw_spin_lock_irq(&rq->lock);
> +			rq->nr_iowait--;
> +			raw_spin_unlock_irq(&rq->lock);
> +		}
> +

It seems that with this solution rq->nr_iowait is only ever modified locally.
Can't we just disable irqs for rq->nr_iowait-- ?

Also if this is only updated locally, no need for a lock to update ts->iowait_sleep_time anymore,
we can flush the io sleeptime in place. Great thing.

Now just in case, it might be worth noting as well that the time elapsing from the rq task enqueue
until it is finally running on the CPU is not accounted anymore there. This can make a little difference
if the task is woken up but a higher priority task, possibly SCHED_FIFO is already running on the CPU.
Now do we care...

Thanks.

 
>  		cpu = smp_processor_id();
>  		rq = cpu_rq(cpu);
>  	} else
> @@ -3939,31 +3948,24 @@ EXPORT_SYMBOL_GPL(yield_to);
>   */
>  void __sched io_schedule(void)
>  {
> -	struct rq *rq = raw_rq();
> -
>  	delayacct_blkio_start();
> -	atomic_inc(&rq->nr_iowait);
>  	blk_flush_plug(current);
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> -	atomic_dec(&rq->nr_iowait);
>  	delayacct_blkio_end();
>  }
>  EXPORT_SYMBOL(io_schedule);
>  
>  long __sched io_schedule_timeout(long timeout)
>  {
> -	struct rq *rq = raw_rq();
>  	long ret;
>  
>  	delayacct_blkio_start();
> -	atomic_inc(&rq->nr_iowait);
>  	blk_flush_plug(current);
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> -	atomic_dec(&rq->nr_iowait);
>  	delayacct_blkio_end();
>  	return ret;
>  }
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 18:25                         ` Oleg Nesterov
@ 2013-08-21  8:31                           ` Peter Zijlstra
  2013-08-21 11:35                             ` Oleg Nesterov
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-21  8:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 08:25:53PM +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:

> > > +		if (unlikely(prev->in_iowait)) {
> > > +			raw_spin_lock_irq(&rq->lock);
> > > +			rq->nr_iowait--;
> > > +			raw_spin_unlock_irq(&rq->lock);
> > > +		}
> >
> > This seems like the wrong place, this is where you return from
> > schedule() running another task,
> 
> Yes, but prev is current, and rq should be "correct" for
> rq->nr_iowait-- ?

Yes its the right rq, but the wrong time.

> This local var should be equal to its value when this task called
> context_switch() in the past.
> 
> Like any other variable, like "rq = raw_rq()" in io_schedule().
> 
> > not where the task you just send to
> > sleep wakes up.
> 
> sure, but currently io_schedule() does the same.

No it doesn't. It only does the decrement when the task is woken back
up. Not right after it switches out.

> Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?

I suppose we could, a timeout of MAX_SCHEDULE_TIMEOUT will act like a
regular schedule, but it gets all the overhead of doing
schedule_timeout(). So I don't think its a win.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21  8:31                           ` Peter Zijlstra
@ 2013-08-21 11:35                             ` Oleg Nesterov
  2013-08-21 12:33                               ` Peter Zijlstra
  2013-08-21 12:48                               ` Peter Zijlstra
  0 siblings, 2 replies; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-21 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> On Tue, Aug 20, 2013 at 08:25:53PM +0200, Oleg Nesterov wrote:
> > On 08/20, Peter Zijlstra wrote:
> > >
> > > On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
>
> > > > +		if (unlikely(prev->in_iowait)) {
> > > > +			raw_spin_lock_irq(&rq->lock);
> > > > +			rq->nr_iowait--;
> > > > +			raw_spin_unlock_irq(&rq->lock);
> > > > +		}
> > >
> > > This seems like the wrong place, this is where you return from
> > > schedule() running another task,
> >
> > Yes, but prev is current, and rq should be "correct" for
> > rq->nr_iowait-- ?
>
> Yes its the right rq, but the wrong time.

Hmm. Just in case, it is not that I think this patch really makes sense,
but I'd like to understand why do you think it is wrong.

> > This local var should be equal to its value when this task called
> > context_switch() in the past.
> >
> > Like any other variable, like "rq = raw_rq()" in io_schedule().
> >
> > > not where the task you just send to
> > > sleep wakes up.
> >
> > sure, but currently io_schedule() does the same.
>
> No it doesn't. It only does the decrement when the task is woken back
> up. Not right after it switches out.

But it is not "after it switches out", it is after it switched back.

Lets ignore the locking,

	if (prev->in_iowait)
		rq->nr_iowait++;

	context_switch(prev, next);

	if (prev->in_iowait)
		rq->nr_iowait--;

>From the task_struct's (current's) pov prev/rq are the same, before or
after context_switch().

But from the CPU's pov they differ. And ignoring more details on UP the
code above is equivalent to

	if (prev->in_iowait)
		rq->nr_iowait++;

	if (next->in_iowait)
		rq->nr_iowait--;

	context_switch(prev, next);

No?

Yes, need_resched()/preemption can trigger more inc/dec's than io_schedule()
does, but I don't think this was your concern.

> > Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?
>
> I suppose we could, a timeout of MAX_SCHEDULE_TIMEOUT will act like a
> regular schedule, but it gets all the overhead of doing
> schedule_timeout(). So I don't think its a win.

Well, the only overhead is "if(to == MAX_SCHEDULE_TIMEOUT)" at the start.
I don't think it makes sense to copy-and-paste the identical code to
avoid it. But please ignore, this is really minor and off-topic.

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 22:18                       ` Frederic Weisbecker
@ 2013-08-21 11:49                         ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-21 11:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 08/21, Frederic Weisbecker wrote:
>
> On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> >
> > --- x/kernel/sched/core.c
> > +++ x/kernel/sched/core.c
> > @@ -2435,6 +2435,9 @@ need_resched:
> >  		rq->curr = next;
> >  		++*switch_count;
> >
> > +		if (unlikely(prev->in_iowait))
> > +			rq->nr_iowait++;
> > +
> >  		context_switch(rq, prev, next); /* unlocks the rq */
> >  		/*
> >  		 * The context switch have flipped the stack from under us
> > @@ -2442,6 +2445,12 @@ need_resched:
> >  		 * this task called schedule() in the past. prev == current
> >  		 * is still correct, but it can be moved to another cpu/rq.
> >  		 */
> > +		if (unlikely(prev->in_iowait)) {
> > +			raw_spin_lock_irq(&rq->lock);
> > +			rq->nr_iowait--;
> > +			raw_spin_unlock_irq(&rq->lock);
> > +		}
> > +
>
> It seems that with this solution rq->nr_iowait is only ever modified locally.
> Can't we just disable irqs for rq->nr_iowait-- ?

Not really. rq holds the old value, which was used when this task
called context_switch() in the past.

IOW, if a task T does io_schedule() on CPU_0, then cpu_of(rq) is still 0
after context_switch(), even if T runs on another cpu.

> Also if this is only updated locally,

Unfortunately, I don't see how we can do this.

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 11:35                             ` Oleg Nesterov
@ 2013-08-21 12:33                               ` Peter Zijlstra
  2013-08-21 14:23                                 ` Peter Zijlstra
  2013-08-21 12:48                               ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-21 12:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> On 08/21, Peter Zijlstra wrote:

> > Yes its the right rq, but the wrong time.
> 
> Hmm. Just in case, it is not that I think this patch really makes sense,
> but I'd like to understand why do you think it is wrong.
 
> But it is not "after it switches out", it is after it switched back.

D'uh I was being particularly dense it seems :/

Yes I think it is correct. You're now trading two atomic ops on a
different cacheline than rq->lock for one atomic op on the rq->lock.

Might be a win, esp. since hopefully there's a fair chance its the same
runqueue.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 11:35                             ` Oleg Nesterov
  2013-08-21 12:33                               ` Peter Zijlstra
@ 2013-08-21 12:48                               ` Peter Zijlstra
  2013-08-21 17:09                                 ` Oleg Nesterov
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-21 12:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> > > Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?
> >
> > I suppose we could, a timeout of MAX_SCHEDULE_TIMEOUT will act like a
> > regular schedule, but it gets all the overhead of doing
> > schedule_timeout(). So I don't think its a win.
> 
> Well, the only overhead is "if(to == MAX_SCHEDULE_TIMEOUT)" at the start.
> I don't think it makes sense to copy-and-paste the identical code to
> avoid it. But please ignore, this is really minor and off-topic.

Ah, so the 'problem' is that schedule_timeout() doesn't live in
kernel/sched/core.c and we thus get an extra function call (which are
somewhat expensive on some archs).

That said, I suppose we could do something like the below, that should
allow the compiler to DTRT.

---
 include/linux/sched.h |   35 ++++++++++++++++++++++---
 kernel/sched/core.c   |   22 +++++-----------
 kernel/timer.c        |   67 ++++++++++--------------------------------------
 3 files changed, 52 insertions(+), 72 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50d04b9..112960c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -303,10 +303,37 @@ extern char __sched_text_start[], __sched_text_end[];
 extern int in_sched_functions(unsigned long addr);
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
-extern signed long schedule_timeout(signed long timeout);
-extern signed long schedule_timeout_interruptible(signed long timeout);
-extern signed long schedule_timeout_killable(signed long timeout);
-extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern long __schedule_timeout(long timeout);
+
+static inline long schedule_timeout(long timeout)
+{
+	if (timeout == MAX_SCHEDULE_TIMEOUT) {
+		schedule();
+		return timeout;
+	}
+	return __schedule_timeout(timeout);
+}
+
+/*
+ * We can use __set_current_state() here because schedule_timeout() calls
+ * schedule() unconditionally.
+ */
+static inline long schedule_timeout_interruptible(long timeout)
+{
+	__set_current_state(TASK_INTERRUPTIBLE);
+	return schedule_timeout(timeout);
+}
+static inline long schedule_timeout_killable(long timeout)
+{
+	__set_current_state(TASK_KILLABLE);
+	return schedule_timeout(timeout);
+}
+static inline long schedule_timeout_uninterruptible(long timeout)
+{
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	return schedule_timeout(timeout);
+}
+
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f737871..ef2cddd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3937,21 +3937,6 @@ EXPORT_SYMBOL_GPL(yield_to);
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
  */
-void __sched io_schedule(void)
-{
-	struct rq *rq = raw_rq();
-
-	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
-	current->in_iowait = 1;
-	schedule();
-	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
-}
-EXPORT_SYMBOL(io_schedule);
-
 long __sched io_schedule_timeout(long timeout)
 {
 	struct rq *rq = raw_rq();
@@ -3968,6 +3953,13 @@ long __sched io_schedule_timeout(long timeout)
 	return ret;
 }
 
+void __sched io_schedule(void)
+{
+	io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+}
+EXPORT_SYMBOL(io_schedule);
+
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.
diff --git a/kernel/timer.c b/kernel/timer.c
index 15bc1b4..8244ce0 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1429,38 +1429,24 @@ static void process_timeout(unsigned long __data)
  *
  * In all cases the return value is guaranteed to be non-negative.
  */
-signed long __sched schedule_timeout(signed long timeout)
+signed long __sched __schedule_timeout(signed long timeout)
 {
 	struct timer_list timer;
 	unsigned long expire;
 
-	switch (timeout)
-	{
-	case MAX_SCHEDULE_TIMEOUT:
-		/*
-		 * These two special cases are useful to be comfortable
-		 * in the caller. Nothing more. We could take
-		 * MAX_SCHEDULE_TIMEOUT from one of the negative value
-		 * but I' d like to return a valid offset (>=0) to allow
-		 * the caller to do everything it want with the retval.
-		 */
-		schedule();
-		goto out;
-	default:
-		/*
-		 * Another bit of PARANOID. Note that the retval will be
-		 * 0 since no piece of kernel is supposed to do a check
-		 * for a negative retval of schedule_timeout() (since it
-		 * should never happens anyway). You just have the printk()
-		 * that will tell you if something is gone wrong and where.
-		 */
-		if (timeout < 0) {
-			printk(KERN_ERR "schedule_timeout: wrong timeout "
+	/*
+	 * Another bit of PARANOID. Note that the retval will be 0 since no
+	 * piece of kernel is supposed to do a check for a negative retval of
+	 * schedule_timeout() (since it should never happens anyway). You just
+	 * have the printk() that will tell you if something is gone wrong and
+	 * where.
+	 */
+	if (timeout < 0) {
+		printk(KERN_ERR "schedule_timeout: wrong timeout "
 				"value %lx\n", timeout);
-			dump_stack();
-			current->state = TASK_RUNNING;
-			goto out;
-		}
+		dump_stack();
+		current->state = TASK_RUNNING;
+		goto out;
 	}
 
 	expire = timeout + jiffies;
@@ -1478,32 +1464,7 @@ signed long __sched schedule_timeout(signed long timeout)
  out:
 	return timeout < 0 ? 0 : timeout;
 }
-EXPORT_SYMBOL(schedule_timeout);
-
-/*
- * We can use __set_current_state() here because schedule_timeout() calls
- * schedule() unconditionally.
- */
-signed long __sched schedule_timeout_interruptible(signed long timeout)
-{
-	__set_current_state(TASK_INTERRUPTIBLE);
-	return schedule_timeout(timeout);
-}
-EXPORT_SYMBOL(schedule_timeout_interruptible);
-
-signed long __sched schedule_timeout_killable(signed long timeout)
-{
-	__set_current_state(TASK_KILLABLE);
-	return schedule_timeout(timeout);
-}
-EXPORT_SYMBOL(schedule_timeout_killable);
-
-signed long __sched schedule_timeout_uninterruptible(signed long timeout)
-{
-	__set_current_state(TASK_UNINTERRUPTIBLE);
-	return schedule_timeout(timeout);
-}
-EXPORT_SYMBOL(schedule_timeout_uninterruptible);
+EXPORT_SYMBOL(__schedule_timeout);
 
 static int __cpuinit init_timers_cpu(int cpu)
 {


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 12:33                               ` Peter Zijlstra
@ 2013-08-21 14:23                                 ` Peter Zijlstra
  2013-08-21 16:41                                   ` Oleg Nesterov
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-21 14:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 02:33:11PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> > On 08/21, Peter Zijlstra wrote:
> 
> > > Yes its the right rq, but the wrong time.
> > 
> > Hmm. Just in case, it is not that I think this patch really makes sense,
> > but I'd like to understand why do you think it is wrong.
>  
> > But it is not "after it switches out", it is after it switched back.
> 
> D'uh I was being particularly dense it seems :/
> 
> Yes I think it is correct. You're now trading two atomic ops on a
> different cacheline than rq->lock for one atomic op on the rq->lock.
> 
> Might be a win, esp. since hopefully there's a fair chance its the same
> runqueue.

The other consideration is that this adds two branches to the normal
schedule path. I really don't know what the regular ratio between
schedule() and io_schedule() is -- and I suspect it can very much depend
on workload -- but it might be a net loss due to that, even if it makes
io_schedule() 'lots' cheaper.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 14:23                                 ` Peter Zijlstra
@ 2013-08-21 16:41                                   ` Oleg Nesterov
  2013-10-01 14:05                                     ` Frederic Weisbecker
  0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-21 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> The other consideration is that this adds two branches to the normal
> schedule path. I really don't know what the regular ratio between
> schedule() and io_schedule() is -- and I suspect it can very much depend
> on workload -- but it might be a net loss due to that, even if it makes
> io_schedule() 'lots' cheaper.

Yes, agreed. Please ignore it for now, I didn't try to actually suggest
this change. And even if this is fine perfomance wise, this needs some
benchmarking.

Well. actually I have a vague feeling that _perhaps_ this change can
help to solve other problems we are discussing, but I am not sure and
right now I can't even explain the idea to me.

In short: please ignore ;)

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 12:48                               ` Peter Zijlstra
@ 2013-08-21 17:09                                 ` Oleg Nesterov
  2013-08-21 18:31                                   ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-21 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> >
> > Well, the only overhead is "if(to == MAX_SCHEDULE_TIMEOUT)" at the start.
> > I don't think it makes sense to copy-and-paste the identical code to
> > avoid it. But please ignore, this is really minor and off-topic.
>
> Ah, so the 'problem' is that schedule_timeout() doesn't live in
> kernel/sched/core.c and we thus get an extra function call (which are
> somewhat expensive on some archs).

So, unlike me, you like -02 more than -Os ;)

> +static inline long schedule_timeout(long timeout)
> +{
> +	if (timeout == MAX_SCHEDULE_TIMEOUT) {
> +		schedule();
> +		return timeout;
> +	}
> +	return __schedule_timeout(timeout);
> +}

Well this means that every caller will do the MAX_SCHEDULE_TIMEOUT
check inline, and this case is unlikely. And you are also going
to make  schedule_timeout_*() inline...

But,

> +static inline long schedule_timeout_interruptible(long timeout)
> +{
> +	__set_current_state(TASK_INTERRUPTIBLE);
> +	return schedule_timeout(timeout);
> +}
> +static inline long schedule_timeout_killable(long timeout)
> +{
> +	__set_current_state(TASK_KILLABLE);
> +	return schedule_timeout(timeout);
> +}
> +static inline long schedule_timeout_uninterruptible(long timeout)
> +{
> +	__set_current_state(TASK_UNINTERRUPTIBLE);
> +	return schedule_timeout(timeout);
> +}
> ...
> -signed long __sched schedule_timeout_interruptible(signed long timeout)
> -{
> -	__set_current_state(TASK_INTERRUPTIBLE);
> -	return schedule_timeout(timeout);
> -}
> -EXPORT_SYMBOL(schedule_timeout_interruptible);
> -
> -signed long __sched schedule_timeout_killable(signed long timeout)
> -{
> -	__set_current_state(TASK_KILLABLE);
> -	return schedule_timeout(timeout);
> -}
> -EXPORT_SYMBOL(schedule_timeout_killable);
> -
> -signed long __sched schedule_timeout_uninterruptible(signed long timeout)
> -{
> -	__set_current_state(TASK_UNINTERRUPTIBLE);
> -	return schedule_timeout(timeout);
> -}
> -EXPORT_SYMBOL(schedule_timeout_uninterruptible);
> +EXPORT_SYMBOL(__schedule_timeout);

personally I think this particular change is fine.

Or we can export a single schedule_timeout_state(timeout, state) to
factor out get_current().

But just in case, of course I won't argue in any case.

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 17:09                                 ` Oleg Nesterov
@ 2013-08-21 18:31                                   ` Peter Zijlstra
  2013-08-21 18:32                                     ` Oleg Nesterov
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-08-21 18:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 07:09:27PM +0200, Oleg Nesterov wrote:
> So, unlike me, you like -02 more than -Os ;)

I haven't checked the actual flags they enable in a while, but I think I
prefer something in the middle.

Esp. -freorder-blocks and the various -falign flags are something you
really want with -Os.

> > +static inline long schedule_timeout(long timeout)
> > +{
> > +	if (timeout == MAX_SCHEDULE_TIMEOUT) {
> > +		schedule();
> > +		return timeout;
> > +	}
> > +	return __schedule_timeout(timeout);
> > +}
> 
> Well this means that every caller will do the MAX_SCHEDULE_TIMEOUT
> check inline, and this case is unlikely. 

OK, so do not remove the MAX_SCHEDULE_TIMEOUT check from
__schedule_timeout() and change the above to:

static __always_inline long schedule_timeout(long timeout)
{
	if (__builtin_constant_p(timeout) && timeout == MAX_SCHEDULE_TIMEOUT) {
		schedule();
		return timeout;
	}
	return __schedule_timeout(timeout);
}

That should avoid extra code generation for the runtime sites while
still allowing what we set out to do.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 18:31                                   ` Peter Zijlstra
@ 2013-08-21 18:32                                     ` Oleg Nesterov
  0 siblings, 0 replies; 68+ messages in thread
From: Oleg Nesterov @ 2013-08-21 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> OK, so do not remove the MAX_SCHEDULE_TIMEOUT check from
> __schedule_timeout() and change the above to:
>
> static __always_inline long schedule_timeout(long timeout)
> {
> 	if (__builtin_constant_p(timeout) && timeout == MAX_SCHEDULE_TIMEOUT) {
> 		schedule();
> 		return timeout;
> 	}
> 	return __schedule_timeout(timeout);
> }

Agreed, this looks nice.

Hmm. And this can simplify some changes in linux/wait.h I am going
to resend.

Oleg.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 16:41                                   ` Oleg Nesterov
@ 2013-10-01 14:05                                     ` Frederic Weisbecker
  2013-10-01 14:26                                       ` Frederic Weisbecker
                                                         ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:05 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Peter Zijlstra, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 06:41:46PM +0200, Oleg Nesterov wrote:
> On 08/21, Peter Zijlstra wrote:
> >
> > The other consideration is that this adds two branches to the normal
> > schedule path. I really don't know what the regular ratio between
> > schedule() and io_schedule() is -- and I suspect it can very much depend
> > on workload -- but it might be a net loss due to that, even if it makes
> > io_schedule() 'lots' cheaper.
> 
> Yes, agreed. Please ignore it for now, I didn't try to actually suggest
> this change. And even if this is fine perfomance wise, this needs some
> benchmarking.
> 
> Well. actually I have a vague feeling that _perhaps_ this change can
> help to solve other problems we are discussing, but I am not sure and
> right now I can't even explain the idea to me.
> 
> In short: please ignore ;)
> 
> Oleg.
> 

Ok, the discussion is hard to synthesize but I think I now have more
clues to send a better new iteration.

So we have the choice between keeping atomics or using rq->lock. It seems
that using rq->lock is going to be worrisome for several reasons. So let's
stick to atomics (but tell me if you prefer the other way).

So the idea for the next try is to do something along the lines of:

struct cpu_idletime {
       nr_iowait,
       seqlock,
       idle_start,
       idle_time,
       iowait_time,
} __cacheline_aligned_in_smp;

DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
       
io_schedule()
{
        int prev_cpu;
	
        preempt_disable();
        prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
        atomic_inc(prev_cpu_idletime->nr_iowait);
        WARN_ON_ONCE(is_idle_task(current));
        preempt_enable_no_resched();

        schedule();

        write_seqlock(prev_cpu_idletime->seqlock)
	if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
           flush_cpu_idle_time(prev_cpu_idletime, 1)
        write_sequnlock(prev_cpu_idletime->seqlock)

}

flush_cpu_idle_time(cpu_idletime, iowait)
{
       if (!cpu_idletime->idle_start)
            return;

       if (nr_iowait)
            cpu_idletime->iowait_time = NOW() - cpu_idletime->idle_start;
       else
            cpu_idletime->idle_time = NOW() - cpu_idletime->idle_start;
}

idle_entry()
{
        write_seqlock(this_cpu_idletime->seqlock)
        this_cpu_idletime->idle_start = NOW();
        write_sequnlock(iowait(cpu)->seqlock)
}

idle_exit()
{
        write_seqlock(this_cpu_idletime->seqlock)
        flush_cpu_idle_time(this_cpu_idletime, atomic_read(&this_cpu_idletime->nr_iowait));
	this_cpu_idletime->idle_start = 0;
        write_sequnlock(this_cpu_idletime->seqlock)
}


Now this all realy on the fact that atomic_inc(cpu_idletime->nr_iowait) can't happen
in a CPU that is already idle. So it can't happen between idle_entry() and idle_exit().
Hence the WARN_ON(is_idle_task(current)) below after the inc.

Hmm?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:05                                     ` Frederic Weisbecker
@ 2013-10-01 14:26                                       ` Frederic Weisbecker
  2013-10-01 14:27                                         ` Frederic Weisbecker
  2013-10-01 14:49                                         ` Frederic Weisbecker
  2013-10-01 15:00                                       ` Peter Zijlstra
  2013-10-01 15:56                                       ` Peter Zijlstra
  2 siblings, 2 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:26 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
> On Wed, Aug 21, 2013 at 06:41:46PM +0200, Oleg Nesterov wrote:
>> On 08/21, Peter Zijlstra wrote:
>> >
>> > The other consideration is that this adds two branches to the normal
>> > schedule path. I really don't know what the regular ratio between
>> > schedule() and io_schedule() is -- and I suspect it can very much depend
>> > on workload -- but it might be a net loss due to that, even if it makes
>> > io_schedule() 'lots' cheaper.
>>
>> Yes, agreed. Please ignore it for now, I didn't try to actually suggest
>> this change. And even if this is fine perfomance wise, this needs some
>> benchmarking.
>>
>> Well. actually I have a vague feeling that _perhaps_ this change can
>> help to solve other problems we are discussing, but I am not sure and
>> right now I can't even explain the idea to me.
>>
>> In short: please ignore ;)
>>
>> Oleg.
>>
>
> Ok, the discussion is hard to synthesize but I think I now have more
> clues to send a better new iteration.
>
> So we have the choice between keeping atomics or using rq->lock. It seems
> that using rq->lock is going to be worrisome for several reasons. So let's
> stick to atomics (but tell me if you prefer the other way).
>
> So the idea for the next try is to do something along the lines of:
>
> struct cpu_idletime {
>        nr_iowait,
>        seqlock,
>        idle_start,
>        idle_time,
>        iowait_time,
> } __cacheline_aligned_in_smp;
>
> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
>
> io_schedule()
> {
>         int prev_cpu;
>
>         preempt_disable();
>         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
>         atomic_inc(prev_cpu_idletime->nr_iowait);
>         WARN_ON_ONCE(is_idle_task(current));
>         preempt_enable_no_resched();
>
>         schedule();
>
>         write_seqlock(prev_cpu_idletime->seqlock)
>         if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
>            flush_cpu_idle_time(prev_cpu_idletime, 1)

I forgot...
              cpu_idletime->idle_start;

after the update.

Also now I wonder if we actually should lock the inc part. Otherwise
it may be hard to get the readers right...

Thanks.

>         write_sequnlock(prev_cpu_idletime->seqlock)
>
> }
>
> flush_cpu_idle_time(cpu_idletime, iowait)
> {
>        if (!cpu_idletime->idle_start)
>             return;
>
>        if (nr_iowait)
>             cpu_idletime->iowait_time = NOW() - cpu_idletime->idle_start;
>        else
>             cpu_idletime->idle_time = NOW() - cpu_idletime->idle_start;
> }
>
> idle_entry()
> {
>         write_seqlock(this_cpu_idletime->seqlock)
>         this_cpu_idletime->idle_start = NOW();
>         write_sequnlock(iowait(cpu)->seqlock)
> }
>
> idle_exit()
> {
>         write_seqlock(this_cpu_idletime->seqlock)
>         flush_cpu_idle_time(this_cpu_idletime, atomic_read(&this_cpu_idletime->nr_iowait));
>         this_cpu_idletime->idle_start = 0;
>         write_sequnlock(this_cpu_idletime->seqlock)
> }
>
>
> Now this all realy on the fact that atomic_inc(cpu_idletime->nr_iowait) can't happen
> in a CPU that is already idle. So it can't happen between idle_entry() and idle_exit().
> Hence the WARN_ON(is_idle_task(current)) below after the inc.
>
> Hmm?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:26                                       ` Frederic Weisbecker
@ 2013-10-01 14:27                                         ` Frederic Weisbecker
  2013-10-01 14:49                                         ` Frederic Weisbecker
  1 sibling, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:27 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
> I forgot...
>               cpu_idletime->idle_start;
                 cpu_idletime->idle_start = NOW();

grrr.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:26                                       ` Frederic Weisbecker
  2013-10-01 14:27                                         ` Frederic Weisbecker
@ 2013-10-01 14:49                                         ` Frederic Weisbecker
  1 sibling, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:49 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
> 2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
>> On Wed, Aug 21, 2013 at 06:41:46PM +0200, Oleg Nesterov wrote:
>>> On 08/21, Peter Zijlstra wrote:
>>> >
>>> > The other consideration is that this adds two branches to the normal
>>> > schedule path. I really don't know what the regular ratio between
>>> > schedule() and io_schedule() is -- and I suspect it can very much depend
>>> > on workload -- but it might be a net loss due to that, even if it makes
>>> > io_schedule() 'lots' cheaper.
>>>
>>> Yes, agreed. Please ignore it for now, I didn't try to actually suggest
>>> this change. And even if this is fine perfomance wise, this needs some
>>> benchmarking.
>>>
>>> Well. actually I have a vague feeling that _perhaps_ this change can
>>> help to solve other problems we are discussing, but I am not sure and
>>> right now I can't even explain the idea to me.
>>>
>>> In short: please ignore ;)
>>>
>>> Oleg.
>>>
>>
>> Ok, the discussion is hard to synthesize but I think I now have more
>> clues to send a better new iteration.
>>
>> So we have the choice between keeping atomics or using rq->lock. It seems
>> that using rq->lock is going to be worrisome for several reasons. So let's
>> stick to atomics (but tell me if you prefer the other way).
>>
>> So the idea for the next try is to do something along the lines of:
>>
>> struct cpu_idletime {
>>        nr_iowait,
>>        seqlock,
>>        idle_start,
>>        idle_time,
>>        iowait_time,
>> } __cacheline_aligned_in_smp;
>>
>> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
>>
>> io_schedule()
>> {
>>         int prev_cpu;
>>
>>         preempt_disable();
>>         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
>>         atomic_inc(prev_cpu_idletime->nr_iowait);
>>         WARN_ON_ONCE(is_idle_task(current));
>>         preempt_enable_no_resched();
>>
>>         schedule();
>>
>>         write_seqlock(prev_cpu_idletime->seqlock)
>>         if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
>>            flush_cpu_idle_time(prev_cpu_idletime, 1)
>
> I forgot...
>               cpu_idletime->idle_start;
>
> after the update.
>
> Also now I wonder if we actually should lock the inc part. Otherwise
> it may be hard to get the readers right...
>
> Thanks.
>
>>         write_sequnlock(prev_cpu_idletime->seqlock)
>>
>> }
>>
>> flush_cpu_idle_time(cpu_idletime, iowait)
>> {
>>        if (!cpu_idletime->idle_start)
>>             return;
>>
>>        if (nr_iowait)
>>             cpu_idletime->iowait_time = NOW() - cpu_idletime->idle_start;
>>        else
>>             cpu_idletime->idle_time = NOW() - cpu_idletime->idle_start;
>> }
>>
>> idle_entry()
>> {
>>         write_seqlock(this_cpu_idletime->seqlock)
>>         this_cpu_idletime->idle_start = NOW();
>>         write_sequnlock(iowait(cpu)->seqlock)
>> }
>>
>> idle_exit()
>> {
>>         write_seqlock(this_cpu_idletime->seqlock)
>>         flush_cpu_idle_time(this_cpu_idletime, atomic_read(&this_cpu_idletime->nr_iowait));
>>         this_cpu_idletime->idle_start = 0;
>>         write_sequnlock(this_cpu_idletime->seqlock)
>> }
>>
>>
>> Now this all realy on the fact that atomic_inc(cpu_idletime->nr_iowait) can't happen
>> in a CPU that is already idle. So it can't happen between idle_entry() and idle_exit().
>> Hence the WARN_ON(is_idle_task(current)) below after the inc.
>>
>> Hmm?

Actually if we record the source of the idle entry time (whether we
enter idle as iowait or idlewait) and protect it inside the seqlock,
we can probably make the readers safe. So they would rely on that
instead of the nr_io_wait which is only used by the updaters.

I'm probably forgetting something obvious. Anyway I'll try to make a
patchset out of that, unless somebody points me out the mistakes I'm
missing here.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:05                                     ` Frederic Weisbecker
  2013-10-01 14:26                                       ` Frederic Weisbecker
@ 2013-10-01 15:00                                       ` Peter Zijlstra
  2013-10-01 15:21                                         ` Frederic Weisbecker
  2013-10-01 15:56                                       ` Peter Zijlstra
  2 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-10-01 15:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 04:05:27PM +0200, Frederic Weisbecker wrote:
 
> struct cpu_idletime {
>        nr_iowait,
>        seqlock,
>        idle_start,
>        idle_time,
>        iowait_time,
> } __cacheline_aligned_in_smp;
> 
> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
>        
> io_schedule()
> {
>         int prev_cpu;
> 	
>         preempt_disable();
>         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
>         atomic_inc(prev_cpu_idletime->nr_iowait);
>         WARN_ON_ONCE(is_idle_task(current));
>         preempt_enable_no_resched();
> 
>         schedule();
> 
>         write_seqlock(prev_cpu_idletime->seqlock)
> 	if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
>            flush_cpu_idle_time(prev_cpu_idletime, 1)
>         write_sequnlock(prev_cpu_idletime->seqlock)
> 
> }

This is at least 3 atomic ops and a whole bunch of branches extra. It
used to be 2 atomics and no branches.

What again are we solving any why?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 15:00                                       ` Peter Zijlstra
@ 2013-10-01 15:21                                         ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 15:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 05:00:37PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 04:05:27PM +0200, Frederic Weisbecker wrote:
>  
> > struct cpu_idletime {
> >        nr_iowait,
> >        seqlock,
> >        idle_start,
> >        idle_time,
> >        iowait_time,
> > } __cacheline_aligned_in_smp;
> > 
> > DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
> >        
> > io_schedule()
> > {
> >         int prev_cpu;
> > 	
> >         preempt_disable();
> >         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
> >         atomic_inc(prev_cpu_idletime->nr_iowait);
> >         WARN_ON_ONCE(is_idle_task(current));
> >         preempt_enable_no_resched();
> > 
> >         schedule();
> > 
> >         write_seqlock(prev_cpu_idletime->seqlock)
> > 	if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
> >            flush_cpu_idle_time(prev_cpu_idletime, 1)
> >         write_sequnlock(prev_cpu_idletime->seqlock)
> > 
> > }
> 
> This is at least 3 atomic ops and a whole bunch of branches extra. It
> used to be 2 atomics and no branches.

Yeah. If somebody has a better proposition, I'm all for it.
Of course the best would be to remove these stats if we can. I think we
already concluded that the idea of per CPU iowait stats is broken since
sleeping tasks aren't assigned a particular CPU.

> 
> What again are we solving any why?

Fernando summerized it better than I could
	 * https://lkml.org/lkml/2013/3/18/962
	 * and http://marc.info/?l=linux-kernel&m=137273800916899&w=2

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:05                                     ` Frederic Weisbecker
  2013-10-01 14:26                                       ` Frederic Weisbecker
  2013-10-01 15:00                                       ` Peter Zijlstra
@ 2013-10-01 15:56                                       ` Peter Zijlstra
  2013-10-01 16:47                                         ` Frederic Weisbecker
  2 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-10-01 15:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton


So what's wrong with something like:

struct cpu_idletime {
       seqlock_t seqlock;
       unsigned long nr_iowait;
       u64 start;
       u64 idle_time,
       u64 iowait_time,
} __cacheline_aligned_in_smp;

DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);

void io_schedule(void)
{
	struct cpu_idletime *it = __raw_get_cpu_var(cpu_idletime);

	write_seqlock(&it->seqlock);
	if (!it->nr_iowait++)
		it->start = local_clock();
	write_sequnlock(&it->seqlock);

	current->in_iowait = 1;
	schedule();
	current->in_iowait = 0;

	write_seqlock(&it->seqlock);
	if (!--it->nr_iowait)
		it->iowait_time += local_clock() - it->start;
	write_sequnlock(&it->seqlock);
}

Afaict you don't need the preempt disable and atomic muck at all.

It will all get a little more complicated to deal with overlapping idle
and iowait times, but the idea is the same.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 15:56                                       ` Peter Zijlstra
@ 2013-10-01 16:47                                         ` Frederic Weisbecker
  2013-10-01 16:59                                           ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 05:56:33PM +0200, Peter Zijlstra wrote:
> 
> So what's wrong with something like:
> 
> struct cpu_idletime {
>        seqlock_t seqlock;
>        unsigned long nr_iowait;
>        u64 start;
>        u64 idle_time,
>        u64 iowait_time,
> } __cacheline_aligned_in_smp;
> 
> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
> 
> void io_schedule(void)
> {
> 	struct cpu_idletime *it = __raw_get_cpu_var(cpu_idletime);
> 
> 	write_seqlock(&it->seqlock);
> 	if (!it->nr_iowait++)
> 		it->start = local_clock();
> 	write_sequnlock(&it->seqlock);
> 
> 	current->in_iowait = 1;
> 	schedule();
> 	current->in_iowait = 0;
> 
> 	write_seqlock(&it->seqlock);
> 	if (!--it->nr_iowait)
> 		it->iowait_time += local_clock() - it->start;
> 	write_sequnlock(&it->seqlock);
> }
> 
> Afaict you don't need the preempt disable and atomic muck at all.

Yeah thinking more about it, the preempt disable was probably not
necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.

OTOH it computes iowait time seperately from idle time, so we probably don't
need to lock the idle time anymore.

Plus this solution is much much more simple. So if nobody sees a flaw there,
I'll try this.

Thanks.

> 
> It will all get a little more complicated to deal with overlapping idle
> and iowait times, but the idea is the same.

Probably no big deal.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 16:47                                         ` Frederic Weisbecker
@ 2013-10-01 16:59                                           ` Peter Zijlstra
  2013-10-02 12:45                                             ` Frederic Weisbecker
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2013-10-01 16:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> Yeah thinking more about it, the preempt disable was probably not
> necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.

It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
2 atomics.

So all we get is some extra branches; which we must hope for don't
actually mess things up too bad.

Ohh.. wait a sec.. we also call local_clock() which includes another
atomic :/ Bugger.. 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 16:59                                           ` Peter Zijlstra
@ 2013-10-02 12:45                                             ` Frederic Weisbecker
  2013-10-02 12:50                                               ` Peter Zijlstra
  2013-10-02 14:35                                               ` Arjan van de Ven
  0 siblings, 2 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-02 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> > Yeah thinking more about it, the preempt disable was probably not
> > necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
> 
> It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
> 2 atomics.

Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
in x86?

> 
> So all we get is some extra branches; which we must hope for don't
> actually mess things up too bad.
> 
> Ohh.. wait a sec.. we also call local_clock() which includes another
> atomic :/ Bugger.. 

Yeah. Anyway, I'm going to try something on top of that. May be we'll get
a fresher mind and ideas on how to optimize that all after.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-02 12:45                                             ` Frederic Weisbecker
@ 2013-10-02 12:50                                               ` Peter Zijlstra
  2013-10-02 14:35                                               ` Arjan van de Ven
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2013-10-02 12:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Wed, Oct 02, 2013 at 02:45:41PM +0200, Frederic Weisbecker wrote:
> On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> > > Yeah thinking more about it, the preempt disable was probably not
> > > necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
> > 
> > It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
> > 2 atomics.
> 
> Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
> in x86?

Unlock isn't an atomic on x86_64; it can be on certain i386 builds. See
UNLOCK_LOCK_PREFIX.

> > 
> > So all we get is some extra branches; which we must hope for don't
> > actually mess things up too bad.
> > 
> > Ohh.. wait a sec.. we also call local_clock() which includes another
> > atomic :/ Bugger.. 
> 
> Yeah. Anyway, I'm going to try something on top of that. May be we'll get
> a fresher mind and ideas on how to optimize that all after.

Fair enough.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-02 12:45                                             ` Frederic Weisbecker
  2013-10-02 12:50                                               ` Peter Zijlstra
@ 2013-10-02 14:35                                               ` Arjan van de Ven
  2013-10-02 16:01                                                 ` Frederic Weisbecker
  1 sibling, 1 reply; 68+ messages in thread
From: Arjan van de Ven @ 2013-10-02 14:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Oleg Nesterov, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 10/2/2013 5:45 AM, Frederic Weisbecker wrote:
> On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
>>> Yeah thinking more about it, the preempt disable was probably not
>>> necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
>>
>> It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
>> 2 atomics.
>
> Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
> in x86?

unlock is not actually an atomic.

and on some modern machines, neither is the lock, for the uncontended case ;-)


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-02 14:35                                               ` Arjan van de Ven
@ 2013-10-02 16:01                                                 ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2013-10-02 16:01 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Oleg Nesterov, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Wed, Oct 02, 2013 at 07:35:49AM -0700, Arjan van de Ven wrote:
> On 10/2/2013 5:45 AM, Frederic Weisbecker wrote:
> >On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
> >>On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> >>>Yeah thinking more about it, the preempt disable was probably not
> >>>necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
> >>
> >>It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
> >>2 atomics.
> >
> >Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
> >in x86?
> 
> unlock is not actually an atomic.
> 
> and on some modern machines, neither is the lock, for the uncontended case ;-)
> 

Yeah, we are never getting short on black magic as the time goes :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2014-04-24 18:45 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
@ 2014-04-24 18:45 ` Denys Vlasenko
  0 siblings, 0 replies; 68+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

This patch is adapted from a very similar patch
by Frederic Weisbecker. The commit message text is also mostly
from Frederic's patch.

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

Namely this sums up to:

       sleeptime = ts($CPU)->idle_sleeptime;
       if (ts($CPU)->idle_active)
       	  sleeptime += NOW() - ts($CPU)->idle_entrytime

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

Lets consider the following scenario:

             CPU 0                                            CPU 1

  (seq 1)    ts(CPU 0)->idle_active = 1
             ts(CPU 0)->idle_entrytime = NOW()

  (seq 2)    sleeptime = NOW() - ts(CPU 0)->idle_entrytime
             ts(CPU 0)->idle_sleeptime += sleeptime           sleeptime = ts(CPU 0)->idle_sleeptime;
                                                              if (ts(CPU 0)->idle_active)
             ts(CPU 0)->idle_entrytime = NOW()                    sleeptime += NOW() - ts(CPU 0)->idle_entrytime

The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:

* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.

* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.

* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.

Some more tricky scenario can also happen if idle_active value is read from a former sequence.

Hence we must honour the following constraints:

- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering

- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.

- We need to prevent from fetching middle-state sequence values.

The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.

This fixes potential cpufreq governor bugs.
It also works towards fixing reported bug where non-monotonic sleeptime stats
are returned by /proc/stat when it is frequently read.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |  1 +
 kernel/time/tick-sched.c | 37 ++++++++++++++++++++++++++-----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..4de1f9e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -67,6 +67,7 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
+	seqcount_t			idle_sleeptime_seq;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
 	unsigned long			next_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7d86a54..8f0f2ee 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -411,12 +411,14 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
 	if (nr_iowait_cpu(smp_processor_id()) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_active = 0;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_wakeup_event(0);
 }
@@ -425,9 +427,13 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
+
 	return now;
 }
 
@@ -449,6 +455,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
@@ -457,15 +464,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		idle = ktime_add(ts->idle_sleeptime, delta);
-	} else {
+	do {
+		ktime_t delta;
+
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		idle = ts->idle_sleeptime;
-	}
+		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+			delta = ktime_sub(now, ts->idle_entrytime);
+			idle = ktime_add(idle, delta);
+		}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
-
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -487,6 +497,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, iowait;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
@@ -495,12 +506,16 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		iowait = ktime_add(ts->iowait_sleeptime, delta);
-	} else {
+	do {
+		ktime_t delta;
+
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		iowait = ts->iowait_sleeptime;
-	}
+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+			delta = ktime_sub(now, ts->idle_entrytime);
+			iowait = ktime_add(ts->iowait_sleeptime, delta);
+		}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(iowait);
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 1/4] nohz: Only update sleeptime stats locally
@ 2014-05-07 13:41 Denys Vlasenko
  2014-05-07 13:41 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
                   ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Denys Vlasenko @ 2014-05-07 13:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Frederic Weisbecker, Denys Vlasenko, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

From: Frederic Weisbecker <fweisbec@gmail.com>

The idle and io sleeptime stats can be updated concurrently from callers
of get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
tick_nohz_stop_idle().

Updaters can easily race and mess up with internal datas coherency,
for example when a governor calls a get_cpu_*_time_us() API and the
target CPU exits idle at the same time, because no locking or whatsoever
is there to protect against this concurrency.

To fix this, lets only update the sleeptime stats locally when the CPU
exits from idle. This is the only place where a real update is truly
needed. The callers of get_cpu_*_time_us() can simply add up the pending
sleep time delta to the last sleeptime snapshot in order to get a coherent
result. There is no need for them to also update the stats.

Rebased to Linus' tree by Denys Vlasenko.

Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/time/tick-sched.c | 63 ++++++++++++++++--------------------------------
 1 file changed, 21 insertions(+), 42 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6558b7a..7d86a54 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -406,31 +406,16 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog();
 }
 
-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
 	ktime_t delta;
 
-	if (ts->idle_active) {
-		delta = ktime_sub(now, ts->idle_entrytime);
-		if (nr_iowait_cpu(cpu) > 0)
-			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-		else
-			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-		ts->idle_entrytime = now;
-	}
-
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
-
-}
-
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
-	update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+	/* Updates the per cpu time idle statistics counters */
+	delta = ktime_sub(now, ts->idle_entrytime);
+	if (nr_iowait_cpu(smp_processor_id()) > 0)
+		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+	else
+		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_active = 0;
 
 	sched_clock_idle_wakeup_event(0);
@@ -469,17 +454,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		idle = ts->idle_sleeptime;
-	} else {
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
-			idle = ts->idle_sleeptime;
-		}
+	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		idle = ktime_add(ts->idle_sleeptime, delta);
+	} else {
+		idle = ts->idle_sleeptime;
 	}
 
 	return ktime_to_us(idle);
@@ -510,17 +492,14 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		iowait = ts->iowait_sleeptime;
-	} else {
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
-			iowait = ts->iowait_sleeptime;
-		}
+	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		iowait = ktime_add(ts->iowait_sleeptime, delta);
+	} else {
+		iowait = ts->iowait_sleeptime;
 	}
 
 	return ktime_to_us(iowait);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2014-05-07 13:41 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
@ 2014-05-07 13:41 ` Denys Vlasenko
  2014-05-07 13:41 ` [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards Denys Vlasenko
  2014-05-07 13:41 ` [PATCH 4/4 v2] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2 siblings, 0 replies; 68+ messages in thread
From: Denys Vlasenko @ 2014-05-07 13:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

This patch is adapted from a very similar patch
by Frederic Weisbecker. The commit message text is also mostly
from Frederic's patch.

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

Namely this sums up to:

       sleeptime = ts($CPU)->idle_sleeptime;
       if (ts($CPU)->idle_active)
       	  sleeptime += NOW() - ts($CPU)->idle_entrytime

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

Lets consider the following scenario:

             CPU 0                                            CPU 1

  (seq 1)    ts(CPU 0)->idle_active = 1
             ts(CPU 0)->idle_entrytime = NOW()

  (seq 2)    sleeptime = NOW() - ts(CPU 0)->idle_entrytime
             ts(CPU 0)->idle_sleeptime += sleeptime           sleeptime = ts(CPU 0)->idle_sleeptime;
                                                              if (ts(CPU 0)->idle_active)
             ts(CPU 0)->idle_entrytime = NOW()                    sleeptime += NOW() - ts(CPU 0)->idle_entrytime

The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:

* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.

* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.

* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.

Some more tricky scenario can also happen if idle_active value is read from a former sequence.

Hence we must honour the following constraints:

- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering

- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.

- We need to prevent from fetching middle-state sequence values.

The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.

This fixes potential cpufreq governor bugs.
It also works towards fixing reported bug where non-monotonic sleeptime stats
are returned by /proc/stat when it is frequently read.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |  1 +
 kernel/time/tick-sched.c | 37 ++++++++++++++++++++++++++-----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..4de1f9e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -67,6 +67,7 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
+	seqcount_t			idle_sleeptime_seq;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
 	unsigned long			next_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7d86a54..8f0f2ee 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -411,12 +411,14 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
 	if (nr_iowait_cpu(smp_processor_id()) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_active = 0;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_wakeup_event(0);
 }
@@ -425,9 +427,13 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
+
 	return now;
 }
 
@@ -449,6 +455,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
@@ -457,15 +464,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		idle = ktime_add(ts->idle_sleeptime, delta);
-	} else {
+	do {
+		ktime_t delta;
+
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		idle = ts->idle_sleeptime;
-	}
+		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+			delta = ktime_sub(now, ts->idle_entrytime);
+			idle = ktime_add(idle, delta);
+		}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
-
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -487,6 +497,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, iowait;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
@@ -495,12 +506,16 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		iowait = ktime_add(ts->iowait_sleeptime, delta);
-	} else {
+	do {
+		ktime_t delta;
+
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		iowait = ts->iowait_sleeptime;
-	}
+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+			delta = ktime_sub(now, ts->idle_entrytime);
+			iowait = ktime_add(ts->iowait_sleeptime, delta);
+		}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(iowait);
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards
  2014-05-07 13:41 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
  2014-05-07 13:41 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
@ 2014-05-07 13:41 ` Denys Vlasenko
  2014-05-07 14:23   ` Peter Zijlstra
  2014-05-07 13:41 ` [PATCH 4/4 v2] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2 siblings, 1 reply; 68+ messages in thread
From: Denys Vlasenko @ 2014-05-07 13:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

With this change, "iowait-ness" of every idle period is decided
at the moment it starts:
if this CPU's run-queue had tasks waiting on I/O, then this idle
period's duration will be added to iowait_sleeptime.

This fixes the bug where iowait and/or idle counts could go backwards,
but iowait accounting is not precise (it can show more iowait
that there really is).

v2: Use symbolic constants for TRUE_IDLE and IOWAIT_IDLE.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/time/tick-sched.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8f0f2ee..7d0e14a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -406,6 +406,11 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog();
 }
 
+enum {
+	TRUE_IDLE = 1,
+	IOWAIT_IDLE = 2,
+};
+
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
 	ktime_t delta;
@@ -413,7 +418,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	/* Updates the per cpu time idle statistics counters */
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
+	if (ts->idle_active == IOWAIT_IDLE)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
@@ -429,7 +434,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
-	ts->idle_active = 1;
+	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? IOWAIT_IDLE : TRUE_IDLE;
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_sleep_event();
@@ -469,7 +474,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		idle = ts->idle_sleeptime;
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		if (ts->idle_active == TRUE_IDLE) {
 			delta = ktime_sub(now, ts->idle_entrytime);
 			idle = ktime_add(idle, delta);
 		}
@@ -511,7 +516,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		iowait = ts->iowait_sleeptime;
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+		if (ts->idle_active == IOWAIT_IDLE) {
 			delta = ktime_sub(now, ts->idle_entrytime);
 			iowait = ktime_add(ts->iowait_sleeptime, delta);
 		}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 4/4 v2] nohz: Fix iowait overcounting if iowait task migrates
  2014-05-07 13:41 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
  2014-05-07 13:41 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
  2014-05-07 13:41 ` [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards Denys Vlasenko
@ 2014-05-07 13:41 ` Denys Vlasenko
  2 siblings, 0 replies; 68+ messages in thread
From: Denys Vlasenko @ 2014-05-07 13:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

Before this change, if last IO-blocked task wakes up
on a different CPU, the original CPU may stay idle for much longer,
and the entire time it stays idle is accounted as iowait time.

This change adds struct tick_sched::iowait_exittime member.
On entry to idle, it is set to KTIME_MAX.
Last IO-blocked task, if migrated, sets it to current time.
Note that this can happen only once per each idle period:
new iowaiting tasks can't magically appear on idle CPU's rq.

If iowait_exittime is set, then (iowait_exittime - idle_entrytime)
gets accounted as iowait, and the remaining (now - iowait_exittime)
as "true" idle.

v2:
* Made iowait_exittime atomic64_t. This way, no locking
  is necessary when setting/accessing it.
* Do more paranoid checking of iowait_exittime before using it:
  now code checks that idle_entrytime <= iowait_exittime
  and iowait_exittime <= now, and uses iowait_exittime only
  if both are true.

Run-tested: /proc/stat counters no longer go backwards.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |  2 ++
 kernel/sched/core.c      | 19 ++++++++--
 kernel/time/tick-sched.c | 91 +++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 98 insertions(+), 14 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 4de1f9e..49f8b29 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -67,6 +67,7 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
+	atomic64_t			iowait_exittime;
 	seqcount_t			idle_sleeptime_seq;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
@@ -140,6 +141,7 @@ extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern void tick_nohz_iowait_to_idle(int cpu);
 
 # else /* !CONFIG_NO_HZ_COMMON */
 static inline int tick_nohz_tick_stopped(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..3137980 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4208,6 +4208,21 @@ EXPORT_SYMBOL_GPL(yield_to);
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
  */
+#ifdef CONFIG_NO_HZ_COMMON
+static __sched void io_wait_end(struct rq *rq)
+{
+	if (atomic_dec_and_test(&rq->nr_iowait)) {
+		if (raw_smp_processor_id() != cpu_of(rq))
+			tick_nohz_iowait_to_idle(cpu_of(rq));
+	}
+}
+#else
+static inline void io_wait_end(struct rq *rq)
+{
+	atomic_dec(&rq->nr_iowait);
+}
+#endif
+
 void __sched io_schedule(void)
 {
 	struct rq *rq = raw_rq();
@@ -4218,7 +4233,7 @@ void __sched io_schedule(void)
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	io_wait_end(rq);
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
@@ -4234,7 +4249,7 @@ long __sched io_schedule_timeout(long timeout)
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	io_wait_end(rq);
 	delayacct_blkio_end();
 	return ret;
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7d0e14a..f3b214d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -411,17 +411,41 @@ enum {
 	IOWAIT_IDLE = 2,
 };
 
+/* We always access iowait_exittime atomically. */
+static inline ktime_t fetch_iowait_exittime(struct tick_sched *ts)
+{
+	ktime_t v;
+
+	v.tv64 = atomic64_read(&ts->iowait_exittime);
+	return v;
+}
+
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
-	ktime_t delta;
+	ktime_t start, delta, end;
 
 	/* Updates the per cpu time idle statistics counters */
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
-	delta = ktime_sub(now, ts->idle_entrytime);
-	if (ts->idle_active == IOWAIT_IDLE)
+	start = ts->idle_entrytime;
+	delta = ktime_sub(now, start);
+
+	if (ts->idle_active == IOWAIT_IDLE) {
+		/*
+		 * If last iowaiting task on our rq wakes up on another
+		 * CPU, it sets iowait_exittime.
+		 * It's the only case it can satisfy "start <= end <= now".
+		 */
+		end = fetch_iowait_exittime(ts);
+		if (ktime_compare(start, end) <= 0 && ktime_compare(end, now) <= 0) {
+			/* [end, now] goes to "true idle" counter */
+			delta = ktime_sub(now, end);
+			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+			delta = ktime_sub(end, start);
+		}
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-	else
+	} else {
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+	}
 	ts->idle_active = 0;
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
@@ -435,6 +459,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? IOWAIT_IDLE : TRUE_IDLE;
+	atomic64_set(&ts->iowait_exittime, KTIME_MAX);
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_sleep_event();
@@ -442,6 +467,14 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 	return now;
 }
 
+void tick_nohz_iowait_to_idle(int cpu)
+{
+	struct tick_sched *ts = tick_get_tick_sched(cpu);
+	ktime_t now = ktime_get();
+
+	atomic64_set(&ts->iowait_exittime, now.tv64);
+}
+
 /**
  * get_cpu_idle_time_us - get the total idle time of a cpu
  * @cpu: CPU number to query
@@ -458,7 +491,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
  */
 u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts;
 	ktime_t now, idle;
 	unsigned int seq;
 
@@ -469,14 +502,35 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
+	ts = &per_cpu(tick_cpu_sched, cpu);
+
 	do {
-		ktime_t delta;
+		ktime_t start, delta, iowait_exit;
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		idle = ts->idle_sleeptime;
-		if (ts->idle_active == TRUE_IDLE) {
-			delta = ktime_sub(now, ts->idle_entrytime);
+
+		if (ts->idle_active /* either IOWAIT_IDLE or TRUE_IDLE */) {
+			start = ts->idle_entrytime;
+
+			if (ts->idle_active == IOWAIT_IDLE) {
+				/* This idle period started as iowait */
+
+				iowait_exit = fetch_iowait_exittime(ts);
+				if (ktime_compare(start, iowait_exit) > 0 ||
+				    ktime_compare(iowait_exit, now) > 0) {
+					/* And it still is (iowait_exit isn't set) */
+					goto skip;
+				}
+				/*
+				 * This CPU used to be "iowait idle", but iowait task
+				 * has migrated. The rest of idle time is "true idle":
+				 */
+				start = iowait_exit;
+			}
+			delta = ktime_sub(now, start);
 			idle = ktime_add(idle, delta);
+ skip: ;
 		}
 	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
@@ -500,7 +554,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts;
 	ktime_t now, iowait;
 	unsigned int seq;
 
@@ -511,14 +565,27 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
+	ts = &per_cpu(tick_cpu_sched, cpu);
+
 	do {
-		ktime_t delta;
+		ktime_t start, delta, iowait_exit;
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		iowait = ts->iowait_sleeptime;
+
 		if (ts->idle_active == IOWAIT_IDLE) {
-			delta = ktime_sub(now, ts->idle_entrytime);
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
+			start = ts->idle_entrytime;
+			iowait_exit = fetch_iowait_exittime(ts);
+			/*
+			 * Did last iowaiting task on our rq wake up on other CPU
+			 * sometime in the past, and updated ts->iowait_exittime?
+			 */
+			if (ktime_compare(start, iowait_exit) > 0 ||
+			    ktime_compare(iowait_exit, now) > 0) {
+				iowait_exit = now; /* no */
+			}
+			delta = ktime_sub(iowait_exit, start);
+			iowait = ktime_add(iowait, delta);
 		}
 	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards
  2014-05-07 13:41 ` [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards Denys Vlasenko
@ 2014-05-07 14:23   ` Peter Zijlstra
  2014-05-07 16:49     ` Denys Vlasenko
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2014-05-07 14:23 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On Wed, May 07, 2014 at 03:41:33PM +0200, Denys Vlasenko wrote:
> With this change, "iowait-ness" of every idle period is decided
> at the moment it starts:
> if this CPU's run-queue had tasks waiting on I/O, then this idle
> period's duration will be added to iowait_sleeptime.
> 
> This fixes the bug where iowait and/or idle counts could go backwards,
> but iowait accounting is not precise (it can show more iowait
> that there really is).
> 

NAK on this, the thing going backwards is a symptom of the bug, not an
actual bug itself.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards
  2014-05-07 14:23   ` Peter Zijlstra
@ 2014-05-07 16:49     ` Denys Vlasenko
  2014-05-07 16:56       ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Denys Vlasenko @ 2014-05-07 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 05/07/2014 04:23 PM, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 03:41:33PM +0200, Denys Vlasenko wrote:
>> With this change, "iowait-ness" of every idle period is decided
>> at the moment it starts:
>> if this CPU's run-queue had tasks waiting on I/O, then this idle
>> period's duration will be added to iowait_sleeptime.
>>
>> This fixes the bug where iowait and/or idle counts could go backwards,
>> but iowait accounting is not precise (it can show more iowait
>> that there really is).
>>
> 
> NAK on this, the thing going backwards is a symptom of the bug, not an
> actual bug itself.

This patch does fix that bug.

The bug is that in nohz_stop_idle(),
we base the decision to add accumulated time to
ts->iowait_sleeptime or to ts->idle_sleeptime
on nr_iowait_cpu(cpu):

        if (nr_iowait_cpu(smp_processor_id()) > 0)
                ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
        else
                ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);

and we use the same nr_iowait_cpu(cpu)
to calculate result of get_cpu_iowait_time_us():

               if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
                       delta = ktime_sub(now, ts->idle_entrytime);
                       iowait = ktime_add(ts->iowait_sleeptime, delta);
               }

This is wrong because nr_iowait_cpu(cpu) is not stable.
It could be != 0 in get_cpu_iowait_time_us() but later
become 0 when we are in nohz_stop_idle().

We must use consistent logic in these two functions.
If nr_iowait_cpu() added something to iowait counter,
the same should be done in nohz_stop_idle().

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards
  2014-05-07 16:49     ` Denys Vlasenko
@ 2014-05-07 16:56       ` Peter Zijlstra
  2014-05-07 18:24         ` Denys Vlasenko
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2014-05-07 16:56 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 1327 bytes --]

On Wed, May 07, 2014 at 06:49:47PM +0200, Denys Vlasenko wrote:
> On 05/07/2014 04:23 PM, Peter Zijlstra wrote:
> > On Wed, May 07, 2014 at 03:41:33PM +0200, Denys Vlasenko wrote:
> >> With this change, "iowait-ness" of every idle period is decided
> >> at the moment it starts:
> >> if this CPU's run-queue had tasks waiting on I/O, then this idle
> >> period's duration will be added to iowait_sleeptime.
> >>
> >> This fixes the bug where iowait and/or idle counts could go backwards,
> >> but iowait accounting is not precise (it can show more iowait
> >> that there really is).
> >>
> > 
> > NAK on this, the thing going backwards is a symptom of the bug, not an
> > actual bug itself.
> 
> This patch does fix that bug.

Which bug, there's two here:

 1) that NOHZ and !NOHZ iowait accounting aren't identical
 2) that iowait accounting in general is a steaming pile of crap

Those are the only two bugs around. If you want easy, you get to fix 1,
if you want to do hard you try and fix 2.

iowait going backwards is a symptom of NOHZ iowait accounting being
broken -- as compared to !NOHZ.

Just stopping it from going backwards is not a fix for anything except
papering over symptoms, and if you thing that's a good approach to
software engineering then I'll welcome you to my killfile.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards
  2014-05-07 16:56       ` Peter Zijlstra
@ 2014-05-07 18:24         ` Denys Vlasenko
  2014-05-07 19:06           ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Denys Vlasenko @ 2014-05-07 18:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 05/07/2014 06:56 PM, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 06:49:47PM +0200, Denys Vlasenko wrote:
>> On 05/07/2014 04:23 PM, Peter Zijlstra wrote:
>>> On Wed, May 07, 2014 at 03:41:33PM +0200, Denys Vlasenko wrote:
>>>> With this change, "iowait-ness" of every idle period is decided
>>>> at the moment it starts:
>>>> if this CPU's run-queue had tasks waiting on I/O, then this idle
>>>> period's duration will be added to iowait_sleeptime.
>>>>
>>>> This fixes the bug where iowait and/or idle counts could go backwards,
>>>> but iowait accounting is not precise (it can show more iowait
>>>> that there really is).
>>>>
>>>
>>> NAK on this, the thing going backwards is a symptom of the bug, not an
>>> actual bug itself.
>>
>> This patch does fix that bug.
> 
> Which bug, there's two here:
> 
>  1) that NOHZ and !NOHZ iowait accounting aren't identical

They can hardly be identical, considering how different these modes are.

And they don't have to be identical, in fact.
It is enough if they give similar numbers for similar usage scenarios.
E.g. if I run dd </dev/sda >/dev/null, I expect that iowait
counter increases while idle counter almost standing still
(on the CPU where dd runs, or course).

>  2) that iowait accounting in general is a steaming pile of crap

If you want to nuke iowait (for example, make its counter constant 0),
I personally won't object. Can't guarantee others won't...

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards
  2014-05-07 18:24         ` Denys Vlasenko
@ 2014-05-07 19:06           ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2014-05-07 19:06 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 3115 bytes --]

On Wed, May 07, 2014 at 08:24:25PM +0200, Denys Vlasenko wrote:
> On 05/07/2014 06:56 PM, Peter Zijlstra wrote:
> > On Wed, May 07, 2014 at 06:49:47PM +0200, Denys Vlasenko wrote:
> >> On 05/07/2014 04:23 PM, Peter Zijlstra wrote:
> >>> On Wed, May 07, 2014 at 03:41:33PM +0200, Denys Vlasenko wrote:
> >>>> With this change, "iowait-ness" of every idle period is decided
> >>>> at the moment it starts:
> >>>> if this CPU's run-queue had tasks waiting on I/O, then this idle
> >>>> period's duration will be added to iowait_sleeptime.
> >>>>
> >>>> This fixes the bug where iowait and/or idle counts could go backwards,
> >>>> but iowait accounting is not precise (it can show more iowait
> >>>> that there really is).
> >>>>
> >>>
> >>> NAK on this, the thing going backwards is a symptom of the bug, not an
> >>> actual bug itself.
> >>
> >> This patch does fix that bug.
> > 
> > Which bug, there's two here:
> > 
> >  1) that NOHZ and !NOHZ iowait accounting aren't identical
> 
> They can hardly be identical, considering how different these modes are.

They can, we've managed it for pretty much everything else, although its
not always easy.

And if you look at the patch I send, that provides the exact moment the
task wakes up, so you can round that to the nearest jiffy boundary and
account appropriately as if it were accounted on the per-cpu timer tick.

Now, there's likely fun corner cases which need more TLC, see
kernel/sched/proc.c for the fun times we had with the global load avg.

> And they don't have to be identical, in fact.

Yes they have to; per definition. CONFIG_NOHZ should have no user
visible difference (except of course the obvious of less interrupts and
ideally energy usage).

> >  2) that iowait accounting in general is a steaming pile of crap
> 
> If you want to nuke iowait (for example, make its counter constant 0),
> I personally won't object. Can't guarantee others won't...

I won't object to a constant 0, but then we have to do it irrespective
of NOHZ. But not necessarily, I think we can have a coherent definition
of iowait, just most likely not per-cpu.

So for UP we have the very simple definition that any idle cycle while
there is a task waiting for io is accounted to iowait.

This definition can be 'trivially' extended to a global iowait,
expensive to compute though.

However, one can argue its not correct to do that trivial extension,
since if there's only 1 task waiting for io, it could at most tie up 1
CPUs worth of idle time (but very emphatically not a specific cpu).

So somewhere in that space is I think a viable way to account iowait,
but the straight fwd implementation (in as far as the eventual
definition will be straight fwd to begin with) will likely be
prohibitively expensive to compute.

Then again, if you look at kernel/sched/proc.c (again) and look at the
bloody mess we had to make for the global load avg accounting to work
for NOHZ there might (or might just not) be a shimmer of hope we can
pull this off in a scalable manner.

Like said, 'fun' problem :-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2014-05-07 19:07 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-07 13:41 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
2014-05-07 13:41 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
2014-05-07 13:41 ` [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards Denys Vlasenko
2014-05-07 14:23   ` Peter Zijlstra
2014-05-07 16:49     ` Denys Vlasenko
2014-05-07 16:56       ` Peter Zijlstra
2014-05-07 18:24         ` Denys Vlasenko
2014-05-07 19:06           ` Peter Zijlstra
2014-05-07 13:41 ` [PATCH 4/4 v2] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  -- strict thread matches above, loose matches on Subject: below --
2014-04-24 18:45 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
2014-04-24 18:45 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
2013-08-16 16:02   ` Oleg Nesterov
2013-08-16 16:20     ` Frederic Weisbecker
2013-08-16 16:26       ` Oleg Nesterov
2013-08-16 16:46         ` Frederic Weisbecker
2013-08-16 16:49           ` Oleg Nesterov
2013-08-16 17:12             ` Frederic Weisbecker
2013-08-18 16:36               ` Oleg Nesterov
2013-08-18 21:25                 ` Frederic Weisbecker
2013-08-19 10:58               ` Peter Zijlstra
2013-08-19 15:44                 ` Arjan van de Ven
2013-08-19 15:47                 ` Arjan van de Ven
2013-08-19 11:10           ` Peter Zijlstra
2013-08-19 11:15             ` Peter Zijlstra
2013-08-20  6:59             ` Fernando Luis Vázquez Cao
2013-08-20  8:44               ` Peter Zijlstra
2013-08-20 15:29                 ` Frederic Weisbecker
2013-08-20 15:33                   ` Arjan van de Ven
2013-08-20 15:35                     ` Frederic Weisbecker
2013-08-20 15:41                       ` Arjan van de Ven
2013-08-20 15:31                 ` Arjan van de Ven
2013-08-20 16:01                   ` Peter Zijlstra
2013-08-20 16:33                     ` Oleg Nesterov
2013-08-20 17:54                       ` Peter Zijlstra
2013-08-20 18:25                         ` Oleg Nesterov
2013-08-21  8:31                           ` Peter Zijlstra
2013-08-21 11:35                             ` Oleg Nesterov
2013-08-21 12:33                               ` Peter Zijlstra
2013-08-21 14:23                                 ` Peter Zijlstra
2013-08-21 16:41                                   ` Oleg Nesterov
2013-10-01 14:05                                     ` Frederic Weisbecker
2013-10-01 14:26                                       ` Frederic Weisbecker
2013-10-01 14:27                                         ` Frederic Weisbecker
2013-10-01 14:49                                         ` Frederic Weisbecker
2013-10-01 15:00                                       ` Peter Zijlstra
2013-10-01 15:21                                         ` Frederic Weisbecker
2013-10-01 15:56                                       ` Peter Zijlstra
2013-10-01 16:47                                         ` Frederic Weisbecker
2013-10-01 16:59                                           ` Peter Zijlstra
2013-10-02 12:45                                             ` Frederic Weisbecker
2013-10-02 12:50                                               ` Peter Zijlstra
2013-10-02 14:35                                               ` Arjan van de Ven
2013-10-02 16:01                                                 ` Frederic Weisbecker
2013-08-21 12:48                               ` Peter Zijlstra
2013-08-21 17:09                                 ` Oleg Nesterov
2013-08-21 18:31                                   ` Peter Zijlstra
2013-08-21 18:32                                     ` Oleg Nesterov
2013-08-20 22:18                       ` Frederic Weisbecker
2013-08-21 11:49                         ` Oleg Nesterov
2013-08-20  6:21           ` Fernando Luis Vázquez Cao
2013-08-20 21:55             ` Frederic Weisbecker
2013-08-16 16:32     ` Frederic Weisbecker
2013-08-16 16:33       ` Oleg Nesterov
2013-08-16 16:49         ` Frederic Weisbecker
2013-08-16 16:37       ` Frederic Weisbecker
2013-08-18 16:54   ` Oleg Nesterov
2013-08-18 21:40     ` Frederic Weisbecker
2013-08-09  0:54 [PATCH 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
2013-08-09  0:54 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).