[PATCH] sched/cputime: fix clock_nanosleep/clock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
@ 2014-11-12 10:29 Stanislaw Gruszka
  2014-11-12 11:15 ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Stanislaw Gruszka @ 2014-11-12 10:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro, Oleg Nesterov,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar

Commit d670ec13178d "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Below is full reproducer (tst-cpuclock2.c) :

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <inttypes.h>

/* Parameters for the Linux kernel ABI for CPU clocks.  */
#define CPUCLOCK_SCHED          2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
        ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

static pthread_barrier_t barrier;

/* Help advance the clock.  */
static void *chew_cpu(void *arg)
{
	pthread_barrier_wait(&barrier);
	while (1) ;

	return NULL;
}

/* Don't use the glibc wrapper.  */
static int do_nanosleep(int flags, const struct timespec *req)
{
	clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

	return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}

static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
	int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
	int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

	return after_i - before_i;
}

int main(void)
{
	int result = 0;
	pthread_t th;

	pthread_barrier_init(&barrier, NULL, 2);

	if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
		perror("pthread_create");
		return 1;
	}

	pthread_barrier_wait(&barrier);

	/* The test.  */
	struct timespec before, after, sleeptimeabs;
	int64_t sleepdiff, diffabs;
	const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

	/* The relative nanosleep.  Not sure why this is needed, but its presence
	   seems to make it easier to reproduce the problem.  */
	if (do_nanosleep(0, &sleeptime) != 0) {
		perror("clock_nanosleep");
		return 1;
	}

	/* Get the current time.  */
	if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
		perror("clock_gettime[2]");
		return 1;
	}

	/* Compute the absolute sleep time based on the current time.  */
	uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
	sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
	sleeptimeabs.tv_nsec = nsec % 1000000000;

	/* Sleep for the computed time.  */
	if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
		perror("absolute clock_nanosleep");
		return 1;
	}

	/* Get the time after the sleep.  */
	if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
		perror("clock_gettime[3]");
		return 1;
	}

	/* The time after sleep should always be equal to or after the absolute sleep
	   time passed to clock_nanosleep.  */
	sleepdiff = tsdiff(&sleeptimeabs, &after);
	if (sleepdiff < 0) {
		printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
		result = 1;

		printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
		printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
		printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
	}

	/* The difference between the timestamps taken before and after the
	   clock_nanosleep call should be equal to or more than the duration of the
	   sleep.  */
	diffabs = tsdiff(&before, &after);
	if (diffabs < sleeptime.tv_nsec) {
		printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
		result = 1;
	}

	pthread_cancel(th);

	return result;
}

It can be compiled and ran by:

gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime again on scheduler tick. When cputimer
finish, it's sum_exec_runtime value is bigger than current sum counted
by iterating over the threads in thread_group_cputime().

KOSAKI Motohiro posted fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191.

This patch takes different approach. It revert change from d670ec13178d,
but to assure process times are in sync with thread times, before
accounting the times it calls update_curr() to update current thread
sum_exec_runtime. This fixes the test case from commit d670ec13178d and
also make things like they were before i.e. process cpu times can be
(nr_threads - 1) * TICK_NSEC behind the clock (this is a drawback
unfortunately). Good news is that patch improve performance of
thread_group_cputime(), i.e. we do not need optimizations from commit
911b289 "sched: Optimize task_sched_runtime()"

Note I'm not familiar with scheduler subsystem, hence I'm not sure if
calling update_curr() will not affect scheduler functionality somehow.

Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 include/linux/kernel_stat.h    |  5 +--
 kernel/sched/core.c            | 69 +++++-------------------------------------
 kernel/sched/cputime.c         |  2 +-
 kernel/sched/deadline.c        |  2 ++
 kernel/sched/fair.c            |  7 +++++
 kernel/sched/rt.c              |  2 ++
 kernel/sched/sched.h           |  2 ++
 kernel/time/posix-cpu-timers.c |  7 +++--
 8 files changed, 27 insertions(+), 69 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 8422b4e..d5bf373 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -77,10 +77,7 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
-/*
- * Lock/unlock the current runqueue - to extract task statistics:
- */
-extern unsigned long long task_delta_exec(struct task_struct *);
+extern void scheduler_update_curr(struct task_struct *);
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..117c852 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2475,75 +2475,20 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
- * Return any ns on the sched_clock that have not yet been accounted in
- * @p in case that task is currently running.
- *
- * Called with task_rq_lock() held on @rq.
+ * If the task is currently running, update the statistics. Main purpose
+ * of this function is assure that the accounted runtime is updated.
  */
-static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
-{
-	u64 ns = 0;
-
-	/*
-	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
-	 * project cycles that may never be accounted to this
-	 * thread, breaking clock_gettime().
-	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		ns = rq_clock_task(rq) - p->se.exec_start;
-		if ((s64)ns < 0)
-			ns = 0;
-	}
-
-	return ns;
-}
-
-unsigned long long task_delta_exec(struct task_struct *p)
+void scheduler_update_curr(struct task_struct *p)
 {
 	unsigned long flags;
 	struct rq *rq;
-	u64 ns = 0;
 
 	rq = task_rq_lock(p, &flags);
-	ns = do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, p, &flags);
-
-	return ns;
-}
-
-/*
- * Return accounted runtime for the task.
- * In case the task is currently running, return the runtime plus current's
- * pending runtime that have not been accounted yet.
- */
-unsigned long long task_sched_runtime(struct task_struct *p)
-{
-	unsigned long flags;
-	struct rq *rq;
-	u64 ns = 0;
-
-#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
-	/*
-	 * 64-bit doesn't need locks to atomically read a 64bit value.
-	 * So we have a optimization chance when the task's delta_exec is 0.
-	 * Reading ->on_cpu is racy, but this is ok.
-	 *
-	 * If we race with it leaving cpu, we'll take a lock. So we're correct.
-	 * If we race with it entering cpu, unaccounted time is 0. This is
-	 * indistinguishable from the read occurring a few cycles earlier.
-	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has
-	 * been accounted, so we're correct here as well.
-	 */
-	if (!p->on_cpu || !task_on_rq_queued(p))
-		return p->se.sum_exec_runtime;
-#endif
-
-	rq = task_rq_lock(p, &flags);
-	ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+	if (task_current(rq, p) && task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		p->sched_class->update_curr(rq);
+	}
 	task_rq_unlock(rq, p, &flags);
-
-	return ns;
 }
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8394b1e..afcf114 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -305,7 +305,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
-			times->sum_exec_runtime += task_sched_runtime(t);
+			times->sum_exec_runtime += t->se.sum_exec_runtime;
 		}
 		/* If lockless access failed, take the lock. */
 		nextseq = 1;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5285332..28fa9d9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1701,4 +1701,6 @@ const struct sched_class dl_sched_class = {
 	.prio_changed           = prio_changed_dl,
 	.switched_from		= switched_from_dl,
 	.switched_to		= switched_to_dl,
+
+	.update_curr		= update_curr_dl,
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 34baa60..99995e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,6 +726,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static void update_curr_rq(struct rq *rq)
+{
+	update_curr(&rq->cfs);
+}
+
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -7949,6 +7954,8 @@ const struct sched_class fair_sched_class = {
 
 	.get_rr_interval	= get_rr_interval_fair,
 
+	.update_curr		= update_curr_rq,
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	.task_move_group	= task_move_group_fair,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d024e6c..20bca39 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2128,6 +2128,8 @@ const struct sched_class rt_sched_class = {
 
 	.prio_changed		= prio_changed_rt,
 	.switched_to		= switched_to_rt,
+
+	.update_curr		= update_curr_rt,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..2df8ef0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1135,6 +1135,8 @@ struct sched_class {
 	unsigned int (*get_rr_interval) (struct rq *rq,
 					 struct task_struct *task);
 
+	void (*update_curr) (struct rq *rq);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	void (*task_move_group) (struct task_struct *p, int on_rq);
 #endif
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 492b986..c2a5401 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -190,7 +190,8 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
 		*sample = virt_ticks(p);
 		break;
 	case CPUCLOCK_SCHED:
-		*sample = task_sched_runtime(p);
+		scheduler_update_curr(p);
+		*sample = p->se.sum_exec_runtime;
 		break;
 	}
 	return 0;
@@ -221,6 +222,7 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
 		 * to synchronize the timer to the clock every time we start
 		 * it.
 		 */
+		scheduler_update_curr(tsk);
 		thread_group_cputime(tsk, &sum);
 		raw_spin_lock_irqsave(&cputimer->lock, flags);
 		cputimer->running = 1;
@@ -254,6 +256,7 @@ static int cpu_clock_sample_group(const clockid_t which_clock,
 		*sample = cputime_to_expires(cputime.utime);
 		break;
 	case CPUCLOCK_SCHED:
+		scheduler_update_curr(p);
 		thread_group_cputime(p, &cputime);
 		*sample = cputime.sum_exec_runtime;
 		break;
@@ -553,7 +556,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
 		*sample = cputime_to_expires(cputime.utime);
 		break;
 	case CPUCLOCK_SCHED:
-		*sample = cputime.sum_exec_runtime + task_delta_exec(p);
+		*sample = cputime.sum_exec_runtime;
 		break;
 	}
 	return 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 10:29 [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency Stanislaw Gruszka
@ 2014-11-12 11:15 ` Peter Zijlstra
  2014-11-12 11:37   ` Peter Zijlstra
  2014-11-12 12:21   ` [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency Stanislaw Gruszka
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 11:15 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 11:29:28AM +0100, Stanislaw Gruszka wrote:
> Commit d670ec13178d "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
> test case in cost of breaking another one. After that commit, calling
> clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
> of Y time being smaller than X time.

<snip>

> Issue happens because on start in thread_group_cputimer() we initialize
> sum_exec_runtime of cputimer with threads runtime not yet accounted and
> then add the threads runtime again on scheduler tick. When cputimer
> finish, it's sum_exec_runtime value is bigger than current sum counted
> by iterating over the threads in thread_group_cputime().
> 
> KOSAKI Motohiro posted fix for this problem, but that patch was never
> applied: https://lkml.org/lkml/2013/5/26/191.
> 
> This patch takes different approach. It revert change from d670ec13178d,
> but to assure process times are in sync with thread times, before
> accounting the times it calls update_curr() to update current thread
> sum_exec_runtime. This fixes the test case from commit d670ec13178d and
> also make things like they were before i.e. process cpu times can be
> (nr_threads - 1) * TICK_NSEC behind the clock (this is a drawback
> unfortunately). 

> Good news is that patch improve performance of
> thread_group_cputime(), i.e. we do not need optimizations from commit
> 911b289 "sched: Optimize task_sched_runtime()"

You're inconsistent with your SHA1 lengths, use 12 chars.

So I'm not seeing how you've not reintroduced the issue from
d670ec13178d, afaict you simply do not take the delta of all the other
threads into account, which is why you're not taking as many locks.

Which also re-introduces the 32bit data race mentioned in that
changelog.

> Note I'm not familiar with scheduler subsystem, hence I'm not sure if
> calling update_curr() will not affect scheduler functionality somehow.

Well, it will, it has to. But we call it at any odd point anyhow so
there's really nothing too weird about that.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 240157c..117c852 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2475,75 +2475,20 @@ EXPORT_PER_CPU_SYMBOL(kstat);
>  EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
>  
>  /*
> + * If the task is currently running, update the statistics. Main purpose
> + * of this function is assure that the accounted runtime is updated.
>   */
> +void scheduler_update_curr(struct task_struct *p)
>  {
>  	unsigned long flags;
>  	struct rq *rq;
>  
>  	rq = task_rq_lock(p, &flags);
> +	if (task_current(rq, p) && task_on_rq_queued(p)) {
> +		update_rq_clock(rq);
> +		p->sched_class->update_curr(rq);
> +	}
>  	task_rq_unlock(rq, p, &flags);
>  }

Right so that name stinks, it updates 'p' not curr, and it really should
have that optimization in, there is no point in taking that lock if p
isn't actually on a cpu atm.

> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 8394b1e..afcf114 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -305,7 +305,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  			task_cputime(t, &utime, &stime);
>  			times->utime += utime;
>  			times->stime += stime;
> +			times->sum_exec_runtime += t->se.sum_exec_runtime;
>  		}
>  		/* If lockless access failed, take the lock. */
>  		nextseq = 1;

This re-introduces the 32bit fail.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 34baa60..99995e0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -726,6 +726,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
>  
> +static void update_curr_rq(struct rq *rq)
> +{
> +	update_curr(&rq->cfs);
> +}
> +
>  static inline void
>  update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> @@ -7949,6 +7954,8 @@ const struct sched_class fair_sched_class = {
>  
>  	.get_rr_interval	= get_rr_interval_fair,
>  
> +	.update_curr		= update_curr_rq,
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	.task_move_group	= task_move_group_fair,
>  #endif

For consistency sake:

  s/update_curr_rq/update_curr_fair/g


> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
> index 492b986..c2a5401 100644
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -190,7 +190,8 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
>  		*sample = virt_ticks(p);
>  		break;
>  	case CPUCLOCK_SCHED:
> -		*sample = task_sched_runtime(p);
> +		scheduler_update_curr(p);
> +		*sample = p->se.sum_exec_runtime;
>  		break;
>  	}
>  	return 0;

Same 32bit fail

> @@ -221,6 +222,7 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times)
>  		 * to synchronize the timer to the clock every time we start
>  		 * it.
>  		 */
> +		scheduler_update_curr(tsk);
>  		thread_group_cputime(tsk, &sum);
>  		raw_spin_lock_irqsave(&cputimer->lock, flags);
>  		cputimer->running = 1;

So here you only update the current thread, not the entire thread group.
But you've failed to explain how that does not re-introduce problems.

> @@ -254,6 +256,7 @@ static int cpu_clock_sample_group(const clockid_t which_clock,
>  		*sample = cputime_to_expires(cputime.utime);
>  		break;
>  	case CPUCLOCK_SCHED:
> +		scheduler_update_curr(p);
>  		thread_group_cputime(p, &cputime);
>  		*sample = cputime.sum_exec_runtime;
>  		break;

Why only these two thread_group_cputime() calls and not all others?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 11:15 ` Peter Zijlstra
@ 2014-11-12 11:37   ` Peter Zijlstra
  2014-11-12 11:45     ` Peter Zijlstra
  2014-11-16  9:50     ` [tip:sched/urgent] sched/cputime: Fix cpu_timer_sample_group() double accounting tip-bot for Peter Zijlstra
  2014-11-12 12:21   ` [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency Stanislaw Gruszka
  1 sibling, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 11:37 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 12:15:53PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 11:29:28AM +0100, Stanislaw Gruszka wrote:

> > Issue happens because on start in thread_group_cputimer() we initialize
> > sum_exec_runtime of cputimer with threads runtime not yet accounted and
> > then add the threads runtime again on scheduler tick. When cputimer
> > finish, it's sum_exec_runtime value is bigger than current sum counted
> > by iterating over the threads in thread_group_cputime().

I'm not seeing how that can happen. We iterate each task once, for each
task we grab sum_exec_runtime + whatever delta.

It doesnt matter if a later interrupt or whatever folds the delta into
sum_exec_runtime, we never look at it again.

What I did found is that we appear to add the delta for the calling task
twice, through:

  cpu_timer_sample_group()
    thread_group_cputimer()
      thread_group_cputime()
        times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short. So would
something like the below not cure things?

---
 include/linux/kernel_stat.h    |  5 -----
 kernel/sched/core.c            | 13 -------------
 kernel/time/posix-cpu-timers.c |  2 +-
 3 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 8422b4ed6882..b9376cd5a187 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -77,11 +77,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
-/*
- * Lock/unlock the current runqueue - to extract task statistics:
- */
-extern unsigned long long task_delta_exec(struct task_struct *);
-
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
 extern void account_steal_time(cputime_t);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5df22f1da07d..85ff99db2591 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2481,19 +2481,6 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
 	return ns;
 }
 
-unsigned long long task_delta_exec(struct task_struct *p)
-{
-	unsigned long flags;
-	struct rq *rq;
-	u64 ns = 0;
-
-	rq = task_rq_lock(p, &flags);
-	ns = do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, p, &flags);
-
-	return ns;
-}
-
 /*
  * Return accounted runtime for the task.
  * In case the task is currently running, return the runtime plus current's
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 492b986195d5..a16b67859e2a 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -553,7 +553,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
 		*sample = cputime_to_expires(cputime.utime);
 		break;
 	case CPUCLOCK_SCHED:
-		*sample = cputime.sum_exec_runtime + task_delta_exec(p);
+		*sample = cputime.sum_exec_runtime;
 		break;
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 11:37   ` Peter Zijlstra
@ 2014-11-12 11:45     ` Peter Zijlstra
  2014-11-12 12:27       ` Stanislaw Gruszka
  2014-11-16  9:50     ` [tip:sched/urgent] sched/cputime: Fix cpu_timer_sample_group() double accounting tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 11:45 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 12:37:37PM +0100, Peter Zijlstra wrote:

> Which would make the sample run ahead, making the sleep short. So would
> something like the below not cure things?

Before anyone asks, yes I tried running that 'reproducer' it doesn't.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 11:45     ` Peter Zijlstra
@ 2014-11-12 12:27       ` Stanislaw Gruszka
  2014-11-12 12:52         ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Stanislaw Gruszka @ 2014-11-12 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 12:45:38PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 12:37:37PM +0100, Peter Zijlstra wrote:
> 
> > Which would make the sample run ahead, making the sleep short. So would
> > something like the below not cure things?
> 
> Before anyone asks, yes I tried running that 'reproducer' it doesn't.

I'd be surprised if the patch would help. Issue here happen at start of
cputimer. We set sum_sched_runtime value of cputimer using not yet
accounted threads runtime and then add that runtime values again to
running cputimer on tick, making it's sum_exec_runtime bigger than
actual threads runtime.

Stanislaw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 12:27       ` Stanislaw Gruszka
@ 2014-11-12 12:52         ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 12:52 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 01:27:53PM +0100, Stanislaw Gruszka wrote:
> On Wed, Nov 12, 2014 at 12:45:38PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 12, 2014 at 12:37:37PM +0100, Peter Zijlstra wrote:
> > 
> > > Which would make the sample run ahead, making the sleep short. So would
> > > something like the below not cure things?
> > 
> > Before anyone asks, yes I tried running that 'reproducer' it doesn't.
> 
> I'd be surprised if the patch would help. Issue here happen at start of
> cputimer. We set sum_sched_runtime value of cputimer using not yet
> accounted threads runtime and then add that runtime values again to
> running cputimer on tick, making it's sum_exec_runtime bigger than
> actual threads runtime.

Ah yes, reading is hard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:sched/urgent] sched/cputime: Fix cpu_timer_sample_group() double accounting
  2014-11-12 11:37   ` Peter Zijlstra
  2014-11-12 11:45     ` Peter Zijlstra
@ 2014-11-16  9:50     ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-11-16  9:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, peterz, cl, tglx, fweisbec, mingo, hpa,
	kosaki.motohiro, riel, oleg, sgruszka, tj, torvalds

Commit-ID:  23cfa361f3e54a3e184a5e126bbbdd95f984881a
Gitweb:     http://git.kernel.org/tip/23cfa361f3e54a3e184a5e126bbbdd95f984881a
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 12 Nov 2014 12:37:37 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 16 Nov 2014 10:04:18 +0100

sched/cputime: Fix cpu_timer_sample_group() double accounting

While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:

  cpu_timer_sample_group()
    thread_group_cputimer()
      thread_group_cputime()
        times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/kernel_stat.h    |  5 -----
 kernel/sched/core.c            | 13 -------------
 kernel/time/posix-cpu-timers.c |  2 +-
 3 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 8422b4e..b9376cd 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -77,11 +77,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
-/*
- * Lock/unlock the current runqueue - to extract task statistics:
- */
-extern unsigned long long task_delta_exec(struct task_struct *);
-
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
 extern void account_steal_time(cputime_t);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f12ca6..797a6c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2499,19 +2499,6 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
 	return ns;
 }
 
-unsigned long long task_delta_exec(struct task_struct *p)
-{
-	unsigned long flags;
-	struct rq *rq;
-	u64 ns = 0;
-
-	rq = task_rq_lock(p, &flags);
-	ns = do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, p, &flags);
-
-	return ns;
-}
-
 /*
  * Return accounted runtime for the task.
  * In case the task is currently running, return the runtime plus current's
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 492b986..a16b678 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -553,7 +553,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
 		*sample = cputime_to_expires(cputime.utime);
 		break;
 	case CPUCLOCK_SCHED:
-		*sample = cputime.sum_exec_runtime + task_delta_exec(p);
+		*sample = cputime.sum_exec_runtime;
 		break;
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 11:15 ` Peter Zijlstra
  2014-11-12 11:37   ` Peter Zijlstra
@ 2014-11-12 12:21   ` Stanislaw Gruszka
  2014-11-12 12:51     ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Stanislaw Gruszka @ 2014-11-12 12:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 12:15:53PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 11:29:28AM +0100, Stanislaw Gruszka wrote:
> > Commit d670ec13178d "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
> > test case in cost of breaking another one. After that commit, calling
> > clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
> > of Y time being smaller than X time.
> 
> <snip>
> 
> > Issue happens because on start in thread_group_cputimer() we initialize
> > sum_exec_runtime of cputimer with threads runtime not yet accounted and
> > then add the threads runtime again on scheduler tick. When cputimer
> > finish, it's sum_exec_runtime value is bigger than current sum counted
> > by iterating over the threads in thread_group_cputime().
> > 
> > KOSAKI Motohiro posted fix for this problem, but that patch was never
> > applied: https://lkml.org/lkml/2013/5/26/191.
> > 
> > This patch takes different approach. It revert change from d670ec13178d,
> > but to assure process times are in sync with thread times, before
> > accounting the times it calls update_curr() to update current thread
> > sum_exec_runtime. This fixes the test case from commit d670ec13178d and
> > also make things like they were before i.e. process cpu times can be
> > (nr_threads - 1) * TICK_NSEC behind the clock (this is a drawback
> > unfortunately). 
> 
> > Good news is that patch improve performance of
> > thread_group_cputime(), i.e. we do not need optimizations from commit
> > 911b289 "sched: Optimize task_sched_runtime()"
> 
> You're inconsistent with your SHA1 lengths, use 12 chars.
> 
> So I'm not seeing how you've not reintroduced the issue from
> d670ec13178d, afaict you simply do not take the delta of all the other
> threads into account, which is why you're not taking as many locks.
> 
> Which also re-introduces the 32bit data race mentioned in that
> changelog.

Right, I missed the 32 bit issue. On 64 bit's this patch really
helps with d670ec13178d test case. Issue there was that thread accounted
cpu time was bigger than cpu time of proccess which include that thread.
It happened because on cpu_clock_sample() we return task sum_exec_runtime
plus not yet accounted runtime, but on cpu_clock_sample_group() we count
only sum_exec_runtime of that task. With this patch we count only the
task sum_exec_runtime on both cpu_clock_sample() and
cpu_clock_sample_group() (but sum_exec_runtime is updated for current
task before we use it in those functions).

> > @@ -254,6 +256,7 @@ static int cpu_clock_sample_group(const clockid_t which_clock,
> >  		*sample = cputime_to_expires(cputime.utime);
> >  		break;
> >  	case CPUCLOCK_SCHED:
> > +		scheduler_update_curr(p);
> >  		thread_group_cputime(p, &cputime);
> >  		*sample = cputime.sum_exec_runtime;
> >  		break;
> 
> Why only these two thread_group_cputime() calls and not all others?

On other cases here we do not utilize sum_exec_runtime, but only utime
and stime, which are not updated by update_curr().

I can fix 32 bit problem by introducing seq_lock around reading/writing 
task sum_exec_runtime on 32 bits arches. And fix other issues you pointed.
Though I'm not sure now, if this approach is better than KOSAKI Motohiro 
patch: http://marc.info/?l=linux-kernel&m=136960420627094&w=2
That patch can be furhter simplified - we do not have to pass bool value
up to task_sched_runtime(), juts add it only to thread_group_cputime().
What do you think ?

Stanislaw


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 12:21   ` [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency Stanislaw Gruszka
@ 2014-11-12 12:51     ` Peter Zijlstra
  2014-11-12 15:58       ` [PATCH v2] " Stanislaw Gruszka
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 12:51 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 01:21:56PM +0100, Stanislaw Gruszka wrote:
> > Why only these two thread_group_cputime() calls and not all others?
> 
> On other cases here we do not utilize sum_exec_runtime, but only utime
> and stime, which are not updated by update_curr().

That's one fragile ass interface. That's just asking for trouble, the
next person to want to use sum_exec_runtime will forget to add this call
and then we'll spend another few months/years until we figure out wtf
happened.

> I can fix 32 bit problem by introducing seq_lock around reading/writing 
> task sum_exec_runtime on 32 bits arches. And fix other issues you pointed.

No, that makes updating the sum_exec_runtime far more expensive than it
already is, and 99.9% of the time nobody cares.

> What do you think ?

I don't particularly like running the entirety of update_curr remotely,
but the NOHZ_FULL people need the same thing so we might as well do it
here.

Something like the below should then cure your issue without doing
horrible things.

---
 include/linux/kernel_stat.h    |  5 -----
 kernel/sched/core.c            | 49 +++++++++---------------------------------
 kernel/sched/deadline.c        |  2 ++
 kernel/sched/fair.c            |  7 ++++++
 kernel/sched/rt.c              |  2 ++
 kernel/sched/sched.h           |  2 ++
 kernel/time/posix-cpu-timers.c |  2 +-
 7 files changed, 24 insertions(+), 45 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 8422b4ed6882..b9376cd5a187 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -77,11 +77,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
-/*
- * Lock/unlock the current runqueue - to extract task statistics:
- */
-extern unsigned long long task_delta_exec(struct task_struct *);
-
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
 extern void account_steal_time(cputime_t);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5df22f1da07d..b708c4b1a02a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2457,44 +2457,6 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
- * Return any ns on the sched_clock that have not yet been accounted in
- * @p in case that task is currently running.
- *
- * Called with task_rq_lock() held on @rq.
- */
-static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
-{
-	u64 ns = 0;
-
-	/*
-	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
-	 * project cycles that may never be accounted to this
-	 * thread, breaking clock_gettime().
-	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		ns = rq_clock_task(rq) - p->se.exec_start;
-		if ((s64)ns < 0)
-			ns = 0;
-	}
-
-	return ns;
-}
-
-unsigned long long task_delta_exec(struct task_struct *p)
-{
-	unsigned long flags;
-	struct rq *rq;
-	u64 ns = 0;
-
-	rq = task_rq_lock(p, &flags);
-	ns = do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, p, &flags);
-
-	return ns;
-}
-
-/*
  * Return accounted runtime for the task.
  * In case the task is currently running, return the runtime plus current's
  * pending runtime that have not been accounted yet.
@@ -2522,7 +2484,16 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 #endif
 
 	rq = task_rq_lock(p, &flags);
-	ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+	/*
+	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
+	 * project cycles that may never be accounted to this
+	 * thread, breaking clock_gettime().
+	 */
+	if (task_current(rq, p) && task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		p->sched_class->update_curr(rq);
+	}
+	ns = p->se.sum_exec_runtime;
 	task_rq_unlock(rq, p, &flags);
 
 	return ns;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f3d7776656ee..b0911797422f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1747,6 +1747,8 @@ const struct sched_class dl_sched_class = {
 	.prio_changed           = prio_changed_dl,
 	.switched_from		= switched_from_dl,
 	.switched_to		= switched_to_dl,
+
+	.update_curr		= update_curr_dl,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf326683..faa482ed264c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,6 +726,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static void update_curr_fair(struct rq *rq)
+{
+	update_curr(&rq->cfs);
+}
+
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -8137,6 +8142,8 @@ const struct sched_class fair_sched_class = {
 
 	.get_rr_interval	= get_rr_interval_fair,
 
+	.update_curr		= update_curr_fair,
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	.task_move_group	= task_move_group_fair,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3d14312db7ea..f1bb92fcc532 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2134,6 +2134,8 @@ const struct sched_class rt_sched_class = {
 
 	.prio_changed		= prio_changed_rt,
 	.switched_to		= switched_to_rt,
+
+	.update_curr		= update_curr_rt,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31f1e4d2996a..9a2a45c970e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1177,6 +1177,8 @@ struct sched_class {
 	unsigned int (*get_rr_interval) (struct rq *rq,
 					 struct task_struct *task);
 
+	void (*update_curr) (struct rq *rq);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	void (*task_move_group) (struct task_struct *p, int on_rq);
 #endif
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 492b986195d5..a16b67859e2a 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -553,7 +553,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
 		*sample = cputime_to_expires(cputime.utime);
 		break;
 	case CPUCLOCK_SCHED:
-		*sample = cputime.sum_exec_runtime + task_delta_exec(p);
+		*sample = cputime.sum_exec_runtime;
 		break;
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 12:51     ` Peter Zijlstra
@ 2014-11-12 15:58       ` Stanislaw Gruszka
  2014-11-12 16:06         ` Peter Zijlstra
                           ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Stanislaw Gruszka @ 2014-11-12 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Below is full reproducer (tst-cpuclock2.c) :

/* Parameters for the Linux kernel ABI for CPU clocks.  */
        ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

static pthread_barrier_t barrier;

/* Help advance the clock.  */
static void *chew_cpu(void *arg)
{
	pthread_barrier_wait(&barrier);
	while (1) ;

	return NULL;
}

/* Don't use the glibc wrapper.  */
static int do_nanosleep(int flags, const struct timespec *req)
{
	clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

	return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}

static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
	int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
	int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

	return after_i - before_i;
}

int main(void)
{
	int result = 0;
	pthread_t th;

	pthread_barrier_init(&barrier, NULL, 2);

	if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
		perror("pthread_create");
		return 1;
	}

	pthread_barrier_wait(&barrier);

	/* The test.  */
	struct timespec before, after, sleeptimeabs;
	int64_t sleepdiff, diffabs;
	const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

	/* The relative nanosleep.  Not sure why this is needed, but its presence
	   seems to make it easier to reproduce the problem.  */
	if (do_nanosleep(0, &sleeptime) != 0) {
		perror("clock_nanosleep");
		return 1;
	}

	/* Get the current time.  */
	if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
		perror("clock_gettime[2]");
		return 1;
	}

	/* Compute the absolute sleep time based on the current time.  */
	uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
	sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
	sleeptimeabs.tv_nsec = nsec % 1000000000;

	/* Sleep for the computed time.  */
	if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
		perror("absolute clock_nanosleep");
		return 1;
	}

	/* Get the time after the sleep.  */
	if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
		perror("clock_gettime[3]");
		return 1;
	}

	/* The time after sleep should always be equal to or after the absolute sleep
	   time passed to clock_nanosleep.  */
	sleepdiff = tsdiff(&sleeptimeabs, &after);
	if (sleepdiff < 0) {
		printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
		result = 1;

		printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
		printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
		printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
	}

	/* The difference between the timestamps taken before and after the
	   clock_nanosleep call should be equal to or more than the duration of the
	   sleep.  */
	diffabs = tsdiff(&before, &after);
	if (diffabs < sleeptime.tv_nsec) {
		printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
		result = 1;
	}

	pthread_cancel(th);

	return result;
}

It can be compiled and ran by:

gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Originally-from: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
v1 -> v2: post Peter version

Peter, if you want post your patch by yourself please do so (you can
add me with Reported-and-tested-by: tag then). I'm posting this in case
you don't want to write the changelog :-)

 include/linux/kernel_stat.h    |  5 -----
 kernel/sched/core.c            | 51 +++++++++---------------------------------
 kernel/sched/deadline.c        |  2 ++
 kernel/sched/fair.c            |  7 ++++++
 kernel/sched/rt.c              |  2 ++
 kernel/sched/sched.h           |  2 ++
 kernel/time/posix-cpu-timers.c |  2 +-
 7 files changed, 25 insertions(+), 46 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 8422b4e..b9376cd 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -77,11 +77,6 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
-/*
- * Lock/unlock the current runqueue - to extract task statistics:
- */
-extern unsigned long long task_delta_exec(struct task_struct *);
-
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
 extern void account_steal_time(cputime_t);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5df22f1da..960f704 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2457,44 +2457,6 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
- * Return any ns on the sched_clock that have not yet been accounted in
- * @p in case that task is currently running.
- *
- * Called with task_rq_lock() held on @rq.
- */
-static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
-{
-	u64 ns = 0;
-
-	/*
-	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
-	 * project cycles that may never be accounted to this
-	 * thread, breaking clock_gettime().
-	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		ns = rq_clock_task(rq) - p->se.exec_start;
-		if ((s64)ns < 0)
-			ns = 0;
-	}
-
-	return ns;
-}
-
-unsigned long long task_delta_exec(struct task_struct *p)
-{
-	unsigned long flags;
-	struct rq *rq;
-	u64 ns = 0;
-
-	rq = task_rq_lock(p, &flags);
-	ns = do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, p, &flags);
-
-	return ns;
-}
-
-/*
  * Return accounted runtime for the task.
  * In case the task is currently running, return the runtime plus current's
  * pending runtime that have not been accounted yet.
@@ -2503,7 +2465,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 {
 	unsigned long flags;
 	struct rq *rq;
-	u64 ns = 0;
+	u64 ns;
 
 #if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
 	/*
@@ -2522,7 +2484,16 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 #endif
 
 	rq = task_rq_lock(p, &flags);
-	ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+	/*
+	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
+	 * project cycles that may never be accounted to this
+	 * thread, breaking clock_gettime().
+	 */
+	if (task_current(rq, p) && task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		p->sched_class->update_curr(rq);
+	}
+	ns = p->se.sum_exec_runtime;
 	task_rq_unlock(rq, p, &flags);
 
 	return ns;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f3d7776..b091179 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1747,6 +1747,8 @@ const struct sched_class dl_sched_class = {
 	.prio_changed           = prio_changed_dl,
 	.switched_from		= switched_from_dl,
 	.switched_to		= switched_to_dl,
+
+	.update_curr		= update_curr_dl,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf3..faa482e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,6 +726,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static void update_curr_fair(struct rq *rq)
+{
+	update_curr(&rq->cfs);
+}
+
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -8137,6 +8142,8 @@ const struct sched_class fair_sched_class = {
 
 	.get_rr_interval	= get_rr_interval_fair,
 
+	.update_curr		= update_curr_fair,
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	.task_move_group	= task_move_group_fair,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3d14312..f1bb92f 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2134,6 +2134,8 @@ const struct sched_class rt_sched_class = {
 
 	.prio_changed		= prio_changed_rt,
 	.switched_to		= switched_to_rt,
+
+	.update_curr		= update_curr_rt,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31f1e4d..9a2a45c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1177,6 +1177,8 @@ struct sched_class {
 	unsigned int (*get_rr_interval) (struct rq *rq,
 					 struct task_struct *task);
 
+	void (*update_curr) (struct rq *rq);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	void (*task_move_group) (struct task_struct *p, int on_rq);
 #endif
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 492b986..a16b678 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -553,7 +553,7 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
 		*sample = cputime_to_expires(cputime.utime);
 		break;
 	case CPUCLOCK_SCHED:
-		*sample = cputime.sum_exec_runtime + task_delta_exec(p);
+		*sample = cputime.sum_exec_runtime;
 		break;
 	}
 	return 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 15:58       ` [PATCH v2] " Stanislaw Gruszka
@ 2014-11-12 16:06         ` Peter Zijlstra
  2014-11-12 16:17           ` Stanislaw Gruszka
  2014-11-12 17:12           ` Peter Zijlstra
  2014-11-12 16:45         ` Peter Zijlstra
                           ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 16:06 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 04:58:44PM +0100, Stanislaw Gruszka wrote:
> v1 -> v2: post Peter version
> 
> Peter, if you want post your patch by yourself please do so (you can
> add me with Reported-and-tested-by: tag then). I'm posting this in case
> you don't want to write the changelog :-)

Ah, but did you verify? That test case didn't trigger for me (I let it
run for several minutes before I got bored and shot it).

I think I might split this in two patches, the one killing the
task_delta_exec() and the other doing this. But I can do that if you can
confirm this all does indeed work.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 16:06         ` Peter Zijlstra
@ 2014-11-12 16:17           ` Stanislaw Gruszka
  2014-11-12 17:12           ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Stanislaw Gruszka @ 2014-11-12 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 05:06:03PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 04:58:44PM +0100, Stanislaw Gruszka wrote:
> > v1 -> v2: post Peter version
> > 
> > Peter, if you want post your patch by yourself please do so (you can
> > add me with Reported-and-tested-by: tag then). I'm posting this in case
> > you don't want to write the changelog :-)
> 
> Ah, but did you verify? That test case didn't trigger for me (I let it
> run for several minutes before I got bored and shot it).
> 
> I think I might split this in two patches, the one killing the
> task_delta_exec() and the other doing this. But I can do that if you can
> confirm this all does indeed work.

Patch does work. I run test for few hours with the patch and the problem
did not show. Without the patch it's matter of minutes and the test
fail here.

Stanislaw

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 16:06         ` Peter Zijlstra
  2014-11-12 16:17           ` Stanislaw Gruszka
@ 2014-11-12 17:12           ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 17:12 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 05:06:03PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 04:58:44PM +0100, Stanislaw Gruszka wrote:
> > v1 -> v2: post Peter version
> > 
> > Peter, if you want post your patch by yourself please do so (you can
> > add me with Reported-and-tested-by: tag then). I'm posting this in case
> > you don't want to write the changelog :-)
> 
> Ah, but did you verify? That test case didn't trigger for me (I let it
> run for several minutes before I got bored and shot it).
> 
> I think I might split this in two patches, the one killing the
> task_delta_exec() and the other doing this. But I can do that if you can
> confirm this all does indeed work.

The below should include the two patches; I'll let WU's robot crawl the
thing and post when nothing funny happens.

https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=sched/urgent

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 15:58       ` [PATCH v2] " Stanislaw Gruszka
  2014-11-12 16:06         ` Peter Zijlstra
@ 2014-11-12 16:45         ` Peter Zijlstra
  2014-11-12 16:53         ` Stanislaw Gruszka
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 16:45 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 04:58:44PM +0100, Stanislaw Gruszka wrote:
> +static void update_curr_fair(struct rq *rq)
> +{
> +	update_curr(&rq->cfs);

+       update_curr(cfs_rq_of(&rq->curr.se));

> +}

I did the above change. While looking it over I realized we'd not
actually update the task when cgroups are in use.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 15:58       ` [PATCH v2] " Stanislaw Gruszka
  2014-11-12 16:06         ` Peter Zijlstra
  2014-11-12 16:45         ` Peter Zijlstra
@ 2014-11-12 16:53         ` Stanislaw Gruszka
  2014-11-12 16:56         ` Peter Zijlstra
  2014-11-16  9:50         ` [tip:sched/urgent] sched/cputime: Fix clock_nanosleep()/ clock_gettime() inconsistency tip-bot for Stanislaw Gruszka
  4 siblings, 0 replies; 17+ messages in thread
From: Stanislaw Gruszka @ 2014-11-12 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 04:58:43PM +0100, Stanislaw Gruszka wrote:
> Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
> test case in cost of breaking another one. After that commit, calling
> clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
> of Y time being smaller than X time.
> 
> Below is full reproducer (tst-cpuclock2.c) :
> 
> /* Parameters for the Linux kernel ABI for CPU clocks.  */
>         ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

Looks like # started lines were eaten. Here is reproducer that compile:

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <inttypes.h>

/* Parameters for the Linux kernel ABI for CPU clocks.  */
#define CPUCLOCK_SCHED          2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
        ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

static pthread_barrier_t barrier;

/* Help advance the clock.  */
static void *chew_cpu(void *arg)
{
	pthread_barrier_wait(&barrier);
	while (1) ;

	return NULL;
}

/* Don't use the glibc wrapper.  */
static int do_nanosleep(int flags, const struct timespec *req)
{
	clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

	return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}

static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
	int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
	int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

	return after_i - before_i;
}

int main(void)
{
	int result = 0;
	pthread_t th;

	pthread_barrier_init(&barrier, NULL, 2);

	if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
		perror("pthread_create");
		return 1;
	}

	pthread_barrier_wait(&barrier);

	/* The test.  */
	struct timespec before, after, sleeptimeabs;
	int64_t sleepdiff, diffabs;
	const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

	/* The relative nanosleep.  Not sure why this is needed, but its presence
	   seems to make it easier to reproduce the problem.  */
	if (do_nanosleep(0, &sleeptime) != 0) {
		perror("clock_nanosleep");
		return 1;
	}

	/* Get the current time.  */
	if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
		perror("clock_gettime[2]");
		return 1;
	}

	/* Compute the absolute sleep time based on the current time.  */
	uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
	sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
	sleeptimeabs.tv_nsec = nsec % 1000000000;

	/* Sleep for the computed time.  */
	if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
		perror("absolute clock_nanosleep");
		return 1;
	}

	/* Get the time after the sleep.  */
	if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
		perror("clock_gettime[3]");
		return 1;
	}

	/* The time after sleep should always be equal to or after the absolute sleep
	   time passed to clock_nanosleep.  */
	sleepdiff = tsdiff(&sleeptimeabs, &after);
	if (sleepdiff < 0) {
		printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
		result = 1;

		printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
		printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
		printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
	}

	/* The difference between the timestamps taken before and after the
	   clock_nanosleep call should be equal to or more than the duration of the
	   sleep.  */
	diffabs = tsdiff(&before, &after);
	if (diffabs < sleeptime.tv_nsec) {
		printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
		result = 1;
	}

	pthread_cancel(th);

	return result;
}



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency
  2014-11-12 15:58       ` [PATCH v2] " Stanislaw Gruszka
                           ` (2 preceding siblings ...)
  2014-11-12 16:53         ` Stanislaw Gruszka
@ 2014-11-12 16:56         ` Peter Zijlstra
  2014-11-16  9:50         ` [tip:sched/urgent] sched/cputime: Fix clock_nanosleep()/ clock_gettime() inconsistency tip-bot for Stanislaw Gruszka
  4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-11-12 16:56 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Rik van Riel, Frederic Weisbecker, KOSAKI Motohiro,
	Oleg Nesterov, Thomas Gleixner, Ingo Molnar

On Wed, Nov 12, 2014 at 04:58:44PM +0100, Stanislaw Gruszka wrote:
> Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
> test case in cost of breaking another one. After that commit, calling
> clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
> of Y time being smaller than X time.
> 
> Below is full reproducer (tst-cpuclock2.c) :
> 
> /* Parameters for the Linux kernel ABI for CPU clocks.  */
>         ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
> 

You lost all the #include and #define stmts there, restored them from
the previous posting.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:sched/urgent] sched/cputime: Fix clock_nanosleep()/ clock_gettime() inconsistency
  2014-11-12 15:58       ` [PATCH v2] " Stanislaw Gruszka
                           ` (3 preceding siblings ...)
  2014-11-12 16:56         ` Peter Zijlstra
@ 2014-11-16  9:50         ` tip-bot for Stanislaw Gruszka
  4 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Stanislaw Gruszka @ 2014-11-16  9:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: kosaki.motohiro, hpa, torvalds, fweisbec, riel, linux-kernel,
	peterz, mingo, tglx, oleg, sgruszka

Commit-ID:  6e998916dfe327e785e7c2447959b2c1a3ea4930
Gitweb:     http://git.kernel.org/tip/6e998916dfe327e785e7c2447959b2c1a3ea4930
Author:     Stanislaw Gruszka <sgruszka@redhat.com>
AuthorDate: Wed, 12 Nov 2014 16:58:44 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 16 Nov 2014 10:04:20 +0100

sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency

Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
	while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

	#define _GNU_SOURCE
	#include <unistd.h>
	#include <sys/syscall.h>
	#include <stdio.h>
	#include <time.h>
	#include <pthread.h>
	#include <stdint.h>
	#include <inttypes.h>

	/* Parameters for the Linux kernel ABI for CPU clocks.  */
	#define CPUCLOCK_SCHED          2
	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

	static pthread_barrier_t barrier;

	/* Help advance the clock.  */
	static void *chew_cpu(void *arg)
	{
		pthread_barrier_wait(&barrier);
		while (1) ;

		return NULL;
	}

	/* Don't use the glibc wrapper.  */
	static int do_nanosleep(int flags, const struct timespec *req)
	{
		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
	}

	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
	{
		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

		return after_i - before_i;
	}

	int main(void)
	{
		int result = 0;
		pthread_t th;

		pthread_barrier_init(&barrier, NULL, 2);

		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
			perror("pthread_create");
			return 1;
		}

		pthread_barrier_wait(&barrier);

		/* The test.  */
		struct timespec before, after, sleeptimeabs;
		int64_t sleepdiff, diffabs;
		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

		/* The relative nanosleep.  Not sure why this is needed, but its presence
		   seems to make it easier to reproduce the problem.  */
		if (do_nanosleep(0, &sleeptime) != 0) {
			perror("clock_nanosleep");
			return 1;
		}

		/* Get the current time.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
			perror("clock_gettime[2]");
			return 1;
		}

		/* Compute the absolute sleep time based on the current time.  */
		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
		sleeptimeabs.tv_nsec = nsec % 1000000000;

		/* Sleep for the computed time.  */
		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
			perror("absolute clock_nanosleep");
			return 1;
		}

		/* Get the time after the sleep.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
			perror("clock_gettime[3]");
			return 1;
		}

		/* The time after sleep should always be equal to or after the absolute sleep
		   time passed to clock_nanosleep.  */
		sleepdiff = tsdiff(&sleeptimeabs, &after);
		if (sleepdiff < 0) {
			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
			result = 1;

			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
		}

		/* The difference between the timestamps taken before and after the
		   clock_nanosleep call should be equal to or more than the duration of the
		   sleep.  */
		diffabs = tsdiff(&before, &after);
		if (diffabs < sleeptime.tv_nsec) {
			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
			result = 1;
		}

		pthread_cancel(th);

		return result;
	}

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c     | 38 +++++++++++---------------------------
 kernel/sched/deadline.c |  2 ++
 kernel/sched/fair.c     |  7 +++++++
 kernel/sched/rt.c       |  2 ++
 kernel/sched/sched.h    |  2 ++
 5 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 797a6c8..24beb9b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2475,31 +2475,6 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
- * Return any ns on the sched_clock that have not yet been accounted in
- * @p in case that task is currently running.
- *
- * Called with task_rq_lock() held on @rq.
- */
-static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
-{
-	u64 ns = 0;
-
-	/*
-	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
-	 * project cycles that may never be accounted to this
-	 * thread, breaking clock_gettime().
-	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		ns = rq_clock_task(rq) - p->se.exec_start;
-		if ((s64)ns < 0)
-			ns = 0;
-	}
-
-	return ns;
-}
-
-/*
  * Return accounted runtime for the task.
  * In case the task is currently running, return the runtime plus current's
  * pending runtime that have not been accounted yet.
@@ -2508,7 +2483,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 {
 	unsigned long flags;
 	struct rq *rq;
-	u64 ns = 0;
+	u64 ns;
 
 #if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
 	/*
@@ -2527,7 +2502,16 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 #endif
 
 	rq = task_rq_lock(p, &flags);
-	ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
+	/*
+	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
+	 * project cycles that may never be accounted to this
+	 * thread, breaking clock_gettime().
+	 */
+	if (task_current(rq, p) && task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		p->sched_class->update_curr(rq);
+	}
+	ns = p->se.sum_exec_runtime;
 	task_rq_unlock(rq, p, &flags);
 
 	return ns;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5285332..28fa9d9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1701,4 +1701,6 @@ const struct sched_class dl_sched_class = {
 	.prio_changed           = prio_changed_dl,
 	.switched_from		= switched_from_dl,
 	.switched_to		= switched_to_dl,
+
+	.update_curr		= update_curr_dl,
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3af3d1e..ef2b104 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,6 +726,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static void update_curr_fair(struct rq *rq)
+{
+	update_curr(cfs_rq_of(&rq->curr->se));
+}
+
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -7956,6 +7961,8 @@ const struct sched_class fair_sched_class = {
 
 	.get_rr_interval	= get_rr_interval_fair,
 
+	.update_curr		= update_curr_fair,
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	.task_move_group	= task_move_group_fair,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d024e6c..20bca39 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2128,6 +2128,8 @@ const struct sched_class rt_sched_class = {
 
 	.prio_changed		= prio_changed_rt,
 	.switched_to		= switched_to_rt,
+
+	.update_curr		= update_curr_rt,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..2df8ef0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1135,6 +1135,8 @@ struct sched_class {
 	unsigned int (*get_rr_interval) (struct rq *rq,
 					 struct task_struct *task);
 
+	void (*update_curr) (struct rq *rq);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	void (*task_move_group) (struct task_struct *p, int on_rq);
 #endif

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-11-16  9:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-12 10:29 [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency Stanislaw Gruszka
2014-11-12 11:15 ` Peter Zijlstra
2014-11-12 11:37   ` Peter Zijlstra
2014-11-12 11:45     ` Peter Zijlstra
2014-11-12 12:27       ` Stanislaw Gruszka
2014-11-12 12:52         ` Peter Zijlstra
2014-11-16  9:50     ` [tip:sched/urgent] sched/cputime: Fix cpu_timer_sample_group() double accounting tip-bot for Peter Zijlstra
2014-11-12 12:21   ` [PATCH] sched/cputime: fix clock_nanosleep/clock_gettime inconsistency Stanislaw Gruszka
2014-11-12 12:51     ` Peter Zijlstra
2014-11-12 15:58       ` [PATCH v2] " Stanislaw Gruszka
2014-11-12 16:06         ` Peter Zijlstra
2014-11-12 16:17           ` Stanislaw Gruszka
2014-11-12 17:12           ` Peter Zijlstra
2014-11-12 16:45         ` Peter Zijlstra
2014-11-12 16:53         ` Stanislaw Gruszka
2014-11-12 16:56         ` Peter Zijlstra
2014-11-16  9:50         ` [tip:sched/urgent] sched/cputime: Fix clock_nanosleep()/ clock_gettime() inconsistency tip-bot for Stanislaw Gruszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.