All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Stanislaw Gruszka <sgruszka@redhat.com>, linux-kernel@vger.kernel.org
Cc: "Ingo Molnar" <mingo@elte.hu>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Spencer Candland" <spencer@bluehost.com>,
	"Américo Wang" <xiyou.wangcong@gmail.com>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Balbir Singh" <balbir@in.ibm.com>
Subject: [PATCH -v2 2/2] sched, cputime: introduce thread_group_times()
Date: Wed, 02 Dec 2009 17:28:07 +0900	[thread overview]
Message-ID: <4B162517.8040909@jp.fujitsu.com> (raw)
In-Reply-To: <20091123101612.GC25978@dhcp-lab-161.englab.brq.redhat.com>

This is a real fix for problem of utime/stime values decreasing
described in the thread:
   http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

 - {u,s}time in task_struct are increased every time when the thread
   is interrupted by a tick (timer interrupt).

 - When a thread exits, its {u,s}time are added to signal->{u,s}time,
   after adjusted by task_times().

 - When all threads in a thread_group exits, accumulated {u,s}time
   (and also c{u,s}time) in signal struct are added to c{u,s}time
   in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while {u,s}time
and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

 - task_times(), to get cputime of a thread:
   This function returns adjusted values that originates from raw
   {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

 - thread_group_cputime(), to get cputime of a thread group:
   This function returns sum of all {u,s}time of living threads in
   the group, plus {u,s}time in the signal struct that is sum of
   adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(), because
it is mixed sum of "raw" value and "adjusted" value:

  group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater than
adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only
runs 45ms) and if it exits, cputime will decrease (e.g. -5ms).

To fix this, we could do:

  group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for every
thread should be avoided.

This patch fixes the above problem in the following way:

 - Modify thread's exit (= __exit_signal()) not to use task_times().
   It means {u,s}time in signal struct accumulates raw values instead
   of adjusted values.  As the result it makes thread_group_cputime()
   to return pure sum of "raw" values.

 - Introduce a new function thread_group_times(*task, *utime, *stime)
   that converts "raw" values of thread_group_cputime() to "adjusted"
   values, in same calculation procedure as task_times().

 - Modify group's exit (= wait_task_zombie()) to use this introduced
   thread_group_times().  It make c{u,s}time in signal struct to
   have adjusted values like before this patch.

 - Replace some thread_group_cputime() by thread_group_times().
   This replacements are only applied where conveys the "adjusted"
   cputime to users, and where already uses task_times() near by it.
   (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

 - Before this patch, if a group contains many short-life threads
   (e.g. runs 0.9ms and not interrupted by ticks), the group's
   cputime could be invisible since thread's cputime was accumulated
   after adjusted: imagine adjustment function as adj(ticks, runtime),
     {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
   After this patch it will not happen because the adjustment is
   applied after accumulated.

v2:
 - remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 fs/proc/array.c       |    5 +----
 include/linux/sched.h |    4 ++++
 kernel/exit.c         |   23 ++++++++++++-----------
 kernel/fork.c         |    3 +++
 kernel/sched.c        |   41 +++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c          |   18 ++++++++----------
 6 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index b37121a..9e5d173 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -495,7 +495,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 
 		/* add up live thread stats at the group level */
 		if (whole) {
-			struct task_cputime cputime;
 			struct task_struct *t = task;
 			do {
 				min_flt += t->min_flt;
@@ -506,9 +505,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			thread_group_cputime(task, &cputime);
-			utime = cputime.utime;
-			stime = cputime.stime;
+			thread_group_times(task, &utime, &stime);
 			gtime = cputime_add(gtime, sig->gtime);
 		}
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12dbb5c..7ae1274 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -628,6 +628,9 @@ struct signal_struct {
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+	cputime_t prev_utime, prev_stime;
+#endif
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
 	unsigned long inblock, oublock, cinblock, coublock;
@@ -1723,6 +1726,7 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
 
 /*
  * Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 2eaf68b..b221ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,8 +91,6 @@ static void __exit_signal(struct task_struct *tsk)
 	if (atomic_dec_and_test(&sig->count))
 		posix_cpu_timers_exit_group(tsk);
 	else {
-		cputime_t utime, stime;
-
 		/*
 		 * If there is any task waiting for the group exit
 		 * then notify it:
@@ -112,9 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
 		 * We won't ever get here for the group leader, since it
 		 * will have been the last reference on the signal_struct.
 		 */
-		task_times(tsk, &utime, &stime);
-		sig->utime = cputime_add(sig->utime, utime);
-		sig->stime = cputime_add(sig->stime, stime);
+		sig->utime = cputime_add(sig->utime, tsk->utime);
+		sig->stime = cputime_add(sig->stime, tsk->stime);
 		sig->gtime = cputime_add(sig->gtime, tsk->gtime);
 		sig->min_flt += tsk->min_flt;
 		sig->maj_flt += tsk->maj_flt;
@@ -1208,6 +1205,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		struct signal_struct *psig;
 		struct signal_struct *sig;
 		unsigned long maxrss;
+		cputime_t tgutime, tgstime;
 
 		/*
 		 * The resource counters for the group leader are in its
@@ -1223,20 +1221,23 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		 * need to protect the access to parent->signal fields,
 		 * as other threads in the parent group can be right
 		 * here reaping other children at the same time.
+		 *
+		 * We use thread_group_times() to get times for the thread
+		 * group, which consolidates times for all threads in the
+		 * group including the group leader.
 		 */
+		thread_group_times(p, &tgutime, &tgstime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
 		psig->cutime =
 			cputime_add(psig->cutime,
-			cputime_add(p->utime,
-			cputime_add(sig->utime,
-				    sig->cutime)));
+			cputime_add(tgutime,
+				    sig->cutime));
 		psig->cstime =
 			cputime_add(psig->cstime,
-			cputime_add(p->stime,
-			cputime_add(sig->stime,
-				    sig->cstime)));
+			cputime_add(tgstime,
+				    sig->cstime));
 		psig->cgtime =
 			cputime_add(psig->cgtime,
 			cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index ad7cb6d..3d6f121 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
 	sig->gtime = cputime_zero;
 	sig->cgtime = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+	sig->prev_utime = sig->prev_stime = cputime_zero;
+#endif
 	sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
 	sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
 	sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
diff --git a/kernel/sched.c b/kernel/sched.c
index 3674f77..aa733d5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5167,6 +5167,16 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	*ut = p->utime;
 	*st = p->stime;
 }
+
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+	struct task_cputime cputime;
+
+	thread_group_cputime(p, &cputime);
+
+	*ut = cputime.utime;
+	*st = cputime.stime;
+}
 #else
 
 #ifndef nsecs_to_cputime
@@ -5200,6 +5210,37 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	*ut = p->prev_utime;
 	*st = p->prev_stime;
 }
+
+/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+	struct signal_struct *sig = p->signal;
+	struct task_cputime cputime;
+	cputime_t rtime, utime, total;
+
+	thread_group_cputime(p, &cputime);
+
+	total = cputime_add(cputime.utime, cputime.stime);
+	rtime = nsecs_to_cputime(cputime.sum_exec_runtime);
+
+	if (total) {
+		u64 temp;
+
+		temp = (u64)(rtime * cputime.utime);
+		do_div(temp, total);
+		utime = (cputime_t)temp;
+	} else
+		utime = rtime;
+
+	sig->prev_utime = max(sig->prev_utime, utime);
+	sig->prev_stime = max(sig->prev_stime,
+			      cputime_sub(rtime, sig->prev_utime));
+
+	*ut = sig->prev_utime;
+	*st = sig->prev_stime;
+}
 #endif
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index d988abe..9968c5f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -911,16 +911,15 @@ change_okay:
 
 void do_sys_times(struct tms *tms)
 {
-	struct task_cputime cputime;
-	cputime_t cutime, cstime;
+	cputime_t tgutime, tgstime, cutime, cstime;
 
 	spin_lock_irq(&current->sighand->siglock);
-	thread_group_cputime(current, &cputime);
+	thread_group_times(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
 	spin_unlock_irq(&current->sighand->siglock);
-	tms->tms_utime = cputime_to_clock_t(cputime.utime);
-	tms->tms_stime = cputime_to_clock_t(cputime.stime);
+	tms->tms_utime = cputime_to_clock_t(tgutime);
+	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
 	tms->tms_cstime = cputime_to_clock_t(cstime);
 }
@@ -1338,8 +1337,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 {
 	struct task_struct *t;
 	unsigned long flags;
-	cputime_t utime, stime;
-	struct task_cputime cputime;
+	cputime_t tgutime, tgstime, utime, stime;
 	unsigned long maxrss = 0;
 
 	memset((char *) r, 0, sizeof *r);
@@ -1372,9 +1370,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
 				break;
 
 		case RUSAGE_SELF:
-			thread_group_cputime(p, &cputime);
-			utime = cputime_add(utime, cputime.utime);
-			stime = cputime_add(stime, cputime.stime);
+			thread_group_times(p, &tgutime, &tgstime);
+			utime = cputime_add(utime, tgutime);
+			stime = cputime_add(stime, tgstime);
 			r->ru_nvcsw += p->signal->nvcsw;
 			r->ru_nivcsw += p->signal->nivcsw;
 			r->ru_minflt += p->signal->min_flt;
-- 
1.6.5.3



  parent reply	other threads:[~2009-12-02  8:28 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-04  0:23 utime/stime decreasing on thread exit Spencer Candland
2009-11-04  6:49 ` Hidetoshi Seto
2009-11-05  5:24   ` Hidetoshi Seto
2009-11-09 14:49     ` Peter Zijlstra
2009-11-09 17:20       ` Oleg Nesterov
2009-11-09 17:27         ` Oleg Nesterov
2009-11-09 17:31         ` Peter Zijlstra
2009-11-09 19:23           ` Oleg Nesterov
2009-11-09 19:32             ` Peter Zijlstra
2009-11-10 10:44             ` Stanislaw Gruszka
2009-11-10 17:40               ` Oleg Nesterov
2009-11-10 18:24                 ` Stanislaw Gruszka
2009-11-10 19:23                   ` Oleg Nesterov
2009-11-17 12:48                     ` Stanislaw Gruszka
2009-11-17 12:57                       ` [PATCH] posix-cpu-timers: reset expire cache when no timer is running Stanislaw Gruszka
2009-11-10  5:42       ` utime/stime decreasing on thread exit Hidetoshi Seto
2009-11-10  5:47         ` [PATCH] fix granularity of task_u/stime() Hidetoshi Seto
2009-11-11 12:11           ` Stanislaw Gruszka
2009-11-12  0:00             ` Hidetoshi Seto
2009-11-12  2:49               ` Hidetoshi Seto
2009-11-12  2:55                 ` Américo Wang
2009-11-12  4:16                   ` Hidetoshi Seto
2009-11-12  4:33                     ` [PATCH] fix granularity of task_u/stime(), v2 Hidetoshi Seto
2009-11-12 14:15                       ` Peter Zijlstra
2009-11-12 14:49                       ` Stanislaw Gruszka
2009-11-12 15:00                         ` Peter Zijlstra
2009-11-12 15:40                           ` Stanislaw Gruszka
2009-11-13 12:42                             ` [PATCH] sys_times: fix utime/stime decreasing on thread exit Stanislaw Gruszka
2009-11-13 13:16                               ` Peter Zijlstra
2009-11-13 14:12                                 ` Balbir Singh
2009-11-13 15:36                                 ` Stanislaw Gruszka
2009-11-13 17:05                                   ` Peter Zijlstra
2009-11-16 19:32                             ` [PATCH] fix granularity of task_u/stime(), v2 Spencer Candland
2009-11-17 13:08                               ` Stanislaw Gruszka
2009-11-17 13:24                                 ` Peter Zijlstra
2009-11-19 18:17                                   ` Stanislaw Gruszka
2009-11-20  2:00                                     ` Hidetoshi Seto
2009-11-23 10:09                                       ` Stanislaw Gruszka
2009-11-23 10:16                                         ` [PATCH] cputime: avoid do_sys_times() races with __exit_signal() Stanislaw Gruszka
2009-11-30  9:20                                           ` [PATCH 1/2] cputime: remove prev_{u,s}time if VIRT_CPU_ACCOUNTING Hidetoshi Seto
2009-11-30  9:21                                           ` [PATCH 2/2] cputime: introduce thread_group_times() Hidetoshi Seto
2009-11-30 14:54                                             ` Stanislaw Gruszka
2009-12-01  1:02                                               ` Hidetoshi Seto
2009-12-02  8:26                                           ` [PATCH -v2 1/2] sched, cputime: cleanups related to task_times() Hidetoshi Seto
2009-12-02 15:17                                             ` Peter Zijlstra
2009-12-02 15:29                                               ` Balbir Singh
2009-12-03  0:21                                                 ` Hidetoshi Seto
2009-12-02 15:57                                             ` Peter Zijlstra
2009-12-02 17:33                                             ` [tip:sched/core] sched, cputime: Cleanups " tip-bot for Hidetoshi Seto
2009-12-02  8:28                                           ` Hidetoshi Seto [this message]
2009-12-02 15:58                                             ` [PATCH -v2 2/2] sched, cputime: introduce thread_group_times() Peter Zijlstra
2009-12-02 17:33                                             ` [tip:sched/core] sched, cputime: Introduce thread_group_times() tip-bot for Hidetoshi Seto
2009-12-02  8:29                                           ` reproducer: utime decreasing Hidetoshi Seto
2009-12-02  8:32                                           ` reproducer: invisible utime Hidetoshi Seto
2009-11-23 10:25                                         ` [PATCH] fix granularity of task_u/stime(), v2 Balbir Singh
2009-11-23 10:46                                           ` Stanislaw Gruszka
2009-11-24  5:33                                         ` Hidetoshi Seto
2009-11-18 22:38                                 ` Spencer Candland
2009-11-23  9:52                         ` Stanislaw Gruszka
2009-11-12 18:12                       ` [tip:sched/core] sched: Fix granularity of task_u/stime() tip-bot for Hidetoshi Seto
2009-11-13  9:40                         ` Stanislaw Gruszka
2009-11-13 23:09                         ` Ingo Molnar
2009-11-16  2:44                           ` Hidetoshi Seto

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B162517.8040909@jp.fujitsu.com \
    --to=seto.hidetoshi@jp.fujitsu.com \
    --cc=balbir@in.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sgruszka@redhat.com \
    --cc=spencer@bluehost.com \
    --cc=tglx@linutronix.de \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.