* [patch 0/4] taskstats: Improve cumulative time accounting
@ 2010-11-19 20:11 Michael Holzheu
2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu
` (4 more replies)
0 siblings, 5 replies; 24+ messages in thread
From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw)
To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
Heiko Carstens, Roland McGrath
Cc: linux-kernel, linux-s390
Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
to the cumulative time of the parents, if the parents ignore SIGCHLD
or have set SA_NOCLDWAIT. This behaviour has the major drawback that
it is not possible to calculate all consumed CPU time of a system by
looking at the current tasks. CPU time can be lost.
To solve this problem, this patch set duplicates the cumulative accounting
data in the signal_struct. In the second set (cdata_acct) the complete
cumulative resource counters are stored. The new cumulative CPU time (utime
and stime) is then exported via the taskstats interface.
PATCH SET OVERVIEW
------------------
Patches apply on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
[1] Introduce "struct cdata"
[2] Introduce __account_cdata() function
[3] Introduce cdata_acct for complete cumulative accounting
[4] Export "cdata_acct" with taskstats
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 1/4] taskstats: Introduce "struct cdata"
2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu
@ 2010-11-19 20:11 ` Michael Holzheu
2010-11-25 12:29 ` Balbir Singh
2010-11-25 14:23 ` Oleg Nesterov
2010-11-19 20:11 ` [patch 2/4] taskstats: Introduce __account_cdata() function Michael Holzheu
` (3 subsequent siblings)
4 siblings, 2 replies; 24+ messages in thread
From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw)
To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
Heiko Carstens, Roland McGrath
Cc: linux-kernel, linux-s390
[-- Attachment #1: 01-taskstats-top-improve-ctime-add-cdata-struct.patch --]
[-- Type: text/plain, Size: 12061 bytes --]
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
This patch introduces a new structure "struct cdata" that is used to
store cumulative resource counters for dead child processes and threads.
Note that there is one asymmetry:
For "struct task_io_accounting" (ioc) there is no extra accounting field for
dead threads. One field is used for both, dead processes and threads.
This patch introduces no functional change.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
fs/binfmt_elf.c | 4 +-
fs/exec.c | 2 -
fs/proc/array.c | 16 ++++----
include/linux/sched.h | 22 +++++++----
kernel/exit.c | 86 ++++++++++++++++++++++++----------------------
kernel/posix-cpu-timers.c | 12 +++---
kernel/sys.c | 44 ++++++++++++-----------
7 files changed, 100 insertions(+), 86 deletions(-)
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs
cputime_to_timeval(p->utime, &prstatus->pr_utime);
cputime_to_timeval(p->stime, &prstatus->pr_stime);
}
- cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
- cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
+ cputime_to_timeval(p->signal->cdata_wait.utime, &prstatus->pr_cutime);
+ cputime_to_timeval(p->signal->cdata_wait.stime, &prstatus->pr_cstime);
}
static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -894,7 +894,7 @@ static int de_thread(struct task_struct
no_thread_group:
if (current->mm)
- setmax_mm_hiwater_rss(&sig->maxrss, current->mm);
+ setmax_mm_hiwater_rss(&sig->cdata_threads.maxrss, current->mm);
exit_itimers(sig);
flush_itimer_signals();
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file
num_threads = get_nr_threads(task);
collect_sigign_sigcatch(task, &sigign, &sigcatch);
- cmin_flt = sig->cmin_flt;
- cmaj_flt = sig->cmaj_flt;
- cutime = sig->cutime;
- cstime = sig->cstime;
- cgtime = sig->cgtime;
+ cmin_flt = sig->cdata_wait.min_flt;
+ cmaj_flt = sig->cdata_wait.maj_flt;
+ cutime = sig->cdata_wait.utime;
+ cstime = sig->cdata_wait.stime;
+ cgtime = sig->cdata_wait.gtime;
rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
/* add up live thread stats at the group level */
@@ -430,10 +430,10 @@ static int do_task_stat(struct seq_file
t = next_thread(t);
} while (t != task);
- min_flt += sig->min_flt;
- maj_flt += sig->maj_flt;
+ min_flt += sig->cdata_threads.min_flt;
+ maj_flt += sig->cdata_threads.maj_flt;
thread_group_times(task, &utime, &stime);
- gtime = cputime_add(gtime, sig->gtime);
+ gtime = cputime_add(gtime, sig->cdata_threads.gtime);
}
sid = task_session_nr_ns(task, ns);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -510,6 +510,17 @@ struct thread_group_cputimer {
};
/*
+ * Cumulative resource counters
+ */
+struct cdata {
+ cputime_t utime, stime, gtime;
+ unsigned long nvcsw, nivcsw;
+ unsigned long min_flt, maj_flt;
+ unsigned long inblock, oublock;
+ unsigned long maxrss;
+};
+
+/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
* implies a shared sighand_struct, so locking
@@ -582,17 +593,12 @@ struct signal_struct {
* Live threads maintain their own counters and add to these
* in __exit_signal, except for the group leader.
*/
- cputime_t utime, stime, cutime, cstime;
- cputime_t gtime;
- cputime_t cgtime;
+ struct cdata cdata_wait;
+ struct cdata cdata_threads;
+ struct task_io_accounting ioac;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
#endif
- unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
- unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
- unsigned long inblock, oublock, cinblock, coublock;
- unsigned long maxrss, cmaxrss;
- struct task_io_accounting ioac;
/*
* Cumulative ns of schedule CPU time fo dead threads in the
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -95,6 +95,7 @@ static void __exit_signal(struct task_st
tty = sig->tty;
sig->tty = NULL;
} else {
+ struct cdata *tcd = &sig->cdata_threads;
/*
* This can only happen if the caller is de_thread().
* FIXME: this is the temporary hack, we should teach
@@ -122,15 +123,15 @@ static void __exit_signal(struct task_st
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, tsk->utime);
- sig->stime = cputime_add(sig->stime, tsk->stime);
- sig->gtime = cputime_add(sig->gtime, tsk->gtime);
- sig->min_flt += tsk->min_flt;
- sig->maj_flt += tsk->maj_flt;
- sig->nvcsw += tsk->nvcsw;
- sig->nivcsw += tsk->nivcsw;
- sig->inblock += task_io_get_inblock(tsk);
- sig->oublock += task_io_get_oublock(tsk);
+ tcd->utime = cputime_add(tcd->utime, tsk->utime);
+ tcd->stime = cputime_add(tcd->stime, tsk->stime);
+ tcd->gtime = cputime_add(tcd->gtime, tsk->gtime);
+ tcd->min_flt += tsk->min_flt;
+ tcd->maj_flt += tsk->maj_flt;
+ tcd->nvcsw += tsk->nvcsw;
+ tcd->nivcsw += tsk->nivcsw;
+ tcd->inblock += task_io_get_inblock(tsk);
+ tcd->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(&sig->ioac, &tsk->ioac);
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
}
@@ -960,10 +961,11 @@ NORET_TYPE void do_exit(long code)
sync_mm_rss(tsk, tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
+ struct cdata *tcd = &tsk->signal->cdata_threads;
hrtimer_cancel(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
if (tsk->mm)
- setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
+ setmax_mm_hiwater_rss(&tcd->maxrss, tsk->mm);
}
acct_collect(code, group_dead);
if (group_dead)
@@ -1225,8 +1227,7 @@ static int wait_task_zombie(struct wait_
* !task_detached() to filter out sub-threads.
*/
if (likely(!traced) && likely(!task_detached(p))) {
- struct signal_struct *psig;
- struct signal_struct *sig;
+ struct cdata *cd, *pcd, *tcd;
unsigned long maxrss;
cputime_t tgutime, tgstime;
@@ -1251,40 +1252,43 @@ static int wait_task_zombie(struct wait_
*/
thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
- psig = p->real_parent->signal;
- sig = p->signal;
- psig->cutime =
- cputime_add(psig->cutime,
+ pcd = &p->real_parent->signal->cdata_wait;
+ tcd = &p->signal->cdata_threads;
+ cd = &p->signal->cdata_wait;
+
+ pcd->utime =
+ cputime_add(pcd->utime,
cputime_add(tgutime,
- sig->cutime));
- psig->cstime =
- cputime_add(psig->cstime,
+ cd->utime));
+ pcd->stime =
+ cputime_add(pcd->stime,
cputime_add(tgstime,
- sig->cstime));
- psig->cgtime =
- cputime_add(psig->cgtime,
+ cd->stime));
+ pcd->gtime =
+ cputime_add(pcd->gtime,
cputime_add(p->gtime,
- cputime_add(sig->gtime,
- sig->cgtime)));
- psig->cmin_flt +=
- p->min_flt + sig->min_flt + sig->cmin_flt;
- psig->cmaj_flt +=
- p->maj_flt + sig->maj_flt + sig->cmaj_flt;
- psig->cnvcsw +=
- p->nvcsw + sig->nvcsw + sig->cnvcsw;
- psig->cnivcsw +=
- p->nivcsw + sig->nivcsw + sig->cnivcsw;
- psig->cinblock +=
+ cputime_add(tcd->gtime,
+ cd->gtime)));
+ pcd->min_flt +=
+ p->min_flt + tcd->min_flt + cd->min_flt;
+ pcd->maj_flt +=
+ p->maj_flt + tcd->maj_flt + cd->maj_flt;
+ pcd->nvcsw +=
+ p->nvcsw + tcd->nvcsw + cd->nvcsw;
+ pcd->nivcsw +=
+ p->nivcsw + tcd->nivcsw + cd->nivcsw;
+ pcd->inblock +=
task_io_get_inblock(p) +
- sig->inblock + sig->cinblock;
- psig->coublock +=
+ tcd->inblock + cd->inblock;
+ pcd->oublock +=
task_io_get_oublock(p) +
- sig->oublock + sig->coublock;
- maxrss = max(sig->maxrss, sig->cmaxrss);
- if (psig->cmaxrss < maxrss)
- psig->cmaxrss = maxrss;
- task_io_accounting_add(&psig->ioac, &p->ioac);
- task_io_accounting_add(&psig->ioac, &sig->ioac);
+ tcd->oublock + cd->oublock;
+ maxrss = max(tcd->maxrss, cd->maxrss);
+ if (pcd->maxrss < maxrss)
+ pcd->maxrss = maxrss;
+ task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
+ task_io_accounting_add(&p->real_parent->signal->ioac,
+ &p->signal->ioac);
spin_unlock_irq(&p->real_parent->sighand->siglock);
}
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -232,12 +232,12 @@ static int cpu_clock_sample(const clocki
void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
{
- struct signal_struct *sig = tsk->signal;
+ struct cdata *tcd = &tsk->signal->cdata_threads;
struct task_struct *t;
- times->utime = sig->utime;
- times->stime = sig->stime;
- times->sum_exec_runtime = sig->sum_sched_runtime;
+ times->utime = tcd->utime;
+ times->stime = tcd->stime;
+ times->sum_exec_runtime = tsk->signal->sum_sched_runtime;
rcu_read_lock();
/* make sure we can trust tsk->thread_group list */
@@ -516,8 +516,8 @@ void posix_cpu_timers_exit_group(struct
struct signal_struct *const sig = tsk->signal;
cleanup_timers(tsk->signal->cpu_timers,
- cputime_add(tsk->utime, sig->utime),
- cputime_add(tsk->stime, sig->stime),
+ cputime_add(tsk->utime, sig->cdata_threads.utime),
+ cputime_add(tsk->stime, sig->cdata_threads.stime),
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
}
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms)
spin_lock_irq(¤t->sighand->siglock);
thread_group_times(current, &tgutime, &tgstime);
- cutime = current->signal->cutime;
- cstime = current->signal->cstime;
+ cutime = current->signal->cdata_wait.utime;
+ cstime = current->signal->cdata_wait.stime;
spin_unlock_irq(¤t->sighand->siglock);
tms->tms_utime = cputime_to_clock_t(tgutime);
tms->tms_stime = cputime_to_clock_t(tgstime);
@@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru
unsigned long flags;
cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;
+ struct cdata *cd;
memset((char *) r, 0, sizeof *r);
utime = stime = cputime_zero;
@@ -1497,7 +1498,7 @@ static void k_getrusage(struct task_stru
if (who == RUSAGE_THREAD) {
task_times(current, &utime, &stime);
accumulate_thread_rusage(p, r);
- maxrss = p->signal->maxrss;
+ maxrss = p->signal->cdata_threads.maxrss;
goto out;
}
@@ -1507,31 +1508,34 @@ static void k_getrusage(struct task_stru
switch (who) {
case RUSAGE_BOTH:
case RUSAGE_CHILDREN:
- utime = p->signal->cutime;
- stime = p->signal->cstime;
- r->ru_nvcsw = p->signal->cnvcsw;
- r->ru_nivcsw = p->signal->cnivcsw;
- r->ru_minflt = p->signal->cmin_flt;
- r->ru_majflt = p->signal->cmaj_flt;
- r->ru_inblock = p->signal->cinblock;
- r->ru_oublock = p->signal->coublock;
- maxrss = p->signal->cmaxrss;
+ cd = &p->signal->cdata_wait;
+ utime = cd->utime;
+ stime = cd->stime;
+ r->ru_nvcsw = cd->nvcsw;
+ r->ru_nivcsw = cd->nivcsw;
+ r->ru_minflt = cd->min_flt;
+ r->ru_majflt = cd->maj_flt;
+ r->ru_inblock = cd->inblock;
+ r->ru_oublock = cd->oublock;
+ maxrss = cd->maxrss;
if (who == RUSAGE_CHILDREN)
break;
case RUSAGE_SELF:
+ cd = &p->signal->cdata_threads;
thread_group_times(p, &tgutime, &tgstime);
utime = cputime_add(utime, tgutime);
stime = cputime_add(stime, tgstime);
- r->ru_nvcsw += p->signal->nvcsw;
- r->ru_nivcsw += p->signal->nivcsw;
- r->ru_minflt += p->signal->min_flt;
- r->ru_majflt += p->signal->maj_flt;
- r->ru_inblock += p->signal->inblock;
- r->ru_oublock += p->signal->oublock;
- if (maxrss < p->signal->maxrss)
- maxrss = p->signal->maxrss;
+ r->ru_nvcsw += cd->nvcsw;
+ r->ru_nivcsw += cd->nivcsw;
+ r->ru_minflt += cd->min_flt;
+ r->ru_majflt += cd->maj_flt;
+ r->ru_inblock += cd->inblock;
+ r->ru_oublock += cd->oublock;
+ if (maxrss < cd->maxrss)
+ maxrss = cd->maxrss;
+
t = p;
do {
accumulate_thread_rusage(t, r);
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 2/4] taskstats: Introduce __account_cdata() function
2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu
2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu
@ 2010-11-19 20:11 ` Michael Holzheu
2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu
` (2 subsequent siblings)
4 siblings, 0 replies; 24+ messages in thread
From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw)
To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
Heiko Carstens, Roland McGrath
Cc: linux-kernel, linux-s390
[-- Attachment #1: 02-taskstats-top-improve-ctime-add_cdata-func.patch --]
[-- Type: text/plain, Size: 5417 bytes --]
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
This patch introduces the function __account_cdata() that does the
cummulative resource accounting for dead processes in sys_wait().
No functional changes are done. This patch is a preparation for cdata_acct
accounting.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
kernel/exit.c | 133 +++++++++++++++++++++++++++++-----------------------------
1 file changed, 68 insertions(+), 65 deletions(-)
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -74,6 +74,72 @@ static void __unhash_process(struct task
list_del_rcu(&p->thread_group);
}
+static void __account_cdata(struct task_struct *p)
+{
+ struct cdata *cd, *pcd, *tcd;
+ unsigned long maxrss;
+ cputime_t tgutime, tgstime;
+
+ /*
+ * The resource counters for the group leader are in its
+ * own task_struct. Those for dead threads in the group
+ * are in its signal_struct, as are those for the child
+ * processes it has previously reaped. All these
+ * accumulate in the parent's signal_struct c* fields.
+ *
+ * We don't bother to take a lock here to protect these
+ * p->signal fields, because they are only touched by
+ * __exit_signal, which runs with tasklist_lock
+ * write-locked anyway, and so is excluded here. We do
+ * need to protect the access to parent->signal fields,
+ * as other threads in the parent group can be right
+ * here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
+ */
+ thread_group_times(p, &tgutime, &tgstime);
+ spin_lock_irq(&p->real_parent->sighand->siglock);
+ pcd = &p->real_parent->signal->cdata_wait;
+ tcd = &p->signal->cdata_threads;
+ cd = &p->signal->cdata_wait;
+
+ pcd->utime =
+ cputime_add(pcd->utime,
+ cputime_add(tgutime,
+ cd->utime));
+ pcd->stime =
+ cputime_add(pcd->stime,
+ cputime_add(tgstime,
+ cd->stime));
+ pcd->gtime =
+ cputime_add(pcd->gtime,
+ cputime_add(p->gtime,
+ cputime_add(tcd->gtime,
+ cd->gtime)));
+ pcd->min_flt +=
+ p->min_flt + tcd->min_flt + cd->min_flt;
+ pcd->maj_flt +=
+ p->maj_flt + tcd->maj_flt + cd->maj_flt;
+ pcd->nvcsw +=
+ p->nvcsw + tcd->nvcsw + cd->nvcsw;
+ pcd->nivcsw +=
+ p->nivcsw + tcd->nivcsw + cd->nivcsw;
+ pcd->inblock +=
+ task_io_get_inblock(p) +
+ tcd->inblock + cd->inblock;
+ pcd->oublock +=
+ task_io_get_oublock(p) +
+ tcd->oublock + cd->oublock;
+ maxrss = max(tcd->maxrss, cd->maxrss);
+ if (pcd->maxrss < maxrss)
+ pcd->maxrss = maxrss;
+ task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
+ task_io_accounting_add(&p->real_parent->signal->ioac, &p->signal->ioac);
+ spin_unlock_irq(&p->real_parent->sighand->siglock);
+}
+
/*
* This function expects the tasklist_lock write-locked.
*/
@@ -1226,71 +1292,8 @@ static int wait_task_zombie(struct wait_
* It can be ptraced but not reparented, check
* !task_detached() to filter out sub-threads.
*/
- if (likely(!traced) && likely(!task_detached(p))) {
- struct cdata *cd, *pcd, *tcd;
- unsigned long maxrss;
- cputime_t tgutime, tgstime;
-
- /*
- * The resource counters for the group leader are in its
- * own task_struct. Those for dead threads in the group
- * are in its signal_struct, as are those for the child
- * processes it has previously reaped. All these
- * accumulate in the parent's signal_struct c* fields.
- *
- * We don't bother to take a lock here to protect these
- * p->signal fields, because they are only touched by
- * __exit_signal, which runs with tasklist_lock
- * write-locked anyway, and so is excluded here. We do
- * need to protect the access to parent->signal fields,
- * as other threads in the parent group can be right
- * here reaping other children at the same time.
- *
- * We use thread_group_times() to get times for the thread
- * group, which consolidates times for all threads in the
- * group including the group leader.
- */
- thread_group_times(p, &tgutime, &tgstime);
- spin_lock_irq(&p->real_parent->sighand->siglock);
- pcd = &p->real_parent->signal->cdata_wait;
- tcd = &p->signal->cdata_threads;
- cd = &p->signal->cdata_wait;
-
- pcd->utime =
- cputime_add(pcd->utime,
- cputime_add(tgutime,
- cd->utime));
- pcd->stime =
- cputime_add(pcd->stime,
- cputime_add(tgstime,
- cd->stime));
- pcd->gtime =
- cputime_add(pcd->gtime,
- cputime_add(p->gtime,
- cputime_add(tcd->gtime,
- cd->gtime)));
- pcd->min_flt +=
- p->min_flt + tcd->min_flt + cd->min_flt;
- pcd->maj_flt +=
- p->maj_flt + tcd->maj_flt + cd->maj_flt;
- pcd->nvcsw +=
- p->nvcsw + tcd->nvcsw + cd->nvcsw;
- pcd->nivcsw +=
- p->nivcsw + tcd->nivcsw + cd->nivcsw;
- pcd->inblock +=
- task_io_get_inblock(p) +
- tcd->inblock + cd->inblock;
- pcd->oublock +=
- task_io_get_oublock(p) +
- tcd->oublock + cd->oublock;
- maxrss = max(tcd->maxrss, cd->maxrss);
- if (pcd->maxrss < maxrss)
- pcd->maxrss = maxrss;
- task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
- task_io_accounting_add(&p->real_parent->signal->ioac,
- &p->signal->ioac);
- spin_unlock_irq(&p->real_parent->sighand->siglock);
- }
+ if (likely(!traced) && likely(!task_detached(p)))
+ __account_cdata(p);
/*
* Now we are sure this task is interesting, and no other
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting
2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu
2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu
2010-11-19 20:11 ` [patch 2/4] taskstats: Introduce __account_cdata() function Michael Holzheu
@ 2010-11-19 20:11 ` Michael Holzheu
2010-11-23 16:59 ` Oleg Nesterov
2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu
2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra
4 siblings, 1 reply; 24+ messages in thread
From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw)
To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
Heiko Carstens, Roland McGrath
Cc: linux-kernel, linux-s390
[-- Attachment #1: 03-taskstats-top-improve-ctime-account-cdata_acct.patch --]
[-- Type: text/plain, Size: 4388 bytes --]
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Currently the cumulative time accounting in Linux is not complete.
Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
to the cumulative time of the parents, if the parents ignore SIGCHLD
or have set SA_NOCLDWAIT. This behaviour has the major drawback that
it is not possible to calculate all consumed CPU time of a system by
looking at the current tasks. CPU time can be lost.
This patch adds a new set of cumulative time counters. We then have two
cumulative counter sets:
* cdata_wait: Traditional cumulative time used e.g. by getrusage.
* cdata_acct: Cumulative time that also includes dead processes with
parents that ignore SIGCHLD or have set SA_NOCLDWAIT.
cdata_acct will be exported by taskstats.
TODO:
-----
With this patch we take the siglock twice. First for the dead task
and second for the parent of the dead task. This give the following
lockdep warning (probably a lockdep annotation is needed here):
=============================================
[ INFO: possible recursive locking detected ]
2.6.37-rc1-00116-g151f52f-dirty #19
---------------------------------------------
kworker/u:0/15 is trying to acquire lock:
(&(&sighand->siglock)->rlock){......}, at: [<000000000014a426>] __account_cdata+0x6e/0x444
but task is already holding lock:
(&(&sighand->siglock)->rlock){......}, at: [<000000000014b634>] release_task+0x160/0x6a0
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
include/linux/sched.h | 2 ++
kernel/exit.c | 36 +++++++++++++++++++++++++-----------
2 files changed, 27 insertions(+), 11 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -595,6 +595,8 @@ struct signal_struct {
*/
struct cdata cdata_wait;
struct cdata cdata_threads;
+ struct cdata cdata_acct;
+ struct task_io_accounting ioac_acct;
struct task_io_accounting ioac;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -74,10 +74,10 @@ static void __unhash_process(struct task
list_del_rcu(&p->thread_group);
}
-static void __account_cdata(struct task_struct *p)
+static void __account_cdata(struct task_struct *p, int wait)
{
struct cdata *cd, *pcd, *tcd;
- unsigned long maxrss;
+ unsigned long maxrss, flags;
cputime_t tgutime, tgstime;
/*
@@ -100,11 +100,16 @@ static void __account_cdata(struct task_
* group including the group leader.
*/
thread_group_times(p, &tgutime, &tgstime);
- spin_lock_irq(&p->real_parent->sighand->siglock);
- pcd = &p->real_parent->signal->cdata_wait;
- tcd = &p->signal->cdata_threads;
- cd = &p->signal->cdata_wait;
-
+ spin_lock_irqsave(&p->real_parent->sighand->siglock, flags);
+ if (wait) {
+ pcd = &p->real_parent->signal->cdata_wait;
+ tcd = &p->signal->cdata_threads;
+ cd = &p->signal->cdata_wait;
+ } else {
+ pcd = &p->real_parent->signal->cdata_acct;
+ tcd = &p->signal->cdata_threads;
+ cd = &p->signal->cdata_acct;
+ }
pcd->utime =
cputime_add(pcd->utime,
cputime_add(tgutime,
@@ -135,9 +140,17 @@ static void __account_cdata(struct task_
maxrss = max(tcd->maxrss, cd->maxrss);
if (pcd->maxrss < maxrss)
pcd->maxrss = maxrss;
- task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
- task_io_accounting_add(&p->real_parent->signal->ioac, &p->signal->ioac);
- spin_unlock_irq(&p->real_parent->sighand->siglock);
+ if (wait) {
+ task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
+ task_io_accounting_add(&p->real_parent->signal->ioac,
+ &p->signal->ioac);
+ } else {
+ task_io_accounting_add(&p->real_parent->signal->ioac_acct,
+ &p->ioac);
+ task_io_accounting_add(&p->real_parent->signal->ioac_acct,
+ &p->signal->ioac_acct);
+ }
+ spin_unlock_irqrestore(&p->real_parent->sighand->siglock, flags);
}
/*
@@ -157,6 +170,7 @@ static void __exit_signal(struct task_st
posix_cpu_timers_exit(tsk);
if (group_dead) {
+ __account_cdata(tsk, 0);
posix_cpu_timers_exit_group(tsk);
tty = sig->tty;
sig->tty = NULL;
@@ -1293,7 +1307,7 @@ static int wait_task_zombie(struct wait_
* !task_detached() to filter out sub-threads.
*/
if (likely(!traced) && likely(!task_detached(p)))
- __account_cdata(p);
+ __account_cdata(p, 1);
/*
* Now we are sure this task is interesting, and no other
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 4/4] taskstats: Export "cdata_acct" with taskstats
2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu
` (2 preceding siblings ...)
2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu
@ 2010-11-19 20:11 ` Michael Holzheu
2010-11-25 13:26 ` Oleg Nesterov
2010-11-25 16:57 ` Balbir Singh
2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra
4 siblings, 2 replies; 24+ messages in thread
From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw)
To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
Heiko Carstens, Roland McGrath
Cc: linux-kernel, linux-s390
[-- Attachment #1: 04-taskstats-top-improve-ctime-taskstats.patch --]
[-- Type: text/plain, Size: 1637 bytes --]
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
With this patch the (full) cumulative CPU time is added to "struct taskstats".
The CPU time is only returned for the thread group leader.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
include/linux/taskstats.h | 7 ++++++-
kernel/tsacct.c | 7 +++++++
2 files changed, 13 insertions(+), 1 deletion(-)
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -33,7 +33,7 @@
*/
-#define TASKSTATS_VERSION 7
+#define TASKSTATS_VERSION 8
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -163,6 +163,11 @@ struct taskstats {
/* Delay waiting for memory reclaim */
__u64 freepages_count;
__u64 freepages_delay_total;
+ /* version 7 ends here */
+
+ /* All cumulative CPU time of dead children */
+ __u64 ac_cutime_acct; /* User CPU time [usec] */
+ __u64 ac_cstime_acct; /* System CPU time [usec] */
};
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -31,6 +31,7 @@ void bacct_add_tsk(struct taskstats *sta
const struct cred *tcred;
struct timespec uptime, ts;
u64 ac_etime;
+ unsigned long flags;
BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
@@ -71,6 +72,12 @@ void bacct_add_tsk(struct taskstats *sta
stats->ac_majflt = tsk->maj_flt;
strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
+ if (tsk->tgid == tsk->pid && lock_task_sighand(tsk, &flags)) {
+ struct cdata *cd = &tsk->signal->cdata_acct;
+ stats->ac_cutime_acct = cputime_to_usecs(cd->utime);
+ stats->ac_cstime_acct = cputime_to_usecs(cd->stime);
+ unlock_task_sighand(tsk, &flags);
+ }
}
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting
2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu
` (3 preceding siblings ...)
2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu
@ 2010-11-19 20:19 ` Peter Zijlstra
2010-11-20 15:17 ` Oleg Nesterov
2010-11-22 11:03 ` Michael Holzheu
4 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2010-11-19 20:19 UTC (permalink / raw)
To: Michael Holzheu
Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> to the cumulative time of the parents, if the parents ignore SIGCHLD
> or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> it is not possible to calculate all consumed CPU time of a system by
> looking at the current tasks. CPU time can be lost.
>
> To solve this problem, this patch set duplicates the cumulative accounting
> data in the signal_struct. In the second set (cdata_acct) the complete
> cumulative resource counters are stored. The new cumulative CPU time (utime
> and stime) is then exported via the taskstats interface.
Maybe this has been treated earlier in the threads and I missed it, but
the obvious solution doesn't get mentioned:
What would break if we violate this silly POSIX rule and account time of
childs regardless of SIGCHLD/SA_NOCLDWAIT?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting
2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra
@ 2010-11-20 15:17 ` Oleg Nesterov
2010-11-22 7:21 ` Balbir Singh
2010-11-22 11:03 ` Michael Holzheu
1 sibling, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2010-11-20 15:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On 11/19, Peter Zijlstra wrote:
>
> On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> > to the cumulative time of the parents, if the parents ignore SIGCHLD
> > or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> > it is not possible to calculate all consumed CPU time of a system by
> > looking at the current tasks. CPU time can be lost.
> >
> > To solve this problem, this patch set duplicates the cumulative accounting
> > data in the signal_struct. In the second set (cdata_acct) the complete
> > cumulative resource counters are stored. The new cumulative CPU time (utime
> > and stime) is then exported via the taskstats interface.
>
> Maybe this has been treated earlier in the threads and I missed it, but
> the obvious solution doesn't get mentioned:
IIRC, the first version did this.
And it was me who spoiled this approach. But! only because I wasn't sure
this user-visible change is acceptable, and because there was some
misunderstanding. See http://marc.info/?l=linux-kernel&m=128552495203050&w=2
But,
> What would break if we
say, any test-case which does getrusage() after fork() with ignored
SIGCHLD/SA_NOCLDWAIT?.
> violate this silly POSIX rule and account time of
> childs regardless of SIGCHLD/SA_NOCLDWAIT?
+1.
Personally, I'd certainly prefer this way, because I don't care about
POSIX at all ;)
Still. Once again, this breaks the current rules, and we never do
this without strong reason.
I think we should ask Roland. If he thinks this is OK, I'd certainly
agree.
Oleg.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting
2010-11-20 15:17 ` Oleg Nesterov
@ 2010-11-22 7:21 ` Balbir Singh
0 siblings, 0 replies; 24+ messages in thread
From: Balbir Singh @ 2010-11-22 7:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Peter Zijlstra, Michael Holzheu, Shailabh Nagar, Andrew Morton,
John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
* Oleg Nesterov <oleg@redhat.com> [2010-11-20 16:17:11]:
> On 11/19, Peter Zijlstra wrote:
> >
> > On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> > > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> > > to the cumulative time of the parents, if the parents ignore SIGCHLD
> > > or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> > > it is not possible to calculate all consumed CPU time of a system by
> > > looking at the current tasks. CPU time can be lost.
> > >
> > > To solve this problem, this patch set duplicates the cumulative accounting
> > > data in the signal_struct. In the second set (cdata_acct) the complete
> > > cumulative resource counters are stored. The new cumulative CPU time (utime
> > > and stime) is then exported via the taskstats interface.
> >
> > Maybe this has been treated earlier in the threads and I missed it, but
> > the obvious solution doesn't get mentioned:
>
> IIRC, the first version did this.
>
> And it was me who spoiled this approach. But! only because I wasn't sure
> this user-visible change is acceptable, and because there was some
> misunderstanding. See http://marc.info/?l=linux-kernel&m=128552495203050&w=2
>
> But,
>
> > What would break if we
>
> say, any test-case which does getrusage() after fork() with ignored
> SIGCHLD/SA_NOCLDWAIT?.
>
> > violate this silly POSIX rule and account time of
> > childs regardless of SIGCHLD/SA_NOCLDWAIT?
>
> +1.
>
> Personally, I'd certainly prefer this way, because I don't care about
> POSIX at all ;)
>
>
> Still. Once again, this breaks the current rules, and we never do
> this without strong reason.
>
> I think we should ask Roland. If he thinks this is OK, I'd certainly
> agree.
>
The Linux man page states
In Linux kernel versions before 2.6.9, if the disposition of SIGCHLD
is set to SIG_IGN then the resource usages of child processes are
automatically included in the value returned by RUSAGE_CHILDREN,
although POSIX.1-2001 explicitly prohibits this. This nonconformance
is rectified in Linux 2.6.9 and later.
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting
2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra
2010-11-20 15:17 ` Oleg Nesterov
@ 2010-11-22 11:03 ` Michael Holzheu
2010-11-22 12:47 ` Michael Holzheu
1 sibling, 1 reply; 24+ messages in thread
From: Michael Holzheu @ 2010-11-22 11:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On Fri, 2010-11-19 at 21:19 +0100, Peter Zijlstra wrote:
> On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> > to the cumulative time of the parents, if the parents ignore SIGCHLD
> > or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> > it is not possible to calculate all consumed CPU time of a system by
> > looking at the current tasks. CPU time can be lost.
> >
> > To solve this problem, this patch set duplicates the cumulative accounting
> > data in the signal_struct. In the second set (cdata_acct) the complete
> > cumulative resource counters are stored. The new cumulative CPU time (utime
> > and stime) is then exported via the taskstats interface.
>
> Maybe this has been treated earlier in the threads and I missed it, but
> the obvious solution doesn't get mentioned:
>
> What would break if we violate this silly POSIX rule and account time of
> childs regardless of SIGCHLD/SA_NOCLDWAIT?
Or maybe we could add a sysctl that allows to switch between the two
semantics.
Michael
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting
2010-11-22 11:03 ` Michael Holzheu
@ 2010-11-22 12:47 ` Michael Holzheu
2010-11-22 18:11 ` Valdis.Kletnieks
0 siblings, 1 reply; 24+ messages in thread
From: Michael Holzheu @ 2010-11-22 12:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On Mon, 2010-11-22 at 12:03 +0100, Michael Holzheu wrote:
> Or maybe we could add a sysctl that allows to switch between the two
> semantics.
Then patch 03/04 would be something like the following:
---
Subject: taskstats: Introduce complete cumulative accounting
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Currently the cumulative time accounting in Linux is not complete.
Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
to the cumulative time of the parents, if the parents ignore SIGCHLD
or have set SA_NOCLDWAIT. This behaviour has the major drawback that
it is not possible to calculate all consumed CPU time of a system by
looking at the current tasks. CPU time can be lost.
This patch adds a new sysctl "kernel.full_cdata" that allows to switch
between the POSIX behavior and complete cumulative accounting.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
include/linux/sched.h | 1 +
kernel/exit.c | 12 ++++++++----
kernel/sysctl.c | 7 +++++++
3 files changed, 16 insertions(+), 4 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1907,6 +1907,7 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
+extern unsigned int full_cdata_enabled;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -57,6 +57,8 @@
#include <asm/pgtable.h>
#include <asm/mmu_context.h>
+unsigned int full_cdata_enabled = 1;
+
static void exit_mm(struct task_struct * tsk);
static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -77,7 +79,7 @@ static void __unhash_process(struct task
static void __account_cdata(struct task_struct *p)
{
struct cdata *cd, *pcd, *tcd;
- unsigned long maxrss;
+ unsigned long maxrss, flags;
cputime_t tgutime, tgstime;
/*
@@ -100,7 +102,7 @@ static void __account_cdata(struct task_
* group including the group leader.
*/
thread_group_times(p, &tgutime, &tgstime);
- spin_lock_irq(&p->real_parent->sighand->siglock);
+ spin_lock_irqsave(&p->real_parent->sighand->siglock, flags);
pcd = &p->real_parent->signal->cdata_wait;
tcd = &p->signal->cdata_threads;
cd = &p->signal->cdata_wait;
@@ -137,7 +139,7 @@ static void __account_cdata(struct task_
pcd->maxrss = maxrss;
task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
task_io_accounting_add(&p->real_parent->signal->ioac, &p->signal->ioac);
- spin_unlock_irq(&p->real_parent->sighand->siglock);
+ spin_unlock_irqrestore(&p->real_parent->sighand->siglock, flags);
}
/*
@@ -157,6 +159,8 @@ static void __exit_signal(struct task_st
posix_cpu_timers_exit(tsk);
if (group_dead) {
+ if (full_cdata_enabled)
+ __account_cdata(tsk);
posix_cpu_timers_exit_group(tsk);
tty = sig->tty;
sig->tty = NULL;
@@ -1292,7 +1296,7 @@ static int wait_task_zombie(struct wait_
* It can be ptraced but not reparented, check
* !task_detached() to filter out sub-threads.
*/
- if (likely(!traced) && likely(!task_detached(p)))
+ if (likely(!traced) && likely(!task_detached(p)) && !full_cdata_enabled)
__account_cdata(p);
/*
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -963,6 +963,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
#endif
+ {
+ .procname = "full_cdata",
+ .data = &full_cdata_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting
2010-11-22 12:47 ` Michael Holzheu
@ 2010-11-22 18:11 ` Valdis.Kletnieks
0 siblings, 0 replies; 24+ messages in thread
From: Valdis.Kletnieks @ 2010-11-22 18:11 UTC (permalink / raw)
To: holzheu
Cc: Peter Zijlstra, Oleg Nesterov, Shailabh Nagar, Andrew Morton,
John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
Heiko Carstens, Roland McGrath, linux-kernel, linux-s390
[-- Attachment #1: Type: text/plain, Size: 761 bytes --]
On Mon, 22 Nov 2010 13:47:55 +0100, Michael Holzheu said:
> Currently the cumulative time accounting in Linux is not complete.
> Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> to the cumulative time of the parents, if the parents ignore SIGCHLD
> or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> it is not possible to calculate all consumed CPU time of a system by
> looking at the current tasks. CPU time can be lost.
>
> This patch adds a new sysctl "kernel.full_cdata" that allows to switch
> between the POSIX behavior and complete cumulative accounting.
> +unsigned int full_cdata_enabled = 1;
This probably needs to default to "current kernel behavior" but allow
sysadmins to change to the new behavior.
[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting
2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu
@ 2010-11-23 16:59 ` Oleg Nesterov
2010-11-25 9:40 ` Michael Holzheu
0 siblings, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2010-11-23 16:59 UTC (permalink / raw)
To: Michael Holzheu
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On 11/19, Michael Holzheu wrote:
>
> Currently the cumulative time accounting in Linux is not complete.
> Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> to the cumulative time of the parents, if the parents ignore SIGCHLD
> or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> it is not possible to calculate all consumed CPU time of a system by
> looking at the current tasks. CPU time can be lost.
>
> This patch adds a new set of cumulative time counters. We then have two
> cumulative counter sets:
>
> * cdata_wait: Traditional cumulative time used e.g. by getrusage.
> * cdata_acct: Cumulative time that also includes dead processes with
> parents that ignore SIGCHLD or have set SA_NOCLDWAIT.
> cdata_acct will be exported by taskstats.
Looks correct at first glance. A couple of nits below.
> TODO:
> -----
> With this patch we take the siglock twice. First for the dead task
> and second for the parent of the dead task. This give the following
> lockdep warning (probably a lockdep annotation is needed here):
And we already discussed this ;) We do not need 2 siglock's, only
parent's. Just move the callsite in __exit_signal() down, under
another (lockless) group_dead check.
Or I missed something?
> @@ -595,6 +595,8 @@ struct signal_struct {
> */
> struct cdata cdata_wait;
> struct cdata cdata_threads;
> + struct cdata cdata_acct;
> + struct task_io_accounting ioac_acct;
> struct task_io_accounting ioac;
Given that task_io_accounting is Linux specific, perhaps we can use
signal->ioac in both cases?
Yes, this is a user-visible change anyway. But, at least we can
forget about POSIX.
> + spin_lock_irqsave(&p->real_parent->sighand->siglock, flags);
> + if (wait) {
> + pcd = &p->real_parent->signal->cdata_wait;
> + tcd = &p->signal->cdata_threads;
> + cd = &p->signal->cdata_wait;
> + } else {
> + pcd = &p->real_parent->signal->cdata_acct;
> + tcd = &p->signal->cdata_threads;
> + cd = &p->signal->cdata_acct;
> + }
We can do this before taking ->siglock. Not that I think this really
matters, but otherwise this looks a bit confusing imho, as if we need
parent's ->siglock to pin something.
And thanks for splitting these changes. It was much, much easier to
read now.
Oleg.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting
2010-11-23 16:59 ` Oleg Nesterov
@ 2010-11-25 9:40 ` Michael Holzheu
2010-11-25 13:21 ` Oleg Nesterov
0 siblings, 1 reply; 24+ messages in thread
From: Michael Holzheu @ 2010-11-25 9:40 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
Hello Oleg,
On Tue, 2010-11-23 at 17:59 +0100, Oleg Nesterov wrote:
> On 11/19, Michael Holzheu wrote:
> > TODO:
> > -----
> > With this patch we take the siglock twice. First for the dead task
> > and second for the parent of the dead task. This give the following
> > lockdep warning (probably a lockdep annotation is needed here):
>
> And we already discussed this ;) We do not need 2 siglock's, only
> parent's. Just move the callsite in __exit_signal() down, under
> another (lockless) group_dead check.
>
> Or I missed something?
The problem with moving this down to the second group_dead check is that
after __unhash_process() is called, pid_alive(tsk) which is checked in
thread_group_cputime() returns false. Therefore we always get zero CPU
times.
So I probably have to introduce a second group_dead check at the
beginning of __exit_signal():
@@ -150,6 +153,9 @@ static void __exit_signal(struct task_st
struct sighand_struct *sighand;
struct tty_struct *uninitialized_var(tty);
+ if (group_dead)
+ __account_cdata(...);
+
sighand = rcu_dereference_check(tsk->sighand,
rcu_read_lock_held() ||
> We can do this before taking ->siglock. Not that I think this really
> matters, but otherwise this looks a bit confusing imho, as if we need
> parent's ->siglock to pin something.
ok
>
>
> And thanks for splitting these changes. It was much, much easier to
> read now.
My personal feeling is that probably the only acceptable thing would be
to make the new behavior configurable with a sysctl and define the
default as it currently is (POSIX compliant).
This would only introduce two additional checks in __exit_signal() and
wait_task_zombie() and would not add any new fields to the
signal_struct.
Michael
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/4] taskstats: Introduce "struct cdata"
2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu
@ 2010-11-25 12:29 ` Balbir Singh
2010-11-25 14:23 ` Oleg Nesterov
1 sibling, 0 replies; 24+ messages in thread
From: Balbir Singh @ 2010-11-25 12:29 UTC (permalink / raw)
To: Michael Holzheu
Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-11-19 21:11:09]:
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
>
> This patch introduces a new structure "struct cdata" that is used to
> store cumulative resource counters for dead child processes and threads.
>
> Note that there is one asymmetry:
> For "struct task_io_accounting" (ioc) there is no extra accounting field for
> dead threads. One field is used for both, dead processes and threads.
>
> This patch introduces no functional change.
>
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
> fs/binfmt_elf.c | 4 +-
> fs/exec.c | 2 -
> fs/proc/array.c | 16 ++++----
> include/linux/sched.h | 22 +++++++----
> kernel/exit.c | 86 ++++++++++++++++++++++++----------------------
> kernel/posix-cpu-timers.c | 12 +++---
> kernel/sys.c | 44 ++++++++++++-----------
> 7 files changed, 100 insertions(+), 86 deletions(-)
Looks good to me
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting
2010-11-25 9:40 ` Michael Holzheu
@ 2010-11-25 13:21 ` Oleg Nesterov
2010-11-25 17:45 ` Michael Holzheu
0 siblings, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2010-11-25 13:21 UTC (permalink / raw)
To: Michael Holzheu
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On 11/25, Michael Holzheu wrote:
>
> Hello Oleg,
>
> On Tue, 2010-11-23 at 17:59 +0100, Oleg Nesterov wrote:
> > On 11/19, Michael Holzheu wrote:
> > > TODO:
> > > -----
> > > With this patch we take the siglock twice. First for the dead task
> > > and second for the parent of the dead task. This give the following
> > > lockdep warning (probably a lockdep annotation is needed here):
> >
> > And we already discussed this ;) We do not need 2 siglock's, only
> > parent's. Just move the callsite in __exit_signal() down, under
> > another (lockless) group_dead check.
> >
> > Or I missed something?
>
> The problem with moving this down to the second group_dead check is that
> after __unhash_process() is called, pid_alive(tsk) which is checked in
> thread_group_cputime() returns false. Therefore we always get zero CPU
> times.
I see, thanks.
> So I probably have to introduce a second group_dead check at the
> beginning of __exit_signal():
Probably...
But in fact this reminds we should cleanup this code somehow.
By the time we call thread_group_times() there are no other
threads.
> My personal feeling is that probably the only acceptable thing would be
> to make the new behavior configurable with a sysctl and define the
> default as it currently is (POSIX compliant).
>
> This would only introduce two additional checks in __exit_signal() and
> wait_task_zombie() and would not add any new fields to the
> signal_struct.
Yeah, it would be nice to avoid new fields.
Hmm. Somehow I forgot about 4/4, please see another email...
Oleg.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats
2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu
@ 2010-11-25 13:26 ` Oleg Nesterov
2010-11-25 17:21 ` Michael Holzheu
2010-11-25 16:57 ` Balbir Singh
1 sibling, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2010-11-25 13:26 UTC (permalink / raw)
To: Michael Holzheu
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On 11/19, Michael Holzheu wrote:
>
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
>
> With this patch the (full) cumulative CPU time is added to "struct taskstats".
> The CPU time is only returned for the thread group leader.
>
> ...
>
> + if (tsk->tgid == tsk->pid
thread_group_leader() ?
> && lock_task_sighand(tsk, &flags)) {
Do you really need ->siglock? Starting from 2.6.35 it is always
safe to access ->signal.
Oleg.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/4] taskstats: Introduce "struct cdata"
2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu
2010-11-25 12:29 ` Balbir Singh
@ 2010-11-25 14:23 ` Oleg Nesterov
2010-11-25 16:38 ` Michael Holzheu
1 sibling, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2010-11-25 14:23 UTC (permalink / raw)
To: Michael Holzheu
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On 11/19, Michael Holzheu wrote:
>
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
>
> This patch introduces a new structure "struct cdata" that is used to
> store cumulative resource counters for dead child processes and threads.
>
> Note that there is one asymmetry:
> For "struct task_io_accounting" (ioc) there is no extra accounting field for
> dead threads. One field is used for both, dead processes and threads.
>
> This patch introduces no functional change.
>
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
> fs/binfmt_elf.c | 4 +-
> fs/exec.c | 2 -
> fs/proc/array.c | 16 ++++----
> include/linux/sched.h | 22 +++++++----
> kernel/exit.c | 86 ++++++++++++++++++++++++----------------------
> kernel/posix-cpu-timers.c | 12 +++---
> kernel/sys.c | 44 ++++++++++++-----------
> 7 files changed, 100 insertions(+), 86 deletions(-)
Looks good. In fact, to me it looks like a cleanup.
But. You seem to forgot to change kernel/signal.c, no?
And cosmetic nit,
> void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> {
> - struct signal_struct *sig = tsk->signal;
> + struct cdata *tcd = &tsk->signal->cdata_threads;
> struct task_struct *t;
>
> - times->utime = sig->utime;
> - times->stime = sig->stime;
> - times->sum_exec_runtime = sig->sum_sched_runtime;
> + times->utime = tcd->utime;
> + times->stime = tcd->stime;
> + times->sum_exec_runtime = tsk->signal->sum_sched_runtime;
Feel free to ignore, but I don't understand why you removed "sig".
Afaics,
- times->utime = sig->utime;
- times->stime = sig->stime;
+ times->utime = sig->cdata_threads->utime;
+ times->stime = sig->cdata_threads->stime;
looks a bit better.
Oleg.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/4] taskstats: Introduce "struct cdata"
2010-11-25 14:23 ` Oleg Nesterov
@ 2010-11-25 16:38 ` Michael Holzheu
0 siblings, 0 replies; 24+ messages in thread
From: Michael Holzheu @ 2010-11-25 16:38 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
Hello Oleg,
On Thu, 2010-11-25 at 15:23 +0100, Oleg Nesterov wrote:
> On 11/19, Michael Holzheu wrote:
> > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > ---
> > fs/binfmt_elf.c | 4 +-
> > fs/exec.c | 2 -
> > fs/proc/array.c | 16 ++++----
> > include/linux/sched.h | 22 +++++++----
> > kernel/exit.c | 86 ++++++++++++++++++++++++----------------------
> > kernel/posix-cpu-timers.c | 12 +++---
> > kernel/sys.c | 44 ++++++++++++-----------
> > 7 files changed, 100 insertions(+), 86 deletions(-)
>
> Looks good. In fact, to me it looks like a cleanup.
>
> But. You seem to forgot to change kernel/signal.c, no?
Yes, you are right.
>
> And cosmetic nit,
>
> > void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> > {
> > - struct signal_struct *sig = tsk->signal;
> > + struct cdata *tcd = &tsk->signal->cdata_threads;
> > struct task_struct *t;
> >
> > - times->utime = sig->utime;
> > - times->stime = sig->stime;
> > - times->sum_exec_runtime = sig->sum_sched_runtime;
> > + times->utime = tcd->utime;
> > + times->stime = tcd->stime;
> > + times->sum_exec_runtime = tsk->signal->sum_sched_runtime;
>
> Feel free to ignore, but I don't understand why you removed "sig".
> Afaics,
>
> - times->utime = sig->utime;
> - times->stime = sig->stime;
> + times->utime = sig->cdata_threads->utime;
> + times->stime = sig->cdata_threads->stime;
>
> looks a bit better.
Yes, this looks better.
I attached the corrected patch.
Michael
---
Subject: taskstats: Introduce "struct cdata"
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
This patch introduces a new structure "struct cdata" that is used to
store cumulative resource counters for dead child processes and threads.
Note that there is one asymmetry:
For "struct task_io_accounting" (ioc) there is no extra accounting field for
dead threads. One field is used for both dead processes and threads.
This patch introduces no functional change.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
fs/binfmt_elf.c | 4 +-
fs/exec.c | 2 -
fs/proc/array.c | 16 ++++----
include/linux/sched.h | 22 +++++++----
kernel/exit.c | 86 ++++++++++++++++++++++++----------------------
kernel/posix-cpu-timers.c | 8 ++--
kernel/signal.c | 4 +-
kernel/sys.c | 44 ++++++++++++-----------
8 files changed, 100 insertions(+), 86 deletions(-)
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs
cputime_to_timeval(p->utime, &prstatus->pr_utime);
cputime_to_timeval(p->stime, &prstatus->pr_stime);
}
- cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
- cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
+ cputime_to_timeval(p->signal->cdata_wait.utime, &prstatus->pr_cutime);
+ cputime_to_timeval(p->signal->cdata_wait.stime, &prstatus->pr_cstime);
}
static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -894,7 +894,7 @@ static int de_thread(struct task_struct
no_thread_group:
if (current->mm)
- setmax_mm_hiwater_rss(&sig->maxrss, current->mm);
+ setmax_mm_hiwater_rss(&sig->cdata_threads.maxrss, current->mm);
exit_itimers(sig);
flush_itimer_signals();
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file
num_threads = get_nr_threads(task);
collect_sigign_sigcatch(task, &sigign, &sigcatch);
- cmin_flt = sig->cmin_flt;
- cmaj_flt = sig->cmaj_flt;
- cutime = sig->cutime;
- cstime = sig->cstime;
- cgtime = sig->cgtime;
+ cmin_flt = sig->cdata_wait.min_flt;
+ cmaj_flt = sig->cdata_wait.maj_flt;
+ cutime = sig->cdata_wait.utime;
+ cstime = sig->cdata_wait.stime;
+ cgtime = sig->cdata_wait.gtime;
rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
/* add up live thread stats at the group level */
@@ -430,10 +430,10 @@ static int do_task_stat(struct seq_file
t = next_thread(t);
} while (t != task);
- min_flt += sig->min_flt;
- maj_flt += sig->maj_flt;
+ min_flt += sig->cdata_threads.min_flt;
+ maj_flt += sig->cdata_threads.maj_flt;
thread_group_times(task, &utime, &stime);
- gtime = cputime_add(gtime, sig->gtime);
+ gtime = cputime_add(gtime, sig->cdata_threads.gtime);
}
sid = task_session_nr_ns(task, ns);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -510,6 +510,17 @@ struct thread_group_cputimer {
};
/*
+ * Cumulative resource counters
+ */
+struct cdata {
+ cputime_t utime, stime, gtime;
+ unsigned long nvcsw, nivcsw;
+ unsigned long min_flt, maj_flt;
+ unsigned long inblock, oublock;
+ unsigned long maxrss;
+};
+
+/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
* implies a shared sighand_struct, so locking
@@ -582,17 +593,12 @@ struct signal_struct {
* Live threads maintain their own counters and add to these
* in __exit_signal, except for the group leader.
*/
- cputime_t utime, stime, cutime, cstime;
- cputime_t gtime;
- cputime_t cgtime;
+ struct cdata cdata_wait;
+ struct cdata cdata_threads;
+ struct task_io_accounting ioac;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
#endif
- unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
- unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
- unsigned long inblock, oublock, cinblock, coublock;
- unsigned long maxrss, cmaxrss;
- struct task_io_accounting ioac;
/*
* Cumulative ns of schedule CPU time fo dead threads in the
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -95,6 +95,7 @@ static void __exit_signal(struct task_st
tty = sig->tty;
sig->tty = NULL;
} else {
+ struct cdata *tcd = &sig->cdata_threads;
/*
* This can only happen if the caller is de_thread().
* FIXME: this is the temporary hack, we should teach
@@ -122,15 +123,15 @@ static void __exit_signal(struct task_st
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, tsk->utime);
- sig->stime = cputime_add(sig->stime, tsk->stime);
- sig->gtime = cputime_add(sig->gtime, tsk->gtime);
- sig->min_flt += tsk->min_flt;
- sig->maj_flt += tsk->maj_flt;
- sig->nvcsw += tsk->nvcsw;
- sig->nivcsw += tsk->nivcsw;
- sig->inblock += task_io_get_inblock(tsk);
- sig->oublock += task_io_get_oublock(tsk);
+ tcd->utime = cputime_add(tcd->utime, tsk->utime);
+ tcd->stime = cputime_add(tcd->stime, tsk->stime);
+ tcd->gtime = cputime_add(tcd->gtime, tsk->gtime);
+ tcd->min_flt += tsk->min_flt;
+ tcd->maj_flt += tsk->maj_flt;
+ tcd->nvcsw += tsk->nvcsw;
+ tcd->nivcsw += tsk->nivcsw;
+ tcd->inblock += task_io_get_inblock(tsk);
+ tcd->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(&sig->ioac, &tsk->ioac);
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
}
@@ -960,10 +961,11 @@ NORET_TYPE void do_exit(long code)
sync_mm_rss(tsk, tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
+ struct cdata *tcd = &tsk->signal->cdata_threads;
hrtimer_cancel(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
if (tsk->mm)
- setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
+ setmax_mm_hiwater_rss(&tcd->maxrss, tsk->mm);
}
acct_collect(code, group_dead);
if (group_dead)
@@ -1225,8 +1227,7 @@ static int wait_task_zombie(struct wait_
* !task_detached() to filter out sub-threads.
*/
if (likely(!traced) && likely(!task_detached(p))) {
- struct signal_struct *psig;
- struct signal_struct *sig;
+ struct cdata *cd, *pcd, *tcd;
unsigned long maxrss;
cputime_t tgutime, tgstime;
@@ -1251,40 +1252,43 @@ static int wait_task_zombie(struct wait_
*/
thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
- psig = p->real_parent->signal;
- sig = p->signal;
- psig->cutime =
- cputime_add(psig->cutime,
+ pcd = &p->real_parent->signal->cdata_wait;
+ tcd = &p->signal->cdata_threads;
+ cd = &p->signal->cdata_wait;
+
+ pcd->utime =
+ cputime_add(pcd->utime,
cputime_add(tgutime,
- sig->cutime));
- psig->cstime =
- cputime_add(psig->cstime,
+ cd->utime));
+ pcd->stime =
+ cputime_add(pcd->stime,
cputime_add(tgstime,
- sig->cstime));
- psig->cgtime =
- cputime_add(psig->cgtime,
+ cd->stime));
+ pcd->gtime =
+ cputime_add(pcd->gtime,
cputime_add(p->gtime,
- cputime_add(sig->gtime,
- sig->cgtime)));
- psig->cmin_flt +=
- p->min_flt + sig->min_flt + sig->cmin_flt;
- psig->cmaj_flt +=
- p->maj_flt + sig->maj_flt + sig->cmaj_flt;
- psig->cnvcsw +=
- p->nvcsw + sig->nvcsw + sig->cnvcsw;
- psig->cnivcsw +=
- p->nivcsw + sig->nivcsw + sig->cnivcsw;
- psig->cinblock +=
+ cputime_add(tcd->gtime,
+ cd->gtime)));
+ pcd->min_flt +=
+ p->min_flt + tcd->min_flt + cd->min_flt;
+ pcd->maj_flt +=
+ p->maj_flt + tcd->maj_flt + cd->maj_flt;
+ pcd->nvcsw +=
+ p->nvcsw + tcd->nvcsw + cd->nvcsw;
+ pcd->nivcsw +=
+ p->nivcsw + tcd->nivcsw + cd->nivcsw;
+ pcd->inblock +=
task_io_get_inblock(p) +
- sig->inblock + sig->cinblock;
- psig->coublock +=
+ tcd->inblock + cd->inblock;
+ pcd->oublock +=
task_io_get_oublock(p) +
- sig->oublock + sig->coublock;
- maxrss = max(sig->maxrss, sig->cmaxrss);
- if (psig->cmaxrss < maxrss)
- psig->cmaxrss = maxrss;
- task_io_accounting_add(&psig->ioac, &p->ioac);
- task_io_accounting_add(&psig->ioac, &sig->ioac);
+ tcd->oublock + cd->oublock;
+ maxrss = max(tcd->maxrss, cd->maxrss);
+ if (pcd->maxrss < maxrss)
+ pcd->maxrss = maxrss;
+ task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
+ task_io_accounting_add(&p->real_parent->signal->ioac,
+ &p->signal->ioac);
spin_unlock_irq(&p->real_parent->sighand->siglock);
}
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -235,8 +235,8 @@ void thread_group_cputime(struct task_st
struct signal_struct *sig = tsk->signal;
struct task_struct *t;
- times->utime = sig->utime;
- times->stime = sig->stime;
+ times->utime = sig->cdata_threads.utime;
+ times->stime = sig->cdata_threads.stime;
times->sum_exec_runtime = sig->sum_sched_runtime;
rcu_read_lock();
@@ -516,8 +516,8 @@ void posix_cpu_timers_exit_group(struct
struct signal_struct *const sig = tsk->signal;
cleanup_timers(tsk->signal->cpu_timers,
- cputime_add(tsk->utime, sig->utime),
- cputime_add(tsk->stime, sig->stime),
+ cputime_add(tsk->utime, sig->cdata_threads.utime),
+ cputime_add(tsk->stime, sig->cdata_threads.stime),
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
}
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1476,9 +1476,9 @@ int do_notify_parent(struct task_struct
rcu_read_unlock();
info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime,
- tsk->signal->utime));
+ tsk->signal->cdata_threads.utime));
info.si_stime = cputime_to_clock_t(cputime_add(tsk->stime,
- tsk->signal->stime));
+ tsk->signal->cdata_threads.stime));
info.si_status = tsk->exit_code & 0x7f;
if (tsk->exit_code & 0x80)
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms)
spin_lock_irq(¤t->sighand->siglock);
thread_group_times(current, &tgutime, &tgstime);
- cutime = current->signal->cutime;
- cstime = current->signal->cstime;
+ cutime = current->signal->cdata_wait.utime;
+ cstime = current->signal->cdata_wait.stime;
spin_unlock_irq(¤t->sighand->siglock);
tms->tms_utime = cputime_to_clock_t(tgutime);
tms->tms_stime = cputime_to_clock_t(tgstime);
@@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru
unsigned long flags;
cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;
+ struct cdata *cd;
memset((char *) r, 0, sizeof *r);
utime = stime = cputime_zero;
@@ -1497,7 +1498,7 @@ static void k_getrusage(struct task_stru
if (who == RUSAGE_THREAD) {
task_times(current, &utime, &stime);
accumulate_thread_rusage(p, r);
- maxrss = p->signal->maxrss;
+ maxrss = p->signal->cdata_threads.maxrss;
goto out;
}
@@ -1507,31 +1508,34 @@ static void k_getrusage(struct task_stru
switch (who) {
case RUSAGE_BOTH:
case RUSAGE_CHILDREN:
- utime = p->signal->cutime;
- stime = p->signal->cstime;
- r->ru_nvcsw = p->signal->cnvcsw;
- r->ru_nivcsw = p->signal->cnivcsw;
- r->ru_minflt = p->signal->cmin_flt;
- r->ru_majflt = p->signal->cmaj_flt;
- r->ru_inblock = p->signal->cinblock;
- r->ru_oublock = p->signal->coublock;
- maxrss = p->signal->cmaxrss;
+ cd = &p->signal->cdata_wait;
+ utime = cd->utime;
+ stime = cd->stime;
+ r->ru_nvcsw = cd->nvcsw;
+ r->ru_nivcsw = cd->nivcsw;
+ r->ru_minflt = cd->min_flt;
+ r->ru_majflt = cd->maj_flt;
+ r->ru_inblock = cd->inblock;
+ r->ru_oublock = cd->oublock;
+ maxrss = cd->maxrss;
if (who == RUSAGE_CHILDREN)
break;
case RUSAGE_SELF:
+ cd = &p->signal->cdata_threads;
thread_group_times(p, &tgutime, &tgstime);
utime = cputime_add(utime, tgutime);
stime = cputime_add(stime, tgstime);
- r->ru_nvcsw += p->signal->nvcsw;
- r->ru_nivcsw += p->signal->nivcsw;
- r->ru_minflt += p->signal->min_flt;
- r->ru_majflt += p->signal->maj_flt;
- r->ru_inblock += p->signal->inblock;
- r->ru_oublock += p->signal->oublock;
- if (maxrss < p->signal->maxrss)
- maxrss = p->signal->maxrss;
+ r->ru_nvcsw += cd->nvcsw;
+ r->ru_nivcsw += cd->nivcsw;
+ r->ru_minflt += cd->min_flt;
+ r->ru_majflt += cd->maj_flt;
+ r->ru_inblock += cd->inblock;
+ r->ru_oublock += cd->oublock;
+ if (maxrss < cd->maxrss)
+ maxrss = cd->maxrss;
+
t = p;
do {
accumulate_thread_rusage(t, r);
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats
2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu
2010-11-25 13:26 ` Oleg Nesterov
@ 2010-11-25 16:57 ` Balbir Singh
1 sibling, 0 replies; 24+ messages in thread
From: Balbir Singh @ 2010-11-25 16:57 UTC (permalink / raw)
To: Michael Holzheu
Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
* Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-11-19 21:11:12]:
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
>
> With this patch the (full) cumulative CPU time is added to "struct taskstats".
> The CPU time is only returned for the thread group leader.
>
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
> include/linux/taskstats.h | 7 ++++++-
> kernel/tsacct.c | 7 +++++++
> 2 files changed, 13 insertions(+), 1 deletion(-)
>
> --- a/include/linux/taskstats.h
> +++ b/include/linux/taskstats.h
> @@ -33,7 +33,7 @@
> */
>
>
> -#define TASKSTATS_VERSION 7
> +#define TASKSTATS_VERSION 8
> #define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
> * in linux/sched.h */
>
> @@ -163,6 +163,11 @@ struct taskstats {
> /* Delay waiting for memory reclaim */
> __u64 freepages_count;
> __u64 freepages_delay_total;
> + /* version 7 ends here */
> +
> + /* All cumulative CPU time of dead children */
> + __u64 ac_cutime_acct; /* User CPU time [usec] */
> + __u64 ac_cstime_acct; /* System CPU time [usec] */
> };
>
>
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -31,6 +31,7 @@ void bacct_add_tsk(struct taskstats *sta
> const struct cred *tcred;
> struct timespec uptime, ts;
> u64 ac_etime;
> + unsigned long flags;
>
> BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
>
> @@ -71,6 +72,12 @@ void bacct_add_tsk(struct taskstats *sta
> stats->ac_majflt = tsk->maj_flt;
>
> strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
> + if (tsk->tgid == tsk->pid && lock_task_sighand(tsk, &flags)) {
I don't think referring to tgid and pid is a good idea in the context
of namespaces, thread_group_leader makes more sense like Oleg pointed
out.
> + struct cdata *cd = &tsk->signal->cdata_acct;
> + stats->ac_cutime_acct = cputime_to_usecs(cd->utime);
> + stats->ac_cstime_acct = cputime_to_usecs(cd->stime);
> + unlock_task_sighand(tsk, &flags);
> + }
> }
>
>
>
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats
2010-11-25 13:26 ` Oleg Nesterov
@ 2010-11-25 17:21 ` Michael Holzheu
2010-11-29 16:43 ` Oleg Nesterov
0 siblings, 1 reply; 24+ messages in thread
From: Michael Holzheu @ 2010-11-25 17:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
Hello Oleg,
On Thu, 2010-11-25 at 14:26 +0100, Oleg Nesterov wrote:
> On 11/19, Michael Holzheu wrote:
> >
> > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> >
> > With this patch the (full) cumulative CPU time is added to "struct taskstats".
> > The CPU time is only returned for the thread group leader.
> >
> > ...
> >
> > + if (tsk->tgid == tsk->pid
>
> thread_group_leader() ?
Yes, that's better.
> > && lock_task_sighand(tsk, &flags)) {
>
> Do you really need ->siglock? Starting from 2.6.35 it is always
> safe to access ->signal.
Hmmm, if you say that...
I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits).
Can we remove the locking there, too?
Michael
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting
2010-11-25 13:21 ` Oleg Nesterov
@ 2010-11-25 17:45 ` Michael Holzheu
0 siblings, 0 replies; 24+ messages in thread
From: Michael Holzheu @ 2010-11-25 17:45 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On Thu, 2010-11-25 at 14:21 +0100, Oleg Nesterov wrote:
> On 11/25, Michael Holzheu wrote:
> But in fact this reminds we should cleanup this code somehow.
> By the time we call thread_group_times() there are no other
> threads.
Couldn't we just replace thread_group_times() with the following:
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -101,12 +101,13 @@ static void __account_cdata(struct task_
* group, which consolidates times for all threads in the
* group including the group leader.
*/
- thread_group_times(p, &tgutime, &tgstime);
pcd = &p->real_parent->signal->cdata_wait;
tcd = &p->signal->cdata_threads;
cd = &p->signal->cdata_wait;
spin_lock_irqsave(&p->real_parent->sighand->siglock, flags);
+ tgutime = cputime_add(tcd->utime, p->utime);
+ tgstime = cputime_add(tcd->stime, p->stime);
pcd->utime =
cputime_add(pcd->utime,
cputime_add(tgutime,
Then we can move the __account_cdata() invocation down to the second
group_dead check.
Michael
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats
2010-11-25 17:21 ` Michael Holzheu
@ 2010-11-29 16:43 ` Oleg Nesterov
2010-11-29 16:58 ` Michael Holzheu
0 siblings, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2010-11-29 16:43 UTC (permalink / raw)
To: Michael Holzheu
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On 11/25, Michael Holzheu wrote:
>
> Hello Oleg,
>
> On Thu, 2010-11-25 at 14:26 +0100, Oleg Nesterov wrote:
> > On 11/19, Michael Holzheu wrote:
> > >
> > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > >
> > > With this patch the (full) cumulative CPU time is added to "struct taskstats".
> > > The CPU time is only returned for the thread group leader.
> > >
> > > ...
> > >
> > > + if (tsk->tgid == tsk->pid
> >
> > thread_group_leader() ?
>
> Yes, that's better.
>
> > > && lock_task_sighand(tsk, &flags)) {
> >
> > Do you really need ->siglock? Starting from 2.6.35 it is always
> > safe to access ->signal.
>
> Hmmm, if you say that...
>
> I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits).
> Can we remove the locking there, too?
We can certainly remove more siglock's which were previously
needed to access ->signal.
This particular one is just wrong. We need task_lock(group_leader)
to read signal->rlim atomically. However, it is not trivial to do
this correctly. Probably we should ignore this minor problem.
In any case, this ->siglock buys nothing today.
Oleg.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats
2010-11-29 16:43 ` Oleg Nesterov
@ 2010-11-29 16:58 ` Michael Holzheu
2010-11-29 18:08 ` Oleg Nesterov
0 siblings, 1 reply; 24+ messages in thread
From: Michael Holzheu @ 2010-11-29 16:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
Hello Oleg,
On Mon, 2010-11-29 at 17:43 +0100, Oleg Nesterov wrote:
> > I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits).
> > Can we remove the locking there, too?
>
> We can certainly remove more siglock's which were previously
> needed to access ->signal.
>
> This particular one is just wrong. We need task_lock(group_leader)
> to read signal->rlim atomically. However, it is not trivial to do
> this correctly. Probably we should ignore this minor problem.
>
> In any case, this ->siglock buys nothing today.
But at least to get the two values cutime and cstime consistent we need
the siglock? There could be a parallel update for
tsk->signal->cutime/cstime, while the taskstats are created, no?
Michael
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats
2010-11-29 16:58 ` Michael Holzheu
@ 2010-11-29 18:08 ` Oleg Nesterov
0 siblings, 0 replies; 24+ messages in thread
From: Oleg Nesterov @ 2010-11-29 18:08 UTC (permalink / raw)
To: Michael Holzheu
Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz,
Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens,
Roland McGrath, linux-kernel, linux-s390
On 11/29, Michael Holzheu wrote:
>
> Hello Oleg,
>
> On Mon, 2010-11-29 at 17:43 +0100, Oleg Nesterov wrote:
> > > I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits).
> > > Can we remove the locking there, too?
> >
> > We can certainly remove more siglock's which were previously
> > needed to access ->signal.
> >
> > This particular one is just wrong. We need task_lock(group_leader)
> > to read signal->rlim atomically. However, it is not trivial to do
> > this correctly. Probably we should ignore this minor problem.
> >
> > In any case, this ->siglock buys nothing today.
>
> But at least to get the two values cutime and cstime consistent we need
> the siglock? There could be a parallel update for
> tsk->signal->cutime/cstime, while the taskstats are created, no?
I guess you mean 4/4 you sent. Probably you are right, I'll try
to look tomorrow.
Oleg.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2010-11-29 18:16 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu
2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu
2010-11-25 12:29 ` Balbir Singh
2010-11-25 14:23 ` Oleg Nesterov
2010-11-25 16:38 ` Michael Holzheu
2010-11-19 20:11 ` [patch 2/4] taskstats: Introduce __account_cdata() function Michael Holzheu
2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu
2010-11-23 16:59 ` Oleg Nesterov
2010-11-25 9:40 ` Michael Holzheu
2010-11-25 13:21 ` Oleg Nesterov
2010-11-25 17:45 ` Michael Holzheu
2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu
2010-11-25 13:26 ` Oleg Nesterov
2010-11-25 17:21 ` Michael Holzheu
2010-11-29 16:43 ` Oleg Nesterov
2010-11-29 16:58 ` Michael Holzheu
2010-11-29 18:08 ` Oleg Nesterov
2010-11-25 16:57 ` Balbir Singh
2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra
2010-11-20 15:17 ` Oleg Nesterov
2010-11-22 7:21 ` Balbir Singh
2010-11-22 11:03 ` Michael Holzheu
2010-11-22 12:47 ` Michael Holzheu
2010-11-22 18:11 ` Valdis.Kletnieks
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox