* [patch 0/4] taskstats: Improve cumulative time accounting
@ 2010-11-19 20:11 Michael Holzheu
2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu
` (4 more replies)
0 siblings, 5 replies; 24+ messages in thread
From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw)
To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra,
John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky,
Heiko Carstens, Roland McGrath
Cc: linux-kernel, linux-s390
Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
to the cumulative time of the parents, if the parents ignore SIGCHLD
or have set SA_NOCLDWAIT. This behaviour has the major drawback that
it is not possible to calculate all consumed CPU time of a system by
looking at the current tasks. CPU time can be lost.
To solve this problem, this patch set duplicates the cumulative accounting
data in the signal_struct. In the second set (cdata_acct) the complete
cumulative resource counters are stored. The new cumulative CPU time (utime
and stime) is then exported via the taskstats interface.
PATCH SET OVERVIEW
------------------
Patches apply on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
[1] Introduce "struct cdata"
[2] Introduce __account_cdata() function
[3] Introduce cdata_acct for complete cumulative accounting
[4] Export "cdata_acct" with taskstats
^ permalink raw reply [flat|nested] 24+ messages in thread* [patch 1/4] taskstats: Introduce "struct cdata" 2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu @ 2010-11-19 20:11 ` Michael Holzheu 2010-11-25 12:29 ` Balbir Singh 2010-11-25 14:23 ` Oleg Nesterov 2010-11-19 20:11 ` [patch 2/4] taskstats: Introduce __account_cdata() function Michael Holzheu ` (3 subsequent siblings) 4 siblings, 2 replies; 24+ messages in thread From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw) To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath Cc: linux-kernel, linux-s390 [-- Attachment #1: 01-taskstats-top-improve-ctime-add-cdata-struct.patch --] [-- Type: text/plain, Size: 12061 bytes --] From: Michael Holzheu <holzheu@linux.vnet.ibm.com> This patch introduces a new structure "struct cdata" that is used to store cumulative resource counters for dead child processes and threads. Note that there is one asymmetry: For "struct task_io_accounting" (ioc) there is no extra accounting field for dead threads. One field is used for both, dead processes and threads. This patch introduces no functional change. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> --- fs/binfmt_elf.c | 4 +- fs/exec.c | 2 - fs/proc/array.c | 16 ++++---- include/linux/sched.h | 22 +++++++---- kernel/exit.c | 86 ++++++++++++++++++++++++---------------------- kernel/posix-cpu-timers.c | 12 +++--- kernel/sys.c | 44 ++++++++++++----------- 7 files changed, 100 insertions(+), 86 deletions(-) --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs cputime_to_timeval(p->utime, &prstatus->pr_utime); cputime_to_timeval(p->stime, &prstatus->pr_stime); } - cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime); - cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime); + cputime_to_timeval(p->signal->cdata_wait.utime, &prstatus->pr_cutime); + cputime_to_timeval(p->signal->cdata_wait.stime, &prstatus->pr_cstime); } static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p, --- a/fs/exec.c +++ b/fs/exec.c @@ -894,7 +894,7 @@ static int de_thread(struct task_struct no_thread_group: if (current->mm) - setmax_mm_hiwater_rss(&sig->maxrss, current->mm); + setmax_mm_hiwater_rss(&sig->cdata_threads.maxrss, current->mm); exit_itimers(sig); flush_itimer_signals(); --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file num_threads = get_nr_threads(task); collect_sigign_sigcatch(task, &sigign, &sigcatch); - cmin_flt = sig->cmin_flt; - cmaj_flt = sig->cmaj_flt; - cutime = sig->cutime; - cstime = sig->cstime; - cgtime = sig->cgtime; + cmin_flt = sig->cdata_wait.min_flt; + cmaj_flt = sig->cdata_wait.maj_flt; + cutime = sig->cdata_wait.utime; + cstime = sig->cdata_wait.stime; + cgtime = sig->cdata_wait.gtime; rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur); /* add up live thread stats at the group level */ @@ -430,10 +430,10 @@ static int do_task_stat(struct seq_file t = next_thread(t); } while (t != task); - min_flt += sig->min_flt; - maj_flt += sig->maj_flt; + min_flt += sig->cdata_threads.min_flt; + maj_flt += sig->cdata_threads.maj_flt; thread_group_times(task, &utime, &stime); - gtime = cputime_add(gtime, sig->gtime); + gtime = cputime_add(gtime, sig->cdata_threads.gtime); } sid = task_session_nr_ns(task, ns); --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -510,6 +510,17 @@ struct thread_group_cputimer { }; /* + * Cumulative resource counters + */ +struct cdata { + cputime_t utime, stime, gtime; + unsigned long nvcsw, nivcsw; + unsigned long min_flt, maj_flt; + unsigned long inblock, oublock; + unsigned long maxrss; +}; + +/* * NOTE! "signal_struct" does not have it's own * locking, because a shared signal_struct always * implies a shared sighand_struct, so locking @@ -582,17 +593,12 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ - cputime_t utime, stime, cutime, cstime; - cputime_t gtime; - cputime_t cgtime; + struct cdata cdata_wait; + struct cdata cdata_threads; + struct task_io_accounting ioac; #ifndef CONFIG_VIRT_CPU_ACCOUNTING cputime_t prev_utime, prev_stime; #endif - unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw; - unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt; - unsigned long inblock, oublock, cinblock, coublock; - unsigned long maxrss, cmaxrss; - struct task_io_accounting ioac; /* * Cumulative ns of schedule CPU time fo dead threads in the --- a/kernel/exit.c +++ b/kernel/exit.c @@ -95,6 +95,7 @@ static void __exit_signal(struct task_st tty = sig->tty; sig->tty = NULL; } else { + struct cdata *tcd = &sig->cdata_threads; /* * This can only happen if the caller is de_thread(). * FIXME: this is the temporary hack, we should teach @@ -122,15 +123,15 @@ static void __exit_signal(struct task_st * We won't ever get here for the group leader, since it * will have been the last reference on the signal_struct. */ - sig->utime = cputime_add(sig->utime, tsk->utime); - sig->stime = cputime_add(sig->stime, tsk->stime); - sig->gtime = cputime_add(sig->gtime, tsk->gtime); - sig->min_flt += tsk->min_flt; - sig->maj_flt += tsk->maj_flt; - sig->nvcsw += tsk->nvcsw; - sig->nivcsw += tsk->nivcsw; - sig->inblock += task_io_get_inblock(tsk); - sig->oublock += task_io_get_oublock(tsk); + tcd->utime = cputime_add(tcd->utime, tsk->utime); + tcd->stime = cputime_add(tcd->stime, tsk->stime); + tcd->gtime = cputime_add(tcd->gtime, tsk->gtime); + tcd->min_flt += tsk->min_flt; + tcd->maj_flt += tsk->maj_flt; + tcd->nvcsw += tsk->nvcsw; + tcd->nivcsw += tsk->nivcsw; + tcd->inblock += task_io_get_inblock(tsk); + tcd->oublock += task_io_get_oublock(tsk); task_io_accounting_add(&sig->ioac, &tsk->ioac); sig->sum_sched_runtime += tsk->se.sum_exec_runtime; } @@ -960,10 +961,11 @@ NORET_TYPE void do_exit(long code) sync_mm_rss(tsk, tsk->mm); group_dead = atomic_dec_and_test(&tsk->signal->live); if (group_dead) { + struct cdata *tcd = &tsk->signal->cdata_threads; hrtimer_cancel(&tsk->signal->real_timer); exit_itimers(tsk->signal); if (tsk->mm) - setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm); + setmax_mm_hiwater_rss(&tcd->maxrss, tsk->mm); } acct_collect(code, group_dead); if (group_dead) @@ -1225,8 +1227,7 @@ static int wait_task_zombie(struct wait_ * !task_detached() to filter out sub-threads. */ if (likely(!traced) && likely(!task_detached(p))) { - struct signal_struct *psig; - struct signal_struct *sig; + struct cdata *cd, *pcd, *tcd; unsigned long maxrss; cputime_t tgutime, tgstime; @@ -1251,40 +1252,43 @@ static int wait_task_zombie(struct wait_ */ thread_group_times(p, &tgutime, &tgstime); spin_lock_irq(&p->real_parent->sighand->siglock); - psig = p->real_parent->signal; - sig = p->signal; - psig->cutime = - cputime_add(psig->cutime, + pcd = &p->real_parent->signal->cdata_wait; + tcd = &p->signal->cdata_threads; + cd = &p->signal->cdata_wait; + + pcd->utime = + cputime_add(pcd->utime, cputime_add(tgutime, - sig->cutime)); - psig->cstime = - cputime_add(psig->cstime, + cd->utime)); + pcd->stime = + cputime_add(pcd->stime, cputime_add(tgstime, - sig->cstime)); - psig->cgtime = - cputime_add(psig->cgtime, + cd->stime)); + pcd->gtime = + cputime_add(pcd->gtime, cputime_add(p->gtime, - cputime_add(sig->gtime, - sig->cgtime))); - psig->cmin_flt += - p->min_flt + sig->min_flt + sig->cmin_flt; - psig->cmaj_flt += - p->maj_flt + sig->maj_flt + sig->cmaj_flt; - psig->cnvcsw += - p->nvcsw + sig->nvcsw + sig->cnvcsw; - psig->cnivcsw += - p->nivcsw + sig->nivcsw + sig->cnivcsw; - psig->cinblock += + cputime_add(tcd->gtime, + cd->gtime))); + pcd->min_flt += + p->min_flt + tcd->min_flt + cd->min_flt; + pcd->maj_flt += + p->maj_flt + tcd->maj_flt + cd->maj_flt; + pcd->nvcsw += + p->nvcsw + tcd->nvcsw + cd->nvcsw; + pcd->nivcsw += + p->nivcsw + tcd->nivcsw + cd->nivcsw; + pcd->inblock += task_io_get_inblock(p) + - sig->inblock + sig->cinblock; - psig->coublock += + tcd->inblock + cd->inblock; + pcd->oublock += task_io_get_oublock(p) + - sig->oublock + sig->coublock; - maxrss = max(sig->maxrss, sig->cmaxrss); - if (psig->cmaxrss < maxrss) - psig->cmaxrss = maxrss; - task_io_accounting_add(&psig->ioac, &p->ioac); - task_io_accounting_add(&psig->ioac, &sig->ioac); + tcd->oublock + cd->oublock; + maxrss = max(tcd->maxrss, cd->maxrss); + if (pcd->maxrss < maxrss) + pcd->maxrss = maxrss; + task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac); + task_io_accounting_add(&p->real_parent->signal->ioac, + &p->signal->ioac); spin_unlock_irq(&p->real_parent->sighand->siglock); } --- a/kernel/posix-cpu-timers.c +++ b/kernel/posix-cpu-timers.c @@ -232,12 +232,12 @@ static int cpu_clock_sample(const clocki void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) { - struct signal_struct *sig = tsk->signal; + struct cdata *tcd = &tsk->signal->cdata_threads; struct task_struct *t; - times->utime = sig->utime; - times->stime = sig->stime; - times->sum_exec_runtime = sig->sum_sched_runtime; + times->utime = tcd->utime; + times->stime = tcd->stime; + times->sum_exec_runtime = tsk->signal->sum_sched_runtime; rcu_read_lock(); /* make sure we can trust tsk->thread_group list */ @@ -516,8 +516,8 @@ void posix_cpu_timers_exit_group(struct struct signal_struct *const sig = tsk->signal; cleanup_timers(tsk->signal->cpu_timers, - cputime_add(tsk->utime, sig->utime), - cputime_add(tsk->stime, sig->stime), + cputime_add(tsk->utime, sig->cdata_threads.utime), + cputime_add(tsk->stime, sig->cdata_threads.stime), tsk->se.sum_exec_runtime + sig->sum_sched_runtime); } --- a/kernel/sys.c +++ b/kernel/sys.c @@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms) spin_lock_irq(¤t->sighand->siglock); thread_group_times(current, &tgutime, &tgstime); - cutime = current->signal->cutime; - cstime = current->signal->cstime; + cutime = current->signal->cdata_wait.utime; + cstime = current->signal->cdata_wait.stime; spin_unlock_irq(¤t->sighand->siglock); tms->tms_utime = cputime_to_clock_t(tgutime); tms->tms_stime = cputime_to_clock_t(tgstime); @@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru unsigned long flags; cputime_t tgutime, tgstime, utime, stime; unsigned long maxrss = 0; + struct cdata *cd; memset((char *) r, 0, sizeof *r); utime = stime = cputime_zero; @@ -1497,7 +1498,7 @@ static void k_getrusage(struct task_stru if (who == RUSAGE_THREAD) { task_times(current, &utime, &stime); accumulate_thread_rusage(p, r); - maxrss = p->signal->maxrss; + maxrss = p->signal->cdata_threads.maxrss; goto out; } @@ -1507,31 +1508,34 @@ static void k_getrusage(struct task_stru switch (who) { case RUSAGE_BOTH: case RUSAGE_CHILDREN: - utime = p->signal->cutime; - stime = p->signal->cstime; - r->ru_nvcsw = p->signal->cnvcsw; - r->ru_nivcsw = p->signal->cnivcsw; - r->ru_minflt = p->signal->cmin_flt; - r->ru_majflt = p->signal->cmaj_flt; - r->ru_inblock = p->signal->cinblock; - r->ru_oublock = p->signal->coublock; - maxrss = p->signal->cmaxrss; + cd = &p->signal->cdata_wait; + utime = cd->utime; + stime = cd->stime; + r->ru_nvcsw = cd->nvcsw; + r->ru_nivcsw = cd->nivcsw; + r->ru_minflt = cd->min_flt; + r->ru_majflt = cd->maj_flt; + r->ru_inblock = cd->inblock; + r->ru_oublock = cd->oublock; + maxrss = cd->maxrss; if (who == RUSAGE_CHILDREN) break; case RUSAGE_SELF: + cd = &p->signal->cdata_threads; thread_group_times(p, &tgutime, &tgstime); utime = cputime_add(utime, tgutime); stime = cputime_add(stime, tgstime); - r->ru_nvcsw += p->signal->nvcsw; - r->ru_nivcsw += p->signal->nivcsw; - r->ru_minflt += p->signal->min_flt; - r->ru_majflt += p->signal->maj_flt; - r->ru_inblock += p->signal->inblock; - r->ru_oublock += p->signal->oublock; - if (maxrss < p->signal->maxrss) - maxrss = p->signal->maxrss; + r->ru_nvcsw += cd->nvcsw; + r->ru_nivcsw += cd->nivcsw; + r->ru_minflt += cd->min_flt; + r->ru_majflt += cd->maj_flt; + r->ru_inblock += cd->inblock; + r->ru_oublock += cd->oublock; + if (maxrss < cd->maxrss) + maxrss = cd->maxrss; + t = p; do { accumulate_thread_rusage(t, r); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/4] taskstats: Introduce "struct cdata" 2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu @ 2010-11-25 12:29 ` Balbir Singh 2010-11-25 14:23 ` Oleg Nesterov 1 sibling, 0 replies; 24+ messages in thread From: Balbir Singh @ 2010-11-25 12:29 UTC (permalink / raw) To: Michael Holzheu Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 * Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-11-19 21:11:09]: > From: Michael Holzheu <holzheu@linux.vnet.ibm.com> > > This patch introduces a new structure "struct cdata" that is used to > store cumulative resource counters for dead child processes and threads. > > Note that there is one asymmetry: > For "struct task_io_accounting" (ioc) there is no extra accounting field for > dead threads. One field is used for both, dead processes and threads. > > This patch introduces no functional change. > > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> > --- > fs/binfmt_elf.c | 4 +- > fs/exec.c | 2 - > fs/proc/array.c | 16 ++++---- > include/linux/sched.h | 22 +++++++---- > kernel/exit.c | 86 ++++++++++++++++++++++++---------------------- > kernel/posix-cpu-timers.c | 12 +++--- > kernel/sys.c | 44 ++++++++++++----------- > 7 files changed, 100 insertions(+), 86 deletions(-) Looks good to me Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> -- Three Cheers, Balbir ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/4] taskstats: Introduce "struct cdata" 2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu 2010-11-25 12:29 ` Balbir Singh @ 2010-11-25 14:23 ` Oleg Nesterov 2010-11-25 16:38 ` Michael Holzheu 1 sibling, 1 reply; 24+ messages in thread From: Oleg Nesterov @ 2010-11-25 14:23 UTC (permalink / raw) To: Michael Holzheu Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On 11/19, Michael Holzheu wrote: > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com> > > This patch introduces a new structure "struct cdata" that is used to > store cumulative resource counters for dead child processes and threads. > > Note that there is one asymmetry: > For "struct task_io_accounting" (ioc) there is no extra accounting field for > dead threads. One field is used for both, dead processes and threads. > > This patch introduces no functional change. > > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> > --- > fs/binfmt_elf.c | 4 +- > fs/exec.c | 2 - > fs/proc/array.c | 16 ++++---- > include/linux/sched.h | 22 +++++++---- > kernel/exit.c | 86 ++++++++++++++++++++++++---------------------- > kernel/posix-cpu-timers.c | 12 +++--- > kernel/sys.c | 44 ++++++++++++----------- > 7 files changed, 100 insertions(+), 86 deletions(-) Looks good. In fact, to me it looks like a cleanup. But. You seem to forgot to change kernel/signal.c, no? And cosmetic nit, > void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > { > - struct signal_struct *sig = tsk->signal; > + struct cdata *tcd = &tsk->signal->cdata_threads; > struct task_struct *t; > > - times->utime = sig->utime; > - times->stime = sig->stime; > - times->sum_exec_runtime = sig->sum_sched_runtime; > + times->utime = tcd->utime; > + times->stime = tcd->stime; > + times->sum_exec_runtime = tsk->signal->sum_sched_runtime; Feel free to ignore, but I don't understand why you removed "sig". Afaics, - times->utime = sig->utime; - times->stime = sig->stime; + times->utime = sig->cdata_threads->utime; + times->stime = sig->cdata_threads->stime; looks a bit better. Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/4] taskstats: Introduce "struct cdata" 2010-11-25 14:23 ` Oleg Nesterov @ 2010-11-25 16:38 ` Michael Holzheu 0 siblings, 0 replies; 24+ messages in thread From: Michael Holzheu @ 2010-11-25 16:38 UTC (permalink / raw) To: Oleg Nesterov Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 Hello Oleg, On Thu, 2010-11-25 at 15:23 +0100, Oleg Nesterov wrote: > On 11/19, Michael Holzheu wrote: > > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> > > --- > > fs/binfmt_elf.c | 4 +- > > fs/exec.c | 2 - > > fs/proc/array.c | 16 ++++---- > > include/linux/sched.h | 22 +++++++---- > > kernel/exit.c | 86 ++++++++++++++++++++++++---------------------- > > kernel/posix-cpu-timers.c | 12 +++--- > > kernel/sys.c | 44 ++++++++++++----------- > > 7 files changed, 100 insertions(+), 86 deletions(-) > > Looks good. In fact, to me it looks like a cleanup. > > But. You seem to forgot to change kernel/signal.c, no? Yes, you are right. > > And cosmetic nit, > > > void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > > { > > - struct signal_struct *sig = tsk->signal; > > + struct cdata *tcd = &tsk->signal->cdata_threads; > > struct task_struct *t; > > > > - times->utime = sig->utime; > > - times->stime = sig->stime; > > - times->sum_exec_runtime = sig->sum_sched_runtime; > > + times->utime = tcd->utime; > > + times->stime = tcd->stime; > > + times->sum_exec_runtime = tsk->signal->sum_sched_runtime; > > Feel free to ignore, but I don't understand why you removed "sig". > Afaics, > > - times->utime = sig->utime; > - times->stime = sig->stime; > + times->utime = sig->cdata_threads->utime; > + times->stime = sig->cdata_threads->stime; > > looks a bit better. Yes, this looks better. I attached the corrected patch. Michael --- Subject: taskstats: Introduce "struct cdata" From: Michael Holzheu <holzheu@linux.vnet.ibm.com> This patch introduces a new structure "struct cdata" that is used to store cumulative resource counters for dead child processes and threads. Note that there is one asymmetry: For "struct task_io_accounting" (ioc) there is no extra accounting field for dead threads. One field is used for both dead processes and threads. This patch introduces no functional change. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> --- fs/binfmt_elf.c | 4 +- fs/exec.c | 2 - fs/proc/array.c | 16 ++++---- include/linux/sched.h | 22 +++++++---- kernel/exit.c | 86 ++++++++++++++++++++++++---------------------- kernel/posix-cpu-timers.c | 8 ++-- kernel/signal.c | 4 +- kernel/sys.c | 44 ++++++++++++----------- 8 files changed, 100 insertions(+), 86 deletions(-) --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs cputime_to_timeval(p->utime, &prstatus->pr_utime); cputime_to_timeval(p->stime, &prstatus->pr_stime); } - cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime); - cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime); + cputime_to_timeval(p->signal->cdata_wait.utime, &prstatus->pr_cutime); + cputime_to_timeval(p->signal->cdata_wait.stime, &prstatus->pr_cstime); } static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p, --- a/fs/exec.c +++ b/fs/exec.c @@ -894,7 +894,7 @@ static int de_thread(struct task_struct no_thread_group: if (current->mm) - setmax_mm_hiwater_rss(&sig->maxrss, current->mm); + setmax_mm_hiwater_rss(&sig->cdata_threads.maxrss, current->mm); exit_itimers(sig); flush_itimer_signals(); --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file num_threads = get_nr_threads(task); collect_sigign_sigcatch(task, &sigign, &sigcatch); - cmin_flt = sig->cmin_flt; - cmaj_flt = sig->cmaj_flt; - cutime = sig->cutime; - cstime = sig->cstime; - cgtime = sig->cgtime; + cmin_flt = sig->cdata_wait.min_flt; + cmaj_flt = sig->cdata_wait.maj_flt; + cutime = sig->cdata_wait.utime; + cstime = sig->cdata_wait.stime; + cgtime = sig->cdata_wait.gtime; rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur); /* add up live thread stats at the group level */ @@ -430,10 +430,10 @@ static int do_task_stat(struct seq_file t = next_thread(t); } while (t != task); - min_flt += sig->min_flt; - maj_flt += sig->maj_flt; + min_flt += sig->cdata_threads.min_flt; + maj_flt += sig->cdata_threads.maj_flt; thread_group_times(task, &utime, &stime); - gtime = cputime_add(gtime, sig->gtime); + gtime = cputime_add(gtime, sig->cdata_threads.gtime); } sid = task_session_nr_ns(task, ns); --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -510,6 +510,17 @@ struct thread_group_cputimer { }; /* + * Cumulative resource counters + */ +struct cdata { + cputime_t utime, stime, gtime; + unsigned long nvcsw, nivcsw; + unsigned long min_flt, maj_flt; + unsigned long inblock, oublock; + unsigned long maxrss; +}; + +/* * NOTE! "signal_struct" does not have it's own * locking, because a shared signal_struct always * implies a shared sighand_struct, so locking @@ -582,17 +593,12 @@ struct signal_struct { * Live threads maintain their own counters and add to these * in __exit_signal, except for the group leader. */ - cputime_t utime, stime, cutime, cstime; - cputime_t gtime; - cputime_t cgtime; + struct cdata cdata_wait; + struct cdata cdata_threads; + struct task_io_accounting ioac; #ifndef CONFIG_VIRT_CPU_ACCOUNTING cputime_t prev_utime, prev_stime; #endif - unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw; - unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt; - unsigned long inblock, oublock, cinblock, coublock; - unsigned long maxrss, cmaxrss; - struct task_io_accounting ioac; /* * Cumulative ns of schedule CPU time fo dead threads in the --- a/kernel/exit.c +++ b/kernel/exit.c @@ -95,6 +95,7 @@ static void __exit_signal(struct task_st tty = sig->tty; sig->tty = NULL; } else { + struct cdata *tcd = &sig->cdata_threads; /* * This can only happen if the caller is de_thread(). * FIXME: this is the temporary hack, we should teach @@ -122,15 +123,15 @@ static void __exit_signal(struct task_st * We won't ever get here for the group leader, since it * will have been the last reference on the signal_struct. */ - sig->utime = cputime_add(sig->utime, tsk->utime); - sig->stime = cputime_add(sig->stime, tsk->stime); - sig->gtime = cputime_add(sig->gtime, tsk->gtime); - sig->min_flt += tsk->min_flt; - sig->maj_flt += tsk->maj_flt; - sig->nvcsw += tsk->nvcsw; - sig->nivcsw += tsk->nivcsw; - sig->inblock += task_io_get_inblock(tsk); - sig->oublock += task_io_get_oublock(tsk); + tcd->utime = cputime_add(tcd->utime, tsk->utime); + tcd->stime = cputime_add(tcd->stime, tsk->stime); + tcd->gtime = cputime_add(tcd->gtime, tsk->gtime); + tcd->min_flt += tsk->min_flt; + tcd->maj_flt += tsk->maj_flt; + tcd->nvcsw += tsk->nvcsw; + tcd->nivcsw += tsk->nivcsw; + tcd->inblock += task_io_get_inblock(tsk); + tcd->oublock += task_io_get_oublock(tsk); task_io_accounting_add(&sig->ioac, &tsk->ioac); sig->sum_sched_runtime += tsk->se.sum_exec_runtime; } @@ -960,10 +961,11 @@ NORET_TYPE void do_exit(long code) sync_mm_rss(tsk, tsk->mm); group_dead = atomic_dec_and_test(&tsk->signal->live); if (group_dead) { + struct cdata *tcd = &tsk->signal->cdata_threads; hrtimer_cancel(&tsk->signal->real_timer); exit_itimers(tsk->signal); if (tsk->mm) - setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm); + setmax_mm_hiwater_rss(&tcd->maxrss, tsk->mm); } acct_collect(code, group_dead); if (group_dead) @@ -1225,8 +1227,7 @@ static int wait_task_zombie(struct wait_ * !task_detached() to filter out sub-threads. */ if (likely(!traced) && likely(!task_detached(p))) { - struct signal_struct *psig; - struct signal_struct *sig; + struct cdata *cd, *pcd, *tcd; unsigned long maxrss; cputime_t tgutime, tgstime; @@ -1251,40 +1252,43 @@ static int wait_task_zombie(struct wait_ */ thread_group_times(p, &tgutime, &tgstime); spin_lock_irq(&p->real_parent->sighand->siglock); - psig = p->real_parent->signal; - sig = p->signal; - psig->cutime = - cputime_add(psig->cutime, + pcd = &p->real_parent->signal->cdata_wait; + tcd = &p->signal->cdata_threads; + cd = &p->signal->cdata_wait; + + pcd->utime = + cputime_add(pcd->utime, cputime_add(tgutime, - sig->cutime)); - psig->cstime = - cputime_add(psig->cstime, + cd->utime)); + pcd->stime = + cputime_add(pcd->stime, cputime_add(tgstime, - sig->cstime)); - psig->cgtime = - cputime_add(psig->cgtime, + cd->stime)); + pcd->gtime = + cputime_add(pcd->gtime, cputime_add(p->gtime, - cputime_add(sig->gtime, - sig->cgtime))); - psig->cmin_flt += - p->min_flt + sig->min_flt + sig->cmin_flt; - psig->cmaj_flt += - p->maj_flt + sig->maj_flt + sig->cmaj_flt; - psig->cnvcsw += - p->nvcsw + sig->nvcsw + sig->cnvcsw; - psig->cnivcsw += - p->nivcsw + sig->nivcsw + sig->cnivcsw; - psig->cinblock += + cputime_add(tcd->gtime, + cd->gtime))); + pcd->min_flt += + p->min_flt + tcd->min_flt + cd->min_flt; + pcd->maj_flt += + p->maj_flt + tcd->maj_flt + cd->maj_flt; + pcd->nvcsw += + p->nvcsw + tcd->nvcsw + cd->nvcsw; + pcd->nivcsw += + p->nivcsw + tcd->nivcsw + cd->nivcsw; + pcd->inblock += task_io_get_inblock(p) + - sig->inblock + sig->cinblock; - psig->coublock += + tcd->inblock + cd->inblock; + pcd->oublock += task_io_get_oublock(p) + - sig->oublock + sig->coublock; - maxrss = max(sig->maxrss, sig->cmaxrss); - if (psig->cmaxrss < maxrss) - psig->cmaxrss = maxrss; - task_io_accounting_add(&psig->ioac, &p->ioac); - task_io_accounting_add(&psig->ioac, &sig->ioac); + tcd->oublock + cd->oublock; + maxrss = max(tcd->maxrss, cd->maxrss); + if (pcd->maxrss < maxrss) + pcd->maxrss = maxrss; + task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac); + task_io_accounting_add(&p->real_parent->signal->ioac, + &p->signal->ioac); spin_unlock_irq(&p->real_parent->sighand->siglock); } --- a/kernel/posix-cpu-timers.c +++ b/kernel/posix-cpu-timers.c @@ -235,8 +235,8 @@ void thread_group_cputime(struct task_st struct signal_struct *sig = tsk->signal; struct task_struct *t; - times->utime = sig->utime; - times->stime = sig->stime; + times->utime = sig->cdata_threads.utime; + times->stime = sig->cdata_threads.stime; times->sum_exec_runtime = sig->sum_sched_runtime; rcu_read_lock(); @@ -516,8 +516,8 @@ void posix_cpu_timers_exit_group(struct struct signal_struct *const sig = tsk->signal; cleanup_timers(tsk->signal->cpu_timers, - cputime_add(tsk->utime, sig->utime), - cputime_add(tsk->stime, sig->stime), + cputime_add(tsk->utime, sig->cdata_threads.utime), + cputime_add(tsk->stime, sig->cdata_threads.stime), tsk->se.sum_exec_runtime + sig->sum_sched_runtime); } --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1476,9 +1476,9 @@ int do_notify_parent(struct task_struct rcu_read_unlock(); info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime, - tsk->signal->utime)); + tsk->signal->cdata_threads.utime)); info.si_stime = cputime_to_clock_t(cputime_add(tsk->stime, - tsk->signal->stime)); + tsk->signal->cdata_threads.stime)); info.si_status = tsk->exit_code & 0x7f; if (tsk->exit_code & 0x80) --- a/kernel/sys.c +++ b/kernel/sys.c @@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms) spin_lock_irq(¤t->sighand->siglock); thread_group_times(current, &tgutime, &tgstime); - cutime = current->signal->cutime; - cstime = current->signal->cstime; + cutime = current->signal->cdata_wait.utime; + cstime = current->signal->cdata_wait.stime; spin_unlock_irq(¤t->sighand->siglock); tms->tms_utime = cputime_to_clock_t(tgutime); tms->tms_stime = cputime_to_clock_t(tgstime); @@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru unsigned long flags; cputime_t tgutime, tgstime, utime, stime; unsigned long maxrss = 0; + struct cdata *cd; memset((char *) r, 0, sizeof *r); utime = stime = cputime_zero; @@ -1497,7 +1498,7 @@ static void k_getrusage(struct task_stru if (who == RUSAGE_THREAD) { task_times(current, &utime, &stime); accumulate_thread_rusage(p, r); - maxrss = p->signal->maxrss; + maxrss = p->signal->cdata_threads.maxrss; goto out; } @@ -1507,31 +1508,34 @@ static void k_getrusage(struct task_stru switch (who) { case RUSAGE_BOTH: case RUSAGE_CHILDREN: - utime = p->signal->cutime; - stime = p->signal->cstime; - r->ru_nvcsw = p->signal->cnvcsw; - r->ru_nivcsw = p->signal->cnivcsw; - r->ru_minflt = p->signal->cmin_flt; - r->ru_majflt = p->signal->cmaj_flt; - r->ru_inblock = p->signal->cinblock; - r->ru_oublock = p->signal->coublock; - maxrss = p->signal->cmaxrss; + cd = &p->signal->cdata_wait; + utime = cd->utime; + stime = cd->stime; + r->ru_nvcsw = cd->nvcsw; + r->ru_nivcsw = cd->nivcsw; + r->ru_minflt = cd->min_flt; + r->ru_majflt = cd->maj_flt; + r->ru_inblock = cd->inblock; + r->ru_oublock = cd->oublock; + maxrss = cd->maxrss; if (who == RUSAGE_CHILDREN) break; case RUSAGE_SELF: + cd = &p->signal->cdata_threads; thread_group_times(p, &tgutime, &tgstime); utime = cputime_add(utime, tgutime); stime = cputime_add(stime, tgstime); - r->ru_nvcsw += p->signal->nvcsw; - r->ru_nivcsw += p->signal->nivcsw; - r->ru_minflt += p->signal->min_flt; - r->ru_majflt += p->signal->maj_flt; - r->ru_inblock += p->signal->inblock; - r->ru_oublock += p->signal->oublock; - if (maxrss < p->signal->maxrss) - maxrss = p->signal->maxrss; + r->ru_nvcsw += cd->nvcsw; + r->ru_nivcsw += cd->nivcsw; + r->ru_minflt += cd->min_flt; + r->ru_majflt += cd->maj_flt; + r->ru_inblock += cd->inblock; + r->ru_oublock += cd->oublock; + if (maxrss < cd->maxrss) + maxrss = cd->maxrss; + t = p; do { accumulate_thread_rusage(t, r); ^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 2/4] taskstats: Introduce __account_cdata() function 2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu 2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu @ 2010-11-19 20:11 ` Michael Holzheu 2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu ` (2 subsequent siblings) 4 siblings, 0 replies; 24+ messages in thread From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw) To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath Cc: linux-kernel, linux-s390 [-- Attachment #1: 02-taskstats-top-improve-ctime-add_cdata-func.patch --] [-- Type: text/plain, Size: 5417 bytes --] From: Michael Holzheu <holzheu@linux.vnet.ibm.com> This patch introduces the function __account_cdata() that does the cummulative resource accounting for dead processes in sys_wait(). No functional changes are done. This patch is a preparation for cdata_acct accounting. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> --- kernel/exit.c | 133 +++++++++++++++++++++++++++++----------------------------- 1 file changed, 68 insertions(+), 65 deletions(-) --- a/kernel/exit.c +++ b/kernel/exit.c @@ -74,6 +74,72 @@ static void __unhash_process(struct task list_del_rcu(&p->thread_group); } +static void __account_cdata(struct task_struct *p) +{ + struct cdata *cd, *pcd, *tcd; + unsigned long maxrss; + cputime_t tgutime, tgstime; + + /* + * The resource counters for the group leader are in its + * own task_struct. Those for dead threads in the group + * are in its signal_struct, as are those for the child + * processes it has previously reaped. All these + * accumulate in the parent's signal_struct c* fields. + * + * We don't bother to take a lock here to protect these + * p->signal fields, because they are only touched by + * __exit_signal, which runs with tasklist_lock + * write-locked anyway, and so is excluded here. We do + * need to protect the access to parent->signal fields, + * as other threads in the parent group can be right + * here reaping other children at the same time. + * + * We use thread_group_times() to get times for the thread + * group, which consolidates times for all threads in the + * group including the group leader. + */ + thread_group_times(p, &tgutime, &tgstime); + spin_lock_irq(&p->real_parent->sighand->siglock); + pcd = &p->real_parent->signal->cdata_wait; + tcd = &p->signal->cdata_threads; + cd = &p->signal->cdata_wait; + + pcd->utime = + cputime_add(pcd->utime, + cputime_add(tgutime, + cd->utime)); + pcd->stime = + cputime_add(pcd->stime, + cputime_add(tgstime, + cd->stime)); + pcd->gtime = + cputime_add(pcd->gtime, + cputime_add(p->gtime, + cputime_add(tcd->gtime, + cd->gtime))); + pcd->min_flt += + p->min_flt + tcd->min_flt + cd->min_flt; + pcd->maj_flt += + p->maj_flt + tcd->maj_flt + cd->maj_flt; + pcd->nvcsw += + p->nvcsw + tcd->nvcsw + cd->nvcsw; + pcd->nivcsw += + p->nivcsw + tcd->nivcsw + cd->nivcsw; + pcd->inblock += + task_io_get_inblock(p) + + tcd->inblock + cd->inblock; + pcd->oublock += + task_io_get_oublock(p) + + tcd->oublock + cd->oublock; + maxrss = max(tcd->maxrss, cd->maxrss); + if (pcd->maxrss < maxrss) + pcd->maxrss = maxrss; + task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac); + task_io_accounting_add(&p->real_parent->signal->ioac, &p->signal->ioac); + spin_unlock_irq(&p->real_parent->sighand->siglock); +} + /* * This function expects the tasklist_lock write-locked. */ @@ -1226,71 +1292,8 @@ static int wait_task_zombie(struct wait_ * It can be ptraced but not reparented, check * !task_detached() to filter out sub-threads. */ - if (likely(!traced) && likely(!task_detached(p))) { - struct cdata *cd, *pcd, *tcd; - unsigned long maxrss; - cputime_t tgutime, tgstime; - - /* - * The resource counters for the group leader are in its - * own task_struct. Those for dead threads in the group - * are in its signal_struct, as are those for the child - * processes it has previously reaped. All these - * accumulate in the parent's signal_struct c* fields. - * - * We don't bother to take a lock here to protect these - * p->signal fields, because they are only touched by - * __exit_signal, which runs with tasklist_lock - * write-locked anyway, and so is excluded here. We do - * need to protect the access to parent->signal fields, - * as other threads in the parent group can be right - * here reaping other children at the same time. - * - * We use thread_group_times() to get times for the thread - * group, which consolidates times for all threads in the - * group including the group leader. - */ - thread_group_times(p, &tgutime, &tgstime); - spin_lock_irq(&p->real_parent->sighand->siglock); - pcd = &p->real_parent->signal->cdata_wait; - tcd = &p->signal->cdata_threads; - cd = &p->signal->cdata_wait; - - pcd->utime = - cputime_add(pcd->utime, - cputime_add(tgutime, - cd->utime)); - pcd->stime = - cputime_add(pcd->stime, - cputime_add(tgstime, - cd->stime)); - pcd->gtime = - cputime_add(pcd->gtime, - cputime_add(p->gtime, - cputime_add(tcd->gtime, - cd->gtime))); - pcd->min_flt += - p->min_flt + tcd->min_flt + cd->min_flt; - pcd->maj_flt += - p->maj_flt + tcd->maj_flt + cd->maj_flt; - pcd->nvcsw += - p->nvcsw + tcd->nvcsw + cd->nvcsw; - pcd->nivcsw += - p->nivcsw + tcd->nivcsw + cd->nivcsw; - pcd->inblock += - task_io_get_inblock(p) + - tcd->inblock + cd->inblock; - pcd->oublock += - task_io_get_oublock(p) + - tcd->oublock + cd->oublock; - maxrss = max(tcd->maxrss, cd->maxrss); - if (pcd->maxrss < maxrss) - pcd->maxrss = maxrss; - task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac); - task_io_accounting_add(&p->real_parent->signal->ioac, - &p->signal->ioac); - spin_unlock_irq(&p->real_parent->sighand->siglock); - } + if (likely(!traced) && likely(!task_detached(p))) + __account_cdata(p); /* * Now we are sure this task is interesting, and no other ^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting 2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu 2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu 2010-11-19 20:11 ` [patch 2/4] taskstats: Introduce __account_cdata() function Michael Holzheu @ 2010-11-19 20:11 ` Michael Holzheu 2010-11-23 16:59 ` Oleg Nesterov 2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu 2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra 4 siblings, 1 reply; 24+ messages in thread From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw) To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath Cc: linux-kernel, linux-s390 [-- Attachment #1: 03-taskstats-top-improve-ctime-account-cdata_acct.patch --] [-- Type: text/plain, Size: 4388 bytes --] From: Michael Holzheu <holzheu@linux.vnet.ibm.com> Currently the cumulative time accounting in Linux is not complete. Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted to the cumulative time of the parents, if the parents ignore SIGCHLD or have set SA_NOCLDWAIT. This behaviour has the major drawback that it is not possible to calculate all consumed CPU time of a system by looking at the current tasks. CPU time can be lost. This patch adds a new set of cumulative time counters. We then have two cumulative counter sets: * cdata_wait: Traditional cumulative time used e.g. by getrusage. * cdata_acct: Cumulative time that also includes dead processes with parents that ignore SIGCHLD or have set SA_NOCLDWAIT. cdata_acct will be exported by taskstats. TODO: ----- With this patch we take the siglock twice. First for the dead task and second for the parent of the dead task. This give the following lockdep warning (probably a lockdep annotation is needed here): ============================================= [ INFO: possible recursive locking detected ] 2.6.37-rc1-00116-g151f52f-dirty #19 --------------------------------------------- kworker/u:0/15 is trying to acquire lock: (&(&sighand->siglock)->rlock){......}, at: [<000000000014a426>] __account_cdata+0x6e/0x444 but task is already holding lock: (&(&sighand->siglock)->rlock){......}, at: [<000000000014b634>] release_task+0x160/0x6a0 Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> --- include/linux/sched.h | 2 ++ kernel/exit.c | 36 +++++++++++++++++++++++++----------- 2 files changed, 27 insertions(+), 11 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -595,6 +595,8 @@ struct signal_struct { */ struct cdata cdata_wait; struct cdata cdata_threads; + struct cdata cdata_acct; + struct task_io_accounting ioac_acct; struct task_io_accounting ioac; #ifndef CONFIG_VIRT_CPU_ACCOUNTING cputime_t prev_utime, prev_stime; --- a/kernel/exit.c +++ b/kernel/exit.c @@ -74,10 +74,10 @@ static void __unhash_process(struct task list_del_rcu(&p->thread_group); } -static void __account_cdata(struct task_struct *p) +static void __account_cdata(struct task_struct *p, int wait) { struct cdata *cd, *pcd, *tcd; - unsigned long maxrss; + unsigned long maxrss, flags; cputime_t tgutime, tgstime; /* @@ -100,11 +100,16 @@ static void __account_cdata(struct task_ * group including the group leader. */ thread_group_times(p, &tgutime, &tgstime); - spin_lock_irq(&p->real_parent->sighand->siglock); - pcd = &p->real_parent->signal->cdata_wait; - tcd = &p->signal->cdata_threads; - cd = &p->signal->cdata_wait; - + spin_lock_irqsave(&p->real_parent->sighand->siglock, flags); + if (wait) { + pcd = &p->real_parent->signal->cdata_wait; + tcd = &p->signal->cdata_threads; + cd = &p->signal->cdata_wait; + } else { + pcd = &p->real_parent->signal->cdata_acct; + tcd = &p->signal->cdata_threads; + cd = &p->signal->cdata_acct; + } pcd->utime = cputime_add(pcd->utime, cputime_add(tgutime, @@ -135,9 +140,17 @@ static void __account_cdata(struct task_ maxrss = max(tcd->maxrss, cd->maxrss); if (pcd->maxrss < maxrss) pcd->maxrss = maxrss; - task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac); - task_io_accounting_add(&p->real_parent->signal->ioac, &p->signal->ioac); - spin_unlock_irq(&p->real_parent->sighand->siglock); + if (wait) { + task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac); + task_io_accounting_add(&p->real_parent->signal->ioac, + &p->signal->ioac); + } else { + task_io_accounting_add(&p->real_parent->signal->ioac_acct, + &p->ioac); + task_io_accounting_add(&p->real_parent->signal->ioac_acct, + &p->signal->ioac_acct); + } + spin_unlock_irqrestore(&p->real_parent->sighand->siglock, flags); } /* @@ -157,6 +170,7 @@ static void __exit_signal(struct task_st posix_cpu_timers_exit(tsk); if (group_dead) { + __account_cdata(tsk, 0); posix_cpu_timers_exit_group(tsk); tty = sig->tty; sig->tty = NULL; @@ -1293,7 +1307,7 @@ static int wait_task_zombie(struct wait_ * !task_detached() to filter out sub-threads. */ if (likely(!traced) && likely(!task_detached(p))) - __account_cdata(p); + __account_cdata(p, 1); /* * Now we are sure this task is interesting, and no other ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting 2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu @ 2010-11-23 16:59 ` Oleg Nesterov 2010-11-25 9:40 ` Michael Holzheu 0 siblings, 1 reply; 24+ messages in thread From: Oleg Nesterov @ 2010-11-23 16:59 UTC (permalink / raw) To: Michael Holzheu Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On 11/19, Michael Holzheu wrote: > > Currently the cumulative time accounting in Linux is not complete. > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted > to the cumulative time of the parents, if the parents ignore SIGCHLD > or have set SA_NOCLDWAIT. This behaviour has the major drawback that > it is not possible to calculate all consumed CPU time of a system by > looking at the current tasks. CPU time can be lost. > > This patch adds a new set of cumulative time counters. We then have two > cumulative counter sets: > > * cdata_wait: Traditional cumulative time used e.g. by getrusage. > * cdata_acct: Cumulative time that also includes dead processes with > parents that ignore SIGCHLD or have set SA_NOCLDWAIT. > cdata_acct will be exported by taskstats. Looks correct at first glance. A couple of nits below. > TODO: > ----- > With this patch we take the siglock twice. First for the dead task > and second for the parent of the dead task. This give the following > lockdep warning (probably a lockdep annotation is needed here): And we already discussed this ;) We do not need 2 siglock's, only parent's. Just move the callsite in __exit_signal() down, under another (lockless) group_dead check. Or I missed something? > @@ -595,6 +595,8 @@ struct signal_struct { > */ > struct cdata cdata_wait; > struct cdata cdata_threads; > + struct cdata cdata_acct; > + struct task_io_accounting ioac_acct; > struct task_io_accounting ioac; Given that task_io_accounting is Linux specific, perhaps we can use signal->ioac in both cases? Yes, this is a user-visible change anyway. But, at least we can forget about POSIX. > + spin_lock_irqsave(&p->real_parent->sighand->siglock, flags); > + if (wait) { > + pcd = &p->real_parent->signal->cdata_wait; > + tcd = &p->signal->cdata_threads; > + cd = &p->signal->cdata_wait; > + } else { > + pcd = &p->real_parent->signal->cdata_acct; > + tcd = &p->signal->cdata_threads; > + cd = &p->signal->cdata_acct; > + } We can do this before taking ->siglock. Not that I think this really matters, but otherwise this looks a bit confusing imho, as if we need parent's ->siglock to pin something. And thanks for splitting these changes. It was much, much easier to read now. Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting 2010-11-23 16:59 ` Oleg Nesterov @ 2010-11-25 9:40 ` Michael Holzheu 2010-11-25 13:21 ` Oleg Nesterov 0 siblings, 1 reply; 24+ messages in thread From: Michael Holzheu @ 2010-11-25 9:40 UTC (permalink / raw) To: Oleg Nesterov Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 Hello Oleg, On Tue, 2010-11-23 at 17:59 +0100, Oleg Nesterov wrote: > On 11/19, Michael Holzheu wrote: > > TODO: > > ----- > > With this patch we take the siglock twice. First for the dead task > > and second for the parent of the dead task. This give the following > > lockdep warning (probably a lockdep annotation is needed here): > > And we already discussed this ;) We do not need 2 siglock's, only > parent's. Just move the callsite in __exit_signal() down, under > another (lockless) group_dead check. > > Or I missed something? The problem with moving this down to the second group_dead check is that after __unhash_process() is called, pid_alive(tsk) which is checked in thread_group_cputime() returns false. Therefore we always get zero CPU times. So I probably have to introduce a second group_dead check at the beginning of __exit_signal(): @@ -150,6 +153,9 @@ static void __exit_signal(struct task_st struct sighand_struct *sighand; struct tty_struct *uninitialized_var(tty); + if (group_dead) + __account_cdata(...); + sighand = rcu_dereference_check(tsk->sighand, rcu_read_lock_held() || > We can do this before taking ->siglock. Not that I think this really > matters, but otherwise this looks a bit confusing imho, as if we need > parent's ->siglock to pin something. ok > > > And thanks for splitting these changes. It was much, much easier to > read now. My personal feeling is that probably the only acceptable thing would be to make the new behavior configurable with a sysctl and define the default as it currently is (POSIX compliant). This would only introduce two additional checks in __exit_signal() and wait_task_zombie() and would not add any new fields to the signal_struct. Michael ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting 2010-11-25 9:40 ` Michael Holzheu @ 2010-11-25 13:21 ` Oleg Nesterov 2010-11-25 17:45 ` Michael Holzheu 0 siblings, 1 reply; 24+ messages in thread From: Oleg Nesterov @ 2010-11-25 13:21 UTC (permalink / raw) To: Michael Holzheu Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On 11/25, Michael Holzheu wrote: > > Hello Oleg, > > On Tue, 2010-11-23 at 17:59 +0100, Oleg Nesterov wrote: > > On 11/19, Michael Holzheu wrote: > > > TODO: > > > ----- > > > With this patch we take the siglock twice. First for the dead task > > > and second for the parent of the dead task. This give the following > > > lockdep warning (probably a lockdep annotation is needed here): > > > > And we already discussed this ;) We do not need 2 siglock's, only > > parent's. Just move the callsite in __exit_signal() down, under > > another (lockless) group_dead check. > > > > Or I missed something? > > The problem with moving this down to the second group_dead check is that > after __unhash_process() is called, pid_alive(tsk) which is checked in > thread_group_cputime() returns false. Therefore we always get zero CPU > times. I see, thanks. > So I probably have to introduce a second group_dead check at the > beginning of __exit_signal(): Probably... But in fact this reminds we should cleanup this code somehow. By the time we call thread_group_times() there are no other threads. > My personal feeling is that probably the only acceptable thing would be > to make the new behavior configurable with a sysctl and define the > default as it currently is (POSIX compliant). > > This would only introduce two additional checks in __exit_signal() and > wait_task_zombie() and would not add any new fields to the > signal_struct. Yeah, it would be nice to avoid new fields. Hmm. Somehow I forgot about 4/4, please see another email... Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting 2010-11-25 13:21 ` Oleg Nesterov @ 2010-11-25 17:45 ` Michael Holzheu 0 siblings, 0 replies; 24+ messages in thread From: Michael Holzheu @ 2010-11-25 17:45 UTC (permalink / raw) To: Oleg Nesterov Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On Thu, 2010-11-25 at 14:21 +0100, Oleg Nesterov wrote: > On 11/25, Michael Holzheu wrote: > But in fact this reminds we should cleanup this code somehow. > By the time we call thread_group_times() there are no other > threads. Couldn't we just replace thread_group_times() with the following: --- a/kernel/exit.c +++ b/kernel/exit.c @@ -101,12 +101,13 @@ static void __account_cdata(struct task_ * group, which consolidates times for all threads in the * group including the group leader. */ - thread_group_times(p, &tgutime, &tgstime); pcd = &p->real_parent->signal->cdata_wait; tcd = &p->signal->cdata_threads; cd = &p->signal->cdata_wait; spin_lock_irqsave(&p->real_parent->sighand->siglock, flags); + tgutime = cputime_add(tcd->utime, p->utime); + tgstime = cputime_add(tcd->stime, p->stime); pcd->utime = cputime_add(pcd->utime, cputime_add(tgutime, Then we can move the __account_cdata() invocation down to the second group_dead check. Michael ^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 4/4] taskstats: Export "cdata_acct" with taskstats 2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu ` (2 preceding siblings ...) 2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu @ 2010-11-19 20:11 ` Michael Holzheu 2010-11-25 13:26 ` Oleg Nesterov 2010-11-25 16:57 ` Balbir Singh 2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra 4 siblings, 2 replies; 24+ messages in thread From: Michael Holzheu @ 2010-11-19 20:11 UTC (permalink / raw) To: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath Cc: linux-kernel, linux-s390 [-- Attachment #1: 04-taskstats-top-improve-ctime-taskstats.patch --] [-- Type: text/plain, Size: 1637 bytes --] From: Michael Holzheu <holzheu@linux.vnet.ibm.com> With this patch the (full) cumulative CPU time is added to "struct taskstats". The CPU time is only returned for the thread group leader. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> --- include/linux/taskstats.h | 7 ++++++- kernel/tsacct.c | 7 +++++++ 2 files changed, 13 insertions(+), 1 deletion(-) --- a/include/linux/taskstats.h +++ b/include/linux/taskstats.h @@ -33,7 +33,7 @@ */ -#define TASKSTATS_VERSION 7 +#define TASKSTATS_VERSION 8 #define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN * in linux/sched.h */ @@ -163,6 +163,11 @@ struct taskstats { /* Delay waiting for memory reclaim */ __u64 freepages_count; __u64 freepages_delay_total; + /* version 7 ends here */ + + /* All cumulative CPU time of dead children */ + __u64 ac_cutime_acct; /* User CPU time [usec] */ + __u64 ac_cstime_acct; /* System CPU time [usec] */ }; --- a/kernel/tsacct.c +++ b/kernel/tsacct.c @@ -31,6 +31,7 @@ void bacct_add_tsk(struct taskstats *sta const struct cred *tcred; struct timespec uptime, ts; u64 ac_etime; + unsigned long flags; BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN); @@ -71,6 +72,12 @@ void bacct_add_tsk(struct taskstats *sta stats->ac_majflt = tsk->maj_flt; strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm)); + if (tsk->tgid == tsk->pid && lock_task_sighand(tsk, &flags)) { + struct cdata *cd = &tsk->signal->cdata_acct; + stats->ac_cutime_acct = cputime_to_usecs(cd->utime); + stats->ac_cstime_acct = cputime_to_usecs(cd->stime); + unlock_task_sighand(tsk, &flags); + } } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats 2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu @ 2010-11-25 13:26 ` Oleg Nesterov 2010-11-25 17:21 ` Michael Holzheu 2010-11-25 16:57 ` Balbir Singh 1 sibling, 1 reply; 24+ messages in thread From: Oleg Nesterov @ 2010-11-25 13:26 UTC (permalink / raw) To: Michael Holzheu Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On 11/19, Michael Holzheu wrote: > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com> > > With this patch the (full) cumulative CPU time is added to "struct taskstats". > The CPU time is only returned for the thread group leader. > > ... > > + if (tsk->tgid == tsk->pid thread_group_leader() ? > && lock_task_sighand(tsk, &flags)) { Do you really need ->siglock? Starting from 2.6.35 it is always safe to access ->signal. Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats 2010-11-25 13:26 ` Oleg Nesterov @ 2010-11-25 17:21 ` Michael Holzheu 2010-11-29 16:43 ` Oleg Nesterov 0 siblings, 1 reply; 24+ messages in thread From: Michael Holzheu @ 2010-11-25 17:21 UTC (permalink / raw) To: Oleg Nesterov Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 Hello Oleg, On Thu, 2010-11-25 at 14:26 +0100, Oleg Nesterov wrote: > On 11/19, Michael Holzheu wrote: > > > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com> > > > > With this patch the (full) cumulative CPU time is added to "struct taskstats". > > The CPU time is only returned for the thread group leader. > > > > ... > > > > + if (tsk->tgid == tsk->pid > > thread_group_leader() ? Yes, that's better. > > && lock_task_sighand(tsk, &flags)) { > > Do you really need ->siglock? Starting from 2.6.35 it is always > safe to access ->signal. Hmmm, if you say that... I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits). Can we remove the locking there, too? Michael ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats 2010-11-25 17:21 ` Michael Holzheu @ 2010-11-29 16:43 ` Oleg Nesterov 2010-11-29 16:58 ` Michael Holzheu 0 siblings, 1 reply; 24+ messages in thread From: Oleg Nesterov @ 2010-11-29 16:43 UTC (permalink / raw) To: Michael Holzheu Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On 11/25, Michael Holzheu wrote: > > Hello Oleg, > > On Thu, 2010-11-25 at 14:26 +0100, Oleg Nesterov wrote: > > On 11/19, Michael Holzheu wrote: > > > > > > From: Michael Holzheu <holzheu@linux.vnet.ibm.com> > > > > > > With this patch the (full) cumulative CPU time is added to "struct taskstats". > > > The CPU time is only returned for the thread group leader. > > > > > > ... > > > > > > + if (tsk->tgid == tsk->pid > > > > thread_group_leader() ? > > Yes, that's better. > > > > && lock_task_sighand(tsk, &flags)) { > > > > Do you really need ->siglock? Starting from 2.6.35 it is always > > safe to access ->signal. > > Hmmm, if you say that... > > I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits). > Can we remove the locking there, too? We can certainly remove more siglock's which were previously needed to access ->signal. This particular one is just wrong. We need task_lock(group_leader) to read signal->rlim atomically. However, it is not trivial to do this correctly. Probably we should ignore this minor problem. In any case, this ->siglock buys nothing today. Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats 2010-11-29 16:43 ` Oleg Nesterov @ 2010-11-29 16:58 ` Michael Holzheu 2010-11-29 18:08 ` Oleg Nesterov 0 siblings, 1 reply; 24+ messages in thread From: Michael Holzheu @ 2010-11-29 16:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 Hello Oleg, On Mon, 2010-11-29 at 17:43 +0100, Oleg Nesterov wrote: > > I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits). > > Can we remove the locking there, too? > > We can certainly remove more siglock's which were previously > needed to access ->signal. > > This particular one is just wrong. We need task_lock(group_leader) > to read signal->rlim atomically. However, it is not trivial to do > this correctly. Probably we should ignore this minor problem. > > In any case, this ->siglock buys nothing today. But at least to get the two values cutime and cstime consistent we need the siglock? There could be a parallel update for tsk->signal->cutime/cstime, while the taskstats are created, no? Michael ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats 2010-11-29 16:58 ` Michael Holzheu @ 2010-11-29 18:08 ` Oleg Nesterov 0 siblings, 0 replies; 24+ messages in thread From: Oleg Nesterov @ 2010-11-29 18:08 UTC (permalink / raw) To: Michael Holzheu Cc: Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On 11/29, Michael Holzheu wrote: > > Hello Oleg, > > On Mon, 2010-11-29 at 17:43 +0100, Oleg Nesterov wrote: > > > I just did it like it is done in e.g. fs/proc/base.c (proc_pid_limits). > > > Can we remove the locking there, too? > > > > We can certainly remove more siglock's which were previously > > needed to access ->signal. > > > > This particular one is just wrong. We need task_lock(group_leader) > > to read signal->rlim atomically. However, it is not trivial to do > > this correctly. Probably we should ignore this minor problem. > > > > In any case, this ->siglock buys nothing today. > > But at least to get the two values cutime and cstime consistent we need > the siglock? There could be a parallel update for > tsk->signal->cutime/cstime, while the taskstats are created, no? I guess you mean 4/4 you sent. Probably you are right, I'll try to look tomorrow. Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 4/4] taskstats: Export "cdata_acct" with taskstats 2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu 2010-11-25 13:26 ` Oleg Nesterov @ 2010-11-25 16:57 ` Balbir Singh 1 sibling, 0 replies; 24+ messages in thread From: Balbir Singh @ 2010-11-25 16:57 UTC (permalink / raw) To: Michael Holzheu Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, Peter Zijlstra, John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 * Michael Holzheu <holzheu@linux.vnet.ibm.com> [2010-11-19 21:11:12]: > From: Michael Holzheu <holzheu@linux.vnet.ibm.com> > > With this patch the (full) cumulative CPU time is added to "struct taskstats". > The CPU time is only returned for the thread group leader. > > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> > --- > include/linux/taskstats.h | 7 ++++++- > kernel/tsacct.c | 7 +++++++ > 2 files changed, 13 insertions(+), 1 deletion(-) > > --- a/include/linux/taskstats.h > +++ b/include/linux/taskstats.h > @@ -33,7 +33,7 @@ > */ > > > -#define TASKSTATS_VERSION 7 > +#define TASKSTATS_VERSION 8 > #define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN > * in linux/sched.h */ > > @@ -163,6 +163,11 @@ struct taskstats { > /* Delay waiting for memory reclaim */ > __u64 freepages_count; > __u64 freepages_delay_total; > + /* version 7 ends here */ > + > + /* All cumulative CPU time of dead children */ > + __u64 ac_cutime_acct; /* User CPU time [usec] */ > + __u64 ac_cstime_acct; /* System CPU time [usec] */ > }; > > > --- a/kernel/tsacct.c > +++ b/kernel/tsacct.c > @@ -31,6 +31,7 @@ void bacct_add_tsk(struct taskstats *sta > const struct cred *tcred; > struct timespec uptime, ts; > u64 ac_etime; > + unsigned long flags; > > BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN); > > @@ -71,6 +72,12 @@ void bacct_add_tsk(struct taskstats *sta > stats->ac_majflt = tsk->maj_flt; > > strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm)); > + if (tsk->tgid == tsk->pid && lock_task_sighand(tsk, &flags)) { I don't think referring to tgid and pid is a good idea in the context of namespaces, thread_group_leader makes more sense like Oleg pointed out. > + struct cdata *cd = &tsk->signal->cdata_acct; > + stats->ac_cutime_acct = cputime_to_usecs(cd->utime); > + stats->ac_cstime_acct = cputime_to_usecs(cd->stime); > + unlock_task_sighand(tsk, &flags); > + } > } > > > -- Three Cheers, Balbir ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting 2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu ` (3 preceding siblings ...) 2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu @ 2010-11-19 20:19 ` Peter Zijlstra 2010-11-20 15:17 ` Oleg Nesterov 2010-11-22 11:03 ` Michael Holzheu 4 siblings, 2 replies; 24+ messages in thread From: Peter Zijlstra @ 2010-11-19 20:19 UTC (permalink / raw) To: Michael Holzheu Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote: > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted > to the cumulative time of the parents, if the parents ignore SIGCHLD > or have set SA_NOCLDWAIT. This behaviour has the major drawback that > it is not possible to calculate all consumed CPU time of a system by > looking at the current tasks. CPU time can be lost. > > To solve this problem, this patch set duplicates the cumulative accounting > data in the signal_struct. In the second set (cdata_acct) the complete > cumulative resource counters are stored. The new cumulative CPU time (utime > and stime) is then exported via the taskstats interface. Maybe this has been treated earlier in the threads and I missed it, but the obvious solution doesn't get mentioned: What would break if we violate this silly POSIX rule and account time of childs regardless of SIGCHLD/SA_NOCLDWAIT? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting 2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra @ 2010-11-20 15:17 ` Oleg Nesterov 2010-11-22 7:21 ` Balbir Singh 2010-11-22 11:03 ` Michael Holzheu 1 sibling, 1 reply; 24+ messages in thread From: Oleg Nesterov @ 2010-11-20 15:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Michael Holzheu, Shailabh Nagar, Andrew Morton, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On 11/19, Peter Zijlstra wrote: > > On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote: > > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted > > to the cumulative time of the parents, if the parents ignore SIGCHLD > > or have set SA_NOCLDWAIT. This behaviour has the major drawback that > > it is not possible to calculate all consumed CPU time of a system by > > looking at the current tasks. CPU time can be lost. > > > > To solve this problem, this patch set duplicates the cumulative accounting > > data in the signal_struct. In the second set (cdata_acct) the complete > > cumulative resource counters are stored. The new cumulative CPU time (utime > > and stime) is then exported via the taskstats interface. > > Maybe this has been treated earlier in the threads and I missed it, but > the obvious solution doesn't get mentioned: IIRC, the first version did this. And it was me who spoiled this approach. But! only because I wasn't sure this user-visible change is acceptable, and because there was some misunderstanding. See http://marc.info/?l=linux-kernel&m=128552495203050&w=2 But, > What would break if we say, any test-case which does getrusage() after fork() with ignored SIGCHLD/SA_NOCLDWAIT?. > violate this silly POSIX rule and account time of > childs regardless of SIGCHLD/SA_NOCLDWAIT? +1. Personally, I'd certainly prefer this way, because I don't care about POSIX at all ;) Still. Once again, this breaks the current rules, and we never do this without strong reason. I think we should ask Roland. If he thinks this is OK, I'd certainly agree. Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting 2010-11-20 15:17 ` Oleg Nesterov @ 2010-11-22 7:21 ` Balbir Singh 0 siblings, 0 replies; 24+ messages in thread From: Balbir Singh @ 2010-11-22 7:21 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Michael Holzheu, Shailabh Nagar, Andrew Morton, John stultz, Thomas Gleixner, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 * Oleg Nesterov <oleg@redhat.com> [2010-11-20 16:17:11]: > On 11/19, Peter Zijlstra wrote: > > > > On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote: > > > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted > > > to the cumulative time of the parents, if the parents ignore SIGCHLD > > > or have set SA_NOCLDWAIT. This behaviour has the major drawback that > > > it is not possible to calculate all consumed CPU time of a system by > > > looking at the current tasks. CPU time can be lost. > > > > > > To solve this problem, this patch set duplicates the cumulative accounting > > > data in the signal_struct. In the second set (cdata_acct) the complete > > > cumulative resource counters are stored. The new cumulative CPU time (utime > > > and stime) is then exported via the taskstats interface. > > > > Maybe this has been treated earlier in the threads and I missed it, but > > the obvious solution doesn't get mentioned: > > IIRC, the first version did this. > > And it was me who spoiled this approach. But! only because I wasn't sure > this user-visible change is acceptable, and because there was some > misunderstanding. See http://marc.info/?l=linux-kernel&m=128552495203050&w=2 > > But, > > > What would break if we > > say, any test-case which does getrusage() after fork() with ignored > SIGCHLD/SA_NOCLDWAIT?. > > > violate this silly POSIX rule and account time of > > childs regardless of SIGCHLD/SA_NOCLDWAIT? > > +1. > > Personally, I'd certainly prefer this way, because I don't care about > POSIX at all ;) > > > Still. Once again, this breaks the current rules, and we never do > this without strong reason. > > I think we should ask Roland. If he thinks this is OK, I'd certainly > agree. > The Linux man page states In Linux kernel versions before 2.6.9, if the disposition of SIGCHLD is set to SIG_IGN then the resource usages of child processes are automatically included in the value returned by RUSAGE_CHILDREN, although POSIX.1-2001 explicitly prohibits this. This nonconformance is rectified in Linux 2.6.9 and later. -- Three Cheers, Balbir ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting 2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra 2010-11-20 15:17 ` Oleg Nesterov @ 2010-11-22 11:03 ` Michael Holzheu 2010-11-22 12:47 ` Michael Holzheu 1 sibling, 1 reply; 24+ messages in thread From: Michael Holzheu @ 2010-11-22 11:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On Fri, 2010-11-19 at 21:19 +0100, Peter Zijlstra wrote: > On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote: > > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted > > to the cumulative time of the parents, if the parents ignore SIGCHLD > > or have set SA_NOCLDWAIT. This behaviour has the major drawback that > > it is not possible to calculate all consumed CPU time of a system by > > looking at the current tasks. CPU time can be lost. > > > > To solve this problem, this patch set duplicates the cumulative accounting > > data in the signal_struct. In the second set (cdata_acct) the complete > > cumulative resource counters are stored. The new cumulative CPU time (utime > > and stime) is then exported via the taskstats interface. > > Maybe this has been treated earlier in the threads and I missed it, but > the obvious solution doesn't get mentioned: > > What would break if we violate this silly POSIX rule and account time of > childs regardless of SIGCHLD/SA_NOCLDWAIT? Or maybe we could add a sysctl that allows to switch between the two semantics. Michael ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting 2010-11-22 11:03 ` Michael Holzheu @ 2010-11-22 12:47 ` Michael Holzheu 2010-11-22 18:11 ` Valdis.Kletnieks 0 siblings, 1 reply; 24+ messages in thread From: Michael Holzheu @ 2010-11-22 12:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Shailabh Nagar, Andrew Morton, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 On Mon, 2010-11-22 at 12:03 +0100, Michael Holzheu wrote: > Or maybe we could add a sysctl that allows to switch between the two > semantics. Then patch 03/04 would be something like the following: --- Subject: taskstats: Introduce complete cumulative accounting From: Michael Holzheu <holzheu@linux.vnet.ibm.com> Currently the cumulative time accounting in Linux is not complete. Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted to the cumulative time of the parents, if the parents ignore SIGCHLD or have set SA_NOCLDWAIT. This behaviour has the major drawback that it is not possible to calculate all consumed CPU time of a system by looking at the current tasks. CPU time can be lost. This patch adds a new sysctl "kernel.full_cdata" that allows to switch between the POSIX behavior and complete cumulative accounting. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> --- include/linux/sched.h | 1 + kernel/exit.c | 12 ++++++++---- kernel/sysctl.c | 7 +++++++ 3 files changed, 16 insertions(+), 4 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1907,6 +1907,7 @@ enum sched_tunable_scaling { }; extern enum sched_tunable_scaling sysctl_sched_tunable_scaling; +extern unsigned int full_cdata_enabled; #ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; extern unsigned int sysctl_sched_nr_migrate; --- a/kernel/exit.c +++ b/kernel/exit.c @@ -57,6 +57,8 @@ #include <asm/pgtable.h> #include <asm/mmu_context.h> +unsigned int full_cdata_enabled = 1; + static void exit_mm(struct task_struct * tsk); static void __unhash_process(struct task_struct *p, bool group_dead) @@ -77,7 +79,7 @@ static void __unhash_process(struct task static void __account_cdata(struct task_struct *p) { struct cdata *cd, *pcd, *tcd; - unsigned long maxrss; + unsigned long maxrss, flags; cputime_t tgutime, tgstime; /* @@ -100,7 +102,7 @@ static void __account_cdata(struct task_ * group including the group leader. */ thread_group_times(p, &tgutime, &tgstime); - spin_lock_irq(&p->real_parent->sighand->siglock); + spin_lock_irqsave(&p->real_parent->sighand->siglock, flags); pcd = &p->real_parent->signal->cdata_wait; tcd = &p->signal->cdata_threads; cd = &p->signal->cdata_wait; @@ -137,7 +139,7 @@ static void __account_cdata(struct task_ pcd->maxrss = maxrss; task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac); task_io_accounting_add(&p->real_parent->signal->ioac, &p->signal->ioac); - spin_unlock_irq(&p->real_parent->sighand->siglock); + spin_unlock_irqrestore(&p->real_parent->sighand->siglock, flags); } /* @@ -157,6 +159,8 @@ static void __exit_signal(struct task_st posix_cpu_timers_exit(tsk); if (group_dead) { + if (full_cdata_enabled) + __account_cdata(tsk); posix_cpu_timers_exit_group(tsk); tty = sig->tty; sig->tty = NULL; @@ -1292,7 +1296,7 @@ static int wait_task_zombie(struct wait_ * It can be ptraced but not reparented, check * !task_detached() to filter out sub-threads. */ - if (likely(!traced) && likely(!task_detached(p))) + if (likely(!traced) && likely(!task_detached(p)) && !full_cdata_enabled) __account_cdata(p); /* --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -963,6 +963,13 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif + { + .procname = "full_cdata", + .data = &full_cdata_enabled, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 0/4] taskstats: Improve cumulative time accounting 2010-11-22 12:47 ` Michael Holzheu @ 2010-11-22 18:11 ` Valdis.Kletnieks 0 siblings, 0 replies; 24+ messages in thread From: Valdis.Kletnieks @ 2010-11-22 18:11 UTC (permalink / raw) To: holzheu Cc: Peter Zijlstra, Oleg Nesterov, Shailabh Nagar, Andrew Morton, John stultz, Thomas Gleixner, Balbir Singh, Martin Schwidefsky, Heiko Carstens, Roland McGrath, linux-kernel, linux-s390 [-- Attachment #1: Type: text/plain, Size: 761 bytes --] On Mon, 22 Nov 2010 13:47:55 +0100, Michael Holzheu said: > Currently the cumulative time accounting in Linux is not complete. > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted > to the cumulative time of the parents, if the parents ignore SIGCHLD > or have set SA_NOCLDWAIT. This behaviour has the major drawback that > it is not possible to calculate all consumed CPU time of a system by > looking at the current tasks. CPU time can be lost. > > This patch adds a new sysctl "kernel.full_cdata" that allows to switch > between the POSIX behavior and complete cumulative accounting. > +unsigned int full_cdata_enabled = 1; This probably needs to default to "current kernel behavior" but allow sysadmins to change to the new behavior. [-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2010-11-29 18:16 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-11-19 20:11 [patch 0/4] taskstats: Improve cumulative time accounting Michael Holzheu 2010-11-19 20:11 ` [patch 1/4] taskstats: Introduce "struct cdata" Michael Holzheu 2010-11-25 12:29 ` Balbir Singh 2010-11-25 14:23 ` Oleg Nesterov 2010-11-25 16:38 ` Michael Holzheu 2010-11-19 20:11 ` [patch 2/4] taskstats: Introduce __account_cdata() function Michael Holzheu 2010-11-19 20:11 ` [patch 3/4] taskstats: Introduce cdata_acct for complete cumulative accounting Michael Holzheu 2010-11-23 16:59 ` Oleg Nesterov 2010-11-25 9:40 ` Michael Holzheu 2010-11-25 13:21 ` Oleg Nesterov 2010-11-25 17:45 ` Michael Holzheu 2010-11-19 20:11 ` [patch 4/4] taskstats: Export "cdata_acct" with taskstats Michael Holzheu 2010-11-25 13:26 ` Oleg Nesterov 2010-11-25 17:21 ` Michael Holzheu 2010-11-29 16:43 ` Oleg Nesterov 2010-11-29 16:58 ` Michael Holzheu 2010-11-29 18:08 ` Oleg Nesterov 2010-11-25 16:57 ` Balbir Singh 2010-11-19 20:19 ` [patch 0/4] taskstats: Improve cumulative time accounting Peter Zijlstra 2010-11-20 15:17 ` Oleg Nesterov 2010-11-22 7:21 ` Balbir Singh 2010-11-22 11:03 ` Michael Holzheu 2010-11-22 12:47 ` Michael Holzheu 2010-11-22 18:11 ` Valdis.Kletnieks
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox