public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* hackbench [pthread mode] regression with 2.6.29-rc3
@ 2009-02-01  7:30 Zhang, Yanmin
       [not found] ` <d3f22a0902010026q1db36381j36cb1c9803d48431@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Zhang, Yanmin @ 2009-02-01  7:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML

Comparing with 2.6.29-rc2's result, hackbench [pthread mode] result is increased about
50%~100% with 2.6.29-rc3 on my 4 qual-core process tigerton machine and 4 qual-core
Montvale Itanium mahchine. The smaller result, the better performance.

Command to run hackbench:
#./hackbench 100 thread 2000

Bisect located below patch.
commit 490dea45d00f01847ebebd007685d564aaf2cd98
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon Nov 24 17:06:57 2008 +0100

    itimers: remove the per-cpu-ish-ness

    Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
    or we just bounce a single..

    Also, using per-cpu allocations for the thread-groups complicates the
    per-cpu allocator in that its currently aimed to be a fixed sized
    allocator and the only possible extention to that would be vmap based,
    which is seriously constrained on 32 bit archs.


After above patch is reverted, hackbench result is restored.

yanmin



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
       [not found] ` <d3f22a0902010026q1db36381j36cb1c9803d48431@mail.gmail.com>
@ 2009-02-01  8:29   ` Lin Ming
  2009-02-01  9:17     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Lin Ming @ 2009-02-01  8:29 UTC (permalink / raw)
  To: peterz; +Cc: Zhang, Yanmin, linux-kernel


> Bisect located below patch.
> commit 490dea45d00f01847ebebd007685d564aaf2cd98
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Mon Nov 24 17:06:57 2008 +0100
> 
>    itimers: remove the per-cpu-ish-ness
> 
>    Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
>    or we just bounce a single..
> 
>    Also, using per-cpu allocations for the thread-groups complicates the
>    per-cpu allocator in that its currently aimed to be a fixed sized
>    allocator and the only possible extention to that would be vmap based,
>    which is seriously constrained on 32 bit archs.
> 
> 
> After above patch is reverted, hackbench result is restored.

oltp has ~3% regression with 2.6.29-rc3 on 4core*2p stokley machine.
After above patch reverted, the regression disappeared.

Lin Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
  2009-02-01  8:29   ` Lin Ming
@ 2009-02-01  9:17     ` Peter Zijlstra
  2009-02-01  9:57       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2009-02-01  9:17 UTC (permalink / raw)
  To: Lin Ming; +Cc: Zhang, Yanmin, linux-kernel

On Sun, 2009-02-01 at 16:29 +0800, Lin Ming wrote:
> > Bisect located below patch.
> > commit 490dea45d00f01847ebebd007685d564aaf2cd98
> > Author: Peter Zijlstra <peterz@infradead.org>
> > Date:   Mon Nov 24 17:06:57 2008 +0100
> > 
> >    itimers: remove the per-cpu-ish-ness
> > 
> >    Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
> >    or we just bounce a single..
> > 
> >    Also, using per-cpu allocations for the thread-groups complicates the
> >    per-cpu allocator in that its currently aimed to be a fixed sized
> >    allocator and the only possible extention to that would be vmap based,
> >    which is seriously constrained on 32 bit archs.
> > 
> > 
> > After above patch is reverted, hackbench result is restored.
> 
> oltp has ~3% regression with 2.6.29-rc3 on 4core*2p stokley machine.
> After above patch reverted, the regression disappeared.

*sigh*, did they gain anything with introduction of the per-cpu crap?

f06febc96ba8e0af80bcc3eaec0a109e88275fac
5ce73a4a5a4893a1aa4cdeed1b1a5a6de42c43b6
bb34d92f643086d546b49cef680f6f305ed84414
ad133ba3dc283300e5b62b5b7211d2f39fbf6ee7
ce394471d13bf071939a9a0b48c64c297676d233




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
  2009-02-01  9:17     ` Peter Zijlstra
@ 2009-02-01  9:57       ` Peter Zijlstra
  2009-02-01 10:04         ` Ingo Molnar
  2009-02-02  1:12         ` Zhang, Yanmin
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-02-01  9:57 UTC (permalink / raw)
  To: Lin Ming; +Cc: Zhang, Yanmin, linux-kernel

On Sun, 2009-02-01 at 10:17 +0100, Peter Zijlstra wrote:
> On Sun, 2009-02-01 at 16:29 +0800, Lin Ming wrote:
> > > Bisect located below patch.
> > > commit 490dea45d00f01847ebebd007685d564aaf2cd98
> > > Author: Peter Zijlstra <peterz@infradead.org>
> > > Date:   Mon Nov 24 17:06:57 2008 +0100
> > > 
> > >    itimers: remove the per-cpu-ish-ness
> > > 
> > >    Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
> > >    or we just bounce a single..
> > > 
> > >    Also, using per-cpu allocations for the thread-groups complicates the
> > >    per-cpu allocator in that its currently aimed to be a fixed sized
> > >    allocator and the only possible extention to that would be vmap based,
> > >    which is seriously constrained on 32 bit archs.
> > > 
> > > 
> > > After above patch is reverted, hackbench result is restored.
> > 
> > oltp has ~3% regression with 2.6.29-rc3 on 4core*2p stokley machine.
> > After above patch reverted, the regression disappeared.
> 
> *sigh*, did they gain anything with introduction of the per-cpu crap?

No it wouldn't have, I just missed something obvious,.. :-(

I wish we never merged that crap...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
  2009-02-01  9:57       ` Peter Zijlstra
@ 2009-02-01 10:04         ` Ingo Molnar
  2009-02-02  1:12         ` Zhang, Yanmin
  1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2009-02-01 10:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Lin Ming, Zhang, Yanmin, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, 2009-02-01 at 10:17 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-02-01 at 16:29 +0800, Lin Ming wrote:
> > > > Bisect located below patch.
> > > > commit 490dea45d00f01847ebebd007685d564aaf2cd98
> > > > Author: Peter Zijlstra <peterz@infradead.org>
> > > > Date:   Mon Nov 24 17:06:57 2008 +0100
> > > > 
> > > >    itimers: remove the per-cpu-ish-ness
> > > > 
> > > >    Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
> > > >    or we just bounce a single..
> > > > 
> > > >    Also, using per-cpu allocations for the thread-groups complicates the
> > > >    per-cpu allocator in that its currently aimed to be a fixed sized
> > > >    allocator and the only possible extention to that would be vmap based,
> > > >    which is seriously constrained on 32 bit archs.
> > > > 
> > > > 
> > > > After above patch is reverted, hackbench result is restored.
> > > 
> > > oltp has ~3% regression with 2.6.29-rc3 on 4core*2p stokley machine.
> > > After above patch reverted, the regression disappeared.
> > 
> > *sigh*, did they gain anything with introduction of the per-cpu crap?
> 
> No it wouldn't have, I just missed something obvious,.. :-(
> 
> I wish we never merged that crap...

oh, it certainly had its use: it highlighted that we have crappy 
threading+timers code (on hackbench_pth) since the beginning of the Linux 
SMP times.

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
  2009-02-01  9:57       ` Peter Zijlstra
  2009-02-01 10:04         ` Ingo Molnar
@ 2009-02-02  1:12         ` Zhang, Yanmin
  2009-02-02  8:53           ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Zhang, Yanmin @ 2009-02-02  1:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Lin Ming, linux-kernel

On Sun, 2009-02-01 at 10:57 +0100, Peter Zijlstra wrote:
> On Sun, 2009-02-01 at 10:17 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-02-01 at 16:29 +0800, Lin Ming wrote:
> > > > Bisect located below patch.
> > > > commit 490dea45d00f01847ebebd007685d564aaf2cd98
> > > > Author: Peter Zijlstra <peterz@infradead.org>
> > > > Date:   Mon Nov 24 17:06:57 2008 +0100
> > > > 
> > > >    itimers: remove the per-cpu-ish-ness
> > > > 
> > > >    Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
> > > >    or we just bounce a single..
> > > > 
> > > >    Also, using per-cpu allocations for the thread-groups complicates the
> > > >    per-cpu allocator in that its currently aimed to be a fixed sized
> > > >    allocator and the only possible extention to that would be vmap based,
> > > >    which is seriously constrained on 32 bit archs.
> > > > 
> > > > 
> > > > After above patch is reverted, hackbench result is restored.
> > > 
> > > oltp has ~3% regression with 2.6.29-rc3 on 4core*2p stokley machine.
> > > After above patch reverted, the regression disappeared.
> > 
> > *sigh*, did they gain anything with introduction of the per-cpu crap?
> 
> No it wouldn't have, I just missed something obvious,.. :-(
> 
> I wish we never merged that crap...
process timer (by setitimer) isn't good. Is per-cpu itimer to improve it?

Prior to kernel 2.6.28, process timer implementation is poor. When the process timer
is near to expire, threads will sum the time from all threads of the thread group
over and over again. If there are thousands of threads, kernel might hang eventually.
I ran into it a couple of months ago when I tried to enable specweb2005 (Apache+PHP
thread mode).

So the per-cpu itimer could improve this situation when thread number is far bigger than
cpu number. I didn't retry specweb2005 with 2.6.28.

-yanmin



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
  2009-02-02  1:12         ` Zhang, Yanmin
@ 2009-02-02  8:53           ` Peter Zijlstra
  2009-02-02 17:49             ` Bryon Roche
  2009-02-03 11:56             ` [RFC] process wide itimer cruft Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-02-02  8:53 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Lin Ming, linux-kernel

On Mon, 2009-02-02 at 09:12 +0800, Zhang, Yanmin wrote:

> process timer (by setitimer) isn't good. Is per-cpu itimer to improve it?

> So the per-cpu itimer could improve this situation when thread number is far bigger than
> cpu number. I didn't retry specweb2005 with 2.6.28.

1) process wide itimers are rubbish, 2) per-cpu itimers are rubbish too,
for the very simple reason they waste gobs of memory for sane programs.

I would rather go back to the old model where we iterate all threads,
and find a way to not make programs with too many threads for their own
good lock up the kernel, but instead get poor service.

Now the problems appears to be that I overlooked that we keep the itimer
clock running at all times, not only when we have a pending timer, and I
suppose that the standard mandates that behaviour :/

Anyway, I will try to sort something out, at worst we'll have to revert
to the .28 state and live with the per-cpu crap for another release.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
  2009-02-02  8:53           ` Peter Zijlstra
@ 2009-02-02 17:49             ` Bryon Roche
  2009-02-02 20:50               ` Peter Zijlstra
  2009-02-03 11:56             ` [RFC] process wide itimer cruft Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Bryon Roche @ 2009-02-02 17:49 UTC (permalink / raw)
  To: linux-kernel

On Mon, 02 Feb 2009 09:53:38 +0100, Peter Zijlstra wrote:

> On Mon, 2009-02-02 at 09:12 +0800, Zhang, Yanmin wrote:
> 
> I would rather go back to the old model where we iterate all threads,
> and find a way to not make programs with too many threads for their own
> good lock up the kernel, but instead get poor service.

Now, there's an interesting question, what is the definition of too many
threads for a program's own good?  When evaluating this, please assume 
that you
do actually have enough RAM to keep thread stacks/other userspace 
threading
resources in-core.

/B


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hackbench [pthread mode] regression with 2.6.29-rc3
  2009-02-02 17:49             ` Bryon Roche
@ 2009-02-02 20:50               ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-02-02 20:50 UTC (permalink / raw)
  To: Bryon Roche; +Cc: linux-kernel

On Mon, 2009-02-02 at 17:49 +0000, Bryon Roche wrote:

don't drop CCs

> On Mon, 02 Feb 2009 09:53:38 +0100, Peter Zijlstra wrote:
> 
> > On Mon, 2009-02-02 at 09:12 +0800, Zhang, Yanmin wrote:
> > 
> > I would rather go back to the old model where we iterate all threads,
> > and find a way to not make programs with too many threads for their own
> > good lock up the kernel, but instead get poor service.
> 
> Now, there's an interesting question, what is the definition of too many
> threads for a program's own good?  When evaluating this, please assume 
> that you
> do actually have enough RAM to keep thread stacks/other userspace 
> threading
> resources in-core.

In my book a program has too many threads when it outnumbers the number
of cpus available to it by an order of so.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC] process wide itimer cruft
  2009-02-02  8:53           ` Peter Zijlstra
  2009-02-02 17:49             ` Bryon Roche
@ 2009-02-03 11:56             ` Peter Zijlstra
  2009-02-03 17:23               ` Oleg Nesterov
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2009-02-03 11:56 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Lin Ming, linux-kernel, Ingo Molnar, Oleg Nesterov

On Mon, 2009-02-02 at 09:53 +0100, Peter Zijlstra wrote:

> Now the problems appears to be that I overlooked that we keep the itimer
> clock running at all times, not only when we have a pending timer, and I
> suppose that the standard mandates that behaviour :/

> I would rather go back to the old model where we iterate all threads,
> and find a way to not make programs with too many threads for their own
> good lock up the kernel, but instead get poor service.

OK, so I attempted something like that (sort-of boots) but exit.c makes
my head hurt.

I'm punting the sum-all-threads work off to a workqueue, now the problem
is that __exit_signal() needs ensurance that that worklet isn't
outstanding, or delay the freeing of signal struct until the worklet it
done.

We cannot use cancel_delayed_work_sync() because that can sleep, so I
tried the other approach, but it seems my put_signal() below is flawed
in that release_task() can be called by a 3rd party in case of ptrace.

The remaining option is to make signal struct itself rcu freed, but
before I do that, I thought I'd run this code by some folks.

Any comments?

Not-Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/init_task.h |    7 +++-
 include/linux/sched.h     |   10 ++++
 kernel/exit.c             |   33 ++++++++------
 kernel/sched.c            |  107 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_fair.c       |    1 -
 kernel/sched_rt.c         |    1 -
 kernel/sched_stats.h      |   91 --------------------------------------
 7 files changed, 138 insertions(+), 112 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index fa50bdb..538c89e 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -14,6 +14,8 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+extern void do_thread_group_cputime_update(struct work_struct *work);
+
 #define INIT_KIOCTX(name, which_mm) \
 {							\
 	.users		= ATOMIC_INIT(1),		\
@@ -53,7 +55,10 @@ extern struct fs_struct init_fs;
 		.stime = cputime_zero,					\
 		.sum_exec_runtime = 0,					\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputime.totals.lock),	\
-	}, },								\
+	},								\
+		.work = __DELAYED_WORK_INITIALIZER(sig.cputime.work,	\
+				      do_thread_group_cputime_update),	\
+	}, 								\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cdec0d..1ced38c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -478,6 +478,7 @@ struct task_cputime {
  */
 struct thread_group_cputime {
 	struct task_cputime totals;
+	struct delayed_work work;
 };
 
 /*
@@ -567,6 +568,7 @@ struct signal_struct {
 	 * in __exit_signal, except for the group leader.
 	 */
 	cputime_t cutime, cstime;
+	u64 cruntime;
 	cputime_t gtime;
 	cputime_t cgtime;
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
@@ -1949,6 +1951,7 @@ extern void exit_thread(void);
 extern void exit_files(struct task_struct *);
 extern void __cleanup_signal(struct signal_struct *);
 extern void __cleanup_sighand(struct sighand_struct *);
+extern void put_signal(struct signal_struct *sig);
 
 extern void exit_itimers(struct signal_struct *);
 extern void flush_itimer_signals(void);
@@ -2210,6 +2213,9 @@ static inline int spin_needbreak(spinlock_t *lock)
  * Thread group CPU time accounting.
  */
 
+extern void __thread_group_cputime_update(struct task_struct *p);
+extern void do_thread_group_cputime_update(struct work_struct *work);
+
 static inline
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
@@ -2217,6 +2223,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	unsigned long flags;
 
 	spin_lock_irqsave(&totals->lock, flags);
+	__thread_group_cputime_update(tsk);
 	*times = *totals;
 	spin_unlock_irqrestore(&totals->lock, flags);
 }
@@ -2230,6 +2237,9 @@ static inline void thread_group_cputime_init(struct signal_struct *sig)
 	};
 
 	spin_lock_init(&sig->cputime.totals.lock);
+
+	INIT_DELAYED_WORK(&sig->cputime.work,
+			     do_thread_group_cputime_update);
 }
 
 static inline void thread_group_cputime_free(struct signal_struct *sig)
diff --git a/kernel/exit.c b/kernel/exit.c
index a1b18c0..899749e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -81,6 +81,15 @@ static void __unhash_process(struct task_struct *p)
 	list_del_init(&p->sibling);
 }
 
+void put_signal(struct signal_struct *sig)
+{
+	if (atomic_dec_and_test(&sig->count)) {
+		flush_sigqueue(&sig->shared_pending);
+		taskstats_tgid_free(sig);
+		__cleanup_signal(sig);
+	}
+}
+
 /*
  * This function expects the tasklist_lock write-locked.
  */
@@ -96,14 +105,16 @@ static void __exit_signal(struct task_struct *tsk)
 	spin_lock(&sighand->siglock);
 
 	posix_cpu_timers_exit(tsk);
-	if (atomic_dec_and_test(&sig->count))
+	if (!atomic_read(&sig->live)) {
 		posix_cpu_timers_exit_group(tsk);
-	else {
+		sig->curr_target = NULL;
+	} else {
 		/*
 		 * If there is any task waiting for the group exit
 		 * then notify it:
 		 */
-		if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count)
+		if (sig->group_exit_task &&
+				atomic_read(&sig->live) == sig->notify_count)
 			wake_up_process(sig->group_exit_task);
 
 		if (tsk == sig->curr_target)
@@ -126,7 +137,6 @@ static void __exit_signal(struct task_struct *tsk)
 		sig->inblock += task_io_get_inblock(tsk);
 		sig->oublock += task_io_get_oublock(tsk);
 		task_io_accounting_add(&sig->ioac, &tsk->ioac);
-		sig = NULL; /* Marker for below. */
 	}
 
 	__unhash_process(tsk);
@@ -142,17 +152,9 @@ static void __exit_signal(struct task_struct *tsk)
 	spin_unlock(&sighand->siglock);
 
 	__cleanup_sighand(sighand);
-	clear_tsk_thread_flag(tsk,TIF_SIGPENDING);
-	if (sig) {
-		flush_sigqueue(&sig->shared_pending);
-		taskstats_tgid_free(sig);
-		/*
-		 * Make sure ->signal can't go away under rq->lock,
-		 * see account_group_exec_runtime().
-		 */
-		task_rq_unlock_wait(tsk);
-		__cleanup_signal(sig);
-	}
+	clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
+
+	put_signal(sig);
 }
 
 static void delayed_put_task_struct(struct rcu_head *rhp)
@@ -1329,6 +1331,7 @@ static int wait_task_zombie(struct task_struct *p, int options,
 			cputime_add(psig->cstime,
 			cputime_add(cputime.stime,
 				    sig->cstime));
+		psig->cruntime += cputime.sum_exec_runtime + sig->cruntime;
 		psig->cgtime =
 			cputime_add(psig->cgtime,
 			cputime_add(p->gtime,
diff --git a/kernel/sched.c b/kernel/sched.c
index 96439a4..2c21c12 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4338,7 +4338,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
 	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
-	account_group_user_time(p, cputime);
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -4367,7 +4366,6 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 	/* Add guest time to process. */
 	p->utime = cputime_add(p->utime, cputime);
 	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
-	account_group_user_time(p, cputime);
 	p->gtime = cputime_add(p->gtime, cputime);
 
 	/* Add guest time to cpustat. */
@@ -4396,7 +4394,6 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	/* Add system time to process. */
 	p->stime = cputime_add(p->stime, cputime);
 	p->stimescaled = cputime_add(p->stimescaled, cputime_scaled);
-	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -4442,6 +4439,108 @@ void account_idle_time(cputime_t cputime)
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 
 /*
+ * Must be called with IRQs disabled, or from hardirq context.
+ */
+void __thread_group_cputime_update(struct task_struct *p)
+{
+	struct rq *rq = this_rq();
+	struct signal_struct *sig;
+
+	/*
+	 * Early boot might not have workqueues running yet.
+	 */
+	if (unlikely(system_state != SYSTEM_RUNNING))
+		return;
+
+	/*
+	 * Nobody cares about group cputimes for idle.
+	 */
+	if (unlikely(rq->idle == p))
+		return;
+
+	/*
+	 * Dead man walking.
+	 */
+	if (unlikely(p->exit_state))
+		return;
+
+	sig = p->signal;
+	/*
+	 * We raced with __exit_signal() and lost.
+	 */
+	if (unlikely(!sig))
+		return;
+
+	/*
+	 * __exit_signal() is ran with IRQs disabled, so when we have
+	 * a signal pointer, it must still be valid.
+	 */
+	if (schedule_delayed_work(&sig->cputime.work,
+				ilog2(atomic_read(&sig->live) / nr_cpu_ids)))
+		atomic_inc(&sig->count);
+}
+
+void do_thread_group_cputime_update(struct work_struct *work)
+{
+	struct task_cputime *totals;
+	struct signal_struct *sig;
+	struct task_struct *t, *n;
+	unsigned long flags;
+
+	cputime_t utime = cputime_zero, stime = cputime_zero;
+	u64 runtime = 0;
+
+	sig = container_of(work, struct signal_struct, cputime.work.work);
+
+	rcu_read_lock();
+	n = t = sig->curr_target;
+	if (n) do {
+		/*
+		 * Dead tasks come later.
+		 */
+		if (unlikely(n->exit_state))
+			goto next;
+
+		/*
+		 * Add time for each task.
+		 */
+		stime = cputime_add(stime, n->stime);
+		utime = cputime_add(utime, n->utime);
+		runtime += n->se.sum_exec_runtime;
+
+next:
+		n = next_thread(n);
+	} while (n != t);
+
+	/*
+	 * Add accumulated time of dead tasks.
+	 *
+	 * There is a risk we missed a task which was dying but hasn't
+	 * been added to the accumulated time yet, too bad, better luck
+	 * next time we try.
+	 */
+	stime = cputime_add(stime, sig->cstime);
+	utime = cputime_add(utime, sig->cutime);
+	runtime += sig->cruntime;
+
+	/*
+	 * Ensure time goes forward.
+	 */
+	totals = &sig->cputime.totals;
+	spin_lock_irqsave(&totals->lock, flags);
+	if (cputime_gt(stime, totals->stime))
+		totals->stime = stime;
+	if (cputime_gt(utime, totals->utime))
+		totals->utime = utime;
+	if (runtime > totals->sum_exec_runtime)
+		totals->sum_exec_runtime = runtime;
+	spin_unlock_irqrestore(&totals->lock, flags);
+	rcu_read_unlock();
+
+	put_signal(sig);
+}
+
+/*
  * Account a single tick of cpu time.
  * @p: the process that the cpu time gets accounted to
  * @user_tick: indicates if the tick is a user or a system tick
@@ -4459,6 +4558,8 @@ void account_process_tick(struct task_struct *p, int user_tick)
 				    one_jiffy_scaled);
 	else
 		account_idle_time(one_jiffy);
+
+	__thread_group_cputime_update(p);
 }
 
 /*
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc1563e..5fe3b3d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -499,7 +499,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		cpuacct_charge(curtask, delta_exec);
-		account_group_exec_runtime(curtask, delta_exec);
 	}
 }
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 299d012..607951e 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -581,7 +581,6 @@ static void update_curr_rt(struct rq *rq)
 	schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 8ab0cef..9831d8c 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -276,94 +276,3 @@ sched_info_switch(struct task_struct *prev, struct task_struct *next)
 #define sched_info_dequeued(t)			do { } while (0)
 #define sched_info_switch(t, next)		do { } while (0)
 #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
-
-/*
- * The following are functions that support scheduler-internal time accounting.
- * These functions are generally called at the timer tick.  None of this depends
- * on CONFIG_SCHEDSTATS.
- */
-
-/**
- * account_group_user_time - Maintain utime for a thread group.
- *
- * @tsk:	Pointer to task structure.
- * @cputime:	Time value by which to increment the utime field of the
- *		thread_group_cputime structure.
- *
- * If thread group time is being maintained, get the structure for the
- * running CPU and update the utime field there.
- */
-static inline void account_group_user_time(struct task_struct *tsk,
-					   cputime_t cputime)
-{
-	struct task_cputime *times;
-	struct signal_struct *sig;
-
-	/* tsk == current, ensure it is safe to use ->signal */
-	if (unlikely(tsk->exit_state))
-		return;
-
-	sig = tsk->signal;
-	times = &sig->cputime.totals;
-
-	spin_lock(&times->lock);
-	times->utime = cputime_add(times->utime, cputime);
-	spin_unlock(&times->lock);
-}
-
-/**
- * account_group_system_time - Maintain stime for a thread group.
- *
- * @tsk:	Pointer to task structure.
- * @cputime:	Time value by which to increment the stime field of the
- *		thread_group_cputime structure.
- *
- * If thread group time is being maintained, get the structure for the
- * running CPU and update the stime field there.
- */
-static inline void account_group_system_time(struct task_struct *tsk,
-					     cputime_t cputime)
-{
-	struct task_cputime *times;
-	struct signal_struct *sig;
-
-	/* tsk == current, ensure it is safe to use ->signal */
-	if (unlikely(tsk->exit_state))
-		return;
-
-	sig = tsk->signal;
-	times = &sig->cputime.totals;
-
-	spin_lock(&times->lock);
-	times->stime = cputime_add(times->stime, cputime);
-	spin_unlock(&times->lock);
-}
-
-/**
- * account_group_exec_runtime - Maintain exec runtime for a thread group.
- *
- * @tsk:	Pointer to task structure.
- * @ns:		Time value by which to increment the sum_exec_runtime field
- *		of the thread_group_cputime structure.
- *
- * If thread group time is being maintained, get the structure for the
- * running CPU and update the sum_exec_runtime field there.
- */
-static inline void account_group_exec_runtime(struct task_struct *tsk,
-					      unsigned long long ns)
-{
-	struct task_cputime *times;
-	struct signal_struct *sig;
-
-	sig = tsk->signal;
-	/* see __exit_signal()->task_rq_unlock_wait() */
-	barrier();
-	if (unlikely(!sig))
-		return;
-
-	times = &sig->cputime.totals;
-
-	spin_lock(&times->lock);
-	times->sum_exec_runtime += ns;
-	spin_unlock(&times->lock);
-}



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] process wide itimer cruft
  2009-02-03 11:56             ` [RFC] process wide itimer cruft Peter Zijlstra
@ 2009-02-03 17:23               ` Oleg Nesterov
  2009-02-03 17:51                 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2009-02-03 17:23 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zhang, Yanmin, Lin Ming, linux-kernel, Ingo Molnar

On 02/03, Peter Zijlstra wrote:
>
> On Mon, 2009-02-02 at 09:53 +0100, Peter Zijlstra wrote:
>
> I'm punting the sum-all-threads work off to a workqueue,

I don't really understand how this works, but I didn't try to read
this part carefully. For example, when we call thread_group_cputime()
we don't really get the "group" statistics immediately? But this looks
very interesting anyway.

Unfortunately, I think we need some changes with ->signal first.

> The remaining option is to make signal struct itself rcu freed, but
> before I do that, I thought I'd run this code by some folks.

I think we should follow the Ingo's suggestion: we should make ->signal
refcountable, we should never clear task->signal, it should be freed
by __put_task_struct()'s path.

In fact I was going to make this patches the previous week, will try
to do this week. But we need another counter for that, we can't use
signal->count. And we should fix some users which check tsk->signal != NULL
to ensure the task was not released, this is easy.

This blows signal_struct a bit, but otoh with this change we can
move some fields (for example, ->group_leader) to signal_struct.
And we can do many simplifications. Just for example, __sched_setscheduler()
takes ->siglock just to read signal->rlim[].

> @@ -96,14 +105,16 @@ static void __exit_signal(struct task_struct *tsk)
>  	spin_lock(&sighand->siglock);
>
>  	posix_cpu_timers_exit(tsk);
> -	if (atomic_dec_and_test(&sig->count))
> +	if (!atomic_read(&sig->live)) {
>  		posix_cpu_timers_exit_group(tsk);

This doesn't look exactly right, but I don't see the "real" problems
with this change.

We can have a lot of threads which didn't even pass exit_notify(),
another process can attach the cpu timer to us once we drop the
locks. OK, no real problems afaics, because each sub-thread will
in turn do posix_cpu_timers_exit_group() later.

But this looks a bit too early. It is better to continue to account
these threads, they can consume a lot of cpu. Anyway, this very
minor issue.

> -	else {
> +		sig->curr_target = NULL;

complete_signal() can crash if it hits ->curr_target = NULL, and
we are still "visible" to signals even if sig->live == 0.

> +	} else {
>  		/*
>  		 * If there is any task waiting for the group exit
>  		 * then notify it:
>  		 */
> -		if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count)
> +		if (sig->group_exit_task &&
> +				atomic_read(&sig->live) == sig->notify_count)

This looks wrong. de_thread() can hang forever, put_signal() doesn't
wake up ->group_exit_task.

I think we really need another counter, at least for now.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] process wide itimer cruft
  2009-02-03 17:23               ` Oleg Nesterov
@ 2009-02-03 17:51                 ` Peter Zijlstra
  2009-02-03 18:22                   ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2009-02-03 17:51 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Zhang, Yanmin, Lin Ming, linux-kernel, Ingo Molnar

On Tue, 2009-02-03 at 18:23 +0100, Oleg Nesterov wrote:
> On 02/03, Peter Zijlstra wrote:
> >
> > On Mon, 2009-02-02 at 09:53 +0100, Peter Zijlstra wrote:
> >
> > I'm punting the sum-all-threads work off to a workqueue,
> 
> I don't really understand how this works, but I didn't try to read
> this part carefully. For example, when we call thread_group_cputime()
> we don't really get the "group" statistics immediately? But this looks
> very interesting anyway.

Because our thread group can be extremely large and take longer than a
jiffy to sum up -- this is the situation that started all this itimer
tinkering.

However, Ingo spoke to me on IRC and suggested another approach, which
I'm currently working on -- hopefully done tomorrow.

> > The remaining option is to make signal struct itself rcu freed, but
> > before I do that, I thought I'd run this code by some folks.
> 
> I think we should follow the Ingo's suggestion: we should make ->signal
> refcountable, we should never clear task->signal, it should be freed
> by __put_task_struct()'s path.

Right, that'd make a lot of sense.

> In fact I was going to make this patches the previous week, will try
> to do this week. But we need another counter for that, we can't use
> signal->count.

I'm not quite sure I understand all that code quite yet, although I've
been staring at it for the past day or so.

->live  -- the number of associated tasks,
->count -- not quite a refcount?

I can see adding a 3rd counter for reference counting could solve
things, but can we start by clarifying the exact semantics of these two?
If only for future readers..

> This blows signal_struct a bit, but otoh with this change we can
> move some fields (for example, ->group_leader) to signal_struct.
> And we can do many simplifications. Just for example, __sched_setscheduler()
> takes ->siglock just to read signal->rlim[].

Could you shed a bit of light on the distinction between sighand and
signal?

> > @@ -96,14 +105,16 @@ static void __exit_signal(struct task_struct *tsk)
> >  	spin_lock(&sighand->siglock);
> >
> >  	posix_cpu_timers_exit(tsk);
> > -	if (atomic_dec_and_test(&sig->count))
> > +	if (!atomic_read(&sig->live)) {
> >  		posix_cpu_timers_exit_group(tsk);
> 
> This doesn't look exactly right, but I don't see the "real" problems
> with this change.
> 
> We can have a lot of threads which didn't even pass exit_notify(),
> another process can attach the cpu timer to us once we drop the
> locks. OK, no real problems afaics, because each sub-thread will
> in turn do posix_cpu_timers_exit_group() later.

Yeah, you can get multiple invocations of the
posix_cpu_timers_exit_group() stuff, and less summing if dead tasks, the
latter might be an issue.

> But this looks a bit too early. It is better to continue to account
> these threads, they can consume a lot of cpu. Anyway, this very
> minor issue.

Agreed.

> > -	else {
> > +		sig->curr_target = NULL;
> 
> complete_signal() can crash if it hits ->curr_target = NULL, and
> we are still "visible" to signals even if sig->live == 0.

Ooh, missed that. Good catch indeed.

> > +	} else {
> >  		/*
> >  		 * If there is any task waiting for the group exit
> >  		 * then notify it:
> >  		 */
> > -		if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count)
> > +		if (sig->group_exit_task &&
> > +				atomic_read(&sig->live) == sig->notify_count)
> 
> This looks wrong. de_thread() can hang forever, put_signal() doesn't
> wake up ->group_exit_task.
> 
> I think we really need another counter, at least for now.

Don't rush on my account, Ingo's proposed solution doesn't need this. 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] process wide itimer cruft
  2009-02-03 17:51                 ` Peter Zijlstra
@ 2009-02-03 18:22                   ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2009-02-03 18:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zhang, Yanmin, Lin Ming, linux-kernel, Ingo Molnar

On 02/03, Peter Zijlstra wrote:
>
> ->live  -- the number of associated tasks,
> ->count -- not quite a refcount?

No, ->count is not a refcount.

Basically, ->count means how many threads didn't pass release_task() yet.
Well, actually __exit_signal(), but this doesn't matter. The thread becomes
"really dead" after that. Until then, it is still visible to, say,
find_task_by_vpid, signals, etc.

But if we have a zombie group leader, it may stay zombie "forever", and
->count doesn't go to zero. So we also have signal->live, when it is zero
we know that all sub-threads at least entered do_exit(). For example,
we can safely do exit_itimers() when ->live == 0, no other thread can
do sys_timer_create() (or any syscall of course).

> Could you shed a bit of light on the distinction between sighand and
> signal?

->signal is protected by ->sighand->siglock, and they both cleared
"atomically" under ->siglock in __exit_signal. I guess, the only
reason for 2 structures is CLONE_SIGHAND which can be used without
CLONE_THREAD.

Now, let's look at arch/ia64/kernel/ptrace.c:ptrace_attach_sync_user_rbs()

	read_lock(&tasklist_lock);
	if (child->signal) {
		... this task is alive, we can proceed ...

This is correct, but if we want to make ->signal refcountable, we
should turn the above check into

	if (child->sighand) {

This is the same, but allows use to never clear task->signal.

I'll try to send the patch which does this today, we should also
change posix-cpu-timers.c and thats all, if my grepping was right.

> > I think we really need another counter, at least for now.
>
> Don't rush on my account, Ingo's proposed solution doesn't need this.

OK.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-02-03 18:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-01  7:30 hackbench [pthread mode] regression with 2.6.29-rc3 Zhang, Yanmin
     [not found] ` <d3f22a0902010026q1db36381j36cb1c9803d48431@mail.gmail.com>
2009-02-01  8:29   ` Lin Ming
2009-02-01  9:17     ` Peter Zijlstra
2009-02-01  9:57       ` Peter Zijlstra
2009-02-01 10:04         ` Ingo Molnar
2009-02-02  1:12         ` Zhang, Yanmin
2009-02-02  8:53           ` Peter Zijlstra
2009-02-02 17:49             ` Bryon Roche
2009-02-02 20:50               ` Peter Zijlstra
2009-02-03 11:56             ` [RFC] process wide itimer cruft Peter Zijlstra
2009-02-03 17:23               ` Oleg Nesterov
2009-02-03 17:51                 ` Peter Zijlstra
2009-02-03 18:22                   ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox