[RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage()
@ 2010-03-24 20:45 Oleg Nesterov
  2010-03-26  3:53 ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-24 20:45 UTC (permalink / raw)
  To: Andrew Morton, Americo Wang, Balbir Singh, Eric W. Biederman,
	Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
	Spencer Candland, Stanislaw Gruszka
  Cc: linux-kernel

do_task_stat()->task_times() can race with getrusage(), they both can
try to update task->prev_Xtime at the same time.

Remove this bit of d180c5bc "sched: Introduce task_times() to replace
task_{u,s}time()".

See also the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 fs/proc/array.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 34-rc1/fs/proc/array.c~PROC_4_DTS_TASK_TIMES_IS_RACY	2010-03-24 19:53:23.000000000 +0100
+++ 34-rc1/fs/proc/array.c	2010-03-24 19:57:37.000000000 +0100
@@ -449,7 +449,8 @@ static int do_task_stat(struct seq_file 
 	if (!whole) {
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
-		task_times(task, &utime, &stime);
+		utime = task->utime;
+		stime = task->stime;
 		gtime = task->gtime;
 	}
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage()
  2010-03-24 20:45 [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage() Oleg Nesterov
@ 2010-03-26  3:53 ` Balbir Singh
  2010-03-26  7:37   ` Stanislaw Gruszka
  2010-03-26 21:49   ` Oleg Nesterov
  0 siblings, 2 replies; 15+ messages in thread
From: Balbir Singh @ 2010-03-26  3:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

* Oleg Nesterov <oleg@redhat.com> [2010-03-24 21:45:50]:

> do_task_stat()->task_times() can race with getrusage(), they both can
> try to update task->prev_Xtime at the same time.
> 
> Remove this bit of d180c5bc "sched: Introduce task_times() to replace
> task_{u,s}time()".

One of the reasons for adding this accuracy was to avoid sampling
based noise and errors that occur with utime and stime.

As long as there is no preemption during the assignment, I think we
should be OK. 

I see two options

1. Disable preemption around assignment
2. Remove task_times() from getrusage()


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage()
  2010-03-26  3:53 ` Balbir Singh
@ 2010-03-26  7:37   ` Stanislaw Gruszka
  2010-03-26 16:12     ` Stanislaw Gruszka
  2010-03-26 21:49   ` Oleg Nesterov
  1 sibling, 1 reply; 15+ messages in thread
From: Stanislaw Gruszka @ 2010-03-26  7:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Oleg Nesterov, Andrew Morton, Americo Wang, Eric W. Biederman,
	Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
	Spencer Candland, linux-kernel

On Fri, Mar 26, 2010 at 09:23:44AM +0530, Balbir Singh wrote:
> * Oleg Nesterov <oleg@redhat.com> [2010-03-24 21:45:50]:
> 
> > do_task_stat()->task_times() can race with getrusage(), they both can
> > try to update task->prev_Xtime at the same time.
> > 
> > Remove this bit of d180c5bc "sched: Introduce task_times() to replace
> > task_{u,s}time()".
> 
> One of the reasons for adding this accuracy was to avoid sampling
> based noise and errors that occur with utime and stime.
> 
> As long as there is no preemption during the assignment, I think we
> should be OK. 
> 
> I see two options
> 
> 1. Disable preemption around assignment
That one sounds very good.

> 2. Remove task_times() from getrusage()
We want more accurate and what more important monotonic values
in getrusage.

Stanislaw

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage()
  2010-03-26  7:37   ` Stanislaw Gruszka
@ 2010-03-26 16:12     ` Stanislaw Gruszka
  0 siblings, 0 replies; 15+ messages in thread
From: Stanislaw Gruszka @ 2010-03-26 16:12 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Oleg Nesterov, Andrew Morton, Americo Wang, Eric W. Biederman,
	Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
	Spencer Candland, linux-kernel

On Fri, Mar 26, 2010 at 08:37:23AM +0100, Stanislaw Gruszka wrote:
> On Fri, Mar 26, 2010 at 09:23:44AM +0530, Balbir Singh wrote:
> > * Oleg Nesterov <oleg@redhat.com> [2010-03-24 21:45:50]:
> > 
> > > do_task_stat()->task_times() can race with getrusage(), they both can
> > > try to update task->prev_Xtime at the same time.
> > > 
> > > Remove this bit of d180c5bc "sched: Introduce task_times() to replace
> > > task_{u,s}time()".
> > 
> > One of the reasons for adding this accuracy was to avoid sampling
> > based noise and errors that occur with utime and stime.
> > 
> > As long as there is no preemption during the assignment, I think we
> > should be OK. 
> > 
> > I see two options
> > 
> > 1. Disable preemption around assignment
> That one sounds very good.
No. Preemption will not help here.

Stanislaw

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage()
  2010-03-26  3:53 ` Balbir Singh
  2010-03-26  7:37   ` Stanislaw Gruszka
@ 2010-03-26 21:49   ` Oleg Nesterov
  2010-03-29 11:17     ` Stanislaw Gruszka
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-26 21:49 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

On 03/26, Balbir Singh wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2010-03-24 21:45:50]:
>
> > do_task_stat()->task_times() can race with getrusage(), they both can
> > try to update task->prev_Xtime at the same time.
> >
> > Remove this bit of d180c5bc "sched: Introduce task_times() to replace
> > task_{u,s}time()".
>
> One of the reasons for adding this accuracy was to avoid sampling
> based noise and errors that occur with utime and stime.
>
> As long as there is no preemption during the assignment, I think we
> should be OK.

I don't think preemp_disable() can help. Probably we can use task_lock().


As for do_task_stat()->thread_group_times(), I think we can make it
rc-safe without breaking /bin/top.

	1. add spin_lock_irqsave(&sig->cputimer.lock) around
	   sig->prev_Xtime = max(...)

	2. Add a couple of barriers into thread_group_cputime()
	   and __exit_signal() so that without ->siglock we can
	   never overestimate utime/stime if we race with exit.

	   If we underestimate these values, this should be fine:

	   - the error can't be "systematic", the next read from
	     /prod/pid/stat will see the updated values

	   - the prev_Xtime logic in thread_group_times() ensures
	     the reported time can never go back.

	   IOW: at worse, cat /proc/pid/stat can miss the time
	   which the exited thread spent on CPU after the previous
	   read of /proc/pid/stat. This looks absolutely harmless,
	   the next read will see this time.

	   Probably we can even detect this case if we look at
	   sig->nr_threads and retry.

I'll try to make patches unless someone has a better idea.

I just can't accept the fact that we are doing while_each_thread()
under ->siglock here ;)

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage()
  2010-03-26 21:49   ` Oleg Nesterov
@ 2010-03-29 11:17     ` Stanislaw Gruszka
  2010-03-29 12:54       ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislaw Gruszka @ 2010-03-29 11:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Balbir Singh, Andrew Morton, Americo Wang, Eric W. Biederman,
	Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
	Spencer Candland, linux-kernel

On Fri, Mar 26, 2010 at 10:49:06PM +0100, Oleg Nesterov wrote:
> As for do_task_stat()->thread_group_times(), I think we can make it
> rc-safe without breaking /bin/top.
> 
> 	1. add spin_lock_irqsave(&sig->cputimer.lock) around
> 	   sig->prev_Xtime = max(...)
The easiest way to avoid that races is move all calls to task_times()
and thread_group_times() inside ->siglock, but that's a bit crappy.

There is also another impossible race here. On 32-bit machines
reading/writing sum_exec_runtime is not atomic, IIRC ->siglock
protect about that as well.

> 	2. Add a couple of barriers into thread_group_cputime()
> 	   and __exit_signal() so that without ->siglock we can
> 	   never overestimate utime/stime if we race with exit.
> 
> 	   If we underestimate these values, this should be fine:
> 
> 	   - the error can't be "systematic", the next read from
> 	     /prod/pid/stat will see the updated values
> 
> 	   - the prev_Xtime logic in thread_group_times() ensures
> 	     the reported time can never go back.
> 
> 	   IOW: at worse, cat /proc/pid/stat can miss the time
> 	   which the exited thread spent on CPU after the previous
> 	   read of /proc/pid/stat. This looks absolutely harmless,
> 	   the next read will see this time.
> 
> 	   Probably we can even detect this case if we look at
> 	   sig->nr_threads and retry.
Races with __exit_signal() can lead to count Xtime values twice,
first: in tsk->Xtime, second: after task exits, in sig->Xtime. As now
we have sig->prev_Xtime, this no longer can break times monotonic,
but still accounting times twice can be problematic. For example let
assume, we have many threads and one, which consume 90% cpu-time of
the process, exits and is accounted twice. Then for long period
sig->prev_Xtime will show value that is much bigger than the estimated
value should be.

> I'll try to make patches unless someone has a better idea.
> 
> I just can't accept the fact that we are doing while_each_thread()
> under ->siglock here ;)
Problem is not only in do_task_stat(). We have couple other places
where we iterate over all threads with ->siglock taken. Maybe we can somehow
redesign things to avoid that. Maybe increase tsk>-signal->Xtime together
with tsk->Xtime is not so bad idea. I'm going to think about that.

Stanislaw

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage()
  2010-03-29 11:17     ` Stanislaw Gruszka
@ 2010-03-29 12:54       ` Oleg Nesterov
  2010-03-29 18:12         ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 12:54 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Balbir Singh, Andrew Morton, Americo Wang, Eric W. Biederman,
	Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
	Spencer Candland, linux-kernel

On 03/29, Stanislaw Gruszka wrote:
>
> On Fri, Mar 26, 2010 at 10:49:06PM +0100, Oleg Nesterov wrote:
> > As for do_task_stat()->thread_group_times(), I think we can make it
> > rc-safe without breaking /bin/top.
> >
> > 	1. add spin_lock_irqsave(&sig->cputimer.lock) around
> > 	   sig->prev_Xtime = max(...)
> The easiest way to avoid that races is move all calls to task_times()
> and thread_group_times() inside ->siglock, but that's a bit crappy.

Yes, and we should avoid overloading ->siglock if possible.

> There is also another impossible race here. On 32-bit machines
> reading/writing sum_exec_runtime is not atomic,

Sure,

> IIRC ->siglock
> protect about that as well.

I don't think so. update_curr/etc updates t->se.sum_exec_runtime without
->siglock, it can't help to read u64 values atomically.

> > 	2. Add a couple of barriers into thread_group_cputime()
> > 	   and __exit_signal() so that without ->siglock we can
> > 	   never overestimate utime/stime if we race with exit.
> > 
> > 	   If we underestimate these values, this should be fine:
> > 
> > 	   - the error can't be "systematic", the next read from
> > 	     /prod/pid/stat will see the updated values
> > 
> > 	   - the prev_Xtime logic in thread_group_times() ensures
> > 	     the reported time can never go back.
> > 
> > 	   IOW: at worse, cat /proc/pid/stat can miss the time
> > 	   which the exited thread spent on CPU after the previous
> > 	   read of /proc/pid/stat. This looks absolutely harmless,
> > 	   the next read will see this time.
> > 
> > 	   Probably we can even detect this case if we look at
> > 	   sig->nr_threads and retry.
> Races with __exit_signal() can lead to count Xtime values twice,
> first: in tsk->Xtime, second: after task exits, in sig->Xtime.

Please see above. This is what should be avoided.

> > I'll try to make patches unless someone has a better idea.
> >
> > I just can't accept the fact that we are doing while_each_thread()
> > under ->siglock here ;)
> Problem is not only in do_task_stat(). We have couple other places
> where we iterate over all threads with ->siglock taken.

Yes sure. I dislike the do_task_stat() case because we always do this,
even if this info is not needed, say, for /bin/ps.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock
  2010-03-29 12:54       ` Oleg Nesterov
@ 2010-03-29 18:12         ` Oleg Nesterov
  2010-03-29 18:12           ` [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff Oleg Nesterov
                             ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

On 03/29, Oleg Nesterov wrote:
>
> Yes sure. I dislike the do_task_stat() case because we always do this,
> even if this info is not needed, say, for /bin/ps.

Note also that nobody else in /fs/proc needs ->siglock. Except
do_io_accounting(), but in this case the user-space explicitly asks for
this info.

OK, This is V2. Still RFC, although I think 1/4 makes sense in any case.

Please comment. Again, I am more or less sure these changes are "correct",
but I don't know what /bin/top can think ;)

I don't really like the fact thread_group_times() takes cputimer.lock,
but imho lock_task_sighand() in do_task_stat() is much worse.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff
  2010-03-29 18:12         ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
@ 2010-03-29 18:12           ` Oleg Nesterov
  2010-03-29 18:13           ` [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless Oleg Nesterov
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

- With the recent changes tsk->signal is stable, we can read it
  first and avoid the initialization from INIT_CPUTIME.

- Even if tsk->signal is always valid, we still have to check it
  is safe to use next_thread() under rcu_read_lock(). Currently
  the code checks ->sighand != NULL, change it to use pid_alive()
  which is commonly used to ensure the task wasn't unhashed before
  we take rcu_read_lock().

- Change the main loop to use the while_each_thread() helper.

This change itself doesn't add the functional changes, currently
it is always called under tasklist/siglock or when we know that
this thread group is already dead (wait, coredump).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/posix-cpu-timers.c |   21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

--- 34-rc1/kernel/posix-cpu-timers.c~cpuacct_1_thread_group_cputime_cleanup_rcu	2010-03-29 18:08:14.000000000 +0200
+++ 34-rc1/kernel/posix-cpu-timers.c	2010-03-29 18:09:15.000000000 +0200
@@ -233,31 +233,24 @@ static int cpu_clock_sample(const clocki
 
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
-	struct sighand_struct *sighand;
-	struct signal_struct *sig;
+	struct signal_struct *sig = tsk->signal;
 	struct task_struct *t;
 
-	*times = INIT_CPUTIME;
+	times->utime = sig->utime;
+	times->stime = sig->stime;
+	times->sum_exec_runtime = sig->sum_sched_runtime;
 
 	rcu_read_lock();
-	sighand = rcu_dereference(tsk->sighand);
-	if (!sighand)
+	/* make sure we can trust tsk->thread_group list */
+	if (!likely(pid_alive(tsk)))
 		goto out;
 
-	sig = tsk->signal;
-
 	t = tsk;
 	do {
 		times->utime = cputime_add(times->utime, t->utime);
 		times->stime = cputime_add(times->stime, t->stime);
 		times->sum_exec_runtime += t->se.sum_exec_runtime;
-
-		t = next_thread(t);
-	} while (t != tsk);
-
-	times->utime = cputime_add(times->utime, sig->utime);
-	times->stime = cputime_add(times->stime, sig->stime);
-	times->sum_exec_runtime += sig->sum_sched_runtime;
+	} while_each_thread(tsk, t);
 out:
 	rcu_read_unlock();
 }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless
  2010-03-29 18:12         ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
  2010-03-29 18:12           ` [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff Oleg Nesterov
@ 2010-03-29 18:13           ` Oleg Nesterov
  2010-03-30 11:01             ` Stanislaw Gruszka
  2010-03-29 18:13           ` [PATCH -mm 3/4] cputimers: thread_group_times: make it rcu-safe Oleg Nesterov
  2010-03-29 18:14           ` [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread() Oleg Nesterov
  3 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

- change __exit_signal() to do __unhash_process() before we accumulate
  the counters in ->signal

- add a couple of barriers into thread_group_cputime() and __exit_signal()
  to make sure thread_group_cputime() can never account the same thread
  twice if it races with exit.

  If any thread T was already accounted in ->signal, next_thread() or
  pid_alive() must see the result of __unhash_process(T).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/exit.c             |   14 +++++++++-----
 kernel/posix-cpu-timers.c |    6 ++++++
 2 files changed, 15 insertions(+), 5 deletions(-)

--- 34-rc1/kernel/exit.c~cpuacct_2_thread_group_cputime_rcu_safe	2010-03-29 18:03:17.000000000 +0200
+++ 34-rc1/kernel/exit.c	2010-03-29 18:29:35.000000000 +0200
@@ -88,6 +88,8 @@ static void __exit_signal(struct task_st
 					rcu_read_lock_held() ||
 					lockdep_is_held(&tasklist_lock));
 	spin_lock(&sighand->siglock);
+	__unhash_process(tsk, group_dead);
+	sig->nr_threads--;
 
 	posix_cpu_timers_exit(tsk);
 	if (group_dead) {
@@ -111,9 +113,14 @@ static void __exit_signal(struct task_st
 		 * The group leader stays around as a zombie as long
 		 * as there are other threads.  When it gets reaped,
 		 * the exit.c code will add its counts into these totals.
-		 * We won't ever get here for the group leader, since it
-		 * will have been the last reference on the signal_struct.
+		 *
+		 * Make sure that this thread can't be accounted twice
+		 * by thread_group_cputime() under rcu. If it sees the
+		 * the result of accounting below it must see the result
+		 * of __unhash_process()->__list_del(thread_group) above.
 		 */
+		smp_wmb();
+
 		sig->utime = cputime_add(sig->utime, tsk->utime);
 		sig->stime = cputime_add(sig->stime, tsk->stime);
 		sig->gtime = cputime_add(sig->gtime, tsk->gtime);
@@ -127,9 +134,6 @@ static void __exit_signal(struct task_st
 		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	}
 
-	sig->nr_threads--;
-	__unhash_process(tsk, group_dead);
-
 	/*
 	 * Do this under ->siglock, we can race with another thread
 	 * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
--- 34-rc1/kernel/posix-cpu-timers.c~cpuacct_2_thread_group_cputime_rcu_safe	2010-03-29 18:09:15.000000000 +0200
+++ 34-rc1/kernel/posix-cpu-timers.c	2010-03-29 18:29:35.000000000 +0200
@@ -239,6 +239,12 @@ void thread_group_cputime(struct task_st
 	times->utime = sig->utime;
 	times->stime = sig->stime;
 	times->sum_exec_runtime = sig->sum_sched_runtime;
+	/*
+	 * This pairs with wmb() in __exit_signal(). If any thread was
+	 * already accounted in tsk->signal, while_each_thread() must
+	 * not see it.
+	 */
+	smp_rmb();
 
 	rcu_read_lock();
 	/* make sure we can trust tsk->thread_group list */


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -mm 3/4] cputimers: thread_group_times: make it rcu-safe
  2010-03-29 18:12         ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
  2010-03-29 18:12           ` [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff Oleg Nesterov
  2010-03-29 18:13           ` [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless Oleg Nesterov
@ 2010-03-29 18:13           ` Oleg Nesterov
  2010-03-29 18:14           ` [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread() Oleg Nesterov
  3 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

thread_group_times() relies on ->siglock hold by the caller when it
updates sig->prev_Xtime members. Change the code to take cputimer.lock
around the assignment.

This makes it rcu-safe and fixes the theoretical race. wait_task_zombie()
calls it lockless and thus can race with do_task_stat().

Note: we are taking cputimer.lock under ->siglock but this dependency
is not new, thread_group_cputimer() already does this.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/sched.c |    3 +++
 1 file changed, 3 insertions(+)

--- 34-rc1/kernel/sched.c~cpuacct_3_thread_group_times_dont_rely_on_siglock	2010-03-29 19:44:25.000000000 +0200
+++ 34-rc1/kernel/sched.c	2010-03-29 19:46:21.000000000 +0200
@@ -3449,6 +3449,7 @@ void thread_group_times(struct task_stru
 	struct signal_struct *sig = p->signal;
 	struct task_cputime cputime;
 	cputime_t rtime, utime, total;
+	unsigned long flags;
 
 	thread_group_cputime(p, &cputime);
 
@@ -3464,9 +3465,11 @@ void thread_group_times(struct task_stru
 	} else
 		utime = rtime;
 
+	spin_lock_irqsave(&sig->cputimer.lock, flags);
 	sig->prev_utime = max(sig->prev_utime, utime);
 	sig->prev_stime = max(sig->prev_stime,
 			      cputime_sub(rtime, sig->prev_utime));
+	spin_unlock_irqrestore(&sig->cputimer.lock, flags);
 
 	*ut = sig->prev_utime;
 	*st = sig->prev_stime;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread()
  2010-03-29 18:12         ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
                             ` (2 preceding siblings ...)
  2010-03-29 18:13           ` [PATCH -mm 3/4] cputimers: thread_group_times: make it rcu-safe Oleg Nesterov
@ 2010-03-29 18:14           ` Oleg Nesterov
  2010-03-29 18:16             ` Oleg Nesterov
  3 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

Change do_task_stat() to walk through the ->thread_group list without
->siglock, we are doing while_each_thread() twice even if the "whole"
info is not necessarily needed, say, for /bin/ps.

We can rely on previous changes which made thread_group_times() rcu-
safe and move the "if (whole)" code from ->siglock to rcu_read_lock().

Note: do_task_stat() needs more cleanups, this series only cares about
thread_group_times() issues.

This is a user visible change. Without ->siglock we can't get the "whole"
info atomically, and if we race with exit() we can miss the exiting thread.

However, I hope this is OK for /bin/top. The next read from /proc/pid/stat
will see the updated info, we can never overestimate the reported numbers,
and they can never go back.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 fs/proc/array.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

--- 34-rc1/fs/proc/array.c~cpuacct_4_do_task_stat_walk_tg_under_rcu	2010-03-29 19:44:24.000000000 +0200
+++ 34-rc1/fs/proc/array.c	2010-03-29 19:47:03.000000000 +0200
@@ -421,8 +421,14 @@ static int do_task_stat(struct seq_file 
 		cgtime = sig->cgtime;
 		rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
 
+		sid = task_session_nr_ns(task, ns);
+		ppid = task_tgid_nr_ns(task->real_parent, ns);
+		pgid = task_pgrp_nr_ns(task, ns);
+		unlock_task_sighand(task, &flags);
+
 		/* add up live thread stats at the group level */
-		if (whole) {
+		rcu_read_lock();
+		if (whole && pid_alive(task)) {
 			struct task_struct *t = task;
 			do {
 				min_flt += t->min_flt;
@@ -433,15 +439,11 @@ static int do_task_stat(struct seq_file 
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			thread_group_times(task, &utime, &stime);
 			gtime = cputime_add(gtime, sig->gtime);
-		}
-
-		sid = task_session_nr_ns(task, ns);
-		ppid = task_tgid_nr_ns(task->real_parent, ns);
-		pgid = task_pgrp_nr_ns(task, ns);
 
-		unlock_task_sighand(task, &flags);
+			thread_group_times(task, &utime, &stime);
+		}
+		rcu_read_unlock();
 	}
 
 	if (permitted && (!whole || num_threads < 2))


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread()
  2010-03-29 18:14           ` [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread() Oleg Nesterov
@ 2010-03-29 18:16             ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
	Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
	Stanislaw Gruszka, linux-kernel

> [PATCH -mm 1/4]

Oh, sorry, this one is 4/4.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless
  2010-03-29 18:13           ` [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless Oleg Nesterov
@ 2010-03-30 11:01             ` Stanislaw Gruszka
  2010-03-30 13:43               ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislaw Gruszka @ 2010-03-30 11:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Balbir Singh, Americo Wang, Eric W. Biederman,
	Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
	Spencer Candland, linux-kernel

On Mon, Mar 29, 2010 at 08:13:29PM +0200, Oleg Nesterov wrote:
> - change __exit_signal() to do __unhash_process() before we accumulate
>   the counters in ->signal
> 
> - add a couple of barriers into thread_group_cputime() and __exit_signal()
>   to make sure thread_group_cputime() can never account the same thread
>   twice if it races with exit.
>
>   If any thread T was already accounted in ->signal, next_thread() or
>   pid_alive() must see the result of __unhash_process(T).

Memory barriers prevents to account times twice, but as we write
sig->{u,s}time and sig->sum_sched_runtime on one cpu and read them
on another cpu, without a lock, this patch make theoretically possible
that some accumulated values of struct task_cputime may include exited
task values and some not. In example times->utime will include values
from 10 threads, and times->{stime,sum_exec_runtime} values form 9
threads, because local cpu updates sig->utime but not two other values.
This can make scaling in thread_group_times() not correct. I'm not sure
how big drawback is that.

Stanislaw

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless
  2010-03-30 11:01             ` Stanislaw Gruszka
@ 2010-03-30 13:43               ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-30 13:43 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Andrew Morton, Balbir Singh, Americo Wang, Eric W. Biederman,
	Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
	Spencer Candland, linux-kernel

On 03/30, Stanislaw Gruszka wrote:
>
> On Mon, Mar 29, 2010 at 08:13:29PM +0200, Oleg Nesterov wrote:
> > - change __exit_signal() to do __unhash_process() before we accumulate
> >   the counters in ->signal
> >
> > - add a couple of barriers into thread_group_cputime() and __exit_signal()
> >   to make sure thread_group_cputime() can never account the same thread
> >   twice if it races with exit.
> >
> >   If any thread T was already accounted in ->signal, next_thread() or
> >   pid_alive() must see the result of __unhash_process(T).
>
> Memory barriers prevents to account times twice, but as we write
> sig->{u,s}time and sig->sum_sched_runtime on one cpu and read them
> on another cpu, without a lock, this patch make theoretically possible
> that some accumulated values of struct task_cputime may include exited
> task values and some not. In example times->utime will include values
> from 10 threads, and times->{stime,sum_exec_runtime} values form 9
> threads, because local cpu updates sig->utime but not two other values.

Yes sure. From 4/4 changelog:

	This is a user visible change. Without ->siglock we can't get the "whole"
	info atomically, and if we race with exit() we can miss the exiting thread.

this also means that it is possible that we don't miss the exiting thread
"completely", but we miss its, say, ->stime.

> This can make scaling in thread_group_times() not correct.

Again, see above. The changelog explicitely explains that this patch
assumes it is OK to return the info which is not atomic/accurate (as
it seen under ->siglock).

> I'm not sure
> how big drawback is that.

Neither me, that is why I am asking for comments.

To me this looks acceptable, but I can't judge.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-03-30 13:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-24 20:45 [RFC,PATCH 1/2] cputimers/proc: do_task_stat()->task_times() can race with getrusage() Oleg Nesterov
2010-03-26  3:53 ` Balbir Singh
2010-03-26  7:37   ` Stanislaw Gruszka
2010-03-26 16:12     ` Stanislaw Gruszka
2010-03-26 21:49   ` Oleg Nesterov
2010-03-29 11:17     ` Stanislaw Gruszka
2010-03-29 12:54       ` Oleg Nesterov
2010-03-29 18:12         ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
2010-03-29 18:12           ` [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff Oleg Nesterov
2010-03-29 18:13           ` [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless Oleg Nesterov
2010-03-30 11:01             ` Stanislaw Gruszka
2010-03-30 13:43               ` Oleg Nesterov
2010-03-29 18:13           ` [PATCH -mm 3/4] cputimers: thread_group_times: make it rcu-safe Oleg Nesterov
2010-03-29 18:14           ` [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread() Oleg Nesterov
2010-03-29 18:16             ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox