* [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff
2010-03-29 18:12 ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
@ 2010-03-29 18:12 ` Oleg Nesterov
2010-03-29 18:13 ` [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
Stanislaw Gruszka, linux-kernel
- With the recent changes tsk->signal is stable, we can read it
first and avoid the initialization from INIT_CPUTIME.
- Even if tsk->signal is always valid, we still have to check it
is safe to use next_thread() under rcu_read_lock(). Currently
the code checks ->sighand != NULL, change it to use pid_alive()
which is commonly used to ensure the task wasn't unhashed before
we take rcu_read_lock().
- Change the main loop to use the while_each_thread() helper.
This change itself doesn't add the functional changes, currently
it is always called under tasklist/siglock or when we know that
this thread group is already dead (wait, coredump).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/posix-cpu-timers.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)
--- 34-rc1/kernel/posix-cpu-timers.c~cpuacct_1_thread_group_cputime_cleanup_rcu 2010-03-29 18:08:14.000000000 +0200
+++ 34-rc1/kernel/posix-cpu-timers.c 2010-03-29 18:09:15.000000000 +0200
@@ -233,31 +233,24 @@ static int cpu_clock_sample(const clocki
void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
{
- struct sighand_struct *sighand;
- struct signal_struct *sig;
+ struct signal_struct *sig = tsk->signal;
struct task_struct *t;
- *times = INIT_CPUTIME;
+ times->utime = sig->utime;
+ times->stime = sig->stime;
+ times->sum_exec_runtime = sig->sum_sched_runtime;
rcu_read_lock();
- sighand = rcu_dereference(tsk->sighand);
- if (!sighand)
+ /* make sure we can trust tsk->thread_group list */
+ if (!likely(pid_alive(tsk)))
goto out;
- sig = tsk->signal;
-
t = tsk;
do {
times->utime = cputime_add(times->utime, t->utime);
times->stime = cputime_add(times->stime, t->stime);
times->sum_exec_runtime += t->se.sum_exec_runtime;
-
- t = next_thread(t);
- } while (t != tsk);
-
- times->utime = cputime_add(times->utime, sig->utime);
- times->stime = cputime_add(times->stime, sig->stime);
- times->sum_exec_runtime += sig->sum_sched_runtime;
+ } while_each_thread(tsk, t);
out:
rcu_read_unlock();
}
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless
2010-03-29 18:12 ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
2010-03-29 18:12 ` [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff Oleg Nesterov
@ 2010-03-29 18:13 ` Oleg Nesterov
2010-03-30 11:01 ` Stanislaw Gruszka
2010-03-29 18:13 ` [PATCH -mm 3/4] cputimers: thread_group_times: make it rcu-safe Oleg Nesterov
2010-03-29 18:14 ` [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread() Oleg Nesterov
3 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:13 UTC (permalink / raw)
To: Andrew Morton
Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
Stanislaw Gruszka, linux-kernel
- change __exit_signal() to do __unhash_process() before we accumulate
the counters in ->signal
- add a couple of barriers into thread_group_cputime() and __exit_signal()
to make sure thread_group_cputime() can never account the same thread
twice if it races with exit.
If any thread T was already accounted in ->signal, next_thread() or
pid_alive() must see the result of __unhash_process(T).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/exit.c | 14 +++++++++-----
kernel/posix-cpu-timers.c | 6 ++++++
2 files changed, 15 insertions(+), 5 deletions(-)
--- 34-rc1/kernel/exit.c~cpuacct_2_thread_group_cputime_rcu_safe 2010-03-29 18:03:17.000000000 +0200
+++ 34-rc1/kernel/exit.c 2010-03-29 18:29:35.000000000 +0200
@@ -88,6 +88,8 @@ static void __exit_signal(struct task_st
rcu_read_lock_held() ||
lockdep_is_held(&tasklist_lock));
spin_lock(&sighand->siglock);
+ __unhash_process(tsk, group_dead);
+ sig->nr_threads--;
posix_cpu_timers_exit(tsk);
if (group_dead) {
@@ -111,9 +113,14 @@ static void __exit_signal(struct task_st
* The group leader stays around as a zombie as long
* as there are other threads. When it gets reaped,
* the exit.c code will add its counts into these totals.
- * We won't ever get here for the group leader, since it
- * will have been the last reference on the signal_struct.
+ *
+ * Make sure that this thread can't be accounted twice
+ * by thread_group_cputime() under rcu. If it sees the
+ * the result of accounting below it must see the result
+ * of __unhash_process()->__list_del(thread_group) above.
*/
+ smp_wmb();
+
sig->utime = cputime_add(sig->utime, tsk->utime);
sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
@@ -127,9 +134,6 @@ static void __exit_signal(struct task_st
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
}
- sig->nr_threads--;
- __unhash_process(tsk, group_dead);
-
/*
* Do this under ->siglock, we can race with another thread
* doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
--- 34-rc1/kernel/posix-cpu-timers.c~cpuacct_2_thread_group_cputime_rcu_safe 2010-03-29 18:09:15.000000000 +0200
+++ 34-rc1/kernel/posix-cpu-timers.c 2010-03-29 18:29:35.000000000 +0200
@@ -239,6 +239,12 @@ void thread_group_cputime(struct task_st
times->utime = sig->utime;
times->stime = sig->stime;
times->sum_exec_runtime = sig->sum_sched_runtime;
+ /*
+ * This pairs with wmb() in __exit_signal(). If any thread was
+ * already accounted in tsk->signal, while_each_thread() must
+ * not see it.
+ */
+ smp_rmb();
rcu_read_lock();
/* make sure we can trust tsk->thread_group list */
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless
2010-03-29 18:13 ` [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless Oleg Nesterov
@ 2010-03-30 11:01 ` Stanislaw Gruszka
2010-03-30 13:43 ` Oleg Nesterov
0 siblings, 1 reply; 15+ messages in thread
From: Stanislaw Gruszka @ 2010-03-30 11:01 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Balbir Singh, Americo Wang, Eric W. Biederman,
Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
Spencer Candland, linux-kernel
On Mon, Mar 29, 2010 at 08:13:29PM +0200, Oleg Nesterov wrote:
> - change __exit_signal() to do __unhash_process() before we accumulate
> the counters in ->signal
>
> - add a couple of barriers into thread_group_cputime() and __exit_signal()
> to make sure thread_group_cputime() can never account the same thread
> twice if it races with exit.
>
> If any thread T was already accounted in ->signal, next_thread() or
> pid_alive() must see the result of __unhash_process(T).
Memory barriers prevents to account times twice, but as we write
sig->{u,s}time and sig->sum_sched_runtime on one cpu and read them
on another cpu, without a lock, this patch make theoretically possible
that some accumulated values of struct task_cputime may include exited
task values and some not. In example times->utime will include values
from 10 threads, and times->{stime,sum_exec_runtime} values form 9
threads, because local cpu updates sig->utime but not two other values.
This can make scaling in thread_group_times() not correct. I'm not sure
how big drawback is that.
Stanislaw
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless
2010-03-30 11:01 ` Stanislaw Gruszka
@ 2010-03-30 13:43 ` Oleg Nesterov
0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-30 13:43 UTC (permalink / raw)
To: Stanislaw Gruszka
Cc: Andrew Morton, Balbir Singh, Americo Wang, Eric W. Biederman,
Hidetoshi Seto, Ingo Molnar, Peter Zijlstra, Roland McGrath,
Spencer Candland, linux-kernel
On 03/30, Stanislaw Gruszka wrote:
>
> On Mon, Mar 29, 2010 at 08:13:29PM +0200, Oleg Nesterov wrote:
> > - change __exit_signal() to do __unhash_process() before we accumulate
> > the counters in ->signal
> >
> > - add a couple of barriers into thread_group_cputime() and __exit_signal()
> > to make sure thread_group_cputime() can never account the same thread
> > twice if it races with exit.
> >
> > If any thread T was already accounted in ->signal, next_thread() or
> > pid_alive() must see the result of __unhash_process(T).
>
> Memory barriers prevents to account times twice, but as we write
> sig->{u,s}time and sig->sum_sched_runtime on one cpu and read them
> on another cpu, without a lock, this patch make theoretically possible
> that some accumulated values of struct task_cputime may include exited
> task values and some not. In example times->utime will include values
> from 10 threads, and times->{stime,sum_exec_runtime} values form 9
> threads, because local cpu updates sig->utime but not two other values.
Yes sure. From 4/4 changelog:
This is a user visible change. Without ->siglock we can't get the "whole"
info atomically, and if we race with exit() we can miss the exiting thread.
this also means that it is possible that we don't miss the exiting thread
"completely", but we miss its, say, ->stime.
> This can make scaling in thread_group_times() not correct.
Again, see above. The changelog explicitely explains that this patch
assumes it is OK to return the info which is not atomic/accurate (as
it seen under ->siglock).
> I'm not sure
> how big drawback is that.
Neither me, that is why I am asking for comments.
To me this looks acceptable, but I can't judge.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH -mm 3/4] cputimers: thread_group_times: make it rcu-safe
2010-03-29 18:12 ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
2010-03-29 18:12 ` [PATCH -mm 1/4] cputimers: thread_group_cputime: cleanup rcu/signal stuff Oleg Nesterov
2010-03-29 18:13 ` [PATCH -mm 2/4] cputimers: make sure thread_group_cputime() can't count the same thread twice lockless Oleg Nesterov
@ 2010-03-29 18:13 ` Oleg Nesterov
2010-03-29 18:14 ` [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread() Oleg Nesterov
3 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:13 UTC (permalink / raw)
To: Andrew Morton
Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
Stanislaw Gruszka, linux-kernel
thread_group_times() relies on ->siglock hold by the caller when it
updates sig->prev_Xtime members. Change the code to take cputimer.lock
around the assignment.
This makes it rcu-safe and fixes the theoretical race. wait_task_zombie()
calls it lockless and thus can race with do_task_stat().
Note: we are taking cputimer.lock under ->siglock but this dependency
is not new, thread_group_cputimer() already does this.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/sched.c | 3 +++
1 file changed, 3 insertions(+)
--- 34-rc1/kernel/sched.c~cpuacct_3_thread_group_times_dont_rely_on_siglock 2010-03-29 19:44:25.000000000 +0200
+++ 34-rc1/kernel/sched.c 2010-03-29 19:46:21.000000000 +0200
@@ -3449,6 +3449,7 @@ void thread_group_times(struct task_stru
struct signal_struct *sig = p->signal;
struct task_cputime cputime;
cputime_t rtime, utime, total;
+ unsigned long flags;
thread_group_cputime(p, &cputime);
@@ -3464,9 +3465,11 @@ void thread_group_times(struct task_stru
} else
utime = rtime;
+ spin_lock_irqsave(&sig->cputimer.lock, flags);
sig->prev_utime = max(sig->prev_utime, utime);
sig->prev_stime = max(sig->prev_stime,
cputime_sub(rtime, sig->prev_utime));
+ spin_unlock_irqrestore(&sig->cputimer.lock, flags);
*ut = sig->prev_utime;
*st = sig->prev_stime;
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread()
2010-03-29 18:12 ` [PATCH -mm 0/4] cputimers/proc: do_task_stat: don't walk through the thread list under ->siglock Oleg Nesterov
` (2 preceding siblings ...)
2010-03-29 18:13 ` [PATCH -mm 3/4] cputimers: thread_group_times: make it rcu-safe Oleg Nesterov
@ 2010-03-29 18:14 ` Oleg Nesterov
2010-03-29 18:16 ` Oleg Nesterov
3 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:14 UTC (permalink / raw)
To: Andrew Morton
Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
Stanislaw Gruszka, linux-kernel
Change do_task_stat() to walk through the ->thread_group list without
->siglock, we are doing while_each_thread() twice even if the "whole"
info is not necessarily needed, say, for /bin/ps.
We can rely on previous changes which made thread_group_times() rcu-
safe and move the "if (whole)" code from ->siglock to rcu_read_lock().
Note: do_task_stat() needs more cleanups, this series only cares about
thread_group_times() issues.
This is a user visible change. Without ->siglock we can't get the "whole"
info atomically, and if we race with exit() we can miss the exiting thread.
However, I hope this is OK for /bin/top. The next read from /proc/pid/stat
will see the updated info, we can never overestimate the reported numbers,
and they can never go back.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/array.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
--- 34-rc1/fs/proc/array.c~cpuacct_4_do_task_stat_walk_tg_under_rcu 2010-03-29 19:44:24.000000000 +0200
+++ 34-rc1/fs/proc/array.c 2010-03-29 19:47:03.000000000 +0200
@@ -421,8 +421,14 @@ static int do_task_stat(struct seq_file
cgtime = sig->cgtime;
rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
+ sid = task_session_nr_ns(task, ns);
+ ppid = task_tgid_nr_ns(task->real_parent, ns);
+ pgid = task_pgrp_nr_ns(task, ns);
+ unlock_task_sighand(task, &flags);
+
/* add up live thread stats at the group level */
- if (whole) {
+ rcu_read_lock();
+ if (whole && pid_alive(task)) {
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -433,15 +439,11 @@ static int do_task_stat(struct seq_file
min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
- }
-
- sid = task_session_nr_ns(task, ns);
- ppid = task_tgid_nr_ns(task->real_parent, ns);
- pgid = task_pgrp_nr_ns(task, ns);
- unlock_task_sighand(task, &flags);
+ thread_group_times(task, &utime, &stime);
+ }
+ rcu_read_unlock();
}
if (permitted && (!whole || num_threads < 2))
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread()
2010-03-29 18:14 ` [PATCH -mm 1/4] cputimers: do_task_stat: avoid ->siglock for while_each_thread() Oleg Nesterov
@ 2010-03-29 18:16 ` Oleg Nesterov
0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2010-03-29 18:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Balbir Singh, Americo Wang, Eric W. Biederman, Hidetoshi Seto,
Ingo Molnar, Peter Zijlstra, Roland McGrath, Spencer Candland,
Stanislaw Gruszka, linux-kernel
> [PATCH -mm 1/4]
Oh, sorry, this one is 4/4.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread