* [PATCH 1/2] introduce for_each_thread() to replace the buggy while_each_thread()
2013-12-02 15:24 [PATCH 0/2] initial while_each_thread() fixes Oleg Nesterov
@ 2013-12-02 15:24 ` Oleg Nesterov
2013-12-03 19:24 ` Sameer Nanda
2013-12-02 15:24 ` [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread() Oleg Nesterov
` (3 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-02 15:24 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, Frederic Weisbecker, Mandeep Singh Baines,
Ma, Xindong, Michal Hocko, Sameer Nanda, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
while_each_thread() and next_thread() should die, almost every
lockless usage is wrong.
1. Unless g == current, the lockless while_each_thread() is not safe.
while_each_thread(g, t) can loop forever if g exits, next_thread()
can't reach the unhashed thread in this case. Note that this can
happen even if g is the group leader, it can exec.
2. Even if while_each_thread() itself was correct, people often use
it wrongly.
It was never safe to just take rcu_read_lock() and loop unless
you verify that pid_alive(g) == T, even the first next_thread()
can point to the already freed/reused memory.
This patch adds signal_struct->thread_head and task->thread_node
to create the normal rcu-safe list with the stable head. The new
for_each_thread(g, t) helper is always safe under rcu_read_lock()
as long as this task_struct can't go away.
Note: of course it is ugly to have both task_struct->thread_node
and the old task_struct->thread_group, we will kill it later, after
we change the users of while_each_thread() to use for_each_thread().
Perhaps we can kill it even before we convert all users, we can
reimplement next_thread(t) using the new thread_head/thread_node.
But we can't do this right now because this will lead to subtle
behavioural changes. For example, do/while_each_thread() always
sees at least one task, while for_each_thread() can do nothing if
the whole thread group has died. Or thread_group_empty(), currently
its semantics is not clear unless thread_group_leader(p) and we
need to audit the callers before we can change it.
So this patch adds the new interface which has to coexist with the
old one for some time, hopefully the next changes will be more or
less straightforward and the old one will go away soon.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/init_task.h | 2 ++
include/linux/sched.h | 12 ++++++++++++
kernel/exit.c | 1 +
kernel/fork.c | 7 +++++++
4 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b0ed422..668aae0 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -40,6 +40,7 @@ extern struct fs_struct init_fs;
#define INIT_SIGNALS(sig) { \
.nr_threads = 1, \
+ .thread_head = LIST_HEAD_INIT(init_task.thread_node), \
.wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
.shared_pending = { \
.list = LIST_HEAD_INIT(sig.shared_pending.list), \
@@ -213,6 +214,7 @@ extern struct task_group root_task_group;
[PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \
}, \
.thread_group = LIST_HEAD_INIT(tsk.thread_group), \
+ .thread_node = LIST_HEAD_INIT(init_signals.thread_head), \
INIT_IDS \
INIT_PERF_EVENTS(tsk) \
INIT_TRACE_IRQFLAGS \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc48056..0550083 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -487,6 +487,7 @@ struct signal_struct {
atomic_t sigcnt;
atomic_t live;
int nr_threads;
+ struct list_head thread_head;
wait_queue_head_t wait_chldexit; /* for wait4() */
@@ -1162,6 +1163,7 @@ struct task_struct {
/* PID/PID hash table linkage. */
struct pid_link pids[PIDTYPE_MAX];
struct list_head thread_group;
+ struct list_head thread_node;
struct completion *vfork_done; /* for vfork() */
int __user *set_child_tid; /* CLONE_CHILD_SETTID */
@@ -2225,6 +2227,16 @@ extern bool current_is_single_threaded(void);
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)
+#define __for_each_thread(signal, t) \
+ list_for_each_entry_rcu(t, &(signal)->thread_head, thread_node)
+
+#define for_each_thread(p, t) \
+ __for_each_thread((p)->signal, t)
+
+/* Careful: this is a double loop, 'break' won't work as expected. */
+#define for_each_process_thread(p, t) \
+ for_each_process(p) for_each_thread(p, t)
+
static inline int get_nr_threads(struct task_struct *tsk)
{
return tsk->signal->nr_threads;
diff --git a/kernel/exit.c b/kernel/exit.c
index a949819..1e77fc6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -74,6 +74,7 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
__this_cpu_dec(process_counts);
}
list_del_rcu(&p->thread_group);
+ list_del_rcu(&p->thread_node);
}
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 9be5154..b84bef7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1034,6 +1034,11 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->nr_threads = 1;
atomic_set(&sig->live, 1);
atomic_set(&sig->sigcnt, 1);
+
+ /* list_add(thread_node, thread_head) without INIT_LIST_HEAD() */
+ sig->thread_head = (struct list_head)LIST_HEAD_INIT(tsk->thread_node);
+ tsk->thread_node = (struct list_head)LIST_HEAD_INIT(sig->thread_head);
+
init_waitqueue_head(&sig->wait_chldexit);
sig->curr_target = tsk;
init_sigpending(&sig->shared_pending);
@@ -1470,6 +1475,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
atomic_inc(¤t->signal->sigcnt);
list_add_tail_rcu(&p->thread_group,
&p->group_leader->thread_group);
+ list_add_tail_rcu(&p->thread_node,
+ &p->signal->thread_head);
}
attach_pid(p, PIDTYPE_PID);
nr_threads++;
--
1.5.5.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] introduce for_each_thread() to replace the buggy while_each_thread()
2013-12-02 15:24 ` [PATCH 1/2] introduce for_each_thread() to replace the buggy while_each_thread() Oleg Nesterov
@ 2013-12-03 19:24 ` Sameer Nanda
0 siblings, 0 replies; 15+ messages in thread
From: Sameer Nanda @ 2013-12-03 19:24 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
On Mon, Dec 2, 2013 at 7:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> while_each_thread() and next_thread() should die, almost every
> lockless usage is wrong.
>
> 1. Unless g == current, the lockless while_each_thread() is not safe.
>
> while_each_thread(g, t) can loop forever if g exits, next_thread()
> can't reach the unhashed thread in this case. Note that this can
> happen even if g is the group leader, it can exec.
>
> 2. Even if while_each_thread() itself was correct, people often use
> it wrongly.
>
> It was never safe to just take rcu_read_lock() and loop unless
> you verify that pid_alive(g) == T, even the first next_thread()
> can point to the already freed/reused memory.
>
> This patch adds signal_struct->thread_head and task->thread_node
> to create the normal rcu-safe list with the stable head. The new
> for_each_thread(g, t) helper is always safe under rcu_read_lock()
> as long as this task_struct can't go away.
>
> Note: of course it is ugly to have both task_struct->thread_node
> and the old task_struct->thread_group, we will kill it later, after
> we change the users of while_each_thread() to use for_each_thread().
>
> Perhaps we can kill it even before we convert all users, we can
> reimplement next_thread(t) using the new thread_head/thread_node.
> But we can't do this right now because this will lead to subtle
> behavioural changes. For example, do/while_each_thread() always
> sees at least one task, while for_each_thread() can do nothing if
> the whole thread group has died. Or thread_group_empty(), currently
> its semantics is not clear unless thread_group_leader(p) and we
> need to audit the callers before we can change it.
>
> So this patch adds the new interface which has to coexist with the
> old one for some time, hopefully the next changes will be more or
> less straightforward and the old one will go away soon.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
> ---
> include/linux/init_task.h | 2 ++
> include/linux/sched.h | 12 ++++++++++++
> kernel/exit.c | 1 +
> kernel/fork.c | 7 +++++++
> 4 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index b0ed422..668aae0 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -40,6 +40,7 @@ extern struct fs_struct init_fs;
>
> #define INIT_SIGNALS(sig) { \
> .nr_threads = 1, \
> + .thread_head = LIST_HEAD_INIT(init_task.thread_node), \
> .wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
> .shared_pending = { \
> .list = LIST_HEAD_INIT(sig.shared_pending.list), \
> @@ -213,6 +214,7 @@ extern struct task_group root_task_group;
> [PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \
> }, \
> .thread_group = LIST_HEAD_INIT(tsk.thread_group), \
> + .thread_node = LIST_HEAD_INIT(init_signals.thread_head), \
> INIT_IDS \
> INIT_PERF_EVENTS(tsk) \
> INIT_TRACE_IRQFLAGS \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc48056..0550083 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -487,6 +487,7 @@ struct signal_struct {
> atomic_t sigcnt;
> atomic_t live;
> int nr_threads;
> + struct list_head thread_head;
>
> wait_queue_head_t wait_chldexit; /* for wait4() */
>
> @@ -1162,6 +1163,7 @@ struct task_struct {
> /* PID/PID hash table linkage. */
> struct pid_link pids[PIDTYPE_MAX];
> struct list_head thread_group;
> + struct list_head thread_node;
>
> struct completion *vfork_done; /* for vfork() */
> int __user *set_child_tid; /* CLONE_CHILD_SETTID */
> @@ -2225,6 +2227,16 @@ extern bool current_is_single_threaded(void);
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>
> +#define __for_each_thread(signal, t) \
> + list_for_each_entry_rcu(t, &(signal)->thread_head, thread_node)
> +
> +#define for_each_thread(p, t) \
> + __for_each_thread((p)->signal, t)
> +
> +/* Careful: this is a double loop, 'break' won't work as expected. */
> +#define for_each_process_thread(p, t) \
> + for_each_process(p) for_each_thread(p, t)
> +
> static inline int get_nr_threads(struct task_struct *tsk)
> {
> return tsk->signal->nr_threads;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a949819..1e77fc6 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -74,6 +74,7 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
> __this_cpu_dec(process_counts);
> }
> list_del_rcu(&p->thread_group);
> + list_del_rcu(&p->thread_node);
> }
>
> /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9be5154..b84bef7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1034,6 +1034,11 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> sig->nr_threads = 1;
> atomic_set(&sig->live, 1);
> atomic_set(&sig->sigcnt, 1);
> +
> + /* list_add(thread_node, thread_head) without INIT_LIST_HEAD() */
> + sig->thread_head = (struct list_head)LIST_HEAD_INIT(tsk->thread_node);
> + tsk->thread_node = (struct list_head)LIST_HEAD_INIT(sig->thread_head);
> +
> init_waitqueue_head(&sig->wait_chldexit);
> sig->curr_target = tsk;
> init_sigpending(&sig->shared_pending);
> @@ -1470,6 +1475,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> atomic_inc(¤t->signal->sigcnt);
> list_add_tail_rcu(&p->thread_group,
> &p->group_leader->thread_group);
> + list_add_tail_rcu(&p->thread_node,
> + &p->signal->thread_head);
> }
> attach_pid(p, PIDTYPE_PID);
> nr_threads++;
> --
> 1.5.5.1
>
--
Sameer
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread()
2013-12-02 15:24 [PATCH 0/2] initial while_each_thread() fixes Oleg Nesterov
2013-12-02 15:24 ` [PATCH 1/2] introduce for_each_thread() to replace the buggy while_each_thread() Oleg Nesterov
@ 2013-12-02 15:24 ` Oleg Nesterov
2013-12-03 18:57 ` Sameer Nanda
2013-12-02 15:34 ` [PATCH 0/2] initial while_each_thread() fixes Oleg Nesterov
` (2 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-02 15:24 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, Frederic Weisbecker, Mandeep Singh Baines,
Ma, Xindong, Michal Hocko, Sameer Nanda, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
1. Change oom_kill.c to use for_each_thread() rather than the racy
while_each_thread() which can loop forever if we race with exit.
Note also that most users were buggy even if while_each_thread()
was fine, the task can exit even _before_ rcu_read_lock().
Fortunately the new for_each_thread() only requires the stable
task_struct, so this change fixes both problems.
2. At least out_of_memory() calls has_intersects_mems_allowed()
without even rcu_read_lock(), this is obviously buggy.
Add the necessary rcu_read_lock(). This means that we can not
simply return from the loop, we need "bool ret" and "break".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 37 ++++++++++++++++++++-----------------
1 files changed, 20 insertions(+), 17 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1e4a600..47dd4ce 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -54,12 +54,14 @@ static DEFINE_SPINLOCK(zone_scan_lock);
* shares the same mempolicy nodes as current if it is bound by such a policy
* and whether or not it has the same set of allowed cpuset nodes.
*/
-static bool has_intersects_mems_allowed(struct task_struct *tsk,
+static bool has_intersects_mems_allowed(struct task_struct *start,
const nodemask_t *mask)
{
- struct task_struct *start = tsk;
+ struct task_struct *tsk;
+ bool ret = false;
- do {
+ rcu_read_lock();
+ for_each_thread(start, tsk) {
if (mask) {
/*
* If this is a mempolicy constrained oom, tsk's
@@ -67,19 +69,20 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
* mempolicy intersects current, otherwise it may be
* needlessly killed.
*/
- if (mempolicy_nodemask_intersects(tsk, mask))
- return true;
+ ret = mempolicy_nodemask_intersects(tsk, mask);
} else {
/*
* This is not a mempolicy constrained oom, so only
* check the mems of tsk's cpuset.
*/
- if (cpuset_mems_allowed_intersects(current, tsk))
- return true;
+ ret = cpuset_mems_allowed_intersects(current, tsk);
}
- } while_each_thread(start, tsk);
+ if (ret)
+ break;
+ }
+ rcu_read_unlock();
- return false;
+ return ret;
}
#else
static bool has_intersects_mems_allowed(struct task_struct *tsk,
@@ -97,14 +100,14 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
*/
struct task_struct *find_lock_task_mm(struct task_struct *p)
{
- struct task_struct *t = p;
+ struct task_struct *t;
- do {
+ for_each_thread(p, t) {
task_lock(t);
if (likely(t->mm))
return t;
task_unlock(t);
- } while_each_thread(p, t);
+ }
return NULL;
}
@@ -301,7 +304,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
unsigned long chosen_points = 0;
rcu_read_lock();
- do_each_thread(g, p) {
+ for_each_process_thread(g, p) {
unsigned int points;
switch (oom_scan_process_thread(p, totalpages, nodemask,
@@ -323,7 +326,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
chosen = p;
chosen_points = points;
}
- } while_each_thread(g, p);
+ }
if (chosen)
get_task_struct(chosen);
rcu_read_unlock();
@@ -406,7 +409,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
{
struct task_struct *victim = p;
struct task_struct *child;
- struct task_struct *t = p;
+ struct task_struct *t;
struct mm_struct *mm;
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
@@ -437,7 +440,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* still freeing memory.
*/
read_lock(&tasklist_lock);
- do {
+ for_each_thread(p, t) {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
@@ -455,7 +458,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
get_task_struct(victim);
}
}
- } while_each_thread(p, t);
+ }
read_unlock(&tasklist_lock);
rcu_read_lock();
--
1.5.5.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread()
2013-12-02 15:24 ` [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread() Oleg Nesterov
@ 2013-12-03 18:57 ` Sameer Nanda
2013-12-03 20:05 ` Oleg Nesterov
2013-12-04 12:57 ` Oleg Nesterov
0 siblings, 2 replies; 15+ messages in thread
From: Sameer Nanda @ 2013-12-03 18:57 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
Thanks for helping get this fixed! Couple of comments below.
On Mon, Dec 2, 2013 at 7:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 1. Change oom_kill.c to use for_each_thread() rather than the racy
> while_each_thread() which can loop forever if we race with exit.
>
> Note also that most users were buggy even if while_each_thread()
> was fine, the task can exit even _before_ rcu_read_lock().
>
> Fortunately the new for_each_thread() only requires the stable
> task_struct, so this change fixes both problems.
>
> 2. At least out_of_memory() calls has_intersects_mems_allowed()
> without even rcu_read_lock(), this is obviously buggy.
>
> Add the necessary rcu_read_lock(). This means that we can not
> simply return from the loop, we need "bool ret" and "break".
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
> ---
> mm/oom_kill.c | 37 ++++++++++++++++++++-----------------
> 1 files changed, 20 insertions(+), 17 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1e4a600..47dd4ce 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -54,12 +54,14 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> * shares the same mempolicy nodes as current if it is bound by such a policy
> * and whether or not it has the same set of allowed cpuset nodes.
> */
> -static bool has_intersects_mems_allowed(struct task_struct *tsk,
> +static bool has_intersects_mems_allowed(struct task_struct *start,
Comment block needs to be fixed up due to variable name change from
tsk to start.
> const nodemask_t *mask)
> {
> - struct task_struct *start = tsk;
> + struct task_struct *tsk;
> + bool ret = false;
>
> - do {
> + rcu_read_lock();
> + for_each_thread(start, tsk) {
> if (mask) {
> /*
> * If this is a mempolicy constrained oom, tsk's
> @@ -67,19 +69,20 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> * mempolicy intersects current, otherwise it may be
> * needlessly killed.
> */
> - if (mempolicy_nodemask_intersects(tsk, mask))
> - return true;
> + ret = mempolicy_nodemask_intersects(tsk, mask);
> } else {
> /*
> * This is not a mempolicy constrained oom, so only
> * check the mems of tsk's cpuset.
> */
> - if (cpuset_mems_allowed_intersects(current, tsk))
> - return true;
> + ret = cpuset_mems_allowed_intersects(current, tsk);
> }
> - } while_each_thread(start, tsk);
> + if (ret)
> + break;
> + }
> + rcu_read_unlock();
>
> - return false;
> + return ret;
> }
> #else
> static bool has_intersects_mems_allowed(struct task_struct *tsk,
> @@ -97,14 +100,14 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> */
> struct task_struct *find_lock_task_mm(struct task_struct *p)
> {
> - struct task_struct *t = p;
> + struct task_struct *t;
>
> - do {
> + for_each_thread(p, t) {
> task_lock(t);
> if (likely(t->mm))
> return t;
> task_unlock(t);
> - } while_each_thread(p, t);
> + }
>
> return NULL;
> }
> @@ -301,7 +304,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> unsigned long chosen_points = 0;
>
> rcu_read_lock();
> - do_each_thread(g, p) {
> + for_each_process_thread(g, p) {
> unsigned int points;
>
> switch (oom_scan_process_thread(p, totalpages, nodemask,
> @@ -323,7 +326,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> chosen = p;
> chosen_points = points;
> }
> - } while_each_thread(g, p);
> + }
> if (chosen)
> get_task_struct(chosen);
> rcu_read_unlock();
> @@ -406,7 +409,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> {
> struct task_struct *victim = p;
> struct task_struct *child;
> - struct task_struct *t = p;
> + struct task_struct *t;
> struct mm_struct *mm;
> unsigned int victim_points = 0;
> static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> @@ -437,7 +440,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> * still freeing memory.
> */
> read_lock(&tasklist_lock);
This can be a rcu_read_lock now, I think?
> - do {
> + for_each_thread(p, t) {
> list_for_each_entry(child, &t->children, sibling) {
> unsigned int child_points;
>
> @@ -455,7 +458,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> get_task_struct(victim);
> }
> }
> - } while_each_thread(p, t);
> + }
> read_unlock(&tasklist_lock);
>
> rcu_read_lock();
> --
> 1.5.5.1
>
--
Sameer
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread()
2013-12-03 18:57 ` Sameer Nanda
@ 2013-12-03 20:05 ` Oleg Nesterov
2013-12-03 20:50 ` Oleg Nesterov
2013-12-04 12:57 ` Oleg Nesterov
1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-03 20:05 UTC (permalink / raw)
To: Sameer Nanda
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
On 12/03, Sameer Nanda wrote:
>
> Reviewed-by: Sameer Nanda <snanda@chromium.org>
Thanks!
> > -static bool has_intersects_mems_allowed(struct task_struct *tsk,
> > +static bool has_intersects_mems_allowed(struct task_struct *start,
> Comment block needs to be fixed up due to variable name change from
> tsk to start.
thanks again, I'll update the patch.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread()
2013-12-03 20:05 ` Oleg Nesterov
@ 2013-12-03 20:50 ` Oleg Nesterov
0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-03 20:50 UTC (permalink / raw)
To: Sameer Nanda
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
On 12/03, Oleg Nesterov wrote:
>
> > > -static bool has_intersects_mems_allowed(struct task_struct *tsk,
> > > +static bool has_intersects_mems_allowed(struct task_struct *start,
> > Comment block needs to be fixed up due to variable name change from
> > tsk to start.
>
> thanks again, I'll update the patch.
Yes, I'll redo and resend.
I decided to split it into 2 changes, "add rcu_read_lock() into
has_intersects_mems_allowed()" should come as a separate change.
Because it seems that oom_kill.c needs more rcu_read_lock's, at
least find_lock_task_mm() can be called without rcu lock.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread()
2013-12-03 18:57 ` Sameer Nanda
2013-12-03 20:05 ` Oleg Nesterov
@ 2013-12-04 12:57 ` Oleg Nesterov
1 sibling, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-04 12:57 UTC (permalink / raw)
To: Sameer Nanda
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
Sameer, I didn't notice this part yesterday.
On 12/03, Sameer Nanda wrote:
>
> > @@ -406,7 +409,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > {
> > struct task_struct *victim = p;
> > struct task_struct *child;
> > - struct task_struct *t = p;
> > + struct task_struct *t;
> > struct mm_struct *mm;
> > unsigned int victim_points = 0;
> > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > @@ -437,7 +440,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > * still freeing memory.
> > */
> > read_lock(&tasklist_lock);
> This can be a rcu_read_lock now, I think?
No, we need tasklist for list_for_each_entry(t->children), it
is not rcu-safe.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] initial while_each_thread() fixes
2013-12-02 15:24 [PATCH 0/2] initial while_each_thread() fixes Oleg Nesterov
2013-12-02 15:24 ` [PATCH 1/2] introduce for_each_thread() to replace the buggy while_each_thread() Oleg Nesterov
2013-12-02 15:24 ` [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread() Oleg Nesterov
@ 2013-12-02 15:34 ` Oleg Nesterov
2013-12-03 15:46 ` Sergey Dyasly
2013-12-03 16:53 ` William Dauchy
4 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-02 15:34 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, Frederic Weisbecker, Mandeep Singh Baines,
Ma, Xindong, Michal Hocko, Sameer Nanda, Sergey Dyasly,
Tu, Xiaobing, linux-kernel
On 12/02, Oleg Nesterov wrote:
>
> The lockless while_each_thread() is racy and broken, almost
> every user can loop forever.
> ...
> Another problem with while_each_thread() is that it is very easy
> to use it wrongly, and oom_kill.c is the good example.
Forgot to mention, it is not necessarily safe even under task
list lock if it is used wrongly. Again, oom_kill.c is the good
example, oom_kill_process() does read_lock(&tasklist_lock) but
it doesn't verify that p is still alive.
The new for_each_thread() is much simpler in this respect, it
only needs the stable task_struct.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] initial while_each_thread() fixes
2013-12-02 15:24 [PATCH 0/2] initial while_each_thread() fixes Oleg Nesterov
` (2 preceding siblings ...)
2013-12-02 15:34 ` [PATCH 0/2] initial while_each_thread() fixes Oleg Nesterov
@ 2013-12-03 15:46 ` Sergey Dyasly
2013-12-03 18:52 ` Oleg Nesterov
2013-12-03 19:01 ` Oleg Nesterov
2013-12-03 16:53 ` William Dauchy
4 siblings, 2 replies; 15+ messages in thread
From: Sergey Dyasly @ 2013-12-03 15:46 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sameer Nanda,
Tu, Xiaobing, linux-kernel
Hi Oleg,
I was waiting for this one!
On Mon, 2 Dec 2013 16:24:23 +0100
Oleg Nesterov <oleg@redhat.com> wrote:
> Hello.
>
> This was reported several times, I believe the first report is
> http://marc.info/?l=linux-kernel&m=127688978121665. Hmm, 3 years
> ago. The lockless while_each_thread() is racy and broken, almost
> every user can loop forever.
>
> Recently people started to report they actually hit this problem in
> oom_kill.c. This doesn't really matter and I can be wrong, but in
> fact I do not think they really hit this race, it is very unlikely.
The race is very easy to catch if you have a process with several threads,
all of which allocates memory simultaneously. This leads to:
1) OOMk selects and sends SIGKILL to one of the threads
2) another thread invokes OOMk and the first thread gets selected,
but it gets unhashed before while_each_thread...
> Another problem with while_each_thread() is that it is very easy
> to use it wrongly, and oom_kill.c is the good example.
>
> I came to conclusion that it is practically impossible to send a
> single series which fixes all problems, too many different users.
>
> So 1/2 adds the new for_each_thread() interface, and 2/2 fixes oom
> kill as an example.
>
> We obviously need a lot more changes like 2/2 before we can kill
> while_each_thread() and task_struct->thread_group, but I hope they
> will be straighforward. And in fact I hope that task->thread_group
> can go away before we change all users of while_each_thread().
>
> David, et al, I din't actually test 2/2, I do not know how. Please
> review, although it looks simple.
The patches look correct and my test case no longer hangs, so
Reviewed-and-Tested-by: Sergey Dyasly <dserrg@gmail.com>
>
> Oleg.
>
> include/linux/init_task.h | 2 ++
> include/linux/sched.h | 12 ++++++++++++
> kernel/exit.c | 1 +
> kernel/fork.c | 7 +++++++
> mm/oom_kill.c | 37 ++++++++++++++++++++-----------------
> 5 files changed, 42 insertions(+), 17 deletions(-)
>
--
Sergey Dyasly <dserrg@gmail.com>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 0/2] initial while_each_thread() fixes
2013-12-03 15:46 ` Sergey Dyasly
@ 2013-12-03 18:52 ` Oleg Nesterov
2013-12-03 19:01 ` Oleg Nesterov
1 sibling, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-03 18:52 UTC (permalink / raw)
To: Sergey Dyasly
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sameer Nanda,
Tu, Xiaobing, linux-kernel
Hi Sergey,
On 12/03, Sergey Dyasly wrote:
>
> I was waiting for this one!
I know ;) Sorry for delay. and thanks for review!
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] initial while_each_thread() fixes
2013-12-03 15:46 ` Sergey Dyasly
2013-12-03 18:52 ` Oleg Nesterov
@ 2013-12-03 19:01 ` Oleg Nesterov
1 sibling, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-03 19:01 UTC (permalink / raw)
To: Sergey Dyasly
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sameer Nanda,
Tu, Xiaobing, linux-kernel
argh, sorry, I didn't finish my email...
On 12/03, Sergey Dyasly wrote:
>
> On Mon, 2 Dec 2013 16:24:23 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > Recently people started to report they actually hit this problem in
> > oom_kill.c. This doesn't really matter and I can be wrong, but in
> > fact I do not think they really hit this race, it is very unlikely.
>
> The race is very easy to catch if you have a process with several threads,
> all of which allocates memory simultaneously. This leads to:
>
> 1) OOMk selects and sends SIGKILL to one of the threads
>
> 2) another thread invokes OOMk and the first thread gets selected,
> but it gets unhashed before while_each_thread...
Yes, but this is what I meant.
It was unhashed before even while_each_thread(g), and it should be
never used unless you ensure that g is still alive.
But this doesn't matter. while_each_thread() was buggy anyway. And
(perhaps even more importantly) it was not easy to use it correctly,
so I finally decided to add another helper which only needs the
stable task_struct.
> The patches look correct and my test case no longer hangs, so
>
> Reviewed-and-Tested-by: Sergey Dyasly <dserrg@gmail.com>
Thanks!
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] initial while_each_thread() fixes
2013-12-02 15:24 [PATCH 0/2] initial while_each_thread() fixes Oleg Nesterov
` (3 preceding siblings ...)
2013-12-03 15:46 ` Sergey Dyasly
@ 2013-12-03 16:53 ` William Dauchy
2013-12-03 20:22 ` Oleg Nesterov
4 siblings, 1 reply; 15+ messages in thread
From: William Dauchy @ 2013-12-03 16:53 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sameer Nanda,
Sergey Dyasly, Tu, Xiaobing, linux-kernel@vger.kernel.org
Hello Oleg,
On Mon, Dec 2, 2013 at 4:24 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> This was reported several times, I believe the first report is
> http://marc.info/?l=linux-kernel&m=127688978121665. Hmm, 3 years
> ago. The lockless while_each_thread() is racy and broken, almost
> every user can loop forever.
>
> Recently people started to report they actually hit this problem in
> oom_kill.c. This doesn't really matter and I can be wrong, but in
> fact I do not think they really hit this race, it is very unlikely.
> Another problem with while_each_thread() is that it is very easy
> to use it wrongly, and oom_kill.c is the good example.
>
> I came to conclusion that it is practically impossible to send a
> single series which fixes all problems, too many different users.
>
> So 1/2 adds the new for_each_thread() interface, and 2/2 fixes oom
> kill as an example.
>
> We obviously need a lot more changes like 2/2 before we can kill
> while_each_thread() and task_struct->thread_group, but I hope they
> will be straighforward. And in fact I hope that task->thread_group
> can go away before we change all users of while_each_thread().
>
> David, et al, I din't actually test 2/2, I do not know how. Please
> review, although it looks simple.
I was wondering if this patch was also targeted for stable branch?
Before this patch, I was testing this one
https://lkml.org/lkml/2013/11/13/336 which is fixing my oom issues.
I applied the two patches on top of a 3.10.x and got some tasks
stalled after the first OOM:
php5-fpm invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
php5-fpm cpuset=VM_X mems_allowed=0-1
CPU: 21 PID: 28256 Comm: php5-fpm Not tainted 3.10.22 #1
Hardware name: Dell Inc. PowerEdge C8220/0TDN55, BIOS 1.1.17 01/09/2013
ffffc9001e3c9000 ffff881fbacbfbe8 ffffffff81569220 ffff881fbacbfc88
ffffffff815662f1 ffff882011bb4800 0000000000020000 ffff881fbacbfc28
ffffffff8156d133 0000000000000206 0000000000000206 ffff881f00000000
Call Trace:
[<ffffffff81569220>] dump_stack+0x19/0x21
[<ffffffff815662f1>] dump_header+0x7a/0x26c
[<ffffffff8156d133>] ? preempt_schedule+0x33/0x50
[<ffffffff8156e597>] ? _raw_spin_unlock_irqrestore+0x67/0x70
[<ffffffff8124b026>] ? ___ratelimit+0xa6/0x130
[<ffffffff810cb2b0>] oom_kill_process+0x2a0/0x440
[<ffffffff8104e0f0>] ? has_capability+0x20/0x20
[<ffffffff8111c8bd>] mem_cgroup_oom_synchronize+0x59d/0x5c0
[<ffffffff8111bbc0>] ? mem_cgroup_charge_common+0xa0/0xa0
[<ffffffff810cbc03>] pagefault_out_of_memory+0x13/0x90
[<ffffffff81564022>] mm_fault_error+0xb8/0x19b
[<ffffffff8102b073>] __do_page_fault+0x543/0x5b0
[<ffffffff8156c6ac>] ? __schedule+0x3dc/0xd90
[<ffffffff8107c970>] ? hrtick_update+0x70/0x70
[<ffffffff8109be27>] ? SyS_futex+0x97/0x2f0
[<ffffffff8156f21a>] ? retint_swapgs_pax+0xd/0x12
[<ffffffff81252621>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8102b119>] do_page_fault+0x9/0x20
[<ffffffff8156f4c8>] page_fault+0x38/0x40
Task in /lxc/VM_X killed as a result of limit of /lxc/VM_X
memory: usage 254644kB, limit 262144kB, failcnt 38594
memory+swap: usage 524288kB, limit 524288kB, failcnt 803
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
Memory cgroup stats for /lxc/VM_X: cache:196KB rss:252116KB
rss_huge:4096KB mapped_file:164KB swap:274236KB inactive_anon:124788KB
active_anon:124332KB inactive_file:0KB active_file:0KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[16017] 0 16017 4444 336 14 94 0 paas-start
[18020] 5101 18020 64126 2113 57 4645 -1000 mysqld
[19162] 5000 19162 87663 463 122 1396 0 php5-fpm
[19216] 5001 19216 24156 473 51 444 0 apache2
[20257] 5001 20257 176880 656 122 2600 0 apache2
[20353] 0 20353 1023 108 8 21 0 sleep
[27746] 5000 27746 90454 1025 139 4316 0 php5-fpm
[28176] 5000 28176 348410 59568 547 53877 0 php5-fpm
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
t=15014 jiffies, g=65569, c=65568, q=6537)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
t=60019 jiffies, g=65569, c=65568, q=15632)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
t=105024 jiffies, g=65569, c=65568, q=32303)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
t=150029 jiffies, g=65569, c=65568, q=43552)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 5,
t=195034 jiffies, g=65569, c=65568, q=194557)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
t=240039 jiffies, g=65569, c=65568, q=205866)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
t=285044 jiffies, g=65569, c=65568, q=238233)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
t=330049 jiffies, g=65569, c=65568, q=274925)
INFO: Stall ended before state dump start
INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 5,
t=375054 jiffies, g=65569, c=65568, q=297436)
INFO: Stall ended before state dump start
--
William
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 0/2] initial while_each_thread() fixes
2013-12-03 16:53 ` William Dauchy
@ 2013-12-03 20:22 ` Oleg Nesterov
2013-12-03 20:28 ` William Dauchy
0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2013-12-03 20:22 UTC (permalink / raw)
To: William Dauchy
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sameer Nanda,
Sergey Dyasly, Tu, Xiaobing, linux-kernel@vger.kernel.org
Hi William,
On 12/03, William Dauchy wrote:
>
> I was wondering if this patch was also targeted for stable branch?
Unlikely... but we will see.
> Before this patch, I was testing this one
> https://lkml.org/lkml/2013/11/13/336
perhaps this patch makes more sense for stable.
But, to clarify just in case, it is not needed after this series.
> which is fixing my oom issues.
Yes, but it doesn't fix all problems even in mm/oom_kill.c, and
we need to fix while_each_thread() anyway.
> I applied the two patches on top of a 3.10.x and got some tasks
> stalled after the first OOM:
So you are saying that this was introduced by this series?
Could you retest with the recent kernel?
> INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 21,
> t=15014 jiffies, g=65569, c=65568, q=6537)
This series does not expand the rcu-locked sections except: it adds
rcu_read_lock() into has_intersects_mems_allowed() but this is the
obvious bugfix.
So far I _think_ that this series should not be blamed for that, but
I'll try to recheck.
Oleg.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 0/2] initial while_each_thread() fixes
2013-12-03 20:22 ` Oleg Nesterov
@ 2013-12-03 20:28 ` William Dauchy
0 siblings, 0 replies; 15+ messages in thread
From: William Dauchy @ 2013-12-03 20:28 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, David Rientjes, Frederic Weisbecker,
Mandeep Singh Baines, Ma, Xindong, Michal Hocko, Sameer Nanda,
Sergey Dyasly, Tu, Xiaobing, linux-kernel@vger.kernel.org
On Tue, Dec 3, 2013 at 9:22 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> Before this patch, I was testing this one
>> https://lkml.org/lkml/2013/11/13/336
>
> perhaps this patch makes more sense for stable.
I guess we should consider it.
> But, to clarify just in case, it is not needed after this series.
>
>> which is fixing my oom issues.
>
> Yes, but it doesn't fix all problems even in mm/oom_kill.c, and
> we need to fix while_each_thread() anyway.
>
>> I applied the two patches on top of a 3.10.x and got some tasks
>> stalled after the first OOM:
>
> So you are saying that this was introduced by this series?
yes, without thse two patches I can't reproduce my issue.
> Could you retest with the recent kernel?
I'll see if I find time for it; will keep you up to date.
Thanks,
--
William
^ permalink raw reply [flat|nested] 15+ messages in thread