* [PATCH 1/2] introduce __next_thread(), fix next_tid() vs exec() race
2023-08-24 14:31 [PATCH 0/2] introduce __next_thread(), change next_thread() Oleg Nesterov
@ 2023-08-24 14:31 ` Oleg Nesterov
2023-08-24 14:32 ` [PATCH 2/2] change next_thread() to use __next_thread() ?: group_leader Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2023-08-24 14:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: Eric W. Biederman, Linus Torvalds, peterz, linux-kernel
next_tid(start) does:
rcu_read_lock();
if (pid_alive(start)) {
pos = next_thread(start);
if (thread_group_leader(pos))
pos = NULL;
else
get_task_struct(pos);
it should return pos = NULL when next_thread() wraps to the 1st thread
in the thread group, group leader, and the thread_group_leader() check
tries to detect this case.
But this can race with exec. To simplify, suppose we have a main thread
M and a single sub-thread T, next_tid(T) should return NULL.
Now suppose that T execs. If next_tid(T) is called after T changes the
leadership and before it does release_task() which removes the old leader
from list, then next_thread() returns M and thread_group_leader(M) = F.
Lockless use of next_thread() should be avoided. After this change only
task_group_seq_get_next() does this, and I believe it should be changed
as well.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 6 ++----
include/linux/sched/signal.h | 11 +++++++++++
2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 69dbb03ad55b..b9fb36cd5e9c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3838,10 +3838,8 @@ static struct task_struct *next_tid(struct task_struct *start)
struct task_struct *pos = NULL;
rcu_read_lock();
if (pid_alive(start)) {
- pos = next_thread(start);
- if (thread_group_leader(pos))
- pos = NULL;
- else
+ pos = __next_thread(start);
+ if (pos)
get_task_struct(pos);
}
rcu_read_unlock();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0014d3adaf84..7fb34b8cda54 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -715,6 +715,17 @@ bool same_thread_group(struct task_struct *p1, struct task_struct *p2)
return p1->signal == p2->signal;
}
+/*
+ * returns NULL if p is the last thread in the thread group
+ */
+static inline struct task_struct *__next_thread(struct task_struct *p)
+{
+ return list_next_or_null_rcu(&p->signal->thread_head,
+ &p->thread_node,
+ struct task_struct,
+ thread_node);
+}
+
static inline struct task_struct *next_thread(const struct task_struct *p)
{
return list_entry_rcu(p->thread_group.next,
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 2/2] change next_thread() to use __next_thread() ?: group_leader
2023-08-24 14:31 [PATCH 0/2] introduce __next_thread(), change next_thread() Oleg Nesterov
2023-08-24 14:31 ` [PATCH 1/2] introduce __next_thread(), fix next_tid() vs exec() race Oleg Nesterov
@ 2023-08-24 14:32 ` Oleg Nesterov
2023-08-24 14:40 ` [PATCH 0/2] introduce __next_thread(), change next_thread() Oleg Nesterov
2023-08-24 15:02 ` Linus Torvalds
3 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2023-08-24 14:32 UTC (permalink / raw)
To: Andrew Morton; +Cc: Eric W. Biederman, Linus Torvalds, peterz, linux-kernel
This relies on fact that group leader is always the 1st entry in the
signal->thread_head list.
With or without this change, if the lockless next_thread(last_thread)
races with exec it can return the old or the new leader.
We are almost ready to kill task->thread_group, after this change its
only user is thread_group_empty().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/sched/signal.h | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7fb34b8cda54..cffc882d367f 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -726,10 +726,9 @@ static inline struct task_struct *__next_thread(struct task_struct *p)
thread_node);
}
-static inline struct task_struct *next_thread(const struct task_struct *p)
+static inline struct task_struct *next_thread(struct task_struct *p)
{
- return list_entry_rcu(p->thread_group.next,
- struct task_struct, thread_group);
+ return __next_thread(p) ?: p->group_leader;
}
static inline int thread_group_empty(struct task_struct *p)
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH 0/2] introduce __next_thread(), change next_thread()
2023-08-24 14:31 [PATCH 0/2] introduce __next_thread(), change next_thread() Oleg Nesterov
2023-08-24 14:31 ` [PATCH 1/2] introduce __next_thread(), fix next_tid() vs exec() race Oleg Nesterov
2023-08-24 14:32 ` [PATCH 2/2] change next_thread() to use __next_thread() ?: group_leader Oleg Nesterov
@ 2023-08-24 14:40 ` Oleg Nesterov
2023-08-24 15:02 ` Linus Torvalds
3 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2023-08-24 14:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Eric W. Biederman, Linus Torvalds, Peter Zijlstra, linux-kernel
Damn.
Peter Zijlstra's email was wrong. Fix it.
Peter, sorry, you can find this short series on kernel.org,
https://lore.kernel.org/all/20230824143112.GA31208@redhat.com/
or I can resend with your email fixed.
On 08/24, Oleg Nesterov wrote:
>
> Hello,
>
> After document-while_each_thread-change-first_tid-to-use-for_each_thread.patch
> in mm tree + this series
>
> 1. We have only one lockless user of next_thread(), task_group_seq_get_next().
> I think it should be changed too.
>
> 2. We have only one user of task_struct->thread_group, thread_group_empty().
> The next patches will change thread_group_empty() and kill ->thread_group.
>
> Oleg.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] introduce __next_thread(), change next_thread()
2023-08-24 14:31 [PATCH 0/2] introduce __next_thread(), change next_thread() Oleg Nesterov
` (2 preceding siblings ...)
2023-08-24 14:40 ` [PATCH 0/2] introduce __next_thread(), change next_thread() Oleg Nesterov
@ 2023-08-24 15:02 ` Linus Torvalds
2023-08-24 15:47 ` Oleg Nesterov
2023-08-25 13:00 ` Eric W. Biederman
3 siblings, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2023-08-24 15:02 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Eric W. Biederman, peterz, linux-kernel
On Thu, 24 Aug 2023 at 07:32, Oleg Nesterov <oleg@redhat.com> wrote:
>
> After document-while_each_thread-change-first_tid-to-use-for_each_thread.patch
> in mm tree + this series
Looking at your patch 2/2, I started looking at users ("Maybe we
*want* NULL for the end case, and make next_thread() and __next_thread
be the same?").
One of the main users is while_each_thread(), which certainly wants
that NULL case, both for an easier loop condition, but also because
the only user that uses the 't' pointer after the loop is
fs/proc/base.c, which wants it to be NULL.
And kernel/bpf/task_iter.c seems to *expect* NULL at the end?
End result: if you're changing next_thread() anyway, please just
change it to be a completely new thing that returns NULL at the end,
which is what everybody really seems to want, and don't add a new
__next_thread() helper. Ok?
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH 0/2] introduce __next_thread(), change next_thread()
2023-08-24 15:02 ` Linus Torvalds
@ 2023-08-24 15:47 ` Oleg Nesterov
2023-08-24 15:53 ` Oleg Nesterov
2023-08-25 13:00 ` Eric W. Biederman
1 sibling, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2023-08-24 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Eric W. Biederman, Peter Zijlstra, linux-kernel
On 08/24, Linus Torvalds wrote:
>
> On Thu, 24 Aug 2023 at 07:32, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > After document-while_each_thread-change-first_tid-to-use-for_each_thread.patch
> > in mm tree + this series
>
> Looking at your patch 2/2, I started looking at users ("Maybe we
> *want* NULL for the end case, and make next_thread() and __next_thread
> be the same?").
Yes, but see below.
> One of the main users is while_each_thread(), which certainly wants
> that NULL case, both for an easier loop condition,
No. Please note that, say,
do {
do_something(t);
} while_each_thread(current, t);
differs from for_each_thread() in that it loops starting from current,
not current->parent. I guess in most cases the order doesn't matter,
and I am going to audit the users and change them to use
for_each_thread() when possible.
Or,
while_each_thread(current, t)
do_something(t);
means do_something for every thread except current. And this have a
couple of valid users (say, zap_other_threads), but perhaps we can
change them too.
> but also because
> the only user that uses the 't' pointer after the loop is
> fs/proc/base.c, which wants it to be NULL.
Do you mean first_tid() ? Not only it is the only user that uses
the 't' pointer after the loop, it is the only user of lockless
while_each_thread() which (in general) is NOT rcu-safe.
But I have already changed it to use for_each_thread(), see
https://lore.kernel.org/all/20230823170806.GA11724@redhat.com/
This is
document-while_each_thread-change-first_tid-to-use-for_each_thread.patch
in mm tree.
> And kernel/bpf/task_iter.c seems to *expect* NULL at the end?
Yes! I think the same and I even documented this in 1/2.
To me this code looks simply wrong, but so far I don't understand
it enough. Currently I am trying to push the initial cleanups into
this code. See the
https://lore.kernel.org/all/20230821150909.GA2431@redhat.com/
thread.
> End result: if you're changing next_thread() anyway, please just
> change it to be a completely new thing that returns NULL at the end,
See above.
I'd prefer to audit/change the current users of while_each_thread()
and next_thread(), then (perhaps) kill while_each_thread() and/or
next_thread().
Oleg.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH 0/2] introduce __next_thread(), change next_thread()
2023-08-24 15:47 ` Oleg Nesterov
@ 2023-08-24 15:53 ` Oleg Nesterov
0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2023-08-24 15:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Eric W. Biederman, Peter Zijlstra, linux-kernel
Argh...
sorry for noise,
On 08/24, Oleg Nesterov wrote:
>
> No. Please note that, say,
>
> do {
> do_something(t);
> } while_each_thread(current, t);
>
> differs from for_each_thread() in that it loops starting from current,
> not current->parent.
^^^^^^
just in case, of course I meant current->group_leader
Oleg.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] introduce __next_thread(), change next_thread()
2023-08-24 15:02 ` Linus Torvalds
2023-08-24 15:47 ` Oleg Nesterov
@ 2023-08-25 13:00 ` Eric W. Biederman
2023-08-25 13:37 ` Oleg Nesterov
1 sibling, 1 reply; 9+ messages in thread
From: Eric W. Biederman @ 2023-08-25 13:00 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Oleg Nesterov, Andrew Morton, peterz, linux-kernel
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Thu, 24 Aug 2023 at 07:32, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> After document-while_each_thread-change-first_tid-to-use-for_each_thread.patch
>> in mm tree + this series
>
> Looking at your patch 2/2, I started looking at users ("Maybe we
> *want* NULL for the end case, and make next_thread() and __next_thread
> be the same?").
>
> One of the main users is while_each_thread(), which certainly wants
> that NULL case, both for an easier loop condition, but also because
> the only user that uses the 't' pointer after the loop is
> fs/proc/base.c, which wants it to be NULL.
Sort of.
I have found 3 loops that want to loop through all of the threads of
a process starting with the current thread.
The loop in do_wait.
The loop finding the thread to signal in complete_signal.
The loop in retarget_shared_pending finding which threads
to wake up.
For the signal case that is just quality of implementation,
and starting somewhere else would just decrease that quality.
For the loop in do_wait it is a correctness issue that the code
starts first with the threads own children before looking for
children of other threads.
There are 4 users of next_thread outside of while_each_thread.
- next_tid -- wants NULL
- task_group_seq_get_next -- same as next_tid
- __exit_signal -- wants any thread on the list after __unhash_process
- complete_signal -- wants the same loop as do_wait.
> And kernel/bpf/task_iter.c seems to *expect* NULL at the end?
>
> End result: if you're changing next_thread() anyway, please just
> change it to be a completely new thing that returns NULL at the end,
> which is what everybody really seems to want, and don't add a new
> __next_thread() helper. Ok?
So I would say Oleg please build the helper that do_wait wants
and use it in do_wait, complete_signal, and retarget_shared_pending.
Change the rest of the loops can use for_each_thread (skipping
the current task if needed) or for_each_process_thread.
Change __exit_signal to use signal->group_leader instead of next_thread.
Change next_thread to be your __next_thread, and update the 2 callers
appropriately.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH 0/2] introduce __next_thread(), change next_thread()
2023-08-25 13:00 ` Eric W. Biederman
@ 2023-08-25 13:37 ` Oleg Nesterov
0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2023-08-25 13:37 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Linus Torvalds, Andrew Morton, peterz, linux-kernel
On 08/25, Eric W. Biederman wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > One of the main users is while_each_thread(), which certainly wants
> > that NULL case, both for an easier loop condition, but also because
> > the only user that uses the 't' pointer after the loop is
> > fs/proc/base.c, which wants it to be NULL.
>
> Sort of.
>
> I have found 3 loops that want to loop through all of the threads of
> a process starting with the current thread.
>
> The loop in do_wait.
> The loop finding the thread to signal in complete_signal.
> The loop in retarget_shared_pending finding which threads
> to wake up.
Yes, plus check_unsafe_exec() and zap_other_threads() which want to
skip the initial thread.
> > And kernel/bpf/task_iter.c seems to *expect* NULL at the end?
Yes. I'll (try to) send the patches today. This code needs cleanups
first.
> > End result: if you're changing next_thread() anyway, please just
> > change it to be a completely new thing that returns NULL at the end,
> > which is what everybody really seems to want, and don't add a new
> > __next_thread() helper. Ok?
>
> So I would say Oleg please build the helper that do_wait wants
> and use it in do_wait, complete_signal, and retarget_shared_pending.
Later. But so far I am not 100% sure this makes sense... I guess we
will need to discuss this again.
> Change the rest of the loops can use for_each_thread (skipping
> the current task if needed) or for_each_process_thread.
Yes, I was going to do this.
> Change next_thread to be your __next_thread, and update the 2 callers
> appropriately.
But I can't do this until I change the current users of next_thread()
and while_each_thread().
Oleg.
^ permalink raw reply [flat|nested] 9+ messages in thread