* [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock
@ 2025-09-14 11:09 Oleg Nesterov
2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov
0 siblings, 2 replies; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-14 11:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel
The ancient comment above task_lock() states that it can be nested outside
of read_lock(&tasklist_lock), but this is no longer true:
CPU_0 CPU_1 CPU_2
task_lock() read_lock(tasklist)
write_lock_irq(tasklist)
read_lock(tasklist) task_lock()
Unless CPU_0 calls read_lock() in IRQ context, queued_read_lock_slowpath()
won't get the lock immediately, it will spin waiting for the pending writer
on CPU_2, resulting in a deadlock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/sched/task.h | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ea41795a352b..8ff98b18b24b 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -210,9 +210,8 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
* pins the final release of task.io_context. Also protects ->cpuset and
* ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
*
- * Nests both inside and outside of read_lock(&tasklist_lock).
- * It must not be nested with write_lock_irq(&tasklist_lock),
- * neither inside nor outside.
+ * Nests inside of read_lock(&tasklist_lock). It must not be nested with
+ * write_lock_irq(&tasklist_lock), neither inside nor outside.
*/
static inline void task_lock(struct task_struct *p)
{
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov
@ 2025-09-14 11:09 ` Oleg Nesterov
2025-09-14 17:48 ` Mateusz Guzik
2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov
1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-14 11:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel
The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit()
path is very broken.
sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct
itself. If tsk != current and tsk is not a leader, this process can exit/exec
and task_lock(tsk->group_leader) may use the already freed task_struct.
Another problem is that sys_prlimit64() can race with mt-exec which changes
->group_leader. In this case do_prlimit() may take the wrong lock, or (worse)
->group_leader may change between task_lock() and task_unlock().
Change sys_prlimit64() to take tasklist_lock when necessary. This is not
nice, but I don't see a better fix for -stable.
Cc: stable@vger.kernel.org
Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/sys.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/kernel/sys.c b/kernel/sys.c
index 1e28b40053ce..36d66ff41611 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
struct rlimit old, new;
struct task_struct *tsk;
unsigned int checkflags = 0;
+ bool need_tasklist;
int ret;
if (old_rlim)
@@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
get_task_struct(tsk);
rcu_read_unlock();
- ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
- old_rlim ? &old : NULL);
+ need_tasklist = !same_thread_group(tsk, current);
+ if (need_tasklist) {
+ /*
+ * Ensure we can't race with group exit or de_thread(),
+ * so tsk->group_leader can't be freed or changed until
+ * read_unlock(tasklist_lock) below.
+ */
+ read_lock(&tasklist_lock);
+ if (!pid_alive(tsk))
+ ret = -ESRCH;
+ }
+
+ if (!ret) {
+ ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
+ old_rlim ? &old : NULL);
+ }
+
+ if (need_tasklist)
+ read_unlock(&tasklist_lock);
if (!ret && old_rlim) {
rlim_to_rlim64(&old, &old64);
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
@ 2025-09-14 17:48 ` Mateusz Guzik
2025-09-14 19:01 ` Oleg Nesterov
2025-09-15 12:05 ` Oleg Nesterov
0 siblings, 2 replies; 7+ messages in thread
From: Mateusz Guzik @ 2025-09-14 17:48 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel
On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit()
> path is very broken.
>
> sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct
> itself. If tsk != current and tsk is not a leader, this process can exit/exec
> and task_lock(tsk->group_leader) may use the already freed task_struct.
>
> Another problem is that sys_prlimit64() can race with mt-exec which changes
> ->group_leader. In this case do_prlimit() may take the wrong lock, or (worse)
> ->group_leader may change between task_lock() and task_unlock().
>
> Change sys_prlimit64() to take tasklist_lock when necessary. This is not
> nice, but I don't see a better fix for -stable.
>
> Cc: stable@vger.kernel.org
> Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")
I think this is more accurate:
Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")
Unfortunately this syscall is used by glibc to get/set limits, the
good news is that almost all real-world calls (AFAICS) with the
calling task as the target. As in, performance-wise, this should not
be a regression and I agree it is more than adequate for stable.
As for something more longterm, what would you think about
synchronizing changes with a lock within ->signal? Preferably for
reading (the most common use case) this would use sequence counters.
Bonus points for avoiding any task ref/lock manipulation if task ==
current (again the most common case in real-world usage).
signal_struct already has holes, so things can be rearranged so that
the struct would not grow above what it is now.
I had a patch somewhere to that extent I could not be bothered to
finish, if this sounds like you a plan I may get around to it.
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> kernel/sys.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 1e28b40053ce..36d66ff41611 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
> struct rlimit old, new;
> struct task_struct *tsk;
> unsigned int checkflags = 0;
> + bool need_tasklist;
> int ret;
>
> if (old_rlim)
> @@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
> get_task_struct(tsk);
> rcu_read_unlock();
>
> - ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
> - old_rlim ? &old : NULL);
> + need_tasklist = !same_thread_group(tsk, current);
> + if (need_tasklist) {
> + /*
> + * Ensure we can't race with group exit or de_thread(),
> + * so tsk->group_leader can't be freed or changed until
> + * read_unlock(tasklist_lock) below.
> + */
> + read_lock(&tasklist_lock);
> + if (!pid_alive(tsk))
> + ret = -ESRCH;
> + }
> +
> + if (!ret) {
> + ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
> + old_rlim ? &old : NULL);
> + }
> +
> + if (need_tasklist)
> + read_unlock(&tasklist_lock);
>
> if (!ret && old_rlim) {
> rlim_to_rlim64(&old, &old64);
> --
> 2.25.1.362.g51ebf55
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
2025-09-14 17:48 ` Mateusz Guzik
@ 2025-09-14 19:01 ` Oleg Nesterov
2025-09-15 12:05 ` Oleg Nesterov
1 sibling, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-14 19:01 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel
Too late for me, I'll write another email tomorrow, just one note for now...
On 09/14, Mateusz Guzik wrote:
>
> On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")
>
> I think this is more accurate:
> Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")
I'll re-check, but at first glance you are right... Thanks for taking
a look and correcting me!
Oleg.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
2025-09-14 17:48 ` Mateusz Guzik
2025-09-14 19:01 ` Oleg Nesterov
@ 2025-09-15 12:05 ` Oleg Nesterov
2025-09-15 13:19 ` Mateusz Guzik
1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-15 12:05 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel
On 09/14, Mateusz Guzik wrote:
>
> On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Change sys_prlimit64() to take tasklist_lock when necessary. This is not
> > nice, but I don't see a better fix for -stable.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")
>
> I think this is more accurate:
> Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")
Yes, thanks again.
> Unfortunately this syscall is used by glibc to get/set limits, the
> good news is that almost all real-world calls (AFAICS) with the
> calling task as the target. As in, performance-wise, this should not
> be a regression and I agree it is more than adequate for stable.
OK, good, I'll send v2 with the corrected "Fixes" tag.
> As for something more longterm, what would you think about
> synchronizing changes with a lock within ->signal?
Agreed, we should probably change the locking, but I am not a new lock
to protect just signal->rlim makes a lot of sense...
We can probably reuse signal->stats_lock, but it needs to disable IRQs.
Or ->siglock, but it is already overused.
I dunno. In any case we need to cleanup the usage of ->group_leader,
it seems there are more buggy users. I'll try to take another look
this week. And probably it and ->real_parent should be moved to
signal_struct.
Thanks!
Oleg.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov
2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
@ 2025-09-15 12:09 ` Oleg Nesterov
1 sibling, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-15 12:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel,
Barret Rhoden
The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit()
path is very broken.
sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct
itself. If tsk != current and tsk is not a leader, this process can exit/exec
and task_lock(tsk->group_leader) may use the already freed task_struct.
Another problem is that sys_prlimit64() can race with mt-exec which changes
->group_leader. In this case do_prlimit() may take the wrong lock, or (worse)
->group_leader may change between task_lock() and task_unlock().
Change sys_prlimit64() to take tasklist_lock when necessary. This is not
nice, but I don't see a better fix for -stable.
Cc: stable@vger.kernel.org
Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/sys.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/kernel/sys.c b/kernel/sys.c
index 1e28b40053ce..36d66ff41611 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
struct rlimit old, new;
struct task_struct *tsk;
unsigned int checkflags = 0;
+ bool need_tasklist;
int ret;
if (old_rlim)
@@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
get_task_struct(tsk);
rcu_read_unlock();
- ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
- old_rlim ? &old : NULL);
+ need_tasklist = !same_thread_group(tsk, current);
+ if (need_tasklist) {
+ /*
+ * Ensure we can't race with group exit or de_thread(),
+ * so tsk->group_leader can't be freed or changed until
+ * read_unlock(tasklist_lock) below.
+ */
+ read_lock(&tasklist_lock);
+ if (!pid_alive(tsk))
+ ret = -ESRCH;
+ }
+
+ if (!ret) {
+ ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
+ old_rlim ? &old : NULL);
+ }
+
+ if (need_tasklist)
+ read_unlock(&tasklist_lock);
if (!ret && old_rlim) {
rlim_to_rlim64(&old, &old64);
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
2025-09-15 12:05 ` Oleg Nesterov
@ 2025-09-15 13:19 ` Mateusz Guzik
0 siblings, 0 replies; 7+ messages in thread
From: Mateusz Guzik @ 2025-09-15 13:19 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel
On Mon, Sep 15, 2025 at 2:07 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 09/14, Mateusz Guzik wrote:
> > As for something more longterm, what would you think about
> > synchronizing changes with a lock within ->signal?
>
> Agreed, we should probably change the locking, but I am not a new lock
> to protect just signal->rlim makes a lot of sense...
>
> We can probably reuse signal->stats_lock, but it needs to disable IRQs.
> Or ->siglock, but it is already overused.
>
The woes from a lock from a different structure aside, I find it
pretty weird that reading rlimits requires taking a lock to begin
with.
I'm not going to argue for a new dedicated lock. I do argue for
sequence counters to read this. With this in place even the
irq-shafted lock wont be much of a problem.
Again I can write a patch to that extent, but no ETA.
> I dunno. In any case we need to cleanup the usage of ->group_leader,
> it seems there are more buggy users. I'll try to take another look
> this week. And probably it and ->real_parent should be moved to
> signal_struct.
>
I got myself into some vfs-related stuff, so I'm going to spare a rant
about linux not having a 'struct process' or similar. ;->
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-09-15 13:19 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov
2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
2025-09-14 17:48 ` Mateusz Guzik
2025-09-14 19:01 ` Oleg Nesterov
2025-09-15 12:05 ` Oleg Nesterov
2025-09-15 13:19 ` Mateusz Guzik
2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox