public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock
@ 2025-09-14 11:09 Oleg Nesterov
  2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
  2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov
  0 siblings, 2 replies; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-14 11:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel

The ancient comment above task_lock() states that it can be nested outside
of read_lock(&tasklist_lock), but this is no longer true:

  CPU_0			CPU_1			CPU_2

  task_lock()		read_lock(tasklist)
  						write_lock_irq(tasklist)
  read_lock(tasklist)	task_lock()

Unless CPU_0 calls read_lock() in IRQ context, queued_read_lock_slowpath()
won't get the lock immediately, it will spin waiting for the pending writer
on CPU_2, resulting in a deadlock.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/sched/task.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ea41795a352b..8ff98b18b24b 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -210,9 +210,8 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
  * pins the final release of task.io_context.  Also protects ->cpuset and
  * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
  *
- * Nests both inside and outside of read_lock(&tasklist_lock).
- * It must not be nested with write_lock_irq(&tasklist_lock),
- * neither inside nor outside.
+ * Nests inside of read_lock(&tasklist_lock). It must not be nested with
+ * write_lock_irq(&tasklist_lock), neither inside nor outside.
  */
 static inline void task_lock(struct task_struct *p)
 {
-- 
2.25.1.362.g51ebf55



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
  2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov
@ 2025-09-14 11:09 ` Oleg Nesterov
  2025-09-14 17:48   ` Mateusz Guzik
  2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov
  1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-14 11:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel

The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit()
path is very broken.

sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct
itself. If tsk != current and tsk is not a leader, this process can exit/exec
and task_lock(tsk->group_leader) may use the already freed task_struct.

Another problem is that sys_prlimit64() can race with mt-exec which changes
->group_leader. In this case do_prlimit() may take the wrong lock, or (worse)
->group_leader may change between task_lock() and task_unlock().

Change sys_prlimit64() to take tasklist_lock when necessary. This is not
nice, but I don't see a better fix for -stable.

Cc: stable@vger.kernel.org
Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sys.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 1e28b40053ce..36d66ff41611 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
 	struct rlimit old, new;
 	struct task_struct *tsk;
 	unsigned int checkflags = 0;
+	bool need_tasklist;
 	int ret;
 
 	if (old_rlim)
@@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
 	get_task_struct(tsk);
 	rcu_read_unlock();
 
-	ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
-			old_rlim ? &old : NULL);
+	need_tasklist = !same_thread_group(tsk, current);
+	if (need_tasklist) {
+		/*
+		 * Ensure we can't race with group exit or de_thread(),
+		 * so tsk->group_leader can't be freed or changed until
+		 * read_unlock(tasklist_lock) below.
+		 */
+		read_lock(&tasklist_lock);
+		if (!pid_alive(tsk))
+			ret = -ESRCH;
+	}
+
+	if (!ret) {
+		ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
+				old_rlim ? &old : NULL);
+	}
+
+	if (need_tasklist)
+		read_unlock(&tasklist_lock);
 
 	if (!ret && old_rlim) {
 		rlim_to_rlim64(&old, &old64);
-- 
2.25.1.362.g51ebf55



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
  2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
@ 2025-09-14 17:48   ` Mateusz Guzik
  2025-09-14 19:01     ` Oleg Nesterov
  2025-09-15 12:05     ` Oleg Nesterov
  0 siblings, 2 replies; 7+ messages in thread
From: Mateusz Guzik @ 2025-09-14 17:48 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel

On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit()
> path is very broken.
>
> sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct
> itself. If tsk != current and tsk is not a leader, this process can exit/exec
> and task_lock(tsk->group_leader) may use the already freed task_struct.
>
> Another problem is that sys_prlimit64() can race with mt-exec which changes
> ->group_leader. In this case do_prlimit() may take the wrong lock, or (worse)
> ->group_leader may change between task_lock() and task_unlock().
>
> Change sys_prlimit64() to take tasklist_lock when necessary. This is not
> nice, but I don't see a better fix for -stable.
>
> Cc: stable@vger.kernel.org
> Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")

I think this is more accurate:
Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")

Unfortunately this syscall is used by glibc to get/set limits, the
good news is that almost all real-world calls (AFAICS) with the
calling task as the target. As in, performance-wise, this should not
be a regression and I agree it is more than adequate for stable.

As for something more longterm, what would you think about
synchronizing changes with a lock within ->signal? Preferably for
reading (the most common use case) this would use sequence counters.
Bonus points for avoiding any task ref/lock manipulation if task ==
current (again the most common case in real-world usage).

signal_struct already has holes, so things can be rearranged so that
the struct would not grow above what it is now.

I had a patch somewhere to that extent I could not be bothered to
finish, if this sounds like you a plan I may get around to it.


> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/sys.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 1e28b40053ce..36d66ff41611 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
>         struct rlimit old, new;
>         struct task_struct *tsk;
>         unsigned int checkflags = 0;
> +       bool need_tasklist;
>         int ret;
>
>         if (old_rlim)
> @@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
>         get_task_struct(tsk);
>         rcu_read_unlock();
>
> -       ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
> -                       old_rlim ? &old : NULL);
> +       need_tasklist = !same_thread_group(tsk, current);
> +       if (need_tasklist) {
> +               /*
> +                * Ensure we can't race with group exit or de_thread(),
> +                * so tsk->group_leader can't be freed or changed until
> +                * read_unlock(tasklist_lock) below.
> +                */
> +               read_lock(&tasklist_lock);
> +               if (!pid_alive(tsk))
> +                       ret = -ESRCH;
> +       }
> +
> +       if (!ret) {
> +               ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
> +                               old_rlim ? &old : NULL);
> +       }
> +
> +       if (need_tasklist)
> +               read_unlock(&tasklist_lock);
>
>         if (!ret && old_rlim) {
>                 rlim_to_rlim64(&old, &old64);
> --
> 2.25.1.362.g51ebf55
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
  2025-09-14 17:48   ` Mateusz Guzik
@ 2025-09-14 19:01     ` Oleg Nesterov
  2025-09-15 12:05     ` Oleg Nesterov
  1 sibling, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-14 19:01 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel

Too late for me, I'll write another email tomorrow, just one note for now...

On 09/14, Mateusz Guzik wrote:
>
> On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")
>
> I think this is more accurate:
> Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")

I'll re-check, but at first glance you are right... Thanks for taking
a look and correcting me!

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
  2025-09-14 17:48   ` Mateusz Guzik
  2025-09-14 19:01     ` Oleg Nesterov
@ 2025-09-15 12:05     ` Oleg Nesterov
  2025-09-15 13:19       ` Mateusz Guzik
  1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-15 12:05 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel

On 09/14, Mateusz Guzik wrote:
>
> On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Change sys_prlimit64() to take tasklist_lock when necessary. This is not
> > nice, but I don't see a better fix for -stable.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall")
>
> I think this is more accurate:
> Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")

Yes, thanks again.

> Unfortunately this syscall is used by glibc to get/set limits, the
> good news is that almost all real-world calls (AFAICS) with the
> calling task as the target. As in, performance-wise, this should not
> be a regression and I agree it is more than adequate for stable.

OK, good, I'll send v2 with the corrected "Fixes" tag.

> As for something more longterm, what would you think about
> synchronizing changes with a lock within ->signal?

Agreed, we should probably change the locking, but I am not a new lock
to protect just signal->rlim makes a lot of sense...

We can probably reuse signal->stats_lock, but it needs to disable IRQs.
Or ->siglock, but it is already overused.

I dunno. In any case we need to cleanup the usage of ->group_leader,
it seems there are more buggy users. I'll try to take another look
this week. And probably it and ->real_parent should be moved to
signal_struct.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
  2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov
  2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
@ 2025-09-15 12:09 ` Oleg Nesterov
  1 sibling, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2025-09-15 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel,
	Barret Rhoden

The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit()
path is very broken.

sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct
itself. If tsk != current and tsk is not a leader, this process can exit/exec
and task_lock(tsk->group_leader) may use the already freed task_struct.

Another problem is that sys_prlimit64() can race with mt-exec which changes
->group_leader. In this case do_prlimit() may take the wrong lock, or (worse)
->group_leader may change between task_lock() and task_unlock().

Change sys_prlimit64() to take tasklist_lock when necessary. This is not
nice, but I don't see a better fix for -stable.

Cc: stable@vger.kernel.org
Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sys.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 1e28b40053ce..36d66ff41611 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
 	struct rlimit old, new;
 	struct task_struct *tsk;
 	unsigned int checkflags = 0;
+	bool need_tasklist;
 	int ret;
 
 	if (old_rlim)
@@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource,
 	get_task_struct(tsk);
 	rcu_read_unlock();
 
-	ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
-			old_rlim ? &old : NULL);
+	need_tasklist = !same_thread_group(tsk, current);
+	if (need_tasklist) {
+		/*
+		 * Ensure we can't race with group exit or de_thread(),
+		 * so tsk->group_leader can't be freed or changed until
+		 * read_unlock(tasklist_lock) below.
+		 */
+		read_lock(&tasklist_lock);
+		if (!pid_alive(tsk))
+			ret = -ESRCH;
+	}
+
+	if (!ret) {
+		ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL,
+				old_rlim ? &old : NULL);
+	}
+
+	if (need_tasklist)
+		read_unlock(&tasklist_lock);
 
 	if (!ret && old_rlim) {
 		rlim_to_rlim64(&old, &old64);
-- 
2.25.1.362.g51ebf55



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
  2025-09-15 12:05     ` Oleg Nesterov
@ 2025-09-15 13:19       ` Mateusz Guzik
  0 siblings, 0 replies; 7+ messages in thread
From: Mateusz Guzik @ 2025-09-15 13:19 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel

On Mon, Sep 15, 2025 at 2:07 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 09/14, Mateusz Guzik wrote:
> > As for something more longterm, what would you think about
> > synchronizing changes with a lock within ->signal?
>
> Agreed, we should probably change the locking, but I am not a new lock
> to protect just signal->rlim makes a lot of sense...
>
> We can probably reuse signal->stats_lock, but it needs to disable IRQs.
> Or ->siglock, but it is already overused.
>

The woes from a lock from a different structure aside, I find it
pretty weird that reading rlimits requires taking a lock to begin
with.

I'm not going to argue for a new dedicated lock. I do argue for
sequence counters to read this. With this in place even the
irq-shafted lock wont be much of a problem.

Again I can write a patch to that extent, but no ETA.

> I dunno. In any case we need to cleanup the usage of ->group_leader,
> it seems there are more buggy users. I'll try to take another look
> this week. And probably it and ->real_parent should be moved to
> signal_struct.
>

I got myself into some vfs-related stuff, so I'm going to spare a rant
about linux not having a 'struct process' or similar. ;->

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-09-15 13:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov
2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov
2025-09-14 17:48   ` Mateusz Guzik
2025-09-14 19:01     ` Oleg Nesterov
2025-09-15 12:05     ` Oleg Nesterov
2025-09-15 13:19       ` Mateusz Guzik
2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox