* [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock @ 2025-09-14 11:09 Oleg Nesterov 2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov 2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov 0 siblings, 2 replies; 7+ messages in thread From: Oleg Nesterov @ 2025-09-14 11:09 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel The ancient comment above task_lock() states that it can be nested outside of read_lock(&tasklist_lock), but this is no longer true: CPU_0 CPU_1 CPU_2 task_lock() read_lock(tasklist) write_lock_irq(tasklist) read_lock(tasklist) task_lock() Unless CPU_0 calls read_lock() in IRQ context, queued_read_lock_slowpath() won't get the lock immediately, it will spin waiting for the pending writer on CPU_2, resulting in a deadlock. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/sched/task.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index ea41795a352b..8ff98b18b24b 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -210,9 +210,8 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t) * pins the final release of task.io_context. Also protects ->cpuset and * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist. * - * Nests both inside and outside of read_lock(&tasklist_lock). - * It must not be nested with write_lock_irq(&tasklist_lock), - * neither inside nor outside. + * Nests inside of read_lock(&tasklist_lock). It must not be nested with + * write_lock_irq(&tasklist_lock), neither inside nor outside. */ static inline void task_lock(struct task_struct *p) { -- 2.25.1.362.g51ebf55 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths 2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov @ 2025-09-14 11:09 ` Oleg Nesterov 2025-09-14 17:48 ` Mateusz Guzik 2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov 1 sibling, 1 reply; 7+ messages in thread From: Oleg Nesterov @ 2025-09-14 11:09 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit() path is very broken. sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct itself. If tsk != current and tsk is not a leader, this process can exit/exec and task_lock(tsk->group_leader) may use the already freed task_struct. Another problem is that sys_prlimit64() can race with mt-exec which changes ->group_leader. In this case do_prlimit() may take the wrong lock, or (worse) ->group_leader may change between task_lock() and task_unlock(). Change sys_prlimit64() to take tasklist_lock when necessary. This is not nice, but I don't see a better fix for -stable. Cc: stable@vger.kernel.org Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall") Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/sys.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index 1e28b40053ce..36d66ff41611 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource, struct rlimit old, new; struct task_struct *tsk; unsigned int checkflags = 0; + bool need_tasklist; int ret; if (old_rlim) @@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource, get_task_struct(tsk); rcu_read_unlock(); - ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL, - old_rlim ? &old : NULL); + need_tasklist = !same_thread_group(tsk, current); + if (need_tasklist) { + /* + * Ensure we can't race with group exit or de_thread(), + * so tsk->group_leader can't be freed or changed until + * read_unlock(tasklist_lock) below. + */ + read_lock(&tasklist_lock); + if (!pid_alive(tsk)) + ret = -ESRCH; + } + + if (!ret) { + ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL, + old_rlim ? &old : NULL); + } + + if (need_tasklist) + read_unlock(&tasklist_lock); if (!ret && old_rlim) { rlim_to_rlim64(&old, &old64); -- 2.25.1.362.g51ebf55 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths 2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov @ 2025-09-14 17:48 ` Mateusz Guzik 2025-09-14 19:01 ` Oleg Nesterov 2025-09-15 12:05 ` Oleg Nesterov 0 siblings, 2 replies; 7+ messages in thread From: Mateusz Guzik @ 2025-09-14 17:48 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote: > > The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit() > path is very broken. > > sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct > itself. If tsk != current and tsk is not a leader, this process can exit/exec > and task_lock(tsk->group_leader) may use the already freed task_struct. > > Another problem is that sys_prlimit64() can race with mt-exec which changes > ->group_leader. In this case do_prlimit() may take the wrong lock, or (worse) > ->group_leader may change between task_lock() and task_unlock(). > > Change sys_prlimit64() to take tasklist_lock when necessary. This is not > nice, but I don't see a better fix for -stable. > > Cc: stable@vger.kernel.org > Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall") I think this is more accurate: Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock") Unfortunately this syscall is used by glibc to get/set limits, the good news is that almost all real-world calls (AFAICS) with the calling task as the target. As in, performance-wise, this should not be a regression and I agree it is more than adequate for stable. As for something more longterm, what would you think about synchronizing changes with a lock within ->signal? Preferably for reading (the most common use case) this would use sequence counters. Bonus points for avoiding any task ref/lock manipulation if task == current (again the most common case in real-world usage). signal_struct already has holes, so things can be rearranged so that the struct would not grow above what it is now. I had a patch somewhere to that extent I could not be bothered to finish, if this sounds like you a plan I may get around to it. > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/sys.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/kernel/sys.c b/kernel/sys.c > index 1e28b40053ce..36d66ff41611 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource, > struct rlimit old, new; > struct task_struct *tsk; > unsigned int checkflags = 0; > + bool need_tasklist; > int ret; > > if (old_rlim) > @@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource, > get_task_struct(tsk); > rcu_read_unlock(); > > - ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL, > - old_rlim ? &old : NULL); > + need_tasklist = !same_thread_group(tsk, current); > + if (need_tasklist) { > + /* > + * Ensure we can't race with group exit or de_thread(), > + * so tsk->group_leader can't be freed or changed until > + * read_unlock(tasklist_lock) below. > + */ > + read_lock(&tasklist_lock); > + if (!pid_alive(tsk)) > + ret = -ESRCH; > + } > + > + if (!ret) { > + ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL, > + old_rlim ? &old : NULL); > + } > + > + if (need_tasklist) > + read_unlock(&tasklist_lock); > > if (!ret && old_rlim) { > rlim_to_rlim64(&old, &old64); > -- > 2.25.1.362.g51ebf55 > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths 2025-09-14 17:48 ` Mateusz Guzik @ 2025-09-14 19:01 ` Oleg Nesterov 2025-09-15 12:05 ` Oleg Nesterov 1 sibling, 0 replies; 7+ messages in thread From: Oleg Nesterov @ 2025-09-14 19:01 UTC (permalink / raw) To: Mateusz Guzik; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel Too late for me, I'll write another email tomorrow, just one note for now... On 09/14, Mateusz Guzik wrote: > > On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall") > > I think this is more accurate: > Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock") I'll re-check, but at first glance you are right... Thanks for taking a look and correcting me! Oleg. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths 2025-09-14 17:48 ` Mateusz Guzik 2025-09-14 19:01 ` Oleg Nesterov @ 2025-09-15 12:05 ` Oleg Nesterov 2025-09-15 13:19 ` Mateusz Guzik 1 sibling, 1 reply; 7+ messages in thread From: Oleg Nesterov @ 2025-09-15 12:05 UTC (permalink / raw) To: Mateusz Guzik; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel On 09/14, Mateusz Guzik wrote: > > On Sun, Sep 14, 2025 at 1:11 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > Change sys_prlimit64() to take tasklist_lock when necessary. This is not > > nice, but I don't see a better fix for -stable. > > > > Cc: stable@vger.kernel.org > > Fixes: c022a0acad53 ("rlimits: implement prlimit64 syscall") > > I think this is more accurate: > Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock") Yes, thanks again. > Unfortunately this syscall is used by glibc to get/set limits, the > good news is that almost all real-world calls (AFAICS) with the > calling task as the target. As in, performance-wise, this should not > be a regression and I agree it is more than adequate for stable. OK, good, I'll send v2 with the corrected "Fixes" tag. > As for something more longterm, what would you think about > synchronizing changes with a lock within ->signal? Agreed, we should probably change the locking, but I am not a new lock to protect just signal->rlim makes a lot of sense... We can probably reuse signal->stats_lock, but it needs to disable IRQs. Or ->siglock, but it is already overused. I dunno. In any case we need to cleanup the usage of ->group_leader, it seems there are more buggy users. I'll try to take another look this week. And probably it and ->real_parent should be moved to signal_struct. Thanks! Oleg. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths 2025-09-15 12:05 ` Oleg Nesterov @ 2025-09-15 13:19 ` Mateusz Guzik 0 siblings, 0 replies; 7+ messages in thread From: Mateusz Guzik @ 2025-09-15 13:19 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Andrew Morton, Christian Brauner, Jiri Slaby, linux-kernel On Mon, Sep 15, 2025 at 2:07 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 09/14, Mateusz Guzik wrote: > > As for something more longterm, what would you think about > > synchronizing changes with a lock within ->signal? > > Agreed, we should probably change the locking, but I am not a new lock > to protect just signal->rlim makes a lot of sense... > > We can probably reuse signal->stats_lock, but it needs to disable IRQs. > Or ->siglock, but it is already overused. > The woes from a lock from a different structure aside, I find it pretty weird that reading rlimits requires taking a lock to begin with. I'm not going to argue for a new dedicated lock. I do argue for sequence counters to read this. With this in place even the irq-shafted lock wont be much of a problem. Again I can write a patch to that extent, but no ETA. > I dunno. In any case we need to cleanup the usage of ->group_leader, > it seems there are more buggy users. I'll try to take another look > this week. And probably it and ->real_parent should be moved to > signal_struct. > I got myself into some vfs-related stuff, so I'm going to spare a rant about linux not having a 'struct process' or similar. ;-> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths 2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov 2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov @ 2025-09-15 12:09 ` Oleg Nesterov 1 sibling, 0 replies; 7+ messages in thread From: Oleg Nesterov @ 2025-09-15 12:09 UTC (permalink / raw) To: Andrew Morton Cc: Christian Brauner, Jiri Slaby, Mateusz Guzik, linux-kernel, Barret Rhoden The usage of task_lock(tsk->group_leader) in sys_prlimit64()->do_prlimit() path is very broken. sys_prlimit64() does get_task_struct(tsk) but this only protects task_struct itself. If tsk != current and tsk is not a leader, this process can exit/exec and task_lock(tsk->group_leader) may use the already freed task_struct. Another problem is that sys_prlimit64() can race with mt-exec which changes ->group_leader. In this case do_prlimit() may take the wrong lock, or (worse) ->group_leader may change between task_lock() and task_unlock(). Change sys_prlimit64() to take tasklist_lock when necessary. This is not nice, but I don't see a better fix for -stable. Cc: stable@vger.kernel.org Fixes: 18c91bb2d872 ("prlimit: do not grab the tasklist_lock") Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/sys.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index 1e28b40053ce..36d66ff41611 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1734,6 +1734,7 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource, struct rlimit old, new; struct task_struct *tsk; unsigned int checkflags = 0; + bool need_tasklist; int ret; if (old_rlim) @@ -1760,8 +1761,25 @@ SYSCALL_DEFINE4(prlimit64, pid_t, pid, unsigned int, resource, get_task_struct(tsk); rcu_read_unlock(); - ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL, - old_rlim ? &old : NULL); + need_tasklist = !same_thread_group(tsk, current); + if (need_tasklist) { + /* + * Ensure we can't race with group exit or de_thread(), + * so tsk->group_leader can't be freed or changed until + * read_unlock(tasklist_lock) below. + */ + read_lock(&tasklist_lock); + if (!pid_alive(tsk)) + ret = -ESRCH; + } + + if (!ret) { + ret = do_prlimit(tsk, resource, new_rlim ? &new : NULL, + old_rlim ? &old : NULL); + } + + if (need_tasklist) + read_unlock(&tasklist_lock); if (!ret && old_rlim) { rlim_to_rlim64(&old, &old64); -- 2.25.1.362.g51ebf55 ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-09-15 13:19 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-14 11:09 [PATCH 1/2] fix the wrong comment on task_lock() nesting with tasklist_lock Oleg Nesterov 2025-09-14 11:09 ` [PATCH 2/2] fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths Oleg Nesterov 2025-09-14 17:48 ` Mateusz Guzik 2025-09-14 19:01 ` Oleg Nesterov 2025-09-15 12:05 ` Oleg Nesterov 2025-09-15 13:19 ` Mateusz Guzik 2025-09-15 12:09 ` [PATCH v2 " Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox