[PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention
@ 2024-11-20 13:21 Zhen Ni
  2024-11-28  1:45 ` Andrew Morton
  0 siblings, 1 reply; 5+ messages in thread
From: Zhen Ni @ 2024-11-20 13:21 UTC (permalink / raw)
  To: akpm, viro, oleg, catalin.marinas, brauner, zev; +Cc: linux-kernel, Zhen Ni

Refines the lock scope in the do_prlimit function to reduce
contention on task_lock(tsk->group_leader). The lock now protects only
sections that access or modify shared resources (rlim). Permission
checks (capable) and security validations (security_task_setrlimit)
are placed outside the lock, as they do not modify rlim and are
independent of shared data protection.

The security_task_setrlimit function is a Linux Security Module (LSM)
hook that evaluates resource limit changes based on security policies.
It does not alter the rlim data structure, as confirmed by existing
LSM implementations (e.g., SELinux and AppArmor). Thus, this function
does not require locking, ensuring correctness while improving
concurrency.

Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---
 kernel/sys.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index c4c701c6f0b4..ef99b654e8d8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1481,18 +1481,20 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,

 	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
 	rlim = tsk->signal->rlim + resource;
-	task_lock(tsk->group_leader);
 	if (new_rlim) {
 		/*
 		 * Keep the capable check against init_user_ns until cgroups can
 		 * contain all limits.
 		 */
 		if (new_rlim->rlim_max > rlim->rlim_max &&
-				!capable(CAP_SYS_RESOURCE))
-			retval = -EPERM;
-		if (!retval)
-			retval = security_task_setrlimit(tsk, resource, new_rlim);
+		    !capable(CAP_SYS_RESOURCE))
+			return -EPERM;
+		retval = security_task_setrlimit(tsk, resource, new_rlim);
+		if (retval)
+			return retval;
 	}
+
+	task_lock(tsk->group_leader);
 	if (!retval) {
 		if (old_rlim)
 			*old_rlim = *rlim;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention
  2024-11-20 13:21 [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention Zhen Ni
@ 2024-11-28  1:45 ` Andrew Morton
  2024-11-28  7:13   ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2024-11-28  1:45 UTC (permalink / raw)
  To: Zhen Ni
  Cc: viro, oleg, catalin.marinas, brauner, zev, linux-kernel,
	linux-security-module

On Wed, 20 Nov 2024 21:21:56 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:

> Refines the lock scope in the do_prlimit function to reduce
> contention on task_lock(tsk->group_leader). The lock now protects only
> sections that access or modify shared resources (rlim). Permission
> checks (capable) and security validations (security_task_setrlimit)
> are placed outside the lock, as they do not modify rlim and are
> independent of shared data protection.

Let's cc linux-security-module@vger.kernel.org, as we're proposing
altering their locking environment!

> The security_task_setrlimit function is a Linux Security Module (LSM)
> hook that evaluates resource limit changes based on security policies.
> It does not alter the rlim data structure, as confirmed by existing
> LSM implementations (e.g., SELinux and AppArmor). Thus, this function
> does not require locking, ensuring correctness while improving
> concurrency.

Seems sane.

Does any code call do_prlimit() frequently enough for this to matter?

> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1481,18 +1481,20 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
>  
>  	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
>  	rlim = tsk->signal->rlim + resource;
> -	task_lock(tsk->group_leader);
>  	if (new_rlim) {
>  		/*
>  		 * Keep the capable check against init_user_ns until cgroups can
>  		 * contain all limits.
>  		 */
>  		if (new_rlim->rlim_max > rlim->rlim_max &&
> -				!capable(CAP_SYS_RESOURCE))
> -			retval = -EPERM;
> -		if (!retval)
> -			retval = security_task_setrlimit(tsk, resource, new_rlim);
> +		    !capable(CAP_SYS_RESOURCE))
> +			return -EPERM;
> +		retval = security_task_setrlimit(tsk, resource, new_rlim);
> +		if (retval)
> +			return retval;
>  	}
> +
> +	task_lock(tsk->group_leader);
>  	if (!retval) {
>  		if (old_rlim)
>  			*old_rlim = *rlim;


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention
  2024-11-28  1:45 ` Andrew Morton
@ 2024-11-28  7:13   ` Oleg Nesterov
  2024-11-28  7:39     ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2024-11-28  7:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zhen Ni, viro, catalin.marinas, brauner, zev, linux-kernel,
	linux-security-module

On 11/27, Andrew Morton wrote:
>
> On Wed, 20 Nov 2024 21:21:56 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:
>
> > The security_task_setrlimit function is a Linux Security Module (LSM)
> > hook that evaluates resource limit changes based on security policies.
> > It does not alter the rlim data structure, as confirmed by existing
> > LSM implementations (e.g., SELinux and AppArmor). Thus, this function
> > does not require locking, ensuring correctness while improving
> > concurrency.
>
> Seems sane.
>
> Does any code call do_prlimit() frequently enough for this to matter?

I have the same question...

> > -	task_lock(tsk->group_leader);
> >  	if (new_rlim) {
> >  		/*
> >  		 * Keep the capable check against init_user_ns until cgroups can
> >  		 * contain all limits.
> >  		 */
> >  		if (new_rlim->rlim_max > rlim->rlim_max &&
> > -				!capable(CAP_SYS_RESOURCE))
> > -			retval = -EPERM;
> > -		if (!retval)
> > -			retval = security_task_setrlimit(tsk, resource, new_rlim);
> > +		    !capable(CAP_SYS_RESOURCE))
> > +			return -EPERM;
> > +		retval = security_task_setrlimit(tsk, resource, new_rlim);
> > +		if (retval)
> > +			return retval;
> >  	}
> > +
> > +	task_lock(tsk->group_leader);

The problem is that task_lock(tsk->group_leader) doesn't look right with or
without this patch. I'll try to make a fix on weekend.

If the caller is sys_prlimit64() and tsk != current, then ->group_leader is
not stable, do_prlimit() can race with mt exec and take the wrong lock.

Oleg.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention
  2024-11-28  7:13   ` Oleg Nesterov
@ 2024-11-28  7:39     ` Oleg Nesterov
  2024-11-28  8:08       ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2024-11-28  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zhen Ni, viro, catalin.marinas, brauner, zev, linux-kernel,
	linux-security-module

On 11/28, Oleg Nesterov wrote:
>
> The problem is that task_lock(tsk->group_leader) doesn't look right with or
> without this patch. I'll try to make a fix on weekend.
>
> If the caller is sys_prlimit64() and tsk != current, then ->group_leader is
> not stable, do_prlimit() can race with mt exec and take the wrong lock.

... and task_unlock(tsk->group_leader) is simply unsafe.

perhaps something like below, but it doesn't look nice, I'll try to think
more. And grep, may be there are more lockless users of tsk->group_leader
when tsk != current.

Oleg.

--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1464,6 +1464,7 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len)
 static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 		      struct rlimit *new_rlim, struct rlimit *old_rlim)
 {
+	struct task_struct *leader;
 	struct rlimit *rlim;
 	int retval = 0;
 
@@ -1481,7 +1482,14 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 
 	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
 	rlim = tsk->signal->rlim + resource;
-	task_lock(tsk->group_leader);
+
+	if (tsk != current)
+		read_lock(&tasklist_lock);
+	leader = READ_ONCE(tsk->group_leader);
+	task_lock(leader);
+	if (tsk != current)
+		read_unlock(&tasklist_lock);
+
 	if (new_rlim) {
 		/*
 		 * Keep the capable check against init_user_ns until cgroups can
@@ -1499,7 +1507,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 		if (new_rlim)
 			*rlim = *new_rlim;
 	}
-	task_unlock(tsk->group_leader);
+	task_unlock(leader);
 
 	/*
 	 * RLIMIT_CPU handling. Arm the posix CPU timer if the limit is not


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention
  2024-11-28  7:39     ` Oleg Nesterov
@ 2024-11-28  8:08       ` Oleg Nesterov
  0 siblings, 0 replies; 5+ messages in thread
From: Oleg Nesterov @ 2024-11-28  8:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zhen Ni, viro, catalin.marinas, brauner, zev, linux-kernel,
	linux-security-module

On 11/28, Oleg Nesterov wrote:
>
> On 11/28, Oleg Nesterov wrote:
> >
> > The problem is that task_lock(tsk->group_leader) doesn't look right with or
> > without this patch. I'll try to make a fix on weekend.
> >
> > If the caller is sys_prlimit64() and tsk != current, then ->group_leader is
> > not stable, do_prlimit() can race with mt exec and take the wrong lock.
>
> ... and task_unlock(tsk->group_leader) is simply unsafe.
>
> perhaps something like below,

No, this is wrong too,

> I'll try to think more.

Yes...

Oleg.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-28  8:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-20 13:21 [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention Zhen Ni
2024-11-28  1:45 ` Andrew Morton
2024-11-28  7:13   ` Oleg Nesterov
2024-11-28  7:39     ` Oleg Nesterov
2024-11-28  8:08       ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox