* [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes [not found] ` <87o83e2mbu.fsf@email.froward.int.ebiederm.org> @ 2022-02-16 15:56 ` Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user Eric W. Biederman ` (5 more replies) 0 siblings, 6 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-16 15:56 UTC (permalink / raw) To: linux-kernel Cc: Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Containers, Michal Koutný, linux-api Michal Koutný recently found some bugs in the enforcement of RLIMIT_NPROC in the recent ucount rlimit implementation. I saw some additional bugs and some cleaner ways to fix the problem so instead of starting with his fixes these are my own. I have rewritten about half my fixes since the last time this was posted. There is this notion (not entirely wrong) that the code should be consistent and make sense. When I dug in I discovered that has not been the case for the last 20 years. Fixing the long standing inconsistencies is something that seems to warrent wider vetting on linux-api. So with this set of patches I have developed a very conservative approach changing only what is necessary to fix the bugs that I can see clearly. Cleanups and anything that is making the code more consistent can follow after we have the code working as it has historically. Anyone who can please take a look and tell me if I am doing something silly. Eric W. Biederman (5): rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 ucounts: Base set_cred_ucounts changes on the real user ucounts: Move RLIMIT_NPROC handling after set_user ucounts: Handle wrapping in is_ucounts_overlimit kernel/cred.c | 9 ++------- kernel/fork.c | 10 +++++----- kernel/sys.c | 20 ++++++++++++++------ kernel/ucount.c | 3 ++- 4 files changed, 23 insertions(+), 19 deletions(-) Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user 2022-02-16 15:56 ` [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes Eric W. Biederman @ 2022-02-16 15:58 ` Eric W. Biederman 2022-02-16 17:42 ` Solar Designer 2022-02-16 15:58 ` [PATCH v2 2/5] ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 Eric W. Biederman ` (4 subsequent siblings) 5 siblings, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2022-02-16 15:58 UTC (permalink / raw) To: linux-kernel Cc: Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, containers, Michal Koutný, linux-api, Eric W. Biederman, stable Solar Designer <solar@openwall.com> wrote: > I'm not aware of anyone actually running into this issue and reporting > it. The systems that I personally know use suexec along with rlimits > still run older/distro kernels, so would not yet be affected. > > So my mention was based on my understanding of how suexec works, and > code review. Specifically, Apache httpd has the setting RLimitNPROC, > which makes it set RLIMIT_NPROC: > > https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc > > The above documentation for it includes: > > "This applies to processes forked from Apache httpd children servicing > requests, not the Apache httpd children themselves. This includes CGI > scripts and SSI exec commands, but not any processes forked from the > Apache httpd parent, such as piped logs." > > In code, there are: > > ./modules/generators/mod_cgid.c: ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > ./modules/generators/mod_cgi.c: ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > ./modules/filters/mod_ext_filter.c: rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc); > > For example, in mod_cgi.c this is in run_cgi_child(). > > I think this means an httpd child sets RLIMIT_NPROC shortly before it > execs suexec, which is a SUID root program. suexec then switches to the > target user and execs the CGI script. > > Before 2863643fb8b9, the setuid() in suexec would set the flag, and the > target user's process count would be checked against RLIMIT_NPROC on > execve(). After 2863643fb8b9, the setuid() in suexec wouldn't set the > flag because setuid() is (naturally) called when the process is still > running as root (thus, has those limits bypass capabilities), and > accordingly execve() would not check the target user's process count > against RLIMIT_NPROC. In commit 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to make it more consistent with fork. Unfortunately because of call site differences those capables calls were checking the credentials of the user before set*id() instead of after set*id(). This breaks enforcement of RLIMIT_NPROC for applications that set the rlimit and then call set*id() while holding a full set of capabilities. The capabilities are only changed in the new credential in security_task_fix_setuid(). The code in apache suexec appears to follow this pattern. Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") where this check was added describes the targes of this capability check as: 2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then calls setuid, the setuid should fail if the user would then be running more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't. This patch adds an appropriate test. With this patch, and per-user process limit imposed in cgiwrap really works. So the original use case also of this check also appears to match the broken pattern. Restore the enforcement of RLIMIT_NPROC by removing the bad capable checks added in set_user. This unfortunately restores the inconsistencies state the code has been in for the last 11 years, but dealing with the inconsistencies looks like a larger problem. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20210907213042.GA22626@openwall.com/ Link: https://lkml.kernel.org/r/20220212221412.GA29214@openwall.com Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds") History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/sys.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index ecc4cf019242..8dd938a3d2bf 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -480,8 +480,7 @@ static int set_user(struct cred *new) * failure to the execve() stage. */ if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) && - new_user != INIT_USER && - !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) + new_user != INIT_USER) current->flags |= PF_NPROC_EXCEEDED; else current->flags &= ~PF_NPROC_EXCEEDED; -- 2.29.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user 2022-02-16 15:58 ` [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user Eric W. Biederman @ 2022-02-16 17:42 ` Solar Designer 0 siblings, 0 replies; 25+ messages in thread From: Solar Designer @ 2022-02-16 17:42 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-kernel, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Ran Xiaokai, containers, Michal Koutn??, linux-api, stable On Wed, Feb 16, 2022 at 09:58:28AM -0600, Eric W. Biederman wrote: > Solar Designer <solar@openwall.com> wrote: > > I'm not aware of anyone actually running into this issue and reporting > > it. The systems that I personally know use suexec along with rlimits > > still run older/distro kernels, so would not yet be affected. > > > > So my mention was based on my understanding of how suexec works, and > > code review. Specifically, Apache httpd has the setting RLimitNPROC, > > which makes it set RLIMIT_NPROC: > > > > https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc > > > > The above documentation for it includes: > > > > "This applies to processes forked from Apache httpd children servicing > > requests, not the Apache httpd children themselves. This includes CGI > > scripts and SSI exec commands, but not any processes forked from the > > Apache httpd parent, such as piped logs." > > > > In code, there are: > > > > ./modules/generators/mod_cgid.c: ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > > ./modules/generators/mod_cgi.c: ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > > ./modules/filters/mod_ext_filter.c: rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc); > > > > For example, in mod_cgi.c this is in run_cgi_child(). > > > > I think this means an httpd child sets RLIMIT_NPROC shortly before it > > execs suexec, which is a SUID root program. suexec then switches to the > > target user and execs the CGI script. > > > > Before 2863643fb8b9, the setuid() in suexec would set the flag, and the > > target user's process count would be checked against RLIMIT_NPROC on > > execve(). After 2863643fb8b9, the setuid() in suexec wouldn't set the > > flag because setuid() is (naturally) called when the process is still > > running as root (thus, has those limits bypass capabilities), and > > accordingly execve() would not check the target user's process count > > against RLIMIT_NPROC. > > In commit 2863643fb8b9 ("set_user: add capability check when > rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to > make it more consistent with fork. Unfortunately because of call site > differences those capables calls were checking the credentials of the s/capables/capable/ > user before set*id() instead of after set*id(). > > This breaks enforcement of RLIMIT_NPROC for applications that set the > rlimit and then call set*id() while holding a full set of > capabilities. The capabilities are only changed in the new credential > in security_task_fix_setuid(). > > The code in apache suexec appears to follow this pattern. > > Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits > (RLIMIT_NPROC)") where this check was added describes the targes of this > capability check as: > > 2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then > calls setuid, the setuid should fail if the user would then be running > more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't. This patch > adds an appropriate test. With this patch, and per-user process limit > imposed in cgiwrap really works. > > So the original use case also of this check also appears to match the broken > pattern. Duplicate "also" - drop one. > Restore the enforcement of RLIMIT_NPROC by removing the bad capable > checks added in set_user. This unfortunately restores the > inconsistencies state the code has been in for the last 11 years, but s/inconsistencies/inconsistent/ > dealing with the inconsistencies looks like a larger problem. > > Cc: stable@vger.kernel.org > Link: https://lore.kernel.org/all/20210907213042.GA22626@openwall.com/ > Link: https://lkml.kernel.org/r/20220212221412.GA29214@openwall.com > Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds") > History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > kernel/sys.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/kernel/sys.c b/kernel/sys.c > index ecc4cf019242..8dd938a3d2bf 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -480,8 +480,7 @@ static int set_user(struct cred *new) > * failure to the execve() stage. > */ > if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) && > - new_user != INIT_USER && > - !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) > + new_user != INIT_USER) > current->flags |= PF_NPROC_EXCEEDED; > else > current->flags &= ~PF_NPROC_EXCEEDED; Reviewed-by: Solar Designer <solar@openwall.com> Alexander ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 2/5] ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 2022-02-16 15:56 ` [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user Eric W. Biederman @ 2022-02-16 15:58 ` Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 3/5] ucounts: Base set_cred_ucounts changes on the real user Eric W. Biederman ` (3 subsequent siblings) 5 siblings, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-16 15:58 UTC (permalink / raw) To: linux-kernel Cc: Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, containers, Michal Koutný, linux-api, Eric W. Biederman, stable Michal Koutný <mkoutny@suse.com> wrote: > It was reported that v5.14 behaves differently when enforcing > RLIMIT_NPROC limit, namely, it allows one more task than previously. > This is consequence of the commit 21d1c5e386bc ("Reimplement > RLIMIT_NPROC on top of ucounts") that missed the sharpness of > equality in the forking path. This can be fixed either by fixing the test or by moving the increment to be before the test. Fix it my moving copy_creds which contains the increment before is_ucounts_overlimit. In the case of CLONE_NEWUSER the ucounts in the task_cred changes. The function is_ucounts_overlimit needs to use the final version of the ucounts for the new process. Which means moving the is_ucounts_overlimit test after copy_creds is necessary. Both the test in fork and the test in set_user were semantically changed when the code moved to ucounts. The change of the test in fork was bad because it was before the increment. The test in set_user was wrong and the change to ucounts fixed it. So this fix only restores the old behavior in one lcation not two. Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com Cc: stable@vger.kernel.org Reported-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/fork.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d75a528f7b21..17d8a8c85e3b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2021,18 +2021,18 @@ static __latent_entropy struct task_struct *copy_process( #ifdef CONFIG_PROVE_LOCKING DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled); #endif + retval = copy_creds(p, clone_flags); + if (retval < 0) + goto bad_fork_free; + retval = -EAGAIN; if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) { if (p->real_cred->user != INIT_USER && !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) - goto bad_fork_free; + goto bad_fork_cleanup_count; } current->flags &= ~PF_NPROC_EXCEEDED; - retval = copy_creds(p, clone_flags); - if (retval < 0) - goto bad_fork_free; - /* * If multiple threads are within copy_process(), then this check * triggers too late. This doesn't hurt, the check is only there -- 2.29.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v2 3/5] ucounts: Base set_cred_ucounts changes on the real user 2022-02-16 15:56 ` [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 2/5] ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 Eric W. Biederman @ 2022-02-16 15:58 ` Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 4/5] ucounts: Move RLIMIT_NPROC handling after set_user Eric W. Biederman ` (2 subsequent siblings) 5 siblings, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-16 15:58 UTC (permalink / raw) To: linux-kernel Cc: Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, containers, Michal Koutný, linux-api, Eric W. Biederman, stable Michal Koutný <mkoutny@suse.com> wrote: > Tasks are associated to multiple users at once. Historically and as per > setrlimit(2) RLIMIT_NPROC is enforce based on real user ID. > > The commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") > made the accounting structure "indexed" by euid and hence potentially > account tasks differently. > > The effective user ID may be different e.g. for setuid programs but > those are exec'd into already existing task (i.e. below limit), so > different accounting is moot. > > Some special setresuid(2) users may notice the difference, justifying > this fix. I looked at cred->ucount and it is only used for rlimit operations that were previously stored in cred->user. Making the fact cred->ucount can refer to a different user from cred->user a bug, affecting all uses of cred->ulimit not just RLIMIT_NPROC. Fix set_cred_ucounts to always use the real uid not the effective uid. Further simplify set_cred_ucounts by noticing that set_cred_ucounts somehow retained a draft version of the check to see if alloc_ucounts was needed that checks the new->user and new->user_ns against the current_real_cred(). Remove that draft version of the check. All that matters for setting the cred->ucounts are the user_ns and uid fields in the cred. Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220207121800.5079-4-mkoutny@suse.com Reported-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/cred.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/kernel/cred.c b/kernel/cred.c index 473d17c431f3..933155c96922 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -665,21 +665,16 @@ EXPORT_SYMBOL(cred_fscmp); int set_cred_ucounts(struct cred *new) { - struct task_struct *task = current; - const struct cred *old = task->real_cred; struct ucounts *new_ucounts, *old_ucounts = new->ucounts; - if (new->user == old->user && new->user_ns == old->user_ns) - return 0; - /* * This optimization is needed because alloc_ucounts() uses locks * for table lookups. */ - if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid)) + if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->uid)) return 0; - if (!(new_ucounts = alloc_ucounts(new->user_ns, new->euid))) + if (!(new_ucounts = alloc_ucounts(new->user_ns, new->uid))) return -EAGAIN; new->ucounts = new_ucounts; -- 2.29.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v2 4/5] ucounts: Move RLIMIT_NPROC handling after set_user 2022-02-16 15:56 ` [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes Eric W. Biederman ` (2 preceding siblings ...) 2022-02-16 15:58 ` [PATCH v2 3/5] ucounts: Base set_cred_ucounts changes on the real user Eric W. Biederman @ 2022-02-16 15:58 ` Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit Eric W. Biederman 2022-02-18 15:34 ` [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17 Eric W. Biederman 5 siblings, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-16 15:58 UTC (permalink / raw) To: linux-kernel Cc: Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, containers, Michal Koutný, linux-api, Eric W. Biederman, stable During set*id() which cred->ucounts to charge the the current process to is not known until after set_cred_ucounts. So move the RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call flag_nproc_exceeded after set_cred_ucounts. This is very much an arbitrary subset of the places where we currently change the RLIMIT_NPROC accounting, designed to preserve the existing logic. Fixing the existing logic will be the subject of another series of changes. Cc: stable@vger.kernel.org Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/sys.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index 8dd938a3d2bf..97dc9e5d6bf9 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -472,6 +472,16 @@ static int set_user(struct cred *new) if (!new_user) return -EAGAIN; + free_uid(new->user); + new->user = new_user; + return 0; +} + +static void flag_nproc_exceeded(struct cred *new) +{ + if (new->ucounts == current_ucounts()) + return; + /* * We don't fail in case of NPROC limit excess here because too many * poorly written programs don't check set*uid() return code, assuming @@ -480,14 +490,10 @@ static int set_user(struct cred *new) * failure to the execve() stage. */ if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) && - new_user != INIT_USER) + new->user != INIT_USER) current->flags |= PF_NPROC_EXCEEDED; else current->flags &= ~PF_NPROC_EXCEEDED; - - free_uid(new->user); - new->user = new_user; - return 0; } /* @@ -562,6 +568,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid) if (retval < 0) goto error; + flag_nproc_exceeded(new); return commit_creds(new); error: @@ -624,6 +631,7 @@ long __sys_setuid(uid_t uid) if (retval < 0) goto error; + flag_nproc_exceeded(new); return commit_creds(new); error: @@ -703,6 +711,7 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid) if (retval < 0) goto error; + flag_nproc_exceeded(new); return commit_creds(new); error: -- 2.29.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit 2022-02-16 15:56 ` [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes Eric W. Biederman ` (3 preceding siblings ...) 2022-02-16 15:58 ` [PATCH v2 4/5] ucounts: Move RLIMIT_NPROC handling after set_user Eric W. Biederman @ 2022-02-16 15:58 ` Eric W. Biederman 2022-02-16 17:28 ` Shuah Khan 2022-02-18 15:34 ` [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17 Eric W. Biederman 5 siblings, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2022-02-16 15:58 UTC (permalink / raw) To: linux-kernel Cc: Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, containers, Michal Koutný, linux-api, Eric W. Biederman, stable While examining is_ucounts_overlimit and reading the various messages I realized that is_ucounts_overlimit fails to deal with counts that may have wrapped. Being wrapped should be a transitory state for counts and they should never be wrapped for long, but it can happen so handle it. Cc: stable@vger.kernel.org Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/ucount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/ucount.c b/kernel/ucount.c index 65b597431c86..06ea04d44685 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign if (rlimit > LONG_MAX) max = LONG_MAX; for (iter = ucounts; iter; iter = iter->ns->ucounts) { - if (get_ucounts_value(iter, type) > max) + long val = get_ucounts_value(iter, type); + if (val < 0 || val > max) return true; max = READ_ONCE(iter->ns->ucount_max[type]); } -- 2.29.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit 2022-02-16 15:58 ` [PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit Eric W. Biederman @ 2022-02-16 17:28 ` Shuah Khan 0 siblings, 0 replies; 25+ messages in thread From: Shuah Khan @ 2022-02-16 17:28 UTC (permalink / raw) To: Eric W. Biederman, linux-kernel Cc: Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, containers, Michal Koutný, linux-api, stable, Shuah Khan On 2/16/22 8:58 AM, Eric W. Biederman wrote: > While examining is_ucounts_overlimit and reading the various messages > I realized that is_ucounts_overlimit fails to deal with counts that > may have wrapped. > > Being wrapped should be a transitory state for counts and they should > never be wrapped for long, but it can happen so handle it. > > Cc: stable@vger.kernel.org > Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > kernel/ucount.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/ucount.c b/kernel/ucount.c > index 65b597431c86..06ea04d44685 100644 > --- a/kernel/ucount.c > +++ b/kernel/ucount.c > @@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign > if (rlimit > LONG_MAX) > max = LONG_MAX; > for (iter = ucounts; iter; iter = iter->ns->ucounts) { > - if (get_ucounts_value(iter, type) > max) > + long val = get_ucounts_value(iter, type); > + if (val < 0 || val > max) > return true; > max = READ_ONCE(iter->ns->ucount_max[type]); > } > This addresses the concerns about overflow. Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> thanks, -- Shuah ^ permalink raw reply [flat|nested] 25+ messages in thread
* [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17 2022-02-16 15:56 ` [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes Eric W. Biederman ` (4 preceding siblings ...) 2022-02-16 15:58 ` [PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit Eric W. Biederman @ 2022-02-18 15:34 ` Eric W. Biederman 2022-02-20 19:05 ` pr-tracker-bot 2022-03-03 0:12 ` [GIT PULL] ucounts: Regression fix " Eric W. Biederman 5 siblings, 2 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-18 15:34 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Containers, Michal Koutný, linux-api Linus, Please pull the ucount-rlimit-fixes-for-v5.17 branch from the git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17 HEAD: 0cbae9e24fa7d6c6e9f828562f084da82217a0c5 ucounts: Handle wrapping in is_ucounts_overlimit Michal Koutný recently found some bugs in the enforcement of RLIMIT_NPROC in the recent ucount rlimit implementation. I saw some additional bugs and some cleaner ways to fix the problem so instead of starting with his fixes these are my own. I have rewritten about half my fixes since the last time this was posted. There is this notion (not entirely wrong) that the code should be consistent and make sense. When I dug in I discovered that has not been the case for the last 20 years. Fixing the long standing inconsistencies is something that seems to warrent wider vetting on linux-api. So with this set of patches I have developed a very conservative approach changing only what is necessary to fix the bugs that I can see clearly. Cleanups and anything that is making the code more consistent can follow after we have the code working as it has historically. I had hoped to let this sit in linux-next for a few days just to be doubly certain all is well. But these patches are all trivial and linux-next is on holiday. v2: https://lkml.kernel.org/r/87ilteiz4a.fsf_-_@email.froward.int.ebiederm.org> v1: https://lkml.kernel.org/r/87o83e2mbu.fsf@email.froward.int.ebiederm.org> Eric W. Biederman (5): rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 ucounts: Base set_cred_ucounts changes on the real user ucounts: Move RLIMIT_NPROC handling after set_user ucounts: Handle wrapping in is_ucounts_overlimit kernel/cred.c | 9 ++------- kernel/fork.c | 10 +++++----- kernel/sys.c | 20 ++++++++++++++------ kernel/ucount.c | 3 ++- 4 files changed, 23 insertions(+), 19 deletions(-) p.s. I should say that the problem is not so much inconsistencies (although those exist) but that it is very difficult to figure out what the code should be doing in the case of RLIMIT_NPROC. All other rlimits are only enforced where the resource is acquired (allocated). RLIMIT_NPROC by necessity needs to be enforced in an additional location, and our current implementation stumbled it's way into that implementation. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17 2022-02-18 15:34 ` [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17 Eric W. Biederman @ 2022-02-20 19:05 ` pr-tracker-bot 2022-03-03 0:12 ` [GIT PULL] ucounts: Regression fix " Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: pr-tracker-bot @ 2022-02-20 19:05 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, linux-kernel, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Containers, Michal Koutný, linux-api The pull request you sent on Fri, 18 Feb 2022 09:34:24 -0600: > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/2d3409ebc87f4bc4ed23bd39e78db9ffc29eec44 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* [GIT PULL] ucounts: Regression fix for v5.17 2022-02-18 15:34 ` [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17 Eric W. Biederman 2022-02-20 19:05 ` pr-tracker-bot @ 2022-03-03 0:12 ` Eric W. Biederman 2022-03-03 0:30 ` pr-tracker-bot 1 sibling, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2022-03-03 0:12 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Containers, Michal Koutný, linux-api Linus, Please pull the ucount-rlimit-fixes-for-v5.17 branch from the git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17 HEAD: 0ac983f512033cb7b5e210c9589768ad25b1e36b ucounts: Fix systemd LimitNPROC with private users regression Etienne Dechamps recently found a regression caused by enforcing RLIMIT_NPROC for root where the rlimit was not previously enforced. Michal Koutný had previously pointed out the inconsistency in enforcing the RLIMIT_NPROC that had been on the root owned process after the root user creates a user namespace. Which makes the fix for the regression simply removing the inconsistency. From: "Eric W. Biederman" <ebiederm@xmission.com> Date: Thu, 24 Feb 2022 08:32:28 -0600 Subject: [PATCH] ucounts: Fix systemd LimitNPROC with private users regression Long story short recursively enforcing RLIMIT_NPROC when it is not enforced on the process that creates a new user namespace, causes currently working code to fail. There is no reason to enforce RLIMIT_NPROC recursively when we don't enforce it normally so update the code to detect this case. I would like to simply use capable(CAP_SYS_RESOURCE) to detect when RLIMIT_NPROC is not enforced upon the caller. Unfortunately because RLIMIT_NPROC is charged and checked for enforcement based upon the real uid, using capable() which is euid based is inconsistent with reality. Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by testing for when the real uid would match the conditions when CAP_SYS_RESOURCE would be present if the real uid was the effective uid. Reported-by: Etienne Dechamps <etienne@edechamps.fr> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596 Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr Link: https://lkml.kernel.org/r/87sfs8jmpz.fsf_-_@email.froward.int.ebiederm.org Cc: stable@vger.kernel.org Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/user_namespace.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 6b2e3ca7ee99..5481ba44a8d6 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) cred->user_ns = user_ns; } +static unsigned long enforced_nproc_rlimit(void) +{ + unsigned long limit = RLIM_INFINITY; + + /* Is RLIMIT_NPROC currently enforced? */ + if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) || + (current_user_ns() != &init_user_ns)) + limit = rlimit(RLIMIT_NPROC); + + return limit; +} + /* * Create a new user namespace, deriving the creator from the user in the * passed credentials, and replacing that user with the new root user for the @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new) for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) { ns->ucount_max[i] = INT_MAX; } - set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)); + set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit()); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK)); -- 2.29.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [GIT PULL] ucounts: Regression fix for v5.17 2022-03-03 0:12 ` [GIT PULL] ucounts: Regression fix " Eric W. Biederman @ 2022-03-03 0:30 ` pr-tracker-bot 0 siblings, 0 replies; 25+ messages in thread From: pr-tracker-bot @ 2022-03-03 0:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, linux-kernel, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Containers, Michal Koutný, linux-api The pull request you sent on Wed, 02 Mar 2022 18:12:40 -0600: > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/5859a2b1991101d6b978f3feb5325dad39421f29 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr>]
[parent not found: <20220215101150.GD21589@blackbody.suse.cz>]
[parent not found: <87zgmi5rhm.fsf@email.froward.int.ebiederm.org>]
* How should rlimits, suid exec, and capabilities interact? [not found] ` <87zgmi5rhm.fsf@email.froward.int.ebiederm.org> @ 2022-02-23 18:00 ` Eric W. Biederman 2022-02-23 19:44 ` Andy Lutomirski 2022-02-23 19:50 ` Linus Torvalds 0 siblings, 2 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-23 18:00 UTC (permalink / raw) To: linux-api Cc: Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, linux-kernel, linux-kselftest, Linux Containers, Michal Koutný, security, Neil Brown, NeilBrown, Serge E. Hallyn, Kees Cook, Jann Horn [CC'd the security list because I really don't know who the right people are to drag into this discussion] While looking at some issues that have cropped up with making it so that RLIMIT_NPROC cannot be escaped by creating a user namespace I have stumbled upon a very old issue of how rlimits and suid exec interact poorly. This specific saga starts with commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") from https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git which essentially replaced a capable() check with a an open-coded implementation of suser(), for RLIMIT_NPROC. The description from Neil Brown was: 1/ If a setuid process swaps it's real and effective uids and then forks, the fork fails if the new realuid has more processes than the original process was limited to. This is particularly a problem if a user with a process limit (e.g. 256) runs a setuid-root program which does setuid() + fork() (e.g. lprng) while root already has more than 256 process (which is quite possible). The root problem here is that a limit which should be a per-user limit is being implemented as a per-process limit with per-process (e.g. CAP_SYS_RESOURCE) controls. Being a per-user limit, it should be that the root-user can over-ride it, not just some process with CAP_SYS_RESOURCE. This patch adds a test to ignore process limits if the real user is root. The test to see if the real user is root was: if (p->real_cred->user != INIT_USER) ... which persists to this day in fs/fork.c:copy_process(). The practical problem with this test is that it works like nothing else in the kernel, and so does not look like what it is. Saying: if (!uid_eq(p->real_cred->uid, GLOBAL_ROOT_USER)) ... would at least be more recognizable. Really this entire test should be if (!capable(CAP_SYS_RESOURCE) because CAP_SYS_RESOURCE is the capability that controls if you are allowed to exceed your rlimits. Which brings us to the practical issues of how all of these things are wired together today. The per-user rlimits are accounted based upon a processes real user, not the effective user. All other permission checks are based upon the effective user. This has the practical effect that uids are swapped as above that the processes are charged to root, but use the permissions of an ordinary user. The problems get worse when you realize that suid exec does not reset any of the rlimits except for RLIMIT_STACK. The rlimits that are particularly affected and are per-user are: RLIMIT_NPROC, RLIMIT_MSGQUEUE, RLIMIT_SIGPENDING, RLIMIT_MEMLOCK. But I think failing to reset rlimits during exec has the potential to effect any suid exec. Does anyone have any historical knowledge or sense of how this should work? Right now it feels like we have coded ourselves into a corner and will have to risk breaking userspace to get out of it. AKA I think we need a policy of reseting rlimits on suid exec, and I think we need to store global rlimits based upon the effective user not the real user. Those changes should allow making capable calls where they belong, and removing the much too magic user == INIT_USER test for RLIMIT_NPROC. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: How should rlimits, suid exec, and capabilities interact? 2022-02-23 18:00 ` How should rlimits, suid exec, and capabilities interact? Eric W. Biederman @ 2022-02-23 19:44 ` Andy Lutomirski 2022-02-23 21:28 ` Willy Tarreau 2022-02-23 19:50 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Andy Lutomirski @ 2022-02-23 19:44 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-api, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, linux-kernel, linux-kselftest, Linux Containers, Michal Koutný, security, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > [CC'd the security list because I really don't know who the right people > are to drag into this discussion] > > While looking at some issues that have cropped up with making it so > that RLIMIT_NPROC cannot be escaped by creating a user namespace I have > stumbled upon a very old issue of how rlimits and suid exec interact > poorly. Once upon a time, these resource limits were effectively the only way to control memory consumption and consumption of historically limited resources like processes. (The scheduler used to have serious issues with too many processes -- this is not so true any more. And without cgroups, too many processes could use too much CPU collectively.) This all worked pretty poorly. Now we have cgroups, fancy memory accounting, etc. So I'm wondering if NPROC is even useful anymore. I don't have a brilliant idea of how to deprecate it, but I think it wouldn't be entirely nuts to take it much less seriously and maybe even eventually get rid of it. I doubt there is much existing userspace that would break if a previously failing fork() started succeeding. --Andy] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: How should rlimits, suid exec, and capabilities interact? 2022-02-23 19:44 ` Andy Lutomirski @ 2022-02-23 21:28 ` Willy Tarreau 0 siblings, 0 replies; 25+ messages in thread From: Willy Tarreau @ 2022-02-23 21:28 UTC (permalink / raw) To: Andy Lutomirski Cc: Eric W. Biederman, linux-api, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, linux-kernel, linux-kselftest, Linux Containers, Michal Koutný, security, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn Hi Andy, On Wed, Feb 23, 2022 at 11:44:51AM -0800, Andy Lutomirski wrote: > On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: > > > > > > [CC'd the security list because I really don't know who the right people > > are to drag into this discussion] > > > > While looking at some issues that have cropped up with making it so > > that RLIMIT_NPROC cannot be escaped by creating a user namespace I have > > stumbled upon a very old issue of how rlimits and suid exec interact > > poorly. > > Once upon a time, these resource limits were effectively the only way > to control memory consumption and consumption of historically limited > resources like processes. (The scheduler used to have serious issues > with too many processes -- this is not so true any more. And without > cgroups, too many processes could use too much CPU collectively.) > This all worked pretty poorly. Now we have cgroups, fancy memory > accounting, etc. So I'm wondering if NPROC is even useful anymore. I > don't have a brilliant idea of how to deprecate it, but I think it > wouldn't be entirely nuts to take it much less seriously and maybe > even eventually get rid of it. > > I doubt there is much existing userspace that would break if a > previously failing fork() started succeeding. I strongly disagree. I've been using it for a long time as a security measure. Setting NPROC to 0 after daemonizing remains a particularly effective and portable method to mitigate the possible consequences of an in-process intrusion. While I wouldn't care about approximate non-zero values, for me it would be a significant security regression to drop the inability to fork() when the limit is zero. Thus at least I do want to keep that feature when NPROC is zero. Willy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: How should rlimits, suid exec, and capabilities interact? 2022-02-23 18:00 ` How should rlimits, suid exec, and capabilities interact? Eric W. Biederman 2022-02-23 19:44 ` Andy Lutomirski @ 2022-02-23 19:50 ` Linus Torvalds 2022-02-24 1:24 ` Eric W. Biederman 2022-02-24 1:32 ` Eric W. Biederman 1 sibling, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2022-02-23 19:50 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux API, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Which brings us to the practical issues of how all of these things are > wired together today. I honestly think you should treat the limits as "approximate". We do that for a number of reasons: - sometimes we have racy tests because we don't want to do excessive locking just for a limit: nobody cares if you can go a couple of entries past a limit because you were lucky, it's important that you can't go *much* past the limit. - sometimes the limits themselves are fuzzy (example: time. it's incremented by "ticks", but it's simply not that precise, and it depends a bit when the ticks happen) - sometimes it's ambiguous who we're talking about. I think suid execs tend to fall in that third category. Be generous. If the limit doesn't trigger at the suid exec, nobody cares. You want to make sure it triggers eventually. For example, let's say that you are the admin, and you made a mistake, and you had a runaway fork() bomb that was caught by the limits. Optimally, you still want to be able to be able to log in (one process that was root when it did the fork(), and did a 'setresuid()' or similar to drop the things, and then one process that does 'sudo' to get privileges to kill the darn fork bomb). See how that 'user' technically went over the limit, and that was A-OK! Basic rule: it's better to be too lenient than to be too strict. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: How should rlimits, suid exec, and capabilities interact? 2022-02-23 19:50 ` Linus Torvalds @ 2022-02-24 1:24 ` Eric W. Biederman 2022-02-24 1:41 ` Linus Torvalds 2022-02-24 1:32 ` Eric W. Biederman 1 sibling, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2022-02-24 1:24 UTC (permalink / raw) To: Linus Torvalds Cc: Linux API, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau Linus Torvalds <linus@torvalds.org> writes: > Basic rule: it's better to be too lenient than to be too strict. Thank you. With that guideline I can explore the space of what is possible. Question: Running a suid program today charges the activity of that program to the user who ran that program, not to the user the program runs as. Does anyone see a problem with charging the user the program runs as? The reason I want to change who is charged with a process (besides it making more sense in my head) is so that capable(CAP_SYS_RESOURCE) can be used instead of the magic incantation (cred->user == INIT_USER). An accidental experiment happened in v5.14-rc1 in July when the ucount rlimit code was merged. It was only this last week when after Michal Koutný discovered the discrepency through code inspect a bug fix was merged. This changes the behavior that has existed in some form since Linux v1.0 when per user process limits were added. The original code in v1.0 looked like: > static int find_empty_process(void) > { > int free_task; > int i, tasks_free; > int this_user_tasks; > > repeat: > if ((++last_pid) & 0xffff8000) > last_pid=1; > this_user_tasks = 0; > tasks_free = 0; > free_task = -EAGAIN; > i = NR_TASKS; > while (--i > 0) { > if (!task[i]) { > free_task = i; > tasks_free++; > continue; > } > if (task[i]->uid == current->uid) > this_user_tasks++; > if (task[i]->pid == last_pid || task[i]->pgrp == last_pid || > task[i]->session == last_pid) > goto repeat; > } > if (tasks_free <= MIN_TASKS_LEFT_FOR_ROOT || > this_user_tasks > MAX_TASKS_PER_USER) > if (current->uid) > return -EAGAIN; > return free_task; > } Having tracked the use of real uid in limits back this far my guess is that it was an accident of the implementation and real uid vs effective uid had not be considered. Does anyone know if choosing the real uid was a deliberate decision anywhere in the history of Linux? Linus you were talking about making it possible to login as I think a non-root user to be able to use sudo and kill a fork bomb. The counter case is apache having a dedicated user for running cgi-scripts and using RLIMIT_NPROC to limit how many of those processes can exist. Unless I am misunderstanding something that looks exactly like your login as non-root so you can run sudo to kill a fork-bomb. A comment from an in-process cleanup patch explains this as best I can: /* * In general rlimits are only enforced when a new resource * is acquired. That would be during fork for RLIMIT_NPROC. * That is insufficient for RLIMIT_NPROC as many attributes of * a new process must be set between fork and exec. * * A case where this matter is when apache runs forks a process * and calls setuid to run cgi-scripts as a different user. * Generating those processes through a code sequence like: * * fork() * setrlimit(RLIMIT_NPROC, ...) * execve() -- suid wrapper * setuid() * execve() -- cgi script * * The cgi-scripts are unlikely to fork on their own so unless * RLIMIT_NPROC is checked after the user change and before * the cgi-script starts, RLIMIT_NPROC simply will not be enforced * for the cgi-scripts. * * So the code tracks if between fork and exec if an operation * occurs that could cause the RLIMIT_NPROC check to fail. If * such an operation has happened re-check RLIMIT_NPROC. */ Answered-Question: I was trying to ask if anyone knows of a reason why we can't just sanitize the rlimits of the process during suid exec? Linus your guideline would appear to allow that behavior. Unfortunately that looks like it would break current usage of apache suexec. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: How should rlimits, suid exec, and capabilities interact? 2022-02-24 1:24 ` Eric W. Biederman @ 2022-02-24 1:41 ` Linus Torvalds 2022-02-24 2:12 ` Eric W. Biederman 2022-02-24 3:00 ` How should rlimits, suid exec, and capabilities interact? David Laight 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2022-02-24 1:41 UTC (permalink / raw) To: Eric W. Biederman Cc: Linux API, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau On Wed, Feb 23, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Question: Running a suid program today charges the activity of that > program to the user who ran that program, not to the user the program > runs as. Does anyone see a problem with charging the user the program > runs as? So I think that there's actually two independent issues with limits when you have situations like this where the actual user might be ambiguous. - the "who to charge" question - the "how do we *check* the limit" question and honestly, I think that when it comes to suid binaries, the first question is fundamentally ambiguous, because it almost certainly depends on the user. Which to me implies that there probably isn't an answer that is always right, and that what you should look at is that second option. So I would actually suggest that the "execute a suid binary" should charge the real user, but *because* it is suid, it should then not check the limit (or, perhaps, should check the hard limit?). You have to charge somebody, but at that point it's a bit ambiguous whether it should be allowed. Exactly so that if you're over a process limit (or something similar - think "too many files open" or whatever because you screwed up and opened everything) you could still log in as yourself (ssh/login charges some admin thing, which probably has high limits or is unlimited), and hopefully get shell access, and then be able to "exec sudo" to actually get admin access that should be disabled from the network. The above is just one (traditional) example of a fork/open bomb case where a user isn't really able to no longer function as himself, but wants to fix things (maybe the user has another terminal open, but then he can hopefully use a shell-buiiltin 'kill' instead). And I'm not saying it's "the thing that needs to work". I'm more making up an example. So I'm only saying that the above actually has two examples to the two sides of the coin: "login" lowering privileges to a user that may be over some limit - and succeeding despite that - and 'suid' succeeding despite the original user perhaps being over-committed. So it's intended exactly as an example of "picking the new or the old user would be wrong in either case if you check limits at the transition point". Hmm? Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: How should rlimits, suid exec, and capabilities interact? 2022-02-24 1:41 ` Linus Torvalds @ 2022-02-24 2:12 ` Eric W. Biederman 2022-02-24 15:41 ` [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression Eric W. Biederman 2022-02-24 3:00 ` How should rlimits, suid exec, and capabilities interact? David Laight 1 sibling, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2022-02-24 2:12 UTC (permalink / raw) To: Linus Torvalds Cc: Linux API, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau Linus Torvalds <linus@torvalds.org> writes: > On Wed, Feb 23, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> Question: Running a suid program today charges the activity of that >> program to the user who ran that program, not to the user the program >> runs as. Does anyone see a problem with charging the user the program >> runs as? > > So I think that there's actually two independent issues with limits > when you have situations like this where the actual user might be > ambiguous. > > - the "who to charge" question > > - the "how do we *check* the limit" question > > and honestly, I think that when it comes to suid binaries, the first > question is fundamentally ambiguous, because it almost certainly > depends on the user. > > Which to me implies that there probably isn't an answer that is always > right, and that what you should look at is that second option. > > So I would actually suggest that the "execute a suid binary" should > charge the real user, but *because* it is suid, it should then not > check the limit (or, perhaps, should check the hard limit?). > > You have to charge somebody, but at that point it's a bit ambiguous > whether it should be allowed. > > Exactly so that if you're over a process limit (or something similar - > think "too many files open" or whatever because you screwed up and > opened everything) you could still log in as yourself (ssh/login > charges some admin thing, which probably has high limits or is > unlimited), and hopefully get shell access, and then be able to "exec > sudo" to actually get admin access that should be disabled from the > network. > > The above is just one (traditional) example of a fork/open bomb case > where a user isn't really able to no longer function as himself, but > wants to fix things (maybe the user has another terminal open, but > then he can hopefully use a shell-buiiltin 'kill' instead). > > And I'm not saying it's "the thing that needs to work". I'm more > making up an example. > > So I'm only saying that the above actually has two examples to the two > sides of the coin: "login" lowering privileges to a user that may be > over some limit - and succeeding despite that - and 'suid' succeeding > despite the original user perhaps being over-committed. > > So it's intended exactly as an example of "picking the new or the old > user would be wrong in either case if you check limits at the > transition point". > > Hmm? That doesn't really clarify anything for me. We have two checks one in fork and one in exec and you seem to be talking about the check in exec. The check I have problems with for a suid executable is the check in fork. If the new process is accounted to the previous user and we use the permissions of the effective user for checking it that does not make sense to me. If we can sort out that the check in fork. I think I have clarity about the other cases. The check in exec while clumsy and needing cleaning up seems to make sense to me. We have a transition that starts with fork and ends with exec and has operations like setuid in between. If something like setuid() is called before exec we check in exec. The case the check in exec is aimed at supporting are processes spawned from a parent that have a different user (than the parent) and will never call fork again. Those processes would be fundamentally immune to RLIMIT_NPROC if we don't check somewhere besides fork. There is existing code in apache to use RLIMIT_NPROC this way. For your login case I have no problems with it in principle. In practice I think you have to login as root to deal with a fork bomb that hits RLIMIT_NPROC and does not die gracefully. What I don't see about your login example is how it is practically different from the apache cgi script case, that the code has supported for 20 years, and that would be a regression if stopped supporting. If we want to stop supporting that case we can just remove all of the RLIMIT_NPROC tests everywhere except for fork, a nice cleanup. That still leaves me with mismatched effective vs real uid checks in fork when the effective and real uids don't match. Which means testing for root with "capable(CAP_SYS_ADMIN)" does not work. Which today is make the code a bit of a challenge to understand and work with. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression 2022-02-24 2:12 ` Eric W. Biederman @ 2022-02-24 15:41 ` Eric W. Biederman 2022-02-24 16:28 ` Kees Cook 0 siblings, 1 reply; 25+ messages in thread From: Eric W. Biederman @ 2022-02-24 15:41 UTC (permalink / raw) To: linux-kernel Cc: Linux API, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau, Linus Torvalds Long story short recursively enforcing RLIMIT_NPROC when it is not enforced on the process that creates a new user namespace, causes currently working code to fail. There is no reason to enforce RLIMIT_NPROC recursively when we don't enforce it normally so update the code to detect this case. I would like to simply use capable(CAP_SYS_RESOURCE) to detect when RLIMIT_NPROC is not enforced upon the caller. Unfortunately because RLIMIT_NPROC is charged and checked for enforcement based upon the real uid, using capable() wich is euid based is inconsistent with reality. Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by testing for when the real uid would match the conditions when CAP_SYS_RESOURCE would be present if the real uid was the effective uid. Reported-by: Etienne Dechamps <etienne@edechamps.fr> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596 Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- The previous conversation has given me enough clarity that I can see which tests I am comfortable with use for this pending regression fix. I have tested this and it works for me. Does anyone have any concerns with this change? kernel/user_namespace.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 6b2e3ca7ee99..5481ba44a8d6 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) cred->user_ns = user_ns; } +static unsigned long enforced_nproc_rlimit(void) +{ + unsigned long limit = RLIM_INFINITY; + + /* Is RLIMIT_NPROC currently enforced? */ + if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) || + (current_user_ns() != &init_user_ns)) + limit = rlimit(RLIMIT_NPROC); + + return limit; +} + /* * Create a new user namespace, deriving the creator from the user in the * passed credentials, and replacing that user with the new root user for the @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new) for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) { ns->ucount_max[i] = INT_MAX; } - set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)); + set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit()); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK)); -- 2.29.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression 2022-02-24 15:41 ` [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression Eric W. Biederman @ 2022-02-24 16:28 ` Kees Cook 2022-02-24 18:53 ` Michal Koutný 2022-02-25 0:29 ` Eric W. Biederman 0 siblings, 2 replies; 25+ messages in thread From: Kees Cook @ 2022-02-24 16:28 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-kernel, Linux API, Etienne Dechamps, Alexey Gladkov, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau, Linus Torvalds typo: Subject's LimigtNPROC -> LimitNPROC On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote: > > Long story short recursively enforcing RLIMIT_NPROC when it is not > enforced on the process that creates a new user namespace, causes > currently working code to fail. There is no reason to enforce > RLIMIT_NPROC recursively when we don't enforce it normally so update > the code to detect this case. > > I would like to simply use capable(CAP_SYS_RESOURCE) to detect when > RLIMIT_NPROC is not enforced upon the caller. Unfortunately because > RLIMIT_NPROC is charged and checked for enforcement based upon the > real uid, using capable() wich is euid based is inconsistent with reality. typo: wich -> which > Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by > testing for when the real uid would match the conditions when > CAP_SYS_RESOURCE would be present if the real uid was the effective > uid. > > Reported-by: Etienne Dechamps <etienne@edechamps.fr> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596 > Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr > Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > > The previous conversation has given me enough clarity that I can see > which tests I am comfortable with use for this pending regression fix. > > I have tested this and it works for me. Does anyone have any concerns > with this change? I'd really love some kind of selftest that exercises the edge cases; do you have your tests in some form that could be converted? But otherwise, yes, this looks like the best option here. Reviewed-by: Kees Cook <keescook@chromium.org> > > kernel/user_namespace.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 6b2e3ca7ee99..5481ba44a8d6 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > cred->user_ns = user_ns; > } > > +static unsigned long enforced_nproc_rlimit(void) > +{ > + unsigned long limit = RLIM_INFINITY; > + > + /* Is RLIMIT_NPROC currently enforced? */ > + if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) || > + (current_user_ns() != &init_user_ns)) > + limit = rlimit(RLIMIT_NPROC); > + > + return limit; > +} > + > /* > * Create a new user namespace, deriving the creator from the user in the > * passed credentials, and replacing that user with the new root user for the > @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new) > for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) { > ns->ucount_max[i] = INT_MAX; > } > - set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)); > + set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit()); > set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); > set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); > set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK)); > -- > 2.29.2 > -- Kees Cook ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression 2022-02-24 16:28 ` Kees Cook @ 2022-02-24 18:53 ` Michal Koutný 2022-02-25 0:29 ` Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: Michal Koutný @ 2022-02-24 18:53 UTC (permalink / raw) To: Kees Cook Cc: Eric W. Biederman, linux-kernel, Linux API, Etienne Dechamps, Alexey Gladkov, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau, Linus Torvalds On Thu, Feb 24, 2022 at 08:28:41AM -0800, Kees Cook <keescook@chromium.org> wrote: > I'd really love some kind of selftest that exercises the edge cases; do > you have your tests in some form that could be converted? There's the original tools/testing/selftests/rlimits/rlimits-per-userns.c selftest. I've been rewriting it to cover more situations, I'm sending it as one monster patch (I'd need spend more time reordering my commits into some logical patch order) if anyone wishes to try it. I've tried it on 5c1ee569660d4a205dced9cb4d0306b907fb7599 + this Eric's patch. The test rlimit-per-userns-root passes - together with that I claim this patch Reviewed-by: Michal Koutný <mkoutny@suse.com> The test rlimit-per-userns-nonroot fails. It's similar off-by-one mistake as was in the fork path, but it's in the do_execveat_common(): if ((current->flags & PF_NPROC_EXCEEDED) && is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) { retval = -EAGAIN; goto out_ret; } (If RLIMIT_NPROC should be strictly honored, setuid+execve should fail when given uid's ucount is at the limit already.) Funnily, the original tools/testing/selftests/rlimits/rlimits-per-userns.c passes thanks to the off-by-one check even though it should not pass because unshare(2) is called after setuid(2). Michal -- >8 -- From be67d903f1f179f585bf302f6c2d2446f24263d7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com> Date: Thu, 20 Jan 2022 19:32:54 +0100 Subject: [RFC PATCH] selftests: Rewrite RLIMIT_NPROC checks (in user namespaces) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This adds two test programs: - rlimit-per-userns-root -- creates user namespaces owned by root, - rlimit-per-userns-nonroot -- creates user namespaces owned by non-root. The forking tree: main [init_user_ns] ` service [user_ns_1] ` worker 1 ` worker 2 ... ` worker k ... ` service [user_ns_n] ` worker 1 ` worker 2 ... ` worker k Expectations rlimit-per-userns-root: n > RLIMIT_NPROC privileged user can spawn mutliple services in different user namespaces (k+1) <= RLIMIT_NPROC limit is honored within user namespace k >= RLIMIT_NPROC-1 separate per-user namespace counters Expectations rlimit-per-userns-nonroot: n <= RLIMIT_NPROC global RLIMIT_NPROC is honored (k+1) <= RLIMIT_NPROC limit is honored within user namespace Signed-off-by: Michal Koutný <mkoutny@suse.com> --- tools/testing/selftests/rlimits/Makefile | 6 +- .../rlimits/rlimits-per-userns-nonroot.c | 37 ++ .../rlimits/rlimits-per-userns-root.c | 34 ++ .../selftests/rlimits/rlimits-per-userns.c | 161 ------- .../selftests/rlimits/service_common.c | 400 ++++++++++++++++++ .../selftests/rlimits/service_common.h | 24 ++ 6 files changed, 500 insertions(+), 162 deletions(-) create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns-root.c delete mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c create mode 100644 tools/testing/selftests/rlimits/service_common.c create mode 100644 tools/testing/selftests/rlimits/service_common.h diff --git a/tools/testing/selftests/rlimits/Makefile b/tools/testing/selftests/rlimits/Makefile index 03aadb406212..8ccb92020206 100644 --- a/tools/testing/selftests/rlimits/Makefile +++ b/tools/testing/selftests/rlimits/Makefile @@ -1,6 +1,10 @@ # SPDX-License-Identifier: GPL-2.0-or-later CFLAGS += -Wall -O2 -g -TEST_GEN_PROGS := rlimits-per-userns +TEST_GEN_PROGS := rlimits-per-userns-root +TEST_GEN_PROGS += rlimits-per-userns-nonroot include ../lib.mk + +$(OUTPUT)/rlimits-per-userns-root: service_common.c +$(OUTPUT)/rlimits-per-userns-nonroot: service_common.c diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c b/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c new file mode 100644 index 000000000000..ccf021769f88 --- /dev/null +++ b/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Alexey Gladkov <gladkov.alexey@gmail.com> + * Author: Michal Koutný <mkoutny@suse.com> + */ +#include <err.h> +#include <sys/types.h> +#include <unistd.h> + +#include "../kselftest.h" +#include "service_common.h" + +int main(int argc, char **argv) +{ + struct services_ctx *ctx; + pid = getpid(); + + if (getenv(ENV_PARAM)) + return run_service(atoi(getenv(ENV_PARAM))); + + if (getuid() > 0) + errx(KSFT_SKIP, "This selftest must start as (global) root user."); + + warnx("(pid=%d) Starting testcase", pid); + + ctx = start_services(argv[0], UM_UNSHARE); + stop_services(ctx); + + if (count_services(ctx) > THE_LIMIT) + errx(KSFT_FAIL, "(pid=%d): Test failed, exec'd services > RLIMIT_NPROC", pid); + + if (check_services(ctx) < count_services(ctx)) + errx(KSFT_FAIL, "(pid=%d): Test failed, failed services", pid); + + warnx("(pid=%d): Test passed", pid); + exit(KSFT_PASS); +} diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns-root.c b/tools/testing/selftests/rlimits/rlimits-per-userns-root.c new file mode 100644 index 000000000000..3bf0149ac93d --- /dev/null +++ b/tools/testing/selftests/rlimits/rlimits-per-userns-root.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Alexey Gladkov <gladkov.alexey@gmail.com> + * Author: Michal Koutný <mkoutny@suse.com> + */ +#include <err.h> +#include <sys/types.h> +#include <unistd.h> + +#include "../kselftest.h" +#include "service_common.h" + +int main(int argc, char **argv) +{ + struct services_ctx *ctx; + pid = getpid(); + + if (getenv(ENV_PARAM)) + return run_service(atoi(getenv(ENV_PARAM))); + + if (getuid() > 0) + errx(KSFT_SKIP, "This selftest must start as (global) root user."); + + warnx("(pid=%d) Starting testcase", pid); + + ctx = start_services(argv[0], UM_CLONE_NEWUSER); + stop_services(ctx); + + if (check_services(ctx) != NR_SERVICES) + errx(KSFT_FAIL, "(pid=%d): Test failed, unexpected terminations", pid); + + warnx("(pid=%d): Test passed", pid); + exit(KSFT_PASS); +} diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c deleted file mode 100644 index 26dc949e93ea..000000000000 --- a/tools/testing/selftests/rlimits/rlimits-per-userns.c +++ /dev/null @@ -1,161 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-or-later -/* - * Author: Alexey Gladkov <gladkov.alexey@gmail.com> - */ -#define _GNU_SOURCE -#include <sys/types.h> -#include <sys/wait.h> -#include <sys/time.h> -#include <sys/resource.h> -#include <sys/prctl.h> -#include <sys/stat.h> - -#include <unistd.h> -#include <stdlib.h> -#include <stdio.h> -#include <string.h> -#include <sched.h> -#include <signal.h> -#include <limits.h> -#include <fcntl.h> -#include <errno.h> -#include <err.h> - -#define NR_CHILDS 2 - -static char *service_prog; -static uid_t user = 60000; -static uid_t group = 60000; - -static void setrlimit_nproc(rlim_t n) -{ - pid_t pid = getpid(); - struct rlimit limit = { - .rlim_cur = n, - .rlim_max = n - }; - - warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n); - - if (setrlimit(RLIMIT_NPROC, &limit) < 0) - err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid); -} - -static pid_t fork_child(void) -{ - pid_t pid = fork(); - - if (pid < 0) - err(EXIT_FAILURE, "fork"); - - if (pid > 0) - return pid; - - pid = getpid(); - - warnx("(pid=%d): New process starting ...", pid); - - if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0) - err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid); - - signal(SIGUSR1, SIG_DFL); - - warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group); - - if (setgid(group) < 0) - err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group); - if (setuid(user) < 0) - err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user); - - warnx("(pid=%d): Service running ...", pid); - - warnx("(pid=%d): Unshare user namespace", pid); - if (unshare(CLONE_NEWUSER) < 0) - err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)"); - - char *const argv[] = { "service", NULL }; - char *const envp[] = { "I_AM_SERVICE=1", NULL }; - - warnx("(pid=%d): Executing real service ...", pid); - - execve(service_prog, argv, envp); - err(EXIT_FAILURE, "(pid=%d): execve", pid); -} - -int main(int argc, char **argv) -{ - size_t i; - pid_t child[NR_CHILDS]; - int wstatus[NR_CHILDS]; - int childs = NR_CHILDS; - pid_t pid; - - if (getenv("I_AM_SERVICE")) { - pause(); - exit(EXIT_SUCCESS); - } - - service_prog = argv[0]; - pid = getpid(); - - warnx("(pid=%d) Starting testcase", pid); - - /* - * This rlimit is not a problem for root because it can be exceeded. - */ - setrlimit_nproc(1); - - for (i = 0; i < NR_CHILDS; i++) { - child[i] = fork_child(); - wstatus[i] = 0; - usleep(250000); - } - - while (1) { - for (i = 0; i < NR_CHILDS; i++) { - if (child[i] <= 0) - continue; - - errno = 0; - pid_t ret = waitpid(child[i], &wstatus[i], WNOHANG); - - if (!ret || (!WIFEXITED(wstatus[i]) && !WIFSIGNALED(wstatus[i]))) - continue; - - if (ret < 0 && errno != ECHILD) - warn("(pid=%d): waitpid(%d)", pid, child[i]); - - child[i] *= -1; - childs -= 1; - } - - if (!childs) - break; - - usleep(250000); - - for (i = 0; i < NR_CHILDS; i++) { - if (child[i] <= 0) - continue; - kill(child[i], SIGUSR1); - } - } - - for (i = 0; i < NR_CHILDS; i++) { - if (WIFEXITED(wstatus[i])) - warnx("(pid=%d): pid %d exited, status=%d", - pid, -child[i], WEXITSTATUS(wstatus[i])); - else if (WIFSIGNALED(wstatus[i])) - warnx("(pid=%d): pid %d killed by signal %d", - pid, -child[i], WTERMSIG(wstatus[i])); - - if (WIFSIGNALED(wstatus[i]) && WTERMSIG(wstatus[i]) == SIGUSR1) - continue; - - warnx("(pid=%d): Test failed", pid); - exit(EXIT_FAILURE); - } - - warnx("(pid=%d): Test passed", pid); - exit(EXIT_SUCCESS); -} diff --git a/tools/testing/selftests/rlimits/service_common.c b/tools/testing/selftests/rlimits/service_common.c new file mode 100644 index 000000000000..043c59828a03 --- /dev/null +++ b/tools/testing/selftests/rlimits/service_common.c @@ -0,0 +1,400 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Alexey Gladkov <gladkov.alexey@gmail.com> + * Author: Michal Koutný <mkoutny@suse.com> + */ +/* + * The forking tree: + * + * main [init_user_ns] + * ` service [user_ns_1] + * ` worker 1 + * ` worker 2 + * ... + * ` worker k + * ... + * ` service [user_ns_n] + * ` worker 1 + * ` worker 2 + * ... + * ` worker k + * + * Sequence (synchronization) diagram: + * main service + * ---- ------- + * setrlimit() + * service=clone([CLONE_NEWUSER]) + * define_maps() + * MAP_DEFINE -> + * setuid() + * [unshare(CLONE_NEWUSER)] + * <- UNSHARE + * rlimit_restore() + * RLIMIT_RESTORE -> + * execve() + * POST_EXEC -> + * + * Expectations UM_UNSHARE: + * + * n <= RLIMIT_NPROC global RLIMIT_NPROC is honored + * (k+1) <= RLIMIT_NPROC limit is honored within user namespace + * + * Expectations UM_CLONE_NEWUSER: + * + * n > RLIMIT_NPROC privileged user can spawn mutliple services in different user namespaces + * k >= RLIMIT_NPROC-1 separate per-user namespace counters + * (k+1) <= RLIMIT_NPROC limit is honored within user namespace + */ + +#define _GNU_SOURCE +#include <assert.h> +#include <err.h> +#include <errno.h> +#include <fcntl.h> +#include <limits.h> +#include <sched.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/prctl.h> +#include <sys/resource.h> +#include <sys/socket.h> +#include <sys/stat.h> +#include <sys/time.h> +#include <sys/wait.h> +#include <unistd.h> + +#include "../kselftest.h" +#include "service_common.h" + +#define STACK_SIZE (2 * (1<<20)) +#define SERVICE_RUNTIME 250000 /* μs */ + +static_assert(NR_SERVICES > THE_LIMIT, "Services must exceed THE_LIMIT for effective test."); +static_assert(NR_WORKERS > THE_LIMIT-1, "Need enough workers to challenge THE_LIIMT."); + +static struct services_ctx { + int control_fd[NR_SERVICES]; + pid_t child[NR_SERVICES]; + int wstatus[NR_SERVICES]; + int fork_ed; + int exec_ed; +} services_ctx; + +static uid_t user = 60000; +static uid_t group = 60000; +static struct rlimit saved_limit; + +static struct { + int control_fd; + const char *pathname; + enum userns_mode um; +} child_args; + +pid_t pid; + +static void setrlimit_nproc(rlim_t n) +{ + struct rlimit limit = { + .rlim_cur = n, + .rlim_max = n + }; + if (getrlimit(RLIMIT_NPROC, &saved_limit) < 0) + err(KSFT_FAIL, "(pid=%d): getrlimit(RLIMIT_NPROC)", pid); + + if (setrlimit(RLIMIT_NPROC, &limit) < 0) + err(KSFT_FAIL, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid); + + warnx("(pid=%d): Set RLIMIT_NPROC=%ld", pid, n); +} + +static void restore_rlimit_nproc(void) +{ + if (setrlimit(RLIMIT_NPROC, &saved_limit) < 0) + err(KSFT_FAIL, "(pid=%d): setrlimit(RLIMIT_NPROC, saved)", pid); + warnx("(pid=%d) Restored RLIMIT_NPROC", pid); +} + +enum msg_sync { + MAP_DEFINE, + UNSHARE, + RLIMIT_RESTORE, + POST_EXEC, +}; + +static int _sync_notify(int fd, enum msg_sync m) +{ + char tmp = m; + + return write(fd, &tmp, 1); +} +static void sync_notify(int fd, enum msg_sync m) +{ + if (_sync_notify(fd, m) < 0) + warnx("(pid=%d): failed sync-write", pid); +} + +static void sync_wait(int fd, enum msg_sync m) +{ + char tmp; + + if (read(fd, &tmp, 1) < 0) + warn("(pid=%d): failed sync-read", pid); + else if (tmp != m) + warnx("(pid=%d): unexpected sync", pid); +} + +static int define_maps(pid_t child_pid, enum userns_mode um) +{ + FILE *f; + char filename[PATH_MAX]; + + if (um != UM_CLONE_NEWUSER) + return 0; + + snprintf(filename, PATH_MAX, "/proc/%i/uid_map", child_pid); + f = fopen(filename, "w"); + if (fprintf(f, "%i %i 1\n", user, user) < 0) + return -1; + fclose(f); + + snprintf(filename, PATH_MAX, "/proc/%i/gid_map", child_pid); + f = fopen(filename, "w"); + if (fprintf(f, "%i %i 1\n", group, group) < 0) + return -1; + fclose(f); + + return 0; +} + +static int setup_and_exec(void *arg) +{ + int control_fd = child_args.control_fd; + + pid = getpid(); + warnx("(pid=%d): New process starting ...", pid); + + signal(SIGUSR1, SIG_DFL); + + sync_wait(control_fd, MAP_DEFINE); + warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group); + + if (setgid(group) < 0) + err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group); + if (setuid(user) < 0) + err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user); + + warnx("(pid=%d): Service running ...", pid); + + if (child_args.um == UM_UNSHARE) { + warnx("(pid=%d): Unshare user namespace", pid); + if (unshare(CLONE_NEWUSER) < 0) + err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)"); + } + + sync_notify(control_fd, UNSHARE); + sync_wait(control_fd, RLIMIT_RESTORE); + + char *param = NULL; + asprintf(¶m, ENV_PARAM "=%i", child_args.um); + char *const argv[] = { "service", NULL }; + char *const envp[] = { param, NULL }; + + warnx("(pid=%d): Executing real service ...", pid); + + execve(child_args.pathname, argv, envp); + + /* stay around until parent notifies/signals */ + warn("(pid=%d): execve failed", pid); + sync_wait(control_fd, POST_EXEC); + pause(); + return 0; +} + +static pid_t start_child(const char *pathname, int control_fd, enum userns_mode um) +{ + char *stack = malloc(STACK_SIZE); + int flags = um == UM_CLONE_NEWUSER ? CLONE_NEWUSER : 0; + pid_t new_pid; + + /* Pass via global variable to child */ + child_args.control_fd = control_fd; + child_args.pathname = pathname; + child_args.um = um; + + new_pid = clone(setup_and_exec, stack+STACK_SIZE-1, flags, NULL); + + free(stack); + close(control_fd); + return new_pid; +} + +static void dump_context(size_t n_workers) +{ + struct rlimit limit; + char user_ns[PATH_MAX]; + ssize_t len; + + if (getrlimit(RLIMIT_NPROC, &limit) < 0) + err(EXIT_FAILURE, "(pid=%d) failed getrlimit", pid); + if ((len = readlink("/proc/self/ns/user", user_ns, PATH_MAX)) < 0) + err(EXIT_FAILURE, "(pid=%d) failed readlink", pid); + user_ns[len] = 0; + + warnx("(pid=%d) Service instance attempts %lu workers, limit %lu:%lu, ns=%s", + pid, n_workers, limit.rlim_cur, limit.rlim_max, user_ns); +} + +int run_service(enum userns_mode um) +{ + size_t i; + pid_t worker[NR_WORKERS]; + int ret = EXIT_SUCCESS; + + dump_context(NR_WORKERS); + + /* test RLIMIT_NPROC inside the service, last worker should fail because of service itself */ + for (i = 0; i < NR_WORKERS; i++) { + worker[i] = fork(); + if (worker[i] == 0) { + /* service worker */ + pause(); + exit(EXIT_SUCCESS); + } + if (worker[i] < 0) { + warn("(pid=%d) service fork %lu failed", pid, i+1); + if (um == UM_CLONE_NEWUSER && !(i >= (THE_LIMIT-1) && errno == EAGAIN)) + ret = EXIT_FAILURE; + } else if (i >= (THE_LIMIT-1)) { + warnx("(pid=%d) RLIMIT_NPROC not honored", pid); + ret = EXIT_FAILURE; + } + } + + /* service cleanup */ + for (i = 0; i < NR_WORKERS; i++) + if (worker[i] > 0) + kill(worker[i], SIGUSR1); + + for (i = 0; i < NR_WORKERS; i++) + if (worker[i] > 0) + waitpid(worker[i], NULL, WNOHANG); + + if (ret) { + warnx("(pid=%d) service failed, ret=%i", pid, ret); + return ret; + } + /* we must get here before SERVICE_RUNTIME elapses */ + pause(); + return EXIT_FAILURE; +} + +struct services_ctx *start_services(const char *pathname, enum userns_mode um) +{ + size_t i; + int sockets[2]; + struct services_ctx *ctx = &services_ctx; + + signal(SIGPIPE, SIG_IGN); + setrlimit_nproc(THE_LIMIT); + ctx->fork_ed = 0; + ctx->exec_ed = 0; + for (i = 0; i < NR_SERVICES; i++) { + if (socketpair(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0, sockets) < 0) + err(KSFT_FAIL, "(pid=%d) socketpair failed", pid); + ctx->control_fd[i] = sockets[0]; + ctx->child[i] = start_child(pathname, sockets[1], um); + ctx->wstatus[i] = 0; + if (ctx->child[i] < 0) + continue; + ctx->fork_ed++; + + if (define_maps(ctx->child[i], um) < 0) + err(KSFT_FAIL, "(pid=%d) user_ns maps definition failed", pid); + sync_notify(ctx->control_fd[i], MAP_DEFINE); + } + + for (i = 0; i < NR_SERVICES; i++) + sync_wait(ctx->control_fd[i], UNSHARE); + restore_rlimit_nproc(); + + for (i = 0; i < NR_SERVICES; i++) { + sync_notify(ctx->control_fd[i], RLIMIT_RESTORE); + } + + return ctx; +} + +void stop_services(struct services_ctx *ctx) +{ + size_t i; + int children = ctx->fork_ed; + + /* Well behaved service would pause() and wait for our SIGUSR1, if it + * failed check it early. + */ + while (1) { + for (i = 0; i < NR_SERVICES; i++) { + if (ctx->child[i] <= 0) + continue; + + errno = 0; + pid_t ret = waitpid(ctx->child[i], &ctx->wstatus[i], WNOHANG | __WALL); + + if (!ret) + continue; + + if (ret < 0 && errno != ECHILD) + warn("(pid=%d): waitpid(%d)", pid, ctx->child[i]); + + ctx->child[i] *= -1; + children -= 1; + } + + if (!children) + break; + + usleep(SERVICE_RUNTIME); + + for (i = 0; i < NR_SERVICES; i++) { + if (ctx->child[i] <= 0) + continue; + if (_sync_notify(ctx->control_fd[i], POST_EXEC) < 0 && + (errno == EPIPE || errno == ECONNREFUSED)) + ctx->exec_ed++; + close(ctx->control_fd[i]); + kill(ctx->child[i], SIGUSR1); + } + } + + warnx("(pid=%d): stats: fork_ed=%i exec_ed=%i", pid, ctx->fork_ed, ctx->exec_ed); +} + +int count_services(struct services_ctx *ctx) +{ + return ctx->exec_ed; +} + +int check_services(struct services_ctx *ctx) +{ + size_t i; + int correct = 0; + + for (i = 0; i < NR_SERVICES; i++) { + if (WIFEXITED(ctx->wstatus[i])) + warnx("(pid=%d): pid %d exited, status=%d", + pid, -ctx->child[i], WEXITSTATUS(ctx->wstatus[i])); + else if (WIFSIGNALED(ctx->wstatus[i])) + warnx("(pid=%d): pid %d killed by signal %d", + pid, -ctx->child[i], WTERMSIG(ctx->wstatus[i])); + + /* The only acceptable service termination */ + if (WIFSIGNALED(ctx->wstatus[i]) && WTERMSIG(ctx->wstatus[i]) == SIGUSR1) + correct++; + } + + return correct; +} + + diff --git a/tools/testing/selftests/rlimits/service_common.h b/tools/testing/selftests/rlimits/service_common.h new file mode 100644 index 000000000000..4a3cd929d865 --- /dev/null +++ b/tools/testing/selftests/rlimits/service_common.h @@ -0,0 +1,24 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +#include <sys/types.h> + +#define THE_LIMIT 4 +#define NR_SERVICES 5 +#define NR_WORKERS 5 + +#define ENV_PARAM "I_AM_SERVICE" + +enum userns_mode { + UM_UNSHARE, /* setrlimit,clone(0),setuid,unshare,execve */ + UM_CLONE_NEWUSER, /* setrlimit,clone(NEWUSER),setuid,execve */ +}; + +struct services_ctx; + +/* Cache current pid */ +extern pid_t pid; + +int run_service(enum userns_mode um); +struct services_ctx *start_services(const char *pathname, enum userns_mode um); +void stop_services(struct services_ctx *ctx); +int count_services(struct services_ctx *ctx); +int check_services(struct services_ctx *ctx); -- 2.34.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression 2022-02-24 16:28 ` Kees Cook 2022-02-24 18:53 ` Michal Koutný @ 2022-02-25 0:29 ` Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-25 0:29 UTC (permalink / raw) To: Kees Cook Cc: linux-kernel, Linux API, Etienne Dechamps, Alexey Gladkov, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau, Linus Torvalds Kees Cook <keescook@chromium.org> writes: > typo: Subject's LimigtNPROC -> LimitNPROC > > On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote: >> >> Long story short recursively enforcing RLIMIT_NPROC when it is not >> enforced on the process that creates a new user namespace, causes >> currently working code to fail. There is no reason to enforce >> RLIMIT_NPROC recursively when we don't enforce it normally so update >> the code to detect this case. >> >> I would like to simply use capable(CAP_SYS_RESOURCE) to detect when >> RLIMIT_NPROC is not enforced upon the caller. Unfortunately because >> RLIMIT_NPROC is charged and checked for enforcement based upon the >> real uid, using capable() wich is euid based is inconsistent with reality. > > typo: wich -> which Ahh... Typos. >> Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by >> testing for when the real uid would match the conditions when >> CAP_SYS_RESOURCE would be present if the real uid was the effective >> uid. >> >> Reported-by: Etienne Dechamps <etienne@edechamps.fr> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596 >> Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr >> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> >> The previous conversation has given me enough clarity that I can see >> which tests I am comfortable with use for this pending regression fix. >> >> I have tested this and it works for me. Does anyone have any concerns >> with this change? > > I'd really love some kind of selftest that exercises the edge cases; do > you have your tests in some form that could be converted? > > But otherwise, yes, this looks like the best option here. Let's start with Michal Koutný tests. I keep forgetting to look at them. This cold has really been kicking my butt. For this issue the test case was a systemd unit file. Which is simple and demonstrates the real-world regression but not really minimal in the way a kernel selftest should be. > Reviewed-by: Kees Cook <keescook@chromium.org> > >> >> kernel/user_namespace.c | 14 +++++++++++++- >> 1 file changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c >> index 6b2e3ca7ee99..5481ba44a8d6 100644 >> --- a/kernel/user_namespace.c >> +++ b/kernel/user_namespace.c >> @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) >> cred->user_ns = user_ns; >> } >> >> +static unsigned long enforced_nproc_rlimit(void) >> +{ >> + unsigned long limit = RLIM_INFINITY; >> + >> + /* Is RLIMIT_NPROC currently enforced? */ >> + if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) || >> + (current_user_ns() != &init_user_ns)) >> + limit = rlimit(RLIMIT_NPROC); >> + >> + return limit; >> +} >> + >> /* >> * Create a new user namespace, deriving the creator from the user in the >> * passed credentials, and replacing that user with the new root user for the >> @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new) >> for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) { >> ns->ucount_max[i] = INT_MAX; >> } >> - set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)); >> + set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit()); >> set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); >> set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); >> set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK)); >> -- >> 2.29.2 >> Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: How should rlimits, suid exec, and capabilities interact? 2022-02-24 1:41 ` Linus Torvalds 2022-02-24 2:12 ` Eric W. Biederman @ 2022-02-24 3:00 ` David Laight 1 sibling, 0 replies; 25+ messages in thread From: David Laight @ 2022-02-24 3:00 UTC (permalink / raw) To: 'Linus Torvalds', Eric W. Biederman Cc: Linux API, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau From: Linus Torvalds > Sent: 24 February 2022 01:42 > > On Wed, Feb 23, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > Question: Running a suid program today charges the activity of that > > program to the user who ran that program, not to the user the program > > runs as. Does anyone see a problem with charging the user the program > > runs as? > > So I think that there's actually two independent issues with limits > when you have situations like this where the actual user might be > ambiguous. > > - the "who to charge" question > > - the "how do we *check* the limit" question > > and honestly, I think that when it comes to suid binaries, the first > question is fundamentally ambiguous, because it almost certainly > depends on the user. Doesn't the rlimit check happen during the fork. At which time you don't know that a suid exec might follow? The problem with changing the uid is that when the process exits you need to "uncharge" the correct uid. So either you need to remember the original uid or setuid has to transfer the charge (whichever uid is used). If you transfer the charge then the setuid system call can't fail. But a later exec can fail. Any check will always be done against the process's own rlimit value. Set that to zero and fork should fail regardless of which uid's process count is checked. Now a normal suid program only changes the effective uid. So keeping the process charged against the real uid makes sense. If a process changes its real uid you could change the charged uid but you can't error if over the rlimit value. OTOH during a later exec you can test things and exec can fail. At least one unix I've used has three uids for each process. The 'real uid', 'effective uid' and 'saved by exec uid'. I suspect the process is always "charged" against the latter. I think that exec compares the 'real' and 'saved by exec' uids and, if different, moves the charge to the real uid (which will check rlimit) then sets the 'saved by exec uid' to the real uid. So an exec after a setuid() can be allowed to fail if the real user has too many processes. But in all other cases exec just works regardless of the process count for any user. > > Which to me implies that there probably isn't an answer that is always > right, and that what you should look at is that second option. > > So I would actually suggest that the "execute a suid binary" should > charge the real user, but *because* it is suid, it should then not > check the limit (or, perhaps, should check the hard limit?). > > You have to charge somebody, but at that point it's a bit ambiguous > whether it should be allowed. > > Exactly so that if you're over a process limit (or something similar - > think "too many files open" or whatever because you screwed up and > opened everything) you could still log in as yourself (ssh/login > charges some admin thing, which probably has high limits or is > unlimited), and hopefully get shell access, and then be able to "exec > sudo" to actually get admin access that should be disabled from the > network. You usually have to use 'rsh machine sh -i' to avoid the shell running all its startup scripts. But I doubt that will get you past a fork bomb. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: How should rlimits, suid exec, and capabilities interact? 2022-02-23 19:50 ` Linus Torvalds 2022-02-24 1:24 ` Eric W. Biederman @ 2022-02-24 1:32 ` Eric W. Biederman 1 sibling, 0 replies; 25+ messages in thread From: Eric W. Biederman @ 2022-02-24 1:32 UTC (permalink / raw) To: Linus Torvalds Cc: Linux API, Etienne Dechamps, Alexey Gladkov, Kees Cook, Shuah Khan, Christian Brauner, Solar Designer, Ran Xiaokai, Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK, Linux Containers, Michal Koutný, Security Officers, Neil Brown, NeilBrown, Serge E. Hallyn, Jann Horn, Andy Lutomirski, Willy Tarreau Linus Torvalds <linus@torvalds.org> writes: > Basic rule: it's better to be too lenient than to be too strict. Thank you. With that guideline I can explore the space of what is possible. Question: Running a suid program today charges the activity of that program to the user who ran that program, not to the user the program runs as. Does anyone see a problem with charging the user the program runs as? The reason I want to change which user is charged with a process (besides it making more sense in my head) is so that "capable(CAP_SYS_RESOURCE)" can be used instead of the magic incantation "(cred->user == INIT_USER)". Today "capable(CAP_SYS_RESOURCE)" with respect to RLIMIT_NPROC is effectively meaningless for suid programs because the of the mismatch of charging the real user with the effective users credentials. An accidental experiment happened in v5.14-rc1 in July when the ucount rlimit code was merged. It was only this last week when after Michal Koutný discovered the discrepancy through code inspection I merged a bug fix because the code was not preserving the existing behavior as intended. This behavior has existed in some form since Linux v1.0 when per user process limits were added. The original code in v1.0 was: > static int find_empty_process(void) > { > int free_task; > int i, tasks_free; > int this_user_tasks; > > repeat: > if ((++last_pid) & 0xffff8000) > last_pid=1; > this_user_tasks = 0; > tasks_free = 0; > free_task = -EAGAIN; > i = NR_TASKS; > while (--i > 0) { > if (!task[i]) { > free_task = i; > tasks_free++; > continue; > } > if (task[i]->uid == current->uid) > this_user_tasks++; > if (task[i]->pid == last_pid || task[i]->pgrp == last_pid || > task[i]->session == last_pid) > goto repeat; > } > if (tasks_free <= MIN_TASKS_LEFT_FOR_ROOT || > this_user_tasks > MAX_TASKS_PER_USER) > if (current->uid) > return -EAGAIN; > return free_task; > } Having tracked the use of real uid in limits back this far my guess is that it was an accident of the implementation and real uid vs effective uid had not be considered. Does anyone know if choosing the real uid vs the effective uid for accounting a users processes was a deliberate decision anywhere in the history of Linux? Linus you were talking about making it possible to login as I think a non-root user to be able to use sudo and kill a fork bomb. The counter case is apache having a dedicated user for running cgi-scripts and using RLIMIT_NPROC to limit how many of those processes can exist. Unless I am misunderstanding something that looks exactly like your login as non-root so you can run sudo to kill a fork-bomb. A comment from an in-process cleanup patch explains this as best I can: /* * In general rlimits are only enforced when a new resource * is acquired. That would be during fork for RLIMIT_NPROC. * That is insufficient for RLIMIT_NPROC as many attributes of * a new process must be set between fork and exec. * * A case where this matter is when apache runs forks a process * and calls setuid to run cgi-scripts as a different user. * Generating those processes through a code sequence like: * * fork() * setrlimit(RLIMIT_NPROC, ...) * execve() -- suid wrapper * setuid() * execve() -- cgi script * * The cgi-scripts are unlikely to fork on their own so unless * RLIMIT_NPROC is checked after the user change and before * the cgi-script starts, RLIMIT_NPROC simply will not be enforced * for the cgi-scripts. * * So the code tracks if between fork and exec if an operation * occurs that could cause the RLIMIT_NPROC check to fail. If * such an operation has happened re-check RLIMIT_NPROC. */ Answered-Question: I was trying to ask if anyone knows of a reason why we can't just sanitize the rlimits of the process during suid exec? Linus your guideline would appear to allow that behavior. Unfortunately that looks like it would break current usage of apache suexec. Eric ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2022-03-03 0:30 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20220207121800.5079-1-mkoutny@suse.com> [not found] ` <87o83e2mbu.fsf@email.froward.int.ebiederm.org> 2022-02-16 15:56 ` [PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user Eric W. Biederman 2022-02-16 17:42 ` Solar Designer 2022-02-16 15:58 ` [PATCH v2 2/5] ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 3/5] ucounts: Base set_cred_ucounts changes on the real user Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 4/5] ucounts: Move RLIMIT_NPROC handling after set_user Eric W. Biederman 2022-02-16 15:58 ` [PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit Eric W. Biederman 2022-02-16 17:28 ` Shuah Khan 2022-02-18 15:34 ` [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17 Eric W. Biederman 2022-02-20 19:05 ` pr-tracker-bot 2022-03-03 0:12 ` [GIT PULL] ucounts: Regression fix " Eric W. Biederman 2022-03-03 0:30 ` pr-tracker-bot [not found] ` <e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr> [not found] ` <20220215101150.GD21589@blackbody.suse.cz> [not found] ` <87zgmi5rhm.fsf@email.froward.int.ebiederm.org> 2022-02-23 18:00 ` How should rlimits, suid exec, and capabilities interact? Eric W. Biederman 2022-02-23 19:44 ` Andy Lutomirski 2022-02-23 21:28 ` Willy Tarreau 2022-02-23 19:50 ` Linus Torvalds 2022-02-24 1:24 ` Eric W. Biederman 2022-02-24 1:41 ` Linus Torvalds 2022-02-24 2:12 ` Eric W. Biederman 2022-02-24 15:41 ` [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression Eric W. Biederman 2022-02-24 16:28 ` Kees Cook 2022-02-24 18:53 ` Michal Koutný 2022-02-25 0:29 ` Eric W. Biederman 2022-02-24 3:00 ` How should rlimits, suid exec, and capabilities interact? David Laight 2022-02-24 1:32 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).