From: ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman)
To: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
Subject: Re: [RFC PATCH] rlimit: Account nproc per-usernamespace/per-user
Date: Tue, 01 Nov 2016 10:01:57 -0500 [thread overview]
Message-ID: <8760o7tfa2.fsf@xmission.com> (raw)
In-Reply-To: <1477485627-16177-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org> (Nikolay Borisov's message of "Wed, 26 Oct 2016 15:40:27 +0300")
Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
> There are container setups which map the same kuids to
> different containers. In such situation what will happen is
> that same uid's in different containers will map to the same
> underlying user on the matchine (e.g. same struct user). One
> implication of this is that the number of processes for that
> particular user are going to be shared among all the same uids
> in the container. This is problematic, as it means a user in
> containerA can potentially exhaust the process limit such that
> a user in containerB cannot spawn any processes.
>
> Fix this by utilising the newly introduced UCOUNT infrastructure,
> so that process counts are now being accounted per-userns/per-user.
> In case when a machine is running in non-container setup then
> the accounting semantics is virtually the same, given that there is
> going to be one ucount-per-user struct.
>
> Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
> ---
> Hello Eric,
>
> Now that ucount has been merged I took the liberty of
> experimenting what the nproc-per-userns approach would
> look like and here is the result. In my previous [0] you
> expressed concerns regarding not having hierarchical limits,
> essentially allowing users to circumvent their limits. So
> this version doesn't do anything to rectify this issue.
> However, I do think the idea of the rlimit is not to
> forbid the user of spawning more processes, since currently
> the limit is being set on a per-process basis, so a malicious
> process can in fact increase that limits and spawn a large number
> of processes. Given this I believe the current patch doesn't make
> the situation any worse in that regard.
I think conceptually this can work.
Reading through the code I don't see anything capturing the current
processes RLIMIT_NPROC when creating a new user namespace. Which is
critical if the limits are going to be enforced hierarchically.
I have a small concern that we toss the per user accounting entirely as
that means we loose a little in ensuring one uid does not have too many
processes. But that is comparatively minor.
I am buried with Kernel Summit and Plumbers this week, so I will be slow
on review etc.
But overall I think this a viable approach assuming there are no
performance issues in structuring things this way.
Eric
> [0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html
>
> fs/exec.c | 2 +-
> include/linux/cred.h | 1 +
> include/linux/sched.h | 1 -
> include/linux/user_namespace.h | 8 ++++++++
> kernel/cred.c | 18 +++++++++++-------
> kernel/exit.c | 2 +-
> kernel/fork.c | 7 +++++--
> kernel/sys.c | 2 +-
> kernel/ucount.c | 16 ++++++++++++++++
> kernel/user.c | 1 -
> 10 files changed, 44 insertions(+), 14 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 6fcfb3f7b137..8126e00a8d3e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1651,7 +1651,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> * whether NPROC limit is still exceeded.
> */
> if ((current->flags & PF_NPROC_EXCEEDED) &&
> - atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
> + atomic_read(¤t_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
> retval = -EAGAIN;
> goto out_ret;
> }
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index f0e70a1bb3ac..c3bb4d20ff16 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -141,6 +141,7 @@ struct cred {
> void *security; /* subjective LSM security */
> #endif
> struct user_struct *user; /* real user ID subscription */
> + struct ucounts *ucounts; /* container for per-userns/per-counts */
> struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
> struct group_info *group_info; /* supplementary groups for euid/fsgid */
> struct rcu_head rcu; /* RCU deletion hook */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 348f51b0ec92..3709d70fe2de 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -842,7 +842,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
> */
> struct user_struct {
> atomic_t __count; /* reference count */
> - atomic_t processes; /* How many processes does this user have? */
> atomic_t sigpending; /* How many pending signals does this user have? */
> #ifdef CONFIG_INOTIFY_USER
> atomic_t inotify_watches; /* How many inotify watches does this user have? */
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb209d4523f5..d4fc31ae9d85 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -35,6 +35,11 @@ enum ucount_type {
> UCOUNT_COUNTS,
> };
>
> +enum ulimit_type {
> + ULIMIT_NPROC,
> + ULIMIT_COUNTS,
> +};
> +
> struct user_namespace {
> struct uid_gid_map uid_map;
> struct uid_gid_map gid_map;
> @@ -67,6 +72,7 @@ struct ucounts {
> kuid_t uid;
> atomic_t count;
> atomic_t ucount[UCOUNT_COUNTS];
> + atomic_t ulimit[ULIMIT_COUNTS];
> };
>
> extern struct user_namespace init_user_ns;
> @@ -75,6 +81,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
> void retire_userns_sysctls(struct user_namespace *ns);
> struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
> void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
>
> #ifdef CONFIG_USER_NS
>
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 5f264fb5737d..dd9ffc0396bb 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -330,13 +330,16 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> #endif
> clone_flags & CLONE_THREAD
> ) {
> - p->real_cred = get_cred(p->cred);
> + struct cred *c;
> + struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
> + ULIMIT_NPROC);
> + p->real_cred = c = get_cred(p->cred);
> get_cred(p->cred);
> alter_cred_subscribers(p->cred, 2);
> kdebug("share_creds(%p{%d,%d})",
> p->cred, atomic_read(&p->cred->usage),
> read_cred_subscribers(p->cred));
> - atomic_inc(&p->cred->user->processes);
> + c->ucounts = uc;
> return 0;
> }
>
> @@ -369,7 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> }
> #endif
>
> - atomic_inc(&new->user->processes);
> + new->ucounts = inc_ulimit(new->user_ns, new->uid,
> + ULIMIT_NPROC);
> p->cred = p->real_cred = get_cred(new);
> alter_cred_subscribers(new, 2);
> validate_creds(new);
> @@ -461,12 +465,12 @@ int commit_creds(struct cred *new)
> * in set_user().
> */
> alter_cred_subscribers(new, 2);
> - if (new->user != old->user)
> - atomic_inc(&new->user->processes);
> + if (new->user != old->user || old->user_ns != new->user_ns)
> + new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
> rcu_assign_pointer(task->real_cred, new);
> rcu_assign_pointer(task->cred, new);
> - if (new->user != old->user)
> - atomic_dec(&old->user->processes);
> + if (new->user != old->user || old->user_ns != new->user_ns)
> + dec_ulimit(old->ucounts, ULIMIT_NPROC);
> alter_cred_subscribers(old, -2);
>
> /* send notifications */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9d68c45ebbe3..7ac8954230bd 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -173,7 +173,7 @@ void release_task(struct task_struct *p)
> /* don't need to get the RCU readlock here - the process is dead and
> * can't be modifying its own credentials. But shut RCU-lockdep up */
> rcu_read_lock();
> - atomic_dec(&__task_cred(p)->user->processes);
> + dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
> rcu_read_unlock();
>
> proc_flush_task(p);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 623259fc794d..0b16c9d920b1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -425,6 +425,7 @@ int arch_task_struct_size __read_mostly;
> void __init fork_init(void)
> {
> int i;
> + struct cred *c = init_task.cred;
> #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
> #ifndef ARCH_MIN_TASKALIGN
> #define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
> @@ -448,6 +449,8 @@ void __init fork_init(void)
> for (i = 0; i < UCOUNT_COUNTS; i++) {
> init_user_ns.ucount_max[i] = max_threads/2;
> }
> +
> + c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
> }
>
> int __weak arch_dup_task_struct(struct task_struct *dst,
> @@ -1515,7 +1518,7 @@ static __latent_entropy struct task_struct *copy_process(
> DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
> #endif
> retval = -EAGAIN;
> - if (atomic_read(&p->real_cred->user->processes) >=
> + if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
> task_rlimit(p, RLIMIT_NPROC)) {
> if (p->real_cred->user != INIT_USER &&
> !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> @@ -1859,7 +1862,7 @@ static __latent_entropy struct task_struct *copy_process(
> #endif
> delayacct_tsk_free(p);
> bad_fork_cleanup_count:
> - atomic_dec(&p->cred->user->processes);
> + dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
> exit_creds(p);
> bad_fork_free:
> put_task_stack(p);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 89d5be418157..143d04ed9da5 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -433,7 +433,7 @@ static int set_user(struct cred *new)
> * for programs doing set*uid()+execve() by harmlessly deferring the
> * failure to the execve() stage.
> */
> - if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
> + if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
> new_user != INIT_USER)
> current->flags |= PF_NPROC_EXCEEDED;
> else
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 9d20d5dd298a..d8f5362f6e17 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -181,6 +181,22 @@ static inline bool atomic_inc_below(atomic_t *v, int u)
> }
> }
>
> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
> + enum ulimit_type type)
> +{
> + struct ucounts *ucounts = get_ucounts(ns, uid);
> + atomic_inc(&ucounts->ulimit[type]);
> +
> + return ucounts;
> +}
> +
> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
> +{
> + atomic_dec(&ucounts->ulimit[type]);
> + WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
> + put_ucounts(ucounts);
> +}
> +
> struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
> enum ucount_type type)
> {
> diff --git a/kernel/user.c b/kernel/user.c
> index b069ccbfb0b0..ce1ba1fb96c0 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -90,7 +90,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
> /* root_user.__count is 1, for init task cred */
> struct user_struct root_user = {
> .__count = ATOMIC_INIT(1),
> - .processes = ATOMIC_INIT(1),
> .sigpending = ATOMIC_INIT(0),
> .locked_shm = 0,
> .uid = GLOBAL_ROOT_UID,
next prev parent reply other threads:[~2016-11-01 15:01 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-26 12:40 [RFC PATCH] rlimit: Account nproc per-usernamespace/per-user Nikolay Borisov
[not found] ` <1477485627-16177-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
2016-10-26 17:25 ` Serge E. Hallyn
[not found] ` <20161026172541.GA12228-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-10-27 7:01 ` Nikolay Borisov
[not found] ` <0e584ff0-3622-231c-5da2-960cbee698c1-6AxghH7DbtA@public.gmane.org>
2016-10-27 14:37 ` Serge E. Hallyn
[not found] ` <20161027143715.GA23294-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-11-03 13:49 ` Nikolay Borisov
[not found] ` <69716b99-dbd3-e4e3-f650-908474cb0b14-6AxghH7DbtA@public.gmane.org>
2016-11-07 16:56 ` Serge E. Hallyn
2016-11-01 15:01 ` Eric W. Biederman [this message]
[not found] ` <8760o7tfa2.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2016-11-03 14:59 ` Nikolay Borisov
[not found] ` <1b593b0d-3e61-ff26-f023-303dcc2debfc-6AxghH7DbtA@public.gmane.org>
2016-11-07 17:28 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8760o7tfa2.fsf@xmission.com \
--to=ebiederm-as9lmozglivwk0htik3j/w@public.gmane.org \
--cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
--cc=kernel-6AxghH7DbtA@public.gmane.org \
--cc=luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox