All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>,
	Kernel Hardening <kernel-hardening@lists.openwall.com>,
	Linux Containers <containers@lists.linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Alexey Gladkov <legion@kernel.org>,
	Christian Brauner <christian@brauner.io>
Subject: Re: [RFC PATCH v2 0/8] Count rlimits in each user namespace
Date: Mon, 11 Jan 2021 14:17:19 -0600	[thread overview]
Message-ID: <87a6tfp6sw.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <CAHk-=wgXZmRu762bjSeK80+T_LTo+UP9y5rP-uvym1vquSxmBw@mail.gmail.com> (Linus Torvalds's message of "Sun, 10 Jan 2021 10:46:05 -0800")

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Jan 10, 2021 at 9:34 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>>
>> To address the problem, we bind rlimit counters to each user namespace. The
>> result is a tree of rlimit counters with the biggest value at the root (aka
>> init_user_ns). The rlimit counter increment/decrement occurs in the current and
>> all parent user namespaces.
>
> I'm not seeing why this is necessary.
>
> Maybe it's the right approach, but none of the patches (or this cover
> letter email) really explain it to me.
>
> I understand why you might want the _limits_ themselves would form a
> tree like this - with the "master limit" limiting the limits in the
> user namespaces under it.
>
> But I don't understand why the _counts_ should do that. The 'struct
> user_struct' should be shared across even user namespaces for the same
> user.
>
> IOW, the very example of the problem you quote seems to argue against this:
>
>> For example, there are two containers (A and B) created by one user. The
>> container A sets RLIMIT_NPROC=1 and starts one process. Everything is fine, but
>> when container B tries to do the same it will fail because the number of
>> processes is counted globally for each user and user has one process already.
>
> Note how the problem was _not_ that the _count_ was global. That part
> was fine and all good.

The problem is fundamentally that the per process RLIMIT_NPROC was
compared against the user_struct->processes.

I have only heard the problem described but I believe it is either the
RLIMIT_NPROC test in fork or at the beginning of do_execveat_common that
is failing.

From fs/exec.c line 1866:
> 	/*
> 	 * We move the actual failure in case of RLIMIT_NPROC excess from
> 	 * set*uid() to execve() because too many poorly written programs
> 	 * don't check setuid() return code.  Here we additionally recheck
> 	 * whether NPROC limit is still exceeded.
> 	 */
> 	if ((current->flags & PF_NPROC_EXCEEDED) &&
> 	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
> 		retval = -EAGAIN;
> 		goto out_ret;
> 	}

From fs/fork.c line 1966:
> 	retval = -EAGAIN;
> 	if (atomic_read(&p->real_cred->user->processes) >=
> 			task_rlimit(p, RLIMIT_NPROC)) {
> 		if (p->real_cred->user != INIT_USER &&
> 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> 			goto bad_fork_free;
> 	}
> 	current->flags &= ~PF_NPROC_EXCEEDED;

In both the cases the RLIMIT_NPROC value comes from
task->signal->rlim[RLIMIT_NPROC] and the count of processes
comes from task->cred->user->processes.

> No, the problem was that the _limit_ in container A also ended up
> affecting container B.

The description I have is that both containers run the same service
that set it's RLIMIT_NPROC to 1 in both containers.

> So to me, that says that it would make sense to continue to use the
> resource counts in 'struct user_struct' (because if user A has a hard
> limit of X, then creating a new namespace shouldn't expand that
> limit), but then have the ability to make per-container changes to the
> resource limits (as long as they are within the bounds of the parent
> user namespace resource limit).

I agree that needs to work as well.

> Maybe there is some reason for this ucounts approach, but if so, I
> feel it was not explained at all.

Let me see if I can starte the example a litle more clearly.

Suppose there is a service never_fork that sets RLIMIT_NPROC runs as
never_fork_user and sets RLIMIT_NPROC to 1 in it's systemd service file.

Further suppose there is a user bob who has two containers he wants to
run: container1 and container2.  Both containers start the never_fork
service.

Bob first starts container1 and inside it the never_fork service starts.
Bob starts container2 and the never_fork service fails to start.

Does that make it clear that it is the count of the processes that would
exceed 1 if both instances of the never_fork service starts that would
be the problem?

Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>,
	 LKML <linux-kernel@vger.kernel.org>,
	 Linux Containers <containers@lists.linux-foundation.org>,
	 Kernel Hardening <kernel-hardening@lists.openwall.com>,
	 Alexey Gladkov <legion@kernel.org>,
	 Kees Cook <keescook@chromium.org>,
	 Christian Brauner <christian@brauner.io>
Subject: Re: [RFC PATCH v2 0/8] Count rlimits in each user namespace
Date: Mon, 11 Jan 2021 14:17:19 -0600	[thread overview]
Message-ID: <87a6tfp6sw.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <CAHk-=wgXZmRu762bjSeK80+T_LTo+UP9y5rP-uvym1vquSxmBw@mail.gmail.com> (Linus Torvalds's message of "Sun, 10 Jan 2021 10:46:05 -0800")

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Jan 10, 2021 at 9:34 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>>
>> To address the problem, we bind rlimit counters to each user namespace. The
>> result is a tree of rlimit counters with the biggest value at the root (aka
>> init_user_ns). The rlimit counter increment/decrement occurs in the current and
>> all parent user namespaces.
>
> I'm not seeing why this is necessary.
>
> Maybe it's the right approach, but none of the patches (or this cover
> letter email) really explain it to me.
>
> I understand why you might want the _limits_ themselves would form a
> tree like this - with the "master limit" limiting the limits in the
> user namespaces under it.
>
> But I don't understand why the _counts_ should do that. The 'struct
> user_struct' should be shared across even user namespaces for the same
> user.
>
> IOW, the very example of the problem you quote seems to argue against this:
>
>> For example, there are two containers (A and B) created by one user. The
>> container A sets RLIMIT_NPROC=1 and starts one process. Everything is fine, but
>> when container B tries to do the same it will fail because the number of
>> processes is counted globally for each user and user has one process already.
>
> Note how the problem was _not_ that the _count_ was global. That part
> was fine and all good.

The problem is fundamentally that the per process RLIMIT_NPROC was
compared against the user_struct->processes.

I have only heard the problem described but I believe it is either the
RLIMIT_NPROC test in fork or at the beginning of do_execveat_common that
is failing.

From fs/exec.c line 1866:
> 	/*
> 	 * We move the actual failure in case of RLIMIT_NPROC excess from
> 	 * set*uid() to execve() because too many poorly written programs
> 	 * don't check setuid() return code.  Here we additionally recheck
> 	 * whether NPROC limit is still exceeded.
> 	 */
> 	if ((current->flags & PF_NPROC_EXCEEDED) &&
> 	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
> 		retval = -EAGAIN;
> 		goto out_ret;
> 	}

From fs/fork.c line 1966:
> 	retval = -EAGAIN;
> 	if (atomic_read(&p->real_cred->user->processes) >=
> 			task_rlimit(p, RLIMIT_NPROC)) {
> 		if (p->real_cred->user != INIT_USER &&
> 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> 			goto bad_fork_free;
> 	}
> 	current->flags &= ~PF_NPROC_EXCEEDED;

In both the cases the RLIMIT_NPROC value comes from
task->signal->rlim[RLIMIT_NPROC] and the count of processes
comes from task->cred->user->processes.

> No, the problem was that the _limit_ in container A also ended up
> affecting container B.

The description I have is that both containers run the same service
that set it's RLIMIT_NPROC to 1 in both containers.

> So to me, that says that it would make sense to continue to use the
> resource counts in 'struct user_struct' (because if user A has a hard
> limit of X, then creating a new namespace shouldn't expand that
> limit), but then have the ability to make per-container changes to the
> resource limits (as long as they are within the bounds of the parent
> user namespace resource limit).

I agree that needs to work as well.

> Maybe there is some reason for this ucounts approach, but if so, I
> feel it was not explained at all.

Let me see if I can starte the example a litle more clearly.

Suppose there is a service never_fork that sets RLIMIT_NPROC runs as
never_fork_user and sets RLIMIT_NPROC to 1 in it's systemd service file.

Further suppose there is a user bob who has two containers he wants to
run: container1 and container2.  Both containers start the never_fork
service.

Bob first starts container1 and inside it the never_fork service starts.
Bob starts container2 and the never_fork service fails to start.

Does that make it clear that it is the count of the processes that would
exceed 1 if both instances of the never_fork service starts that would
be the problem?

Eric

  reply	other threads:[~2021-01-11 20:18 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-10 17:33 [RFC PATCH v2 0/8] Count rlimits in each user namespace Alexey Gladkov
2021-01-10 17:33 ` Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 1/8] Use atomic type for ucounts reference counting Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-13 16:31   ` Eric W. Biederman
2021-01-13 16:31     ` Eric W. Biederman
2021-01-13 18:01     ` Kees Cook
2021-01-13 18:01       ` Kees Cook
2021-01-10 17:33 ` [RFC PATCH v2 2/8] Add a reference to ucounts for each user Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-13  6:33   ` 59ebc79722: kernel_BUG_at_kernel/cred.c kernel test robot
2021-01-13  6:33     ` kernel test robot
2021-01-13  6:33     ` kernel test robot
2021-01-13 16:25   ` [RFC PATCH v2 2/8] Add a reference to ucounts for each user Eric W. Biederman
2021-01-13 16:25     ` Eric W. Biederman
2021-01-10 17:33 ` [RFC PATCH v2 3/8] Increase size of ucounts to atomic_long_t Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 4/8] Move RLIMIT_NPROC counter to ucounts Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 5/8] Move RLIMIT_MSGQUEUE " Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 6/8] Move RLIMIT_SIGPENDING " Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 7/8] Move RLIMIT_MEMLOCK " Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 8/8] Move RLIMIT_NPROC check to the place where we increment the counter Alexey Gladkov
2021-01-10 17:33   ` Alexey Gladkov
2021-01-10 18:46 ` [RFC PATCH v2 0/8] Count rlimits in each user namespace Linus Torvalds
2021-01-10 18:46   ` Linus Torvalds
2021-01-11 20:17   ` Eric W. Biederman [this message]
2021-01-11 20:17     ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a6tfp6sw.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=christian@brauner.io \
    --cc=containers@lists.linux-foundation.org \
    --cc=keescook@chromium.org \
    --cc=kernel-hardening@lists.openwall.com \
    --cc=legion@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.