netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Serge Hallyn <serge@hallyn.com>
To: linux-kernel@vger.kernel.org
Cc: dhowells@redhat.com, ebiederm@xmission.com,
	containers@lists.linux-foundation.org, netdev@vger.kernel.org,
	akpm@osdl.org, "Serge E. Hallyn" <serge.hallyn@canonical.com>
Subject: [PATCH 01/14] add Documentation/namespaces/user_namespace.txt
Date: Tue, 26 Jul 2011 18:58:24 +0000	[thread overview]
Message-ID: <1311706717-7398-2-git-send-email-serge@hallyn.com> (raw)
In-Reply-To: <1311706717-7398-1-git-send-email-serge@hallyn.com>

From: Serge E. Hallyn <serge.hallyn@canonical.com>

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feed back from David Howells.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..7e50517
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+Any task in or resource belonging to the initial user namespace will, to this
+new task, appear to belong to UID and GID -1 - which is usually known as
+'nobody'.  Permission to open such files will be granted according to world
+access permissions.  UID comparisons and group membership checks will return
+false, and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here
+http://www.mail-archive.com/devel@openvz.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized wrt pidns?  (it appears to be).
+If so, then we can change the capable() check in kernel/acct.c to
+'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.
-- 
1.7.4.1

  reply	other threads:[~2011-07-26 18:58 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-26 18:58 [PATCH 0/14] user namespaces v2: continue targetting capabilities Serge Hallyn
2011-07-26 18:58 ` Serge Hallyn [this message]
2011-07-26 20:22   ` [PATCH 01/14] add Documentation/namespaces/user_namespace.txt Randy Dunlap
2011-07-27 15:38     ` Serge E. Hallyn
2011-07-27 16:02       ` Randy Dunlap
2011-07-26 20:29   ` David Howells
2011-07-29 17:25     ` [PATCH 01/14] add Documentation/namespaces/user_namespace.txt (v3) Serge E. Hallyn
2011-07-26 18:58 ` [PATCH 02/14] allow root in container to copy namespaces Serge Hallyn
2011-07-27 23:14   ` Eric W. Biederman
2011-07-28  2:13     ` Serge E. Hallyn
2011-07-29 17:27     ` [PATCH 02/14] allow root in container to copy namespaces (v3) Serge E. Hallyn
2011-08-01 22:25       ` Eric W. Biederman
     [not found]         ` <m1ei146a6t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2011-08-02 14:08           ` Serge E. Hallyn
2011-08-02 22:03             ` Eric W. Biederman
2011-08-04 22:01               ` Serge E. Hallyn
2011-07-26 18:58 ` [PATCH 03/14] keyctl: check capabilities against key's user_ns Serge Hallyn
2011-07-26 18:58 ` [PATCH 04/14] user_ns: convert fs/attr.c to targeted capabilities Serge Hallyn
2011-07-26 18:58 ` [PATCH 05/14] userns: clamp down users of cap_raised Serge Hallyn
2011-07-28 23:23   ` Vasiliy Kulikov
2011-07-28 23:51     ` Serge E. Hallyn
2011-07-26 18:58 ` [PATCH 06/14] user namespace: make each net (net_ns) belong to a user_ns Serge Hallyn
2011-07-26 18:58 ` [PATCH 07/14] user namespace: use net->user_ns for some capable calls under net/ Serge Hallyn
2011-07-26 18:58 ` [PATCH 08/14] af_netlink.c: make netlink_capable userns-aware Serge Hallyn
2011-07-26 18:58 ` [PATCH 09/14] user ns: convert ipv6 to targeted capabilities Serge Hallyn
2011-07-26 18:58 ` [PATCH 10/14] net/core/scm.c: target capable() calls to user_ns owning the net_ns Serge Hallyn
2011-08-04 22:06   ` Serge E. Hallyn
2011-07-26 18:58 ` [PATCH 11/14] userns: make some net-sysfs capable calls targeted Serge Hallyn
2011-07-26 18:58 ` [PATCH 12/14] user_ns: target af_key capability check Serge Hallyn
2011-07-26 18:58 ` [PATCH 13/14] userns: net: make many network capable calls targeted Serge Hallyn
2011-07-26 18:58 ` [PATCH 14/14] net: pass user_ns to cap_netlink_recv() Serge Hallyn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1311706717-7398-2-git-send-email-serge@hallyn.com \
    --to=serge@hallyn.com \
    --cc=akpm@osdl.org \
    --cc=containers@lists.linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=serge.hallyn@canonical.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).