[PATCH] doc: give a more thorough id handling explanation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>, Al Viro <viro@zeniv.linux.org.uk>,
	Seth Forshee <sforshee@gmail.com>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Christian Brauner <christian.brauner@ubuntu.com>
Subject: [PATCH] doc: give a more thorough id handling explanation
Date: Fri, 23 Jul 2021 14:51:50 +0200	[thread overview]
Message-ID: <20210723125150.334206-1-brauner@kernel.org> (raw)

From: Christian Brauner <christian.brauner@ubuntu.com>

Currently there's no document explaining how idmappings work at all.
Add a document that gives an introduction and also goes into a bit more
detail for more advanced use-cases.

With-input-from: Seth Forshee <sforshee@kernel.org>
With-input-from: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
(I'll likely be slow to respond since I'm going to be away over the
 weekend.)
---
 Documentation/filesystems/idmappings.rst | 1008 ++++++++++++++++++++++
 Documentation/filesystems/index.rst      |    1 +
 2 files changed, 1009 insertions(+)
 create mode 100644 Documentation/filesystems/idmappings.rst

diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
new file mode 100644
index 000000000000..d543ff9529b6
--- /dev/null
+++ b/Documentation/filesystems/idmappings.rst
@@ -0,0 +1,1008 @@
+Idmappings
+==========
+
+Most filesystem developers will have encountered idmappings. They have to be
+used when reading from or writing ownership to disk, reporting ownership to
+userspace, or for permission checking. This document is aimed at filesystem
+developers that want to know how idmappings work.
+
+Formal notes
+------------
+
+An idmapping is essentially a translation of a range of ids into another or the
+same range of ids. The notational convention for idmappings that is widely used
+in userspace is::
+
+ x:y:K
+
+The ``K`` parameter indicates the range of the idmapping, i.e. how many ids are
+mapped. More generally, ``x`` is an element of the upper idmapset ``X`` and
+``y`` is an element of the lower idmapset ``Y``.
+
+To see what this looks like in practice, let's take the following idmapping::
+
+ 22:10000:3
+
+and write down the mappings it will generate::
+
+ 22 -> 10000
+ 23 -> 10001
+ 24 -> 10002
+
+From a mathematical viewpoint ``X`` and ``Y`` are well-ordered sets and an
+idmapping is an order isomorphism from ``X`` into ``Y``. So ``X`` and ``Y`` are
+order isomorphic. In fact, ``X`` and ``Y`` are always well-ordered subsets of
+the set of all possible ids useable on a given system.
+
+Looking at this mathematically briefly will help us highlight some properties
+that make it easier to understand how we can translate between idmappings. For
+example, we know that the inverse idmapping is an order isomorphism as well::
+
+ 10000 -> 22
+ 10001 -> 23
+ 10002 -> 24
+
+Given that we are dealing with order isomorphisms plus the fact that we're
+dealing with subsets we can embedd idmappings into each other, i.e. we can
+sensibly translate between different idmappings. For example, assume we've been
+given the three idmappings::
+
+ 1. 0:10000:10000
+ 2. 0:20000:10000
+ 3. 0:30000:10000
+
+and we're given the id ``11000`` which has been generated by the first
+idmapping by mapping ``1000 -> 11000`` down from the upper into the lower
+idmapset.
+
+Because we're dealing with order isomorphic subsets it is meaningful to ask
+what id ``11000`` corresponds to in the second or third idmapping. The
+straightfoward algorithm to use is to apply the inverse of the first idmapping
+``11000 -> 1000`` and then use the second idmapping ``1000 -> 21000`` or the
+third idmapping ``1000 -> 31000`` . If we were given the same task for the
+following three idmappings::
+
+ 1. 0:10000:10000
+ 2. 0:20000:200
+ 3. 0:30000:300
+
+we would fail to translate as the sets aren't order isomorphic anymore over the
+full range of the first idmapping (However they are order isomorphic over the
+full range of the second idmapping.). Neither the second or third idmapping
+contain id ``1000`` in the upper idmapset ``X``. This is equivalent to not
+having an id mapped, so ``1000`` is an unmapped id in the second and third
+idmaping. The kernel will report unmapped ids as the overflowuid ``(uid_t)-1``
+or overflowgid ``(gid_t)-1`` to userspace.
+
+The algorithm to calculate what a given id maps to is pretty simple. First, we
+need to verify that the range can contain our target id. We will skip this step
+for simplicity. After that if we want to know what the id ``id`` maps to we can
+do simple calculations:
+
+- If we want to map from left to right::
+
+   x:y:K
+   id - x + y = z
+
+- If we want to map from right to left::
+
+   x:y:K
+   id - y + x = z
+
+Instead of "left to right" we can also say "down" and instead of "right to
+left" we can also say "up". Obviously mapping down and up invert each other.
+
+To see whether the simple formulas above work, consider the following two
+idmappings::
+
+ 1. 0:20000:10000
+ 2. 500:30000:10000
+
+Assume we are given the id ``21000`` in the lower idmapset of the first
+idmapping. We want to know what id this was mapped from in the upper idmapset
+of the first idmapping. So we're mapping up in the first idmapping::
+
+ id    - y     + x = z
+ 21000 - 20000 + 0 = 1000
+
+Now assume we are given the id ``1100`` in the upper idmapset of the second
+idmapping and we want to know what this id maps down to in the lower idmapset
+of the second idmapping. This means we're mapping down in the second idmapping::
+
+ id   - x   + y     = z
+ 1100 - 500 + 30000 = 30600
+
+General notes
+-------------
+
+In the context of the kernel an idmapping can be interpreted as mapping a range
+of userspace ids into a range of kernel ids::
+
+ userspace-id:kernel-id:range
+
+A userspace id is always an element in the source idmapset of an idmapping of
+type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the target
+idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
+"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
+types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
+
+The kernel is mostly concerned with kernel ids. They are used when performing
+permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
+A userspace id on the other hand is an id that is reported to userspace by the
+kernel, or is passed by userspace to the kernel, or a raw device id that is
+written or read from disk.
+
+Note that we are only concerned with idmappings as the kernel stores them not
+how userspace would specify them.
+
+A kernel id is always created by an idmapping. Such idmappings are associated
+with user namespaces. Since we mainly care about how idmappings work we're not
+going to be concerned with how idmappings are created nor how they are used
+outside of the filesystem context. This is best left to an explanation of user
+namespaces.
+
+The initial user namespace is special. It always has an idmapping of the
+following form::
+
+ 0:0:4294967295
+
+which is an identity idmapping over the full range of ids available on this
+system.
+
+Other user namespaces usually have non-identity idmappings such as::
+
+ 0:10000:10000
+
+When a process creates or wants to change ownership of a file, or when the
+ownership of a file is read from disk by a filesystem, the userspace id is
+immediately translated into a kernel id according to the idmapping associated
+with the relevant user namespace.
+
+For instance, a file that is stored on disk by a filesystem as being owned by
+userspace id ``1000``:
+
+- If a filesystem were to be mounted in the initial user namespaces (as most
+  filesystems are) then the initial idmapping will be used. As we saw this is
+  simply the identity idmapping. This would mean the userspace id ``1000`` read
+  from disk would be mapped to kernel id ``1000``. So a VFS inode's ``i_uid``
+  and ``i_gid`` field would contain kernel id ``1000``.
+
+- If a filesystem were to be mounted in a user namespace with an idmapping of
+  ``0:10000:10000`` then the userspace id ``1000`` read from disk would be
+  mapped to kernel id ``11000``. So a VFS inode's ``i_uid`` and ``i_gid`` would
+  contain ``11000``.
+
+An idmapping ``0:10000:10000`` consists of a set of userspace ids or "userspace
+idmapset" and a set of kernel ids or "kernel idmapset". This distinction is
+import when translating between different idmappings.
+
+Translation algorithms
+----------------------
+
+We've already seen briefly that it is possible to translate between different
+idmappings. We'll now take a closer look how that works.
+
+Crossmapping
+~~~~~~~~~~~~
+
+This translation algorithm is used by the kernel in quite a few places. For
+example, it is used when reporting back the ownership of a file to userspace
+via the ``stat()`` system call family.
+
+If we've been given a kernel id ``11000`` from one idmapping we can map that id
+up in another idmapping. In order for this to work both idmappings need to
+contain the same kernel id in their kernel idmapsets. For example, consider the
+following idmappings::
+
+ 1. 0:10000:10000
+ 2. 20000:10000:10000
+
+and we are mapping the userspace id ``1000`` according to the first idmapping
+``1000 -> 11000``. We can translate the kernel id ``11000`` into a userspace id
+in the second idmapping using the kernel idmapset of the second idmapping::
+
+ /* Map the kernel id up into a userspace id in the second idmapping. */
+ from_kuid(20000:10000:10000, 11000) = 21000
+
+Note, how we can get back to the kernel id in the first idmapping by inverting
+the algorithm::
+
+ /* Map the userspace id down into a kernel id in the second idmapping. */
+ make_kuid(20000:10000:10000, 21000) = 11000
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(0:10000:10000, 11000) = 1000
+
+This algorithm allows us to answer the question what userspace id a given
+kernel id corresponds to in a given idmapping. In order to be able to answer
+this question both idmappings need to contain the same kernel id in their
+respective kernel idmapsets.
+
+For example, when the kernel reads a raw userspace id from disk it maps it into
+a kernel id according to the idmapping associated with the filesystem. Let's
+assume the filesystem was mount with an idmapping of ``0:20000:10000`` and it
+reads a file owned by userspace id ``1000`` from disk. This means userspace id
+``1000`` will be  mapped to kernel id ``21000`` which is what will be stored in
+the VFS's inode ``i_uid`` and ``i_gid`` field.
+
+When someone in userspace calls ``stat()`` or a related function to get
+ownership information of the file the kernel can't simply map the id back up
+according to the filesystem's idmapping as this would give the wrong owner.
+Instead, the kernel will map the id back up in the idmapping of the caller.
+Let's assume the caller has the slighly unconventional idmapping
+``3000:20000:10000`` then the kernel id ``21000`` would map back up to
+userspace id ``4000`` in this idmapping and consequently the user would see
+that this file is owned by userspace id ``4000`` according to their idmapping.
+
+Remapping
+~~~~~~~~~
+
+It is possible to translate the id from one idmapping to another one via the
+userspace idmapset of the two idmappings. This is equivalent to remapping an
+id.
+
+Let's look at an example. We are given the following two idmappings::
+
+ 1. 0:10000:10000
+ 2. 0:20000:10000
+
+and we are given the kernel id ``11000`` in the first idmapping. In order to
+translate this kernel id in the first idmapping into a kernel id in the second
+idmapping we need to perform two steps:
+
+1. Map the kernel id up into a userspace id in the first idmapping::
+
+    /* Map the kernel id up into a userspace id in the first idmapping. */
+    from_kuid(0:10000:10000, 11000) = 1000
+
+2. Map the userspace id down into a kernel id in the second idmapping::
+
+    /* Map the userspace id down into a kernel id in the second idmapping. */
+    make_kuid(0:20000:10000, 1000) = 21000
+
+As you can see we used the userspace idmapset in both idmappings to translate
+the kernel id in one idmapping to a kernel id in another idmapping.
+
+This allows us to answer the question what kernel id we would need to use to
+get the same userspace id in another idmapping. In order to be able to answer
+this question both idmappings need to contain the same userspace id in their
+respective userspace idmapsets.
+
+Note, how we can easily get back to the kernel id in the first idmapping by
+inverting the algorithm:
+
+1. Map the kernel id up into a userspace id in the second idmapping::
+
+    /* Map the kernel id up into a userspace id in the second idmapping. */
+    from_kuid(0:20000:10000, 21000) = 1000
+
+2. Map the userspace id down into a kernel id in the first idmapping::
+
+    /* Map the userspace id down into a kernel id in the first idmapping. */
+    make_kuid(0:10000:10000, 1000) = 11000
+
+Another way to look at this translation is to treat it as undoing an already
+active idmapping and applying another idmapping. This will come in handy when
+working with idmapped mounts.
+
+Invalid translations
+~~~~~~~~~~~~~~~~~~~~
+
+It is never valid to use an id in the kernel idmapset of one idmapping as the
+id in the userspace idmapset of another or the same idmapping. While the kernel
+idmapset always indicates an idmapset in the kernel id space the userspace
+idmapset indicates a userspace id. So the following translations are forbidden::
+
+ /* Map the userspace id down into a kernel id in the first idmapping. */
+ make_kuid(0:10000:10000, 1000) = 11000
+
+ /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
+ make_kuid(10000:20000:10000, 110000) = 21000
+
+and equally wrong::
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(0:10000:10000, 11000) = 1000
+
+ /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
+ from_kuid(20000:0:10000, 1000) = 21000
+
+Idmappings when creating filesystem objects
+-------------------------------------------
+
+The concepts of mapping an id down or mapping an id up are expressed in the two
+kernel functions filesystem developers are rather familiar with::
+
+ /* Map the userspace id down into a kernel id. */
+ make_kuid(idmapping, uid)
+
+ /* Map the kernel id up into a userspace id. */
+ from_kuid(idmapping, kuid)
+
+We will take an abbreviated look into how idmappings figure into creating
+filesystem objects. For simplicity we will only look at what happens when the
+VFS has already completed path lookup right before it calls into the filesystem
+itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
+called. We will also assume that the directory we're creating filesystem
+objects in is readable and writable for everyone.
+
+When creating a filesystem object the caller will look at the caller's
+filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
+but they are exclusively used when determining file ownership which is why they
+are called "filesystem ids". They are usually identical to the uid and gid of
+the caller but can differ. We will just assume they are always identical to not
+get lost in too many details.
+
+When the caller enters the kernel two things happen:
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping.
+   (To be precise, the kernel will simply look at the kernel ids stashed in the
+   credentials of the current task but for our education we'll pretend this
+   translation happens just in time.)
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+   filesystem's idmapping.
+
+The second step is important as regular filesystem will ultimately need to
+translate the kernel id back into a raw userspace id when writing to disk.
+So with the second step the kernel guarantees that a valid userspace id can be
+written to disk. If it can't the kernel will refuse the creation request to not
+even remotely risk filesystem corruption.
+
+Example 1
+~~~~~~~~~
+
+::
+
+ caller userspace id:  1000
+ caller idmapping:     0:0:4294967295
+ filesystem idmapping: 0:0:4294967295
+
+Both the caller and the filesystem use the identity idmapping:
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+    make_kuid(0:0:4294967295, 1000) = 1000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+   filesystem's idmapping.
+
+   For this second step the kernel will call the function
+   ``fsuidgid_has_mapping()`` which ultimately boils down to calling
+   ``from_kuid()``::
+
+    from_kuid(0:0:4294967295, 1000) = 1000
+
+The astute reader will have realized that this is simply a varation of the
+crossmapping algorithm we mentioned above in a previous section. First, the
+kernel maps the caller's userspace id down into a kernel id according to the
+caller's idmapping and then maps that kernel id up according to the
+filesystem's idmapping. In this example both idmappings are the same so there's
+nothing exciting going on. Ultimately the userspace id that lands on disk will
+be ``1000``.
+
+Example 2
+~~~~~~~~~
+
+::
+
+ caller userspace id:  1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:20000:10000
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+    make_kuid(0:10000:10000, 1000) = 11000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+   filesystem's idmapping::
+
+    from_kuid(0:20000:10000, 11000) = -1
+
+It's immediately clear that while the caller's userspace id could be
+successfully mapped down into kernel ids in the caller's idmapping the kernel
+ids could not be mapped up according to the filesystem's idmapping. So the
+kernel will deny this creation request.
+
+Note that while this example is less common, because most filesystem can't be
+mounted with non-initial idmappings this is a general problem.
+
+Example 3
+~~~~~~~~~
+
+::
+
+ caller userspace id:  1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:0:4294967295
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+    make_kuid(0:10000:10000, 1000) = 11000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+   filesystem's idmapping::
+
+    from_kuid(0:0:4294967295, 11000) = 11000
+
+We can see that the translation always succeeds. The userspace id that the
+filesystem will ultimately put to disk will always be identical to the value of
+the kernel id that was created in the caller's idmapping. In this example
+``11000``. This has mainly two consequences.
+
+First, that we can't allow a caller to ultimately write to disk with another
+userspace id. We could only do this if we were to mount the whole fileystem
+with the caller's or another idmapping. But as we've seen that is limited to
+a few filesystems and not very flexible. But this is a use-case that is pretty
+important in containerized workloads.
+
+Second, the caller will usually not be able to create any files or access
+directories that have stricter permissions because none of the filesystem's
+kernel ids map up into valid userspace ids in the caller's idmapping
+
+1. Map raw userspace ids into kernel ids in the filesystem's idmapping::
+
+    make_kuid(0:0:4294967295, 1000) = 1000
+
+2. Map kernel ids into userspace ids in the caller's idmapping::
+
+    from_kuid(0:10000:10000, 1000) = -1
+
+Example 4
+~~~~~~~~~
+
+::
+
+ file userspace id:    1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:0:4294967295
+
+In order to report ownership to userspace uses the crossmapping algorithm
+introduced in a previous section:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+   idmapping::
+
+    make_kuid(0:0:4294967295, 1000) = 1000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+    from_kuid(0:10000:10000, 1000) = -1
+
+The crossmapping algorithm fails in this case because the kernel id in the
+filesystem idmapping cannot be mapped to a userspace id in the caller's
+idmapping. Thus, the kernel will report the ownership of this file as the
+overflowid.
+
+Example 5
+~~~~~~~~~
+
+::
+
+ file userspace id:    1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:20000:10000
+
+In order to report ownership to userspace uses the crossmapping algorithm
+introduced in a previous section:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+   idmapping::
+
+    make_kuid(0:20000:10000, 1000) = 21000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+    from_kuid(0:10000:10000, 1000) = -1
+
+Again, the crossmapping algorithm fails in this case because the kernel id in
+the filesystem idmapping cannot be mapped to a userspace id in the caller's
+idmapping. Thus, the kernel will report the ownership of this file as the
+overflowid.
+
+Note how in the last two examples things would be simple if the caller would be
+using the initial idmapping. For a filesystem mounted with the initial
+idmapping it would be trivial. So we only consider a filesystem with an
+idmapping of ``0:20000:10000``:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+   idmapping::
+
+    make_kuid(0:20000:10000, 1000) = 21000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+    from_kuid(0:0:4294967295, 1000) = 21000
+
+Idmappings on idmapped mounts
+-----------------------------
+
+The examples we've seen in the previous section where the caller's idmapping
+and the filesystem's idmapping are incompatible causes various issues for
+workloads. For a more complex but common example, consider two containers
+started on the host. To completely prevent the two containers from affecting
+each other, an administrato may often use different non-overlapping idmappings
+for the two containers::
+
+ container1 idmapping:  0:10000:10000
+ container2 idmapping:  0:20000:10000
+ filesystem idmapping:  0:30000:10000
+
+An administrator wanting to provide easy read-write access to the following set
+of files::
+
+ dir userpace id:       0
+ dir/file1 userpace id: 1000
+ dir/file2 userpace id: 2000
+
+to both containers currently can't.
+
+Of course the administrator has the option to recursively change ownership via
+``chown()``. For example, they could change ownership so that ``dir`` and all
+files below it can be crossmapped from the filesystem's into the container's
+idmapping. Let's assume they change ownership so it is compatible with the
+first container's idmapping::
+
+ dir userpace id:       10000
+ dir/file1 userpace id: 11000
+ dir/file2 userpace id: 12000
+
+This would still leave ``dir`` rather useless to the second container. In fact,
+``dir`` and all files below it would continue to appear owned by the overflowid
+for the second container.
+
+Or consider another increasingly popular example. Some service managers such as
+systemd implement a concept called "portable home directories". A user may want
+to use their home directories on different machines where they are assigned
+different login userspace ids. Most users will have ``1000`` as the login id on
+their machine at home and all files in their home directory will usually be
+owned by id ``1000``. At uni or at work they may have another login id such as
+``1125``. This makes it rather difficult to interact with their home directory
+on the work machine.
+
+In both cases changing ownership recursively has grave implications. The most
+obvious one is that ownership is changed globally and permanently. In the home
+directory case this change in ownership would even need to happen everytime the
+user switches from their home to their work machine. For really large sets of
+files this becomes increasingly costly.
+
+If the user is lucky, they are dealing with a filesystem that is mountable
+inside user namespaces. But this would also change ownership globally and the
+change in ownership is tied to the lifetime of the filesystem mount, i.e. the
+superblock. The only way to change ownership is to completely unmount the
+filesystem and mount it again in another user namespace. This is usually
+impossible because it would mean that all users currently accessing the
+filesystem can't anymore. And it means that ``dir`` still can't be shared
+between two containers with different idmappings.
+But usually the user doesn't even have this option since most filesystems
+aren't mountable inside containers. And not having them mountable might be
+desirable as it doesn't require the filesystem to deal with malicious
+filesystem images.
+
+But the usecases mentioned above and more can be handled by idmapped mounts.
+They allow to expose the same set of dentries with different ownership at
+different mounts. This is achieved by marking the mounts with a user namespace
+through the ``mount_setattr()`` system call. The idmapping associated with it
+is then used to translate from the caller's idmapping to the filesystem's
+idmapping and vica versa using the remapping algorithm we introduced above.
+
+In contrast, idmapped mounts make it possible to change ownership in
+a temporary and localized way. The ownership changes are restricted to
+a specific mount and the ownership changes are tied to the lifetime of the
+mount. All other users and locations where the filesystem is exposed are
+unaffected.
+
+Filesystems that support idmapped mounts don't have any real reason to support
+being mountable inside user namespaces. A filesystem could be exposed
+completely under an idmapped mount to get the same effect. This has the
+advantage that filesystem can leave the creation of the superblock to
+privileged users in the initial user namespace.
+
+However, it is perfectly possible to combine idmapped mounts with filesystems
+mountable inside user namespaces. We will touch on this further below.
+
+Idmapping functions were added that translate between idmappings. They make use
+of the remapping algorithm we've introduced earlier. We're going to look at
+two:
+
+- ``mapped_fsuid()`` and ``mapped_fsgid()``
+
+  The ``mapped_fs*id()`` functions translate the caller's kernel ids into
+  kernel ids in the filesystem's idmapping. This translation is achieved by
+  remapping the caller's kernel ids using the mount's idmapping::
+
+   /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
+   uid = from_kuid(mount, id)
+
+   /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
+   kuid = make_kuid(filesystem, uid)
+
+- ``i_uid_into_mnt()`` and ``i_gid_into_mnt()``
+
+  The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into
+  kernel ids in the mount's idmapping::
+
+   /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
+   uid = from_kuid(filesystem, id)
+
+   /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */
+   kuid = make_kuid(mount, uid)
+
+Note that these two functions invert each other. Consider the following
+idmappings::
+
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:20000:10000
+ mount idmapping:      0:10000:10000
+
+Assume a file with userspace id ``1000`` is read from disk. The filesystem maps
+this userspace id into kernel id ``21000`` according to it's idmapping. This is
+what is stored in the inode's ``i_uid`` and ``i_gid`` fields.
+
+When the caller queries the ownership of this file via ``stat()`` the kernel
+would usually simply use the crossmapping algorithm and map the filesystem's
+kernel id up to a userspace id in the caller's idmapping.
+
+But when the caller is accessing the file on an idmapped mount the kernel will
+first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id
+into a kernel id in the mount's idmapping::
+
+ i_uid_into_mnt(21000):
+   /* Map the filesystem's kernel id up into a userspace id. */
+   1000 = from_kuid(0:20000:10000, 21000)
+
+   /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */
+   11000 = make_kuid(0:10000:10000, 1000)
+
+Finally, when the kernel reports the owner to the caller it will turn the
+kernel id in the mount's idmapping into a userspace id in the caller's
+idmapping::
+
+  1000 = from_kuid(0:10000:10000, 11000)
+
+We can test whether this algorithm really works by verifying what happens when
+we create a new file. Let's say the user is creating a file with filesystem
+userspace id ``1000``.
+
+The kernel maps this to kernel id ``11000`` in the caller's idmapping. Usually
+the kernel would now apply the crossmapping, verifying that the kernel id
+``11000`` can be mapped to a userspace id in the filesystem's idmapping and
+ultimately write that userspace id to disk.
+
+But when the caller is accessing the file on an idmapped mount the kernel will
+first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
+a kernel id according to the mount's idmapping::
+
+ mapped_fs(id(11000):
+    /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
+    1000 = from_kuid(0:10000:10000, 11000)
+
+    /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
+    21000 = make_kuid(0:20000:10000, 1000)
+
+When finally writing to disk the kernel will then map the kernel id ``21000``
+up into a userspace id in the filesystem's idmapping::
+
+   1000 = from_kuid(0:20000:10000, 21000)
+
+As we can see, we end up with a revertible and information preserving
+algorithm. A file created from userspace id ``1000`` from an idmapped mount
+will also be reported as being owned by userspace id ``1000`` and vica versa.
+
+Let's now briefly reconsider the failing examples from earlier in the context
+of idmapped mounts.
+
+Example 2 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ caller userspace id:  1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:20000:10000
+ mount idmapping:      0:10000:10000
+
+When the caller is using a non-initial idmapping the common case is to attach
+the same idmapping to the mount. We now perform three steps:
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+    make_kuid(0:10000:10000, 1000) = 11000
+
+2. Translate the caller's kernel id into a kernel id in the filesystem's
+   idmapping::
+
+    mapped_fsuid(11000):
+      /* Map the kernel id up into a userspace id in the mount's idmapping. */
+      from_kuid(0:10000:10000, 11000) = 1000
+
+      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+      make_kuid(0:20000:10000, 1000) = 21000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+   filesystem's idmapping::
+
+    from_kuid(0:20000:10000, 21000) = 1000
+
+So the ownership that lands on disk will be the userspace id ``1000``.
+
+Example 3 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ caller userspace id:  1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:0:4294967295
+ mount idmapping:      0:10000:10000
+
+The same translation algorithm works with the third example.
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+    make_kuid(0:10000:10000, 1000) = 11000
+
+2. Translate the caller's kernel id into a kernel id in the filesystem's
+   idmapping::
+
+    mapped_fsuid(11000):
+       /* Map the kernel id up into a userspace id in the mount's idmapping. */
+       from_kuid(0:10000:10000, 11000) = 1000
+
+       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+       make_kuid(0:0:4294967295, 1000) = 1000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+   filesystem's idmapping::
+
+    from_kuid(0:0:4294967295, 21000) = 1000
+
+So the ownership that lands on disk will be the userspace id ``1000``.
+
+Example 4 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ file userspace id:    1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:0:4294967295
+ mount idmapping:      0:10000:10000
+
+In order to report ownership to userspace the kernel now does three steps with
+a translation algorithm we introduced earlier:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+   idmapping::
+
+    make_kuid(0:0:4294967295, 1000) = 1000
+
+2. Translate the kernel id into a kernel id in the mount's idmapping::
+
+    i_uid_into_mnt(1000):
+      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+      from_kuid(0:0:4294967295, 1000) = 1000
+
+      /* Map the userspace id down into a kernel id in the mounts's idmapping. */
+      make_kuid(0:10000:10000, 1000) = 11000
+
+3. Map the kernel id up into a userspace id in the caller's idmapping::
+
+    from_kuid(0:10000:10000, 11000) = 1000
+
+Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
+idmapping. With the idmapped mount in place it now can be crossmapped into the
+filesystem's idmapping via the mount's idmapping. The file will now be created
+with userspace id ``1000`` according to the mount's idmapping.
+
+Example 5 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ file userspace id:    1000
+ caller idmapping:     0:10000:10000
+ filesystem idmapping: 0:20000:10000
+ mount idmapping:      0:10000:10000
+
+Again, in order to report ownership to userspace the kernel now does three
+steps with a translation algorithm we introduced earlier:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+   idmapping::
+
+    make_kuid(0:20000:10000, 1000) = 21000
+
+2. Translate the kernel id into a kernel id in the mount's idmapping::
+
+    i_uid_into_mnt(21000):
+      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+      from_kuid(0:20000:10000, 21000) = 1000
+
+      /* Map the userspace id down into a kernel id in the mounts's idmapping. */
+      make_kuid(0:10000:10000, 1000) = 11000
+
+3. Map the kernel id up into a userspace id in the caller's idmapping::
+
+    from_kuid(0:10000:10000, 11000) = 1000
+
+Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
+idmapping. With the idmapped mount in place it now can be crossmapped into the
+filesystem's idmapping via the mount's idmapping. The file is now owned by
+userspace id ``1000`` according to the mount's idmapping.
+
+Changing ownership on a home directory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We've seen above how idmapped mounts can be used to translate between
+idmappings when either the caller, the filesystem or both uses a non-initial
+idmapping. A wide range of usecases exist when the caller is using
+a non-initial idmapping. This mostly happens in the context of containerized
+workloads. The consequence is as we have seen that for both, filesystem mounted
+with the initial idmapping and filesystems mounted with non-initial idmappings,
+access to the filesystem isn't working because the kernel ids can't be
+crossmapped between the caller's and the filesystem's idmapping.
+
+As we've seen above idmapped mounts provide a solution to this by translating
+between the caller's and the filesystem's idmapping.
+
+Aside from containerized workloads, idmapped mounts have the advantage that
+they also work when both the caller and the filesystem use the initial
+idmapping which means users on the host can change the ownership of dentries on
+a per-mount basis.
+
+Consider our previous example where a user has their home directory on portable
+storage. At home they have id ``1000`` and all files in their home directory
+are owned by id ``1000`` whereas at uni or work they have login id ``1125``.
+
+Taking their home directory with them becomes problematic. They can't easily
+access their files, they might not be able to write to disk without applying
+lax permissions or ACLs and even if they can, they will end up with an annoying
+mix of files and directories owned by id ``1000`` and id ``1125``.
+
+Idmapped mounts allow to solve this problem. A user can create an idmapped
+mount for their home directory on their work computer or their computer at home
+depending on what ownership they would prefer to end up on the portable storage
+itself.
+
+Let's assume they want all files on disk to belong to userspace id ``1000``.
+When the user plugs in their portable storage at their work station they can
+setup a job that creates an idmapped mount with the minimal idmapping
+``1000:1125:1``. So now when they create a file the kernel performs the
+following steps we already know from above:
+
+::
+
+ caller userspace id:  1125
+ caller idmapping:     0:0:4294967295
+ filesystem idmapping: 0:0:4294967295
+ mount idmapping:      1000:1125:1
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+    make_kuid(0:0:4294967295, 1125) = 1125
+
+2. Translate the caller's kernel id into a kernel id in the filesystem's
+   idmapping::
+
+    mapped_fsuid(1125):
+      /* Map the kernel id up into a userspace id in the mount's idmapping. */
+      from_kuid(1000:1125:1, 1125) = 1000
+
+      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+      make_kuid(0:0:4294967295, 1000) = 1000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+   filesystem's idmapping::
+
+    from_kuid(0:0:4294967295, 1000) = 1000
+
+So ultimately the file will be created with userspace id ``1000`` on disk.
+
+Now let's briefly look at what ownership the caller with id ``1125`` will see
+on their work computer:
+
+::
+
+ file userspace id:    1000
+ caller idmapping:     0:0:4294967295
+ filesystem idmapping: 0:0:4294967295
+ mount idmapping:      1000:1125:1
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+   idmapping::
+
+    make_kuid(0:0:4294967295, 1000) = 1000
+
+2. Translate the kernel id into a kernel id in the mount's idmapping::
+
+    i_uid_into_mnt(1000):
+      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+      from_kuid(0:0:4294967295, 1000) = 1000
+
+      /* Map the userspace id down into a kernel id in the mounts's idmapping. */
+      make_kuid(1000:1125:1, 1000) = 1125
+
+3. Map the kernel id up into a userspace id in the caller's idmapping::
+
+    from_kuid(0:0:4294967295, 1125) = 1125
+
+So ultimately the caller will be reported that the file belongs to userspace id
+``1125`` which is the caller's userspace id on their workstation in our
+example.
+
+The raw userspace id that is put on disk is ``1000`` so when the user takes
+their home directory back to their home computer where they are assigned
+userspace id ``1000`` using the initial idmapping and mount the filesystem with
+the initial idmapping they will see all those files belonging to id ``1000``.
+
+Shortcircuting
+--------------
+
+Currently, the implementation of idmapped mounts enforces that the filesystem
+is mounted with the initial idmapping. The reason is simply that none of the
+filesystems that we targeted were mountable with a non-initial idmapping. But
+that might change soon enough. As we've seen above, thanks to the properties of
+idmappings the translation works for both filesystems mounted with the initial
+idmapping and filesystem with non-initial idmappings.
+
+Based on this current restriction to filesystem mounted with the initial
+idmapping two noticeable shortcuts have been taken:
+
+1. We always stash a reference to the initial user namespace in ``struct
+   vfsmount``. Idmapped mounts are thus mounts that have a non-initial user
+   namespace attached to them.
+
+   In order to support idmapped mounts this needs to be changed. Instead of
+   stashing the initial user namespace the user namespace the filesystem was
+   mounted with must be stashed. An idmapped mount is then any mount that has
+   a different user namespace attached then the filesystem was mounted with.
+   This has no user-visible consequences.
+
+2. The translation algorithms in ``mapped_fs*id()`` and ``i_*id_into_mnt()``
+   are simplified.
+
+   Let's consider ``mapped_fs*id()`` first. This function translates the
+   caller's kernel id into a kernel id in the filesystem's idmapping via
+   a mount's idmapping. The full algorithm is::
+
+    mapped_fsuid():
+      /* Map the kernel id up into a userspace id in the mount's idmapping. */
+      uid_t uid = from_kuid(mount-idmapping, id)
+
+      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+      kuid_t kuid = make_kuid(filesystem-idmapping, uid)
+
+   We know that the filesystem is always mounted with the initial idmapping as
+   we enforce this in ``mount_setattr()``. So this can be shortened to::
+
+    mapped_fsuid():
+      /* Map the kernel id up into a userspace id in the mount's idmapping. */
+      uid_t uid = from_kuid(mount-idmapping, id)
+
+      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+      kuid_t kuid = KUIDT_INIT(uid);
+
+   Similarly, for ``i_*id_into_mnt()`` which translated the filesystem's kernel
+   id into a mount's kernel id::
+
+    i_uid_into_mnt():
+      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+      uid_t uid = from_kuid(filesystem-idmapping, id)
+
+      /* Map the userspace id down into a kernel id in the mounts's idmapping. */
+      kuid_t kuid = make_kuid(mount-idmapping, uid)
+
+   Again, we know that the filesystem is always mounted with the initial
+   idmapping as we enforce this in ``mount_setattr()``. So this can be
+   shortened to::
+
+    i_uid_into_mnt():
+      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+      uid_t uid = __kuid_val(kuid)
+
+      /* Map the userspace id down into a kernel id in the mounts's idmapping. */
+      kuid_t kuid = make_kuid(mount-idmapping, uid)
+
+Handling filesystems mounted with non-initial idmappings requires that the
+translation functions be converted to their full form. They can still be
+shortcircuited on non-idmapped mounts. This has no user-visible consequences.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 246af51b277a..f97ea4b18523 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -34,6 +34,7 @@ algorithms work.
    quota
    seq_file
    sharedsubtree
+   idmappings
 
    automount-support
 

base-commit: 2734d6c1b1a089fb593ef6a23d4b70903526fe0c
-- 
2.30.2

                 reply	other threads:[~2021-07-23 12:52 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d543ff9529b dfblob:246af51b277 dfblob:f97ea4b1852 )
 OR (
bs:"[PATCH] doc: give a more thorough id handling explanation" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210723125150.334206-1-brauner@kernel.org \
    --to=brauner@kernel.org \
    --cc=christian.brauner@ubuntu.com \
    --cc=cyphar@cyphar.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=sforshee@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.