From: Christian Brauner <brauner@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: "Alexander Viro" <viro@zeniv.linux.org.uk>,
"Jan Kara" <jack@suse.cz>, "Jeff Layton" <jlayton@kernel.org>,
"Amir Goldstein" <amir73il@gmail.com>,
"Lennart Poettering" <lennart@poettering.net>,
"Zbigniew Jędrzejewski-Szmek" <zbyszek@in.waw.pl>,
"Josef Bacik" <josef@toxicpanda.com>,
"Christian Brauner" <brauner@kernel.org>
Subject: [PATCH 3/3] fs: add immutable rootfs
Date: Fri, 02 Jan 2026 15:36:24 +0100 [thread overview]
Message-ID: <20260102-work-immutable-rootfs-v1-3-f2073b2d1602@kernel.org> (raw)
In-Reply-To: <20260102-work-immutable-rootfs-v1-0-f2073b2d1602@kernel.org>
Currently pivot_root() doesnt't work on the real rootfs because it
cannot be unmounted. Userspace has to do a recursive removal of the
initramfs contents manually before continuing the boot.
Really all we want from the real rootfs is to serve as the parent mount
for anything that is actually useful such as the tmpfs or ramfs for
initramfs unpacking or the rootfs itself. There's no need for the real
rootfs to actually be anything meaningful or useful. Add a immutable
rootfs that can be selected via the "immutable_rootfs" kernel command
line option.
The kernel will mount a tmpfs/ramfs on top of it, unpack the initramfs
and fire up userspace which mounts the rootfs and can then just do:
chdir(rootfs);
pivot_root(".", ".");
umount2(".", MNT_DETACH);
and be done with it. (Ofc, userspace can also choose to retain the
initramfs contents by using something like pivot_root(".", "/initramfs")
without unmounting it.)
Technically this also means that the rootfs mount in unprivileged
namespaces doesn't need to become MNT_LOCKED anymore as it's guaranteed
that the immutable rootfs remains permanently empty so there cannot be
anything revealed by unmounting the covering mount.
In the future this will also allow us to create completely empty mount
namespaces without risking to leak anything.
systemd already handles this all correctly as it tries to pivot_root()
first and falls back to MS_MOVE only when that fails.
This goes back to various discussion in previous years and a LPC 2024
presentation about this very topic.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/Makefile | 2 +-
fs/mount.h | 1 +
fs/namespace.c | 78 ++++++++++++++++++++++++++++++++++++++++------
fs/rootfs.c | 65 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/magic.h | 1 +
init/do_mounts.c | 13 ++++++--
init/do_mounts.h | 1 +
7 files changed, 149 insertions(+), 12 deletions(-)
diff --git a/fs/Makefile b/fs/Makefile
index a04274a3c854..d31b56b7c4d5 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -16,7 +16,7 @@ obj-y := open.o read_write.o file_table.o super.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
fs_dirent.o fs_context.o fs_parser.o fsopen.o init.o \
kernel_read_file.o mnt_idmapping.o remap_range.o pidfs.o \
- file_attr.o
+ file_attr.o rootfs.o
obj-$(CONFIG_BUFFER_HEAD) += buffer.o mpage.o
obj-$(CONFIG_PROC_FS) += proc_namespace.o
diff --git a/fs/mount.h b/fs/mount.h
index 2d28ef2a3aed..c3e0d9dbfaa4 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -5,6 +5,7 @@
#include <linux/ns_common.h>
#include <linux/fs_pin.h>
+extern struct file_system_type immutable_rootfs_fs_type;
extern struct list_head notify_list;
struct mnt_namespace {
diff --git a/fs/namespace.c b/fs/namespace.c
index 9261f56ccc81..30597f4610fd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -75,6 +75,17 @@ static int __init initramfs_options_setup(char *str)
__setup("initramfs_options=", initramfs_options_setup);
+bool immutable_rootfs = false;
+
+static int __init immutable_rootfs_setup(char *str)
+{
+ if (*str)
+ return 0;
+ immutable_rootfs = true;
+ return 1;
+}
+__setup("immutable_rootfs", immutable_rootfs_setup);
+
static u64 event;
static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
static DEFINE_IDA(mnt_group_ida);
@@ -5976,24 +5987,73 @@ struct mnt_namespace init_mnt_ns = {
static void __init init_mount_tree(void)
{
- struct vfsmount *mnt;
- struct mount *m;
+ struct vfsmount *mnt, *immutable_mnt;
+ struct mount *mnt_root;
struct path root;
+ /*
+ * When the immutable rootfs is used, we create two mounts:
+ *
+ * (1) immutable rootfs with mount id 1
+ * (2) mutable rootfs with mount id 2
+ *
+ * with (2) mounted on top of (1).
+ */
+ if (immutable_rootfs) {
+ immutable_mnt = vfs_kern_mount(&immutable_rootfs_fs_type, 0,
+ "rootfs", NULL);
+ if (IS_ERR(immutable_mnt))
+ panic("VFS: Failed to create immutable rootfs");
+ }
+
mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", initramfs_options);
if (IS_ERR(mnt))
panic("Can't create rootfs");
- m = real_mount(mnt);
- init_mnt_ns.root = m;
- init_mnt_ns.nr_mounts = 1;
- mnt_add_to_ns(&init_mnt_ns, m);
+ if (immutable_rootfs) {
+ VFS_WARN_ON_ONCE(real_mount(immutable_mnt)->mnt_id != 1);
+ VFS_WARN_ON_ONCE(real_mount(mnt)->mnt_id != 2);
+
+ /* The namespace root is the immutable rootfs. */
+ mnt_root = real_mount(immutable_mnt);
+ init_mnt_ns.root = mnt_root;
+
+ /* Mount mutable rootfs on top of the immutable rootfs. */
+ root.mnt = immutable_mnt;
+ root.dentry = immutable_mnt->mnt_root;
+
+ LOCK_MOUNT_EXACT(mp, &root);
+ if (unlikely(IS_ERR(mp.parent)))
+ panic("VFS: Failed to setup immutable rootfs");
+ scoped_guard(mount_writer)
+ attach_mnt(real_mount(mnt), mp.parent, mp.mp);
+
+ pr_info("VFS: Finished setting up immutable rootfs\n");
+ } else {
+ VFS_WARN_ON_ONCE(real_mount(mnt)->mnt_id != 1);
+
+ /* The namespace root is the mutable rootfs. */
+ mnt_root = real_mount(mnt);
+ init_mnt_ns.root = mnt_root;
+ }
+
+ /*
+ * We've dropped all locks here but that's fine. Not just are we
+ * the only task that's running, there's no other mount
+ * namespace in existence and the initial mount namespace is
+ * completely empty until we add the mounts we just created.
+ */
+ for (struct mount *p = mnt_root; p; p = next_mnt(p, mnt_root)) {
+ mnt_add_to_ns(&init_mnt_ns, p);
+ init_mnt_ns.nr_mounts++;
+ }
+
init_task.nsproxy->mnt_ns = &init_mnt_ns;
get_mnt_ns(&init_mnt_ns);
- root.mnt = mnt;
- root.dentry = mnt->mnt_root;
-
+ /* The root and pwd always point to the mutable rootfs. */
+ root.mnt = mnt;
+ root.dentry = mnt->mnt_root;
set_fs_pwd(current->fs, &root);
set_fs_root(current->fs, &root);
diff --git a/fs/rootfs.c b/fs/rootfs.c
new file mode 100644
index 000000000000..b82b73bb8bb2
--- /dev/null
+++ b/fs/rootfs.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2026 Christian Brauner <brauner@kernel.org> */
+#include <linux/fs/super_types.h>
+#include <linux/fs_context.h>
+#include <linux/magic.h>
+
+static const struct super_operations rootfs_super_operations = {
+ .statfs = simple_statfs,
+};
+
+static int rootfs_fs_fill_super(struct super_block *s, struct fs_context *fc)
+{
+ struct inode *inode;
+
+ s->s_maxbytes = MAX_LFS_FILESIZE;
+ s->s_blocksize = PAGE_SIZE;
+ s->s_blocksize_bits = PAGE_SHIFT;
+ s->s_magic = ROOT_FS_MAGIC;
+ s->s_op = &rootfs_super_operations;
+ s->s_export_op = NULL;
+ s->s_xattr = NULL;
+ s->s_time_gran = 1;
+ s->s_d_flags = 0;
+
+ inode = new_inode(s);
+ if (!inode)
+ return -ENOMEM;
+
+ /* The real rootfs is permanently empty... */
+ make_empty_dir_inode(inode);
+ simple_inode_init_ts(inode);
+ inode->i_ino = 1;
+ /* ... and immutable. */
+ inode->i_flags |= S_IMMUTABLE;
+
+ s->s_root = d_make_root(inode);
+ if (!s->s_root)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int rootfs_fs_get_tree(struct fs_context *fc)
+{
+ return get_tree_single(fc, rootfs_fs_fill_super);
+}
+
+static const struct fs_context_operations rootfs_fs_context_ops = {
+ .get_tree = rootfs_fs_get_tree,
+};
+
+static int rootfs_init_fs_context(struct fs_context *fc)
+{
+ fc->ops = &rootfs_fs_context_ops;
+ fc->global = true;
+ fc->sb_flags = SB_NOUSER;
+ fc->s_iflags = SB_I_NOEXEC | SB_I_NODEV;
+ return 0;
+}
+
+struct file_system_type immutable_rootfs_fs_type = {
+ .name = "rootfs",
+ .init_fs_context = rootfs_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 638ca21b7a90..1a3a5a5b785a 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -104,5 +104,6 @@
#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
#define PID_FS_MAGIC 0x50494446 /* "PIDF" */
#define GUEST_MEMFD_MAGIC 0x474d454d /* "GMEM" */
+#define ROOT_FS_MAGIC 0x524F4F54 /* "ROOT" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/init/do_mounts.c b/init/do_mounts.c
index defbbf1d55f7..e245e5e4e954 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -492,8 +492,17 @@ void __init prepare_namespace(void)
mount_root(saved_root_name);
out:
devtmpfs_mount();
- init_mount(".", "/", NULL, MS_MOVE, NULL);
- init_chroot(".");
+
+ if (immutable_rootfs) {
+ if (init_pivot_root(".", "."))
+ pr_err("VFS: Failed to pivot into new rootfs\n");
+ if (init_umount(".", MNT_DETACH))
+ pr_err("VFS: Failed to unmount old rootfs\n");
+ pr_info("VFS: Pivoted into new rootfs\n");
+ } else {
+ init_mount(".", "/", NULL, MS_MOVE, NULL);
+ init_chroot(".");
+ }
}
static bool is_tmpfs;
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 6069ea3eb80d..d05870fcb662 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -15,6 +15,7 @@
void mount_root_generic(char *name, char *pretty_name, int flags);
void mount_root(char *root_device_name);
extern int root_mountflags;
+extern bool immutable_rootfs;
static inline __init int create_dev(char *name, dev_t dev)
{
--
2.47.3
next prev parent reply other threads:[~2026-01-02 14:36 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-02 14:36 [PATCH 0/3] fs: add immutable rootfs and support pivot_root() in the initramfs Christian Brauner
2026-01-02 14:36 ` [PATCH 1/3] fs: ensure that internal tmpfs mount gets mount id zero Christian Brauner
2026-01-02 14:36 ` [PATCH 2/3] fs: add init_pivot_root() Christian Brauner
2026-01-02 14:36 ` Christian Brauner [this message]
2026-01-04 7:27 ` [PATCH 3/3] fs: add immutable rootfs Al Viro
2026-01-04 7:41 ` Al Viro
2026-01-06 22:07 ` Christian Brauner
2026-01-06 22:59 ` Al Viro
2026-01-07 10:53 ` Christian Brauner
2026-01-07 2:28 ` Gao Xiang
2026-01-07 2:47 ` Al Viro
2026-01-07 2:55 ` Gao Xiang
2026-01-07 10:52 ` Christian Brauner
2026-01-07 16:33 ` Colin Walters
2026-01-08 11:02 ` Christian Brauner
2026-01-25 20:47 ` Askar Safin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260102-work-immutable-rootfs-v1-3-f2073b2d1602@kernel.org \
--to=brauner@kernel.org \
--cc=amir73il@gmail.com \
--cc=jack@suse.cz \
--cc=jlayton@kernel.org \
--cc=josef@toxicpanda.com \
--cc=lennart@poettering.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=zbyszek@in.waw.pl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox