[RFC 0/4] per-namespace allowed filesystems list

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/4] per-namespace allowed filesystems list
@ 2012-01-23 16:56 Glauber Costa
  2012-01-23 16:56 ` [RFC 2/4] " Glauber Costa
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

This patch creates a list of allowed filesystems per-namespace.
The goal is to prevent users inside a container, even root,
to mount filesystems that are not allowed by the main box admin.

My main two motivators to pursue this are:
 1) We want to prevent a certain tailored view of some virtual
    filesystems, for example, by bind-mounting files with userspace
    generated data into /proc. The ability of mounting /proc inside
    the container works against this effort, while disallowing it
    via capabilities would have the effect of disallowing other
    mounts as well.

2) Some filesystems are known not to behave well under a container
   environment. They require changes to work in a safe-way. We can
   whitelist only the filesystems we want.

This works as a whitelist. Only filesystems in the list are allowed
to be mounted. Doing a blacklist would create problems when, say,
a module is loaded. The whitelist is only checked if it is enabled first.
So any setup that was already working, will keep working. And whoever
is not interested in limiting filesystem mount, does not need
to bother about it.

Please let me know what you guys think about it.

Glauber Costa (4):
  move /proc/filesystems inside /proc/self
  per-namespace allowed filesystems list
  show only allowed filesystems in /proc/filesystems
  fslist netlink interface

 fs/Kconfig                     |    9 +++
 fs/Makefile                    |    1 +
 fs/filesystems.c               |  108 ++++++++++++++++++++++++------
 fs/fsnetlink.c                 |  145 ++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c                 |    5 +-
 fs/proc/base.c                 |   64 +++++++++++++++---
 fs/proc/root.c                 |    1 +
 include/linux/fs.h             |   11 +++
 include/linux/fslist_netlink.h |   35 ++++++++++
 include/linux/mnt_namespace.h  |   20 ++++++
 10 files changed, 368 insertions(+), 31 deletions(-)
 create mode 100644 fs/fsnetlink.c
 create mode 100644 include/linux/fslist_netlink.h

-- 
1.7.7.4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 2/4] per-namespace allowed filesystems list
  2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa
@ 2012-01-23 16:56 ` Glauber Costa
  2012-01-23 16:56 ` [RFC 3/4] show only allowed filesystems in /proc/filesystems Glauber Costa
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw)
  To: cgroups
  Cc: linux-fsdevel, ebiederm, serge, daniel.lezcano, pjt, mzxreary,
	xemul, James.Bottomley, tj, eric.dumazet, Glauber Costa

This patch creates a list of allowed filesystems per-namespace.
The goal is to prevent users inside a container, even root,
to mount filesystems that are not allowed by the main box admin.

My main two motivators to pursue this are:
 1) We want to prevent a certain tailored view of some virtual
    filesystems, for example, by bind-mounting files with userspace
    generated data into /proc. The ability of mounting /proc inside
    the container works against this effort, while disallowing it
    via capabilities would have the effect of disallowing other
    mounts as well.

2) Some filesystems are known not to behave well under a container
   environment. They require changes to work in a safe-way. We can
   whitelist only the filesystems we want.

This works as a whitelist. Only filesystems in the list are allowed
to be mounted. Doing a blacklist would create problems when, say,
a module is loaded. The whitelist is only checked if it is enabled first.
So any setup that was already working, will keep working. And whoever
is not interested in limiting filesystem mount, does not need
to bother about it.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/filesystems.c              |   83 +++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c                |    5 ++-
 include/linux/fs.h            |    9 ++++
 include/linux/mnt_namespace.h |   20 ++++++++++
 4 files changed, 116 insertions(+), 1 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 458d120..118d0d6 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <linux/mnt_namespace.h>
 #include <asm/uaccess.h>
 
 /*
@@ -218,6 +219,26 @@ int __init get_filesystem_list(char *buf)
 	return len;
 }
 
+static bool fs_allowed(struct file_system_type *fs, struct mnt_namespace *mnt)
+{
+	struct fs_allowed *p;
+	bool ret = true;
+
+	if (!fslist_is_enabled(mnt))
+		goto out;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(p, &mnt->fs_allowed, list)
+		if (p->fstype == fs)
+			goto out_rcu;
+
+	ret = false;
+out_rcu:
+	rcu_read_unlock();
+out:
+	return ret;
+}
+
 #ifdef CONFIG_PROC_FS
 int filesystems_proc_show(struct seq_file *m, void *v)
 {
@@ -265,4 +286,66 @@ struct file_system_type *get_fs_type(const char *name)
 	return fs;
 }
 
+void destroy_filesystems_list(struct mnt_namespace *mnt)
+{
+	struct fs_allowed *fs;
+
+	WARN_ON(!mnt);
+
+	if (!fslist_is_enabled(mnt))
+		return;
+	mutex_lock(&mnt->fs_list_mutex);
+	synchronize_rcu();
+
+	list_for_each_entry(fs, &mnt->fs_allowed, list) {
+		list_del(&fs->list);
+		kfree(fs);
+	}
+	mutex_unlock(&mnt->fs_list_mutex);
+}
+
+void enable_filesystems_list(struct mnt_namespace *mnt)
+{
+	mnt->fs_list_enabled = true;
+}
+
+int add_filesystem_list(const char *name, struct mnt_namespace *mnt)
+{
+	struct file_system_type **fstype;
+	struct fs_allowed *fs;
+
+	if (!fslist_is_enabled(mnt))
+		return -EINVAL;
+
+	fstype = find_filesystem(name, strlen(name));
+	if (!fstype)
+		return -EINVAL;
+
+	if (fs_allowed(*fstype, mnt))
+		return 0;
+
+	fs = kmalloc(sizeof(*fs), GFP_KERNEL);
+	if (!fs)
+		return -ENOMEM;
+
+	fs->fstype = *fstype;
+
+	mutex_lock(&mnt->fs_list_mutex);
+	list_add_rcu(&fs->list, &mnt->fs_allowed);
+	mutex_unlock(&mnt->fs_list_mutex);
+
+	return 0;
+}
+
+struct file_system_type *get_fs_type_ns(const char *name,
+					struct mnt_namespace *mnt)
+{
+	struct file_system_type *fs = get_fs_type(name);
+
+	if (fs && mnt && !fs_allowed(fs, mnt)) {
+		put_filesystem(fs);
+		fs = NULL;
+	}
+	return fs;
+}
 EXPORT_SYMBOL(get_fs_type);
diff --git a/fs/namespace.c b/fs/namespace.c
index cfc6d44..e897985 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1958,7 +1958,8 @@ static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
 struct vfsmount *
 do_kern_mount(const char *fstype, int flags, const char *name, void *data)
 {
-	struct file_system_type *type = get_fs_type(fstype);
+	struct file_system_type *type = get_fs_type_ns(fstype,
+					current->nsproxy->mnt_ns);
 	struct vfsmount *mnt;
 	if (!type)
 		return ERR_PTR(-ENODEV);
@@ -2365,6 +2366,7 @@ static struct mnt_namespace *alloc_mnt_ns(void)
 	INIT_LIST_HEAD(&new_ns->list);
 	init_waitqueue_head(&new_ns->poll);
 	new_ns->event = 0;
+	init_fslist(new_ns);
 	return new_ns;
 }
 
@@ -2745,6 +2747,7 @@ void put_mnt_ns(struct mnt_namespace *ns)
 	br_write_unlock(vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
+	destroy_filesystems_list(ns);
 	kfree(ns);
 }
 EXPORT_SYMBOL(put_mnt_ns);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3286d74..ab3633a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2531,6 +2531,15 @@ extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern int filesystems_proc_show(struct seq_file *m, void *v);
 
+
+struct mnt_namespace;
+extern struct file_system_type *get_fs_type_ns(const char *name,
+					       struct mnt_namespace *mnt);
+extern void enable_filesystems_list(struct mnt_namespace *ns);
+extern void destroy_filesystems_list(struct mnt_namespace *ns);
+extern int add_filesystem_list(const char *name, struct mnt_namespace *ns);
+extern int del_filesystem_list(char *name, struct mnt_namespace *ns);
+
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern struct super_block *user_get_super(dev_t);
diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h
index 2930485..4138fb4 100644
--- a/include/linux/mnt_namespace.h
+++ b/include/linux/mnt_namespace.h
@@ -6,12 +6,20 @@
 #include <linux/seq_file.h>
 #include <linux/wait.h>
 
+struct fs_allowed {
+	struct list_head	list;
+	struct file_system_type *fstype;
+};
+
 struct mnt_namespace {
 	atomic_t		count;
 	struct vfsmount *	root;
 	struct list_head	list;
 	wait_queue_head_t poll;
 	int event;
+	struct list_head	fs_allowed;
+	struct mutex		fs_list_mutex;
+	bool			fs_list_enabled;
 };
 
 struct proc_mounts {
@@ -22,6 +30,18 @@ struct proc_mounts {
 
 struct fs_struct;
 
+static inline bool fslist_is_enabled(struct mnt_namespace *mnt)
+{
+	return mnt->fs_list_enabled;
+}
+
+static inline void init_fslist(struct mnt_namespace *ns)
+{
+	ns->fs_list_enabled = false;
+	INIT_LIST_HEAD(&ns->fs_allowed);
+	mutex_init(&ns->fs_list_mutex);
+}
+
 extern struct mnt_namespace *create_mnt_ns(struct vfsmount *mnt);
 extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *,
 		struct fs_struct *);
-- 
1.7.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 3/4] show only allowed filesystems in /proc/filesystems
  2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa
  2012-01-23 16:56 ` [RFC 2/4] " Glauber Costa
@ 2012-01-23 16:56 ` Glauber Costa
       [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-01-24  0:04 ` Eric W. Biederman
  3 siblings, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw)
  To: cgroups
  Cc: linux-fsdevel, ebiederm, serge, daniel.lezcano, pjt, mzxreary,
	xemul, James.Bottomley, tj, eric.dumazet, Glauber Costa

Now that a namespace can have a different than default list of
filesystems, only show the allowed ones in /proc/filesystems.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/filesystems.c |    4 +++-
 fs/proc/base.c   |   51 +++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 118d0d6..b797cda 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -243,11 +243,13 @@ out:
 int filesystems_proc_show(struct seq_file *m, void *v)
 {
 	struct file_system_type * tmp;
+	struct mnt_namespace *ns = m->private;
 
 	read_lock(&file_systems_lock);
 	tmp = file_systems;
 	while (tmp) {
-		seq_printf(m, "%s\t%s\n",
+		if (fs_allowed(tmp, ns))
+			seq_printf(m, "%s\t%s\n",
 			(tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
 			tmp->name);
 		tmp = tmp->next;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2a6e2c7..2a88a47 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -627,6 +627,44 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
 	return 0;
 }
 
+struct mnt_namespace *mnt_ns_from_task(struct task_struct *task)
+{
+	struct nsproxy *nsp;
+	struct mnt_namespace *ns = NULL;
+
+
+	rcu_read_lock();
+	nsp = task_nsproxy(task);
+	if (nsp) {
+		ns = nsp->mnt_ns;
+		if (ns)
+			get_mnt_ns(ns);
+	}
+	rcu_read_unlock();
+	return ns;
+}
+
+struct mnt_namespace *mnt_ns_from_inode(struct inode *inode)
+{
+	struct task_struct *task = get_proc_task(inode);
+	struct path root;
+	struct mnt_namespace *ns = NULL;
+
+	if (!task)
+		return NULL;
+
+	ns = mnt_ns_from_task(task);
+
+	if (ns && get_task_root(task, &root) != 0) {
+		put_mnt_ns(ns);
+		ns = NULL;
+	}
+
+	path_put(&root);
+	put_task_struct(task);
+	return ns;
+}
+
 static const struct inode_operations proc_def_inode_operations = {
 	.setattr	= proc_setattr,
 };
@@ -635,21 +673,13 @@ static int mounts_open_common(struct inode *inode, struct file *file,
 			      const struct seq_operations *op)
 {
 	struct task_struct *task = get_proc_task(inode);
-	struct nsproxy *nsp;
 	struct mnt_namespace *ns = NULL;
 	struct path root;
 	struct proc_mounts *p;
 	int ret = -EINVAL;
 
 	if (task) {
-		rcu_read_lock();
-		nsp = task_nsproxy(task);
-		if (nsp) {
-			ns = nsp->mnt_ns;
-			if (ns)
-				get_mnt_ns(ns);
-		}
-		rcu_read_unlock();
+		ns = mnt_ns_from_task(task);
 		if (ns && get_task_root(task, &root) == 0)
 			ret = 0;
 		put_task_struct(task);
@@ -2875,7 +2905,8 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de
 
 static int filesystems_proc_open(struct inode *inode, struct file *file)
 {
-	return single_open(file, filesystems_proc_show, NULL);
+	struct mnt_namespace *ns = mnt_ns_from_inode(inode);
+	return single_open(file, filesystems_proc_show, ns);
 }
 
 static const struct file_operations filesystems_proc_fops = {
-- 
1.7.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

[parent not found: <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* [RFC 1/4] move /proc/filesystems inside /proc/self
       [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-01-23 16:56   ` Glauber Costa
  2012-01-23 16:56   ` [RFC 4/4] fslist netlink interface Glauber Costa
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	Glauber Costa

This simple patch is a preparation for what is to come. It moves
the list of available filesystems inside /proc/self/, linking /proc/filesystems
to it.

It effectively means that each process may have a different view of which
filesystems are available in the system, depending on the namespace it
lives.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/filesystems.c   |   21 +--------------------
 fs/proc/base.c     |   15 +++++++++++++++
 fs/proc/root.c     |    1 +
 include/linux/fs.h |    2 ++
 4 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 0845f84..458d120 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -219,7 +219,7 @@ int __init get_filesystem_list(char *buf)
 }
 
 #ifdef CONFIG_PROC_FS
-static int filesystems_proc_show(struct seq_file *m, void *v)
+int filesystems_proc_show(struct seq_file *m, void *v)
 {
 	struct file_system_type * tmp;
 
@@ -234,25 +234,6 @@ static int filesystems_proc_show(struct seq_file *m, void *v)
 	read_unlock(&file_systems_lock);
 	return 0;
 }
-
-static int filesystems_proc_open(struct inode *inode, struct file *file)
-{
-	return single_open(file, filesystems_proc_show, NULL);
-}
-
-static const struct file_operations filesystems_proc_fops = {
-	.open		= filesystems_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
-static int __init proc_filesystems_init(void)
-{
-	proc_create("filesystems", 0, NULL, &filesystems_proc_fops);
-	return 0;
-}
-module_init(proc_filesystems_init);
 #endif
 
 static struct file_system_type *__get_fs_type(const char *name, int len)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 851ba3d..2a6e2c7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2768,6 +2768,7 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
  */
 static const struct file_operations proc_task_operations;
 static const struct inode_operations proc_task_inode_operations;
+static const struct file_operations filesystems_proc_fops;
 
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
@@ -2851,6 +2852,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_HARDWALL
 	INF("hardwall",   S_IRUGO, proc_pid_hardwall),
 #endif
+	REG("filesystems", S_IRUGO, filesystems_proc_fops),
 };
 
 static int proc_tgid_base_readdir(struct file * filp,
@@ -2871,6 +2873,18 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de
 				  tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
 }
 
+static int filesystems_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, filesystems_proc_show, NULL);
+}
+
+static const struct file_operations filesystems_proc_fops = {
+	.open		= filesystems_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static const struct inode_operations proc_tgid_base_inode_operations = {
 	.lookup		= proc_tgid_base_lookup,
 	.getattr	= pid_getattr,
@@ -3193,6 +3207,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_HARDWALL
 	INF("hardwall",   S_IRUGO, proc_pid_hardwall),
 #endif
+	REG("filesystems", S_IRUGO, filesystems_proc_fops),
 };
 
 static int proc_tid_base_readdir(struct file * filp,
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 03102d9..87bf2e3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -104,6 +104,7 @@ void __init proc_root_init(void)
 	}
 
 	proc_symlink("mounts", NULL, "self/mounts");
+	proc_symlink("filesystems", NULL, "self/filesystems");
 
 	proc_net_init();
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0bc4ff..3286d74 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2529,6 +2529,8 @@ extern int generic_block_fiemap(struct inode *inode,
 extern void get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
+extern int filesystems_proc_show(struct seq_file *m, void *v);
+
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern struct super_block *user_get_super(dev_t);
-- 
1.7.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 4/4] fslist netlink interface
       [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-01-23 16:56   ` [RFC 1/4] move /proc/filesystems inside /proc/self Glauber Costa
@ 2012-01-23 16:56   ` Glauber Costa
  2012-01-23 19:20   ` [RFC 0/4] per-namespace allowed filesystems list Eric W. Biederman
  2012-01-23 21:12   ` Al Viro
  3 siblings, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	Glauber Costa

This patch provides a netlink interaface for the namespace filesystem list.
It is quite simple, and I need at least one more operation (query).

Also, keep in mind that although I wrote it believing it is a nice
interface to manipulate the list, I don't feel strongly about the
interface per-se. So feel free to suggest something better.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/Kconfig                     |    9 +++
 fs/Makefile                    |    1 +
 fs/fsnetlink.c                 |  145 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fslist_netlink.h |   35 ++++++++++
 4 files changed, 190 insertions(+), 0 deletions(-)
 create mode 100644 fs/fsnetlink.c
 create mode 100644 include/linux/fslist_netlink.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 440d189..842dcc4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -78,6 +78,15 @@ config GENERIC_ACL
 	bool
 	select FS_POSIX_ACL
 
+config FSLIST_NETLINK
+	bool "Filesystem Lists Netlink"
+	help
+	  This option allows userspace to select a sublist of the available
+	  filesystems that are mountable by a particular namespace. It provides
+	  a netlink interface through which one can manage such a set.
+
+
+
 menu "Caches"
 
 source "fs/fscache/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 57d446d..675f613 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_FS_MBCACHE)	+= mbcache.o
 obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.o xattr_acl.o
 obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
 obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
+obj-$(CONFIG_FSLIST_NETLINK)	+= fsnetlink.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
 
diff --git a/fs/fsnetlink.c b/fs/fsnetlink.c
new file mode 100644
index 0000000..619fad1
--- /dev/null
+++ b/fs/fsnetlink.c
@@ -0,0 +1,145 @@
+#include <net/genetlink.h>
+#include <linux/fslist_netlink.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+#include <linux/mnt_namespace.h>
+#include <linux/fs_struct.h>
+#include <linux/dcache.h>
+
+static struct nla_policy fslist_genl_policy[FSLIST_A_MAX + 1] = {
+	[FSLIST_A_OP] = { .type = NLA_U32 },
+	[FSLIST_A_OP_ARG] = { .type = NLA_STRING, .len = 200 },
+};
+
+static struct genl_family fslist_gnl_family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= FSLIST_GENL_NAME,
+	.version	= FSLIST_GENL_VERSION,
+	.maxattr	= FSLIST_A_MAX,
+};
+
+static int msg_reply(int ret, struct genl_info *info)
+{
+        struct sk_buff *skb;
+        void *reply;
+	size_t size;
+	struct genlmsghdr *genlhdr;
+	void *data;
+
+	size =  nla_total_size(1) +
+		nla_total_size(0);
+
+	skb = genlmsg_new(size, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	reply = genlmsg_put_reply(skb, info, &fslist_gnl_family, 0, FSLIST_CMD_REPLY);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+
+	nla_put_u32(skb, FSLIST_A_OP, ret);
+
+	genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	data = genlmsg_data(genlhdr);
+
+	ret = genlmsg_end(skb, data);
+	if (ret < 0) {
+		nlmsg_free(skb);
+		return ret;
+	}
+
+	return genlmsg_reply(skb, info);
+}
+
+
+static int cmd_ask(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr *na;
+	int op = 0;
+	char *data = NULL;
+	int ret = 0;
+	struct mnt_namespace *mnt = current->nsproxy->mnt_ns;
+	struct dentry *curr_root;
+	int msg_ret = 0;
+
+	/*
+	 * Once a process is contained by a chroot environment,
+	 * we don't allow the list to grow further, or be by
+	 * any means modified. 
+	 *
+	 * It should still work after pivot_root, though.
+	 */
+	curr_root = current->fs->root.dentry;
+	if (curr_root->d_parent != curr_root)
+		return -EINVAL;
+
+	na = info->attrs[FSLIST_A_OP];
+	if (na) 
+		op = nla_get_u32(na);
+
+	na = info->attrs[FSLIST_A_OP_ARG];
+	if (na) {
+		int len = nla_len(na);
+
+		data = kmalloc(len, GFP_KERNEL);
+		if (!data)
+			return -ENOMEM;
+
+		nla_strlcpy(data, na, len);
+	}
+
+	switch (op) {
+	case FSLIST_OP_RESET:
+		enable_filesystems_list(mnt);
+		break;
+	case FSLIST_OP_ADD:
+		if (!data) {
+			ret = -EINVAL;
+			break;
+		}
+		msg_ret = add_filesystem_list(data, mnt);
+		break;
+	case FSLIST_OP_QUERY:
+		ret = -ENOSYS;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	
+	kfree(data);
+	if (!ret)
+		msg_reply(msg_ret, info);
+	return ret;
+}
+
+static struct genl_ops fslist_nl_ops = {
+	.cmd		= FSLIST_CMD_ASK,
+	.doit		= cmd_ask,
+	.policy 	= fslist_genl_policy,
+};
+
+int __init fslist_netlink_init(void)
+{
+	int ret;
+	ret = genl_register_family(&fslist_gnl_family);
+	if (ret) 
+		return ret;
+	
+	ret = genl_register_ops(&fslist_gnl_family, &fslist_nl_ops);
+	if (ret)
+		goto fail_unregister;
+
+	return 0;
+
+fail_unregister:
+	genl_unregister_family(&fslist_gnl_family);
+	return ret;
+}
+
+fs_initcall(fslist_netlink_init);
+
diff --git a/include/linux/fslist_netlink.h b/include/linux/fslist_netlink.h
new file mode 100644
index 0000000..926760d
--- /dev/null
+++ b/include/linux/fslist_netlink.h
@@ -0,0 +1,35 @@
+#ifndef _FSLIST_NETLINK_H
+#define _FSLIST_NETLINK_H
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#endif
+
+enum {
+	FSLIST_A_UNSPEC,
+	FSLIST_A_OP,
+	FSLIST_A_OP_ARG,
+	__FSLIST_A_MAX,
+};
+
+#define FSLIST_A_MAX (__FSLIST_A_MAX - 1)
+
+enum {
+	FSLIST_CMD_UNSPEC = 0,
+	FSLIST_CMD_ASK,		/* user->kernel */
+	FSLIST_CMD_REPLY,	/* kernel->user */
+	__FSLIST_CMD_MAX,
+};
+
+#define FSLIST_CMD_MAX (__FSLIST_CMD_MAX - 1)
+
+#define FSLIST_GENL_VERSION 0x1
+#define FSLIST_GENL_NAME "FSLIST"
+
+enum {
+	FSLIST_OP_RESET = 1,
+	FSLIST_OP_ADD,
+	FSLIST_OP_QUERY,
+};
+
+#endif /* _FSLIST_NETLINK_H */
-- 
1.7.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-01-23 16:56   ` [RFC 1/4] move /proc/filesystems inside /proc/self Glauber Costa
  2012-01-23 16:56   ` [RFC 4/4] fslist netlink interface Glauber Costa
@ 2012-01-23 19:20   ` Eric W. Biederman
  2012-01-23 21:12   ` Al Viro
  3 siblings, 0 replies; 16+ messages in thread
From: Eric W. Biederman @ 2012-01-23 19:20 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q,
	xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> This patch creates a list of allowed filesystems per-namespace.
> The goal is to prevent users inside a container, even root,
> to mount filesystems that are not allowed by the main box admin.
>
> My main two motivators to pursue this are:
>  1) We want to prevent a certain tailored view of some virtual
>     filesystems, for example, by bind-mounting files with userspace
>     generated data into /proc. The ability of mounting /proc inside
>     the container works against this effort, while disallowing it
>     via capabilities would have the effect of disallowing other
>     mounts as well.
>
> 2) Some filesystems are known not to behave well under a container
>    environment. They require changes to work in a safe-way. We can
>    whitelist only the filesystems we want.
>
> This works as a whitelist. Only filesystems in the list are allowed
> to be mounted. Doing a blacklist would create problems when, say,
> a module is loaded. The whitelist is only checked if it is enabled first.
> So any setup that was already working, will keep working. And whoever
> is not interested in limiting filesystem mount, does not need
> to bother about it.

My first impression is that this looks like a hack to avoid finishing
the user namespace.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (2 preceding siblings ...)
  2012-01-23 19:20   ` [RFC 0/4] per-namespace allowed filesystems list Eric W. Biederman
@ 2012-01-23 21:12   ` Al Viro
       [not found]     ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  3 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2012-01-23 21:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

On Mon, Jan 23, 2012 at 08:56:08PM +0400, Glauber Costa wrote:
> This patch creates a list of allowed filesystems per-namespace.
> The goal is to prevent users inside a container, even root,
> to mount filesystems that are not allowed by the main box admin.
> 
> My main two motivators to pursue this are:
>  1) We want to prevent a certain tailored view of some virtual
>     filesystems, for example, by bind-mounting files with userspace
>     generated data into /proc. The ability of mounting /proc inside
>     the container works against this effort, while disallowing it
>     via capabilities would have the effect of disallowing other
>     mounts as well.

Translation, please.

> 2) Some filesystems are known not to behave well under a container
>    environment. They require changes to work in a safe-way. We can
>    whitelist only the filesystems we want.

So fix them.
 
> This works as a whitelist. Only filesystems in the list are allowed
> to be mounted. Doing a blacklist would create problems when, say,
> a module is loaded. The whitelist is only checked if it is enabled first.
> So any setup that was already working, will keep working. And whoever
> is not interested in limiting filesystem mount, does not need
> to bother about it.
> 
> Please let me know what you guys think about it.

NAKed-by: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
NAKed-because: too fucking ugly

This is bloody ridiculous; if you want to prevent a luser adming playing with
the set of mounts you've given it, the right way to go is not to mess with the
"which fs types are allowed" but to add a per-namespace "immutable" flag.
And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS
and setting the "immutable" on the copied namespace.

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found]     ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2012-01-23 23:04       ` Kirill A. Shutemov
       [not found]         ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
  2012-01-24 10:22       ` Glauber Costa
  1 sibling, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-01-23 23:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote:
> This is bloody ridiculous; if you want to prevent a luser adming playing with
> the set of mounts you've given it, the right way to go is not to mess with the
> "which fs types are allowed" but to add a per-namespace "immutable" flag.
> And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS
> and setting the "immutable" on the copied namespace.

How will it work if we want to allow namespaced environment to mount block
devices, but not, let say, debugfs?

Differentiation between filesystem type and source is one of broken things
in Unix API. I don't see an easy way to fix it. Only plan9. :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>]

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found]         ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
@ 2012-01-23 23:12           ` Al Viro
  2012-01-24  7:17             ` Kirill A. Shutemov
  2012-01-24 10:32           ` Glauber Costa
  1 sibling, 1 reply; 16+ messages in thread
From: Al Viro @ 2012-01-23 23:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

On Tue, Jan 24, 2012 at 01:04:57AM +0200, Kirill A. Shutemov wrote:
> On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote:
> > This is bloody ridiculous; if you want to prevent a luser adming playing with
> > the set of mounts you've given it, the right way to go is not to mess with the
> > "which fs types are allowed" but to add a per-namespace "immutable" flag.
> > And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS
> > and setting the "immutable" on the copied namespace.
> 
> How will it work if we want to allow namespaced environment to mount block
> devices, but not, let say, debugfs?
> 
> Differentiation between filesystem type and source is one of broken things
> in Unix API.

Translation, please?

> I don't see an easy way to fix it. Only plan9. :)

Huh?  Plan 9 does *not* contain anything of that kind.  And their '#<letter>'
convention for in-kernel filesystems is one of the uglier things about their
API, IMO...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/4] per-namespace allowed filesystems list
  2012-01-23 23:12           ` Al Viro
@ 2012-01-24  7:17             ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2012-01-24  7:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Glauber Costa, cgroups, linux-fsdevel, ebiederm, serge,
	daniel.lezcano, pjt, mzxreary, xemul, James.Bottomley, tj,
	eric.dumazet

On Mon, Jan 23, 2012 at 11:12:39PM +0000, Al Viro wrote:
> On Tue, Jan 24, 2012 at 01:04:57AM +0200, Kirill A. Shutemov wrote:
> > On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote:
> > > This is bloody ridiculous; if you want to prevent a luser adming playing with
> > > the set of mounts you've given it, the right way to go is not to mess with the
> > > "which fs types are allowed" but to add a per-namespace "immutable" flag.
> > > And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS
> > > and setting the "immutable" on the copied namespace.
> > 
> > How will it work if we want to allow namespaced environment to mount block
> > devices, but not, let say, debugfs?
> > 
> > Differentiation between filesystem type and source is one of broken things
> > in Unix API.
> 
> Translation, please?

mount(2) should take a file descriptor and mount it to a mountpoint. File
descriptor provides known interface at that point (9p?). It doesn't matter
here what's the actual type of filesystem or what's the source (no fake
names for sources of pseudo-fs).

> > I don't see an easy way to fix it. Only plan9. :)
> 
> Huh?  Plan 9 does *not* contain anything of that kind. And their '#<letter>'
> convention for in-kernel filesystems is one of the uglier things about their
> API, IMO...

Yes, it's ugly. But in plan9 it can be fixed quite straight forward:
- kernel provides a way to boot to an early userspace environment without
  access to any media (initrd);
- kernel provides *one* '#<letter>' fs which contains handles to create
  any other in-kernel filesystems. This special fs can be mounted only
  once. After that you have (about) all resources as files and can
  construct any other environment using mount and bind.

In fact, list of available (pseudo-)filesystems is yet another resource
namespace. With approach above we can get rid of it in plan9.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found]         ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
  2012-01-23 23:12           ` Al Viro
@ 2012-01-24 10:32           ` Glauber Costa
  1 sibling, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-24 10:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Al Viro, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

On 01/24/2012 03:04 AM, Kirill A. Shutemov wrote:
> On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote:
>> This is bloody ridiculous; if you want to prevent a luser adming playing with
>> the set of mounts you've given it, the right way to go is not to mess with the
>> "which fs types are allowed" but to add a per-namespace "immutable" flag.
>> And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS
>> and setting the "immutable" on the copied namespace.
>
> How will it work if we want to allow namespaced environment to mount block
> devices, but not, let say, debugfs?
>

For the record, that is more or less what I have in mind. But my main 
use case is /proc. I guess the case for debugfs is the same.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found]     ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2012-01-23 23:04       ` Kirill A. Shutemov
@ 2012-01-24 10:22       ` Glauber Costa
  1 sibling, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-24 10:22 UTC (permalink / raw)
  To: Al Viro
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA,
	daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA,
	mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

On 01/24/2012 01:12 AM, Al Viro wrote:
> On Mon, Jan 23, 2012 at 08:56:08PM +0400, Glauber Costa wrote:
>> This patch creates a list of allowed filesystems per-namespace.
>> The goal is to prevent users inside a container, even root,
>> to mount filesystems that are not allowed by the main box admin.
>>
>> My main two motivators to pursue this are:
>>   1) We want to prevent a certain tailored view of some virtual
>>      filesystems, for example, by bind-mounting files with userspace
>>      generated data into /proc. The ability of mounting /proc inside
>>      the container works against this effort, while disallowing it
>>      via capabilities would have the effect of disallowing other
>>      mounts as well.
>
> Translation, please.
>
>> 2) Some filesystems are known not to behave well under a container
>>     environment. They require changes to work in a safe-way. We can
>>     whitelist only the filesystems we want.
>
> So fix them.
>
>> This works as a whitelist. Only filesystems in the list are allowed
>> to be mounted. Doing a blacklist would create problems when, say,
>> a module is loaded. The whitelist is only checked if it is enabled first.
>> So any setup that was already working, will keep working. And whoever
>> is not interested in limiting filesystem mount, does not need
>> to bother about it.
>>
>> Please let me know what you guys think about it.
>
> NAKed-by: Al Viro<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> NAKed-because: too fucking ugly
>
> This is bloody ridiculous; if you want to prevent a luser adming playing with
> the set of mounts you've given it, the right way to go is not to mess with the
> "which fs types are allowed" but to add a per-namespace "immutable" flag.
> And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS
> and setting the "immutable" on the copied namespace.

Okay, not that I laid down the problem, I am happy to pursue any 
solutions we think is better. But let me develop it a bit more, first.

An immutable flag does not work, because I don't want to prevent a luser 
(loved that) to mess up with the mounts they are given. In general, it 
is perfectly fine for them to mount things inside the cointainer as the 
time goes.

But some others, I don't consider so. The example of /proc I've given, 
let me elaborate: Much of the information living on /proc, is really 
global, rather than per-container. The ones pertaining to pid namespace, 
and other namespaces are already per-namespace so they are fine. But 
there is more: some of the things /proc track, like cpu usage, memory, 
and the like, are resource-constrained by other entities, for instance, 
cgroups. In some cases, like /proc/stat, information exists in cgroup, 
but come from more than once cgroup. All of them are independent in 
nature, making it hard to come out with a
coherent vision.

Furthermore, there is no connection between namespaces and cgroups, so 
it is not obvious at all (there were discussions before), which 
information should the process see - unlike namespaces, the mere fact 
that a process lives in a cgroup, does not really mean it is isolated 
from the system in this sense.

One of the solutions, is to do it all in userspace, from outside the 
container, and bind mount the files inside the container's /proc. But it 
only works if we can prevent the user from remounting the real /proc 
somewhere. Not because it would screw up his system, which I don't care 
about, but because it will give him information about the global state 
of the system.

An immutable flag fixes this, but then it prevents all further 
legitimate mounts

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/4] per-namespace allowed filesystems list
  2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa
                   ` (2 preceding siblings ...)
       [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-01-24  0:04 ` Eric W. Biederman
       [not found]   ` <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  3 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2012-01-24  0:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-fsdevel, serge, daniel.lezcano, pjt, mzxreary,
	xemul, James.Bottomley, tj, eric.dumazet

Glauber Costa <glommer@parallels.com> writes:

> This patch creates a list of allowed filesystems per-namespace.
> The goal is to prevent users inside a container, even root,
> to mount filesystems that are not allowed by the main box admin.
>
> My main two motivators to pursue this are:
>  1) We want to prevent a certain tailored view of some virtual
>     filesystems, for example, by bind-mounting files with userspace
>     generated data into /proc. The ability of mounting /proc inside
>     the container works against this effort, while disallowing it
>     via capabilities would have the effect of disallowing other
>     mounts as well.
>
> 2) Some filesystems are known not to behave well under a container
>    environment. They require changes to work in a safe-way. We can
>    whitelist only the filesystems we want.
>
> This works as a whitelist. Only filesystems in the list are allowed
> to be mounted. Doing a blacklist would create problems when, say,
> a module is loaded. The whitelist is only checked if it is enabled first.
> So any setup that was already working, will keep working. And whoever
> is not interested in limiting filesystem mount, does not need
> to bother about it.

My first impression is that this looks like a hack to avoid finishing
the user namespace.

This is a terrible way to go about implementing unprivileged mounts.

If there are technical reasons why it is unsafe to mount filesystems
that we need to whitelist/blacklist filesystems in the kernel where we
can check things.

Why in the world would anyone want the ability to not mount a specific
filesystem type?

Using netlink as an interface when you are talking filesystems to
filesystem is pretty horrid.  Netlink is great for networking developers
they get networking, but filesystem people understand filesystems and
you want to use netlink?  

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>]

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found]   ` <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2012-01-24 10:31     ` Glauber Costa
       [not found]       ` <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Glauber Costa @ 2012-01-24 10:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q,
	xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

On 01/24/2012 04:04 AM, Eric W. Biederman wrote:
> Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>  writes:
>
>> This patch creates a list of allowed filesystems per-namespace.
>> The goal is to prevent users inside a container, even root,
>> to mount filesystems that are not allowed by the main box admin.
>>
>> My main two motivators to pursue this are:
>>   1) We want to prevent a certain tailored view of some virtual
>>      filesystems, for example, by bind-mounting files with userspace
>>      generated data into /proc. The ability of mounting /proc inside
>>      the container works against this effort, while disallowing it
>>      via capabilities would have the effect of disallowing other
>>      mounts as well.
>>
>> 2) Some filesystems are known not to behave well under a container
>>     environment. They require changes to work in a safe-way. We can
>>     whitelist only the filesystems we want.
>>
>> This works as a whitelist. Only filesystems in the list are allowed
>> to be mounted. Doing a blacklist would create problems when, say,
>> a module is loaded. The whitelist is only checked if it is enabled first.
>> So any setup that was already working, will keep working. And whoever
>> is not interested in limiting filesystem mount, does not need
>> to bother about it.
>
> My first impression is that this looks like a hack to avoid finishing
> the user namespace.
>
> This is a terrible way to go about implementing unprivileged mounts.
>
> If there are technical reasons why it is unsafe to mount filesystems
> that we need to whitelist/blacklist filesystems in the kernel where we
> can check things.
>
> Why in the world would anyone want the ability to not mount a specific
> filesystem type?

See my reply to Al. So again, to avoid steering the discussions to 
details I myself don't consider central (since this is a first post 
anyway), let's focus on the /proc container case. It is a privileged 
user as far as the container goes, and we'd like to allow it to mount 
filesystems. But disallowing it to mount /proc, can guarantee that the 
user will be provided with a version of /proc that is safe, and that he 
can't escape this.

Ideally, userspace wouldn't even get involved with this, and a process 
mounting /proc would see the right things, depending on where it came 
from. But turns out that the cgroups-controlled resources are a lot 
harder than the namespaces-controlled resources for this.

> Using netlink as an interface when you are talking filesystems to
> filesystem is pretty horrid.  Netlink is great for networking developers
> they get networking, but filesystem people understand filesystems and
> you want to use netlink?
>
Well, I am not doing it for filesystem people, but for people who are 
neither, aka,
whoever wants to use this interface. But that said, I don't want to keep 
the discussion around this. My main reason was to have a quick way to 
communicate this list to the kernel, so I could test it, and post a PoC 
for you guys to comment on. Even if everybody liked it, I was prepared 
from the start to redesign the interface.

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: [RFC 0/4] per-namespace allowed filesystems list
       [not found]       ` <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-01-24 11:17         ` Eric W. Biederman
  2012-01-24 11:24           ` Glauber Costa
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2012-01-24 11:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q,
	xemul-bzQdu9zFT3WakBO8gow8eQ,
	James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk,
	tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w

Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> On 01/24/2012 04:04 AM, Eric W. Biederman wrote:
>> My first impression is that this looks like a hack to avoid finishing
>> the user namespace.
>
> See my reply to Al. So again, to avoid steering the discussions to details I
> myself don't consider central (since this is a first post anyway), let's focus
> on the /proc container case. It is a privileged user as far as the container
> goes, and we'd like to allow it to mount filesystems. But disallowing it to
> mount /proc, can guarantee that the user will be provided with a version of
> /proc that is safe, and that he can't escape this.

The key things are that to the rest of the system you want this user to
look like an unprivileged user.  Aka user namespace.

> Ideally, userspace wouldn't even get involved with this, and a process mounting
> /proc would see the right things, depending on where it came from. But turns out
> that the cgroups-controlled resources are a lot harder than the
> namespaces-controlled resources for this.

There are a couple of sides to this.

If you trust the root user in your container all you have to say is:
"Don't do that then."

There are things like /proc/cpuinfo that a lot of processes use to
figure out how many threads are wise to use.  That is a problem that
deserves a proper solution not a hack.

There are the global tunables under /proc like
/proc/sys/kernel/panic_on_oops that you don't want people touching.

There are potential security issues with people mounting block devices
when they can control the filesystem data before mounting the
filesystem.  That mostly deserves fixing the filesystems but in the
unprivileged mount context that probably deserves a whitelist.

Then the are problems with mounting cgroup filesystems inside of a
container, and wondering why they don't work.  That is a design
limitation in the cgroup filesystem and code that needs to be fixed.

Is there a case you are worried about that I have not covered?

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/4] per-namespace allowed filesystems list
  2012-01-24 11:17         ` Eric W. Biederman
@ 2012-01-24 11:24           ` Glauber Costa
  0 siblings, 0 replies; 16+ messages in thread
From: Glauber Costa @ 2012-01-24 11:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: cgroups, linux-fsdevel, serge, daniel.lezcano, pjt, mzxreary,
	xemul, James.Bottomley, tj, eric.dumazet

On 01/24/2012 03:17 PM, Eric W. Biederman wrote:
> Glauber Costa<glommer@parallels.com>  writes:
>
>> On 01/24/2012 04:04 AM, Eric W. Biederman wrote:
>>> My first impression is that this looks like a hack to avoid finishing
>>> the user namespace.
>>
>> See my reply to Al. So again, to avoid steering the discussions to details I
>> myself don't consider central (since this is a first post anyway), let's focus
>> on the /proc container case. It is a privileged user as far as the container
>> goes, and we'd like to allow it to mount filesystems. But disallowing it to
>> mount /proc, can guarantee that the user will be provided with a version of
>> /proc that is safe, and that he can't escape this.
>
> The key things are that to the rest of the system you want this user to
> look like an unprivileged user.  Aka user namespace.
>
>> Ideally, userspace wouldn't even get involved with this, and a process mounting
>> /proc would see the right things, depending on where it came from. But turns out
>> that the cgroups-controlled resources are a lot harder than the
>> namespaces-controlled resources for this.
>
> There are a couple of sides to this.
>
> If you trust the root user in your container all you have to say is:
> "Don't do that then."

Of course he may not obey. And then mess up with the *other* containers 
in the system. (If he messes with himself, I don't care). Note that in 
this context, "messing" can be as simple as figuring out information 
that you'd not like the container to see.

> There are things like /proc/cpuinfo that a lot of processes use to
> figure out how many threads are wise to use.  That is a problem that
> deserves a proper solution not a hack.

Agreed. This can be either in the kernel or in userspace. If it is in 
userspace, maybe we'd like to guarantee that this view will be 
consistent, and not replaced by the systemwide version.

> There are the global tunables under /proc like
> /proc/sys/kernel/panic_on_oops that you don't want people touching.
>
> There are potential security issues with people mounting block devices
> when they can control the filesystem data before mounting the
> filesystem.  That mostly deserves fixing the filesystems but in the
> unprivileged mount context that probably deserves a whitelist.
>
> Then the are problems with mounting cgroup filesystems inside of a
> container, and wondering why they don't work.  That is a design
> limitation in the cgroup filesystem and code that needs to be fixed.
>
> Is there a case you are worried about that I have not covered?
>

The ones I've listed here in this mail, mostly. I am now wondering if 
Kirill has any around debugfs ?



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-01-24 11:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa
2012-01-23 16:56 ` [RFC 2/4] " Glauber Costa
2012-01-23 16:56 ` [RFC 3/4] show only allowed filesystems in /proc/filesystems Glauber Costa
     [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-01-23 16:56   ` [RFC 1/4] move /proc/filesystems inside /proc/self Glauber Costa
2012-01-23 16:56   ` [RFC 4/4] fslist netlink interface Glauber Costa
2012-01-23 19:20   ` [RFC 0/4] per-namespace allowed filesystems list Eric W. Biederman
2012-01-23 21:12   ` Al Viro
     [not found]     ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2012-01-23 23:04       ` Kirill A. Shutemov
     [not found]         ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
2012-01-23 23:12           ` Al Viro
2012-01-24  7:17             ` Kirill A. Shutemov
2012-01-24 10:32           ` Glauber Costa
2012-01-24 10:22       ` Glauber Costa
2012-01-24  0:04 ` Eric W. Biederman
     [not found]   ` <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2012-01-24 10:31     ` Glauber Costa
     [not found]       ` <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-01-24 11:17         ` Eric W. Biederman
2012-01-24 11:24           ` Glauber Costa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).