* [RFC 0/4] per-namespace allowed filesystems list @ 2012-01-23 16:56 Glauber Costa 2012-01-23 16:56 ` [RFC 2/4] " Glauber Costa ` (3 more replies) 0 siblings, 4 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w This patch creates a list of allowed filesystems per-namespace. The goal is to prevent users inside a container, even root, to mount filesystems that are not allowed by the main box admin. My main two motivators to pursue this are: 1) We want to prevent a certain tailored view of some virtual filesystems, for example, by bind-mounting files with userspace generated data into /proc. The ability of mounting /proc inside the container works against this effort, while disallowing it via capabilities would have the effect of disallowing other mounts as well. 2) Some filesystems are known not to behave well under a container environment. They require changes to work in a safe-way. We can whitelist only the filesystems we want. This works as a whitelist. Only filesystems in the list are allowed to be mounted. Doing a blacklist would create problems when, say, a module is loaded. The whitelist is only checked if it is enabled first. So any setup that was already working, will keep working. And whoever is not interested in limiting filesystem mount, does not need to bother about it. Please let me know what you guys think about it. Glauber Costa (4): move /proc/filesystems inside /proc/self per-namespace allowed filesystems list show only allowed filesystems in /proc/filesystems fslist netlink interface fs/Kconfig | 9 +++ fs/Makefile | 1 + fs/filesystems.c | 108 ++++++++++++++++++++++++------ fs/fsnetlink.c | 145 ++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 5 +- fs/proc/base.c | 64 +++++++++++++++--- fs/proc/root.c | 1 + include/linux/fs.h | 11 +++ include/linux/fslist_netlink.h | 35 ++++++++++ include/linux/mnt_namespace.h | 20 ++++++ 10 files changed, 368 insertions(+), 31 deletions(-) create mode 100644 fs/fsnetlink.c create mode 100644 include/linux/fslist_netlink.h -- 1.7.7.4 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC 2/4] per-namespace allowed filesystems list 2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa @ 2012-01-23 16:56 ` Glauber Costa 2012-01-23 16:56 ` [RFC 3/4] show only allowed filesystems in /proc/filesystems Glauber Costa ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw) To: cgroups Cc: linux-fsdevel, ebiederm, serge, daniel.lezcano, pjt, mzxreary, xemul, James.Bottomley, tj, eric.dumazet, Glauber Costa This patch creates a list of allowed filesystems per-namespace. The goal is to prevent users inside a container, even root, to mount filesystems that are not allowed by the main box admin. My main two motivators to pursue this are: 1) We want to prevent a certain tailored view of some virtual filesystems, for example, by bind-mounting files with userspace generated data into /proc. The ability of mounting /proc inside the container works against this effort, while disallowing it via capabilities would have the effect of disallowing other mounts as well. 2) Some filesystems are known not to behave well under a container environment. They require changes to work in a safe-way. We can whitelist only the filesystems we want. This works as a whitelist. Only filesystems in the list are allowed to be mounted. Doing a blacklist would create problems when, say, a module is loaded. The whitelist is only checked if it is enabled first. So any setup that was already working, will keep working. And whoever is not interested in limiting filesystem mount, does not need to bother about it. Signed-off-by: Glauber Costa <glommer@parallels.com> --- fs/filesystems.c | 83 +++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 5 ++- include/linux/fs.h | 9 ++++ include/linux/mnt_namespace.h | 20 ++++++++++ 4 files changed, 116 insertions(+), 1 deletions(-) diff --git a/fs/filesystems.c b/fs/filesystems.c index 458d120..118d0d6 100644 --- a/fs/filesystems.c +++ b/fs/filesystems.c @@ -14,6 +14,7 @@ #include <linux/init.h> #include <linux/module.h> #include <linux/slab.h> +#include <linux/mnt_namespace.h> #include <asm/uaccess.h> /* @@ -218,6 +219,26 @@ int __init get_filesystem_list(char *buf) return len; } +static bool fs_allowed(struct file_system_type *fs, struct mnt_namespace *mnt) +{ + struct fs_allowed *p; + bool ret = true; + + if (!fslist_is_enabled(mnt)) + goto out; + + rcu_read_lock(); + list_for_each_entry_rcu(p, &mnt->fs_allowed, list) + if (p->fstype == fs) + goto out_rcu; + + ret = false; +out_rcu: + rcu_read_unlock(); +out: + return ret; +} + #ifdef CONFIG_PROC_FS int filesystems_proc_show(struct seq_file *m, void *v) { @@ -265,4 +286,66 @@ struct file_system_type *get_fs_type(const char *name) return fs; } +void destroy_filesystems_list(struct mnt_namespace *mnt) +{ + struct fs_allowed *fs; + + WARN_ON(!mnt); + + if (!fslist_is_enabled(mnt)) + return; + mutex_lock(&mnt->fs_list_mutex); + synchronize_rcu(); + + list_for_each_entry(fs, &mnt->fs_allowed, list) { + list_del(&fs->list); + kfree(fs); + } + mutex_unlock(&mnt->fs_list_mutex); +} + +void enable_filesystems_list(struct mnt_namespace *mnt) +{ + mnt->fs_list_enabled = true; +} + +int add_filesystem_list(const char *name, struct mnt_namespace *mnt) +{ + struct file_system_type **fstype; + struct fs_allowed *fs; + + if (!fslist_is_enabled(mnt)) + return -EINVAL; + + fstype = find_filesystem(name, strlen(name)); + if (!fstype) + return -EINVAL; + + if (fs_allowed(*fstype, mnt)) + return 0; + + fs = kmalloc(sizeof(*fs), GFP_KERNEL); + if (!fs) + return -ENOMEM; + + fs->fstype = *fstype; + + mutex_lock(&mnt->fs_list_mutex); + list_add_rcu(&fs->list, &mnt->fs_allowed); + mutex_unlock(&mnt->fs_list_mutex); + + return 0; +} + +struct file_system_type *get_fs_type_ns(const char *name, + struct mnt_namespace *mnt) +{ + struct file_system_type *fs = get_fs_type(name); + + if (fs && mnt && !fs_allowed(fs, mnt)) { + put_filesystem(fs); + fs = NULL; + } + return fs; +} EXPORT_SYMBOL(get_fs_type); diff --git a/fs/namespace.c b/fs/namespace.c index cfc6d44..e897985 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1958,7 +1958,8 @@ static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype) struct vfsmount * do_kern_mount(const char *fstype, int flags, const char *name, void *data) { - struct file_system_type *type = get_fs_type(fstype); + struct file_system_type *type = get_fs_type_ns(fstype, + current->nsproxy->mnt_ns); struct vfsmount *mnt; if (!type) return ERR_PTR(-ENODEV); @@ -2365,6 +2366,7 @@ static struct mnt_namespace *alloc_mnt_ns(void) INIT_LIST_HEAD(&new_ns->list); init_waitqueue_head(&new_ns->poll); new_ns->event = 0; + init_fslist(new_ns); return new_ns; } @@ -2745,6 +2747,7 @@ void put_mnt_ns(struct mnt_namespace *ns) br_write_unlock(vfsmount_lock); up_write(&namespace_sem); release_mounts(&umount_list); + destroy_filesystems_list(ns); kfree(ns); } EXPORT_SYMBOL(put_mnt_ns); diff --git a/include/linux/fs.h b/include/linux/fs.h index 3286d74..ab3633a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2531,6 +2531,15 @@ extern void put_filesystem(struct file_system_type *fs); extern struct file_system_type *get_fs_type(const char *name); extern int filesystems_proc_show(struct seq_file *m, void *v); + +struct mnt_namespace; +extern struct file_system_type *get_fs_type_ns(const char *name, + struct mnt_namespace *mnt); +extern void enable_filesystems_list(struct mnt_namespace *ns); +extern void destroy_filesystems_list(struct mnt_namespace *ns); +extern int add_filesystem_list(const char *name, struct mnt_namespace *ns); +extern int del_filesystem_list(char *name, struct mnt_namespace *ns); + extern struct super_block *get_super(struct block_device *); extern struct super_block *get_active_super(struct block_device *bdev); extern struct super_block *user_get_super(dev_t); diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h index 2930485..4138fb4 100644 --- a/include/linux/mnt_namespace.h +++ b/include/linux/mnt_namespace.h @@ -6,12 +6,20 @@ #include <linux/seq_file.h> #include <linux/wait.h> +struct fs_allowed { + struct list_head list; + struct file_system_type *fstype; +}; + struct mnt_namespace { atomic_t count; struct vfsmount * root; struct list_head list; wait_queue_head_t poll; int event; + struct list_head fs_allowed; + struct mutex fs_list_mutex; + bool fs_list_enabled; }; struct proc_mounts { @@ -22,6 +30,18 @@ struct proc_mounts { struct fs_struct; +static inline bool fslist_is_enabled(struct mnt_namespace *mnt) +{ + return mnt->fs_list_enabled; +} + +static inline void init_fslist(struct mnt_namespace *ns) +{ + ns->fs_list_enabled = false; + INIT_LIST_HEAD(&ns->fs_allowed); + mutex_init(&ns->fs_list_mutex); +} + extern struct mnt_namespace *create_mnt_ns(struct vfsmount *mnt); extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *, struct fs_struct *); -- 1.7.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC 3/4] show only allowed filesystems in /proc/filesystems 2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa 2012-01-23 16:56 ` [RFC 2/4] " Glauber Costa @ 2012-01-23 16:56 ` Glauber Costa [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-01-24 0:04 ` Eric W. Biederman 3 siblings, 0 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw) To: cgroups Cc: linux-fsdevel, ebiederm, serge, daniel.lezcano, pjt, mzxreary, xemul, James.Bottomley, tj, eric.dumazet, Glauber Costa Now that a namespace can have a different than default list of filesystems, only show the allowed ones in /proc/filesystems. Signed-off-by: Glauber Costa <glommer@parallels.com> --- fs/filesystems.c | 4 +++- fs/proc/base.c | 51 +++++++++++++++++++++++++++++++++++++++++---------- 2 files changed, 44 insertions(+), 11 deletions(-) diff --git a/fs/filesystems.c b/fs/filesystems.c index 118d0d6..b797cda 100644 --- a/fs/filesystems.c +++ b/fs/filesystems.c @@ -243,11 +243,13 @@ out: int filesystems_proc_show(struct seq_file *m, void *v) { struct file_system_type * tmp; + struct mnt_namespace *ns = m->private; read_lock(&file_systems_lock); tmp = file_systems; while (tmp) { - seq_printf(m, "%s\t%s\n", + if (fs_allowed(tmp, ns)) + seq_printf(m, "%s\t%s\n", (tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev", tmp->name); tmp = tmp->next; diff --git a/fs/proc/base.c b/fs/proc/base.c index 2a6e2c7..2a88a47 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -627,6 +627,44 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr) return 0; } +struct mnt_namespace *mnt_ns_from_task(struct task_struct *task) +{ + struct nsproxy *nsp; + struct mnt_namespace *ns = NULL; + + + rcu_read_lock(); + nsp = task_nsproxy(task); + if (nsp) { + ns = nsp->mnt_ns; + if (ns) + get_mnt_ns(ns); + } + rcu_read_unlock(); + return ns; +} + +struct mnt_namespace *mnt_ns_from_inode(struct inode *inode) +{ + struct task_struct *task = get_proc_task(inode); + struct path root; + struct mnt_namespace *ns = NULL; + + if (!task) + return NULL; + + ns = mnt_ns_from_task(task); + + if (ns && get_task_root(task, &root) != 0) { + put_mnt_ns(ns); + ns = NULL; + } + + path_put(&root); + put_task_struct(task); + return ns; +} + static const struct inode_operations proc_def_inode_operations = { .setattr = proc_setattr, }; @@ -635,21 +673,13 @@ static int mounts_open_common(struct inode *inode, struct file *file, const struct seq_operations *op) { struct task_struct *task = get_proc_task(inode); - struct nsproxy *nsp; struct mnt_namespace *ns = NULL; struct path root; struct proc_mounts *p; int ret = -EINVAL; if (task) { - rcu_read_lock(); - nsp = task_nsproxy(task); - if (nsp) { - ns = nsp->mnt_ns; - if (ns) - get_mnt_ns(ns); - } - rcu_read_unlock(); + ns = mnt_ns_from_task(task); if (ns && get_task_root(task, &root) == 0) ret = 0; put_task_struct(task); @@ -2875,7 +2905,8 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de static int filesystems_proc_open(struct inode *inode, struct file *file) { - return single_open(file, filesystems_proc_show, NULL); + struct mnt_namespace *ns = mnt_ns_from_inode(inode); + return single_open(file, filesystems_proc_show, ns); } static const struct file_operations filesystems_proc_fops = { -- 1.7.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* [RFC 1/4] move /proc/filesystems inside /proc/self [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-01-23 16:56 ` Glauber Costa 2012-01-23 16:56 ` [RFC 4/4] fslist netlink interface Glauber Costa ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w, Glauber Costa This simple patch is a preparation for what is to come. It moves the list of available filesystems inside /proc/self/, linking /proc/filesystems to it. It effectively means that each process may have a different view of which filesystems are available in the system, depending on the namespace it lives. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> --- fs/filesystems.c | 21 +-------------------- fs/proc/base.c | 15 +++++++++++++++ fs/proc/root.c | 1 + include/linux/fs.h | 2 ++ 4 files changed, 19 insertions(+), 20 deletions(-) diff --git a/fs/filesystems.c b/fs/filesystems.c index 0845f84..458d120 100644 --- a/fs/filesystems.c +++ b/fs/filesystems.c @@ -219,7 +219,7 @@ int __init get_filesystem_list(char *buf) } #ifdef CONFIG_PROC_FS -static int filesystems_proc_show(struct seq_file *m, void *v) +int filesystems_proc_show(struct seq_file *m, void *v) { struct file_system_type * tmp; @@ -234,25 +234,6 @@ static int filesystems_proc_show(struct seq_file *m, void *v) read_unlock(&file_systems_lock); return 0; } - -static int filesystems_proc_open(struct inode *inode, struct file *file) -{ - return single_open(file, filesystems_proc_show, NULL); -} - -static const struct file_operations filesystems_proc_fops = { - .open = filesystems_proc_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - -static int __init proc_filesystems_init(void) -{ - proc_create("filesystems", 0, NULL, &filesystems_proc_fops); - return 0; -} -module_init(proc_filesystems_init); #endif static struct file_system_type *__get_fs_type(const char *name, int len) diff --git a/fs/proc/base.c b/fs/proc/base.c index 851ba3d..2a6e2c7 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2768,6 +2768,7 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns, */ static const struct file_operations proc_task_operations; static const struct inode_operations proc_task_inode_operations; +static const struct file_operations filesystems_proc_fops; static const struct pid_entry tgid_base_stuff[] = { DIR("task", S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations), @@ -2851,6 +2852,7 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_HARDWALL INF("hardwall", S_IRUGO, proc_pid_hardwall), #endif + REG("filesystems", S_IRUGO, filesystems_proc_fops), }; static int proc_tgid_base_readdir(struct file * filp, @@ -2871,6 +2873,18 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff)); } +static int filesystems_proc_open(struct inode *inode, struct file *file) +{ + return single_open(file, filesystems_proc_show, NULL); +} + +static const struct file_operations filesystems_proc_fops = { + .open = filesystems_proc_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + static const struct inode_operations proc_tgid_base_inode_operations = { .lookup = proc_tgid_base_lookup, .getattr = pid_getattr, @@ -3193,6 +3207,7 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_HARDWALL INF("hardwall", S_IRUGO, proc_pid_hardwall), #endif + REG("filesystems", S_IRUGO, filesystems_proc_fops), }; static int proc_tid_base_readdir(struct file * filp, diff --git a/fs/proc/root.c b/fs/proc/root.c index 03102d9..87bf2e3 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -104,6 +104,7 @@ void __init proc_root_init(void) } proc_symlink("mounts", NULL, "self/mounts"); + proc_symlink("filesystems", NULL, "self/filesystems"); proc_net_init(); diff --git a/include/linux/fs.h b/include/linux/fs.h index e0bc4ff..3286d74 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2529,6 +2529,8 @@ extern int generic_block_fiemap(struct inode *inode, extern void get_filesystem(struct file_system_type *fs); extern void put_filesystem(struct file_system_type *fs); extern struct file_system_type *get_fs_type(const char *name); +extern int filesystems_proc_show(struct seq_file *m, void *v); + extern struct super_block *get_super(struct block_device *); extern struct super_block *get_active_super(struct block_device *bdev); extern struct super_block *user_get_super(dev_t); -- 1.7.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC 4/4] fslist netlink interface [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-01-23 16:56 ` [RFC 1/4] move /proc/filesystems inside /proc/self Glauber Costa @ 2012-01-23 16:56 ` Glauber Costa 2012-01-23 19:20 ` [RFC 0/4] per-namespace allowed filesystems list Eric W. Biederman 2012-01-23 21:12 ` Al Viro 3 siblings, 0 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-23 16:56 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w, Glauber Costa This patch provides a netlink interaface for the namespace filesystem list. It is quite simple, and I need at least one more operation (query). Also, keep in mind that although I wrote it believing it is a nice interface to manipulate the list, I don't feel strongly about the interface per-se. So feel free to suggest something better. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> --- fs/Kconfig | 9 +++ fs/Makefile | 1 + fs/fsnetlink.c | 145 ++++++++++++++++++++++++++++++++++++++++ include/linux/fslist_netlink.h | 35 ++++++++++ 4 files changed, 190 insertions(+), 0 deletions(-) create mode 100644 fs/fsnetlink.c create mode 100644 include/linux/fslist_netlink.h diff --git a/fs/Kconfig b/fs/Kconfig index 440d189..842dcc4 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -78,6 +78,15 @@ config GENERIC_ACL bool select FS_POSIX_ACL +config FSLIST_NETLINK + bool "Filesystem Lists Netlink" + help + This option allows userspace to select a sublist of the available + filesystems that are mountable by a particular namespace. It provides + a netlink interface through which one can manage such a set. + + + menu "Caches" source "fs/fscache/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index 57d446d..675f613 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -46,6 +46,7 @@ obj-$(CONFIG_FS_MBCACHE) += mbcache.o obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o xattr_acl.o obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_GENERIC_ACL) += generic_acl.o +obj-$(CONFIG_FSLIST_NETLINK) += fsnetlink.o obj-$(CONFIG_FHANDLE) += fhandle.o diff --git a/fs/fsnetlink.c b/fs/fsnetlink.c new file mode 100644 index 0000000..619fad1 --- /dev/null +++ b/fs/fsnetlink.c @@ -0,0 +1,145 @@ +#include <net/genetlink.h> +#include <linux/fslist_netlink.h> +#include <linux/gfp.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/nsproxy.h> +#include <linux/mnt_namespace.h> +#include <linux/fs_struct.h> +#include <linux/dcache.h> + +static struct nla_policy fslist_genl_policy[FSLIST_A_MAX + 1] = { + [FSLIST_A_OP] = { .type = NLA_U32 }, + [FSLIST_A_OP_ARG] = { .type = NLA_STRING, .len = 200 }, +}; + +static struct genl_family fslist_gnl_family = { + .id = GENL_ID_GENERATE, + .name = FSLIST_GENL_NAME, + .version = FSLIST_GENL_VERSION, + .maxattr = FSLIST_A_MAX, +}; + +static int msg_reply(int ret, struct genl_info *info) +{ + struct sk_buff *skb; + void *reply; + size_t size; + struct genlmsghdr *genlhdr; + void *data; + + size = nla_total_size(1) + + nla_total_size(0); + + skb = genlmsg_new(size, GFP_KERNEL); + if (!skb) + return -ENOMEM; + + reply = genlmsg_put_reply(skb, info, &fslist_gnl_family, 0, FSLIST_CMD_REPLY); + if (reply == NULL) { + nlmsg_free(skb); + return -EINVAL; + } + + nla_put_u32(skb, FSLIST_A_OP, ret); + + genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + data = genlmsg_data(genlhdr); + + ret = genlmsg_end(skb, data); + if (ret < 0) { + nlmsg_free(skb); + return ret; + } + + return genlmsg_reply(skb, info); +} + + +static int cmd_ask(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr *na; + int op = 0; + char *data = NULL; + int ret = 0; + struct mnt_namespace *mnt = current->nsproxy->mnt_ns; + struct dentry *curr_root; + int msg_ret = 0; + + /* + * Once a process is contained by a chroot environment, + * we don't allow the list to grow further, or be by + * any means modified. + * + * It should still work after pivot_root, though. + */ + curr_root = current->fs->root.dentry; + if (curr_root->d_parent != curr_root) + return -EINVAL; + + na = info->attrs[FSLIST_A_OP]; + if (na) + op = nla_get_u32(na); + + na = info->attrs[FSLIST_A_OP_ARG]; + if (na) { + int len = nla_len(na); + + data = kmalloc(len, GFP_KERNEL); + if (!data) + return -ENOMEM; + + nla_strlcpy(data, na, len); + } + + switch (op) { + case FSLIST_OP_RESET: + enable_filesystems_list(mnt); + break; + case FSLIST_OP_ADD: + if (!data) { + ret = -EINVAL; + break; + } + msg_ret = add_filesystem_list(data, mnt); + break; + case FSLIST_OP_QUERY: + ret = -ENOSYS; + break; + default: + ret = -EINVAL; + break; + } + + kfree(data); + if (!ret) + msg_reply(msg_ret, info); + return ret; +} + +static struct genl_ops fslist_nl_ops = { + .cmd = FSLIST_CMD_ASK, + .doit = cmd_ask, + .policy = fslist_genl_policy, +}; + +int __init fslist_netlink_init(void) +{ + int ret; + ret = genl_register_family(&fslist_gnl_family); + if (ret) + return ret; + + ret = genl_register_ops(&fslist_gnl_family, &fslist_nl_ops); + if (ret) + goto fail_unregister; + + return 0; + +fail_unregister: + genl_unregister_family(&fslist_gnl_family); + return ret; +} + +fs_initcall(fslist_netlink_init); + diff --git a/include/linux/fslist_netlink.h b/include/linux/fslist_netlink.h new file mode 100644 index 0000000..926760d --- /dev/null +++ b/include/linux/fslist_netlink.h @@ -0,0 +1,35 @@ +#ifndef _FSLIST_NETLINK_H +#define _FSLIST_NETLINK_H + +#ifdef __KERNEL__ +#include <linux/types.h> +#endif + +enum { + FSLIST_A_UNSPEC, + FSLIST_A_OP, + FSLIST_A_OP_ARG, + __FSLIST_A_MAX, +}; + +#define FSLIST_A_MAX (__FSLIST_A_MAX - 1) + +enum { + FSLIST_CMD_UNSPEC = 0, + FSLIST_CMD_ASK, /* user->kernel */ + FSLIST_CMD_REPLY, /* kernel->user */ + __FSLIST_CMD_MAX, +}; + +#define FSLIST_CMD_MAX (__FSLIST_CMD_MAX - 1) + +#define FSLIST_GENL_VERSION 0x1 +#define FSLIST_GENL_NAME "FSLIST" + +enum { + FSLIST_OP_RESET = 1, + FSLIST_OP_ADD, + FSLIST_OP_QUERY, +}; + +#endif /* _FSLIST_NETLINK_H */ -- 1.7.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-01-23 16:56 ` [RFC 1/4] move /proc/filesystems inside /proc/self Glauber Costa 2012-01-23 16:56 ` [RFC 4/4] fslist netlink interface Glauber Costa @ 2012-01-23 19:20 ` Eric W. Biederman 2012-01-23 21:12 ` Al Viro 3 siblings, 0 replies; 16+ messages in thread From: Eric W. Biederman @ 2012-01-23 19:20 UTC (permalink / raw) To: Glauber Costa Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > This patch creates a list of allowed filesystems per-namespace. > The goal is to prevent users inside a container, even root, > to mount filesystems that are not allowed by the main box admin. > > My main two motivators to pursue this are: > 1) We want to prevent a certain tailored view of some virtual > filesystems, for example, by bind-mounting files with userspace > generated data into /proc. The ability of mounting /proc inside > the container works against this effort, while disallowing it > via capabilities would have the effect of disallowing other > mounts as well. > > 2) Some filesystems are known not to behave well under a container > environment. They require changes to work in a safe-way. We can > whitelist only the filesystems we want. > > This works as a whitelist. Only filesystems in the list are allowed > to be mounted. Doing a blacklist would create problems when, say, > a module is loaded. The whitelist is only checked if it is enabled first. > So any setup that was already working, will keep working. And whoever > is not interested in limiting filesystem mount, does not need > to bother about it. My first impression is that this looks like a hack to avoid finishing the user namespace. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (2 preceding siblings ...) 2012-01-23 19:20 ` [RFC 0/4] per-namespace allowed filesystems list Eric W. Biederman @ 2012-01-23 21:12 ` Al Viro [not found] ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 3 siblings, 1 reply; 16+ messages in thread From: Al Viro @ 2012-01-23 21:12 UTC (permalink / raw) To: Glauber Costa Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w On Mon, Jan 23, 2012 at 08:56:08PM +0400, Glauber Costa wrote: > This patch creates a list of allowed filesystems per-namespace. > The goal is to prevent users inside a container, even root, > to mount filesystems that are not allowed by the main box admin. > > My main two motivators to pursue this are: > 1) We want to prevent a certain tailored view of some virtual > filesystems, for example, by bind-mounting files with userspace > generated data into /proc. The ability of mounting /proc inside > the container works against this effort, while disallowing it > via capabilities would have the effect of disallowing other > mounts as well. Translation, please. > 2) Some filesystems are known not to behave well under a container > environment. They require changes to work in a safe-way. We can > whitelist only the filesystems we want. So fix them. > This works as a whitelist. Only filesystems in the list are allowed > to be mounted. Doing a blacklist would create problems when, say, > a module is loaded. The whitelist is only checked if it is enabled first. > So any setup that was already working, will keep working. And whoever > is not interested in limiting filesystem mount, does not need > to bother about it. > > Please let me know what you guys think about it. NAKed-by: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> NAKed-because: too fucking ugly This is bloody ridiculous; if you want to prevent a luser adming playing with the set of mounts you've given it, the right way to go is not to mess with the "which fs types are allowed" but to add a per-namespace "immutable" flag. And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS and setting the "immutable" on the copied namespace. ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2012-01-23 23:04 ` Kirill A. Shutemov [not found] ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org> 2012-01-24 10:22 ` Glauber Costa 1 sibling, 1 reply; 16+ messages in thread From: Kirill A. Shutemov @ 2012-01-23 23:04 UTC (permalink / raw) To: Al Viro Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote: > This is bloody ridiculous; if you want to prevent a luser adming playing with > the set of mounts you've given it, the right way to go is not to mess with the > "which fs types are allowed" but to add a per-namespace "immutable" flag. > And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS > and setting the "immutable" on the copied namespace. How will it work if we want to allow namespaced environment to mount block devices, but not, let say, debugfs? Differentiation between filesystem type and source is one of broken things in Unix API. I don't see an easy way to fix it. Only plan9. :) -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>]
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org> @ 2012-01-23 23:12 ` Al Viro 2012-01-24 7:17 ` Kirill A. Shutemov 2012-01-24 10:32 ` Glauber Costa 1 sibling, 1 reply; 16+ messages in thread From: Al Viro @ 2012-01-23 23:12 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w On Tue, Jan 24, 2012 at 01:04:57AM +0200, Kirill A. Shutemov wrote: > On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote: > > This is bloody ridiculous; if you want to prevent a luser adming playing with > > the set of mounts you've given it, the right way to go is not to mess with the > > "which fs types are allowed" but to add a per-namespace "immutable" flag. > > And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS > > and setting the "immutable" on the copied namespace. > > How will it work if we want to allow namespaced environment to mount block > devices, but not, let say, debugfs? > > Differentiation between filesystem type and source is one of broken things > in Unix API. Translation, please? > I don't see an easy way to fix it. Only plan9. :) Huh? Plan 9 does *not* contain anything of that kind. And their '#<letter>' convention for in-kernel filesystems is one of the uglier things about their API, IMO... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 0/4] per-namespace allowed filesystems list 2012-01-23 23:12 ` Al Viro @ 2012-01-24 7:17 ` Kirill A. Shutemov 0 siblings, 0 replies; 16+ messages in thread From: Kirill A. Shutemov @ 2012-01-24 7:17 UTC (permalink / raw) To: Al Viro Cc: Glauber Costa, cgroups, linux-fsdevel, ebiederm, serge, daniel.lezcano, pjt, mzxreary, xemul, James.Bottomley, tj, eric.dumazet On Mon, Jan 23, 2012 at 11:12:39PM +0000, Al Viro wrote: > On Tue, Jan 24, 2012 at 01:04:57AM +0200, Kirill A. Shutemov wrote: > > On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote: > > > This is bloody ridiculous; if you want to prevent a luser adming playing with > > > the set of mounts you've given it, the right way to go is not to mess with the > > > "which fs types are allowed" but to add a per-namespace "immutable" flag. > > > And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS > > > and setting the "immutable" on the copied namespace. > > > > How will it work if we want to allow namespaced environment to mount block > > devices, but not, let say, debugfs? > > > > Differentiation between filesystem type and source is one of broken things > > in Unix API. > > Translation, please? mount(2) should take a file descriptor and mount it to a mountpoint. File descriptor provides known interface at that point (9p?). It doesn't matter here what's the actual type of filesystem or what's the source (no fake names for sources of pseudo-fs). > > I don't see an easy way to fix it. Only plan9. :) > > Huh? Plan 9 does *not* contain anything of that kind. And their '#<letter>' > convention for in-kernel filesystems is one of the uglier things about their > API, IMO... Yes, it's ugly. But in plan9 it can be fixed quite straight forward: - kernel provides a way to boot to an early userspace environment without access to any media (initrd); - kernel provides *one* '#<letter>' fs which contains handles to create any other in-kernel filesystems. This special fs can be mounted only once. After that you have (about) all resources as files and can construct any other environment using mount and bind. In fact, list of available (pseudo-)filesystems is yet another resource namespace. With approach above we can get rid of it in plan9. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org> 2012-01-23 23:12 ` Al Viro @ 2012-01-24 10:32 ` Glauber Costa 1 sibling, 0 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-24 10:32 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Al Viro, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w On 01/24/2012 03:04 AM, Kirill A. Shutemov wrote: > On Mon, Jan 23, 2012 at 09:12:19PM +0000, Al Viro wrote: >> This is bloody ridiculous; if you want to prevent a luser adming playing with >> the set of mounts you've given it, the right way to go is not to mess with the >> "which fs types are allowed" but to add a per-namespace "immutable" flag. >> And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS >> and setting the "immutable" on the copied namespace. > > How will it work if we want to allow namespaced environment to mount block > devices, but not, let say, debugfs? > For the record, that is more or less what I have in mind. But my main use case is /proc. I guess the case for debugfs is the same. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2012-01-23 23:04 ` Kirill A. Shutemov @ 2012-01-24 10:22 ` Glauber Costa 1 sibling, 0 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-24 10:22 UTC (permalink / raw) To: Al Viro Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ebiederm-aS9lmoZGLiVWk0Htik3J/w, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w On 01/24/2012 01:12 AM, Al Viro wrote: > On Mon, Jan 23, 2012 at 08:56:08PM +0400, Glauber Costa wrote: >> This patch creates a list of allowed filesystems per-namespace. >> The goal is to prevent users inside a container, even root, >> to mount filesystems that are not allowed by the main box admin. >> >> My main two motivators to pursue this are: >> 1) We want to prevent a certain tailored view of some virtual >> filesystems, for example, by bind-mounting files with userspace >> generated data into /proc. The ability of mounting /proc inside >> the container works against this effort, while disallowing it >> via capabilities would have the effect of disallowing other >> mounts as well. > > Translation, please. > >> 2) Some filesystems are known not to behave well under a container >> environment. They require changes to work in a safe-way. We can >> whitelist only the filesystems we want. > > So fix them. > >> This works as a whitelist. Only filesystems in the list are allowed >> to be mounted. Doing a blacklist would create problems when, say, >> a module is loaded. The whitelist is only checked if it is enabled first. >> So any setup that was already working, will keep working. And whoever >> is not interested in limiting filesystem mount, does not need >> to bother about it. >> >> Please let me know what you guys think about it. > > NAKed-by: Al Viro<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> > NAKed-because: too fucking ugly > > This is bloody ridiculous; if you want to prevent a luser adming playing with > the set of mounts you've given it, the right way to go is not to mess with the > "which fs types are allowed" but to add a per-namespace "immutable" flag. > And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS > and setting the "immutable" on the copied namespace. Okay, not that I laid down the problem, I am happy to pursue any solutions we think is better. But let me develop it a bit more, first. An immutable flag does not work, because I don't want to prevent a luser (loved that) to mess up with the mounts they are given. In general, it is perfectly fine for them to mount things inside the cointainer as the time goes. But some others, I don't consider so. The example of /proc I've given, let me elaborate: Much of the information living on /proc, is really global, rather than per-container. The ones pertaining to pid namespace, and other namespaces are already per-namespace so they are fine. But there is more: some of the things /proc track, like cpu usage, memory, and the like, are resource-constrained by other entities, for instance, cgroups. In some cases, like /proc/stat, information exists in cgroup, but come from more than once cgroup. All of them are independent in nature, making it hard to come out with a coherent vision. Furthermore, there is no connection between namespaces and cgroups, so it is not obvious at all (there were discussions before), which information should the process see - unlike namespaces, the mere fact that a process lives in a cgroup, does not really mean it is isolated from the system in this sense. One of the solutions, is to do it all in userspace, from outside the container, and bind mount the files inside the container's /proc. But it only works if we can prevent the user from remounting the real /proc somewhere. Not because it would screw up his system, which I don't care about, but because it will give him information about the global state of the system. An immutable flag fixes this, but then it prevents all further legitimate mounts ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 0/4] per-namespace allowed filesystems list 2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa ` (2 preceding siblings ...) [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-01-24 0:04 ` Eric W. Biederman [not found] ` <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org> 3 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2012-01-24 0:04 UTC (permalink / raw) To: Glauber Costa Cc: cgroups, linux-fsdevel, serge, daniel.lezcano, pjt, mzxreary, xemul, James.Bottomley, tj, eric.dumazet Glauber Costa <glommer@parallels.com> writes: > This patch creates a list of allowed filesystems per-namespace. > The goal is to prevent users inside a container, even root, > to mount filesystems that are not allowed by the main box admin. > > My main two motivators to pursue this are: > 1) We want to prevent a certain tailored view of some virtual > filesystems, for example, by bind-mounting files with userspace > generated data into /proc. The ability of mounting /proc inside > the container works against this effort, while disallowing it > via capabilities would have the effect of disallowing other > mounts as well. > > 2) Some filesystems are known not to behave well under a container > environment. They require changes to work in a safe-way. We can > whitelist only the filesystems we want. > > This works as a whitelist. Only filesystems in the list are allowed > to be mounted. Doing a blacklist would create problems when, say, > a module is loaded. The whitelist is only checked if it is enabled first. > So any setup that was already working, will keep working. And whoever > is not interested in limiting filesystem mount, does not need > to bother about it. My first impression is that this looks like a hack to avoid finishing the user namespace. This is a terrible way to go about implementing unprivileged mounts. If there are technical reasons why it is unsafe to mount filesystems that we need to whitelist/blacklist filesystems in the kernel where we can check things. Why in the world would anyone want the ability to not mount a specific filesystem type? Using netlink as an interface when you are talking filesystems to filesystem is pretty horrid. Netlink is great for networking developers they get networking, but filesystem people understand filesystems and you want to use netlink? Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>]
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org> @ 2012-01-24 10:31 ` Glauber Costa [not found] ` <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Glauber Costa @ 2012-01-24 10:31 UTC (permalink / raw) To: Eric W. Biederman Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w On 01/24/2012 04:04 AM, Eric W. Biederman wrote: > Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > >> This patch creates a list of allowed filesystems per-namespace. >> The goal is to prevent users inside a container, even root, >> to mount filesystems that are not allowed by the main box admin. >> >> My main two motivators to pursue this are: >> 1) We want to prevent a certain tailored view of some virtual >> filesystems, for example, by bind-mounting files with userspace >> generated data into /proc. The ability of mounting /proc inside >> the container works against this effort, while disallowing it >> via capabilities would have the effect of disallowing other >> mounts as well. >> >> 2) Some filesystems are known not to behave well under a container >> environment. They require changes to work in a safe-way. We can >> whitelist only the filesystems we want. >> >> This works as a whitelist. Only filesystems in the list are allowed >> to be mounted. Doing a blacklist would create problems when, say, >> a module is loaded. The whitelist is only checked if it is enabled first. >> So any setup that was already working, will keep working. And whoever >> is not interested in limiting filesystem mount, does not need >> to bother about it. > > My first impression is that this looks like a hack to avoid finishing > the user namespace. > > This is a terrible way to go about implementing unprivileged mounts. > > If there are technical reasons why it is unsafe to mount filesystems > that we need to whitelist/blacklist filesystems in the kernel where we > can check things. > > Why in the world would anyone want the ability to not mount a specific > filesystem type? See my reply to Al. So again, to avoid steering the discussions to details I myself don't consider central (since this is a first post anyway), let's focus on the /proc container case. It is a privileged user as far as the container goes, and we'd like to allow it to mount filesystems. But disallowing it to mount /proc, can guarantee that the user will be provided with a version of /proc that is safe, and that he can't escape this. Ideally, userspace wouldn't even get involved with this, and a process mounting /proc would see the right things, depending on where it came from. But turns out that the cgroups-controlled resources are a lot harder than the namespaces-controlled resources for this. > Using netlink as an interface when you are talking filesystems to > filesystem is pretty horrid. Netlink is great for networking developers > they get networking, but filesystem people understand filesystems and > you want to use netlink? > Well, I am not doing it for filesystem people, but for people who are neither, aka, whoever wants to use this interface. But that said, I don't want to keep the discussion around this. My main reason was to have a quick way to communicate this list to the kernel, so I could test it, and post a PoC for you guys to comment on. Even if everybody liked it, I was prepared from the start to redesign the interface. ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [RFC 0/4] per-namespace allowed filesystems list [not found] ` <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-01-24 11:17 ` Eric W. Biederman 2012-01-24 11:24 ` Glauber Costa 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2012-01-24 11:17 UTC (permalink / raw) To: Glauber Costa Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, serge-A9i7LUbDfNHQT0dZR+AlfA, daniel.lezcano-GANU6spQydw, pjt-hpIqsD4AKlfQT0dZR+AlfA, mzxreary-uLTowLwuiw4b1SvskN2V4Q, xemul-bzQdu9zFT3WakBO8gow8eQ, James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk, tj-DgEjT+Ai2ygdnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes: > On 01/24/2012 04:04 AM, Eric W. Biederman wrote: >> My first impression is that this looks like a hack to avoid finishing >> the user namespace. > > See my reply to Al. So again, to avoid steering the discussions to details I > myself don't consider central (since this is a first post anyway), let's focus > on the /proc container case. It is a privileged user as far as the container > goes, and we'd like to allow it to mount filesystems. But disallowing it to > mount /proc, can guarantee that the user will be provided with a version of > /proc that is safe, and that he can't escape this. The key things are that to the rest of the system you want this user to look like an unprivileged user. Aka user namespace. > Ideally, userspace wouldn't even get involved with this, and a process mounting > /proc would see the right things, depending on where it came from. But turns out > that the cgroups-controlled resources are a lot harder than the > namespaces-controlled resources for this. There are a couple of sides to this. If you trust the root user in your container all you have to say is: "Don't do that then." There are things like /proc/cpuinfo that a lot of processes use to figure out how many threads are wise to use. That is a problem that deserves a proper solution not a hack. There are the global tunables under /proc like /proc/sys/kernel/panic_on_oops that you don't want people touching. There are potential security issues with people mounting block devices when they can control the filesystem data before mounting the filesystem. That mostly deserves fixing the filesystems but in the unprivileged mount context that probably deserves a whitelist. Then the are problems with mounting cgroup filesystems inside of a container, and wondering why they don't work. That is a design limitation in the cgroup filesystem and code that needs to be fixed. Is there a case you are worried about that I have not covered? Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 0/4] per-namespace allowed filesystems list 2012-01-24 11:17 ` Eric W. Biederman @ 2012-01-24 11:24 ` Glauber Costa 0 siblings, 0 replies; 16+ messages in thread From: Glauber Costa @ 2012-01-24 11:24 UTC (permalink / raw) To: Eric W. Biederman Cc: cgroups, linux-fsdevel, serge, daniel.lezcano, pjt, mzxreary, xemul, James.Bottomley, tj, eric.dumazet On 01/24/2012 03:17 PM, Eric W. Biederman wrote: > Glauber Costa<glommer@parallels.com> writes: > >> On 01/24/2012 04:04 AM, Eric W. Biederman wrote: >>> My first impression is that this looks like a hack to avoid finishing >>> the user namespace. >> >> See my reply to Al. So again, to avoid steering the discussions to details I >> myself don't consider central (since this is a first post anyway), let's focus >> on the /proc container case. It is a privileged user as far as the container >> goes, and we'd like to allow it to mount filesystems. But disallowing it to >> mount /proc, can guarantee that the user will be provided with a version of >> /proc that is safe, and that he can't escape this. > > The key things are that to the rest of the system you want this user to > look like an unprivileged user. Aka user namespace. > >> Ideally, userspace wouldn't even get involved with this, and a process mounting >> /proc would see the right things, depending on where it came from. But turns out >> that the cgroups-controlled resources are a lot harder than the >> namespaces-controlled resources for this. > > There are a couple of sides to this. > > If you trust the root user in your container all you have to say is: > "Don't do that then." Of course he may not obey. And then mess up with the *other* containers in the system. (If he messes with himself, I don't care). Note that in this context, "messing" can be as simple as figuring out information that you'd not like the container to see. > There are things like /proc/cpuinfo that a lot of processes use to > figure out how many threads are wise to use. That is a problem that > deserves a proper solution not a hack. Agreed. This can be either in the kernel or in userspace. If it is in userspace, maybe we'd like to guarantee that this view will be consistent, and not replaced by the systemwide version. > There are the global tunables under /proc like > /proc/sys/kernel/panic_on_oops that you don't want people touching. > > There are potential security issues with people mounting block devices > when they can control the filesystem data before mounting the > filesystem. That mostly deserves fixing the filesystems but in the > unprivileged mount context that probably deserves a whitelist. > > Then the are problems with mounting cgroup filesystems inside of a > container, and wondering why they don't work. That is a design > limitation in the cgroup filesystem and code that needs to be fixed. > > Is there a case you are worried about that I have not covered? > The ones I've listed here in this mail, mostly. I am now wondering if Kirill has any around debugfs ? ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-01-24 11:25 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-23 16:56 [RFC 0/4] per-namespace allowed filesystems list Glauber Costa 2012-01-23 16:56 ` [RFC 2/4] " Glauber Costa 2012-01-23 16:56 ` [RFC 3/4] show only allowed filesystems in /proc/filesystems Glauber Costa [not found] ` <1327337772-1972-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-01-23 16:56 ` [RFC 1/4] move /proc/filesystems inside /proc/self Glauber Costa 2012-01-23 16:56 ` [RFC 4/4] fslist netlink interface Glauber Costa 2012-01-23 19:20 ` [RFC 0/4] per-namespace allowed filesystems list Eric W. Biederman 2012-01-23 21:12 ` Al Viro [not found] ` <20120123211218.GF23916-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2012-01-23 23:04 ` Kirill A. Shutemov [not found] ` <20120123230457.GA14347-oKw7cIdHH8eLwutG50LtGA@public.gmane.org> 2012-01-23 23:12 ` Al Viro 2012-01-24 7:17 ` Kirill A. Shutemov 2012-01-24 10:32 ` Glauber Costa 2012-01-24 10:22 ` Glauber Costa 2012-01-24 0:04 ` Eric W. Biederman [not found] ` <m1vco2m0eh.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org> 2012-01-24 10:31 ` Glauber Costa [not found] ` <4F1E886A.7000107-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-01-24 11:17 ` Eric W. Biederman 2012-01-24 11:24 ` Glauber Costa
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).