* [PATCH 0/4] Devices accessibility control group (v2)
@ 2008-01-08 9:02 Pavel Emelyanov
[not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-08 9:02 UTC (permalink / raw)
To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage
The first version was posted long ago
(http://openvz.org/pipermail/devel/2007-September/007647.html)
and since then there are many (good I hope) changes:
* Added the block devices support :) It turned out to
be a bit simpler than the char one (or I missed
something significant);
* Now we can enable/disable not just individual devices,
but the whole major with all its minors (see the TODO
list beyond as well);
* Added the ability to restrict the read/write permissions
to devices, not just visible/invisible state.
That is - the main features I wished to implement right
after the v1 was sent. Some minor changes are:
* I merged the devices.char and devices.block files into
one - devices.permissions;
* As the result of the change above - the strings passed
to this file has changed. Now they are
[bc] <major>:{<minor>|*} [r-][w-]
E.g. b 5:2 r- will grant the read permissions to the
block 5:2 device and c 3:* -w will grant the write-only
access to all the character devices with the major 5.
However, there are some things to be done:
* Make the /proc/devices show relevant info depending on
who is reading it. This seems to be easy to do, since
I already have the support to dump similar info into the
devices.permissions file, but I haven't tried to use
this in /proc/devices yet;
* Add the support for devices ranges. I.e. someone might
wish to tell smth like b 5:[0-10] r- to this subsystem.
Currently this is not supported and I'm afraid that if we
start support minor ranges we'll have smth similar to
VMA-s or FLOCK-s ranges management in one more place in the
kernel.
* One more question is - are there any other permissions to
work with? E.g. in OpenVZ we have a separate bit for
quota management, maybe we can invent some more...
Currently I didn't pay much attention to split this set well,
so this will most likely won't work with git-bisect, but I
think this is OK for now. I will sure split it better when I
send the v3 and further.
The set is prepared against the 2.6.24-rc5-mm1.
All this is minimally tested and seems to work. Hope to hear
you comments, wishes and patches soon :)
To play with it - run a standard procedure:
# mount -t container none /cont/devs -o devices
# mkdir /cont/devs/0
# echo -n $$ > /cont/devs/0/tasks
and tune device permissions.
Thanks,
Pavel
^ permalink raw reply [flat|nested] 42+ messages in thread[parent not found: <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* [PATCH 1/4] Some changes in the kobject mapper [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-08 9:07 ` Pavel Emelyanov [not found] ` <47833D43.3090703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-01-08 9:12 ` [PATCH 2/4] The character devices layer changes Pavel Emelyanov ` (4 subsequent siblings) 5 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-08 9:07 UTC (permalink / raw) To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage The main thing that I want from the kobj mapper is to add the mode_t on the struct kobj_map that reflects with permissions are associated with this particular map. This mode is to be returned via the kobj_lookup. I use the FMODE_XXX flags to handle the permissions bits, as I will compare these ones to the file->f_mode later. By default all bits are set (for the initial container). The additional things I need are kobj_remap() to change that permission and kobj_iterate() to walk the map. The kobj_map_fini() is the roll-back for the kobj_map_init(). Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> --- diff --git a/drivers/base/map.c b/drivers/base/map.c index e87017f..1aa2b58 100644 --- a/drivers/base/map.c +++ b/drivers/base/map.c @@ -15,11 +15,13 @@ #include <linux/kdev_t.h> #include <linux/kobject.h> #include <linux/kobj_map.h> +#include <linux/fs.h> struct kobj_map { struct probe { struct probe *next; dev_t dev; + mode_t mode; unsigned long range; struct module *owner; kobj_probe_t *get; @@ -29,9 +31,9 @@ struct kobj_map { struct mutex *lock; }; -int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range, - struct module *module, kobj_probe_t *probe, - int (*lock)(dev_t, void *), void *data) +static int __kobj_map(struct kobj_map *domain, dev_t dev, mode_t mode, + unsigned long range, struct module *module, + kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data) { unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1; unsigned index = MAJOR(dev); @@ -53,8 +55,10 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range, p->dev = dev; p->range = range; p->data = data; + /* we allow these ones always by default */ + p->mode = mode | FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; } - mutex_lock(domain->lock); + for (i = 0, p -= n; i < n; i++, p++, index++) { struct probe **s = &domain->probes[index % 255]; while (*s && (*s)->range < range) @@ -62,10 +66,57 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range, p->next = *s; *s = p; } - mutex_unlock(domain->lock); return 0; } +int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range, + struct module *module, kobj_probe_t *probe, + int (*lock)(dev_t, void *), void *data) +{ + int err; + + mutex_lock(domain->lock); + err = __kobj_map(domain, dev, FMODE_READ | FMODE_WRITE, range, + module, probe, lock, data); + mutex_unlock(domain->lock); + return err; +} + +#ifdef CONFIG_CGROUP_DEVS +int kobj_remap(struct kobj_map *domain, dev_t dev, mode_t mode, + unsigned long range, struct module *module, + kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data) +{ + unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1; + unsigned index = MAJOR(dev); + unsigned i; + int err = -ESRCH; + + if (n > 255) + n = 255; + + mutex_lock(domain->lock); + for (i = 0; i < n; i++, index++) { + struct probe **s; + for (s = &domain->probes[index % 255]; *s; s = &(*s)->next) { + struct probe *p = *s; + if (p->dev == dev) { + p->mode = mode | FMODE_LSEEK | + FMODE_PREAD | FMODE_PWRITE; + err = 0; + break; + } + } + } + + if (err) + err = __kobj_map(domain, dev, mode, range, module, + probe, lock, data); + mutex_unlock(domain->lock); + return err; +} +#endif + void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range) { unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1; @@ -93,7 +144,8 @@ void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range) kfree(found); } -struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, int *index) +struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, mode_t *mode, + int *index) { struct kobject *kobj; struct probe *p; @@ -125,14 +177,46 @@ retry: kobj = probe(dev, index, data); /* Currently ->owner protects _only_ ->probe() itself. */ module_put(owner); - if (kobj) + if (kobj) { + if (mode) + *mode = p->mode; return kobj; + } goto retry; } mutex_unlock(domain->lock); return NULL; } +#ifdef CONFIG_CGROUP_DEVS +void kobj_map_iterate(struct kobj_map *domain, + int (*fn)(dev_t, int, mode_t, void *), void *arg) +{ + int i; + struct probe *p; + dev_t skip = MKDEV(0, 0); + + mutex_lock(domain->lock); + for (i = 0; i < 255; i++) { + p = domain->probes[i]; + while (p != NULL) { + if (p->dev == skip) + goto next; + if (p->dev == MKDEV(0, 1)) + goto next; + + skip = p->dev; + if (fn(p->dev, p->range, p->mode, arg)) + goto done; +next: + p = p->next; + } + } +done: + mutex_unlock(domain->lock); +} +#endif + struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock) { struct kobj_map *p = kmalloc(sizeof(struct kobj_map), GFP_KERNEL); @@ -153,3 +237,21 @@ struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock) p->lock = lock; return p; } + +void kobj_map_fini(struct kobj_map *map) +{ + int i; + struct probe *p, *next; + + for (i = 0; i < 256; i++) { + p = map->probes[i]; + while (p->next != NULL) { + next = p->next; + kfree(p); + p = next; + } + } + + kfree(p); + kfree(map); +} diff --git a/include/linux/kobj_map.h b/include/linux/kobj_map.h index bafe178..ecfe772 100644 --- a/include/linux/kobj_map.h +++ b/include/linux/kobj_map.h @@ -7,8 +7,13 @@ struct kobj_map; int kobj_map(struct kobj_map *, dev_t, unsigned long, struct module *, kobj_probe_t *, int (*)(dev_t, void *), void *); +int kobj_remap(struct kobj_map *, dev_t, mode_t, unsigned long, struct module *, + kobj_probe_t *, int (*)(dev_t, void *), void *); void kobj_unmap(struct kobj_map *, dev_t, unsigned long); -struct kobject *kobj_lookup(struct kobj_map *, dev_t, int *); +struct kobject *kobj_lookup(struct kobj_map *, dev_t, mode_t *, int *); +void kobj_map_iterate(struct kobj_map *, int (*fn)(dev_t, int, mode_t, void *), + void *); struct kobj_map *kobj_map_init(kobj_probe_t *, struct mutex *); +void kobj_map_fini(struct kobj_map *); #endif ^ permalink raw reply related [flat|nested] 42+ messages in thread
[parent not found: <47833D43.3090703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 1/4] Some changes in the kobject mapper [not found] ` <47833D43.3090703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-08 18:36 ` Daniel Hokka Zakrisson [not found] ` <4783C2B4.7000501-nym3zxDgnZcAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Daniel Hokka Zakrisson @ 2008-01-08 18:36 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage Pavel Emelyanov wrote: > The main thing that I want from the kobj mapper is to add > the mode_t on the struct kobj_map that reflects with > permissions are associated with this particular map. This > mode is to be returned via the kobj_lookup. > > I use the FMODE_XXX flags to handle the permissions bits, > as I will compare these ones to the file->f_mode later. > By default all bits are set (for the initial container). > > The additional things I need are kobj_remap() to change > that permission and kobj_iterate() to walk the map. > > The kobj_map_fini() is the roll-back for the kobj_map_init(). > > Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > > ... > @@ -153,3 +237,21 @@ struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock) > p->lock = lock; > return p; > } > + > +void kobj_map_fini(struct kobj_map *map) > +{ > + int i; > + struct probe *p, *next; > + > + for (i = 0; i < 256; i++) { This should be 255, shouldn't it? > + p = map->probes[i]; > + while (p->next != NULL) { > + next = p->next; > + kfree(p); > + p = next; > + } > + } > + > + kfree(p); > + kfree(map); > +} > diff --git a/include/linux/kobj_map.h b/include/linux/kobj_map.h > index bafe178..ecfe772 100644 > --- a/include/linux/kobj_map.h > +++ b/include/linux/kobj_map.h > @@ -7,8 +7,13 @@ struct kobj_map; > > int kobj_map(struct kobj_map *, dev_t, unsigned long, struct module *, > kobj_probe_t *, int (*)(dev_t, void *), void *); > +int kobj_remap(struct kobj_map *, dev_t, mode_t, unsigned long, struct module *, > + kobj_probe_t *, int (*)(dev_t, void *), void *); > void kobj_unmap(struct kobj_map *, dev_t, unsigned long); > -struct kobject *kobj_lookup(struct kobj_map *, dev_t, int *); > +struct kobject *kobj_lookup(struct kobj_map *, dev_t, mode_t *, int *); > +void kobj_map_iterate(struct kobj_map *, int (*fn)(dev_t, int, mode_t, void *), > + void *); > struct kobj_map *kobj_map_init(kobj_probe_t *, struct mutex *); > +void kobj_map_fini(struct kobj_map *); > > #endif > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linux-foundation.org/mailman/listinfo/containers -- Daniel Hokka Zakrisson ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <4783C2B4.7000501-nym3zxDgnZcAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 1/4] Some changes in the kobject mapper [not found] ` <4783C2B4.7000501-nym3zxDgnZcAvxtiuMwx3w@public.gmane.org> @ 2008-01-08 19:17 ` Dave Hansen 0 siblings, 0 replies; 42+ messages in thread From: Dave Hansen @ 2008-01-08 19:17 UTC (permalink / raw) To: Daniel Hokka Zakrisson; +Cc: Linux Containers, Paul Menage, Pavel Emelyanov On Tue, 2008-01-08 at 19:36 +0100, Daniel Hokka Zakrisson wrote: > > > +void kobj_map_fini(struct kobj_map *map) > > +{ > > + int i; > > + struct probe *p, *next; > > + > > + for (i = 0; i < 256; i++) { > > This should be 255, shouldn't it? Neither, it should be a named constant. ;) -- Dave ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 2/4] The character devices layer changes [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-01-08 9:07 ` [PATCH 1/4] Some changes in the kobject mapper Pavel Emelyanov @ 2008-01-08 9:12 ` Pavel Emelyanov [not found] ` <47833E93.6010108-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-01-08 9:15 ` [PATCH 3/4] The block " Pavel Emelyanov ` (3 subsequent siblings) 5 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-08 9:12 UTC (permalink / raw) To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage These changes include the API for the control group to map/remap/unmap the devices with their permissions and one important thing. The fact is that the struct cdev is cached in the inode for faster access, so once we looked one up we go through the fast path and omit the kobj_lookup() call. This is no longer good when we restrict the access to cdevs. To address this issue, I store the last_perm and last(_map) fields on the struct cdev (and protect them with the cdev_lock) and force the re-lookup in the kobj mappings if needed. I know, this might be slow, but I have two points for it: 1. The re-lookup happens on open() only which is not a fast-path. Besides, this is so for block layer and nobody complains; 2. On a well-isolated setup, when each container has its own filesystem this is no longer a problem - each cgroup will cache the cdev on its inode and work good. Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> --- diff --git a/fs/char_dev.c b/fs/char_dev.c index c3bfa76..2b821ef 100644 --- a/fs/char_dev.c +++ b/fs/char_dev.c @@ -22,6 +22,8 @@ #include <linux/mutex.h> #include <linux/backing-dev.h> +#include <linux/devscontrol.h> + #ifdef CONFIG_KMOD #include <linux/kmod.h> #endif @@ -362,17 +364,25 @@ int chrdev_open(struct inode * inode, struct file * filp) struct cdev *p; struct cdev *new = NULL; int ret = 0; + struct kobj_map *map; + mode_t mode; + + map = task_cdev_map(current); + if (map == NULL) + map = cdev_map; spin_lock(&cdev_lock); p = inode->i_cdev; - if (!p) { + if (!p || p->last != map) { struct kobject *kobj; int idx; + spin_unlock(&cdev_lock); - kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx); + kobj = kobj_lookup(map, inode->i_rdev, &mode, &idx); if (!kobj) return -ENXIO; new = container_of(kobj, struct cdev, kobj); + BUG_ON(p != NULL && p != new); spin_lock(&cdev_lock); p = inode->i_cdev; if (!p) { @@ -382,12 +392,24 @@ int chrdev_open(struct inode * inode, struct file * filp) new = NULL; } else if (!cdev_get(p)) ret = -ENXIO; + else { + p->last = map; + p->last_mode = mode; + } } else if (!cdev_get(p)) ret = -ENXIO; + else + mode = p->last_mode; spin_unlock(&cdev_lock); cdev_put(new); if (ret) return ret; + + if ((filp->f_mode & mode) != filp->f_mode) { + cdev_put(p); + return -EACCES; + } + filp->f_op = fops_get(p->ops); if (!filp->f_op) { cdev_put(p); @@ -461,6 +483,64 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count) return kobj_map(cdev_map, dev, count, NULL, exact_match, exact_lock, p); } +#ifdef CONFIG_CGROUP_DEVS +static inline void cdev_map_reset(struct kobj_map *map, struct cdev *c) +{ + spin_lock(&cdev_lock); + if (c->last == map) + c->last = NULL; + spin_unlock(&cdev_lock); +} + +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode) +{ + int tmp; + struct kobject *k; + struct cdev *c; + + k = kobj_lookup(cdev_map, dev, NULL, &tmp); + if (k == NULL) + return -ENODEV; + + c = container_of(k, struct cdev, kobj); + tmp = kobj_remap(map, dev, mode, all ? MINORMASK : 1, NULL, + exact_match, exact_lock, c); + if (tmp < 0) { + cdev_put(c); + return tmp; + } + + cdev_map_reset(map, c); + return 0; +} + +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all) +{ + int tmp; + struct kobject *k; + struct cdev *c; + + k = kobj_lookup(cdev_map, dev, NULL, &tmp); + if (k == NULL) + return -ENODEV; + + c = container_of(k, struct cdev, kobj); + kobj_unmap(map, dev, all ? MINORMASK : 1); + + cdev_map_reset(map, c); + + cdev_put(c); + cdev_put(c); + return 0; +} + +void cdev_iterate_map(struct kobj_map *map, + int (*fn)(dev_t, int, mode_t, void *), void *x) +{ + kobj_map_iterate(map, fn, x); +} +#endif + static void cdev_unmap(dev_t dev, unsigned count) { kobj_unmap(cdev_map, dev, count); @@ -542,9 +622,19 @@ static struct kobject *base_probe(dev_t dev, int *part, void *data) return NULL; } +struct kobj_map *cdev_map_init(void) +{ + return kobj_map_init(base_probe, &chrdevs_lock); +} + +void cdev_map_fini(struct kobj_map *map) +{ + kobj_map_fini(map); +} + void __init chrdev_init(void) { - cdev_map = kobj_map_init(base_probe, &chrdevs_lock); + cdev_map = cdev_map_init(); bdi_init(&directly_mappable_cdev_bdi); } diff --git a/include/linux/cdev.h b/include/linux/cdev.h index 1e29b13..d72a2a1 100644 --- a/include/linux/cdev.h +++ b/include/linux/cdev.h @@ -9,6 +9,7 @@ struct file_operations; struct inode; struct module; +struct kobj_map; struct cdev { struct kobject kobj; @@ -17,6 +18,8 @@ struct cdev { struct list_head list; dev_t dev; unsigned int count; + struct kobj_map *last; + mode_t last_mode; }; void cdev_init(struct cdev *, const struct file_operations *); @@ -33,5 +36,11 @@ void cd_forget(struct inode *); extern struct backing_dev_info directly_mappable_cdev_bdi; +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode); +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all); +struct kobj_map *cdev_map_init(void); +void cdev_map_fini(struct kobj_map *map); +void cdev_iterate_map(struct kobj_map *, + int (*fn)(dev_t, int, mode_t, void *), void *); #endif #endif ^ permalink raw reply related [flat|nested] 42+ messages in thread
[parent not found: <47833E93.6010108-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 2/4] The character devices layer changes [not found] ` <47833E93.6010108-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-14 17:03 ` Serge E. Hallyn [not found] ` <20080114170333.GA15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Serge E. Hallyn @ 2008-01-14 17:03 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > These changes include the API for the control group > to map/remap/unmap the devices with their permissions > and one important thing. > > The fact is that the struct cdev is cached in the inode > for faster access, so once we looked one up we go through > the fast path and omit the kobj_lookup() call. This is no > longer good when we restrict the access to cdevs. > > To address this issue, I store the last_perm and last(_map) > fields on the struct cdev (and protect them with the cdev_lock) > and force the re-lookup in the kobj mappings if needed. > > I know, this might be slow, but I have two points for it: > 1. The re-lookup happens on open() only which is not > a fast-path. Besides, this is so for block layer and > nobody complains; > 2. On a well-isolated setup, when each container has its > own filesystem this is no longer a problem - each > cgroup will cache the cdev on its inode and work good. What about simply returning -EPERM when open()ing a cdev with ->map!=task_cdev_map(current)? Shouldn't be a problem for ttys, since the container init already has the tty open, right? Otherwise, the patchset looks good to me. Want to look through this one a little more (i think that'd be easier with the -EPERM approach) and scrutinize patch 4, but overall it makes sense. If I understand right, we're taking 14k per cgroup for kobjmaps? Do we consider that a problem? thanks, -serge > Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > > --- > > diff --git a/fs/char_dev.c b/fs/char_dev.c > index c3bfa76..2b821ef 100644 > --- a/fs/char_dev.c > +++ b/fs/char_dev.c > @@ -22,6 +22,8 @@ > #include <linux/mutex.h> > #include <linux/backing-dev.h> > > +#include <linux/devscontrol.h> > + > #ifdef CONFIG_KMOD > #include <linux/kmod.h> > #endif > @@ -362,17 +364,25 @@ int chrdev_open(struct inode * inode, struct file * filp) > struct cdev *p; > struct cdev *new = NULL; > int ret = 0; > + struct kobj_map *map; > + mode_t mode; > + > + map = task_cdev_map(current); > + if (map == NULL) > + map = cdev_map; > > spin_lock(&cdev_lock); > p = inode->i_cdev; > - if (!p) { > + if (!p || p->last != map) { > struct kobject *kobj; > int idx; > + > spin_unlock(&cdev_lock); > - kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx); > + kobj = kobj_lookup(map, inode->i_rdev, &mode, &idx); > if (!kobj) > return -ENXIO; > new = container_of(kobj, struct cdev, kobj); > + BUG_ON(p != NULL && p != new); > spin_lock(&cdev_lock); > p = inode->i_cdev; > if (!p) { > @@ -382,12 +392,24 @@ int chrdev_open(struct inode * inode, struct file * filp) > new = NULL; > } else if (!cdev_get(p)) > ret = -ENXIO; > + else { > + p->last = map; > + p->last_mode = mode; > + } > } else if (!cdev_get(p)) > ret = -ENXIO; > + else > + mode = p->last_mode; > spin_unlock(&cdev_lock); > cdev_put(new); > if (ret) > return ret; > + > + if ((filp->f_mode & mode) != filp->f_mode) { > + cdev_put(p); > + return -EACCES; > + } > + > filp->f_op = fops_get(p->ops); > if (!filp->f_op) { > cdev_put(p); > @@ -461,6 +483,64 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count) > return kobj_map(cdev_map, dev, count, NULL, exact_match, exact_lock, p); > } > > +#ifdef CONFIG_CGROUP_DEVS > +static inline void cdev_map_reset(struct kobj_map *map, struct cdev *c) > +{ > + spin_lock(&cdev_lock); > + if (c->last == map) > + c->last = NULL; > + spin_unlock(&cdev_lock); > +} > + > +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode) > +{ > + int tmp; > + struct kobject *k; > + struct cdev *c; > + > + k = kobj_lookup(cdev_map, dev, NULL, &tmp); > + if (k == NULL) > + return -ENODEV; > + > + c = container_of(k, struct cdev, kobj); > + tmp = kobj_remap(map, dev, mode, all ? MINORMASK : 1, NULL, > + exact_match, exact_lock, c); > + if (tmp < 0) { > + cdev_put(c); > + return tmp; > + } > + > + cdev_map_reset(map, c); > + return 0; > +} > + > +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all) > +{ > + int tmp; > + struct kobject *k; > + struct cdev *c; > + > + k = kobj_lookup(cdev_map, dev, NULL, &tmp); > + if (k == NULL) > + return -ENODEV; > + > + c = container_of(k, struct cdev, kobj); > + kobj_unmap(map, dev, all ? MINORMASK : 1); > + > + cdev_map_reset(map, c); > + > + cdev_put(c); > + cdev_put(c); > + return 0; > +} > + > +void cdev_iterate_map(struct kobj_map *map, > + int (*fn)(dev_t, int, mode_t, void *), void *x) > +{ > + kobj_map_iterate(map, fn, x); > +} > +#endif > + > static void cdev_unmap(dev_t dev, unsigned count) > { > kobj_unmap(cdev_map, dev, count); > @@ -542,9 +622,19 @@ static struct kobject *base_probe(dev_t dev, int *part, void *data) > return NULL; > } > > +struct kobj_map *cdev_map_init(void) > +{ > + return kobj_map_init(base_probe, &chrdevs_lock); > +} > + > +void cdev_map_fini(struct kobj_map *map) > +{ > + kobj_map_fini(map); > +} > + > void __init chrdev_init(void) > { > - cdev_map = kobj_map_init(base_probe, &chrdevs_lock); > + cdev_map = cdev_map_init(); > bdi_init(&directly_mappable_cdev_bdi); > } > > diff --git a/include/linux/cdev.h b/include/linux/cdev.h > index 1e29b13..d72a2a1 100644 > --- a/include/linux/cdev.h > +++ b/include/linux/cdev.h > @@ -9,6 +9,7 @@ > struct file_operations; > struct inode; > struct module; > +struct kobj_map; > > struct cdev { > struct kobject kobj; > @@ -17,6 +18,8 @@ struct cdev { > struct list_head list; > dev_t dev; > unsigned int count; > + struct kobj_map *last; > + mode_t last_mode; > }; > > void cdev_init(struct cdev *, const struct file_operations *); > @@ -33,5 +36,11 @@ void cd_forget(struct inode *); > > extern struct backing_dev_info directly_mappable_cdev_bdi; > > +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode); > +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all); > +struct kobj_map *cdev_map_init(void); > +void cdev_map_fini(struct kobj_map *map); > +void cdev_iterate_map(struct kobj_map *, > + int (*fn)(dev_t, int, mode_t, void *), void *); > #endif > #endif ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080114170333.GA15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>]
* Re: [PATCH 2/4] The character devices layer changes [not found] ` <20080114170333.GA15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> @ 2008-01-15 8:05 ` Pavel Emelyanov [not found] ` <478C6942.4050903-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-15 8:05 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Paul Menage Serge E. Hallyn wrote: > Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): >> These changes include the API for the control group >> to map/remap/unmap the devices with their permissions >> and one important thing. >> >> The fact is that the struct cdev is cached in the inode >> for faster access, so once we looked one up we go through >> the fast path and omit the kobj_lookup() call. This is no >> longer good when we restrict the access to cdevs. >> >> To address this issue, I store the last_perm and last(_map) >> fields on the struct cdev (and protect them with the cdev_lock) >> and force the re-lookup in the kobj mappings if needed. >> >> I know, this might be slow, but I have two points for it: >> 1. The re-lookup happens on open() only which is not >> a fast-path. Besides, this is so for block layer and >> nobody complains; >> 2. On a well-isolated setup, when each container has its >> own filesystem this is no longer a problem - each >> cgroup will cache the cdev on its inode and work good. > > What about simply returning -EPERM when open()ing a cdev > with ->map!=task_cdev_map(current)? In this case it will HAVE to setup isolated filesystem for each cgroup. I thought that this flexibility doesn't hurt. > Shouldn't be a problem for ttys, since the container init > already has the tty open, right? Yup, but this is not the case for /dev/null or /dev/zero. > Otherwise, the patchset looks good to me. Want to look > through this one a little more (i think that'd be easier > with the -EPERM approach) and scrutinize patch 4, but > overall it makes sense. OK, thanks. > If I understand right, we're taking 14k per cgroup for > kobjmaps? Do we consider that a problem? 14k? I allocate the struct kobj_map which is only 256 pointers (i.e. - 2K) and the struct probe that is 32 bytes. I.e. 4k or a single page. I think this is OK. > thanks, > -serge > [snip] ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <478C6942.4050903-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 2/4] The character devices layer changes [not found] ` <478C6942.4050903-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-15 14:54 ` Serge E. Hallyn 0 siblings, 0 replies; 42+ messages in thread From: Serge E. Hallyn @ 2008-01-15 14:54 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > Serge E. Hallyn wrote: > > Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > >> These changes include the API for the control group > >> to map/remap/unmap the devices with their permissions > >> and one important thing. > >> > >> The fact is that the struct cdev is cached in the inode > >> for faster access, so once we looked one up we go through > >> the fast path and omit the kobj_lookup() call. This is no > >> longer good when we restrict the access to cdevs. > >> > >> To address this issue, I store the last_perm and last(_map) > >> fields on the struct cdev (and protect them with the cdev_lock) > >> and force the re-lookup in the kobj mappings if needed. > >> > >> I know, this might be slow, but I have two points for it: > >> 1. The re-lookup happens on open() only which is not > >> a fast-path. Besides, this is so for block layer and > >> nobody complains; > >> 2. On a well-isolated setup, when each container has its > >> own filesystem this is no longer a problem - each > >> cgroup will cache the cdev on its inode and work good. > > > > What about simply returning -EPERM when open()ing a cdev > > with ->map!=task_cdev_map(current)? > > In this case it will HAVE to setup isolated filesystem for > each cgroup. I thought that this flexibility doesn't hurt. The cost and effort of setting up a private /dev seems so minimal to me it seems worth not dealing with the inode->map switching around. But maybe that's just me. > > Shouldn't be a problem for ttys, since the container init > > already has the tty open, right? > > Yup, but this is not the case for /dev/null or /dev/zero. > > > Otherwise, the patchset looks good to me. Want to look > > through this one a little more (i think that'd be easier > > with the -EPERM approach) and scrutinize patch 4, but > > overall it makes sense. > > OK, thanks. > > > If I understand right, we're taking 14k per cgroup for > > kobjmaps? Do we consider that a problem? > > 14k? I allocate the struct kobj_map which is only 256 pointers > (i.e. - 2K) and the struct probe that is 32 bytes. I.e. 4k > or a single page. I think this is OK. Oops, I was thinking the probes were all pre-allocated. Sorry. > > thanks, > > -serge > > > > [snip] ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 3/4] The block devices layer changes [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-01-08 9:07 ` [PATCH 1/4] Some changes in the kobject mapper Pavel Emelyanov 2008-01-08 9:12 ` [PATCH 2/4] The character devices layer changes Pavel Emelyanov @ 2008-01-08 9:15 ` Pavel Emelyanov 2008-01-08 9:18 ` [PATCH 4/4] The control group itself Pavel Emelyanov ` (2 subsequent siblings) 5 siblings, 0 replies; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-08 9:15 UTC (permalink / raw) To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage They are the same as for the character layer, but the good news is that there are no caching in this case. So this patch is smaller and easier to understand as compared to the previous one. Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> --- diff --git a/block/genhd.c b/block/genhd.c index 5e4ab4b..6f9ef48 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -8,6 +8,7 @@ #include <linux/kdev_t.h> #include <linux/kernel.h> #include <linux/blkdev.h> +#include <linux/devscontrol.h> #include <linux/init.h> #include <linux/spinlock.h> #include <linux/seq_file.h> @@ -195,6 +196,57 @@ void unlink_gendisk(struct gendisk *disk) disk->minors); } +#ifdef CONFIG_CGROUP_DEVS +int bdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode) +{ + int tmp; + struct kobject *kobj; + struct device *d; + struct gendisk *disk; + + kobj = kobj_lookup(bdev_map, dev, NULL, &tmp); + if (kobj == NULL) + return -ENODEV; + + d = kobj_to_dev(kobj); + disk = dev_to_disk(d); + tmp = kobj_remap(map, dev, mode, all ? MINORBITS : 1, NULL, + exact_match, exact_lock, disk); + if (tmp < 0) { + put_disk(disk); + return tmp; + } + + return 0; +} + +int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all) +{ + int tmp; + struct kobject *kobj; + struct device *d; + struct gendisk *disk; + + kobj = kobj_lookup(bdev_map, dev, NULL, &tmp); + if (kobj == NULL) + return -ENODEV; + + d = kobj_to_dev(kobj); + disk = dev_to_disk(d); + kobj_unmap(map, dev, all ? MINORBITS : 1); + + put_disk(disk); + put_disk(disk); + return 0; +} + +void bdev_iterate_map(struct kobj_map *map, + int (*fn)(dev_t, int, mode_t, void *), void *x) +{ + kobj_map_iterate(map, fn, x); +} +#endif + /** * get_gendisk - get partitioning information for a given device * @dev: device to get partitioning information for @@ -202,10 +254,18 @@ void unlink_gendisk(struct gendisk *disk) * This function gets the structure containing partitioning * information for the given device @dev. */ -struct gendisk *get_gendisk(dev_t devt, int *part) +struct gendisk *get_gendisk(dev_t devt, mode_t *mode, int *part) { - struct kobject *kobj = kobj_lookup(bdev_map, devt, part); - struct device *dev = kobj_to_dev(kobj); + struct kobj_map *map; + struct kobject *kobj; + struct device *dev; + + map = task_bdev_map(current); + if (map == NULL) + map = bdev_map; + + kobj = kobj_lookup(map, devt, mode, part); + dev = kobj_to_dev(kobj); return kobj ? dev_to_disk(dev) : NULL; } @@ -356,10 +416,20 @@ static struct kobject *base_probe(dev_t devt, int *part, void *data) return NULL; } +struct kobj_map *bdev_map_init(void) +{ + return kobj_map_init(base_probe, &block_class_lock); +} + +void bdev_map_fini(struct kobj_map *map) +{ + kobj_map_fini(map); +} + static int __init genhd_device_init(void) { class_register(&block_class); - bdev_map = kobj_map_init(base_probe, &block_class_lock); + bdev_map = bdev_map_init(); blk_dev_init(); #ifndef CONFIG_SYSFS_DEPRECATED diff --git a/fs/block_dev.c b/fs/block_dev.c index 55295a4..03b1b5e 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1129,16 +1129,25 @@ static int do_open(struct block_device *bdev, struct file *file, int for_part) struct module *owner = NULL; struct gendisk *disk; int ret = -ENXIO; + mode_t mode; int part; file->f_mapping = bdev->bd_inode->i_mapping; lock_kernel(); - disk = get_gendisk(bdev->bd_dev, &part); + disk = get_gendisk(bdev->bd_dev, &mode, &part); if (!disk) { unlock_kernel(); bdput(bdev); return ret; } + + if ((file->f_mode & mode) != file->f_mode) { + unlock_kernel(); + bdput(bdev); + put_disk(disk); + return -EACCES; + } + owner = disk->fops->owner; mutex_lock_nested(&bdev->bd_mutex, for_part); diff --git a/include/linux/genhd.h b/include/linux/genhd.h index dc710a0..b2d7b52 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -239,7 +239,15 @@ extern int get_blkdev_list(char *, int); extern void add_disk(struct gendisk *disk); extern void del_gendisk(struct gendisk *gp); extern void unlink_gendisk(struct gendisk *gp); -extern struct gendisk *get_gendisk(dev_t dev, int *part); +extern struct gendisk *get_gendisk(dev_t dev, mode_t *mode, int *part); + +struct kobj_map; +extern int bdev_add_to_map(struct kobj_map *, dev_t dev, int all, mode_t mode); +extern int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all); +extern void bdev_iterate_map(struct kobj_map *map, + int (*fn)(dev_t, int, mode_t, void *), void *x); +extern struct kobj_map *bdev_map_init(void); +extern void bdev_map_fini(struct kobj_map *map); extern void set_device_ro(struct block_device *bdev, int flag); extern void set_disk_ro(struct gendisk *disk, int flag); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 4/4] The control group itself [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> ` (2 preceding siblings ...) 2008-01-08 9:15 ` [PATCH 3/4] The block " Pavel Emelyanov @ 2008-01-08 9:18 ` Pavel Emelyanov [not found] ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-01-12 21:20 ` [PATCH 0/4] Devices accessibility control group (v2) sukadev-r/Jw6+rmf7HQT0dZR+AlfA 2008-01-14 21:18 ` Paul Menage 5 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-08 9:18 UTC (permalink / raw) To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage Each new group will have its own maps for char and block layers. The devices access list is tuned via the devices.permissions file. One may read from the file to get the configured state. E.g. # cat /cont/devices/0/devices.permissions c 1:* rw b 8:1 rw The top container isn't initialized, so that the char and block layers will use the global maps to lookup their devices. Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> --- diff --git a/fs/Makefile b/fs/Makefile index 82b6ae1..a085706 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -63,6 +63,8 @@ obj-y += devpts/ obj-$(CONFIG_PROFILING) += dcookies.o obj-$(CONFIG_DLM) += dlm/ + +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o # Do not add any filesystems before this line obj-$(CONFIG_REISERFS_FS) += reiserfs/ diff --git a/fs/devscontrol.c b/fs/devscontrol.c new file mode 100644 index 0000000..ea282f3 --- /dev/null +++ b/fs/devscontrol.c @@ -0,0 +1,291 @@ +/* + * devscontrol.c - Device Controller + * + * Copyright 2007 OpenVZ SWsoft Inc + * Author: Pavel Emelyanov <xemul at openvz.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/cgroup.h> +#include <linux/cdev.h> +#include <linux/err.h> +#include <linux/devscontrol.h> +#include <linux/uaccess.h> +#include <linux/fs.h> +#include <linux/genhd.h> + +struct devs_cgroup { + struct cgroup_subsys_state css; + + struct kobj_map *cdev_map; + struct kobj_map *bdev_map; +}; + +static inline +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) +{ + return container_of(css, struct devs_cgroup, css); +} + +static inline +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) +{ + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); +} + +struct kobj_map *task_cdev_map(struct task_struct *tsk) +{ + struct cgroup_subsys_state *css; + + css = task_subsys_state(tsk, devs_subsys_id); + if (css->cgroup->parent == NULL) + return NULL; + else + return css_to_devs(css)->cdev_map; +} + +struct kobj_map *task_bdev_map(struct task_struct *tsk) +{ + struct cgroup_subsys_state *css; + + css = task_subsys_state(tsk, devs_subsys_id); + if (css->cgroup->parent == NULL) + return NULL; + else + return css_to_devs(css)->bdev_map; +} + +static struct cgroup_subsys_state * +devs_create(struct cgroup_subsys *ss, struct cgroup *cont) +{ + struct devs_cgroup *devs; + + devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL); + if (devs == NULL) + goto out; + + devs->cdev_map = cdev_map_init(); + if (devs->cdev_map == NULL) + goto out_free; + + devs->bdev_map = bdev_map_init(); + if (devs->bdev_map == NULL) + goto out_free_cdev; + + return &devs->css; + +out_free_cdev: + cdev_map_fini(devs->cdev_map); +out_free: + kfree(devs); +out: + return ERR_PTR(-ENOMEM); +} + +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont) +{ + struct devs_cgroup *devs; + + devs = cgroup_to_devs(cont); + bdev_map_fini(devs->bdev_map); + cdev_map_fini(devs->cdev_map); + kfree(devs); +} + +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev, + int *all, mode_t *mode) +{ + unsigned int major, minor; + char *end; + mode_t tmp; + + /* [cb] <major>:<minor> [r-][w-] */ + + if (buf[0] == 'c') + *chrdev = 1; + else if (buf[0] == 'b') + *chrdev = 0; + else + return -EINVAL; + + if (buf[1] != ' ') + return -EINVAL; + + major = simple_strtoul(buf + 2, &end, 10); + if (*end != ':') + return -EINVAL; + + if (end[1] == '*') { + if (end[2] != ' ') + return -EINVAL; + + *all = 1; + minor = 0; + end += 2; + } else { + minor = simple_strtoul(end + 1, &end, 10); + if (*end != ' ') + return -EINVAL; + + *all = 0; + } + + tmp = 0; + + if (end[1] == 'r') + tmp |= FMODE_READ; + else if (end[1] != '-') + return -EINVAL; + if (end[2] == 'w') + tmp |= FMODE_WRITE; + else if (end[2] != '-') + return -EINVAL; + + *dev = MKDEV(major, minor); + *mode = tmp; + return 0; +} + +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev, + int all, mode_t mode) +{ + int ret; + + ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev)); + if (all) + ret += snprintf(buf + ret, len - ret, "*"); + else + ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev)); + + ret += snprintf(buf + ret, len - ret, " %c%c\n", + (mode & FMODE_READ) ? 'r' : '-', + (mode & FMODE_WRITE) ? 'w' : '-'); + + return ret + 1; +} + +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, + struct file *f, const char __user *ubuf, + size_t nbytes, loff_t *pos) +{ + int err, all, chrdev; + dev_t dev; + char buf[64]; + struct devs_cgroup *devs; + mode_t mode; + + if (copy_from_user(buf, ubuf, sizeof(buf))) + return -EFAULT; + + buf[sizeof(buf) - 1] = 0; + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); + if (err < 0) + return err; + + devs = cgroup_to_devs(cont); + + if (mode == 0) { + if (chrdev) + err = cdev_del_from_map(devs->cdev_map, dev, all); + else + err = bdev_del_from_map(devs->bdev_map, dev, all); + + if (err < 0) + return err; + + css_put(&devs->css); + } else { + if (chrdev) + err = cdev_add_to_map(devs->cdev_map, dev, all, mode); + else + err = bdev_add_to_map(devs->bdev_map, dev, all, mode); + + if (err < 0) + return err; + + css_get(&devs->css); + } + + return nbytes; +} + +struct devs_dump_arg { + char *buf; + int pos; + int chrdev; +}; + +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x) +{ + struct devs_dump_arg *arg = x; + char tmp[64]; + int len; + + len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev, + range != 1, mode); + + if (arg->pos >= PAGE_SIZE - len) + return 1; + + memcpy(arg->buf + arg->pos, tmp, len); + arg->pos += len; + return 0; +} + +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft, + struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos) +{ + struct devs_dump_arg arg; + struct devs_cgroup *devs; + ssize_t ret; + + arg.buf = (char *)__get_free_page(GFP_KERNEL); + if (arg.buf == NULL) + return -ENOMEM; + + devs = cgroup_to_devs(cont); + arg.pos = 0; + + arg.chrdev = 1; + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); + + arg.chrdev = 0; + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); + + ret = simple_read_from_buffer(ubuf, nbytes, pos, + arg.buf, arg.pos); + + free_page((unsigned long)arg.buf); + return ret; +} + +static struct cftype devs_files[] = { + { + .name = "permissions", + .write = devs_write, + .read = devs_read, + }, +}; + +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont) +{ + return cgroup_add_files(cont, ss, + devs_files, ARRAY_SIZE(devs_files)); +} + +struct cgroup_subsys devs_subsys = { + .name = "devices", + .subsys_id = devs_subsys_id, + .create = devs_create, + .destroy = devs_destroy, + .populate = devs_populate, +}; diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index 228235c..9c0cd2c 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup) #endif /* */ + +#ifdef CONFIG_CGROUP_DEVS +SUBSYS(devs) +#endif + +/* */ diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h new file mode 100644 index 0000000..5f574e0 --- /dev/null +++ b/include/linux/devscontrol.h @@ -0,0 +1,20 @@ +#ifndef __DEVS_CONTROL_H__ +#define __DEVS_CONTROL_H__ +struct kobj_map; +struct task_struct; + +#ifdef CONFIG_CGROUP_DEVS +struct kobj_map *task_cdev_map(struct task_struct *); +struct kobj_map *task_bdev_map(struct task_struct *); +#else +static inline kobj_map *task_cdev_map(struct task_struct *tsk) +{ + return NULL; +} + +static inline kobj_map *task_bdev_map(struct task_struct *tsk) +{ + return NULL; +} +#endif +#endif diff --git a/init/Kconfig b/init/Kconfig index b886ac9..4dd53d6 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -283,6 +283,12 @@ config CGROUP_DEBUG Say N if unsure +config CGROUP_DEVS + bool "Devices cgroup subsystem" + depends on CGROUPS + help + Controlls the visibility of devices + config CGROUP_NS bool "Namespace cgroup subsystem" depends on CGROUPS ^ permalink raw reply related [flat|nested] 42+ messages in thread
[parent not found: <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-14 17:40 ` Serge E. Hallyn [not found] ` <20080114174056.GB15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> 2008-01-14 21:54 ` Paul Menage 1 sibling, 1 reply; 42+ messages in thread From: Serge E. Hallyn @ 2008-01-14 17:40 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > Each new group will have its own maps for char and block > layers. The devices access list is tuned via the > devices.permissions file. > > One may read from the file to get the configured > state. E.g. > > # cat /cont/devices/0/devices.permissions > c 1:* rw > b 8:1 rw > > The top container isn't initialized, so that the char > and block layers will use the global maps to lookup > their devices. > > Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > > --- > > diff --git a/fs/Makefile b/fs/Makefile > index 82b6ae1..a085706 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -63,6 +63,8 @@ obj-y += devpts/ > > obj-$(CONFIG_PROFILING) += dcookies.o > obj-$(CONFIG_DLM) += dlm/ > + > +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o > > # Do not add any filesystems before this line > obj-$(CONFIG_REISERFS_FS) += reiserfs/ > diff --git a/fs/devscontrol.c b/fs/devscontrol.c > new file mode 100644 > index 0000000..ea282f3 > --- /dev/null > +++ b/fs/devscontrol.c > @@ -0,0 +1,291 @@ > +/* > + * devscontrol.c - Device Controller > + * > + * Copyright 2007 OpenVZ SWsoft Inc > + * Author: Pavel Emelyanov <xemul at openvz.org> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + */ > + > +#include <linux/cgroup.h> > +#include <linux/cdev.h> > +#include <linux/err.h> > +#include <linux/devscontrol.h> > +#include <linux/uaccess.h> > +#include <linux/fs.h> > +#include <linux/genhd.h> > + > +struct devs_cgroup { > + struct cgroup_subsys_state css; > + > + struct kobj_map *cdev_map; > + struct kobj_map *bdev_map; > +}; > + > +static inline > +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) > +{ > + return container_of(css, struct devs_cgroup, css); > +} > + > +static inline > +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) > +{ > + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); > +} > + > +struct kobj_map *task_cdev_map(struct task_struct *tsk) > +{ > + struct cgroup_subsys_state *css; > + > + css = task_subsys_state(tsk, devs_subsys_id); > + if (css->cgroup->parent == NULL) > + return NULL; > + else > + return css_to_devs(css)->cdev_map; > +} > + > +struct kobj_map *task_bdev_map(struct task_struct *tsk) > +{ > + struct cgroup_subsys_state *css; > + > + css = task_subsys_state(tsk, devs_subsys_id); > + if (css->cgroup->parent == NULL) > + return NULL; > + else > + return css_to_devs(css)->bdev_map; > +} > + > +static struct cgroup_subsys_state * > +devs_create(struct cgroup_subsys *ss, struct cgroup *cont) > +{ > + struct devs_cgroup *devs; > + > + devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL); > + if (devs == NULL) > + goto out; > + > + devs->cdev_map = cdev_map_init(); > + if (devs->cdev_map == NULL) > + goto out_free; > + > + devs->bdev_map = bdev_map_init(); > + if (devs->bdev_map == NULL) > + goto out_free_cdev; > + > + return &devs->css; > + > +out_free_cdev: > + cdev_map_fini(devs->cdev_map); > +out_free: > + kfree(devs); > +out: > + return ERR_PTR(-ENOMEM); > +} Thanks for working on this, Pavel. My only question with this patch is - so if I create a devs cgroup which only has access to, say /dev/loop0 and /dev/tty3, and someone in that cgroup manages to create a new cgroup, the new cgroup will have all the default permissions again, rather than inherit the permissions from this cgroup, right? > + > +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont) > +{ > + struct devs_cgroup *devs; > + > + devs = cgroup_to_devs(cont); > + bdev_map_fini(devs->bdev_map); > + cdev_map_fini(devs->cdev_map); > + kfree(devs); > +} > + > +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev, > + int *all, mode_t *mode) > +{ > + unsigned int major, minor; > + char *end; > + mode_t tmp; > + > + /* [cb] <major>:<minor> [r-][w-] */ > + > + if (buf[0] == 'c') > + *chrdev = 1; > + else if (buf[0] == 'b') > + *chrdev = 0; > + else > + return -EINVAL; > + > + if (buf[1] != ' ') > + return -EINVAL; > + > + major = simple_strtoul(buf + 2, &end, 10); > + if (*end != ':') > + return -EINVAL; > + > + if (end[1] == '*') { > + if (end[2] != ' ') > + return -EINVAL; > + > + *all = 1; > + minor = 0; > + end += 2; > + } else { > + minor = simple_strtoul(end + 1, &end, 10); > + if (*end != ' ') > + return -EINVAL; > + > + *all = 0; > + } > + > + tmp = 0; > + > + if (end[1] == 'r') > + tmp |= FMODE_READ; > + else if (end[1] != '-') > + return -EINVAL; > + if (end[2] == 'w') > + tmp |= FMODE_WRITE; > + else if (end[2] != '-') > + return -EINVAL; > + > + *dev = MKDEV(major, minor); > + *mode = tmp; > + return 0; > +} > + > +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev, > + int all, mode_t mode) > +{ > + int ret; > + > + ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev)); > + if (all) > + ret += snprintf(buf + ret, len - ret, "*"); > + else > + ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev)); > + > + ret += snprintf(buf + ret, len - ret, " %c%c\n", > + (mode & FMODE_READ) ? 'r' : '-', > + (mode & FMODE_WRITE) ? 'w' : '-'); > + > + return ret + 1; > +} > + > +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, > + struct file *f, const char __user *ubuf, > + size_t nbytes, loff_t *pos) > +{ > + int err, all, chrdev; > + dev_t dev; > + char buf[64]; > + struct devs_cgroup *devs; > + mode_t mode; (Of course this will require some privilege, i assume that's a detail you'll add next time around) > + > + if (copy_from_user(buf, ubuf, sizeof(buf))) > + return -EFAULT; > + > + buf[sizeof(buf) - 1] = 0; > + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); > + if (err < 0) > + return err; > + > + devs = cgroup_to_devs(cont); > + > + if (mode == 0) { > + if (chrdev) > + err = cdev_del_from_map(devs->cdev_map, dev, all); > + else > + err = bdev_del_from_map(devs->bdev_map, dev, all); > + > + if (err < 0) > + return err; > + > + css_put(&devs->css); > + } else { > + if (chrdev) > + err = cdev_add_to_map(devs->cdev_map, dev, all, mode); > + else > + err = bdev_add_to_map(devs->bdev_map, dev, all, mode); > + > + if (err < 0) > + return err; > + > + css_get(&devs->css); > + } > + > + return nbytes; > +} > + > +struct devs_dump_arg { > + char *buf; > + int pos; > + int chrdev; > +}; > + > +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x) > +{ > + struct devs_dump_arg *arg = x; > + char tmp[64]; > + int len; > + > + len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev, > + range != 1, mode); > + > + if (arg->pos >= PAGE_SIZE - len) > + return 1; > + > + memcpy(arg->buf + arg->pos, tmp, len); > + arg->pos += len; > + return 0; > +} > + > +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft, > + struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos) > +{ > + struct devs_dump_arg arg; > + struct devs_cgroup *devs; > + ssize_t ret; > + > + arg.buf = (char *)__get_free_page(GFP_KERNEL); > + if (arg.buf == NULL) > + return -ENOMEM; > + > + devs = cgroup_to_devs(cont); > + arg.pos = 0; > + > + arg.chrdev = 1; > + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); > + > + arg.chrdev = 0; > + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); > + > + ret = simple_read_from_buffer(ubuf, nbytes, pos, > + arg.buf, arg.pos); > + > + free_page((unsigned long)arg.buf); > + return ret; > +} > + > +static struct cftype devs_files[] = { > + { > + .name = "permissions", > + .write = devs_write, > + .read = devs_read, > + }, > +}; > + > +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont) > +{ > + return cgroup_add_files(cont, ss, > + devs_files, ARRAY_SIZE(devs_files)); > +} > + > +struct cgroup_subsys devs_subsys = { > + .name = "devices", > + .subsys_id = devs_subsys_id, > + .create = devs_create, > + .destroy = devs_destroy, > + .populate = devs_populate, > +}; > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h > index 228235c..9c0cd2c 100644 > --- a/include/linux/cgroup_subsys.h > +++ b/include/linux/cgroup_subsys.h > @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup) > #endif > > /* */ > + > +#ifdef CONFIG_CGROUP_DEVS > +SUBSYS(devs) > +#endif > + > +/* */ > diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h > new file mode 100644 > index 0000000..5f574e0 > --- /dev/null > +++ b/include/linux/devscontrol.h > @@ -0,0 +1,20 @@ > +#ifndef __DEVS_CONTROL_H__ > +#define __DEVS_CONTROL_H__ > +struct kobj_map; > +struct task_struct; > + > +#ifdef CONFIG_CGROUP_DEVS > +struct kobj_map *task_cdev_map(struct task_struct *); > +struct kobj_map *task_bdev_map(struct task_struct *); > +#else > +static inline kobj_map *task_cdev_map(struct task_struct *tsk) > +{ > + return NULL; > +} > + > +static inline kobj_map *task_bdev_map(struct task_struct *tsk) > +{ > + return NULL; > +} > +#endif > +#endif > diff --git a/init/Kconfig b/init/Kconfig > index b886ac9..4dd53d6 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -283,6 +283,12 @@ config CGROUP_DEBUG > > Say N if unsure > > +config CGROUP_DEVS > + bool "Devices cgroup subsystem" > + depends on CGROUPS > + help > + Controlls the visibility of devices > + > config CGROUP_NS > bool "Namespace cgroup subsystem" > depends on CGROUPS ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080114174056.GB15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080114174056.GB15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> @ 2008-01-15 7:53 ` Pavel Emelyanov [not found] ` <478C6669.7070705-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-15 7:53 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Paul Menage [snip] > Thanks for working on this, Pavel. > > My only question with this patch is - so if I create a devs > cgroup which only has access to, say /dev/loop0 and /dev/tty3, > and someone in that cgroup manages to create a new cgroup, the > new cgroup will have all the default permissions again, rather > than inherit the permissions from this cgroup, right? Right. When you create a new cgroup you have an empty perms set. Maybe it's worth inheriting the perms from the parent container, but I think that empty set is better as you will reconfigure it anyway. [snip] >> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, >> + struct file *f, const char __user *ubuf, >> + size_t nbytes, loff_t *pos) >> +{ >> + int err, all, chrdev; >> + dev_t dev; >> + char buf[64]; >> + struct devs_cgroup *devs; >> + mode_t mode; > > (Of course this will require some privilege, i assume that's a detail > you'll add next time around) Hm... I though that privileges are governed at the cgroup level.... No? [snip] ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <478C6669.7070705-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <478C6669.7070705-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-15 14:44 ` Serge E. Hallyn [not found] ` <20080115144440.GE4453-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Serge E. Hallyn @ 2008-01-15 14:44 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > [snip] > > > Thanks for working on this, Pavel. > > > > My only question with this patch is - so if I create a devs > > cgroup which only has access to, say /dev/loop0 and /dev/tty3, > > and someone in that cgroup manages to create a new cgroup, the > > new cgroup will have all the default permissions again, rather > > than inherit the permissions from this cgroup, right? > > Right. When you create a new cgroup you have an empty perms > set. Maybe it's worth inheriting the perms from the parent > container, but I think that empty set is better as you will > reconfigure it anyway. > > [snip] > > >> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, > >> + struct file *f, const char __user *ubuf, > >> + size_t nbytes, loff_t *pos) > >> +{ > >> + int err, all, chrdev; > >> + dev_t dev; > >> + char buf[64]; > >> + struct devs_cgroup *devs; > >> + mode_t mode; > > > > (Of course this will require some privilege, i assume that's a detail > > you'll add next time around) > > Hm... I though that privileges are governed at the cgroup level.... No? I don't think so... Wouldn't really make sense for the cgroup infrastructure to presume to know what to enforce, and I don't see any checks around the _write functions in cgroup.c, and no capable() calls at all. -serge ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080115144440.GE4453-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080115144440.GE4453-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> @ 2008-01-15 16:13 ` Paul Menage [not found] ` <6599ad830801150813s6a5a7374qd25b6d6206d5896a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Paul Menage @ 2008-01-15 16:13 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov On Jan 15, 2008 6:44 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > I don't think so... Wouldn't really make sense for the cgroup > infrastructure to presume to know what to enforce, and I don't see any > checks around the _write functions in cgroup.c, and no capable() calls > at all. The cgroup filesystem can provide simple unix-level permissions on any given file. Am I right in thinking that having an entry in the mapper doesn't automatically give privileges for a device to the members of the cgroup, but they also have to have sufficient privilege in their own right? If so, that might be sufficient. One other thought - should the parse/print routines themselves do a translation based on the device mappings for the writer/reader's cgroup? That way you could safely give a VE full permission to write to its children's device maps, but it would only be able to add/remap device targets that it could address itself. Paul ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <6599ad830801150813s6a5a7374qd25b6d6206d5896a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <6599ad830801150813s6a5a7374qd25b6d6206d5896a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-01-15 17:49 ` Serge E. Hallyn [not found] ` <20080115174941.GA11638-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Serge E. Hallyn @ 2008-01-15 17:49 UTC (permalink / raw) To: Paul Menage; +Cc: Linux Containers, Pavel Emelyanov Quoting Paul Menage (menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org): > On Jan 15, 2008 6:44 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > > > I don't think so... Wouldn't really make sense for the cgroup > > infrastructure to presume to know what to enforce, and I don't see any > > checks around the _write functions in cgroup.c, and no capable() calls > > at all. > > The cgroup filesystem can provide simple unix-level permissions on any > given file. Am I right in thinking that having an entry in the mapper > doesn't automatically give privileges for a device to the members of > the cgroup, but they also have to have sufficient privilege in their > own right? If so, that might be sufficient. Oh, well actually I think what we'd want is to require both CAP_NS_OVERRIDE and either CAP_MKNOD or CAP_SYS_ADMIN. So it's probably fine to leave this as is for now, and after I resend the patchset which pushes CAP_NS_OVERRIDE (which is in a 4-patch userns patchset I've been sitting on) the extra checks can be added. > One other thought - should the parse/print routines themselves do a > translation based on the device mappings for the writer/reader's > cgroup? That way you could safely give a VE full permission to write > to its children's device maps, but it would only be able to add/remap > device targets that it could address itself. Oh, well if we do this then we can just as well use the translation functions to not allow a VE to add to its own set of devices, right? Then maybe capable(CAP_NS_OVERRIDE|CAP_SYS_ADMIN) would only be required to add devices. Though there *is* some bit of danger to removing devices from a privileged daemon, isn't there? Though I can't think of examples just now. (Sorry, piercing headache, can't think quite right, will think about this later) -serge ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080115174941.GA11638-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080115174941.GA11638-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> @ 2008-01-15 17:54 ` Paul Menage [not found] ` <6599ad830801150954w7e1b6db0p4dd737730f407348-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Paul Menage @ 2008-01-15 17:54 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov On Jan 15, 2008 9:49 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > One other thought - should the parse/print routines themselves do a > > translation based on the device mappings for the writer/reader's > > cgroup? That way you could safely give a VE full permission to write > > to its children's device maps, but it would only be able to add/remap > > device targets that it could address itself. > > Oh, well if we do this then we can just as well use the translation > functions to not allow a VE to add to its own set of devices, right? Right. > > Then maybe capable(CAP_NS_OVERRIDE|CAP_SYS_ADMIN) would only be required > to add devices. Or simply require that they be added by someone who already has access to that device via their own control group? The root cgroup would have access to all devices. Paul ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <6599ad830801150954w7e1b6db0p4dd737730f407348-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <6599ad830801150954w7e1b6db0p4dd737730f407348-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-01-15 18:17 ` Serge E. Hallyn 0 siblings, 0 replies; 42+ messages in thread From: Serge E. Hallyn @ 2008-01-15 18:17 UTC (permalink / raw) To: Paul Menage; +Cc: Linux Containers, Pavel Emelyanov Quoting Paul Menage (menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org): > On Jan 15, 2008 9:49 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > > One other thought - should the parse/print routines themselves do a > > > translation based on the device mappings for the writer/reader's > > > cgroup? That way you could safely give a VE full permission to write > > > to its children's device maps, but it would only be able to add/remap > > > device targets that it could address itself. > > > > Oh, well if we do this then we can just as well use the translation > > functions to not allow a VE to add to its own set of devices, right? > > Right. > > > > > Then maybe capable(CAP_NS_OVERRIDE|CAP_SYS_ADMIN) would only be required > > to add devices. > > Or simply require that they be added by someone who already has access > to that device via their own control group? The root cgroup would have > access to all devices. Where by 'have access' you mean access to create the device? That sounds good. thanks, -serge ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/4] The control group itself [not found] ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-01-14 17:40 ` Serge E. Hallyn @ 2008-01-14 21:54 ` Paul Menage [not found] ` <6599ad830801141354p5b165cdao8d6184adb9ab61b6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 42+ messages in thread From: Paul Menage @ 2008-01-14 21:54 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers On Jan 8, 2008 1:18 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: > Each new group will have its own maps for char and block > layers. The devices access list is tuned via the > devices.permissions file. > > One may read from the file to get the configured > state. E.g. > > # cat /cont/devices/0/devices.permissions > c 1:* rw > b 8:1 rw > > The top container isn't initialized, so that the char > and block layers will use the global maps to lookup > their devices. > > Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > > --- > > diff --git a/fs/Makefile b/fs/Makefile > index 82b6ae1..a085706 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -63,6 +63,8 @@ obj-y += devpts/ > > obj-$(CONFIG_PROFILING) += dcookies.o > obj-$(CONFIG_DLM) += dlm/ > + > +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o > > # Do not add any filesystems before this line > obj-$(CONFIG_REISERFS_FS) += reiserfs/ > diff --git a/fs/devscontrol.c b/fs/devscontrol.c > new file mode 100644 > index 0000000..ea282f3 > --- /dev/null > +++ b/fs/devscontrol.c > @@ -0,0 +1,291 @@ > +/* > + * devscontrol.c - Device Controller > + * > + * Copyright 2007 OpenVZ SWsoft Inc > + * Author: Pavel Emelyanov <xemul at openvz.org> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + */ > + > +#include <linux/cgroup.h> > +#include <linux/cdev.h> > +#include <linux/err.h> > +#include <linux/devscontrol.h> > +#include <linux/uaccess.h> > +#include <linux/fs.h> > +#include <linux/genhd.h> > + > +struct devs_cgroup { > + struct cgroup_subsys_state css; > + > + struct kobj_map *cdev_map; > + struct kobj_map *bdev_map; > +}; > + > +static inline > +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) > +{ > + return container_of(css, struct devs_cgroup, css); > +} > + > +static inline > +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) > +{ > + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); > +} > + > +struct kobj_map *task_cdev_map(struct task_struct *tsk) > +{ > + struct cgroup_subsys_state *css; > + > + css = task_subsys_state(tsk, devs_subsys_id); > + if (css->cgroup->parent == NULL) > + return NULL; > + else > + return css_to_devs(css)->cdev_map; > +} For this (and task_bdev_map) it would be cleaner to just leave the cdev_map and bdev_map in the root cgroup as NULL, so you could just make this struct kobj_map *task_cdev_map(struct task_struct *tsk) { return css_to_devs(task_subsys_state(tsk, devs_subsys_id))->cdev_map; } > + > +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, > + struct file *f, const char __user *ubuf, > + size_t nbytes, loff_t *pos) > +{ > + int err, all, chrdev; > + dev_t dev; > + char buf[64]; > + struct devs_cgroup *devs; > + mode_t mode; > + > + if (copy_from_user(buf, ubuf, sizeof(buf))) > + return -EFAULT; > + > + buf[sizeof(buf) - 1] = 0; > + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); > + if (err < 0) > + return err; > + > + devs = cgroup_to_devs(cont); > + > + if (mode == 0) { > + if (chrdev) > + err = cdev_del_from_map(devs->cdev_map, dev, all); > + else > + err = bdev_del_from_map(devs->bdev_map, dev, all); There's no locking involved on these calls, other than the cgroups code guaranteeing that the subsystem objects themselves won't go away. A quick look over the kobj_map calls suggests that these calls may be thread-safe - can you confirm that. (And maybe comment at the top of the function?) > + arg.buf = (char *)__get_free_page(GFP_KERNEL); > + if (arg.buf == NULL) > + return -ENOMEM; > + > + devs = cgroup_to_devs(cont); > + arg.pos = 0; > + > + arg.chrdev = 1; > + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); > + > + arg.chrdev = 0; > + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); Is there any chance of this overflowing the buffer page? Paul ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <6599ad830801141354p5b165cdao8d6184adb9ab61b6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <6599ad830801141354p5b165cdao8d6184adb9ab61b6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-01-15 7:58 ` Pavel Emelyanov 0 siblings, 0 replies; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-15 7:58 UTC (permalink / raw) To: Paul Menage; +Cc: Linux Containers Paul Menage wrote: > On Jan 8, 2008 1:18 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: >> Each new group will have its own maps for char and block >> layers. The devices access list is tuned via the >> devices.permissions file. >> >> One may read from the file to get the configured >> state. E.g. >> >> # cat /cont/devices/0/devices.permissions >> c 1:* rw >> b 8:1 rw >> >> The top container isn't initialized, so that the char >> and block layers will use the global maps to lookup >> their devices. >> >> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> >> >> --- >> >> diff --git a/fs/Makefile b/fs/Makefile >> index 82b6ae1..a085706 100644 >> --- a/fs/Makefile >> +++ b/fs/Makefile >> @@ -63,6 +63,8 @@ obj-y += devpts/ >> >> obj-$(CONFIG_PROFILING) += dcookies.o >> obj-$(CONFIG_DLM) += dlm/ >> + >> +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o >> >> # Do not add any filesystems before this line >> obj-$(CONFIG_REISERFS_FS) += reiserfs/ >> diff --git a/fs/devscontrol.c b/fs/devscontrol.c >> new file mode 100644 >> index 0000000..ea282f3 >> --- /dev/null >> +++ b/fs/devscontrol.c >> @@ -0,0 +1,291 @@ >> +/* >> + * devscontrol.c - Device Controller >> + * >> + * Copyright 2007 OpenVZ SWsoft Inc >> + * Author: Pavel Emelyanov <xemul at openvz.org> >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License as published by >> + * the Free Software Foundation; either version 2 of the License, or >> + * (at your option) any later version. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + */ >> + >> +#include <linux/cgroup.h> >> +#include <linux/cdev.h> >> +#include <linux/err.h> >> +#include <linux/devscontrol.h> >> +#include <linux/uaccess.h> >> +#include <linux/fs.h> >> +#include <linux/genhd.h> >> + >> +struct devs_cgroup { >> + struct cgroup_subsys_state css; >> + >> + struct kobj_map *cdev_map; >> + struct kobj_map *bdev_map; >> +}; >> + >> +static inline >> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) >> +{ >> + return container_of(css, struct devs_cgroup, css); >> +} >> + >> +static inline >> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) >> +{ >> + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); >> +} >> + >> +struct kobj_map *task_cdev_map(struct task_struct *tsk) >> +{ >> + struct cgroup_subsys_state *css; >> + >> + css = task_subsys_state(tsk, devs_subsys_id); >> + if (css->cgroup->parent == NULL) >> + return NULL; >> + else >> + return css_to_devs(css)->cdev_map; >> +} > > For this (and task_bdev_map) it would be cleaner to just leave the > cdev_map and bdev_map in the root cgroup as NULL, so you could just > make this > > struct kobj_map *task_cdev_map(struct task_struct *tsk) > { > return css_to_devs(task_subsys_state(tsk, devs_subsys_id))->cdev_map; > } Ok, thanks, I'll make it in the next version. >> + >> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, >> + struct file *f, const char __user *ubuf, >> + size_t nbytes, loff_t *pos) >> +{ >> + int err, all, chrdev; >> + dev_t dev; >> + char buf[64]; >> + struct devs_cgroup *devs; >> + mode_t mode; >> + >> + if (copy_from_user(buf, ubuf, sizeof(buf))) >> + return -EFAULT; >> + >> + buf[sizeof(buf) - 1] = 0; >> + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); >> + if (err < 0) >> + return err; >> + >> + devs = cgroup_to_devs(cont); >> + >> + if (mode == 0) { >> + if (chrdev) >> + err = cdev_del_from_map(devs->cdev_map, dev, all); >> + else >> + err = bdev_del_from_map(devs->bdev_map, dev, all); > > There's no locking involved on these calls, other than the cgroups > code guaranteeing that the subsystem objects themselves won't go away. > A quick look over the kobj_map calls suggests that these calls may be > thread-safe - can you confirm that. (And maybe comment at the top of > the function?) Yup. The locking inside this call is unnecessary as maps are atomic by themselves. I'll add the comment about it. >> + arg.buf = (char *)__get_free_page(GFP_KERNEL); >> + if (arg.buf == NULL) >> + return -ENOMEM; >> + >> + devs = cgroup_to_devs(cont); >> + arg.pos = 0; >> + >> + arg.chrdev = 1; >> + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); >> + >> + arg.chrdev = 0; >> + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); > > Is there any chance of this overflowing the buffer page? Well, I have checks that we've filled all the page and do not go on dumping the info in this case. So the only bad thing user can have is that he doesn't see the full perms list, but the kernel won't OOPs ;) Since we have from 8 to 16 bytes per perm line, we may have from 256 to 256 permissions set for cgroup. Not that much, but that's enough for a first time (I haven't seen any OpenVZ user who sets more than 50 devperms). But sure this is one of TODO-s for the next patch versions. > Paul > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> ` (3 preceding siblings ...) 2008-01-08 9:18 ` [PATCH 4/4] The control group itself Pavel Emelyanov @ 2008-01-12 21:20 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA [not found] ` <20080112212014.GA12085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2008-01-14 21:18 ` Paul Menage 5 siblings, 1 reply; 42+ messages in thread From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-01-12 21:20 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote: | The first version was posted long ago | (http://openvz.org/pipermail/devel/2007-September/007647.html) | and since then there are many (good I hope) changes: | | * Added the block devices support :) It turned out to | be a bit simpler than the char one (or I missed | something significant); | * Now we can enable/disable not just individual devices, | but the whole major with all its minors (see the TODO | list beyond as well); | * Added the ability to restrict the read/write permissions | to devices, not just visible/invisible state. | | That is - the main features I wished to implement right | after the v1 was sent. Some minor changes are: | | * I merged the devices.char and devices.block files into | one - devices.permissions; | * As the result of the change above - the strings passed | to this file has changed. Now they are | [bc] <major>:{<minor>|*} [r-][w-] | E.g. b 5:2 r- will grant the read permissions to the | block 5:2 device and c 3:* -w will grant the write-only | access to all the character devices with the major 5. | | However, there are some things to be done: | | * Make the /proc/devices show relevant info depending on | who is reading it. This seems to be easy to do, since | I already have the support to dump similar info into the | devices.permissions file, but I haven't tried to use | this in /proc/devices yet; | * Add the support for devices ranges. I.e. someone might | wish to tell smth like b 5:[0-10] r- to this subsystem. | Currently this is not supported and I'm afraid that if we | start support minor ranges we'll have smth similar to | VMA-s or FLOCK-s ranges management in one more place in the | kernel. | * One more question is - are there any other permissions to | work with? E.g. in OpenVZ we have a separate bit for | quota management, maybe we can invent some more... | | Currently I didn't pay much attention to split this set well, | so this will most likely won't work with git-bisect, but I | think this is OK for now. I will sure split it better when I | send the v3 and further. | | The set is prepared against the 2.6.24-rc5-mm1. | | All this is minimally tested and seems to work. Hope to hear | you comments, wishes and patches soon :) | | To play with it - run a standard procedure: | | # mount -t container none /cont/devs -o devices This should be '-t cgroup' | # mkdir /cont/devs/0 | # echo -n $$ > /cont/devs/0/tasks | | and tune device permissions. I started playing with this and noticed that even if I try to enable read access to device [c, 1:3] it also grants access to device [c, 1:5]. i.e the access restrictions seem to apply to all devices with a given major number. Is that really the intent ? Both devices accessible here: # hexdump /dev/null # hexdump /dev/zero 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * ^C Neither device accessible: # echo $$ > /container/devs/0/tasks # hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor # hexdump /dev/null hexdump: /dev/null: No such device or address hexdump: /dev/null: Bad file descriptor Grant read access to /dev/null, but /dev/zero is also readable # echo c 1:3 r- > /container/devs/0/devices.permissions # hexdump /dev/null # hexdump /dev/zero 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * ^C Remove read access to /dev/null, but /dev/zero is also not readable. # echo c 1:3 -- > /container/devs/0/devices.permissions # hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor BTW, a question about cgroups: If we 'echo $$ > /container/devs/0/tasks' is there a way to remove/undo it later (so that the process has access as before) ? ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080112212014.GA12085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <20080112212014.GA12085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-01-14 7:52 ` Pavel Emelyanov [not found] ` <478B14DB.4000106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-14 7:52 UTC (permalink / raw) To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote: > Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote: > | The first version was posted long ago > | (http://openvz.org/pipermail/devel/2007-September/007647.html) > | and since then there are many (good I hope) changes: > | > | * Added the block devices support :) It turned out to > | be a bit simpler than the char one (or I missed > | something significant); > | * Now we can enable/disable not just individual devices, > | but the whole major with all its minors (see the TODO > | list beyond as well); > | * Added the ability to restrict the read/write permissions > | to devices, not just visible/invisible state. > | > | That is - the main features I wished to implement right > | after the v1 was sent. Some minor changes are: > | > | * I merged the devices.char and devices.block files into > | one - devices.permissions; > | * As the result of the change above - the strings passed > | to this file has changed. Now they are > | [bc] <major>:{<minor>|*} [r-][w-] > | E.g. b 5:2 r- will grant the read permissions to the > | block 5:2 device and c 3:* -w will grant the write-only > | access to all the character devices with the major 5. > | > | However, there are some things to be done: > | > | * Make the /proc/devices show relevant info depending on > | who is reading it. This seems to be easy to do, since > | I already have the support to dump similar info into the > | devices.permissions file, but I haven't tried to use > | this in /proc/devices yet; > | * Add the support for devices ranges. I.e. someone might > | wish to tell smth like b 5:[0-10] r- to this subsystem. > | Currently this is not supported and I'm afraid that if we > | start support minor ranges we'll have smth similar to > | VMA-s or FLOCK-s ranges management in one more place in the > | kernel. > | * One more question is - are there any other permissions to > | work with? E.g. in OpenVZ we have a separate bit for > | quota management, maybe we can invent some more... > | > | Currently I didn't pay much attention to split this set well, > | so this will most likely won't work with git-bisect, but I > | think this is OK for now. I will sure split it better when I > | send the v3 and further. > | > | The set is prepared against the 2.6.24-rc5-mm1. > | > | All this is minimally tested and seems to work. Hope to hear > | you comments, wishes and patches soon :) > | > | To play with it - run a standard procedure: > | > | # mount -t container none /cont/devs -o devices > > This should be '-t cgroup' Right :) Thank you for the feedback. Serge, Oren, do you have smth to tell about this set? I planned to show it to Andrew this week, hope he will find time to look at it :) > | # mkdir /cont/devs/0 > | # echo -n $$ > /cont/devs/0/tasks > | > | and tune device permissions. > > I started playing with this and noticed that even if I try to > enable read access to device [c, 1:3] it also grants access > to device [c, 1:5]. Hm... I can't reproduce this: # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions # /bin/echo -n $$ > /cnt/dev/0/tasks # cat /cnt/dev/0/devices.permissions c 1:3 r- # hexdump /dev/null # hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor Maybe you have played with devs cgroups before getting this? Can you show what's the contents of the devices.permissions file in your case? > i.e the access restrictions seem to apply to all devices with > a given major number. Is that really the intent ? > > Both devices accessible here: > # hexdump /dev/null > # hexdump /dev/zero > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > * > ^C > > Neither device accessible: > > # echo $$ > /container/devs/0/tasks > # hexdump /dev/zero > hexdump: /dev/zero: No such device or address > hexdump: /dev/zero: Bad file descriptor > # hexdump /dev/null > hexdump: /dev/null: No such device or address > hexdump: /dev/null: Bad file descriptor > > Grant read access to /dev/null, but /dev/zero is also readable > > # echo c 1:3 r- > /container/devs/0/devices.permissions > # hexdump /dev/null > # hexdump /dev/zero > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > * > ^C > > Remove read access to /dev/null, but /dev/zero is also not > readable. > > # echo c 1:3 -- > /container/devs/0/devices.permissions > # hexdump /dev/zero > hexdump: /dev/zero: No such device or address > hexdump: /dev/zero: Bad file descriptor > > BTW, a question about cgroups: If we 'echo $$ > /container/devs/0/tasks' > is there a way to remove/undo it later (so that the process has access > as before) ? I've always thought that it's to move the task to top cgrop, i.e. echo $$ > /container/devs/tasks Thanks, Pavel > ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <478B14DB.4000106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <478B14DB.4000106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-14 17:42 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA [not found] ` <20080114174220.GA17825-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-01-14 17:42 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage | > I started playing with this and noticed that even if I try to | > enable read access to device [c, 1:3] it also grants access | > to device [c, 1:5]. | | Hm... I can't reproduce this: | | # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions | # /bin/echo -n $$ > /cnt/dev/0/tasks | # cat /cnt/dev/0/devices.permissions | c 1:3 r- | # hexdump /dev/null | # hexdump /dev/zero | hexdump: /dev/zero: No such device or address | hexdump: /dev/zero: Bad file descriptor | | Maybe you have played with devs cgroups before getting this? | Can you show what's the contents of the devices.permissions file | in your case? Here is the repro again. I even tried after a reboot. Basically, granting access to /dev/null is also granting access to /dev/zero. # cat devices.permissions # hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor # hexdump /dev/null hexdump: /dev/null: No such device or address hexdump: /dev/null: Bad file descriptor # echo 'c 1:3 r-' > devices.permissions # hexdump /dev/null # hexdump /dev/zero 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * ^C # cat tasks 3279 22266 # ps PID TTY TIME CMD 3279 pts/0 00:00:00 bash 22267 pts/0 00:00:00 ps ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080114174220.GA17825-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <20080114174220.GA17825-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-01-15 8:22 ` Pavel Emelyanov [not found] ` <478C6D2B.6020904-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-15 8:22 UTC (permalink / raw) To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote: > | > I started playing with this and noticed that even if I try to > | > enable read access to device [c, 1:3] it also grants access > | > to device [c, 1:5]. > | > | Hm... I can't reproduce this: > | > | # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions > | # /bin/echo -n $$ > /cnt/dev/0/tasks > | # cat /cnt/dev/0/devices.permissions > | c 1:3 r- > | # hexdump /dev/null > | # hexdump /dev/zero > | hexdump: /dev/zero: No such device or address > | hexdump: /dev/zero: Bad file descriptor > | > | Maybe you have played with devs cgroups before getting this? > | Can you show what's the contents of the devices.permissions file > | in your case? > > Here is the repro again. I even tried after a reboot. Basically, > granting access to /dev/null is also granting access to /dev/zero. > > # cat devices.permissions > # hexdump /dev/zero > hexdump: /dev/zero: No such device or address > hexdump: /dev/zero: Bad file descriptor > # hexdump /dev/null > hexdump: /dev/null: No such device or address > hexdump: /dev/null: Bad file descriptor > # echo 'c 1:3 r-' > devices.permissions > # hexdump /dev/null > # hexdump /dev/zero > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > * > ^C > # cat tasks > 3279 > 22266 > # ps > PID TTY TIME CMD > 3279 pts/0 00:00:00 bash > 22267 pts/0 00:00:00 ps > This all looks completely incomprehensible :( Here's my test: # mount -t cgroup none /cnt/dev/ -o devices # mkdir /cnt/dev/0 # /bin/echo -n $$ > /cnt/dev/0/tasks # cat /cnt/dev/0/devices.permissions # hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor # hexdump /dev/null hexdump: /dev/null: No such device or address hexdump: /dev/null: Bad file descriptor # echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions # cat /cnt/dev/0/devices.permissions c 1:3 r- # hexdump /dev/null # hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor Sukadev, could you please try to track the problem as you seem to be the only person who's experiencing problems with that. Thanks, Pavel ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <478C6D2B.6020904-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <478C6D2B.6020904-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-01-17 6:26 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA [not found] ` <20080117062605.GA24475-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-01-17 6:26 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote: | sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote: | > | > I started playing with this and noticed that even if I try to | > | > enable read access to device [c, 1:3] it also grants access | > | > to device [c, 1:5]. | > | | > | Hm... I can't reproduce this: | > | | > | # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions | > | # /bin/echo -n $$ > /cnt/dev/0/tasks | > | # cat /cnt/dev/0/devices.permissions | > | c 1:3 r- | > | # hexdump /dev/null | > | # hexdump /dev/zero | > | hexdump: /dev/zero: No such device or address | > | hexdump: /dev/zero: Bad file descriptor | > | | > | Maybe you have played with devs cgroups before getting this? | > | Can you show what's the contents of the devices.permissions file | > | in your case? | > | > Here is the repro again. I even tried after a reboot. Basically, | > granting access to /dev/null is also granting access to /dev/zero. | > | > # cat devices.permissions | > # hexdump /dev/zero | > hexdump: /dev/zero: No such device or address | > hexdump: /dev/zero: Bad file descriptor | > # hexdump /dev/null | > hexdump: /dev/null: No such device or address | > hexdump: /dev/null: Bad file descriptor | > # echo 'c 1:3 r-' > devices.permissions | > # hexdump /dev/null | > # hexdump /dev/zero | > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 | > * | > ^C | > # cat tasks | > 3279 | > 22266 | > # ps | > PID TTY TIME CMD | > 3279 pts/0 00:00:00 bash | > 22267 pts/0 00:00:00 ps | > | | This all looks completely incomprehensible :( | | Here's my test: | # mount -t cgroup none /cnt/dev/ -o devices | # mkdir /cnt/dev/0 | # /bin/echo -n $$ > /cnt/dev/0/tasks | # cat /cnt/dev/0/devices.permissions | # hexdump /dev/zero | hexdump: /dev/zero: No such device or address | hexdump: /dev/zero: Bad file descriptor Can you try this sequence: - grant access to /dev/zero, - hexdump /dev/zero - revoke access to /dev/zero - hexdump /dev/null - hexdump /dev/zero. | # hexdump /dev/null | hexdump: /dev/null: No such device or address | hexdump: /dev/null: Bad file descriptor | # echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions | # cat /cnt/dev/0/devices.permissions | c 1:3 r- | # hexdump /dev/null | # hexdump /dev/zero | hexdump: /dev/zero: No such device or address | hexdump: /dev/zero: Bad file descriptor | | | Sukadev, could you please try to track the problem as you | seem to be the only person who's experiencing problems | with that. I suspect the 'caching' of the last_mode (that you introduce in PATCH 2/4) combined with the fact that /dev/zero, /dev/null, /dev/kmem etc share a _SINGLE_ 'struct cdev' leads to the problem I am running into with /dev/zero and /dev/null. Here is a what I suspect is happening (sorry, for low-level details) Following sequence seems to repro it consistently for me: $ mount -t cgroup none /container/devs/ -o devices $ mkdir /container/devs/0 $ cd !$ cd /container/devs/0 $ echo $$ > tasks $ hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor $ hexdump /dev/null hexdump: /dev/null: No such device or address hexdump: /dev/null: Bad file descriptor $ echo 'c 1:3 r-' > devices.permissions $ hexdump /dev/null $ hexdump /dev/zero hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor No surprise so far. $ echo 'c 1:5 r-' > devices.permissions $ hexdump /dev/zero 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * ^C Now grant read access to /dev/zero and more importantly, create a properly initialized inode for it. $ echo 'c 1:5 --' > devices.permissions Then remove access to /dev/zero. This removes the kobject for /dev/zero from map. Also cdev_map_reset() sets cdev->last to NULL. $ hdz hexdump: /dev/zero: No such device or address hexdump: /dev/zero: Bad file descriptor Since cdev->last is NULL, chrdev_open() calls kobj_lookup() which returns a NULL kobj and the open fails. $ hexdump /dev/null # XXX Again, since cdev->last is NULL, kobj_lookup() is called, this time for /dev/null. This succeeds and cdev->last is correctly initialized. Eventually this open of /dev/null succeeds. $ hexdump /dev/zero 0000000 0000 0000 0000 0000 0000 0000 0000 0000 Now the open of /dev/zero also succeeds ! I suspect that the reason is that when we first successfully read /dev/zero, we created/initialized an inode for it. This inode has the inode->i_cdev set correctly. By reading /dev/null (marked XXX above), cdev->last is also correctly set. But since /dev/zero and /dev/null _SHARE_ a 'struct cdev', when we call chrdev_open() for /dev/zero, we check the permissions of this common cdev and grant /dev/zero the same permissions as /dev/null. I suspect we will get this behavior with all devices implemented by the 'mem' driver in drivers/char/mem.c. I was able to repro with /dev/full [c, 1:7]) Sukadev ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080117062605.GA24475-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <20080117062605.GA24475-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-01-21 8:31 ` Pavel Emelyanov 0 siblings, 0 replies; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-21 8:31 UTC (permalink / raw) To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote: > Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote: > | sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote: > | > | > I started playing with this and noticed that even if I try to > | > | > enable read access to device [c, 1:3] it also grants access > | > | > to device [c, 1:5]. > | > | > | > | Hm... I can't reproduce this: > | > | > | > | # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions > | > | # /bin/echo -n $$ > /cnt/dev/0/tasks > | > | # cat /cnt/dev/0/devices.permissions > | > | c 1:3 r- > | > | # hexdump /dev/null > | > | # hexdump /dev/zero > | > | hexdump: /dev/zero: No such device or address > | > | hexdump: /dev/zero: Bad file descriptor > | > | > | > | Maybe you have played with devs cgroups before getting this? > | > | Can you show what's the contents of the devices.permissions file > | > | in your case? > | > > | > Here is the repro again. I even tried after a reboot. Basically, > | > granting access to /dev/null is also granting access to /dev/zero. > | > > | > # cat devices.permissions > | > # hexdump /dev/zero > | > hexdump: /dev/zero: No such device or address > | > hexdump: /dev/zero: Bad file descriptor > | > # hexdump /dev/null > | > hexdump: /dev/null: No such device or address > | > hexdump: /dev/null: Bad file descriptor > | > # echo 'c 1:3 r-' > devices.permissions > | > # hexdump /dev/null > | > # hexdump /dev/zero > | > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > | > * > | > ^C > | > # cat tasks > | > 3279 > | > 22266 > | > # ps > | > PID TTY TIME CMD > | > 3279 pts/0 00:00:00 bash > | > 22267 pts/0 00:00:00 ps > | > > | > | This all looks completely incomprehensible :( > | > | Here's my test: > | # mount -t cgroup none /cnt/dev/ -o devices > | # mkdir /cnt/dev/0 > | # /bin/echo -n $$ > /cnt/dev/0/tasks > | # cat /cnt/dev/0/devices.permissions > | # hexdump /dev/zero > | hexdump: /dev/zero: No such device or address > | hexdump: /dev/zero: Bad file descriptor > > Can you try this sequence: > > - grant access to /dev/zero, > - hexdump /dev/zero > - revoke access to /dev/zero > - hexdump /dev/null > - hexdump /dev/zero. OK, I'll try it, thanks. > | # hexdump /dev/null > | hexdump: /dev/null: No such device or address > | hexdump: /dev/null: Bad file descriptor > | # echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions > | # cat /cnt/dev/0/devices.permissions > | c 1:3 r- > | # hexdump /dev/null > | # hexdump /dev/zero > | hexdump: /dev/zero: No such device or address > | hexdump: /dev/zero: Bad file descriptor > | > | > | Sukadev, could you please try to track the problem as you > | seem to be the only person who's experiencing problems > | with that. > > > I suspect the 'caching' of the last_mode (that you introduce in PATCH 2/4) > combined with the fact that /dev/zero, /dev/null, /dev/kmem etc share > a _SINGLE_ 'struct cdev' leads to the problem I am running into with > /dev/zero and /dev/null. > > Here is a what I suspect is happening (sorry, for low-level details) > > Following sequence seems to repro it consistently for me: > > $ mount -t cgroup none /container/devs/ -o devices > $ mkdir /container/devs/0 > $ cd !$ > cd /container/devs/0 > $ echo $$ > tasks > > $ hexdump /dev/zero > hexdump: /dev/zero: No such device or address > hexdump: /dev/zero: Bad file descriptor > > $ hexdump /dev/null > hexdump: /dev/null: No such device or address > hexdump: /dev/null: Bad file descriptor > > $ echo 'c 1:3 r-' > devices.permissions > > $ hexdump /dev/null > > $ hexdump /dev/zero > hexdump: /dev/zero: No such device or address > hexdump: /dev/zero: Bad file descriptor > > No surprise so far. > > $ echo 'c 1:5 r-' > devices.permissions > $ hexdump /dev/zero > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > * > ^C > > Now grant read access to /dev/zero and more importantly, create a properly > initialized inode for it. > > $ echo 'c 1:5 --' > devices.permissions > > Then remove access to /dev/zero. This removes the kobject for /dev/zero from > map. Also cdev_map_reset() sets cdev->last to NULL. > > $ hdz > hexdump: /dev/zero: No such device or address > hexdump: /dev/zero: Bad file descriptor > > Since cdev->last is NULL, chrdev_open() calls kobj_lookup() which returns a > NULL kobj and the open fails. > > $ hexdump /dev/null # XXX > > Again, since cdev->last is NULL, kobj_lookup() is called, this time for > /dev/null. This succeeds and cdev->last is correctly initialized. > Eventually this open of /dev/null succeeds. > > $ hexdump /dev/zero > 0000000 0000 0000 0000 0000 0000 0000 0000 0000 > > Now the open of /dev/zero also succeeds ! Hm... The analysis looks correct. Thanks, Sukadev, I'll try to resolve this issue. > I suspect that the reason is that when we first successfully read /dev/zero, > we created/initialized an inode for it. This inode has the inode->i_cdev set > correctly. > > By reading /dev/null (marked XXX above), cdev->last is also correctly set. > > But since /dev/zero and /dev/null _SHARE_ a 'struct cdev', when we call > chrdev_open() for /dev/zero, we check the permissions of this common cdev > and grant /dev/zero the same permissions as /dev/null. > > I suspect we will get this behavior with all devices implemented by > the 'mem' driver in drivers/char/mem.c. I was able to repro with > /dev/full [c, 1:7]) > > Sukadev > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> ` (4 preceding siblings ...) 2008-01-12 21:20 ` [PATCH 0/4] Devices accessibility control group (v2) sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-01-14 21:18 ` Paul Menage [not found] ` <6599ad830801141318h121a6a80h9af68c52431c48b8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 5 siblings, 1 reply; 42+ messages in thread From: Paul Menage @ 2008-01-14 21:18 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers On Jan 8, 2008 1:02 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: > * Add the support for devices ranges. I.e. someone might > wish to tell smth like b 5:[0-10] r- to this subsystem. I'd be inclined to leave support for that in userspace, rather than adding fancier parsing to the kernel API. Paul ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <6599ad830801141318h121a6a80h9af68c52431c48b8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/4] Devices accessibility control group (v2) [not found] ` <6599ad830801141318h121a6a80h9af68c52431c48b8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-01-15 8:06 ` Pavel Emelyanov 0 siblings, 0 replies; 42+ messages in thread From: Pavel Emelyanov @ 2008-01-15 8:06 UTC (permalink / raw) To: Paul Menage; +Cc: Linux Containers Paul Menage wrote: > On Jan 8, 2008 1:02 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: >> * Add the support for devices ranges. I.e. someone might >> wish to tell smth like b 5:[0-10] r- to this subsystem. > > I'd be inclined to leave support for that in userspace, rather than > adding fancier parsing to the kernel API. Yup. I agree with that. > Paul > Thanks, Pavel ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 0/4] Devices accessibility control group (v3, release candidate)
@ 2008-02-07 12:56 Pavel Emelyanov
[not found] ` <47AAFFF2.9030804-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-02-07 12:56 UTC (permalink / raw)
To: Serge Hallyn, Sukadev Bhattiprolu; +Cc: Linux Containers, Paul Menage
Changes from v2:
* Fixed problems pointed out by Sukadev with permissions
revoke. Now we have to perform kobject re-lookup on
each char device open, just like for block ones, so I
think this is OK.
The /proc/devices tune is still in TODO list, as I have
problems with getting majors _in_a_simple_manner_ from a
map, that contains a mix of major/minor pairs in
arbitrary order.
The second version is here:
http://openvz.org/pipermail/devel/2008-January/010160.html
Changes from v1:
* Added the block devices support :) It turned out to
be a bit simpler than the char one (or I missed
something significant);
* Now we can enable/disable not just individual devices,
but the whole major with all its minors (see the TODO
list beyond as well);
* Added the ability to restrict the read/write permissions
to devices, not just visible/invisible state.
The first version was here:
http://openvz.org/pipermail/devel/2007-September/007647.html
I still don't pay much attention to split this set well, so
this will most likely won't work with git-bisect, but I think
this is OK for now. I will sure split it better when I send
it to Andrew.
The set is prepared against the 2.6.24-rc8-mm1.
To play with it - run a standard procedure:
# mount -t container none /cont/devs -o devices
# mkdir /cont/devs/0
# echo -n $$ > /cont/devs/0/tasks
and tune device permissions.
Thanks,
Pavel
^ permalink raw reply [flat|nested] 42+ messages in thread[parent not found: <47AAFFF2.9030804-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* [PATCH 4/4] The control group itself [not found] ` <47AAFFF2.9030804-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-02-07 13:01 ` Pavel Emelyanov [not found] ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-02-07 13:01 UTC (permalink / raw) To: Serge Hallyn, Sukadev Bhattiprolu; +Cc: Linux Containers, Paul Menage Each new group will have its own maps for char and block layers. The devices access list is tuned via the devices.permissions file. One may read from the file to get the configured state. The top container isn't initialized, so that the char and block layers will use the global maps to lookup their devices. I did that not to export the static maps to the outer world. Good news is that this patch now contains more comments and Documentation file :) Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> --- diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt new file mode 100644 index 0000000..dbd0c7a --- /dev/null +++ b/Documentation/controllers/devices.txt @@ -0,0 +1,61 @@ + + Devices visibility controller + +This controller allows to tune the devices accessibility by tasks, +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only +access to IDE devices and completely hide SCSI disks. + +Tasks still can call mknod to create device files, regardless of +whether the particular device is visible or accessible, but they +may not be able to open it later. + +This one hides under CONFIG_CGROUP_DEVS option. + + +Configuring + +The controller provides a single file to configure itself -- the +devices.permissions one. To change the accessibility level for some +device write the following string into it: + +[cb] <major>:(<minor>|*) [r-][w-] + ^ ^ ^ + | | | + | | +--- access rights (1) + | | + | +-- device major and minor numbers (2) + | + +-- device type (character / block) + +1) The access rights set to '--' remove the device from the group's +access list, so that it will not even be shown in this file later. + +2) Setting the minor to '*' grants access to all the minors for +particular major. + +When reading from it, one may see something like + + c 1:5 rw + b 8:* r- + +Security issues, concerning who may grant access to what are governed +at the cgroup infrastructure level. + + +Examples: + +1. Grand full access to /dev/null + # echo c 1:3 rw > /cgroups/<id>/devices.permissions + +2. Grant the read-only access to /dev/sda and partitions + # echo b 8:* r- > ... + +3. Change the /dev/null access to write-only + # echo c 1:3 -w > ... + +4. Revoke access to /dev/sda + # echo b 8:* -- > ... + + + Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> + diff --git a/fs/Makefile b/fs/Makefile index 7996220..5ad03be 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -64,6 +64,8 @@ obj-y += devpts/ obj-$(CONFIG_PROFILING) += dcookies.o obj-$(CONFIG_DLM) += dlm/ + +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o # Do not add any filesystems before this line obj-$(CONFIG_REISERFS_FS) += reiserfs/ diff --git a/fs/devscontrol.c b/fs/devscontrol.c new file mode 100644 index 0000000..48c5f69 --- /dev/null +++ b/fs/devscontrol.c @@ -0,0 +1,314 @@ +/* + * devscontrol.c - Device Controller + * + * Copyright 2007 OpenVZ SWsoft Inc + * Author: Pavel Emelyanov <xemul at openvz dot org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/cgroup.h> +#include <linux/cdev.h> +#include <linux/err.h> +#include <linux/devscontrol.h> +#include <linux/uaccess.h> +#include <linux/fs.h> +#include <linux/genhd.h> + +struct devs_cgroup { + /* + * The subsys state to build into cgrous infrastructure + */ + struct cgroup_subsys_state css; + + /* + * The maps of character and block devices. They provide a + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c + * and block/genhd.c to find out how the ->open() callbacks + * work when opening a device. + * + * Each group will have its onw maps, and at the open() + * time code will lookup in this map to get the device + * and permissions by its dev_t. + */ + struct kobj_map *cdev_map; + struct kobj_map *bdev_map; +}; + +static inline +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) +{ + return container_of(css, struct devs_cgroup, css); +} + +static inline +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) +{ + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); +} + +struct kobj_map *task_cdev_map(struct task_struct *tsk) +{ + struct cgroup_subsys_state *css; + + css = task_subsys_state(tsk, devs_subsys_id); + if (css->cgroup->parent == NULL) + return NULL; + else + return css_to_devs(css)->cdev_map; +} + +struct kobj_map *task_bdev_map(struct task_struct *tsk) +{ + struct cgroup_subsys_state *css; + + css = task_subsys_state(tsk, devs_subsys_id); + if (css->cgroup->parent == NULL) + return NULL; + else + return css_to_devs(css)->bdev_map; +} + +static struct cgroup_subsys_state * +devs_create(struct cgroup_subsys *ss, struct cgroup *cont) +{ + struct devs_cgroup *devs; + + devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL); + if (devs == NULL) + goto out; + + devs->cdev_map = cdev_map_init(); + if (devs->cdev_map == NULL) + goto out_free; + + devs->bdev_map = bdev_map_init(); + if (devs->bdev_map == NULL) + goto out_free_cdev; + + return &devs->css; + +out_free_cdev: + cdev_map_fini(devs->cdev_map); +out_free: + kfree(devs); +out: + return ERR_PTR(-ENOMEM); +} + +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont) +{ + struct devs_cgroup *devs; + + devs = cgroup_to_devs(cont); + bdev_map_fini(devs->bdev_map); + cdev_map_fini(devs->cdev_map); + kfree(devs); +} + +/* + * The devices.permissions file read/write functionality + * + * The following two routines parse and print the strings like + * [cb] <major>:(<minor>|*) [r-][w-] + */ + +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev, + int *all, mode_t *mode) +{ + unsigned int major, minor; + char *end; + mode_t tmp; + + if (buf[0] == 'c') + *chrdev = 1; + else if (buf[0] == 'b') + *chrdev = 0; + else + return -EINVAL; + + if (buf[1] != ' ') + return -EINVAL; + + major = simple_strtoul(buf + 2, &end, 10); + if (*end != ':') + return -EINVAL; + + if (end[1] == '*') { + if (end[2] != ' ') + return -EINVAL; + + *all = 1; + minor = 0; + end += 2; + } else { + minor = simple_strtoul(end + 1, &end, 10); + if (*end != ' ') + return -EINVAL; + + *all = 0; + } + + tmp = 0; + + if (end[1] == 'r') + tmp |= FMODE_READ; + else if (end[1] != '-') + return -EINVAL; + if (end[2] == 'w') + tmp |= FMODE_WRITE; + else if (end[2] != '-') + return -EINVAL; + + *dev = MKDEV(major, minor); + *mode = tmp; + return 0; +} + +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev, + int all, mode_t mode) +{ + int ret; + + ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev)); + if (all) + ret += snprintf(buf + ret, len - ret, "*"); + else + ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev)); + + ret += snprintf(buf + ret, len - ret, " %c%c\n", + (mode & FMODE_READ) ? 'r' : '-', + (mode & FMODE_WRITE) ? 'w' : '-'); + + return ret + 1; +} + +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, + struct file *f, const char __user *ubuf, + size_t nbytes, loff_t *pos) +{ + int err, all, chrdev; + dev_t dev; + char buf[64]; + struct devs_cgroup *devs; + mode_t mode; + + if (copy_from_user(buf, ubuf, sizeof(buf))) + return -EFAULT; + + buf[sizeof(buf) - 1] = 0; + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); + if (err < 0) + return err; + + devs = cgroup_to_devs(cont); + + /* + * No locking here is required - all that we need + * is provided inside the kobject mapping code + */ + + if (mode == 0) { + if (chrdev) + err = cdev_del_from_map(devs->cdev_map, dev, all); + else + err = bdev_del_from_map(devs->bdev_map, dev, all); + + if (err < 0) + return err; + + css_put(&devs->css); + } else { + if (chrdev) + err = cdev_add_to_map(devs->cdev_map, dev, all, mode); + else + err = bdev_add_to_map(devs->bdev_map, dev, all, mode); + + if (err < 0) + return err; + + css_get(&devs->css); + } + + return nbytes; +} + +struct devs_dump_arg { + char *buf; + int pos; + int chrdev; +}; + +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x) +{ + struct devs_dump_arg *arg = x; + char tmp[64]; + int len; + + len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev, + range != 1, mode); + + if (arg->pos >= PAGE_SIZE - len) + return 1; + + memcpy(arg->buf + arg->pos, tmp, len); + arg->pos += len; + return 0; +} + +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft, + struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos) +{ + struct devs_dump_arg arg; + struct devs_cgroup *devs; + ssize_t ret; + + arg.buf = (char *)__get_free_page(GFP_KERNEL); + if (arg.buf == NULL) + return -ENOMEM; + + devs = cgroup_to_devs(cont); + arg.pos = 0; + + arg.chrdev = 1; + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); + + arg.chrdev = 0; + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); + + ret = simple_read_from_buffer(ubuf, nbytes, pos, + arg.buf, arg.pos); + + free_page((unsigned long)arg.buf); + return ret; +} + +static struct cftype devs_files[] = { + { + .name = "permissions", + .write = devs_write, + .read = devs_read, + }, +}; + +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont) +{ + return cgroup_add_files(cont, ss, + devs_files, ARRAY_SIZE(devs_files)); +} + +struct cgroup_subsys devs_subsys = { + .name = "devices", + .subsys_id = devs_subsys_id, + .create = devs_create, + .destroy = devs_destroy, + .populate = devs_populate, +}; diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index 228235c..9c0cd2c 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup) #endif /* */ + +#ifdef CONFIG_CGROUP_DEVS +SUBSYS(devs) +#endif + +/* */ diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h new file mode 100644 index 0000000..38057b9 --- /dev/null +++ b/include/linux/devscontrol.h @@ -0,0 +1,26 @@ +#ifndef __DEVS_CONTROL_H__ +#define __DEVS_CONTROL_H__ +struct kobj_map; +struct task_struct; + +/* + * task_[cb]dev_map - get a map from task. Both calls may return + * NULL, to indicate, that task doesn't belong to any group and + * that the global map is to be used. + */ + +#ifdef CONFIG_CGROUP_DEVS +struct kobj_map *task_cdev_map(struct task_struct *); +struct kobj_map *task_bdev_map(struct task_struct *); +#else +static inline kobj_map *task_cdev_map(struct task_struct *tsk) +{ + return NULL; +} + +static inline kobj_map *task_bdev_map(struct task_struct *tsk) +{ + return NULL; +} +#endif +#endif diff --git a/init/Kconfig b/init/Kconfig index 732a1c2..f9a1b4f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -283,6 +283,19 @@ config CGROUP_DEBUG Say N if unsure +config CGROUP_DEVS + bool "Devices cgroup subsystem" + depends on CGROUPS + help + Controlls the access rights to devices, i.e. you may hide + some of them from tasks, so that they will not be able + to open them, or you may grant a read-only access to some + of them. + + See Documentation/controllers/devices.txt for details. + + This is harmless to say N here, so do it if unsure. + config CGROUP_NS bool "Namespace cgroup subsystem" depends on CGROUPS ^ permalink raw reply related [flat|nested] 42+ messages in thread
[parent not found: <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-02-11 17:38 ` Serge E. Hallyn [not found] ` <20080211173830.GA22160-6s5zFf/epYL1ENwx4SLHqw@public.gmane.org> 2008-02-12 7:42 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA 2008-02-21 20:47 ` Paul Menage 2 siblings, 1 reply; 42+ messages in thread From: Serge E. Hallyn @ 2008-02-11 17:38 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Paul Menage, Linux Containers Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > Each new group will have its own maps for char and block > layers. The devices access list is tuned via the > devices.permissions file. One may read from the file to get > the configured state. > > The top container isn't initialized, so that the char > and block layers will use the global maps to lookup > their devices. I did that not to export the static maps > to the outer world. > > Good news is that this patch now contains more comments > and Documentation file :) Seems to work as advertised :) I can't reproduce Suka's null/zero bug. You're relying fully on uid-0 to stop writes into the devices.permissions files. Would you mind adding a check for CAP_SYS_ADMIN (or CAP_NS_OVERRIDE+CAP_MKNOD)? Or were you really counting on using the filesystem visibility cgroup to stop a container from making changes to its device access whitelist? thanks, -serge > Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > > --- > > diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt > new file mode 100644 > index 0000000..dbd0c7a > --- /dev/null > +++ b/Documentation/controllers/devices.txt > @@ -0,0 +1,61 @@ > + > + Devices visibility controller > + > +This controller allows to tune the devices accessibility by tasks, > +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only > +access to IDE devices and completely hide SCSI disks. > + > +Tasks still can call mknod to create device files, regardless of > +whether the particular device is visible or accessible, but they > +may not be able to open it later. > + > +This one hides under CONFIG_CGROUP_DEVS option. > + > + > +Configuring > + > +The controller provides a single file to configure itself -- the > +devices.permissions one. To change the accessibility level for some > +device write the following string into it: > + > +[cb] <major>:(<minor>|*) [r-][w-] > + ^ ^ ^ > + | | | > + | | +--- access rights (1) > + | | > + | +-- device major and minor numbers (2) > + | > + +-- device type (character / block) > + > +1) The access rights set to '--' remove the device from the group's > +access list, so that it will not even be shown in this file later. > + > +2) Setting the minor to '*' grants access to all the minors for > +particular major. > + > +When reading from it, one may see something like > + > + c 1:5 rw > + b 8:* r- > + > +Security issues, concerning who may grant access to what are governed > +at the cgroup infrastructure level. > + > + > +Examples: > + > +1. Grand full access to /dev/null > + # echo c 1:3 rw > /cgroups/<id>/devices.permissions > + > +2. Grant the read-only access to /dev/sda and partitions > + # echo b 8:* r- > ... > + > +3. Change the /dev/null access to write-only > + # echo c 1:3 -w > ... > + > +4. Revoke access to /dev/sda > + # echo b 8:* -- > ... > + > + > + Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > + > diff --git a/fs/Makefile b/fs/Makefile > index 7996220..5ad03be 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -64,6 +64,8 @@ obj-y += devpts/ > > obj-$(CONFIG_PROFILING) += dcookies.o > obj-$(CONFIG_DLM) += dlm/ > + > +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o > > # Do not add any filesystems before this line > obj-$(CONFIG_REISERFS_FS) += reiserfs/ > diff --git a/fs/devscontrol.c b/fs/devscontrol.c > new file mode 100644 > index 0000000..48c5f69 > --- /dev/null > +++ b/fs/devscontrol.c > @@ -0,0 +1,314 @@ > +/* > + * devscontrol.c - Device Controller > + * > + * Copyright 2007 OpenVZ SWsoft Inc > + * Author: Pavel Emelyanov <xemul at openvz dot org> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + */ > + > +#include <linux/cgroup.h> > +#include <linux/cdev.h> > +#include <linux/err.h> > +#include <linux/devscontrol.h> > +#include <linux/uaccess.h> > +#include <linux/fs.h> > +#include <linux/genhd.h> > + > +struct devs_cgroup { > + /* > + * The subsys state to build into cgrous infrastructure > + */ > + struct cgroup_subsys_state css; > + > + /* > + * The maps of character and block devices. They provide a > + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c > + * and block/genhd.c to find out how the ->open() callbacks > + * work when opening a device. > + * > + * Each group will have its onw maps, and at the open() > + * time code will lookup in this map to get the device > + * and permissions by its dev_t. > + */ > + struct kobj_map *cdev_map; > + struct kobj_map *bdev_map; > +}; > + > +static inline > +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) > +{ > + return container_of(css, struct devs_cgroup, css); > +} > + > +static inline > +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) > +{ > + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); > +} > + > +struct kobj_map *task_cdev_map(struct task_struct *tsk) > +{ > + struct cgroup_subsys_state *css; > + > + css = task_subsys_state(tsk, devs_subsys_id); > + if (css->cgroup->parent == NULL) > + return NULL; > + else > + return css_to_devs(css)->cdev_map; > +} > + > +struct kobj_map *task_bdev_map(struct task_struct *tsk) > +{ > + struct cgroup_subsys_state *css; > + > + css = task_subsys_state(tsk, devs_subsys_id); > + if (css->cgroup->parent == NULL) > + return NULL; > + else > + return css_to_devs(css)->bdev_map; > +} > + > +static struct cgroup_subsys_state * > +devs_create(struct cgroup_subsys *ss, struct cgroup *cont) > +{ > + struct devs_cgroup *devs; > + > + devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL); > + if (devs == NULL) > + goto out; > + > + devs->cdev_map = cdev_map_init(); > + if (devs->cdev_map == NULL) > + goto out_free; > + > + devs->bdev_map = bdev_map_init(); > + if (devs->bdev_map == NULL) > + goto out_free_cdev; > + > + return &devs->css; > + > +out_free_cdev: > + cdev_map_fini(devs->cdev_map); > +out_free: > + kfree(devs); > +out: > + return ERR_PTR(-ENOMEM); > +} > + > +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont) > +{ > + struct devs_cgroup *devs; > + > + devs = cgroup_to_devs(cont); > + bdev_map_fini(devs->bdev_map); > + cdev_map_fini(devs->cdev_map); > + kfree(devs); > +} > + > +/* > + * The devices.permissions file read/write functionality > + * > + * The following two routines parse and print the strings like > + * [cb] <major>:(<minor>|*) [r-][w-] > + */ > + > +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev, > + int *all, mode_t *mode) > +{ > + unsigned int major, minor; > + char *end; > + mode_t tmp; > + > + if (buf[0] == 'c') > + *chrdev = 1; > + else if (buf[0] == 'b') > + *chrdev = 0; > + else > + return -EINVAL; > + > + if (buf[1] != ' ') > + return -EINVAL; > + > + major = simple_strtoul(buf + 2, &end, 10); > + if (*end != ':') > + return -EINVAL; > + > + if (end[1] == '*') { > + if (end[2] != ' ') > + return -EINVAL; > + > + *all = 1; > + minor = 0; > + end += 2; > + } else { > + minor = simple_strtoul(end + 1, &end, 10); > + if (*end != ' ') > + return -EINVAL; > + > + *all = 0; > + } > + > + tmp = 0; > + > + if (end[1] == 'r') > + tmp |= FMODE_READ; > + else if (end[1] != '-') > + return -EINVAL; > + if (end[2] == 'w') > + tmp |= FMODE_WRITE; > + else if (end[2] != '-') > + return -EINVAL; > + > + *dev = MKDEV(major, minor); > + *mode = tmp; > + return 0; > +} > + > +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev, > + int all, mode_t mode) > +{ > + int ret; > + > + ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev)); > + if (all) > + ret += snprintf(buf + ret, len - ret, "*"); > + else > + ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev)); > + > + ret += snprintf(buf + ret, len - ret, " %c%c\n", > + (mode & FMODE_READ) ? 'r' : '-', > + (mode & FMODE_WRITE) ? 'w' : '-'); > + > + return ret + 1; > +} > + > +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, > + struct file *f, const char __user *ubuf, > + size_t nbytes, loff_t *pos) > +{ > + int err, all, chrdev; > + dev_t dev; > + char buf[64]; > + struct devs_cgroup *devs; > + mode_t mode; > + > + if (copy_from_user(buf, ubuf, sizeof(buf))) > + return -EFAULT; > + > + buf[sizeof(buf) - 1] = 0; > + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); > + if (err < 0) > + return err; > + > + devs = cgroup_to_devs(cont); > + > + /* > + * No locking here is required - all that we need > + * is provided inside the kobject mapping code > + */ > + > + if (mode == 0) { > + if (chrdev) > + err = cdev_del_from_map(devs->cdev_map, dev, all); > + else > + err = bdev_del_from_map(devs->bdev_map, dev, all); > + > + if (err < 0) > + return err; > + > + css_put(&devs->css); > + } else { > + if (chrdev) > + err = cdev_add_to_map(devs->cdev_map, dev, all, mode); > + else > + err = bdev_add_to_map(devs->bdev_map, dev, all, mode); > + > + if (err < 0) > + return err; > + > + css_get(&devs->css); > + } > + > + return nbytes; > +} > + > +struct devs_dump_arg { > + char *buf; > + int pos; > + int chrdev; > +}; > + > +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x) > +{ > + struct devs_dump_arg *arg = x; > + char tmp[64]; > + int len; > + > + len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev, > + range != 1, mode); > + > + if (arg->pos >= PAGE_SIZE - len) > + return 1; > + > + memcpy(arg->buf + arg->pos, tmp, len); > + arg->pos += len; > + return 0; > +} > + > +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft, > + struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos) > +{ > + struct devs_dump_arg arg; > + struct devs_cgroup *devs; > + ssize_t ret; > + > + arg.buf = (char *)__get_free_page(GFP_KERNEL); > + if (arg.buf == NULL) > + return -ENOMEM; > + > + devs = cgroup_to_devs(cont); > + arg.pos = 0; > + > + arg.chrdev = 1; > + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); > + > + arg.chrdev = 0; > + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); > + > + ret = simple_read_from_buffer(ubuf, nbytes, pos, > + arg.buf, arg.pos); > + > + free_page((unsigned long)arg.buf); > + return ret; > +} > + > +static struct cftype devs_files[] = { > + { > + .name = "permissions", > + .write = devs_write, > + .read = devs_read, > + }, > +}; > + > +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont) > +{ > + return cgroup_add_files(cont, ss, > + devs_files, ARRAY_SIZE(devs_files)); > +} > + > +struct cgroup_subsys devs_subsys = { > + .name = "devices", > + .subsys_id = devs_subsys_id, > + .create = devs_create, > + .destroy = devs_destroy, > + .populate = devs_populate, > +}; > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h > index 228235c..9c0cd2c 100644 > --- a/include/linux/cgroup_subsys.h > +++ b/include/linux/cgroup_subsys.h > @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup) > #endif > > /* */ > + > +#ifdef CONFIG_CGROUP_DEVS > +SUBSYS(devs) > +#endif > + > +/* */ > diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h > new file mode 100644 > index 0000000..38057b9 > --- /dev/null > +++ b/include/linux/devscontrol.h > @@ -0,0 +1,26 @@ > +#ifndef __DEVS_CONTROL_H__ > +#define __DEVS_CONTROL_H__ > +struct kobj_map; > +struct task_struct; > + > +/* > + * task_[cb]dev_map - get a map from task. Both calls may return > + * NULL, to indicate, that task doesn't belong to any group and > + * that the global map is to be used. > + */ > + > +#ifdef CONFIG_CGROUP_DEVS > +struct kobj_map *task_cdev_map(struct task_struct *); > +struct kobj_map *task_bdev_map(struct task_struct *); > +#else > +static inline kobj_map *task_cdev_map(struct task_struct *tsk) > +{ > + return NULL; > +} > + > +static inline kobj_map *task_bdev_map(struct task_struct *tsk) > +{ > + return NULL; > +} > +#endif > +#endif > diff --git a/init/Kconfig b/init/Kconfig > index 732a1c2..f9a1b4f 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -283,6 +283,19 @@ config CGROUP_DEBUG > > Say N if unsure > > +config CGROUP_DEVS > + bool "Devices cgroup subsystem" > + depends on CGROUPS > + help > + Controlls the access rights to devices, i.e. you may hide > + some of them from tasks, so that they will not be able > + to open them, or you may grant a read-only access to some > + of them. > + > + See Documentation/controllers/devices.txt for details. > + > + This is harmless to say N here, so do it if unsure. > + > config CGROUP_NS > bool "Namespace cgroup subsystem" > depends on CGROUPS ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080211173830.GA22160-6s5zFf/epYL1ENwx4SLHqw@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080211173830.GA22160-6s5zFf/epYL1ENwx4SLHqw@public.gmane.org> @ 2008-02-12 10:28 ` Pavel Emelyanov [not found] ` <47B174B2.5010500-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-02-12 10:28 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Paul Menage, Linux Containers Serge E. Hallyn wrote: > Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): >> Each new group will have its own maps for char and block >> layers. The devices access list is tuned via the >> devices.permissions file. One may read from the file to get >> the configured state. >> >> The top container isn't initialized, so that the char >> and block layers will use the global maps to lookup >> their devices. I did that not to export the static maps >> to the outer world. >> >> Good news is that this patch now contains more comments >> and Documentation file :) > > Seems to work as advertised :) I can't reproduce Suka's null/zero > bug. > > You're relying fully on uid-0 to stop writes into the > devices.permissions files. Would you mind adding a check for > CAP_SYS_ADMIN (or CAP_NS_OVERRIDE+CAP_MKNOD)? Or were you really > counting on using the filesystem visibility cgroup to stop a container > from making changes to its device access whitelist? Yup. I strongly believe that a controller itself should not bring any security policy of its own, but the infrastructure should take care of this. However, I'm open to change my mind if I see good explanation of why it is wrong. > thanks, > -serge > >> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> >> >> --- >> >> diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt >> new file mode 100644 >> index 0000000..dbd0c7a >> --- /dev/null >> +++ b/Documentation/controllers/devices.txt >> @@ -0,0 +1,61 @@ >> + >> + Devices visibility controller >> + >> +This controller allows to tune the devices accessibility by tasks, >> +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only >> +access to IDE devices and completely hide SCSI disks. >> + >> +Tasks still can call mknod to create device files, regardless of >> +whether the particular device is visible or accessible, but they >> +may not be able to open it later. >> + >> +This one hides under CONFIG_CGROUP_DEVS option. >> + >> + >> +Configuring >> + >> +The controller provides a single file to configure itself -- the >> +devices.permissions one. To change the accessibility level for some >> +device write the following string into it: >> + >> +[cb] <major>:(<minor>|*) [r-][w-] >> + ^ ^ ^ >> + | | | >> + | | +--- access rights (1) >> + | | >> + | +-- device major and minor numbers (2) >> + | >> + +-- device type (character / block) >> + >> +1) The access rights set to '--' remove the device from the group's >> +access list, so that it will not even be shown in this file later. >> + >> +2) Setting the minor to '*' grants access to all the minors for >> +particular major. >> + >> +When reading from it, one may see something like >> + >> + c 1:5 rw >> + b 8:* r- >> + >> +Security issues, concerning who may grant access to what are governed >> +at the cgroup infrastructure level. >> + >> + >> +Examples: >> + >> +1. Grand full access to /dev/null >> + # echo c 1:3 rw > /cgroups/<id>/devices.permissions >> + >> +2. Grant the read-only access to /dev/sda and partitions >> + # echo b 8:* r- > ... >> + >> +3. Change the /dev/null access to write-only >> + # echo c 1:3 -w > ... >> + >> +4. Revoke access to /dev/sda >> + # echo b 8:* -- > ... >> + >> + >> + Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> >> + >> diff --git a/fs/Makefile b/fs/Makefile >> index 7996220..5ad03be 100644 >> --- a/fs/Makefile >> +++ b/fs/Makefile >> @@ -64,6 +64,8 @@ obj-y += devpts/ >> >> obj-$(CONFIG_PROFILING) += dcookies.o >> obj-$(CONFIG_DLM) += dlm/ >> + >> +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o >> >> # Do not add any filesystems before this line >> obj-$(CONFIG_REISERFS_FS) += reiserfs/ >> diff --git a/fs/devscontrol.c b/fs/devscontrol.c >> new file mode 100644 >> index 0000000..48c5f69 >> --- /dev/null >> +++ b/fs/devscontrol.c >> @@ -0,0 +1,314 @@ >> +/* >> + * devscontrol.c - Device Controller >> + * >> + * Copyright 2007 OpenVZ SWsoft Inc >> + * Author: Pavel Emelyanov <xemul at openvz dot org> >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License as published by >> + * the Free Software Foundation; either version 2 of the License, or >> + * (at your option) any later version. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + */ >> + >> +#include <linux/cgroup.h> >> +#include <linux/cdev.h> >> +#include <linux/err.h> >> +#include <linux/devscontrol.h> >> +#include <linux/uaccess.h> >> +#include <linux/fs.h> >> +#include <linux/genhd.h> >> + >> +struct devs_cgroup { >> + /* >> + * The subsys state to build into cgrous infrastructure >> + */ >> + struct cgroup_subsys_state css; >> + >> + /* >> + * The maps of character and block devices. They provide a >> + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c >> + * and block/genhd.c to find out how the ->open() callbacks >> + * work when opening a device. >> + * >> + * Each group will have its onw maps, and at the open() >> + * time code will lookup in this map to get the device >> + * and permissions by its dev_t. >> + */ >> + struct kobj_map *cdev_map; >> + struct kobj_map *bdev_map; >> +}; >> + >> +static inline >> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) >> +{ >> + return container_of(css, struct devs_cgroup, css); >> +} >> + >> +static inline >> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) >> +{ >> + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); >> +} >> + >> +struct kobj_map *task_cdev_map(struct task_struct *tsk) >> +{ >> + struct cgroup_subsys_state *css; >> + >> + css = task_subsys_state(tsk, devs_subsys_id); >> + if (css->cgroup->parent == NULL) >> + return NULL; >> + else >> + return css_to_devs(css)->cdev_map; >> +} >> + >> +struct kobj_map *task_bdev_map(struct task_struct *tsk) >> +{ >> + struct cgroup_subsys_state *css; >> + >> + css = task_subsys_state(tsk, devs_subsys_id); >> + if (css->cgroup->parent == NULL) >> + return NULL; >> + else >> + return css_to_devs(css)->bdev_map; >> +} >> + >> +static struct cgroup_subsys_state * >> +devs_create(struct cgroup_subsys *ss, struct cgroup *cont) >> +{ >> + struct devs_cgroup *devs; >> + >> + devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL); >> + if (devs == NULL) >> + goto out; >> + >> + devs->cdev_map = cdev_map_init(); >> + if (devs->cdev_map == NULL) >> + goto out_free; >> + >> + devs->bdev_map = bdev_map_init(); >> + if (devs->bdev_map == NULL) >> + goto out_free_cdev; >> + >> + return &devs->css; >> + >> +out_free_cdev: >> + cdev_map_fini(devs->cdev_map); >> +out_free: >> + kfree(devs); >> +out: >> + return ERR_PTR(-ENOMEM); >> +} >> + >> +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont) >> +{ >> + struct devs_cgroup *devs; >> + >> + devs = cgroup_to_devs(cont); >> + bdev_map_fini(devs->bdev_map); >> + cdev_map_fini(devs->cdev_map); >> + kfree(devs); >> +} >> + >> +/* >> + * The devices.permissions file read/write functionality >> + * >> + * The following two routines parse and print the strings like >> + * [cb] <major>:(<minor>|*) [r-][w-] >> + */ >> + >> +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev, >> + int *all, mode_t *mode) >> +{ >> + unsigned int major, minor; >> + char *end; >> + mode_t tmp; >> + >> + if (buf[0] == 'c') >> + *chrdev = 1; >> + else if (buf[0] == 'b') >> + *chrdev = 0; >> + else >> + return -EINVAL; >> + >> + if (buf[1] != ' ') >> + return -EINVAL; >> + >> + major = simple_strtoul(buf + 2, &end, 10); >> + if (*end != ':') >> + return -EINVAL; >> + >> + if (end[1] == '*') { >> + if (end[2] != ' ') >> + return -EINVAL; >> + >> + *all = 1; >> + minor = 0; >> + end += 2; >> + } else { >> + minor = simple_strtoul(end + 1, &end, 10); >> + if (*end != ' ') >> + return -EINVAL; >> + >> + *all = 0; >> + } >> + >> + tmp = 0; >> + >> + if (end[1] == 'r') >> + tmp |= FMODE_READ; >> + else if (end[1] != '-') >> + return -EINVAL; >> + if (end[2] == 'w') >> + tmp |= FMODE_WRITE; >> + else if (end[2] != '-') >> + return -EINVAL; >> + >> + *dev = MKDEV(major, minor); >> + *mode = tmp; >> + return 0; >> +} >> + >> +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev, >> + int all, mode_t mode) >> +{ >> + int ret; >> + >> + ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev)); >> + if (all) >> + ret += snprintf(buf + ret, len - ret, "*"); >> + else >> + ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev)); >> + >> + ret += snprintf(buf + ret, len - ret, " %c%c\n", >> + (mode & FMODE_READ) ? 'r' : '-', >> + (mode & FMODE_WRITE) ? 'w' : '-'); >> + >> + return ret + 1; >> +} >> + >> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, >> + struct file *f, const char __user *ubuf, >> + size_t nbytes, loff_t *pos) >> +{ >> + int err, all, chrdev; >> + dev_t dev; >> + char buf[64]; >> + struct devs_cgroup *devs; >> + mode_t mode; >> + >> + if (copy_from_user(buf, ubuf, sizeof(buf))) >> + return -EFAULT; >> + >> + buf[sizeof(buf) - 1] = 0; >> + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); >> + if (err < 0) >> + return err; >> + >> + devs = cgroup_to_devs(cont); >> + >> + /* >> + * No locking here is required - all that we need >> + * is provided inside the kobject mapping code >> + */ >> + >> + if (mode == 0) { >> + if (chrdev) >> + err = cdev_del_from_map(devs->cdev_map, dev, all); >> + else >> + err = bdev_del_from_map(devs->bdev_map, dev, all); >> + >> + if (err < 0) >> + return err; >> + >> + css_put(&devs->css); >> + } else { >> + if (chrdev) >> + err = cdev_add_to_map(devs->cdev_map, dev, all, mode); >> + else >> + err = bdev_add_to_map(devs->bdev_map, dev, all, mode); >> + >> + if (err < 0) >> + return err; >> + >> + css_get(&devs->css); >> + } >> + >> + return nbytes; >> +} >> + >> +struct devs_dump_arg { >> + char *buf; >> + int pos; >> + int chrdev; >> +}; >> + >> +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x) >> +{ >> + struct devs_dump_arg *arg = x; >> + char tmp[64]; >> + int len; >> + >> + len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev, >> + range != 1, mode); >> + >> + if (arg->pos >= PAGE_SIZE - len) >> + return 1; >> + >> + memcpy(arg->buf + arg->pos, tmp, len); >> + arg->pos += len; >> + return 0; >> +} >> + >> +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft, >> + struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos) >> +{ >> + struct devs_dump_arg arg; >> + struct devs_cgroup *devs; >> + ssize_t ret; >> + >> + arg.buf = (char *)__get_free_page(GFP_KERNEL); >> + if (arg.buf == NULL) >> + return -ENOMEM; >> + >> + devs = cgroup_to_devs(cont); >> + arg.pos = 0; >> + >> + arg.chrdev = 1; >> + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); >> + >> + arg.chrdev = 0; >> + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); >> + >> + ret = simple_read_from_buffer(ubuf, nbytes, pos, >> + arg.buf, arg.pos); >> + >> + free_page((unsigned long)arg.buf); >> + return ret; >> +} >> + >> +static struct cftype devs_files[] = { >> + { >> + .name = "permissions", >> + .write = devs_write, >> + .read = devs_read, >> + }, >> +}; >> + >> +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont) >> +{ >> + return cgroup_add_files(cont, ss, >> + devs_files, ARRAY_SIZE(devs_files)); >> +} >> + >> +struct cgroup_subsys devs_subsys = { >> + .name = "devices", >> + .subsys_id = devs_subsys_id, >> + .create = devs_create, >> + .destroy = devs_destroy, >> + .populate = devs_populate, >> +}; >> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h >> index 228235c..9c0cd2c 100644 >> --- a/include/linux/cgroup_subsys.h >> +++ b/include/linux/cgroup_subsys.h >> @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup) >> #endif >> >> /* */ >> + >> +#ifdef CONFIG_CGROUP_DEVS >> +SUBSYS(devs) >> +#endif >> + >> +/* */ >> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h >> new file mode 100644 >> index 0000000..38057b9 >> --- /dev/null >> +++ b/include/linux/devscontrol.h >> @@ -0,0 +1,26 @@ >> +#ifndef __DEVS_CONTROL_H__ >> +#define __DEVS_CONTROL_H__ >> +struct kobj_map; >> +struct task_struct; >> + >> +/* >> + * task_[cb]dev_map - get a map from task. Both calls may return >> + * NULL, to indicate, that task doesn't belong to any group and >> + * that the global map is to be used. >> + */ >> + >> +#ifdef CONFIG_CGROUP_DEVS >> +struct kobj_map *task_cdev_map(struct task_struct *); >> +struct kobj_map *task_bdev_map(struct task_struct *); >> +#else >> +static inline kobj_map *task_cdev_map(struct task_struct *tsk) >> +{ >> + return NULL; >> +} >> + >> +static inline kobj_map *task_bdev_map(struct task_struct *tsk) >> +{ >> + return NULL; >> +} >> +#endif >> +#endif >> diff --git a/init/Kconfig b/init/Kconfig >> index 732a1c2..f9a1b4f 100644 >> --- a/init/Kconfig >> +++ b/init/Kconfig >> @@ -283,6 +283,19 @@ config CGROUP_DEBUG >> >> Say N if unsure >> >> +config CGROUP_DEVS >> + bool "Devices cgroup subsystem" >> + depends on CGROUPS >> + help >> + Controlls the access rights to devices, i.e. you may hide >> + some of them from tasks, so that they will not be able >> + to open them, or you may grant a read-only access to some >> + of them. >> + >> + See Documentation/controllers/devices.txt for details. >> + >> + This is harmless to say N here, so do it if unsure. >> + >> config CGROUP_NS >> bool "Namespace cgroup subsystem" >> depends on CGROUPS > ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <47B174B2.5010500-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <47B174B2.5010500-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-02-12 17:21 ` Serge E. Hallyn [not found] ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Serge E. Hallyn @ 2008-02-12 17:21 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Paul Menage, Linux Containers Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > Serge E. Hallyn wrote: > > Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org): > >> Each new group will have its own maps for char and block > >> layers. The devices access list is tuned via the > >> devices.permissions file. One may read from the file to get > >> the configured state. > >> > >> The top container isn't initialized, so that the char > >> and block layers will use the global maps to lookup > >> their devices. I did that not to export the static maps > >> to the outer world. > >> > >> Good news is that this patch now contains more comments > >> and Documentation file :) > > > > Seems to work as advertised :) I can't reproduce Suka's null/zero > > bug. > > > > You're relying fully on uid-0 to stop writes into the > > devices.permissions files. Would you mind adding a check for > > CAP_SYS_ADMIN (or CAP_NS_OVERRIDE+CAP_MKNOD)? Or were you really > > counting on using the filesystem visibility cgroup to stop a container > > from making changes to its device access whitelist? > > Yup. I strongly believe that a controller itself should not bring > any security policy of its own, but the infrastructure should > take care of this. That would be ok if the controller gave the infrastructure some way of knowing what sort of thing the controller does. I.e. I wouldn't mind having the cgroup infrastructe check for capabilities, but I suspect some cgroups will really be best represented by different capabilities. Paul (actually both Menage and Jackson :) do you have an opinion on this? Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/' to have some non-root user be able to dole out cpusets? Is there any way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN? > However, I'm open to change my mind if I see good explanation of > why it is wrong. Well the thing is that it currently leaves no way to keep root in another namespace from manipulating it. I don't think we can even use SELinux unless we're willing to prevent containers from having write access to everything under the cgroup - which is only ok depending on what is composed with the devices cgroup. We have the same kind of problem with the /proc/sys/filesystems/<fs>/fs_safe flag. There are a few possibilities, and your fs visibility cgroup (plus splitting /proc/sys/filesystems into its own fs) is one. More complicated things pop into my head, but I think until we get more comfortable with all this the simplest way is the best. But really imo the simplest way is to have a capable(CAP_SYS_ADMIN) check inside your _write function :) Ideally if you didn't mind I would float a patch (cousin to the userns 4-patch set) defining CAP_NS_OVERRIDE and requiring both CAP_NS_OVERRIDE and CAP_SYS_ADMIN to access _write. -serge > > thanks, > > -serge > > > >> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > >> > >> --- > >> > >> diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt > >> new file mode 100644 > >> index 0000000..dbd0c7a > >> --- /dev/null > >> +++ b/Documentation/controllers/devices.txt > >> @@ -0,0 +1,61 @@ > >> + > >> + Devices visibility controller > >> + > >> +This controller allows to tune the devices accessibility by tasks, > >> +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only > >> +access to IDE devices and completely hide SCSI disks. > >> + > >> +Tasks still can call mknod to create device files, regardless of > >> +whether the particular device is visible or accessible, but they > >> +may not be able to open it later. > >> + > >> +This one hides under CONFIG_CGROUP_DEVS option. > >> + > >> + > >> +Configuring > >> + > >> +The controller provides a single file to configure itself -- the > >> +devices.permissions one. To change the accessibility level for some > >> +device write the following string into it: > >> + > >> +[cb] <major>:(<minor>|*) [r-][w-] > >> + ^ ^ ^ > >> + | | | > >> + | | +--- access rights (1) > >> + | | > >> + | +-- device major and minor numbers (2) > >> + | > >> + +-- device type (character / block) > >> + > >> +1) The access rights set to '--' remove the device from the group's > >> +access list, so that it will not even be shown in this file later. > >> + > >> +2) Setting the minor to '*' grants access to all the minors for > >> +particular major. > >> + > >> +When reading from it, one may see something like > >> + > >> + c 1:5 rw > >> + b 8:* r- > >> + > >> +Security issues, concerning who may grant access to what are governed > >> +at the cgroup infrastructure level. > >> + > >> + > >> +Examples: > >> + > >> +1. Grand full access to /dev/null > >> + # echo c 1:3 rw > /cgroups/<id>/devices.permissions > >> + > >> +2. Grant the read-only access to /dev/sda and partitions > >> + # echo b 8:* r- > ... > >> + > >> +3. Change the /dev/null access to write-only > >> + # echo c 1:3 -w > ... > >> + > >> +4. Revoke access to /dev/sda > >> + # echo b 8:* -- > ... > >> + > >> + > >> + Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > >> + > >> diff --git a/fs/Makefile b/fs/Makefile > >> index 7996220..5ad03be 100644 > >> --- a/fs/Makefile > >> +++ b/fs/Makefile > >> @@ -64,6 +64,8 @@ obj-y += devpts/ > >> > >> obj-$(CONFIG_PROFILING) += dcookies.o > >> obj-$(CONFIG_DLM) += dlm/ > >> + > >> +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o > >> > >> # Do not add any filesystems before this line > >> obj-$(CONFIG_REISERFS_FS) += reiserfs/ > >> diff --git a/fs/devscontrol.c b/fs/devscontrol.c > >> new file mode 100644 > >> index 0000000..48c5f69 > >> --- /dev/null > >> +++ b/fs/devscontrol.c > >> @@ -0,0 +1,314 @@ > >> +/* > >> + * devscontrol.c - Device Controller > >> + * > >> + * Copyright 2007 OpenVZ SWsoft Inc > >> + * Author: Pavel Emelyanov <xemul at openvz dot org> > >> + * > >> + * This program is free software; you can redistribute it and/or modify > >> + * it under the terms of the GNU General Public License as published by > >> + * the Free Software Foundation; either version 2 of the License, or > >> + * (at your option) any later version. > >> + * > >> + * This program is distributed in the hope that it will be useful, > >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of > >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > >> + * GNU General Public License for more details. > >> + */ > >> + > >> +#include <linux/cgroup.h> > >> +#include <linux/cdev.h> > >> +#include <linux/err.h> > >> +#include <linux/devscontrol.h> > >> +#include <linux/uaccess.h> > >> +#include <linux/fs.h> > >> +#include <linux/genhd.h> > >> + > >> +struct devs_cgroup { > >> + /* > >> + * The subsys state to build into cgrous infrastructure > >> + */ > >> + struct cgroup_subsys_state css; > >> + > >> + /* > >> + * The maps of character and block devices. They provide a > >> + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c > >> + * and block/genhd.c to find out how the ->open() callbacks > >> + * work when opening a device. > >> + * > >> + * Each group will have its onw maps, and at the open() > >> + * time code will lookup in this map to get the device > >> + * and permissions by its dev_t. > >> + */ > >> + struct kobj_map *cdev_map; > >> + struct kobj_map *bdev_map; > >> +}; > >> + > >> +static inline > >> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) > >> +{ > >> + return container_of(css, struct devs_cgroup, css); > >> +} > >> + > >> +static inline > >> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont) > >> +{ > >> + return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id)); > >> +} > >> + > >> +struct kobj_map *task_cdev_map(struct task_struct *tsk) > >> +{ > >> + struct cgroup_subsys_state *css; > >> + > >> + css = task_subsys_state(tsk, devs_subsys_id); > >> + if (css->cgroup->parent == NULL) > >> + return NULL; > >> + else > >> + return css_to_devs(css)->cdev_map; > >> +} > >> + > >> +struct kobj_map *task_bdev_map(struct task_struct *tsk) > >> +{ > >> + struct cgroup_subsys_state *css; > >> + > >> + css = task_subsys_state(tsk, devs_subsys_id); > >> + if (css->cgroup->parent == NULL) > >> + return NULL; > >> + else > >> + return css_to_devs(css)->bdev_map; > >> +} > >> + > >> +static struct cgroup_subsys_state * > >> +devs_create(struct cgroup_subsys *ss, struct cgroup *cont) > >> +{ > >> + struct devs_cgroup *devs; > >> + > >> + devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL); > >> + if (devs == NULL) > >> + goto out; > >> + > >> + devs->cdev_map = cdev_map_init(); > >> + if (devs->cdev_map == NULL) > >> + goto out_free; > >> + > >> + devs->bdev_map = bdev_map_init(); > >> + if (devs->bdev_map == NULL) > >> + goto out_free_cdev; > >> + > >> + return &devs->css; > >> + > >> +out_free_cdev: > >> + cdev_map_fini(devs->cdev_map); > >> +out_free: > >> + kfree(devs); > >> +out: > >> + return ERR_PTR(-ENOMEM); > >> +} > >> + > >> +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont) > >> +{ > >> + struct devs_cgroup *devs; > >> + > >> + devs = cgroup_to_devs(cont); > >> + bdev_map_fini(devs->bdev_map); > >> + cdev_map_fini(devs->cdev_map); > >> + kfree(devs); > >> +} > >> + > >> +/* > >> + * The devices.permissions file read/write functionality > >> + * > >> + * The following two routines parse and print the strings like > >> + * [cb] <major>:(<minor>|*) [r-][w-] > >> + */ > >> + > >> +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev, > >> + int *all, mode_t *mode) > >> +{ > >> + unsigned int major, minor; > >> + char *end; > >> + mode_t tmp; > >> + > >> + if (buf[0] == 'c') > >> + *chrdev = 1; > >> + else if (buf[0] == 'b') > >> + *chrdev = 0; > >> + else > >> + return -EINVAL; > >> + > >> + if (buf[1] != ' ') > >> + return -EINVAL; > >> + > >> + major = simple_strtoul(buf + 2, &end, 10); > >> + if (*end != ':') > >> + return -EINVAL; > >> + > >> + if (end[1] == '*') { > >> + if (end[2] != ' ') > >> + return -EINVAL; > >> + > >> + *all = 1; > >> + minor = 0; > >> + end += 2; > >> + } else { > >> + minor = simple_strtoul(end + 1, &end, 10); > >> + if (*end != ' ') > >> + return -EINVAL; > >> + > >> + *all = 0; > >> + } > >> + > >> + tmp = 0; > >> + > >> + if (end[1] == 'r') > >> + tmp |= FMODE_READ; > >> + else if (end[1] != '-') > >> + return -EINVAL; > >> + if (end[2] == 'w') > >> + tmp |= FMODE_WRITE; > >> + else if (end[2] != '-') > >> + return -EINVAL; > >> + > >> + *dev = MKDEV(major, minor); > >> + *mode = tmp; > >> + return 0; > >> +} > >> + > >> +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev, > >> + int all, mode_t mode) > >> +{ > >> + int ret; > >> + > >> + ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev)); > >> + if (all) > >> + ret += snprintf(buf + ret, len - ret, "*"); > >> + else > >> + ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev)); > >> + > >> + ret += snprintf(buf + ret, len - ret, " %c%c\n", > >> + (mode & FMODE_READ) ? 'r' : '-', > >> + (mode & FMODE_WRITE) ? 'w' : '-'); > >> + > >> + return ret + 1; > >> +} > >> + > >> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft, > >> + struct file *f, const char __user *ubuf, > >> + size_t nbytes, loff_t *pos) > >> +{ > >> + int err, all, chrdev; > >> + dev_t dev; > >> + char buf[64]; > >> + struct devs_cgroup *devs; > >> + mode_t mode; > >> + > >> + if (copy_from_user(buf, ubuf, sizeof(buf))) > >> + return -EFAULT; > >> + > >> + buf[sizeof(buf) - 1] = 0; > >> + err = decode_perms_str(buf, &chrdev, &dev, &all, &mode); > >> + if (err < 0) > >> + return err; > >> + > >> + devs = cgroup_to_devs(cont); > >> + > >> + /* > >> + * No locking here is required - all that we need > >> + * is provided inside the kobject mapping code > >> + */ > >> + > >> + if (mode == 0) { > >> + if (chrdev) > >> + err = cdev_del_from_map(devs->cdev_map, dev, all); > >> + else > >> + err = bdev_del_from_map(devs->bdev_map, dev, all); > >> + > >> + if (err < 0) > >> + return err; > >> + > >> + css_put(&devs->css); > >> + } else { > >> + if (chrdev) > >> + err = cdev_add_to_map(devs->cdev_map, dev, all, mode); > >> + else > >> + err = bdev_add_to_map(devs->bdev_map, dev, all, mode); > >> + > >> + if (err < 0) > >> + return err; > >> + > >> + css_get(&devs->css); > >> + } > >> + > >> + return nbytes; > >> +} > >> + > >> +struct devs_dump_arg { > >> + char *buf; > >> + int pos; > >> + int chrdev; > >> +}; > >> + > >> +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x) > >> +{ > >> + struct devs_dump_arg *arg = x; > >> + char tmp[64]; > >> + int len; > >> + > >> + len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev, > >> + range != 1, mode); > >> + > >> + if (arg->pos >= PAGE_SIZE - len) > >> + return 1; > >> + > >> + memcpy(arg->buf + arg->pos, tmp, len); > >> + arg->pos += len; > >> + return 0; > >> +} > >> + > >> +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft, > >> + struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos) > >> +{ > >> + struct devs_dump_arg arg; > >> + struct devs_cgroup *devs; > >> + ssize_t ret; > >> + > >> + arg.buf = (char *)__get_free_page(GFP_KERNEL); > >> + if (arg.buf == NULL) > >> + return -ENOMEM; > >> + > >> + devs = cgroup_to_devs(cont); > >> + arg.pos = 0; > >> + > >> + arg.chrdev = 1; > >> + cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg); > >> + > >> + arg.chrdev = 0; > >> + bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg); > >> + > >> + ret = simple_read_from_buffer(ubuf, nbytes, pos, > >> + arg.buf, arg.pos); > >> + > >> + free_page((unsigned long)arg.buf); > >> + return ret; > >> +} > >> + > >> +static struct cftype devs_files[] = { > >> + { > >> + .name = "permissions", > >> + .write = devs_write, > >> + .read = devs_read, > >> + }, > >> +}; > >> + > >> +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont) > >> +{ > >> + return cgroup_add_files(cont, ss, > >> + devs_files, ARRAY_SIZE(devs_files)); > >> +} > >> + > >> +struct cgroup_subsys devs_subsys = { > >> + .name = "devices", > >> + .subsys_id = devs_subsys_id, > >> + .create = devs_create, > >> + .destroy = devs_destroy, > >> + .populate = devs_populate, > >> +}; > >> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h > >> index 228235c..9c0cd2c 100644 > >> --- a/include/linux/cgroup_subsys.h > >> +++ b/include/linux/cgroup_subsys.h > >> @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup) > >> #endif > >> > >> /* */ > >> + > >> +#ifdef CONFIG_CGROUP_DEVS > >> +SUBSYS(devs) > >> +#endif > >> + > >> +/* */ > >> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h > >> new file mode 100644 > >> index 0000000..38057b9 > >> --- /dev/null > >> +++ b/include/linux/devscontrol.h > >> @@ -0,0 +1,26 @@ > >> +#ifndef __DEVS_CONTROL_H__ > >> +#define __DEVS_CONTROL_H__ > >> +struct kobj_map; > >> +struct task_struct; > >> + > >> +/* > >> + * task_[cb]dev_map - get a map from task. Both calls may return > >> + * NULL, to indicate, that task doesn't belong to any group and > >> + * that the global map is to be used. > >> + */ > >> + > >> +#ifdef CONFIG_CGROUP_DEVS > >> +struct kobj_map *task_cdev_map(struct task_struct *); > >> +struct kobj_map *task_bdev_map(struct task_struct *); > >> +#else > >> +static inline kobj_map *task_cdev_map(struct task_struct *tsk) > >> +{ > >> + return NULL; > >> +} > >> + > >> +static inline kobj_map *task_bdev_map(struct task_struct *tsk) > >> +{ > >> + return NULL; > >> +} > >> +#endif > >> +#endif > >> diff --git a/init/Kconfig b/init/Kconfig > >> index 732a1c2..f9a1b4f 100644 > >> --- a/init/Kconfig > >> +++ b/init/Kconfig > >> @@ -283,6 +283,19 @@ config CGROUP_DEBUG > >> > >> Say N if unsure > >> > >> +config CGROUP_DEVS > >> + bool "Devices cgroup subsystem" > >> + depends on CGROUPS > >> + help > >> + Controlls the access rights to devices, i.e. you may hide > >> + some of them from tasks, so that they will not be able > >> + to open them, or you may grant a read-only access to some > >> + of them. > >> + > >> + See Documentation/controllers/devices.txt for details. > >> + > >> + This is harmless to say N here, so do it if unsure. > >> + > >> config CGROUP_NS > >> bool "Namespace cgroup subsystem" > >> depends on CGROUPS > > ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> @ 2008-02-13 2:17 ` Paul Menage [not found] ` <6599ad830802121817n7713fa85h51aedf4df74aa764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2008-02-13 2:32 ` Paul Jackson 1 sibling, 1 reply; 42+ messages in thread From: Paul Menage @ 2008-02-13 2:17 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov On Feb 12, 2008 9:21 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > Paul (actually both Menage and Jackson :) do you have an opinion on > this? Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/' > to have some non-root user be able to dole out cpusets? We (Google) currently only do that for the tasks file, to allow users to move their processes between cpusets/groups that they own. > Is there any > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN? Sure, we could have flags in the subsys object for this kind of thing that let particular subsystems request this. Paul ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <6599ad830802121817n7713fa85h51aedf4df74aa764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <6599ad830802121817n7713fa85h51aedf4df74aa764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-02-13 2:42 ` Paul Jackson [not found] ` <20080212204215.2eca689f.pj-sJ/iWh9BUns@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Paul Jackson @ 2008-02-13 2:42 UTC (permalink / raw) To: Paul Menage Cc: containers-qjLDD68F18O7TbgM5vRIOg, xemul-GEFAQzZX7r8dnm+yROfE0A Paul M wrote: > > Is there any > > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN? > > Sure, we could have flags in the subsys object for this kind of thing > that let particular subsystems request this. That sounds like it would work fine ... the cpuset subsystem would just not request this. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj-sJ/iWh9BUns@public.gmane.org> 1.940.382.4214 ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080212204215.2eca689f.pj-sJ/iWh9BUns@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080212204215.2eca689f.pj-sJ/iWh9BUns@public.gmane.org> @ 2008-02-14 17:17 ` Serge E. Hallyn 0 siblings, 0 replies; 42+ messages in thread From: Serge E. Hallyn @ 2008-02-14 17:17 UTC (permalink / raw) To: Paul Jackson Cc: containers-qjLDD68F18O7TbgM5vRIOg, Paul Menage, xemul-GEFAQzZX7r8dnm+yROfE0A Quoting Paul Jackson (pj-sJ/iWh9BUns@public.gmane.org): > Paul M wrote: > > > Is there any > > > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN? > > > > Sure, we could have flags in the subsys object for this kind of thing > > that let particular subsystems request this. > > That sounds like it would work fine ... the cpuset subsystem would > just not request this. Cool I'll keep it in mind then - though really right now it seems like it just amounts to slightly more complicated approach than just having the subsystems do the check in _write... thanks, -serge ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/4] The control group itself [not found] ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org> 2008-02-13 2:17 ` Paul Menage @ 2008-02-13 2:32 ` Paul Jackson [not found] ` <20080212203215.fb636900.pj-sJ/iWh9BUns@public.gmane.org> 1 sibling, 1 reply; 42+ messages in thread From: Paul Jackson @ 2008-02-13 2:32 UTC (permalink / raw) To: Serge E. Hallyn Cc: containers-qjLDD68F18O7TbgM5vRIOg, menage-hpIqsD4AKlfQT0dZR+AlfA, xemul-GEFAQzZX7r8dnm+yROfE0A Serge wrote: > Paul (actually both Menage and Jackson :) do you have an opinion on > this? Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/' > to have some non-root user be able to dole out cpusets? Is there any > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN? I don't know what my users actually do here ... I'm a couple layers removed from that reality. But certainly I've recommended that they sometimes do things like having the batch scheduler chown the files of each jobs cpuset to the uid of the user running that job, so that the job can manipulate its own cpuset allocate resources in finer detail. One of the more elaborate ways of doing this nests a pair of cpusets, with the parent owned by the batch scheduler confining the child owned by the individual job. The job can actually do things like write its own cpus and mems files, but is confined by the parent cpuset to only specify cpus and mems assigned to that job. As to how this affects your question ... I'm not sure. However I suspect that an added requirement for CAP_SYS_ADMIN would cause breakage and not be a good idea. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj-sJ/iWh9BUns@public.gmane.org> 1.940.382.4214 ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080212203215.fb636900.pj-sJ/iWh9BUns@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080212203215.fb636900.pj-sJ/iWh9BUns@public.gmane.org> @ 2008-02-14 17:18 ` Serge E. Hallyn 0 siblings, 0 replies; 42+ messages in thread From: Serge E. Hallyn @ 2008-02-14 17:18 UTC (permalink / raw) To: Paul Jackson Cc: containers-qjLDD68F18O7TbgM5vRIOg, menage-hpIqsD4AKlfQT0dZR+AlfA, xemul-GEFAQzZX7r8dnm+yROfE0A Quoting Paul Jackson (pj-sJ/iWh9BUns@public.gmane.org): > Serge wrote: > > Paul (actually both Menage and Jackson :) do you have an opinion on > > this? Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/' > > to have some non-root user be able to dole out cpusets? Is there any > > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN? > > I don't know what my users actually do here ... I'm a couple layers > removed from that reality. But certainly I've recommended that they > sometimes do things like having the batch scheduler chown the files > of each jobs cpuset to the uid of the user running that job, so that > the job can manipulate its own cpuset allocate resources in finer > detail. > > One of the more elaborate ways of doing this nests a pair of cpusets, > with the parent owned by the batch scheduler confining the child > owned by the individual job. The job can actually do things like > write its own cpus and mems files, but is confined by the parent > cpuset to only specify cpus and mems assigned to that job. > > As to how this affects your question ... I'm not sure. However I > suspect that an added requirement for CAP_SYS_ADMIN would cause > breakage and not be a good idea. Yeah, I guess a more general mechanism to couple a user namespace's connection to a mount is the right way to go. If we can just specify that root in this namespace is not root in that namespace (or any other userid we've chowned the files to), we've got what we need. thanks, -serge ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/4] The control group itself [not found] ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-02-11 17:38 ` Serge E. Hallyn @ 2008-02-12 7:42 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA [not found] ` <20080212074217.GA15992-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2008-02-21 20:47 ` Paul Menage 2 siblings, 1 reply; 42+ messages in thread From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-02-12 7:42 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage This patchset does fix the problem I was having before with null and zero devices. Overall, it looks like pretty good. I am still reviewing the patches. Just some nits I came across: Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote: | Each new group will have its own maps for char and block | layers. The devices access list is tuned via the | devices.permissions file. One may read from the file to get | the configured state. | | The top container isn't initialized, so that the char | and block layers will use the global maps to lookup | their devices. I did that not to export the static maps | to the outer world. | | Good news is that this patch now contains more comments | and Documentation file :) | | Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> | | --- | | diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt | new file mode 100644 | index 0000000..dbd0c7a | --- /dev/null | +++ b/Documentation/controllers/devices.txt | @@ -0,0 +1,61 @@ | + | + Devices visibility controller | + | +This controller allows to tune the devices accessibility by tasks, | +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only | +access to IDE devices and completely hide SCSI disks. | + | +Tasks still can call mknod to create device files, regardless of | +whether the particular device is visible or accessible, but they | +may not be able to open it later. | + | +This one hides under CONFIG_CGROUP_DEVS option. | + | + | +Configuring | + | +The controller provides a single file to configure itself -- the | +devices.permissions one. To change the accessibility level for some | +device write the following string into it: | + | +[cb] <major>:(<minor>|*) [r-][w-] | + ^ ^ ^ | + | | | | + | | +--- access rights (1) | + | | | + | +-- device major and minor numbers (2) | + | | + +-- device type (character / block) | + | +1) The access rights set to '--' remove the device from the group's | +access list, so that it will not even be shown in this file later. | + | +2) Setting the minor to '*' grants access to all the minors for | +particular major. | + | +When reading from it, one may see something like | + | + c 1:5 rw | + b 8:* r- | + | +Security issues, concerning who may grant access to what are governed | +at the cgroup infrastructure level. | + | + | +Examples: | + | +1. Grand full access to /dev/null Grant. | + # echo c 1:3 rw > /cgroups/<id>/devices.permissions | + | +2. Grant the read-only access to /dev/sda and partitions | + # echo b 8:* r- > ... This grants access to all scsi disks, sda..sdp and not just 'sda' right ? | + | +3. Change the /dev/null access to write-only | + # echo c 1:3 -w > ... | + | +4. Revoke access to /dev/sda | + # echo b 8:* -- > ... | + | + | + Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> | + | diff --git a/fs/Makefile b/fs/Makefile | index 7996220..5ad03be 100644 | --- a/fs/Makefile | +++ b/fs/Makefile | @@ -64,6 +64,8 @@ obj-y += devpts/ | | obj-$(CONFIG_PROFILING) += dcookies.o | obj-$(CONFIG_DLM) += dlm/ | + | +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o | | # Do not add any filesystems before this line | obj-$(CONFIG_REISERFS_FS) += reiserfs/ | diff --git a/fs/devscontrol.c b/fs/devscontrol.c | new file mode 100644 | index 0000000..48c5f69 | --- /dev/null | +++ b/fs/devscontrol.c | @@ -0,0 +1,314 @@ | +/* | + * devscontrol.c - Device Controller | + * | + * Copyright 2007 OpenVZ SWsoft Inc | + * Author: Pavel Emelyanov <xemul at openvz dot org> | + * | + * This program is free software; you can redistribute it and/or modify | + * it under the terms of the GNU General Public License as published by | + * the Free Software Foundation; either version 2 of the License, or | + * (at your option) any later version. | + * | + * This program is distributed in the hope that it will be useful, | + * but WITHOUT ANY WARRANTY; without even the implied warranty of | + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | + * GNU General Public License for more details. | + */ | + | +#include <linux/cgroup.h> | +#include <linux/cdev.h> | +#include <linux/err.h> | +#include <linux/devscontrol.h> | +#include <linux/uaccess.h> | +#include <linux/fs.h> | +#include <linux/genhd.h> | + | +struct devs_cgroup { | + /* | + * The subsys state to build into cgrous infrastructure | + */ ... into cgroups | + struct cgroup_subsys_state css; | + | + /* | + * The maps of character and block devices. They provide a | + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c | + * and block/genhd.c to find out how the ->open() callbacks | + * work when opening a device. | + * | + * Each group will have its onw maps, and at the open() own maps | + * time code will lookup in this map to get the device | + * and permissions by its dev_t. | + */ | + struct kobj_map *cdev_map; | + struct kobj_map *bdev_map; | +}; | + | +static inline | +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) | +{ | + return container_of(css, struct devs_cgroup, css); | +} 'devs' as prefix/suffix does not look very clear. How about css_to_devs_cg() ? Similarly below for dev_cg_create(), dev_cg_destroy() ? ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20080212074217.GA15992-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <20080212074217.GA15992-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-02-12 7:51 ` Pavel Emelyanov 0 siblings, 0 replies; 42+ messages in thread From: Pavel Emelyanov @ 2008-02-12 7:51 UTC (permalink / raw) To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote: > This patchset does fix the problem I was having before with null and > zero devices. Overall, it looks like pretty good. > > I am still reviewing the patches. Just some nits I came across: > > > Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote: > | Each new group will have its own maps for char and block > | layers. The devices access list is tuned via the > | devices.permissions file. One may read from the file to get > | the configured state. > | > | The top container isn't initialized, so that the char > | and block layers will use the global maps to lookup > | their devices. I did that not to export the static maps > | to the outer world. > | > | Good news is that this patch now contains more comments > | and Documentation file :) > | > | Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > | > | --- > | > | diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt > | new file mode 100644 > | index 0000000..dbd0c7a > | --- /dev/null > | +++ b/Documentation/controllers/devices.txt > | @@ -0,0 +1,61 @@ > | + > | + Devices visibility controller > | + > | +This controller allows to tune the devices accessibility by tasks, > | +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only > | +access to IDE devices and completely hide SCSI disks. > | + > | +Tasks still can call mknod to create device files, regardless of > | +whether the particular device is visible or accessible, but they > | +may not be able to open it later. > | + > | +This one hides under CONFIG_CGROUP_DEVS option. > | + > | + > | +Configuring > | + > | +The controller provides a single file to configure itself -- the > | +devices.permissions one. To change the accessibility level for some > | +device write the following string into it: > | + > | +[cb] <major>:(<minor>|*) [r-][w-] > | + ^ ^ ^ > | + | | | > | + | | +--- access rights (1) > | + | | > | + | +-- device major and minor numbers (2) > | + | > | + +-- device type (character / block) > | + > | +1) The access rights set to '--' remove the device from the group's > | +access list, so that it will not even be shown in this file later. > | + > | +2) Setting the minor to '*' grants access to all the minors for > | +particular major. > | + > | +When reading from it, one may see something like > | + > | + c 1:5 rw > | + b 8:* r- > | + > | +Security issues, concerning who may grant access to what are governed > | +at the cgroup infrastructure level. > | + > | + > | +Examples: > | + > | +1. Grand full access to /dev/null > > Grant. :) > | + # echo c 1:3 rw > /cgroups/<id>/devices.permissions > | + > | +2. Grant the read-only access to /dev/sda and partitions > | + # echo b 8:* r- > ... > > This grants access to all scsi disks, sda..sdp and not just 'sda' right ? Well, yes. I'll fix the comment like ;Grant the RO access to scsi disks. > | + > | +3. Change the /dev/null access to write-only > | + # echo c 1:3 -w > ... > | + > | +4. Revoke access to /dev/sda > | + # echo b 8:* -- > ... > | + > | + > | + Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> > | + > | diff --git a/fs/Makefile b/fs/Makefile > | index 7996220..5ad03be 100644 > | --- a/fs/Makefile > | +++ b/fs/Makefile > | @@ -64,6 +64,8 @@ obj-y += devpts/ > | > | obj-$(CONFIG_PROFILING) += dcookies.o > | obj-$(CONFIG_DLM) += dlm/ > | + > | +obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o > | > | # Do not add any filesystems before this line > | obj-$(CONFIG_REISERFS_FS) += reiserfs/ > | diff --git a/fs/devscontrol.c b/fs/devscontrol.c > | new file mode 100644 > | index 0000000..48c5f69 > | --- /dev/null > | +++ b/fs/devscontrol.c > | @@ -0,0 +1,314 @@ > | +/* > | + * devscontrol.c - Device Controller > | + * > | + * Copyright 2007 OpenVZ SWsoft Inc > | + * Author: Pavel Emelyanov <xemul at openvz dot org> > | + * > | + * This program is free software; you can redistribute it and/or modify > | + * it under the terms of the GNU General Public License as published by > | + * the Free Software Foundation; either version 2 of the License, or > | + * (at your option) any later version. > | + * > | + * This program is distributed in the hope that it will be useful, > | + * but WITHOUT ANY WARRANTY; without even the implied warranty of > | + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > | + * GNU General Public License for more details. > | + */ > | + > | +#include <linux/cgroup.h> > | +#include <linux/cdev.h> > | +#include <linux/err.h> > | +#include <linux/devscontrol.h> > | +#include <linux/uaccess.h> > | +#include <linux/fs.h> > | +#include <linux/genhd.h> > | + > | +struct devs_cgroup { > | + /* > | + * The subsys state to build into cgrous infrastructure > | + */ > > ... into cgroups > > | + struct cgroup_subsys_state css; > | + > | + /* > | + * The maps of character and block devices. They provide a > | + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c > | + * and block/genhd.c to find out how the ->open() callbacks > | + * work when opening a device. > | + * > | + * Each group will have its onw maps, and at the open() > > own maps > > | + * time code will lookup in this map to get the device > | + * and permissions by its dev_t. > | + */ > | + struct kobj_map *cdev_map; > | + struct kobj_map *bdev_map; > | +}; > | + > | +static inline > | +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css) > | +{ > | + return container_of(css, struct devs_cgroup, css); > | +} > > 'devs' as prefix/suffix does not look very clear. > > How about css_to_devs_cg() ? Similarly below for dev_cg_create(), > dev_cg_destroy() ? These names are internal to devscontrol.c, so I'd like to keep them as short as possible. Thanks, Pavel ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/4] The control group itself [not found] ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 2008-02-11 17:38 ` Serge E. Hallyn 2008-02-12 7:42 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-02-21 20:47 ` Paul Menage [not found] ` <6599ad830802211247t21fdc4e4hfe637fcffd98ded7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2 siblings, 1 reply; 42+ messages in thread From: Paul Menage @ 2008-02-21 20:47 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers On Thu, Feb 7, 2008 at 5:01 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: > + > +[cb] <major>:(<minor>|*) [r-][w-] > + ^ ^ ^ > + | | | > + | | +--- access rights (1) > + | | > + | +-- device major and minor numbers (2) > + | > + +-- device type (character / block) > ... > +When reading from it, one may see something like > + > + c 1:5 rw > + b 8:* r- > + In the interest of avoiding proliferating cgroup control file formats, I'm wondering if we can abstract out the general form of the data being presented here and maybe simplify it in such a way that we can hopefully reuse the format for other control files in the future? For example, one way to represent this would be a map from device strings such c:1:5 to permission strings such as rw. Or maybe numerical device ids to numerical permission values. The alternative might be to accept that there are two kinds of control files - those which are likely to be programmatically read (e.g. resource usage values), and those that are likely to be programmatically written but only actually read by humans for debugging purposes (like this one) and make it clear up-front when a control file is added which type they're considered to be. We could then ignore the API consistency requirements for the debugging-readable files. On a separate note, can you document the recommended way to completely overwrite the set of device permissions for a cgroup? Does this involves writing a "--" permission for every device that you don't want in the cgroup? Paul ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <6599ad830802211247t21fdc4e4hfe637fcffd98ded7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <6599ad830802211247t21fdc4e4hfe637fcffd98ded7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-02-22 8:12 ` Pavel Emelyanov [not found] ` <47BE83FD.7060908-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Pavel Emelyanov @ 2008-02-22 8:12 UTC (permalink / raw) To: Paul Menage; +Cc: Linux Containers Paul Menage wrote: > On Thu, Feb 7, 2008 at 5:01 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: >> + >> +[cb] <major>:(<minor>|*) [r-][w-] >> + ^ ^ ^ >> + | | | >> + | | +--- access rights (1) >> + | | >> + | +-- device major and minor numbers (2) >> + | >> + +-- device type (character / block) >> ... >> +When reading from it, one may see something like >> + >> + c 1:5 rw >> + b 8:* r- >> + > > In the interest of avoiding proliferating cgroup control file formats, > I'm wondering if we can abstract out the general form of the data > being presented here and maybe simplify it in such a way that we can > hopefully reuse the format for other control files in the future? > > For example, one way to represent this would be a map from device > strings such c:1:5 to permission strings such as rw. Or maybe > numerical device ids to numerical permission values. You mean smth like <some_device_id><space><some_permissions_string>? Well, I don't mind, but AFAIK the <major>:<minor> form is very common for specifying the device. So I agree with the 'c:1:5 rw' form. > The alternative might be to accept that there are two kinds of control > files - those which are likely to be programmatically read (e.g. > resource usage values), and those that are likely to be > programmatically written but only actually read by humans for > debugging purposes (like this one) and make it clear up-front when a > control file is added which type they're considered to be. We could > then ignore the API consistency requirements for the > debugging-readable files. Hmm, you mean make them a binary files? I thought that filesystem-based API should be human readable and writable as much as possible... > On a separate note, can you document the recommended way to completely > overwrite the set of device permissions for a cgroup? Does this There's not way to flush all the permissions in this implementation, but I though about one. Maybe 'echo 0 > devices.permissions' would be good? > involves writing a "--" permission for every device that you don't > want in the cgroup? Currently - yes. > Paul > ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <47BE83FD.7060908-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <47BE83FD.7060908-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-02-23 23:12 ` Paul Menage [not found] ` <6599ad830802231512t20343cabq738df3039c8a1d1f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 42+ messages in thread From: Paul Menage @ 2008-02-23 23:12 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: Linux Containers On Fri, Feb 22, 2008 at 12:12 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: > > Hmm, you mean make them a binary files? No, not by default. But I'm working on a plan to have an optional binary API to cgroups, that would allow multiple control files to be read in a binary format with a single system call. The existing API would still be available as well, of course. The idea would be that monitoring programs that frequently read lots of values from a single cgroup (or even multiple cgroups) would be able to do so more cheaply than by making multiple different reads on different files. In order for this to work, CGroups needs to know the data type of a given control file - so this would only be available for the control files that use typed cgroup output methods rather than the raw file output interface. > I thought that filesystem-based > API should be human readable and writable as much as possible... > Yes, but even without a binary API it makes sense for values that are likely to be parsed by programs be in a consistent format. But after thinking more about this, I think that the devices permission control file output doesn't really fall under this category - from a programmatic point of view, I suspect it's write-only, and only humans will be reading the output, for debugging. Paul ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <6599ad830802231512t20343cabq738df3039c8a1d1f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/4] The control group itself [not found] ` <6599ad830802231512t20343cabq738df3039c8a1d1f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-02-26 7:54 ` Pavel Emelyanov 0 siblings, 0 replies; 42+ messages in thread From: Pavel Emelyanov @ 2008-02-26 7:54 UTC (permalink / raw) To: Paul Menage; +Cc: Linux Containers Paul Menage wrote: > On Fri, Feb 22, 2008 at 12:12 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote: >> Hmm, you mean make them a binary files? > > No, not by default. But I'm working on a plan to have an optional > binary API to cgroups, that would allow multiple control files to be > read in a binary format with a single system call. The existing API > would still be available as well, of course. The idea would be that > monitoring programs that frequently read lots of values from a single > cgroup (or even multiple cgroups) would be able to do so more cheaply > than by making multiple different reads on different files. > > In order for this to work, CGroups needs to know the data type of a > given control file - so this would only be available for the control > files that use typed cgroup output methods rather than the raw file > output interface. Sounds reasonable. >> I thought that filesystem-based >> API should be human readable and writable as much as possible... >> > > Yes, but even without a binary API it makes sense for values that are > likely to be parsed by programs be in a consistent format. > > But after thinking more about this, I think that the devices > permission control file output doesn't really fall under this category > - from a programmatic point of view, I suspect it's write-only, and > only humans will be reading the output, for debugging. Yup. So, if you're fine with the proposed API, I think I will prepare this set and send it to Andrew this week. > Paul > Thanks, Pavel ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2008-02-26 7:54 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-08 9:02 [PATCH 0/4] Devices accessibility control group (v2) Pavel Emelyanov
[not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-08 9:07 ` [PATCH 1/4] Some changes in the kobject mapper Pavel Emelyanov
[not found] ` <47833D43.3090703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-08 18:36 ` Daniel Hokka Zakrisson
[not found] ` <4783C2B4.7000501-nym3zxDgnZcAvxtiuMwx3w@public.gmane.org>
2008-01-08 19:17 ` Dave Hansen
2008-01-08 9:12 ` [PATCH 2/4] The character devices layer changes Pavel Emelyanov
[not found] ` <47833E93.6010108-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-14 17:03 ` Serge E. Hallyn
[not found] ` <20080114170333.GA15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15 8:05 ` Pavel Emelyanov
[not found] ` <478C6942.4050903-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-15 14:54 ` Serge E. Hallyn
2008-01-08 9:15 ` [PATCH 3/4] The block " Pavel Emelyanov
2008-01-08 9:18 ` [PATCH 4/4] The control group itself Pavel Emelyanov
[not found] ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-14 17:40 ` Serge E. Hallyn
[not found] ` <20080114174056.GB15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15 7:53 ` Pavel Emelyanov
[not found] ` <478C6669.7070705-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-15 14:44 ` Serge E. Hallyn
[not found] ` <20080115144440.GE4453-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15 16:13 ` Paul Menage
[not found] ` <6599ad830801150813s6a5a7374qd25b6d6206d5896a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15 17:49 ` Serge E. Hallyn
[not found] ` <20080115174941.GA11638-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15 17:54 ` Paul Menage
[not found] ` <6599ad830801150954w7e1b6db0p4dd737730f407348-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15 18:17 ` Serge E. Hallyn
2008-01-14 21:54 ` Paul Menage
[not found] ` <6599ad830801141354p5b165cdao8d6184adb9ab61b6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15 7:58 ` Pavel Emelyanov
2008-01-12 21:20 ` [PATCH 0/4] Devices accessibility control group (v2) sukadev-r/Jw6+rmf7HQT0dZR+AlfA
[not found] ` <20080112212014.GA12085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-01-14 7:52 ` Pavel Emelyanov
[not found] ` <478B14DB.4000106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-14 17:42 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
[not found] ` <20080114174220.GA17825-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-01-15 8:22 ` Pavel Emelyanov
[not found] ` <478C6D2B.6020904-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-17 6:26 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
[not found] ` <20080117062605.GA24475-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-01-21 8:31 ` Pavel Emelyanov
2008-01-14 21:18 ` Paul Menage
[not found] ` <6599ad830801141318h121a6a80h9af68c52431c48b8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15 8:06 ` Pavel Emelyanov
-- strict thread matches above, loose matches on Subject: below --
2008-02-07 12:56 [PATCH 0/4] Devices accessibility control group (v3, release candidate) Pavel Emelyanov
[not found] ` <47AAFFF2.9030804-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-07 13:01 ` [PATCH 4/4] The control group itself Pavel Emelyanov
[not found] ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-11 17:38 ` Serge E. Hallyn
[not found] ` <20080211173830.GA22160-6s5zFf/epYL1ENwx4SLHqw@public.gmane.org>
2008-02-12 10:28 ` Pavel Emelyanov
[not found] ` <47B174B2.5010500-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-12 17:21 ` Serge E. Hallyn
[not found] ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-02-13 2:17 ` Paul Menage
[not found] ` <6599ad830802121817n7713fa85h51aedf4df74aa764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-13 2:42 ` Paul Jackson
[not found] ` <20080212204215.2eca689f.pj-sJ/iWh9BUns@public.gmane.org>
2008-02-14 17:17 ` Serge E. Hallyn
2008-02-13 2:32 ` Paul Jackson
[not found] ` <20080212203215.fb636900.pj-sJ/iWh9BUns@public.gmane.org>
2008-02-14 17:18 ` Serge E. Hallyn
2008-02-12 7:42 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
[not found] ` <20080212074217.GA15992-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-02-12 7:51 ` Pavel Emelyanov
2008-02-21 20:47 ` Paul Menage
[not found] ` <6599ad830802211247t21fdc4e4hfe637fcffd98ded7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-22 8:12 ` Pavel Emelyanov
[not found] ` <47BE83FD.7060908-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-23 23:12 ` Paul Menage
[not found] ` <6599ad830802231512t20343cabq738df3039c8a1d1f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-26 7:54 ` Pavel Emelyanov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.