All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Devices accessibility control group (v2)
@ 2008-01-08  9:02 Pavel Emelyanov
       [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-08  9:02 UTC (permalink / raw)
  To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage

The first version was posted long ago 
(http://openvz.org/pipermail/devel/2007-September/007647.html)
and since then there are many (good I hope) changes:

* Added the block devices support :) It turned out to
  be a bit simpler than the char one (or I missed
  something significant);
* Now we can enable/disable not just individual devices,
  but the whole major with all its minors (see the TODO
  list beyond as well);
* Added the ability to restrict the read/write permissions
  to devices, not just visible/invisible state.

That is - the main features I wished to implement right
after the v1 was sent. Some minor changes are:

* I merged the devices.char and devices.block files into
  one - devices.permissions;
* As the result of the change above - the strings passed
  to this file has changed. Now they are
         [bc] <major>:{<minor>|*} [r-][w-]
  E.g. b 5:2 r- will grant the read permissions to the
  block 5:2 device and c 3:* -w will grant the write-only
  access to all the character devices with the major 5.

However, there are some things to be done:

* Make the /proc/devices show relevant info depending on
  who is reading it. This seems to be easy to do, since
  I already have the support to dump similar info into the
  devices.permissions file, but I haven't tried to use
  this in /proc/devices yet;
* Add the support for devices ranges. I.e. someone might
  wish to tell smth like b 5:[0-10] r- to this subsystem.
  Currently this is not supported and I'm afraid that if we
  start support minor ranges we'll have smth similar to
  VMA-s or FLOCK-s ranges management in one more place in the
  kernel.
* One more question is - are there any other permissions to
  work with? E.g. in OpenVZ we have a separate bit for 
  quota management, maybe we can invent some more...

Currently I didn't pay much attention to split this set well,
so this will most likely won't work with git-bisect, but I 
think this is OK for now. I will sure split it better when I
send the v3 and further.

The set is prepared against the 2.6.24-rc5-mm1.

All this is minimally tested and seems to work. Hope to hear
you comments, wishes and patches soon :)

To play with it - run a standard procedure:

 # mount -t container none /cont/devs -o devices
 # mkdir /cont/devs/0
 # echo -n $$ > /cont/devs/0/tasks

and tune device permissions.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] Some changes in the kobject mapper
       [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-08  9:07   ` Pavel Emelyanov
       [not found]     ` <47833D43.3090703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-01-08  9:12   ` [PATCH 2/4] The character devices layer changes Pavel Emelyanov
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-08  9:07 UTC (permalink / raw)
  To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage

The main thing that I want from the kobj mapper is to add 
the mode_t on the struct kobj_map that reflects with 
permissions are associated with this particular map. This 
mode is to be returned via the kobj_lookup.

I use the FMODE_XXX flags to handle the permissions bits, 
as I will compare these ones to the file->f_mode later.
By default all bits are set (for the initial container).

The additional things I need are kobj_remap() to change 
that permission and kobj_iterate() to walk the map.

The kobj_map_fini() is the roll-back for the kobj_map_init().

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

---

diff --git a/drivers/base/map.c b/drivers/base/map.c
index e87017f..1aa2b58 100644
--- a/drivers/base/map.c
+++ b/drivers/base/map.c
@@ -15,11 +15,13 @@
 #include <linux/kdev_t.h>
 #include <linux/kobject.h>
 #include <linux/kobj_map.h>
+#include <linux/fs.h>
 
 struct kobj_map {
 	struct probe {
 		struct probe *next;
 		dev_t dev;
+		mode_t mode;
 		unsigned long range;
 		struct module *owner;
 		kobj_probe_t *get;
@@ -29,9 +31,9 @@ struct kobj_map {
 	struct mutex *lock;
 };
 
-int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
-	     struct module *module, kobj_probe_t *probe,
-	     int (*lock)(dev_t, void *), void *data)
+static int __kobj_map(struct kobj_map *domain, dev_t dev, mode_t mode,
+		unsigned long range, struct module *module,
+		kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data)
 {
 	unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
 	unsigned index = MAJOR(dev);
@@ -53,8 +55,10 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
 		p->dev = dev;
 		p->range = range;
 		p->data = data;
+		/* we allow these ones always by default */
+		p->mode = mode | FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
 	}
-	mutex_lock(domain->lock);
+
 	for (i = 0, p -= n; i < n; i++, p++, index++) {
 		struct probe **s = &domain->probes[index % 255];
 		while (*s && (*s)->range < range)
@@ -62,10 +66,57 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
 		p->next = *s;
 		*s = p;
 	}
-	mutex_unlock(domain->lock);
 	return 0;
 }
 
+int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
+	     struct module *module, kobj_probe_t *probe,
+	     int (*lock)(dev_t, void *), void *data)
+{
+	int err;
+
+	mutex_lock(domain->lock);
+	err = __kobj_map(domain, dev, FMODE_READ | FMODE_WRITE, range,
+			module, probe, lock, data);
+	mutex_unlock(domain->lock);
+	return err;
+}
+
+#ifdef CONFIG_CGROUP_DEVS
+int kobj_remap(struct kobj_map *domain, dev_t dev, mode_t mode,
+		unsigned long range, struct module *module,
+		kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data)
+{
+	unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
+	unsigned index = MAJOR(dev);
+	unsigned i;
+	int err = -ESRCH;
+
+	if (n > 255)
+		n = 255;
+
+	mutex_lock(domain->lock);
+	for (i = 0; i < n; i++, index++) {
+		struct probe **s;
+		for (s = &domain->probes[index % 255]; *s; s = &(*s)->next) {
+			struct probe *p = *s;
+			if (p->dev == dev) {
+				p->mode = mode | FMODE_LSEEK |
+					FMODE_PREAD | FMODE_PWRITE;
+				err = 0;
+				break;
+			}
+		}
+	}
+
+	if (err)
+		err = __kobj_map(domain, dev, mode, range, module,
+				probe, lock, data);
+	mutex_unlock(domain->lock);
+	return err;
+}
+#endif
+
 void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)
 {
 	unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
@@ -93,7 +144,8 @@ void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)
 	kfree(found);
 }
 
-struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, int *index)
+struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, mode_t *mode,
+		int *index)
 {
 	struct kobject *kobj;
 	struct probe *p;
@@ -125,14 +177,46 @@ retry:
 		kobj = probe(dev, index, data);
 		/* Currently ->owner protects _only_ ->probe() itself. */
 		module_put(owner);
-		if (kobj)
+		if (kobj) {
+			if (mode)
+				*mode = p->mode;
 			return kobj;
+		}
 		goto retry;
 	}
 	mutex_unlock(domain->lock);
 	return NULL;
 }
 
+#ifdef CONFIG_CGROUP_DEVS
+void kobj_map_iterate(struct kobj_map *domain,
+		int (*fn)(dev_t, int, mode_t, void *), void *arg)
+{
+	int i;
+	struct probe *p;
+	dev_t skip = MKDEV(0, 0);
+
+	mutex_lock(domain->lock);
+	for (i = 0; i < 255; i++) {
+		p = domain->probes[i];
+		while (p != NULL) {
+			if (p->dev == skip)
+				goto next;
+			if (p->dev == MKDEV(0, 1))
+				goto next;
+
+			skip = p->dev;
+			if (fn(p->dev, p->range, p->mode, arg))
+				goto done;
+next:
+			p = p->next;
+		}
+	}
+done:
+	mutex_unlock(domain->lock);
+}
+#endif
+
 struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock)
 {
 	struct kobj_map *p = kmalloc(sizeof(struct kobj_map), GFP_KERNEL);
@@ -153,3 +237,21 @@ struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock)
 	p->lock = lock;
 	return p;
 }
+
+void kobj_map_fini(struct kobj_map *map)
+{
+	int i;
+	struct probe *p, *next;
+
+	for (i = 0; i < 256; i++) {
+		p = map->probes[i];
+		while (p->next != NULL) {
+			next = p->next;
+			kfree(p);
+			p = next;
+		}
+	}
+
+	kfree(p);
+	kfree(map);
+}
diff --git a/include/linux/kobj_map.h b/include/linux/kobj_map.h
index bafe178..ecfe772 100644
--- a/include/linux/kobj_map.h
+++ b/include/linux/kobj_map.h
@@ -7,8 +7,13 @@ struct kobj_map;
 
 int kobj_map(struct kobj_map *, dev_t, unsigned long, struct module *,
 	     kobj_probe_t *, int (*)(dev_t, void *), void *);
+int kobj_remap(struct kobj_map *, dev_t, mode_t, unsigned long, struct module *,
+	     kobj_probe_t *, int (*)(dev_t, void *), void *);
 void kobj_unmap(struct kobj_map *, dev_t, unsigned long);
-struct kobject *kobj_lookup(struct kobj_map *, dev_t, int *);
+struct kobject *kobj_lookup(struct kobj_map *, dev_t, mode_t *, int *);
+void kobj_map_iterate(struct kobj_map *, int (*fn)(dev_t, int, mode_t, void *),
+		void *);
 struct kobj_map *kobj_map_init(kobj_probe_t *, struct mutex *);
+void kobj_map_fini(struct kobj_map *);
 
 #endif

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/4] The character devices layer changes
       [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-01-08  9:07   ` [PATCH 1/4] Some changes in the kobject mapper Pavel Emelyanov
@ 2008-01-08  9:12   ` Pavel Emelyanov
       [not found]     ` <47833E93.6010108-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-01-08  9:15   ` [PATCH 3/4] The block " Pavel Emelyanov
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-08  9:12 UTC (permalink / raw)
  To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage

These changes include the API for the control group
to map/remap/unmap the devices with their permissions
and one important thing.

The fact is that the struct cdev is cached in the inode
for faster access, so once we looked one up we go through
the fast path and omit the kobj_lookup() call. This is no
longer good when we restrict the access to cdevs.

To address this issue, I store the last_perm and last(_map)
fields on the struct cdev (and protect them with the cdev_lock)
and force the re-lookup in the kobj mappings if needed.

I know, this might be slow, but I have two points for it:
1. The re-lookup happens on open() only which is not
   a fast-path. Besides, this is so for block layer and
   nobody complains;
2. On a well-isolated setup, when each container has its
   own filesystem this is no longer a problem - each
   cgroup will cache the cdev on its inode and work good.

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

---

diff --git a/fs/char_dev.c b/fs/char_dev.c
index c3bfa76..2b821ef 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -22,6 +22,8 @@
 #include <linux/mutex.h>
 #include <linux/backing-dev.h>
 
+#include <linux/devscontrol.h>
+
 #ifdef CONFIG_KMOD
 #include <linux/kmod.h>
 #endif
@@ -362,17 +364,25 @@ int chrdev_open(struct inode * inode, struct file * filp)
 	struct cdev *p;
 	struct cdev *new = NULL;
 	int ret = 0;
+	struct kobj_map *map;
+	mode_t mode;
+
+	map = task_cdev_map(current);
+	if (map == NULL)
+		map = cdev_map;
 
 	spin_lock(&cdev_lock);
 	p = inode->i_cdev;
-	if (!p) {
+	if (!p || p->last != map) {
 		struct kobject *kobj;
 		int idx;
+
 		spin_unlock(&cdev_lock);
-		kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx);
+		kobj = kobj_lookup(map, inode->i_rdev, &mode, &idx);
 		if (!kobj)
 			return -ENXIO;
 		new = container_of(kobj, struct cdev, kobj);
+		BUG_ON(p != NULL && p != new);
 		spin_lock(&cdev_lock);
 		p = inode->i_cdev;
 		if (!p) {
@@ -382,12 +392,24 @@ int chrdev_open(struct inode * inode, struct file * filp)
 			new = NULL;
 		} else if (!cdev_get(p))
 			ret = -ENXIO;
+		else {
+			p->last = map;
+			p->last_mode = mode;
+		}
 	} else if (!cdev_get(p))
 		ret = -ENXIO;
+	else
+		mode = p->last_mode;
 	spin_unlock(&cdev_lock);
 	cdev_put(new);
 	if (ret)
 		return ret;
+
+	if ((filp->f_mode & mode) != filp->f_mode) {
+		cdev_put(p);
+		return -EACCES;
+	}
+
 	filp->f_op = fops_get(p->ops);
 	if (!filp->f_op) {
 		cdev_put(p);
@@ -461,6 +483,64 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count)
 	return kobj_map(cdev_map, dev, count, NULL, exact_match, exact_lock, p);
 }
 
+#ifdef CONFIG_CGROUP_DEVS
+static inline void cdev_map_reset(struct kobj_map *map, struct cdev *c)
+{
+	spin_lock(&cdev_lock);
+	if (c->last == map)
+		c->last = NULL;
+	spin_unlock(&cdev_lock);
+}
+
+int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode)
+{
+	int tmp;
+	struct kobject *k;
+	struct cdev *c;
+
+	k = kobj_lookup(cdev_map, dev, NULL, &tmp);
+	if (k == NULL)
+		return -ENODEV;
+
+	c = container_of(k, struct cdev, kobj);
+	tmp = kobj_remap(map, dev, mode, all ? MINORMASK : 1, NULL,
+			exact_match, exact_lock, c);
+	if (tmp < 0) {
+		cdev_put(c);
+		return tmp;
+	}
+
+	cdev_map_reset(map, c);
+	return 0;
+}
+
+int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all)
+{
+	int tmp;
+	struct kobject *k;
+	struct cdev *c;
+
+	k = kobj_lookup(cdev_map, dev, NULL, &tmp);
+	if (k == NULL)
+		return -ENODEV;
+
+	c = container_of(k, struct cdev, kobj);
+	kobj_unmap(map, dev, all ? MINORMASK : 1);
+
+	cdev_map_reset(map, c);
+
+	cdev_put(c);
+	cdev_put(c);
+	return 0;
+}
+
+void cdev_iterate_map(struct kobj_map *map,
+		int (*fn)(dev_t, int, mode_t, void *), void *x)
+{
+	kobj_map_iterate(map, fn, x);
+}
+#endif
+
 static void cdev_unmap(dev_t dev, unsigned count)
 {
 	kobj_unmap(cdev_map, dev, count);
@@ -542,9 +622,19 @@ static struct kobject *base_probe(dev_t dev, int *part, void *data)
 	return NULL;
 }
 
+struct kobj_map *cdev_map_init(void)
+{
+	return kobj_map_init(base_probe, &chrdevs_lock);
+}
+
+void cdev_map_fini(struct kobj_map *map)
+{
+	kobj_map_fini(map);
+}
+
 void __init chrdev_init(void)
 {
-	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	cdev_map = cdev_map_init();
 	bdi_init(&directly_mappable_cdev_bdi);
 }
 
diff --git a/include/linux/cdev.h b/include/linux/cdev.h
index 1e29b13..d72a2a1 100644
--- a/include/linux/cdev.h
+++ b/include/linux/cdev.h
@@ -9,6 +9,7 @@
 struct file_operations;
 struct inode;
 struct module;
+struct kobj_map;
 
 struct cdev {
 	struct kobject kobj;
@@ -17,6 +18,8 @@ struct cdev {
 	struct list_head list;
 	dev_t dev;
 	unsigned int count;
+	struct kobj_map *last;
+	mode_t last_mode;
 };
 
 void cdev_init(struct cdev *, const struct file_operations *);
@@ -33,5 +36,11 @@ void cd_forget(struct inode *);
 
 extern struct backing_dev_info directly_mappable_cdev_bdi;
 
+int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode);
+int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all);
+struct kobj_map *cdev_map_init(void);
+void cdev_map_fini(struct kobj_map *map);
+void cdev_iterate_map(struct kobj_map *,
+		int (*fn)(dev_t, int, mode_t, void *), void *);
 #endif
 #endif

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/4] The block devices layer changes
       [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-01-08  9:07   ` [PATCH 1/4] Some changes in the kobject mapper Pavel Emelyanov
  2008-01-08  9:12   ` [PATCH 2/4] The character devices layer changes Pavel Emelyanov
@ 2008-01-08  9:15   ` Pavel Emelyanov
  2008-01-08  9:18   ` [PATCH 4/4] The control group itself Pavel Emelyanov
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-08  9:15 UTC (permalink / raw)
  To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage

They are the same as for the character layer, but
the good news is that there are no caching in this
case.

So this patch is smaller and easier to understand
as compared to the previous one.

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

---

diff --git a/block/genhd.c b/block/genhd.c
index 5e4ab4b..6f9ef48 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -8,6 +8,7 @@
 #include <linux/kdev_t.h>
 #include <linux/kernel.h>
 #include <linux/blkdev.h>
+#include <linux/devscontrol.h>
 #include <linux/init.h>
 #include <linux/spinlock.h>
 #include <linux/seq_file.h>
@@ -195,6 +196,57 @@ void unlink_gendisk(struct gendisk *disk)
 			      disk->minors);
 }
 
+#ifdef CONFIG_CGROUP_DEVS
+int bdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode)
+{
+	int tmp;
+	struct kobject *kobj;
+	struct device *d;
+	struct gendisk *disk;
+
+	kobj = kobj_lookup(bdev_map, dev, NULL, &tmp);
+	if (kobj == NULL)
+		return -ENODEV;
+
+	d = kobj_to_dev(kobj);
+	disk = dev_to_disk(d);
+	tmp = kobj_remap(map, dev, mode, all ? MINORBITS : 1, NULL,
+			exact_match, exact_lock, disk);
+	if (tmp < 0) {
+		put_disk(disk);
+		return tmp;
+	}
+
+	return 0;
+}
+
+int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all)
+{
+	int tmp;
+	struct kobject *kobj;
+	struct device *d;
+	struct gendisk *disk;
+
+	kobj = kobj_lookup(bdev_map, dev, NULL, &tmp);
+	if (kobj == NULL)
+		return -ENODEV;
+
+	d = kobj_to_dev(kobj);
+	disk = dev_to_disk(d);
+	kobj_unmap(map, dev, all ? MINORBITS : 1);
+
+	put_disk(disk);
+	put_disk(disk);
+	return 0;
+}
+
+void bdev_iterate_map(struct kobj_map *map,
+		int (*fn)(dev_t, int, mode_t, void *), void *x)
+{
+	kobj_map_iterate(map, fn, x);
+}
+#endif
+
 /**
  * get_gendisk - get partitioning information for a given device
  * @dev: device to get partitioning information for
@@ -202,10 +254,18 @@ void unlink_gendisk(struct gendisk *disk)
  * This function gets the structure containing partitioning
  * information for the given device @dev.
  */
-struct gendisk *get_gendisk(dev_t devt, int *part)
+struct gendisk *get_gendisk(dev_t devt, mode_t *mode, int *part)
 {
-	struct kobject *kobj = kobj_lookup(bdev_map, devt, part);
-	struct device *dev = kobj_to_dev(kobj);
+	struct kobj_map *map;
+	struct kobject *kobj;
+	struct device *dev;
+
+	map = task_bdev_map(current);
+	if (map == NULL)
+		map = bdev_map;
+
+	kobj = kobj_lookup(map, devt, mode, part);
+	dev = kobj_to_dev(kobj);
 
 	return  kobj ? dev_to_disk(dev) : NULL;
 }
@@ -356,10 +416,20 @@ static struct kobject *base_probe(dev_t devt, int *part, void *data)
 	return NULL;
 }
 
+struct kobj_map *bdev_map_init(void)
+{
+	return kobj_map_init(base_probe, &block_class_lock);
+}
+
+void bdev_map_fini(struct kobj_map *map)
+{
+	kobj_map_fini(map);
+}
+
 static int __init genhd_device_init(void)
 {
 	class_register(&block_class);
-	bdev_map = kobj_map_init(base_probe, &block_class_lock);
+	bdev_map = bdev_map_init();
 	blk_dev_init();
 
 #ifndef CONFIG_SYSFS_DEPRECATED
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 55295a4..03b1b5e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1129,16 +1129,25 @@ static int do_open(struct block_device *bdev, struct file *file, int for_part)
 	struct module *owner = NULL;
 	struct gendisk *disk;
 	int ret = -ENXIO;
+	mode_t mode;
 	int part;
 
 	file->f_mapping = bdev->bd_inode->i_mapping;
 	lock_kernel();
-	disk = get_gendisk(bdev->bd_dev, &part);
+	disk = get_gendisk(bdev->bd_dev, &mode, &part);
 	if (!disk) {
 		unlock_kernel();
 		bdput(bdev);
 		return ret;
 	}
+
+	if ((file->f_mode & mode) != file->f_mode) {
+		unlock_kernel();
+		bdput(bdev);
+		put_disk(disk);
+		return -EACCES;
+	}
+
 	owner = disk->fops->owner;
 
 	mutex_lock_nested(&bdev->bd_mutex, for_part);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index dc710a0..b2d7b52 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -239,7 +239,15 @@ extern int get_blkdev_list(char *, int);
 extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
 extern void unlink_gendisk(struct gendisk *gp);
-extern struct gendisk *get_gendisk(dev_t dev, int *part);
+extern struct gendisk *get_gendisk(dev_t dev, mode_t *mode, int *part);
+
+struct kobj_map;
+extern int bdev_add_to_map(struct kobj_map *, dev_t dev, int all, mode_t mode);
+extern int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all);
+extern void bdev_iterate_map(struct kobj_map *map,
+		int (*fn)(dev_t, int, mode_t, void *), void *x);
+extern struct kobj_map *bdev_map_init(void);
+extern void bdev_map_fini(struct kobj_map *map);
 
 extern void set_device_ro(struct block_device *bdev, int flag);
 extern void set_disk_ro(struct gendisk *disk, int flag);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/4] The control group itself
       [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2008-01-08  9:15   ` [PATCH 3/4] The block " Pavel Emelyanov
@ 2008-01-08  9:18   ` Pavel Emelyanov
       [not found]     ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-01-12 21:20   ` [PATCH 0/4] Devices accessibility control group (v2) sukadev-r/Jw6+rmf7HQT0dZR+AlfA
  2008-01-14 21:18   ` Paul Menage
  5 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-08  9:18 UTC (permalink / raw)
  To: Serge Hallyn, Oren Laadan; +Cc: Linux Containers, Paul Menage

Each new group will have its own maps for char and block
layers. The devices access list is tuned via the 
devices.permissions file.

One may read from the file to get the configured
state. E.g.

# cat /cont/devices/0/devices.permissions
c 1:* rw
b 8:1 rw

The top container isn't initialized, so that the char 
and block layers will use the global maps to lookup 
their devices.

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

---

diff --git a/fs/Makefile b/fs/Makefile
index 82b6ae1..a085706 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -63,6 +63,8 @@ obj-y				+= devpts/
 
 obj-$(CONFIG_PROFILING)		+= dcookies.o
 obj-$(CONFIG_DLM)		+= dlm/
+
+obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
  
 # Do not add any filesystems before this line
 obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
diff --git a/fs/devscontrol.c b/fs/devscontrol.c
new file mode 100644
index 0000000..ea282f3
--- /dev/null
+++ b/fs/devscontrol.c
@@ -0,0 +1,291 @@
+/*
+ * devscontrol.c - Device Controller
+ *
+ * Copyright 2007 OpenVZ SWsoft Inc
+ * Author: Pavel Emelyanov <xemul at openvz.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cdev.h>
+#include <linux/err.h>
+#include <linux/devscontrol.h>
+#include <linux/uaccess.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+
+struct devs_cgroup {
+	struct cgroup_subsys_state css;
+
+	struct kobj_map *cdev_map;
+	struct kobj_map *bdev_map;
+};
+
+static inline
+struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct devs_cgroup, css);
+}
+
+static inline
+struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
+{
+	return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
+}
+
+struct kobj_map *task_cdev_map(struct task_struct *tsk)
+{
+	struct cgroup_subsys_state *css;
+
+	css = task_subsys_state(tsk, devs_subsys_id);
+	if (css->cgroup->parent == NULL)
+		return NULL;
+	else
+		return css_to_devs(css)->cdev_map;
+}
+
+struct kobj_map *task_bdev_map(struct task_struct *tsk)
+{
+	struct cgroup_subsys_state *css;
+
+	css = task_subsys_state(tsk, devs_subsys_id);
+	if (css->cgroup->parent == NULL)
+		return NULL;
+	else
+		return css_to_devs(css)->bdev_map;
+}
+
+static struct cgroup_subsys_state *
+devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct devs_cgroup *devs;
+
+	devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
+	if (devs == NULL)
+		goto out;
+
+	devs->cdev_map = cdev_map_init();
+	if (devs->cdev_map == NULL)
+		goto out_free;
+
+	devs->bdev_map = bdev_map_init();
+	if (devs->bdev_map == NULL)
+		goto out_free_cdev;
+
+	return &devs->css;
+
+out_free_cdev:
+	cdev_map_fini(devs->cdev_map);
+out_free:
+	kfree(devs);
+out:
+	return ERR_PTR(-ENOMEM);
+}
+
+static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct devs_cgroup *devs;
+
+	devs = cgroup_to_devs(cont);
+	bdev_map_fini(devs->bdev_map);
+	cdev_map_fini(devs->cdev_map);
+	kfree(devs);
+}
+
+static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
+		int *all, mode_t *mode)
+{
+	unsigned int major, minor;
+	char *end;
+	mode_t tmp;
+
+	/* [cb] <major>:<minor> [r-][w-] */
+
+	if (buf[0] == 'c')
+		*chrdev = 1;
+	else if (buf[0] == 'b')
+		*chrdev = 0;
+	else
+		return -EINVAL;
+
+	if (buf[1] != ' ')
+		return -EINVAL;
+
+	major = simple_strtoul(buf + 2, &end, 10);
+	if (*end != ':')
+		return -EINVAL;
+
+	if (end[1] == '*') {
+		if (end[2] != ' ')
+			return -EINVAL;
+
+		*all = 1;
+		minor = 0;
+		end += 2;
+	} else {
+		minor = simple_strtoul(end + 1, &end, 10);
+		if (*end != ' ')
+			return -EINVAL;
+
+		*all = 0;
+	}
+
+	tmp = 0;
+
+	if (end[1] == 'r')
+		tmp |= FMODE_READ;
+	else if (end[1] != '-')
+		return -EINVAL;
+	if (end[2] == 'w')
+		tmp |= FMODE_WRITE;
+	else if (end[2] != '-')
+		return -EINVAL;
+
+	*dev = MKDEV(major, minor);
+	*mode = tmp;
+	return 0;
+}
+
+static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
+		int all, mode_t mode)
+{
+	int ret;
+
+	ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
+	if (all)
+		ret += snprintf(buf + ret, len - ret, "*");
+	else
+		ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
+
+	ret += snprintf(buf + ret, len - ret, " %c%c\n",
+			(mode & FMODE_READ) ? 'r' : '-',
+			(mode & FMODE_WRITE) ? 'w' : '-');
+
+	return ret + 1;
+}
+
+static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
+		struct file *f, const char __user *ubuf,
+		size_t nbytes, loff_t *pos)
+{
+	int err, all, chrdev;
+	dev_t dev;
+	char buf[64];
+	struct devs_cgroup *devs;
+	mode_t mode;
+
+	if (copy_from_user(buf, ubuf, sizeof(buf)))
+		return -EFAULT;
+
+	buf[sizeof(buf) - 1] = 0;
+	err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
+	if (err < 0)
+		return err;
+
+	devs = cgroup_to_devs(cont);
+
+	if (mode == 0) {
+		if (chrdev)
+			err = cdev_del_from_map(devs->cdev_map, dev, all);
+		else
+			err = bdev_del_from_map(devs->bdev_map, dev, all);
+
+		if (err < 0)
+			return err;
+
+		css_put(&devs->css);
+	} else {
+		if (chrdev)
+			err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
+		else
+			err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
+
+		if (err < 0)
+			return err;
+
+		css_get(&devs->css);
+	}
+
+	return nbytes;
+}
+
+struct devs_dump_arg {
+	char *buf;
+	int pos;
+	int chrdev;
+};
+
+static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
+{
+	struct devs_dump_arg *arg = x;
+	char tmp[64];
+	int len;
+
+	len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
+			range != 1, mode);
+
+	if (arg->pos >= PAGE_SIZE - len)
+		return 1;
+
+	memcpy(arg->buf + arg->pos, tmp, len);
+	arg->pos += len;
+	return 0;
+}
+
+static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
+		struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
+{
+	struct devs_dump_arg arg;
+	struct devs_cgroup *devs;
+	ssize_t ret;
+
+	arg.buf = (char *)__get_free_page(GFP_KERNEL);
+	if (arg.buf == NULL)
+		return -ENOMEM;
+
+	devs = cgroup_to_devs(cont);
+	arg.pos = 0;
+
+	arg.chrdev = 1;
+	cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
+
+	arg.chrdev = 0;
+	bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
+
+	ret = simple_read_from_buffer(ubuf, nbytes, pos,
+			arg.buf, arg.pos);
+
+	free_page((unsigned long)arg.buf);
+	return ret;
+}
+
+static struct cftype devs_files[] = {
+	{
+		.name = "permissions",
+		.write = devs_write,
+		.read = devs_read,
+	},
+};
+
+static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss,
+			devs_files, ARRAY_SIZE(devs_files));
+}
+
+struct cgroup_subsys devs_subsys = {
+	.name = "devices",
+	.subsys_id = devs_subsys_id,
+	.create = devs_create,
+	.destroy = devs_destroy,
+	.populate = devs_populate,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 228235c..9c0cd2c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_DEVS
+SUBSYS(devs)
+#endif
+
+/* */
diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
new file mode 100644
index 0000000..5f574e0
--- /dev/null
+++ b/include/linux/devscontrol.h
@@ -0,0 +1,20 @@
+#ifndef __DEVS_CONTROL_H__
+#define __DEVS_CONTROL_H__
+struct kobj_map;
+struct task_struct;
+
+#ifdef CONFIG_CGROUP_DEVS
+struct kobj_map *task_cdev_map(struct task_struct *);
+struct kobj_map *task_bdev_map(struct task_struct *);
+#else
+static inline kobj_map *task_cdev_map(struct task_struct *tsk)
+{
+	return NULL;
+}
+
+static inline kobj_map *task_bdev_map(struct task_struct *tsk)
+{
+	return NULL;
+}
+#endif
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index b886ac9..4dd53d6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -283,6 +283,12 @@ config CGROUP_DEBUG
 
 	  Say N if unsure
 
+config CGROUP_DEVS
+	bool "Devices cgroup subsystem"
+	depends on CGROUPS
+	help
+	  Controlls the visibility of devices
+
 config CGROUP_NS
         bool "Namespace cgroup subsystem"
         depends on CGROUPS

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] Some changes in the kobject mapper
       [not found]     ` <47833D43.3090703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-08 18:36       ` Daniel Hokka Zakrisson
       [not found]         ` <4783C2B4.7000501-nym3zxDgnZcAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Hokka Zakrisson @ 2008-01-08 18:36 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

Pavel Emelyanov wrote:
> The main thing that I want from the kobj mapper is to add 
> the mode_t on the struct kobj_map that reflects with 
> permissions are associated with this particular map. This 
> mode is to be returned via the kobj_lookup.
> 
> I use the FMODE_XXX flags to handle the permissions bits, 
> as I will compare these ones to the file->f_mode later.
> By default all bits are set (for the initial container).
> 
> The additional things I need are kobj_remap() to change 
> that permission and kobj_iterate() to walk the map.
> 
> The kobj_map_fini() is the roll-back for the kobj_map_init().
> 
> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> 
> ...
> @@ -153,3 +237,21 @@ struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock)
>  	p->lock = lock;
>  	return p;
>  }
> +
> +void kobj_map_fini(struct kobj_map *map)
> +{
> +	int i;
> +	struct probe *p, *next;
> +
> +	for (i = 0; i < 256; i++) {

This should be 255, shouldn't it?

> +		p = map->probes[i];
> +		while (p->next != NULL) {
> +			next = p->next;
> +			kfree(p);
> +			p = next;
> +		}
> +	}
> +
> +	kfree(p);
> +	kfree(map);
> +}
> diff --git a/include/linux/kobj_map.h b/include/linux/kobj_map.h
> index bafe178..ecfe772 100644
> --- a/include/linux/kobj_map.h
> +++ b/include/linux/kobj_map.h
> @@ -7,8 +7,13 @@ struct kobj_map;
>  
>  int kobj_map(struct kobj_map *, dev_t, unsigned long, struct module *,
>  	     kobj_probe_t *, int (*)(dev_t, void *), void *);
> +int kobj_remap(struct kobj_map *, dev_t, mode_t, unsigned long, struct module *,
> +	     kobj_probe_t *, int (*)(dev_t, void *), void *);
>  void kobj_unmap(struct kobj_map *, dev_t, unsigned long);
> -struct kobject *kobj_lookup(struct kobj_map *, dev_t, int *);
> +struct kobject *kobj_lookup(struct kobj_map *, dev_t, mode_t *, int *);
> +void kobj_map_iterate(struct kobj_map *, int (*fn)(dev_t, int, mode_t, void *),
> +		void *);
>  struct kobj_map *kobj_map_init(kobj_probe_t *, struct mutex *);
> +void kobj_map_fini(struct kobj_map *);
>  
>  #endif
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers


-- 
Daniel Hokka Zakrisson

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] Some changes in the kobject mapper
       [not found]         ` <4783C2B4.7000501-nym3zxDgnZcAvxtiuMwx3w@public.gmane.org>
@ 2008-01-08 19:17           ` Dave Hansen
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Hansen @ 2008-01-08 19:17 UTC (permalink / raw)
  To: Daniel Hokka Zakrisson; +Cc: Linux Containers, Paul Menage, Pavel Emelyanov

On Tue, 2008-01-08 at 19:36 +0100, Daniel Hokka Zakrisson wrote:
> 
> > +void kobj_map_fini(struct kobj_map *map)
> > +{
> > +     int i;
> > +     struct probe *p, *next;
> > +
> > +     for (i = 0; i < 256; i++) {
> 
> This should be 255, shouldn't it?

Neither, it should be a named constant. ;)

-- Dave

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2008-01-08  9:18   ` [PATCH 4/4] The control group itself Pavel Emelyanov
@ 2008-01-12 21:20   ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
       [not found]     ` <20080112212014.GA12085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-01-14 21:18   ` Paul Menage
  5 siblings, 1 reply; 42+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-01-12 21:20 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote:
| The first version was posted long ago 
| (http://openvz.org/pipermail/devel/2007-September/007647.html)
| and since then there are many (good I hope) changes:
| 
| * Added the block devices support :) It turned out to
|   be a bit simpler than the char one (or I missed
|   something significant);
| * Now we can enable/disable not just individual devices,
|   but the whole major with all its minors (see the TODO
|   list beyond as well);
| * Added the ability to restrict the read/write permissions
|   to devices, not just visible/invisible state.
| 
| That is - the main features I wished to implement right
| after the v1 was sent. Some minor changes are:
| 
| * I merged the devices.char and devices.block files into
|   one - devices.permissions;
| * As the result of the change above - the strings passed
|   to this file has changed. Now they are
|          [bc] <major>:{<minor>|*} [r-][w-]
|   E.g. b 5:2 r- will grant the read permissions to the
|   block 5:2 device and c 3:* -w will grant the write-only
|   access to all the character devices with the major 5.
| 
| However, there are some things to be done:
| 
| * Make the /proc/devices show relevant info depending on
|   who is reading it. This seems to be easy to do, since
|   I already have the support to dump similar info into the
|   devices.permissions file, but I haven't tried to use
|   this in /proc/devices yet;
| * Add the support for devices ranges. I.e. someone might
|   wish to tell smth like b 5:[0-10] r- to this subsystem.
|   Currently this is not supported and I'm afraid that if we
|   start support minor ranges we'll have smth similar to
|   VMA-s or FLOCK-s ranges management in one more place in the
|   kernel.
| * One more question is - are there any other permissions to
|   work with? E.g. in OpenVZ we have a separate bit for 
|   quota management, maybe we can invent some more...
| 
| Currently I didn't pay much attention to split this set well,
| so this will most likely won't work with git-bisect, but I 
| think this is OK for now. I will sure split it better when I
| send the v3 and further.
| 
| The set is prepared against the 2.6.24-rc5-mm1.
| 
| All this is minimally tested and seems to work. Hope to hear
| you comments, wishes and patches soon :)
| 
| To play with it - run a standard procedure:
| 
|  # mount -t container none /cont/devs -o devices

This should be '-t cgroup'

|  # mkdir /cont/devs/0
|  # echo -n $$ > /cont/devs/0/tasks
| 
| and tune device permissions.

I started playing with this and noticed that even if I try to
enable read access to device [c, 1:3] it also grants access
to device [c, 1:5]. 

i.e the access restrictions seem to apply to all devices with
a given major number. Is that really the intent ?

Both devices accessible here:
	# hexdump /dev/null
	# hexdump /dev/zero
	0000000 0000 0000 0000 0000 0000 0000 0000 0000
	*
	^C

Neither device accessible:

	# echo $$ > /container/devs/0/tasks
	# hexdump /dev/zero
	hexdump: /dev/zero: No such device or address
	hexdump: /dev/zero: Bad file descriptor
	# hexdump /dev/null
	hexdump: /dev/null: No such device or address
	hexdump: /dev/null: Bad file descriptor

Grant read access to /dev/null, but /dev/zero is also readable

	# echo c 1:3 r- > /container/devs/0/devices.permissions
	# hexdump /dev/null
	# hexdump /dev/zero
	0000000 0000 0000 0000 0000 0000 0000 0000 0000
	*
	^C

Remove read access to /dev/null, but /dev/zero is also not
readable.

	# echo c 1:3 -- > /container/devs/0/devices.permissions
	# hexdump /dev/zero
	hexdump: /dev/zero: No such device or address
	hexdump: /dev/zero: Bad file descriptor

BTW, a question about cgroups: If we 'echo $$ > /container/devs/0/tasks'
is there a way to remove/undo it later  (so that the process has access
as before) ?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found]     ` <20080112212014.GA12085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-01-14  7:52       ` Pavel Emelyanov
       [not found]         ` <478B14DB.4000106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-14  7:52 UTC (permalink / raw)
  To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote:
> Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote:
> | The first version was posted long ago 
> | (http://openvz.org/pipermail/devel/2007-September/007647.html)
> | and since then there are many (good I hope) changes:
> | 
> | * Added the block devices support :) It turned out to
> |   be a bit simpler than the char one (or I missed
> |   something significant);
> | * Now we can enable/disable not just individual devices,
> |   but the whole major with all its minors (see the TODO
> |   list beyond as well);
> | * Added the ability to restrict the read/write permissions
> |   to devices, not just visible/invisible state.
> | 
> | That is - the main features I wished to implement right
> | after the v1 was sent. Some minor changes are:
> | 
> | * I merged the devices.char and devices.block files into
> |   one - devices.permissions;
> | * As the result of the change above - the strings passed
> |   to this file has changed. Now they are
> |          [bc] <major>:{<minor>|*} [r-][w-]
> |   E.g. b 5:2 r- will grant the read permissions to the
> |   block 5:2 device and c 3:* -w will grant the write-only
> |   access to all the character devices with the major 5.
> | 
> | However, there are some things to be done:
> | 
> | * Make the /proc/devices show relevant info depending on
> |   who is reading it. This seems to be easy to do, since
> |   I already have the support to dump similar info into the
> |   devices.permissions file, but I haven't tried to use
> |   this in /proc/devices yet;
> | * Add the support for devices ranges. I.e. someone might
> |   wish to tell smth like b 5:[0-10] r- to this subsystem.
> |   Currently this is not supported and I'm afraid that if we
> |   start support minor ranges we'll have smth similar to
> |   VMA-s or FLOCK-s ranges management in one more place in the
> |   kernel.
> | * One more question is - are there any other permissions to
> |   work with? E.g. in OpenVZ we have a separate bit for 
> |   quota management, maybe we can invent some more...
> | 
> | Currently I didn't pay much attention to split this set well,
> | so this will most likely won't work with git-bisect, but I 
> | think this is OK for now. I will sure split it better when I
> | send the v3 and further.
> | 
> | The set is prepared against the 2.6.24-rc5-mm1.
> | 
> | All this is minimally tested and seems to work. Hope to hear
> | you comments, wishes and patches soon :)
> | 
> | To play with it - run a standard procedure:
> | 
> |  # mount -t container none /cont/devs -o devices
> 
> This should be '-t cgroup'

Right :)

Thank you for the feedback. Serge, Oren, do you have smth
to tell about this set? I planned to show it to Andrew this
week, hope he will find time to look at it :)

> |  # mkdir /cont/devs/0
> |  # echo -n $$ > /cont/devs/0/tasks
> | 
> | and tune device permissions.
> 
> I started playing with this and noticed that even if I try to
> enable read access to device [c, 1:3] it also grants access
> to device [c, 1:5]. 

Hm... I can't reproduce this:

# /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
# /bin/echo -n $$ > /cnt/dev/0/tasks
# cat /cnt/dev/0/devices.permissions 
c 1:3 r-
# hexdump /dev/null 
# hexdump /dev/zero 
hexdump: /dev/zero: No such device or address
hexdump: /dev/zero: Bad file descriptor

Maybe you have played with devs cgroups before getting this?
Can you show what's the contents of the devices.permissions file
in your case?

> i.e the access restrictions seem to apply to all devices with
> a given major number. Is that really the intent ?
> 
> Both devices accessible here:
> 	# hexdump /dev/null
> 	# hexdump /dev/zero
> 	0000000 0000 0000 0000 0000 0000 0000 0000 0000
> 	*
> 	^C
> 
> Neither device accessible:
> 
> 	# echo $$ > /container/devs/0/tasks
> 	# hexdump /dev/zero
> 	hexdump: /dev/zero: No such device or address
> 	hexdump: /dev/zero: Bad file descriptor
> 	# hexdump /dev/null
> 	hexdump: /dev/null: No such device or address
> 	hexdump: /dev/null: Bad file descriptor
> 
> Grant read access to /dev/null, but /dev/zero is also readable
> 
> 	# echo c 1:3 r- > /container/devs/0/devices.permissions
> 	# hexdump /dev/null
> 	# hexdump /dev/zero
> 	0000000 0000 0000 0000 0000 0000 0000 0000 0000
> 	*
> 	^C
> 
> Remove read access to /dev/null, but /dev/zero is also not
> readable.
> 
> 	# echo c 1:3 -- > /container/devs/0/devices.permissions
> 	# hexdump /dev/zero
> 	hexdump: /dev/zero: No such device or address
> 	hexdump: /dev/zero: Bad file descriptor
> 
> BTW, a question about cgroups: If we 'echo $$ > /container/devs/0/tasks'
> is there a way to remove/undo it later  (so that the process has access
> as before) ?

I've always thought that it's to move the task to top cgrop, i.e.
echo $$ > /container/devs/tasks

Thanks,
Pavel

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] The character devices layer changes
       [not found]     ` <47833E93.6010108-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-14 17:03       ` Serge E. Hallyn
       [not found]         ` <20080114170333.GA15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Serge E. Hallyn @ 2008-01-14 17:03 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> These changes include the API for the control group
> to map/remap/unmap the devices with their permissions
> and one important thing.
> 
> The fact is that the struct cdev is cached in the inode
> for faster access, so once we looked one up we go through
> the fast path and omit the kobj_lookup() call. This is no
> longer good when we restrict the access to cdevs.
> 
> To address this issue, I store the last_perm and last(_map)
> fields on the struct cdev (and protect them with the cdev_lock)
> and force the re-lookup in the kobj mappings if needed.
> 
> I know, this might be slow, but I have two points for it:
> 1. The re-lookup happens on open() only which is not
>    a fast-path. Besides, this is so for block layer and
>    nobody complains;
> 2. On a well-isolated setup, when each container has its
>    own filesystem this is no longer a problem - each
>    cgroup will cache the cdev on its inode and work good.

What about simply returning -EPERM when open()ing a cdev
with ->map!=task_cdev_map(current)?

Shouldn't be a problem for ttys, since the container init
already has the tty open, right?

Otherwise, the patchset looks good to me.  Want to look
through this one a little more (i think that'd be easier
with the -EPERM approach) and scrutinize patch 4, but
overall it makes sense.

If I understand right, we're taking 14k per cgroup for
kobjmaps?  Do we consider that a problem?

thanks,
-serge

> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> 
> ---
> 
> diff --git a/fs/char_dev.c b/fs/char_dev.c
> index c3bfa76..2b821ef 100644
> --- a/fs/char_dev.c
> +++ b/fs/char_dev.c
> @@ -22,6 +22,8 @@
>  #include <linux/mutex.h>
>  #include <linux/backing-dev.h>
> 
> +#include <linux/devscontrol.h>
> +
>  #ifdef CONFIG_KMOD
>  #include <linux/kmod.h>
>  #endif
> @@ -362,17 +364,25 @@ int chrdev_open(struct inode * inode, struct file * filp)
>  	struct cdev *p;
>  	struct cdev *new = NULL;
>  	int ret = 0;
> +	struct kobj_map *map;
> +	mode_t mode;
> +
> +	map = task_cdev_map(current);
> +	if (map == NULL)
> +		map = cdev_map;
> 
>  	spin_lock(&cdev_lock);
>  	p = inode->i_cdev;
> -	if (!p) {
> +	if (!p || p->last != map) {
>  		struct kobject *kobj;
>  		int idx;
> +
>  		spin_unlock(&cdev_lock);
> -		kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx);
> +		kobj = kobj_lookup(map, inode->i_rdev, &mode, &idx);
>  		if (!kobj)
>  			return -ENXIO;
>  		new = container_of(kobj, struct cdev, kobj);
> +		BUG_ON(p != NULL && p != new);
>  		spin_lock(&cdev_lock);
>  		p = inode->i_cdev;
>  		if (!p) {
> @@ -382,12 +392,24 @@ int chrdev_open(struct inode * inode, struct file * filp)
>  			new = NULL;
>  		} else if (!cdev_get(p))
>  			ret = -ENXIO;
> +		else {
> +			p->last = map;
> +			p->last_mode = mode;
> +		}
>  	} else if (!cdev_get(p))
>  		ret = -ENXIO;
> +	else
> +		mode = p->last_mode;
>  	spin_unlock(&cdev_lock);
>  	cdev_put(new);
>  	if (ret)
>  		return ret;
> +
> +	if ((filp->f_mode & mode) != filp->f_mode) {
> +		cdev_put(p);
> +		return -EACCES;
> +	}
> +
>  	filp->f_op = fops_get(p->ops);
>  	if (!filp->f_op) {
>  		cdev_put(p);
> @@ -461,6 +483,64 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count)
>  	return kobj_map(cdev_map, dev, count, NULL, exact_match, exact_lock, p);
>  }
> 
> +#ifdef CONFIG_CGROUP_DEVS
> +static inline void cdev_map_reset(struct kobj_map *map, struct cdev *c)
> +{
> +	spin_lock(&cdev_lock);
> +	if (c->last == map)
> +		c->last = NULL;
> +	spin_unlock(&cdev_lock);
> +}
> +
> +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode)
> +{
> +	int tmp;
> +	struct kobject *k;
> +	struct cdev *c;
> +
> +	k = kobj_lookup(cdev_map, dev, NULL, &tmp);
> +	if (k == NULL)
> +		return -ENODEV;
> +
> +	c = container_of(k, struct cdev, kobj);
> +	tmp = kobj_remap(map, dev, mode, all ? MINORMASK : 1, NULL,
> +			exact_match, exact_lock, c);
> +	if (tmp < 0) {
> +		cdev_put(c);
> +		return tmp;
> +	}
> +
> +	cdev_map_reset(map, c);
> +	return 0;
> +}
> +
> +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all)
> +{
> +	int tmp;
> +	struct kobject *k;
> +	struct cdev *c;
> +
> +	k = kobj_lookup(cdev_map, dev, NULL, &tmp);
> +	if (k == NULL)
> +		return -ENODEV;
> +
> +	c = container_of(k, struct cdev, kobj);
> +	kobj_unmap(map, dev, all ? MINORMASK : 1);
> +
> +	cdev_map_reset(map, c);
> +
> +	cdev_put(c);
> +	cdev_put(c);
> +	return 0;
> +}
> +
> +void cdev_iterate_map(struct kobj_map *map,
> +		int (*fn)(dev_t, int, mode_t, void *), void *x)
> +{
> +	kobj_map_iterate(map, fn, x);
> +}
> +#endif
> +
>  static void cdev_unmap(dev_t dev, unsigned count)
>  {
>  	kobj_unmap(cdev_map, dev, count);
> @@ -542,9 +622,19 @@ static struct kobject *base_probe(dev_t dev, int *part, void *data)
>  	return NULL;
>  }
> 
> +struct kobj_map *cdev_map_init(void)
> +{
> +	return kobj_map_init(base_probe, &chrdevs_lock);
> +}
> +
> +void cdev_map_fini(struct kobj_map *map)
> +{
> +	kobj_map_fini(map);
> +}
> +
>  void __init chrdev_init(void)
>  {
> -	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
> +	cdev_map = cdev_map_init();
>  	bdi_init(&directly_mappable_cdev_bdi);
>  }
> 
> diff --git a/include/linux/cdev.h b/include/linux/cdev.h
> index 1e29b13..d72a2a1 100644
> --- a/include/linux/cdev.h
> +++ b/include/linux/cdev.h
> @@ -9,6 +9,7 @@
>  struct file_operations;
>  struct inode;
>  struct module;
> +struct kobj_map;
> 
>  struct cdev {
>  	struct kobject kobj;
> @@ -17,6 +18,8 @@ struct cdev {
>  	struct list_head list;
>  	dev_t dev;
>  	unsigned int count;
> +	struct kobj_map *last;
> +	mode_t last_mode;
>  };
> 
>  void cdev_init(struct cdev *, const struct file_operations *);
> @@ -33,5 +36,11 @@ void cd_forget(struct inode *);
> 
>  extern struct backing_dev_info directly_mappable_cdev_bdi;
> 
> +int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode);
> +int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all);
> +struct kobj_map *cdev_map_init(void);
> +void cdev_map_fini(struct kobj_map *map);
> +void cdev_iterate_map(struct kobj_map *,
> +		int (*fn)(dev_t, int, mode_t, void *), void *);
>  #endif
>  #endif

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]     ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-14 17:40       ` Serge E. Hallyn
       [not found]         ` <20080114174056.GB15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
  2008-01-14 21:54       ` Paul Menage
  1 sibling, 1 reply; 42+ messages in thread
From: Serge E. Hallyn @ 2008-01-14 17:40 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> Each new group will have its own maps for char and block
> layers. The devices access list is tuned via the 
> devices.permissions file.
> 
> One may read from the file to get the configured
> state. E.g.
> 
> # cat /cont/devices/0/devices.permissions
> c 1:* rw
> b 8:1 rw
> 
> The top container isn't initialized, so that the char 
> and block layers will use the global maps to lookup 
> their devices.
> 
> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> 
> ---
> 
> diff --git a/fs/Makefile b/fs/Makefile
> index 82b6ae1..a085706 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -63,6 +63,8 @@ obj-y				+= devpts/
> 
>  obj-$(CONFIG_PROFILING)		+= dcookies.o
>  obj-$(CONFIG_DLM)		+= dlm/
> +
> +obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
>   
>  # Do not add any filesystems before this line
>  obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
> diff --git a/fs/devscontrol.c b/fs/devscontrol.c
> new file mode 100644
> index 0000000..ea282f3
> --- /dev/null
> +++ b/fs/devscontrol.c
> @@ -0,0 +1,291 @@
> +/*
> + * devscontrol.c - Device Controller
> + *
> + * Copyright 2007 OpenVZ SWsoft Inc
> + * Author: Pavel Emelyanov <xemul at openvz.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/cdev.h>
> +#include <linux/err.h>
> +#include <linux/devscontrol.h>
> +#include <linux/uaccess.h>
> +#include <linux/fs.h>
> +#include <linux/genhd.h>
> +
> +struct devs_cgroup {
> +	struct cgroup_subsys_state css;
> +
> +	struct kobj_map *cdev_map;
> +	struct kobj_map *bdev_map;
> +};
> +
> +static inline
> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
> +{
> +	return container_of(css, struct devs_cgroup, css);
> +}
> +
> +static inline
> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
> +{
> +	return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
> +}
> +
> +struct kobj_map *task_cdev_map(struct task_struct *tsk)
> +{
> +	struct cgroup_subsys_state *css;
> +
> +	css = task_subsys_state(tsk, devs_subsys_id);
> +	if (css->cgroup->parent == NULL)
> +		return NULL;
> +	else
> +		return css_to_devs(css)->cdev_map;
> +}
> +
> +struct kobj_map *task_bdev_map(struct task_struct *tsk)
> +{
> +	struct cgroup_subsys_state *css;
> +
> +	css = task_subsys_state(tsk, devs_subsys_id);
> +	if (css->cgroup->parent == NULL)
> +		return NULL;
> +	else
> +		return css_to_devs(css)->bdev_map;
> +}
> +
> +static struct cgroup_subsys_state *
> +devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +	struct devs_cgroup *devs;
> +
> +	devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
> +	if (devs == NULL)
> +		goto out;
> +
> +	devs->cdev_map = cdev_map_init();
> +	if (devs->cdev_map == NULL)
> +		goto out_free;
> +
> +	devs->bdev_map = bdev_map_init();
> +	if (devs->bdev_map == NULL)
> +		goto out_free_cdev;
> +
> +	return &devs->css;
> +
> +out_free_cdev:
> +	cdev_map_fini(devs->cdev_map);
> +out_free:
> +	kfree(devs);
> +out:
> +	return ERR_PTR(-ENOMEM);
> +}

Thanks for working on this, Pavel.

My only question with this patch is - so if I create a devs
cgroup which only has access to, say /dev/loop0 and /dev/tty3,
and someone in that cgroup manages to create a new cgroup, the
new cgroup will have all the default permissions again, rather
than inherit the permissions from this cgroup, right?

> +
> +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +	struct devs_cgroup *devs;
> +
> +	devs = cgroup_to_devs(cont);
> +	bdev_map_fini(devs->bdev_map);
> +	cdev_map_fini(devs->cdev_map);
> +	kfree(devs);
> +}
> +
> +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
> +		int *all, mode_t *mode)
> +{
> +	unsigned int major, minor;
> +	char *end;
> +	mode_t tmp;
> +
> +	/* [cb] <major>:<minor> [r-][w-] */
> +
> +	if (buf[0] == 'c')
> +		*chrdev = 1;
> +	else if (buf[0] == 'b')
> +		*chrdev = 0;
> +	else
> +		return -EINVAL;
> +
> +	if (buf[1] != ' ')
> +		return -EINVAL;
> +
> +	major = simple_strtoul(buf + 2, &end, 10);
> +	if (*end != ':')
> +		return -EINVAL;
> +
> +	if (end[1] == '*') {
> +		if (end[2] != ' ')
> +			return -EINVAL;
> +
> +		*all = 1;
> +		minor = 0;
> +		end += 2;
> +	} else {
> +		minor = simple_strtoul(end + 1, &end, 10);
> +		if (*end != ' ')
> +			return -EINVAL;
> +
> +		*all = 0;
> +	}
> +
> +	tmp = 0;
> +
> +	if (end[1] == 'r')
> +		tmp |= FMODE_READ;
> +	else if (end[1] != '-')
> +		return -EINVAL;
> +	if (end[2] == 'w')
> +		tmp |= FMODE_WRITE;
> +	else if (end[2] != '-')
> +		return -EINVAL;
> +
> +	*dev = MKDEV(major, minor);
> +	*mode = tmp;
> +	return 0;
> +}
> +
> +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
> +		int all, mode_t mode)
> +{
> +	int ret;
> +
> +	ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
> +	if (all)
> +		ret += snprintf(buf + ret, len - ret, "*");
> +	else
> +		ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
> +
> +	ret += snprintf(buf + ret, len - ret, " %c%c\n",
> +			(mode & FMODE_READ) ? 'r' : '-',
> +			(mode & FMODE_WRITE) ? 'w' : '-');
> +
> +	return ret + 1;
> +}
> +
> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
> +		struct file *f, const char __user *ubuf,
> +		size_t nbytes, loff_t *pos)
> +{
> +	int err, all, chrdev;
> +	dev_t dev;
> +	char buf[64];
> +	struct devs_cgroup *devs;
> +	mode_t mode;

(Of course this will require some privilege, i assume that's a detail
you'll add next time around)

> +
> +	if (copy_from_user(buf, ubuf, sizeof(buf)))
> +		return -EFAULT;
> +
> +	buf[sizeof(buf) - 1] = 0;
> +	err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
> +	if (err < 0)
> +		return err;
> +
> +	devs = cgroup_to_devs(cont);
> +
> +	if (mode == 0) {
> +		if (chrdev)
> +			err = cdev_del_from_map(devs->cdev_map, dev, all);
> +		else
> +			err = bdev_del_from_map(devs->bdev_map, dev, all);
> +
> +		if (err < 0)
> +			return err;
> +
> +		css_put(&devs->css);
> +	} else {
> +		if (chrdev)
> +			err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
> +		else
> +			err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
> +
> +		if (err < 0)
> +			return err;
> +
> +		css_get(&devs->css);
> +	}
> +
> +	return nbytes;
> +}
> +
> +struct devs_dump_arg {
> +	char *buf;
> +	int pos;
> +	int chrdev;
> +};
> +
> +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
> +{
> +	struct devs_dump_arg *arg = x;
> +	char tmp[64];
> +	int len;
> +
> +	len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
> +			range != 1, mode);
> +
> +	if (arg->pos >= PAGE_SIZE - len)
> +		return 1;
> +
> +	memcpy(arg->buf + arg->pos, tmp, len);
> +	arg->pos += len;
> +	return 0;
> +}
> +
> +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
> +		struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
> +{
> +	struct devs_dump_arg arg;
> +	struct devs_cgroup *devs;
> +	ssize_t ret;
> +
> +	arg.buf = (char *)__get_free_page(GFP_KERNEL);
> +	if (arg.buf == NULL)
> +		return -ENOMEM;
> +
> +	devs = cgroup_to_devs(cont);
> +	arg.pos = 0;
> +
> +	arg.chrdev = 1;
> +	cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
> +
> +	arg.chrdev = 0;
> +	bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
> +
> +	ret = simple_read_from_buffer(ubuf, nbytes, pos,
> +			arg.buf, arg.pos);
> +
> +	free_page((unsigned long)arg.buf);
> +	return ret;
> +}
> +
> +static struct cftype devs_files[] = {
> +	{
> +		.name = "permissions",
> +		.write = devs_write,
> +		.read = devs_read,
> +	},
> +};
> +
> +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +	return cgroup_add_files(cont, ss,
> +			devs_files, ARRAY_SIZE(devs_files));
> +}
> +
> +struct cgroup_subsys devs_subsys = {
> +	.name = "devices",
> +	.subsys_id = devs_subsys_id,
> +	.create = devs_create,
> +	.destroy = devs_destroy,
> +	.populate = devs_populate,
> +};
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 228235c..9c0cd2c 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
>  #endif
> 
>  /* */
> +
> +#ifdef CONFIG_CGROUP_DEVS
> +SUBSYS(devs)
> +#endif
> +
> +/* */
> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
> new file mode 100644
> index 0000000..5f574e0
> --- /dev/null
> +++ b/include/linux/devscontrol.h
> @@ -0,0 +1,20 @@
> +#ifndef __DEVS_CONTROL_H__
> +#define __DEVS_CONTROL_H__
> +struct kobj_map;
> +struct task_struct;
> +
> +#ifdef CONFIG_CGROUP_DEVS
> +struct kobj_map *task_cdev_map(struct task_struct *);
> +struct kobj_map *task_bdev_map(struct task_struct *);
> +#else
> +static inline kobj_map *task_cdev_map(struct task_struct *tsk)
> +{
> +	return NULL;
> +}
> +
> +static inline kobj_map *task_bdev_map(struct task_struct *tsk)
> +{
> +	return NULL;
> +}
> +#endif
> +#endif
> diff --git a/init/Kconfig b/init/Kconfig
> index b886ac9..4dd53d6 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -283,6 +283,12 @@ config CGROUP_DEBUG
> 
>  	  Say N if unsure
> 
> +config CGROUP_DEVS
> +	bool "Devices cgroup subsystem"
> +	depends on CGROUPS
> +	help
> +	  Controlls the visibility of devices
> +
>  config CGROUP_NS
>          bool "Namespace cgroup subsystem"
>          depends on CGROUPS

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found]         ` <478B14DB.4000106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-14 17:42           ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
       [not found]             ` <20080114174220.GA17825-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-01-14 17:42 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

| > I started playing with this and noticed that even if I try to
| > enable read access to device [c, 1:3] it also grants access
| > to device [c, 1:5]. 
| 
| Hm... I can't reproduce this:
| 
| # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
| # /bin/echo -n $$ > /cnt/dev/0/tasks
| # cat /cnt/dev/0/devices.permissions 
| c 1:3 r-
| # hexdump /dev/null 
| # hexdump /dev/zero 
| hexdump: /dev/zero: No such device or address
| hexdump: /dev/zero: Bad file descriptor
| 
| Maybe you have played with devs cgroups before getting this?
| Can you show what's the contents of the devices.permissions file
| in your case?

Here is the repro again. I even tried after a reboot. Basically,
granting access to /dev/null is also granting access to /dev/zero.

	# cat devices.permissions
	# hexdump /dev/zero
	hexdump: /dev/zero: No such device or address
	hexdump: /dev/zero: Bad file descriptor
	# hexdump /dev/null
	hexdump: /dev/null: No such device or address
	hexdump: /dev/null: Bad file descriptor
	# echo 'c 1:3 r-' > devices.permissions
	# hexdump /dev/null
	# hexdump /dev/zero
	0000000 0000 0000 0000 0000 0000 0000 0000 0000
	*
	^C
	# cat tasks
	3279
	22266
	# ps
	  PID TTY          TIME CMD
	 3279 pts/0    00:00:00 bash
	22267 pts/0    00:00:00 ps

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
                     ` (4 preceding siblings ...)
  2008-01-12 21:20   ` [PATCH 0/4] Devices accessibility control group (v2) sukadev-r/Jw6+rmf7HQT0dZR+AlfA
@ 2008-01-14 21:18   ` Paul Menage
       [not found]     ` <6599ad830801141318h121a6a80h9af68c52431c48b8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  5 siblings, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-01-14 21:18 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers

On Jan 8, 2008 1:02 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> * Add the support for devices ranges. I.e. someone might
>   wish to tell smth like b 5:[0-10] r- to this subsystem.

I'd be inclined to leave support for that in userspace, rather than
adding fancier parsing to the kernel API.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]     ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-01-14 17:40       ` Serge E. Hallyn
@ 2008-01-14 21:54       ` Paul Menage
       [not found]         ` <6599ad830801141354p5b165cdao8d6184adb9ab61b6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-01-14 21:54 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers

On Jan 8, 2008 1:18 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> Each new group will have its own maps for char and block
> layers. The devices access list is tuned via the
> devices.permissions file.
>
> One may read from the file to get the configured
> state. E.g.
>
> # cat /cont/devices/0/devices.permissions
> c 1:* rw
> b 8:1 rw
>
> The top container isn't initialized, so that the char
> and block layers will use the global maps to lookup
> their devices.
>
> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>
> ---
>
> diff --git a/fs/Makefile b/fs/Makefile
> index 82b6ae1..a085706 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -63,6 +63,8 @@ obj-y                         += devpts/
>
>  obj-$(CONFIG_PROFILING)                += dcookies.o
>  obj-$(CONFIG_DLM)              += dlm/
> +
> +obj-$(CONFIG_CGROUP_DEVS)      += devscontrol.o
>
>  # Do not add any filesystems before this line
>  obj-$(CONFIG_REISERFS_FS)      += reiserfs/
> diff --git a/fs/devscontrol.c b/fs/devscontrol.c
> new file mode 100644
> index 0000000..ea282f3
> --- /dev/null
> +++ b/fs/devscontrol.c
> @@ -0,0 +1,291 @@
> +/*
> + * devscontrol.c - Device Controller
> + *
> + * Copyright 2007 OpenVZ SWsoft Inc
> + * Author: Pavel Emelyanov <xemul at openvz.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/cdev.h>
> +#include <linux/err.h>
> +#include <linux/devscontrol.h>
> +#include <linux/uaccess.h>
> +#include <linux/fs.h>
> +#include <linux/genhd.h>
> +
> +struct devs_cgroup {
> +       struct cgroup_subsys_state css;
> +
> +       struct kobj_map *cdev_map;
> +       struct kobj_map *bdev_map;
> +};
> +
> +static inline
> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
> +{
> +       return container_of(css, struct devs_cgroup, css);
> +}
> +
> +static inline
> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
> +{
> +       return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
> +}
> +
> +struct kobj_map *task_cdev_map(struct task_struct *tsk)
> +{
> +       struct cgroup_subsys_state *css;
> +
> +       css = task_subsys_state(tsk, devs_subsys_id);
> +       if (css->cgroup->parent == NULL)
> +               return NULL;
> +       else
> +               return css_to_devs(css)->cdev_map;
> +}

For this (and task_bdev_map) it would be cleaner to just leave the
cdev_map and bdev_map in the root cgroup as NULL, so you could just
make this

struct kobj_map *task_cdev_map(struct task_struct *tsk)
{
  return css_to_devs(task_subsys_state(tsk, devs_subsys_id))->cdev_map;
}
> +
> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
> +               struct file *f, const char __user *ubuf,
> +               size_t nbytes, loff_t *pos)
> +{
> +       int err, all, chrdev;
> +       dev_t dev;
> +       char buf[64];
> +       struct devs_cgroup *devs;
> +       mode_t mode;
> +
> +       if (copy_from_user(buf, ubuf, sizeof(buf)))
> +               return -EFAULT;
> +
> +       buf[sizeof(buf) - 1] = 0;
> +       err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
> +       if (err < 0)
> +               return err;
> +
> +       devs = cgroup_to_devs(cont);
> +
> +       if (mode == 0) {
> +               if (chrdev)
> +                       err = cdev_del_from_map(devs->cdev_map, dev, all);
> +               else
> +                       err = bdev_del_from_map(devs->bdev_map, dev, all);

There's no locking involved on these calls, other than the cgroups
code guaranteeing that the subsystem objects themselves won't go away.
A quick look over the kobj_map calls suggests that these calls may be
thread-safe - can you confirm that. (And maybe comment at the top of
the function?)

> +       arg.buf = (char *)__get_free_page(GFP_KERNEL);
> +       if (arg.buf == NULL)
> +               return -ENOMEM;
> +
> +       devs = cgroup_to_devs(cont);
> +       arg.pos = 0;
> +
> +       arg.chrdev = 1;
> +       cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
> +
> +       arg.chrdev = 0;
> +       bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);

Is there any chance of this overflowing the buffer page?

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]         ` <20080114174056.GB15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
@ 2008-01-15  7:53           ` Pavel Emelyanov
       [not found]             ` <478C6669.7070705-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-15  7:53 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Paul Menage

[snip]

> Thanks for working on this, Pavel.
> 
> My only question with this patch is - so if I create a devs
> cgroup which only has access to, say /dev/loop0 and /dev/tty3,
> and someone in that cgroup manages to create a new cgroup, the
> new cgroup will have all the default permissions again, rather
> than inherit the permissions from this cgroup, right?

Right. When you create a new cgroup you have an empty perms
set. Maybe it's worth inheriting the perms from the parent
container, but I think that empty set is better as you will
reconfigure it anyway.

[snip]

>> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
>> +		struct file *f, const char __user *ubuf,
>> +		size_t nbytes, loff_t *pos)
>> +{
>> +	int err, all, chrdev;
>> +	dev_t dev;
>> +	char buf[64];
>> +	struct devs_cgroup *devs;
>> +	mode_t mode;
> 
> (Of course this will require some privilege, i assume that's a detail
> you'll add next time around)

Hm... I though that privileges are governed at the cgroup level.... No?

[snip]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]         ` <6599ad830801141354p5b165cdao8d6184adb9ab61b6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-01-15  7:58           ` Pavel Emelyanov
  0 siblings, 0 replies; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-15  7:58 UTC (permalink / raw)
  To: Paul Menage; +Cc: Linux Containers

Paul Menage wrote:
> On Jan 8, 2008 1:18 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
>> Each new group will have its own maps for char and block
>> layers. The devices access list is tuned via the
>> devices.permissions file.
>>
>> One may read from the file to get the configured
>> state. E.g.
>>
>> # cat /cont/devices/0/devices.permissions
>> c 1:* rw
>> b 8:1 rw
>>
>> The top container isn't initialized, so that the char
>> and block layers will use the global maps to lookup
>> their devices.
>>
>> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>>
>> ---
>>
>> diff --git a/fs/Makefile b/fs/Makefile
>> index 82b6ae1..a085706 100644
>> --- a/fs/Makefile
>> +++ b/fs/Makefile
>> @@ -63,6 +63,8 @@ obj-y                         += devpts/
>>
>>  obj-$(CONFIG_PROFILING)                += dcookies.o
>>  obj-$(CONFIG_DLM)              += dlm/
>> +
>> +obj-$(CONFIG_CGROUP_DEVS)      += devscontrol.o
>>
>>  # Do not add any filesystems before this line
>>  obj-$(CONFIG_REISERFS_FS)      += reiserfs/
>> diff --git a/fs/devscontrol.c b/fs/devscontrol.c
>> new file mode 100644
>> index 0000000..ea282f3
>> --- /dev/null
>> +++ b/fs/devscontrol.c
>> @@ -0,0 +1,291 @@
>> +/*
>> + * devscontrol.c - Device Controller
>> + *
>> + * Copyright 2007 OpenVZ SWsoft Inc
>> + * Author: Pavel Emelyanov <xemul at openvz.org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/cgroup.h>
>> +#include <linux/cdev.h>
>> +#include <linux/err.h>
>> +#include <linux/devscontrol.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/fs.h>
>> +#include <linux/genhd.h>
>> +
>> +struct devs_cgroup {
>> +       struct cgroup_subsys_state css;
>> +
>> +       struct kobj_map *cdev_map;
>> +       struct kobj_map *bdev_map;
>> +};
>> +
>> +static inline
>> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
>> +{
>> +       return container_of(css, struct devs_cgroup, css);
>> +}
>> +
>> +static inline
>> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
>> +{
>> +       return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
>> +}
>> +
>> +struct kobj_map *task_cdev_map(struct task_struct *tsk)
>> +{
>> +       struct cgroup_subsys_state *css;
>> +
>> +       css = task_subsys_state(tsk, devs_subsys_id);
>> +       if (css->cgroup->parent == NULL)
>> +               return NULL;
>> +       else
>> +               return css_to_devs(css)->cdev_map;
>> +}
> 
> For this (and task_bdev_map) it would be cleaner to just leave the
> cdev_map and bdev_map in the root cgroup as NULL, so you could just
> make this
> 
> struct kobj_map *task_cdev_map(struct task_struct *tsk)
> {
>   return css_to_devs(task_subsys_state(tsk, devs_subsys_id))->cdev_map;
> }

Ok, thanks, I'll make it in the next version.

>> +
>> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
>> +               struct file *f, const char __user *ubuf,
>> +               size_t nbytes, loff_t *pos)
>> +{
>> +       int err, all, chrdev;
>> +       dev_t dev;
>> +       char buf[64];
>> +       struct devs_cgroup *devs;
>> +       mode_t mode;
>> +
>> +       if (copy_from_user(buf, ubuf, sizeof(buf)))
>> +               return -EFAULT;
>> +
>> +       buf[sizeof(buf) - 1] = 0;
>> +       err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
>> +       if (err < 0)
>> +               return err;
>> +
>> +       devs = cgroup_to_devs(cont);
>> +
>> +       if (mode == 0) {
>> +               if (chrdev)
>> +                       err = cdev_del_from_map(devs->cdev_map, dev, all);
>> +               else
>> +                       err = bdev_del_from_map(devs->bdev_map, dev, all);
> 
> There's no locking involved on these calls, other than the cgroups
> code guaranteeing that the subsystem objects themselves won't go away.
> A quick look over the kobj_map calls suggests that these calls may be
> thread-safe - can you confirm that. (And maybe comment at the top of
> the function?)

Yup. The locking inside this call is unnecessary as maps
are atomic by themselves. I'll add the comment about it.

>> +       arg.buf = (char *)__get_free_page(GFP_KERNEL);
>> +       if (arg.buf == NULL)
>> +               return -ENOMEM;
>> +
>> +       devs = cgroup_to_devs(cont);
>> +       arg.pos = 0;
>> +
>> +       arg.chrdev = 1;
>> +       cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
>> +
>> +       arg.chrdev = 0;
>> +       bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
> 
> Is there any chance of this overflowing the buffer page?

Well, I have checks that we've filled all the page and do not
go on dumping the info in this case. So the only bad thing user
can have is that he doesn't see the full perms list, but the
kernel won't OOPs ;) Since we have from 8 to 16 bytes per perm
line, we may have from 256 to 256 permissions set for cgroup. 

Not that much, but that's enough for a first time (I haven't seen 
any OpenVZ user who sets more than 50 devperms). But sure this 
is one of TODO-s for the next patch versions.

> Paul
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] The character devices layer changes
       [not found]         ` <20080114170333.GA15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
@ 2008-01-15  8:05           ` Pavel Emelyanov
       [not found]             ` <478C6942.4050903-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-15  8:05 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Paul Menage

Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
>> These changes include the API for the control group
>> to map/remap/unmap the devices with their permissions
>> and one important thing.
>>
>> The fact is that the struct cdev is cached in the inode
>> for faster access, so once we looked one up we go through
>> the fast path and omit the kobj_lookup() call. This is no
>> longer good when we restrict the access to cdevs.
>>
>> To address this issue, I store the last_perm and last(_map)
>> fields on the struct cdev (and protect them with the cdev_lock)
>> and force the re-lookup in the kobj mappings if needed.
>>
>> I know, this might be slow, but I have two points for it:
>> 1. The re-lookup happens on open() only which is not
>>    a fast-path. Besides, this is so for block layer and
>>    nobody complains;
>> 2. On a well-isolated setup, when each container has its
>>    own filesystem this is no longer a problem - each
>>    cgroup will cache the cdev on its inode and work good.
> 
> What about simply returning -EPERM when open()ing a cdev
> with ->map!=task_cdev_map(current)?

In this case it will HAVE to setup isolated filesystem for
each cgroup. I thought that this flexibility doesn't hurt.

> Shouldn't be a problem for ttys, since the container init
> already has the tty open, right?

Yup, but this is not the case for /dev/null or /dev/zero.

> Otherwise, the patchset looks good to me.  Want to look
> through this one a little more (i think that'd be easier
> with the -EPERM approach) and scrutinize patch 4, but
> overall it makes sense.

OK, thanks.

> If I understand right, we're taking 14k per cgroup for
> kobjmaps?  Do we consider that a problem?

14k? I allocate the struct kobj_map which is only 256 pointers
(i.e. - 2K) and the struct probe that is 32 bytes. I.e. 4k
or a single page. I think this is OK.

> thanks,
> -serge
> 

[snip]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found]     ` <6599ad830801141318h121a6a80h9af68c52431c48b8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-01-15  8:06       ` Pavel Emelyanov
  0 siblings, 0 replies; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-15  8:06 UTC (permalink / raw)
  To: Paul Menage; +Cc: Linux Containers

Paul Menage wrote:
> On Jan 8, 2008 1:02 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
>> * Add the support for devices ranges. I.e. someone might
>>   wish to tell smth like b 5:[0-10] r- to this subsystem.
> 
> I'd be inclined to leave support for that in userspace, rather than
> adding fancier parsing to the kernel API.

Yup. I agree with that.

> Paul
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found]             ` <20080114174220.GA17825-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-01-15  8:22               ` Pavel Emelyanov
       [not found]                 ` <478C6D2B.6020904-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-15  8:22 UTC (permalink / raw)
  To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote:
> | > I started playing with this and noticed that even if I try to
> | > enable read access to device [c, 1:3] it also grants access
> | > to device [c, 1:5]. 
> | 
> | Hm... I can't reproduce this:
> | 
> | # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
> | # /bin/echo -n $$ > /cnt/dev/0/tasks
> | # cat /cnt/dev/0/devices.permissions 
> | c 1:3 r-
> | # hexdump /dev/null 
> | # hexdump /dev/zero 
> | hexdump: /dev/zero: No such device or address
> | hexdump: /dev/zero: Bad file descriptor
> | 
> | Maybe you have played with devs cgroups before getting this?
> | Can you show what's the contents of the devices.permissions file
> | in your case?
> 
> Here is the repro again. I even tried after a reboot. Basically,
> granting access to /dev/null is also granting access to /dev/zero.
> 
> 	# cat devices.permissions
> 	# hexdump /dev/zero
> 	hexdump: /dev/zero: No such device or address
> 	hexdump: /dev/zero: Bad file descriptor
> 	# hexdump /dev/null
> 	hexdump: /dev/null: No such device or address
> 	hexdump: /dev/null: Bad file descriptor
> 	# echo 'c 1:3 r-' > devices.permissions
> 	# hexdump /dev/null
> 	# hexdump /dev/zero
> 	0000000 0000 0000 0000 0000 0000 0000 0000 0000
> 	*
> 	^C
> 	# cat tasks
> 	3279
> 	22266
> 	# ps
> 	  PID TTY          TIME CMD
> 	 3279 pts/0    00:00:00 bash
> 	22267 pts/0    00:00:00 ps
> 

This all looks completely incomprehensible :( 

Here's my test:
# mount -t cgroup none /cnt/dev/ -o devices
# mkdir /cnt/dev/0
# /bin/echo -n $$ > /cnt/dev/0/tasks 
# cat /cnt/dev/0/devices.permissions 
# hexdump /dev/zero 
hexdump: /dev/zero: No such device or address
hexdump: /dev/zero: Bad file descriptor
# hexdump /dev/null 
hexdump: /dev/null: No such device or address
hexdump: /dev/null: Bad file descriptor
# echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
# cat /cnt/dev/0/devices.permissions 
c 1:3 r-
# hexdump /dev/null 
# hexdump /dev/zero 
hexdump: /dev/zero: No such device or address
hexdump: /dev/zero: Bad file descriptor


Sukadev, could you please try to track the problem as you
seem to be the only person who's experiencing problems
with that.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]             ` <478C6669.7070705-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-15 14:44               ` Serge E. Hallyn
       [not found]                 ` <20080115144440.GE4453-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Serge E. Hallyn @ 2008-01-15 14:44 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> [snip]
> 
> > Thanks for working on this, Pavel.
> > 
> > My only question with this patch is - so if I create a devs
> > cgroup which only has access to, say /dev/loop0 and /dev/tty3,
> > and someone in that cgroup manages to create a new cgroup, the
> > new cgroup will have all the default permissions again, rather
> > than inherit the permissions from this cgroup, right?
> 
> Right. When you create a new cgroup you have an empty perms
> set. Maybe it's worth inheriting the perms from the parent
> container, but I think that empty set is better as you will
> reconfigure it anyway.
> 
> [snip]
> 
> >> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
> >> +		struct file *f, const char __user *ubuf,
> >> +		size_t nbytes, loff_t *pos)
> >> +{
> >> +	int err, all, chrdev;
> >> +	dev_t dev;
> >> +	char buf[64];
> >> +	struct devs_cgroup *devs;
> >> +	mode_t mode;
> > 
> > (Of course this will require some privilege, i assume that's a detail
> > you'll add next time around)
> 
> Hm... I though that privileges are governed at the cgroup level.... No?

I don't think so...  Wouldn't really make sense for the cgroup
infrastructure to presume to know what to enforce, and I don't see any
checks around the _write functions in cgroup.c, and no capable() calls
at all.

-serge

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] The character devices layer changes
       [not found]             ` <478C6942.4050903-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-15 14:54               ` Serge E. Hallyn
  0 siblings, 0 replies; 42+ messages in thread
From: Serge E. Hallyn @ 2008-01-15 14:54 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> Serge E. Hallyn wrote:
> > Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> >> These changes include the API for the control group
> >> to map/remap/unmap the devices with their permissions
> >> and one important thing.
> >>
> >> The fact is that the struct cdev is cached in the inode
> >> for faster access, so once we looked one up we go through
> >> the fast path and omit the kobj_lookup() call. This is no
> >> longer good when we restrict the access to cdevs.
> >>
> >> To address this issue, I store the last_perm and last(_map)
> >> fields on the struct cdev (and protect them with the cdev_lock)
> >> and force the re-lookup in the kobj mappings if needed.
> >>
> >> I know, this might be slow, but I have two points for it:
> >> 1. The re-lookup happens on open() only which is not
> >>    a fast-path. Besides, this is so for block layer and
> >>    nobody complains;
> >> 2. On a well-isolated setup, when each container has its
> >>    own filesystem this is no longer a problem - each
> >>    cgroup will cache the cdev on its inode and work good.
> > 
> > What about simply returning -EPERM when open()ing a cdev
> > with ->map!=task_cdev_map(current)?
> 
> In this case it will HAVE to setup isolated filesystem for
> each cgroup. I thought that this flexibility doesn't hurt.

The cost and effort of setting up a private /dev seems so minimal to me
it seems worth not dealing with the inode->map switching around.

But maybe that's just me.

> > Shouldn't be a problem for ttys, since the container init
> > already has the tty open, right?
> 
> Yup, but this is not the case for /dev/null or /dev/zero.
> 
> > Otherwise, the patchset looks good to me.  Want to look
> > through this one a little more (i think that'd be easier
> > with the -EPERM approach) and scrutinize patch 4, but
> > overall it makes sense.
> 
> OK, thanks.
> 
> > If I understand right, we're taking 14k per cgroup for
> > kobjmaps?  Do we consider that a problem?
> 
> 14k? I allocate the struct kobj_map which is only 256 pointers
> (i.e. - 2K) and the struct probe that is 32 bytes. I.e. 4k
> or a single page. I think this is OK.

Oops, I was thinking the probes were all pre-allocated.  Sorry.

> > thanks,
> > -serge
> > 
> 
> [snip]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                 ` <20080115144440.GE4453-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
@ 2008-01-15 16:13                   ` Paul Menage
       [not found]                     ` <6599ad830801150813s6a5a7374qd25b6d6206d5896a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-01-15 16:13 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov

On Jan 15, 2008 6:44 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>
> I don't think so...  Wouldn't really make sense for the cgroup
> infrastructure to presume to know what to enforce, and I don't see any
> checks around the _write functions in cgroup.c, and no capable() calls
> at all.

The cgroup filesystem can provide simple unix-level permissions on any
given file. Am I right in thinking that having an entry in the mapper
doesn't automatically give privileges for a device to the members of
the cgroup, but they also have to have sufficient privilege in their
own right? If so, that might be sufficient.

One other thought - should the parse/print routines themselves do a
translation based on the device mappings for the writer/reader's
cgroup? That way you could safely give a VE full permission to write
to its children's device maps, but it would only be able to add/remap
device targets that it could address itself.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                     ` <6599ad830801150813s6a5a7374qd25b6d6206d5896a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-01-15 17:49                       ` Serge E. Hallyn
       [not found]                         ` <20080115174941.GA11638-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Serge E. Hallyn @ 2008-01-15 17:49 UTC (permalink / raw)
  To: Paul Menage; +Cc: Linux Containers, Pavel Emelyanov

Quoting Paul Menage (menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Jan 15, 2008 6:44 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > I don't think so...  Wouldn't really make sense for the cgroup
> > infrastructure to presume to know what to enforce, and I don't see any
> > checks around the _write functions in cgroup.c, and no capable() calls
> > at all.
> 
> The cgroup filesystem can provide simple unix-level permissions on any
> given file. Am I right in thinking that having an entry in the mapper
> doesn't automatically give privileges for a device to the members of
> the cgroup, but they also have to have sufficient privilege in their
> own right? If so, that might be sufficient.

Oh, well actually I think what we'd want is to require both
CAP_NS_OVERRIDE and either CAP_MKNOD or CAP_SYS_ADMIN.  So it's probably
fine to leave this as is for now, and after I resend the patchset which
pushes CAP_NS_OVERRIDE (which is in a 4-patch userns patchset I've
been sitting on) the extra checks can be added.

> One other thought - should the parse/print routines themselves do a
> translation based on the device mappings for the writer/reader's
> cgroup? That way you could safely give a VE full permission to write
> to its children's device maps, but it would only be able to add/remap
> device targets that it could address itself.

Oh, well if we do this then we can just as well use the translation
functions to not allow a VE to add to its own set of devices, right?

Then maybe capable(CAP_NS_OVERRIDE|CAP_SYS_ADMIN) would only be required
to add devices.

Though there *is* some bit of danger to removing devices from a
privileged daemon, isn't there?  Though I can't think of examples
just now.  (Sorry, piercing headache, can't think quite right, will
think about this later)

-serge

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                         ` <20080115174941.GA11638-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
@ 2008-01-15 17:54                           ` Paul Menage
       [not found]                             ` <6599ad830801150954w7e1b6db0p4dd737730f407348-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-01-15 17:54 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov

On Jan 15, 2008 9:49 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> > One other thought - should the parse/print routines themselves do a
> > translation based on the device mappings for the writer/reader's
> > cgroup? That way you could safely give a VE full permission to write
> > to its children's device maps, but it would only be able to add/remap
> > device targets that it could address itself.
>
> Oh, well if we do this then we can just as well use the translation
> functions to not allow a VE to add to its own set of devices, right?

Right.

>
> Then maybe capable(CAP_NS_OVERRIDE|CAP_SYS_ADMIN) would only be required
> to add devices.

Or simply require that they be added by someone who already has access
to that device via their own control group? The root cgroup would have
access to all devices.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                             ` <6599ad830801150954w7e1b6db0p4dd737730f407348-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-01-15 18:17                               ` Serge E. Hallyn
  0 siblings, 0 replies; 42+ messages in thread
From: Serge E. Hallyn @ 2008-01-15 18:17 UTC (permalink / raw)
  To: Paul Menage; +Cc: Linux Containers, Pavel Emelyanov

Quoting Paul Menage (menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Jan 15, 2008 9:49 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> > > One other thought - should the parse/print routines themselves do a
> > > translation based on the device mappings for the writer/reader's
> > > cgroup? That way you could safely give a VE full permission to write
> > > to its children's device maps, but it would only be able to add/remap
> > > device targets that it could address itself.
> >
> > Oh, well if we do this then we can just as well use the translation
> > functions to not allow a VE to add to its own set of devices, right?
> 
> Right.
> 
> >
> > Then maybe capable(CAP_NS_OVERRIDE|CAP_SYS_ADMIN) would only be required
> > to add devices.
> 
> Or simply require that they be added by someone who already has access
> to that device via their own control group? The root cgroup would have
> access to all devices.

Where by 'have access' you mean access to create the device?  That
sounds good.

thanks,
-serge

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found]                 ` <478C6D2B.6020904-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-01-17  6:26                   ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
       [not found]                     ` <20080117062605.GA24475-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-01-17  6:26 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote:
| sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote:
| > | > I started playing with this and noticed that even if I try to
| > | > enable read access to device [c, 1:3] it also grants access
| > | > to device [c, 1:5]. 
| > | 
| > | Hm... I can't reproduce this:
| > | 
| > | # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
| > | # /bin/echo -n $$ > /cnt/dev/0/tasks
| > | # cat /cnt/dev/0/devices.permissions 
| > | c 1:3 r-
| > | # hexdump /dev/null 
| > | # hexdump /dev/zero 
| > | hexdump: /dev/zero: No such device or address
| > | hexdump: /dev/zero: Bad file descriptor
| > | 
| > | Maybe you have played with devs cgroups before getting this?
| > | Can you show what's the contents of the devices.permissions file
| > | in your case?
| > 
| > Here is the repro again. I even tried after a reboot. Basically,
| > granting access to /dev/null is also granting access to /dev/zero.
| > 
| > 	# cat devices.permissions
| > 	# hexdump /dev/zero
| > 	hexdump: /dev/zero: No such device or address
| > 	hexdump: /dev/zero: Bad file descriptor
| > 	# hexdump /dev/null
| > 	hexdump: /dev/null: No such device or address
| > 	hexdump: /dev/null: Bad file descriptor
| > 	# echo 'c 1:3 r-' > devices.permissions
| > 	# hexdump /dev/null
| > 	# hexdump /dev/zero
| > 	0000000 0000 0000 0000 0000 0000 0000 0000 0000
| > 	*
| > 	^C
| > 	# cat tasks
| > 	3279
| > 	22266
| > 	# ps
| > 	  PID TTY          TIME CMD
| > 	 3279 pts/0    00:00:00 bash
| > 	22267 pts/0    00:00:00 ps
| > 
| 
| This all looks completely incomprehensible :( 
| 
| Here's my test:
| # mount -t cgroup none /cnt/dev/ -o devices
| # mkdir /cnt/dev/0
| # /bin/echo -n $$ > /cnt/dev/0/tasks 
| # cat /cnt/dev/0/devices.permissions 
| # hexdump /dev/zero 
| hexdump: /dev/zero: No such device or address
| hexdump: /dev/zero: Bad file descriptor

Can you try this sequence:

	- grant access to /dev/zero,
	- hexdump /dev/zero
	- revoke access to /dev/zero
	- hexdump /dev/null
	- hexdump /dev/zero.

| # hexdump /dev/null 
| hexdump: /dev/null: No such device or address
| hexdump: /dev/null: Bad file descriptor
| # echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
| # cat /cnt/dev/0/devices.permissions 
| c 1:3 r-
| # hexdump /dev/null 
| # hexdump /dev/zero 
| hexdump: /dev/zero: No such device or address
| hexdump: /dev/zero: Bad file descriptor
| 
| 
| Sukadev, could you please try to track the problem as you
| seem to be the only person who's experiencing problems
| with that.


I suspect the 'caching' of the last_mode (that you introduce in PATCH 2/4)
combined with the fact that /dev/zero, /dev/null, /dev/kmem etc share
a _SINGLE_ 'struct cdev' leads to the problem I am running into with
/dev/zero and /dev/null.

Here is a what I suspect is happening (sorry, for low-level details)

Following sequence seems to repro it consistently for me:

	$ mount -t cgroup none /container/devs/ -o devices
	$ mkdir /container/devs/0
	$ cd !$
	cd /container/devs/0
	$ echo $$ > tasks

	$ hexdump /dev/zero
	hexdump: /dev/zero: No such device or address
	hexdump: /dev/zero: Bad file descriptor

	$ hexdump /dev/null
	hexdump: /dev/null: No such device or address
	hexdump: /dev/null: Bad file descriptor

	$ echo 'c 1:3 r-' > devices.permissions

	$ hexdump /dev/null

	$ hexdump /dev/zero
	hexdump: /dev/zero: No such device or address
	hexdump: /dev/zero: Bad file descriptor

No surprise so far.

	$ echo 'c 1:5 r-' > devices.permissions
	$ hexdump /dev/zero
	0000000 0000 0000 0000 0000 0000 0000 0000 0000
	*
	^C

Now grant read access to /dev/zero and more importantly, create a properly
initialized inode for it.

	$ echo 'c 1:5 --' > devices.permissions

Then remove access to /dev/zero. This removes the kobject for /dev/zero from
map.  Also cdev_map_reset() sets cdev->last to NULL.

	$ hdz
	hexdump: /dev/zero: No such device or address
	hexdump: /dev/zero: Bad file descriptor

Since cdev->last is NULL, chrdev_open() calls kobj_lookup() which returns a
NULL kobj and the open fails.

	$ hexdump /dev/null	# XXX 

Again, since cdev->last is NULL, kobj_lookup() is called, this time for
/dev/null.  This succeeds and cdev->last is correctly initialized.
Eventually this open of /dev/null succeeds.

	$ hexdump /dev/zero
	0000000 0000 0000 0000 0000 0000 0000 0000 0000

Now the open of /dev/zero also succeeds !

I suspect that the reason is that when we first successfully read /dev/zero,
we created/initialized an inode for it. This inode has the inode->i_cdev set
correctly.

By reading /dev/null (marked XXX above), cdev->last is also correctly set. 

But since /dev/zero and /dev/null _SHARE_ a 'struct cdev', when we call
chrdev_open() for /dev/zero, we check the permissions of this common cdev
and grant /dev/zero the same permissions as /dev/null.

I suspect we will get this behavior with all devices implemented by
the 'mem' driver in drivers/char/mem.c. I was able to repro with
/dev/full [c, 1:7])

Sukadev

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] Devices accessibility control group (v2)
       [not found]                     ` <20080117062605.GA24475-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-01-21  8:31                       ` Pavel Emelyanov
  0 siblings, 0 replies; 42+ messages in thread
From: Pavel Emelyanov @ 2008-01-21  8:31 UTC (permalink / raw)
  To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote:
> Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote:
> | sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote:
> | > | > I started playing with this and noticed that even if I try to
> | > | > enable read access to device [c, 1:3] it also grants access
> | > | > to device [c, 1:5]. 
> | > | 
> | > | Hm... I can't reproduce this:
> | > | 
> | > | # /bin/echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
> | > | # /bin/echo -n $$ > /cnt/dev/0/tasks
> | > | # cat /cnt/dev/0/devices.permissions 
> | > | c 1:3 r-
> | > | # hexdump /dev/null 
> | > | # hexdump /dev/zero 
> | > | hexdump: /dev/zero: No such device or address
> | > | hexdump: /dev/zero: Bad file descriptor
> | > | 
> | > | Maybe you have played with devs cgroups before getting this?
> | > | Can you show what's the contents of the devices.permissions file
> | > | in your case?
> | > 
> | > Here is the repro again. I even tried after a reboot. Basically,
> | > granting access to /dev/null is also granting access to /dev/zero.
> | > 
> | > 	# cat devices.permissions
> | > 	# hexdump /dev/zero
> | > 	hexdump: /dev/zero: No such device or address
> | > 	hexdump: /dev/zero: Bad file descriptor
> | > 	# hexdump /dev/null
> | > 	hexdump: /dev/null: No such device or address
> | > 	hexdump: /dev/null: Bad file descriptor
> | > 	# echo 'c 1:3 r-' > devices.permissions
> | > 	# hexdump /dev/null
> | > 	# hexdump /dev/zero
> | > 	0000000 0000 0000 0000 0000 0000 0000 0000 0000
> | > 	*
> | > 	^C
> | > 	# cat tasks
> | > 	3279
> | > 	22266
> | > 	# ps
> | > 	  PID TTY          TIME CMD
> | > 	 3279 pts/0    00:00:00 bash
> | > 	22267 pts/0    00:00:00 ps
> | > 
> | 
> | This all looks completely incomprehensible :( 
> | 
> | Here's my test:
> | # mount -t cgroup none /cnt/dev/ -o devices
> | # mkdir /cnt/dev/0
> | # /bin/echo -n $$ > /cnt/dev/0/tasks 
> | # cat /cnt/dev/0/devices.permissions 
> | # hexdump /dev/zero 
> | hexdump: /dev/zero: No such device or address
> | hexdump: /dev/zero: Bad file descriptor
> 
> Can you try this sequence:
> 
> 	- grant access to /dev/zero,
> 	- hexdump /dev/zero
> 	- revoke access to /dev/zero
> 	- hexdump /dev/null
> 	- hexdump /dev/zero.

OK, I'll try it, thanks.

> | # hexdump /dev/null 
> | hexdump: /dev/null: No such device or address
> | hexdump: /dev/null: Bad file descriptor
> | # echo 'c 1:3 r-' > /cnt/dev/0/devices.permissions
> | # cat /cnt/dev/0/devices.permissions 
> | c 1:3 r-
> | # hexdump /dev/null 
> | # hexdump /dev/zero 
> | hexdump: /dev/zero: No such device or address
> | hexdump: /dev/zero: Bad file descriptor
> | 
> | 
> | Sukadev, could you please try to track the problem as you
> | seem to be the only person who's experiencing problems
> | with that.
> 
> 
> I suspect the 'caching' of the last_mode (that you introduce in PATCH 2/4)
> combined with the fact that /dev/zero, /dev/null, /dev/kmem etc share
> a _SINGLE_ 'struct cdev' leads to the problem I am running into with
> /dev/zero and /dev/null.
> 
> Here is a what I suspect is happening (sorry, for low-level details)
> 
> Following sequence seems to repro it consistently for me:
> 
> 	$ mount -t cgroup none /container/devs/ -o devices
> 	$ mkdir /container/devs/0
> 	$ cd !$
> 	cd /container/devs/0
> 	$ echo $$ > tasks
> 
> 	$ hexdump /dev/zero
> 	hexdump: /dev/zero: No such device or address
> 	hexdump: /dev/zero: Bad file descriptor
> 
> 	$ hexdump /dev/null
> 	hexdump: /dev/null: No such device or address
> 	hexdump: /dev/null: Bad file descriptor
> 
> 	$ echo 'c 1:3 r-' > devices.permissions
> 
> 	$ hexdump /dev/null
> 
> 	$ hexdump /dev/zero
> 	hexdump: /dev/zero: No such device or address
> 	hexdump: /dev/zero: Bad file descriptor
> 
> No surprise so far.
> 
> 	$ echo 'c 1:5 r-' > devices.permissions
> 	$ hexdump /dev/zero
> 	0000000 0000 0000 0000 0000 0000 0000 0000 0000
> 	*
> 	^C
> 
> Now grant read access to /dev/zero and more importantly, create a properly
> initialized inode for it.
> 
> 	$ echo 'c 1:5 --' > devices.permissions
> 
> Then remove access to /dev/zero. This removes the kobject for /dev/zero from
> map.  Also cdev_map_reset() sets cdev->last to NULL.
> 
> 	$ hdz
> 	hexdump: /dev/zero: No such device or address
> 	hexdump: /dev/zero: Bad file descriptor
> 
> Since cdev->last is NULL, chrdev_open() calls kobj_lookup() which returns a
> NULL kobj and the open fails.
> 
> 	$ hexdump /dev/null	# XXX 
> 
> Again, since cdev->last is NULL, kobj_lookup() is called, this time for
> /dev/null.  This succeeds and cdev->last is correctly initialized.
> Eventually this open of /dev/null succeeds.
> 
> 	$ hexdump /dev/zero
> 	0000000 0000 0000 0000 0000 0000 0000 0000 0000
> 
> Now the open of /dev/zero also succeeds !

Hm... The analysis looks correct. Thanks, Sukadev, I'll try
to resolve this issue.

> I suspect that the reason is that when we first successfully read /dev/zero,
> we created/initialized an inode for it. This inode has the inode->i_cdev set
> correctly.
> 
> By reading /dev/null (marked XXX above), cdev->last is also correctly set. 
> 
> But since /dev/zero and /dev/null _SHARE_ a 'struct cdev', when we call
> chrdev_open() for /dev/zero, we check the permissions of this common cdev
> and grant /dev/zero the same permissions as /dev/null.
> 
> I suspect we will get this behavior with all devices implemented by
> the 'mem' driver in drivers/char/mem.c. I was able to repro with
> /dev/full [c, 1:7])
> 
> Sukadev
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] The control group itself
       [not found] ` <47AAFFF2.9030804-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-02-07 13:01   ` Pavel Emelyanov
       [not found]     ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-02-07 13:01 UTC (permalink / raw)
  To: Serge Hallyn, Sukadev Bhattiprolu; +Cc: Linux Containers, Paul Menage

Each new group will have its own maps for char and block
layers. The devices access list is tuned via the 
devices.permissions file. One may read from the file to get 
the configured state.

The top container isn't initialized, so that the char 
and block layers will use the global maps to lookup 
their devices. I did that not to export the static maps
to the outer world.

Good news is that this patch now contains more comments 
and Documentation file :)

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

---

diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
new file mode 100644
index 0000000..dbd0c7a
--- /dev/null
+++ b/Documentation/controllers/devices.txt
@@ -0,0 +1,61 @@
+
+	Devices visibility controller
+
+This controller allows to tune the devices accessibility by tasks,
+i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
+access to IDE devices and completely hide SCSI disks.
+
+Tasks still can call mknod to create device files, regardless of
+whether the particular device is visible or accessible, but they
+may not be able to open it later.
+
+This one hides under CONFIG_CGROUP_DEVS option.
+
+
+Configuring
+
+The controller provides a single file to configure itself -- the
+devices.permissions one. To change the accessibility level for some
+device write the following string into it:
+
+[cb] <major>:(<minor>|*) [r-][w-]
+ ^          ^               ^
+ |          |               |
+ |          |               +--- access rights (1)
+ |          |
+ |          +-- device major and minor numbers (2)
+ |
+ +-- device type (character / block)
+
+1) The access rights set to '--' remove the device from the group's
+access list, so that it will not even be shown in this file later.
+
+2) Setting the minor to '*' grants access to all the minors for
+particular major.
+
+When reading from it, one may see something like
+
+	c 1:5 rw
+	b 8:* r-
+
+Security issues, concerning who may grant access to what are governed
+at the cgroup infrastructure level.
+
+
+Examples:
+
+1. Grand full access to /dev/null
+	# echo c 1:3 rw > /cgroups/<id>/devices.permissions
+
+2. Grant the read-only access to /dev/sda and partitions
+	# echo b 8:* r- > ...
+
+3. Change the /dev/null access to write-only
+	# echo c 1:3 -w > ...
+
+4. Revoke access to /dev/sda
+	# echo b 8:* -- > ...
+
+
+		Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
+
diff --git a/fs/Makefile b/fs/Makefile
index 7996220..5ad03be 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -64,6 +64,8 @@ obj-y				+= devpts/
 
 obj-$(CONFIG_PROFILING)		+= dcookies.o
 obj-$(CONFIG_DLM)		+= dlm/
+
+obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
  
 # Do not add any filesystems before this line
 obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
diff --git a/fs/devscontrol.c b/fs/devscontrol.c
new file mode 100644
index 0000000..48c5f69
--- /dev/null
+++ b/fs/devscontrol.c
@@ -0,0 +1,314 @@
+/*
+ * devscontrol.c - Device Controller
+ *
+ * Copyright 2007 OpenVZ SWsoft Inc
+ * Author: Pavel Emelyanov <xemul at openvz dot org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cdev.h>
+#include <linux/err.h>
+#include <linux/devscontrol.h>
+#include <linux/uaccess.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+
+struct devs_cgroup {
+	/*
+	 * The subsys state to build into cgrous infrastructure
+	 */
+	struct cgroup_subsys_state css;
+
+	/*
+	 * The maps of character and block devices. They provide a
+	 * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
+	 * and block/genhd.c to find out how the ->open() callbacks
+	 * work when opening a device.
+	 *
+	 * Each group will have its onw maps, and at the open()
+	 * time code will lookup in this map to get the device
+	 * and permissions by its dev_t.
+	 */
+	struct kobj_map *cdev_map;
+	struct kobj_map *bdev_map;
+};
+
+static inline
+struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct devs_cgroup, css);
+}
+
+static inline
+struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
+{
+	return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
+}
+
+struct kobj_map *task_cdev_map(struct task_struct *tsk)
+{
+	struct cgroup_subsys_state *css;
+
+	css = task_subsys_state(tsk, devs_subsys_id);
+	if (css->cgroup->parent == NULL)
+		return NULL;
+	else
+		return css_to_devs(css)->cdev_map;
+}
+
+struct kobj_map *task_bdev_map(struct task_struct *tsk)
+{
+	struct cgroup_subsys_state *css;
+
+	css = task_subsys_state(tsk, devs_subsys_id);
+	if (css->cgroup->parent == NULL)
+		return NULL;
+	else
+		return css_to_devs(css)->bdev_map;
+}
+
+static struct cgroup_subsys_state *
+devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct devs_cgroup *devs;
+
+	devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
+	if (devs == NULL)
+		goto out;
+
+	devs->cdev_map = cdev_map_init();
+	if (devs->cdev_map == NULL)
+		goto out_free;
+
+	devs->bdev_map = bdev_map_init();
+	if (devs->bdev_map == NULL)
+		goto out_free_cdev;
+
+	return &devs->css;
+
+out_free_cdev:
+	cdev_map_fini(devs->cdev_map);
+out_free:
+	kfree(devs);
+out:
+	return ERR_PTR(-ENOMEM);
+}
+
+static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct devs_cgroup *devs;
+
+	devs = cgroup_to_devs(cont);
+	bdev_map_fini(devs->bdev_map);
+	cdev_map_fini(devs->cdev_map);
+	kfree(devs);
+}
+
+/*
+ * The devices.permissions file read/write functionality
+ *
+ * The following two routines parse and print the strings like
+ * [cb] <major>:(<minor>|*) [r-][w-]
+ */
+
+static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
+		int *all, mode_t *mode)
+{
+	unsigned int major, minor;
+	char *end;
+	mode_t tmp;
+
+	if (buf[0] == 'c')
+		*chrdev = 1;
+	else if (buf[0] == 'b')
+		*chrdev = 0;
+	else
+		return -EINVAL;
+
+	if (buf[1] != ' ')
+		return -EINVAL;
+
+	major = simple_strtoul(buf + 2, &end, 10);
+	if (*end != ':')
+		return -EINVAL;
+
+	if (end[1] == '*') {
+		if (end[2] != ' ')
+			return -EINVAL;
+
+		*all = 1;
+		minor = 0;
+		end += 2;
+	} else {
+		minor = simple_strtoul(end + 1, &end, 10);
+		if (*end != ' ')
+			return -EINVAL;
+
+		*all = 0;
+	}
+
+	tmp = 0;
+
+	if (end[1] == 'r')
+		tmp |= FMODE_READ;
+	else if (end[1] != '-')
+		return -EINVAL;
+	if (end[2] == 'w')
+		tmp |= FMODE_WRITE;
+	else if (end[2] != '-')
+		return -EINVAL;
+
+	*dev = MKDEV(major, minor);
+	*mode = tmp;
+	return 0;
+}
+
+static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
+		int all, mode_t mode)
+{
+	int ret;
+
+	ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
+	if (all)
+		ret += snprintf(buf + ret, len - ret, "*");
+	else
+		ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
+
+	ret += snprintf(buf + ret, len - ret, " %c%c\n",
+			(mode & FMODE_READ) ? 'r' : '-',
+			(mode & FMODE_WRITE) ? 'w' : '-');
+
+	return ret + 1;
+}
+
+static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
+		struct file *f, const char __user *ubuf,
+		size_t nbytes, loff_t *pos)
+{
+	int err, all, chrdev;
+	dev_t dev;
+	char buf[64];
+	struct devs_cgroup *devs;
+	mode_t mode;
+
+	if (copy_from_user(buf, ubuf, sizeof(buf)))
+		return -EFAULT;
+
+	buf[sizeof(buf) - 1] = 0;
+	err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
+	if (err < 0)
+		return err;
+
+	devs = cgroup_to_devs(cont);
+
+	/*
+	 * No locking here is required - all that we need
+	 * is provided inside the kobject mapping code
+	 */
+
+	if (mode == 0) {
+		if (chrdev)
+			err = cdev_del_from_map(devs->cdev_map, dev, all);
+		else
+			err = bdev_del_from_map(devs->bdev_map, dev, all);
+
+		if (err < 0)
+			return err;
+
+		css_put(&devs->css);
+	} else {
+		if (chrdev)
+			err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
+		else
+			err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
+
+		if (err < 0)
+			return err;
+
+		css_get(&devs->css);
+	}
+
+	return nbytes;
+}
+
+struct devs_dump_arg {
+	char *buf;
+	int pos;
+	int chrdev;
+};
+
+static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
+{
+	struct devs_dump_arg *arg = x;
+	char tmp[64];
+	int len;
+
+	len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
+			range != 1, mode);
+
+	if (arg->pos >= PAGE_SIZE - len)
+		return 1;
+
+	memcpy(arg->buf + arg->pos, tmp, len);
+	arg->pos += len;
+	return 0;
+}
+
+static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
+		struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
+{
+	struct devs_dump_arg arg;
+	struct devs_cgroup *devs;
+	ssize_t ret;
+
+	arg.buf = (char *)__get_free_page(GFP_KERNEL);
+	if (arg.buf == NULL)
+		return -ENOMEM;
+
+	devs = cgroup_to_devs(cont);
+	arg.pos = 0;
+
+	arg.chrdev = 1;
+	cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
+
+	arg.chrdev = 0;
+	bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
+
+	ret = simple_read_from_buffer(ubuf, nbytes, pos,
+			arg.buf, arg.pos);
+
+	free_page((unsigned long)arg.buf);
+	return ret;
+}
+
+static struct cftype devs_files[] = {
+	{
+		.name = "permissions",
+		.write = devs_write,
+		.read = devs_read,
+	},
+};
+
+static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	return cgroup_add_files(cont, ss,
+			devs_files, ARRAY_SIZE(devs_files));
+}
+
+struct cgroup_subsys devs_subsys = {
+	.name = "devices",
+	.subsys_id = devs_subsys_id,
+	.create = devs_create,
+	.destroy = devs_destroy,
+	.populate = devs_populate,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 228235c..9c0cd2c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_DEVS
+SUBSYS(devs)
+#endif
+
+/* */
diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
new file mode 100644
index 0000000..38057b9
--- /dev/null
+++ b/include/linux/devscontrol.h
@@ -0,0 +1,26 @@
+#ifndef __DEVS_CONTROL_H__
+#define __DEVS_CONTROL_H__
+struct kobj_map;
+struct task_struct;
+
+/*
+ * task_[cb]dev_map - get a map from task. Both calls may return
+ * NULL, to indicate, that task doesn't belong to any group and
+ * that the global map is to be used.
+ */
+
+#ifdef CONFIG_CGROUP_DEVS
+struct kobj_map *task_cdev_map(struct task_struct *);
+struct kobj_map *task_bdev_map(struct task_struct *);
+#else
+static inline kobj_map *task_cdev_map(struct task_struct *tsk)
+{
+	return NULL;
+}
+
+static inline kobj_map *task_bdev_map(struct task_struct *tsk)
+{
+	return NULL;
+}
+#endif
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 732a1c2..f9a1b4f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -283,6 +283,19 @@ config CGROUP_DEBUG
 
 	  Say N if unsure
 
+config CGROUP_DEVS
+	bool "Devices cgroup subsystem"
+	depends on CGROUPS
+	help
+	  Controlls the access rights to devices, i.e. you may hide
+	  some of them from tasks, so that they will not be able
+	  to open them, or you may grant a read-only access to some
+	  of them.
+
+	  See Documentation/controllers/devices.txt for details.
+
+	  This is harmless to say N here, so do it if unsure.
+
 config CGROUP_NS
         bool "Namespace cgroup subsystem"
         depends on CGROUPS

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]     ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-02-11 17:38       ` Serge E. Hallyn
       [not found]         ` <20080211173830.GA22160-6s5zFf/epYL1ENwx4SLHqw@public.gmane.org>
  2008-02-12  7:42       ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
  2008-02-21 20:47       ` Paul Menage
  2 siblings, 1 reply; 42+ messages in thread
From: Serge E. Hallyn @ 2008-02-11 17:38 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Paul Menage, Linux Containers

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> Each new group will have its own maps for char and block
> layers. The devices access list is tuned via the 
> devices.permissions file. One may read from the file to get 
> the configured state.
> 
> The top container isn't initialized, so that the char 
> and block layers will use the global maps to lookup 
> their devices. I did that not to export the static maps
> to the outer world.
> 
> Good news is that this patch now contains more comments 
> and Documentation file :)

Seems to work as advertised  :)  I can't reproduce Suka's null/zero
bug.

You're relying fully on uid-0 to stop writes into the
devices.permissions files.  Would you mind adding a check for
CAP_SYS_ADMIN (or CAP_NS_OVERRIDE+CAP_MKNOD)?  Or were you really
counting on using the filesystem visibility cgroup to stop a container
from making changes to its device access whitelist?

thanks,
-serge

> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> 
> ---
> 
> diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
> new file mode 100644
> index 0000000..dbd0c7a
> --- /dev/null
> +++ b/Documentation/controllers/devices.txt
> @@ -0,0 +1,61 @@
> +
> +	Devices visibility controller
> +
> +This controller allows to tune the devices accessibility by tasks,
> +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> +access to IDE devices and completely hide SCSI disks.
> +
> +Tasks still can call mknod to create device files, regardless of
> +whether the particular device is visible or accessible, but they
> +may not be able to open it later.
> +
> +This one hides under CONFIG_CGROUP_DEVS option.
> +
> +
> +Configuring
> +
> +The controller provides a single file to configure itself -- the
> +devices.permissions one. To change the accessibility level for some
> +device write the following string into it:
> +
> +[cb] <major>:(<minor>|*) [r-][w-]
> + ^          ^               ^
> + |          |               |
> + |          |               +--- access rights (1)
> + |          |
> + |          +-- device major and minor numbers (2)
> + |
> + +-- device type (character / block)
> +
> +1) The access rights set to '--' remove the device from the group's
> +access list, so that it will not even be shown in this file later.
> +
> +2) Setting the minor to '*' grants access to all the minors for
> +particular major.
> +
> +When reading from it, one may see something like
> +
> +	c 1:5 rw
> +	b 8:* r-
> +
> +Security issues, concerning who may grant access to what are governed
> +at the cgroup infrastructure level.
> +
> +
> +Examples:
> +
> +1. Grand full access to /dev/null
> +	# echo c 1:3 rw > /cgroups/<id>/devices.permissions
> +
> +2. Grant the read-only access to /dev/sda and partitions
> +	# echo b 8:* r- > ...
> +
> +3. Change the /dev/null access to write-only
> +	# echo c 1:3 -w > ...
> +
> +4. Revoke access to /dev/sda
> +	# echo b 8:* -- > ...
> +
> +
> +		Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> +
> diff --git a/fs/Makefile b/fs/Makefile
> index 7996220..5ad03be 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -64,6 +64,8 @@ obj-y				+= devpts/
> 
>  obj-$(CONFIG_PROFILING)		+= dcookies.o
>  obj-$(CONFIG_DLM)		+= dlm/
> +
> +obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
>   
>  # Do not add any filesystems before this line
>  obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
> diff --git a/fs/devscontrol.c b/fs/devscontrol.c
> new file mode 100644
> index 0000000..48c5f69
> --- /dev/null
> +++ b/fs/devscontrol.c
> @@ -0,0 +1,314 @@
> +/*
> + * devscontrol.c - Device Controller
> + *
> + * Copyright 2007 OpenVZ SWsoft Inc
> + * Author: Pavel Emelyanov <xemul at openvz dot org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/cdev.h>
> +#include <linux/err.h>
> +#include <linux/devscontrol.h>
> +#include <linux/uaccess.h>
> +#include <linux/fs.h>
> +#include <linux/genhd.h>
> +
> +struct devs_cgroup {
> +	/*
> +	 * The subsys state to build into cgrous infrastructure
> +	 */
> +	struct cgroup_subsys_state css;
> +
> +	/*
> +	 * The maps of character and block devices. They provide a
> +	 * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
> +	 * and block/genhd.c to find out how the ->open() callbacks
> +	 * work when opening a device.
> +	 *
> +	 * Each group will have its onw maps, and at the open()
> +	 * time code will lookup in this map to get the device
> +	 * and permissions by its dev_t.
> +	 */
> +	struct kobj_map *cdev_map;
> +	struct kobj_map *bdev_map;
> +};
> +
> +static inline
> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
> +{
> +	return container_of(css, struct devs_cgroup, css);
> +}
> +
> +static inline
> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
> +{
> +	return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
> +}
> +
> +struct kobj_map *task_cdev_map(struct task_struct *tsk)
> +{
> +	struct cgroup_subsys_state *css;
> +
> +	css = task_subsys_state(tsk, devs_subsys_id);
> +	if (css->cgroup->parent == NULL)
> +		return NULL;
> +	else
> +		return css_to_devs(css)->cdev_map;
> +}
> +
> +struct kobj_map *task_bdev_map(struct task_struct *tsk)
> +{
> +	struct cgroup_subsys_state *css;
> +
> +	css = task_subsys_state(tsk, devs_subsys_id);
> +	if (css->cgroup->parent == NULL)
> +		return NULL;
> +	else
> +		return css_to_devs(css)->bdev_map;
> +}
> +
> +static struct cgroup_subsys_state *
> +devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +	struct devs_cgroup *devs;
> +
> +	devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
> +	if (devs == NULL)
> +		goto out;
> +
> +	devs->cdev_map = cdev_map_init();
> +	if (devs->cdev_map == NULL)
> +		goto out_free;
> +
> +	devs->bdev_map = bdev_map_init();
> +	if (devs->bdev_map == NULL)
> +		goto out_free_cdev;
> +
> +	return &devs->css;
> +
> +out_free_cdev:
> +	cdev_map_fini(devs->cdev_map);
> +out_free:
> +	kfree(devs);
> +out:
> +	return ERR_PTR(-ENOMEM);
> +}
> +
> +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +	struct devs_cgroup *devs;
> +
> +	devs = cgroup_to_devs(cont);
> +	bdev_map_fini(devs->bdev_map);
> +	cdev_map_fini(devs->cdev_map);
> +	kfree(devs);
> +}
> +
> +/*
> + * The devices.permissions file read/write functionality
> + *
> + * The following two routines parse and print the strings like
> + * [cb] <major>:(<minor>|*) [r-][w-]
> + */
> +
> +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
> +		int *all, mode_t *mode)
> +{
> +	unsigned int major, minor;
> +	char *end;
> +	mode_t tmp;
> +
> +	if (buf[0] == 'c')
> +		*chrdev = 1;
> +	else if (buf[0] == 'b')
> +		*chrdev = 0;
> +	else
> +		return -EINVAL;
> +
> +	if (buf[1] != ' ')
> +		return -EINVAL;
> +
> +	major = simple_strtoul(buf + 2, &end, 10);
> +	if (*end != ':')
> +		return -EINVAL;
> +
> +	if (end[1] == '*') {
> +		if (end[2] != ' ')
> +			return -EINVAL;
> +
> +		*all = 1;
> +		minor = 0;
> +		end += 2;
> +	} else {
> +		minor = simple_strtoul(end + 1, &end, 10);
> +		if (*end != ' ')
> +			return -EINVAL;
> +
> +		*all = 0;
> +	}
> +
> +	tmp = 0;
> +
> +	if (end[1] == 'r')
> +		tmp |= FMODE_READ;
> +	else if (end[1] != '-')
> +		return -EINVAL;
> +	if (end[2] == 'w')
> +		tmp |= FMODE_WRITE;
> +	else if (end[2] != '-')
> +		return -EINVAL;
> +
> +	*dev = MKDEV(major, minor);
> +	*mode = tmp;
> +	return 0;
> +}
> +
> +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
> +		int all, mode_t mode)
> +{
> +	int ret;
> +
> +	ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
> +	if (all)
> +		ret += snprintf(buf + ret, len - ret, "*");
> +	else
> +		ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
> +
> +	ret += snprintf(buf + ret, len - ret, " %c%c\n",
> +			(mode & FMODE_READ) ? 'r' : '-',
> +			(mode & FMODE_WRITE) ? 'w' : '-');
> +
> +	return ret + 1;
> +}
> +
> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
> +		struct file *f, const char __user *ubuf,
> +		size_t nbytes, loff_t *pos)
> +{
> +	int err, all, chrdev;
> +	dev_t dev;
> +	char buf[64];
> +	struct devs_cgroup *devs;
> +	mode_t mode;
> +
> +	if (copy_from_user(buf, ubuf, sizeof(buf)))
> +		return -EFAULT;
> +
> +	buf[sizeof(buf) - 1] = 0;
> +	err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
> +	if (err < 0)
> +		return err;
> +
> +	devs = cgroup_to_devs(cont);
> +
> +	/*
> +	 * No locking here is required - all that we need
> +	 * is provided inside the kobject mapping code
> +	 */
> +
> +	if (mode == 0) {
> +		if (chrdev)
> +			err = cdev_del_from_map(devs->cdev_map, dev, all);
> +		else
> +			err = bdev_del_from_map(devs->bdev_map, dev, all);
> +
> +		if (err < 0)
> +			return err;
> +
> +		css_put(&devs->css);
> +	} else {
> +		if (chrdev)
> +			err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
> +		else
> +			err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
> +
> +		if (err < 0)
> +			return err;
> +
> +		css_get(&devs->css);
> +	}
> +
> +	return nbytes;
> +}
> +
> +struct devs_dump_arg {
> +	char *buf;
> +	int pos;
> +	int chrdev;
> +};
> +
> +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
> +{
> +	struct devs_dump_arg *arg = x;
> +	char tmp[64];
> +	int len;
> +
> +	len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
> +			range != 1, mode);
> +
> +	if (arg->pos >= PAGE_SIZE - len)
> +		return 1;
> +
> +	memcpy(arg->buf + arg->pos, tmp, len);
> +	arg->pos += len;
> +	return 0;
> +}
> +
> +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
> +		struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
> +{
> +	struct devs_dump_arg arg;
> +	struct devs_cgroup *devs;
> +	ssize_t ret;
> +
> +	arg.buf = (char *)__get_free_page(GFP_KERNEL);
> +	if (arg.buf == NULL)
> +		return -ENOMEM;
> +
> +	devs = cgroup_to_devs(cont);
> +	arg.pos = 0;
> +
> +	arg.chrdev = 1;
> +	cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
> +
> +	arg.chrdev = 0;
> +	bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
> +
> +	ret = simple_read_from_buffer(ubuf, nbytes, pos,
> +			arg.buf, arg.pos);
> +
> +	free_page((unsigned long)arg.buf);
> +	return ret;
> +}
> +
> +static struct cftype devs_files[] = {
> +	{
> +		.name = "permissions",
> +		.write = devs_write,
> +		.read = devs_read,
> +	},
> +};
> +
> +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +	return cgroup_add_files(cont, ss,
> +			devs_files, ARRAY_SIZE(devs_files));
> +}
> +
> +struct cgroup_subsys devs_subsys = {
> +	.name = "devices",
> +	.subsys_id = devs_subsys_id,
> +	.create = devs_create,
> +	.destroy = devs_destroy,
> +	.populate = devs_populate,
> +};
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 228235c..9c0cd2c 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
>  #endif
> 
>  /* */
> +
> +#ifdef CONFIG_CGROUP_DEVS
> +SUBSYS(devs)
> +#endif
> +
> +/* */
> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
> new file mode 100644
> index 0000000..38057b9
> --- /dev/null
> +++ b/include/linux/devscontrol.h
> @@ -0,0 +1,26 @@
> +#ifndef __DEVS_CONTROL_H__
> +#define __DEVS_CONTROL_H__
> +struct kobj_map;
> +struct task_struct;
> +
> +/*
> + * task_[cb]dev_map - get a map from task. Both calls may return
> + * NULL, to indicate, that task doesn't belong to any group and
> + * that the global map is to be used.
> + */
> +
> +#ifdef CONFIG_CGROUP_DEVS
> +struct kobj_map *task_cdev_map(struct task_struct *);
> +struct kobj_map *task_bdev_map(struct task_struct *);
> +#else
> +static inline kobj_map *task_cdev_map(struct task_struct *tsk)
> +{
> +	return NULL;
> +}
> +
> +static inline kobj_map *task_bdev_map(struct task_struct *tsk)
> +{
> +	return NULL;
> +}
> +#endif
> +#endif
> diff --git a/init/Kconfig b/init/Kconfig
> index 732a1c2..f9a1b4f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -283,6 +283,19 @@ config CGROUP_DEBUG
> 
>  	  Say N if unsure
> 
> +config CGROUP_DEVS
> +	bool "Devices cgroup subsystem"
> +	depends on CGROUPS
> +	help
> +	  Controlls the access rights to devices, i.e. you may hide
> +	  some of them from tasks, so that they will not be able
> +	  to open them, or you may grant a read-only access to some
> +	  of them.
> +
> +	  See Documentation/controllers/devices.txt for details.
> +
> +	  This is harmless to say N here, so do it if unsure.
> +
>  config CGROUP_NS
>          bool "Namespace cgroup subsystem"
>          depends on CGROUPS

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]     ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-02-11 17:38       ` Serge E. Hallyn
@ 2008-02-12  7:42       ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
       [not found]         ` <20080212074217.GA15992-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-02-21 20:47       ` Paul Menage
  2 siblings, 1 reply; 42+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-02-12  7:42 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Paul Menage

This patchset does fix the problem I was having before with null and
zero devices. Overall, it looks like pretty good.

I am still reviewing the patches.  Just some nits I came across:


Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote:
| Each new group will have its own maps for char and block
| layers. The devices access list is tuned via the 
| devices.permissions file. One may read from the file to get 
| the configured state.
| 
| The top container isn't initialized, so that the char 
| and block layers will use the global maps to lookup 
| their devices. I did that not to export the static maps
| to the outer world.
| 
| Good news is that this patch now contains more comments 
| and Documentation file :)
| 
| Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
| 
| ---
| 
| diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
| new file mode 100644
| index 0000000..dbd0c7a
| --- /dev/null
| +++ b/Documentation/controllers/devices.txt
| @@ -0,0 +1,61 @@
| +
| +	Devices visibility controller
| +
| +This controller allows to tune the devices accessibility by tasks,
| +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
| +access to IDE devices and completely hide SCSI disks.
| +
| +Tasks still can call mknod to create device files, regardless of
| +whether the particular device is visible or accessible, but they
| +may not be able to open it later.
| +
| +This one hides under CONFIG_CGROUP_DEVS option.
| +
| +
| +Configuring
| +
| +The controller provides a single file to configure itself -- the
| +devices.permissions one. To change the accessibility level for some
| +device write the following string into it:
| +
| +[cb] <major>:(<minor>|*) [r-][w-]
| + ^          ^               ^
| + |          |               |
| + |          |               +--- access rights (1)
| + |          |
| + |          +-- device major and minor numbers (2)
| + |
| + +-- device type (character / block)
| +
| +1) The access rights set to '--' remove the device from the group's
| +access list, so that it will not even be shown in this file later.
| +
| +2) Setting the minor to '*' grants access to all the minors for
| +particular major.
| +
| +When reading from it, one may see something like
| +
| +	c 1:5 rw
| +	b 8:* r-
| +
| +Security issues, concerning who may grant access to what are governed
| +at the cgroup infrastructure level.
| +
| +
| +Examples:
| +
| +1. Grand full access to /dev/null

Grant.

| +	# echo c 1:3 rw > /cgroups/<id>/devices.permissions
| +
| +2. Grant the read-only access to /dev/sda and partitions
| +	# echo b 8:* r- > ...

This grants access to all scsi disks, sda..sdp and not just 'sda' right ?

| +
| +3. Change the /dev/null access to write-only
| +	# echo c 1:3 -w > ...
| +
| +4. Revoke access to /dev/sda
| +	# echo b 8:* -- > ...
| +
| +
| +		Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
| +
| diff --git a/fs/Makefile b/fs/Makefile
| index 7996220..5ad03be 100644
| --- a/fs/Makefile
| +++ b/fs/Makefile
| @@ -64,6 +64,8 @@ obj-y				+= devpts/
| 
|  obj-$(CONFIG_PROFILING)		+= dcookies.o
|  obj-$(CONFIG_DLM)		+= dlm/
| +
| +obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
|   
|  # Do not add any filesystems before this line
|  obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
| diff --git a/fs/devscontrol.c b/fs/devscontrol.c
| new file mode 100644
| index 0000000..48c5f69
| --- /dev/null
| +++ b/fs/devscontrol.c
| @@ -0,0 +1,314 @@
| +/*
| + * devscontrol.c - Device Controller
| + *
| + * Copyright 2007 OpenVZ SWsoft Inc
| + * Author: Pavel Emelyanov <xemul at openvz dot org>
| + *
| + * This program is free software; you can redistribute it and/or modify
| + * it under the terms of the GNU General Public License as published by
| + * the Free Software Foundation; either version 2 of the License, or
| + * (at your option) any later version.
| + *
| + * This program is distributed in the hope that it will be useful,
| + * but WITHOUT ANY WARRANTY; without even the implied warranty of
| + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
| + * GNU General Public License for more details.
| + */
| +
| +#include <linux/cgroup.h>
| +#include <linux/cdev.h>
| +#include <linux/err.h>
| +#include <linux/devscontrol.h>
| +#include <linux/uaccess.h>
| +#include <linux/fs.h>
| +#include <linux/genhd.h>
| +
| +struct devs_cgroup {
| +	/*
| +	 * The subsys state to build into cgrous infrastructure
| +	 */

... into cgroups

| +	struct cgroup_subsys_state css;
| +
| +	/*
| +	 * The maps of character and block devices. They provide a
| +	 * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
| +	 * and block/genhd.c to find out how the ->open() callbacks
| +	 * work when opening a device.
| +	 *
| +	 * Each group will have its onw maps, and at the open()

own maps

| +	 * time code will lookup in this map to get the device
| +	 * and permissions by its dev_t.
| +	 */
| +	struct kobj_map *cdev_map;
| +	struct kobj_map *bdev_map;
| +};
| +
| +static inline
| +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
| +{
| +	return container_of(css, struct devs_cgroup, css);
| +}

'devs' as prefix/suffix does not look very clear.

How about css_to_devs_cg() ?  Similarly below for dev_cg_create(),
dev_cg_destroy() ?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]         ` <20080212074217.GA15992-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-02-12  7:51           ` Pavel Emelyanov
  0 siblings, 0 replies; 42+ messages in thread
From: Pavel Emelyanov @ 2008-02-12  7:51 UTC (permalink / raw)
  To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Paul Menage

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org wrote:
> This patchset does fix the problem I was having before with null and
> zero devices. Overall, it looks like pretty good.
> 
> I am still reviewing the patches.  Just some nits I came across:
> 
> 
> Pavel Emelianov [xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org] wrote:
> | Each new group will have its own maps for char and block
> | layers. The devices access list is tuned via the 
> | devices.permissions file. One may read from the file to get 
> | the configured state.
> | 
> | The top container isn't initialized, so that the char 
> | and block layers will use the global maps to lookup 
> | their devices. I did that not to export the static maps
> | to the outer world.
> | 
> | Good news is that this patch now contains more comments 
> | and Documentation file :)
> | 
> | Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> | 
> | ---
> | 
> | diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
> | new file mode 100644
> | index 0000000..dbd0c7a
> | --- /dev/null
> | +++ b/Documentation/controllers/devices.txt
> | @@ -0,0 +1,61 @@
> | +
> | +	Devices visibility controller
> | +
> | +This controller allows to tune the devices accessibility by tasks,
> | +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> | +access to IDE devices and completely hide SCSI disks.
> | +
> | +Tasks still can call mknod to create device files, regardless of
> | +whether the particular device is visible or accessible, but they
> | +may not be able to open it later.
> | +
> | +This one hides under CONFIG_CGROUP_DEVS option.
> | +
> | +
> | +Configuring
> | +
> | +The controller provides a single file to configure itself -- the
> | +devices.permissions one. To change the accessibility level for some
> | +device write the following string into it:
> | +
> | +[cb] <major>:(<minor>|*) [r-][w-]
> | + ^          ^               ^
> | + |          |               |
> | + |          |               +--- access rights (1)
> | + |          |
> | + |          +-- device major and minor numbers (2)
> | + |
> | + +-- device type (character / block)
> | +
> | +1) The access rights set to '--' remove the device from the group's
> | +access list, so that it will not even be shown in this file later.
> | +
> | +2) Setting the minor to '*' grants access to all the minors for
> | +particular major.
> | +
> | +When reading from it, one may see something like
> | +
> | +	c 1:5 rw
> | +	b 8:* r-
> | +
> | +Security issues, concerning who may grant access to what are governed
> | +at the cgroup infrastructure level.
> | +
> | +
> | +Examples:
> | +
> | +1. Grand full access to /dev/null
> 
> Grant.

:)

> | +	# echo c 1:3 rw > /cgroups/<id>/devices.permissions
> | +
> | +2. Grant the read-only access to /dev/sda and partitions
> | +	# echo b 8:* r- > ...
> 
> This grants access to all scsi disks, sda..sdp and not just 'sda' right ?

Well, yes. I'll fix the comment like ;Grant the RO access to scsi disks.

> | +
> | +3. Change the /dev/null access to write-only
> | +	# echo c 1:3 -w > ...
> | +
> | +4. Revoke access to /dev/sda
> | +	# echo b 8:* -- > ...
> | +
> | +
> | +		Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> | +
> | diff --git a/fs/Makefile b/fs/Makefile
> | index 7996220..5ad03be 100644
> | --- a/fs/Makefile
> | +++ b/fs/Makefile
> | @@ -64,6 +64,8 @@ obj-y				+= devpts/
> | 
> |  obj-$(CONFIG_PROFILING)		+= dcookies.o
> |  obj-$(CONFIG_DLM)		+= dlm/
> | +
> | +obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
> |   
> |  # Do not add any filesystems before this line
> |  obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
> | diff --git a/fs/devscontrol.c b/fs/devscontrol.c
> | new file mode 100644
> | index 0000000..48c5f69
> | --- /dev/null
> | +++ b/fs/devscontrol.c
> | @@ -0,0 +1,314 @@
> | +/*
> | + * devscontrol.c - Device Controller
> | + *
> | + * Copyright 2007 OpenVZ SWsoft Inc
> | + * Author: Pavel Emelyanov <xemul at openvz dot org>
> | + *
> | + * This program is free software; you can redistribute it and/or modify
> | + * it under the terms of the GNU General Public License as published by
> | + * the Free Software Foundation; either version 2 of the License, or
> | + * (at your option) any later version.
> | + *
> | + * This program is distributed in the hope that it will be useful,
> | + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> | + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> | + * GNU General Public License for more details.
> | + */
> | +
> | +#include <linux/cgroup.h>
> | +#include <linux/cdev.h>
> | +#include <linux/err.h>
> | +#include <linux/devscontrol.h>
> | +#include <linux/uaccess.h>
> | +#include <linux/fs.h>
> | +#include <linux/genhd.h>
> | +
> | +struct devs_cgroup {
> | +	/*
> | +	 * The subsys state to build into cgrous infrastructure
> | +	 */
> 
> ... into cgroups
> 
> | +	struct cgroup_subsys_state css;
> | +
> | +	/*
> | +	 * The maps of character and block devices. They provide a
> | +	 * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
> | +	 * and block/genhd.c to find out how the ->open() callbacks
> | +	 * work when opening a device.
> | +	 *
> | +	 * Each group will have its onw maps, and at the open()
> 
> own maps
> 
> | +	 * time code will lookup in this map to get the device
> | +	 * and permissions by its dev_t.
> | +	 */
> | +	struct kobj_map *cdev_map;
> | +	struct kobj_map *bdev_map;
> | +};
> | +
> | +static inline
> | +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
> | +{
> | +	return container_of(css, struct devs_cgroup, css);
> | +}
> 
> 'devs' as prefix/suffix does not look very clear.
> 
> How about css_to_devs_cg() ?  Similarly below for dev_cg_create(),
> dev_cg_destroy() ?

These names are internal to devscontrol.c, so I'd like to keep them
as short as possible.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]         ` <20080211173830.GA22160-6s5zFf/epYL1ENwx4SLHqw@public.gmane.org>
@ 2008-02-12 10:28           ` Pavel Emelyanov
       [not found]             ` <47B174B2.5010500-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-02-12 10:28 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Paul Menage, Linux Containers

Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
>> Each new group will have its own maps for char and block
>> layers. The devices access list is tuned via the 
>> devices.permissions file. One may read from the file to get 
>> the configured state.
>>
>> The top container isn't initialized, so that the char 
>> and block layers will use the global maps to lookup 
>> their devices. I did that not to export the static maps
>> to the outer world.
>>
>> Good news is that this patch now contains more comments 
>> and Documentation file :)
> 
> Seems to work as advertised  :)  I can't reproduce Suka's null/zero
> bug.
> 
> You're relying fully on uid-0 to stop writes into the
> devices.permissions files.  Would you mind adding a check for
> CAP_SYS_ADMIN (or CAP_NS_OVERRIDE+CAP_MKNOD)?  Or were you really
> counting on using the filesystem visibility cgroup to stop a container
> from making changes to its device access whitelist?

Yup. I strongly believe that a controller itself should not bring 
any security policy of its own, but the infrastructure should 
take care of this.

However, I'm open to change my mind if I see good explanation of
why it is wrong.

> thanks,
> -serge
> 
>> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>>
>> ---
>>
>> diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
>> new file mode 100644
>> index 0000000..dbd0c7a
>> --- /dev/null
>> +++ b/Documentation/controllers/devices.txt
>> @@ -0,0 +1,61 @@
>> +
>> +	Devices visibility controller
>> +
>> +This controller allows to tune the devices accessibility by tasks,
>> +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
>> +access to IDE devices and completely hide SCSI disks.
>> +
>> +Tasks still can call mknod to create device files, regardless of
>> +whether the particular device is visible or accessible, but they
>> +may not be able to open it later.
>> +
>> +This one hides under CONFIG_CGROUP_DEVS option.
>> +
>> +
>> +Configuring
>> +
>> +The controller provides a single file to configure itself -- the
>> +devices.permissions one. To change the accessibility level for some
>> +device write the following string into it:
>> +
>> +[cb] <major>:(<minor>|*) [r-][w-]
>> + ^          ^               ^
>> + |          |               |
>> + |          |               +--- access rights (1)
>> + |          |
>> + |          +-- device major and minor numbers (2)
>> + |
>> + +-- device type (character / block)
>> +
>> +1) The access rights set to '--' remove the device from the group's
>> +access list, so that it will not even be shown in this file later.
>> +
>> +2) Setting the minor to '*' grants access to all the minors for
>> +particular major.
>> +
>> +When reading from it, one may see something like
>> +
>> +	c 1:5 rw
>> +	b 8:* r-
>> +
>> +Security issues, concerning who may grant access to what are governed
>> +at the cgroup infrastructure level.
>> +
>> +
>> +Examples:
>> +
>> +1. Grand full access to /dev/null
>> +	# echo c 1:3 rw > /cgroups/<id>/devices.permissions
>> +
>> +2. Grant the read-only access to /dev/sda and partitions
>> +	# echo b 8:* r- > ...
>> +
>> +3. Change the /dev/null access to write-only
>> +	# echo c 1:3 -w > ...
>> +
>> +4. Revoke access to /dev/sda
>> +	# echo b 8:* -- > ...
>> +
>> +
>> +		Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>> +
>> diff --git a/fs/Makefile b/fs/Makefile
>> index 7996220..5ad03be 100644
>> --- a/fs/Makefile
>> +++ b/fs/Makefile
>> @@ -64,6 +64,8 @@ obj-y				+= devpts/
>>
>>  obj-$(CONFIG_PROFILING)		+= dcookies.o
>>  obj-$(CONFIG_DLM)		+= dlm/
>> +
>> +obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
>>   
>>  # Do not add any filesystems before this line
>>  obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
>> diff --git a/fs/devscontrol.c b/fs/devscontrol.c
>> new file mode 100644
>> index 0000000..48c5f69
>> --- /dev/null
>> +++ b/fs/devscontrol.c
>> @@ -0,0 +1,314 @@
>> +/*
>> + * devscontrol.c - Device Controller
>> + *
>> + * Copyright 2007 OpenVZ SWsoft Inc
>> + * Author: Pavel Emelyanov <xemul at openvz dot org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/cgroup.h>
>> +#include <linux/cdev.h>
>> +#include <linux/err.h>
>> +#include <linux/devscontrol.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/fs.h>
>> +#include <linux/genhd.h>
>> +
>> +struct devs_cgroup {
>> +	/*
>> +	 * The subsys state to build into cgrous infrastructure
>> +	 */
>> +	struct cgroup_subsys_state css;
>> +
>> +	/*
>> +	 * The maps of character and block devices. They provide a
>> +	 * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
>> +	 * and block/genhd.c to find out how the ->open() callbacks
>> +	 * work when opening a device.
>> +	 *
>> +	 * Each group will have its onw maps, and at the open()
>> +	 * time code will lookup in this map to get the device
>> +	 * and permissions by its dev_t.
>> +	 */
>> +	struct kobj_map *cdev_map;
>> +	struct kobj_map *bdev_map;
>> +};
>> +
>> +static inline
>> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
>> +{
>> +	return container_of(css, struct devs_cgroup, css);
>> +}
>> +
>> +static inline
>> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
>> +{
>> +	return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
>> +}
>> +
>> +struct kobj_map *task_cdev_map(struct task_struct *tsk)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +
>> +	css = task_subsys_state(tsk, devs_subsys_id);
>> +	if (css->cgroup->parent == NULL)
>> +		return NULL;
>> +	else
>> +		return css_to_devs(css)->cdev_map;
>> +}
>> +
>> +struct kobj_map *task_bdev_map(struct task_struct *tsk)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +
>> +	css = task_subsys_state(tsk, devs_subsys_id);
>> +	if (css->cgroup->parent == NULL)
>> +		return NULL;
>> +	else
>> +		return css_to_devs(css)->bdev_map;
>> +}
>> +
>> +static struct cgroup_subsys_state *
>> +devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
>> +{
>> +	struct devs_cgroup *devs;
>> +
>> +	devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
>> +	if (devs == NULL)
>> +		goto out;
>> +
>> +	devs->cdev_map = cdev_map_init();
>> +	if (devs->cdev_map == NULL)
>> +		goto out_free;
>> +
>> +	devs->bdev_map = bdev_map_init();
>> +	if (devs->bdev_map == NULL)
>> +		goto out_free_cdev;
>> +
>> +	return &devs->css;
>> +
>> +out_free_cdev:
>> +	cdev_map_fini(devs->cdev_map);
>> +out_free:
>> +	kfree(devs);
>> +out:
>> +	return ERR_PTR(-ENOMEM);
>> +}
>> +
>> +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
>> +{
>> +	struct devs_cgroup *devs;
>> +
>> +	devs = cgroup_to_devs(cont);
>> +	bdev_map_fini(devs->bdev_map);
>> +	cdev_map_fini(devs->cdev_map);
>> +	kfree(devs);
>> +}
>> +
>> +/*
>> + * The devices.permissions file read/write functionality
>> + *
>> + * The following two routines parse and print the strings like
>> + * [cb] <major>:(<minor>|*) [r-][w-]
>> + */
>> +
>> +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
>> +		int *all, mode_t *mode)
>> +{
>> +	unsigned int major, minor;
>> +	char *end;
>> +	mode_t tmp;
>> +
>> +	if (buf[0] == 'c')
>> +		*chrdev = 1;
>> +	else if (buf[0] == 'b')
>> +		*chrdev = 0;
>> +	else
>> +		return -EINVAL;
>> +
>> +	if (buf[1] != ' ')
>> +		return -EINVAL;
>> +
>> +	major = simple_strtoul(buf + 2, &end, 10);
>> +	if (*end != ':')
>> +		return -EINVAL;
>> +
>> +	if (end[1] == '*') {
>> +		if (end[2] != ' ')
>> +			return -EINVAL;
>> +
>> +		*all = 1;
>> +		minor = 0;
>> +		end += 2;
>> +	} else {
>> +		minor = simple_strtoul(end + 1, &end, 10);
>> +		if (*end != ' ')
>> +			return -EINVAL;
>> +
>> +		*all = 0;
>> +	}
>> +
>> +	tmp = 0;
>> +
>> +	if (end[1] == 'r')
>> +		tmp |= FMODE_READ;
>> +	else if (end[1] != '-')
>> +		return -EINVAL;
>> +	if (end[2] == 'w')
>> +		tmp |= FMODE_WRITE;
>> +	else if (end[2] != '-')
>> +		return -EINVAL;
>> +
>> +	*dev = MKDEV(major, minor);
>> +	*mode = tmp;
>> +	return 0;
>> +}
>> +
>> +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
>> +		int all, mode_t mode)
>> +{
>> +	int ret;
>> +
>> +	ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
>> +	if (all)
>> +		ret += snprintf(buf + ret, len - ret, "*");
>> +	else
>> +		ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
>> +
>> +	ret += snprintf(buf + ret, len - ret, " %c%c\n",
>> +			(mode & FMODE_READ) ? 'r' : '-',
>> +			(mode & FMODE_WRITE) ? 'w' : '-');
>> +
>> +	return ret + 1;
>> +}
>> +
>> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
>> +		struct file *f, const char __user *ubuf,
>> +		size_t nbytes, loff_t *pos)
>> +{
>> +	int err, all, chrdev;
>> +	dev_t dev;
>> +	char buf[64];
>> +	struct devs_cgroup *devs;
>> +	mode_t mode;
>> +
>> +	if (copy_from_user(buf, ubuf, sizeof(buf)))
>> +		return -EFAULT;
>> +
>> +	buf[sizeof(buf) - 1] = 0;
>> +	err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
>> +	if (err < 0)
>> +		return err;
>> +
>> +	devs = cgroup_to_devs(cont);
>> +
>> +	/*
>> +	 * No locking here is required - all that we need
>> +	 * is provided inside the kobject mapping code
>> +	 */
>> +
>> +	if (mode == 0) {
>> +		if (chrdev)
>> +			err = cdev_del_from_map(devs->cdev_map, dev, all);
>> +		else
>> +			err = bdev_del_from_map(devs->bdev_map, dev, all);
>> +
>> +		if (err < 0)
>> +			return err;
>> +
>> +		css_put(&devs->css);
>> +	} else {
>> +		if (chrdev)
>> +			err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
>> +		else
>> +			err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
>> +
>> +		if (err < 0)
>> +			return err;
>> +
>> +		css_get(&devs->css);
>> +	}
>> +
>> +	return nbytes;
>> +}
>> +
>> +struct devs_dump_arg {
>> +	char *buf;
>> +	int pos;
>> +	int chrdev;
>> +};
>> +
>> +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
>> +{
>> +	struct devs_dump_arg *arg = x;
>> +	char tmp[64];
>> +	int len;
>> +
>> +	len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
>> +			range != 1, mode);
>> +
>> +	if (arg->pos >= PAGE_SIZE - len)
>> +		return 1;
>> +
>> +	memcpy(arg->buf + arg->pos, tmp, len);
>> +	arg->pos += len;
>> +	return 0;
>> +}
>> +
>> +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
>> +		struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
>> +{
>> +	struct devs_dump_arg arg;
>> +	struct devs_cgroup *devs;
>> +	ssize_t ret;
>> +
>> +	arg.buf = (char *)__get_free_page(GFP_KERNEL);
>> +	if (arg.buf == NULL)
>> +		return -ENOMEM;
>> +
>> +	devs = cgroup_to_devs(cont);
>> +	arg.pos = 0;
>> +
>> +	arg.chrdev = 1;
>> +	cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
>> +
>> +	arg.chrdev = 0;
>> +	bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
>> +
>> +	ret = simple_read_from_buffer(ubuf, nbytes, pos,
>> +			arg.buf, arg.pos);
>> +
>> +	free_page((unsigned long)arg.buf);
>> +	return ret;
>> +}
>> +
>> +static struct cftype devs_files[] = {
>> +	{
>> +		.name = "permissions",
>> +		.write = devs_write,
>> +		.read = devs_read,
>> +	},
>> +};
>> +
>> +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
>> +{
>> +	return cgroup_add_files(cont, ss,
>> +			devs_files, ARRAY_SIZE(devs_files));
>> +}
>> +
>> +struct cgroup_subsys devs_subsys = {
>> +	.name = "devices",
>> +	.subsys_id = devs_subsys_id,
>> +	.create = devs_create,
>> +	.destroy = devs_destroy,
>> +	.populate = devs_populate,
>> +};
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index 228235c..9c0cd2c 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
>>  #endif
>>
>>  /* */
>> +
>> +#ifdef CONFIG_CGROUP_DEVS
>> +SUBSYS(devs)
>> +#endif
>> +
>> +/* */
>> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
>> new file mode 100644
>> index 0000000..38057b9
>> --- /dev/null
>> +++ b/include/linux/devscontrol.h
>> @@ -0,0 +1,26 @@
>> +#ifndef __DEVS_CONTROL_H__
>> +#define __DEVS_CONTROL_H__
>> +struct kobj_map;
>> +struct task_struct;
>> +
>> +/*
>> + * task_[cb]dev_map - get a map from task. Both calls may return
>> + * NULL, to indicate, that task doesn't belong to any group and
>> + * that the global map is to be used.
>> + */
>> +
>> +#ifdef CONFIG_CGROUP_DEVS
>> +struct kobj_map *task_cdev_map(struct task_struct *);
>> +struct kobj_map *task_bdev_map(struct task_struct *);
>> +#else
>> +static inline kobj_map *task_cdev_map(struct task_struct *tsk)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline kobj_map *task_bdev_map(struct task_struct *tsk)
>> +{
>> +	return NULL;
>> +}
>> +#endif
>> +#endif
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 732a1c2..f9a1b4f 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -283,6 +283,19 @@ config CGROUP_DEBUG
>>
>>  	  Say N if unsure
>>
>> +config CGROUP_DEVS
>> +	bool "Devices cgroup subsystem"
>> +	depends on CGROUPS
>> +	help
>> +	  Controlls the access rights to devices, i.e. you may hide
>> +	  some of them from tasks, so that they will not be able
>> +	  to open them, or you may grant a read-only access to some
>> +	  of them.
>> +
>> +	  See Documentation/controllers/devices.txt for details.
>> +
>> +	  This is harmless to say N here, so do it if unsure.
>> +
>>  config CGROUP_NS
>>          bool "Namespace cgroup subsystem"
>>          depends on CGROUPS
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]             ` <47B174B2.5010500-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-02-12 17:21               ` Serge E. Hallyn
       [not found]                 ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Serge E. Hallyn @ 2008-02-12 17:21 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Paul Menage, Linux Containers

Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> Serge E. Hallyn wrote:
> > Quoting Pavel Emelyanov (xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org):
> >> Each new group will have its own maps for char and block
> >> layers. The devices access list is tuned via the 
> >> devices.permissions file. One may read from the file to get 
> >> the configured state.
> >>
> >> The top container isn't initialized, so that the char 
> >> and block layers will use the global maps to lookup 
> >> their devices. I did that not to export the static maps
> >> to the outer world.
> >>
> >> Good news is that this patch now contains more comments 
> >> and Documentation file :)
> > 
> > Seems to work as advertised  :)  I can't reproduce Suka's null/zero
> > bug.
> > 
> > You're relying fully on uid-0 to stop writes into the
> > devices.permissions files.  Would you mind adding a check for
> > CAP_SYS_ADMIN (or CAP_NS_OVERRIDE+CAP_MKNOD)?  Or were you really
> > counting on using the filesystem visibility cgroup to stop a container
> > from making changes to its device access whitelist?
> 
> Yup. I strongly believe that a controller itself should not bring 
> any security policy of its own, but the infrastructure should 
> take care of this.

That would be ok if the controller gave the infrastructure some way of
knowing what sort of thing the controller does.  I.e. I wouldn't mind
having the cgroup infrastructe check for capabilities, but I suspect
some cgroups will really be best represented by different capabilities.

Paul (actually both Menage and Jackson :) do you have an opinion on
this?  Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/'
to have some non-root user be able to dole out cpusets?  Is there any
way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN?

> However, I'm open to change my mind if I see good explanation of
> why it is wrong.

Well the thing is that it currently leaves no way to keep root in
another namespace from manipulating it.  I don't think we can even use
SELinux unless we're willing to prevent containers from having write
access to everything under the cgroup - which is only ok depending on
what is composed with the devices cgroup.

We have the same kind of problem with the
/proc/sys/filesystems/<fs>/fs_safe flag.  There are a few possibilities,
and your fs visibility cgroup (plus splitting /proc/sys/filesystems into
its own fs) is one.  More complicated things pop into my head, but I
think until we get more comfortable with all this the simplest way is
the best.

But really imo the simplest way is to have a capable(CAP_SYS_ADMIN)
check inside your _write function :)

Ideally if you didn't mind I would float a patch (cousin to the userns
4-patch set) defining CAP_NS_OVERRIDE and requiring both CAP_NS_OVERRIDE
and CAP_SYS_ADMIN to access _write.

-serge

> > thanks,
> > -serge
> > 
> >> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> >>
> >> ---
> >>
> >> diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
> >> new file mode 100644
> >> index 0000000..dbd0c7a
> >> --- /dev/null
> >> +++ b/Documentation/controllers/devices.txt
> >> @@ -0,0 +1,61 @@
> >> +
> >> +	Devices visibility controller
> >> +
> >> +This controller allows to tune the devices accessibility by tasks,
> >> +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> >> +access to IDE devices and completely hide SCSI disks.
> >> +
> >> +Tasks still can call mknod to create device files, regardless of
> >> +whether the particular device is visible or accessible, but they
> >> +may not be able to open it later.
> >> +
> >> +This one hides under CONFIG_CGROUP_DEVS option.
> >> +
> >> +
> >> +Configuring
> >> +
> >> +The controller provides a single file to configure itself -- the
> >> +devices.permissions one. To change the accessibility level for some
> >> +device write the following string into it:
> >> +
> >> +[cb] <major>:(<minor>|*) [r-][w-]
> >> + ^          ^               ^
> >> + |          |               |
> >> + |          |               +--- access rights (1)
> >> + |          |
> >> + |          +-- device major and minor numbers (2)
> >> + |
> >> + +-- device type (character / block)
> >> +
> >> +1) The access rights set to '--' remove the device from the group's
> >> +access list, so that it will not even be shown in this file later.
> >> +
> >> +2) Setting the minor to '*' grants access to all the minors for
> >> +particular major.
> >> +
> >> +When reading from it, one may see something like
> >> +
> >> +	c 1:5 rw
> >> +	b 8:* r-
> >> +
> >> +Security issues, concerning who may grant access to what are governed
> >> +at the cgroup infrastructure level.
> >> +
> >> +
> >> +Examples:
> >> +
> >> +1. Grand full access to /dev/null
> >> +	# echo c 1:3 rw > /cgroups/<id>/devices.permissions
> >> +
> >> +2. Grant the read-only access to /dev/sda and partitions
> >> +	# echo b 8:* r- > ...
> >> +
> >> +3. Change the /dev/null access to write-only
> >> +	# echo c 1:3 -w > ...
> >> +
> >> +4. Revoke access to /dev/sda
> >> +	# echo b 8:* -- > ...
> >> +
> >> +
> >> +		Written by Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> >> +
> >> diff --git a/fs/Makefile b/fs/Makefile
> >> index 7996220..5ad03be 100644
> >> --- a/fs/Makefile
> >> +++ b/fs/Makefile
> >> @@ -64,6 +64,8 @@ obj-y				+= devpts/
> >>
> >>  obj-$(CONFIG_PROFILING)		+= dcookies.o
> >>  obj-$(CONFIG_DLM)		+= dlm/
> >> +
> >> +obj-$(CONFIG_CGROUP_DEVS)	+= devscontrol.o
> >>   
> >>  # Do not add any filesystems before this line
> >>  obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
> >> diff --git a/fs/devscontrol.c b/fs/devscontrol.c
> >> new file mode 100644
> >> index 0000000..48c5f69
> >> --- /dev/null
> >> +++ b/fs/devscontrol.c
> >> @@ -0,0 +1,314 @@
> >> +/*
> >> + * devscontrol.c - Device Controller
> >> + *
> >> + * Copyright 2007 OpenVZ SWsoft Inc
> >> + * Author: Pavel Emelyanov <xemul at openvz dot org>
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify
> >> + * it under the terms of the GNU General Public License as published by
> >> + * the Free Software Foundation; either version 2 of the License, or
> >> + * (at your option) any later version.
> >> + *
> >> + * This program is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> + * GNU General Public License for more details.
> >> + */
> >> +
> >> +#include <linux/cgroup.h>
> >> +#include <linux/cdev.h>
> >> +#include <linux/err.h>
> >> +#include <linux/devscontrol.h>
> >> +#include <linux/uaccess.h>
> >> +#include <linux/fs.h>
> >> +#include <linux/genhd.h>
> >> +
> >> +struct devs_cgroup {
> >> +	/*
> >> +	 * The subsys state to build into cgrous infrastructure
> >> +	 */
> >> +	struct cgroup_subsys_state css;
> >> +
> >> +	/*
> >> +	 * The maps of character and block devices. They provide a
> >> +	 * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
> >> +	 * and block/genhd.c to find out how the ->open() callbacks
> >> +	 * work when opening a device.
> >> +	 *
> >> +	 * Each group will have its onw maps, and at the open()
> >> +	 * time code will lookup in this map to get the device
> >> +	 * and permissions by its dev_t.
> >> +	 */
> >> +	struct kobj_map *cdev_map;
> >> +	struct kobj_map *bdev_map;
> >> +};
> >> +
> >> +static inline
> >> +struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
> >> +{
> >> +	return container_of(css, struct devs_cgroup, css);
> >> +}
> >> +
> >> +static inline
> >> +struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
> >> +{
> >> +	return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
> >> +}
> >> +
> >> +struct kobj_map *task_cdev_map(struct task_struct *tsk)
> >> +{
> >> +	struct cgroup_subsys_state *css;
> >> +
> >> +	css = task_subsys_state(tsk, devs_subsys_id);
> >> +	if (css->cgroup->parent == NULL)
> >> +		return NULL;
> >> +	else
> >> +		return css_to_devs(css)->cdev_map;
> >> +}
> >> +
> >> +struct kobj_map *task_bdev_map(struct task_struct *tsk)
> >> +{
> >> +	struct cgroup_subsys_state *css;
> >> +
> >> +	css = task_subsys_state(tsk, devs_subsys_id);
> >> +	if (css->cgroup->parent == NULL)
> >> +		return NULL;
> >> +	else
> >> +		return css_to_devs(css)->bdev_map;
> >> +}
> >> +
> >> +static struct cgroup_subsys_state *
> >> +devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >> +{
> >> +	struct devs_cgroup *devs;
> >> +
> >> +	devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
> >> +	if (devs == NULL)
> >> +		goto out;
> >> +
> >> +	devs->cdev_map = cdev_map_init();
> >> +	if (devs->cdev_map == NULL)
> >> +		goto out_free;
> >> +
> >> +	devs->bdev_map = bdev_map_init();
> >> +	if (devs->bdev_map == NULL)
> >> +		goto out_free_cdev;
> >> +
> >> +	return &devs->css;
> >> +
> >> +out_free_cdev:
> >> +	cdev_map_fini(devs->cdev_map);
> >> +out_free:
> >> +	kfree(devs);
> >> +out:
> >> +	return ERR_PTR(-ENOMEM);
> >> +}
> >> +
> >> +static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
> >> +{
> >> +	struct devs_cgroup *devs;
> >> +
> >> +	devs = cgroup_to_devs(cont);
> >> +	bdev_map_fini(devs->bdev_map);
> >> +	cdev_map_fini(devs->cdev_map);
> >> +	kfree(devs);
> >> +}
> >> +
> >> +/*
> >> + * The devices.permissions file read/write functionality
> >> + *
> >> + * The following two routines parse and print the strings like
> >> + * [cb] <major>:(<minor>|*) [r-][w-]
> >> + */
> >> +
> >> +static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
> >> +		int *all, mode_t *mode)
> >> +{
> >> +	unsigned int major, minor;
> >> +	char *end;
> >> +	mode_t tmp;
> >> +
> >> +	if (buf[0] == 'c')
> >> +		*chrdev = 1;
> >> +	else if (buf[0] == 'b')
> >> +		*chrdev = 0;
> >> +	else
> >> +		return -EINVAL;
> >> +
> >> +	if (buf[1] != ' ')
> >> +		return -EINVAL;
> >> +
> >> +	major = simple_strtoul(buf + 2, &end, 10);
> >> +	if (*end != ':')
> >> +		return -EINVAL;
> >> +
> >> +	if (end[1] == '*') {
> >> +		if (end[2] != ' ')
> >> +			return -EINVAL;
> >> +
> >> +		*all = 1;
> >> +		minor = 0;
> >> +		end += 2;
> >> +	} else {
> >> +		minor = simple_strtoul(end + 1, &end, 10);
> >> +		if (*end != ' ')
> >> +			return -EINVAL;
> >> +
> >> +		*all = 0;
> >> +	}
> >> +
> >> +	tmp = 0;
> >> +
> >> +	if (end[1] == 'r')
> >> +		tmp |= FMODE_READ;
> >> +	else if (end[1] != '-')
> >> +		return -EINVAL;
> >> +	if (end[2] == 'w')
> >> +		tmp |= FMODE_WRITE;
> >> +	else if (end[2] != '-')
> >> +		return -EINVAL;
> >> +
> >> +	*dev = MKDEV(major, minor);
> >> +	*mode = tmp;
> >> +	return 0;
> >> +}
> >> +
> >> +static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
> >> +		int all, mode_t mode)
> >> +{
> >> +	int ret;
> >> +
> >> +	ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
> >> +	if (all)
> >> +		ret += snprintf(buf + ret, len - ret, "*");
> >> +	else
> >> +		ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
> >> +
> >> +	ret += snprintf(buf + ret, len - ret, " %c%c\n",
> >> +			(mode & FMODE_READ) ? 'r' : '-',
> >> +			(mode & FMODE_WRITE) ? 'w' : '-');
> >> +
> >> +	return ret + 1;
> >> +}
> >> +
> >> +static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
> >> +		struct file *f, const char __user *ubuf,
> >> +		size_t nbytes, loff_t *pos)
> >> +{
> >> +	int err, all, chrdev;
> >> +	dev_t dev;
> >> +	char buf[64];
> >> +	struct devs_cgroup *devs;
> >> +	mode_t mode;
> >> +
> >> +	if (copy_from_user(buf, ubuf, sizeof(buf)))
> >> +		return -EFAULT;
> >> +
> >> +	buf[sizeof(buf) - 1] = 0;
> >> +	err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
> >> +	if (err < 0)
> >> +		return err;
> >> +
> >> +	devs = cgroup_to_devs(cont);
> >> +
> >> +	/*
> >> +	 * No locking here is required - all that we need
> >> +	 * is provided inside the kobject mapping code
> >> +	 */
> >> +
> >> +	if (mode == 0) {
> >> +		if (chrdev)
> >> +			err = cdev_del_from_map(devs->cdev_map, dev, all);
> >> +		else
> >> +			err = bdev_del_from_map(devs->bdev_map, dev, all);
> >> +
> >> +		if (err < 0)
> >> +			return err;
> >> +
> >> +		css_put(&devs->css);
> >> +	} else {
> >> +		if (chrdev)
> >> +			err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
> >> +		else
> >> +			err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
> >> +
> >> +		if (err < 0)
> >> +			return err;
> >> +
> >> +		css_get(&devs->css);
> >> +	}
> >> +
> >> +	return nbytes;
> >> +}
> >> +
> >> +struct devs_dump_arg {
> >> +	char *buf;
> >> +	int pos;
> >> +	int chrdev;
> >> +};
> >> +
> >> +static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
> >> +{
> >> +	struct devs_dump_arg *arg = x;
> >> +	char tmp[64];
> >> +	int len;
> >> +
> >> +	len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
> >> +			range != 1, mode);
> >> +
> >> +	if (arg->pos >= PAGE_SIZE - len)
> >> +		return 1;
> >> +
> >> +	memcpy(arg->buf + arg->pos, tmp, len);
> >> +	arg->pos += len;
> >> +	return 0;
> >> +}
> >> +
> >> +static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
> >> +		struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
> >> +{
> >> +	struct devs_dump_arg arg;
> >> +	struct devs_cgroup *devs;
> >> +	ssize_t ret;
> >> +
> >> +	arg.buf = (char *)__get_free_page(GFP_KERNEL);
> >> +	if (arg.buf == NULL)
> >> +		return -ENOMEM;
> >> +
> >> +	devs = cgroup_to_devs(cont);
> >> +	arg.pos = 0;
> >> +
> >> +	arg.chrdev = 1;
> >> +	cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
> >> +
> >> +	arg.chrdev = 0;
> >> +	bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
> >> +
> >> +	ret = simple_read_from_buffer(ubuf, nbytes, pos,
> >> +			arg.buf, arg.pos);
> >> +
> >> +	free_page((unsigned long)arg.buf);
> >> +	return ret;
> >> +}
> >> +
> >> +static struct cftype devs_files[] = {
> >> +	{
> >> +		.name = "permissions",
> >> +		.write = devs_write,
> >> +		.read = devs_read,
> >> +	},
> >> +};
> >> +
> >> +static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
> >> +{
> >> +	return cgroup_add_files(cont, ss,
> >> +			devs_files, ARRAY_SIZE(devs_files));
> >> +}
> >> +
> >> +struct cgroup_subsys devs_subsys = {
> >> +	.name = "devices",
> >> +	.subsys_id = devs_subsys_id,
> >> +	.create = devs_create,
> >> +	.destroy = devs_destroy,
> >> +	.populate = devs_populate,
> >> +};
> >> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> >> index 228235c..9c0cd2c 100644
> >> --- a/include/linux/cgroup_subsys.h
> >> +++ b/include/linux/cgroup_subsys.h
> >> @@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
> >>  #endif
> >>
> >>  /* */
> >> +
> >> +#ifdef CONFIG_CGROUP_DEVS
> >> +SUBSYS(devs)
> >> +#endif
> >> +
> >> +/* */
> >> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
> >> new file mode 100644
> >> index 0000000..38057b9
> >> --- /dev/null
> >> +++ b/include/linux/devscontrol.h
> >> @@ -0,0 +1,26 @@
> >> +#ifndef __DEVS_CONTROL_H__
> >> +#define __DEVS_CONTROL_H__
> >> +struct kobj_map;
> >> +struct task_struct;
> >> +
> >> +/*
> >> + * task_[cb]dev_map - get a map from task. Both calls may return
> >> + * NULL, to indicate, that task doesn't belong to any group and
> >> + * that the global map is to be used.
> >> + */
> >> +
> >> +#ifdef CONFIG_CGROUP_DEVS
> >> +struct kobj_map *task_cdev_map(struct task_struct *);
> >> +struct kobj_map *task_bdev_map(struct task_struct *);
> >> +#else
> >> +static inline kobj_map *task_cdev_map(struct task_struct *tsk)
> >> +{
> >> +	return NULL;
> >> +}
> >> +
> >> +static inline kobj_map *task_bdev_map(struct task_struct *tsk)
> >> +{
> >> +	return NULL;
> >> +}
> >> +#endif
> >> +#endif
> >> diff --git a/init/Kconfig b/init/Kconfig
> >> index 732a1c2..f9a1b4f 100644
> >> --- a/init/Kconfig
> >> +++ b/init/Kconfig
> >> @@ -283,6 +283,19 @@ config CGROUP_DEBUG
> >>
> >>  	  Say N if unsure
> >>
> >> +config CGROUP_DEVS
> >> +	bool "Devices cgroup subsystem"
> >> +	depends on CGROUPS
> >> +	help
> >> +	  Controlls the access rights to devices, i.e. you may hide
> >> +	  some of them from tasks, so that they will not be able
> >> +	  to open them, or you may grant a read-only access to some
> >> +	  of them.
> >> +
> >> +	  See Documentation/controllers/devices.txt for details.
> >> +
> >> +	  This is harmless to say N here, so do it if unsure.
> >> +
> >>  config CGROUP_NS
> >>          bool "Namespace cgroup subsystem"
> >>          depends on CGROUPS
> > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                 ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
@ 2008-02-13  2:17                   ` Paul Menage
       [not found]                     ` <6599ad830802121817n7713fa85h51aedf4df74aa764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-02-13  2:32                   ` Paul Jackson
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-02-13  2:17 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov

On Feb 12, 2008 9:21 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>
> Paul (actually both Menage and Jackson :) do you have an opinion on
> this?  Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/'
> to have some non-root user be able to dole out cpusets?

We (Google) currently only do that for the tasks file, to allow users
to move their processes between cpusets/groups that they own.

> Is there any
> way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN?

Sure, we could have flags in the subsys object for this kind of thing
that let particular subsystems request this.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                 ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
  2008-02-13  2:17                   ` Paul Menage
@ 2008-02-13  2:32                   ` Paul Jackson
       [not found]                     ` <20080212203215.fb636900.pj-sJ/iWh9BUns@public.gmane.org>
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-02-13  2:32 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, menage-hpIqsD4AKlfQT0dZR+AlfA,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Serge wrote:
> Paul (actually both Menage and Jackson :) do you have an opinion on
> this?  Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/'
> to have some non-root user be able to dole out cpusets?  Is there any
> way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN?

I don't know what my users actually do here ... I'm a couple layers
removed from that reality.  But certainly I've recommended that they
sometimes do things like having the batch scheduler chown the files
of each jobs cpuset to the uid of the user running that job, so that
the job can manipulate its own cpuset allocate resources in finer
detail.

One of the more elaborate ways of doing this nests a pair of cpusets,
with the parent owned by the batch scheduler confining the child
owned by the individual job.  The job can actually do things like
write its own cpus and mems files, but is confined by the parent
cpuset to only specify cpus and mems assigned to that job.

As to how this affects your question ... I'm not sure.  However I
suspect that an added requirement for CAP_SYS_ADMIN would cause
breakage and not be a good idea.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj-sJ/iWh9BUns@public.gmane.org> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                     ` <6599ad830802121817n7713fa85h51aedf4df74aa764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-02-13  2:42                       ` Paul Jackson
       [not found]                         ` <20080212204215.2eca689f.pj-sJ/iWh9BUns@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-02-13  2:42 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, xemul-GEFAQzZX7r8dnm+yROfE0A

Paul M wrote:
> > Is there any
> > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN?
> 
> Sure, we could have flags in the subsys object for this kind of thing
> that let particular subsystems request this.

That sounds like it would work fine ... the cpuset subsystem would
just not request this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj-sJ/iWh9BUns@public.gmane.org> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                         ` <20080212204215.2eca689f.pj-sJ/iWh9BUns@public.gmane.org>
@ 2008-02-14 17:17                           ` Serge E. Hallyn
  0 siblings, 0 replies; 42+ messages in thread
From: Serge E. Hallyn @ 2008-02-14 17:17 UTC (permalink / raw)
  To: Paul Jackson
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, Paul Menage,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Quoting Paul Jackson (pj-sJ/iWh9BUns@public.gmane.org):
> Paul M wrote:
> > > Is there any
> > > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN?
> > 
> > Sure, we could have flags in the subsys object for this kind of thing
> > that let particular subsystems request this.
> 
> That sounds like it would work fine ... the cpuset subsystem would
> just not request this.

Cool I'll keep it in mind then - though really right now it seems like
it just amounts to slightly more complicated approach than just having
the subsystems do the check in _write...

thanks,
-serge

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                     ` <20080212203215.fb636900.pj-sJ/iWh9BUns@public.gmane.org>
@ 2008-02-14 17:18                       ` Serge E. Hallyn
  0 siblings, 0 replies; 42+ messages in thread
From: Serge E. Hallyn @ 2008-02-14 17:18 UTC (permalink / raw)
  To: Paul Jackson
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, menage-hpIqsD4AKlfQT0dZR+AlfA,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Quoting Paul Jackson (pj-sJ/iWh9BUns@public.gmane.org):
> Serge wrote:
> > Paul (actually both Menage and Jackson :) do you have an opinion on
> > this?  Are there sites which eg do 'chown -R some_user_id /cgroup/cpusets/'
> > to have some non-root user be able to dole out cpusets?  Is there any
> > way it would be ok to have cgroup_file_write() check for CAP_SYS_ADMIN?
> 
> I don't know what my users actually do here ... I'm a couple layers
> removed from that reality.  But certainly I've recommended that they
> sometimes do things like having the batch scheduler chown the files
> of each jobs cpuset to the uid of the user running that job, so that
> the job can manipulate its own cpuset allocate resources in finer
> detail.
> 
> One of the more elaborate ways of doing this nests a pair of cpusets,
> with the parent owned by the batch scheduler confining the child
> owned by the individual job.  The job can actually do things like
> write its own cpus and mems files, but is confined by the parent
> cpuset to only specify cpus and mems assigned to that job.
> 
> As to how this affects your question ... I'm not sure.  However I
> suspect that an added requirement for CAP_SYS_ADMIN would cause
> breakage and not be a good idea.

Yeah, I guess a more general mechanism to couple a user namespace's
connection to a mount is the right way to go.  If we can just specify
that root in this namespace is not root in that namespace (or any other
userid we've chowned the files to), we've got what we need.

thanks,
-serge

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]     ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  2008-02-11 17:38       ` Serge E. Hallyn
  2008-02-12  7:42       ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
@ 2008-02-21 20:47       ` Paul Menage
       [not found]         ` <6599ad830802211247t21fdc4e4hfe637fcffd98ded7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-02-21 20:47 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers

On Thu, Feb 7, 2008 at 5:01 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
>  +
>  +[cb] <major>:(<minor>|*) [r-][w-]
>  + ^          ^               ^
>  + |          |               |
>  + |          |               +--- access rights (1)
>  + |          |
>  + |          +-- device major and minor numbers (2)
>  + |
>  + +-- device type (character / block)
> ...
>  +When reading from it, one may see something like
>  +
>  +       c 1:5 rw
>  +       b 8:* r-
>  +

In the interest of avoiding proliferating cgroup control file formats,
I'm wondering if we can abstract out the general form of the data
being presented here and maybe simplify it in such a way that we can
hopefully reuse the format for other control files in the future?

For example, one way to represent this would be a map from device
strings such c:1:5 to permission strings such as rw. Or maybe
numerical device ids to numerical permission values.

The alternative might be to accept that there are two kinds of control
files - those which are likely to be programmatically read (e.g.
resource usage values), and those that are likely to be
programmatically written but only actually read by humans for
debugging purposes (like this one) and make it clear up-front when a
control file is added which type they're considered to be. We could
then ignore the API consistency requirements for the
debugging-readable files.

On a separate note, can you document the recommended way to completely
overwrite the set of device permissions for a cgroup? Does this
involves writing a "--" permission for every device that you don't
want in the cgroup?

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]         ` <6599ad830802211247t21fdc4e4hfe637fcffd98ded7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-02-22  8:12           ` Pavel Emelyanov
       [not found]             ` <47BE83FD.7060908-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Pavel Emelyanov @ 2008-02-22  8:12 UTC (permalink / raw)
  To: Paul Menage; +Cc: Linux Containers

Paul Menage wrote:
> On Thu, Feb 7, 2008 at 5:01 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
>>  +
>>  +[cb] <major>:(<minor>|*) [r-][w-]
>>  + ^          ^               ^
>>  + |          |               |
>>  + |          |               +--- access rights (1)
>>  + |          |
>>  + |          +-- device major and minor numbers (2)
>>  + |
>>  + +-- device type (character / block)
>> ...
>>  +When reading from it, one may see something like
>>  +
>>  +       c 1:5 rw
>>  +       b 8:* r-
>>  +
> 
> In the interest of avoiding proliferating cgroup control file formats,
> I'm wondering if we can abstract out the general form of the data
> being presented here and maybe simplify it in such a way that we can
> hopefully reuse the format for other control files in the future?
> 
> For example, one way to represent this would be a map from device
> strings such c:1:5 to permission strings such as rw. Or maybe
> numerical device ids to numerical permission values.

You mean smth like <some_device_id><space><some_permissions_string>?

Well, I don't mind, but AFAIK the <major>:<minor> form is very common
for specifying the device. So I agree with the 'c:1:5 rw' form.

> The alternative might be to accept that there are two kinds of control
> files - those which are likely to be programmatically read (e.g.
> resource usage values), and those that are likely to be
> programmatically written but only actually read by humans for
> debugging purposes (like this one) and make it clear up-front when a
> control file is added which type they're considered to be. We could
> then ignore the API consistency requirements for the
> debugging-readable files.

Hmm, you mean make them a binary files? I thought that filesystem-based
API should be human readable and writable as much as possible...

> On a separate note, can you document the recommended way to completely
> overwrite the set of device permissions for a cgroup? Does this

There's not way to flush all the permissions in this implementation, but 
I though about one. Maybe 'echo 0 > devices.permissions' would be good?

> involves writing a "--" permission for every device that you don't
> want in the cgroup?

Currently - yes.

> Paul
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]             ` <47BE83FD.7060908-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2008-02-23 23:12               ` Paul Menage
       [not found]                 ` <6599ad830802231512t20343cabq738df3039c8a1d1f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-02-23 23:12 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers

On Fri, Feb 22, 2008 at 12:12 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
>
>  Hmm, you mean make them a binary files?

No, not by default. But I'm working on a plan to have an optional
binary API to cgroups, that would allow multiple control files to be
read in a binary format with a single system call. The existing API
would still be available as well, of course. The idea would be that
monitoring programs that frequently read lots of values from a single
cgroup (or even multiple cgroups) would be able to do so more cheaply
than by making multiple different reads on different files.

In order for this to work, CGroups needs to know the data type of a
given control file - so this would only be available for the control
files that use typed cgroup output methods rather than the raw file
output interface.

>  I thought that filesystem-based
>  API should be human readable and writable as much as possible...
>

Yes, but even without a binary API it makes sense for values that are
likely to be parsed by programs be in a consistent format.

But after thinking more about this, I think that the devices
permission control file output doesn't really fall under this category
- from a programmatic point of view, I suspect it's write-only, and
only humans will be reading the output, for debugging.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] The control group itself
       [not found]                 ` <6599ad830802231512t20343cabq738df3039c8a1d1f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-02-26  7:54                   ` Pavel Emelyanov
  0 siblings, 0 replies; 42+ messages in thread
From: Pavel Emelyanov @ 2008-02-26  7:54 UTC (permalink / raw)
  To: Paul Menage; +Cc: Linux Containers

Paul Menage wrote:
> On Fri, Feb 22, 2008 at 12:12 AM, Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
>>  Hmm, you mean make them a binary files?
> 
> No, not by default. But I'm working on a plan to have an optional
> binary API to cgroups, that would allow multiple control files to be
> read in a binary format with a single system call. The existing API
> would still be available as well, of course. The idea would be that
> monitoring programs that frequently read lots of values from a single
> cgroup (or even multiple cgroups) would be able to do so more cheaply
> than by making multiple different reads on different files.
> 
> In order for this to work, CGroups needs to know the data type of a
> given control file - so this would only be available for the control
> files that use typed cgroup output methods rather than the raw file
> output interface.

Sounds reasonable.

>>  I thought that filesystem-based
>>  API should be human readable and writable as much as possible...
>>
> 
> Yes, but even without a binary API it makes sense for values that are
> likely to be parsed by programs be in a consistent format.
> 
> But after thinking more about this, I think that the devices
> permission control file output doesn't really fall under this category
> - from a programmatic point of view, I suspect it's write-only, and
> only humans will be reading the output, for debugging.

Yup. So, if you're fine with the proposed API, I think I will prepare
this set and send it to Andrew this week.

> Paul
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2008-02-26  7:54 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-08  9:02 [PATCH 0/4] Devices accessibility control group (v2) Pavel Emelyanov
     [not found] ` <47833C3A.8090106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-08  9:07   ` [PATCH 1/4] Some changes in the kobject mapper Pavel Emelyanov
     [not found]     ` <47833D43.3090703-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-08 18:36       ` Daniel Hokka Zakrisson
     [not found]         ` <4783C2B4.7000501-nym3zxDgnZcAvxtiuMwx3w@public.gmane.org>
2008-01-08 19:17           ` Dave Hansen
2008-01-08  9:12   ` [PATCH 2/4] The character devices layer changes Pavel Emelyanov
     [not found]     ` <47833E93.6010108-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-14 17:03       ` Serge E. Hallyn
     [not found]         ` <20080114170333.GA15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15  8:05           ` Pavel Emelyanov
     [not found]             ` <478C6942.4050903-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-15 14:54               ` Serge E. Hallyn
2008-01-08  9:15   ` [PATCH 3/4] The block " Pavel Emelyanov
2008-01-08  9:18   ` [PATCH 4/4] The control group itself Pavel Emelyanov
     [not found]     ` <47833FF6.6060901-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-14 17:40       ` Serge E. Hallyn
     [not found]         ` <20080114174056.GB15077-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15  7:53           ` Pavel Emelyanov
     [not found]             ` <478C6669.7070705-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-15 14:44               ` Serge E. Hallyn
     [not found]                 ` <20080115144440.GE4453-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15 16:13                   ` Paul Menage
     [not found]                     ` <6599ad830801150813s6a5a7374qd25b6d6206d5896a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15 17:49                       ` Serge E. Hallyn
     [not found]                         ` <20080115174941.GA11638-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-01-15 17:54                           ` Paul Menage
     [not found]                             ` <6599ad830801150954w7e1b6db0p4dd737730f407348-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15 18:17                               ` Serge E. Hallyn
2008-01-14 21:54       ` Paul Menage
     [not found]         ` <6599ad830801141354p5b165cdao8d6184adb9ab61b6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15  7:58           ` Pavel Emelyanov
2008-01-12 21:20   ` [PATCH 0/4] Devices accessibility control group (v2) sukadev-r/Jw6+rmf7HQT0dZR+AlfA
     [not found]     ` <20080112212014.GA12085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-01-14  7:52       ` Pavel Emelyanov
     [not found]         ` <478B14DB.4000106-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-14 17:42           ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
     [not found]             ` <20080114174220.GA17825-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-01-15  8:22               ` Pavel Emelyanov
     [not found]                 ` <478C6D2B.6020904-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-01-17  6:26                   ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
     [not found]                     ` <20080117062605.GA24475-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-01-21  8:31                       ` Pavel Emelyanov
2008-01-14 21:18   ` Paul Menage
     [not found]     ` <6599ad830801141318h121a6a80h9af68c52431c48b8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-01-15  8:06       ` Pavel Emelyanov
  -- strict thread matches above, loose matches on Subject: below --
2008-02-07 12:56 [PATCH 0/4] Devices accessibility control group (v3, release candidate) Pavel Emelyanov
     [not found] ` <47AAFFF2.9030804-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-07 13:01   ` [PATCH 4/4] The control group itself Pavel Emelyanov
     [not found]     ` <47AB013B.8060502-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-11 17:38       ` Serge E. Hallyn
     [not found]         ` <20080211173830.GA22160-6s5zFf/epYL1ENwx4SLHqw@public.gmane.org>
2008-02-12 10:28           ` Pavel Emelyanov
     [not found]             ` <47B174B2.5010500-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-12 17:21               ` Serge E. Hallyn
     [not found]                 ` <20080212172134.GA12177-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2008-02-13  2:17                   ` Paul Menage
     [not found]                     ` <6599ad830802121817n7713fa85h51aedf4df74aa764-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-13  2:42                       ` Paul Jackson
     [not found]                         ` <20080212204215.2eca689f.pj-sJ/iWh9BUns@public.gmane.org>
2008-02-14 17:17                           ` Serge E. Hallyn
2008-02-13  2:32                   ` Paul Jackson
     [not found]                     ` <20080212203215.fb636900.pj-sJ/iWh9BUns@public.gmane.org>
2008-02-14 17:18                       ` Serge E. Hallyn
2008-02-12  7:42       ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
     [not found]         ` <20080212074217.GA15992-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-02-12  7:51           ` Pavel Emelyanov
2008-02-21 20:47       ` Paul Menage
     [not found]         ` <6599ad830802211247t21fdc4e4hfe637fcffd98ded7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-22  8:12           ` Pavel Emelyanov
     [not found]             ` <47BE83FD.7060908-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-02-23 23:12               ` Paul Menage
     [not found]                 ` <6599ad830802231512t20343cabq738df3039c8a1d1f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-26  7:54                   ` Pavel Emelyanov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.