* [RFC PATCH v2 0/3] i2c: show and change bus frequency via sysfs
From: Octavian Purdila @ 2014-10-14 14:48 UTC (permalink / raw)
To: wsa-z923LK4zBo2bacvFa/9K2g
Cc: linux-0h96xk9xTtrk1uMJSBkQmQ, johan-DgEjT+Ai2ygdnm+yROfE0A,
linux-i2c-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Octavian Purdila
This second version gets rid of the frequency table and introduces a
min and max range. It also changes the ABI to allow the driver to
round the given frequency to one that is supported, since many drivers
seems to work this way (i2c-stu300, i2c-s3c2410, i2c-kempld).
Octavian Purdila (3):
i2c: document the existing i2c sysfs ABI
i2c: document struct i2c_adapter
i2c: show and change bus frequency via sysfs
Documentation/ABI/testing/sysfs-bus-i2c | 75 ++++++++++++++++++++++++++
drivers/i2c/i2c-core.c | 94 +++++++++++++++++++++++++++++++++
include/linux/i2c.h | 32 +++++++++--
3 files changed, 198 insertions(+), 3 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-i2c
--
1.9.1
^ permalink raw reply
* Re: [RFC PATCH 3/3] i2c: show and change bus frequency via sysfs
From: Octavian Purdila @ 2014-10-14 9:24 UTC (permalink / raw)
To: Mark Roszko
Cc: linux-i2c, linux-api-u79uwXL29TY76Z2rM5mHXA, lkml, Johan Hovold,
Wolfram Sang
In-Reply-To: <CAJjB1qJsmFebE2u_wMe2gdrQYcEwY1-yDLpufK97EXjEoR6omQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Oct 14, 2014 at 4:53 AM, Mark Roszko <mark.roszko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > If this limitations exists
> >they are not introduced by this patch. This patch just exposes the
> >frequency so that it can be read or changed in userspace.
>
> Ah, well right now you can have an i2c bus with driver 1 and 2. Say
> the i2c bus is configured for 60khz in kernel space which normally
> can't be changed. Driver 1 talks to a slave that cannot go above
> 100khz. Now the userspace interface is added to the i2c bus. Some
> userspace application decides to reconfigure the bus for 400khz and do
> its communication to some slave device. Now the kernel tries to do
> some background talking to the drive 1 slave and suddenly finds it can
> no longer communicate with it. Right now with the kernel space only
> configuration, the system is safe from being messed up easily. It's
> more of a sanity of configuration issue.
You need privileges to change the bus frequency, so this is a
configuration issue. Which you still have today, since can still set
the i2c frequency on some busses via a module parameter.
>
>
> >On a different not, I have noticed that a fixed set of frequencies
> >might not be the best API, since multiple drivers rather support a
> >rather large set of frequencies in a range. A better API might be to
> >expose a min-max range and let the bus driver adjust the requested
> >frequency. I will follow up with a second version that does that.
>
> I was actually thinking you could eliminate the table of supported
> frequencies and just have the bus driver handle the set frequency
> decision itself and just return an error code if it's invalid. There
> are legitiamate drivers that cannot do more than a list of frequencies
> already as well. One example is here:
> http://lxr.free-electrons.com/source/drivers/i2c/busses/i2c-bcm-kona.c#L717
Yes, I will do that for v2 and in addition I will also add min and max
attributes to make it easier to determine if a bus supports fast mode,
fast plus, high, etc.
^ permalink raw reply
* [PATCH net] net: filter: move common defines into bpf_common.h
From: Alexei Starovoitov @ 2014-10-14 9:08 UTC (permalink / raw)
To: David S. Miller
Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
Daniel Borkmann, Hannes Frederic Sowa, Chema Gonzalez,
Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
userspace programs that use eBPF instruction macros need to include two files:
uapi/linux/filter.h and uapi/linux/bpf.h
Move common macro definitions that are shared between classic BPF and eBPF
into uapi/linux/bpf_common.h, so that user app can include only one bpf.h file
Cc: Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
Daniel believes that this patch has to be done for this merge window.
I think it can wait till next, but it won't hurt now either.
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/bpf.h | 1 +
include/uapi/linux/bpf_common.h | 55 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/filter.h | 56 +--------------------------------------
4 files changed, 58 insertions(+), 55 deletions(-)
create mode 100644 include/uapi/linux/bpf_common.h
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 70e150e..1115d68 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -68,6 +68,7 @@ header-y += binfmts.h
header-y += blkpg.h
header-y += blktrace_api.h
header-y += bpf.h
+header-y += bpf_common.h
header-y += bpqether.h
header-y += bsg.h
header-y += btrfs.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 31b0ac2..d18316f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -8,6 +8,7 @@
#define _UAPI__LINUX_BPF_H__
#include <linux/types.h>
+#include <linux/bpf_common.h>
/* Extended instruction set based on top of classic BPF */
diff --git a/include/uapi/linux/bpf_common.h b/include/uapi/linux/bpf_common.h
new file mode 100644
index 0000000..a5c220e
--- /dev/null
+++ b/include/uapi/linux/bpf_common.h
@@ -0,0 +1,55 @@
+#ifndef _UAPI__LINUX_BPF_COMMON_H__
+#define _UAPI__LINUX_BPF_COMMON_H__
+
+/* Instruction classes */
+#define BPF_CLASS(code) ((code) & 0x07)
+#define BPF_LD 0x00
+#define BPF_LDX 0x01
+#define BPF_ST 0x02
+#define BPF_STX 0x03
+#define BPF_ALU 0x04
+#define BPF_JMP 0x05
+#define BPF_RET 0x06
+#define BPF_MISC 0x07
+
+/* ld/ldx fields */
+#define BPF_SIZE(code) ((code) & 0x18)
+#define BPF_W 0x00
+#define BPF_H 0x08
+#define BPF_B 0x10
+#define BPF_MODE(code) ((code) & 0xe0)
+#define BPF_IMM 0x00
+#define BPF_ABS 0x20
+#define BPF_IND 0x40
+#define BPF_MEM 0x60
+#define BPF_LEN 0x80
+#define BPF_MSH 0xa0
+
+/* alu/jmp fields */
+#define BPF_OP(code) ((code) & 0xf0)
+#define BPF_ADD 0x00
+#define BPF_SUB 0x10
+#define BPF_MUL 0x20
+#define BPF_DIV 0x30
+#define BPF_OR 0x40
+#define BPF_AND 0x50
+#define BPF_LSH 0x60
+#define BPF_RSH 0x70
+#define BPF_NEG 0x80
+#define BPF_MOD 0x90
+#define BPF_XOR 0xa0
+
+#define BPF_JA 0x00
+#define BPF_JEQ 0x10
+#define BPF_JGT 0x20
+#define BPF_JGE 0x30
+#define BPF_JSET 0x40
+#define BPF_SRC(code) ((code) & 0x08)
+#define BPF_K 0x00
+#define BPF_X 0x08
+
+#ifndef BPF_MAXINSNS
+#define BPF_MAXINSNS 4096
+#endif
+
+#endif /* _UAPI__LINUX_BPF_COMMON_H__ */
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 253b4d4..47785d5 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -7,7 +7,7 @@
#include <linux/compiler.h>
#include <linux/types.h>
-
+#include <linux/bpf_common.h>
/*
* Current version of the filter code architecture.
@@ -32,56 +32,6 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
struct sock_filter __user *filter;
};
-/*
- * Instruction classes
- */
-
-#define BPF_CLASS(code) ((code) & 0x07)
-#define BPF_LD 0x00
-#define BPF_LDX 0x01
-#define BPF_ST 0x02
-#define BPF_STX 0x03
-#define BPF_ALU 0x04
-#define BPF_JMP 0x05
-#define BPF_RET 0x06
-#define BPF_MISC 0x07
-
-/* ld/ldx fields */
-#define BPF_SIZE(code) ((code) & 0x18)
-#define BPF_W 0x00
-#define BPF_H 0x08
-#define BPF_B 0x10
-#define BPF_MODE(code) ((code) & 0xe0)
-#define BPF_IMM 0x00
-#define BPF_ABS 0x20
-#define BPF_IND 0x40
-#define BPF_MEM 0x60
-#define BPF_LEN 0x80
-#define BPF_MSH 0xa0
-
-/* alu/jmp fields */
-#define BPF_OP(code) ((code) & 0xf0)
-#define BPF_ADD 0x00
-#define BPF_SUB 0x10
-#define BPF_MUL 0x20
-#define BPF_DIV 0x30
-#define BPF_OR 0x40
-#define BPF_AND 0x50
-#define BPF_LSH 0x60
-#define BPF_RSH 0x70
-#define BPF_NEG 0x80
-#define BPF_MOD 0x90
-#define BPF_XOR 0xa0
-
-#define BPF_JA 0x00
-#define BPF_JEQ 0x10
-#define BPF_JGT 0x20
-#define BPF_JGE 0x30
-#define BPF_JSET 0x40
-#define BPF_SRC(code) ((code) & 0x08)
-#define BPF_K 0x00
-#define BPF_X 0x08
-
/* ret - BPF_K and BPF_X also apply */
#define BPF_RVAL(code) ((code) & 0x18)
#define BPF_A 0x10
@@ -91,10 +41,6 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#define BPF_TAX 0x00
#define BPF_TXA 0x80
-#ifndef BPF_MAXINSNS
-#define BPF_MAXINSNS 4096
-#endif
-
/*
* Macros for filter block array initializers.
*/
--
1.7.9.5
^ permalink raw reply related
* Re: [PATCH v9 net-next 2/4] net: filter: split filter.h and expose eBPF to user space
From: Alexei Starovoitov @ 2014-10-14 8:43 UTC (permalink / raw)
To: Daniel Borkmann
Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
Steven Rostedt, Hannes Frederic Sowa, Chema Gonzalez,
Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
Kees Cook, Linux API, Network Development, LKML
In-Reply-To: <543CD206.709@redhat.com>
On Tue, Oct 14, 2014 at 12:34 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 10/13/2014 11:49 PM, Alexei Starovoitov wrote:
>>
>> On Mon, Oct 13, 2014 at 10:21 AM, Daniel Borkmann <dborkman@redhat.com>
>> wrote:
>>>
>>> On 09/03/2014 05:46 PM, Daniel Borkmann wrote:
>>> ...
>>>>
>>>>
>>>> Ok, given you post the remaining two RFCs later on this window as
>>>> you indicate, I have no objections:
>>>>
>>>> Acked-by: Daniel Borkmann <dborkman@redhat.com>
>>>
>>>
>>> Ping, Alexei, are you still sending the patch for bpf_common.h or
>>> do you want me to take care of this?
>>
>>
>> It's not forgotten.
>> I'm not sending it only because net-next is closed
>> and it seems to be -next material.
>
>
> Well, the point was since it's UAPI you're modifying, that it needs
> to be shipped before it first gets exposed to user land ...
>
> I think that should be reason enough ... there's no point in doing
> this at a later point in time.
Moving common #defines from filter.h into bpf_common.h can
be done at any point in time. For the sake of argument if
there is an app that includes both filter.h and bpf.h, it will
continue to work just fine.
Anyway, since you insist, I will send it now, though it definitely
can wait.
^ permalink raw reply
* Re: [PATCH v9 net-next 2/4] net: filter: split filter.h and expose eBPF to user space
From: Daniel Borkmann @ 2014-10-14 7:34 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
Steven Rostedt, Hannes Frederic Sowa, Chema Gonzalez,
Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
Kees Cook, Linux API, Network Development, LKML
In-Reply-To: <CAMEtUuw-iAGiYqmcN0AOkku0+Wto7810JJo9HEQvQN-2F=baTA@mail.gmail.com>
On 10/13/2014 11:49 PM, Alexei Starovoitov wrote:
> On Mon, Oct 13, 2014 at 10:21 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> On 09/03/2014 05:46 PM, Daniel Borkmann wrote:
>> ...
>>>
>>> Ok, given you post the remaining two RFCs later on this window as
>>> you indicate, I have no objections:
>>>
>>> Acked-by: Daniel Borkmann <dborkman@redhat.com>
>>
>> Ping, Alexei, are you still sending the patch for bpf_common.h or
>> do you want me to take care of this?
>
> It's not forgotten.
> I'm not sending it only because net-next is closed
> and it seems to be -next material.
Well, the point was since it's UAPI you're modifying, that it needs
to be shipped before it first gets exposed to user land ...
I think that should be reason enough ... there's no point in doing
this at a later point in time.
^ permalink raw reply
* Re: [RFC PATCH 3/3] i2c: show and change bus frequency via sysfs
From: Mark Roszko @ 2014-10-14 1:53 UTC (permalink / raw)
To: Octavian Purdila; +Cc: linux-i2c, linux-api, lkml, Johan Hovold, Wolfram Sang
In-Reply-To: <CAE1zotL4Pb5etHzakCxeDLvZR112XGGaHomZ_dbh=AQVc0wvPg@mail.gmail.com>
> If this limitations exists
>they are not introduced by this patch. This patch just exposes the
>frequency so that it can be read or changed in userspace.
Ah, well right now you can have an i2c bus with driver 1 and 2. Say
the i2c bus is configured for 60khz in kernel space which normally
can't be changed. Driver 1 talks to a slave that cannot go above
100khz. Now the userspace interface is added to the i2c bus. Some
userspace application decides to reconfigure the bus for 400khz and do
its communication to some slave device. Now the kernel tries to do
some background talking to the drive 1 slave and suddenly finds it can
no longer communicate with it. Right now with the kernel space only
configuration, the system is safe from being messed up easily. It's
more of a sanity of configuration issue.
>On a different not, I have noticed that a fixed set of frequencies
>might not be the best API, since multiple drivers rather support a
>rather large set of frequencies in a range. A better API might be to
>expose a min-max range and let the bus driver adjust the requested
>frequency. I will follow up with a second version that does that.
I was actually thinking you could eliminate the table of supported
frequencies and just have the bus driver handle the set frequency
decision itself and just return an error code if it's invalid. There
are legitiamate drivers that cannot do more than a list of frequencies
already as well. One example is here:
http://lxr.free-electrons.com/source/drivers/i2c/busses/i2c-bcm-kona.c#L717
^ permalink raw reply
* Re: [PATCH v9 net-next 2/4] net: filter: split filter.h and expose eBPF to user space
From: Alexei Starovoitov @ 2014-10-13 21:49 UTC (permalink / raw)
To: Daniel Borkmann
Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
Steven Rostedt, Hannes Frederic Sowa, Chema Gonzalez,
Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
Kees Cook, Linux API, Network Development, LKML
In-Reply-To: <543C0A2D.5040908@redhat.com>
On Mon, Oct 13, 2014 at 10:21 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 09/03/2014 05:46 PM, Daniel Borkmann wrote:
> ...
>>
>> Ok, given you post the remaining two RFCs later on this window as
>> you indicate, I have no objections:
>>
>> Acked-by: Daniel Borkmann <dborkman@redhat.com>
>
>
> Ping, Alexei, are you still sending the patch for bpf_common.h or
> do you want me to take care of this?
It's not forgotten.
I'm not sending it only because net-next is closed
and it seems to be -next material.
^ permalink raw reply
* [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api,
mingo
Cc: containers, jnagal, Aditya Kali
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali@google.com>
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.
Signed-off-by: Aditya Kali <adityakali@google.com>
---
fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 95 insertions(+), 2 deletions(-)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
return NULL;
}
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ /* inode for the given kernfs_node should already exist. */
+ inode = ilookup(sb, kn->ino);
+ if (!inode) {
+ pr_debug("kernfs: could not get inode for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* instantiate and link root dentry */
+ dentry = d_obtain_root(inode);
+ if (!dentry) {
+ pr_debug("kernfs: could not get dentry for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* If this is a new dentry, set it up. We need kernfs_mutex because this
+ * may be called by callers other than kernfs_fill_super. */
+ mutex_lock(&kernfs_mutex);
+ if (!dentry->d_fsdata) {
+ kernfs_get(kn);
+ dentry->d_fsdata = kn;
+ } else {
+ WARN_ON(dentry->d_fsdata != kn);
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return dentry;
+}
+
static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
{
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
unsigned int flags, void *priv);
void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2fc0dfa..ef27dc4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
memset(opts, 0, sizeof(*opts));
+ /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+ * namespace.
+ */
+ if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+ opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+ }
+
while ((token = strsep(&o, ",")) != NULL) {
nr_opts++;
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
- if (nr_opts != 1) {
+ if (nr_opts > 1) {
pr_err("sane_behavior: no other mount options allowed\n");
return -EINVAL;
}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}
+struct dentry *cgroupns_get_root(struct super_block *sb,
+ struct cgroup_namespace *ns)
+{
+ struct dentry *nsdentry;
+
+ nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+ return nsdentry;
+}
+
static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
{
LIST_HEAD(tmp_links);
@@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+ struct cgroup_namespace *ns =
+ get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }
/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1816,11 +1840,28 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
return ERR_PTR(ret);
+ }
dentry = kernfs_mount(fs_type, flags, root->kf_root,
CGROUP_SUPER_MAGIC, &new_sb);
+
+ if (!IS_ERR(dentry)) {
+ /* If this mount is for a non-init cgroup namespace, then
+ * Instead of root's dentry, we return the dentry specific to
+ * the cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+
+ nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
@@ -1833,6 +1874,7 @@ out_free:
deactivate_super(pinned_sb);
}
+ put_cgroup_ns(ns);
return dentry;
}
@@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};
static struct kobject *cgroup_kobj;
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
jnagal-hpIqsD4AKlfQT0dZR+AlfA, Aditya Kali
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
setns on a cgroup namespace is allowed only if
* task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
* task's current cgroup is descendent of the target cgroupns-root
cgroup.
* target cgroupns-root is same as or deeper than task's current
cgroupns-root. This is so that the task cannot escape out of its
cgroupns-root. This also ensures that setns() only makes the task
get restricted to a deeper cgroup hierarchy.
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index c16604f..c612946 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -80,8 +80,48 @@ err_out:
static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
{
- pr_info("setns not supported for cgroup namespace");
- return -EINVAL;
+ struct cgroup_namespace *cgroup_ns = ns;
+ struct task_struct *task = current;
+ struct cgroup *cgrp = NULL;
+ int err = 0;
+
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+ !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(task);
+
+ cgrp = get_task_cgroup(task);
+
+ err = -EINVAL;
+ if (!cgroup_on_dfl(cgrp))
+ goto out_unlock;
+
+ /* Allow switch only if the task's current cgroup is descendant of the
+ * target cgroup_ns->root_cgrp.
+ */
+ if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
+ goto out_unlock;
+
+ /* Only allow setns to a cgroupns root-ed deeper than task's current
+ * cgroupns-root. This will make sure that tasks cannot escape their
+ * cgroupns by attaching to parent cgroupns.
+ */
+ if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
+ task_cgroupns_root(task)))
+ goto out_unlock;
+
+ err = 0;
+ get_cgroup_ns(cgroup_ns);
+ put_cgroup_ns(nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns = cgroup_ns;
+
+out_unlock:
+ threadgroup_unlock(current);
+ if (cgrp)
+ cgroup_put(cgrp);
+ return err;
}
static void *cgroupns_get(struct task_struct *task)
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Restrict following operations within the calling tasks:
* cgroup_mkdir & cgroup_rmdir
* cgroup_attach_task
* writes to cgroup files outside of task's cgroupns-root
Also, read of /proc/<pid>/cgroup file is now restricted only
to tasks under same cgroupns-root. If a task tries to look
at cgroup of another task outside of its cgroupns-root, then
it won't be able to see anything for the default hierarchy.
This is same as if the cgroups are not mounted.
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f8099b4..2fc0dfa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
struct task_struct *task;
int ret;
+ /* Only allow changing cgroups accessible within task's cgroup
+ * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+ * cgroupns->root_cgrp. */
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ return -EPERM;
+
/* look up all src csets */
down_read(&css_set_rwsem);
rcu_read_lock();
@@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
struct cgroup_subsys_state *css;
int ret;
+ /* Reject writes to cgroup files outside of task's cgroupns-root. */
+ if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+ return -EINVAL;
+
if (cft->write)
return cft->write(of, buf, nbytes, off);
@@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
parent = cgroup_kn_lock_live(parent_kn);
if (!parent)
return -ENODEV;
+
+ /* Allow mkdir only within process's cgroup namespace root. */
+ if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+ ret = -EPERM;
+ goto out_unlock;
+ }
+
root = parent->root;
/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
if (!cgrp)
return 0;
+ /* Allow rmdir only within process's cgroup namespace root.
+ * The process can't delete its own root anyways. */
+ if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+ cgroup_kn_unlock(kn);
+ return -EPERM;
+ }
+
ret = cgroup_destroy_locked(cgrp);
cgroup_kn_unlock(kn);
@@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
continue;
+ cgrp = task_cgroup_from_root(tsk, root);
+
+ /* The cgroup path on default hierarchy is shown only if it
+ * falls under current task's cgroupns-root.
+ */
+ if (root == &cgrp_dfl_root &&
+ !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+ continue;
+
seq_printf(m, "%d:", root->hierarchy_id);
for_each_subsys(ss, ssid)
if (root->subsys_mask & (1 << ssid))
@@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
seq_printf(m, "%sname=%s", count ? "," : "",
root->name);
seq_putc(m, ':');
- cgrp = task_cgroup_from_root(tsk, root);
path = cgroup_path(cgrp, buf, PATH_MAX);
if (!path) {
retval = -ENAMETOOLONG;
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 5/8] cgroup: introduce cgroup namespaces
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
fs/proc/namespaces.c | 3 +
include/linux/cgroup.h | 18 +++++-
include/linux/cgroup_namespace.h | 62 +++++++++++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
init/Kconfig | 9 +++
kernel/Makefile | 1 +
kernel/cgroup.c | 11 ++++
kernel/cgroup_namespace.c | 128 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++++-
11 files changed, 255 insertions(+), 4 deletions(-)
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
#endif
&mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+ &cgroupns_operations,
+#endif
};
static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
#include <linux/seq_file.h>
#include <linux/kernfs.h>
#include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
#ifdef CONFIG_CGROUPS
@@ -460,6 +462,13 @@ struct cftype {
#endif
};
+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set;
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
}
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
size_t buflen)
{
- return kernfs_path(cgrp->kn, buf, buflen);
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
}
static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+ return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns);
+
+#else /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+ unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns) {
+ if (flags & CLONE_NEWCGROUP)
+ return ERR_PTR(-EINVAL);
+
+ return old_ns;
+}
+
+#endif /* CONFIG_CGROUP_NS */
+
+#endif /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
+struct cgroup_namespace;
struct fs_struct;
/*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
struct pid_namespace;
struct nsproxy;
+struct task_struct;
+struct inode;
struct proc_ns_operations {
const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
/*
* We always define these enumerators
@@ -37,6 +40,7 @@ enum {
PROC_UTS_INIT_INO = 0xEFFFFFFEU,
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
};
#ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..c3be001 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.
+config CGROUP_NS
+ bool "CGroup Namespaces"
+ default n
+ help
+ This options enables CGroup Namespaces which can be used to isolate
+ cgroup paths. This feature is only useful when unified cgroup
+ hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+ option).
+
endif # CGROUPS
config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..75334f8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2b3e9f9..f8099b4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
#include <linux/atomic.h>
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
bool is_add);
+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
/* IDR wrappers which synchronize using cgroup_idr_lock */
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
gfp_t gfp_mask)
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..c16604f
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(current);
+
+ cgrp = get_task_cgroup(current);
+
+ /* Creating new CGROUPNS is supported only when unified hierarchy is in
+ * use. */
+ err = -EINVAL;
+ if (!cgroup_on_dfl(cgrp))
+ goto err_out_unlock;
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out_unlock;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out_unlock;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ threadgroup_unlock(current);
+
+ return new_ns;
+
+err_out_unlock:
+ threadgroup_unlock(current);
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ rcu_read_lock();
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ rcu_read_unlock();
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..cc06851 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
static struct kmem_cache *nsproxy_cachep;
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
+ .cgroup_ns = &init_cgroup_ns,
};
static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_pid;
}
+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+ tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
return new_nsp;
out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *new_ns;
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWPID | CLONE_NEWNET)))) {
+ CLONE_NEWPID | CLONE_NEWNET |
+ CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;
if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
return 0;
user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/cgroup.h | 22 ++++++++++++++++++++++
kernel/cgroup.c | 22 ----------------------
2 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == &cgrp_dfl_root;
}
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+ return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+ WARN_ON_ONCE(cgroup_is_dead(cgrp));
+ css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+ return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_has_tasks(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 56d507b..2b3e9f9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
return cgroup_css(cgrp, ss);
}
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
- return !(cgrp->self.flags & CSS_ONLINE);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
}
-static void cgroup_get(struct cgroup *cgrp)
-{
- WARN_ON_ONCE(cgroup_is_dead(cgrp));
- css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
- return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
/**
* cgroup_refresh_child_subsys_mask - update child_subsys_mask
* @cgrp: the target cgroup
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj, lizefan, serge.hallyn, luto, cgroups, linux-kernel, linux-api,
mingo
Cc: containers, jnagal, Aditya Kali
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali@google.com>
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.
Signed-off-by: Aditya Kali <adityakali@google.com>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
}
char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cab7dc4..56d507b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
}
EXPORT_SYMBOL_GPL(task_cgroup_path);
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+ down_read(&css_set_rwsem);
+
+ cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+ cgroup_get(cgrp);
+
+ up_read(&css_set_rwsem);
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
/* used to track tasks and other necessary states during migration */
struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CLONE_NEWCGROUP will be used to create new cgroup namespace.
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/uapi/linux/sched.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
- and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname group? */
#define CLONE_NEWIPC 0x08000000 /* New ipcs */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
jnagal-hpIqsD4AKlfQT0dZR+AlfA, Aditya Kali
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.
Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
fs/kernfs/dir.c | 53 ++++++++++++++++++++++++++++++++++++++++----------
include/linux/kernfs.h | 3 +++
2 files changed, 46 insertions(+), 10 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..8655485 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
}
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
- size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+ struct kernfs_node *kn_root,
+ struct kernfs_node *kn,
+ char *buf,
+ size_t buflen)
{
char *p = buf + buflen;
int len;
+ BUG_ON(!buflen);
+
*--p = '\0';
+ if (kn == kn_root) {
+ *--p = '/';
+ return p;
+ }
+
do {
len = strlen(kn->name);
if (p - buf < len + 1) {
@@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
memcpy(p, kn->name, len);
*--p = '/';
kn = kn->parent;
+ if (kn == kn_root)
+ break;
} while (kn && kn->parent);
return p;
@@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
}
/**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
* @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
* @buflen: size of @buf
*
- * Builds and returns the full path of @kn in @buf of @buflen bytes. The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
* doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
* and %NULL is returned.
*/
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+ char *buf, size_t buflen)
{
unsigned long flags;
char *p;
spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
spin_unlock_irqrestore(&kernfs_rename_lock, flags);
return p;
}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes. The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
EXPORT_SYMBOL_GPL(kernfs_path);
/**
@@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
- sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
if (p)
pr_cont("%s", p);
else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
}
int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+ struct kernfs_node *kn, char *buf,
+ size_t buflen);
char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
--
2.1.0.rc2.206.gedb03e5
^ permalink raw reply related
* [PATCHv1 0/8] CGroup Namespaces
From: Aditya Kali @ 2014-10-13 21:23 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <adityakali-cgroupns>
Second take at the Cgroup Namespace patch-set.
Major changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
your cgroupns-root.
More details in the writeup below.
Background
Cgroups and Namespaces are used together to create “virtual”
containers that isolates the host environment from the processes
running in container. But since cgroups themselves are not
“virtualized”, the task is always able to see global cgroups view
through cgroupfs mount and via /proc/self/cgroup file.
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
This exposure of cgroup names to the processes running inside a
container results in some problems:
(1) The container names are typically host-container-management-agent
(systemd, docker/libcontainer, etc.) data and leaking its name (or
leaking the hierarchy) reveals too much information about the host
system.
(2) It makes the container migration across machines (CRIU) more
difficult as the container names need to be unique across the
machines in the migration domain.
(3) It makes it difficult to run container management tools (like
docker/libcontainer, lmctfy, etc.) within virtual containers
without adding dependency on some state/agent present outside the
container.
Note that the feature proposed here is completely different than the
“ns cgroup” feature which existed in the linux kernel until recently.
The ns cgroup also attempted to connect cgroups and namespaces by
creating a new cgroup every time a new namespace was created. It did
not solve any of the above mentioned problems and was later dropped
from the kernel. Incidentally though, it used the same config option
name CONFIG_CGROUP_NS as used in my prototype!
Introducing CGroup Namespaces
With unified cgroup hierarchy
(Documentation/cgroups/unified-hierarchy.txt), the containers can now
have a much more coherent cgroup view and its easy to associate a
container with a single cgroup. This also allows us to virtualize the
cgroup view for tasks inside the container.
The new CGroup Namespace allows a process to “unshare” its cgroup
hierarchy starting from the cgroup its currently in.
For Ex:
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
$ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
[ns]$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
cgroup:[4026532183]
# From within new cgroupns, process sees that its in the root cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
# From global cgroupns:
$ cat /proc/<pid>/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
# Unshare cgroupns along with userns and mountns
# Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
# sets up uid/gid map and exec’s /bin/bash
$ ~/unshare -c -u -m
# Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
# hierarchy.
[ns]$ mount -t cgroup cgroup /tmp/cgroup
[ns]$ ls -l /tmp/cgroup
total 0
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
-rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
-rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
filesystem root for the namespace specific cgroupfs mount.
The virtualization of /proc/self/cgroup file combined with restricting
the view of cgroup hierarchy by namespace-private cgroupfs mount
should provide a completely isolated cgroup view inside the container.
In its current form, the cgroup namespaces patcheset provides following
behavior:
(1) The “root” cgroup for a cgroup namespace is the cgroup in which
the process calling unshare is running.
For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
For the init_cgroup_ns, this is the real root (“/”) cgroup
(identified in code as cgrp_dfl_root.cgrp).
(2) The cgroupns-root cgroup does not change even if the namespace
creator process later moves to a different cgroup.
$ ~/unshare -c # unshare cgroupns in some cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
[ns]$ mkdir sub_cgrp_1
[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
(3) Each process gets its CGROUPNS specific view of
/proc/<pid>/cgroup.
(a) Processes running inside the cgroup namespace will be able to see
cgroup paths (in /proc/self/cgroup) only inside their root cgroup
[ns]$ sleep 100000 & # From within unshared cgroupns
[1] 7353
[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
(b) From global cgroupns, the real cgroup path will be visible:
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
(c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
path will be visible:
# ns2's cgroupns-root is at '/batchjobs/c_job_id2'
[ns2]$ cat /proc/7353/cgroup
[ns2]$
This is same as when cgroup hierarchy is not mounted at all.
(In correct container setup though, it should not be possible to
access PIDs in another container in the first place.)
(4) Processes inside a cgroupns are not allowed to move out of the
cgroupns-root. This is true even if a privileged process in global
cgroupns tries to move the process out of its cgroupns-root.
# From global cgroupns
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
# cgroupns-root for 7353 is /batchjobs/c_job_id1
$ echo 7353 > batchjobs/c_job_id2/cgroup.procs
-bash: echo: write error: Operation not permitted
(5) Setns to another cgroup namespace is allowed only when:
(a) process has CAP_SYS_ADMIN in its current userns
(b) process has CAP_SYS_ADMIN in the target cgroupns' userns
(c) the process's current cgroup is a descendant cgroupns-root of the
target namespace.
(d) the target cgroupns-root is descendant of current cgroupns-root..
The last check (d) prevents processes from escaping their cgroupns-root by
attaching to parent cgroupns. Thus, setns is allowed only when the process
is trying to restrict itself to a deeper cgroup hierarchy.
(6) When some thread from a multi-threaded process unshares its
cgroup-namespace, the new cgroupns gets applied to the entire
process (all the threads). This should be OK since
unified-hierarchy only allows process-level containerization. So
all the threads in the process will have the same cgroup. And both
- changing cgroups and unsharing namespaces - are protected under
threadgroup_lock(task).
(7) The cgroup namespace is alive as long as there is atleast 1
process inside it. When the last process exits, the cgroup
namespace is destroyed. The cgroupns-root and the actual cgroups
remain though.
(8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
the unified cgroup hierarchy with cgroupns-root as the filesystem root.
The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
container management tools to be run inside the containers transparently.
Implementation
The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
branch). Its fairly non-intrusive and provides above mentioned
features.
Possible extensions of CGROUPNS:
(1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
capabilities to restrict cgroups to administrative users. CGroup
namespaces could be of help here. With cgroup namespaces, it might
be possible to delegate administration of sub-cgroups under a
cgroupns-root to the cgroupns owner.
---
fs/kernfs/dir.c | 53 +++++++++---
fs/kernfs/mount.c | 48 +++++++++++
fs/proc/namespaces.c | 3 +
include/linux/cgroup.h | 41 +++++++++-
include/linux/cgroup_namespace.h | 62 +++++++++++++++
include/linux/kernfs.h | 5 ++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 +
include/uapi/linux/sched.h | 3 +-
init/Kconfig | 9 +++
kernel/Makefile | 1 +
kernel/cgroup.c | 139 ++++++++++++++++++++++++++------
kernel/cgroup_namespace.c | 168 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 ++++-
15 files changed, 518 insertions(+), 41 deletions(-)
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c
[PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv1 3/8] cgroup: add function to get task's cgroup on default
[PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv1 5/8] cgroup: introduce cgroup namespaces
[PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
[PATCHv1 7/8] cgroup: cgroup namespace setns support
[PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply
* Re: [PATCH v9 net-next 2/4] net: filter: split filter.h and expose eBPF to user space
From: Daniel Borkmann @ 2014-10-13 17:21 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
Steven Rostedt, Hannes Frederic Sowa, Chema Gonzalez,
Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
Kees Cook, linux-api, netdev, linux-kernel
In-Reply-To: <540737DF.1010801@redhat.com>
On 09/03/2014 05:46 PM, Daniel Borkmann wrote:
...
> Ok, given you post the remaining two RFCs later on this window as
> you indicate, I have no objections:
>
> Acked-by: Daniel Borkmann <dborkman@redhat.com>
Ping, Alexei, are you still sending the patch for bpf_common.h or
do you want me to take care of this?
Cheers,
Daniel
^ permalink raw reply
* Re: [PATCH] driver core: amba: add device binding path 'driver_override'
From: Kim Phillips @ 2014-10-13 13:23 UTC (permalink / raw)
To: Antonios Motakis
Cc: Russell King, open list:ABI/API, open list, kvmarm, iommu,
alex.williamson, tech, christoffer.dall, eric.auger, marc.zyngier
In-Reply-To: <1413205672-6236-1-git-send-email-a.motakis@virtualopensystems.com>
On Mon, 13 Oct 2014 15:07:34 +0200
Antonios Motakis <a.motakis@virtualopensystems.com> wrote:
> As already demonstrated with PCI [1] and the platform bus [2], a
> driver_override property in sysfs can be used to bypass the id matching
> of a device to a AMBA driver. This can be used by VFIO to bind to any AMBA
> device requested by the user.
>
> [1] http://lists-archives.com/linux-kernel/28030441-pci-introduce-new-device-binding-path-using-pci_dev-driver_override.html
> [2] https://www.redhat.com/archives/libvir-list/2014-April/msg00382.html
>
> Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
> ---
Reviewed-by: Kim Phillips <kim.phillips@freescale.com>
Kim
^ permalink raw reply
* [PATCH v8 04/18] vfio: amba: VFIO support for AMBA devices
From: Antonios Motakis @ 2014-10-13 13:10 UTC (permalink / raw)
To: kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
alex.williamson-H+wXaHxf7aLQT0dZR+AlfA
Cc: open list:VFIO DRIVER, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
marc.zyngier-5wv7dgnIgG8, open list:ABI/API,
will.deacon-5wv7dgnIgG8, open list, Antonios Motakis,
tech-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J,
christoffer.dall-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <1413205825-6370-1-git-send-email-a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
Add support for discovering AMBA devices with VFIO and handle them
similarly to Linux platform devices.
Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
---
drivers/vfio/platform/vfio_amba.c | 108 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 1 +
2 files changed, 109 insertions(+)
create mode 100644 drivers/vfio/platform/vfio_amba.c
diff --git a/drivers/vfio/platform/vfio_amba.c b/drivers/vfio/platform/vfio_amba.c
new file mode 100644
index 0000000..b16a40d
--- /dev/null
+++ b/drivers/vfio/platform/vfio_amba.c
@@ -0,0 +1,108 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/amba/bus.h>
+
+#include "vfio_platform_private.h"
+
+#define DRIVER_VERSION "0.8"
+#define DRIVER_AUTHOR "Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>"
+#define DRIVER_DESC "VFIO for AMBA devices - User Level meta-driver"
+
+/* probing devices from the AMBA bus */
+
+static struct resource *get_amba_resource(struct vfio_platform_device *vdev,
+ int i)
+{
+ struct amba_device *adev = (struct amba_device *) vdev->opaque;
+
+ if (i == 0)
+ return &adev->res;
+
+ return NULL;
+}
+
+static int get_amba_irq(struct vfio_platform_device *vdev, int i)
+{
+ struct amba_device *adev = (struct amba_device *) vdev->opaque;
+
+ if (i < AMBA_NR_IRQS)
+ return adev->irq[i];
+
+ return 0;
+}
+
+static int vfio_amba_probe(struct amba_device *adev, const struct amba_id *id)
+{
+
+ struct vfio_platform_device *vdev;
+ int ret;
+
+ vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+ if (!vdev)
+ return -ENOMEM;
+
+ vdev->opaque = (void *) adev;
+ vdev->name = "vfio-amba-dev";
+ vdev->flags = VFIO_DEVICE_FLAGS_AMBA;
+ vdev->get_resource = get_amba_resource;
+ vdev->get_irq = get_amba_irq;
+
+ ret = vfio_platform_probe_common(vdev, &adev->dev);
+ if (ret)
+ kfree(vdev);
+
+ return ret;
+}
+
+static int vfio_amba_remove(struct amba_device *adev)
+{
+ return vfio_platform_remove_common(&adev->dev);
+}
+
+static struct amba_id pl330_ids[] = {
+ { 0, 0 },
+};
+
+MODULE_DEVICE_TABLE(amba, pl330_ids);
+
+static struct amba_driver vfio_amba_driver = {
+ .probe = vfio_amba_probe,
+ .remove = vfio_amba_remove,
+ .id_table = pl330_ids,
+ .drv = {
+ .name = "vfio-amba",
+ .owner = THIS_MODULE,
+ },
+};
+
+module_amba_driver(vfio_amba_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index aca6d3e..1a5986c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -159,6 +159,7 @@ struct vfio_device_info {
#define VFIO_DEVICE_FLAGS_RESET (1 << 0) /* Device supports reset */
#define VFIO_DEVICE_FLAGS_PCI (1 << 1) /* vfio-pci device */
#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2) /* vfio-platform device */
+#define VFIO_DEVICE_FLAGS_AMBA (1 << 3) /* vfio-amba device */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
};
--
2.1.1
^ permalink raw reply related
* [PATCH v8 02/18] vfio: platform: probe to devices on the platform bus
From: Antonios Motakis @ 2014-10-13 13:10 UTC (permalink / raw)
To: kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
alex.williamson-H+wXaHxf7aLQT0dZR+AlfA
Cc: open list:VFIO DRIVER, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
marc.zyngier-5wv7dgnIgG8, open list:ABI/API,
will.deacon-5wv7dgnIgG8, open list, Antonios Motakis,
tech-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J,
christoffer.dall-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <1413205825-6370-1-git-send-email-a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
Driver to bind to Linux platform devices, and callbacks to discover their
resources to be used by the main VFIO PLATFORM code.
Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
---
drivers/vfio/platform/vfio_platform.c | 107 ++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 1 +
2 files changed, 108 insertions(+)
create mode 100644 drivers/vfio/platform/vfio_platform.c
diff --git a/drivers/vfio/platform/vfio_platform.c b/drivers/vfio/platform/vfio_platform.c
new file mode 100644
index 0000000..baeb7da
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -0,0 +1,107 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/io.h>
+#include <linux/platform_device.h>
+#include <linux/irq.h>
+
+#include "vfio_platform_private.h"
+
+#define DRIVER_VERSION "0.8"
+#define DRIVER_AUTHOR "Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>"
+#define DRIVER_DESC "VFIO for platform devices - User Level meta-driver"
+
+/* probing devices from the linux platform bus */
+
+static struct resource *get_platform_resource(struct vfio_platform_device *vdev,
+ int num)
+{
+ struct platform_device *dev = (struct platform_device *) vdev->opaque;
+ int i;
+
+ for (i = 0; i < dev->num_resources; i++) {
+ struct resource *r = &dev->resource[i];
+
+ if (resource_type(r) & (IORESOURCE_MEM|IORESOURCE_IO)) {
+ num--;
+
+ if (!num)
+ return r;
+ }
+ }
+ return NULL;
+}
+
+static int get_platform_irq(struct vfio_platform_device *vdev, int i)
+{
+ struct platform_device *pdev = (struct platform_device *) vdev->opaque;
+
+ return platform_get_irq(pdev, i);
+}
+
+
+static int vfio_platform_probe(struct platform_device *pdev)
+{
+ struct vfio_platform_device *vdev;
+ int ret;
+
+ vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+ if (!vdev)
+ return -ENOMEM;
+
+ vdev->opaque = (void *) pdev;
+ vdev->name = pdev->name;
+ vdev->flags = VFIO_DEVICE_FLAGS_PLATFORM;
+ vdev->get_resource = get_platform_resource;
+ vdev->get_irq = get_platform_irq;
+
+ ret = vfio_platform_probe_common(vdev, &pdev->dev);
+ if (ret)
+ kfree(vdev);
+
+ return ret;
+}
+
+static int vfio_platform_remove(struct platform_device *pdev)
+{
+ return vfio_platform_remove_common(&pdev->dev);
+}
+
+static struct platform_driver vfio_platform_driver = {
+ .probe = vfio_platform_probe,
+ .remove = vfio_platform_remove,
+ .driver = {
+ .name = "vfio-platform",
+ .owner = THIS_MODULE,
+ },
+};
+
+module_platform_driver(vfio_platform_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 111b5e8..aca6d3e 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -158,6 +158,7 @@ struct vfio_device_info {
__u32 flags;
#define VFIO_DEVICE_FLAGS_RESET (1 << 0) /* Device supports reset */
#define VFIO_DEVICE_FLAGS_PCI (1 << 1) /* vfio-pci device */
+#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2) /* vfio-platform device */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
};
--
2.1.1
^ permalink raw reply related
* [PATCH 2/5] vfio: introduce the VFIO_DMA_MAP_FLAG_NOEXEC flag
From: Antonios Motakis @ 2014-10-13 13:09 UTC (permalink / raw)
To: kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
alex.williamson-H+wXaHxf7aLQT0dZR+AlfA
Cc: open list:VFIO DRIVER, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
marc.zyngier-5wv7dgnIgG8, open list:ABI/API,
will.deacon-5wv7dgnIgG8, open list, Antonios Motakis,
tech-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J,
christoffer.dall-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <1413205748-6300-1-git-send-email-a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
We introduce the VFIO_DMA_MAP_FLAG_NOEXEC flag to the VFIO dma map call,
and expose its availability via the capability VFIO_DMA_NOEXEC_IOMMU.
This way the user can control whether the XN flag will be set on the
requested mappings. The IOMMU_NOEXEC flag needs to be available for all
the IOMMUs of the container used.
Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
---
include/uapi/linux/vfio.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 6612974..111b5e8 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -29,6 +29,7 @@
* capability is subject to change as groups are added or removed.
*/
#define VFIO_DMA_CC_IOMMU 4
+#define VFIO_DMA_NOEXEC_IOMMU 5
/* Check if EEH is supported */
#define VFIO_EEH 5
@@ -401,6 +402,7 @@ struct vfio_iommu_type1_dma_map {
__u32 flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */
+#define VFIO_DMA_MAP_FLAG_NOEXEC (1 << 2) /* not executable from device */
__u64 vaddr; /* Process virtual address */
__u64 iova; /* IO virtual address */
__u64 size; /* Size of mapping (bytes) */
--
2.1.1
^ permalink raw reply related
* [PATCH] driver core: amba: add device binding path 'driver_override'
From: Antonios Motakis @ 2014-10-13 13:07 UTC (permalink / raw)
To: Russell King, Antonios Motakis, open list:ABI/API, open list
Cc: eric.auger-QSEj5FYQhm4dnm+yROfE0A, marc.zyngier-5wv7dgnIgG8,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
tech-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J,
kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
christoffer.dall-QSEj5FYQhm4dnm+yROfE0A
As already demonstrated with PCI [1] and the platform bus [2], a
driver_override property in sysfs can be used to bypass the id matching
of a device to a AMBA driver. This can be used by VFIO to bind to any AMBA
device requested by the user.
[1] http://lists-archives.com/linux-kernel/28030441-pci-introduce-new-device-binding-path-using-pci_dev-driver_override.html
[2] https://www.redhat.com/archives/libvir-list/2014-April/msg00382.html
Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
---
Documentation/ABI/testing/sysfs-bus-amba | 20 ++++++++++++++
drivers/amba/bus.c | 47 ++++++++++++++++++++++++++++++++
include/linux/amba/bus.h | 1 +
3 files changed, 68 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-amba
diff --git a/Documentation/ABI/testing/sysfs-bus-amba b/Documentation/ABI/testing/sysfs-bus-amba
new file mode 100644
index 0000000..e7b5467
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-amba
@@ -0,0 +1,20 @@
+What: /sys/bus/amba/devices/.../driver_override
+Date: September 2014
+Contact: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
+Description:
+ This file allows the driver for a device to be specified which
+ will override standard OF, ACPI, ID table, and name matching.
+ When specified, only a driver with a name matching the value
+ written to driver_override will have an opportunity to bind to
+ the device. The override is specified by writing a string to the
+ driver_override file (echo vfio-amba > driver_override) and may
+ be cleared with an empty string (echo > driver_override).
+ This returns the device to standard matching rules binding.
+ Writing to driver_override does not automatically unbind the
+ device from its current driver or make any attempt to
+ automatically load the specified driver. If no driver with a
+ matching name is currently loaded in the kernel, the device will
+ not bind to any driver. This also allows devices to opt-out of
+ driver binding using a driver_override name such as "none".
+ Only a single driver may be specified in the override, there is
+ no support for parsing delimiters.
diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 3cf61a1..c7aa448 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -17,6 +17,7 @@
#include <linux/pm_runtime.h>
#include <linux/amba/bus.h>
#include <linux/sizes.h>
+#include <linux/limits.h>
#include <asm/irq.h>
@@ -42,6 +43,10 @@ static int amba_match(struct device *dev, struct device_driver *drv)
struct amba_device *pcdev = to_amba_device(dev);
struct amba_driver *pcdrv = to_amba_driver(drv);
+ /* When driver_override is set, only bind to the matching driver */
+ if (pcdev->driver_override)
+ return !strcmp(pcdev->driver_override, drv->name);
+
return amba_lookup(pcdrv->id_table, pcdev) != NULL;
}
@@ -58,6 +63,47 @@ static int amba_uevent(struct device *dev, struct kobj_uevent_env *env)
return retval;
}
+static ssize_t driver_override_show(struct device *_dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct amba_device *dev = to_amba_device(_dev);
+
+ if (!dev->driver_override)
+ return 0;
+
+ return sprintf(buf, "%s\n", dev->driver_override);
+}
+
+static ssize_t driver_override_store(struct device *_dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct amba_device *dev = to_amba_device(_dev);
+ char *driver_override, *old = dev->driver_override, *cp;
+
+ if (count > PATH_MAX)
+ return -EINVAL;
+
+ driver_override = kstrndup(buf, count, GFP_KERNEL);
+ if (!driver_override)
+ return -ENOMEM;
+
+ cp = strchr(driver_override, '\n');
+ if (cp)
+ *cp = '\0';
+
+ if (strlen(driver_override)) {
+ dev->driver_override = driver_override;
+ } else {
+ kfree(driver_override);
+ dev->driver_override = NULL;
+ }
+
+ kfree(old);
+
+ return count;
+}
+
#define amba_attr_func(name,fmt,arg...) \
static ssize_t name##_show(struct device *_dev, \
struct device_attribute *attr, char *buf) \
@@ -80,6 +126,7 @@ amba_attr_func(resource, "\t%016llx\t%016llx\t%016lx\n",
static struct device_attribute amba_dev_attrs[] = {
__ATTR_RO(id),
__ATTR_RO(resource),
+ __ATTR_RW(driver_override),
__ATTR_NULL,
};
diff --git a/include/linux/amba/bus.h b/include/linux/amba/bus.h
index fdd7e1b..7c011e7 100644
--- a/include/linux/amba/bus.h
+++ b/include/linux/amba/bus.h
@@ -32,6 +32,7 @@ struct amba_device {
struct clk *pclk;
unsigned int periphid;
unsigned int irq[AMBA_NR_IRQS];
+ char *driver_override;
};
struct amba_driver {
--
2.1.1
^ permalink raw reply related
* Re: [RFC PATCH 3/3] i2c: show and change bus frequency via sysfs
From: Octavian Purdila @ 2014-10-13 10:32 UTC (permalink / raw)
To: Guenter Roeck
Cc: Mark Roszko, linux-i2c, linux-api-u79uwXL29TY76Z2rM5mHXA, lkml,
Johan Hovold, Wolfram Sang
In-Reply-To: <20141012170637.GA12591-0h96xk9xTtrk1uMJSBkQmQ@public.gmane.org>
On Sun, Oct 12, 2014 at 8:06 PM, Guenter Roeck <linux-0h96xk9xTtrk1uMJSBkQmQ@public.gmane.org> wrote:
> On Sun, Oct 12, 2014 at 12:32:56PM +0300, Octavian Purdila wrote:
>> On Sat, Oct 11, 2014 at 11:14 PM, Mark Roszko <mark.roszko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> > This seems limiting to arches with peripherals that can support a range of
>> > frequencies rather than fixed numbers.
>> > Also it creates some portability quirkiness between platforms when all the
>> > i2c bus drivers have different supported freq lists and you have to match
>> > exactly the right frequency. I.e. one guy does 60khz but another only has
>> > 80khz.
>>
>> Sorry, I don't understand your points here. If this limitations exists
>> they are not introduced by this patch. This patch just exposes the
>> frequency so that it can be read or changed in userspace.
>>
>> > Another issue is in systems where you have i2c devices on the same bus as
>> > the sysfs user space driver. User space could set a bus frequency that
>> > prevents operation with a system i2c device.
>>
>> Changing the frequency is limited to root. Also, bus drivers do not
>> have to implement set_freq if it is thought not to be safe.
>>
>> On a different not, I have noticed that a fixed set of frequencies
>> might not be the best API, since multiple drivers rather support a
>> rather large set of frequencies in a range. A better API might be to
>> expose a min-max range and let the bus driver adjust the requested
>> frequency. I will follow up with a second version that does that.
>
> As two separate sysfs attributes, maybe ? sysfs is supposed to provide
> one value per attribute.
>
Yes, I was thinking of using three attributes: bus_frequency,
bus_min_frequency and bus_max_frequency.
> For the patch itself, I would find it better if you used is_visible to
> determine if the new attributes should be visible (and/or writable) instead
> of returning -EOPNOTSUPP.
>
OK, that sounds better. I will switch to that model in v2.
^ permalink raw reply
* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Michael S. Tsirkin @ 2014-10-13 7:16 UTC (permalink / raw)
To: Jason Wang; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api
In-Reply-To: <543B6F94.2090107@redhat.com>
On Mon, Oct 13, 2014 at 02:22:12PM +0800, Jason Wang wrote:
> On 10/12/2014 05:27 PM, Michael S. Tsirkin wrote:
> > On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
> >> Below should be useful for some experiments Jason is doing.
> >> I thought I'd send it out for early review/feedback.
> >>
> >> event idx feature allows us to defer interrupts until
> >> a specific # of descriptors were used.
> >> Sometimes it might be useful to get an interrupt after
> >> a specific descriptor, regardless.
> >> This adds a descriptor flag for this, and an API
> >> to create an urgent output descriptor.
> >> This is still an RFC:
> >> we'll need a feature bit for drivers to detect this,
> >> but we've run out of feature bits for virtio 0.X.
> >> For experimentation purposes, drivers can assume
> >> this is set, or add a driver-specific feature bit.
> >>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
> > I see that as compared to my original patch, you have
> > added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
> > I don't think it's necessary, see below.
> >
> > As such, I think this patch should be split:
> > - original patch adding support for urgent descriptors
> > - a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?
>
> Not sure this is a good idea, since the api of first patch is in-completed.
> >
> >> ---
> >> drivers/virtio/virtio_ring.c | 75 +++++++++++++++++++++++++++++++++++++---
> >> include/linux/virtio.h | 14 ++++++++
> >> include/uapi/linux/virtio_ring.h | 5 ++-
> >> 3 files changed, 89 insertions(+), 5 deletions(-)
> >>
> [...]
> >>
> >> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
> >> +{
> >> + struct vring_virtqueue *vq = to_vvq(_vq);
> >> + u16 last_used_idx;
> >> +
> >> + START_USE(vq);
> >> + vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
> >> + last_used_idx = vq->last_used_idx;
> >> + END_USE(vq);
> >> + return last_used_idx;
> >> +}
> >> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
> >> +
> > You can implement virtqueue_enable_cb_prepare_urgent
> > simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.
> >
> > The effect is same: host sends interrupts only if there
> > is an urgent descriptor.
>
> Seems not, consider the case when event index was disabled. This will
> turn on all interrupts.
This means that a legacy device without event index support
will get more interrupts.
Sounds reasonable.
--
MST
^ permalink raw reply
* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Jason Wang @ 2014-10-13 6:22 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api
In-Reply-To: <20141012092744.GA9567@redhat.com>
On 10/12/2014 05:27 PM, Michael S. Tsirkin wrote:
> On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
>> Below should be useful for some experiments Jason is doing.
>> I thought I'd send it out for early review/feedback.
>>
>> event idx feature allows us to defer interrupts until
>> a specific # of descriptors were used.
>> Sometimes it might be useful to get an interrupt after
>> a specific descriptor, regardless.
>> This adds a descriptor flag for this, and an API
>> to create an urgent output descriptor.
>> This is still an RFC:
>> we'll need a feature bit for drivers to detect this,
>> but we've run out of feature bits for virtio 0.X.
>> For experimentation purposes, drivers can assume
>> this is set, or add a driver-specific feature bit.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> I see that as compared to my original patch, you have
> added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
> I don't think it's necessary, see below.
>
> As such, I think this patch should be split:
> - original patch adding support for urgent descriptors
> - a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?
Not sure this is a good idea, since the api of first patch is in-completed.
>
>> ---
>> drivers/virtio/virtio_ring.c | 75 +++++++++++++++++++++++++++++++++++++---
>> include/linux/virtio.h | 14 ++++++++
>> include/uapi/linux/virtio_ring.h | 5 ++-
>> 3 files changed, 89 insertions(+), 5 deletions(-)
>>
[...]
>>
>> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
>> +{
>> + struct vring_virtqueue *vq = to_vvq(_vq);
>> + u16 last_used_idx;
>> +
>> + START_USE(vq);
>> + vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
>> + last_used_idx = vq->last_used_idx;
>> + END_USE(vq);
>> + return last_used_idx;
>> +}
>> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
>> +
> You can implement virtqueue_enable_cb_prepare_urgent
> simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.
>
> The effect is same: host sends interrupts only if there
> is an urgent descriptor.
Seems not, consider the case when event index was disabled. This will
turn on all interrupts.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox