* Re: [RFC][PATCH] IP address restricting cgroup subsystem
[not found] ` <20090106230554.GB25228-IaEwMO9oKu/77SC2UrCW1JJg/dWx8T/9@public.gmane.org>
@ 2009-01-07 6:01 ` Li Zefan
[not found] ` <49644526.8030205-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2009-01-07 6:01 UTC (permalink / raw)
To: Grzegorz Nosek
Cc: containers-qjLDD68F18O7TbgM5vRIOg, netdev-u79uwXL29TY76Z2rM5mHXA
CC: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
I'll review the cgroup part if this patch is regarded as useful.
Grzegorz Nosek wrote:
> This is a very simple cgroup subsystem to restrict IP addresses used
> by member processes. Currently it is limited to IPv4 only but IPv6 (or
> other protocols) should be easy to implement.
>
> IP addresses are write-once (via /cgroup/.../ipaddr.ipv4 in dotted-quad
Why they should be write-once ?
> format) and are inherited by descendant cgroups, so a process once
> restricted should never be able to get rid of the limits. Any address
> may be specified in multiple cgroups. No verification is done to ensure
> the addresses are actually configured on the machine, which has its
> advantages (may add the addresses later) and disadvantages (if you enter
> the wrong address, the cgroup will be effectively cut off from the
> network).
>
> Whenever a process inside a restricted cgroup calls bind(2), the address
> is checked like this:
> - INADDR_LOOPBACK is explicitly allowed (a special case)
> - INADDR_ANY is remapped to _the_ IP address
> - _the_ IP address is passed through unharmed
> - everything else causes -EPERM
>
> When a process calls connect(2), this subsystem calls bind(_the_IP_)
> quietly behind its back, while preserving the original bound port (if
> any).
>
> Rationale (or when/why would you want it):
> The use case for ipaddr_cgroup doesn't overlap with network namespaces,
> which also allow IP address restrictions, because it aims to be much
> lighter due to its limited scope (hopefully able to easily support
> hundreds or possibly thousands of distinct cgroups). It does not attempt
> to hide the existence of other IP addresses from the user.
>
> Signed-off-by: Grzegorz Nosek <root-AfQBxy1nhrQ00sYp1HPQUA@public.gmane.org>
> ---
>
> This is more of an RFC than a finished patch so any and all comments are
> appreciated.
>
> The patch is based to a significant extent on the device_cgroup code,
> including bypassing the security infrastructure and hooking directly
> into the networking code.
>
> I'd also love to hear your opinion about locking--I have a version of this
> patch that uses a seqlock to protect the IP address but I'm not sure this
> is the Right Way to do it (and raw non-atomic lockless access looks scary,
> regardless of how rarely would the address be changed, i.e. at most
> once).
>
> And of course, if the whole idea is stupid, let me know.
>
> include/linux/cgroup_subsys.h | 6 ++
> include/linux/ipaddr_cgroup.h | 23 +++++
> init/Kconfig | 7 ++
> net/socket.c | 16 +++-
> security/Makefile | 1 +
> security/ipaddr_cgroup.c | 200 +++++++++++++++++++++++++++++++++++++++++
> 6 files changed, 250 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/ipaddr_cgroup.h
> create mode 100644 security/ipaddr_cgroup.c
>
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c22396..70dd375 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -54,3 +54,9 @@ SUBSYS(freezer)
> #endif
>
> /* */
> +
> +#ifdef CONFIG_CGROUP_IPADDR
> +SUBSYS(ipaddr)
> +#endif
> +
> +/* */
> diff --git a/include/linux/ipaddr_cgroup.h b/include/linux/ipaddr_cgroup.h
> new file mode 100644
> index 0000000..19dc382
> --- /dev/null
> +++ b/include/linux/ipaddr_cgroup.h
> @@ -0,0 +1,23 @@
> +#ifndef HAVE_IPADDR_CGROUP_H
> +#define HAVE_IPADDR_CGROUP_H
> +
> +struct socket;
> +struct sockaddr;
> +
> +#ifdef CONFIG_CGROUP_IPADDR
> +int ipaddr_cgroup_connect(struct socket *sock, struct sockaddr *address, int addrlen);
> +int ipaddr_cgroup_bind(struct socket *sock, struct sockaddr *address, int addrlen);
> +
> +#else
> +static inline int ipaddr_cgroup_connect(struct socket *sock, struct sockaddr *address, int addrlen)
> +{
> + return 0;
> +}
> +
> +static inline int ipaddr_cgroup_bind(struct socket *sock, struct sockaddr *address, int addrlen)
> +{
> + return 0;
> +}
> +
> +#endif /* CONFIG_CGROUP_IPADDR */
> +#endif /* HAVE_IPADDR_CGROUP_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 35d87b9..db43344 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -338,6 +338,13 @@ config CGROUP_DEVICE
> Provides a cgroup implementing whitelists for devices which
> a process in the cgroup can mknod or open.
>
> +config CGROUP_IPADDR
> + bool "IP address controller for cgroups"
> + depends on CGROUPS && EXPERIMENTAL
> + help
> + Provides a cgroup restricting IP addresses its member processes
> + can use.
> +
> config CPUSETS
> bool "Cpuset support"
> depends on SMP && CGROUPS
> diff --git a/net/socket.c b/net/socket.c
> index 3e8d4e3..3bd8c08 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -87,6 +87,7 @@
> #include <linux/audit.h>
> #include <linux/wireless.h>
> #include <linux/nsproxy.h>
> +#include <linux/ipaddr_cgroup.h>
>
> #include <asm/uaccess.h>
> #include <asm/unistd.h>
> @@ -1375,9 +1376,13 @@ asmlinkage long sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
> if (sock) {
> err = move_addr_to_kernel(umyaddr, addrlen, (struct sockaddr *)&address);
> if (err >= 0) {
> - err = security_socket_bind(sock,
> - (struct sockaddr *)&address,
> - addrlen);
> + err = ipaddr_cgroup_bind(sock,
> + (struct sockaddr *)&address,
> + addrlen);
> + if (!err)
> + err = security_socket_bind(sock,
> + (struct sockaddr *)&address,
> + addrlen);
> if (!err)
> err = sock->ops->bind(sock,
> (struct sockaddr *)
> @@ -1600,6 +1605,11 @@ asmlinkage long sys_connect(int fd, struct sockaddr __user *uservaddr,
> goto out_put;
>
> err =
> + ipaddr_cgroup_connect(sock, (struct sockaddr *)&address, addrlen);
> + if (err)
> + goto out_put;
> +
> + err =
> security_socket_connect(sock, (struct sockaddr *)&address, addrlen);
> if (err)
> goto out_put;
> diff --git a/security/Makefile b/security/Makefile
> index f654260..aaf225e 100644
> --- a/security/Makefile
> +++ b/security/Makefile
> @@ -16,3 +16,4 @@ obj-$(CONFIG_SECURITY_SELINUX) += selinux/built-in.o
> obj-$(CONFIG_SECURITY_SMACK) += smack/built-in.o
> obj-$(CONFIG_SECURITY_ROOTPLUG) += root_plug.o
> obj-$(CONFIG_CGROUP_DEVICE) += device_cgroup.o
> +obj-$(CONFIG_CGROUP_IPADDR) += ipaddr_cgroup.o
> diff --git a/security/ipaddr_cgroup.c b/security/ipaddr_cgroup.c
> new file mode 100644
> index 0000000..96ccf27
> --- /dev/null
> +++ b/security/ipaddr_cgroup.c
> @@ -0,0 +1,200 @@
> +/*
> + * IP address cgroup subsystem
> + */
> +
> +#include <linux/ipaddr_cgroup.h>
> +
> +#include <linux/cgroup.h>
> +#include <linux/err.h>
> +#include <linux/in.h>
> +#include <linux/inet.h>
> +#include <linux/seq_file.h>
> +#include <linux/socket.h>
> +
> +#include <net/inet_sock.h>
> +
> +struct ipaddr_cgroup {
> + struct cgroup_subsys_state css;
> + u32 ipv4_addr;
> +};
> +
> +static inline struct ipaddr_cgroup *css_to_ipcgroup(struct cgroup_subsys_state *s)
> +{
> + return container_of(s, struct ipaddr_cgroup, css);
> +}
> +
> +static inline struct ipaddr_cgroup *cgroup_to_ipcgroup(struct cgroup *cgroup)
> +{
> + return css_to_ipcgroup(cgroup_subsys_state(cgroup, ipaddr_subsys_id));
> +}
> +
> +static inline struct ipaddr_cgroup *task_ipcgroup(struct task_struct *task)
> +{
> + return css_to_ipcgroup(task_subsys_state(task, ipaddr_subsys_id));
> +}
> +
> +struct cgroup_subsys ipaddr_subsys;
> +
> +static int ipcgroup_can_attach(struct cgroup_subsys *ss,
> + struct cgroup *new_cgroup, struct task_struct *task)
> +{
> + struct ipaddr_cgroup *old_ipcgroup, *new_ipcgroup;
> + u32 old_ipv4;
> +
> + if (current != task && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + old_ipcgroup = task_ipcgroup(task);
> + new_ipcgroup = cgroup_to_ipcgroup(new_cgroup);
> + old_ipv4 = old_ipcgroup->ipv4_addr;
> +
> + if (old_ipv4 != INADDR_ANY && old_ipv4 != new_ipcgroup->ipv4_addr)
> + return -EPERM;
> +
> + return 0;
> +}
> +
> +static struct cgroup_subsys_state *ipcgroup_create(struct cgroup_subsys *ss,
> + struct cgroup *cgroup)
> +{
> + struct ipaddr_cgroup *ipcgroup, *parent_ipcgroup;
> + struct cgroup *parent_cgroup;
> +
> + ipcgroup = kzalloc(sizeof(*ipcgroup), GFP_KERNEL);
> + if (!ipcgroup)
> + return ERR_PTR(-ENOMEM);
> + parent_cgroup = cgroup->parent;
> +
> + if (parent_cgroup == NULL) {
> + ipcgroup->ipv4_addr = htonl(INADDR_ANY);
> + } else {
> + parent_ipcgroup = cgroup_to_ipcgroup(parent_cgroup);
> + ipcgroup->ipv4_addr = parent_ipcgroup->ipv4_addr;
> + }
> +
> + return &ipcgroup->css;
> +}
> +
> +static void ipcgroup_destroy(struct cgroup_subsys *ss,
> + struct cgroup *cgroup)
> +{
> + struct ipaddr_cgroup *ipcgroup;
> +
> + ipcgroup = cgroup_to_ipcgroup(cgroup);
> + kfree(ipcgroup);
> +}
> +
> +static int ipcgroup_write_ipv4(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + u32 new_addr;
> + struct ipaddr_cgroup *ipcgroup;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + ipcgroup = cgroup_to_ipcgroup(cgrp);
> + if (ipcgroup->ipv4_addr != htonl(INADDR_ANY))
> + return -EPERM;
> +
> + ret = in4_pton(buffer, -1, (u8 *)&new_addr, '\0', NULL);
> + if (!ret)
> + return -EINVAL;
> +
> + /* already network-endian */
> + ipcgroup->ipv4_addr = new_addr;
> + return 0;
> +}
> +
> +static int ipcgroup_read_ipv4(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct ipaddr_cgroup *ipcgroup;
> +
> + ipcgroup = cgroup_to_ipcgroup(cgrp);
> + seq_printf(m, NIPQUAD_FMT "\n", NIPQUAD(ipcgroup->ipv4_addr));
> + return 0;
> +}
> +
> +static struct cftype ipaddr_cgroup_files[] = {
> + {
> + .name = "ipv4",
> + .write_string = ipcgroup_write_ipv4,
> + .read_seq_string = ipcgroup_read_ipv4,
> + },
> +};
> +
> +static int ipcgroup_populate(struct cgroup_subsys *ss,
> + struct cgroup *cgroup)
> +{
> + return cgroup_add_files(cgroup, ss, ipaddr_cgroup_files,
> + ARRAY_SIZE(ipaddr_cgroup_files));
> +}
> +
> +struct cgroup_subsys ipaddr_subsys = {
> + .name = "ipaddr",
> + .can_attach = ipcgroup_can_attach,
> + .create = ipcgroup_create,
> + .destroy = ipcgroup_destroy,
> + .populate = ipcgroup_populate,
> + .subsys_id = ipaddr_subsys_id
> +};
> +
> +int ipaddr_cgroup_connect(struct socket *sock, struct sockaddr *address, int addrlen)
> +{
> + struct sockaddr_in sa_in;
> + struct ipaddr_cgroup *ipcgroup;
> + struct inet_sock *inet;
> + int err;
> +
> + if (address->sa_family != AF_INET)
> + return 0;
> +
> + ipcgroup = task_ipcgroup(current);
> + if (ipcgroup->ipv4_addr == htonl(INADDR_ANY))
> + return 0;
> +
> + inet = inet_sk(sock->sk);
> +
> + sa_in.sin_family = AF_INET;
> + sa_in.sin_addr.s_addr = ipcgroup->ipv4_addr;
> + sa_in.sin_port = inet->sport;
> +
> + err = security_socket_bind(sock, (struct sockaddr *)&sa_in, sizeof(sa_in));
> + if (err)
> + return err;
> +
> + err = sock->ops->bind(sock, (struct sockaddr *)&sa_in, sizeof(sa_in));
> +
> + return err;
> +}
> +
> +int ipaddr_cgroup_bind(struct socket *sock, struct sockaddr *address, int addrlen)
> +{
> + struct sockaddr_in *sa_in;
> + struct ipaddr_cgroup *ipcgroup;
> +
> + if (address->sa_family != AF_INET)
> + return 0;
> +
> + ipcgroup = task_ipcgroup(current);
> + if (ipcgroup->ipv4_addr == htonl(INADDR_ANY))
> + return 0;
> +
> + sa_in = (struct sockaddr_in *) address;
> +
> + /* remap INADDR_ANY to cgroup IP address */
> + if (sa_in->sin_addr.s_addr == htonl(INADDR_ANY))
> + sa_in->sin_addr.s_addr = ipcgroup->ipv4_addr;
> +
> + /* a very special case */
> + if (sa_in->sin_addr.s_addr == htonl(INADDR_LOOPBACK))
> + return 0;
> +
> + if (sa_in->sin_addr.s_addr == ipcgroup->ipv4_addr)
> + return 0;
> +
> + return -EPERM;
> +}
> +
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH] IP address restricting cgroup subsystem
[not found] ` <49644526.8030205-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-01-07 7:38 ` Grzegorz Nosek
2009-01-07 8:36 ` Li Zefan
0 siblings, 1 reply; 13+ messages in thread
From: Grzegorz Nosek @ 2009-01-07 7:38 UTC (permalink / raw)
To: Li Zefan; +Cc: containers-qjLDD68F18O7TbgM5vRIOg, netdev-u79uwXL29TY76Z2rM5mHXA
On śro, sty 07, 2009 at 02:01:10 +0800, Li Zefan wrote:
> CC: netdev@vger.kernel.org
>
> I'll review the cgroup part if this patch is regarded as useful.
>
> Grzegorz Nosek wrote:
> > This is a very simple cgroup subsystem to restrict IP addresses used
> > by member processes. Currently it is limited to IPv4 only but IPv6 (or
> > other protocols) should be easy to implement.
> >
> > IP addresses are write-once (via /cgroup/.../ipaddr.ipv4 in dotted-quad
>
> Why they should be write-once ?
No real (technical) reason. Making it read-write would be fine with me.
I wanted to make the restriction a one-way road but I guess I can police
that in userspace (simply don't write anything to the file twice).
However, I think that the restriction should be inherited, so that if
CG1 is bound to e.g. 10.0.0.1, CG1/CG2 must be bound to the same
address. But what would I do then with descendant cgroups? Leave them as
is (breaking the inheritance)? Find them all and change their bound
address behind their back (do we have an API for that?)?
I guess I have the same problem right now, anyway (only once instead of
multiple times), so I'd really appreciate your input on this.
Best regards,
Grzegorz Nosek
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-07 7:38 ` Grzegorz Nosek
@ 2009-01-07 8:36 ` Li Zefan
2009-01-07 9:16 ` Grzegorz Nosek
0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2009-01-07 8:36 UTC (permalink / raw)
To: Grzegorz Nosek; +Cc: containers, netdev
Grzegorz Nosek wrote:
> On śro, sty 07, 2009 at 02:01:10 +0800, Li Zefan wrote:
>> CC: netdev@vger.kernel.org
>>
>> I'll review the cgroup part if this patch is regarded as useful.
>>
>> Grzegorz Nosek wrote:
>>> This is a very simple cgroup subsystem to restrict IP addresses used
>>> by member processes. Currently it is limited to IPv4 only but IPv6 (or
>>> other protocols) should be easy to implement.
>>>
>>> IP addresses are write-once (via /cgroup/.../ipaddr.ipv4 in dotted-quad
>> Why they should be write-once ?
>
> No real (technical) reason. Making it read-write would be fine with me.
> I wanted to make the restriction a one-way road but I guess I can police
> that in userspace (simply don't write anything to the file twice).
>
But seems the patch makes it impossible to re-allow a restricted task to
be binded to INADDR_ANY.
> However, I think that the restriction should be inherited, so that if
> CG1 is bound to e.g. 10.0.0.1, CG1/CG2 must be bound to the same
> address. But what would I do then with descendant cgroups? Leave them as
> is (breaking the inheritance)? Find them all and change their bound
> address behind their back (do we have an API for that?)?
>
Firstly, is inheritance necessary ?
If yes, then how about:
The root cgroup is read-only, so the tasks in it always bind to INADDR_ANY.
For other cgroups, write is allowed only if it has no children and the
parent is INADDR_ANY.
> I guess I have the same problem right now, anyway (only once instead of
> multiple times), so I'd really appreciate your input on this.
>
> Best regards,
> Grzegorz Nosek
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-07 8:36 ` Li Zefan
@ 2009-01-07 9:16 ` Grzegorz Nosek
2009-01-07 9:33 ` Li Zefan
0 siblings, 1 reply; 13+ messages in thread
From: Grzegorz Nosek @ 2009-01-07 9:16 UTC (permalink / raw)
To: Li Zefan; +Cc: Grzegorz Nosek, containers, netdev
On śro, sty 07, 2009 at 04:36:35 +0800, Li Zefan wrote:
> Grzegorz Nosek wrote:
> >>> IP addresses are write-once (via /cgroup/.../ipaddr.ipv4 in dotted-quad
> >> Why they should be write-once ?
> >
> > No real (technical) reason. Making it read-write would be fine with me.
> > I wanted to make the restriction a one-way road but I guess I can police
> > that in userspace (simply don't write anything to the file twice).
> >
>
> But seems the patch makes it impossible to re-allow a restricted task to
> be binded to INADDR_ANY.
Yes, my goal is to disallow that but I don't insist to do that in the
kernel (I'm not currently planning to let untrusted root loose in a
container).
> Firstly, is inheritance necessary ?
It would be nice to have when the container's root is untrusted but
might want to subdivide the container's cgroup for other purposes.
Without inheritance, they would be able to circumvent the IP address
restriction. One could argue that a full untrusted-root container would
need a proper network namespace anyway (and giving CAP_SYS_ADMIN there
is probably a very bad idea), but still, I'd feel uneasy.
> If yes, then how about:
>
> The root cgroup is read-only, so the tasks in it always bind to INADDR_ANY.
> For other cgroups, write is allowed only if it has no children and the
> parent is INADDR_ANY.
Yes, I like that. Will update the patch. I assume that I must check
list_empty(&cgroup->children)? Should I use cgroup_lock()/cgroup_unlock()
or other locking? I think it will be safe to do without locks but would
rather get some expert advice.
Thanks a lot for your comments.
Best regards,
Grzegorz Nosek
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-07 9:16 ` Grzegorz Nosek
@ 2009-01-07 9:33 ` Li Zefan
2009-01-07 9:37 ` Grzegorz Nosek
2009-01-09 21:38 ` [Devel] " Paul Menage
0 siblings, 2 replies; 13+ messages in thread
From: Li Zefan @ 2009-01-07 9:33 UTC (permalink / raw)
To: Grzegorz Nosek; +Cc: containers, netdev
Grzegorz Nosek wrote:
> On śro, sty 07, 2009 at 04:36:35 +0800, Li Zefan wrote:
>> Grzegorz Nosek wrote:
>>>>> IP addresses are write-once (via /cgroup/.../ipaddr.ipv4 in dotted-quad
>>>> Why they should be write-once ?
>>> No real (technical) reason. Making it read-write would be fine with me.
>>> I wanted to make the restriction a one-way road but I guess I can police
>>> that in userspace (simply don't write anything to the file twice).
>>>
>> But seems the patch makes it impossible to re-allow a restricted task to
>> be binded to INADDR_ANY.
>
> Yes, my goal is to disallow that but I don't insist to do that in the
> kernel (I'm not currently planning to let untrusted root loose in a
> container).
>
>> Firstly, is inheritance necessary ?
>
> It would be nice to have when the container's root is untrusted but
> might want to subdivide the container's cgroup for other purposes.
> Without inheritance, they would be able to circumvent the IP address
> restriction. One could argue that a full untrusted-root container would
> need a proper network namespace anyway (and giving CAP_SYS_ADMIN there
> is probably a very bad idea), but still, I'd feel uneasy.
>
>> If yes, then how about:
>>
>> The root cgroup is read-only, so the tasks in it always bind to INADDR_ANY.
>> For other cgroups, write is allowed only if it has no children and the
>> parent is INADDR_ANY.
>
> Yes, I like that. Will update the patch. I assume that I must check
> list_empty(&cgroup->children)?
Yes.
> Should I use cgroup_lock()/cgroup_unlock()
Yes.
> or other locking? I think it will be safe to do without locks but would
> rather get some expert advice.
>
No. Without locks, it races with mkdir.
=============
//cgroup_lock();
if (list_empty(&cgrp->children) &&
parent->ipv4_addr == INADDR_ANY)
<--- mkdir()
ipcgroup->ipv4_addr = new_addr;
//cgroup_unlock();
==============
In the above case, ipcgroup->ipv4_addr = new_addr,
but child_cgroup->ipv4_addr == INADDR_ANY, which is not expected.
> Thanks a lot for your comments.
>
> Best regards,
> Grzegorz Nosek
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-07 9:33 ` Li Zefan
@ 2009-01-07 9:37 ` Grzegorz Nosek
2009-01-09 21:38 ` [Devel] " Paul Menage
1 sibling, 0 replies; 13+ messages in thread
From: Grzegorz Nosek @ 2009-01-07 9:37 UTC (permalink / raw)
To: Li Zefan; +Cc: Grzegorz Nosek, containers, netdev
On śro, sty 07, 2009 at 05:33:49 +0800, Li Zefan wrote:
> >> The root cgroup is read-only, so the tasks in it always bind to INADDR_ANY.
> >> For other cgroups, write is allowed only if it has no children and the
> >> parent is INADDR_ANY.
> >
> > Yes, I like that. Will update the patch. I assume that I must check
> > list_empty(&cgroup->children)?
>
> Yes.
>
> > Should I use cgroup_lock()/cgroup_unlock()
>
> Yes.
>
> > or other locking? I think it will be safe to do without locks but would
> > rather get some expert advice.
> >
>
> No. Without locks, it races with mkdir.
>
> =============
>
> //cgroup_lock();
>
> if (list_empty(&cgrp->children) &&
> parent->ipv4_addr == INADDR_ANY)
> <--- mkdir()
> ipcgroup->ipv4_addr = new_addr;
>
> //cgroup_unlock();
>
> ==============
>
> In the above case, ipcgroup->ipv4_addr = new_addr,
> but child_cgroup->ipv4_addr == INADDR_ANY, which is not expected.
I see. Thanks a lot!
Best regards,
Grzegorz Nosek
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Devel] Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-07 9:33 ` Li Zefan
2009-01-07 9:37 ` Grzegorz Nosek
@ 2009-01-09 21:38 ` Paul Menage
2009-01-10 4:50 ` Li Zefan
1 sibling, 1 reply; 13+ messages in thread
From: Paul Menage @ 2009-01-09 21:38 UTC (permalink / raw)
To: Li Zefan; +Cc: Grzegorz Nosek, containers, netdev
On Wed, Jan 7, 2009 at 1:33 AM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>>
>> Yes, I like that. Will update the patch. I assume that I must check
>> list_empty(&cgroup->children)?
>
> Yes.
>
>> Should I use cgroup_lock()/cgroup_unlock()
>
> Yes.
For checking the "children" list, you can just lock
ipaddr_subsys.hierarchy_mutex.
Paul
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Devel] Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-09 21:38 ` [Devel] " Paul Menage
@ 2009-01-10 4:50 ` Li Zefan
2009-01-10 16:14 ` Paul Menage
0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2009-01-10 4:50 UTC (permalink / raw)
To: Paul Menage; +Cc: Grzegorz Nosek, containers, netdev
Paul Menage wrote:
> On Wed, Jan 7, 2009 at 1:33 AM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>>> Yes, I like that. Will update the patch. I assume that I must check
>>> list_empty(&cgroup->children)?
>> Yes.
>>
>>> Should I use cgroup_lock()/cgroup_unlock()
>> Yes.
>
> For checking the "children" list, you can just lock
> ipaddr_subsys.hierarchy_mutex.
>
Unfortunately hierarchy_mutex can't be used here, since hierarchy_mutex
doesn't protect subsys's create() method, and the create() will access
parent cgroup's data.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Devel] Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-10 4:50 ` Li Zefan
@ 2009-01-10 16:14 ` Paul Menage
2009-01-12 2:20 ` Li Zefan
0 siblings, 1 reply; 13+ messages in thread
From: Paul Menage @ 2009-01-10 16:14 UTC (permalink / raw)
To: Li Zefan; +Cc: Grzegorz Nosek, containers, netdev
On Fri, Jan 9, 2009 at 8:50 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>>
>> For checking the "children" list, you can just lock
>> ipaddr_subsys.hierarchy_mutex.
>>
>
> Unfortunately hierarchy_mutex can't be used here, since hierarchy_mutex
> doesn't protect subsys's create() method, and the create() will access
> parent cgroup's data.
>
But that can be solved by putting a spinlock in the ipaddr_cgroup
structure and taking it in the write handler (and the connect/bind
handlers, which should also be using RCU), and taking the parent
structure's lock before copying from it in the create callback. No
need for something as heavy as cgroup_lock().
Paul
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Devel] Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-10 16:14 ` Paul Menage
@ 2009-01-12 2:20 ` Li Zefan
2009-01-14 2:07 ` Paul Menage
0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2009-01-12 2:20 UTC (permalink / raw)
To: Paul Menage; +Cc: Grzegorz Nosek, containers, netdev
Paul Menage wrote:
> On Fri, Jan 9, 2009 at 8:50 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>>> For checking the "children" list, you can just lock
>>> ipaddr_subsys.hierarchy_mutex.
>>>
>> Unfortunately hierarchy_mutex can't be used here, since hierarchy_mutex
>> doesn't protect subsys's create() method, and the create() will access
>> parent cgroup's data.
>>
>
> But that can be solved by putting a spinlock in the ipaddr_cgroup
> structure and taking it in the write handler (and the connect/bind
> handlers, which should also be using RCU), and taking the parent
> structure's lock before copying from it in the create callback. No
> need for something as heavy as cgroup_lock().
>
Still won't work. :(
================
lock(ipaddr_subsys.hierarchy_mutex);
if (list_empty(&cgrp->children) &&
parent->ipv4_addr == INADDR_ANY)
create()
parent->spin_lock();
cgrp->ipv4_addr = parent->ipv4_addr;
parent->spin_unlock();
ipcgroup->spin_lock();
ipcgroup->ipv4_addr = new_addr;
ipcgroup->spin_unlock();
unlock(ipaddr_subsys.hierarchy_mutex);
list_add(&cgrp->sibling, &cgrp->parent->children);
================
And neither can it work to hold hierarchy_mutex in subsys's create().
I think the only way to make hierarchy_mutex works for this issue is:
@@ -2403,16 +2403,18 @@ static long cgroup_create(struct cgroup *parent, struct
if (notify_on_release(parent))
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
+ cgroup_lock_hierarchy(root);
+
for_each_subsys(root, ss) {
struct cgroup_subsys_state *css = ss->create(ss, cgrp);
if (IS_ERR(css)) {
+ cgroup_unlock_hierarchy(root);
err = PTR_ERR(css);
goto err_destroy;
}
init_cgroup_css(css, ss, cgrp);
}
- cgroup_lock_hierarchy(root);
list_add(&cgrp->sibling, &cgrp->parent->children);
cgroup_unlock_hierarchy(root);
root->number_of_cgroups++;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Devel] Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-12 2:20 ` Li Zefan
@ 2009-01-14 2:07 ` Paul Menage
2009-01-14 2:47 ` Li Zefan
0 siblings, 1 reply; 13+ messages in thread
From: Paul Menage @ 2009-01-14 2:07 UTC (permalink / raw)
To: Li Zefan; +Cc: Grzegorz Nosek, containers, netdev
>
> I think the only way to make hierarchy_mutex works for this issue is:
>
> @@ -2403,16 +2403,18 @@ static long cgroup_create(struct cgroup *parent, struct
> if (notify_on_release(parent))
> set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
>
> + cgroup_lock_hierarchy(root);
> +
> for_each_subsys(root, ss) {
> struct cgroup_subsys_state *css = ss->create(ss, cgrp);
> if (IS_ERR(css)) {
> + cgroup_unlock_hierarchy(root);
> err = PTR_ERR(css);
> goto err_destroy;
> }
> init_cgroup_css(css, ss, cgrp);
> }
>
> - cgroup_lock_hierarchy(root);
> list_add(&cgrp->sibling, &cgrp->parent->children);
> cgroup_unlock_hierarchy(root);
> root->number_of_cgroups++;
>
That would be possible, but I'm not sure that extending
hierarchy_mutex across all the create calls is a good idea - it's
meant to be very lightweight.
OK, an alternative way to avoid cgroup_lock() is for the
spinlock-protected state in ipcgroup to be the address and the count
of active children.
create() does:
lock parent
css->addr = parent->addr
parent->child_count++;
unlock parent
and write does:
lock css
if (!css->child_count) {
css->addr = new_addr
} else {
report error;
}
unlock css
Paul
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Devel] Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-14 2:07 ` Paul Menage
@ 2009-01-14 2:47 ` Li Zefan
2009-01-14 2:50 ` Paul Menage
0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2009-01-14 2:47 UTC (permalink / raw)
To: Paul Menage; +Cc: Grzegorz Nosek, containers, netdev
> That would be possible, but I'm not sure that extending
> hierarchy_mutex across all the create calls is a good idea - it's
> meant to be very lightweight.
>
agree
> OK, an alternative way to avoid cgroup_lock() is for the
> spinlock-protected state in ipcgroup to be the address and the count
> of active children.
>
This works. But:
- we put extra burden on subsystem developers.
- hierarchy_mutex can't do what we expect, and it's a bit subtle.
- there won't be performance problem or potential lock issue to use
cgroup_mutex in subsys' simple write functions, so I don't think
we have to avoid cgroup_mutex here.
In memcg, both mem_cgroup_hierarchy_write() and mem_cgroup_swappiness_write()
check cgrp->children list, similar to ipv4_write() here. And I'm going
to fix swappiness_write() for it doesn't hold cgroup_lock(), but if
avoiding cgroup_lock() is the direction, then I have to use this
alternative way you sugguested.
> create() does:
>
> lock parent
> css->addr = parent->addr
> parent->child_count++;
> unlock parent
>
> and write does:
>
> lock css
> if (!css->child_count) {
> css->addr = new_addr
> } else {
> report error;
> }
> unlock css
>
> Paul
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Devel] Re: [RFC][PATCH] IP address restricting cgroup subsystem
2009-01-14 2:47 ` Li Zefan
@ 2009-01-14 2:50 ` Paul Menage
0 siblings, 0 replies; 13+ messages in thread
From: Paul Menage @ 2009-01-14 2:50 UTC (permalink / raw)
To: Li Zefan; +Cc: Grzegorz Nosek, containers, netdev
On Tue, Jan 13, 2009 at 6:47 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>
> - we put extra burden on subsystem developers.
Better than putting extra burden on cgroup_lock() :-)
Anyway, if this turns out to be a common operation, we can provide
support for it in the framework rather than having each subsystem
duplicate the effort.
Paul
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-01-14 2:50 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090106230554.GB25228@eskarina.localdomain.pl>
[not found] ` <20090106230554.GB25228-IaEwMO9oKu/77SC2UrCW1JJg/dWx8T/9@public.gmane.org>
2009-01-07 6:01 ` [RFC][PATCH] IP address restricting cgroup subsystem Li Zefan
[not found] ` <49644526.8030205-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-01-07 7:38 ` Grzegorz Nosek
2009-01-07 8:36 ` Li Zefan
2009-01-07 9:16 ` Grzegorz Nosek
2009-01-07 9:33 ` Li Zefan
2009-01-07 9:37 ` Grzegorz Nosek
2009-01-09 21:38 ` [Devel] " Paul Menage
2009-01-10 4:50 ` Li Zefan
2009-01-10 16:14 ` Paul Menage
2009-01-12 2:20 ` Li Zefan
2009-01-14 2:07 ` Paul Menage
2009-01-14 2:47 ` Li Zefan
2009-01-14 2:50 ` Paul Menage
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).