[PATCH 1/2] connector: add cgroup release event report to proc connector

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] connector: add cgroup release event report to proc connector
@ 2015-05-26 22:07 Dimitri John Ledkov
       [not found] ` <1432678051-4372-1-git-send-email-dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Dimitri John Ledkov @ 2015-05-26 22:07 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

Add a kernel API to send a proc connector notification that a cgroup
has become empty. A userspace daemon can then act upon such
information, and usually clean-up and remove such a group as it's no
longer needed.

Currently there are two other ways (one for current & one for unified
cgroups) to receive such notifications, but they either involve
spawning userspace helper or monitoring a lot of files. This is a
firehose of all such events instead from a single place.

In the current cgroups structure the way to get notifications is by
enabling `release_agent' and setting `notify_on_release' for a given
cgroup hierarchy. This will then spawn userspace helper with removed
cgroup as an argument. It has been acknowledged that this is
expensive, especially in the exit-heavy workloads. In userspace this
is currently used by systemd and CGmanager that I know of, both of
agents establish connection to the long running daemon and pass the
message to it. As a courtesy to other processes, such an event is
sometimes forwarded further on, e.g. systemd forwards it to the system
DBus.

In the future/unified cgroups structure support for `release_agent' is
removed, without a direct replacement. However, there is a new
`cgroup.populated' file exposed that recursively reports if there are
any tasks in a given cgroup hierarchy. It's a very good flag to
quickly/lazily scan for empty things, however one would need to
establish inotify watch on each and every cgroup.populated file at
cgroup setup time (ideally before any pids enter said cgroup). Thus
again anybody else, but the original creator of a given cgroup, has a
chance to reliably monitor cgroup becoming empty (since there is no
reliable recursive inotify watch).

Hence, the addition to the proc connector firehose. Multiple things,
albeit with a CAP_NET_ADMIN in the init pid/user namespace), could
connect and monitor cgroups release notifications. In a way, this
repeats udev history, at first it was a userspace helper, which later
became a netlink socket. And I hope, that proc connector is a
naturally good fit for this notification type.

For precisely when cgroups should emit this event, see next patch
against kernel/cgroup.c.

Signed-off-by: Dimitri John Ledkov <dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/connector/cn_proc.c  | 56 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cn_proc.h      |  6 +++++
 include/uapi/linux/cn_proc.h |  8 ++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c
index 15d06fc..ef71bd9 100644
--- a/drivers/connector/cn_proc.c
+++ b/drivers/connector/cn_proc.c
@@ -31,6 +31,7 @@
 #include <linux/ptrace.h>
 #include <linux/atomic.h>
 #include <linux/pid_namespace.h>
+#include <linux/cgroup.h>

 #include <linux/cn_proc.h>

@@ -244,6 +245,61 @@ void proc_comm_connector(struct task_struct *task)
 	cn_netlink_send(msg, 0, CN_IDX_PROC, GFP_KERNEL);
 }

+void proc_cgrelease_connector(struct cgroup *cgrp)
+{
+	struct cn_msg *msg;
+	struct proc_event *ev;
+	char *path_buffer, *path;
+	__u8 *buffer;
+	__u8 length;
+
+	if (atomic_read(&proc_event_num_listeners) < 1)
+		return;
+
+	/* ENOMEM */
+	path_buffer = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!path_buffer)
+		return;
+
+	/* ENAMETOOLONG */
+	path = cgroup_path(cgrp, path_buffer, PATH_MAX);
+	if (!path)
+		goto out_path_buffer;
+
+	length = strlen(path);
+
+	buffer = kmalloc(CN_PROC_MSG_SIZE + length, GFP_KERNEL);
+	if (!buffer)
+		goto out_path_buffer;
+
+	msg = buffer_to_cn_msg(buffer);
+	ev = (struct proc_event *)msg->data;
+	memset(&ev->event_data, 0, sizeof(ev->event_data) + length);
+	get_seq(&msg->seq, &ev->cpu);
+	ev->timestamp_ns = ktime_get_ns();
+	ev->what = PROC_EVENT_CGROUP_RELEASE;
+
+	/* If MAX_CGROUP_ROOT_NAMELEN is ever increased,
+	 * ./include/uapi/linux/cn_proc.h will need an update */
+	BUILD_BUG_ON(MAX_CGROUP_ROOT_NAMELEN > 64);
+	memcpy(ev->event_data.cgroup_release.cgroup_root,
+	       cgrp->root->name,
+	       MAX_CGROUP_ROOT_NAMELEN);
+
+	memcpy(ev->event_data.cgroup_release.cgroup_path, path, length);
+
+	memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id));
+	msg->ack = 0; /* not used */
+	msg->len = sizeof(*ev) + length;
+	msg->flags = 0; /* not used */
+	cn_netlink_send(msg, 0, CN_IDX_PROC, GFP_KERNEL);
+
+	kfree(buffer);
+
+out_path_buffer:
+	kfree(path_buffer);
+}
+
 void proc_coredump_connector(struct task_struct *task)
 {
 	struct cn_msg *msg;
diff --git a/include/linux/cn_proc.h b/include/linux/cn_proc.h
index 1d5b02a..cf7cc56 100644
--- a/include/linux/cn_proc.h
+++ b/include/linux/cn_proc.h
@@ -19,6 +19,8 @@

 #include <uapi/linux/cn_proc.h>

+struct cgroup;
+
 #ifdef CONFIG_PROC_EVENTS
 void proc_fork_connector(struct task_struct *task);
 void proc_exec_connector(struct task_struct *task);
@@ -28,6 +30,7 @@ void proc_ptrace_connector(struct task_struct *task, int which_id);
 void proc_comm_connector(struct task_struct *task);
 void proc_coredump_connector(struct task_struct *task);
 void proc_exit_connector(struct task_struct *task);
+void proc_cgrelease_connector(struct cgroup *cgrp);
 #else
 static inline void proc_fork_connector(struct task_struct *task)
 {}
@@ -54,5 +57,8 @@ static inline void proc_coredump_connector(struct task_struct *task)

 static inline void proc_exit_connector(struct task_struct *task)
 {}
+
+static inline void proc_cgrelease_connector(struct cgroup *cgrp)
+{}
 #endif	/* CONFIG_PROC_EVENTS */
 #endif	/* CN_PROC_H */
diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h
index f6c2710..bed7b6b 100644
--- a/include/uapi/linux/cn_proc.h
+++ b/include/uapi/linux/cn_proc.h
@@ -55,7 +55,8 @@ struct proc_event {
 		PROC_EVENT_SID  = 0x00000080,
 		PROC_EVENT_PTRACE = 0x00000100,
 		PROC_EVENT_COMM = 0x00000200,
-		/* "next" should be 0x00000400 */
+		PROC_EVENT_CGROUP_RELEASE = 0x00000400,
+		/* "next" should be 0x00000800 */
 		/* "last" is the last process event: exit,
 		 * while "next to last" is coredumping event */
 		PROC_EVENT_COREDUMP = 0x40000000,
@@ -123,6 +124,11 @@ struct proc_event {
 			__u32 exit_code, exit_signal;
 		} exit;

+		struct cgroup_release_proc_event {
+			char cgroup_root[64];
+			char cgroup_path[];
+		} cgroup_release;
+
 	} event_data;
 };

-- 
2.1.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

[parent not found: <1432678051-4372-1-git-send-email-dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]

* [PATCH 2/2] cgroup: report cgroup release event to proc connector
       [not found] ` <1432678051-4372-1-git-send-email-dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-05-26 22:07   ` Dimitri John Ledkov
  2015-05-27 11:22   ` [PATCH 1/2] connector: add cgroup release event report " Zefan Li
  1 sibling, 0 replies; 7+ messages in thread
From: Dimitri John Ledkov @ 2015-05-26 22:07 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

This adds a call to proc_cgrelease_connector at check_for_release
time. It is done when cgroup becomes dead, regardless of the
notify_on_release status.

This is thus compatible with both current & unified cgroups hierarchy,
and the decision which cgroups to emit events for is offloaded to the
proc connector API.

Specifically, if there are no listeners, no events are emitted. If
only certain events are desired, the userspace proc connector listener
can filter them in the userspace or install a BPF on the socket to
ignore things it doesn't care about.

Signed-off-by: Dimitri John Ledkov <dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 kernel/cgroup.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd54..c52e584 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/cn_proc.h>
 
 #include <linux/atomic.h>
 
@@ -5307,9 +5308,12 @@ void cgroup_exit(struct task_struct *tsk)
 
 static void check_for_release(struct cgroup *cgrp)
 {
-	if (notify_on_release(cgrp) && !cgroup_has_tasks(cgrp) &&
-	    !css_has_online_children(&cgrp->self) && !cgroup_is_dead(cgrp))
-		schedule_work(&cgrp->release_agent_work);
+	if (!cgroup_has_tasks(cgrp) &&
+	    !css_has_online_children(&cgrp->self) && !cgroup_is_dead(cgrp)) {
+		proc_cgrelease_connector(cgrp);
+		if (notify_on_release(cgrp))
+			schedule_work(&cgrp->release_agent_work);
+	}
 }
 
 /*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] connector: add cgroup release event report to proc connector
       [not found] ` <1432678051-4372-1-git-send-email-dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-05-26 22:07   ` [PATCH 2/2] cgroup: report cgroup release event " Dimitri John Ledkov
@ 2015-05-27 11:22   ` Zefan Li
       [not found]     ` <5565A8F9.8040601-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 7+ messages in thread
From: Zefan Li @ 2015-05-27 11:22 UTC (permalink / raw)
  To: Dimitri John Ledkov
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2015/5/27 6:07, Dimitri John Ledkov wrote:
> Add a kernel API to send a proc connector notification that a cgroup
> has become empty. A userspace daemon can then act upon such
> information, and usually clean-up and remove such a group as it's no
> longer needed.
> 
> Currently there are two other ways (one for current & one for unified
> cgroups) to receive such notifications, but they either involve
> spawning userspace helper or monitoring a lot of files. This is a
> firehose of all such events instead from a single place.
> 
> In the current cgroups structure the way to get notifications is by
> enabling `release_agent' and setting `notify_on_release' for a given
> cgroup hierarchy. This will then spawn userspace helper with removed
> cgroup as an argument. It has been acknowledged that this is
> expensive, especially in the exit-heavy workloads. In userspace this
> is currently used by systemd and CGmanager that I know of, both of
> agents establish connection to the long running daemon and pass the
> message to it. As a courtesy to other processes, such an event is
> sometimes forwarded further on, e.g. systemd forwards it to the system
> DBus.
> 
> In the future/unified cgroups structure support for `release_agent' is
> removed, without a direct replacement. However, there is a new
> `cgroup.populated' file exposed that recursively reports if there are
> any tasks in a given cgroup hierarchy. It's a very good flag to
> quickly/lazily scan for empty things, however one would need to
> establish inotify watch on each and every cgroup.populated file at
> cgroup setup time (ideally before any pids enter said cgroup). Thus
> again anybody else, but the original creator of a given cgroup, has a
> chance to reliably monitor cgroup becoming empty (since there is no
> reliable recursive inotify watch).
> 
> Hence, the addition to the proc connector firehose. Multiple things,
> albeit with a CAP_NET_ADMIN in the init pid/user namespace), could
> connect and monitor cgroups release notifications. In a way, this
> repeats udev history, at first it was a userspace helper, which later
> became a netlink socket. And I hope, that proc connector is a
> naturally good fit for this notification type.
> 
> For precisely when cgroups should emit this event, see next patch
> against kernel/cgroup.c.
> 

We really don't want yet another way for cgroup notification.

Systemd is happy with this cgroup.populated interface. Do you have any
real use case in mind that can't be satisfied with inotify watch?

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <5565A8F9.8040601-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH 1/2] connector: add cgroup release event report to proc connector
       [not found]     ` <5565A8F9.8040601-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2015-05-27 12:37       ` Dimitri John Ledkov
       [not found]         ` <CAK7xexkFm_BZ9sdB3VAZ-WGiikBeAQfaimXjggEtac+M2diV-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Dimitri John Ledkov @ 2015-05-27 12:37 UTC (permalink / raw)
  To: Zefan Li; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

On 27 May 2015 at 12:22, Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> On 2015/5/27 6:07, Dimitri John Ledkov wrote:
>> Add a kernel API to send a proc connector notification that a cgroup
>> has become empty. A userspace daemon can then act upon such
>> information, and usually clean-up and remove such a group as it's no
>> longer needed.
>>
>> Currently there are two other ways (one for current & one for unified
>> cgroups) to receive such notifications, but they either involve
>> spawning userspace helper or monitoring a lot of files. This is a
>> firehose of all such events instead from a single place.
>>
>> In the current cgroups structure the way to get notifications is by
>> enabling `release_agent' and setting `notify_on_release' for a given
>> cgroup hierarchy. This will then spawn userspace helper with removed
>> cgroup as an argument. It has been acknowledged that this is
>> expensive, especially in the exit-heavy workloads. In userspace this
>> is currently used by systemd and CGmanager that I know of, both of
>> agents establish connection to the long running daemon and pass the
>> message to it. As a courtesy to other processes, such an event is
>> sometimes forwarded further on, e.g. systemd forwards it to the system
>> DBus.
>>
>> In the future/unified cgroups structure support for `release_agent' is
>> removed, without a direct replacement. However, there is a new
>> `cgroup.populated' file exposed that recursively reports if there are
>> any tasks in a given cgroup hierarchy. It's a very good flag to
>> quickly/lazily scan for empty things, however one would need to
>> establish inotify watch on each and every cgroup.populated file at
>> cgroup setup time (ideally before any pids enter said cgroup). Thus
>> again anybody else, but the original creator of a given cgroup, has a
>> chance to reliably monitor cgroup becoming empty (since there is no
>> reliable recursive inotify watch).
>>
>> Hence, the addition to the proc connector firehose. Multiple things,
>> albeit with a CAP_NET_ADMIN in the init pid/user namespace), could
>> connect and monitor cgroups release notifications. In a way, this
>> repeats udev history, at first it was a userspace helper, which later
>> became a netlink socket. And I hope, that proc connector is a
>> naturally good fit for this notification type.
>>
>> For precisely when cgroups should emit this event, see next patch
>> against kernel/cgroup.c.
>>
>
> We really don't want yet another way for cgroup notification.
>

we do have multiple information sources for similar events in other
places... e.g. fork events can be tracked with ptrace and with
proc-connector, ditto other things.

> Systemd is happy with this cgroup.populated interface. Do you have any
> real use case in mind that can't be satisfied with inotify watch?
>

cgroup.populated is not implemented in systemd and would require a lot
of inotify watches. Also it's only set on the unified structure and
not exposed on the current one.

Also it will not allow anybody else to establish notify watch in a
timely manner. Thus anyone external to the cgroups creator will not be
able to monitor cgroup.populated at the right time. With
proc_connector I was thinking processes entering cgroups would be
useful events as well, but I don't have a use-case for them yet thus
I'm not sure how the event should look like.

Would cgroup.populated be exposed on the legacy cgroup hierchy? At the
moment I see about ~20ms of my ~200ms boot wasted on spawning the
cgroups agent and I would like to get rid of that as soon as possible.
This patch solves it for me. ( i have a matching one to connect to
proc connector and then feed notifications to systemd via systemd's
private api end-point )

Exposing cgroup.populated irrespective of the cgroup mount options
would be great, but would result in many watches being established
awaiting for a once in a lifecycle condition of a cgroup. Imho this is
wasteful, but nonetheless will be much better than spawning the agent.

Would a patch that exposes cgroup.populated on legacy cgroup structure
be accepted? It is forward-compatible afterall... or no?

-- 
Regards,

Dimitri.
Pura Vida!

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <CAK7xexkFm_BZ9sdB3VAZ-WGiikBeAQfaimXjggEtac+M2diV-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH 1/2] connector: add cgroup release event report to proc connector
       [not found]         ` <CAK7xexkFm_BZ9sdB3VAZ-WGiikBeAQfaimXjggEtac+M2diV-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-28  3:30           ` Zefan Li
       [not found]             ` <55668BE9.1090100-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Zefan Li @ 2015-05-28  3:30 UTC (permalink / raw)
  To: Dimitri John Ledkov
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2015/5/27 20:37, Dimitri John Ledkov wrote:
> On 27 May 2015 at 12:22, Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>> On 2015/5/27 6:07, Dimitri John Ledkov wrote:
>>> Add a kernel API to send a proc connector notification that a cgroup
>>> has become empty. A userspace daemon can then act upon such
>>> information, and usually clean-up and remove such a group as it's no
>>> longer needed.
>>>
>>> Currently there are two other ways (one for current & one for unified
>>> cgroups) to receive such notifications, but they either involve
>>> spawning userspace helper or monitoring a lot of files. This is a
>>> firehose of all such events instead from a single place.
>>>
>>> In the current cgroups structure the way to get notifications is by
>>> enabling `release_agent' and setting `notify_on_release' for a given
>>> cgroup hierarchy. This will then spawn userspace helper with removed
>>> cgroup as an argument. It has been acknowledged that this is
>>> expensive, especially in the exit-heavy workloads. In userspace this
>>> is currently used by systemd and CGmanager that I know of, both of
>>> agents establish connection to the long running daemon and pass the
>>> message to it. As a courtesy to other processes, such an event is
>>> sometimes forwarded further on, e.g. systemd forwards it to the system
>>> DBus.
>>>
>>> In the future/unified cgroups structure support for `release_agent' is
>>> removed, without a direct replacement. However, there is a new
>>> `cgroup.populated' file exposed that recursively reports if there are
>>> any tasks in a given cgroup hierarchy. It's a very good flag to
>>> quickly/lazily scan for empty things, however one would need to
>>> establish inotify watch on each and every cgroup.populated file at
>>> cgroup setup time (ideally before any pids enter said cgroup). Thus
>>> again anybody else, but the original creator of a given cgroup, has a
>>> chance to reliably monitor cgroup becoming empty (since there is no
>>> reliable recursive inotify watch).
>>>
>>> Hence, the addition to the proc connector firehose. Multiple things,
>>> albeit with a CAP_NET_ADMIN in the init pid/user namespace), could
>>> connect and monitor cgroups release notifications. In a way, this
>>> repeats udev history, at first it was a userspace helper, which later
>>> became a netlink socket. And I hope, that proc connector is a
>>> naturally good fit for this notification type.
>>>
>>> For precisely when cgroups should emit this event, see next patch
>>> against kernel/cgroup.c.
>>>
>>
>> We really don't want yet another way for cgroup notification.
>>
> 
> we do have multiple information sources for similar events in other
> places... e.g. fork events can be tracked with ptrace and with
> proc-connector, ditto other things.
> 
>> Systemd is happy with this cgroup.populated interface. Do you have any
>> real use case in mind that can't be satisfied with inotify watch?
>>
> 
> cgroup.populated is not implemented in systemd and would require a lot
> of inotify watches.

I believe systemd will use cgroup.populated, though I don't know its
roadmap. Maybe it's waiting for the kernel to remove the experimental
flag of unified hierarchy.

> Also it's only set on the unified structure and
> not exposed on the current one.
> 
> Also it will not allow anybody else to establish notify watch in a
> timely manner. Thus anyone external to the cgroups creator will not be
> able to monitor cgroup.populated at the right time.

I guess this isn't a problem, as you can watch the IN_CREATE event, and
then you'll get notified when a cgroup is created.

> With
> proc_connector I was thinking processes entering cgroups would be
> useful events as well, but I don't have a use-case for them yet thus
> I'm not sure how the event should look like.
> 
> Would cgroup.populated be exposed on the legacy cgroup hierchy? At the
> moment I see about ~20ms of my ~200ms boot wasted on spawning the
> cgroups agent and I would like to get rid of that as soon as possible.
> This patch solves it for me. ( i have a matching one to connect to
> proc connector and then feed notifications to systemd via systemd's
> private api end-point )
> 
> Exposing cgroup.populated irrespective of the cgroup mount options
> would be great, but would result in many watches being established
> awaiting for a once in a lifecycle condition of a cgroup. Imho this is
> wasteful, but nonetheless will be much better than spawning the agent.
> 

Each inotify watch will consume a little memory, which should be
acceptable.

> Would a patch that exposes cgroup.populated on legacy cgroup structure
> be accepted? It is forward-compatible afterall... or no?
> 

I'm afraid no...All new features are done in unified hiearchy, and we've
been restraining from adding them to the legacy hierarchy.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <55668BE9.1090100-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH 1/2] connector: add cgroup release event report to proc connector
       [not found]             ` <55668BE9.1090100-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2015-05-28  8:54               ` Dimitri John Ledkov
       [not found]                 ` <CAK7xexmATyoUsOb5cgHF-Pz91ihQHm47sJoXmi15fk87yBJjNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Dimitri John Ledkov @ 2015-05-28  8:54 UTC (permalink / raw)
  To: Zefan Li
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Evgeniy Polyakov

On 28 May 2015 at 04:30, Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> On 2015/5/27 20:37, Dimitri John Ledkov wrote:
>> On 27 May 2015 at 12:22, Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>>> On 2015/5/27 6:07, Dimitri John Ledkov wrote:
>>>> Add a kernel API to send a proc connector notification that a cgroup
>>>> has become empty. A userspace daemon can then act upon such
>>>> information, and usually clean-up and remove such a group as it's no
>>>> longer needed.
>>>>
>>>> Currently there are two other ways (one for current & one for unified
>>>> cgroups) to receive such notifications, but they either involve
>>>> spawning userspace helper or monitoring a lot of files. This is a
>>>> firehose of all such events instead from a single place.
>>>>
>>>> In the current cgroups structure the way to get notifications is by
>>>> enabling `release_agent' and setting `notify_on_release' for a given
>>>> cgroup hierarchy. This will then spawn userspace helper with removed
>>>> cgroup as an argument. It has been acknowledged that this is
>>>> expensive, especially in the exit-heavy workloads. In userspace this
>>>> is currently used by systemd and CGmanager that I know of, both of
>>>> agents establish connection to the long running daemon and pass the
>>>> message to it. As a courtesy to other processes, such an event is
>>>> sometimes forwarded further on, e.g. systemd forwards it to the system
>>>> DBus.
>>>>
>>>> In the future/unified cgroups structure support for `release_agent' is
>>>> removed, without a direct replacement. However, there is a new
>>>> `cgroup.populated' file exposed that recursively reports if there are
>>>> any tasks in a given cgroup hierarchy. It's a very good flag to
>>>> quickly/lazily scan for empty things, however one would need to
>>>> establish inotify watch on each and every cgroup.populated file at
>>>> cgroup setup time (ideally before any pids enter said cgroup). Thus
>>>> again anybody else, but the original creator of a given cgroup, has a
>>>> chance to reliably monitor cgroup becoming empty (since there is no
>>>> reliable recursive inotify watch).
>>>>
>>>> Hence, the addition to the proc connector firehose. Multiple things,
>>>> albeit with a CAP_NET_ADMIN in the init pid/user namespace), could
>>>> connect and monitor cgroups release notifications. In a way, this
>>>> repeats udev history, at first it was a userspace helper, which later
>>>> became a netlink socket. And I hope, that proc connector is a
>>>> naturally good fit for this notification type.
>>>>
>>>> For precisely when cgroups should emit this event, see next patch
>>>> against kernel/cgroup.c.
>>>>
>>>
>>> We really don't want yet another way for cgroup notification.
>>>
>>
>> we do have multiple information sources for similar events in other
>> places... e.g. fork events can be tracked with ptrace and with
>> proc-connector, ditto other things.
>>
>>> Systemd is happy with this cgroup.populated interface. Do you have any
>>> real use case in mind that can't be satisfied with inotify watch?
>>>
>>
>> cgroup.populated is not implemented in systemd and would require a lot
>> of inotify watches.
>
> I believe systemd will use cgroup.populated, though I don't know its
> roadmap. Maybe it's waiting for the kernel to remove the experimental
> flag of unified hierarchy.
>

There is no code in master to support unified hierarchy in systemd
that I can see. And more and more things rely on the current
hierarchy, especially around container-like technologies.

>> Also it's only set on the unified structure and
>> not exposed on the current one.
>>
>> Also it will not allow anybody else to establish notify watch in a
>> timely manner. Thus anyone external to the cgroups creator will not be
>> able to monitor cgroup.populated at the right time.
>
> I guess this isn't a problem, as you can watch the IN_CREATE event, and
> then you'll get notified when a cgroup is created.
>

It is a problem, there is no effective way to establish race-free
inotify watches, which is well known. Having a watch on
/sys/fs/cgroup, one has to establish inotify watch on a directory
created there, and then another watch on cgroup.populated within
there. By which time a process could have already entered, run and
exited.

>> With
>> proc_connector I was thinking processes entering cgroups would be
>> useful events as well, but I don't have a use-case for them yet thus
>> I'm not sure how the event should look like.
>>
>> Would cgroup.populated be exposed on the legacy cgroup hierchy? At the
>> moment I see about ~20ms of my ~200ms boot wasted on spawning the
>> cgroups agent and I would like to get rid of that as soon as possible.
>> This patch solves it for me. ( i have a matching one to connect to
>> proc connector and then feed notifications to systemd via systemd's
>> private api end-point )
>>
>> Exposing cgroup.populated irrespective of the cgroup mount options
>> would be great, but would result in many watches being established
>> awaiting for a once in a lifecycle condition of a cgroup. Imho this is
>> wasteful, but nonetheless will be much better than spawning the agent.
>>
>
> Each inotify watch will consume a little memory, which should be
> acceptable.
>
>> Would a patch that exposes cgroup.populated on legacy cgroup structure
>> be accepted? It is forward-compatible afterall... or no?
>>
>
> I'm afraid no...All new features are done in unified hiearchy, and we've
> been restraining from adding them to the legacy hierarchy.
>

Am I right to say it's been a year with little movements in unified
hierarchy?! What's the current status on unmarking it experimental,
and/or what else needs doing in kernel and/or userspace?

What you are saying is that we have inefficient notification mechanism
that hammers everyone's boot time significantly, and no current path
to resolve it. What can I do get us efficient cgroup release
notifications soon?
This patch-set is a no-op if one doesn't subscribe from the userspace
and has no other side effects that I can trivially see and is very
similar in-spirit to other notifications that proc-connector
generates. E.g. /proc/pid/comm is exposed as a file, yet there is proc
connector notification as well about comm name changes. Maybe Evgeniy
can chip in, if such a notification would be beneficial to
proc-connector.

-- 
Regards,

Dimitri.
Pura Vida!

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <CAK7xexmATyoUsOb5cgHF-Pz91ihQHm47sJoXmi15fk87yBJjNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH 1/2] connector: add cgroup release event report to proc connector
       [not found]                 ` <CAK7xexmATyoUsOb5cgHF-Pz91ihQHm47sJoXmi15fk87yBJjNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-11 15:28                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 7+ messages in thread
From: Evgeniy Polyakov @ 2015-06-11 15:28 UTC (permalink / raw)
  To: Dimitri John Ledkov, Zefan Li
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi

28.05.2015, 11:54, "Dimitri John Ledkov" <dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>:

> What you are saying is that we have inefficient notification mechanism
> that hammers everyone's boot time significantly, and no current path
> to resolve it. What can I do get us efficient cgroup release
> notifications soon?
> This patch-set is a no-op if one doesn't subscribe from the userspace
> and has no other side effects that I can trivially see and is very
> similar in-spirit to other notifications that proc-connector
> generates. E.g. /proc/pid/comm is exposed as a file, yet there is proc
> connector notification as well about comm name changes. Maybe Evgeniy
> can chip in, if such a notification would be beneficial to
> proc-connector.

I understand your need in a new notifications related to cgroups,
although I would rather put it into separate module than proc connector - 
I'm pretty sure there will be quite alot of extensions in this module in the future.

But if you do want to extend proc connector module, I'm ok with it, but it should
go via its maintainer.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-06-11 15:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-26 22:07 [PATCH 1/2] connector: add cgroup release event report to proc connector Dimitri John Ledkov
     [not found] ` <1432678051-4372-1-git-send-email-dimitri.j.ledkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-05-26 22:07   ` [PATCH 2/2] cgroup: report cgroup release event " Dimitri John Ledkov
2015-05-27 11:22   ` [PATCH 1/2] connector: add cgroup release event report " Zefan Li
     [not found]     ` <5565A8F9.8040601-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2015-05-27 12:37       ` Dimitri John Ledkov
     [not found]         ` <CAK7xexkFm_BZ9sdB3VAZ-WGiikBeAQfaimXjggEtac+M2diV-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-28  3:30           ` Zefan Li
     [not found]             ` <55668BE9.1090100-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2015-05-28  8:54               ` Dimitri John Ledkov
     [not found]                 ` <CAK7xexmATyoUsOb5cgHF-Pz91ihQHm47sJoXmi15fk87yBJjNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-11 15:28                   ` Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).