[PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
@ 2013-06-04  2:13 Tejun Heo
       [not found] ` <20130604021302.GH29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-04  2:13 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Some resources controlled by cgroup aren't per-task and cgroup core
allowing threads of a single thread_group to be in different cgroups
forced memcg do explicitly find the group leader and use it.  This is
gonna be nasty when transitioning to unified hierarchy and in general
we don't want and won't support granularity finer than processes.

Mark "tasks" with CFTYPE_INSANE.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Balbir Singh <bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c |    1 +
 1 file changed, 1 insertion(+)

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4037,6 +4037,7 @@ static int cgroup_clone_children_write(s
 static struct cftype files[] = {
 	{
 		.name = "tasks",
+		.flags = CFTYPE_INSANE,		/* use "procs" instead */
 		.open = cgroup_tasks_open,
 		.write_u64 = cgroup_tasks_write,
 		.release = cgroup_pidlist_release,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane
       [not found] ` <20130604021302.GH29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2013-06-04  2:13   ` Tejun Heo
       [not found]     ` <20130604021355.GI29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2013-06-04 10:43   ` [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane Li Zefan
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-04  2:13 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

The empty cgroup notification mechanism currently implemented in
cgroup is tragically outdated.  Forking and execing userland process
stopped being a viable notification mechanism more than a decade ago.
We're gonna have a saner mechanism.  Let's make it clear that this
abomination is going away.

Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4052,6 +4052,7 @@ static struct cftype files[] = {
 	},
 	{
 		.name = "notify_on_release",
+		.flags = CFTYPE_INSANE,
 		.read_u64 = cgroup_read_notify_on_release,
 		.write_u64 = cgroup_write_notify_on_release,
 	},
@@ -4073,7 +4074,7 @@ static struct cftype files[] = {
 	},
 	{
 		.name = "release_agent",
-		.flags = CFTYPE_ONLY_ON_ROOT,
+		.flags = CFTYPE_INSANE | CFTYPE_ONLY_ON_ROOT,
 		.read_seq_string = cgroup_release_agent_show,
 		.write_string = cgroup_release_agent_write,
 		.max_write_len = PATH_MAX,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH cgroup/for-3.11 3/3] cgroup: clean up the cftype array for the base cgroup files
       [not found]     ` <20130604021355.GI29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2013-06-04  2:14       ` Tejun Heo
       [not found]         ` <20130604021434.GJ29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2013-06-04 10:47       ` [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane Li Zefan
  1 sibling, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-04  2:14 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

* Rename it from files[] (really?) to cgroup_base_files[].

* Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
  used inconsistently.  Just use "cgroup." directly.

* Collect insane files at the end.  Note that only the insane ones are
  missing "cgroup." prefix.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c |   47 ++++++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 23 deletions(-)

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4029,35 +4029,16 @@ static int cgroup_clone_children_write(s
 	return 0;
 }
 
-/*
- * for the common functions, 'private' gives the type of file
- */
-/* for hysterical raisins, we can't put this on the older files */
-#define CGROUP_FILE_GENERIC_PREFIX "cgroup."
-static struct cftype files[] = {
+static struct cftype cgroup_base_files[] = {
 	{
-		.name = "tasks",
-		.flags = CFTYPE_INSANE,		/* use "procs" instead */
-		.open = cgroup_tasks_open,
-		.write_u64 = cgroup_tasks_write,
-		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO | S_IWUSR,
-	},
-	{
-		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
+		.name = "cgroup.procs",
 		.open = cgroup_procs_open,
 		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
 		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
-		.name = "notify_on_release",
-		.flags = CFTYPE_INSANE,
-		.read_u64 = cgroup_read_notify_on_release,
-		.write_u64 = cgroup_write_notify_on_release,
-	},
-	{
-		.name = CGROUP_FILE_GENERIC_PREFIX "event_control",
+		.name = "cgroup.event_control",
 		.write_string = cgroup_write_event_control,
 		.mode = S_IWUGO,
 	},
@@ -4072,6 +4053,26 @@ static struct cftype files[] = {
 		.flags = CFTYPE_ONLY_ON_ROOT,
 		.read_seq_string = cgroup_sane_behavior_show,
 	},
+
+	/*
+	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
+	 * don't exist if sane_behavior.  If you're depending on these, be
+	 * prepared to be burned.
+	 */
+	{
+		.name = "tasks",
+		.flags = CFTYPE_INSANE,		/* use "procs" instead */
+		.open = cgroup_tasks_open,
+		.write_u64 = cgroup_tasks_write,
+		.release = cgroup_pidlist_release,
+		.mode = S_IRUGO | S_IWUSR,
+	},
+	{
+		.name = "notify_on_release",
+		.flags = CFTYPE_INSANE,
+		.read_u64 = cgroup_read_notify_on_release,
+		.write_u64 = cgroup_write_notify_on_release,
+	},
 	{
 		.name = "release_agent",
 		.flags = CFTYPE_INSANE | CFTYPE_ONLY_ON_ROOT,
@@ -4095,7 +4096,7 @@ static int cgroup_populate_dir(struct cg
 	struct cgroup_subsys *ss;
 
 	if (base_files) {
-		err = cgroup_addrm_files(cgrp, NULL, files, true);
+		err = cgroup_addrm_files(cgrp, NULL, cgroup_base_files, true);
 		if (err < 0)
 			return err;
 	}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found] ` <20130604021302.GH29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2013-06-04  2:13   ` [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane Tejun Heo
@ 2013-06-04 10:43   ` Li Zefan
  2013-06-04 11:15   ` Daniel P. Berrange
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Li Zefan @ 2013-06-04 10:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/6/4 10:13, Tejun Heo wrote:
> Some resources controlled by cgroup aren't per-task and cgroup core
> allowing threads of a single thread_group to be in different cgroups
> forced memcg do explicitly find the group leader and use it.  This is
> gonna be nasty when transitioning to unified hierarchy and in general
> we don't want and won't support granularity finer than processes.
> 
> Mark "tasks" with CFTYPE_INSANE.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Cc: Balbir Singh <bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane
       [not found]     ` <20130604021355.GI29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2013-06-04  2:14       ` [PATCH cgroup/for-3.11 3/3] cgroup: clean up the cftype array for the base cgroup files Tejun Heo
@ 2013-06-04 10:47       ` Li Zefan
       [not found]         ` <51ADC5D3.5070108-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 33+ messages in thread
From: Li Zefan @ 2013-06-04 10:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/6/4 10:13, Tejun Heo wrote:
> The empty cgroup notification mechanism currently implemented in
> cgroup is tragically outdated.  Forking and execing userland process
> stopped being a viable notification mechanism more than a decade ago.
> We're gonna have a saner mechanism.  Let's make it clear that this
> abomination is going away.
> 
> Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.
> 

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Do you plan to implement the alternative mechanism for 3.11?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 3/3] cgroup: clean up the cftype array for the base cgroup files
       [not found]         ` <20130604021434.GJ29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2013-06-04 10:49           ` Li Zefan
  0 siblings, 0 replies; 33+ messages in thread
From: Li Zefan @ 2013-06-04 10:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/6/4 10:14, Tejun Heo wrote:
> * Rename it from files[] (really?) to cgroup_base_files[].
> 
> * Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
>   used inconsistently.  Just use "cgroup." directly.
> 
> * Collect insane files at the end.  Note that only the insane ones are
>   missing "cgroup." prefix.
> 
> This patch doesn't introduce any functional changes.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found] ` <20130604021302.GH29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  2013-06-04  2:13   ` [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane Tejun Heo
  2013-06-04 10:43   ` [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane Li Zefan
@ 2013-06-04 11:15   ` Daniel P. Berrange
       [not found]     ` <20130604111556.GA4963-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-06-04 11:21   ` Michal Hocko
  2013-06-05 19:03   ` Tejun Heo
  4 siblings, 1 reply; 33+ messages in thread
From: Daniel P. Berrange @ 2013-06-04 11:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Mon, Jun 03, 2013 at 07:13:02PM -0700, Tejun Heo wrote:
> Some resources controlled by cgroup aren't per-task and cgroup core
> allowing threads of a single thread_group to be in different cgroups
> forced memcg do explicitly find the group leader and use it.  This is
> gonna be nasty when transitioning to unified hierarchy and in general
> we don't want and won't support granularity finer than processes.

With libvirt and KVM we require the ability to put different threads
in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.
This is to allow us to control schedular tunables / placement for
QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
requiring all threads of a process to be in the same cgroup isn't
sufficiently flexible for our needs.

We don't care about this for the other non-CPU related controllers
though.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found] ` <20130604021302.GH29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-06-04 11:15   ` Daniel P. Berrange
@ 2013-06-04 11:21   ` Michal Hocko
       [not found]     ` <20130604112139.GD31242-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2013-06-05 19:03   ` Tejun Heo
  4 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2013-06-04 11:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 03-06-13 19:13:02, Tejun Heo wrote:
> Some resources controlled by cgroup aren't per-task and cgroup core
> allowing threads of a single thread_group to be in different cgroups
> forced memcg do explicitly find the group leader and use it.  This is
> gonna be nasty when transitioning to unified hierarchy and in general
> we don't want and won't support granularity finer than processes.
> 
> Mark "tasks" with CFTYPE_INSANE.

Hmm, I wasn't aware that procs is a better interface to work with
entities in the group so I was using tasks which worked well for memcg.
I am afraid I am not the only one. Can we get a warning when somebody
opens the file?

That being said, I do not object against removal, please just add a
warning to let people know that procs is a preferred interface.

> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Cc: Balbir Singh <bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/cgroup.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4037,6 +4037,7 @@ static int cgroup_clone_children_write(s
>  static struct cftype files[] = {
>  	{
>  		.name = "tasks",
> +		.flags = CFTYPE_INSANE,		/* use "procs" instead */
>  		.open = cgroup_tasks_open,
>  		.write_u64 = cgroup_tasks_write,
>  		.release = cgroup_pidlist_release,
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]     ` <20130604111556.GA4963-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-06-04 14:34       ` Vivek Goyal
       [not found]         ` <20130604143444.GI4799-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-06-04 20:19       ` Tejun Heo
  1 sibling, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2013-06-04 14:34 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Jun 04, 2013 at 12:15:56PM +0100, Daniel P. Berrange wrote:
> On Mon, Jun 03, 2013 at 07:13:02PM -0700, Tejun Heo wrote:
> > Some resources controlled by cgroup aren't per-task and cgroup core
> > allowing threads of a single thread_group to be in different cgroups
> > forced memcg do explicitly find the group leader and use it.  This is
> > gonna be nasty when transitioning to unified hierarchy and in general
> > we don't want and won't support granularity finer than processes.
> 
> With libvirt and KVM we require the ability to put different threads
> in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.
> This is to allow us to control schedular tunables / placement for
> QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
> requiring all threads of a process to be in the same cgroup isn't
> sufficiently flexible for our needs.

For placement of vCPU threads, can we set per thread cpu affinity
(sched_setaffinity()), instead of using cgroups for that purpose.

Apart from cpu affinity, what scheduling parameters we want different
between different threads.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]         ` <20130604143444.GI4799-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-06-04 14:50           ` Daniel P. Berrange
       [not found]             ` <20130604145008.GV4963-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 33+ messages in thread
From: Daniel P. Berrange @ 2013-06-04 14:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Jun 04, 2013 at 10:34:44AM -0400, Vivek Goyal wrote:
> On Tue, Jun 04, 2013 at 12:15:56PM +0100, Daniel P. Berrange wrote:
> > On Mon, Jun 03, 2013 at 07:13:02PM -0700, Tejun Heo wrote:
> > > Some resources controlled by cgroup aren't per-task and cgroup core
> > > allowing threads of a single thread_group to be in different cgroups
> > > forced memcg do explicitly find the group leader and use it.  This is
> > > gonna be nasty when transitioning to unified hierarchy and in general
> > > we don't want and won't support granularity finer than processes.
> > 
> > With libvirt and KVM we require the ability to put different threads
> > in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.
> > This is to allow us to control schedular tunables / placement for
> > QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
> > requiring all threads of a process to be in the same cgroup isn't
> > sufficiently flexible for our needs.
> 
> For placement of vCPU threads, can we set per thread cpu affinity
> (sched_setaffinity()), instead of using cgroups for that purpose.

sched_setaffinity can't overrride affinity already set in the
cgroup. So this won't allow for disjoint affinity sets between
threads. ie if you use cgroups to bind the process to pCPU 1
(to apply all possible non-vCPU threads) and then want to bind
vCPU threads to pCPU 2 you can't do it.

eg for cpu/cpuacct/cpuset controllers we have a setup

 <domain cgroup> 0 threads
   |
   +- vcpu0     1 thread
   +- vcpu1     1 thread
   +- emulator  n threads

and want complete independance in settings for each of these child
cgroups.

> Apart from cpu affinity, what scheduling parameters we want different
> between different threads.

Placement isn't the big deal - it is really the cpu.cfs_period_us,
cpu.cfs_quota_us and cpu.shares settings that are important ones,
along with cpuacct.{stat,usage,usage_percpu} to track utilization
across multiple threads.

For cpuacct, if we only had 1 cgroup for all threads, we'd have to
read the process's overall usage and then subtract usage of individual
threads. This would really be a step backwards, throwing away the
benefits that cgroups brought in allowing setup arbitrary grouping of
tasks :-(

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]             ` <20130604145008.GV4963-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-06-04 15:12               ` Vivek Goyal
       [not found]                 ` <20130604151236.GA7555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 33+ messages in thread
From: Vivek Goyal @ 2013-06-04 15:12 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Jun 04, 2013 at 03:50:08PM +0100, Daniel P. Berrange wrote:
> On Tue, Jun 04, 2013 at 10:34:44AM -0400, Vivek Goyal wrote:
> > On Tue, Jun 04, 2013 at 12:15:56PM +0100, Daniel P. Berrange wrote:
> > > On Mon, Jun 03, 2013 at 07:13:02PM -0700, Tejun Heo wrote:
> > > > Some resources controlled by cgroup aren't per-task and cgroup core
> > > > allowing threads of a single thread_group to be in different cgroups
> > > > forced memcg do explicitly find the group leader and use it.  This is
> > > > gonna be nasty when transitioning to unified hierarchy and in general
> > > > we don't want and won't support granularity finer than processes.
> > > 
> > > With libvirt and KVM we require the ability to put different threads
> > > in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.
> > > This is to allow us to control schedular tunables / placement for
> > > QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
> > > requiring all threads of a process to be in the same cgroup isn't
> > > sufficiently flexible for our needs.
> > 
> > For placement of vCPU threads, can we set per thread cpu affinity
> > (sched_setaffinity()), instead of using cgroups for that purpose.
> 
> sched_setaffinity can't overrride affinity already set in the
> cgroup. So this won't allow for disjoint affinity sets between
> threads. ie if you use cgroups to bind the process to pCPU 1
> (to apply all possible non-vCPU threads) and then want to bind
> vCPU threads to pCPU 2 you can't do it.
> 

I thought we don't have to override affinity set in cgroup. Instead
subdivide that among its child tasks as needed.

So in above example, we would allow cgroup to have both pcpu1 and pcpu2
and then set affiinity for vcpu threads as well as non-vcpu threads.

> eg for cpu/cpuacct/cpuset controllers we have a setup
> 
>  <domain cgroup> 0 threads
>    |
>    +- vcpu0     1 thread
>    +- vcpu1     1 thread
>    +- emulator  n threads
> 
> and want complete independance in settings for each of these child
> cgroups.

I guess this will not work with single hierarchy as controllers like
blkio don't support putting threads of process in separate group. All
threads of a process share iocontext and an iocontext is associated
with a cgroup.

> 
> > Apart from cpu affinity, what scheduling parameters we want different
> > between different threads.
> 
> Placement isn't the big deal - it is really the cpu.cfs_period_us,
> cpu.cfs_quota_us and cpu.shares settings that are important ones,
> along with cpuacct.{stat,usage,usage_percpu} to track utilization
> across multiple threads.

Yes, upper limiting cpu usage will become unavailable at thread level
if we make this change. I guess customers don't care but libvirt might
internally want to upper limit cpu usage of group of threads. Don't
know why though. And we don't have this feature available per thread.

I am hoping there is a way to set priority per thread and that should
be able to emulate cpu.shares at a thread level.

> 
> For cpuacct, if we only had 1 cgroup for all threads, we'd have to
> read the process's overall usage and then subtract usage of individual
> threads. This would really be a step backwards, throwing away the
> benefits that cgroups brought in allowing setup arbitrary grouping of
> tasks :-(

So these per thread utilization stats are exported to user. Curious, In general
how this per thread/group_of_some_threads data is useful?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                 ` <20130604151236.GA7555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-06-04 15:25                   ` Daniel P. Berrange
  0 siblings, 0 replies; 33+ messages in thread
From: Daniel P. Berrange @ 2013-06-04 15:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Jun 04, 2013 at 11:12:36AM -0400, Vivek Goyal wrote:
> On Tue, Jun 04, 2013 at 03:50:08PM +0100, Daniel P. Berrange wrote:
> > On Tue, Jun 04, 2013 at 10:34:44AM -0400, Vivek Goyal wrote:
> > > On Tue, Jun 04, 2013 at 12:15:56PM +0100, Daniel P. Berrange wrote:
> > > > On Mon, Jun 03, 2013 at 07:13:02PM -0700, Tejun Heo wrote:
> > > > > Some resources controlled by cgroup aren't per-task and cgroup core
> > > > > allowing threads of a single thread_group to be in different cgroups
> > > > > forced memcg do explicitly find the group leader and use it.  This is
> > > > > gonna be nasty when transitioning to unified hierarchy and in general
> > > > > we don't want and won't support granularity finer than processes.
> > > > 
> > > > With libvirt and KVM we require the ability to put different threads
> > > > in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.
> > > > This is to allow us to control schedular tunables / placement for
> > > > QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
> > > > requiring all threads of a process to be in the same cgroup isn't
> > > > sufficiently flexible for our needs.
> > > 
> > > For placement of vCPU threads, can we set per thread cpu affinity
> > > (sched_setaffinity()), instead of using cgroups for that purpose.
> > 
> > sched_setaffinity can't overrride affinity already set in the
> > cgroup. So this won't allow for disjoint affinity sets between
> > threads. ie if you use cgroups to bind the process to pCPU 1
> > (to apply all possible non-vCPU threads) and then want to bind
> > vCPU threads to pCPU 2 you can't do it.
> > 
> 
> I thought we don't have to override affinity set in cgroup. Instead
> subdivide that among its child tasks as needed.
> 
> So in above example, we would allow cgroup to have both pcpu1 and pcpu2
> and then set affiinity for vcpu threads as well as non-vcpu threads.




> > eg for cpu/cpuacct/cpuset controllers we have a setup
> > 
> >  <domain cgroup> 0 threads
> >    |
> >    +- vcpu0     1 thread
> >    +- vcpu1     1 thread
> >    +- emulator  n threads
> > 
> > and want complete independance in settings for each of these child
> > cgroups.
> 
> I guess this will not work with single hierarchy as controllers like
> blkio don't support putting threads of process in separate group. All
> threads of a process share iocontext and an iocontext is associated
> with a cgroup.

IIUC, even with unified hiearchy, we're not going to be co-mounting all
controllers at the same mount point. The hiearchy libvirt creates is
the same structure across all controllers, it just has a couple of
extra leaves at the bottom in the cpu,cpuacct,cpuset controllers.

> > > Apart from cpu affinity, what scheduling parameters we want different
> > > between different threads.
> > 
> > Placement isn't the big deal - it is really the cpu.cfs_period_us,
> > cpu.cfs_quota_us and cpu.shares settings that are important ones,
> > along with cpuacct.{stat,usage,usage_percpu} to track utilization
> > across multiple threads.
> 
> Yes, upper limiting cpu usage will become unavailable at thread level
> if we make this change. I guess customers don't care but libvirt might
> internally want to upper limit cpu usage of group of threads. Don't
> know why though. And we don't have this feature available per thread.
> 
> I am hoping there is a way to set priority per thread and that should
> be able to emulate cpu.shares at a thread level.

As described, we already need to set priority on groups of threads to
be able to control all QEMU non-VCPU threads as a single set. Per-thread
settings are too fine and per-process settings are too coarse.

> > For cpuacct, if we only had 1 cgroup for all threads, we'd have to
> > read the process's overall usage and then subtract usage of individual
> > threads. This would really be a step backwards, throwing away the
> > benefits that cgroups brought in allowing setup arbitrary grouping of
> > tasks :-(
> 
> So these per thread utilization stats are exported to user. Curious, In general
> how this per thread/group_of_some_threads data is useful?

It is all about distinguishing utilization of non-vCPU threads in QEMU
from vCPU threads. We have users who make use of this feature in their
mgmt tools.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane
       [not found]         ` <51ADC5D3.5070108-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-06-04 20:00           ` Tejun Heo
       [not found]             ` <20130604200003.GC14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-04 20:00 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner, Michal Hocko,
	Balbir Singh, KAMEZAWA Hiroyuki, Vivek Goyal,
	kay.sievers-tD+1rO4QERM, lennart-mdGvqq1h2p+GdvJs77BJ7Q

On Tue, Jun 04, 2013 at 06:47:47PM +0800, Li Zefan wrote:
> On 2013/6/4 10:13, Tejun Heo wrote:
> > The empty cgroup notification mechanism currently implemented in
> > cgroup is tragically outdated.  Forking and execing userland process
> > stopped being a viable notification mechanism more than a decade ago.
> > We're gonna have a saner mechanism.  Let's make it clear that this
> > abomination is going away.
> > 
> > Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.
> > 
> 
> Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> 
> Do you plan to implement the alternative mechanism for 3.11?

No idea about the timeframe at this moment.  Hoping to do it in this
year.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]     ` <20130604112139.GD31242-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-06-04 20:01       ` Tejun Heo
       [not found]         ` <20130604200149.GD14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-04 20:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Jun 04, 2013 at 01:21:39PM +0200, Michal Hocko wrote:
> On Mon 03-06-13 19:13:02, Tejun Heo wrote:
> > Some resources controlled by cgroup aren't per-task and cgroup core
> > allowing threads of a single thread_group to be in different cgroups
> > forced memcg do explicitly find the group leader and use it.  This is
> > gonna be nasty when transitioning to unified hierarchy and in general
> > we don't want and won't support granularity finer than processes.
> > 
> > Mark "tasks" with CFTYPE_INSANE.
> 
> Hmm, I wasn't aware that procs is a better interface to work with
> entities in the group so I was using tasks which worked well for memcg.
> I am afraid I am not the only one. Can we get a warning when somebody
> opens the file?
> 
> That being said, I do not object against removal, please just add a
> warning to let people know that procs is a preferred interface.

Hmmm... I don't know.  For users of multiple hierarchies, tasks are
fine.  It's only gonna be an issue when we transition to unified
hierarchy where a lot of other things would change too.  I'm not sure
whether it'd be worthwhile to generate a warning now for everyone.
Li, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]     ` <20130604111556.GA4963-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-06-04 14:34       ` Vivek Goyal
@ 2013-06-04 20:19       ` Tejun Heo
       [not found]         ` <20130604201947.GE14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  1 sibling, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-04 20:19 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Li Zefan, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Daniel.

On Tue, Jun 04, 2013 at 12:15:56PM +0100, Daniel P. Berrange wrote:
> With libvirt and KVM we require the ability to put different threads

I really don't think cgroup has ever been intended (if there were ever
any such overall intending) or is suited for something as fine grained
as in-process resource management.  There already are existing
per-thread interfaces for that.  Please use them instead.  cgroup
simply doesn't fit.

> in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.

cpu and cpuacct are in the process of being merged.  The scheduler
people hate the duplicate accounting the separation causes and cpuacct
is generally considered a mistake that we shouldn't repeat.  So, umm,
you're really depending on a lot of things which are considered big
mistakes in cgroup.

> This is to allow us to control schedular tunables / placement for
> QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
> requiring all threads of a process to be in the same cgroup isn't
> sufficiently flexible for our needs.

It was never suited to that level of flexibility and it will never be
and things like that will be clearly forbidden rather than being left
in the current "not fully supported but kinda works" state.  The
existing stuff won't break but new things won't keep the support.  If
you're fine with staying with the old interface, which will be around
for the foreseeable future, that's fine too, but if you intend to move
onto the new interface when it finally becomes ready, whenever that
is, please move on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane
       [not found]             ` <20130604200003.GC14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2013-06-05  6:47               ` Glauber Costa
  0 siblings, 0 replies; 33+ messages in thread
From: Glauber Costa @ 2013-06-05  6:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 06/05/2013 12:00 AM, Tejun Heo wrote:
> On Tue, Jun 04, 2013 at 06:47:47PM +0800, Li Zefan wrote:
>> On 2013/6/4 10:13, Tejun Heo wrote:
>>> The empty cgroup notification mechanism currently implemented in
>>> cgroup is tragically outdated.  Forking and execing userland process
>>> stopped being a viable notification mechanism more than a decade ago.
>>> We're gonna have a saner mechanism.  Let's make it clear that this
>>> abomination is going away.
>>>
>>> Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.
>>>
>>
>> Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>
>> Do you plan to implement the alternative mechanism for 3.11?
> 
> No idea about the timeframe at this moment.  Hoping to do it in this
> year.
> 

A notification mechanism would be very welcome for us. The problem we
have now, is that if a container halts itself, we have no way to do any
cleanup. Doing it would involve having a daemon to scan for
no-longer-alive groups and shut them down.

The current mechanism proved unusable for us as well, because they are
global and used by systemd. We would like to register our own handler
(it would be fine if systemd would allow us to do so, being still the
only one hooked to the global handler)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]         ` <20130604200149.GD14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2013-06-05  9:49           ` Li Zefan
  0 siblings, 0 replies; 33+ messages in thread
From: Li Zefan @ 2013-06-05  9:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/6/5 4:01, Tejun Heo wrote:
> On Tue, Jun 04, 2013 at 01:21:39PM +0200, Michal Hocko wrote:
>> On Mon 03-06-13 19:13:02, Tejun Heo wrote:
>>> Some resources controlled by cgroup aren't per-task and cgroup core
>>> allowing threads of a single thread_group to be in different cgroups
>>> forced memcg do explicitly find the group leader and use it.  This is
>>> gonna be nasty when transitioning to unified hierarchy and in general
>>> we don't want and won't support granularity finer than processes.
>>>
>>> Mark "tasks" with CFTYPE_INSANE.
>>
>> Hmm, I wasn't aware that procs is a better interface to work with
>> entities in the group so I was using tasks which worked well for memcg.
>> I am afraid I am not the only one. Can we get a warning when somebody
>> opens the file?
>>
>> That being said, I do not object against removal, please just add a
>> warning to let people know that procs is a preferred interface.
> 
> Hmmm... I don't know.  For users of multiple hierarchies, tasks are
> fine.  It's only gonna be an issue when we transition to unified
> hierarchy where a lot of other things would change too.  I'm not sure
> whether it'd be worthwhile to generate a warning now for everyone.
> Li, what do you think?
> 

I think users who use cgroup for serious work should already know that
"tasks" can be used to attach a thread only, and the "cgroup.procs" file
is documented in Documentation/cgroups/cgroups.txt

As we'll keep allowing multiple hierarchies for forseeable future, and we
don't have a timeline to forbid it, I don't think we want to pump a 
warning for now.

And we don't generate warnings for other "insane" behaviors currently.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]         ` <20130604201947.GE14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2013-06-05 18:52           ` Tejun Heo
  2013-06-06  7:48           ` Li Zefan
  2013-06-06  9:20           ` Daniel P. Berrange
  2 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-06-05 18:52 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Li Zefan, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, again.

On Tue, Jun 04, 2013 at 01:19:47PM -0700, Tejun Heo wrote:
> It was never suited to that level of flexibility and it will never be
> and things like that will be clearly forbidden rather than being left
> in the current "not fully supported but kinda works" state.

I'm gonna commit these three patches.  I hope that the existing
in-process usages can move on to other mechanisms but recognize that
it's an actual use case and we may eventually decide to allow further
nesting cgroups at the leaf and allow separating threads of the same
process with only perf-task controllers.  I *strongly* hope it doesn't
come to that tho.

As __DEVEL__sane_behvior is currently the playground for testing new
stuff, I'm gonna commit these for now.  If it doesn't fan out, we can
add it back, but preferably under a new name "cgroup.tasks" and with
whole lot more restrictions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found] ` <20130604021302.GH29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-06-04 11:21   ` Michal Hocko
@ 2013-06-05 19:03   ` Tejun Heo
  4 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-06-05 19:03 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner, Michal Hocko,
	Balbir Singh, KAMEZAWA Hiroyuki, Vivek Goyal,
	kay.sievers-tD+1rO4QERM, lennart-mdGvqq1h2p+GdvJs77BJ7Q

On Mon, Jun 03, 2013 at 07:13:02PM -0700, Tejun Heo wrote:
> Some resources controlled by cgroup aren't per-task and cgroup core
> allowing threads of a single thread_group to be in different cgroups
> forced memcg do explicitly find the group leader and use it.  This is
> gonna be nasty when transitioning to unified hierarchy and in general
> we don't want and won't support granularity finer than processes.
> 
> Mark "tasks" with CFTYPE_INSANE.

Applied 1-3 to cgroup/for-3.11.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]         ` <20130604201947.GE14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2013-06-05 18:52           ` Tejun Heo
@ 2013-06-06  7:48           ` Li Zefan
  2013-06-06  9:20           ` Daniel P. Berrange
  2 siblings, 0 replies; 33+ messages in thread
From: Li Zefan @ 2013-06-06  7:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On 2013/6/5 4:19, Tejun Heo wrote:
> Hey, Daniel.
> 
> On Tue, Jun 04, 2013 at 12:15:56PM +0100, Daniel P. Berrange wrote:
>> With libvirt and KVM we require the ability to put different threads
> 
> I really don't think cgroup has ever been intended (if there were ever
> any such overall intending) or is suited for something as fine grained
> as in-process resource management.  There already are existing
> per-thread interfaces for that.  Please use them instead.  cgroup
> simply doesn't fit.
> 
>> in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.
> 
> cpu and cpuacct are in the process of being merged.  The scheduler
> people hate the duplicate accounting the separation causes and cpuacct
> is generally considered a mistake that we shouldn't repeat.  So, umm,
> you're really depending on a lot of things which are considered big
> mistakes in cgroup.
> 
>> This is to allow us to control schedular tunables / placement for
>> QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
>> requiring all threads of a process to be in the same cgroup isn't
>> sufficiently flexible for our needs.
> 
> It was never suited to that level of flexibility and it will never be
> and things like that will be clearly forbidden rather than being left
> in the current "not fully supported but kinda works" state.  The
> existing stuff won't break but new things won't keep the support.  If
> you're fine with staying with the old interface, which will be around
> for the foreseeable future, that's fine too, 

I don't think this can be an option for libvirt. When some other user
applications (systemd for example) use the new unified hierarchy interface,
then the old interface is no longer available for libvirt.

> but if you intend to move
> onto the new interface when it finally becomes ready, whenever that
> is, please move on.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]         ` <20130604201947.GE14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2013-06-05 18:52           ` Tejun Heo
  2013-06-06  7:48           ` Li Zefan
@ 2013-06-06  9:20           ` Daniel P. Berrange
       [not found]             ` <20130606092055.GF30217-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 33+ messages in thread
From: Daniel P. Berrange @ 2013-06-06  9:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Tue, Jun 04, 2013 at 01:19:47PM -0700, Tejun Heo wrote:
> Hey, Daniel.
> 
> On Tue, Jun 04, 2013 at 12:15:56PM +0100, Daniel P. Berrange wrote:
> > With libvirt and KVM we require the ability to put different threads
> 
> I really don't think cgroup has ever been intended (if there were ever
> any such overall intending) or is suited for something as fine grained
> as in-process resource management.  There already are existing
> per-thread interfaces for that.  Please use them instead.  cgroup
> simply doesn't fit.

Unless I'm mistaken there is no alternative that can work. With QEMU
we need to apply scheduling controls to 

  1. Individual vCPU threads
  2. All non-vCPU threads (ie QEMU's I/O threads)

We can use per-thread APIs for 1, but for 2 we require something that
applies to the group of threads as a whole, without also impacting the
controls set for the vCPU threads. AFAIK, nothing except cgroups as
we use them today can satisfy that requirement ? Am I wrong ? Is there
something else that can achieve this same setup ?

> > in different cgroups for the "cpu", "cpuset" & "cpuacct" controllers.
> 
> cpu and cpuacct are in the process of being merged.  The scheduler
> people hate the duplicate accounting the separation causes and cpuacct
> is generally considered a mistake that we shouldn't repeat.  So, umm,
> you're really depending on a lot of things which are considered big
> mistakes in cgroup.

Merging cpu + cpuacct together is not a problem - they're already
co-mounted by systemd. What I'm saying is that for cpu, cpuset
and cpuacct we create

  /some/path/
    |
    +- domain-cgroup
        |
        +- vcpu0 - thread for cpu 0
        +- vcpu1 - thread for cpu 1
        +- emulator - all other non-vCPU threads

We can't leave the non-vcpu threads at the higher level, because
then limits applied at the 'domain-cgroup' level would impact on
the vcpu threads.

while for all other controllers (memory, blkio, etc) we create

  /some/path
    |
    +- domain-cgroup  -  all threads

The directory structure is the same in all controllers, except that
with the cpu, cpuset + cpuacct controllers, we create 2 further leaf
nodes.

I understand that having wildly distinct hiearchies across different
controllers causes alot of pain for the kernel. Libvirt doesn't
actually require that full level of flexibility though. Our needs
are very much simpler. We're happy with the same core hierarchy
across all controllers. We just want to be able to create an extra
leaf node in some controllers to move threads about. 

It would be fine with us if the kernel required that the same directory
hierarchy exists in all controllers, and mandated that threads can only
be moved to a directory immediately below where the process is initially
placed.

> > This is to allow us to control schedular tunables / placement for
> > QEMU vCPU threads, independantly of limits for QEMU I/O threads. So
> > requiring all threads of a process to be in the same cgroup isn't
> > sufficiently flexible for our needs.
> 
> It was never suited to that level of flexibility and it will never be
> and things like that will be clearly forbidden rather than being left
> in the current "not fully supported but kinda works" state.  The
> existing stuff won't break but new things won't keep the support.  If
> you're fine with staying with the old interface, which will be around
> for the foreseeable future, that's fine too, but if you intend to move
> onto the new interface when it finally becomes ready, whenever that
> is, please move on.

You say the old interface will be around for the forseeable future, but
if systemd starts applying a different setup to comply with your new
scheme, then libvirt does get given any option to continue to use the
old scheme. So even if you leave old interfaces around, we're going to
be forced to change. That's not really a back-compatibility story that
works for applications.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]             ` <20130606092055.GF30217-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-06-06 21:14               ` Tejun Heo
       [not found]                 ` <20130606211410.GF5045-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-06 21:14 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

Hello,

On Thu, Jun 06, 2013 at 10:20:55AM +0100, Daniel P. Berrange wrote:
> Unless I'm mistaken there is no alternative that can work. With QEMU
> we need to apply scheduling controls to 
> 
>   1. Individual vCPU threads
>   2. All non-vCPU threads (ie QEMU's I/O threads)
> 
> We can use per-thread APIs for 1, but for 2 we require something that
> applies to the group of threads as a whole, without also impacting the
> controls set for the vCPU threads. AFAIK, nothing except cgroups as
> we use them today can satisfy that requirement ? Am I wrong ? Is there
> something else that can achieve this same setup ?

Can you please explain more about your requirements on !vCPU threads?

> I understand that having wildly distinct hiearchies across different
> controllers causes alot of pain for the kernel. Libvirt doesn't
> actually require that full level of flexibility though. Our needs
> are very much simpler. We're happy with the same core hierarchy
> across all controllers. We just want to be able to create an extra
> leaf node in some controllers to move threads about. 
> 
> It would be fine with us if the kernel required that the same directory
> hierarchy exists in all controllers, and mandated that threads can only
> be moved to a directory immediately below where the process is initially
> placed.

The problem is that it doesn't make any sense to split threads of the
same process for at least two major controllers and you end up with
situation where you can't identify a resource to be belonging to a
certain cgroup because such level of granularity is simply undefined.
As I wrote before, we can special case certain controllers but I'm
extremely reluctant.  If you need it, please convince me.

> > It was never suited to that level of flexibility and it will never be
> > and things like that will be clearly forbidden rather than being left
> > in the current "not fully supported but kinda works" state.  The
> > existing stuff won't break but new things won't keep the support.  If
> > you're fine with staying with the old interface, which will be around
> > for the foreseeable future, that's fine too, but if you intend to move
> > onto the new interface when it finally becomes ready, whenever that
> > is, please move on.
> 
> You say the old interface will be around for the forseeable future, but
> if systemd starts applying a different setup to comply with your new
> scheme, then libvirt does get given any option to continue to use the
> old scheme. So even if you leave old interfaces around, we're going to
> be forced to change. That's not really a back-compatibility story that
> works for applications.

Even after unified hierarchy, it will be possible to tell systemd to
not to do that and the system should be able to continue using
multiple hierarchies as before.  Nothing really should change in that
respect.  It may not be the prettiest but is still a workable
compatibility.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as  insane
       [not found]                 ` <20130606211410.GF5045-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2013-06-07  5:10                   ` Lennart Poettering
       [not found]                     ` <20130607051040.GA1364-kS5D54t9nk0aINubkmmoJbNAH6kLmebB@public.gmane.org>
  2013-06-07 10:12                   ` Daniel P. Berrange
  1 sibling, 1 reply; 33+ messages in thread
From: Lennart Poettering @ 2013-06-07  5:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel P. Berrange, Li Zefan,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 06.06.13 14:14, Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org) wrote:

> > You say the old interface will be around for the forseeable future, but
> > if systemd starts applying a different setup to comply with your new
> > scheme, then libvirt does get given any option to continue to use the
> > old scheme. So even if you leave old interfaces around, we're going to
> > be forced to change. That's not really a back-compatibility story that
> > works for applications.
> 
> Even after unified hierarchy, it will be possible to tell systemd to
> not to do that and the system should be able to continue using
> multiple hierarchies as before.  Nothing really should change in that
> respect.  It may not be the prettiest but is still a workable
> compatibility.

Uhm. So I don't think we will support two ways to set this up for long
in systemd (if at all). And even if we did, the distributions would pick
one or the other as default, and if you are unlucky, then you couldn't
run libvirt on it without reconfiguration and rebooting to get the other
cgroup setup logic...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                     ` <20130607051040.GA1364-kS5D54t9nk0aINubkmmoJbNAH6kLmebB@public.gmane.org>
@ 2013-06-07  9:30                       ` Daniel P. Berrange
       [not found]                         ` <20130607093050.GA10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-06-07 20:03                       ` Tejun Heo
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel P. Berrange @ 2013-06-07  9:30 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Fri, Jun 07, 2013 at 07:10:40AM +0200, Lennart Poettering wrote:
> On Thu, 06.06.13 14:14, Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org) wrote:
> 
> > > You say the old interface will be around for the forseeable future, but
> > > if systemd starts applying a different setup to comply with your new
> > > scheme, then libvirt does get given any option to continue to use the
> > > old scheme. So even if you leave old interfaces around, we're going to
> > > be forced to change. That's not really a back-compatibility story that
> > > works for applications.
> > 
> > Even after unified hierarchy, it will be possible to tell systemd to
> > not to do that and the system should be able to continue using
> > multiple hierarchies as before.  Nothing really should change in that
> > respect.  It may not be the prettiest but is still a workable
> > compatibility.
> 
> Uhm. So I don't think we will support two ways to set this up for long
> in systemd (if at all). And even if we did, the distributions would pick
> one or the other as default, and if you are unlucky, then you couldn't
> run libvirt on it without reconfiguration and rebooting to get the other
> cgroup setup logic...

Yep, that's exactly what concerns me. If systemd introduced the option
for a new style setup, then it would almost certainly be adopted by
default by Fedora (otherwise there's no point adding it). At which
point libvirt is doomed. We can't require people to reconfigure systemd
and reboot to use virt.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                 ` <20130606211410.GF5045-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2013-06-07  5:10                   ` Lennart Poettering
@ 2013-06-07 10:12                   ` Daniel P. Berrange
       [not found]                     ` <20130607101220.GE10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel P. Berrange @ 2013-06-07 10:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Thu, Jun 06, 2013 at 02:14:10PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jun 06, 2013 at 10:20:55AM +0100, Daniel P. Berrange wrote:
> > Unless I'm mistaken there is no alternative that can work. With QEMU
> > we need to apply scheduling controls to 
> > 
> >   1. Individual vCPU threads
> >   2. All non-vCPU threads (ie QEMU's I/O threads)
> > 
> > We can use per-thread APIs for 1, but for 2 we require something that
> > applies to the group of threads as a whole, without also impacting the
> > controls set for the vCPU threads. AFAIK, nothing except cgroups as
> > we use them today can satisfy that requirement ? Am I wrong ? Is there
> > something else that can achieve this same setup ?
> 
> Can you please explain more about your requirements on !vCPU threads?

Well we pretty much needs the tunables available in the cpu, cpuset
and cpuacct controllers to be available for the set of non-vCPU threads
as a group. eg, cpu_shares, cfs_period_us, cfs_quota_us, cpuacct.usage,
cpuacct.usage_percpu, cpuset.cpus, cpuset.mems.

CPU/memory affinity could possibly be done with a combination of
sched_setaffinity + libnuma, but I'm not sure that it has quite
the same semantics. IIUC, with cpuset cgroup changing affinity
will cause the kernel to migrate existing memory allocations to
the newly specified node masks, but this isn't done if you just
use sched_setaffinity/libnuma.

For cpu accounting, you'd have to look at the overall cgroup usage
and then subtract the usage accounted to each vcpu thread to get
the non-vCPU thread group total. Possible but slightly tedious &
more inaccurate since you will have timing delays getting info
from each thread's /proc files

I don't see any way to do cpu_sahres, cfs_period_us and cfs_quota_us
for the group of non-vCPU threads as a whole. You can't set these
at the per-thread level since that is semantically very different.
You can't set these for the process as a whole at the cgroup level,
since that'll confine vCPU threads at the same time which is also
semantically very different.

We need completely separate scheduler tuning for the set of non-vCPU
threads, vs each vCPU thread. AFAICT the only way todo this is with
cgroups, grouping together subsets of threads within the process.

> > I understand that having wildly distinct hiearchies across different
> > controllers causes alot of pain for the kernel. Libvirt doesn't
> > actually require that full level of flexibility though. Our needs
> > are very much simpler. We're happy with the same core hierarchy
> > across all controllers. We just want to be able to create an extra
> > leaf node in some controllers to move threads about. 
> > 
> > It would be fine with us if the kernel required that the same directory
> > hierarchy exists in all controllers, and mandated that threads can only
> > be moved to a directory immediately below where the process is initially
> > placed.
> 
> The problem is that it doesn't make any sense to split threads of the
> same process for at least two major controllers and you end up with
> situation where you can't identify a resource to be belonging to a
> certain cgroup because such level of granularity is simply undefined.
> As I wrote before, we can special case certain controllers but I'm
> extremely reluctant.  If you need it, please convince me.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                     ` <20130607101220.GE10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-06-07 10:21                       ` Kay Sievers
  2013-06-07 10:28                         ` Daniel P. Berrange
  2013-06-07 10:32                       ` Glauber Costa
  2013-06-07 20:23                       ` Tejun Heo
  2 siblings, 1 reply; 33+ messages in thread
From: Kay Sievers @ 2013-06-07 10:21 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Tejun Heo, Li Zefan,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Michal Hocko, Vivek Goyal, lennart-mdGvqq1h2p+GdvJs77BJ7Q,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Jun 7, 2013 at 12:12 PM, Daniel P. Berrange <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> Well we pretty much needs the tunables available in the cpu, cpuset
> and cpuacct controllers to be available for the set of non-vCPU threads
> as a group. eg, cpu_shares, cfs_period_us, cfs_quota_us, cpuacct.usage,
> cpuacct.usage_percpu, cpuset.cpus, cpuset.mems.

The cpu, cpuset, cpuacct controllers need different configuration per
controller, or do you refer them as one entity?

Kay

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
  2013-06-07 10:21                       ` Kay Sievers
@ 2013-06-07 10:28                         ` Daniel P. Berrange
  0 siblings, 0 replies; 33+ messages in thread
From: Daniel P. Berrange @ 2013-06-07 10:28 UTC (permalink / raw)
  To: Kay Sievers
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Michal Hocko, lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On Fri, Jun 07, 2013 at 12:21:19PM +0200, Kay Sievers wrote:
> On Fri, Jun 7, 2013 at 12:12 PM, Daniel P. Berrange <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > Well we pretty much needs the tunables available in the cpu, cpuset
> > and cpuacct controllers to be available for the set of non-vCPU threads
> > as a group. eg, cpu_shares, cfs_period_us, cfs_quota_us, cpuacct.usage,
> > cpuacct.usage_percpu, cpuset.cpus, cpuset.mems.
> 
> The cpu, cpuset, cpuacct controllers need different configuration per
> controller, or do you refer them as one entity?

They can all be co-mounted/merged without trouble, as they're treated
all the same way.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                     ` <20130607101220.GE10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-06-07 10:21                       ` Kay Sievers
@ 2013-06-07 10:32                       ` Glauber Costa
       [not found]                         ` <51B1B6C2.7000304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-06-07 20:23                       ` Tejun Heo
  2 siblings, 1 reply; 33+ messages in thread
From: Glauber Costa @ 2013-06-07 10:32 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Tejun Heo, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

On 06/07/2013 02:12 PM, Daniel P. Berrange wrote:
>> The problem is that it doesn't make any sense to split threads of the
>> same process for at least two major controllers and you end up with
>> situation where you can't identify a resource to be belonging to a
>> certain cgroup because such level of granularity is simply undefined.
>> As I wrote before, we can special case certain controllers but I'm
>> extremely reluctant.  If you need it, please convince me.

It seems quite valid for me to split priority threads in a process and
give them a 80 % timeslice guarantee, while leaving only 20 % for
low-prio threads (for instance).

In general, I don't think that it would hurt to allow separation at
thread level *for the leaves*, specifically at the cpu related controllers.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                     ` <20130607051040.GA1364-kS5D54t9nk0aINubkmmoJbNAH6kLmebB@public.gmane.org>
  2013-06-07  9:30                       ` Daniel P. Berrange
@ 2013-06-07 20:03                       ` Tejun Heo
       [not found]                         ` <20130607200307.GA14781-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  1 sibling, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2013-06-07 20:03 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Daniel P. Berrange, Li Zefan,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Lennart.

On Fri, Jun 07, 2013 at 07:10:40AM +0200, Lennart Poettering wrote:
> Uhm. So I don't think we will support two ways to set this up for long
> in systemd (if at all). And even if we did, the distributions would pick
> one or the other as default, and if you are unlucky, then you couldn't
> run libvirt on it without reconfiguration and rebooting to get the other
> cgroup setup logic...

Hmmm... would that mean that people wouldn't be able to boot older
kernel w/o unified hierarhcy with new systemd fairly soon too?  We can
make the new features which matter for management available for named
mounts (w/o unified hierarchy) so that systemd can operate as if none
of the actual controllers are enabled, which it should be able to
handle anyway.  That wouldn't solve booting old kernels with new
systemd problem but it should at least allow workable backward
compatibility from userland side as long as the kernel keeps the
compatibility, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                         ` <20130607093050.GA10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-06-07 20:05                           ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-06-07 20:05 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Lennart Poettering,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

Hello, Daniel.

On Fri, Jun 07, 2013 at 10:30:50AM +0100, Daniel P. Berrange wrote:
> > Uhm. So I don't think we will support two ways to set this up for long
> > in systemd (if at all). And even if we did, the distributions would pick
> > one or the other as default, and if you are unlucky, then you couldn't
> > run libvirt on it without reconfiguration and rebooting to get the other
> > cgroup setup logic...
> 
> Yep, that's exactly what concerns me. If systemd introduced the option

I don't think you two are speaking about the same problem.  Lennart is
saying that backward compat option won't stay for long and you're
worrying about the method to enable compatibility mode.

> for a new style setup, then it would almost certainly be adopted by
> default by Fedora (otherwise there's no point adding it). At which
> point libvirt is doomed. We can't require people to reconfigure systemd
> and reboot to use virt.

Yeah, it's cumbersome, but if it's pressing we can make the release of
controllers from unified hierarchy dynamic I suppose.  I'll think more
about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                     ` <20130607101220.GE10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-06-07 10:21                       ` Kay Sievers
  2013-06-07 10:32                       ` Glauber Costa
@ 2013-06-07 20:23                       ` Tejun Heo
  2 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-06-07 20:23 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Li Zefan, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Fri, Jun 07, 2013 at 11:12:20AM +0100, Daniel P. Berrange wrote:
> Well we pretty much needs the tunables available in the cpu, cpuset
> and cpuacct controllers to be available for the set of non-vCPU threads
> as a group. eg, cpu_shares, cfs_period_us, cfs_quota_us, cpuacct.usage,
> cpuacct.usage_percpu, cpuset.cpus, cpuset.mems.
> 
> CPU/memory affinity could possibly be done with a combination of
> sched_setaffinity + libnuma, but I'm not sure that it has quite
> the same semantics. IIUC, with cpuset cgroup changing affinity
> will cause the kernel to migrate existing memory allocations to
> the newly specified node masks, but this isn't done if you just
> use sched_setaffinity/libnuma.

The features are overlapping mostly.  The exact details might differ
but is that really an overriding concern?  With autonuma, the kernel
will do migration automatically anyway and it's more likely to show
better overall behavior anyway performing migration gradually.  One of
the problems with some of cgroup's features is that they're clutch or
hackery to achieve pretty specific goals and I want to at least
discourage such usages.  If something can be done with the usual
programming API, it's better to stick with them.  They tend to be much
better thought-through, engineered and given a lot more attention.

> For cpu accounting, you'd have to look at the overall cgroup usage
> and then subtract the usage accounted to each vcpu thread to get
> the non-vCPU thread group total. Possible but slightly tedious &
> more inaccurate since you will have timing delays getting info
> from each thread's /proc files

But you would want to keep track of how much cpu time each vcpu
consumes anyway, right?  That's a logical thing to do.

> I don't see any way to do cpu_sahres, cfs_period_us and cfs_quota_us
> for the group of non-vCPU threads as a whole. You can't set these
> at the per-thread level since that is semantically very different.
> You can't set these for the process as a whole at the cgroup level,
> since that'll confine vCPU threads at the same time which is also
> semantically very different.

The above doesn't really explain why you need them.  It just describes
what you're doing right now and how that doesn't map exactly to the
usual interface.  What effect are you achieving with tuning those
scheduler params which BTW is much more deeply connected to scheduler
internals and thus a lot more volatile?  Exploiting such features is
dangerous for both userland and kernel - userland is much more likely
to get affected by kernel implementation changes and in turn kernel is
locked into interface which it never intended to make available widely
and gets restricted in what it can do to improve the implementation.
While quite unlikely, people regularly talk about alternative
scheduler implementations and things like above mean that any such
alternative implementations would have to emulate, of course
imperfectly, those knobs which may have no meaning whatsoever to the
new scheduler.

I'm sure you had good reasons adding those but the fact that you're
depending on them at all instead of the usual scheduler API is not a
good thing and if at all possible should be removed.  This is about
the same theme that I talked above.  It looks like a good feature on
the surface but is also something which is deterimental in the long
term.  We want to closely limit accesses to such implementation
details as much as possible.

So, can you please explain what benefits you're getting out of by
tuning those scheduler specific knobs which can't be obtained via
normal scheduler API and how much difference that actually makes?
Because if it's something legitimately necessary, it should really be
accessible in generic manner so that lay programs can make use of them
too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane
       [not found]                         ` <51B1B6C2.7000304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-06-07 20:32                           ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-06-07 20:32 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko,
	lennart-mdGvqq1h2p+GdvJs77BJ7Q, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal

Hello, Glauber.

On Fri, Jun 07, 2013 at 02:32:34PM +0400, Glauber Costa wrote:
> It seems quite valid for me to split priority threads in a process and
> give them a 80 % timeslice guarantee, while leaving only 20 % for
> low-prio threads (for instance).
> 
> In general, I don't think that it would hurt to allow separation at
> thread level *for the leaves*, specifically at the cpu related controllers.

But if you look at the larger picture, it actually is deterimental,
because allowing different threads of the same process almost implies
that the program binary itself would be involved in interacting with -
accessing and tuning - cgroups.  Back to sysctl analogy, it's one
thing to expose them to base system tools which are tightly coupled
with the kernel so that they can configure it, but allowing lay
programs to embed sysctl tuning inside their binaries is a completely
different thing and something which must not happen.

While not strictly designated explicitly, things exposed from the
kernel have differing levels of exposedness and we do definitely want
to avoid exposing too much implementation details for the sake of both
the kernel and userland.  Such levels are in some cases determined as
the use cases develop.  If you ask me, cgroup is a good example of
things going horribly wrong.

So, yeah, while I agree that per-thread usages could make sense for
scheduler related controllers, I'd very much like to avoid that if at
all possible.  It's not a good road to be on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as  insane
       [not found]                         ` <20130607200307.GA14781-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2013-06-10 11:08                           ` Lennart Poettering
  0 siblings, 0 replies; 33+ messages in thread
From: Lennart Poettering @ 2013-06-10 11:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel P. Berrange, Li Zefan,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kay.sievers-tD+1rO4QERM, Michal Hocko, Vivek Goyal,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, 07.06.13 13:03, Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org) wrote:

Heya,

> On Fri, Jun 07, 2013 at 07:10:40AM +0200, Lennart Poettering wrote:
> > Uhm. So I don't think we will support two ways to set this up for long
> > in systemd (if at all). And even if we did, the distributions would pick
> > one or the other as default, and if you are unlucky, then you couldn't
> > run libvirt on it without reconfiguration and rebooting to get the other
> > cgroup setup logic...
> 
> Hmmm... would that mean that people wouldn't be able to boot older
> kernel w/o unified hierarhcy with new systemd fairly soon too? 

Yeah, I guess. Or maybe only to the level we currently support
cgroup-less kernels: i.e. we print a big warning at boot, wait for a few
seconds, and then proceed but things might be very confused because we
cannot keep track of services anymore..

> We can make the new features which matter for management available for
> named mounts (w/o unified hierarchy) so that systemd can operate as if
> none of the actual controllers are enabled, which it should be able to
> handle anyway.  That wouldn't solve booting old kernels with new
> systemd problem but it should at least allow workable backward
> compatibility from userland side as long as the kernel keeps the
> compatibility, right?

The closer the kernel stays to its old interfaces the easier it would be
for us to stay compatibile for a while. But there are many levels of
compatbility possible here: as mentioned above, the one where we warn
and just don't make use of cgroups, or where we just use cgroups for
organizing things, but drop support for resource management and so on.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-06-10 11:08 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-04  2:13 [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane Tejun Heo
     [not found] ` <20130604021302.GH29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-06-04  2:13   ` [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane Tejun Heo
     [not found]     ` <20130604021355.GI29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-06-04  2:14       ` [PATCH cgroup/for-3.11 3/3] cgroup: clean up the cftype array for the base cgroup files Tejun Heo
     [not found]         ` <20130604021434.GJ29989-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-06-04 10:49           ` Li Zefan
2013-06-04 10:47       ` [PATCH cgroup/for-3.11 2/3] cgroup: mark "notify_on_release" and "release_agent" cgroup files insane Li Zefan
     [not found]         ` <51ADC5D3.5070108-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-04 20:00           ` Tejun Heo
     [not found]             ` <20130604200003.GC14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-06-05  6:47               ` Glauber Costa
2013-06-04 10:43   ` [PATCH cgroup/for-3.11 1/3] cgroup: mark "tasks" cgroup file as insane Li Zefan
2013-06-04 11:15   ` Daniel P. Berrange
     [not found]     ` <20130604111556.GA4963-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-06-04 14:34       ` Vivek Goyal
     [not found]         ` <20130604143444.GI4799-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-06-04 14:50           ` Daniel P. Berrange
     [not found]             ` <20130604145008.GV4963-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-06-04 15:12               ` Vivek Goyal
     [not found]                 ` <20130604151236.GA7555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-06-04 15:25                   ` Daniel P. Berrange
2013-06-04 20:19       ` Tejun Heo
     [not found]         ` <20130604201947.GE14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-06-05 18:52           ` Tejun Heo
2013-06-06  7:48           ` Li Zefan
2013-06-06  9:20           ` Daniel P. Berrange
     [not found]             ` <20130606092055.GF30217-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-06-06 21:14               ` Tejun Heo
     [not found]                 ` <20130606211410.GF5045-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-06-07  5:10                   ` Lennart Poettering
     [not found]                     ` <20130607051040.GA1364-kS5D54t9nk0aINubkmmoJbNAH6kLmebB@public.gmane.org>
2013-06-07  9:30                       ` Daniel P. Berrange
     [not found]                         ` <20130607093050.GA10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-06-07 20:05                           ` Tejun Heo
2013-06-07 20:03                       ` Tejun Heo
     [not found]                         ` <20130607200307.GA14781-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-06-10 11:08                           ` Lennart Poettering
2013-06-07 10:12                   ` Daniel P. Berrange
     [not found]                     ` <20130607101220.GE10742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-06-07 10:21                       ` Kay Sievers
2013-06-07 10:28                         ` Daniel P. Berrange
2013-06-07 10:32                       ` Glauber Costa
     [not found]                         ` <51B1B6C2.7000304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-06-07 20:32                           ` Tejun Heo
2013-06-07 20:23                       ` Tejun Heo
2013-06-04 11:21   ` Michal Hocko
     [not found]     ` <20130604112139.GD31242-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-04 20:01       ` Tejun Heo
     [not found]         ` <20130604200149.GD14916-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-06-05  9:49           ` Li Zefan
2013-06-05 19:03   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).