[RFC PATCH 0/2] CPU controller statistics

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] CPU controller statistics - v5
@ 2009-02-25 10:57 Bharata B Rao
  2009-02-25 10:58 ` [RFC PATCH 1/2] New cgroup subsystem API (->initialize()) Bharata B Rao
  2009-02-25 10:59 ` [RFC PATCH 2/2] Add per-cgroup CPU controller statistics Bharata B Rao
  0 siblings, 2 replies; 14+ messages in thread
From: Bharata B Rao @ 2009-02-25 10:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Balaji Rao, Dhaval Giani, Balbir Singh, Li Zefan, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

Hi,

Last year, Balaji posted a patch to collect CPU controller statistics. After
a few initial versions, I couldn't see any activity. Here I am posting the
next version of the patch for review.

Changes for v5:

- Updated to 2.6.29-rc6.
- Separated cgroup modifications into a different patch.
- Changed the prototype of the ->initialize() subsystem API from
  cgroup_subsys->initialize(int early) to cgroup_subsys->initialize(void)
  and calling it only from cgroup_init() and not from cgroup_init_early().
- Updated documentation for new API.
- Added guest time metric as another cpu controller statistic.
- Free percpu statistics counters when the cgroup is brought down.
- Account irq and softirq time also as system time for cgroup accounting.
- Separate out stats collection code under CONFIG_CGROUP_SCHED to a helper
  funtion to reduce ifdefs.

v4:
- http://lkml.org/lkml/2008/5/11/163

I have tried to address most of the comments given for the 4th version
of the patch. A few questions still remain:

- percpu counters are used for stats collection. Since percpu counters
  aren't usable during cgroup_init_early(), we have to allocate
  percpu stats counter for init_task_group separately later during
  cgroup_init(). Because of this, in the stats collection code, we
  end up having a 'is stats counter allocated?' check
- This patch collects per cgroup cpu controller stats. Does it make
  sense to account stime and utime hierarchially ? If so, we would
  probably be duplicating what cpuacct controller already does.
- Is steal time a useful per-cgroup metric ? Perhaps in container based
  virtualized environments ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-25 10:57 [RFC PATCH 0/2] CPU controller statistics - v5 Bharata B Rao
@ 2009-02-25 10:58 ` Bharata B Rao
  2009-02-26  2:55   ` Li Zefan
  2009-02-25 10:59 ` [RFC PATCH 2/2] Add per-cgroup CPU controller statistics Bharata B Rao
  1 sibling, 1 reply; 14+ messages in thread
From: Bharata B Rao @ 2009-02-25 10:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Balaji Rao, Dhaval Giani, Balbir Singh, Li Zefan, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

From: Balaji Rao <balajirrao@gmail.com>

cgroup: Add ->initialize() to cgroup_subsys structure

Some cgroup subsystems (like cpu controller) would need subsystem
specific initialization. Such subsystems can define ->initialize()
which gets called during cgroup_init() (and not cgroup_init_early()).

Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 Documentation/cgroups/cgroups.txt |    6 ++++++
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |    2 ++
 3 files changed, 9 insertions(+)

--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -534,6 +534,12 @@ and root cgroup. Currently this will onl
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+void initialize(void)
+
+Called during cgroup initialization (cgroup_init()) to do any subsystem
+specific initialization. This is not called during early cgroup
+initialialization (cgroup_init_early().
+
 4. Questions
 ============
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -359,6 +359,7 @@ struct cgroup_subsys {
 			struct cgroup *cgrp);
 	void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+	void (*initialize)(void);
 
 	int subsys_id;
 	int active;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2707,6 +2707,8 @@ int __init cgroup_init(void)
 		struct cgroup_subsys *ss = subsys[i];
 		if (!ss->early_init)
 			cgroup_init_subsys(ss);
+		if (ss->initialize)
+			ss->initialize();
 	}
 
 	/* Add init_css_set to the hash table */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-25 10:58 ` [RFC PATCH 1/2] New cgroup subsystem API (->initialize()) Bharata B Rao
@ 2009-02-26  2:55   ` Li Zefan
  2009-02-26  7:52     ` Bharata B Rao
  0 siblings, 1 reply; 14+ messages in thread
From: Li Zefan @ 2009-02-26  2:55 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

Bharata B Rao wrote:
> From: Balaji Rao <balajirrao@gmail.com>
> 
> cgroup: Add ->initialize() to cgroup_subsys structure
> 
> Some cgroup subsystems (like cpu controller) would need subsystem
> specific initialization. Such subsystems can define ->initialize()
> which gets called during cgroup_init() (and not cgroup_init_early()).
> 

I think it's better to avoid adding this.

It would be best if we can add a hook to initialize init_task_group.stat where
kmalloc is available but acount_xxx_time() hasn't been called. Otherwise, we
have to check (tg->stat == NULL) in account_task_group_time(), then why not add
a hook in smp_init_smp() to do initialization?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-26  2:55   ` Li Zefan
@ 2009-02-26  7:52     ` Bharata B Rao
  2009-02-26  8:11       ` Li Zefan
  2009-02-26  8:48       ` Dhaval Giani
  0 siblings, 2 replies; 14+ messages in thread
From: Bharata B Rao @ 2009-02-26  7:52 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

On Thu, Feb 26, 2009 at 10:55:54AM +0800, Li Zefan wrote:
> Bharata B Rao wrote:
> > From: Balaji Rao <balajirrao@gmail.com>
> > 
> > cgroup: Add ->initialize() to cgroup_subsys structure
> > 
> > Some cgroup subsystems (like cpu controller) would need subsystem
> > specific initialization. Such subsystems can define ->initialize()
> > which gets called during cgroup_init() (and not cgroup_init_early()).
> > 
> 
> I think it's better to avoid adding this.
> 
> It would be best if we can add a hook to initialize init_task_group.stat where
> kmalloc is available but acount_xxx_time() hasn't been called. Otherwise, we
> have to check (tg->stat == NULL) in account_task_group_time(), then why not add
> a hook in smp_init_smp() to do initialization?

account_xxx_time() is called from scheduler ticks and AFAICS they end up
getting called much before kmalloc is available. In any case, I would think
any hook to just initialize stats for init_task_group would be
very very (cpu controller) subsytem specific. Isn't that bad ?

Another solution I see which can prevent all this is not to collect
stats for init_task_group at all with the understanding that system wide
stime/utime accounting (which is already present) is essentially the
accounting for init_task_group because init_task_group comprises of all
the tasks in the system. But this would necessiate us to make collection
of cpu controller stats hierarchial. This was one of the questions I asked
in my 0/2 thread. Shouldn't we be doing hierarchial accounting for
cpu controller ?

Another thing that could be done is to enhance already existing
cpuacct controller to do stime/utime accouting also. cpuacct controller
exists precisely for doing per-cgroup accounting and is there any reason
why these stats shouldn't be part of cpuacct controller. If we do this,
users would be forced to use cpu controller and cpuacct controller
together. Is that a problem ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-26  7:52     ` Bharata B Rao
@ 2009-02-26  8:11       ` Li Zefan
  2009-02-26  8:20         ` KAMEZAWA Hiroyuki
  2009-02-26 10:12         ` Bharata B Rao
  2009-02-26  8:48       ` Dhaval Giani
  1 sibling, 2 replies; 14+ messages in thread
From: Li Zefan @ 2009-02-26  8:11 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

Bharata B Rao wrote:
> On Thu, Feb 26, 2009 at 10:55:54AM +0800, Li Zefan wrote:
>> Bharata B Rao wrote:
>>> From: Balaji Rao <balajirrao@gmail.com>
>>>
>>> cgroup: Add ->initialize() to cgroup_subsys structure
>>>
>>> Some cgroup subsystems (like cpu controller) would need subsystem
>>> specific initialization. Such subsystems can define ->initialize()
>>> which gets called during cgroup_init() (and not cgroup_init_early()).
>>>
>> I think it's better to avoid adding this.
>>
>> It would be best if we can add a hook to initialize init_task_group.stat where
>> kmalloc is available but acount_xxx_time() hasn't been called. Otherwise, we
>> have to check (tg->stat == NULL) in account_task_group_time(), then why not add
>> a hook in smp_init_smp() to do initialization?
> 
> account_xxx_time() is called from scheduler ticks and AFAICS they end up
> getting called much before kmalloc is available. In any case, I would think
> any hook to just initialize stats for init_task_group would be
> very very (cpu controller) subsytem specific. Isn't that bad ?
> 

Since it's very very cpu subsystem specific, so it's better to use it's own hook.
(and because the initialize() API is not so elegant..)

> Another solution I see which can prevent all this is not to collect
> stats for init_task_group at all with the understanding that system wide

This came to my mind too. ;)

> stime/utime accounting (which is already present) is essentially the
> accounting for init_task_group because init_task_group comprises of all
> the tasks in the system. But this would necessiate us to make collection
> of cpu controller stats hierarchial. This was one of the questions I asked
> in my 0/2 thread. Shouldn't we be doing hierarchial accounting for
> cpu controller ?
> 

Don't know. I have no strong opinion about this. I'm a bit doubt how useful
this is.

> Another thing that could be done is to enhance already existing
> cpuacct controller to do stime/utime accouting also. cpuacct controller
> exists precisely for doing per-cgroup accounting and is there any reason
> why these stats shouldn't be part of cpuacct controller. If we do this,
> users would be forced to use cpu controller and cpuacct controller
> together. Is that a problem ?
> 

I wondered why these stats is part of cpu subsystem but not cpuacct.
And I don't see any problem to use these 2 subsystems together. Actually
this is one of the advantage of cgroup.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-26  8:11       ` Li Zefan
@ 2009-02-26  8:20         ` KAMEZAWA Hiroyuki
  2009-02-26 10:12         ` Bharata B Rao
  1 sibling, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26  8:20 UTC (permalink / raw)
  To: Li Zefan
  Cc: bharata, linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh,
	Paul Menage, Andrew Morton, Ingo Molnar, Peter Zijlstra

On Thu, 26 Feb 2009 16:11:36 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> Bharata B Rao wrote:
> > On Thu, Feb 26, 2009 at 10:55:54AM +0800, Li Zefan wrote:
> >> Bharata B Rao wrote:
> >>> From: Balaji Rao <balajirrao@gmail.com>
> > Another thing that could be done is to enhance already existing
> > cpuacct controller to do stime/utime accouting also. cpuacct controller
> > exists precisely for doing per-cgroup accounting and is there any reason
> > why these stats shouldn't be part of cpuacct controller. If we do this,
> > users would be forced to use cpu controller and cpuacct controller
> > together. Is that a problem ?
> > 
> 
> I wondered why these stats is part of cpu subsystem but not cpuacct.
> And I don't see any problem to use these 2 subsystems together. Actually
> this is one of the advantage of cgroup.
> 
IMHO, cpuacct subsystem should count utime/stime.
current statistics is not so useful.

BTW, I wonde why cpuacct subsys walks up to root at charging...
I think we can gather all subtree information at read...

Thanks
-Kame


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-26  8:11       ` Li Zefan
  2009-02-26  8:20         ` KAMEZAWA Hiroyuki
@ 2009-02-26 10:12         ` Bharata B Rao
  1 sibling, 0 replies; 14+ messages in thread
From: Bharata B Rao @ 2009-02-26 10:12 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

On Thu, Feb 26, 2009 at 04:11:36PM +0800, Li Zefan wrote:
> Bharata B Rao wrote:
> > On Thu, Feb 26, 2009 at 10:55:54AM +0800, Li Zefan wrote:
> >> Bharata B Rao wrote:
> >>> From: Balaji Rao <balajirrao@gmail.com>
> >>>
> >>> cgroup: Add ->initialize() to cgroup_subsys structure
> >>>
> >>> Some cgroup subsystems (like cpu controller) would need subsystem
> >>> specific initialization. Such subsystems can define ->initialize()
> >>> which gets called during cgroup_init() (and not cgroup_init_early()).
> >>>
> >> I think it's better to avoid adding this.
> >>
> >> It would be best if we can add a hook to initialize init_task_group.stat where
> >> kmalloc is available but acount_xxx_time() hasn't been called. Otherwise, we
> >> have to check (tg->stat == NULL) in account_task_group_time(), then why not add
> >> a hook in smp_init_smp() to do initialization?
> > 
> > account_xxx_time() is called from scheduler ticks and AFAICS they end up
> > getting called much before kmalloc is available. In any case, I would think
> > any hook to just initialize stats for init_task_group would be
> > very very (cpu controller) subsytem specific. Isn't that bad ?
> > 
> 
> Since it's very very cpu subsystem specific, so it's better to use it's own hook.
> (and because the initialize() API is not so elegant..)
> 
> > Another solution I see which can prevent all this is not to collect
> > stats for init_task_group at all with the understanding that system wide
> 
> This came to my mind too. ;)
> 
> > stime/utime accounting (which is already present) is essentially the
> > accounting for init_task_group because init_task_group comprises of all
> > the tasks in the system. But this would necessiate us to make collection
> > of cpu controller stats hierarchial. This was one of the questions I asked
> > in my 0/2 thread. Shouldn't we be doing hierarchial accounting for
> > cpu controller ?
> > 
> 
> Don't know. I have no strong opinion about this. I'm a bit doubt how useful
> this is.
> 
> > Another thing that could be done is to enhance already existing
> > cpuacct controller to do stime/utime accouting also. cpuacct controller
> > exists precisely for doing per-cgroup accounting and is there any reason
> > why these stats shouldn't be part of cpuacct controller. If we do this,
> > users would be forced to use cpu controller and cpuacct controller
> > together. Is that a problem ?
> > 
> 
> I wondered why these stats is part of cpu subsystem but not cpuacct.
> And I don't see any problem to use these 2 subsystems together. Actually
> this is one of the advantage of cgroup.

Ok, so if there are no objections, my next version would move these
stats to cpuacct subsystem.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-26  7:52     ` Bharata B Rao
  2009-02-26  8:11       ` Li Zefan
@ 2009-02-26  8:48       ` Dhaval Giani
  2009-02-26  8:52         ` Li Zefan
  1 sibling, 1 reply; 14+ messages in thread
From: Dhaval Giani @ 2009-02-26  8:48 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Li Zefan, linux-kernel, Balaji Rao, Balbir Singh, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

On Thu, Feb 26, 2009 at 01:22:59PM +0530, Bharata B Rao wrote:
> On Thu, Feb 26, 2009 at 10:55:54AM +0800, Li Zefan wrote:
> > Bharata B Rao wrote:
> > > From: Balaji Rao <balajirrao@gmail.com>
> > > 
> > > cgroup: Add ->initialize() to cgroup_subsys structure
> > > 
> > > Some cgroup subsystems (like cpu controller) would need subsystem
> > > specific initialization. Such subsystems can define ->initialize()
> > > which gets called during cgroup_init() (and not cgroup_init_early()).
> > > 
> > 
> > I think it's better to avoid adding this.
> > 
> > It would be best if we can add a hook to initialize init_task_group.stat where
> > kmalloc is available but acount_xxx_time() hasn't been called. Otherwise, we
> > have to check (tg->stat == NULL) in account_task_group_time(), then why not add
> > a hook in smp_init_smp() to do initialization?
> 
> account_xxx_time() is called from scheduler ticks and AFAICS they end up
> getting called much before kmalloc is available. In any case, I would think
> any hook to just initialize stats for init_task_group would be
> very very (cpu controller) subsytem specific. Isn't that bad ?
> 
> Another solution I see which can prevent all this is not to collect
> stats for init_task_group at all with the understanding that system wide
> stime/utime accounting (which is already present) is essentially the
> accounting for init_task_group because init_task_group comprises of all
> the tasks in the system. But this would necessiate us to make collection
> of cpu controller stats hierarchial. This was one of the questions I asked
> in my 0/2 thread. Shouldn't we be doing hierarchial accounting for
> cpu controller ?
> 

I believe hierarchical accounting is natural. Since the CPU controller
divides its bandwidth as per hierarchy, it makes sense for the
accounting to also be hierarchical. I guess this is a reasonable
compromise. But how do you still check if the group you are accounting
for is the root group or not? (Unless I am missing something obvious)

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] New cgroup subsystem API (->initialize())
  2009-02-26  8:48       ` Dhaval Giani
@ 2009-02-26  8:52         ` Li Zefan
  0 siblings, 0 replies; 14+ messages in thread
From: Li Zefan @ 2009-02-26  8:52 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Bharata B Rao, linux-kernel, Balaji Rao, Balbir Singh,
	Paul Menage, Andrew Morton, Ingo Molnar, Peter Zijlstra

> I believe hierarchical accounting is natural. Since the CPU controller
> divides its bandwidth as per hierarchy, it makes sense for the
> accounting to also be hierarchical. I guess this is a reasonable
> compromise. But how do you still check if the group you are accounting
> for is the root group or not? (Unless I am missing something obvious)
> 

As easy as: if (cgrp->parent == NULL) or if (cgrp == cgrp->top_cgroup)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/2] Add per-cgroup CPU controller statistics
  2009-02-25 10:57 [RFC PATCH 0/2] CPU controller statistics - v5 Bharata B Rao
  2009-02-25 10:58 ` [RFC PATCH 1/2] New cgroup subsystem API (->initialize()) Bharata B Rao
@ 2009-02-25 10:59 ` Bharata B Rao
  2009-02-25 11:04   ` Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Bharata B Rao @ 2009-02-25 10:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Balaji Rao, Dhaval Giani, Balbir Singh, Li Zefan, Paul Menage,
	Andrew Morton, Ingo Molnar, Peter Zijlstra

From: Balaji Rao <balajirrao@gmail.com>

sched: Add cpu controller statistics

Add per-cgroup cpu controller statistics like system time and user time
consumed by groups of tasks.

Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c |  106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -263,10 +263,25 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+#ifdef CONFIG_CGROUP_SCHED
+enum cpu_cgroup_stat_index {
+	CPU_CGROUP_STAT_UTIME, /* User time of the task group */
+	CPU_CGROUP_STAT_STIME, /* Kernel time of the task group */
+	CPU_CGROUP_STAT_GTIME, /* Guest time of the task group */
+
+	CPU_CGROUP_STAT_NSTATS,
+};
+
+struct cpu_cgroup_stat {
+	struct percpu_counter cpustat[CPU_CGROUP_STAT_NSTATS];
+};
+#endif
+
 /* task group related information */
 struct task_group {
 #ifdef CONFIG_CGROUP_SCHED
 	struct cgroup_subsys_state css;
+	struct cpu_cgroup_stat *stat;
 #endif
 
 #ifdef CONFIG_USER_SCHED
@@ -4159,6 +4174,19 @@ unsigned long long task_delta_exec(struc
 	return ns;
 }
 
+#ifdef CONFIG_CGROUP_SCHED
+static void account_task_group_time(struct task_struct *p,
+			enum cpu_cgroup_stat_index idx, int val)
+{
+	struct task_group *tg = task_group(p);
+
+	if (likely(tg->stat))
+		percpu_counter_add(&tg->stat->cpustat[idx], val);
+}
+#else
+#define account_task_group_time(x, y, z)	{ 0; }
+#endif
+
 /*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -4182,6 +4210,10 @@ void account_user_time(struct task_struc
 		cpustat->nice = cputime64_add(cpustat->nice, tmp);
 	else
 		cpustat->user = cputime64_add(cpustat->user, tmp);
+
+	account_task_group_time(p, CPU_CGROUP_STAT_UTIME,
+			cputime_to_msecs(cputime));
+
 	/* Account for user time used */
 	acct_update_integrals(p);
 }
@@ -4206,6 +4238,9 @@ static void account_guest_time(struct ta
 	account_group_user_time(p, cputime);
 	p->gtime = cputime_add(p->gtime, cputime);
 
+	account_task_group_time(p, CPU_CGROUP_STAT_GTIME,
+			cputime_to_msecs(cputime));
+
 	/* Add guest time to cpustat. */
 	cpustat->user = cputime64_add(cpustat->user, tmp);
 	cpustat->guest = cputime64_add(cpustat->guest, tmp);
@@ -4243,6 +4278,8 @@ void account_system_time(struct task_str
 	else
 		cpustat->system = cputime64_add(cpustat->system, tmp);
 
+	account_task_group_time(p, CPU_CGROUP_STAT_STIME,
+				cputime_to_msecs(cputime));
 	/* Account for system time used */
 	acct_update_integrals(p);
 }
@@ -9290,6 +9327,7 @@ static struct cgroup_subsys_state *
 cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct task_group *tg, *parent;
+	int i;
 
 	if (!cgrp->parent) {
 		/* This is early initialization for the top cgroup */
@@ -9301,6 +9339,15 @@ cpu_cgroup_create(struct cgroup_subsys *
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
+	tg->stat = kmalloc(sizeof(struct cpu_cgroup_stat), GFP_KERNEL);
+	if (!tg->stat) {
+		sched_destroy_group(tg);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++)
+		percpu_counter_init(&tg->stat->cpustat[i], 0);
+
 	return &tg->css;
 }
 
@@ -9308,6 +9355,13 @@ static void
 cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct task_group *tg = cgroup_tg(cgrp);
+	int i;
+
+	if (tg->stat) {
+		for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++)
+			percpu_counter_destroy(&tg->stat->cpustat[i]);
+		kfree(tg->stat);
+	}
 
 	sched_destroy_group(tg);
 }
@@ -9336,6 +9390,53 @@ cpu_cgroup_attach(struct cgroup_subsys *
 	sched_move_task(tsk);
 }
 
+static void cpu_cgroup_initialize(void)
+{
+	int i;
+	struct cpu_cgroup_stat *stat;
+
+	stat = kmalloc(sizeof(struct cpu_cgroup_stat) , GFP_KERNEL);
+	if (!stat)
+		return;
+
+	for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++)
+		percpu_counter_init(&stat->cpustat[i], 0);
+	init_task_group.stat = stat;
+}
+
+static s64 cpu_cgroup_read_stat(struct cpu_cgroup_stat *stat,
+		enum cpu_cgroup_stat_index idx)
+{
+	if (stat)
+		return percpu_counter_read(&stat->cpustat[idx]);
+	return 0;
+}
+
+static const struct cpu_cgroup_stat_desc {
+	const char *msg;
+	u64 unit;
+} cpu_cgroup_stat_desc[] = {
+	[CPU_CGROUP_STAT_UTIME] = { "utime", 1, },
+	[CPU_CGROUP_STAT_STIME] = { "stime", 1, },
+	[CPU_CGROUP_STAT_GTIME] = { "gtime", 1, },
+};
+
+static int cpu_cgroup_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cpu_cgroup_stat *stat = tg->stat;
+	int i;
+
+	for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++) {
+		s64 val;
+		val = cpu_cgroup_read_stat(stat, i);
+		val *= cpu_cgroup_stat_desc[i].unit;
+		cb->fill(cb, cpu_cgroup_stat_desc[i].msg, val);
+	}
+	return 0;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
@@ -9395,6 +9496,10 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	{
+		.name = "stat",
+		.read_map = cpu_cgroup_stats_show,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -9409,6 +9514,7 @@ struct cgroup_subsys cpu_cgroup_subsys =
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.populate	= cpu_cgroup_populate,
+	.initialize	= cpu_cgroup_initialize,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.early_init	= 1,
 };

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] Add per-cgroup CPU controller statistics
  2009-02-25 10:59 ` [RFC PATCH 2/2] Add per-cgroup CPU controller statistics Bharata B Rao
@ 2009-02-25 11:04   ` Peter Zijlstra
  2009-02-25 11:20     ` Bharata B Rao
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-02-25 11:04 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Li Zefan,
	Paul Menage, Andrew Morton, Ingo Molnar

On Wed, 2009-02-25 at 16:29 +0530, Bharata B Rao wrote:
> From: Balaji Rao <balajirrao@gmail.com>
> 
> sched: Add cpu controller statistics
> 
> Add per-cgroup cpu controller statistics like system time and user time
> consumed by groups of tasks.

Do we want this unconditionally?

> Signed-off-by: Balaji Rao <balajirrao@gmail.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c |  106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 106 insertions(+)
> 
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -263,10 +263,25 @@ struct cfs_rq;
>  
>  static LIST_HEAD(task_groups);
>  
> +#ifdef CONFIG_CGROUP_SCHED
> +enum cpu_cgroup_stat_index {
> +	CPU_CGROUP_STAT_UTIME, /* User time of the task group */
> +	CPU_CGROUP_STAT_STIME, /* Kernel time of the task group */
> +	CPU_CGROUP_STAT_GTIME, /* Guest time of the task group */
> +
> +	CPU_CGROUP_STAT_NSTATS,
> +};
> +
> +struct cpu_cgroup_stat {
> +	struct percpu_counter cpustat[CPU_CGROUP_STAT_NSTATS];
> +};
> +#endif
> +
>  /* task group related information */
>  struct task_group {
>  #ifdef CONFIG_CGROUP_SCHED
>  	struct cgroup_subsys_state css;
> +	struct cpu_cgroup_stat *stat;
>  #endif
>  
>  #ifdef CONFIG_USER_SCHED
> @@ -4159,6 +4174,19 @@ unsigned long long task_delta_exec(struc
>  	return ns;
>  }
>  
> +#ifdef CONFIG_CGROUP_SCHED
> +static void account_task_group_time(struct task_struct *p,
> +			enum cpu_cgroup_stat_index idx, int val)
> +{
> +	struct task_group *tg = task_group(p);
> +
> +	if (likely(tg->stat))
> +		percpu_counter_add(&tg->stat->cpustat[idx], val);
> +}
> +#else
> +#define account_task_group_time(x, y, z)	{ 0; }

inline please, so we get argument validation.

> +#endif
> +
>  /*
>   * Account user cpu time to a process.
>   * @p: the process that the cpu time gets accounted to
> @@ -4182,6 +4210,10 @@ void account_user_time(struct task_struc
>  		cpustat->nice = cputime64_add(cpustat->nice, tmp);
>  	else
>  		cpustat->user = cputime64_add(cpustat->user, tmp);
> +
> +	account_task_group_time(p, CPU_CGROUP_STAT_UTIME,
> +			cputime_to_msecs(cputime));
> +
>  	/* Account for user time used */
>  	acct_update_integrals(p);
>  }
> @@ -4206,6 +4238,9 @@ static void account_guest_time(struct ta
>  	account_group_user_time(p, cputime);
>  	p->gtime = cputime_add(p->gtime, cputime);
>  
> +	account_task_group_time(p, CPU_CGROUP_STAT_GTIME,
> +			cputime_to_msecs(cputime));
> +
>  	/* Add guest time to cpustat. */
>  	cpustat->user = cputime64_add(cpustat->user, tmp);
>  	cpustat->guest = cputime64_add(cpustat->guest, tmp);
> @@ -4243,6 +4278,8 @@ void account_system_time(struct task_str
>  	else
>  		cpustat->system = cputime64_add(cpustat->system, tmp);
>  
> +	account_task_group_time(p, CPU_CGROUP_STAT_STIME,
> +				cputime_to_msecs(cputime));
>  	/* Account for system time used */
>  	acct_update_integrals(p);
>  }
> @@ -9290,6 +9327,7 @@ static struct cgroup_subsys_state *
>  cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  {
>  	struct task_group *tg, *parent;
> +	int i;
>  
>  	if (!cgrp->parent) {
>  		/* This is early initialization for the top cgroup */
> @@ -9301,6 +9339,15 @@ cpu_cgroup_create(struct cgroup_subsys *
>  	if (IS_ERR(tg))
>  		return ERR_PTR(-ENOMEM);
>  
> +	tg->stat = kmalloc(sizeof(struct cpu_cgroup_stat), GFP_KERNEL);
> +	if (!tg->stat) {
> +		sched_destroy_group(tg);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++)
> +		percpu_counter_init(&tg->stat->cpustat[i], 0);

percpu_counter_init() can fail with -ENOMEM.

>  	return &tg->css;
>  }
>  
> @@ -9308,6 +9355,13 @@ static void
>  cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  {
>  	struct task_group *tg = cgroup_tg(cgrp);
> +	int i;
> +
> +	if (tg->stat) {
> +		for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++)
> +			percpu_counter_destroy(&tg->stat->cpustat[i]);
> +		kfree(tg->stat);
> +	}
>  
>  	sched_destroy_group(tg);
>  }
> @@ -9336,6 +9390,53 @@ cpu_cgroup_attach(struct cgroup_subsys *
>  	sched_move_task(tsk);
>  }
>  
> +static void cpu_cgroup_initialize(void)
> +{
> +	int i;
> +	struct cpu_cgroup_stat *stat;
> +
> +	stat = kmalloc(sizeof(struct cpu_cgroup_stat) , GFP_KERNEL);
> +	if (!stat)
> +		return;
> +
> +	for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++)
> +		percpu_counter_init(&stat->cpustat[i], 0);
> +	init_task_group.stat = stat;
> +}
> +
> +static s64 cpu_cgroup_read_stat(struct cpu_cgroup_stat *stat,
> +		enum cpu_cgroup_stat_index idx)
> +{
> +	if (stat)
> +		return percpu_counter_read(&stat->cpustat[idx]);
> +	return 0;
> +}
> +
> +static const struct cpu_cgroup_stat_desc {
> +	const char *msg;
> +	u64 unit;
> +} cpu_cgroup_stat_desc[] = {
> +	[CPU_CGROUP_STAT_UTIME] = { "utime", 1, },
> +	[CPU_CGROUP_STAT_STIME] = { "stime", 1, },
> +	[CPU_CGROUP_STAT_GTIME] = { "gtime", 1, },
> +};
> +
> +static int cpu_cgroup_stats_show(struct cgroup *cgrp, struct cftype *cft,
> +		struct cgroup_map_cb *cb)
> +{
> +	struct task_group *tg = cgroup_tg(cgrp);
> +	struct cpu_cgroup_stat *stat = tg->stat;
> +	int i;
> +
> +	for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++) {
> +		s64 val;
> +		val = cpu_cgroup_read_stat(stat, i);
> +		val *= cpu_cgroup_stat_desc[i].unit;
> +		cb->fill(cb, cpu_cgroup_stat_desc[i].msg, val);
> +	}
> +	return 0;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
>  				u64 shareval)
> @@ -9395,6 +9496,10 @@ static struct cftype cpu_files[] = {
>  		.write_u64 = cpu_rt_period_write_uint,
>  	},
>  #endif
> +	{
> +		.name = "stat",
> +		.read_map = cpu_cgroup_stats_show,
> +	},
>  };
>  
>  static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
> @@ -9409,6 +9514,7 @@ struct cgroup_subsys cpu_cgroup_subsys =
>  	.can_attach	= cpu_cgroup_can_attach,
>  	.attach		= cpu_cgroup_attach,
>  	.populate	= cpu_cgroup_populate,
> +	.initialize	= cpu_cgroup_initialize,
>  	.subsys_id	= cpu_cgroup_subsys_id,
>  	.early_init	= 1,
>  };


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] Add per-cgroup CPU controller statistics
  2009-02-25 11:04   ` Peter Zijlstra
@ 2009-02-25 11:20     ` Bharata B Rao
  2009-02-25 11:24       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Bharata B Rao @ 2009-02-25 11:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Li Zefan,
	Paul Menage, Andrew Morton, Ingo Molnar

On Wed, Feb 25, 2009 at 12:04:10PM +0100, Peter Zijlstra wrote:
> On Wed, 2009-02-25 at 16:29 +0530, Bharata B Rao wrote:
> > From: Balaji Rao <balajirrao@gmail.com>
> > 
> > sched: Add cpu controller statistics
> > 
> > Add per-cgroup cpu controller statistics like system time and user time
> > consumed by groups of tasks.
> 
> Do we want this unconditionally?

Not sure if I understand you correctly here, but we are collecting
stats only when CONFIG_CGROUP_SCHED.

Or do you mean we need a config option to enable stats collection ?
If so, we have memory controller already providing stats unconditionally.

> >  
> > +#ifdef CONFIG_CGROUP_SCHED
> > +static void account_task_group_time(struct task_struct *p,
> > +			enum cpu_cgroup_stat_index idx, int val)
> > +{
> > +	struct task_group *tg = task_group(p);
> > +
> > +	if (likely(tg->stat))
> > +		percpu_counter_add(&tg->stat->cpustat[idx], val);
> > +}
> > +#else
> > +#define account_task_group_time(x, y, z)	{ 0; }
> 
> inline please, so we get argument validation.

Actually wanted to inline, but one of the arguments (cpu_cgroup_stat_index)
is defined only CONFIG_CGROUP_SCHED. Hence didn't want argument
validation here.

> > +
> > +	for (i = 0; i < CPU_CGROUP_STAT_NSTATS; i++)
> > +		percpu_counter_init(&tg->stat->cpustat[i], 0);
> 
> percpu_counter_init() can fail with -ENOMEM.

Ok, will fix.

Thanks for your review, Peter.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] Add per-cgroup CPU controller statistics
  2009-02-25 11:20     ` Bharata B Rao
@ 2009-02-25 11:24       ` Peter Zijlstra
  2009-02-25 11:41         ` Bharata B Rao
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-02-25 11:24 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Li Zefan,
	Paul Menage, Andrew Morton, Ingo Molnar

On Wed, 2009-02-25 at 16:50 +0530, Bharata B Rao wrote:
> On Wed, Feb 25, 2009 at 12:04:10PM +0100, Peter Zijlstra wrote:
> > On Wed, 2009-02-25 at 16:29 +0530, Bharata B Rao wrote:
> > > From: Balaji Rao <balajirrao@gmail.com>
> > > 
> > > sched: Add cpu controller statistics
> > > 
> > > Add per-cgroup cpu controller statistics like system time and user time
> > > consumed by groups of tasks.
> > 
> > Do we want this unconditionally?
> 
> Not sure if I understand you correctly here, but we are collecting
> stats only when CONFIG_CGROUP_SCHED.
> 
> Or do you mean we need a config option to enable stats collection ?
> If so, we have memory controller already providing stats unconditionally.

OK, I guess so ;-)

> > >  
> > > +#ifdef CONFIG_CGROUP_SCHED
> > > +static void account_task_group_time(struct task_struct *p,
> > > +			enum cpu_cgroup_stat_index idx, int val)
> > > +{
> > > +	struct task_group *tg = task_group(p);
> > > +
> > > +	if (likely(tg->stat))
> > > +		percpu_counter_add(&tg->stat->cpustat[idx], val);
> > > +}
> > > +#else
> > > +#define account_task_group_time(x, y, z)	{ 0; }
> > 
> > inline please, so we get argument validation.
> 
> Actually wanted to inline, but one of the arguments (cpu_cgroup_stat_index)
> is defined only CONFIG_CGROUP_SCHED. Hence didn't want argument
> validation here.

Just pull the enum outside of the ifdef?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] Add per-cgroup CPU controller statistics
  2009-02-25 11:24       ` Peter Zijlstra
@ 2009-02-25 11:41         ` Bharata B Rao
  0 siblings, 0 replies; 14+ messages in thread
From: Bharata B Rao @ 2009-02-25 11:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Balaji Rao, Dhaval Giani, Balbir Singh, Li Zefan,
	Paul Menage, Andrew Morton, Ingo Molnar

On Wed, Feb 25, 2009 at 12:24:58PM +0100, Peter Zijlstra wrote:
> On Wed, 2009-02-25 at 16:50 +0530, Bharata B Rao wrote:
> > On Wed, Feb 25, 2009 at 12:04:10PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2009-02-25 at 16:29 +0530, Bharata B Rao wrote:
> > > >  
> > > > +#ifdef CONFIG_CGROUP_SCHED
> > > > +static void account_task_group_time(struct task_struct *p,
> > > > +			enum cpu_cgroup_stat_index idx, int val)
> > > > +{
> > > > +	struct task_group *tg = task_group(p);
> > > > +
> > > > +	if (likely(tg->stat))
> > > > +		percpu_counter_add(&tg->stat->cpustat[idx], val);
> > > > +}
> > > > +#else
> > > > +#define account_task_group_time(x, y, z)	{ 0; }
> > > 
> > > inline please, so we get argument validation.
> > 
> > Actually wanted to inline, but one of the arguments (cpu_cgroup_stat_index)
> > is defined only CONFIG_CGROUP_SCHED. Hence didn't want argument
> > validation here.
> 
> Just pull the enum outside of the ifdef?

Ok will do it in the next iteration if that's acceptable and argument
validation is important.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-02-26 10:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-25 10:57 [RFC PATCH 0/2] CPU controller statistics - v5 Bharata B Rao
2009-02-25 10:58 ` [RFC PATCH 1/2] New cgroup subsystem API (->initialize()) Bharata B Rao
2009-02-26  2:55   ` Li Zefan
2009-02-26  7:52     ` Bharata B Rao
2009-02-26  8:11       ` Li Zefan
2009-02-26  8:20         ` KAMEZAWA Hiroyuki
2009-02-26 10:12         ` Bharata B Rao
2009-02-26  8:48       ` Dhaval Giani
2009-02-26  8:52         ` Li Zefan
2009-02-25 10:59 ` [RFC PATCH 2/2] Add per-cgroup CPU controller statistics Bharata B Rao
2009-02-25 11:04   ` Peter Zijlstra
2009-02-25 11:20     ` Bharata B Rao
2009-02-25 11:24       ` Peter Zijlstra
2009-02-25 11:41         ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox