[PATCH 0/4] Enable cpu frequency and power tracking for cpuacct cgroup

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Enable cpu frequency and power tracking for cpuacct cgroup
@ 2010-05-19  1:30 Mike Chan
  2010-05-19  1:30 ` [PATCH 1/4] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies Mike Chan
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Mike Chan @ 2010-05-19  1:30 UTC (permalink / raw)
  Cc: khilman, menage, balbir, cpufreq, linux-kernel, linux-omap,
	Mike Chan

Rebased onto  linux-next.

This patch series introduces cpu frequency and power tracking for cpuacct
cgroups. A similar patch set was discussed a while back and it was concluded
that due to varying architectures (ppc, x86 with overboot) you cannot account
for frequencies and their power consumption generically in sched.c, thus we
have platform specific hooks the cpuacct can call into (if available).

I've implemented all the necessary hooks for OMAP architecture as an example.
For the OMAP folks, I?m not sure what the state of the OPP API in mainline,
as most of my original work was based off of the omap-pm branch.

Mike Chan (4):
  scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU
    frequencies
  omap: cpu: Implement callbacks for cpu frequency tracking in cpuacct
  scheduler: cpuacct: Enable platform callbacks for cpuacct power
    tracking
  omap: cpu: Power tracking support for cgroup cpuacct

 Documentation/cgroups/cpuacct.txt         |    7 ++
 arch/arm/plat-omap/cpu-omap.c             |   87 ++++++++++++++++++++++++++++-
 arch/arm/plat-omap/include/plat/omap-pm.h |    1 +
 include/linux/cpuacct.h                   |   43 ++++++++++++++
 kernel/sched.c                            |   69 +++++++++++++++++++++++
 5 files changed, 206 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/cpuacct.h

Signed-off-by: Mike Chan <mike@android.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies
  2010-05-19  1:30 [PATCH 0/4] Enable cpu frequency and power tracking for cpuacct cgroup Mike Chan
@ 2010-05-19  1:30 ` Mike Chan
  2010-05-19  1:30 ` [PATCH 2/4] omap: cpu: Implement callbacks for cpu frequency tracking in cpuacct Mike Chan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Mike Chan @ 2010-05-19  1:30 UTC (permalink / raw)
  Cc: khilman, menage, balbir, cpufreq, linux-kernel, linux-omap,
	Mike Chan

Introduce new platform callback hooks for cpuacct for tracking CPU frequencies

Not all platforms / architectures have a set CPU_FREQ_TABLE defined
for CPU transition speeds. In order to track time spent in at various
CPU frequencies, we enable platform callbacks from cpuacct for this accounting.

Architectures that support overclock boosting, or don't have pre-defined
frequency tables can implement their own bucketing system that makes sense
given their cpufreq scaling abilities.

New file:
cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
frequency.

Signed-off-by: Mike Chan <mike@android.com>
---
 Documentation/cgroups/cpuacct.txt |    4 +++
 include/linux/cpuacct.h           |   41 +++++++++++++++++++++++++++++++
 kernel/sched.c                    |   49 +++++++++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/cpuacct.h

diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 8b93094..600d2d0 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -40,6 +40,10 @@ system: Time spent by tasks of the cgroup in kernel mode.
 
 user and system are in USER_HZ unit.
 
+cpuacct.cpufreq file gives CPU time (in nanoseconds) spent at each CPU
+frequency. Platform hooks must be implemented inorder to properly track
+time at each CPU frequency.
+
 cpuacct controller uses percpu_counter interface to collect user and
 system times. This has two side effects:
 
diff --git a/include/linux/cpuacct.h b/include/linux/cpuacct.h
new file mode 100644
index 0000000..9ff479e
--- /dev/null
+++ b/include/linux/cpuacct.h
@@ -0,0 +1,41 @@
+/* include/linux/cpuacct.h
+ *
+ * Copyright (C) 2010 Google, Inc.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _CPUACCT_H_
+#define _CPUACCT_H_
+
+#include <linux/cgroup.h>
+
+#ifdef CONFIG_CGROUPS
+
+/*
+ * Platform specific CPU frequency hooks for cpuacct. These functions are
+ * called from the scheduler.
+ */
+struct cpuacct_cpufreq_calls {
+	/*
+	 * Platforms can take advantage of this data and use
+	 * per-cpu allocations if necessary.
+	 */
+	void (*init) (void **cpuacct_data);
+	void (*charge) (void *cpuacct_data,  u64 cputime, unsigned int cpu);
+	void (*show) (void *cpuacct_data, struct cgroup_map_cb *cb);
+};
+
+int cpuacct_register_cpufreq(struct cpuacct_cpufreq_calls *fn);
+
+#endif /* CONFIG_CGROUPS */
+
+#endif /* _CPUACCT_H_ */
diff --git a/kernel/sched.c b/kernel/sched.c
index 34a9722..6b6c45a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -72,6 +72,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/cpuacct.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -8638,8 +8639,30 @@ struct cpuacct {
 	u64 __percpu *cpuusage;
 	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
 	struct cpuacct *parent;
+	struct cpuacct_cpufreq_calls *cpufreq_fn;
+	void *cpuacct_data;
 };
 
+static struct cpuacct *cpuacct_root;
+
+/* Default calls for cpufreq accounting */
+static struct cpuacct_cpufreq_calls *cpuacct_cpufreq;
+int cpuacct_register_cpufreq(struct cpuacct_cpufreq_calls *fn)
+{
+	cpuacct_cpufreq = fn;
+
+	/*
+	 * Root node is created before platform can register callbacks,
+	 * initalize here.
+	 */
+	if (cpuacct_root && fn) {
+		cpuacct_root->cpufreq_fn = fn;
+		if (fn->init)
+			fn->init(&cpuacct_root->cpuacct_data);
+	}
+	return 0;
+}
+
 struct cgroup_subsys cpuacct_subsys;
 
 /* return cpu accounting group corresponding to this container */
@@ -8674,8 +8697,16 @@ static struct cgroup_subsys_state *cpuacct_create(
 		if (percpu_counter_init(&ca->cpustat[i], 0))
 			goto out_free_counters;
 
+	ca->cpufreq_fn = cpuacct_cpufreq;
+
+	/* If available, have platform code initalize cpu frequency table */
+	if (ca->cpufreq_fn && ca->cpufreq_fn->init)
+		ca->cpufreq_fn->init(&ca->cpuacct_data);
+
 	if (cgrp->parent)
 		ca->parent = cgroup_ca(cgrp->parent);
+	else
+		cpuacct_root = ca;
 
 	return &ca->css;
 
@@ -8803,6 +8834,16 @@ static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static int cpuacct_cpufreq_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	if (ca->cpufreq_fn && ca->cpufreq_fn->show)
+		ca->cpufreq_fn->show(ca->cpuacct_data, cb);
+
+	return 0;
+}
+
 static struct cftype files[] = {
 	{
 		.name = "usage",
@@ -8817,6 +8858,10 @@ static struct cftype files[] = {
 		.name = "stat",
 		.read_map = cpuacct_stats_show,
 	},
+	{
+		.name =  "cpufreq",
+		.read_map = cpuacct_cpufreq_show,
+	},
 };
 
 static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
@@ -8846,6 +8891,10 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 	for (; ca; ca = ca->parent) {
 		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 		*cpuusage += cputime;
+
+		/* Call back into platform code to account for CPU speeds */
+		if (ca->cpufreq_fn && ca->cpufreq_fn->charge)
+			ca->cpufreq_fn->charge(ca->cpuacct_data, cputime, cpu);
 	}
 
 	rcu_read_unlock();
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] omap: cpu: Implement callbacks for cpu frequency tracking in cpuacct
  2010-05-19  1:30 [PATCH 0/4] Enable cpu frequency and power tracking for cpuacct cgroup Mike Chan
  2010-05-19  1:30 ` [PATCH 1/4] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies Mike Chan
@ 2010-05-19  1:30 ` Mike Chan
  2010-05-19  1:30 ` [PATCH 3/4] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking Mike Chan
  2010-05-19  1:30 ` [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct Mike Chan
  3 siblings, 0 replies; 21+ messages in thread
From: Mike Chan @ 2010-05-19  1:30 UTC (permalink / raw)
  Cc: khilman, menage, balbir, cpufreq, linux-kernel, linux-omap,
	Mike Chan

Implement OMAP platform specific scheduler callbacks for tracking
cpu frequencies per cpuacct cgroup.

Signed-off-by: Mike Chan <mike@android.com>
---
 arch/arm/plat-omap/cpu-omap.c |   66 ++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 65 insertions(+), 1 deletions(-)

diff --git a/arch/arm/plat-omap/cpu-omap.c b/arch/arm/plat-omap/cpu-omap.c
index 6d3d333..d27234d 100644
--- a/arch/arm/plat-omap/cpu-omap.c
+++ b/arch/arm/plat-omap/cpu-omap.c
@@ -21,6 +21,7 @@
 #include <linux/err.h>
 #include <linux/clk.h>
 #include <linux/io.h>
+#include <linux/cpuacct.h>
 
 #include <mach/hardware.h>
 #include <plat/clock.h>
@@ -38,6 +39,10 @@ static struct cpufreq_frequency_table *freq_table;
 
 static struct clk *mpu_clk;
 
+#ifdef CONFIG_CGROUP_CPUACCT
+static int freq_index;
+#endif
+
 /* TODO: Add support for SDRAM timing changes */
 
 int omap_verify_speed(struct cpufreq_policy *policy)
@@ -96,6 +101,11 @@ static int omap_target(struct cpufreq_policy *policy,
 	       freqs.old, freqs.new);
 #endif
 	ret = clk_set_rate(mpu_clk, freqs.new * 1000);
+#ifdef CONFIG_CGROUP_CPUACCT
+	/* Update freq_index before cpufreq transition post notification. */
+	cpufreq_frequency_table_target(policy, freq_table, freqs.new,
+			CPUFREQ_RELATION_L, &freq_index);
+#endif
 	cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE);
 
 	return ret;
@@ -125,7 +135,14 @@ static int __init omap_cpu_init(struct cpufreq_policy *policy)
 		policy->cpuinfo.max_freq = clk_round_rate(mpu_clk,
 							VERY_HI_RATE) / 1000;
 	}
-
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * Update freq_index, since we are using the extact frequency
+	 * we can use any relation.
+	 */
+	cpufreq_frequency_table_target(policy, freq_table, policy->cur,
+			CPUFREQ_RELATION_L, &freq_index);
+#endif
 	/* FIXME: what's the actual transition time? */
 	policy->cpuinfo.transition_latency = 300 * 1000;
 
@@ -169,3 +186,50 @@ arch_initcall(omap_cpufreq_init);
  * cpufreq_frequency_table_put_attr()
  */
 
+#ifdef CONFIG_CGROUP_CPUACCT
+/*
+ * Omap platform calls for cpuacct frequency accounting.
+ */
+
+static void omap_cpuacct_freq_init(void **cpuacct_data)
+{
+	*cpuacct_data = kzalloc(sizeof(u64) * MAX_VDD1_OPP,
+			GFP_KERNEL);
+}
+
+/* Called with rcu_read_lock() held. */
+static void omap_cpuacct_freq_charge(void *cpuacct_data, u64 cputime, unsigned int cpu)
+{
+	u64 *cpuacct_freq = cpuacct_data;
+	if (freq_index < 0)
+		return;
+
+	cpuacct_freq[freq_index] += cputime;
+}
+
+static void omap_cpuacct_freq_show(void *cpuacct_data, struct cgroup_map_cb *cb)
+{
+	int i;
+	char buf[32];
+	u64 *cpuacct_freq = cpuacct_data;
+	for (i = 0; freq_table[i].frequency != CPUFREQ_TABLE_END; i++) {
+		snprintf(buf, sizeof(buf), "%u", freq_table[i].frequency);
+		cb->fill(cb, buf, cpuacct_freq[i]);
+	}
+}
+
+static struct cpuacct_cpufreq_calls omap_cpuacct_cpufreq = {
+	.init = omap_cpuacct_freq_init,
+	.charge = omap_cpuacct_freq_charge,
+	.show = omap_cpuacct_freq_show,
+};
+
+static int __init omap_cpuacct_init(void)
+{
+	freq_index = -1;
+	cpuacct_register_cpufreq(&omap_cpuacct_cpufreq);
+	return 0;
+}
+
+early_initcall(omap_cpuacct_init);
+#endif
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking
  2010-05-19  1:30 [PATCH 0/4] Enable cpu frequency and power tracking for cpuacct cgroup Mike Chan
  2010-05-19  1:30 ` [PATCH 1/4] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies Mike Chan
  2010-05-19  1:30 ` [PATCH 2/4] omap: cpu: Implement callbacks for cpu frequency tracking in cpuacct Mike Chan
@ 2010-05-19  1:30 ` Mike Chan
  2010-05-19  9:30   ` [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file Thomas Renninger
  2010-05-19  1:30 ` [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct Mike Chan
  3 siblings, 1 reply; 21+ messages in thread
From: Mike Chan @ 2010-05-19  1:30 UTC (permalink / raw)
  Cc: khilman, menage, balbir, cpufreq, linux-kernel, linux-omap,
	Mike Chan

Platform must register cpu power function that return power in
milliWatt seconds.

Signed-off-by: Mike Chan <mike@android.com>
---
 Documentation/cgroups/cpuacct.txt |    3 +++
 include/linux/cpuacct.h           |    4 +++-
 kernel/sched.c                    |   24 ++++++++++++++++++++++--
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 600d2d0..84e471b 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -44,6 +44,9 @@ cpuacct.cpufreq file gives CPU time (in nanoseconds) spent at each CPU
 frequency. Platform hooks must be implemented inorder to properly track
 time at each CPU frequency.
 
+cpuacct.power file gives CPU power consumed (in milliWatt seconds). Platform
+must provide and implement power callback functions.
+
 cpuacct controller uses percpu_counter interface to collect user and
 system times. This has two side effects:
 
diff --git a/include/linux/cpuacct.h b/include/linux/cpuacct.h
index 9ff479e..effe842 100644
--- a/include/linux/cpuacct.h
+++ b/include/linux/cpuacct.h
@@ -31,7 +31,9 @@ struct cpuacct_cpufreq_calls {
 	 */
 	void (*init) (void **cpuacct_data);
 	void (*charge) (void *cpuacct_data,  u64 cputime, unsigned int cpu);
-	void (*show) (void *cpuacct_data, struct cgroup_map_cb *cb);
+	void (*cpufreq_show) (void *cpuacct_data, struct cgroup_map_cb *cb);
+	/* Returns power consumed in milliWatt seconds */
+	u64 (*power_usage) (void *cpuacct_data);
 };
 
 int cpuacct_register_cpufreq(struct cpuacct_cpufreq_calls *fn);
diff --git a/kernel/sched.c b/kernel/sched.c
index 6b6c45a..d55d8af 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8838,12 +8838,28 @@ static int cpuacct_cpufreq_show(struct cgroup *cgrp, struct cftype *cft,
 		struct cgroup_map_cb *cb)
 {
 	struct cpuacct *ca = cgroup_ca(cgrp);
-	if (ca->cpufreq_fn && ca->cpufreq_fn->show)
-		ca->cpufreq_fn->show(ca->cpuacct_data, cb);
+	if (ca->cpufreq_fn && ca->cpufreq_fn->cpufreq_show)
+		ca->cpufreq_fn->cpufreq_show(ca->cpuacct_data, cb);
 
 	return 0;
 }
 
+/* return total cpu power usage (milliWatt second) of a group */
+static u64 cpuacct_powerusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	int i;
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	u64 totalpower = 0;
+
+	if (ca->cpufreq_fn && ca->cpufreq_fn->power_usage)
+		for_each_present_cpu(i) {
+			totalpower += ca->cpufreq_fn->power_usage(
+					ca->cpuacct_data);
+		}
+
+	return totalpower;
+}
+
 static struct cftype files[] = {
 	{
 		.name = "usage",
@@ -8862,6 +8878,10 @@ static struct cftype files[] = {
 		.name =  "cpufreq",
 		.read_map = cpuacct_cpufreq_show,
 	},
+	{
+		.name = "power",
+		.read_u64 = cpuacct_powerusage_read
+	},
 };
 
 static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct
  2010-05-19  1:30 [PATCH 0/4] Enable cpu frequency and power tracking for cpuacct cgroup Mike Chan
                   ` (2 preceding siblings ...)
  2010-05-19  1:30 ` [PATCH 3/4] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking Mike Chan
@ 2010-05-19  1:30 ` Mike Chan
  2010-05-19 13:11   ` Nishanth Menon
  3 siblings, 1 reply; 21+ messages in thread
From: Mike Chan @ 2010-05-19  1:30 UTC (permalink / raw)
  Cc: khilman, menage, balbir, cpufreq, linux-kernel, linux-omap,
	Mike Chan

Specify new power field in struct omap_opp, which is
power exported in milliWatt.

power_usage function gives power consumed in milliWatt seconds

Signed-off-by: Mike Chan <mike@android.com>
---
 arch/arm/plat-omap/cpu-omap.c             |   23 ++++++++++++++++++++++-
 arch/arm/plat-omap/include/plat/omap-pm.h |    1 +
 2 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/arch/arm/plat-omap/cpu-omap.c b/arch/arm/plat-omap/cpu-omap.c
index d27234d..1381539 100644
--- a/arch/arm/plat-omap/cpu-omap.c
+++ b/arch/arm/plat-omap/cpu-omap.c
@@ -22,6 +22,7 @@
 #include <linux/clk.h>
 #include <linux/io.h>
 #include <linux/cpuacct.h>
+#include <linux/math64.h>
 
 #include <mach/hardware.h>
 #include <plat/clock.h>
@@ -218,10 +219,30 @@ static void omap_cpuacct_freq_show(void *cpuacct_data, struct cgroup_map_cb *cb)
 	}
 }
 
+
+/* Returns power (in milliWatt seconds) for cgroup */
+static u64 omap_cpuacct_power_usage(void *cpuacct_data)
+{
+	int i;
+	u64 *cpuacct_freq = cpuacct_data;
+	u64 totalpower = 0;
+
+	/* mpu_opps table starts at 1 */
+	for (i = 0; freq_table[i].frequency != CPUFREQ_TABLE_END; i++)
+		totalpower += cpuacct_freq[i] * mpu_opps[i + 1].power;
+
+	/* milliwatt seconds */
+	totalpower = div64_u64(totalpower, NSEC_PER_SEC);
+
+	return totalpower;
+}
+
+
 static struct cpuacct_cpufreq_calls omap_cpuacct_cpufreq = {
 	.init = omap_cpuacct_freq_init,
 	.charge = omap_cpuacct_freq_charge,
-	.show = omap_cpuacct_freq_show,
+	.cpufreq_show = omap_cpuacct_freq_show,
+	.power_usage = omap_cpuacct_power_usage,
 };
 
 static int __init omap_cpuacct_init(void)
diff --git a/arch/arm/plat-omap/include/plat/omap-pm.h b/arch/arm/plat-omap/include/plat/omap-pm.h
index 3ee41d7..8f2db6f 100644
--- a/arch/arm/plat-omap/include/plat/omap-pm.h
+++ b/arch/arm/plat-omap/include/plat/omap-pm.h
@@ -31,6 +31,7 @@ struct omap_opp {
 	unsigned long rate;
 	u8 opp_id;
 	u16 min_vdd;
+	unsigned long power; /* power consumed running at OPP in milliWatts */
 };
 
 extern struct omap_opp *mpu_opps;
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file
  2010-05-19  1:30 ` [PATCH 3/4] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking Mike Chan
@ 2010-05-19  9:30   ` Thomas Renninger
  2010-05-19  9:37     ` Peter Zijlstra
  2010-05-19 19:06     ` [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file Mike Chan
  0 siblings, 2 replies; 21+ messages in thread
From: Thomas Renninger @ 2010-05-19  9:30 UTC (permalink / raw)
  To: Mike Chan, khilman
  Cc: menage, balbir, cpufreq, linux-kernel, linux-omap, mingo, peterz,
	lizf, containers

Hi,

thread topic was:
Re: [PATCH 3/4] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking

I mingled this patch together with a minor comment for Mike's patch.
Like this interested people in CC are kept.

Peter/Ingo: Can you pick up this cleanup if appropriate, please.
Shall I resend separately or could you cut out comments below?

On Wednesday 19 May 2010 03:30:19 Mike Chan wrote:
> Platform must register cpu power function that return power in
> milliWatt seconds.
> 
> Signed-off-by: Mike Chan <mike@android.com>
> ---
>  Documentation/cgroups/cpuacct.txt |    3 +++
>  include/linux/cpuacct.h           |    4 +++-
>  kernel/sched.c                    |   24 ++++++++++++++++++++++--
>  3 files changed, 28 insertions(+), 3 deletions(-)
...
> diff --git a/include/linux/cpuacct.h b/include/linux/cpuacct.h
> index 9ff479e..effe842 100644
> --- a/include/linux/cpuacct.h
> +++ b/include/linux/cpuacct.h
> @@ -31,7 +31,9 @@ struct cpuacct_cpufreq_calls {
This is a general cpuacct_charge interface, not cpufreq specific?
I'd call it "struct cpuacct_charge_calls".
Platforms can account C-states, frequency, power, whatever they like?
The latter two are implemented with your patches.
>  	 */
>  	void (*init) (void **cpuacct_data);
>  	void (*charge) (void *cpuacct_data,  u64 cputime, unsigned int cpu);
> -	void (*show) (void *cpuacct_data, struct cgroup_map_cb *cb);
> +	void (*cpufreq_show) (void *cpuacct_data, struct cgroup_map_cb *cb);
> +	/* Returns power consumed in milliWatt seconds */
> +	u64 (*power_usage) (void *cpuacct_data);
>  };
>  int cpuacct_register_cpufreq(struct cpuacct_cpufreq_calls *fn);
Same here, why not name it cpuacct_register_charge?
Eventually at other places.

> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6b6c45a..d55d8af 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
Nothing to do with this patch, but I wonder why this is all in kernel/sched.c.
Try a quick cleanup... works.
Whatabout below cleanup?
Not sure through which tree this should go in, but if below is accepted,
would you mind rebasing your things against it.
Then it would already show up in cgroup_cpuaccount.c git history.

       Thomas

This is a cleanup against current linux-2.6 Linus tree.
    
Having CONFIG_CGROUP_CPUACCT code in kernel/sched.c looks wrong.
Move this out to kernel/cgroup_cpuaccount.c
    
Test compiled with and without CONFIG_CGROUP_CPUACCT set on x86_64.

Signed-off-by: Thomas Renninger <trenn@suse.de>    
CC: linux-kernel@vger.kernel.org
CC: mike@android.com
CC: menage@google.com
CC: lizf@cn.fujitsu.com
CC: containers@lists.linux-foundation.org
CC: mingo@elte.hu
CC: peterz@infradead.org

---
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..6e2c88a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -609,6 +609,24 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
 unsigned short css_id(struct cgroup_subsys_state *css);
 unsigned short css_depth(struct cgroup_subsys_state *css);
 
+/* Time spent by the tasks of the cpu accounting group executing in ... */
+enum cpuacct_stat_index {
+	CPUACCT_STAT_USER,	/* ... user mode */
+	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
+
+	CPUACCT_STAT_NSTATS,
+};
+
+#ifdef CONFIG_CGROUP_CPUACCT
+void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+void cpuacct_update_stats(struct task_struct *tsk,
+		enum cpuacct_stat_index idx, cputime_t val);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_update_stats(struct task_struct *tsk,
+			enum cpuacct_stat_index idx, cputime_t val) {}
+#endif
+
 #else /* !CONFIG_CGROUPS */
 
 static inline int cgroup_init_early(void) { return 0; }
diff --git a/kernel/Makefile b/kernel/Makefile
index 149e18e..1df6e53 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cgroup_cpuaccount.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
diff --git a/kernel/cgroup_cpuaccount.c b/kernel/cgroup_cpuaccount.c
new file mode 100644
index 0000000..d32b927
--- /dev/null
+++ b/kernel/cgroup_cpuaccount.c
@@ -0,0 +1,284 @@
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/cgroup.h>
+#include <linux/srcu.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+
+#include <asm/cputime.h>
+
+/*
+ * CPU accounting code for task groups.
+ *
+ * Based on the work by Paul Menage (menage@google.com) and Balbir Singh
+ * (balbir@in.ibm.com).
+ */
+
+/* track cpu usage of a group of tasks and its child groups */
+struct cpuacct {
+	struct cgroup_subsys_state css;
+	/* cpuusage holds pointer to a u64-type object on every cpu */
+	u64 __percpu *cpuusage;
+	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
+	struct cpuacct *parent;
+};
+
+struct cgroup_subsys cpuacct_subsys;
+
+/* return cpu accounting group corresponding to this container */
+static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, cpuacct_subsys_id),
+			    struct cpuacct, css);
+}
+
+/* return cpu accounting group to which this task belongs */
+static inline struct cpuacct *task_ca(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, cpuacct_subsys_id),
+			    struct cpuacct, css);
+}
+
+/* create a new cpu accounting group */
+static struct cgroup_subsys_state *cpuacct_create(
+	struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
+	int i;
+
+	if (!ca)
+		goto out;
+
+	ca->cpuusage = alloc_percpu(u64);
+	if (!ca->cpuusage)
+		goto out_free_ca;
+
+	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
+		if (percpu_counter_init(&ca->cpustat[i], 0))
+			goto out_free_counters;
+
+	if (cgrp->parent)
+		ca->parent = cgroup_ca(cgrp->parent);
+
+	return &ca->css;
+
+out_free_counters:
+	while (--i >= 0)
+		percpu_counter_destroy(&ca->cpustat[i]);
+	free_percpu(ca->cpuusage);
+out_free_ca:
+	kfree(ca);
+out:
+	return ERR_PTR(-ENOMEM);
+}
+
+/* destroy an existing cpu accounting group */
+static void
+cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	int i;
+
+	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
+		percpu_counter_destroy(&ca->cpustat[i]);
+	free_percpu(ca->cpuusage);
+	kfree(ca);
+}
+
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+{
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 data;
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	data = *cpuusage;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	data = *cpuusage;
+#endif
+
+	return data;
+}
+
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+{
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	*cpuusage = val;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	*cpuusage = val;
+#endif
+}
+
+/* return total cpu usage (in nanoseconds) of a group */
+static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	u64 totalcpuusage = 0;
+	int i;
+
+	for_each_present_cpu(i)
+		totalcpuusage += cpuacct_cpuusage_read(ca, i);
+
+	return totalcpuusage;
+}
+
+static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+								u64 reset)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	int err = 0;
+	int i;
+
+	if (reset) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	for_each_present_cpu(i)
+		cpuacct_cpuusage_write(ca, i, 0);
+
+out:
+	return err;
+}
+
+static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
+				   struct seq_file *m)
+{
+	struct cpuacct *ca = cgroup_ca(cgroup);
+	u64 percpu;
+	int i;
+
+	for_each_present_cpu(i) {
+		percpu = cpuacct_cpuusage_read(ca, i);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const char *cpuacct_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
+static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	int i;
+
+	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
+		s64 val = percpu_counter_read(&ca->cpustat[i]);
+		val = cputime64_to_clock_t(val);
+		cb->fill(cb, cpuacct_stat_desc[i], val);
+	}
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "usage",
+		.read_u64 = cpuusage_read,
+		.write_u64 = cpuusage_write,
+	},
+	{
+		.name = "usage_percpu",
+		.read_seq_string = cpuacct_percpu_seq_read,
+	},
+	{
+		.name = "stat",
+		.read_map = cpuacct_stats_show,
+	},
+};
+
+static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+/*
+ * charge this task's execution time to its accounting group.
+ *
+ * called with rq->lock held.
+ */
+void cpuacct_charge(struct task_struct *tsk, u64 cputime)
+{
+	struct cpuacct *ca;
+	int cpu;
+
+	if (unlikely(!cpuacct_subsys.active))
+		return;
+
+	cpu = task_cpu(tsk);
+
+	rcu_read_lock();
+
+	ca = task_ca(tsk);
+
+	for (; ca; ca = ca->parent) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+
+	rcu_read_unlock();
+}
+
+/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
+ * in cputime_t units. As a result, cpuacct_update_stats calls
+ * percpu_counter_add with values large enough to always overflow the
+ * per cpu batch limit causing bad SMP scalability.
+ *
+ * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
+ * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
+ * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
+ */
+#ifdef CONFIG_SMP
+#define CPUACCT_BATCH	\
+	min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
+#else
+#define CPUACCT_BATCH	0
+#endif
+
+/*
+ * Charge the system/user time to the task's accounting group.
+ */
+void cpuacct_update_stats(struct task_struct *tsk,
+			  enum cpuacct_stat_index idx, cputime_t val)
+{
+	struct cpuacct *ca;
+	int batch = CPUACCT_BATCH;
+
+	if (unlikely(!cpuacct_subsys.active))
+		return;
+
+	rcu_read_lock();
+	ca = task_ca(tsk);
+
+	do {
+		__percpu_counter_add(&ca->cpustat[idx], val, batch);
+		ca = ca->parent;
+	} while (ca);
+	rcu_read_unlock();
+}
+
+struct cgroup_subsys cpuacct_subsys = {
+	.name = "cpuacct",
+	.create = cpuacct_create,
+	.destroy = cpuacct_destroy,
+	.populate = cpuacct_populate,
+	.subsys_id = cpuacct_subsys_id,
+};
diff --git a/kernel/sched.c b/kernel/sched.c
index 1d93cd0..45d60dd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1394,24 +1394,6 @@ static const u32 prio_to_wmult[40] = {
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
 
-/* Time spent by the tasks of the cpu accounting group executing in ... */
-enum cpuacct_stat_index {
-	CPUACCT_STAT_USER,	/* ... user mode */
-	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
-
-	CPUACCT_STAT_NSTATS,
-};
-
-#ifdef CONFIG_CGROUP_CPUACCT
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-static void cpuacct_update_stats(struct task_struct *tsk,
-		enum cpuacct_stat_index idx, cputime_t val);
-#else
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
-static inline void cpuacct_update_stats(struct task_struct *tsk,
-		enum cpuacct_stat_index idx, cputime_t val) {}
-#endif
-
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -8617,283 +8599,6 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
-#ifdef CONFIG_CGROUP_CPUACCT
-
-/*
- * CPU accounting code for task groups.
- *
- * Based on the work by Paul Menage (menage@google.com) and Balbir Singh
- * (balbir@in.ibm.com).
- */
-
-/* track cpu usage of a group of tasks and its child groups */
-struct cpuacct {
-	struct cgroup_subsys_state css;
-	/* cpuusage holds pointer to a u64-type object on every cpu */
-	u64 __percpu *cpuusage;
-	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
-	struct cpuacct *parent;
-};
-
-struct cgroup_subsys cpuacct_subsys;
-
-/* return cpu accounting group corresponding to this container */
-static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
-{
-	return container_of(cgroup_subsys_state(cgrp, cpuacct_subsys_id),
-			    struct cpuacct, css);
-}
-
-/* return cpu accounting group to which this task belongs */
-static inline struct cpuacct *task_ca(struct task_struct *tsk)
-{
-	return container_of(task_subsys_state(tsk, cpuacct_subsys_id),
-			    struct cpuacct, css);
-}
-
-/* create a new cpu accounting group */
-static struct cgroup_subsys_state *cpuacct_create(
-	struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
-	struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
-	int i;
-
-	if (!ca)
-		goto out;
-
-	ca->cpuusage = alloc_percpu(u64);
-	if (!ca->cpuusage)
-		goto out_free_ca;
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		if (percpu_counter_init(&ca->cpustat[i], 0))
-			goto out_free_counters;
-
-	if (cgrp->parent)
-		ca->parent = cgroup_ca(cgrp->parent);
-
-	return &ca->css;
-
-out_free_counters:
-	while (--i >= 0)
-		percpu_counter_destroy(&ca->cpustat[i]);
-	free_percpu(ca->cpuusage);
-out_free_ca:
-	kfree(ca);
-out:
-	return ERR_PTR(-ENOMEM);
-}
-
-/* destroy an existing cpu accounting group */
-static void
-cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int i;
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		percpu_counter_destroy(&ca->cpustat[i]);
-	free_percpu(ca->cpuusage);
-	kfree(ca);
-}
-
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-	u64 data;
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	data = *cpuusage;
-#endif
-
-	return data;
-}
-
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	*cpuusage = val;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
-}
-
-/* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	u64 totalcpuusage = 0;
-	int i;
-
-	for_each_present_cpu(i)
-		totalcpuusage += cpuacct_cpuusage_read(ca, i);
-
-	return totalcpuusage;
-}
-
-static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
-								u64 reset)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int err = 0;
-	int i;
-
-	if (reset) {
-		err = -EINVAL;
-		goto out;
-	}
-
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
-
-out:
-	return err;
-}
-
-static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
-				   struct seq_file *m)
-{
-	struct cpuacct *ca = cgroup_ca(cgroup);
-	u64 percpu;
-	int i;
-
-	for_each_present_cpu(i) {
-		percpu = cpuacct_cpuusage_read(ca, i);
-		seq_printf(m, "%llu ", (unsigned long long) percpu);
-	}
-	seq_printf(m, "\n");
-	return 0;
-}
-
-static const char *cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
-static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
-		struct cgroup_map_cb *cb)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int i;
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
-		s64 val = percpu_counter_read(&ca->cpustat[i]);
-		val = cputime64_to_clock_t(val);
-		cb->fill(cb, cpuacct_stat_desc[i], val);
-	}
-	return 0;
-}
-
-static struct cftype files[] = {
-	{
-		.name = "usage",
-		.read_u64 = cpuusage_read,
-		.write_u64 = cpuusage_write,
-	},
-	{
-		.name = "usage_percpu",
-		.read_seq_string = cpuacct_percpu_seq_read,
-	},
-	{
-		.name = "stat",
-		.read_map = cpuacct_stats_show,
-	},
-};
-
-static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
-	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
-}
-
-/*
- * charge this task's execution time to its accounting group.
- *
- * called with rq->lock held.
- */
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-	struct cpuacct *ca;
-	int cpu;
-
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	cpu = task_cpu(tsk);
-
-	rcu_read_lock();
-
-	ca = task_ca(tsk);
-
-	for (; ca; ca = ca->parent) {
-		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-		*cpuusage += cputime;
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
- * in cputime_t units. As a result, cpuacct_update_stats calls
- * percpu_counter_add with values large enough to always overflow the
- * per cpu batch limit causing bad SMP scalability.
- *
- * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
- * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
- * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
- */
-#ifdef CONFIG_SMP
-#define CPUACCT_BATCH	\
-	min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
-#else
-#define CPUACCT_BATCH	0
-#endif
-
-/*
- * Charge the system/user time to the task's accounting group.
- */
-static void cpuacct_update_stats(struct task_struct *tsk,
-		enum cpuacct_stat_index idx, cputime_t val)
-{
-	struct cpuacct *ca;
-	int batch = CPUACCT_BATCH;
-
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	rcu_read_lock();
-	ca = task_ca(tsk);
-
-	do {
-		__percpu_counter_add(&ca->cpustat[idx], val, batch);
-		ca = ca->parent;
-	} while (ca);
-	rcu_read_unlock();
-}
-
-struct cgroup_subsys cpuacct_subsys = {
-	.name = "cpuacct",
-	.create = cpuacct_create,
-	.destroy = cpuacct_destroy,
-	.populate = cpuacct_populate,
-	.subsys_id = cpuacct_subsys_id,
-};
-#endif	/* CONFIG_CGROUP_CPUACCT */
-
 #ifndef CONFIG_SMP
 
 void synchronize_sched_expedited(void)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file
  2010-05-19  9:30   ` [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file Thomas Renninger
@ 2010-05-19  9:37     ` Peter Zijlstra
  2010-05-19 10:27       ` Peter Zijlstra
  2010-05-19 19:06     ` [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file Mike Chan
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-05-19  9:37 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Mike Chan, khilman, menage, balbir, cpufreq, linux-kernel,
	linux-omap, mingo, lizf, containers

On Wed, 2010-05-19 at 11:30 +0200, Thomas Renninger wrote:
> 
> This is a cleanup against current linux-2.6 Linus tree.
>     
> Having CONFIG_CGROUP_CPUACCT code in kernel/sched.c looks wrong.
> Move this out to kernel/cgroup_cpuaccount.c
>     
> Test compiled with and without CONFIG_CGROUP_CPUACCT set on x86_64.

Sure, works for me, less code to look at :-)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file
  2010-05-19  9:37     ` Peter Zijlstra
@ 2010-05-19 10:27       ` Peter Zijlstra
  2010-05-19 18:58         ` scheduler: cleanup sched.c and extract cgroup_cpuaccount stuff into separate file Thomas Renninger
                           ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-05-19 10:27 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Mike Chan, khilman, menage, balbir, cpufreq, linux-kernel,
	linux-omap, mingo, lizf, containers

On Wed, 2010-05-19 at 11:37 +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 11:30 +0200, Thomas Renninger wrote:
> > 
> > This is a cleanup against current linux-2.6 Linus tree.
> >     
> > Having CONFIG_CGROUP_CPUACCT code in kernel/sched.c looks wrong.
> > Move this out to kernel/cgroup_cpuaccount.c
> >     
> > Test compiled with and without CONFIG_CGROUP_CPUACCT set on x86_64.
> 
> Sure, works for me, less code to look at :-)

In file included from /mnt/build/linux-2.6/kernel/sched.c:1908:
/mnt/build/linux-2.6/kernel/sched_fair.c: In function ‘update_curr’:
/mnt/build/linux-2.6/kernel/sched_fair.c:544: error: implicit declaration of function ‘cpuacct_charge’
  CC      kernel/sysctl_binary.o
/mnt/build/linux-2.6/kernel/sched.c: In function ‘account_user_time’:
/mnt/build/linux-2.6/kernel/sched.c:3265: error: implicit declaration of function ‘cpuacct_update_stats’
/mnt/build/linux-2.6/kernel/sched.c:3265: error: ‘CPUACCT_STAT_USER’ undeclared (first use in this function)
/mnt/build/linux-2.6/kernel/sched.c:3265: error: (Each undeclared identifier is reported only once
/mnt/build/linux-2.6/kernel/sched.c:3265: error: for each function it appears in.)
/mnt/build/linux-2.6/kernel/sched.c: In function ‘account_system_time’:
/mnt/build/linux-2.6/kernel/sched.c:3332: error: ‘CPUACCT_STAT_SYSTEM’ undeclared (first use in this function)
make[2]: *** [kernel/sched.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [kernel/] Error 2
make: *** [sub-make] Error 2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct
  2010-05-19  1:30 ` [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct Mike Chan
@ 2010-05-19 13:11   ` Nishanth Menon
  2010-05-19 15:34     ` Thomas Renninger
  0 siblings, 1 reply; 21+ messages in thread
From: Nishanth Menon @ 2010-05-19 13:11 UTC (permalink / raw)
  To: Mike Chan
  Cc: khilman@deeprootsystems.com, menage@google.com, balbir@in.ibm.com,
	cpufreq@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-omap@vger.kernel.org

Mike Chan had written, on 05/18/2010 08:30 PM, the following:
> Specify new power field in struct omap_opp, which is
> power exported in milliWatt.
> 
> power_usage function gives power consumed in milliWatt seconds
> 
> Signed-off-by: Mike Chan <mike@android.com>
> ---
>  arch/arm/plat-omap/cpu-omap.c             |   23 ++++++++++++++++++++++-
>  arch/arm/plat-omap/include/plat/omap-pm.h |    1 +
>  2 files changed, 23 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/arm/plat-omap/cpu-omap.c b/arch/arm/plat-omap/cpu-omap.c
> index d27234d..1381539 100644
> --- a/arch/arm/plat-omap/cpu-omap.c
> +++ b/arch/arm/plat-omap/cpu-omap.c
> @@ -22,6 +22,7 @@
>  #include <linux/clk.h>
>  #include <linux/io.h>
>  #include <linux/cpuacct.h>
> +#include <linux/math64.h>
>  
>  #include <mach/hardware.h>
>  #include <plat/clock.h>
> @@ -218,10 +219,30 @@ static void omap_cpuacct_freq_show(void *cpuacct_data, struct cgroup_map_cb *cb)
>  	}
>  }
>  
> +
> +/* Returns power (in milliWatt seconds) for cgroup */
> +static u64 omap_cpuacct_power_usage(void *cpuacct_data)
> +{
> +	int i;
> +	u64 *cpuacct_freq = cpuacct_data;
> +	u64 totalpower = 0;
> +
> +	/* mpu_opps table starts at 1 */
> +	for (i = 0; freq_table[i].frequency != CPUFREQ_TABLE_END; i++)
> +		totalpower += cpuacct_freq[i] * mpu_opps[i + 1].power;
> +
> +	/* milliwatt seconds */
> +	totalpower = div64_u64(totalpower, NSEC_PER_SEC);
> +
> +	return totalpower;
> +}
> +
> +
>  static struct cpuacct_cpufreq_calls omap_cpuacct_cpufreq = {
>  	.init = omap_cpuacct_freq_init,
>  	.charge = omap_cpuacct_freq_charge,
> -	.show = omap_cpuacct_freq_show,
> +	.cpufreq_show = omap_cpuacct_freq_show,
> +	.power_usage = omap_cpuacct_power_usage,
>  };
>  
>  static int __init omap_cpuacct_init(void)
> diff --git a/arch/arm/plat-omap/include/plat/omap-pm.h b/arch/arm/plat-omap/include/plat/omap-pm.h
> index 3ee41d7..8f2db6f 100644
> --- a/arch/arm/plat-omap/include/plat/omap-pm.h
> +++ b/arch/arm/plat-omap/include/plat/omap-pm.h
> @@ -31,6 +31,7 @@ struct omap_opp {
>  	unsigned long rate;
>  	u8 opp_id;
>  	u16 min_vdd;
> +	unsigned long power; /* power consumed running at OPP in milliWatts */

this conflicts with the OPP layer implementation. this structure will 
disappear for good.

>  };
>  
>  extern struct omap_opp *mpu_opps;


-- 
Regards,
Nishanth Menon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct
  2010-05-19 13:11   ` Nishanth Menon
@ 2010-05-19 15:34     ` Thomas Renninger
  2010-05-19 18:56       ` Mike Chan
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Renninger @ 2010-05-19 15:34 UTC (permalink / raw)
  To: Nishanth Menon
  Cc: Mike Chan, khilman@deeprootsystems.com, menage@google.com,
	balbir@in.ibm.com, cpufreq@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-omap@vger.kernel.org

On Wednesday 19 May 2010 15:11:03 Nishanth Menon wrote:
> Mike Chan had written, on 05/18/2010 08:30 PM, the following:
> > Specify new power field in struct omap_opp, which is
> > power exported in milliWatt.
...
> > +		totalpower += cpuacct_freq[i] * mpu_opps[i + 1].power;
...
> > +	unsigned long power; /* power consumed running at OPP in 
milliWatts */
> 
> this conflicts with the OPP layer implementation. this structure will 
> disappear for good.
Ah yes, another question:
The power variable in struct omap_opp never gets initialized
with a sane value, something is wrong...

     Thomas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct
  2010-05-19 15:34     ` Thomas Renninger
@ 2010-05-19 18:56       ` Mike Chan
  2010-05-19 19:00         ` Nishanth Menon
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Chan @ 2010-05-19 18:56 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Nishanth Menon, khilman@deeprootsystems.com, menage@google.com,
	balbir@in.ibm.com, cpufreq@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-omap@vger.kernel.org

On Wed, May 19, 2010 at 8:34 AM, Thomas Renninger <trenn@suse.de> wrote:
> On Wednesday 19 May 2010 15:11:03 Nishanth Menon wrote:
>> Mike Chan had written, on 05/18/2010 08:30 PM, the following:
>> > Specify new power field in struct omap_opp, which is
>> > power exported in milliWatt.
> ...
>> > +           totalpower += cpuacct_freq[i] * mpu_opps[i + 1].power;
> ...
>> > +   unsigned long power; /* power consumed running at OPP in
> milliWatts */
>>
>> this conflicts with the OPP layer implementation. this structure will
>> disappear for good.

Okay, I wasn't sure what to use. The omap-pm branch uses omap_opp for
the mpu frequencies. However it looks like struct prcm_config is used
in linux-next for mpu frequencies, but there are deprecated comments
around this struct definition so I'm not sure where what struct I
should be using.

> Ah yes, another question:
> The power variable in struct omap_opp never gets initialized
> with a sane value, something is wrong...
>

You're right, this last patch won't work on linux-next, it seems quite
a bit has changed in the omap specific. Previously board files can
register their OPP table which is an array of struct omp_opp, which is
exactly what I do for Droid in our android-omap branch.

I'm hoping some omap folks can offer some suggestions for what needs
to be changed.

-- Mike

>     Thomas
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* scheduler: cleanup sched.c and extract cgroup_cpuaccount stuff into separate file
  2010-05-19 10:27       ` Peter Zijlstra
@ 2010-05-19 18:58         ` Thomas Renninger
  2010-05-19 18:58         ` [PATCH 1/2] scheduler: Extract cgroups_cpuaccount code from sched.c into own file V2 Thomas Renninger
  2010-05-19 18:58         ` [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic Thomas Renninger
  2 siblings, 0 replies; 21+ messages in thread
From: Thomas Renninger @ 2010-05-19 18:58 UTC (permalink / raw)
  To: trenn, peterz; +Cc: mike, linux-kernel, lizf, menage, containers, mingo

The first patch should only be moving code.
To be able to access cpu_rq in sched.c:
raw_spin_lock_irq(&cpu_rq(cpu)->lock);

void unlock_runqueue(unsigned int cpu)
wrapper is added and kernel/sched.h created.

The locks and sched.h are removed again in the second
patch, this should be the only "real" change.
There atomic64_t is introduced in a per_cpu_ptr variable
to avoid the lock.
Not sure about allocating atomic64_t like that is
correct. Review is appreciated.
Both patches are compile tested on various archs and configs.

       Thomas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] scheduler: Extract cgroups_cpuaccount code from sched.c into own file V2
  2010-05-19 10:27       ` Peter Zijlstra
  2010-05-19 18:58         ` scheduler: cleanup sched.c and extract cgroup_cpuaccount stuff into separate file Thomas Renninger
@ 2010-05-19 18:58         ` Thomas Renninger
  2010-05-19 18:58         ` [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic Thomas Renninger
  2 siblings, 0 replies; 21+ messages in thread
From: Thomas Renninger @ 2010-05-19 18:58 UTC (permalink / raw)
  To: trenn, peterz; +Cc: mike, linux-kernel, lizf, menage, containers, mingo

This is a cleanup against current linux-2.6 Linus tree.

Having CONFIG_CGROUP_CPUACCT code in kernel/sched.c looks wrong.
Move this out to kernel/cgroup_cpuaccount.c

Test compiled on several archs (IA64, ppc, x86(_64), s390)
with differnt configs.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: linux-kernel@vger.kernel.org
CC: mike@android.com
CC: menage@google.com
CC: lizf@cn.fujitsu.com
CC: containers@lists.linux-foundation.org
CC: mingo@elte.hu
CC: peterz@infradead.org
---
 include/linux/cgroup.h     |   29 ++++
 kernel/Makefile            |    1 +
 kernel/cgroup_cpuaccount.c |  287 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c             |  309 ++------------------------------------------
 kernel/sched.h             |    7 +
 5 files changed, 336 insertions(+), 297 deletions(-)
 create mode 100644 kernel/cgroup_cpuaccount.c
 create mode 100644 kernel/sched.h

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..1c8f09d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -609,8 +609,37 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
 unsigned short css_id(struct cgroup_subsys_state *css);
 unsigned short css_depth(struct cgroup_subsys_state *css);
 
+/* Time spent by the tasks of the cpu accounting group executing in ... */
+enum cpuacct_stat_index {
+	CPUACCT_STAT_USER,	/* ... user mode */
+	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
+
+	CPUACCT_STAT_NSTATS,
+};
+
+#ifdef CONFIG_CGROUP_CPUACCT
+void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+void cpuacct_update_stats(struct task_struct *tsk,
+		enum cpuacct_stat_index idx, cputime_t val);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_update_stats(struct task_struct *tsk,
+			enum cpuacct_stat_index idx, cputime_t val) {}
+#endif
+
 #else /* !CONFIG_CGROUPS */
 
+/* TBD: Make these double declarations (see above) disappear */
+enum cpuacct_stat_index {
+	CPUACCT_STAT_USER,	/* ... user mode */
+	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
+
+	CPUACCT_STAT_NSTATS,
+};
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_update_stats(struct task_struct *tsk,
+			enum cpuacct_stat_index idx, cputime_t val) {}
+
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
diff --git a/kernel/Makefile b/kernel/Makefile
index 149e18e..1df6e53 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_CPUACCT) += cgroup_cpuaccount.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
diff --git a/kernel/cgroup_cpuaccount.c b/kernel/cgroup_cpuaccount.c
new file mode 100644
index 0000000..0ad356a
--- /dev/null
+++ b/kernel/cgroup_cpuaccount.c
@@ -0,0 +1,287 @@
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/cgroup.h>
+#include <linux/srcu.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/err.h>
+
+#include <asm/cputime.h>
+
+#include "sched.h"
+
+/*
+ * CPU accounting code for task groups.
+ *
+ * Based on the work by Paul Menage (menage@google.com) and Balbir Singh
+ * (balbir@in.ibm.com).
+ */
+
+/* track cpu usage of a group of tasks and its child groups */
+struct cpuacct {
+	struct cgroup_subsys_state css;
+	/* cpuusage holds pointer to a u64-type object on every cpu */
+	u64 __percpu *cpuusage;
+	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
+	struct cpuacct *parent;
+};
+
+struct cgroup_subsys cpuacct_subsys;
+
+/* return cpu accounting group corresponding to this container */
+static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, cpuacct_subsys_id),
+			    struct cpuacct, css);
+}
+
+/* return cpu accounting group to which this task belongs */
+static inline struct cpuacct *task_ca(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, cpuacct_subsys_id),
+			    struct cpuacct, css);
+}
+
+/* create a new cpu accounting group */
+static struct cgroup_subsys_state *cpuacct_create(
+	struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
+	int i;
+
+	if (!ca)
+		goto out;
+
+	ca->cpuusage = alloc_percpu(u64);
+	if (!ca->cpuusage)
+		goto out_free_ca;
+
+	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
+		if (percpu_counter_init(&ca->cpustat[i], 0))
+			goto out_free_counters;
+
+	if (cgrp->parent)
+		ca->parent = cgroup_ca(cgrp->parent);
+
+	return &ca->css;
+
+out_free_counters:
+	while (--i >= 0)
+		percpu_counter_destroy(&ca->cpustat[i]);
+	free_percpu(ca->cpuusage);
+out_free_ca:
+	kfree(ca);
+out:
+	return ERR_PTR(-ENOMEM);
+}
+
+/* destroy an existing cpu accounting group */
+static void
+cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	int i;
+
+	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
+		percpu_counter_destroy(&ca->cpustat[i]);
+	free_percpu(ca->cpuusage);
+	kfree(ca);
+}
+
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+{
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 data;
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+	 */
+	lock_runqueue(cpu);
+	data = *cpuusage;
+	unlock_runqueue(cpu);
+#else
+	data = *cpuusage;
+#endif
+
+	return data;
+}
+
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+{
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+	 */
+	lock_runqueue(cpu);
+	*cpuusage = val;
+	unlock_runqueue(cpu);
+#else
+	*cpuusage = val;
+#endif
+}
+
+/* return total cpu usage (in nanoseconds) of a group */
+static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	u64 totalcpuusage = 0;
+	int i;
+
+	for_each_present_cpu(i)
+		totalcpuusage += cpuacct_cpuusage_read(ca, i);
+
+	return totalcpuusage;
+}
+
+static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+								u64 reset)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	int err = 0;
+	int i;
+
+	if (reset) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	for_each_present_cpu(i)
+		cpuacct_cpuusage_write(ca, i, 0);
+
+out:
+	return err;
+}
+
+static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
+				   struct seq_file *m)
+{
+	struct cpuacct *ca = cgroup_ca(cgroup);
+	u64 percpu;
+	int i;
+
+	for_each_present_cpu(i) {
+		percpu = cpuacct_cpuusage_read(ca, i);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const char *cpuacct_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
+static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	int i;
+
+	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
+		s64 val = percpu_counter_read(&ca->cpustat[i]);
+		val = cputime64_to_clock_t(val);
+		cb->fill(cb, cpuacct_stat_desc[i], val);
+	}
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "usage",
+		.read_u64 = cpuusage_read,
+		.write_u64 = cpuusage_write,
+	},
+	{
+		.name = "usage_percpu",
+		.read_seq_string = cpuacct_percpu_seq_read,
+	},
+	{
+		.name = "stat",
+		.read_map = cpuacct_stats_show,
+	},
+};
+
+static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+/*
+ * charge this task's execution time to its accounting group.
+ *
+ * called with rq->lock held.
+ */
+void cpuacct_charge(struct task_struct *tsk, u64 cputime)
+{
+	struct cpuacct *ca;
+	int cpu;
+
+	if (unlikely(!cpuacct_subsys.active))
+		return;
+
+	cpu = task_cpu(tsk);
+
+	rcu_read_lock();
+
+	ca = task_ca(tsk);
+
+	for (; ca; ca = ca->parent) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+
+	rcu_read_unlock();
+}
+
+/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
+ * in cputime_t units. As a result, cpuacct_update_stats calls
+ * percpu_counter_add with values large enough to always overflow the
+ * per cpu batch limit causing bad SMP scalability.
+ *
+ * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
+ * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
+ * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
+ */
+#ifdef CONFIG_SMP
+#define CPUACCT_BATCH	\
+	min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
+#else
+#define CPUACCT_BATCH	0
+#endif
+
+/*
+ * Charge the system/user time to the task's accounting group.
+ */
+void cpuacct_update_stats(struct task_struct *tsk,
+			  enum cpuacct_stat_index idx, cputime_t val)
+{
+	struct cpuacct *ca;
+	int batch = CPUACCT_BATCH;
+
+	if (unlikely(!cpuacct_subsys.active))
+		return;
+
+	rcu_read_lock();
+	ca = task_ca(tsk);
+
+	do {
+		__percpu_counter_add(&ca->cpustat[idx], val, batch);
+		ca = ca->parent;
+	} while (ca);
+	rcu_read_unlock();
+}
+
+struct cgroup_subsys cpuacct_subsys = {
+	.name = "cpuacct",
+	.create = cpuacct_create,
+	.destroy = cpuacct_destroy,
+	.populate = cpuacct_populate,
+	.subsys_id = cpuacct_subsys_id,
+};
diff --git a/kernel/sched.c b/kernel/sched.c
index 1d93cd0..fc93cbd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -72,11 +72,13 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/cgroup.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
 
 #include "sched_cpupri.h"
+#include "sched.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
@@ -236,8 +238,6 @@ static DEFINE_MUTEX(sched_domains_mutex);
 
 #ifdef CONFIG_CGROUP_SCHED
 
-#include <linux/cgroup.h>
-
 struct cfs_rq;
 
 static LIST_HEAD(task_groups);
@@ -642,6 +642,16 @@ static inline int cpu_of(struct rq *rq)
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+void lock_runqueue(unsigned int cpu)
+{
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+}
+
+void unlock_runqueue(unsigned int cpu)
+{
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+}
+
 inline void update_rq_clock(struct rq *rq)
 {
 	if (!rq->skip_clock_update)
@@ -1394,24 +1404,6 @@ static const u32 prio_to_wmult[40] = {
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
 
-/* Time spent by the tasks of the cpu accounting group executing in ... */
-enum cpuacct_stat_index {
-	CPUACCT_STAT_USER,	/* ... user mode */
-	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
-
-	CPUACCT_STAT_NSTATS,
-};
-
-#ifdef CONFIG_CGROUP_CPUACCT
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-static void cpuacct_update_stats(struct task_struct *tsk,
-		enum cpuacct_stat_index idx, cputime_t val);
-#else
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
-static inline void cpuacct_update_stats(struct task_struct *tsk,
-		enum cpuacct_stat_index idx, cputime_t val) {}
-#endif
-
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -8617,283 +8609,6 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
-#ifdef CONFIG_CGROUP_CPUACCT
-
-/*
- * CPU accounting code for task groups.
- *
- * Based on the work by Paul Menage (menage@google.com) and Balbir Singh
- * (balbir@in.ibm.com).
- */
-
-/* track cpu usage of a group of tasks and its child groups */
-struct cpuacct {
-	struct cgroup_subsys_state css;
-	/* cpuusage holds pointer to a u64-type object on every cpu */
-	u64 __percpu *cpuusage;
-	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
-	struct cpuacct *parent;
-};
-
-struct cgroup_subsys cpuacct_subsys;
-
-/* return cpu accounting group corresponding to this container */
-static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
-{
-	return container_of(cgroup_subsys_state(cgrp, cpuacct_subsys_id),
-			    struct cpuacct, css);
-}
-
-/* return cpu accounting group to which this task belongs */
-static inline struct cpuacct *task_ca(struct task_struct *tsk)
-{
-	return container_of(task_subsys_state(tsk, cpuacct_subsys_id),
-			    struct cpuacct, css);
-}
-
-/* create a new cpu accounting group */
-static struct cgroup_subsys_state *cpuacct_create(
-	struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
-	struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
-	int i;
-
-	if (!ca)
-		goto out;
-
-	ca->cpuusage = alloc_percpu(u64);
-	if (!ca->cpuusage)
-		goto out_free_ca;
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		if (percpu_counter_init(&ca->cpustat[i], 0))
-			goto out_free_counters;
-
-	if (cgrp->parent)
-		ca->parent = cgroup_ca(cgrp->parent);
-
-	return &ca->css;
-
-out_free_counters:
-	while (--i >= 0)
-		percpu_counter_destroy(&ca->cpustat[i]);
-	free_percpu(ca->cpuusage);
-out_free_ca:
-	kfree(ca);
-out:
-	return ERR_PTR(-ENOMEM);
-}
-
-/* destroy an existing cpu accounting group */
-static void
-cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int i;
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		percpu_counter_destroy(&ca->cpustat[i]);
-	free_percpu(ca->cpuusage);
-	kfree(ca);
-}
-
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-	u64 data;
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	data = *cpuusage;
-#endif
-
-	return data;
-}
-
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	*cpuusage = val;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
-}
-
-/* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	u64 totalcpuusage = 0;
-	int i;
-
-	for_each_present_cpu(i)
-		totalcpuusage += cpuacct_cpuusage_read(ca, i);
-
-	return totalcpuusage;
-}
-
-static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
-								u64 reset)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int err = 0;
-	int i;
-
-	if (reset) {
-		err = -EINVAL;
-		goto out;
-	}
-
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
-
-out:
-	return err;
-}
-
-static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
-				   struct seq_file *m)
-{
-	struct cpuacct *ca = cgroup_ca(cgroup);
-	u64 percpu;
-	int i;
-
-	for_each_present_cpu(i) {
-		percpu = cpuacct_cpuusage_read(ca, i);
-		seq_printf(m, "%llu ", (unsigned long long) percpu);
-	}
-	seq_printf(m, "\n");
-	return 0;
-}
-
-static const char *cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
-static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
-		struct cgroup_map_cb *cb)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int i;
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
-		s64 val = percpu_counter_read(&ca->cpustat[i]);
-		val = cputime64_to_clock_t(val);
-		cb->fill(cb, cpuacct_stat_desc[i], val);
-	}
-	return 0;
-}
-
-static struct cftype files[] = {
-	{
-		.name = "usage",
-		.read_u64 = cpuusage_read,
-		.write_u64 = cpuusage_write,
-	},
-	{
-		.name = "usage_percpu",
-		.read_seq_string = cpuacct_percpu_seq_read,
-	},
-	{
-		.name = "stat",
-		.read_map = cpuacct_stats_show,
-	},
-};
-
-static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
-	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
-}
-
-/*
- * charge this task's execution time to its accounting group.
- *
- * called with rq->lock held.
- */
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-	struct cpuacct *ca;
-	int cpu;
-
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	cpu = task_cpu(tsk);
-
-	rcu_read_lock();
-
-	ca = task_ca(tsk);
-
-	for (; ca; ca = ca->parent) {
-		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-		*cpuusage += cputime;
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
- * in cputime_t units. As a result, cpuacct_update_stats calls
- * percpu_counter_add with values large enough to always overflow the
- * per cpu batch limit causing bad SMP scalability.
- *
- * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
- * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
- * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
- */
-#ifdef CONFIG_SMP
-#define CPUACCT_BATCH	\
-	min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
-#else
-#define CPUACCT_BATCH	0
-#endif
-
-/*
- * Charge the system/user time to the task's accounting group.
- */
-static void cpuacct_update_stats(struct task_struct *tsk,
-		enum cpuacct_stat_index idx, cputime_t val)
-{
-	struct cpuacct *ca;
-	int batch = CPUACCT_BATCH;
-
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	rcu_read_lock();
-	ca = task_ca(tsk);
-
-	do {
-		__percpu_counter_add(&ca->cpustat[idx], val, batch);
-		ca = ca->parent;
-	} while (ca);
-	rcu_read_unlock();
-}
-
-struct cgroup_subsys cpuacct_subsys = {
-	.name = "cpuacct",
-	.create = cpuacct_create,
-	.destroy = cpuacct_destroy,
-	.populate = cpuacct_populate,
-	.subsys_id = cpuacct_subsys_id,
-};
-#endif	/* CONFIG_CGROUP_CPUACCT */
-
 #ifndef CONFIG_SMP
 
 void synchronize_sched_expedited(void)
diff --git a/kernel/sched.h b/kernel/sched.h
new file mode 100644
index 0000000..2fc20e0
--- /dev/null
+++ b/kernel/sched.h
@@ -0,0 +1,7 @@
+#ifndef _LINUX_SCHED_LOCAL_H
+#define _LINUX_SCHED_LOCAL_H
+
+void lock_runqueue(unsigned int cpu);
+void unlock_runqueue(unsigned int cpu);
+
+#endif
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic
  2010-05-19 10:27       ` Peter Zijlstra
  2010-05-19 18:58         ` scheduler: cleanup sched.c and extract cgroup_cpuaccount stuff into separate file Thomas Renninger
  2010-05-19 18:58         ` [PATCH 1/2] scheduler: Extract cgroups_cpuaccount code from sched.c into own file V2 Thomas Renninger
@ 2010-05-19 18:58         ` Thomas Renninger
  2010-05-19 19:02           ` Peter Zijlstra
  2010-05-20  0:43           ` KAMEZAWA Hiroyuki
  2 siblings, 2 replies; 21+ messages in thread
From: Thomas Renninger @ 2010-05-19 18:58 UTC (permalink / raw)
  To: trenn, peterz; +Cc: mike, linux-kernel, lizf, menage, containers, mingo

and avoid locking on 32 bit.
This resolves an ugly dependency in cgroups_cpuaccount.c
to per_cpu runqueues lock variable.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: linux-kernel@vger.kernel.org
CC: mike@android.com
CC: menage@google.com
CC: lizf@cn.fujitsu.com
CC: containers@lists.linux-foundation.org
CC: mingo@elte.hu
CC: peterz@infradead.org
---
 kernel/cgroup_cpuaccount.c |   39 +++++++++------------------------------
 kernel/sched.c             |   11 -----------
 kernel/sched.h             |    7 -------
 3 files changed, 9 insertions(+), 48 deletions(-)
 delete mode 100644 kernel/sched.h

diff --git a/kernel/cgroup_cpuaccount.c b/kernel/cgroup_cpuaccount.c
index 0ad356a..0a53487 100644
--- a/kernel/cgroup_cpuaccount.c
+++ b/kernel/cgroup_cpuaccount.c
@@ -10,8 +10,6 @@
 
 #include <asm/cputime.h>
 
-#include "sched.h"
-
 /*
  * CPU accounting code for task groups.
  *
@@ -23,7 +21,7 @@
 struct cpuacct {
 	struct cgroup_subsys_state css;
 	/* cpuusage holds pointer to a u64-type object on every cpu */
-	u64 __percpu *cpuusage;
+	atomic64_t __percpu *cpuusage;
 	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
 	struct cpuacct *parent;
 };
@@ -54,7 +52,7 @@ static struct cgroup_subsys_state *cpuacct_create(
 	if (!ca)
 		goto out;
 
-	ca->cpuusage = alloc_percpu(u64);
+	ca->cpuusage = alloc_percpu(atomic64_t);
 	if (!ca->cpuusage)
 		goto out_free_ca;
 
@@ -92,37 +90,18 @@ cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 {
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	atomic64_t *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 	u64 data;
 
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	lock_runqueue(cpu);
-	data = *cpuusage;
-	unlock_runqueue(cpu);
-#else
-	data = *cpuusage;
-#endif
-
+	data = atomic64_read(cpuusage);
 	return data;
 }
 
 static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 {
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	lock_runqueue(cpu);
-	*cpuusage = val;
-	unlock_runqueue(cpu);
-#else
-	*cpuusage = val;
-#endif
+	atomic64_t *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+
+	atomic64_set(cpuusage, val);
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
@@ -232,8 +211,8 @@ void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 	ca = task_ca(tsk);
 
 	for (; ca; ca = ca->parent) {
-		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-		*cpuusage += cputime;
+		atomic64_t *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		atomic64_add(cputime, cpuusage);
 	}
 
 	rcu_read_unlock();
diff --git a/kernel/sched.c b/kernel/sched.c
index fc93cbd..e1caba2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -78,7 +78,6 @@
 #include <asm/irq_regs.h>
 
 #include "sched_cpupri.h"
-#include "sched.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
@@ -642,16 +641,6 @@ static inline int cpu_of(struct rq *rq)
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
-void lock_runqueue(unsigned int cpu)
-{
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-}
-
-void unlock_runqueue(unsigned int cpu)
-{
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-}
-
 inline void update_rq_clock(struct rq *rq)
 {
 	if (!rq->skip_clock_update)
diff --git a/kernel/sched.h b/kernel/sched.h
deleted file mode 100644
index 2fc20e0..0000000
--- a/kernel/sched.h
+++ /dev/null
@@ -1,7 +0,0 @@
-#ifndef _LINUX_SCHED_LOCAL_H
-#define _LINUX_SCHED_LOCAL_H
-
-void lock_runqueue(unsigned int cpu);
-void unlock_runqueue(unsigned int cpu);
-
-#endif
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct
  2010-05-19 18:56       ` Mike Chan
@ 2010-05-19 19:00         ` Nishanth Menon
  0 siblings, 0 replies; 21+ messages in thread
From: Nishanth Menon @ 2010-05-19 19:00 UTC (permalink / raw)
  To: Mike Chan, Kevin Hilman
  Cc: Thomas Renninger, menage@google.com, balbir@in.ibm.com,
	cpufreq@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-omap@vger.kernel.org

Mike Chan had written, on 05/19/2010 01:56 PM, the following:
> On Wed, May 19, 2010 at 8:34 AM, Thomas Renninger <trenn@suse.de> wrote:
>> On Wednesday 19 May 2010 15:11:03 Nishanth Menon wrote:
>>> Mike Chan had written, on 05/18/2010 08:30 PM, the following:
>>>> Specify new power field in struct omap_opp, which is
>>>> power exported in milliWatt.
>> ...
>>>> +           totalpower += cpuacct_freq[i] * mpu_opps[i + 1].power;
>> ...
>>>> +   unsigned long power; /* power consumed running at OPP in
>> milliWatts */
>>> this conflicts with the OPP layer implementation. this structure will
>>> disappear for good.
> 
> Okay, I wasn't sure what to use. The omap-pm branch uses omap_opp for
> the mpu frequencies. However it looks like struct prcm_config is used
> in linux-next for mpu frequencies, but there are deprecated comments
> around this struct definition so I'm not sure where what struct I
> should be using.

Kevin can probably comment on opp upstream plans.

> 
>> Ah yes, another question:
>> The power variable in struct omap_opp never gets initialized
>> with a sane value, something is wrong...
>>
> 
> You're right, this last patch won't work on linux-next, it seems quite
> a bit has changed in the omap specific. Previously board files can
> register their OPP table which is an array of struct omp_opp, which is
> exactly what I do for Droid in our android-omap branch.
> 
> I'm hoping some omap folks can offer some suggestions for what needs
> to be changed.
> 
> -- Mike
> 
>>     Thomas
>>


-- 
Regards,
Nishanth Menon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic
  2010-05-19 18:58         ` [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic Thomas Renninger
@ 2010-05-19 19:02           ` Peter Zijlstra
  2010-05-19 19:13             ` Thomas Renninger
  2010-05-20 10:53             ` Thomas Renninger
  2010-05-20  0:43           ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-05-19 19:02 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: mike, linux-kernel, lizf, menage, containers, mingo

On Wed, 2010-05-19 at 20:58 +0200, Thomas Renninger wrote:
> and avoid locking on 32 bit.
> This resolves an ugly dependency in cgroups_cpuaccount.c
> to per_cpu runqueues lock variable.

While I totally agree with the sentiments here, atomic64_t isn't
available on all 32bit archs and is _terribly_ expensive on ARCH=i386.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c  into own file
  2010-05-19  9:30   ` [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file Thomas Renninger
  2010-05-19  9:37     ` Peter Zijlstra
@ 2010-05-19 19:06     ` Mike Chan
  1 sibling, 0 replies; 21+ messages in thread
From: Mike Chan @ 2010-05-19 19:06 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: khilman, menage, balbir, cpufreq, linux-kernel, linux-omap, mingo,
	peterz, lizf, containers

2010/5/19 Thomas Renninger <trenn@suse.de>:
> Hi,
>
> thread topic was:
> Re: [PATCH 3/4] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking
>
> I mingled this patch together with a minor comment for Mike's patch.
> Like this interested people in CC are kept.
>
> Peter/Ingo: Can you pick up this cleanup if appropriate, please.
> Shall I resend separately or could you cut out comments below?
>
> On Wednesday 19 May 2010 03:30:19 Mike Chan wrote:
>> Platform must register cpu power function that return power in
>> milliWatt seconds.
>>
>> Signed-off-by: Mike Chan <mike@android.com>
>> ---
>>  Documentation/cgroups/cpuacct.txt |    3 +++
>>  include/linux/cpuacct.h           |    4 +++-
>>  kernel/sched.c                    |   24 ++++++++++++++++++++++--
>>  3 files changed, 28 insertions(+), 3 deletions(-)
> ...
>> diff --git a/include/linux/cpuacct.h b/include/linux/cpuacct.h
>> index 9ff479e..effe842 100644
>> --- a/include/linux/cpuacct.h
>> +++ b/include/linux/cpuacct.h
>> @@ -31,7 +31,9 @@ struct cpuacct_cpufreq_calls {
> This is a general cpuacct_charge interface, not cpufreq specific?
> I'd call it "struct cpuacct_charge_calls".
> Platforms can account C-states, frequency, power, whatever they like?
> The latter two are implemented with your patches.

I'm ok with the name change if that's what people prefer.

>>        */
>>       void (*init) (void **cpuacct_data);
>>       void (*charge) (void *cpuacct_data,  u64 cputime, unsigned int cpu);
>> -     void (*show) (void *cpuacct_data, struct cgroup_map_cb *cb);
>> +     void (*cpufreq_show) (void *cpuacct_data, struct cgroup_map_cb *cb);
>> +     /* Returns power consumed in milliWatt seconds */
>> +     u64 (*power_usage) (void *cpuacct_data);
>>  };
>>  int cpuacct_register_cpufreq(struct cpuacct_cpufreq_calls *fn);
> Same here, why not name it cpuacct_register_charge?
> Eventually at other places.
>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 6b6c45a..d55d8af 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
> Nothing to do with this patch, but I wonder why this is all in kernel/sched.c.
> Try a quick cleanup... works.
> Whatabout below cleanup?
> Not sure through which tree this should go in, but if below is accepted,
> would you mind rebasing your things against it.
> Then it would already show up in cgroup_cpuaccount.c git history.

Sounds reasonable

-- Mike

>
>       Thomas
>
> This is a cleanup against current linux-2.6 Linus tree.
>
> Having CONFIG_CGROUP_CPUACCT code in kernel/sched.c looks wrong.
> Move this out to kernel/cgroup_cpuaccount.c
>
> Test compiled with and without CONFIG_CGROUP_CPUACCT set on x86_64.
>
> Signed-off-by: Thomas Renninger <trenn@suse.de>
> CC: linux-kernel@vger.kernel.org
> CC: mike@android.com
> CC: menage@google.com
> CC: lizf@cn.fujitsu.com
> CC: containers@lists.linux-foundation.org
> CC: mingo@elte.hu
> CC: peterz@infradead.org
>
> ---
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8f78073..6e2c88a 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -609,6 +609,24 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
>  unsigned short css_id(struct cgroup_subsys_state *css);
>  unsigned short css_depth(struct cgroup_subsys_state *css);
>
> +/* Time spent by the tasks of the cpu accounting group executing in ... */
> +enum cpuacct_stat_index {
> +       CPUACCT_STAT_USER,      /* ... user mode */
> +       CPUACCT_STAT_SYSTEM,    /* ... kernel mode */
> +
> +       CPUACCT_STAT_NSTATS,
> +};
> +
> +#ifdef CONFIG_CGROUP_CPUACCT
> +void cpuacct_charge(struct task_struct *tsk, u64 cputime);
> +void cpuacct_update_stats(struct task_struct *tsk,
> +               enum cpuacct_stat_index idx, cputime_t val);
> +#else
> +static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
> +static inline void cpuacct_update_stats(struct task_struct *tsk,
> +                       enum cpuacct_stat_index idx, cputime_t val) {}
> +#endif
> +
>  #else /* !CONFIG_CGROUPS */
>
>  static inline int cgroup_init_early(void) { return 0; }
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 149e18e..1df6e53 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -60,6 +60,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
>  obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUP_CPUACCT) += cgroup_cpuaccount.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
> diff --git a/kernel/cgroup_cpuaccount.c b/kernel/cgroup_cpuaccount.c
> new file mode 100644
> index 0000000..d32b927
> --- /dev/null
> +++ b/kernel/cgroup_cpuaccount.c
> @@ -0,0 +1,284 @@
> +#include <linux/kernel.h>
> +#include <linux/percpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/sched.h>
> +#include <linux/cgroup.h>
> +#include <linux/srcu.h>
> +#include <linux/slab.h>
> +#include <linux/seq_file.h>
> +
> +#include <asm/cputime.h>
> +
> +/*
> + * CPU accounting code for task groups.
> + *
> + * Based on the work by Paul Menage (menage@google.com) and Balbir Singh
> + * (balbir@in.ibm.com).
> + */
> +
> +/* track cpu usage of a group of tasks and its child groups */
> +struct cpuacct {
> +       struct cgroup_subsys_state css;
> +       /* cpuusage holds pointer to a u64-type object on every cpu */
> +       u64 __percpu *cpuusage;
> +       struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
> +       struct cpuacct *parent;
> +};
> +
> +struct cgroup_subsys cpuacct_subsys;
> +
> +/* return cpu accounting group corresponding to this container */
> +static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> +{
> +       return container_of(cgroup_subsys_state(cgrp, cpuacct_subsys_id),
> +                           struct cpuacct, css);
> +}
> +
> +/* return cpu accounting group to which this task belongs */
> +static inline struct cpuacct *task_ca(struct task_struct *tsk)
> +{
> +       return container_of(task_subsys_state(tsk, cpuacct_subsys_id),
> +                           struct cpuacct, css);
> +}
> +
> +/* create a new cpu accounting group */
> +static struct cgroup_subsys_state *cpuacct_create(
> +       struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
> +       int i;
> +
> +       if (!ca)
> +               goto out;
> +
> +       ca->cpuusage = alloc_percpu(u64);
> +       if (!ca->cpuusage)
> +               goto out_free_ca;
> +
> +       for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> +               if (percpu_counter_init(&ca->cpustat[i], 0))
> +                       goto out_free_counters;
> +
> +       if (cgrp->parent)
> +               ca->parent = cgroup_ca(cgrp->parent);
> +
> +       return &ca->css;
> +
> +out_free_counters:
> +       while (--i >= 0)
> +               percpu_counter_destroy(&ca->cpustat[i]);
> +       free_percpu(ca->cpuusage);
> +out_free_ca:
> +       kfree(ca);
> +out:
> +       return ERR_PTR(-ENOMEM);
> +}
> +
> +/* destroy an existing cpu accounting group */
> +static void
> +cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       struct cpuacct *ca = cgroup_ca(cgrp);
> +       int i;
> +
> +       for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> +               percpu_counter_destroy(&ca->cpustat[i]);
> +       free_percpu(ca->cpuusage);
> +       kfree(ca);
> +}
> +
> +static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
> +{
> +       u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> +       u64 data;
> +
> +#ifndef CONFIG_64BIT
> +       /*
> +        * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> +        */
> +       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +       data = *cpuusage;
> +       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +#else
> +       data = *cpuusage;
> +#endif
> +
> +       return data;
> +}
> +
> +static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
> +{
> +       u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> +
> +#ifndef CONFIG_64BIT
> +       /*
> +        * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> +        */
> +       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +       *cpuusage = val;
> +       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +#else
> +       *cpuusage = val;
> +#endif
> +}
> +
> +/* return total cpu usage (in nanoseconds) of a group */
> +static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct cpuacct *ca = cgroup_ca(cgrp);
> +       u64 totalcpuusage = 0;
> +       int i;
> +
> +       for_each_present_cpu(i)
> +               totalcpuusage += cpuacct_cpuusage_read(ca, i);
> +
> +       return totalcpuusage;
> +}
> +
> +static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> +                                                               u64 reset)
> +{
> +       struct cpuacct *ca = cgroup_ca(cgrp);
> +       int err = 0;
> +       int i;
> +
> +       if (reset) {
> +               err = -EINVAL;
> +               goto out;
> +       }
> +
> +       for_each_present_cpu(i)
> +               cpuacct_cpuusage_write(ca, i, 0);
> +
> +out:
> +       return err;
> +}
> +
> +static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
> +                                  struct seq_file *m)
> +{
> +       struct cpuacct *ca = cgroup_ca(cgroup);
> +       u64 percpu;
> +       int i;
> +
> +       for_each_present_cpu(i) {
> +               percpu = cpuacct_cpuusage_read(ca, i);
> +               seq_printf(m, "%llu ", (unsigned long long) percpu);
> +       }
> +       seq_printf(m, "\n");
> +       return 0;
> +}
> +
> +static const char *cpuacct_stat_desc[] = {
> +       [CPUACCT_STAT_USER] = "user",
> +       [CPUACCT_STAT_SYSTEM] = "system",
> +};
> +
> +static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
> +               struct cgroup_map_cb *cb)
> +{
> +       struct cpuacct *ca = cgroup_ca(cgrp);
> +       int i;
> +
> +       for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> +               s64 val = percpu_counter_read(&ca->cpustat[i]);
> +               val = cputime64_to_clock_t(val);
> +               cb->fill(cb, cpuacct_stat_desc[i], val);
> +       }
> +       return 0;
> +}
> +
> +static struct cftype files[] = {
> +       {
> +               .name = "usage",
> +               .read_u64 = cpuusage_read,
> +               .write_u64 = cpuusage_write,
> +       },
> +       {
> +               .name = "usage_percpu",
> +               .read_seq_string = cpuacct_percpu_seq_read,
> +       },
> +       {
> +               .name = "stat",
> +               .read_map = cpuacct_stats_show,
> +       },
> +};
> +
> +static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
> +}
> +
> +/*
> + * charge this task's execution time to its accounting group.
> + *
> + * called with rq->lock held.
> + */
> +void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> +{
> +       struct cpuacct *ca;
> +       int cpu;
> +
> +       if (unlikely(!cpuacct_subsys.active))
> +               return;
> +
> +       cpu = task_cpu(tsk);
> +
> +       rcu_read_lock();
> +
> +       ca = task_ca(tsk);
> +
> +       for (; ca; ca = ca->parent) {
> +               u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> +               *cpuusage += cputime;
> +       }
> +
> +       rcu_read_unlock();
> +}
> +
> +/*
> + * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
> + * in cputime_t units. As a result, cpuacct_update_stats calls
> + * percpu_counter_add with values large enough to always overflow the
> + * per cpu batch limit causing bad SMP scalability.
> + *
> + * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
> + * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
> + * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
> + */
> +#ifdef CONFIG_SMP
> +#define CPUACCT_BATCH  \
> +       min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
> +#else
> +#define CPUACCT_BATCH  0
> +#endif
> +
> +/*
> + * Charge the system/user time to the task's accounting group.
> + */
> +void cpuacct_update_stats(struct task_struct *tsk,
> +                         enum cpuacct_stat_index idx, cputime_t val)
> +{
> +       struct cpuacct *ca;
> +       int batch = CPUACCT_BATCH;
> +
> +       if (unlikely(!cpuacct_subsys.active))
> +               return;
> +
> +       rcu_read_lock();
> +       ca = task_ca(tsk);
> +
> +       do {
> +               __percpu_counter_add(&ca->cpustat[idx], val, batch);
> +               ca = ca->parent;
> +       } while (ca);
> +       rcu_read_unlock();
> +}
> +
> +struct cgroup_subsys cpuacct_subsys = {
> +       .name = "cpuacct",
> +       .create = cpuacct_create,
> +       .destroy = cpuacct_destroy,
> +       .populate = cpuacct_populate,
> +       .subsys_id = cpuacct_subsys_id,
> +};
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1d93cd0..45d60dd 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1394,24 +1394,6 @@ static const u32 prio_to_wmult[40] = {
>  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
>  };
>
> -/* Time spent by the tasks of the cpu accounting group executing in ... */
> -enum cpuacct_stat_index {
> -       CPUACCT_STAT_USER,      /* ... user mode */
> -       CPUACCT_STAT_SYSTEM,    /* ... kernel mode */
> -
> -       CPUACCT_STAT_NSTATS,
> -};
> -
> -#ifdef CONFIG_CGROUP_CPUACCT
> -static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
> -static void cpuacct_update_stats(struct task_struct *tsk,
> -               enum cpuacct_stat_index idx, cputime_t val);
> -#else
> -static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
> -static inline void cpuacct_update_stats(struct task_struct *tsk,
> -               enum cpuacct_stat_index idx, cputime_t val) {}
> -#endif
> -
>  static inline void inc_cpu_load(struct rq *rq, unsigned long load)
>  {
>        update_load_add(&rq->load, load);
> @@ -8617,283 +8599,6 @@ struct cgroup_subsys cpu_cgroup_subsys = {
>
>  #endif /* CONFIG_CGROUP_SCHED */
>
> -#ifdef CONFIG_CGROUP_CPUACCT
> -
> -/*
> - * CPU accounting code for task groups.
> - *
> - * Based on the work by Paul Menage (menage@google.com) and Balbir Singh
> - * (balbir@in.ibm.com).
> - */
> -
> -/* track cpu usage of a group of tasks and its child groups */
> -struct cpuacct {
> -       struct cgroup_subsys_state css;
> -       /* cpuusage holds pointer to a u64-type object on every cpu */
> -       u64 __percpu *cpuusage;
> -       struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
> -       struct cpuacct *parent;
> -};
> -
> -struct cgroup_subsys cpuacct_subsys;
> -
> -/* return cpu accounting group corresponding to this container */
> -static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> -{
> -       return container_of(cgroup_subsys_state(cgrp, cpuacct_subsys_id),
> -                           struct cpuacct, css);
> -}
> -
> -/* return cpu accounting group to which this task belongs */
> -static inline struct cpuacct *task_ca(struct task_struct *tsk)
> -{
> -       return container_of(task_subsys_state(tsk, cpuacct_subsys_id),
> -                           struct cpuacct, css);
> -}
> -
> -/* create a new cpu accounting group */
> -static struct cgroup_subsys_state *cpuacct_create(
> -       struct cgroup_subsys *ss, struct cgroup *cgrp)
> -{
> -       struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
> -       int i;
> -
> -       if (!ca)
> -               goto out;
> -
> -       ca->cpuusage = alloc_percpu(u64);
> -       if (!ca->cpuusage)
> -               goto out_free_ca;
> -
> -       for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> -               if (percpu_counter_init(&ca->cpustat[i], 0))
> -                       goto out_free_counters;
> -
> -       if (cgrp->parent)
> -               ca->parent = cgroup_ca(cgrp->parent);
> -
> -       return &ca->css;
> -
> -out_free_counters:
> -       while (--i >= 0)
> -               percpu_counter_destroy(&ca->cpustat[i]);
> -       free_percpu(ca->cpuusage);
> -out_free_ca:
> -       kfree(ca);
> -out:
> -       return ERR_PTR(-ENOMEM);
> -}
> -
> -/* destroy an existing cpu accounting group */
> -static void
> -cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> -{
> -       struct cpuacct *ca = cgroup_ca(cgrp);
> -       int i;
> -
> -       for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> -               percpu_counter_destroy(&ca->cpustat[i]);
> -       free_percpu(ca->cpuusage);
> -       kfree(ca);
> -}
> -
> -static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
> -{
> -       u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> -       u64 data;
> -
> -#ifndef CONFIG_64BIT
> -       /*
> -        * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> -        */
> -       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> -       data = *cpuusage;
> -       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> -#else
> -       data = *cpuusage;
> -#endif
> -
> -       return data;
> -}
> -
> -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
> -{
> -       u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> -
> -#ifndef CONFIG_64BIT
> -       /*
> -        * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> -        */
> -       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> -       *cpuusage = val;
> -       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> -#else
> -       *cpuusage = val;
> -#endif
> -}
> -
> -/* return total cpu usage (in nanoseconds) of a group */
> -static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
> -{
> -       struct cpuacct *ca = cgroup_ca(cgrp);
> -       u64 totalcpuusage = 0;
> -       int i;
> -
> -       for_each_present_cpu(i)
> -               totalcpuusage += cpuacct_cpuusage_read(ca, i);
> -
> -       return totalcpuusage;
> -}
> -
> -static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> -                                                               u64 reset)
> -{
> -       struct cpuacct *ca = cgroup_ca(cgrp);
> -       int err = 0;
> -       int i;
> -
> -       if (reset) {
> -               err = -EINVAL;
> -               goto out;
> -       }
> -
> -       for_each_present_cpu(i)
> -               cpuacct_cpuusage_write(ca, i, 0);
> -
> -out:
> -       return err;
> -}
> -
> -static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
> -                                  struct seq_file *m)
> -{
> -       struct cpuacct *ca = cgroup_ca(cgroup);
> -       u64 percpu;
> -       int i;
> -
> -       for_each_present_cpu(i) {
> -               percpu = cpuacct_cpuusage_read(ca, i);
> -               seq_printf(m, "%llu ", (unsigned long long) percpu);
> -       }
> -       seq_printf(m, "\n");
> -       return 0;
> -}
> -
> -static const char *cpuacct_stat_desc[] = {
> -       [CPUACCT_STAT_USER] = "user",
> -       [CPUACCT_STAT_SYSTEM] = "system",
> -};
> -
> -static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
> -               struct cgroup_map_cb *cb)
> -{
> -       struct cpuacct *ca = cgroup_ca(cgrp);
> -       int i;
> -
> -       for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> -               s64 val = percpu_counter_read(&ca->cpustat[i]);
> -               val = cputime64_to_clock_t(val);
> -               cb->fill(cb, cpuacct_stat_desc[i], val);
> -       }
> -       return 0;
> -}
> -
> -static struct cftype files[] = {
> -       {
> -               .name = "usage",
> -               .read_u64 = cpuusage_read,
> -               .write_u64 = cpuusage_write,
> -       },
> -       {
> -               .name = "usage_percpu",
> -               .read_seq_string = cpuacct_percpu_seq_read,
> -       },
> -       {
> -               .name = "stat",
> -               .read_map = cpuacct_stats_show,
> -       },
> -};
> -
> -static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> -{
> -       return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
> -}
> -
> -/*
> - * charge this task's execution time to its accounting group.
> - *
> - * called with rq->lock held.
> - */
> -static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> -{
> -       struct cpuacct *ca;
> -       int cpu;
> -
> -       if (unlikely(!cpuacct_subsys.active))
> -               return;
> -
> -       cpu = task_cpu(tsk);
> -
> -       rcu_read_lock();
> -
> -       ca = task_ca(tsk);
> -
> -       for (; ca; ca = ca->parent) {
> -               u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> -               *cpuusage += cputime;
> -       }
> -
> -       rcu_read_unlock();
> -}
> -
> -/*
> - * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
> - * in cputime_t units. As a result, cpuacct_update_stats calls
> - * percpu_counter_add with values large enough to always overflow the
> - * per cpu batch limit causing bad SMP scalability.
> - *
> - * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
> - * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
> - * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
> - */
> -#ifdef CONFIG_SMP
> -#define CPUACCT_BATCH  \
> -       min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
> -#else
> -#define CPUACCT_BATCH  0
> -#endif
> -
> -/*
> - * Charge the system/user time to the task's accounting group.
> - */
> -static void cpuacct_update_stats(struct task_struct *tsk,
> -               enum cpuacct_stat_index idx, cputime_t val)
> -{
> -       struct cpuacct *ca;
> -       int batch = CPUACCT_BATCH;
> -
> -       if (unlikely(!cpuacct_subsys.active))
> -               return;
> -
> -       rcu_read_lock();
> -       ca = task_ca(tsk);
> -
> -       do {
> -               __percpu_counter_add(&ca->cpustat[idx], val, batch);
> -               ca = ca->parent;
> -       } while (ca);
> -       rcu_read_unlock();
> -}
> -
> -struct cgroup_subsys cpuacct_subsys = {
> -       .name = "cpuacct",
> -       .create = cpuacct_create,
> -       .destroy = cpuacct_destroy,
> -       .populate = cpuacct_populate,
> -       .subsys_id = cpuacct_subsys_id,
> -};
> -#endif /* CONFIG_CGROUP_CPUACCT */
> -
>  #ifndef CONFIG_SMP
>
>  void synchronize_sched_expedited(void)
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic
  2010-05-19 19:02           ` Peter Zijlstra
@ 2010-05-19 19:13             ` Thomas Renninger
  2010-05-19 19:31               ` Mike Chan
  2010-05-20 10:53             ` Thomas Renninger
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Renninger @ 2010-05-19 19:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mike, linux-kernel, lizf, menage, containers, mingo

On Wednesday 19 May 2010 09:02:23 pm Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 20:58 +0200, Thomas Renninger wrote:
> > and avoid locking on 32 bit.
> > This resolves an ugly dependency in cgroups_cpuaccount.c
> > to per_cpu runqueues lock variable.
>
> While I totally agree with the sentiments here, atomic64_t isn't
> available on all 32bit archs and is _terribly_ expensive on ARCH=i386.
Speed on i386 shouldn't count that much here.
Big machines run x86_64 and cgroup cpu accounting doesn't make much
sense on old i386 single core machines.

Would still be great to see the first patch go in then.
The idea of splitting was to:
  a) show the real change
  b) make at least the first go in if the snd does not work out

Thanks,

     Thomas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic
  2010-05-19 19:13             ` Thomas Renninger
@ 2010-05-19 19:31               ` Mike Chan
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Chan @ 2010-05-19 19:31 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Peter Zijlstra, linux-kernel, lizf, menage, containers, mingo

On Wed, May 19, 2010 at 12:13 PM, Thomas Renninger <trenn@suse.de> wrote:
> On Wednesday 19 May 2010 09:02:23 pm Peter Zijlstra wrote:
>> On Wed, 2010-05-19 at 20:58 +0200, Thomas Renninger wrote:
>> > and avoid locking on 32 bit.
>> > This resolves an ugly dependency in cgroups_cpuaccount.c
>> > to per_cpu runqueues lock variable.
>>
>> While I totally agree with the sentiments here, atomic64_t isn't
>> available on all 32bit archs and is _terribly_ expensive on ARCH=i386.
> Speed on i386 shouldn't count that much here.
> Big machines run x86_64 and cgroup cpu accounting doesn't make much
> sense on old i386 single core machines.
>

cpuacct can still be useful for single core machines, at least in
Android it tracks the cpu usage if your apps. On ARM I'm not sure
(maybe someone more knowledgeable can chime in) how much overhead this
will incur relative to scheduler / cpuacct overhead that's already
there.

Since this is a per-cpu value already, is it not sufficient to just
disable interrupts on that cpu while doing a normal u64 add for both
MP and UP systems?

-- Mike

> Would still be great to see the first patch go in then.
> The idea of splitting was to:
>  a) show the real change
>  b) make at least the first go in if the snd does not work out
>
> Thanks,
>
>     Thomas
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic
  2010-05-19 18:58         ` [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic Thomas Renninger
  2010-05-19 19:02           ` Peter Zijlstra
@ 2010-05-20  0:43           ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-20  0:43 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: peterz, mike, linux-kernel, lizf, menage, containers, mingo

On Wed, 19 May 2010 20:58:59 +0200
Thomas Renninger <trenn@suse.de> wrote:

> and avoid locking on 32 bit.
> This resolves an ugly dependency in cgroups_cpuaccount.c
> to per_cpu runqueues lock variable.
> 
> Signed-off-by: Thomas Renninger <trenn@suse.de>
> CC: linux-kernel@vger.kernel.org
> CC: mike@android.com
> CC: menage@google.com
> CC: lizf@cn.fujitsu.com
> CC: containers@lists.linux-foundation.org
> CC: mingo@elte.hu
> CC: peterz@infradead.org
> ---
>  kernel/cgroup_cpuaccount.c |   39 +++++++++------------------------------
>  kernel/sched.c             |   11 -----------
>  kernel/sched.h             |    7 -------
>  3 files changed, 9 insertions(+), 48 deletions(-)
>  delete mode 100644 kernel/sched.h
> 
> diff --git a/kernel/cgroup_cpuaccount.c b/kernel/cgroup_cpuaccount.c
> index 0ad356a..0a53487 100644
> --- a/kernel/cgroup_cpuaccount.c
> +++ b/kernel/cgroup_cpuaccount.c
> @@ -10,8 +10,6 @@
>  
>  #include <asm/cputime.h>
>  
> -#include "sched.h"
> -
>  /*
>   * CPU accounting code for task groups.
>   *
> @@ -23,7 +21,7 @@
>  struct cpuacct {
>  	struct cgroup_subsys_state css;
>  	/* cpuusage holds pointer to a u64-type object on every cpu */
> -	u64 __percpu *cpuusage;
> +	atomic64_t __percpu *cpuusage;

Hmmm...how about adding seq_counter for 32bit rather than atomic64_t ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic
  2010-05-19 19:02           ` Peter Zijlstra
  2010-05-19 19:13             ` Thomas Renninger
@ 2010-05-20 10:53             ` Thomas Renninger
  1 sibling, 0 replies; 21+ messages in thread
From: Thomas Renninger @ 2010-05-20 10:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mike, linux-kernel, lizf, menage, containers, mingo

On Wednesday 19 May 2010 21:02:23 Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 20:58 +0200, Thomas Renninger wrote:
> > and avoid locking on 32 bit.
> > This resolves an ugly dependency in cgroups_cpuaccount.c
> > to per_cpu runqueues lock variable.
> 
> While I totally agree with the sentiments here, atomic64_t isn't
> available on all 32bit archs and is _terribly_ expensive on ARCH=i386.

Please ignore the 2nd patch.
If nobody objects about the creation of kernel/sched.h which should get
used for further cleanups IMO, the first patch should get in.
I thought a bit about it, but taking the per_cpu rq lock is the most sane
thing I could think of.

Here is a fix (bug always was there), for "on top" patching of the first.
It should also be possible to fix this by exchanging:
CONFIG_64BIT to CONFIG_X86_64, but removing the #ifdefs should
be preferred over a lock that is only grabbed via sysfs access...

Thanks,

    Thomas

scheduler: cgroup_cpuaccount - Fix race on cpuusage

A u64 access is not atomic on all 64 bit archs.
Especially this:
*cpuusage += cputime;
should not necessarily be atomic on non X86_64 archs.
As cpuacct_cpuusage_{read,write} is only used during
sysfs access, the lock should be acquired rather rarely
and the code is nicer as well.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: linux-kernel@vger.kernel.org
CC: mike@android.com
CC: menage@google.com
CC: lizf@cn.fujitsu.com
CC: containers@lists.linux-foundation.org
CC: mingo@elte.hu
CC: peterz@infradead.org

diff --git a/kernel/cgroup_cpuaccount.c b/kernel/cgroup_cpuaccount.c
index 0ad356a..a9e0345 100644
--- a/kernel/cgroup_cpuaccount.c
+++ b/kernel/cgroup_cpuaccount.c
@@ -95,16 +95,12 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 	u64 data;
 
-#ifndef CONFIG_64BIT
 	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+	 * Take rq->lock to make 64-bit read safe
 	 */
 	lock_runqueue(cpu);
 	data = *cpuusage;
 	unlock_runqueue(cpu);
-#else
-	data = *cpuusage;
-#endif
 
 	return data;
 }
@@ -113,16 +109,12 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 {
 	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
 	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+	 * Take rq->lock to make 64-bit write safe
 	 */
 	lock_runqueue(cpu);
 	*cpuusage = val;
 	unlock_runqueue(cpu);
-#else
-	*cpuusage = val;
-#endif
 }
 
 /* return total cpu usage (in nanoseconds) of a group */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-05-20 10:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-19  1:30 [PATCH 0/4] Enable cpu frequency and power tracking for cpuacct cgroup Mike Chan
2010-05-19  1:30 ` [PATCH 1/4] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies Mike Chan
2010-05-19  1:30 ` [PATCH 2/4] omap: cpu: Implement callbacks for cpu frequency tracking in cpuacct Mike Chan
2010-05-19  1:30 ` [PATCH 3/4] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking Mike Chan
2010-05-19  9:30   ` [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file Thomas Renninger
2010-05-19  9:37     ` Peter Zijlstra
2010-05-19 10:27       ` Peter Zijlstra
2010-05-19 18:58         ` scheduler: cleanup sched.c and extract cgroup_cpuaccount stuff into separate file Thomas Renninger
2010-05-19 18:58         ` [PATCH 1/2] scheduler: Extract cgroups_cpuaccount code from sched.c into own file V2 Thomas Renninger
2010-05-19 18:58         ` [PATCH 2/2] scheduler: cgroups cpuaccouting: Make cpuusage atomic Thomas Renninger
2010-05-19 19:02           ` Peter Zijlstra
2010-05-19 19:13             ` Thomas Renninger
2010-05-19 19:31               ` Mike Chan
2010-05-20 10:53             ` Thomas Renninger
2010-05-20  0:43           ` KAMEZAWA Hiroyuki
2010-05-19 19:06     ` [PATCH] scheduler: Extract cgroups_cpuaccount code from sched.c into own file Mike Chan
2010-05-19  1:30 ` [PATCH 4/4] omap: cpu: Power tracking support for cgroup cpuacct Mike Chan
2010-05-19 13:11   ` Nishanth Menon
2010-05-19 15:34     ` Thomas Renninger
2010-05-19 18:56       ` Mike Chan
2010-05-19 19:00         ` Nishanth Menon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).