[PATCH 0/4] Provide cpuacct functionality in cpu cgroup

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Provide cpuacct functionality in cpu cgroup
@ 2011-11-15 15:59 Glauber Costa
  2011-11-15 15:59 ` [PATCH 1/4] Change cpustat fields to an array Glauber Costa
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Glauber Costa @ 2011-11-15 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	cgroups

Hi,

This is an excerpt of the last patches I sent regarding cpu cgroup.
It is mostly focused on cleaning up what we have now, so it can
be considered largely preparation. As a user of the new organization
of things, I am including cpuacct stats functionality in the end of
the series. The files related to cpuusage are left to be sent in
an upcoming series after this one is included.

Let me know if there is anything you'd like me to address.

Glauber Costa (4):
  Change cpustat fields to an array.
  split kernel stat in two
  Keep scheduler statistics per cgroup
  provide a version of cpuacct statistics inside cpu cgroup

 arch/s390/appldata/appldata_os.c       |   18 ++-
 arch/x86/include/asm/i387.h            |    2 +-
 drivers/cpufreq/cpufreq_conservative.c |   33 +++--
 drivers/cpufreq/cpufreq_ondemand.c     |   33 +++--
 drivers/macintosh/rack-meter.c         |    8 +-
 fs/proc/stat.c                         |   67 +++++-----
 fs/proc/uptime.c                       |    9 +-
 include/linux/kernel_stat.h            |   54 ++++++--
 kernel/sched.c                         |  228 +++++++++++++++++++++++++-------
 9 files changed, 315 insertions(+), 137 deletions(-)

-- 
1.7.6.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] Change cpustat fields to an array.
  2011-11-15 15:59 [PATCH 0/4] Provide cpuacct functionality in cpu cgroup Glauber Costa
@ 2011-11-15 15:59 ` Glauber Costa
  2011-11-16  5:58   ` Paul Turner
  2011-11-15 15:59 ` [PATCH 2/4] split kernel stat in two Glauber Costa
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Glauber Costa @ 2011-11-15 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	cgroups, Glauber Costa

This will give us a bit more flexibility to deal with the
fields in this structure. This is a preparation patch for
later patches in this series.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Paul Tuner <pjt@google.com>
---
 arch/s390/appldata/appldata_os.c       |   16 ++++----
 arch/x86/include/asm/i387.h            |    2 +-
 drivers/cpufreq/cpufreq_conservative.c |   23 +++++-----
 drivers/cpufreq/cpufreq_ondemand.c     |   23 +++++-----
 drivers/macintosh/rack-meter.c         |    6 +-
 fs/proc/stat.c                         |   63 +++++++++++++---------------
 fs/proc/uptime.c                       |    4 +-
 include/linux/kernel_stat.h            |   30 +++++++------
 kernel/sched.c                         |   71 ++++++++++++++++----------------
 9 files changed, 117 insertions(+), 121 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 92f1cb7..3d6b672 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
 	j = 0;
 	for_each_online_cpu(i) {
 		os_data->os_cpu[j].per_cpu_user =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.user);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
 		os_data->os_cpu[j].per_cpu_nice =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.nice);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
 		os_data->os_cpu[j].per_cpu_system =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.system);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
 		os_data->os_cpu[j].per_cpu_idle =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.idle);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
 		os_data->os_cpu[j].per_cpu_irq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.irq);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
 		os_data->os_cpu[j].per_cpu_softirq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.softirq);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
 		os_data->os_cpu[j].per_cpu_iowait =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.iowait);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
 		os_data->os_cpu[j].per_cpu_steal =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.steal);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
 		os_data->os_cpu[j].cpu_id = i;
 		j++;
 	}
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index c9e09ea..56fa4d7 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #ifdef CONFIG_SMP
 #define safe_address (__per_cpu_offset[0])
 #else
-#define safe_address (kstat_cpu(0).cpustat.user)
+#define safe_address (kstat_cpu(0).cpustat[USER])
 #endif
 
 /*
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index c97b468..2ab538f 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -103,13 +103,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
-			kstat_cpu(cpu).cpustat.system);
+	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
+			kstat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -272,7 +272,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 	}
 	return count;
 }
@@ -365,7 +365,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
+			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -374,7 +374,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -501,10 +501,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice) {
+			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat.nice;
-			}
+						kstat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->down_skip = 0;
 		this_dbs_info->requested_freq = policy->cur;
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index fa8af4e..45d8e17 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -127,13 +127,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
-			kstat_cpu(cpu).cpustat.system);
+	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
+			kstat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -345,7 +345,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 
 	}
 	return count;
@@ -458,7 +458,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
+			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -467,7 +467,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -646,10 +646,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice) {
+			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat.nice;
-			}
+						kstat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->cpu = cpu;
 		this_dbs_info->rate_mult = 1;
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 2637c13..c80e49a 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -83,11 +83,11 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
 {
 	cputime64_t retval;
 
-	retval = cputime64_add(kstat_cpu(cpu).cpustat.idle,
-			kstat_cpu(cpu).cpustat.iowait);
+	retval = cputime64_add(kstat_cpu(cpu).cpustat[IDLE],
+			kstat_cpu(cpu).cpustat[IOWAIT]);
 
 	if (rackmeter_ignore_nice)
-		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat.nice);
+		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat[NICE]);
 
 	return retval;
 }
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 42b274d..b7b74ad 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -22,29 +22,27 @@
 #define arch_idle_time(cpu) 0
 #endif
 
-static cputime64_t get_idle_time(int cpu)
+static u64 get_idle_time(int cpu)
 {
-	u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
-	cputime64_t idle;
+	u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
 
 	if (idle_time == -1ULL) {
 		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kstat_cpu(cpu).cpustat.idle;
-		idle = cputime64_add(idle, arch_idle_time(cpu));
+		idle = kstat_cpu(cpu).cpustat[IDLE];
+		idle += arch_idle_time(cpu);
 	} else
 		idle = usecs_to_cputime(idle_time);
 
 	return idle;
 }
 
-static cputime64_t get_iowait_time(int cpu)
+static u64 get_iowait_time(int cpu)
 {
-	u64 iowait_time = get_cpu_iowait_time_us(cpu, NULL);
-	cputime64_t iowait;
+	u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kstat_cpu(cpu).cpustat.iowait;
+		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
 	else
 		iowait = usecs_to_cputime(iowait_time);
 
@@ -55,33 +53,30 @@ static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
 	unsigned long jif;
-	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
-	cputime64_t guest, guest_nice;
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 
 	user = nice = system = idle = iowait =
-		irq = softirq = steal = cputime64_zero;
-	guest = guest_nice = cputime64_zero;
+		irq = softirq = steal = 0;
+	guest = guest_nice = 0;
 	getboottime(&boottime);
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user = cputime64_add(user, kstat_cpu(i).cpustat.user);
-		nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
-		system = cputime64_add(system, kstat_cpu(i).cpustat.system);
-		idle = cputime64_add(idle, get_idle_time(i));
-		iowait = cputime64_add(iowait, get_iowait_time(i));
-		irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
-		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
-		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-		guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
-		guest_nice = cputime64_add(guest_nice,
-			kstat_cpu(i).cpustat.guest_nice);
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
+		user += kstat_cpu(i).cpustat[USER];
+		nice += kstat_cpu(i).cpustat[NICE];
+		system += kstat_cpu(i).cpustat[SYSTEM];
+		idle += get_idle_time(i);
+		iowait += get_iowait_time(i);
+		irq += kstat_cpu(i).cpustat[IRQ];
+		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
+		steal += kstat_cpu(i).cpustat[STEAL];
+		guest += kstat_cpu(i).cpustat[GUEST];
+		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -106,16 +101,16 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat.user;
-		nice = kstat_cpu(i).cpustat.nice;
-		system = kstat_cpu(i).cpustat.system;
+		user = kstat_cpu(i).cpustat[USER];
+		nice = kstat_cpu(i).cpustat[NICE];
+		system = kstat_cpu(i).cpustat[SYSTEM];
 		idle = get_idle_time(i);
 		iowait = get_iowait_time(i);
-		irq = kstat_cpu(i).cpustat.irq;
-		softirq = kstat_cpu(i).cpustat.softirq;
-		steal = kstat_cpu(i).cpustat.steal;
-		guest = kstat_cpu(i).cpustat.guest;
-		guest_nice = kstat_cpu(i).cpustat.guest_nice;
+		irq = kstat_cpu(i).cpustat[IRQ];
+		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
+		steal = kstat_cpu(i).cpustat[STEAL];
+		guest = kstat_cpu(i).cpustat[GUEST];
+		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 766b1d4..76737bc 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -12,10 +12,10 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	struct timespec uptime;
 	struct timespec idle;
 	int i;
-	cputime_t idletime = cputime_zero;
+	u64 idletime = 0;
 
 	for_each_possible_cpu(i)
-		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
+		idletime += kstat_cpu(i).cpustat[IDLE];
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..7bfd0fe 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -6,6 +6,7 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/interrupt.h>
+#include <linux/sched.h>
 #include <asm/irq.h>
 #include <asm/cputime.h>
 
@@ -15,21 +16,22 @@
  * used by rstatd/perfmeter
  */
 
-struct cpu_usage_stat {
-	cputime64_t user;
-	cputime64_t nice;
-	cputime64_t system;
-	cputime64_t softirq;
-	cputime64_t irq;
-	cputime64_t idle;
-	cputime64_t iowait;
-	cputime64_t steal;
-	cputime64_t guest;
-	cputime64_t guest_nice;
+enum cpu_usage_stat {
+	USER,
+	NICE,
+	SYSTEM,
+	SOFTIRQ,
+	IRQ,
+	IDLE,
+	IOWAIT,
+	STEAL,
+	GUEST,
+	GUEST_NICE,
+	NR_STATS,
 };
 
 struct kernel_stat {
-	struct cpu_usage_stat	cpustat;
+	u64 cpustat[NR_STATS];
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
 #endif
@@ -39,9 +41,9 @@ struct kernel_stat {
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
 
-#define kstat_cpu(cpu)	per_cpu(kstat, cpu)
 /* Must have preemption disabled for this to be meaningful. */
-#define kstat_this_cpu	__get_cpu_var(kstat)
+#define kstat_this_cpu (&__get_cpu_var(kstat))
+#define kstat_cpu(cpu) per_cpu(kstat, cpu)
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 594ea22..7ac5aa6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2158,14 +2158,14 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -2173,14 +2173,14 @@ static int irqtime_account_hi_update(void)
 
 static int irqtime_account_si_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -3866,8 +3866,8 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t tmp;
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 tmp;
 
 	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
@@ -3876,10 +3876,11 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
+
 	if (TASK_NICE(p) > 0)
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
+		cpustat[NICE] += tmp;
 	else
-		cpustat->user = cputime64_add(cpustat->user, tmp);
+		cpustat[USER] += tmp;
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3895,8 +3896,8 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
-	cputime64_t tmp;
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3908,11 +3909,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
-		cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
+		cpustat[NICE] += tmp;
+		cpustat[GUEST_NICE] += tmp;
 	} else {
-		cpustat->user = cputime64_add(cpustat->user, tmp);
-		cpustat->guest = cputime64_add(cpustat->guest, tmp);
+		cpustat[USER] += tmp;
+		cpustat[GUEST] += tmp;
 	}
 }
 
@@ -3925,9 +3926,9 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, cputime64_t *target_cputime64)
+			cputime_t cputime_scaled, u64 *target_cputime64)
 {
-	cputime64_t tmp = cputime_to_cputime64(cputime);
+	u64 tmp = cputime_to_cputime64(cputime);
 
 	/* Add system time to process. */
 	p->stime = cputime_add(p->stime, cputime);
@@ -3935,7 +3936,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
+	*target_cputime64 += tmp;
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3952,8 +3953,8 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t *target_cputime64;
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *target_cputime64;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3961,11 +3962,11 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat->irq;
+		target_cputime64 = &cpustat[IRQ];
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat->softirq;
+		target_cputime64 = &cpustat[SOFTIRQ];
 	else
-		target_cputime64 = &cpustat->system;
+		target_cputime64 = &cpustat[SYSTEM];
 
 	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
 }
@@ -3976,10 +3977,10 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 
-	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
+	cpustat[STEAL] += cputime64;
 }
 
 /*
@@ -3988,14 +3989,14 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
+		cpustat[IOWAIT] += cputime64;
 	else
-		cpustat->idle = cputime64_add(cpustat->idle, cputime64);
+		cpustat[IDLE] += cputime64;
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -4045,16 +4046,16 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 						struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
-	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
+	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+		cpustat[IRQ] += tmp;
 	} else if (irqtime_account_si_update()) {
-		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		cpustat[SOFTIRQ] += tmp;
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
@@ -4062,7 +4063,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->softirq);
+					&cpustat[SOFTIRQ]);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -4071,7 +4072,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->system);
+					&cpustat[SYSTEM]);
 	}
 }
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] split kernel stat in two
  2011-11-15 15:59 [PATCH 0/4] Provide cpuacct functionality in cpu cgroup Glauber Costa
  2011-11-15 15:59 ` [PATCH 1/4] Change cpustat fields to an array Glauber Costa
@ 2011-11-15 15:59 ` Glauber Costa
  2011-11-16  6:12   ` Paul Turner
  2011-11-15 15:59 ` [PATCH 3/4] Keep scheduler statistics per cgroup Glauber Costa
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Glauber Costa @ 2011-11-15 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	cgroups, Glauber Costa

In a later patch, we will use cpustat information per-task group.
However, some of its fields are naturally global, such as the irq
counters. There is no need to impose the task group overhead to them
in this case. So better separate them.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Paul Tuner <pjt@google.com>
---
 arch/s390/appldata/appldata_os.c       |   16 +++++++-------
 arch/x86/include/asm/i387.h            |    2 +-
 drivers/cpufreq/cpufreq_conservative.c |   20 ++++++++--------
 drivers/cpufreq/cpufreq_ondemand.c     |   20 ++++++++--------
 drivers/macintosh/rack-meter.c         |    6 ++--
 fs/proc/stat.c                         |   36 ++++++++++++++++----------------
 include/linux/kernel_stat.h            |    8 ++++++-
 kernel/sched.c                         |   18 ++++++++-------
 8 files changed, 67 insertions(+), 59 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 3d6b672..695388a 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
 	j = 0;
 	for_each_online_cpu(i) {
 		os_data->os_cpu[j].per_cpu_user =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[USER]);
 		os_data->os_cpu[j].per_cpu_nice =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[NICE]);
 		os_data->os_cpu[j].per_cpu_system =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[SYSTEM]);
 		os_data->os_cpu[j].per_cpu_idle =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IDLE]);
 		os_data->os_cpu[j].per_cpu_irq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IRQ]);
 		os_data->os_cpu[j].per_cpu_softirq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[SOFTIRQ]);
 		os_data->os_cpu[j].per_cpu_iowait =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IOWAIT]);
 		os_data->os_cpu[j].per_cpu_steal =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[STEAL]);
 		os_data->os_cpu[j].cpu_id = i;
 		j++;
 	}
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 56fa4d7..1f1b536 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #ifdef CONFIG_SMP
 #define safe_address (__per_cpu_offset[0])
 #else
-#define safe_address (kstat_cpu(0).cpustat[USER])
+#define safe_address (__get_cpu_var(kernel_cpustat).cpustat[USER])
 #endif
 
 /*
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index 2ab538f..a3a739f 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -103,13 +103,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
-			kstat_cpu(cpu).cpustat[SYSTEM]);
+	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
+			kcpustat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -272,7 +272,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 	}
 	return count;
 }
@@ -365,7 +365,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
+			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -374,7 +374,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -503,7 +503,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 						&j_dbs_info->prev_cpu_wall);
 			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat[NICE];
+						kcpustat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->down_skip = 0;
 		this_dbs_info->requested_freq = policy->cur;
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 45d8e17..46e89663 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -127,13 +127,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
-			kstat_cpu(cpu).cpustat[SYSTEM]);
+	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
+			kcpustat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -345,7 +345,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 
 	}
 	return count;
@@ -458,7 +458,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
+			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -467,7 +467,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -648,7 +648,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 						&j_dbs_info->prev_cpu_wall);
 			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat[NICE];
+						kcpustat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->cpu = cpu;
 		this_dbs_info->rate_mult = 1;
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index c80e49a..c8e67b0 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -83,11 +83,11 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
 {
 	cputime64_t retval;
 
-	retval = cputime64_add(kstat_cpu(cpu).cpustat[IDLE],
-			kstat_cpu(cpu).cpustat[IOWAIT]);
+	retval = cputime64_add(kcpustat_cpu(cpu).cpustat[IDLE],
+			kcpustat_cpu(cpu).cpustat[IOWAIT]);
 
 	if (rackmeter_ignore_nice)
-		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat[NICE]);
+		retval = cputime64_add(retval, kcpustat_cpu(cpu).cpustat[NICE]);
 
 	return retval;
 }
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index b7b74ad..6ab20db 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -28,7 +28,7 @@ static u64 get_idle_time(int cpu)
 
 	if (idle_time == -1ULL) {
 		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kstat_cpu(cpu).cpustat[IDLE];
+		idle = kcpustat_cpu(cpu).cpustat[IDLE];
 		idle += arch_idle_time(cpu);
 	} else
 		idle = usecs_to_cputime(idle_time);
@@ -42,7 +42,7 @@ static u64 get_iowait_time(int cpu)
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
+		iowait = kcpustat_cpu(cpu).cpustat[IOWAIT];
 	else
 		iowait = usecs_to_cputime(iowait_time);
 
@@ -67,16 +67,16 @@ static int show_stat(struct seq_file *p, void *v)
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user += kstat_cpu(i).cpustat[USER];
-		nice += kstat_cpu(i).cpustat[NICE];
-		system += kstat_cpu(i).cpustat[SYSTEM];
+		user += kcpustat_cpu(i).cpustat[USER];
+		nice += kcpustat_cpu(i).cpustat[NICE];
+		system += kcpustat_cpu(i).cpustat[SYSTEM];
 		idle += get_idle_time(i);
 		iowait += get_iowait_time(i);
-		irq += kstat_cpu(i).cpustat[IRQ];
-		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
-		steal += kstat_cpu(i).cpustat[STEAL];
-		guest += kstat_cpu(i).cpustat[GUEST];
-		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
+		irq += kcpustat_cpu(i).cpustat[IRQ];
+		softirq += kcpustat_cpu(i).cpustat[SOFTIRQ];
+		steal += kcpustat_cpu(i).cpustat[STEAL];
+		guest += kcpustat_cpu(i).cpustat[GUEST];
+		guest_nice += kcpustat_cpu(i).cpustat[GUEST_NICE];
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -101,16 +101,16 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat[USER];
-		nice = kstat_cpu(i).cpustat[NICE];
-		system = kstat_cpu(i).cpustat[SYSTEM];
+		user = kcpustat_cpu(i).cpustat[USER];
+		nice = kcpustat_cpu(i).cpustat[NICE];
+		system = kcpustat_cpu(i).cpustat[SYSTEM];
 		idle = get_idle_time(i);
 		iowait = get_iowait_time(i);
-		irq = kstat_cpu(i).cpustat[IRQ];
-		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
-		steal = kstat_cpu(i).cpustat[STEAL];
-		guest = kstat_cpu(i).cpustat[GUEST];
-		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
+		irq = kcpustat_cpu(i).cpustat[IRQ];
+		softirq = kcpustat_cpu(i).cpustat[SOFTIRQ];
+		steal = kcpustat_cpu(i).cpustat[STEAL];
+		guest = kcpustat_cpu(i).cpustat[GUEST];
+		guest_nice = kcpustat_cpu(i).cpustat[GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 7bfd0fe..f0e31a9 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -30,8 +30,11 @@ enum cpu_usage_stat {
 	NR_STATS,
 };
 
-struct kernel_stat {
+struct kernel_cpustat {
 	u64 cpustat[NR_STATS];
+};
+
+struct kernel_stat {
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
 #endif
@@ -40,10 +43,13 @@ struct kernel_stat {
 };
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
+DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
 /* Must have preemption disabled for this to be meaningful. */
 #define kstat_this_cpu (&__get_cpu_var(kstat))
+#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
 #define kstat_cpu(cpu) per_cpu(kstat, cpu)
+#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 7ac5aa6..efdd4d8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2158,7 +2158,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
@@ -2173,7 +2173,7 @@ static int irqtime_account_hi_update(void)
 
 static int irqtime_account_si_update(void)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
@@ -3803,8 +3803,10 @@ unlock:
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat);
+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
 EXPORT_PER_CPU_SYMBOL(kstat);
+EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
  * Return any ns on the sched_clock that have not yet been accounted in
@@ -3866,7 +3868,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 tmp;
 
 	/* Add user time to process. */
@@ -3897,7 +3899,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
 	u64 tmp;
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3953,7 +3955,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 *target_cputime64;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
@@ -3977,7 +3979,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
 
 	cpustat[STEAL] += cputime64;
@@ -3989,7 +3991,7 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
@@ -4047,7 +4049,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] Keep scheduler statistics per cgroup
  2011-11-15 15:59 [PATCH 0/4] Provide cpuacct functionality in cpu cgroup Glauber Costa
  2011-11-15 15:59 ` [PATCH 1/4] Change cpustat fields to an array Glauber Costa
  2011-11-15 15:59 ` [PATCH 2/4] split kernel stat in two Glauber Costa
@ 2011-11-15 15:59 ` Glauber Costa
  2011-11-16  7:02   ` Paul Turner
  2011-11-15 15:59 ` [PATCH 4/4] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
  2011-11-16  0:57 ` [PATCH 0/4] Provide cpuacct functionality in " KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 17+ messages in thread
From: Glauber Costa @ 2011-11-15 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	cgroups, Glauber Costa

This patch makes the scheduler statistics, such as user ticks,
system ticks, etc, per-cgroup. With this information, we are
able to display the same information we currently do in cpuacct
cgroup, but within the normal cpu cgroup.

For all cgroups other than the root, those statistics are only
collected when the top level file sched_stats is set to 1. This
guarantees that the overhead of the patchset is negligible if we're
not collecting statistics, even if all the bits are in.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Paul Tuner <pjt@google.com>
---
 arch/s390/appldata/appldata_os.c       |    2 +
 drivers/cpufreq/cpufreq_conservative.c |   16 +++-
 drivers/cpufreq/cpufreq_ondemand.c     |   16 +++-
 drivers/macintosh/rack-meter.c         |    2 +
 fs/proc/stat.c                         |   40 ++++----
 fs/proc/uptime.c                       |    7 +-
 include/linux/kernel_stat.h            |   20 ++++-
 kernel/sched.c                         |  170 ++++++++++++++++++++++++--------
 8 files changed, 207 insertions(+), 66 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 695388a..0612a7c 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -114,6 +114,7 @@ static void appldata_get_os_data(void *data)
 
 	j = 0;
 	for_each_online_cpu(i) {
+		kstat_lock();
 		os_data->os_cpu[j].per_cpu_user =
 			cputime_to_jiffies(kcpustat_cpu(i).cpustat[USER]);
 		os_data->os_cpu[j].per_cpu_nice =
@@ -131,6 +132,7 @@ static void appldata_get_os_data(void *data)
 		os_data->os_cpu[j].per_cpu_steal =
 			cputime_to_jiffies(kcpustat_cpu(i).cpustat[STEAL]);
 		os_data->os_cpu[j].cpu_id = i;
+		kstat_unlock();
 		j++;
 	}
 
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index a3a739f..ca98530 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -102,6 +102,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t cur_wall_time;
 	cputime64_t busy_time;
 
+	kstat_lock();
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
 	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
 			kcpustat_cpu(cpu).cpustat[SYSTEM]);
@@ -110,6 +111,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
+	kstat_unlock();
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -271,8 +273,11 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info = &per_cpu(cs_cpu_dbs_info, j);
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
-		if (dbs_tuners_ins.ignore_nice)
+		if (dbs_tuners_ins.ignore_nice) {
+			kstat_lock();
 			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
+		}
 	}
 	return count;
 }
@@ -365,8 +370,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
+			kstat_lock();
 			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
+			kstat_unlock();
 			/*
 			 * Assumption: nice time between sampling periods will
 			 * be less than 2^32 jiffies for 32 bit sys
@@ -374,7 +381,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
+			kstat_lock();
 			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -501,9 +510,12 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice)
+			if (dbs_tuners_ins.ignore_nice) {
+				kstat_lock();
 				j_dbs_info->prev_cpu_nice =
 						kcpustat_cpu(j).cpustat[NICE];
+				kstat_unlock();
+			}
 		}
 		this_dbs_info->down_skip = 0;
 		this_dbs_info->requested_freq = policy->cur;
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 46e89663..4076453 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -126,6 +126,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t cur_wall_time;
 	cputime64_t busy_time;
 
+	kstat_lock();
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
 	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
 			kcpustat_cpu(cpu).cpustat[SYSTEM]);
@@ -134,6 +135,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
+	kstat_unlock();
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -344,8 +346,11 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info = &per_cpu(od_cpu_dbs_info, j);
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
-		if (dbs_tuners_ins.ignore_nice)
+		if (dbs_tuners_ins.ignore_nice) {
+			kstat_lock();
 			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
+		}
 
 	}
 	return count;
@@ -458,8 +463,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
+			kstat_lock();
 			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
+			kstat_unlock();
 			/*
 			 * Assumption: nice time between sampling periods will
 			 * be less than 2^32 jiffies for 32 bit sys
@@ -467,7 +474,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
+			kstat_lock();
 			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -646,9 +655,12 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice)
+			if (dbs_tuners_ins.ignore_nice) {
+				kstat_lock();
 				j_dbs_info->prev_cpu_nice =
 						kcpustat_cpu(j).cpustat[NICE];
+				kstat_unlock();
+			}
 		}
 		this_dbs_info->cpu = cpu;
 		this_dbs_info->rate_mult = 1;
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index c8e67b0..196244f 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -83,11 +83,13 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
 {
 	cputime64_t retval;
 
+	kstat_lock();
 	retval = cputime64_add(kcpustat_cpu(cpu).cpustat[IDLE],
 			kcpustat_cpu(cpu).cpustat[IOWAIT]);
 
 	if (rackmeter_ignore_nice)
 		retval = cputime64_add(retval, kcpustat_cpu(cpu).cpustat[NICE]);
+	kstat_unlock();
 
 	return retval;
 }
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6ab20db..ee01403 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -28,7 +28,7 @@ static u64 get_idle_time(int cpu)
 
 	if (idle_time == -1ULL) {
 		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kcpustat_cpu(cpu).cpustat[IDLE];
+		idle = root_kcpustat_cpu(cpu).cpustat[IDLE];
 		idle += arch_idle_time(cpu);
 	} else
 		idle = usecs_to_cputime(idle_time);
@@ -42,7 +42,7 @@ static u64 get_iowait_time(int cpu)
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kcpustat_cpu(cpu).cpustat[IOWAIT];
+		iowait = root_kcpustat_cpu(cpu).cpustat[IOWAIT];
 	else
 		iowait = usecs_to_cputime(iowait_time);
 
@@ -67,16 +67,18 @@ static int show_stat(struct seq_file *p, void *v)
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user += kcpustat_cpu(i).cpustat[USER];
-		nice += kcpustat_cpu(i).cpustat[NICE];
-		system += kcpustat_cpu(i).cpustat[SYSTEM];
+		kstat_lock();
+		user += root_kcpustat_cpu(i).cpustat[USER];
+		nice += root_kcpustat_cpu(i).cpustat[NICE];
+		system += root_kcpustat_cpu(i).cpustat[SYSTEM];
 		idle += get_idle_time(i);
 		iowait += get_iowait_time(i);
-		irq += kcpustat_cpu(i).cpustat[IRQ];
-		softirq += kcpustat_cpu(i).cpustat[SOFTIRQ];
-		steal += kcpustat_cpu(i).cpustat[STEAL];
-		guest += kcpustat_cpu(i).cpustat[GUEST];
-		guest_nice += kcpustat_cpu(i).cpustat[GUEST_NICE];
+		irq += root_kcpustat_cpu(i).cpustat[IRQ];
+		softirq += root_kcpustat_cpu(i).cpustat[SOFTIRQ];
+		steal += root_kcpustat_cpu(i).cpustat[STEAL];
+		guest += root_kcpustat_cpu(i).cpustat[GUEST];
+		guest_nice += root_kcpustat_cpu(i).cpustat[GUEST_NICE];
+		kstat_unlock();
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -100,17 +102,19 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)cputime64_to_clock_t(guest),
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
+		kstat_lock();
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcpustat_cpu(i).cpustat[USER];
-		nice = kcpustat_cpu(i).cpustat[NICE];
-		system = kcpustat_cpu(i).cpustat[SYSTEM];
+		user = root_kcpustat_cpu(i).cpustat[USER];
+		nice = root_kcpustat_cpu(i).cpustat[NICE];
+		system = root_kcpustat_cpu(i).cpustat[SYSTEM];
 		idle = get_idle_time(i);
 		iowait = get_iowait_time(i);
-		irq = kcpustat_cpu(i).cpustat[IRQ];
-		softirq = kcpustat_cpu(i).cpustat[SOFTIRQ];
-		steal = kcpustat_cpu(i).cpustat[STEAL];
-		guest = kcpustat_cpu(i).cpustat[GUEST];
-		guest_nice = kcpustat_cpu(i).cpustat[GUEST_NICE];
+		irq = root_kcpustat_cpu(i).cpustat[IRQ];
+		softirq = root_kcpustat_cpu(i).cpustat[SOFTIRQ];
+		steal = root_kcpustat_cpu(i).cpustat[STEAL];
+		guest = root_kcpustat_cpu(i).cpustat[GUEST];
+		guest_nice = root_kcpustat_cpu(i).cpustat[GUEST_NICE];
+		kstat_unlock();
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 76737bc..acb9ba8 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -14,8 +14,11 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	int i;
 	u64 idletime = 0;
 
-	for_each_possible_cpu(i)
-		idletime += kstat_cpu(i).cpustat[IDLE];
+	for_each_possible_cpu(i) {
+		kstat_lock();
+		idletime += kcpustat_cpu(i).cpustat[IDLE];
+		kstat_unlock();
+	}
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index f0e31a9..4c8ff41 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -45,11 +45,27 @@ struct kernel_stat {
 DECLARE_PER_CPU(struct kernel_stat, kstat);
 DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
-/* Must have preemption disabled for this to be meaningful. */
 #define kstat_this_cpu (&__get_cpu_var(kstat))
-#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
 #define kstat_cpu(cpu) per_cpu(kstat, cpu)
+
+#ifdef CONFIG_CGROUP_SCHED
+struct kernel_cpustat *task_group_kstat(struct task_struct *p);
+
+#define kcpustat_this_cpu      this_cpu_ptr(task_group_kstat(current))
+#define kcpustat_cpu(cpu) (*per_cpu_ptr(task_group_kstat(current), cpu))
+#define kstat_lock()   rcu_read_lock()
+#define kstat_unlock() rcu_read_unlock()
+#else
+#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
 #define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
+#define kstat_lock()
+#define kstat_unlock()
+#endif
+/*
+ * This makes sure the root cgroup is the one we read from when cpu
+ * cgroup is on, and is just equivalent to kcpustat_cpu when it is off
+ */
+#define root_kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index efdd4d8..934f631 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -301,6 +301,7 @@ struct task_group {
 #endif
 
 	struct cfs_bandwidth cfs_bandwidth;
+	struct kernel_cpustat __percpu *cpustat;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -740,6 +741,12 @@ static inline int cpu_of(struct rq *rq)
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+DEFINE_PER_CPU(struct kernel_stat, kstat);
+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
+
+EXPORT_PER_CPU_SYMBOL(kstat);
+EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /*
@@ -763,6 +770,21 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return autogroup_task_group(p, tg);
 }
 
+static struct jump_label_key sched_cgroup_enabled;
+static int sched_has_sched_stats = 0;
+
+struct kernel_cpustat *task_group_kstat(struct task_struct *p)
+{
+	if (static_branch(&sched_cgroup_enabled)) {
+		struct task_group *tg;
+		tg = task_group(p);
+		return tg->cpustat;
+	}
+
+	return &kernel_cpustat;
+}
+EXPORT_SYMBOL(task_group_kstat);
+
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
@@ -784,9 +806,36 @@ static inline struct task_group *task_group(struct task_struct *p)
 {
 	return NULL;
 }
-
 #endif /* CONFIG_CGROUP_SCHED */
 
+static inline void task_group_account_field(struct task_struct *p,
+					     u64 tmp, int index)
+{
+	/*
+	 * Since all updates are sure to touch the root cgroup, we
+	 * get ourselves ahead and touch it first. If the root cgroup
+	 * is the only cgroup, then nothing else should be necessary.
+	 *
+	 */
+	__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
+
+#ifdef CONFIG_CGROUP_SCHED
+	if (static_branch(&sched_cgroup_enabled)) {
+		struct kernel_cpustat *kcpustat;
+		struct task_group *tg;
+
+		rcu_read_lock();
+		tg = task_group(p);
+		while (tg && (tg != &root_task_group)) {
+			kcpustat = this_cpu_ptr(tg->cpustat);
+			kcpustat->cpustat[index] += tmp;
+			tg = tg->parent;
+		}
+		rcu_read_unlock();
+	}
+#endif
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta);
 
 static void update_rq_clock(struct rq *rq)
@@ -2158,30 +2207,36 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
+	u64 *cpustat;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_hardirq_time);
+	kstat_lock();
+	cpustat = kcpustat_this_cpu->cpustat;
 	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
 		ret = 1;
+	kstat_unlock();
 	local_irq_restore(flags);
 	return ret;
 }
 
 static int irqtime_account_si_update(void)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 *cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_softirq_time);
+	kstat_lock();
+	cpustat = kcpustat_this_cpu->cpustat;
 	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
 		ret = 1;
+	kstat_unlock();
 	local_irq_restore(flags);
 	return ret;
 }
@@ -3802,12 +3857,6 @@ unlock:
 
 #endif
 
-DEFINE_PER_CPU(struct kernel_stat, kstat);
-DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
-
-EXPORT_PER_CPU_SYMBOL(kstat);
-EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
-
 /*
  * Return any ns on the sched_clock that have not yet been accounted in
  * @p in case that task is currently running.
@@ -3868,7 +3917,6 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 tmp;
 
 	/* Add user time to process. */
@@ -3880,9 +3928,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	tmp = cputime_to_cputime64(cputime);
 
 	if (TASK_NICE(p) > 0)
-		cpustat[NICE] += tmp;
+		task_group_account_field(p, tmp, NICE);
 	else
-		cpustat[USER] += tmp;
+		task_group_account_field(p, tmp, USER);
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3899,7 +3947,6 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
 	u64 tmp;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3911,11 +3958,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat[NICE] += tmp;
-		cpustat[GUEST_NICE] += tmp;
+		task_group_account_field(p, tmp, NICE);
+		task_group_account_field(p, tmp, GUEST_NICE);
 	} else {
-		cpustat[USER] += tmp;
-		cpustat[GUEST] += tmp;
+		task_group_account_field(p, tmp, USER);
+		task_group_account_field(p, tmp, GUEST);
 	}
 }
 
@@ -3928,7 +3975,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, u64 *target_cputime64)
+			cputime_t cputime_scaled, int index)
 {
 	u64 tmp = cputime_to_cputime64(cputime);
 
@@ -3938,7 +3985,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 += tmp;
+	task_group_account_field(p, tmp, index);
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3955,8 +4002,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
-	u64 *target_cputime64;
+	int index;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3964,13 +4010,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat[IRQ];
+		index = IRQ;
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat[SOFTIRQ];
+		index = SOFTIRQ;
 	else
-		target_cputime64 = &cpustat[SYSTEM];
+		index = SYSTEM;
 
-	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
+	__account_system_time(p, cputime, cputime_scaled, index);
 }
 
 /*
@@ -3979,10 +4025,8 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
-
-	cpustat[STEAL] += cputime64;
+	__get_cpu_var(kernel_cpustat).cpustat[STEAL] += cputime64;
 }
 
 /*
@@ -3991,14 +4035,19 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	struct kernel_cpustat *kcpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
+	kstat_lock();
+	kcpustat = kcpustat_this_cpu;
+
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat[IOWAIT] += cputime64;
+		kcpustat->cpustat[IOWAIT] += cputime64;
 	else
-		cpustat[IDLE] += cputime64;
+		/* idle is always accounted to the root cgroup */
+		__get_cpu_var(kernel_cpustat).cpustat[IDLE] += cputime64;
+	kstat_unlock();
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -4045,27 +4094,26 @@ static __always_inline bool steal_account_process_tick(void)
  * softirq as those do not count in task exec_runtime any more.
  */
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
-						struct rq *rq)
+					 struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat[IRQ] += tmp;
+		task_group_account_field(p, tmp, IRQ);
 	} else if (irqtime_account_si_update()) {
-		cpustat[SOFTIRQ] += tmp;
+		task_group_account_field(p, tmp, SOFTIRQ);
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SOFTIRQ]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SOFTIRQ);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -4073,8 +4121,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SYSTEM]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SYSTEM);
 	}
 }
 
@@ -8237,6 +8285,8 @@ void __init sched_init(void)
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
+
+	root_task_group.cpustat = &kernel_cpustat;
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -8674,6 +8724,7 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
+	free_percpu(tg->cpustat);
 	kfree(tg);
 }
 
@@ -8693,6 +8744,10 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	tg->cpustat = alloc_percpu(struct kernel_cpustat);
+	if (!tg->cpustat)
+		goto err;
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -9437,6 +9492,23 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static u64 cpu_has_sched_stats(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_has_sched_stats;
+}
+
+static int cpu_set_sched_stats(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	if (!val && sched_has_sched_stats)
+		jump_label_dec(&sched_cgroup_enabled);
+
+	if (val && !sched_has_sched_stats)
+		jump_label_inc(&sched_cgroup_enabled);
+
+	sched_has_sched_stats = !!val;
+	return 0;
+}
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9475,9 +9547,27 @@ static struct cftype cpu_files[] = {
 #endif
 };
 
+/*
+ * Files appearing here will be shown at the top level only. Although we could
+ * show them unconditionally, and then return an error when read/writen from
+ * non-root cgroups, this is less confusing for users
+ */
+static struct cftype cpu_root_files[] = {
+	{
+		.name = "sched_stats",
+		.read_u64 = cpu_has_sched_stats,
+		.write_u64 = cpu_set_sched_stats,
+	},
+};
+
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
+	int ret;
+	ret = cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
+	if (!ret)
+		ret = cgroup_add_files(cont, ss, cpu_root_files,
+				       ARRAY_SIZE(cpu_root_files));
+	return ret;
 }
 
 struct cgroup_subsys cpu_cgroup_subsys = {
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] provide a version of cpuacct statistics inside cpu cgroup
  2011-11-15 15:59 [PATCH 0/4] Provide cpuacct functionality in cpu cgroup Glauber Costa
                   ` (2 preceding siblings ...)
  2011-11-15 15:59 ` [PATCH 3/4] Keep scheduler statistics per cgroup Glauber Costa
@ 2011-11-15 15:59 ` Glauber Costa
  2011-11-17  7:12   ` Balbir Singh
  2011-11-16  0:57 ` [PATCH 0/4] Provide cpuacct functionality in " KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 17+ messages in thread
From: Glauber Costa @ 2011-11-15 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	cgroups, Glauber Costa, Balbir Singh

For users interested in using the information currently displayed
at cpuacct.stat, we provide it inside the cpu cgroup.

This have the advantage of accounting this information only once,
instead of twice, when there is no need to have an independent group
of tasks for controlling and accounting, leading to a lot less overhead.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched.c |   43 ++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 934f631..99e806b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9151,6 +9151,11 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+static const char *cpuacct_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /* return corresponding task_group object of a cgroup */
@@ -9509,6 +9514,35 @@ static int cpu_set_sched_stats(struct cgroup *cgrp, struct cftype *cft, u64 val)
 	return 0;
 }
 
+static int cpu_cgroup_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int cpu;
+	s64 val = 0;
+
+	for_each_present_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[USER];
+		val += kcpustat->cpustat[NICE];
+	}
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_USER], val);
+
+	val = 0;
+	for_each_online_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[SYSTEM];
+		val += kcpustat->cpustat[IRQ];
+		val += kcpustat->cpustat[SOFTIRQ];
+	}
+
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val);
+
+	return 0;
+}
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9545,6 +9579,10 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	{
+		.name = "acct_stat",
+		.read_map = cpu_cgroup_stats_show,
+	},
 };
 
 /*
@@ -9746,11 +9784,6 @@ static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
 	return 0;
 }
 
-static const char *cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
 static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
 		struct cgroup_map_cb *cb)
 {
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Provide cpuacct functionality in cpu cgroup
  2011-11-15 15:59 [PATCH 0/4] Provide cpuacct functionality in cpu cgroup Glauber Costa
                   ` (3 preceding siblings ...)
  2011-11-15 15:59 ` [PATCH 4/4] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
@ 2011-11-16  0:57 ` KAMEZAWA Hiroyuki
  2011-11-23 10:29   ` Glauber Costa
  4 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-16  0:57 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, cgroups

On Tue, 15 Nov 2011 13:59:13 -0200
Glauber Costa <glommer@parallels.com> wrote:

> Hi,
> 
> This is an excerpt of the last patches I sent regarding cpu cgroup.
> It is mostly focused on cleaning up what we have now, so it can
> be considered largely preparation. As a user of the new organization
> of things, I am including cpuacct stats functionality in the end of
> the series. The files related to cpuusage are left to be sent in
> an upcoming series after this one is included.
> 
> Let me know if there is anything you'd like me to address.
> 

I'm sorry but let me several questions.

Why cpu cgroup but cpuacct cgroup ?
If scheduler stat is reported via cpu cgroup, it makes sense.
If user accounting information is reported via cpu cgroup, I feel it's strange.
Do you want to make cpuacct cgroup obsolete and merge 2 cgroups ?

What's relationship between this new counters and cpuacct cgroup ?
Aren't users confused ?

Please provide Documenation.

 
Thanks,
-Kame


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Change cpustat fields to an array.
  2011-11-15 15:59 ` [PATCH 1/4] Change cpustat fields to an array Glauber Costa
@ 2011-11-16  5:58   ` Paul Turner
  2011-11-16 11:25     ` Glauber Costa
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Turner @ 2011-11-16  5:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, cgroups

On 11/15/2011 07:59 AM, Glauber Costa wrote:
> This will give us a bit more flexibility to deal with the
> fields in this structure. This is a preparation patch for
> later patches in this series.
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> CC: Paul Tuner<pjt@google.com>
> ---
>   arch/s390/appldata/appldata_os.c       |   16 ++++----
>   arch/x86/include/asm/i387.h            |    2 +-
>   drivers/cpufreq/cpufreq_conservative.c |   23 +++++-----
>   drivers/cpufreq/cpufreq_ondemand.c     |   23 +++++-----
>   drivers/macintosh/rack-meter.c         |    6 +-
>   fs/proc/stat.c                         |   63 +++++++++++++---------------
>   fs/proc/uptime.c                       |    4 +-
>   include/linux/kernel_stat.h            |   30 +++++++------
>   kernel/sched.c                         |   71 ++++++++++++++++----------------
>   9 files changed, 117 insertions(+), 121 deletions(-)
>
> diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
> index 92f1cb7..3d6b672 100644
> --- a/arch/s390/appldata/appldata_os.c
> +++ b/arch/s390/appldata/appldata_os.c
> @@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
>   	j = 0;
>   	for_each_online_cpu(i) {
>   		os_data->os_cpu[j].per_cpu_user =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.user);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
>   		os_data->os_cpu[j].per_cpu_nice =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.nice);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
>   		os_data->os_cpu[j].per_cpu_system =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.system);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
>   		os_data->os_cpu[j].per_cpu_idle =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.idle);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
>   		os_data->os_cpu[j].per_cpu_irq =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.irq);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
>   		os_data->os_cpu[j].per_cpu_softirq =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.softirq);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
>   		os_data->os_cpu[j].per_cpu_iowait =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.iowait);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
>   		os_data->os_cpu[j].per_cpu_steal =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat.steal);
> +			cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
>   		os_data->os_cpu[j].cpu_id = i;
>   		j++;
>   	}
> diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
> index c9e09ea..56fa4d7 100644
> --- a/arch/x86/include/asm/i387.h
> +++ b/arch/x86/include/asm/i387.h
> @@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
>   #ifdef CONFIG_SMP
>   #define safe_address (__per_cpu_offset[0])
>   #else
> -#define safe_address (kstat_cpu(0).cpustat.user)
> +#define safe_address (kstat_cpu(0).cpustat[USER])
>   #endif
>
>   /*
> diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
> index c97b468..2ab538f 100644
> --- a/drivers/cpufreq/cpufreq_conservative.c
> +++ b/drivers/cpufreq/cpufreq_conservative.c
> @@ -103,13 +103,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	cputime64_t busy_time;
>
>   	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
> -	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
> -			kstat_cpu(cpu).cpustat.system);
> +	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
> +			kstat_cpu(cpu).cpustat[SYSTEM]);
>
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
>
>   	idle_time = cputime64_sub(cur_wall_time, busy_time);
>   	if (wall)
> @@ -272,7 +272,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
>   		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&dbs_info->prev_cpu_wall);
>   		if (dbs_tuners_ins.ignore_nice)
> -			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
> +			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>   	}
>   	return count;
>   }
> @@ -365,7 +365,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cputime64_t cur_nice;
>   			unsigned long cur_nice_jiffies;
>
> -			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
> +			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
>   					 j_dbs_info->prev_cpu_nice);
>   			/*
>   			 * Assumption: nice time between sampling periods will
> @@ -374,7 +374,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cur_nice_jiffies = (unsigned long)
>   					cputime64_to_jiffies64(cur_nice);
>
> -			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
> +			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>   			idle_time += jiffies_to_usecs(cur_nice_jiffies);
>   		}
>
> @@ -501,10 +501,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
>
>   			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&j_dbs_info->prev_cpu_wall);
> -			if (dbs_tuners_ins.ignore_nice) {
> +			if (dbs_tuners_ins.ignore_nice)
>   				j_dbs_info->prev_cpu_nice =
> -						kstat_cpu(j).cpustat.nice;
> -			}
> +						kstat_cpu(j).cpustat[NICE];
>   		}
>   		this_dbs_info->down_skip = 0;
>   		this_dbs_info->requested_freq = policy->cur;
> diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
> index fa8af4e..45d8e17 100644
> --- a/drivers/cpufreq/cpufreq_ondemand.c
> +++ b/drivers/cpufreq/cpufreq_ondemand.c
> @@ -127,13 +127,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	cputime64_t busy_time;
>
>   	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
> -	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
> -			kstat_cpu(cpu).cpustat.system);
> +	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
> +			kstat_cpu(cpu).cpustat[SYSTEM]);
>
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
> +	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
>
>   	idle_time = cputime64_sub(cur_wall_time, busy_time);
>   	if (wall)
> @@ -345,7 +345,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
>   		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&dbs_info->prev_cpu_wall);
>   		if (dbs_tuners_ins.ignore_nice)
> -			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
> +			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>
>   	}
>   	return count;
> @@ -458,7 +458,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cputime64_t cur_nice;
>   			unsigned long cur_nice_jiffies;
>
> -			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
> +			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
>   					 j_dbs_info->prev_cpu_nice);
>   			/*
>   			 * Assumption: nice time between sampling periods will
> @@ -467,7 +467,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cur_nice_jiffies = (unsigned long)
>   					cputime64_to_jiffies64(cur_nice);
>
> -			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
> +			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>   			idle_time += jiffies_to_usecs(cur_nice_jiffies);
>   		}
>
> @@ -646,10 +646,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
>
>   			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&j_dbs_info->prev_cpu_wall);
> -			if (dbs_tuners_ins.ignore_nice) {
> +			if (dbs_tuners_ins.ignore_nice)
>   				j_dbs_info->prev_cpu_nice =
> -						kstat_cpu(j).cpustat.nice;
> -			}
> +						kstat_cpu(j).cpustat[NICE];
>   		}
>   		this_dbs_info->cpu = cpu;
>   		this_dbs_info->rate_mult = 1;
> diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
> index 2637c13..c80e49a 100644
> --- a/drivers/macintosh/rack-meter.c
> +++ b/drivers/macintosh/rack-meter.c
> @@ -83,11 +83,11 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
>   {
>   	cputime64_t retval;
>
> -	retval = cputime64_add(kstat_cpu(cpu).cpustat.idle,
> -			kstat_cpu(cpu).cpustat.iowait);
> +	retval = cputime64_add(kstat_cpu(cpu).cpustat[IDLE],
> +			kstat_cpu(cpu).cpustat[IOWAIT]);
>
>   	if (rackmeter_ignore_nice)
> -		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat.nice);
> +		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat[NICE]);
>
>   	return retval;
>   }
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 42b274d..b7b74ad 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -22,29 +22,27 @@
>   #define arch_idle_time(cpu) 0
>   #endif
>
> -static cputime64_t get_idle_time(int cpu)
> +static u64 get_idle_time(int cpu)
>   {
> -	u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
> -	cputime64_t idle;
> +	u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
>
>   	if (idle_time == -1ULL) {
>   		/* !NO_HZ so we can rely on cpustat.idle */
> -		idle = kstat_cpu(cpu).cpustat.idle;
> -		idle = cputime64_add(idle, arch_idle_time(cpu));
> +		idle = kstat_cpu(cpu).cpustat[IDLE];
> +		idle += arch_idle_time(cpu);
>   	} else
>   		idle = usecs_to_cputime(idle_time);
>
>   	return idle;
>   }
>
> -static cputime64_t get_iowait_time(int cpu)
> +static u64 get_iowait_time(int cpu)
>   {
> -	u64 iowait_time = get_cpu_iowait_time_us(cpu, NULL);
> -	cputime64_t iowait;
> +	u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
>
>   	if (iowait_time == -1ULL)
>   		/* !NO_HZ so we can rely on cpustat.iowait */
> -		iowait = kstat_cpu(cpu).cpustat.iowait;
> +		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
>   	else
>   		iowait = usecs_to_cputime(iowait_time);
>
> @@ -55,33 +53,30 @@ static int show_stat(struct seq_file *p, void *v)
>   {
>   	int i, j;
>   	unsigned long jif;
> -	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
> -	cputime64_t guest, guest_nice;
> +	u64 user, nice, system, idle, iowait, irq, softirq, steal;
> +	u64 guest, guest_nice;
>   	u64 sum = 0;
>   	u64 sum_softirq = 0;
>   	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>   	struct timespec boottime;
>
>   	user = nice = system = idle = iowait =
> -		irq = softirq = steal = cputime64_zero;
> -	guest = guest_nice = cputime64_zero;
> +		irq = softirq = steal = 0;
> +	guest = guest_nice = 0;
>   	getboottime(&boottime);
>   	jif = boottime.tv_sec;
>
>   	for_each_possible_cpu(i) {
> -		user = cputime64_add(user, kstat_cpu(i).cpustat.user);
> -		nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
> -		system = cputime64_add(system, kstat_cpu(i).cpustat.system);
> -		idle = cputime64_add(idle, get_idle_time(i));
> -		iowait = cputime64_add(iowait, get_iowait_time(i));
> -		irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
> -		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
> -		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
> -		guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
> -		guest_nice = cputime64_add(guest_nice,
> -			kstat_cpu(i).cpustat.guest_nice);
> -		sum += kstat_cpu_irqs_sum(i);
> -		sum += arch_irq_stat_cpu(i);
> +		user += kstat_cpu(i).cpustat[USER];

Half the time cputime64_add is preserved, half the time this patch converts it 
to a naked '+='.  Admittedly no one seems to usefully define cputime64_add but 
why the conversion / inconsistency?

> +		nice += kstat_cpu(i).cpustat[NICE];
> +		system += kstat_cpu(i).cpustat[SYSTEM];
> +		idle += get_idle_time(i);
> +		iowait += get_iowait_time(i);
> +		irq += kstat_cpu(i).cpustat[IRQ];
> +		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
> +		steal += kstat_cpu(i).cpustat[STEAL];
> +		guest += kstat_cpu(i).cpustat[GUEST];
> +		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
>
>   		for (j = 0; j<  NR_SOFTIRQS; j++) {
>   			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> @@ -106,16 +101,16 @@ static int show_stat(struct seq_file *p, void *v)
>   		(unsigned long long)cputime64_to_clock_t(guest_nice));
>   	for_each_online_cpu(i) {
>   		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
> -		user = kstat_cpu(i).cpustat.user;
> -		nice = kstat_cpu(i).cpustat.nice;
> -		system = kstat_cpu(i).cpustat.system;
> +		user = kstat_cpu(i).cpustat[USER];
> +		nice = kstat_cpu(i).cpustat[NICE];
> +		system = kstat_cpu(i).cpustat[SYSTEM];
>   		idle = get_idle_time(i);
>   		iowait = get_iowait_time(i);
> -		irq = kstat_cpu(i).cpustat.irq;
> -		softirq = kstat_cpu(i).cpustat.softirq;
> -		steal = kstat_cpu(i).cpustat.steal;
> -		guest = kstat_cpu(i).cpustat.guest;
> -		guest_nice = kstat_cpu(i).cpustat.guest_nice;
> +		irq = kstat_cpu(i).cpustat[IRQ];
> +		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
> +		steal = kstat_cpu(i).cpustat[STEAL];
> +		guest = kstat_cpu(i).cpustat[GUEST];
> +		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
>   		seq_printf(p,
>   			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
>   			"%llu\n",
> diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
> index 766b1d4..76737bc 100644
> --- a/fs/proc/uptime.c
> +++ b/fs/proc/uptime.c
> @@ -12,10 +12,10 @@ static int uptime_proc_show(struct seq_file *m, void *v)
>   	struct timespec uptime;
>   	struct timespec idle;
>   	int i;
> -	cputime_t idletime = cputime_zero;
> +	u64 idletime = 0;
>
>   	for_each_possible_cpu(i)
> -		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
> +		idletime += kstat_cpu(i).cpustat[IDLE];
>
>   	do_posix_clock_monotonic_gettime(&uptime);
>   	monotonic_to_bootbased(&uptime);
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index 0cce2db..7bfd0fe 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -6,6 +6,7 @@
>   #include<linux/percpu.h>
>   #include<linux/cpumask.h>
>   #include<linux/interrupt.h>
> +#include<linux/sched.h>
>   #include<asm/irq.h>
>   #include<asm/cputime.h>
>
> @@ -15,21 +16,22 @@
>    * used by rstatd/perfmeter
>    */
>
> -struct cpu_usage_stat {
> -	cputime64_t user;
> -	cputime64_t nice;
> -	cputime64_t system;
> -	cputime64_t softirq;
> -	cputime64_t irq;
> -	cputime64_t idle;
> -	cputime64_t iowait;
> -	cputime64_t steal;
> -	cputime64_t guest;
> -	cputime64_t guest_nice;
> +enum cpu_usage_stat {
> +	USER,
> +	NICE,
> +	SYSTEM,
> +	SOFTIRQ,
> +	IRQ,
> +	IDLE,
> +	IOWAIT,
> +	STEAL,
> +	GUEST,
> +	GUEST_NICE,
> +	NR_STATS,
>   };

I suspect we want a more descriptive prefix here, e.g. CPUTIME_USER

>
>   struct kernel_stat {
> -	struct cpu_usage_stat	cpustat;
> +	u64 cpustat[NR_STATS];
>   #ifndef CONFIG_GENERIC_HARDIRQS
>          unsigned int irqs[NR_IRQS];
>   #endif
> @@ -39,9 +41,9 @@ struct kernel_stat {
>
>   DECLARE_PER_CPU(struct kernel_stat, kstat);
>
> -#define kstat_cpu(cpu)	per_cpu(kstat, cpu)
>   /* Must have preemption disabled for this to be meaningful. */
> -#define kstat_this_cpu	__get_cpu_var(kstat)
> +#define kstat_this_cpu (&__get_cpu_var(kstat))
> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
>
>   extern unsigned long long nr_context_switches(void);
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 594ea22..7ac5aa6 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2158,14 +2158,14 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>   #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>   static int irqtime_account_hi_update(void)
>   {
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> +	u64 *cpustat = kstat_this_cpu->cpustat;
>   	unsigned long flags;
>   	u64 latest_ns;
>   	int ret = 0;
>
>   	local_irq_save(flags);
>   	latest_ns = this_cpu_read(cpu_hardirq_time);
> -	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
> +	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
>   		ret = 1;
>   	local_irq_restore(flags);
>   	return ret;
> @@ -2173,14 +2173,14 @@ static int irqtime_account_hi_update(void)
>
>   static int irqtime_account_si_update(void)
>   {
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> +	u64 *cpustat = kstat_this_cpu->cpustat;
>   	unsigned long flags;
>   	u64 latest_ns;
>   	int ret = 0;
>
>   	local_irq_save(flags);
>   	latest_ns = this_cpu_read(cpu_softirq_time);
> -	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
> +	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
>   		ret = 1;
>   	local_irq_restore(flags);
>   	return ret;
> @@ -3866,8 +3866,8 @@ unsigned long long task_sched_runtime(struct task_struct *p)
>   void account_user_time(struct task_struct *p, cputime_t cputime,
>   		       cputime_t cputime_scaled)
>   {
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> -	cputime64_t tmp;
> +	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 tmp;
>
>   	/* Add user time to process. */
>   	p->utime = cputime_add(p->utime, cputime);
> @@ -3876,10 +3876,11 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
>
>   	/* Add user time to cpustat. */
>   	tmp = cputime_to_cputime64(cputime);
> +
>   	if (TASK_NICE(p)>  0)

We now that these are actually fields this could be:
   field = TASK_NICE(p) > 0 ? CPUTIME_NICE : CPUTIME_USER;

> -		cpustat->nice = cputime64_add(cpustat->nice, tmp);
> +		cpustat[NICE] += tmp;
>   	else
> -		cpustat->user = cputime64_add(cpustat->user, tmp);
> +		cpustat[USER] += tmp;
>
>   	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
>   	/* Account for user time used */
> @@ -3895,8 +3896,8 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
>   static void account_guest_time(struct task_struct *p, cputime_t cputime,
>   			       cputime_t cputime_scaled)
>   {
> -	cputime64_t tmp;
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> +	u64 tmp;
> +	u64 *cpustat = kstat_this_cpu->cpustat;
>
>   	tmp = cputime_to_cputime64(cputime);
>
> @@ -3908,11 +3909,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>
>   	/* Add guest time to cpustat. */
>   	if (TASK_NICE(p)>  0) {
> -		cpustat->nice = cputime64_add(cpustat->nice, tmp);
> -		cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
> +		cpustat[NICE] += tmp;
> +		cpustat[GUEST_NICE] += tmp;
>   	} else {
> -		cpustat->user = cputime64_add(cpustat->user, tmp);
> -		cpustat->guest = cputime64_add(cpustat->guest, tmp);
> +		cpustat[USER] += tmp;
> +		cpustat[GUEST] += tmp;
>   	}
>   }
>
> @@ -3925,9 +3926,9 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>    */
>   static inline
>   void __account_system_time(struct task_struct *p, cputime_t cputime,
> -			cputime_t cputime_scaled, cputime64_t *target_cputime64)
> +			cputime_t cputime_scaled, u64 *target_cputime64)

Having cpustat be an array means we can drop the pointer here and pass the id.

>   {
> -	cputime64_t tmp = cputime_to_cputime64(cputime);
> +	u64 tmp = cputime_to_cputime64(cputime);
>
>   	/* Add system time to process. */
>   	p->stime = cputime_add(p->stime, cputime);
> @@ -3935,7 +3936,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>   	account_group_system_time(p, cputime);
>
>   	/* Add system time to cpustat. */
> -	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
> +	*target_cputime64 += tmp;
>   	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
>
>   	/* Account for system time used */
> @@ -3952,8 +3953,8 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>   void account_system_time(struct task_struct *p, int hardirq_offset,
>   			 cputime_t cputime, cputime_t cputime_scaled)
>   {
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> -	cputime64_t *target_cputime64;
> +	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *target_cputime64;
>
>   	if ((p->flags&  PF_VCPU)&&  (irq_count() - hardirq_offset == 0)) {
>   		account_guest_time(p, cputime, cputime_scaled);
> @@ -3961,11 +3962,11 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>   	}
>
>   	if (hardirq_count() - hardirq_offset)
> -		target_cputime64 =&cpustat->irq;
> +		target_cputime64 =&cpustat[IRQ];
>   	else if (in_serving_softirq())
> -		target_cputime64 =&cpustat->softirq;
> +		target_cputime64 =&cpustat[SOFTIRQ];
>   	else
> -		target_cputime64 =&cpustat->system;
> +		target_cputime64 =&cpustat[SYSTEM];
>
>   	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
>   }
> @@ -3976,10 +3977,10 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>    */
>   void account_steal_time(cputime_t cputime)
>   {
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> -	cputime64_t cputime64 = cputime_to_cputime64(cputime);
> +	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 cputime64 = cputime_to_cputime64(cputime);
>
> -	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
> +	cpustat[STEAL] += cputime64;
>   }
>
>   /*
> @@ -3988,14 +3989,14 @@ void account_steal_time(cputime_t cputime)
>    */
>   void account_idle_time(cputime_t cputime)
>   {
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> -	cputime64_t cputime64 = cputime_to_cputime64(cputime);
> +	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 cputime64 = cputime_to_cputime64(cputime);
>   	struct rq *rq = this_rq();
>
>   	if (atomic_read(&rq->nr_iowait)>  0)
> -		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
> +		cpustat[IOWAIT] += cputime64;
>   	else
> -		cpustat->idle = cputime64_add(cpustat->idle, cputime64);
> +		cpustat[IDLE] += cputime64;
>   }
>
>   static __always_inline bool steal_account_process_tick(void)
> @@ -4045,16 +4046,16 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>   						struct rq *rq)
>   {
>   	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
> -	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
> -	struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
> +	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
> +	u64 *cpustat = kstat_this_cpu->cpustat;
>
>   	if (steal_account_process_tick())
>   		return;
>
>   	if (irqtime_account_hi_update()) {
> -		cpustat->irq = cputime64_add(cpustat->irq, tmp);
> +		cpustat[IRQ] += tmp;
>   	} else if (irqtime_account_si_update()) {
> -		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
> +		cpustat[SOFTIRQ] += tmp;
>   	} else if (this_cpu_ksoftirqd() == p) {
>   		/*
>   		 * ksoftirqd time do not get accounted in cpu_softirq_time.
> @@ -4062,7 +4063,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>   		 * Also, p->stime needs to be updated for ksoftirqd.
>   		 */
>   		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> -					&cpustat->softirq);
> +					&cpustat[SOFTIRQ]);
>   	} else if (user_tick) {
>   		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>   	} else if (p == rq->idle) {
> @@ -4071,7 +4072,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>   		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
>   	} else {
>   		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> -					&cpustat->system);
> +					&cpustat[SYSTEM]);
>   	}
>   }
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] split kernel stat in two
  2011-11-15 15:59 ` [PATCH 2/4] split kernel stat in two Glauber Costa
@ 2011-11-16  6:12   ` Paul Turner
  2011-11-16 11:34     ` Glauber Costa
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Turner @ 2011-11-16  6:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, cgroups

On 11/15/2011 07:59 AM, Glauber Costa wrote:
> In a later patch, we will use cpustat information per-task group.
> However, some of its fields are naturally global, such as the irq
> counters. There is no need to impose the task group overhead to them
> in this case. So better separate them.
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> CC: Paul Tuner<pjt@google.com>
> ---
>   arch/s390/appldata/appldata_os.c       |   16 +++++++-------
>   arch/x86/include/asm/i387.h            |    2 +-
>   drivers/cpufreq/cpufreq_conservative.c |   20 ++++++++--------
>   drivers/cpufreq/cpufreq_ondemand.c     |   20 ++++++++--------
>   drivers/macintosh/rack-meter.c         |    6 ++--
>   fs/proc/stat.c                         |   36 ++++++++++++++++----------------
>   include/linux/kernel_stat.h            |    8 ++++++-
>   kernel/sched.c                         |   18 ++++++++-------
>   8 files changed, 67 insertions(+), 59 deletions(-)
>
> diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
> index 3d6b672..695388a 100644
> --- a/arch/s390/appldata/appldata_os.c
> +++ b/arch/s390/appldata/appldata_os.c
> @@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
>   	j = 0;
>   	for_each_online_cpu(i) {
>   		os_data->os_cpu[j].per_cpu_user =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[USER]);
>   		os_data->os_cpu[j].per_cpu_nice =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[NICE]);
>   		os_data->os_cpu[j].per_cpu_system =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[SYSTEM]);
>   		os_data->os_cpu[j].per_cpu_idle =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IDLE]);
>   		os_data->os_cpu[j].per_cpu_irq =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IRQ]);
>   		os_data->os_cpu[j].per_cpu_softirq =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[SOFTIRQ]);
>   		os_data->os_cpu[j].per_cpu_iowait =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IOWAIT]);
>   		os_data->os_cpu[j].per_cpu_steal =
> -			cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
> +			cputime_to_jiffies(kcpustat_cpu(i).cpustat[STEAL]);
>   		os_data->os_cpu[j].cpu_id = i;
>   		j++;
>   	}
> diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
> index 56fa4d7..1f1b536 100644
> --- a/arch/x86/include/asm/i387.h
> +++ b/arch/x86/include/asm/i387.h
> @@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
>   #ifdef CONFIG_SMP
>   #define safe_address (__per_cpu_offset[0])
>   #else
> -#define safe_address (kstat_cpu(0).cpustat[USER])
> +#define safe_address (__get_cpu_var(kernel_cpustat).cpustat[USER])
>   #endif
>
>   /*
> diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
> index 2ab538f..a3a739f 100644
> --- a/drivers/cpufreq/cpufreq_conservative.c
> +++ b/drivers/cpufreq/cpufreq_conservative.c
> @@ -103,13 +103,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	cputime64_t busy_time;
>
>   	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
> -	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
> -			kstat_cpu(cpu).cpustat[SYSTEM]);
> +	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
> +			kcpustat_cpu(cpu).cpustat[SYSTEM]);
>
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[IRQ]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
>

This clobbers almost *all* the same lines as the last patch.  There has to be a 
more readable way of structuring these 2 patches.

>   	idle_time = cputime64_sub(cur_wall_time, busy_time);
>   	if (wall)
> @@ -272,7 +272,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
>   		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&dbs_info->prev_cpu_wall);
>   		if (dbs_tuners_ins.ignore_nice)
> -			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
> +			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
>   	}
>   	return count;
>   }
> @@ -365,7 +365,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cputime64_t cur_nice;
>   			unsigned long cur_nice_jiffies;
>
> -			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
> +			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
>   					 j_dbs_info->prev_cpu_nice);
>   			/*
>   			 * Assumption: nice time between sampling periods will
> @@ -374,7 +374,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cur_nice_jiffies = (unsigned long)
>   					cputime64_to_jiffies64(cur_nice);
>
> -			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
> +			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
>   			idle_time += jiffies_to_usecs(cur_nice_jiffies);
>   		}
>
> @@ -503,7 +503,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
>   						&j_dbs_info->prev_cpu_wall);
>   			if (dbs_tuners_ins.ignore_nice)
>   				j_dbs_info->prev_cpu_nice =
> -						kstat_cpu(j).cpustat[NICE];
> +						kcpustat_cpu(j).cpustat[NICE];
>   		}
>   		this_dbs_info->down_skip = 0;
>   		this_dbs_info->requested_freq = policy->cur;
> diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
> index 45d8e17..46e89663 100644
> --- a/drivers/cpufreq/cpufreq_ondemand.c
> +++ b/drivers/cpufreq/cpufreq_ondemand.c
> @@ -127,13 +127,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	cputime64_t busy_time;
>
>   	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
> -	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
> -			kstat_cpu(cpu).cpustat[SYSTEM]);
> +	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
> +			kcpustat_cpu(cpu).cpustat[SYSTEM]);
>
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
> -	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[IRQ]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
> +	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
>
>   	idle_time = cputime64_sub(cur_wall_time, busy_time);
>   	if (wall)
> @@ -345,7 +345,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
>   		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&dbs_info->prev_cpu_wall);
>   		if (dbs_tuners_ins.ignore_nice)
> -			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
> +			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
>
>   	}
>   	return count;
> @@ -458,7 +458,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cputime64_t cur_nice;
>   			unsigned long cur_nice_jiffies;
>
> -			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
> +			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
>   					 j_dbs_info->prev_cpu_nice);
>   			/*
>   			 * Assumption: nice time between sampling periods will
> @@ -467,7 +467,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cur_nice_jiffies = (unsigned long)
>   					cputime64_to_jiffies64(cur_nice);
>
> -			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
> +			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
>   			idle_time += jiffies_to_usecs(cur_nice_jiffies);
>   		}
>
> @@ -648,7 +648,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
>   						&j_dbs_info->prev_cpu_wall);
>   			if (dbs_tuners_ins.ignore_nice)
>   				j_dbs_info->prev_cpu_nice =
> -						kstat_cpu(j).cpustat[NICE];
> +						kcpustat_cpu(j).cpustat[NICE];
>   		}
>   		this_dbs_info->cpu = cpu;
>   		this_dbs_info->rate_mult = 1;
> diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
> index c80e49a..c8e67b0 100644
> --- a/drivers/macintosh/rack-meter.c
> +++ b/drivers/macintosh/rack-meter.c
> @@ -83,11 +83,11 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
>   {
>   	cputime64_t retval;
>
> -	retval = cputime64_add(kstat_cpu(cpu).cpustat[IDLE],
> -			kstat_cpu(cpu).cpustat[IOWAIT]);
> +	retval = cputime64_add(kcpustat_cpu(cpu).cpustat[IDLE],
> +			kcpustat_cpu(cpu).cpustat[IOWAIT]);
>
>   	if (rackmeter_ignore_nice)
> -		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat[NICE]);
> +		retval = cputime64_add(retval, kcpustat_cpu(cpu).cpustat[NICE]);
>
>   	return retval;
>   }
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index b7b74ad..6ab20db 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -28,7 +28,7 @@ static u64 get_idle_time(int cpu)
>
>   	if (idle_time == -1ULL) {
>   		/* !NO_HZ so we can rely on cpustat.idle */
> -		idle = kstat_cpu(cpu).cpustat[IDLE];
> +		idle = kcpustat_cpu(cpu).cpustat[IDLE];
>   		idle += arch_idle_time(cpu);
>   	} else
>   		idle = usecs_to_cputime(idle_time);
> @@ -42,7 +42,7 @@ static u64 get_iowait_time(int cpu)
>
>   	if (iowait_time == -1ULL)
>   		/* !NO_HZ so we can rely on cpustat.iowait */
> -		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
> +		iowait = kcpustat_cpu(cpu).cpustat[IOWAIT];
>   	else
>   		iowait = usecs_to_cputime(iowait_time);
>
> @@ -67,16 +67,16 @@ static int show_stat(struct seq_file *p, void *v)
>   	jif = boottime.tv_sec;
>
>   	for_each_possible_cpu(i) {
> -		user += kstat_cpu(i).cpustat[USER];
> -		nice += kstat_cpu(i).cpustat[NICE];
> -		system += kstat_cpu(i).cpustat[SYSTEM];
> +		user += kcpustat_cpu(i).cpustat[USER];
> +		nice += kcpustat_cpu(i).cpustat[NICE];
> +		system += kcpustat_cpu(i).cpustat[SYSTEM];
>   		idle += get_idle_time(i);
>   		iowait += get_iowait_time(i);
> -		irq += kstat_cpu(i).cpustat[IRQ];
> -		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
> -		steal += kstat_cpu(i).cpustat[STEAL];
> -		guest += kstat_cpu(i).cpustat[GUEST];
> -		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
> +		irq += kcpustat_cpu(i).cpustat[IRQ];
> +		softirq += kcpustat_cpu(i).cpustat[SOFTIRQ];
> +		steal += kcpustat_cpu(i).cpustat[STEAL];
> +		guest += kcpustat_cpu(i).cpustat[GUEST];
> +		guest_nice += kcpustat_cpu(i).cpustat[GUEST_NICE];
>
>   		for (j = 0; j<  NR_SOFTIRQS; j++) {
>   			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> @@ -101,16 +101,16 @@ static int show_stat(struct seq_file *p, void *v)
>   		(unsigned long long)cputime64_to_clock_t(guest_nice));
>   	for_each_online_cpu(i) {
>   		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
> -		user = kstat_cpu(i).cpustat[USER];
> -		nice = kstat_cpu(i).cpustat[NICE];
> -		system = kstat_cpu(i).cpustat[SYSTEM];
> +		user = kcpustat_cpu(i).cpustat[USER];
> +		nice = kcpustat_cpu(i).cpustat[NICE];
> +		system = kcpustat_cpu(i).cpustat[SYSTEM];
>   		idle = get_idle_time(i);
>   		iowait = get_iowait_time(i);
> -		irq = kstat_cpu(i).cpustat[IRQ];
> -		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
> -		steal = kstat_cpu(i).cpustat[STEAL];
> -		guest = kstat_cpu(i).cpustat[GUEST];
> -		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
> +		irq = kcpustat_cpu(i).cpustat[IRQ];
> +		softirq = kcpustat_cpu(i).cpustat[SOFTIRQ];
> +		steal = kcpustat_cpu(i).cpustat[STEAL];
> +		guest = kcpustat_cpu(i).cpustat[GUEST];
> +		guest_nice = kcpustat_cpu(i).cpustat[GUEST_NICE];
>   		seq_printf(p,
>   			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
>   			"%llu\n",
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index 7bfd0fe..f0e31a9 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -30,8 +30,11 @@ enum cpu_usage_stat {
>   	NR_STATS,
>   };
>
> -struct kernel_stat {
> +struct kernel_cpustat {
>   	u64 cpustat[NR_STATS];
> +};
> +
> +struct kernel_stat {
>   #ifndef CONFIG_GENERIC_HARDIRQS
>          unsigned int irqs[NR_IRQS];
>   #endif
> @@ -40,10 +43,13 @@ struct kernel_stat {
>   };
>
>   DECLARE_PER_CPU(struct kernel_stat, kstat);
> +DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
>
>   /* Must have preemption disabled for this to be meaningful. */
>   #define kstat_this_cpu (&__get_cpu_var(kstat))
> +#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
>   #define kstat_cpu(cpu) per_cpu(kstat, cpu)
> +#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
>
>   extern unsigned long long nr_context_switches(void);
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 7ac5aa6..efdd4d8 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2158,7 +2158,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>   #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>   static int irqtime_account_hi_update(void)
>   {
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	unsigned long flags;
>   	u64 latest_ns;
>   	int ret = 0;
> @@ -2173,7 +2173,7 @@ static int irqtime_account_hi_update(void)
>
>   static int irqtime_account_si_update(void)
>   {
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	unsigned long flags;
>   	u64 latest_ns;
>   	int ret = 0;
> @@ -3803,8 +3803,10 @@ unlock:
>   #endif
>
>   DEFINE_PER_CPU(struct kernel_stat, kstat);
> +DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
>
>   EXPORT_PER_CPU_SYMBOL(kstat);
> +EXPORT_PER_CPU_SYMBOL(kernel_cpustat);

This would want a big fat comment explaining the difference.

>
>   /*
>    * Return any ns on the sched_clock that have not yet been accounted in
> @@ -3866,7 +3868,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
>   void account_user_time(struct task_struct *p, cputime_t cputime,
>   		       cputime_t cputime_scaled)
>   {
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	u64 tmp;
>
>   	/* Add user time to process. */
> @@ -3897,7 +3899,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>   			       cputime_t cputime_scaled)
>   {
>   	u64 tmp;
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>


>   	tmp = cputime_to_cputime64(cputime);
>
> @@ -3953,7 +3955,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>   void account_system_time(struct task_struct *p, int hardirq_offset,
>   			 cputime_t cputime, cputime_t cputime_scaled)
>   {
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	u64 *target_cputime64;
>
>   	if ((p->flags&  PF_VCPU)&&  (irq_count() - hardirq_offset == 0)) {
> @@ -3977,7 +3979,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>    */
>   void account_steal_time(cputime_t cputime)
>   {
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	u64 cputime64 = cputime_to_cputime64(cputime);
>
>   	cpustat[STEAL] += cputime64;
> @@ -3989,7 +3991,7 @@ void account_steal_time(cputime_t cputime)
>    */
>   void account_idle_time(cputime_t cputime)
>   {
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	u64 cputime64 = cputime_to_cputime64(cputime);
>   	struct rq *rq = this_rq();
>
> @@ -4047,7 +4049,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>   {
>   	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>   	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
> -	u64 *cpustat = kstat_this_cpu->cpustat;
> +	u64 *cpustat = kcpustat_this_cpu->cpustat;
>
>   	if (steal_account_process_tick())
>   		return;


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] Keep scheduler statistics per cgroup
  2011-11-15 15:59 ` [PATCH 3/4] Keep scheduler statistics per cgroup Glauber Costa
@ 2011-11-16  7:02   ` Paul Turner
  2011-11-16 11:56     ` Glauber Costa
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Turner @ 2011-11-16  7:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, cgroups

On 11/15/2011 07:59 AM, Glauber Costa wrote:
> This patch makes the scheduler statistics, such as user ticks,
> system ticks, etc, per-cgroup. With this information, we are
> able to display the same information we currently do in cpuacct
> cgroup, but within the normal cpu cgroup.
>

Hmm,

So this goes a little beyond the existing stats exported by cpuacct.

Currently we have:

CPUACCT_STAT_USER
CPUACCT_STAT_SYSTEM (in cpuacct.info)
    and
cpuacct.usage / cpuacct.usage_per_cpu

Arguably the last two stats are the *most* useful information exported by 
cpuacct (and the ones we get for free from existing sched_entity accounting). 
But their functionality is not maintained.

As proposed in: https://lkml.org/lkml/2011/11/11/265
I'm not sure we really want to bring the other stats /within/ the CPU controller.

Furthermore, given your stated goal of changing virtualizing some of the /proc 
interfaces using this export it definitely seems like these fields (and any 
future behavioral changes using them may enable) be independent from core cpu.

(/me ... reads through patch then continues thoughts at bottom.)

> For all cgroups other than the root, those statistics are only
> collected when the top level file sched_stats is set to 1. This
> guarantees that the overhead of the patchset is negligible if we're
> not collecting statistics, even if all the bits are in.
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> CC: Paul Tuner<pjt@google.com>
> ---
>   arch/s390/appldata/appldata_os.c       |    2 +
>   drivers/cpufreq/cpufreq_conservative.c |   16 +++-
>   drivers/cpufreq/cpufreq_ondemand.c     |   16 +++-
>   drivers/macintosh/rack-meter.c         |    2 +
>   fs/proc/stat.c                         |   40 ++++----
>   fs/proc/uptime.c                       |    7 +-
>   include/linux/kernel_stat.h            |   20 ++++-
>   kernel/sched.c                         |  170 ++++++++++++++++++++++++--------
>   8 files changed, 207 insertions(+), 66 deletions(-)
>
> diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
> index 695388a..0612a7c 100644
> --- a/arch/s390/appldata/appldata_os.c
> +++ b/arch/s390/appldata/appldata_os.c
> @@ -114,6 +114,7 @@ static void appldata_get_os_data(void *data)
>
>   	j = 0;
>   	for_each_online_cpu(i) {
> +		kstat_lock();
>   		os_data->os_cpu[j].per_cpu_user =
>   			cputime_to_jiffies(kcpustat_cpu(i).cpustat[USER]);
>   		os_data->os_cpu[j].per_cpu_nice =
> @@ -131,6 +132,7 @@ static void appldata_get_os_data(void *data)
>   		os_data->os_cpu[j].per_cpu_steal =
>   			cputime_to_jiffies(kcpustat_cpu(i).cpustat[STEAL]);
>   		os_data->os_cpu[j].cpu_id = i;
> +		kstat_unlock();
>   		j++;
>   	}
>
> diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
> index a3a739f..ca98530 100644
> --- a/drivers/cpufreq/cpufreq_conservative.c
> +++ b/drivers/cpufreq/cpufreq_conservative.c
> @@ -102,6 +102,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	cputime64_t cur_wall_time;
>   	cputime64_t busy_time;
>
> +	kstat_lock();
>   	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
>   	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
>   			kcpustat_cpu(cpu).cpustat[SYSTEM]);
> @@ -110,6 +111,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
>   	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
>   	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
> +	kstat_unlock();
>
>   	idle_time = cputime64_sub(cur_wall_time, busy_time);
>   	if (wall)
> @@ -271,8 +273,11 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
>   		dbs_info =&per_cpu(cs_cpu_dbs_info, j);
>   		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&dbs_info->prev_cpu_wall);
> -		if (dbs_tuners_ins.ignore_nice)
> +		if (dbs_tuners_ins.ignore_nice) {
> +			kstat_lock();
>   			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
> +			kstat_unlock();
> +		}
>   	}
>   	return count;
>   }
> @@ -365,8 +370,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cputime64_t cur_nice;
>   			unsigned long cur_nice_jiffies;
>
> +			kstat_lock();
>   			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
>   					 j_dbs_info->prev_cpu_nice);
> +			kstat_unlock();
>   			/*
>   			 * Assumption: nice time between sampling periods will
>   			 * be less than 2^32 jiffies for 32 bit sys
> @@ -374,7 +381,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cur_nice_jiffies = (unsigned long)
>   					cputime64_to_jiffies64(cur_nice);
>
> +			kstat_lock();
>   			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
> +			kstat_unlock();
>   			idle_time += jiffies_to_usecs(cur_nice_jiffies);
>   		}
>
> @@ -501,9 +510,12 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
>
>   			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&j_dbs_info->prev_cpu_wall);
> -			if (dbs_tuners_ins.ignore_nice)
> +			if (dbs_tuners_ins.ignore_nice) {
> +				kstat_lock();
>   				j_dbs_info->prev_cpu_nice =
>   						kcpustat_cpu(j).cpustat[NICE];
> +				kstat_unlock();
> +			}
>   		}
>   		this_dbs_info->down_skip = 0;
>   		this_dbs_info->requested_freq = policy->cur;
> diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
> index 46e89663..4076453 100644
> --- a/drivers/cpufreq/cpufreq_ondemand.c
> +++ b/drivers/cpufreq/cpufreq_ondemand.c
> @@ -126,6 +126,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	cputime64_t cur_wall_time;
>   	cputime64_t busy_time;
>
> +	kstat_lock();
>   	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
>   	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
>   			kcpustat_cpu(cpu).cpustat[SYSTEM]);
> @@ -134,6 +135,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
>   	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
>   	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
>   	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
> +	kstat_unlock();
>
>   	idle_time = cputime64_sub(cur_wall_time, busy_time);
>   	if (wall)
> @@ -344,8 +346,11 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
>   		dbs_info =&per_cpu(od_cpu_dbs_info, j);
>   		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&dbs_info->prev_cpu_wall);
> -		if (dbs_tuners_ins.ignore_nice)
> +		if (dbs_tuners_ins.ignore_nice) {
> +			kstat_lock();
>   			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
> +			kstat_unlock();
> +		}
>
>   	}
>   	return count;
> @@ -458,8 +463,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cputime64_t cur_nice;
>   			unsigned long cur_nice_jiffies;
>
> +			kstat_lock();
>   			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
>   					 j_dbs_info->prev_cpu_nice);
> +			kstat_unlock();
>   			/*
>   			 * Assumption: nice time between sampling periods will
>   			 * be less than 2^32 jiffies for 32 bit sys
> @@ -467,7 +474,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
>   			cur_nice_jiffies = (unsigned long)
>   					cputime64_to_jiffies64(cur_nice);
>
> +			kstat_lock();
>   			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
> +			kstat_unlock();
>   			idle_time += jiffies_to_usecs(cur_nice_jiffies);
>   		}
>
> @@ -646,9 +655,12 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
>
>   			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>   						&j_dbs_info->prev_cpu_wall);
> -			if (dbs_tuners_ins.ignore_nice)
> +			if (dbs_tuners_ins.ignore_nice) {
> +				kstat_lock();
>   				j_dbs_info->prev_cpu_nice =
>   						kcpustat_cpu(j).cpustat[NICE];
> +				kstat_unlock();
> +			}
>   		}
>   		this_dbs_info->cpu = cpu;
>   		this_dbs_info->rate_mult = 1;
> diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
> index c8e67b0..196244f 100644
> --- a/drivers/macintosh/rack-meter.c
> +++ b/drivers/macintosh/rack-meter.c
> @@ -83,11 +83,13 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
>   {
>   	cputime64_t retval;
>
> +	kstat_lock();
>   	retval = cputime64_add(kcpustat_cpu(cpu).cpustat[IDLE],
>   			kcpustat_cpu(cpu).cpustat[IOWAIT]);
>
>   	if (rackmeter_ignore_nice)
>   		retval = cputime64_add(retval, kcpustat_cpu(cpu).cpustat[NICE]);
> +	kstat_unlock();
>
>   	return retval;
>   }
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 6ab20db..ee01403 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -28,7 +28,7 @@ static u64 get_idle_time(int cpu)
>
>   	if (idle_time == -1ULL) {
>   		/* !NO_HZ so we can rely on cpustat.idle */
> -		idle = kcpustat_cpu(cpu).cpustat[IDLE];
> +		idle = root_kcpustat_cpu(cpu).cpustat[IDLE];
>   		idle += arch_idle_time(cpu);
>   	} else
>   		idle = usecs_to_cputime(idle_time);
> @@ -42,7 +42,7 @@ static u64 get_iowait_time(int cpu)
>
>   	if (iowait_time == -1ULL)
>   		/* !NO_HZ so we can rely on cpustat.iowait */
> -		iowait = kcpustat_cpu(cpu).cpustat[IOWAIT];
> +		iowait = root_kcpustat_cpu(cpu).cpustat[IOWAIT];
>   	else
>   		iowait = usecs_to_cputime(iowait_time);
>
> @@ -67,16 +67,18 @@ static int show_stat(struct seq_file *p, void *v)
>   	jif = boottime.tv_sec;
>
>   	for_each_possible_cpu(i) {
> -		user += kcpustat_cpu(i).cpustat[USER];
> -		nice += kcpustat_cpu(i).cpustat[NICE];
> -		system += kcpustat_cpu(i).cpustat[SYSTEM];
> +		kstat_lock();
> +		user += root_kcpustat_cpu(i).cpustat[USER];
> +		nice += root_kcpustat_cpu(i).cpustat[NICE];
> +		system += root_kcpustat_cpu(i).cpustat[SYSTEM];
>   		idle += get_idle_time(i);
>   		iowait += get_iowait_time(i);
> -		irq += kcpustat_cpu(i).cpustat[IRQ];
> -		softirq += kcpustat_cpu(i).cpustat[SOFTIRQ];
> -		steal += kcpustat_cpu(i).cpustat[STEAL];
> -		guest += kcpustat_cpu(i).cpustat[GUEST];
> -		guest_nice += kcpustat_cpu(i).cpustat[GUEST_NICE];
> +		irq += root_kcpustat_cpu(i).cpustat[IRQ];
> +		softirq += root_kcpustat_cpu(i).cpustat[SOFTIRQ];
> +		steal += root_kcpustat_cpu(i).cpustat[STEAL];
> +		guest += root_kcpustat_cpu(i).cpustat[GUEST];
> +		guest_nice += root_kcpustat_cpu(i).cpustat[GUEST_NICE];
> +		kstat_unlock();
>
>   		for (j = 0; j<  NR_SOFTIRQS; j++) {
>   			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> @@ -100,17 +102,19 @@ static int show_stat(struct seq_file *p, void *v)
>   		(unsigned long long)cputime64_to_clock_t(guest),
>   		(unsigned long long)cputime64_to_clock_t(guest_nice));
>   	for_each_online_cpu(i) {
> +		kstat_lock();
>   		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
> -		user = kcpustat_cpu(i).cpustat[USER];
> -		nice = kcpustat_cpu(i).cpustat[NICE];
> -		system = kcpustat_cpu(i).cpustat[SYSTEM];
> +		user = root_kcpustat_cpu(i).cpustat[USER];
> +		nice = root_kcpustat_cpu(i).cpustat[NICE];
> +		system = root_kcpustat_cpu(i).cpustat[SYSTEM];
>   		idle = get_idle_time(i);
>   		iowait = get_iowait_time(i);
> -		irq = kcpustat_cpu(i).cpustat[IRQ];
> -		softirq = kcpustat_cpu(i).cpustat[SOFTIRQ];
> -		steal = kcpustat_cpu(i).cpustat[STEAL];
> -		guest = kcpustat_cpu(i).cpustat[GUEST];
> -		guest_nice = kcpustat_cpu(i).cpustat[GUEST_NICE];
> +		irq = root_kcpustat_cpu(i).cpustat[IRQ];
> +		softirq = root_kcpustat_cpu(i).cpustat[SOFTIRQ];
> +		steal = root_kcpustat_cpu(i).cpustat[STEAL];
> +		guest = root_kcpustat_cpu(i).cpustat[GUEST];
> +		guest_nice = root_kcpustat_cpu(i).cpustat[GUEST_NICE];
> +		kstat_unlock();
>   		seq_printf(p,
>   			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
>   			"%llu\n",
> diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
> index 76737bc..acb9ba8 100644
> --- a/fs/proc/uptime.c
> +++ b/fs/proc/uptime.c
> @@ -14,8 +14,11 @@ static int uptime_proc_show(struct seq_file *m, void *v)
>   	int i;
>   	u64 idletime = 0;
>
> -	for_each_possible_cpu(i)
> -		idletime += kstat_cpu(i).cpustat[IDLE];
> +	for_each_possible_cpu(i) {
> +		kstat_lock();
> +		idletime += kcpustat_cpu(i).cpustat[IDLE];
> +		kstat_unlock();
> +	}
>
>   	do_posix_clock_monotonic_gettime(&uptime);
>   	monotonic_to_bootbased(&uptime);
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index f0e31a9..4c8ff41 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -45,11 +45,27 @@ struct kernel_stat {
>   DECLARE_PER_CPU(struct kernel_stat, kstat);
>   DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
>
> -/* Must have preemption disabled for this to be meaningful. */
>   #define kstat_this_cpu (&__get_cpu_var(kstat))
> -#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
>   #define kstat_cpu(cpu) per_cpu(kstat, cpu)
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +struct kernel_cpustat *task_group_kstat(struct task_struct *p);
> +
> +#define kcpustat_this_cpu      this_cpu_ptr(task_group_kstat(current))
> +#define kcpustat_cpu(cpu) (*per_cpu_ptr(task_group_kstat(current), cpu))
> +#define kstat_lock()   rcu_read_lock()
> +#define kstat_unlock() rcu_read_unlock()
> +#else
> +#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
>   #define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
> +#define kstat_lock()
> +#define kstat_unlock()
> +#endif
> +/*
> + * This makes sure the root cgroup is the one we read from when cpu
> + * cgroup is on, and is just equivalent to kcpustat_cpu when it is off
> + */
> +#define root_kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
>
>   extern unsigned long long nr_context_switches(void);
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index efdd4d8..934f631 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -301,6 +301,7 @@ struct task_group {
>   #endif
>
>   	struct cfs_bandwidth cfs_bandwidth;
> +	struct kernel_cpustat __percpu *cpustat;
>   };
>
>   /* task_group_lock serializes the addition/removal of task groups */
> @@ -740,6 +741,12 @@ static inline int cpu_of(struct rq *rq)
>   #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>   #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>
> +DEFINE_PER_CPU(struct kernel_stat, kstat);
> +DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
> +
> +EXPORT_PER_CPU_SYMBOL(kstat);
> +EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
> +
>   #ifdef CONFIG_CGROUP_SCHED
>
>   /*
> @@ -763,6 +770,21 @@ static inline struct task_group *task_group(struct task_struct *p)
>   	return autogroup_task_group(p, tg);
>   }
>
> +static struct jump_label_key sched_cgroup_enabled;

This name does not really suggest what this jump-label is used for.

Something like task_group_sched_stats_enabled is much clearer.

> +static int sched_has_sched_stats = 0;
> +
> +struct kernel_cpustat *task_group_kstat(struct task_struct *p)
> +{
> +	if (static_branch(&sched_cgroup_enabled)) {
> +		struct task_group *tg;
> +		tg = task_group(p);
> +		return tg->cpustat;
> +	}
> +
> +	return&kernel_cpustat;
> +}
> +EXPORT_SYMBOL(task_group_kstat);
> +
>   /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
>   static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>   {
> @@ -784,9 +806,36 @@ static inline struct task_group *task_group(struct task_struct *p)
>   {
>   	return NULL;
>   }
> -
>   #endif /* CONFIG_CGROUP_SCHED */
>
> +static inline void task_group_account_field(struct task_struct *p,
> +					     u64 tmp, int index)
> +{
> +	/*
> +	 * Since all updates are sure to touch the root cgroup, we
> +	 * get ourselves ahead and touch it first. If the root cgroup
> +	 * is the only cgroup, then nothing else should be necessary.
> +	 *
> +	 */
> +	__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;

> +
> +#ifdef CONFIG_CGROUP_SCHED
> +	if (static_branch(&sched_cgroup_enabled)) {
> +		struct kernel_cpustat *kcpustat;
> +		struct task_group *tg;
> +
> +		rcu_read_lock();
> +		tg = task_group(p);
> +		while (tg&&  (tg !=&root_task_group)) {

You could use for_each_entity starting from &p->se here.

> +			kcpustat = this_cpu_ptr(tg->cpustat);

This is going to have to do the this_cpu_ptr work at every level; we already 
know what cpu we're on it and can reference it directly.

> +			kcpustat->cpustat[index] += tmp;
> +			tg = tg->parent;


> +		}
> +		rcu_read_unlock();
> +	}
> +#endif
> +}
> +
>   static void update_rq_clock_task(struct rq *rq, s64 delta);
>
>   static void update_rq_clock(struct rq *rq)
> @@ -2158,30 +2207,36 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>   #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>   static int irqtime_account_hi_update(void)
>   {
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	unsigned long flags;
>   	u64 latest_ns;
> +	u64 *cpustat;
>   	int ret = 0;
>
>   	local_irq_save(flags);
>   	latest_ns = this_cpu_read(cpu_hardirq_time);
> +	kstat_lock();

This protects ?

> +	cpustat = kcpustat_this_cpu->cpustat;
>   	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
>   		ret = 1;
> +	kstat_unlock();
>   	local_irq_restore(flags);
>   	return ret;
>   }
>
>   static int irqtime_account_si_update(void)
>   {
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	u64 *cpustat;
>   	unsigned long flags;
>   	u64 latest_ns;
>   	int ret = 0;
>
>   	local_irq_save(flags);
>   	latest_ns = this_cpu_read(cpu_softirq_time);
> +	kstat_lock();
> +	cpustat = kcpustat_this_cpu->cpustat;
>   	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
>   		ret = 1;
> +	kstat_unlock();
>   	local_irq_restore(flags);
>   	return ret;
>   }
> @@ -3802,12 +3857,6 @@ unlock:
>
>   #endif
>
> -DEFINE_PER_CPU(struct kernel_stat, kstat);
> -DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
> -
> -EXPORT_PER_CPU_SYMBOL(kstat);
> -EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
> -
>   /*
>    * Return any ns on the sched_clock that have not yet been accounted in
>    * @p in case that task is currently running.
> @@ -3868,7 +3917,6 @@ unsigned long long task_sched_runtime(struct task_struct *p)
>   void account_user_time(struct task_struct *p, cputime_t cputime,
>   		       cputime_t cputime_scaled)
>   {
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	u64 tmp;
>
>   	/* Add user time to process. */
> @@ -3880,9 +3928,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
>   	tmp = cputime_to_cputime64(cputime);
>
>   	if (TASK_NICE(p)>  0)
> -		cpustat[NICE] += tmp;
> +		task_group_account_field(p, tmp, NICE);
>   	else
> -		cpustat[USER] += tmp;
> +		task_group_account_field(p, tmp, USER);
>
>   	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
>   	/* Account for user time used */
> @@ -3899,7 +3947,6 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>   			       cputime_t cputime_scaled)
>   {
>   	u64 tmp;
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
>
>   	tmp = cputime_to_cputime64(cputime);
>
> @@ -3911,11 +3958,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>
>   	/* Add guest time to cpustat. */
>   	if (TASK_NICE(p)>  0) {
> -		cpustat[NICE] += tmp;
> -		cpustat[GUEST_NICE] += tmp;
> +		task_group_account_field(p, tmp, NICE);
> +		task_group_account_field(p, tmp, GUEST_NICE);
>   	} else {
> -		cpustat[USER] += tmp;
> -		cpustat[GUEST] += tmp;
> +		task_group_account_field(p, tmp, USER);
> +		task_group_account_field(p, tmp, GUEST);
>   	}
>   }
>
> @@ -3928,7 +3975,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>    */
>   static inline
>   void __account_system_time(struct task_struct *p, cputime_t cputime,
> -			cputime_t cputime_scaled, u64 *target_cputime64)
> +			cputime_t cputime_scaled, int index)
>   {
>   	u64 tmp = cputime_to_cputime64(cputime);
>
> @@ -3938,7 +3985,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>   	account_group_system_time(p, cputime);
>
>   	/* Add system time to cpustat. */
> -	*target_cputime64 += tmp;
> +	task_group_account_field(p, tmp, index);
>   	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
>
>   	/* Account for system time used */
> @@ -3955,8 +4002,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>   void account_system_time(struct task_struct *p, int hardirq_offset,
>   			 cputime_t cputime, cputime_t cputime_scaled)
>   {
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> -	u64 *target_cputime64;
> +	int index;
>
>   	if ((p->flags&  PF_VCPU)&&  (irq_count() - hardirq_offset == 0)) {
>   		account_guest_time(p, cputime, cputime_scaled);
> @@ -3964,13 +4010,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>   	}
>
>   	if (hardirq_count() - hardirq_offset)
> -		target_cputime64 =&cpustat[IRQ];
> +		index = IRQ;
>   	else if (in_serving_softirq())
> -		target_cputime64 =&cpustat[SOFTIRQ];
> +		index = SOFTIRQ;
>   	else
> -		target_cputime64 =&cpustat[SYSTEM];
> +		index = SYSTEM;
>
> -	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
> +	__account_system_time(p, cputime, cputime_scaled, index);
>   }
>
>   /*
> @@ -3979,10 +4025,8 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>    */
>   void account_steal_time(cputime_t cputime)
>   {
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
>   	u64 cputime64 = cputime_to_cputime64(cputime);
> -
> -	cpustat[STEAL] += cputime64;
> +	__get_cpu_var(kernel_cpustat).cpustat[STEAL] += cputime64;
>   }
>
>   /*
> @@ -3991,14 +4035,19 @@ void account_steal_time(cputime_t cputime)
>    */
>   void account_idle_time(cputime_t cputime)
>   {
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	struct kernel_cpustat *kcpustat;
>   	u64 cputime64 = cputime_to_cputime64(cputime);
>   	struct rq *rq = this_rq();
>
> +	kstat_lock();
> +	kcpustat = kcpustat_this_cpu;
> +
>   	if (atomic_read(&rq->nr_iowait)>  0)
> -		cpustat[IOWAIT] += cputime64;
> +		kcpustat->cpustat[IOWAIT] += cputime64;
>   	else
> -		cpustat[IDLE] += cputime64;
> +		/* idle is always accounted to the root cgroup */
> +		__get_cpu_var(kernel_cpustat).cpustat[IDLE] += cputime64;
> +	kstat_unlock();
>   }
>
>   static __always_inline bool steal_account_process_tick(void)
> @@ -4045,27 +4094,26 @@ static __always_inline bool steal_account_process_tick(void)
>    * softirq as those do not count in task exec_runtime any more.
>    */
>   static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
> -						struct rq *rq)
> +					 struct rq *rq)
>   {
>   	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>   	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
>
>   	if (steal_account_process_tick())
>   		return;
>
>   	if (irqtime_account_hi_update()) {
> -		cpustat[IRQ] += tmp;
> +		task_group_account_field(p, tmp, IRQ);
>   	} else if (irqtime_account_si_update()) {
> -		cpustat[SOFTIRQ] += tmp;
> +		task_group_account_field(p, tmp, SOFTIRQ);
>   	} else if (this_cpu_ksoftirqd() == p) {
>   		/*
>   		 * ksoftirqd time do not get accounted in cpu_softirq_time.
>   		 * So, we have to handle it separately here.
>   		 * Also, p->stime needs to be updated for ksoftirqd.
>   		 */
> -		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> -					&cpustat[SOFTIRQ]);
> +		__account_system_time(p, cputime_one_jiffy,
> +				      one_jiffy_scaled, SOFTIRQ);
>   	} else if (user_tick) {
>   		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>   	} else if (p == rq->idle) {
> @@ -4073,8 +4121,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>   	} else if (p->flags&  PF_VCPU) { /* System time or guest time */
>   		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
>   	} else {
> -		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> -					&cpustat[SYSTEM]);
> +		__account_system_time(p, cputime_one_jiffy,
> +				      one_jiffy_scaled, SYSTEM);
>   	}
>   }
>
> @@ -8237,6 +8285,8 @@ void __init sched_init(void)
>   	INIT_LIST_HEAD(&root_task_group.children);
>   	INIT_LIST_HEAD(&root_task_group.siblings);
>   	autogroup_init(&init_task);
> +
> +	root_task_group.cpustat =&kernel_cpustat;
>   #endif /* CONFIG_CGROUP_SCHED */
>
>   	for_each_possible_cpu(i) {
> @@ -8674,6 +8724,7 @@ static void free_sched_group(struct task_group *tg)
>   	free_fair_sched_group(tg);
>   	free_rt_sched_group(tg);
>   	autogroup_free(tg);
> +	free_percpu(tg->cpustat);
>   	kfree(tg);
>   }
>
> @@ -8693,6 +8744,10 @@ struct task_group *sched_create_group(struct task_group *parent)
>   	if (!alloc_rt_sched_group(tg, parent))
>   		goto err;
>
> +	tg->cpustat = alloc_percpu(struct kernel_cpustat);
> +	if (!tg->cpustat)
> +		goto err;
> +
>   	spin_lock_irqsave(&task_group_lock, flags);
>   	list_add_rcu(&tg->list,&task_groups);
>
> @@ -9437,6 +9492,23 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
>   }
>   #endif /* CONFIG_RT_GROUP_SCHED */
>
> +static u64 cpu_has_sched_stats(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return sched_has_sched_stats;
> +}
> +
> +static int cpu_set_sched_stats(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	if (!val&&  sched_has_sched_stats)
> +		jump_label_dec(&sched_cgroup_enabled);
> +
> +	if (val&&  !sched_has_sched_stats)
> +		jump_label_inc(&sched_cgroup_enabled);
> +
> +	sched_has_sched_stats = !!val;
> +	return 0;
> +}
> +
>   static struct cftype cpu_files[] = {
>   #ifdef CONFIG_FAIR_GROUP_SCHED
>   	{
> @@ -9475,9 +9547,27 @@ static struct cftype cpu_files[] = {
>   #endif
>   };
>
> +/*
> + * Files appearing here will be shown at the top level only. Although we could
> + * show them unconditionally, and then return an error when read/writen from
> + * non-root cgroups, this is less confusing for users
> + */
> +static struct cftype cpu_root_files[] = {
> +	{
> +		.name = "sched_stats",
> +		.read_u64 = cpu_has_sched_stats,
> +		.write_u64 = cpu_set_sched_stats,
> +	},
> +};
> +
>   static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
>   {
> -	return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
> +	int ret;
> +	ret = cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
> +	if (!ret)
> +		ret = cgroup_add_files(cont, ss, cpu_root_files,
> +				       ARRAY_SIZE(cpu_root_files));
> +	return ret;
>   }
>
>   struct cgroup_subsys cpu_cgroup_subsys = {

So I'm not seeing any touch-points that intrinsically benefit from being a part 
of the cpu sub-system.  The hierarchy walk in task_group_account_field() is 
completely independent from the rest of the controller.

The argument for merging {usage, usage_per_cpu} into cpu is almost entirely 
performance based -- the stats are very useful from a management perspective and 
we already maintain (hidden) versions of them in cpu.  Whereas, as it stands 
these this really would seem not to suffer at all from being in its own 
controller.  I previously suggested that this might want to be a "co-controller" 
(e.g. one that only ever exists mounted adjacent to cpu so that we could 
leverage the existing hierarchy without over-loading the core of "cpu").  But 
I'm not even sure that is required or beneficial given that this isn't going to 
add value or make anything cheaper.

 From that perspective, perhaps what you're looking for *really* is best served 
just by greatly extending the stats exported by cpuacct (as above).

We'll still pull {usage, usage_per_cpu} into "cpu" for the common case but those 
who really want everything else could continue using "cpuacct".

Reasonable?

- Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Change cpustat fields to an array.
  2011-11-16  5:58   ` Paul Turner
@ 2011-11-16 11:25     ` Glauber Costa
  2011-11-16 11:31       ` Glauber Costa
  0 siblings, 1 reply; 17+ messages in thread
From: Glauber Costa @ 2011-11-16 11:25 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, cgroups

On 11/16/2011 03:58 AM, Paul Turner wrote:
> On 11/15/2011 07:59 AM, Glauber Costa wrote:
>> This will give us a bit more flexibility to deal with the
>> fields in this structure. This is a preparation patch for
>> later patches in this series.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Paul Tuner<pjt@google.com>
>> ---
>> arch/s390/appldata/appldata_os.c | 16 ++++----
>> arch/x86/include/asm/i387.h | 2 +-
>> drivers/cpufreq/cpufreq_conservative.c | 23 +++++-----
>> drivers/cpufreq/cpufreq_ondemand.c | 23 +++++-----
>> drivers/macintosh/rack-meter.c | 6 +-
>> fs/proc/stat.c | 63 +++++++++++++---------------
>> fs/proc/uptime.c | 4 +-
>> include/linux/kernel_stat.h | 30 +++++++------
>> kernel/sched.c | 71 ++++++++++++++++----------------
>> 9 files changed, 117 insertions(+), 121 deletions(-)
>>
>> diff --git a/arch/s390/appldata/appldata_os.c
>> b/arch/s390/appldata/appldata_os.c
>> index 92f1cb7..3d6b672 100644
>> --- a/arch/s390/appldata/appldata_os.c
>> +++ b/arch/s390/appldata/appldata_os.c
>> @@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
>> j = 0;
>> for_each_online_cpu(i) {
>> os_data->os_cpu[j].per_cpu_user =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.user);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
>> os_data->os_cpu[j].per_cpu_nice =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.nice);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
>> os_data->os_cpu[j].per_cpu_system =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.system);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
>> os_data->os_cpu[j].per_cpu_idle =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.idle);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
>> os_data->os_cpu[j].per_cpu_irq =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.irq);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
>> os_data->os_cpu[j].per_cpu_softirq =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.softirq);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
>> os_data->os_cpu[j].per_cpu_iowait =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.iowait);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
>> os_data->os_cpu[j].per_cpu_steal =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat.steal);
>> + cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
>> os_data->os_cpu[j].cpu_id = i;
>> j++;
>> }
>> diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
>> index c9e09ea..56fa4d7 100644
>> --- a/arch/x86/include/asm/i387.h
>> +++ b/arch/x86/include/asm/i387.h
>> @@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
>> #ifdef CONFIG_SMP
>> #define safe_address (__per_cpu_offset[0])
>> #else
>> -#define safe_address (kstat_cpu(0).cpustat.user)
>> +#define safe_address (kstat_cpu(0).cpustat[USER])
>> #endif
>>
>> /*
>> diff --git a/drivers/cpufreq/cpufreq_conservative.c
>> b/drivers/cpufreq/cpufreq_conservative.c
>> index c97b468..2ab538f 100644
>> --- a/drivers/cpufreq/cpufreq_conservative.c
>> +++ b/drivers/cpufreq/cpufreq_conservative.c
>> @@ -103,13 +103,13 @@ static inline cputime64_t
>> get_cpu_idle_time_jiffy(unsigned int cpu,
>> cputime64_t busy_time;
>>
>> cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
>> - busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
>> - kstat_cpu(cpu).cpustat.system);
>> + busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
>> + kstat_cpu(cpu).cpustat[SYSTEM]);
>>
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
>>
>> idle_time = cputime64_sub(cur_wall_time, busy_time);
>> if (wall)
>> @@ -272,7 +272,7 @@ static ssize_t store_ignore_nice_load(struct
>> kobject *a, struct attribute *b,
>> dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>> &dbs_info->prev_cpu_wall);
>> if (dbs_tuners_ins.ignore_nice)
>> - dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
>> + dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>> }
>> return count;
>> }
>> @@ -365,7 +365,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s
>> *this_dbs_info)
>> cputime64_t cur_nice;
>> unsigned long cur_nice_jiffies;
>>
>> - cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
>> + cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
>> j_dbs_info->prev_cpu_nice);
>> /*
>> * Assumption: nice time between sampling periods will
>> @@ -374,7 +374,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s
>> *this_dbs_info)
>> cur_nice_jiffies = (unsigned long)
>> cputime64_to_jiffies64(cur_nice);
>>
>> - j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
>> + j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>> idle_time += jiffies_to_usecs(cur_nice_jiffies);
>> }
>>
>> @@ -501,10 +501,9 @@ static int cpufreq_governor_dbs(struct
>> cpufreq_policy *policy,
>>
>> j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>> &j_dbs_info->prev_cpu_wall);
>> - if (dbs_tuners_ins.ignore_nice) {
>> + if (dbs_tuners_ins.ignore_nice)
>> j_dbs_info->prev_cpu_nice =
>> - kstat_cpu(j).cpustat.nice;
>> - }
>> + kstat_cpu(j).cpustat[NICE];
>> }
>> this_dbs_info->down_skip = 0;
>> this_dbs_info->requested_freq = policy->cur;
>> diff --git a/drivers/cpufreq/cpufreq_ondemand.c
>> b/drivers/cpufreq/cpufreq_ondemand.c
>> index fa8af4e..45d8e17 100644
>> --- a/drivers/cpufreq/cpufreq_ondemand.c
>> +++ b/drivers/cpufreq/cpufreq_ondemand.c
>> @@ -127,13 +127,13 @@ static inline cputime64_t
>> get_cpu_idle_time_jiffy(unsigned int cpu,
>> cputime64_t busy_time;
>>
>> cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
>> - busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
>> - kstat_cpu(cpu).cpustat.system);
>> + busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
>> + kstat_cpu(cpu).cpustat[SYSTEM]);
>>
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
>> + busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
>>
>> idle_time = cputime64_sub(cur_wall_time, busy_time);
>> if (wall)
>> @@ -345,7 +345,7 @@ static ssize_t store_ignore_nice_load(struct
>> kobject *a, struct attribute *b,
>> dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>> &dbs_info->prev_cpu_wall);
>> if (dbs_tuners_ins.ignore_nice)
>> - dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
>> + dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>>
>> }
>> return count;
>> @@ -458,7 +458,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s
>> *this_dbs_info)
>> cputime64_t cur_nice;
>> unsigned long cur_nice_jiffies;
>>
>> - cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
>> + cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
>> j_dbs_info->prev_cpu_nice);
>> /*
>> * Assumption: nice time between sampling periods will
>> @@ -467,7 +467,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s
>> *this_dbs_info)
>> cur_nice_jiffies = (unsigned long)
>> cputime64_to_jiffies64(cur_nice);
>>
>> - j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
>> + j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
>> idle_time += jiffies_to_usecs(cur_nice_jiffies);
>> }
>>
>> @@ -646,10 +646,9 @@ static int cpufreq_governor_dbs(struct
>> cpufreq_policy *policy,
>>
>> j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
>> &j_dbs_info->prev_cpu_wall);
>> - if (dbs_tuners_ins.ignore_nice) {
>> + if (dbs_tuners_ins.ignore_nice)
>> j_dbs_info->prev_cpu_nice =
>> - kstat_cpu(j).cpustat.nice;
>> - }
>> + kstat_cpu(j).cpustat[NICE];
>> }
>> this_dbs_info->cpu = cpu;
>> this_dbs_info->rate_mult = 1;
>> diff --git a/drivers/macintosh/rack-meter.c
>> b/drivers/macintosh/rack-meter.c
>> index 2637c13..c80e49a 100644
>> --- a/drivers/macintosh/rack-meter.c
>> +++ b/drivers/macintosh/rack-meter.c
>> @@ -83,11 +83,11 @@ static inline cputime64_t
>> get_cpu_idle_time(unsigned int cpu)
>> {
>> cputime64_t retval;
>>
>> - retval = cputime64_add(kstat_cpu(cpu).cpustat.idle,
>> - kstat_cpu(cpu).cpustat.iowait);
>> + retval = cputime64_add(kstat_cpu(cpu).cpustat[IDLE],
>> + kstat_cpu(cpu).cpustat[IOWAIT]);
>>
>> if (rackmeter_ignore_nice)
>> - retval = cputime64_add(retval, kstat_cpu(cpu).cpustat.nice);
>> + retval = cputime64_add(retval, kstat_cpu(cpu).cpustat[NICE]);
>>
>> return retval;
>> }
>> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
>> index 42b274d..b7b74ad 100644
>> --- a/fs/proc/stat.c
>> +++ b/fs/proc/stat.c
>> @@ -22,29 +22,27 @@
>> #define arch_idle_time(cpu) 0
>> #endif
>>
>> -static cputime64_t get_idle_time(int cpu)
>> +static u64 get_idle_time(int cpu)
>> {
>> - u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
>> - cputime64_t idle;
>> + u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
>>
>> if (idle_time == -1ULL) {
>> /* !NO_HZ so we can rely on cpustat.idle */
>> - idle = kstat_cpu(cpu).cpustat.idle;
>> - idle = cputime64_add(idle, arch_idle_time(cpu));
>> + idle = kstat_cpu(cpu).cpustat[IDLE];
>> + idle += arch_idle_time(cpu);
>> } else
>> idle = usecs_to_cputime(idle_time);
>>
>> return idle;
>> }
>>
>> -static cputime64_t get_iowait_time(int cpu)
>> +static u64 get_iowait_time(int cpu)
>> {
>> - u64 iowait_time = get_cpu_iowait_time_us(cpu, NULL);
>> - cputime64_t iowait;
>> + u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
>>
>> if (iowait_time == -1ULL)
>> /* !NO_HZ so we can rely on cpustat.iowait */
>> - iowait = kstat_cpu(cpu).cpustat.iowait;
>> + iowait = kstat_cpu(cpu).cpustat[IOWAIT];
>> else
>> iowait = usecs_to_cputime(iowait_time);
>>
>> @@ -55,33 +53,30 @@ static int show_stat(struct seq_file *p, void *v)
>> {
>> int i, j;
>> unsigned long jif;
>> - cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
>> - cputime64_t guest, guest_nice;
>> + u64 user, nice, system, idle, iowait, irq, softirq, steal;
>> + u64 guest, guest_nice;
>> u64 sum = 0;
>> u64 sum_softirq = 0;
>> unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>> struct timespec boottime;
>>
>> user = nice = system = idle = iowait =
>> - irq = softirq = steal = cputime64_zero;
>> - guest = guest_nice = cputime64_zero;
>> + irq = softirq = steal = 0;
>> + guest = guest_nice = 0;
>> getboottime(&boottime);
>> jif = boottime.tv_sec;
>>
>> for_each_possible_cpu(i) {
>> - user = cputime64_add(user, kstat_cpu(i).cpustat.user);
>> - nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
>> - system = cputime64_add(system, kstat_cpu(i).cpustat.system);
>> - idle = cputime64_add(idle, get_idle_time(i));
>> - iowait = cputime64_add(iowait, get_iowait_time(i));
>> - irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
>> - softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
>> - steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
>> - guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
>> - guest_nice = cputime64_add(guest_nice,
>> - kstat_cpu(i).cpustat.guest_nice);
>> - sum += kstat_cpu_irqs_sum(i);
>> - sum += arch_irq_stat_cpu(i);
>> + user += kstat_cpu(i).cpustat[USER];
>
> Half the time cputime64_add is preserved, half the time this patch
> converts it to a naked '+='. Admittedly no one seems to usefully define
> cputime64_add but why the conversion / inconsistency?

Because at least in this patchset of mine, cputime conversion is not a 
goal, but a side effect. I wanted cpustat to be an array of u64, so I 
converted users. There are some places in the code that still uses 
cputime64 for other variables, and those I kept untouched.

That's to make the patch focused. The other variables can be easily 
converted if we see value on it.

>> + nice += kstat_cpu(i).cpustat[NICE];
>> + system += kstat_cpu(i).cpustat[SYSTEM];
>> + idle += get_idle_time(i);
>> + iowait += get_iowait_time(i);
>> + irq += kstat_cpu(i).cpustat[IRQ];
>> + softirq += kstat_cpu(i).cpustat[SOFTIRQ];
>> + steal += kstat_cpu(i).cpustat[STEAL];
>> + guest += kstat_cpu(i).cpustat[GUEST];
>> + guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
>>
>> for (j = 0; j< NR_SOFTIRQS; j++) {
>> unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
>> @@ -106,16 +101,16 @@ static int show_stat(struct seq_file *p, void *v)
>> (unsigned long long)cputime64_to_clock_t(guest_nice));
>> for_each_online_cpu(i) {
>> /* Copy values here to work around gcc-2.95.3, gcc-2.96 */
>> - user = kstat_cpu(i).cpustat.user;
>> - nice = kstat_cpu(i).cpustat.nice;
>> - system = kstat_cpu(i).cpustat.system;
>> + user = kstat_cpu(i).cpustat[USER];
>> + nice = kstat_cpu(i).cpustat[NICE];
>> + system = kstat_cpu(i).cpustat[SYSTEM];
>> idle = get_idle_time(i);
>> iowait = get_iowait_time(i);
>> - irq = kstat_cpu(i).cpustat.irq;
>> - softirq = kstat_cpu(i).cpustat.softirq;
>> - steal = kstat_cpu(i).cpustat.steal;
>> - guest = kstat_cpu(i).cpustat.guest;
>> - guest_nice = kstat_cpu(i).cpustat.guest_nice;
>> + irq = kstat_cpu(i).cpustat[IRQ];
>> + softirq = kstat_cpu(i).cpustat[SOFTIRQ];
>> + steal = kstat_cpu(i).cpustat[STEAL];
>> + guest = kstat_cpu(i).cpustat[GUEST];
>> + guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
>> seq_printf(p,
>> "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
>> "%llu\n",
>> diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
>> index 766b1d4..76737bc 100644
>> --- a/fs/proc/uptime.c
>> +++ b/fs/proc/uptime.c
>> @@ -12,10 +12,10 @@ static int uptime_proc_show(struct seq_file *m,
>> void *v)
>> struct timespec uptime;
>> struct timespec idle;
>> int i;
>> - cputime_t idletime = cputime_zero;
>> + u64 idletime = 0;
>>
>> for_each_possible_cpu(i)
>> - idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
>> + idletime += kstat_cpu(i).cpustat[IDLE];
>>
>> do_posix_clock_monotonic_gettime(&uptime);
>> monotonic_to_bootbased(&uptime);
>> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
>> index 0cce2db..7bfd0fe 100644
>> --- a/include/linux/kernel_stat.h
>> +++ b/include/linux/kernel_stat.h
>> @@ -6,6 +6,7 @@
>> #include<linux/percpu.h>
>> #include<linux/cpumask.h>
>> #include<linux/interrupt.h>
>> +#include<linux/sched.h>
>> #include<asm/irq.h>
>> #include<asm/cputime.h>
>>
>> @@ -15,21 +16,22 @@
>> * used by rstatd/perfmeter
>> */
>>
>> -struct cpu_usage_stat {
>> - cputime64_t user;
>> - cputime64_t nice;
>> - cputime64_t system;
>> - cputime64_t softirq;
>> - cputime64_t irq;
>> - cputime64_t idle;
>> - cputime64_t iowait;
>> - cputime64_t steal;
>> - cputime64_t guest;
>> - cputime64_t guest_nice;
>> +enum cpu_usage_stat {
>> + USER,
>> + NICE,
>> + SYSTEM,
>> + SOFTIRQ,
>> + IRQ,
>> + IDLE,
>> + IOWAIT,
>> + STEAL,
>> + GUEST,
>> + GUEST_NICE,
>> + NR_STATS,
>> };
>
> I suspect we want a more descriptive prefix here, e.g. CPUTIME_USER
>
>>
>> struct kernel_stat {
>> - struct cpu_usage_stat cpustat;
>> + u64 cpustat[NR_STATS];
>> #ifndef CONFIG_GENERIC_HARDIRQS
>> unsigned int irqs[NR_IRQS];
>> #endif
>> @@ -39,9 +41,9 @@ struct kernel_stat {
>>
>> DECLARE_PER_CPU(struct kernel_stat, kstat);
>>
>> -#define kstat_cpu(cpu) per_cpu(kstat, cpu)
>> /* Must have preemption disabled for this to be meaningful. */
>> -#define kstat_this_cpu __get_cpu_var(kstat)
>> +#define kstat_this_cpu (&__get_cpu_var(kstat))
>> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
>>
>> extern unsigned long long nr_context_switches(void);
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 594ea22..7ac5aa6 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2158,14 +2158,14 @@ static void update_rq_clock_task(struct rq
>> *rq, s64 delta)
>> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> static int irqtime_account_hi_update(void)
>> {
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>> unsigned long flags;
>> u64 latest_ns;
>> int ret = 0;
>>
>> local_irq_save(flags);
>> latest_ns = this_cpu_read(cpu_hardirq_time);
>> - if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
>> + if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
>> ret = 1;
>> local_irq_restore(flags);
>> return ret;
>> @@ -2173,14 +2173,14 @@ static int irqtime_account_hi_update(void)
>>
>> static int irqtime_account_si_update(void)
>> {
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>> unsigned long flags;
>> u64 latest_ns;
>> int ret = 0;
>>
>> local_irq_save(flags);
>> latest_ns = this_cpu_read(cpu_softirq_time);
>> - if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
>> + if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
>> ret = 1;
>> local_irq_restore(flags);
>> return ret;
>> @@ -3866,8 +3866,8 @@ unsigned long long task_sched_runtime(struct
>> task_struct *p)
>> void account_user_time(struct task_struct *p, cputime_t cputime,
>> cputime_t cputime_scaled)
>> {
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> - cputime64_t tmp;
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>> + u64 tmp;
>>
>> /* Add user time to process. */
>> p->utime = cputime_add(p->utime, cputime);
>> @@ -3876,10 +3876,11 @@ void account_user_time(struct task_struct *p,
>> cputime_t cputime,
>>
>> /* Add user time to cpustat. */
>> tmp = cputime_to_cputime64(cputime);
>> +
>> if (TASK_NICE(p)> 0)
>
> We now that these are actually fields this could be:
> field = TASK_NICE(p) > 0 ? CPUTIME_NICE : CPUTIME_USER;
>
>> - cpustat->nice = cputime64_add(cpustat->nice, tmp);
>> + cpustat[NICE] += tmp;
>> else
>> - cpustat->user = cputime64_add(cpustat->user, tmp);
>> + cpustat[USER] += tmp;
>>
>> cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
>> /* Account for user time used */
>> @@ -3895,8 +3896,8 @@ void account_user_time(struct task_struct *p,
>> cputime_t cputime,
>> static void account_guest_time(struct task_struct *p, cputime_t cputime,
>> cputime_t cputime_scaled)
>> {
>> - cputime64_t tmp;
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> + u64 tmp;
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>>
>> tmp = cputime_to_cputime64(cputime);
>>
>> @@ -3908,11 +3909,11 @@ static void account_guest_time(struct
>> task_struct *p, cputime_t cputime,
>>
>> /* Add guest time to cpustat. */
>> if (TASK_NICE(p)> 0) {
>> - cpustat->nice = cputime64_add(cpustat->nice, tmp);
>> - cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
>> + cpustat[NICE] += tmp;
>> + cpustat[GUEST_NICE] += tmp;
>> } else {
>> - cpustat->user = cputime64_add(cpustat->user, tmp);
>> - cpustat->guest = cputime64_add(cpustat->guest, tmp);
>> + cpustat[USER] += tmp;
>> + cpustat[GUEST] += tmp;
>> }
>> }
>>
>> @@ -3925,9 +3926,9 @@ static void account_guest_time(struct
>> task_struct *p, cputime_t cputime,
>> */
>> static inline
>> void __account_system_time(struct task_struct *p, cputime_t cputime,
>> - cputime_t cputime_scaled, cputime64_t *target_cputime64)
>> + cputime_t cputime_scaled, u64 *target_cputime64)
>
> Having cpustat be an array means we can drop the pointer here and pass
> the id.
>
>> {
>> - cputime64_t tmp = cputime_to_cputime64(cputime);
>> + u64 tmp = cputime_to_cputime64(cputime);
>>
>> /* Add system time to process. */
>> p->stime = cputime_add(p->stime, cputime);
>> @@ -3935,7 +3936,7 @@ void __account_system_time(struct task_struct
>> *p, cputime_t cputime,
>> account_group_system_time(p, cputime);
>>
>> /* Add system time to cpustat. */
>> - *target_cputime64 = cputime64_add(*target_cputime64, tmp);
>> + *target_cputime64 += tmp;
>> cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
>>
>> /* Account for system time used */
>> @@ -3952,8 +3953,8 @@ void __account_system_time(struct task_struct
>> *p, cputime_t cputime,
>> void account_system_time(struct task_struct *p, int hardirq_offset,
>> cputime_t cputime, cputime_t cputime_scaled)
>> {
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> - cputime64_t *target_cputime64;
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>> + u64 *target_cputime64;
>>
>> if ((p->flags& PF_VCPU)&& (irq_count() - hardirq_offset == 0)) {
>> account_guest_time(p, cputime, cputime_scaled);
>> @@ -3961,11 +3962,11 @@ void account_system_time(struct task_struct
>> *p, int hardirq_offset,
>> }
>>
>> if (hardirq_count() - hardirq_offset)
>> - target_cputime64 =&cpustat->irq;
>> + target_cputime64 =&cpustat[IRQ];
>> else if (in_serving_softirq())
>> - target_cputime64 =&cpustat->softirq;
>> + target_cputime64 =&cpustat[SOFTIRQ];
>> else
>> - target_cputime64 =&cpustat->system;
>> + target_cputime64 =&cpustat[SYSTEM];
>>
>> __account_system_time(p, cputime, cputime_scaled, target_cputime64);
>> }
>> @@ -3976,10 +3977,10 @@ void account_system_time(struct task_struct
>> *p, int hardirq_offset,
>> */
>> void account_steal_time(cputime_t cputime)
>> {
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> - cputime64_t cputime64 = cputime_to_cputime64(cputime);
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>> + u64 cputime64 = cputime_to_cputime64(cputime);
>>
>> - cpustat->steal = cputime64_add(cpustat->steal, cputime64);
>> + cpustat[STEAL] += cputime64;
>> }
>>
>> /*
>> @@ -3988,14 +3989,14 @@ void account_steal_time(cputime_t cputime)
>> */
>> void account_idle_time(cputime_t cputime)
>> {
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> - cputime64_t cputime64 = cputime_to_cputime64(cputime);
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>> + u64 cputime64 = cputime_to_cputime64(cputime);
>> struct rq *rq = this_rq();
>>
>> if (atomic_read(&rq->nr_iowait)> 0)
>> - cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
>> + cpustat[IOWAIT] += cputime64;
>> else
>> - cpustat->idle = cputime64_add(cpustat->idle, cputime64);
>> + cpustat[IDLE] += cputime64;
>> }
>>
>> static __always_inline bool steal_account_process_tick(void)
>> @@ -4045,16 +4046,16 @@ static void
>> irqtime_account_process_tick(struct task_struct *p, int user_tick,
>> struct rq *rq)
>> {
>> cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>> - cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
>> - struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> + u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
>> + u64 *cpustat = kstat_this_cpu->cpustat;
>>
>> if (steal_account_process_tick())
>> return;
>>
>> if (irqtime_account_hi_update()) {
>> - cpustat->irq = cputime64_add(cpustat->irq, tmp);
>> + cpustat[IRQ] += tmp;
>> } else if (irqtime_account_si_update()) {
>> - cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
>> + cpustat[SOFTIRQ] += tmp;
>> } else if (this_cpu_ksoftirqd() == p) {
>> /*
>> * ksoftirqd time do not get accounted in cpu_softirq_time.
>> @@ -4062,7 +4063,7 @@ static void irqtime_account_process_tick(struct
>> task_struct *p, int user_tick,
>> * Also, p->stime needs to be updated for ksoftirqd.
>> */
>> __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
>> - &cpustat->softirq);
>> + &cpustat[SOFTIRQ]);
>> } else if (user_tick) {
>> account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>> } else if (p == rq->idle) {
>> @@ -4071,7 +4072,7 @@ static void irqtime_account_process_tick(struct
>> task_struct *p, int user_tick,
>> account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
>> } else {
>> __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
>> - &cpustat->system);
>> + &cpustat[SYSTEM]);
>> }
>> }
>>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Change cpustat fields to an array.
  2011-11-16 11:25     ` Glauber Costa
@ 2011-11-16 11:31       ` Glauber Costa
  0 siblings, 0 replies; 17+ messages in thread
From: Glauber Costa @ 2011-11-16 11:31 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, cgroups

Ok, I missed the other comments in this patch.

Here it goes:

On 11/16/2011 09:25 AM, Glauber Costa wrote:
>>> for_each_possible_cpu(i) {
>>> - user = cputime64_add(user, kstat_cpu(i).cpustat.user);
>>> - nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
>>> - system = cputime64_add(system, kstat_cpu(i).cpustat.system);
>>> - idle = cputime64_add(idle, get_idle_time(i));
>>> - iowait = cputime64_add(iowait, get_iowait_time(i));
>>> - irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
>>> - softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
>>> - steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
>>> - guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
>>> - guest_nice = cputime64_add(guest_nice,
>>> - kstat_cpu(i).cpustat.guest_nice);
>>> - sum += kstat_cpu_irqs_sum(i);
>>> - sum += arch_irq_stat_cpu(i);
>>> + user += kstat_cpu(i).cpustat[USER];
>>
>> Half the time cputime64_add is preserved, half the time this patch
>> converts it to a naked '+='. Admittedly no one seems to usefully define
>> cputime64_add but why the conversion / inconsistency?
>
> Because at least in this patchset of mine, cputime conversion is not a
> goal, but a side effect. I wanted cpustat to be an array of u64, so I
> converted users. There are some places in the code that still uses
> cputime64 for other variables, and those I kept untouched.
>
> That's to make the patch focused. The other variables can be easily
> converted if we see value on it.
Outside of sched.c I did miss some. Those I can chase and convert.

>>> +enum cpu_usage_stat {
>>> + USER,
>>> + NICE,
>>> + SYSTEM,
>>> + SOFTIRQ,
>>> + IRQ,
>>> + IDLE,
>>> + IOWAIT,
>>> + STEAL,
>>> + GUEST,
>>> + GUEST_NICE,
>>> + NR_STATS,
>>> };
>>
>> I suspect we want a more descriptive prefix here, e.g. CPUTIME_USER

Ok.

>>>
>>> /* Add user time to cpustat. */
>>> tmp = cputime_to_cputime64(cputime);
>>> +
>>> if (TASK_NICE(p)> 0)
>>
>> We now that these are actually fields this could be:
>> field = TASK_NICE(p) > 0 ? CPUTIME_NICE : CPUTIME_USER;

Yes, absolutely.

>>> @@ -3925,9 +3926,9 @@ static void account_guest_time(struct
>>> task_struct *p, cputime_t cputime,
>>> */
>>> static inline
>>> void __account_system_time(struct task_struct *p, cputime_t cputime,
>>> - cputime_t cputime_scaled, cputime64_t *target_cputime64)
>>> + cputime_t cputime_scaled, u64 *target_cputime64)
>>
>> Having cpustat be an array means we can drop the pointer here and pass
>> the id.

And that's precisely what I do on a later patch, with account_field. I 
can move that code here if you prefer.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] split kernel stat in two
  2011-11-16  6:12   ` Paul Turner
@ 2011-11-16 11:34     ` Glauber Costa
  0 siblings, 0 replies; 17+ messages in thread
From: Glauber Costa @ 2011-11-16 11:34 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, cgroups

On 11/16/2011 04:12 AM, Paul Turner wrote:
> On 11/15/2011 07:59 AM, Glauber Costa wrote:
>> In a later patch, we will use cpustat information per-task group.
>> However, some of its fields are naturally global, such as the irq
>> counters. There is no need to impose the task group overhead to them
>> in this case. So better separate them.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Paul Tuner<pjt@google.com>
>> ---
>> arch/s390/appldata/appldata_os.c | 16 +++++++-------
>> arch/x86/include/asm/i387.h | 2 +-
>> drivers/cpufreq/cpufreq_conservative.c | 20 ++++++++--------
>> drivers/cpufreq/cpufreq_ondemand.c | 20 ++++++++--------
>> drivers/macintosh/rack-meter.c | 6 ++--
>> fs/proc/stat.c | 36 ++++++++++++++++----------------
>> include/linux/kernel_stat.h | 8 ++++++-
>> kernel/sched.c | 18 ++++++++-------
>> 8 files changed, 67 insertions(+), 59 deletions(-)
>>
>> diff --git a/arch/s390/appldata/appldata_os.c
>> b/arch/s390/appldata/appldata_os.c
>> index 3d6b672..695388a 100644
>> --- a/arch/s390/appldata/appldata_os.c
>> +++ b/arch/s390/appldata/appldata_os.c
>> @@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
>> j = 0;
>> for_each_online_cpu(i) {
>> os_data->os_cpu[j].per_cpu_user =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[USER]);
>> os_data->os_cpu[j].per_cpu_nice =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[NICE]);
>> os_data->os_cpu[j].per_cpu_system =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[SYSTEM]);
>> os_data->os_cpu[j].per_cpu_idle =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[IDLE]);
>> os_data->os_cpu[j].per_cpu_irq =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[IRQ]);
>> os_data->os_cpu[j].per_cpu_softirq =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[SOFTIRQ]);
>> os_data->os_cpu[j].per_cpu_iowait =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[IOWAIT]);
>> os_data->os_cpu[j].per_cpu_steal =
>> - cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
>> + cputime_to_jiffies(kcpustat_cpu(i).cpustat[STEAL]);
>> os_data->os_cpu[j].cpu_id = i;
>> j++;
>> }
>> diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
>> index 56fa4d7..1f1b536 100644
>> --- a/arch/x86/include/asm/i387.h
>> +++ b/arch/x86/include/asm/i387.h
>> @@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
>> #ifdef CONFIG_SMP
>> #define safe_address (__per_cpu_offset[0])
>> #else
>> -#define safe_address (kstat_cpu(0).cpustat[USER])
>> +#define safe_address (__get_cpu_var(kernel_cpustat).cpustat[USER])
>> #endif
>>
>> /*
>> diff --git a/drivers/cpufreq/cpufreq_conservative.c
>> b/drivers/cpufreq/cpufreq_conservative.c
>> index 2ab538f..a3a739f 100644
>> --- a/drivers/cpufreq/cpufreq_conservative.c
>> +++ b/drivers/cpufreq/cpufreq_conservative.c
>> @@ -103,13 +103,13 @@ static inline cputime64_t
>> get_cpu_idle_time_jiffy(unsigned int cpu,
>> cputime64_t busy_time;
>>
>> cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
>> - busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
>> - kstat_cpu(cpu).cpustat[SYSTEM]);
>> + busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
>> + kcpustat_cpu(cpu).cpustat[SYSTEM]);
>>
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
>> - busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
>> + busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[IRQ]);
>> + busy_time = cputime64_add(busy_time,
>> kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
>> + busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
>> + busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
>>
>
> This clobbers almost *all* the same lines as the last patch. There has
> to be a more readable way of structuring these 2 patches.

Yes, there is: Merging them in the same patch. But I preferred to keep 
them logically separated. One of them brings index access, the other, 
changes the macro name. I think this is more important than the eventual 
clobber, but let me know your preference.

>> -struct kernel_stat {
>> +struct kernel_cpustat {
>> u64 cpustat[NR_STATS];
>> +};
>> +
>> +struct kernel_stat {
>> #ifndef CONFIG_GENERIC_HARDIRQS
>> unsigned int irqs[NR_IRQS];
>> #endif
>> @@ -40,10 +43,13 @@ struct kernel_stat {
>> };
>>
>> DECLARE_PER_CPU(struct kernel_stat, kstat);
>> +DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
>>
>> /* Must have preemption disabled for this to be meaningful. */
>> #define kstat_this_cpu (&__get_cpu_var(kstat))
>> +#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
>> #define kstat_cpu(cpu) per_cpu(kstat, cpu)
>> +#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
>>
>> extern unsigned long long nr_context_switches(void);
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 7ac5aa6..efdd4d8 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2158,7 +2158,7 @@ static void update_rq_clock_task(struct rq *rq,
>> s64 delta)
>> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> static int irqtime_account_hi_update(void)
>> {
>> - u64 *cpustat = kstat_this_cpu->cpustat;
>> + u64 *cpustat = kcpustat_this_cpu->cpustat;
>> unsigned long flags;
>> u64 latest_ns;
>> int ret = 0;
>> @@ -2173,7 +2173,7 @@ static int irqtime_account_hi_update(void)
>>
>> static int irqtime_account_si_update(void)
>> {
>> - u64 *cpustat = kstat_this_cpu->cpustat;
>> + u64 *cpustat = kcpustat_this_cpu->cpustat;
>> unsigned long flags;
>> u64 latest_ns;
>> int ret = 0;
>> @@ -3803,8 +3803,10 @@ unlock:
>> #endif
>>
>> DEFINE_PER_CPU(struct kernel_stat, kstat);
>> +DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
>>
>> EXPORT_PER_CPU_SYMBOL(kstat);
>> +EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
>
> This would want a big fat comment explaining the difference.
>
Fair.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] Keep scheduler statistics per cgroup
  2011-11-16  7:02   ` Paul Turner
@ 2011-11-16 11:56     ` Glauber Costa
  0 siblings, 0 replies; 17+ messages in thread
From: Glauber Costa @ 2011-11-16 11:56 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, cgroups

On 11/16/2011 05:02 AM, Paul Turner wrote:
> On 11/15/2011 07:59 AM, Glauber Costa wrote:
>> This patch makes the scheduler statistics, such as user ticks,
>> system ticks, etc, per-cgroup. With this information, we are
>> able to display the same information we currently do in cpuacct
>> cgroup, but within the normal cpu cgroup.
>>
>
> Hmm,
>
> So this goes a little beyond the existing stats exported by cpuacct.
>
> Currently we have:
>
> CPUACCT_STAT_USER
> CPUACCT_STAT_SYSTEM (in cpuacct.info)
> and
> cpuacct.usage / cpuacct.usage_per_cpu
>
> Arguably the last two stats are the *most* useful information exported
> by cpuacct (and the ones we get for free from existing sched_entity
> accounting). But their functionality is not maintained.
Of course it is. But not in *this* patchset. If you look at the last one 
I sent, with all the functionality, before a split attempt, you will see 
that this is indeed used.

What I tried to achieve here, i

> As proposed in: https://lkml.org/lkml/2011/11/11/265
> I'm not sure we really want to bring the other stats /within/ the CPU
> controller.
>
> Furthermore, given your stated goal of changing virtualizing some of the
> /proc interfaces using this export it definitely seems like these fields
> (and any future behavioral changes using them may enable) be independent
> from core cpu.
>
> (/me ... reads through patch then continues thoughts at bottom.)

>> +static struct jump_label_key sched_cgroup_enabled;
>
> This name does not really suggest what this jump-label is used for.
>
> Something like task_group_sched_stats_enabled is much clearer.
OK.


>> +static int sched_has_sched_stats = 0;
>> +
>> +struct kernel_cpustat *task_group_kstat(struct task_struct *p)
>> +{
>> + if (static_branch(&sched_cgroup_enabled)) {
>> + struct task_group *tg;
>> + tg = task_group(p);
>> + return tg->cpustat;
>> + }
>> +
>> + return&kernel_cpustat;
>> +}
>> +EXPORT_SYMBOL(task_group_kstat);
>> +
>> /* Change a task's cfs_rq and parent entity if it moves across
>> CPUs/groups */
>> static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>> {
>> @@ -784,9 +806,36 @@ static inline struct task_group
>> *task_group(struct task_struct *p)
>> {
>> return NULL;
>> }
>> -
>> #endif /* CONFIG_CGROUP_SCHED */
>>
>> +static inline void task_group_account_field(struct task_struct *p,
>> + u64 tmp, int index)
>> +{
>> + /*
>> + * Since all updates are sure to touch the root cgroup, we
>> + * get ourselves ahead and touch it first. If the root cgroup
>> + * is the only cgroup, then nothing else should be necessary.
>> + *
>> + */
>> + __get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
>
>> +
>> +#ifdef CONFIG_CGROUP_SCHED
>> + if (static_branch(&sched_cgroup_enabled)) {
>> + struct kernel_cpustat *kcpustat;
>> + struct task_group *tg;
>> +
>> + rcu_read_lock();
>> + tg = task_group(p);
>> + while (tg&& (tg !=&root_task_group)) {
>
> You could use for_each_entity starting from &p->se here.

Yes, but note that I explicitly need to skip the root. So in the end I 
think that could do more operations than this.

>> + kcpustat = this_cpu_ptr(tg->cpustat);
>
> This is going to have to do the this_cpu_ptr work at every level; we
> already know what cpu we're on it and can reference it directly.
How exactly? I thought this_cpu_ptr was always needed in the case of 
dynamic allocated percpu variables.

>> local_irq_save(flags);
>> latest_ns = this_cpu_read(cpu_hardirq_time);
>> + kstat_lock();
>
> This protects ?

This is basically an rcu (or nothing in the disabled case) that 
guarantees that the task group won't go away while we read it.

>> struct cgroup_subsys cpu_cgroup_subsys = {
>
> So I'm not seeing any touch-points that intrinsically benefit from being
> a part of the cpu sub-system. The hierarchy walk in
> task_group_account_field() is completely independent from the rest of
> the controller.

Indeed we gain more in cpuusage, which is left out of this patchset due 
to your early request.

>
> The argument for merging {usage, usage_per_cpu} into cpu is almost
> entirely performance based -- the stats are very useful from a
> management perspective and we already maintain (hidden) versions of them
> in cpu. Whereas, as it stands these this really would seem not to suffer
> at all from being in its own controller. I previously suggested that
> this might want to be a "co-controller" (e.g. one that only ever exists
> mounted adjacent to cpu so that we could leverage the existing hierarchy
> without over-loading the core of "cpu"). But I'm not even sure that is
> required or beneficial given that this isn't going to add value or make
> anything cheaper.

Please take a look on how I address the cpuusage problem in my original 
patchset, and see if you keep your opinion about this. (continue below)

>  From that perspective, perhaps what you're looking for *really* is best
> served just by greatly extending the stats exported by cpuacct (as above).

I myself don't care in which cgroup this lives, as long as I can export 
the cpustat information easily.

But in the end, I still believe having one sched-related cgroup instead 
of 2 is

1) simpler, since the grouping already exist in task_groups
2) cheaper, specially when we account cpuusage.

You seem to put more stress in the fact that statistics are logically 
separate from the controlling itself, which still works for me. But 
looking from another perspective, there is no harm in leaving the 
statistics be collected close to their natural grouping.

> We'll still pull {usage, usage_per_cpu} into "cpu" for the common case
> but those who really want everything else could continue using "cpuacct".
>
> Reasonable?

I don't think so, unless we can rip cpuusage in cpuacct. If you think 
about it, the common case will really be to have both enabled. And then, 
the performance hit is there anyway - which defeats the point of moving 
cpuusage to the cpu cgroup in the first place.

Way I see it, there are two possible ways to do it:
1) Everything in the cpu cgroup, including cpuusage.
2) All the stats in cpuacct, including cpuusage.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] provide a version of cpuacct statistics inside cpu cgroup
  2011-11-15 15:59 ` [PATCH 4/4] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
@ 2011-11-17  7:12   ` Balbir Singh
  0 siblings, 0 replies; 17+ messages in thread
From: Balbir Singh @ 2011-11-17  7:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, cgroups

On Tue, Nov 15, 2011 at 9:29 PM, Glauber Costa <glommer@parallels.com> wrote:
> For users interested in using the information currently displayed
> at cpuacct.stat, we provide it inside the cpu cgroup.

I presume I need to mount cpu to see these stats and cpu implies
control today - no? Can I only monitor statistics without implementing
control? Please see the other thread as well.

Balbir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] Provide cpuacct functionality in cpu cgroup
  2011-11-16  0:57 ` [PATCH 0/4] Provide cpuacct functionality in " KAMEZAWA Hiroyuki
@ 2011-11-23 10:29   ` Glauber Costa
  0 siblings, 0 replies; 17+ messages in thread
From: Glauber Costa @ 2011-11-23 10:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, cgroups

On 11/15/2011 10:57 PM, KAMEZAWA Hiroyuki wrote:
> On Tue, 15 Nov 2011 13:59:13 -0200
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> Hi,
>>
>> This is an excerpt of the last patches I sent regarding cpu cgroup.
>> It is mostly focused on cleaning up what we have now, so it can
>> be considered largely preparation. As a user of the new organization
>> of things, I am including cpuacct stats functionality in the end of
>> the series. The files related to cpuusage are left to be sent in
>> an upcoming series after this one is included.
>>
>> Let me know if there is anything you'd like me to address.
>>
>
> I'm sorry but let me several questions.
>
> Why cpu cgroup but cpuacct cgroup ?
> If scheduler stat is reported via cpu cgroup, it makes sense.
> If user accounting information is reported via cpu cgroup, I feel it's strange.
> Do you want to make cpuacct cgroup obsolete and merge 2 cgroups ?
>
> What's relationship between this new counters and cpuacct cgroup ?
> Aren't users confused ?
>
> Please provide Documenation.

Hi Kame,

Just so you know, I intend to withdraw this. This idea came from a 
discussion with Peter, where we were thinking about ways to avoid 
collecting statistics more than once, and walking structures more than 
once as well (And cpuusage, for instance, was already collected for the 
task_groups in the scheduler).

Now, Balbir pointed out that the use case for this really needs to keep
the statistics orthogonal to scheduler entities. So with that in mind, 
merging them would complicate things more than it would simplify.

I think I can go back to my original version, where /proc/stat data is 
shown according to data collected in cpuacct.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] Change cpustat fields to an array.
  2011-11-25  1:33 [PATCH 0/4] cpuacct cleanup Glauber Costa
@ 2011-11-25  1:33 ` Glauber Costa
  2011-11-25  2:33   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 17+ messages in thread
From: Glauber Costa @ 2011-11-25  1:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt, cgroups,
	devel, Glauber Costa

This patch changes fields in cpustat from a structure, to an
u64 array. Math gets easier, and the code is more flexible.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Paul Tuner <pjt@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/s390/appldata/appldata_os.c       |   16 +++---
 arch/x86/include/asm/i387.h            |    2 +-
 drivers/cpufreq/cpufreq_conservative.c |   38 +++++++--------
 drivers/cpufreq/cpufreq_ondemand.c     |   38 +++++++--------
 drivers/macintosh/rack-meter.c         |    8 ++--
 fs/proc/stat.c                         |   63 ++++++++++++--------------
 fs/proc/uptime.c                       |    4 +-
 include/linux/kernel_stat.h            |   36 +++++++++------
 kernel/sched.c                         |   78 ++++++++++++++++---------------
 9 files changed, 142 insertions(+), 141 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 92f1cb7..4de031d 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
 	j = 0;
 	for_each_online_cpu(i) {
 		os_data->os_cpu[j].per_cpu_user =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.user);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_USER]);
 		os_data->os_cpu[j].per_cpu_nice =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.nice);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_NICE]);
 		os_data->os_cpu[j].per_cpu_system =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.system);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]);
 		os_data->os_cpu[j].per_cpu_idle =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.idle);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IDLE]);
 		os_data->os_cpu[j].per_cpu_irq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.irq);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IRQ]);
 		os_data->os_cpu[j].per_cpu_softirq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.softirq);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]);
 		os_data->os_cpu[j].per_cpu_iowait =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.iowait);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IOWAIT]);
 		os_data->os_cpu[j].per_cpu_steal =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.steal);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_STEAL]);
 		os_data->os_cpu[j].cpu_id = i;
 		j++;
 	}
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index c9e09ea..6919e93 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #ifdef CONFIG_SMP
 #define safe_address (__per_cpu_offset[0])
 #else
-#define safe_address (kstat_cpu(0).cpustat.user)
+#define safe_address (__get_cpu_var(kernel_cpustat).cpustat[CPUTIME_USER])
 #endif
 
 /*
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index c97b468..118bff7 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -95,27 +95,26 @@ static struct dbs_tuners {
 	.freq_step = 5,
 };
 
-static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
-							cputime64_t *wall)
+static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
 {
-	cputime64_t idle_time;
+	u64 idle_time;
 	cputime64_t cur_wall_time;
-	cputime64_t busy_time;
+	u64 busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
-			kstat_cpu(cpu).cpustat.system);
+	busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER] +
+		    kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
-		*wall = (cputime64_t)jiffies_to_usecs(cur_wall_time);
+		*wall = jiffies_to_usecs(cur_wall_time);
 
-	return (cputime64_t)jiffies_to_usecs(idle_time);
+	return jiffies_to_usecs(idle_time);
 }
 
 static inline cputime64_t get_cpu_idle_time(unsigned int cpu, cputime64_t *wall)
@@ -272,7 +271,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 	}
 	return count;
 }
@@ -362,11 +361,11 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 		j_dbs_info->prev_cpu_idle = cur_idle_time;
 
 		if (dbs_tuners_ins.ignore_nice) {
-			cputime64_t cur_nice;
+			u64 cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
-					 j_dbs_info->prev_cpu_nice);
+			cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE] -
+					 j_dbs_info->prev_cpu_nice;
 			/*
 			 * Assumption: nice time between sampling periods will
 			 * be less than 2^32 jiffies for 32 bit sys
@@ -374,7 +373,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -501,10 +500,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice) {
+			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat.nice;
-			}
+						kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 		}
 		this_dbs_info->down_skip = 0;
 		this_dbs_info->requested_freq = policy->cur;
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index fa8af4e..f3d327c 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -119,27 +119,26 @@ static struct dbs_tuners {
 	.powersave_bias = 0,
 };
 
-static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
-							cputime64_t *wall)
+static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
 {
-	cputime64_t idle_time;
+	u64 idle_time;
 	cputime64_t cur_wall_time;
-	cputime64_t busy_time;
+	u64 busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
-			kstat_cpu(cpu).cpustat.system);
+	busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER] +
+		    kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
+	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
-		*wall = (cputime64_t)jiffies_to_usecs(cur_wall_time);
+		*wall = jiffies_to_usecs(cur_wall_time);
 
-	return (cputime64_t)jiffies_to_usecs(idle_time);
+	return jiffies_to_usecs(idle_time);
 }
 
 static inline cputime64_t get_cpu_idle_time(unsigned int cpu, cputime64_t *wall)
@@ -345,7 +344,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 
 	}
 	return count;
@@ -455,11 +454,11 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 		j_dbs_info->prev_cpu_iowait = cur_iowait_time;
 
 		if (dbs_tuners_ins.ignore_nice) {
-			cputime64_t cur_nice;
+			u64 cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
-					 j_dbs_info->prev_cpu_nice);
+			cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE] -
+					 j_dbs_info->prev_cpu_nice;
 			/*
 			 * Assumption: nice time between sampling periods will
 			 * be less than 2^32 jiffies for 32 bit sys
@@ -467,7 +466,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -646,10 +645,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice) {
+			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat.nice;
-			}
+						kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 		}
 		this_dbs_info->cpu = cpu;
 		this_dbs_info->rate_mult = 1;
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 2637c13..66d7f1c7 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -81,13 +81,13 @@ static int rackmeter_ignore_nice;
  */
 static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
 {
-	cputime64_t retval;
+	u64 retval;
 
-	retval = cputime64_add(kstat_cpu(cpu).cpustat.idle,
-			kstat_cpu(cpu).cpustat.iowait);
+	retval = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] +
+		 kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
 
 	if (rackmeter_ignore_nice)
-		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat.nice);
+		retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
 
 	return retval;
 }
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 42b274d..8a6ab66 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -22,29 +22,27 @@
 #define arch_idle_time(cpu) 0
 #endif
 
-static cputime64_t get_idle_time(int cpu)
+static u64 get_idle_time(int cpu)
 {
-	u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
-	cputime64_t idle;
+	u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
 
 	if (idle_time == -1ULL) {
 		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kstat_cpu(cpu).cpustat.idle;
-		idle = cputime64_add(idle, arch_idle_time(cpu));
+		idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
+		idle += arch_idle_time(cpu);
 	} else
 		idle = usecs_to_cputime(idle_time);
 
 	return idle;
 }
 
-static cputime64_t get_iowait_time(int cpu)
+static u64 get_iowait_time(int cpu)
 {
-	u64 iowait_time = get_cpu_iowait_time_us(cpu, NULL);
-	cputime64_t iowait;
+	u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kstat_cpu(cpu).cpustat.iowait;
+		iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
 	else
 		iowait = usecs_to_cputime(iowait_time);
 
@@ -55,33 +53,30 @@ static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
 	unsigned long jif;
-	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
-	cputime64_t guest, guest_nice;
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 
 	user = nice = system = idle = iowait =
-		irq = softirq = steal = cputime64_zero;
-	guest = guest_nice = cputime64_zero;
+		irq = softirq = steal = 0;
+	guest = guest_nice = 0;
 	getboottime(&boottime);
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user = cputime64_add(user, kstat_cpu(i).cpustat.user);
-		nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
-		system = cputime64_add(system, kstat_cpu(i).cpustat.system);
-		idle = cputime64_add(idle, get_idle_time(i));
-		iowait = cputime64_add(iowait, get_iowait_time(i));
-		irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
-		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
-		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-		guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
-		guest_nice = cputime64_add(guest_nice,
-			kstat_cpu(i).cpustat.guest_nice);
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
+		user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
+		nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
+		system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
+		idle += get_idle_time(i);
+		iowait += get_iowait_time(i);
+		irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
+		softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
+		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
+		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
+		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -106,16 +101,16 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat.user;
-		nice = kstat_cpu(i).cpustat.nice;
-		system = kstat_cpu(i).cpustat.system;
+		user = kcpustat_cpu(i).cpustat[CPUTIME_USER];
+		nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE];
+		system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
 		idle = get_idle_time(i);
 		iowait = get_iowait_time(i);
-		irq = kstat_cpu(i).cpustat.irq;
-		softirq = kstat_cpu(i).cpustat.softirq;
-		steal = kstat_cpu(i).cpustat.steal;
-		guest = kstat_cpu(i).cpustat.guest;
-		guest_nice = kstat_cpu(i).cpustat.guest_nice;
+		irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
+		softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
+		steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
+		guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
+		guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 766b1d4..0fb22e4 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -12,10 +12,10 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	struct timespec uptime;
 	struct timespec idle;
 	int i;
-	cputime_t idletime = cputime_zero;
+	u64 idletime = 0;
 
 	for_each_possible_cpu(i)
-		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
+		idletime += kcpustat_cpu(i).cpustat[CPUTIME_IDLE];
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..2fbd905 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -6,6 +6,7 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/interrupt.h>
+#include <linux/sched.h>
 #include <asm/irq.h>
 #include <asm/cputime.h>
 
@@ -15,21 +16,25 @@
  * used by rstatd/perfmeter
  */
 
-struct cpu_usage_stat {
-	cputime64_t user;
-	cputime64_t nice;
-	cputime64_t system;
-	cputime64_t softirq;
-	cputime64_t irq;
-	cputime64_t idle;
-	cputime64_t iowait;
-	cputime64_t steal;
-	cputime64_t guest;
-	cputime64_t guest_nice;
+enum cpu_usage_stat {
+	CPUTIME_USER,
+	CPUTIME_NICE,
+	CPUTIME_SYSTEM,
+	CPUTIME_SOFTIRQ,
+	CPUTIME_IRQ,
+	CPUTIME_IDLE,
+	CPUTIME_IOWAIT,
+	CPUTIME_STEAL,
+	CPUTIME_GUEST,
+	CPUTIME_GUEST_NICE,
+	NR_STATS,
+};
+
+struct kernel_cpustat {
+	u64 cpustat[NR_STATS];
 };
 
 struct kernel_stat {
-	struct cpu_usage_stat	cpustat;
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
 #endif
@@ -38,10 +43,13 @@ struct kernel_stat {
 };
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
+DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
-#define kstat_cpu(cpu)	per_cpu(kstat, cpu)
 /* Must have preemption disabled for this to be meaningful. */
-#define kstat_this_cpu	__get_cpu_var(kstat)
+#define kstat_this_cpu (&__get_cpu_var(kstat))
+#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
+#define kstat_cpu(cpu) per_cpu(kstat, cpu)
+#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index d87c6e5..2e57942 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2158,14 +2158,14 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[CPUTIME_IRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -2173,14 +2173,14 @@ static int irqtime_account_hi_update(void)
 
 static int irqtime_account_si_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[CPUTIME_SOFTIRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -3803,8 +3803,10 @@ unlock:
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat);
+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
 EXPORT_PER_CPU_SYMBOL(kstat);
+EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
  * Return any ns on the sched_clock that have not yet been accounted in
@@ -3866,8 +3868,9 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t tmp;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 tmp;
+	int index;
 
 	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
@@ -3876,10 +3879,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
-	if (TASK_NICE(p) > 0)
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
-	else
-		cpustat->user = cputime64_add(cpustat->user, tmp);
+
+	index = (TASK_NICE(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
+	cpustat[index] += tmp;
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3895,8 +3897,8 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
-	cputime64_t tmp;
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3908,11 +3910,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
-		cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
+		cpustat[CPUTIME_NICE] += tmp;
+		cpustat[CPUTIME_GUEST_NICE] += tmp;
 	} else {
-		cpustat->user = cputime64_add(cpustat->user, tmp);
-		cpustat->guest = cputime64_add(cpustat->guest, tmp);
+		cpustat[CPUTIME_USER] += tmp;
+		cpustat[CPUTIME_GUEST] += tmp;
 	}
 }
 
@@ -3925,9 +3927,10 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, cputime64_t *target_cputime64)
+			cputime_t cputime_scaled, int index)
 {
-	cputime64_t tmp = cputime_to_cputime64(cputime);
+	u64 tmp = cputime_to_cputime64(cputime);
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	/* Add system time to process. */
 	p->stime = cputime_add(p->stime, cputime);
@@ -3935,7 +3938,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
+	cpustat[index] += tmp;
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3952,8 +3955,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t *target_cputime64;
+	int index;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3961,13 +3963,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat->irq;
+		index = CPUTIME_IRQ;
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat->softirq;
+		index = CPUTIME_SOFTIRQ;
 	else
-		target_cputime64 = &cpustat->system;
+		index = CPUTIME_SYSTEM;
 
-	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
+	__account_system_time(p, cputime, cputime_scaled, index);
 }
 
 /*
@@ -3976,10 +3978,10 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 
-	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
+	cpustat[CPUTIME_STEAL] += cputime64;
 }
 
 /*
@@ -3988,14 +3990,14 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
+		cpustat[CPUTIME_IOWAIT] += cputime64;
 	else
-		cpustat->idle = cputime64_add(cpustat->idle, cputime64);
+		cpustat[CPUTIME_IDLE] += cputime64;
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -4045,16 +4047,16 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 						struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
-	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+		cpustat[CPUTIME_IRQ] += tmp;
 	} else if (irqtime_account_si_update()) {
-		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		cpustat[CPUTIME_SOFTIRQ] += tmp;
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
@@ -4062,7 +4064,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->softirq);
+					CPUTIME_SOFTIRQ);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -4071,7 +4073,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->system);
+					CPUTIME_SYSTEM);
 	}
 }
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Change cpustat fields to an array.
  2011-11-25  1:33 ` [PATCH 1/4] Change cpustat fields to an array Glauber Costa
@ 2011-11-25  2:33   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-25  2:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	cgroups, devel

On Thu, 24 Nov 2011 23:33:23 -0200
Glauber Costa <glommer@parallels.com> wrote:

> This patch changes fields in cpustat from a structure, to an
> u64 array. Math gets easier, and the code is more flexible.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Paul Tuner <pjt@google.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>

I like this change.
Reivewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-11-25  2:34 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-15 15:59 [PATCH 0/4] Provide cpuacct functionality in cpu cgroup Glauber Costa
2011-11-15 15:59 ` [PATCH 1/4] Change cpustat fields to an array Glauber Costa
2011-11-16  5:58   ` Paul Turner
2011-11-16 11:25     ` Glauber Costa
2011-11-16 11:31       ` Glauber Costa
2011-11-15 15:59 ` [PATCH 2/4] split kernel stat in two Glauber Costa
2011-11-16  6:12   ` Paul Turner
2011-11-16 11:34     ` Glauber Costa
2011-11-15 15:59 ` [PATCH 3/4] Keep scheduler statistics per cgroup Glauber Costa
2011-11-16  7:02   ` Paul Turner
2011-11-16 11:56     ` Glauber Costa
2011-11-15 15:59 ` [PATCH 4/4] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
2011-11-17  7:12   ` Balbir Singh
2011-11-16  0:57 ` [PATCH 0/4] Provide cpuacct functionality in " KAMEZAWA Hiroyuki
2011-11-23 10:29   ` Glauber Costa
  -- strict thread matches above, loose matches on Subject: below --
2011-11-25  1:33 [PATCH 0/4] cpuacct cleanup Glauber Costa
2011-11-25  1:33 ` [PATCH 1/4] Change cpustat fields to an array Glauber Costa
2011-11-25  2:33   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).