[PATCH 00/10] Per-cgroup /proc/stat information

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/10] Per-cgroup /proc/stat information
@ 2011-10-02 19:21 Glauber Costa
  2011-10-02 19:21 ` [PATCH 01/10] trivial: initialize root cgroup's sibling list Glauber Costa
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley

Hi guys,

So after two rounds of initial comments, this is my first
real shot at it. I am trying to address all previous comments,
but if one of them is left out, just let me know.

Now, cpustat is just an array of u64s. I am also proposing
we deprecate CPUACCT based on the current work. But that is
totally open to discussion. cpuacct guys, please chime in.

Hope you enjoy

Glauber Costa (10):
  trivial: initialize root cgroup's sibling list
  Change cpustat fields to an array.
  Move /proc/stat logic inside sched.c
  Display /proc/stat information per cgroup
  Make total_forks per-cgroup
  per-cgroup boot time
  Report steal time for cgroup
  provide a version of cpuacct statistics inside cpu cgroup
  provide a version of cpuusage statistics inside cpu cgroup
  Change CPUACCT to default n

 Documentation/feature-removal-schedule.txt |    8 +
 fs/proc/stat.c                             |  117 +------
 fs/proc/uptime.c                           |    2 +-
 include/linux/kernel_stat.h                |   50 ++-
 include/linux/sched.h                      |    5 +
 init/Kconfig                               |    1 +
 kernel/fork.c                              |    7 +-
 kernel/sched.c                             |  603 ++++++++++++++++++++++------
 kernel/sched_fair.c                        |    2 +-
 kernel/sched_rt.c                          |    2 +-
 10 files changed, 538 insertions(+), 259 deletions(-)

-- 
1.7.6

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/10] trivial: initialize root cgroup's sibling list
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-02 19:21 ` [PATCH 02/10] Change cpustat fields to an array Glauber Costa
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa

Even though there are no siblings, the list should be
initialized not to contain bogus values.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Paul Menage <paul@paulmenage.org>
---
 kernel/sched.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..3ed4107 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8082,6 +8082,7 @@ void __init sched_init(void)
 #ifdef CONFIG_CGROUP_SCHED
 	list_add(&root_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&root_task_group.children);
+	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
 #endif /* CONFIG_CGROUP_SCHED */
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/10] Change cpustat fields to an array.
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
  2011-10-02 19:21 ` [PATCH 01/10] trivial: initialize root cgroup's sibling list Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-02 19:21 ` [PATCH 03/10] Move /proc/stat logic inside sched.c Glauber Costa
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa

This will give us a bit more flexibility to deal with the
fields in this structure. This is a preparation patch for
later patches in this series.

I tried to keep the acessor macros unchanged, so we don't need
to patch all users.. At some point I gave up on kstat_this_cpu.
But to be fair, this one is not used outside of sched.c, so is
not a big deal.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/proc/stat.c              |   53 ++++++++++++++++----------------
 fs/proc/uptime.c            |    2 +-
 include/linux/kernel_stat.h |   38 ++++++++++++-----------
 kernel/sched.c              |   71 ++++++++++++++++++++++---------------------
 4 files changed, 83 insertions(+), 81 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 9758b65..e8e2b39 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -25,32 +25,31 @@ static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
 	unsigned long jif;
-	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
-	cputime64_t guest, guest_nice;
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 
 	user = nice = system = idle = iowait =
-		irq = softirq = steal = cputime64_zero;
-	guest = guest_nice = cputime64_zero;
+		irq = softirq = steal = 0;
+	guest = guest_nice = 0;
 	getboottime(&boottime);
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user = cputime64_add(user, kstat_cpu(i).cpustat.user);
-		nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
-		system = cputime64_add(system, kstat_cpu(i).cpustat.system);
-		idle = cputime64_add(idle, kstat_cpu(i).cpustat.idle);
-		idle = cputime64_add(idle, arch_idle_time(i));
-		iowait = cputime64_add(iowait, kstat_cpu(i).cpustat.iowait);
-		irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
-		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
-		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-		guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
-		guest_nice = cputime64_add(guest_nice,
-			kstat_cpu(i).cpustat.guest_nice);
+		user += kstat_cpu(i).cpustat[USER];
+		nice += kstat_cpu(i).cpustat[NICE];
+		system += kstat_cpu(i).cpustat[SYSTEM];
+		idle += kstat_cpu(i).cpustat[IDLE];
+		idle += arch_idle_time(i);
+		iowait += kstat_cpu(i).cpustat[IOWAIT];
+		irq += kstat_cpu(i).cpustat[IRQ];
+		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
+		steal += kstat_cpu(i).cpustat[STEAL];
+		guest += kstat_cpu(i).cpustat[GUEST];
+		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -78,17 +77,17 @@ static int show_stat(struct seq_file *p, void *v)
 	for_each_online_cpu(i) {
 
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat.user;
-		nice = kstat_cpu(i).cpustat.nice;
-		system = kstat_cpu(i).cpustat.system;
-		idle = kstat_cpu(i).cpustat.idle;
-		idle = cputime64_add(idle, arch_idle_time(i));
-		iowait = kstat_cpu(i).cpustat.iowait;
-		irq = kstat_cpu(i).cpustat.irq;
-		softirq = kstat_cpu(i).cpustat.softirq;
-		steal = kstat_cpu(i).cpustat.steal;
-		guest = kstat_cpu(i).cpustat.guest;
-		guest_nice = kstat_cpu(i).cpustat.guest_nice;
+		user = kstat_cpu(i).cpustat[USER];
+		nice = kstat_cpu(i).cpustat[NICE];
+		system = kstat_cpu(i).cpustat[SYSTEM];
+		idle = kstat_cpu(i).cpustat[IDLE];
+		idle += arch_idle_time(i);
+		iowait = kstat_cpu(i).cpustat[IOWAIT];
+		irq = kstat_cpu(i).cpustat[IRQ];
+		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
+		steal = kstat_cpu(i).cpustat[STEAL];
+		guest = kstat_cpu(i).cpustat[GUEST];
+		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 766b1d4..b0e053d 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -15,7 +15,7 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	cputime_t idletime = cputime_zero;
 
 	for_each_possible_cpu(i)
-		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
+		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat[IDLE]);
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..897eabf 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -6,6 +6,7 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/interrupt.h>
+#include <linux/sched.h>
 #include <asm/irq.h>
 #include <asm/cputime.h>
 
@@ -15,21 +16,22 @@
  * used by rstatd/perfmeter
  */
 
-struct cpu_usage_stat {
-	cputime64_t user;
-	cputime64_t nice;
-	cputime64_t system;
-	cputime64_t softirq;
-	cputime64_t irq;
-	cputime64_t idle;
-	cputime64_t iowait;
-	cputime64_t steal;
-	cputime64_t guest;
-	cputime64_t guest_nice;
+enum cpu_usage_stat {
+	USER,
+	NICE,
+	SYSTEM,
+	SOFTIRQ,
+	IRQ,
+	IDLE,
+	IOWAIT,
+	STEAL,
+	GUEST,
+	GUEST_NICE,
+	NR_STATS,
 };
 
 struct kernel_stat {
-	struct cpu_usage_stat	cpustat;
+	u64 cpustat[NR_STATS];
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
 #endif
@@ -39,9 +41,9 @@ struct kernel_stat {
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
 
-#define kstat_cpu(cpu)	per_cpu(kstat, cpu)
 /* Must have preemption disabled for this to be meaningful. */
-#define kstat_this_cpu	__get_cpu_var(kstat)
+#define kstat_this_cpu (&__get_cpu_var(kstat))
+#define kstat_cpu(cpu) per_cpu(kstat, cpu)
 
 extern unsigned long long nr_context_switches(void);
 
@@ -52,8 +54,8 @@ struct irq_desc;
 static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
 					    struct irq_desc *desc)
 {
-	__this_cpu_inc(kstat.irqs[irq]);
-	__this_cpu_inc(kstat.irqs_sum);
+	kstat_this_cpu->irqs[irq]++;
+	kstat_this_cpu->irqs_sum++;
 }
 
 static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
@@ -67,14 +69,14 @@ extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
 #define kstat_incr_irqs_this_cpu(irqno, DESC)		\
 do {							\
 	__this_cpu_inc(*(DESC)->kstat_irqs);		\
-	__this_cpu_inc(kstat.irqs_sum);			\
+	kstat_this_cpu->irqs_sum++;			\
 } while (0)
 
 #endif
 
 static inline void kstat_incr_softirqs_this_cpu(unsigned int irq)
 {
-	__this_cpu_inc(kstat.softirqs[irq]);
+	kstat_this_cpu->softirqs[irq]++;
 }
 
 static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3ed4107..8f0fa05 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2004,14 +2004,14 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -2019,14 +2019,14 @@ static int irqtime_account_hi_update(void)
 
 static int irqtime_account_si_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -3757,8 +3757,8 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t tmp;
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 tmp;
 
 	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
@@ -3767,10 +3767,11 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
+
 	if (TASK_NICE(p) > 0)
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
+		cpustat[NICE] += tmp;
 	else
-		cpustat->user = cputime64_add(cpustat->user, tmp);
+		cpustat[USER] += tmp;
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3786,8 +3787,8 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
-	cputime64_t tmp;
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3799,11 +3800,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
-		cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
+		cpustat[NICE] += tmp;
+		cpustat[GUEST_NICE] += tmp;
 	} else {
-		cpustat->user = cputime64_add(cpustat->user, tmp);
-		cpustat->guest = cputime64_add(cpustat->guest, tmp);
+		cpustat[USER] += tmp;
+		cpustat[GUEST] += tmp;
 	}
 }
 
@@ -3816,9 +3817,9 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, cputime64_t *target_cputime64)
+			cputime_t cputime_scaled, u64 *target_cputime64)
 {
-	cputime64_t tmp = cputime_to_cputime64(cputime);
+	u64 tmp = cputime_to_cputime64(cputime);
 
 	/* Add system time to process. */
 	p->stime = cputime_add(p->stime, cputime);
@@ -3826,7 +3827,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
+	*target_cputime64 += tmp;
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3843,8 +3844,8 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t *target_cputime64;
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *target_cputime64;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3852,11 +3853,11 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat->irq;
+		target_cputime64 = &cpustat[IRQ];
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat->softirq;
+		target_cputime64 = &cpustat[SOFTIRQ];
 	else
-		target_cputime64 = &cpustat->system;
+		target_cputime64 = &cpustat[SYSTEM];
 
 	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
 }
@@ -3867,10 +3868,10 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 
-	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
+	cpustat[STEAL] += cputime64;
 }
 
 /*
@@ -3879,14 +3880,14 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
+		cpustat[IOWAIT] += cputime64;
 	else
-		cpustat->idle = cputime64_add(cpustat->idle, cputime64);
+		cpustat[IDLE] += cputime64;
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -3936,16 +3937,16 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 						struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
-	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
+	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+		cpustat[IRQ] += tmp;
 	} else if (irqtime_account_si_update()) {
-		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		cpustat[SOFTIRQ] += tmp;
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
@@ -3953,7 +3954,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->softirq);
+					&cpustat[SOFTIRQ]);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -3962,7 +3963,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->system);
+					&cpustat[SYSTEM]);
 	}
 }
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/10] Move /proc/stat logic inside sched.c
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
  2011-10-02 19:21 ` [PATCH 01/10] trivial: initialize root cgroup's sibling list Glauber Costa
  2011-10-02 19:21 ` [PATCH 02/10] Change cpustat fields to an array Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-02 19:21 ` [PATCH 04/10] Display /proc/stat information per cgroup Glauber Costa
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa

This patch moves all of the /proc/stat display code inside
sched.c. The goal is to, later on, have a different version
of it per-cgroup. In containers environment, this is useful
to give each container a different and independent view of
the statistics displayed in this file.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/proc/stat.c        |  116 +-----------------------------------------------
 include/linux/sched.h |    1 +
 kernel/sched.c        |  119 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 121 insertions(+), 115 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e8e2b39..6b10387 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -11,123 +11,9 @@
 #include <linux/irqnr.h>
 #include <asm/cputime.h>
 
-#ifndef arch_irq_stat_cpu
-#define arch_irq_stat_cpu(cpu) 0
-#endif
-#ifndef arch_irq_stat
-#define arch_irq_stat() 0
-#endif
-#ifndef arch_idle_time
-#define arch_idle_time(cpu) 0
-#endif
-
 static int show_stat(struct seq_file *p, void *v)
 {
-	int i, j;
-	unsigned long jif;
-	u64 user, nice, system, idle, iowait, irq, softirq, steal;
-	u64 guest, guest_nice;
-	u64 sum = 0;
-	u64 sum_softirq = 0;
-	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
-	struct timespec boottime;
-
-	user = nice = system = idle = iowait =
-		irq = softirq = steal = 0;
-	guest = guest_nice = 0;
-	getboottime(&boottime);
-	jif = boottime.tv_sec;
-
-	for_each_possible_cpu(i) {
-		user += kstat_cpu(i).cpustat[USER];
-		nice += kstat_cpu(i).cpustat[NICE];
-		system += kstat_cpu(i).cpustat[SYSTEM];
-		idle += kstat_cpu(i).cpustat[IDLE];
-		idle += arch_idle_time(i);
-		iowait += kstat_cpu(i).cpustat[IOWAIT];
-		irq += kstat_cpu(i).cpustat[IRQ];
-		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
-		steal += kstat_cpu(i).cpustat[STEAL];
-		guest += kstat_cpu(i).cpustat[GUEST];
-		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
-
-		for (j = 0; j < NR_SOFTIRQS; j++) {
-			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
-
-			per_softirq_sums[j] += softirq_stat;
-			sum_softirq += softirq_stat;
-		}
-	}
-	sum += arch_irq_stat();
-
-	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
-		"%llu\n",
-		(unsigned long long)cputime64_to_clock_t(user),
-		(unsigned long long)cputime64_to_clock_t(nice),
-		(unsigned long long)cputime64_to_clock_t(system),
-		(unsigned long long)cputime64_to_clock_t(idle),
-		(unsigned long long)cputime64_to_clock_t(iowait),
-		(unsigned long long)cputime64_to_clock_t(irq),
-		(unsigned long long)cputime64_to_clock_t(softirq),
-		(unsigned long long)cputime64_to_clock_t(steal),
-		(unsigned long long)cputime64_to_clock_t(guest),
-		(unsigned long long)cputime64_to_clock_t(guest_nice));
-	for_each_online_cpu(i) {
-
-		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat[USER];
-		nice = kstat_cpu(i).cpustat[NICE];
-		system = kstat_cpu(i).cpustat[SYSTEM];
-		idle = kstat_cpu(i).cpustat[IDLE];
-		idle += arch_idle_time(i);
-		iowait = kstat_cpu(i).cpustat[IOWAIT];
-		irq = kstat_cpu(i).cpustat[IRQ];
-		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
-		steal = kstat_cpu(i).cpustat[STEAL];
-		guest = kstat_cpu(i).cpustat[GUEST];
-		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
-		seq_printf(p,
-			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
-			"%llu\n",
-			i,
-			(unsigned long long)cputime64_to_clock_t(user),
-			(unsigned long long)cputime64_to_clock_t(nice),
-			(unsigned long long)cputime64_to_clock_t(system),
-			(unsigned long long)cputime64_to_clock_t(idle),
-			(unsigned long long)cputime64_to_clock_t(iowait),
-			(unsigned long long)cputime64_to_clock_t(irq),
-			(unsigned long long)cputime64_to_clock_t(softirq),
-			(unsigned long long)cputime64_to_clock_t(steal),
-			(unsigned long long)cputime64_to_clock_t(guest),
-			(unsigned long long)cputime64_to_clock_t(guest_nice));
-	}
-	seq_printf(p, "intr %llu", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_printf(p, " %u", kstat_irqs(j));
-
-	seq_printf(p,
-		"\nctxt %llu\n"
-		"btime %lu\n"
-		"processes %lu\n"
-		"procs_running %lu\n"
-		"procs_blocked %lu\n",
-		nr_context_switches(),
-		(unsigned long)jif,
-		total_forks,
-		nr_running(),
-		nr_iowait());
-
-	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
-
-	for (i = 0; i < NR_SOFTIRQS; i++)
-		seq_printf(p, " %u", per_softirq_sums[i]);
-	seq_putc(p, '\n');
-
-	return 0;
+	return cpu_cgroup_proc_stat(p);
 }
 
 static int stat_open(struct inode *inode, struct file *file)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ac2c05..25658d8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2712,6 +2712,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+int cpu_cgroup_proc_stat(struct seq_file *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index 8f0fa05..482e645 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9083,6 +9083,125 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+#ifndef arch_irq_stat_cpu
+#define arch_irq_stat_cpu(cpu) 0
+#endif
+#ifndef arch_irq_stat
+#define arch_irq_stat() 0
+#endif
+#ifndef arch_idle_time
+#define arch_idle_time(cpu) 0
+#endif
+
+int cpu_cgroup_proc_stat(struct seq_file *p)
+{
+	int i, j;
+	unsigned long jif;
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
+	u64 sum = 0;
+	u64 sum_softirq = 0;
+	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
+	struct timespec boottime;
+
+	user = nice = system = idle = iowait =
+		irq = softirq = steal = 0;
+	guest = guest_nice = 0;
+	getboottime(&boottime);
+	jif = boottime.tv_sec;
+
+	for_each_possible_cpu(i) {
+		user += kstat_this_cpu->cpustat[USER];
+		nice += kstat_this_cpu->cpustat[NICE];
+		system += kstat_this_cpu->cpustat[SYSTEM];
+		idle += kstat_this_cpu->cpustat[IDLE];
+		idle += arch_idle_time(i);
+		iowait += kstat_this_cpu->cpustat[IOWAIT];
+		irq += kstat_this_cpu->cpustat[IRQ];
+		softirq += kstat_this_cpu->cpustat[SOFTIRQ];
+		steal += kstat_this_cpu->cpustat[STEAL];
+		guest += kstat_this_cpu->cpustat[GUEST];
+		guest_nice += kstat_this_cpu->cpustat[GUEST_NICE];
+		sum += kstat_cpu_irqs_sum(i);
+		sum += arch_irq_stat_cpu(i);
+
+		for (j = 0; j < NR_SOFTIRQS; j++) {
+			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
+
+			per_softirq_sums[j] += softirq_stat;
+			sum_softirq += softirq_stat;
+		}
+	}
+	sum += arch_irq_stat();
+
+	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+		"%llu\n",
+		(unsigned long long)cputime64_to_clock_t(user),
+		(unsigned long long)cputime64_to_clock_t(nice),
+		(unsigned long long)cputime64_to_clock_t(system),
+		(unsigned long long)cputime64_to_clock_t(idle),
+		(unsigned long long)cputime64_to_clock_t(iowait),
+		(unsigned long long)cputime64_to_clock_t(irq),
+		(unsigned long long)cputime64_to_clock_t(softirq),
+		(unsigned long long)cputime64_to_clock_t(steal),
+		(unsigned long long)cputime64_to_clock_t(guest),
+		(unsigned long long)cputime64_to_clock_t(guest_nice));
+	for_each_online_cpu(i) {
+
+		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
+		user = kstat_this_cpu->cpustat[USER];
+		nice = kstat_this_cpu->cpustat[NICE];
+		system = kstat_this_cpu->cpustat[SYSTEM];
+		idle = kstat_this_cpu->cpustat[IDLE];
+		idle += arch_idle_time(i);
+		iowait = kstat_this_cpu->cpustat[IOWAIT];
+		irq = kstat_this_cpu->cpustat[IRQ];
+		softirq = kstat_this_cpu->cpustat[SOFTIRQ];
+		steal = kstat_this_cpu->cpustat[STEAL];
+		guest = kstat_this_cpu->cpustat[GUEST];
+		guest_nice = kstat_this_cpu->cpustat[GUEST_NICE];
+		seq_printf(p,
+			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+			"%llu\n",
+			i,
+			(unsigned long long)cputime64_to_clock_t(user),
+			(unsigned long long)cputime64_to_clock_t(nice),
+			(unsigned long long)cputime64_to_clock_t(system),
+			(unsigned long long)cputime64_to_clock_t(idle),
+			(unsigned long long)cputime64_to_clock_t(iowait),
+			(unsigned long long)cputime64_to_clock_t(irq),
+			(unsigned long long)cputime64_to_clock_t(softirq),
+			(unsigned long long)cputime64_to_clock_t(steal),
+			(unsigned long long)cputime64_to_clock_t(guest),
+			(unsigned long long)cputime64_to_clock_t(guest_nice));
+	}
+	seq_printf(p, "intr %llu", (unsigned long long)sum);
+
+	/* sum again ? it could be updated? */
+	for_each_irq_nr(j)
+		seq_printf(p, " %u", kstat_irqs(j));
+
+	seq_printf(p,
+		"\nctxt %llu\n"
+		"btime %lu\n"
+		"processes %lu\n"
+		"procs_running %lu\n"
+		"procs_blocked %lu\n",
+		nr_context_switches(),
+		(unsigned long)jif,
+		total_forks,
+		nr_running(),
+		nr_iowait());
+
+	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
+
+	for (i = 0; i < NR_SOFTIRQS; i++)
+		seq_printf(p, " %u", per_softirq_sums[i]);
+	seq_putc(p, '\n');
+
+	return 0;
+}
+
 #ifdef CONFIG_CGROUP_CPUACCT
 
 /*
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/10] Display /proc/stat information per cgroup
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
                   ` (2 preceding siblings ...)
  2011-10-02 19:21 ` [PATCH 03/10] Move /proc/stat logic inside sched.c Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-05  8:56   ` Peter Zijlstra
  2011-10-05  9:04   ` Peter Zijlstra
  2011-10-02 19:21 ` [PATCH 05/10] Make total_forks per-cgroup Glauber Costa
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa

Each cgroup has its own file, cpu.proc.stat that will
display the exact same format as /proc/stat. Users
that want to have access to a per-cgroup version of
this information, can query it for that purpose.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/proc/stat.c              |    2 +-
 include/linux/kernel_stat.h |   11 ++-
 include/linux/sched.h       |    5 +-
 kernel/sched.c              |  202 +++++++++++++++++++++++++++++++------------
 4 files changed, 160 insertions(+), 60 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6b10387..c9b2ae9 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -13,7 +13,7 @@
 
 static int show_stat(struct seq_file *p, void *v)
 {
-	return cpu_cgroup_proc_stat(p);
+	return cpu_cgroup_proc_stat(NULL, NULL, p);
 }
 
 static int stat_open(struct inode *inode, struct file *file)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 897eabf..71a69a0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,6 +27,8 @@ enum cpu_usage_stat {
 	STEAL,
 	GUEST,
 	GUEST_NICE,
+	IDLE_BASE,
+	IOWAIT_BASE,
 	NR_STATS,
 };
 
@@ -39,11 +41,18 @@ struct kernel_stat {
 	unsigned int softirqs[NR_SOFTIRQS];
 };
 
-DECLARE_PER_CPU(struct kernel_stat, kstat);
+#ifdef CONFIG_CGROUP_SCHED
+struct kernel_stat *task_group_kstat(struct task_struct *p);
 
 /* Must have preemption disabled for this to be meaningful. */
+#define kstat_this_cpu	this_cpu_ptr(task_group_kstat(current))
+#define kstat_cpu(cpu) (*per_cpu_ptr(task_group_kstat(current), cpu))
+#else
+DECLARE_PER_CPU(struct kernel_stat, kstat);
+
 #define kstat_this_cpu (&__get_cpu_var(kstat))
 #define kstat_cpu(cpu) per_cpu(kstat, cpu)
+#endif
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25658d8..64c5ba5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2712,7 +2712,10 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
-int cpu_cgroup_proc_stat(struct seq_file *p);
+struct cgroup;
+struct cftype;
+int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
+			 struct seq_file *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index 482e645..89d2248 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -278,6 +278,7 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+	struct kernel_stat __percpu *cpustat;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -631,6 +632,17 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return autogroup_task_group(p, tg);
 }
 
+struct kernel_stat *task_group_kstat(struct task_struct *p)
+{
+	struct task_group *tg;
+	struct kernel_stat *kstat;
+
+	rcu_read_lock();
+	tg = task_group(p);
+	kstat = tg->cpustat;
+	rcu_read_unlock();
+	return kstat;
+}
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
@@ -645,6 +657,22 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+static inline void task_group_account_field(struct task_struct *p,
+					     u64 tmp, int index)
+{
+	struct kernel_stat *kstat;
+	struct task_group *tg;
+
+	rcu_read_lock();
+	tg = task_group(p);
+	do {
+		kstat = this_cpu_ptr(tg->cpustat);
+		kstat->cpustat[index] += tmp;
+		tg = tg->parent;
+	} while (tg);
+	rcu_read_unlock();
+}
+
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -653,6 +681,14 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return NULL;
 }
 
+DEFINE_PER_CPU(struct kernel_stat, kstat);
+EXPORT_PER_CPU_SYMBOL(kstat);
+
+static inline void task_group_account_field(struct task_struct *p,
+					     u64 tmp, int index)
+{
+	__this_cpu_add(kstat.cpustat[index], tmp);
+}
 #endif /* CONFIG_CGROUP_SCHED */
 
 static void update_rq_clock_task(struct rq *rq, s64 delta);
@@ -3669,10 +3705,6 @@ unlock:
 
 #endif
 
-DEFINE_PER_CPU(struct kernel_stat, kstat);
-
-EXPORT_PER_CPU_SYMBOL(kstat);
-
 /*
  * Return any ns on the sched_clock that have not yet been accounted in
  * @p in case that task is currently running.
@@ -3757,7 +3789,6 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
 	u64 tmp;
 
 	/* Add user time to process. */
@@ -3769,9 +3800,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	tmp = cputime_to_cputime64(cputime);
 
 	if (TASK_NICE(p) > 0)
-		cpustat[NICE] += tmp;
+		task_group_account_field(p, tmp, NICE);
 	else
-		cpustat[USER] += tmp;
+		task_group_account_field(p, tmp, USER);
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3788,7 +3819,6 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
 	u64 tmp;
-	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3800,11 +3830,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat[NICE] += tmp;
-		cpustat[GUEST_NICE] += tmp;
+		task_group_account_field(p, tmp, NICE);
+		task_group_account_field(p, tmp, GUEST_NICE);
 	} else {
-		cpustat[USER] += tmp;
-		cpustat[GUEST] += tmp;
+		task_group_account_field(p, tmp, USER);
+		task_group_account_field(p, tmp, GUEST);
 	}
 }
 
@@ -3817,7 +3847,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, u64 *target_cputime64)
+			cputime_t cputime_scaled, int index)
 {
 	u64 tmp = cputime_to_cputime64(cputime);
 
@@ -3827,7 +3857,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 += tmp;
+	task_group_account_field(p, tmp, index);
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3844,8 +3874,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
-	u64 *target_cputime64;
+	int index;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3853,13 +3882,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat[IRQ];
+		index = IRQ;
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat[SOFTIRQ];
+		index = SOFTIRQ;
 	else
-		target_cputime64 = &cpustat[SYSTEM];
+		index = SYSTEM;
 
-	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
+	__account_system_time(p, cputime, cputime_scaled, index);
 }
 
 /*
@@ -3868,10 +3897,14 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	struct kernel_stat *kstat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
-
-	cpustat[STEAL] += cputime64;
+#ifdef CONFIG_CGROUP_SCHED
+	kstat = this_cpu_ptr(root_task_group.cpustat);
+#else
+	kstat = __get_cpu_var(kstat);
+#endif
+	kstat->cpustat[STEAL] += cputime64;
 }
 
 /*
@@ -3880,14 +3913,18 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	struct kernel_stat *kstat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
-
+#ifdef CONFIG_CGROUP_SCHED
+	kstat = this_cpu_ptr(root_task_group.cpustat);
+#else
+	kstat = __get_cpu_var(kstat);
+#endif
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat[IOWAIT] += cputime64;
+		kstat->cpustat[IOWAIT] += cputime64;
 	else
-		cpustat[IDLE] += cputime64;
+		kstat->cpustat[IDLE] += cputime64;
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -3934,27 +3971,26 @@ static __always_inline bool steal_account_process_tick(void)
  * softirq as those do not count in task exec_runtime any more.
  */
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
-						struct rq *rq)
+					 struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
-	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat[IRQ] += tmp;
+		task_group_account_field(p, tmp, IRQ);
 	} else if (irqtime_account_si_update()) {
-		cpustat[SOFTIRQ] += tmp;
+		task_group_account_field(p, tmp, SOFTIRQ);
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SOFTIRQ]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SOFTIRQ);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -3962,8 +3998,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SYSTEM]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SYSTEM);
 	}
 }
 
@@ -8085,6 +8121,10 @@ void __init sched_init(void)
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
+
+	root_task_group.cpustat = alloc_percpu(struct kernel_stat);
+	/* Failing that early an allocation means we're screwed anyway */
+	BUG_ON(!root_task_group.cpustat);
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -8519,6 +8559,7 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
+	free_percpu(tg->cpustat);
 	kfree(tg);
 }
 
@@ -8527,6 +8568,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 {
 	struct task_group *tg;
 	unsigned long flags;
+	int i;
 
 	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
 	if (!tg)
@@ -8538,6 +8580,19 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	tg->cpustat = alloc_percpu(struct kernel_stat);
+	if (!tg->cpustat)
+		goto err;
+
+	for_each_possible_cpu(i) {
+		struct kernel_stat *kstat, *root_kstat;
+
+		kstat = per_cpu_ptr(tg->cpustat, i);
+		root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
+		kstat->cpustat[IDLE_BASE]  = root_kstat->cpustat[IDLE];
+		kstat->cpustat[IOWAIT_BASE] = root_kstat->cpustat[IOWAIT];
+	}
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -9062,6 +9117,10 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	{
+		.name = "proc.stat",
+		.read_seq_string = cpu_cgroup_proc_stat,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -9093,7 +9152,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 #define arch_idle_time(cpu) 0
 #endif
 
-int cpu_cgroup_proc_stat(struct seq_file *p)
+int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
+			 struct seq_file *p)
 {
 	int i, j;
 	unsigned long jif;
@@ -9103,6 +9163,14 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *tg;
+
+	if (cgrp)
+		tg = cgroup_tg(cgrp);
+	else
+		tg = &root_task_group;
+#endif
 
 	user = nice = system = idle = iowait =
 		irq = softirq = steal = 0;
@@ -9111,17 +9179,28 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user += kstat_this_cpu->cpustat[USER];
-		nice += kstat_this_cpu->cpustat[NICE];
-		system += kstat_this_cpu->cpustat[SYSTEM];
-		idle += kstat_this_cpu->cpustat[IDLE];
+		struct kernel_stat *kstat, *idle_kstat;
+#ifdef CONFIG_CGROUP_SCHED
+		kstat = per_cpu_ptr(tg->cpustat, i);
+		idle_kstat = per_cpu_ptr(root_task_group.cpustat, i);
+#else
+		kstat = per_cpu(kstat, i);
+		idle_kstat = kstat;
+#endif
+
+		user += kstat->cpustat[USER];
+		nice += kstat->cpustat[NICE];
+		system += kstat->cpustat[SYSTEM];
+		idle += idle_kstat->cpustat[IDLE];
 		idle += arch_idle_time(i);
-		iowait += kstat_this_cpu->cpustat[IOWAIT];
-		irq += kstat_this_cpu->cpustat[IRQ];
-		softirq += kstat_this_cpu->cpustat[SOFTIRQ];
-		steal += kstat_this_cpu->cpustat[STEAL];
-		guest += kstat_this_cpu->cpustat[GUEST];
-		guest_nice += kstat_this_cpu->cpustat[GUEST_NICE];
+		idle -= kstat->cpustat[IDLE_BASE];
+		iowait += idle_kstat->cpustat[IOWAIT];
+		iowait -= kstat->cpustat[IOWAIT_BASE];
+		irq += kstat->cpustat[IRQ];
+		softirq += kstat->cpustat[SOFTIRQ];
+		steal += kstat->cpustat[STEAL];
+		guest += kstat->cpustat[GUEST];
+		guest_nice += kstat->cpustat[GUEST_NICE];
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -9147,19 +9226,28 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 		(unsigned long long)cputime64_to_clock_t(guest),
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
-
+		struct kernel_stat *kstat, *idle_kstat;
+#ifdef CONFIG_CGROUP_SCHED
+		kstat = per_cpu_ptr(tg->cpustat, i);
+		idle_kstat = per_cpu_ptr(root_task_group.cpustat, i);
+#else
+		kstat = per_cpu(kstat, i);
+		idle_kstat = kstat;
+#endif
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_this_cpu->cpustat[USER];
-		nice = kstat_this_cpu->cpustat[NICE];
-		system = kstat_this_cpu->cpustat[SYSTEM];
-		idle = kstat_this_cpu->cpustat[IDLE];
+		user = kstat->cpustat[USER];
+		nice = kstat->cpustat[NICE];
+		system = kstat->cpustat[SYSTEM];
+		idle = idle_kstat->cpustat[IDLE];
 		idle += arch_idle_time(i);
-		iowait = kstat_this_cpu->cpustat[IOWAIT];
-		irq = kstat_this_cpu->cpustat[IRQ];
-		softirq = kstat_this_cpu->cpustat[SOFTIRQ];
-		steal = kstat_this_cpu->cpustat[STEAL];
-		guest = kstat_this_cpu->cpustat[GUEST];
-		guest_nice = kstat_this_cpu->cpustat[GUEST_NICE];
+		idle -= kstat->cpustat[IDLE_BASE];
+		iowait = idle_kstat->cpustat[IOWAIT];
+		iowait -= kstat->cpustat[IOWAIT_BASE];
+		irq = kstat->cpustat[IRQ];
+		softirq = kstat->cpustat[SOFTIRQ];
+		steal = kstat->cpustat[STEAL];
+		guest = kstat->cpustat[GUEST];
+		guest_nice = kstat->cpustat[GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/10] Make total_forks per-cgroup
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
                   ` (3 preceding siblings ...)
  2011-10-02 19:21 ` [PATCH 04/10] Display /proc/stat information per cgroup Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-05  9:05   ` Peter Zijlstra
  2011-10-02 19:21 ` [PATCH 06/10] per-cgroup boot time Glauber Costa
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa

This patch counts the total number of forks per-cgroup.
The information is propagated to the parent, so the total
number of forks in the system, is the parent cgroup's one.

To achieve that, total_forks is made per-cpu. There is no
particular reason to do that, but by doing this, we are
able to bundle it inside the cpustat structure already
present.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/linux/kernel_stat.h |    1 +
 include/linux/sched.h       |    1 +
 kernel/fork.c               |    7 ++-----
 kernel/sched.c              |    9 ++++++++-
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 71a69a0..9fb0dda 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -29,6 +29,7 @@ enum cpu_usage_stat {
 	GUEST_NICE,
 	IDLE_BASE,
 	IOWAIT_BASE,
+	TOTAL_FORKS,
 	NR_STATS,
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64c5ba5..4ba9dde 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2716,6 +2716,7 @@ struct cgroup;
 struct cftype;
 int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 			 struct seq_file *p);
+void task_group_new_fork(struct task_struct *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..ec2b729 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,10 +76,6 @@
 
 #include <trace/events/sched.h>
 
-/*
- * Protected counters by write_lock_irq(&tasklist_lock)
- */
-unsigned long total_forks;	/* Handle normal Linux uptimes. */
 int nr_threads;			/* The idle threads do not count.. */
 
 int max_threads;		/* tunable limit on nr_threads */
@@ -1372,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		nr_threads++;
 	}
 
-	total_forks++;
+	task_group_new_fork(p);
+
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 89d2248..5a8181e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -691,6 +691,11 @@ static inline void task_group_account_field(struct task_struct *p,
 }
 #endif /* CONFIG_CGROUP_SCHED */
 
+void task_group_new_fork(struct task_struct *p)
+{
+	task_group_account_field(p, 1, TOTAL_FORKS);
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta);
 
 static void update_rq_clock(struct rq *rq)
@@ -9161,6 +9166,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	u64 guest, guest_nice;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
+	u64 total_forks = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 #ifdef CONFIG_CGROUP_SCHED
@@ -9201,6 +9207,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		steal += kstat->cpustat[STEAL];
 		guest += kstat->cpustat[GUEST];
 		guest_nice += kstat->cpustat[GUEST_NICE];
+		total_forks += kstat->cpustat[TOTAL_FORKS];
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -9272,7 +9279,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	seq_printf(p,
 		"\nctxt %llu\n"
 		"btime %lu\n"
-		"processes %lu\n"
+		"processes %llu\n"
 		"procs_running %lu\n"
 		"procs_blocked %lu\n",
 		nr_context_switches(),
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/10] per-cgroup boot time
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
                   ` (4 preceding siblings ...)
  2011-10-02 19:21 ` [PATCH 05/10] Make total_forks per-cgroup Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-02 19:21 ` [PATCH 07/10] Report steal time for cgroup Glauber Costa
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa

Record the time in which the cgroup was created. This can be
used to provide a more accurate boottime information in
cpuacct.proc.stat.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5a8181e..a2554b7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -279,6 +279,7 @@ struct task_group {
 	struct autogroup *autogroup;
 #endif
 	struct kernel_stat __percpu *cpustat;
+	struct timespec start_time;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -8127,6 +8128,7 @@ void __init sched_init(void)
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
 
+	root_task_group.start_time = (struct timespec){0, 0};
 	root_task_group.cpustat = alloc_percpu(struct kernel_stat);
 	/* Failing that early an allocation means we're screwed anyway */
 	BUG_ON(!root_task_group.cpustat);
@@ -8598,6 +8600,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 		kstat->cpustat[IOWAIT_BASE] = root_kstat->cpustat[IOWAIT];
 	}
 
+	get_monotonic_boottime(&tg->start_time);
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -9182,6 +9186,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		irq = softirq = steal = 0;
 	guest = guest_nice = 0;
 	getboottime(&boottime);
+#ifdef CONFIG_CGROUP_SCHED
+	boottime = timespec_add(boottime, tg->start_time);
+#endif
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/10] Report steal time for cgroup
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
                   ` (5 preceding siblings ...)
  2011-10-02 19:21 ` [PATCH 06/10] per-cgroup boot time Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-02 19:21 ` [PATCH 08/10] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa

This patch introduces a functionality commonly found in
hypervisors: steal time.

For those not particularly familiar with it, steal time
is defined as any time in which a virtual machine (or container)
wanted to perform cpu work, but could not due to another
VM/container being scheduled in its place. Note that idle
time is never defined as steal time.

Assuming each container will live in its cgroup, we can
very easily and nicely calculate steal time as all user/system
time recorded in our sibling cgroups.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a2554b7..dec73c4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9175,6 +9175,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	struct timespec boottime;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *tg;
+	struct task_group *sib;
 
 	if (cgrp)
 		tg = cgroup_tg(cgrp);
@@ -9215,6 +9216,20 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		guest += kstat->cpustat[GUEST];
 		guest_nice += kstat->cpustat[GUEST_NICE];
 		total_forks += kstat->cpustat[TOTAL_FORKS];
+		steal += kstat->cpustat[STEAL];
+#ifdef CONFIG_CGROUP_SCHED
+		rcu_read_lock();
+		list_for_each_entry(sib, &tg->siblings, siblings) {
+			struct kernel_stat *ksib;
+			if (!sib)
+				continue;
+
+			ksib = per_cpu_ptr(sib->cpustat, i);
+			steal += ksib->cpustat[USER] + ksib->cpustat[SYSTEM] +
+				 ksib->cpustat[IRQ] + ksib->cpustat[SOFTIRQ];
+		}
+		rcu_read_unlock();
+#endif
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -9260,6 +9275,19 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		irq = kstat->cpustat[IRQ];
 		softirq = kstat->cpustat[SOFTIRQ];
 		steal = kstat->cpustat[STEAL];
+#ifdef CONFIG_CGROUP_SCHED
+		rcu_read_lock();
+		list_for_each_entry(sib, &tg->siblings, siblings) {
+			struct kernel_stat *ksib;
+			if (!sib)
+				continue;
+
+			ksib = per_cpu_ptr(sib->cpustat, i);
+			steal += ksib->cpustat[USER] + ksib->cpustat[SYSTEM] +
+				 ksib->cpustat[IRQ] + ksib->cpustat[SOFTIRQ];
+		}
+		rcu_read_unlock();
+#endif
 		guest = kstat->cpustat[GUEST];
 		guest_nice = kstat->cpustat[GUEST_NICE];
 		seq_printf(p,
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/10] provide a version of cpuacct statistics inside cpu cgroup
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
                   ` (6 preceding siblings ...)
  2011-10-02 19:21 ` [PATCH 07/10] Report steal time for cgroup Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-05  9:10   ` Peter Zijlstra
  2011-10-02 19:21 ` [PATCH 09/10] provide a version of cpuusage " Glauber Costa
  2011-10-02 19:21 ` [PATCH 10/10] Change CPUACCT to default n Glauber Costa
  9 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa, Balbir Singh

For users interested in using the information currently displayed
at cpuacct.stat, we provide it inside the cpu cgroup.

This have the advantage of accounting this information only once,
instead of twice, when there is no need to have an independent group
of tasks for controlling and accounting, leading to a lot less overhead.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched.c |   43 ++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dec73c4..3f7c1fd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9106,6 +9106,40 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static const char *cpuacct_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
+static int cpu_cgroup_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int cpu;
+	s64 val = 0;
+
+	for_each_online_cpu(cpu) {
+		struct kernel_stat *kstat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kstat->cpustat[USER];
+		val += kstat->cpustat[NICE];
+	}
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_USER], val);
+
+	val = 0;
+	for_each_online_cpu(cpu) {
+		struct kernel_stat *kstat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kstat->cpustat[SYSTEM];
+		val += kstat->cpustat[IRQ];
+		val += kstat->cpustat[SOFTIRQ];
+	}
+
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val);
+
+	return 0;
+}
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9130,6 +9164,10 @@ static struct cftype cpu_files[] = {
 		.name = "proc.stat",
 		.read_seq_string = cpu_cgroup_proc_stat,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_cgroup_stats_show,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -9494,11 +9532,6 @@ static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
 	return 0;
 }
 
-static const char *cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
 static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
 		struct cgroup_map_cb *cb)
 {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/10] provide a version of cpuusage statistics inside cpu cgroup
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
                   ` (7 preceding siblings ...)
  2011-10-02 19:21 ` [PATCH 08/10] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  2011-10-05  9:14   ` Peter Zijlstra
  2011-10-05  9:46   ` Peter Zijlstra
  2011-10-02 19:21 ` [PATCH 10/10] Change CPUACCT to default n Glauber Costa
  9 siblings, 2 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa, Balbir Singh

For users interested in using the information currently displayed
at cpuacct.usage and cpuaact.usage_per_cpu, we provide them inside
the cpu cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched.c      |  221 +++++++++++++++++++++++++++++++++++----------------
 kernel/sched_fair.c |    2 +-
 kernel/sched_rt.c   |    2 +-
 3 files changed, 156 insertions(+), 69 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3f7c1fd..8ed9dd7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -279,6 +279,7 @@ struct task_group {
 	struct autogroup *autogroup;
 #endif
 	struct kernel_stat __percpu *cpustat;
+	u64 __percpu *cpuusage;
 	struct timespec start_time;
 };
 
@@ -2080,6 +2081,8 @@ static int irqtime_account_si_update(void)
 
 #endif
 
+static void cpuusage_charge(struct task_struct *tsk, u64 cputime);
+
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
@@ -8130,8 +8133,10 @@ void __init sched_init(void)
 
 	root_task_group.start_time = (struct timespec){0, 0};
 	root_task_group.cpustat = alloc_percpu(struct kernel_stat);
+	root_task_group.cpuusage = alloc_percpu(u64);
 	/* Failing that early an allocation means we're screwed anyway */
 	BUG_ON(!root_task_group.cpustat);
+	BUG_ON(!root_task_group.cpuusage);
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -8566,7 +8571,10 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
-	free_percpu(tg->cpustat);
+	if (tg->cpustat)
+		free_percpu(tg->cpustat);
+	if (tg->cpuusage)
+		free_percpu(tg->cpuusage);
 	kfree(tg);
 }
 
@@ -8587,6 +8595,10 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	tg->cpuusage = alloc_percpu(u64);
+	if (!tg->cpuusage)
+		goto err;
+
 	tg->cpustat = alloc_percpu(struct kernel_stat);
 	if (!tg->cpustat)
 		goto err;
@@ -9106,6 +9118,88 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static u64 cpuacct_cpuusage_read(u64 *cpuusage, int cpu)
+{
+	u64 data;
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	data = *cpuusage;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	data = *cpuusage;
+#endif
+
+	return data;
+}
+
+static void cpuacct_cpuusage_write(u64 *cpuusage, int cpu, u64 val)
+{
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	*cpuusage = val;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	*cpuusage = val;
+#endif
+}
+
+static u64 cpu_cgroup_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	u64 totalcpuusage = 0;
+	int i;
+
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, i);
+		totalcpuusage += cpuacct_cpuusage_read(cpuusage, i);
+	}
+
+	return totalcpuusage;
+}
+
+static int cpu_cgroup_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+								u64 reset)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int err = 0;
+	int i;
+
+	if (reset) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, i);
+		cpuacct_cpuusage_write(cpuusage, i, 0);
+	}
+
+out:
+	return err;
+}
+
+static int cpu_cgroup_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
+				      struct seq_file *m)
+{
+	u64 percpu;
+	int i;
+
+	for_each_present_cpu(i) {
+		percpu = cpu_cgroup_cpuusage_read(cgroup, cft);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+
+
 static const char *cpuacct_stat_desc[] = {
 	[CPUACCT_STAT_USER] = "user",
 	[CPUACCT_STAT_SYSTEM] = "system",
@@ -9168,6 +9262,15 @@ static struct cftype cpu_files[] = {
 		.name = "stat",
 		.read_map = cpu_cgroup_stats_show,
 	},
+	{
+		.name = "usage",
+		.read_u64 = cpu_cgroup_cpuusage_read,
+		.write_u64 = cpu_cgroup_cpuusage_write,
+	},
+	{
+		.name = "usage_percpu",
+		.read_seq_string = cpu_cgroup_percpu_seq_read,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -9450,41 +9553,6 @@ cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-	u64 data;
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	data = *cpuusage;
-#endif
-
-	return data;
-}
-
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	*cpuusage = val;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
-}
-
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
 {
@@ -9492,8 +9560,10 @@ static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
 	u64 totalcpuusage = 0;
 	int i;
 
-	for_each_present_cpu(i)
-		totalcpuusage += cpuacct_cpuusage_read(ca, i);
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		totalcpuusage += cpuacct_cpuusage_read(cpuusage, i);
+	}
 
 	return totalcpuusage;
 }
@@ -9510,8 +9580,10 @@ static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
 		goto out;
 	}
 
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		cpuacct_cpuusage_write(cpuusage, i, 0);
+	}
 
 out:
 	return err;
@@ -9568,33 +9640,6 @@ static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 /*
- * charge this task's execution time to its accounting group.
- *
- * called with rq->lock held.
- */
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-	struct cpuacct *ca;
-	int cpu;
-
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	cpu = task_cpu(tsk);
-
-	rcu_read_lock();
-
-	ca = task_ca(tsk);
-
-	for (; ca; ca = ca->parent) {
-		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-		*cpuusage += cputime;
-	}
-
-	rcu_read_unlock();
-}
-
-/*
  * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
  * in cputime_t units. As a result, cpuacct_update_stats calls
  * percpu_counter_add with values large enough to always overflow the
@@ -9642,3 +9687,45 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+/*
+ * charge this task's execution time to its accounting group.
+ *
+ * called with rq->lock held.
+ */
+static void cpuusage_charge(struct task_struct *tsk, u64 cputime)
+{
+	int cpu;
+
+#ifdef CONFIG_CGROUP_CPUACCT
+	struct cpuacct *ca;
+#endif
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *tg;
+#endif
+	cpu = task_cpu(tsk);
+
+	rcu_read_lock();
+
+#ifdef CONFIG_CGROUP_CPUACCT
+	ca = task_ca(tsk);
+
+	if (unlikely(!cpuacct_subsys.active))
+		goto no_cpuacct;
+
+	for (; ca; ca = ca->parent) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+no_cpuacct:
+#endif
+
+#ifdef CONFIG_CGROUP_SCHED
+	tg = task_group(tsk);
+	for (; tg; tg = tg->parent) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+#endif
+	rcu_read_unlock();
+}
+
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc8ee99..38b4549 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -580,7 +580,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
+		cpuusage_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 }
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 97540f0..a21b58e 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -676,7 +676,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
-	cpuacct_charge(curr, delta_exec);
+	cpuusage_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/10] Change CPUACCT to default n
  2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
                   ` (8 preceding siblings ...)
  2011-10-02 19:21 ` [PATCH 09/10] provide a version of cpuusage " Glauber Costa
@ 2011-10-02 19:21 ` Glauber Costa
  9 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-02 19:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa, Balbir Singh

Now that we're providing all cpuacct capability in cpu cgroup,
default CPUACCT to n. We still maintain it for compatiblity for
anyone that need an independent set of tasks to be managed by it
relatively to cpu cgroup, but encourage the use of cpucgroup for that.

Proposing schedule of deprecation for 3.5

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 Documentation/feature-removal-schedule.txt |    8 ++++++++
 init/Kconfig                               |    1 +
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index c4a6e14..08adcd4 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -592,3 +592,11 @@ Why:    In 3.0, we can now autodetect internal 3G device and already have
 	interface that was used by acer-wmi driver. It will replaced by
 	information log when acer-wmi initial.
 Who:    Lee, Chun-Yi <jlee@novell.com>
+
+----------------------------
+What:	cpuacct cgroup	
+When:	3.5
+Why:	Same functionality is provided by the CGROUP_SCHED, with a lower
+	footprint.	
+Who:	Glauber Costa <glommer@parallels.com>
+
diff --git a/init/Kconfig b/init/Kconfig
index d627783..891d6c1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -626,6 +626,7 @@ config PROC_PID_CPUSET
 
 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
+	default n
 	help
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] Display /proc/stat information per cgroup
  2011-10-02 19:21 ` [PATCH 04/10] Display /proc/stat information per cgroup Glauber Costa
@ 2011-10-05  8:56   ` Peter Zijlstra
  2011-10-05 12:10     ` Glauber Costa
  2011-10-05  9:04   ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05  8:56 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> +struct kernel_stat *task_group_kstat(struct task_struct *p)
> +{
> +       struct task_group *tg;
> +       struct kernel_stat *kstat;
> +
> +       rcu_read_lock();
> +       tg = task_group(p);
> +       kstat = tg->cpustat;
> +       rcu_read_unlock();
> +       return kstat;
> +} 

Who keeps tg alive and kicking while you poke at its (cpustat) member?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] Display /proc/stat information per cgroup
  2011-10-02 19:21 ` [PATCH 04/10] Display /proc/stat information per cgroup Glauber Costa
  2011-10-05  8:56   ` Peter Zijlstra
@ 2011-10-05  9:04   ` Peter Zijlstra
  2011-10-05 12:11     ` Glauber Costa
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05  9:04 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> @@ -3868,10 +3897,14 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>   */
>  void account_steal_time(cputime_t cputime)
>  {
> -       u64 *cpustat = kstat_this_cpu->cpustat;
> +       struct kernel_stat *kstat;
>         u64 cputime64 = cputime_to_cputime64(cputime);
> -
> -       cpustat[STEAL] += cputime64;
> +#ifdef CONFIG_CGROUP_SCHED
> +       kstat = this_cpu_ptr(root_task_group.cpustat);
> +#else
> +       kstat = __get_cpu_var(kstat);
> +#endif
> +       kstat->cpustat[STEAL] += cputime64;
>  }

> @@ -9111,17 +9179,28 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
>         jif = boottime.tv_sec;
>  
>         for_each_possible_cpu(i) {
> -               user += kstat_this_cpu->cpustat[USER];
> -               nice += kstat_this_cpu->cpustat[NICE];
> -               system += kstat_this_cpu->cpustat[SYSTEM];
> -               idle += kstat_this_cpu->cpustat[IDLE];
> +               struct kernel_stat *kstat, *idle_kstat;
> +#ifdef CONFIG_CGROUP_SCHED
> +               kstat = per_cpu_ptr(tg->cpustat, i);
> +               idle_kstat = per_cpu_ptr(root_task_group.cpustat, i);
> +#else
> +               kstat = per_cpu(kstat, i);
> +               idle_kstat = kstat;
> +#endif
> +
> +               user += kstat->cpustat[USER];
> +               nice += kstat->cpustat[NICE];
> +               system += kstat->cpustat[SYSTEM]; 

That if-deffery just begs to be killed by some helper function. Both
variants appear multiple times I think.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-02 19:21 ` [PATCH 05/10] Make total_forks per-cgroup Glauber Costa
@ 2011-10-05  9:05   ` Peter Zijlstra
  2011-10-05 12:12     ` Glauber Costa
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05  9:05 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Frederic Weisbecker

On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> This patch counts the total number of forks per-cgroup.
> The information is propagated to the parent, so the total
> number of forks in the system, is the parent cgroup's one.
> 
> To achieve that, total_forks is made per-cpu. There is no
> particular reason to do that, but by doing this, we are
> able to bundle it inside the cpustat structure already
> present.

I think fweisbec is also doing something with forks and cgroups.

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> ---
>  include/linux/kernel_stat.h |    1 +
>  include/linux/sched.h       |    1 +
>  kernel/fork.c               |    7 ++-----
>  kernel/sched.c              |    9 ++++++++-
>  4 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index 71a69a0..9fb0dda 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -29,6 +29,7 @@ enum cpu_usage_stat {
>  	GUEST_NICE,
>  	IDLE_BASE,
>  	IOWAIT_BASE,
> +	TOTAL_FORKS,
>  	NR_STATS,
>  };
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 64c5ba5..4ba9dde 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2716,6 +2716,7 @@ struct cgroup;
>  struct cftype;
>  int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>  			 struct seq_file *p);
> +void task_group_new_fork(struct task_struct *p);
>  #endif /* __KERNEL__ */
>  
>  #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8e6b6f4..ec2b729 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -76,10 +76,6 @@
>  
>  #include <trace/events/sched.h>
>  
> -/*
> - * Protected counters by write_lock_irq(&tasklist_lock)
> - */
> -unsigned long total_forks;	/* Handle normal Linux uptimes. */
>  int nr_threads;			/* The idle threads do not count.. */
>  
>  int max_threads;		/* tunable limit on nr_threads */
> @@ -1372,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  		nr_threads++;
>  	}
>  
> -	total_forks++;
> +	task_group_new_fork(p);
> +
>  	spin_unlock(&current->sighand->siglock);
>  	write_unlock_irq(&tasklist_lock);
>  	proc_fork_connector(p);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 89d2248..5a8181e 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -691,6 +691,11 @@ static inline void task_group_account_field(struct task_struct *p,
>  }
>  #endif /* CONFIG_CGROUP_SCHED */
>  
> +void task_group_new_fork(struct task_struct *p)
> +{
> +	task_group_account_field(p, 1, TOTAL_FORKS);
> +}
> +
>  static void update_rq_clock_task(struct rq *rq, s64 delta);
>  
>  static void update_rq_clock(struct rq *rq)
> @@ -9161,6 +9166,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>  	u64 guest, guest_nice;
>  	u64 sum = 0;
>  	u64 sum_softirq = 0;
> +	u64 total_forks = 0;
>  	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>  	struct timespec boottime;
>  #ifdef CONFIG_CGROUP_SCHED
> @@ -9201,6 +9207,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>  		steal += kstat->cpustat[STEAL];
>  		guest += kstat->cpustat[GUEST];
>  		guest_nice += kstat->cpustat[GUEST_NICE];
> +		total_forks += kstat->cpustat[TOTAL_FORKS];
>  		sum += kstat_cpu_irqs_sum(i);
>  		sum += arch_irq_stat_cpu(i);
>  
> @@ -9272,7 +9279,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>  	seq_printf(p,
>  		"\nctxt %llu\n"
>  		"btime %lu\n"
> -		"processes %lu\n"
> +		"processes %llu\n"
>  		"procs_running %lu\n"
>  		"procs_blocked %lu\n",
>  		nr_context_switches(),


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/10] provide a version of cpuacct statistics inside cpu cgroup
  2011-10-02 19:21 ` [PATCH 08/10] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
@ 2011-10-05  9:10   ` Peter Zijlstra
  2011-10-05 12:16     ` Glauber Costa
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05  9:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh

On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> +       for_each_online_cpu(cpu) {
> +               struct kernel_stat *kstat = per_cpu_ptr(tg->cpustat, cpu);
> +               val += kstat->cpustat[USER];
> +               val += kstat->cpustat[NICE];
> +       } 

by using online_cpu, you fail to sum whatever data was accrued on
unplugged cpus, is that a problem?



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] provide a version of cpuusage statistics inside cpu cgroup
  2011-10-02 19:21 ` [PATCH 09/10] provide a version of cpuusage " Glauber Costa
@ 2011-10-05  9:14   ` Peter Zijlstra
  2011-10-05 12:17     ` Glauber Costa
  2011-10-05  9:46   ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05  9:14 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh

On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> +       if (tg->cpustat)
> +               free_percpu(tg->cpustat);
> +       if (tg->cpuusage)
> +               free_percpu(tg->cpuusage); 

void free_percpu(void __percpu *ptr)
{
        void *addr;
        struct pcpu_chunk *chunk;
        unsigned long flags;
        int off;

        if (!ptr)
                return;
	...
}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] provide a version of cpuusage statistics inside cpu cgroup
  2011-10-02 19:21 ` [PATCH 09/10] provide a version of cpuusage " Glauber Costa
  2011-10-05  9:14   ` Peter Zijlstra
@ 2011-10-05  9:46   ` Peter Zijlstra
  2011-10-05 12:22     ` Glauber Costa
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05  9:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh, Paul Turner

On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> +/*
> + * charge this task's execution time to its accounting group.
> + *
> + * called with rq->lock held.
> + */
> +static void cpuusage_charge(struct task_struct *tsk, u64 cputime)
> +{
> +       int cpu;
> +
> +#ifdef CONFIG_CGROUP_CPUACCT
> +       struct cpuacct *ca;
> +#endif
> +#ifdef CONFIG_CGROUP_SCHED
> +       struct task_group *tg;
> +#endif
> +       cpu = task_cpu(tsk);
> +
> +       rcu_read_lock();
> +
> +#ifdef CONFIG_CGROUP_CPUACCT
> +       ca = task_ca(tsk);
> +
> +       if (unlikely(!cpuacct_subsys.active))
> +               goto no_cpuacct;
> +
> +       for (; ca; ca = ca->parent) {
> +               u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> +               *cpuusage += cputime;
> +       }
> +no_cpuacct:
> +#endif
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +       tg = task_group(tsk);
> +       for (; tg; tg = tg->parent) {
> +               u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
> +               *cpuusage += cputime;
> +       }
> +#endif
> +       rcu_read_unlock();
> +} 

I don't actually think we need to do this. tg->se[cpu]->sum_exec_runtime
should contain the same information.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] Display /proc/stat information per cgroup
  2011-10-05  8:56   ` Peter Zijlstra
@ 2011-10-05 12:10     ` Glauber Costa
  2011-10-05 12:38       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 12:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 10/05/2011 12:56 PM, Peter Zijlstra wrote:
> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>> +struct kernel_stat *task_group_kstat(struct task_struct *p)
>> +{
>> +       struct task_group *tg;
>> +       struct kernel_stat *kstat;
>> +
>> +       rcu_read_lock();
>> +       tg = task_group(p);
>> +       kstat = tg->cpustat;
>> +       rcu_read_unlock();
>> +       return kstat;
>> +}
>
> Who keeps tg alive and kicking while you poke at its (cpustat) member?

* All calls to this function currently pass current as a parameter 
(Okay, maybe it is too generic and we should pass nothing at all, and 
grab current within it)
* rcu_read_lock() guarantees that current will exist during this call, 
and task_group won't change. (right?)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] Display /proc/stat information per cgroup
  2011-10-05  9:04   ` Peter Zijlstra
@ 2011-10-05 12:11     ` Glauber Costa
  0 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 12:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 10/05/2011 01:04 PM, Peter Zijlstra wrote:
> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>> @@ -3868,10 +3897,14 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>>    */
>>   void account_steal_time(cputime_t cputime)
>>   {
>> -       u64 *cpustat = kstat_this_cpu->cpustat;
>> +       struct kernel_stat *kstat;
>>          u64 cputime64 = cputime_to_cputime64(cputime);
>> -
>> -       cpustat[STEAL] += cputime64;
>> +#ifdef CONFIG_CGROUP_SCHED
>> +       kstat = this_cpu_ptr(root_task_group.cpustat);
>> +#else
>> +       kstat = __get_cpu_var(kstat);
>> +#endif
>> +       kstat->cpustat[STEAL] += cputime64;
>>   }
>
>> @@ -9111,17 +9179,28 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
>>          jif = boottime.tv_sec;
>>
>>          for_each_possible_cpu(i) {
>> -               user += kstat_this_cpu->cpustat[USER];
>> -               nice += kstat_this_cpu->cpustat[NICE];
>> -               system += kstat_this_cpu->cpustat[SYSTEM];
>> -               idle += kstat_this_cpu->cpustat[IDLE];
>> +               struct kernel_stat *kstat, *idle_kstat;
>> +#ifdef CONFIG_CGROUP_SCHED
>> +               kstat = per_cpu_ptr(tg->cpustat, i);
>> +               idle_kstat = per_cpu_ptr(root_task_group.cpustat, i);
>> +#else
>> +               kstat = per_cpu(kstat, i);
>> +               idle_kstat = kstat;
>> +#endif
>> +
>> +               user += kstat->cpustat[USER];
>> +               nice += kstat->cpustat[NICE];
>> +               system += kstat->cpustat[SYSTEM];
>
> That if-deffery just begs to be killed by some helper function. Both
> variants appear multiple times I think.
Okay, I will clean this up for next submission.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-05  9:05   ` Peter Zijlstra
@ 2011-10-05 12:12     ` Glauber Costa
  2011-10-11 23:45       ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Frederic Weisbecker

On 10/05/2011 01:05 PM, Peter Zijlstra wrote:
> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>> This patch counts the total number of forks per-cgroup.
>> The information is propagated to the parent, so the total
>> number of forks in the system, is the parent cgroup's one.
>>
>> To achieve that, total_forks is made per-cpu. There is no
>> particular reason to do that, but by doing this, we are
>> able to bundle it inside the cpustat structure already
>> present.
>
> I think fweisbec is also doing something with forks and cgroups.

I am all ears...

Frederic, does it conflict with what you're doing ?

>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> ---
>>   include/linux/kernel_stat.h |    1 +
>>   include/linux/sched.h       |    1 +
>>   kernel/fork.c               |    7 ++-----
>>   kernel/sched.c              |    9 ++++++++-
>>   4 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
>> index 71a69a0..9fb0dda 100644
>> --- a/include/linux/kernel_stat.h
>> +++ b/include/linux/kernel_stat.h
>> @@ -29,6 +29,7 @@ enum cpu_usage_stat {
>>   	GUEST_NICE,
>>   	IDLE_BASE,
>>   	IOWAIT_BASE,
>> +	TOTAL_FORKS,
>>   	NR_STATS,
>>   };
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 64c5ba5..4ba9dde 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2716,6 +2716,7 @@ struct cgroup;
>>   struct cftype;
>>   int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>>   			 struct seq_file *p);
>> +void task_group_new_fork(struct task_struct *p);
>>   #endif /* __KERNEL__ */
>>
>>   #endif
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 8e6b6f4..ec2b729 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -76,10 +76,6 @@
>>
>>   #include<trace/events/sched.h>
>>
>> -/*
>> - * Protected counters by write_lock_irq(&tasklist_lock)
>> - */
>> -unsigned long total_forks;	/* Handle normal Linux uptimes. */
>>   int nr_threads;			/* The idle threads do not count.. */
>>
>>   int max_threads;		/* tunable limit on nr_threads */
>> @@ -1372,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>   		nr_threads++;
>>   	}
>>
>> -	total_forks++;
>> +	task_group_new_fork(p);
>> +
>>   	spin_unlock(&current->sighand->siglock);
>>   	write_unlock_irq(&tasklist_lock);
>>   	proc_fork_connector(p);
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 89d2248..5a8181e 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -691,6 +691,11 @@ static inline void task_group_account_field(struct task_struct *p,
>>   }
>>   #endif /* CONFIG_CGROUP_SCHED */
>>
>> +void task_group_new_fork(struct task_struct *p)
>> +{
>> +	task_group_account_field(p, 1, TOTAL_FORKS);
>> +}
>> +
>>   static void update_rq_clock_task(struct rq *rq, s64 delta);
>>
>>   static void update_rq_clock(struct rq *rq)
>> @@ -9161,6 +9166,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>>   	u64 guest, guest_nice;
>>   	u64 sum = 0;
>>   	u64 sum_softirq = 0;
>> +	u64 total_forks = 0;
>>   	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>>   	struct timespec boottime;
>>   #ifdef CONFIG_CGROUP_SCHED
>> @@ -9201,6 +9207,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>>   		steal += kstat->cpustat[STEAL];
>>   		guest += kstat->cpustat[GUEST];
>>   		guest_nice += kstat->cpustat[GUEST_NICE];
>> +		total_forks += kstat->cpustat[TOTAL_FORKS];
>>   		sum += kstat_cpu_irqs_sum(i);
>>   		sum += arch_irq_stat_cpu(i);
>>
>> @@ -9272,7 +9279,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>>   	seq_printf(p,
>>   		"\nctxt %llu\n"
>>   		"btime %lu\n"
>> -		"processes %lu\n"
>> +		"processes %llu\n"
>>   		"procs_running %lu\n"
>>   		"procs_blocked %lu\n",
>>   		nr_context_switches(),
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/10] provide a version of cpuacct statistics inside cpu cgroup
  2011-10-05  9:10   ` Peter Zijlstra
@ 2011-10-05 12:16     ` Glauber Costa
  0 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh

On 10/05/2011 01:10 PM, Peter Zijlstra wrote:
> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>> +       for_each_online_cpu(cpu) {
>> +               struct kernel_stat *kstat = per_cpu_ptr(tg->cpustat, cpu);
>> +               val += kstat->cpustat[USER];
>> +               val += kstat->cpustat[NICE];
>> +       }
>
> by using online_cpu, you fail to sum whatever data was accrued on
> unplugged cpus, is that a problem?
>
>

Humm... I guess it is. We're effectively replacing a percpu_counter, 
that should accumulate anything left in cpus that gets unplugged. I'll
have to change that.

Thanks

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] provide a version of cpuusage statistics inside cpu cgroup
  2011-10-05  9:14   ` Peter Zijlstra
@ 2011-10-05 12:17     ` Glauber Costa
  0 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 12:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh

On 10/05/2011 01:14 PM, Peter Zijlstra wrote:
> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>> +       if (tg->cpustat)
>> +               free_percpu(tg->cpustat);
>> +       if (tg->cpuusage)
>> +               free_percpu(tg->cpuusage);
>
> void free_percpu(void __percpu *ptr)
> {
>          void *addr;
>          struct pcpu_chunk *chunk;
>          unsigned long flags;
>          int off;
>
>          if (!ptr)
>                  return;
> 	...
> }
Silly me.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] provide a version of cpuusage statistics inside cpu cgroup
  2011-10-05  9:46   ` Peter Zijlstra
@ 2011-10-05 12:22     ` Glauber Costa
  2011-10-05 12:31       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh, Paul Turner

On 10/05/2011 01:46 PM, Peter Zijlstra wrote:
> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>> +/*
>> + * charge this task's execution time to its accounting group.
>> + *
>> + * called with rq->lock held.
>> + */
>> +static void cpuusage_charge(struct task_struct *tsk, u64 cputime)
>> +{
>> +       int cpu;
>> +
>> +#ifdef CONFIG_CGROUP_CPUACCT
>> +       struct cpuacct *ca;
>> +#endif
>> +#ifdef CONFIG_CGROUP_SCHED
>> +       struct task_group *tg;
>> +#endif
>> +       cpu = task_cpu(tsk);
>> +
>> +       rcu_read_lock();
>> +
>> +#ifdef CONFIG_CGROUP_CPUACCT
>> +       ca = task_ca(tsk);
>> +
>> +       if (unlikely(!cpuacct_subsys.active))
>> +               goto no_cpuacct;
>> +
>> +       for (; ca; ca = ca->parent) {
>> +               u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
>> +               *cpuusage += cputime;
>> +       }
>> +no_cpuacct:
>> +#endif
>> +
>> +#ifdef CONFIG_CGROUP_SCHED
>> +       tg = task_group(tsk);
>> +       for (; tg; tg = tg->parent) {
>> +               u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
>> +               *cpuusage += cputime;
>> +       }
>> +#endif
>> +       rcu_read_unlock();
>> +}
>
> I don't actually think we need to do this. tg->se[cpu]->sum_exec_runtime
> should contain the same information.

Well,

We should at least sum up se and rt_se, right?
Otherwise... I will re-read the code, but from a first sight, I think 
you are right.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] provide a version of cpuusage statistics inside cpu cgroup
  2011-10-05 12:22     ` Glauber Costa
@ 2011-10-05 12:31       ` Peter Zijlstra
  2011-10-05 15:05         ` Glauber Costa
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05 12:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh, Paul Turner

On Wed, 2011-10-05 at 16:22 +0400, Glauber Costa wrote:
> 
> We should at least sum up se and rt_se, right?

Ah yes indeed.

> Otherwise... I will re-read the code, but from a first sight, I think 
> you are right. 

Good, less hierarchy walks -> *win*

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] Display /proc/stat information per cgroup
  2011-10-05 12:10     ` Glauber Costa
@ 2011-10-05 12:38       ` Peter Zijlstra
  2011-10-05 12:43         ` Glauber Costa
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-10-05 12:38 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Wed, 2011-10-05 at 16:10 +0400, Glauber Costa wrote:
> On 10/05/2011 12:56 PM, Peter Zijlstra wrote:
> > On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> >> +struct kernel_stat *task_group_kstat(struct task_struct *p)
> >> +{
> >> +       struct task_group *tg;
> >> +       struct kernel_stat *kstat;
> >> +
> >> +       rcu_read_lock();
> >> +       tg = task_group(p);
> >> +       kstat = tg->cpustat;
> >> +       rcu_read_unlock();
> >> +       return kstat;
> >> +}
> >
> > Who keeps tg alive and kicking while you poke at its (cpustat) member?
> 
> * All calls to this function currently pass current as a parameter 
> (Okay, maybe it is too generic and we should pass nothing at all, and 
> grab current within it)
> * rcu_read_lock() guarantees that current will exist during this call, 
> and task_group won't change. (right?)

The thing I worry about is:

A (pid n)				B

kstat = task_group_kstat()
					echo n > /cgroup/something-else/pid
					rmdir /cgroup/group-that-had-A
<timer interrupt>
  RCU complete
  <softirq>
    kfree(tg) etc..

kstat->foo++; <-- *BOOM*


The only way to avoid someone moving you around is by holding some
cgroup lock, task->alloc_lock, task->pi_lock or the rq->lock where task
runs. Alternatively keep rcu_read_lock() around the entire kstat usage.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] Display /proc/stat information per cgroup
  2011-10-05 12:38       ` Peter Zijlstra
@ 2011-10-05 12:43         ` Glauber Costa
  0 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 12:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 10/05/2011 04:38 PM, Peter Zijlstra wrote:
> On Wed, 2011-10-05 at 16:10 +0400, Glauber Costa wrote:
>> On 10/05/2011 12:56 PM, Peter Zijlstra wrote:
>>> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>>>> +struct kernel_stat *task_group_kstat(struct task_struct *p)
>>>> +{
>>>> +       struct task_group *tg;
>>>> +       struct kernel_stat *kstat;
>>>> +
>>>> +       rcu_read_lock();
>>>> +       tg = task_group(p);
>>>> +       kstat = tg->cpustat;
>>>> +       rcu_read_unlock();
>>>> +       return kstat;
>>>> +}
>>>
>>> Who keeps tg alive and kicking while you poke at its (cpustat) member?
>>
>> * All calls to this function currently pass current as a parameter
>> (Okay, maybe it is too generic and we should pass nothing at all, and
>> grab current within it)
>> * rcu_read_lock() guarantees that current will exist during this call,
>> and task_group won't change. (right?)
>
> The thing I worry about is:
>
> A (pid n)				B
>
> kstat = task_group_kstat()
> 					echo n>  /cgroup/something-else/pid
> 					rmdir /cgroup/group-that-had-A
> <timer interrupt>
>    RCU complete
>    <softirq>
>      kfree(tg) etc..
>
> kstat->foo++;<-- *BOOM*
>
>
> The only way to avoid someone moving you around is by holding some
> cgroup lock, task->alloc_lock, task->pi_lock or the rq->lock where task
> runs. Alternatively keep rcu_read_lock() around the entire kstat usage.
>
>
Well, if I understand it correctly, we'd have to hold the lock around 
the entire kstat usge as well, right?

Otherwise it can just explode. So rcu does seem the way to go. I do, 
however, see the problem you are describing.

Maybe we can remove the rcu_read_lock() call and replace with a rcu 
validation. Then patch all callers. (Including of course the current 
users of kstat_cpu() all over). What do you think?



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/10] provide a version of cpuusage statistics inside cpu cgroup
  2011-10-05 12:31       ` Peter Zijlstra
@ 2011-10-05 15:05         ` Glauber Costa
  0 siblings, 0 replies; 34+ messages in thread
From: Glauber Costa @ 2011-10-05 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Balbir Singh, Paul Turner

On 10/05/2011 04:31 PM, Peter Zijlstra wrote:
> On Wed, 2011-10-05 at 16:22 +0400, Glauber Costa wrote:
>>
>> We should at least sum up se and rt_se, right?
>
> Ah yes indeed.

sched_rt.c:

         curr->se.sum_exec_runtime += delta_exec;

Where did I get this crazy idea about rt_se from? I swear I did not do 
any drugs today...

>> Otherwise... I will re-read the code, but from a first sight, I think
>> you are right.
>
> Good, less hierarchy walks ->  *win*


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-05 12:12     ` Glauber Costa
@ 2011-10-11 23:45       ` Frederic Weisbecker
  2011-10-12  7:35         ` Glauber Costa
  0 siblings, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2011-10-11 23:45 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley

On Wed, Oct 05, 2011 at 04:12:00PM +0400, Glauber Costa wrote:
> On 10/05/2011 01:05 PM, Peter Zijlstra wrote:
> >On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> >>This patch counts the total number of forks per-cgroup.
> >>The information is propagated to the parent, so the total
> >>number of forks in the system, is the parent cgroup's one.
> >>
> >>To achieve that, total_forks is made per-cpu. There is no
> >>particular reason to do that, but by doing this, we are
> >>able to bundle it inside the cpustat structure already
> >>present.
> >
> >I think fweisbec is also doing something with forks and cgroups.
> 
> I am all ears...
> 
> Frederic, does it conflict with what you're doing ?

I don't know if that really conflicts but I'm working
on a cgroup subsystem that is able to control the number
of tasks running in a subsystem.

It consists in two new files added:

* tasks.usage
* tasks.limit

The subsystem rejects any new fork or migration into the
cgroup when tasks.usage > tasks.limit

So tasks.usage can inform you about the number of tasks
running into the cgroup. It's not strictly the number
of forks because it also counts the tasks that have been
attached to the cgroup.

But something like a tasks.fork file could be implemented
in that subsystem as well.

It depends on what you need.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-11 23:45       ` Frederic Weisbecker
@ 2011-10-12  7:35         ` Glauber Costa
  2011-10-12 12:59           ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-12  7:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley

On 10/12/2011 03:45 AM, Frederic Weisbecker wrote:
> On Wed, Oct 05, 2011 at 04:12:00PM +0400, Glauber Costa wrote:
>> On 10/05/2011 01:05 PM, Peter Zijlstra wrote:
>>> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>>>> This patch counts the total number of forks per-cgroup.
>>>> The information is propagated to the parent, so the total
>>>> number of forks in the system, is the parent cgroup's one.
>>>>
>>>> To achieve that, total_forks is made per-cpu. There is no
>>>> particular reason to do that, but by doing this, we are
>>>> able to bundle it inside the cpustat structure already
>>>> present.
>>>
>>> I think fweisbec is also doing something with forks and cgroups.
>>
>> I am all ears...
>>
>> Frederic, does it conflict with what you're doing ?
>
> I don't know if that really conflicts but I'm working
> on a cgroup subsystem that is able to control the number
> of tasks running in a subsystem.
>
> It consists in two new files added:
>
> * tasks.usage
> * tasks.limit
>
> The subsystem rejects any new fork or migration into the
> cgroup when tasks.usage>  tasks.limit
>
> So tasks.usage can inform you about the number of tasks
> running into the cgroup. It's not strictly the number
> of forks because it also counts the tasks that have been
> attached to the cgroup.
>
> But something like a tasks.fork file could be implemented
> in that subsystem as well.
>
> It depends on what you need.

So the specific piece I am working on, is to display /proc/stat 
information per-cgroup. One of the many fields it has, is total_forks.
(it is actually just a small part of the series)
So instead of tracking how many forks the system has in total, I'll
track it per-cpucgroup.

So I don't think we conflict at all. At the very least, IIUC, you are 
planning to account and check *before* a fork happens, right? This 
particular stat is incremented after it already succeeded.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-12  7:35         ` Glauber Costa
@ 2011-10-12 12:59           ` Frederic Weisbecker
  2011-10-12 12:59             ` Glauber Costa
  0 siblings, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2011-10-12 12:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley

On Wed, Oct 12, 2011 at 11:35:50AM +0400, Glauber Costa wrote:
> On 10/12/2011 03:45 AM, Frederic Weisbecker wrote:
> >On Wed, Oct 05, 2011 at 04:12:00PM +0400, Glauber Costa wrote:
> >>On 10/05/2011 01:05 PM, Peter Zijlstra wrote:
> >>>On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> >>>>This patch counts the total number of forks per-cgroup.
> >>>>The information is propagated to the parent, so the total
> >>>>number of forks in the system, is the parent cgroup's one.
> >>>>
> >>>>To achieve that, total_forks is made per-cpu. There is no
> >>>>particular reason to do that, but by doing this, we are
> >>>>able to bundle it inside the cpustat structure already
> >>>>present.
> >>>
> >>>I think fweisbec is also doing something with forks and cgroups.
> >>
> >>I am all ears...
> >>
> >>Frederic, does it conflict with what you're doing ?
> >
> >I don't know if that really conflicts but I'm working
> >on a cgroup subsystem that is able to control the number
> >of tasks running in a subsystem.
> >
> >It consists in two new files added:
> >
> >* tasks.usage
> >* tasks.limit
> >
> >The subsystem rejects any new fork or migration into the
> >cgroup when tasks.usage>  tasks.limit
> >
> >So tasks.usage can inform you about the number of tasks
> >running into the cgroup. It's not strictly the number
> >of forks because it also counts the tasks that have been
> >attached to the cgroup.
> >
> >But something like a tasks.fork file could be implemented
> >in that subsystem as well.
> >
> >It depends on what you need.
> 
> So the specific piece I am working on, is to display /proc/stat
> information per-cgroup. One of the many fields it has, is
> total_forks.
> (it is actually just a small part of the series)
> So instead of tracking how many forks the system has in total, I'll
> track it per-cpucgroup.
> 
> So I don't think we conflict at all. At the very least, IIUC, you
> are planning to account and check *before* a fork happens, right?
> This particular stat is incremented after it already succeeded.

That doesn't make much difference since the accounting is cancelled
in case the fork is finally rejected.

But probably having a simple accouting like you do involves less
overhead than the whole task counter subsystem.

Is your counting propagated to the parents in a hierarchy?
For example if A is parent cgroup of B and C, does A account the
forks happening in B and C?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-12 12:59           ` Frederic Weisbecker
@ 2011-10-12 12:59             ` Glauber Costa
  2011-10-12 13:03               ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-12 12:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley

On 10/12/2011 04:59 PM, Frederic Weisbecker wrote:
> On Wed, Oct 12, 2011 at 11:35:50AM +0400, Glauber Costa wrote:
>> On 10/12/2011 03:45 AM, Frederic Weisbecker wrote:
>>> On Wed, Oct 05, 2011 at 04:12:00PM +0400, Glauber Costa wrote:
>>>> On 10/05/2011 01:05 PM, Peter Zijlstra wrote:
>>>>> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>>>>>> This patch counts the total number of forks per-cgroup.
>>>>>> The information is propagated to the parent, so the total
>>>>>> number of forks in the system, is the parent cgroup's one.
>>>>>>
>>>>>> To achieve that, total_forks is made per-cpu. There is no
>>>>>> particular reason to do that, but by doing this, we are
>>>>>> able to bundle it inside the cpustat structure already
>>>>>> present.
>>>>>
>>>>> I think fweisbec is also doing something with forks and cgroups.
>>>>
>>>> I am all ears...
>>>>
>>>> Frederic, does it conflict with what you're doing ?
>>>
>>> I don't know if that really conflicts but I'm working
>>> on a cgroup subsystem that is able to control the number
>>> of tasks running in a subsystem.
>>>
>>> It consists in two new files added:
>>>
>>> * tasks.usage
>>> * tasks.limit
>>>
>>> The subsystem rejects any new fork or migration into the
>>> cgroup when tasks.usage>   tasks.limit
>>>
>>> So tasks.usage can inform you about the number of tasks
>>> running into the cgroup. It's not strictly the number
>>> of forks because it also counts the tasks that have been
>>> attached to the cgroup.
>>>
>>> But something like a tasks.fork file could be implemented
>>> in that subsystem as well.
>>>
>>> It depends on what you need.
>>
>> So the specific piece I am working on, is to display /proc/stat
>> information per-cgroup. One of the many fields it has, is
>> total_forks.
>> (it is actually just a small part of the series)
>> So instead of tracking how many forks the system has in total, I'll
>> track it per-cpucgroup.
>>
>> So I don't think we conflict at all. At the very least, IIUC, you
>> are planning to account and check *before* a fork happens, right?
>> This particular stat is incremented after it already succeeded.
>
> That doesn't make much difference since the accounting is cancelled
> in case the fork is finally rejected.
>
> But probably having a simple accouting like you do involves less
> overhead than the whole task counter subsystem.
>
> Is your counting propagated to the parents in a hierarchy?
> For example if A is parent cgroup of B and C, does A account the
> forks happening in B and C?

Yes.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-12 12:59             ` Glauber Costa
@ 2011-10-12 13:03               ` Frederic Weisbecker
  2011-10-12 13:03                 ` Glauber Costa
  0 siblings, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2011-10-12 13:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley

On Wed, Oct 12, 2011 at 04:59:07PM +0400, Glauber Costa wrote:
> On 10/12/2011 04:59 PM, Frederic Weisbecker wrote:
> >On Wed, Oct 12, 2011 at 11:35:50AM +0400, Glauber Costa wrote:
> >>On 10/12/2011 03:45 AM, Frederic Weisbecker wrote:
> >>>On Wed, Oct 05, 2011 at 04:12:00PM +0400, Glauber Costa wrote:
> >>>>On 10/05/2011 01:05 PM, Peter Zijlstra wrote:
> >>>>>On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
> >>>>>>This patch counts the total number of forks per-cgroup.
> >>>>>>The information is propagated to the parent, so the total
> >>>>>>number of forks in the system, is the parent cgroup's one.
> >>>>>>
> >>>>>>To achieve that, total_forks is made per-cpu. There is no
> >>>>>>particular reason to do that, but by doing this, we are
> >>>>>>able to bundle it inside the cpustat structure already
> >>>>>>present.
> >>>>>
> >>>>>I think fweisbec is also doing something with forks and cgroups.
> >>>>
> >>>>I am all ears...
> >>>>
> >>>>Frederic, does it conflict with what you're doing ?
> >>>
> >>>I don't know if that really conflicts but I'm working
> >>>on a cgroup subsystem that is able to control the number
> >>>of tasks running in a subsystem.
> >>>
> >>>It consists in two new files added:
> >>>
> >>>* tasks.usage
> >>>* tasks.limit
> >>>
> >>>The subsystem rejects any new fork or migration into the
> >>>cgroup when tasks.usage>   tasks.limit
> >>>
> >>>So tasks.usage can inform you about the number of tasks
> >>>running into the cgroup. It's not strictly the number
> >>>of forks because it also counts the tasks that have been
> >>>attached to the cgroup.
> >>>
> >>>But something like a tasks.fork file could be implemented
> >>>in that subsystem as well.
> >>>
> >>>It depends on what you need.
> >>
> >>So the specific piece I am working on, is to display /proc/stat
> >>information per-cgroup. One of the many fields it has, is
> >>total_forks.
> >>(it is actually just a small part of the series)
> >>So instead of tracking how many forks the system has in total, I'll
> >>track it per-cpucgroup.
> >>
> >>So I don't think we conflict at all. At the very least, IIUC, you
> >>are planning to account and check *before* a fork happens, right?
> >>This particular stat is incremented after it already succeeded.
> >
> >That doesn't make much difference since the accounting is cancelled
> >in case the fork is finally rejected.
> >
> >But probably having a simple accouting like you do involves less
> >overhead than the whole task counter subsystem.
> >
> >Is your counting propagated to the parents in a hierarchy?
> >For example if A is parent cgroup of B and C, does A account the
> >forks happening in B and C?
> 
> Yes.

But only to the first parent or also all ancestors?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-12 13:03               ` Frederic Weisbecker
@ 2011-10-12 13:03                 ` Glauber Costa
  2011-10-12 14:03                   ` Frederic Weisbecker
  0 siblings, 1 reply; 34+ messages in thread
From: Glauber Costa @ 2011-10-12 13:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley

On 10/12/2011 05:03 PM, Frederic Weisbecker wrote:
> On Wed, Oct 12, 2011 at 04:59:07PM +0400, Glauber Costa wrote:
>> On 10/12/2011 04:59 PM, Frederic Weisbecker wrote:
>>> On Wed, Oct 12, 2011 at 11:35:50AM +0400, Glauber Costa wrote:
>>>> On 10/12/2011 03:45 AM, Frederic Weisbecker wrote:
>>>>> On Wed, Oct 05, 2011 at 04:12:00PM +0400, Glauber Costa wrote:
>>>>>> On 10/05/2011 01:05 PM, Peter Zijlstra wrote:
>>>>>>> On Sun, 2011-10-02 at 23:21 +0400, Glauber Costa wrote:
>>>>>>>> This patch counts the total number of forks per-cgroup.
>>>>>>>> The information is propagated to the parent, so the total
>>>>>>>> number of forks in the system, is the parent cgroup's one.
>>>>>>>>
>>>>>>>> To achieve that, total_forks is made per-cpu. There is no
>>>>>>>> particular reason to do that, but by doing this, we are
>>>>>>>> able to bundle it inside the cpustat structure already
>>>>>>>> present.
>>>>>>>
>>>>>>> I think fweisbec is also doing something with forks and cgroups.
>>>>>>
>>>>>> I am all ears...
>>>>>>
>>>>>> Frederic, does it conflict with what you're doing ?
>>>>>
>>>>> I don't know if that really conflicts but I'm working
>>>>> on a cgroup subsystem that is able to control the number
>>>>> of tasks running in a subsystem.
>>>>>
>>>>> It consists in two new files added:
>>>>>
>>>>> * tasks.usage
>>>>> * tasks.limit
>>>>>
>>>>> The subsystem rejects any new fork or migration into the
>>>>> cgroup when tasks.usage>    tasks.limit
>>>>>
>>>>> So tasks.usage can inform you about the number of tasks
>>>>> running into the cgroup. It's not strictly the number
>>>>> of forks because it also counts the tasks that have been
>>>>> attached to the cgroup.
>>>>>
>>>>> But something like a tasks.fork file could be implemented
>>>>> in that subsystem as well.
>>>>>
>>>>> It depends on what you need.
>>>>
>>>> So the specific piece I am working on, is to display /proc/stat
>>>> information per-cgroup. One of the many fields it has, is
>>>> total_forks.
>>>> (it is actually just a small part of the series)
>>>> So instead of tracking how many forks the system has in total, I'll
>>>> track it per-cpucgroup.
>>>>
>>>> So I don't think we conflict at all. At the very least, IIUC, you
>>>> are planning to account and check *before* a fork happens, right?
>>>> This particular stat is incremented after it already succeeded.
>>>
>>> That doesn't make much difference since the accounting is cancelled
>>> in case the fork is finally rejected.
>>>
>>> But probably having a simple accouting like you do involves less
>>> overhead than the whole task counter subsystem.
>>>
>>> Is your counting propagated to the parents in a hierarchy?
>>> For example if A is parent cgroup of B and C, does A account the
>>> forks happening in B and C?
>>
>> Yes.
>
> But only to the first parent or also all ancestors?
it keeps going until it reaches the root task group.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] Make total_forks per-cgroup
  2011-10-12 13:03                 ` Glauber Costa
@ 2011-10-12 14:03                   ` Frederic Weisbecker
  0 siblings, 0 replies; 34+ messages in thread
From: Frederic Weisbecker @ 2011-10-12 14:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley

On Wed, Oct 12, 2011 at 05:03:52PM +0400, Glauber Costa wrote:
> On 10/12/2011 05:03 PM, Frederic Weisbecker wrote:
> >On Wed, Oct 12, 2011 at 04:59:07PM +0400, Glauber Costa wrote:
> >>On 10/12/2011 04:59 PM, Frederic Weisbecker wrote:
> >>>On Wed, Oct 12, 2011 at 11:35:50AM +0400, Glauber Costa wrote:
> >>>>On 10/12/2011 03:45 AM, Frederic Weisbecker wrote:
> >>>Is your counting propagated to the parents in a hierarchy?
> >>>For example if A is parent cgroup of B and C, does A account the
> >>>forks happening in B and C?
> >>
> >>Yes.
> >
> >But only to the first parent or also all ancestors?
> it keeps going until it reaches the root task group.

Using the task counter subsystem for that would involve
a bit more complications like adding a new counter on it
that would need to be incremented in parallel but differently.
You would also need to bind this subsystem anytime you
want this statistic.

I don't think it's worth doing this. You would also get the
overhead in the exit path because the subsystem also uncharge
one of its counters across the whole hierarchy.

OTOH, it's scary to see that new feature will walk
the entire hierarchy on every fork when CONFIG_CGROUP_SCHED=y
And I guess that today this config is easily selected by
distros.

IIUC that walk also happen on every timer interrupt for
the user/system/idle cpu time accounting?

May be this should reside in a seperate config, so that those
that don't care about the per cgroup stats avoid that
overhead?

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-10-12 14:03 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-02 19:21 [PATCH 00/10] Per-cgroup /proc/stat information Glauber Costa
2011-10-02 19:21 ` [PATCH 01/10] trivial: initialize root cgroup's sibling list Glauber Costa
2011-10-02 19:21 ` [PATCH 02/10] Change cpustat fields to an array Glauber Costa
2011-10-02 19:21 ` [PATCH 03/10] Move /proc/stat logic inside sched.c Glauber Costa
2011-10-02 19:21 ` [PATCH 04/10] Display /proc/stat information per cgroup Glauber Costa
2011-10-05  8:56   ` Peter Zijlstra
2011-10-05 12:10     ` Glauber Costa
2011-10-05 12:38       ` Peter Zijlstra
2011-10-05 12:43         ` Glauber Costa
2011-10-05  9:04   ` Peter Zijlstra
2011-10-05 12:11     ` Glauber Costa
2011-10-02 19:21 ` [PATCH 05/10] Make total_forks per-cgroup Glauber Costa
2011-10-05  9:05   ` Peter Zijlstra
2011-10-05 12:12     ` Glauber Costa
2011-10-11 23:45       ` Frederic Weisbecker
2011-10-12  7:35         ` Glauber Costa
2011-10-12 12:59           ` Frederic Weisbecker
2011-10-12 12:59             ` Glauber Costa
2011-10-12 13:03               ` Frederic Weisbecker
2011-10-12 13:03                 ` Glauber Costa
2011-10-12 14:03                   ` Frederic Weisbecker
2011-10-02 19:21 ` [PATCH 06/10] per-cgroup boot time Glauber Costa
2011-10-02 19:21 ` [PATCH 07/10] Report steal time for cgroup Glauber Costa
2011-10-02 19:21 ` [PATCH 08/10] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
2011-10-05  9:10   ` Peter Zijlstra
2011-10-05 12:16     ` Glauber Costa
2011-10-02 19:21 ` [PATCH 09/10] provide a version of cpuusage " Glauber Costa
2011-10-05  9:14   ` Peter Zijlstra
2011-10-05 12:17     ` Glauber Costa
2011-10-05  9:46   ` Peter Zijlstra
2011-10-05 12:22     ` Glauber Costa
2011-10-05 12:31       ` Peter Zijlstra
2011-10-05 15:05         ` Glauber Costa
2011-10-02 19:21 ` [PATCH 10/10] Change CPUACCT to default n Glauber Costa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox