On 05/26/2012 08:44 AM, Paul Turner wrote:
> On 04/09/2012 03:25 PM, Glauber Costa wrote:
>> In the interest of providing a per-cgroup figure of common statistics,
>> this patch adds a nr_switches counter to each group runqueue (both cfs
>> and rt).
>>
>> To avoid impact on schedule(), we don't walk the tree at stat gather
>> time. This is because schedule() is called much more frequently than
>> the tick functions, in which we do walk the tree.
>>
>> When this figure needs to be read (different patch), we will
>> aggregate them at read time.
>>
>>

Paul,

How about the following patch instead?

It is still using the cfs_rq and rt_rq's structures, (this code actually 
only touches fair.c as a PoC, rt would be similar).

Tasks in the root cgroup (without an se->parent), will do a branch and 
exit. For the others, we accumulate here, and simplify the reader.

My reasoning for this, is based on the fact that all the se->parent 
relations should be cached by our recent call to put_prev_task (well, 
unless of course we have a really big chain)

This would incur a slightly higher context switch time for tasks inside 
a cgroup.

The reader (in a different patch) would then be the same as the others:

+static u64 tg_nr_switches(struct task_group *tg, int cpu)
+{
+	if (tg != &root_task_group)
+		return rt_rq(rt_nr_switches, tg, cpu)
                        +fair_rq(nr_switches, tg, cpu);
+
+	return cpu_rq(cpu)->nr_switches;
+}

I plan to measure this today, but an extra branch cost for the common 
case of a task in the root cgroup + O(depth) for tasks inside cgroups 
may be acceptable, given the simplification it brings.

Let me know what you think.