From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753681Ab1IHPPp (ORCPT ); Thu, 8 Sep 2011 11:15:45 -0400 Received: from e5.ny.us.ibm.com ([32.97.182.145]:47626 "EHLO e5.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752826Ab1IHPPo (ORCPT ); Thu, 8 Sep 2011 11:15:44 -0400 Date: Thu, 8 Sep 2011 20:45:07 +0530 From: Srivatsa Vaddagiri To: Peter Zijlstra Cc: Paul Turner , Kamalesh Babulal , Vladimir Davydov , "linux-kernel@vger.kernel.org" , Bharata B Rao , Dhaval Giani , Vaidyanathan Srinivasan , Ingo Molnar , Pavel Emelianov Subject: Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Message-ID: <20110908151433.GB6587@linux.vnet.ibm.com> Reply-To: Srivatsa Vaddagiri References: <20110607154542.GA2991@linux.vnet.ibm.com> <1307529966.4928.8.camel@dhcp-10-30-22-158.sw.ru> <20110608163234.GA23031@linux.vnet.ibm.com> <20110610181719.GA30330@linux.vnet.ibm.com> <20110615053716.GA390@linux.vnet.ibm.com> <20110907152009.GA3868@linux.vnet.ibm.com> <1315423342.11101.25.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <1315423342.11101.25.camel@twins> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Peter Zijlstra [2011-09-07 21:22:22]: > On Wed, 2011-09-07 at 20:50 +0530, Srivatsa Vaddagiri wrote: > > > > Fix excessive idle time reported when cgroups are capped. > > Where from? The whole idea of bandwidth caps is to introduce idle time, > so what's excessive and where does it come from? We have setup cgroups and their hard limits so that in theory they should consume the entire capacity available on machine, leading to 0% idle time. That's not what we see. A more detailed description of the setup and the problem is here: https://lkml.org/lkml/2011/6/7/352 but to quickly summarize it, the machine and the test-case is as below: Machine : 16-cpus (2 Quad-core w/ HT enabled) Cgroups : 5 in number (C1-C5), each having {2, 2, 4, 8, 16} tasks respectively. Further, each task is placed in its own (sub-)cgroup with a capped usage of 50% CPU. /C1/C1_1/Task1 -> capped at 50% cpu usage /C1/C1_2/Task2 -> capped at 50% cpu usage /C2/C2_1/Task3 -> capped at 50% cpu usage /C2/C2_2/Task3 -> capped at 50% cpu usage /C3/C3_1/Task4 -> capped at 50% cpu usage /C3/C3_2/Task4 -> capped at 50% cpu usage /C3/C3_3/Task4 -> capped at 50% cpu usage /C3/C3_4/Task4 -> capped at 50% cpu usage ... /C5/C5_16/Task32 -> capped at 50% cpu usage So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU system. One can expect 0% idle time in this scenario, which was found not to be the case. With early versions of cfs hardlimits, upto ~20% idle time was seen, though with the current version in tip, we see upto ~10% idle time (when cfs.period = 100ms) which goes down to ~5% when cfs.period is set to 500ms. >>From what I could find out, the "excess" idle time crops up because load-balancer is not perfect. For example, there are instances when a CPU has just 1 task on its runqueue (rather then the ideal number of 2 tasks/cpu). When that lone task exceeds its 50% limit, cpu is forced to become idle. > > The patch introduces the notion of "steal" > > The virt folks already claimed steal-time and have it mean something > entirely different. You get to pick a new name. grace time? > > (or "grace") time which is the surplus > > time/bandwidth each cgroup is allowed to consume, subject to a maximum > > steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal" > > or "grace" time when the lone task running on a cpu is about to be throttled. > > Ok, so this is a solution to an unstated problem. Why is it a good > solution? I am not sure if there are any "good" solutions to this problem! One possibility is to make the idle load balancer become aggressive in pulling tasks across sched-domain boundaries i.e when a CPU becomes idle (after a task got throttled) and invokes the idle load balancer, it should try "harder" at pulling a task from far-off cpus (across package/node boundaries)? > Also, another tunable, yay! - vatsa