From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755493Ab0JNJlT (ORCPT ); Thu, 14 Oct 2010 05:41:19 -0400 Received: from e4.ny.us.ibm.com ([32.97.182.144]:50029 "EHLO e4.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754565Ab0JNJlS (ORCPT ); Thu, 14 Oct 2010 05:41:18 -0400 Date: Thu, 14 Oct 2010 15:10:14 +0530 From: Bharata B Rao To: Paul Turner Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, Dhaval Giani , Balbir Singh , Vaidyanathan Srinivasan , Srivatsa Vaddagiri , Kamalesh Babulal , Ingo Molnar , Pavel Emelyanov , Herbert Poetzl , Avi Kivity , Chris Friesen , Paul Menage , Mike Waychison , Nikhil Rao Subject: Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Message-ID: <20101014094014.GE3874@in.ibm.com> Reply-To: bharata@linux.vnet.ibm.com References: <20101012074910.GA9893@in.ibm.com> <20101012075109.GC9893@in.ibm.com> <1287047996.29097.173.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.19 (2009-01-05) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 14, 2010 at 02:27:02AM -0700, Paul Turner wrote: > On Thu, Oct 14, 2010 at 2:19 AM, Peter Zijlstra wrote: > > On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote: > >> +#ifdef CONFIG_CFS_BANDWIDTH > >> +       { > >> +               .procname       = "sched_cfs_bandwidth_slice_us", > >> +               .data           = &sysctl_sched_cfs_bandwidth_slice, > >> +               .maxlen         = sizeof(unsigned int), > >> +               .mode           = 0644, > >> +               .proc_handler   = proc_dointvec_minmax, > >> +               .extra1         = &one, > >> +       }, > >> +#endif > > > > So this is basically your scalability knob.. the larger this value less > > less frequent we have to access global state, but the less parallelism > > is possible due to fewer CPUs depleting the total quota, leaving nothing > > for the others. > > > > Exactly > > > I guess one could go try and play load-balancer games to try and > > mitigate this by pulling this group's tasks to the CPU(s) that have move > > bandwidth for that group, but balancing that against the regular > > load-balancer goal of well balancing load, will undoubtedly be > > 'interesting'... > > > > I considered this approach as an alternative previously, but I don't > think it can be enacted effectively: > > Since quota will likely expire in a staggered fashion you're going to > get a funnel-herd effect as everything is crowded onto the cpus with > remaining quota. > > It's much more easily avoided by keeping the slice small enough > (relative to the bandwidth period) that we're not potentially > stranding a significant percentage of our quota. The potential for > abuse could be eliminated/reduced here by making the slice size a > constant ratio relative to the period length. This would also make > possible parallelism more deterministic. You can see from the numbers I posted in 0/7, how current default of 10ms slice can lead to a large amount of stranted quota and how that can affect the runtime obtained by the tasks. So reducing the slice size should help, but it will still be a problem in a large system with huge number of CPUs where each CPU claims a slice and does not use it fully. Regards, Bharata.