public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Chris Snook <csnook@redhat.com>
To: Chris Snook <csnook@redhat.com>
Cc: Tong Li <tong.n.li@intel.com>,
	mingo@elte.hu, linux-kernel@vger.kernel.org
Subject: Re: [RFC] scheduler: improve SMP fairness in CFS
Date: Tue, 24 Jul 2007 04:07:59 -0400	[thread overview]
Message-ID: <46A5B35F.90103@redhat.com> (raw)
In-Reply-To: <46A53C88.6060006@redhat.com>

Chris Snook wrote:
> Tong Li wrote:
>> This patch extends CFS to achieve better fairness for SMPs. For 
>> example, with 10 tasks (same priority) on 8 CPUs, it enables each task 
>> to receive equal CPU time (80%). The code works on top of CFS and 
>> provides SMP fairness at a coarser time grainularity; local on each 
>> CPU, it relies on CFS to provide fine-grained fairness and good 
>> interactivity.
>>
>> The code is based on the distributed weighted round-robin (DWRR) 
>> algorithm. It keeps two RB trees on each CPU: one is the original 
>> cfs_rq, referred to as active, and one is a new cfs_rq, called 
>> round-expired. Each CPU keeps a round number, initially zero. The 
>> scheduler works exactly the same way as in CFS, but only runs tasks 
>> from the active tree. Each task is assigned a round slice, equal to 
>> its weight times a system constant (e.g., 100ms), controlled by 
>> sysctl_base_round_slice. When a task uses up its round slice, it moves 
>> to the round-expired tree on the same CPU and stops running. Thus, at 
>> any time on each CPU, the active tree contains all tasks that are 
>> running in the current round, while tasks in round-expired have all 
>> finished the current round and await to start the next round. When an 
>> active tree becomes empty, it calls idle_balance() to grab tasks of 
>> the same round from other CPUs. If none can be moved over, it switches 
>> its active and round-expired trees, thus unleashing round-expired 
>> tasks and advancing the local round number by one. An invariant it 
>> maintains is that the round numbers of any two CPUs in the system 
>> differ by at most one. This property ensures fairness across CPUs. The 
>> variable sysctl_base_round_slice controls fairness-performance 
>> tradeoffs: a smaller value leads to better cross-CPU fairness at the 
>> potential cost of performance; on the other hand, the larger the value 
>> is, the closer the system behavior is to the default CFS without the 
>> patch.
>>
>> Any comments and suggestions would be highly appreciated.
> 
> This patch is massive overkill.  Maybe you're not seeing the overhead on 
> your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a 
> partially-RT workload.  Do you have any data justifying the need for 
> this patch?
> 
> Doing anything globally is expensive, and should be avoided at all 
> costs.  The scheduler already rebalances when a CPU is idle, so you're 
> really just rebalancing the overload here.  On a server workload, we 
> don't necessarily want to do that, since the overload may be multiple 
> threads spawned to service a single request, and could be sharing a lot 
> of data.
> 
> Instead of an explicit system-wide fairness invariant (which will get 
> very hard to enforce when you throw SCHED_FIFO processes into the mix 
> and the scheduler isn't running on some CPUs), try a simpler invariant.  
> If we guarantee that the load on CPU X does not differ from the load on 
> CPU (X+1)%N by more than some small constant, then we know that the 
> system is fairly balanced.  We can achieve global fairness with local 
> balancing, and avoid all this overhead.  This has the added advantage of 
> keeping most of the migrations core/socket/node-local on 
> SMT/multicore/NUMA systems.
> 
>     -- Chris

To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only" 
algorithm is the only way to do this.  Rather, I'm pointing out that 
even an extremely simple algorithm can give you fair loading when you 
already have CFS managing the runqueues.  There are countless more 
sophisticated ways we could do this without using global locking, or 
possibly without any locking at all, other than the locking we already 
use during migration.

	-- Chris

  reply	other threads:[~2007-07-24  8:08 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-23 18:38 [RFC] scheduler: improve SMP fairness in CFS Tong Li
2007-07-23 20:00 ` Andi Kleen
2007-07-23 21:10   ` Li, Tong N
2007-07-23 21:25     ` Chris Friesen
2007-07-24  9:43       ` Andi Kleen
2007-07-23 23:40 ` Chris Snook
2007-07-24  8:07   ` Chris Snook [this message]
2007-07-24 17:11     ` Li, Tong N
2007-07-24 17:07   ` Tong Li
2007-07-24 18:08     ` Chris Snook
2007-07-24 19:47       ` Chris Friesen
2007-07-24 20:39         ` Chris Snook
2007-07-24 20:58           ` Li, Tong N
2007-07-24 21:09             ` Chris Snook
2007-07-24 21:23               ` Chris Friesen
2007-07-24 21:45                 ` Chris Snook
2007-07-24 23:33                   ` Chris Friesen
2007-07-24 21:06           ` Bill Huey
2007-07-24 21:22             ` Chris Snook
2007-07-24 23:14               ` Bill Huey
2007-07-24 21:12           ` Chris Friesen
2007-07-25 11:01 ` Ingo Molnar
2007-07-25 12:03   ` Ingo Molnar
2007-07-25 17:23     ` Tong Li
2007-07-25 19:24       ` Ingo Molnar
2007-07-25 20:38         ` Chris Friesen
2007-07-25 20:55           ` Chris Snook
2007-07-25 21:15             ` Li, Tong N
2007-07-25 22:24               ` Chris Snook
2007-07-26 19:00         ` Tong Li
2007-07-26 21:31           ` Ingo Molnar
2007-07-26 22:00             ` Li, Tong N
2007-07-27  1:34               ` Tong Li
2007-07-27 17:16                 ` Chris Snook
2007-07-27 19:03                   ` Tong Li
2007-07-27 22:20                     ` Bill Huey
2007-07-27 23:36                     ` Chris Snook
2007-07-28  0:54                       ` Bill Huey
2007-07-28  2:59                         ` Chris Snook
2007-07-28 19:38                           ` Tong Li
2007-07-29  2:40                             ` Chris Snook
2007-07-28 19:23                       ` Tong Li
2007-07-29  3:01                         ` Chris Snook
2007-07-25 18:20     ` Li, Tong N
2007-07-25 19:18       ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46A5B35F.90103@redhat.com \
    --to=csnook@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=tong.n.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox