Re: scheduler scalability - cgroups, cpusets and load-balancing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Paul Jackson <pj@sgi.com>
To: "Gregory Haskins" <ghaskins@novell.com>
Cc: a.p.zijlstra@chello.nl, mingo@elte.hu,
	dmitry.adamushko@gmail.com, rostedt@goodmis.org,
	menage@google.com, rientjes@google.com, tong.n.li@intel.com,
	tglx@linutronix.de, akpm@linux-foundation.org,
	dhaval@linux.vnet.ibm.com, vatsa@linux.vnet.ibm.com,
	sgrubb@redhat.com, linux-kernel@vger.kernel.org,
	ebiederm@xmission.com, nickpiggin@yahoo.com.au
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing
Date: Tue, 29 Jan 2008 10:28:36 -0600	[thread overview]
Message-ID: <20080129102836.be614579.pj@sgi.com> (raw)
In-Reply-To: <479F01AF.BA47.005A.0@novell.com>

Gregory wrote:
>   I am a bit confused as to why you disable load-balancing in the
>   RT cpuset?  It shouldn't be strictly necessary in order for the
>   RT scheduler to do its job (unless I am misunderstanding what you
>   are trying to accomplish?).  Do you do this because you *have*
>   to in order to make real-time deadlines, or because its just a
>   further optimization?

My primary motivation for cpusets originally, and for the
sched_load_balance flag now, was not realtime, but "soft partitioning"
of big NUMA systems, especially for batch schedulers.  They sometimes
have large cpusets which are only being used to hold smaller, per-job,
cpusets.  It is a waste of time (CPU cycles in the kernel sched code)
to load balance those large cpusets.  Load balancing doesn't scale
easily to high CPU counts, and it's nice to avoid doing that where
not needed.

See the following lkml message for a fuller explanation:

  http://lkml.org/lkml/2008/1/29/85

As a secondary motivation, I thought that disabling load balancing on
the RT cpuset was the right thing to do for RT needs, but I make no
claim to knowing much about RT.

I just now realized that you added a 'root_domain' in a patch in
late Nov and early Dec.   I was on the road then, moving from
California to Texas, and not paying much attention to Linux.

A couple of questions on that patch, both involving a comment it adds
to kernel/sched.c:

/*
 * We add the notion of a root-domain which will be used to define per-domain
 * variables. Each exclusive cpuset essentially defines an island domain by
 * fully partitioning the member cpus from any other cpuset. Whenever a new
 * exclusive cpuset is created, we also create and attach a new root-domain
 * object.
 */

1) What are 'per-domain' variables?

2) The mention of 'exclusive cpuset' is no longer correct.

   With the patch 'remove sched domain hooks from cpusets' cpusets
   no longer defines sched domains using the cpu_exclusive flag.

   With the subsequent sched_load_balance patch (see
   http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset
   flag 'sched_load_balance' to define sched domains.

The following revised comment might be more accurate:

/*
 * We add the notion of a root-domain which will be used to define per-domain
 * variables.  Each non-overlapping sched domain defines an island domain by
 * fully partitioning the member cpus from any other cpuset. Whenever a new
 * such a sched domain is created, we also create and attach a new root-domain
 * object.  These non-overlapping sched domains are determined by the cpuset
 * configuration, via a call to partition_sched_domains().
 */

It sounds like you (Gregory, others) want your RT CPUs to be in a sched
domain, unlike the current way things are, where my cpuset code
carefully avoids setting up a sched domain for those CPUs.  However I
still have need, in the batch scheduler case explained above, to have
some CPUs not in any sched domain.

If you require these RT sched domains to be setup differently somehow,
in some way that is visible to partition_sched_domains, then that
apparently means we need a per-cpuset flag to mark those RT cpusets.

If you just want an ordinary sched domain setup (just so long as it
contains only the intended RT CPUs, not others) then I guess we don't
technically need any more per-cpuset flags, but I'm worried, because
the API we're presenting to users for this has just gone from subtle to
bizarre.  I suspect I'll want to add a flag anyway, if by doing so, I
can make the kernel-user API, via cpusets, easier to understand.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

next prev parent reply	other threads:[~2008-01-29 16:29 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-29  9:53 scheduler scalability - cgroups, cpusets and load-balancing Peter Zijlstra
2008-01-29 10:01 ` Paul Jackson
2008-01-29 10:50   ` Peter Zijlstra
2008-01-29 11:13     ` Paul Jackson
2008-01-29 11:31       ` Peter Zijlstra
2008-01-29 11:53         ` Paul Jackson
2008-01-29 12:07           ` Peter Zijlstra
2008-01-29 12:36             ` Paul Jackson
2008-01-29 12:03         ` Paul Jackson
2008-01-29 12:30           ` Peter Zijlstra
2008-01-29 12:52             ` Paul Jackson
2008-01-29 13:38               ` Peter Zijlstra
2008-01-29 10:57 ` Peter Zijlstra
2008-01-29 11:30   ` Paul Jackson
2008-01-29 11:34     ` Paul Jackson
2008-01-29 11:50     ` Peter Zijlstra
2008-01-29 12:12       ` Paul Jackson
2008-01-29 15:57         ` Gregory Haskins
2008-01-29 16:33           ` Paul Jackson
2008-01-29 15:50       ` Gregory Haskins
2008-01-29 16:51         ` Paul Jackson
2008-01-29 17:21           ` Gregory Haskins
2008-01-29 19:04             ` Paul Jackson
2008-01-29 20:36               ` Gregory Haskins
2008-01-29 21:02                 ` Paul Jackson
2008-01-29 21:07                   ` Gregory Haskins
2008-01-29 15:36     ` Gregory Haskins
2008-01-29 16:28       ` Paul Jackson [this message]
2008-01-29 16:42         ` Gregory Haskins
2008-01-29 19:37           ` Paul Jackson
2008-01-29 20:28             ` Gregory Haskins
2008-01-29 20:56               ` Paul Jackson
2008-01-29 21:02                 ` Gregory Haskins
2008-01-29 22:23                   ` Steven Rostedt
2008-01-29 12:32   ` Srivatsa Vaddagiri
2008-01-29 12:21     ` Paul Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080129102836.be614579.pj@sgi.com \
    --to=pj@sgi.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=dhaval@linux.vnet.ibm.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=ghaskins@novell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=menage@google.com \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=sgrubb@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tong.n.li@intel.com \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox