From: Paul Jackson <pj@sgi.com>
To: dino@in.ibm.com
Cc: nickpiggin@yahoo.com.au, akpm@osdl.org, mbligh@google.com,
menage@google.com, Simon.Derr@bull.net,
linux-kernel@vger.kernel.org, rohitseth@google.com, holt@sgi.com,
dipankar@in.ibm.com, suresh.b.siddha@intel.com
Subject: Re: [RFC] cpuset: add interface to isolated cpus
Date: Mon, 23 Oct 2006 13:47:30 -0700 [thread overview]
Message-ID: <20061023134730.62e791a2.pj@sgi.com> (raw)
In-Reply-To: <20061023195011.GB1542@in.ibm.com>
Dinakar wrote:
> This as far as I can tell is the only problem with the current code.
> So I dont see why we need a major rewrite that involves a complete change
> in the approach to the dynamic sched domain implementation.
Nick and I agree that if we can get an adequate automatic partition of
sched domains, without any such explicit 'sched_domain' API, then that
would be better.
Nick keeps hoping we can do this automatically, and I have been fading
in and out of agreement. I have doubts we can do an automatic
partition that is adequate.
Last night, Nick suggested we could do this by partitioning based on
the cpus_allowed masks of the tasks in the system, instead of based on
selected cpusets in the system. We could create a partition any place
that didn't cut across some tasks cpus_allowed. This would seem to have
a better chance than basing it on the cpus masks in cpusets - for
example a task in top cpuset that was pinned to a single CPU (many
kernel threads fit this description) would no longer impede
partitioning.
Right now, I am working on a posting that spells out an algorithm
to compute such a partitioning, based on all task cpus_allowed masks
in the system. I think that's doable.
But I doubt it works. I afraid it will result in useful partitions
only in the cases we seem to need them least, on systems using cpusets
to nicely carve up a system.
==
Why the heck are we doing this partitioning in the first place?
Putting aside for a moment the specialized needs of the real-time folks
who want to isolate some nodes from any source of jitter, aren't these
sched domain partitions just a workaround for performance issues, that
arise from trying to load balance across big sched domains?
Granted, there may be no better way to address this performance issue.
And granted, it may well be my own employers big honkin NUMA boxes
that are most in need of this, and I just don't realize it.
But could someone whack me upside the head with a clear cut instance
where we know we need this partitioning?
In particular, if (extreme example) I have 1024 threads on a 1024 CPU
system, each compute bound and each pinned to a separate CPU, and
nothing else, then do I still need these sched partitions? Or does the
scheduler efficiently handle this case, quickly recognizing that it has
no useful balancing work worth doing?
==
As it stands right now, if I had to place my "final answer" in Jeopardy
on this, I'd vote for something like the patch you describe, which I
take it is much like the sched_domain patch with which I started this
scrum a week ago, minus the 'sched_domain_enabled' flag that I had in
for backwards compatibility. I suspect we agree that we can do without
that flag, and that a single clean long term API outweighs perfect
backward compatibility, in this case.
==
The only twist to your patch I would like you to consider - instead
of a 'sched_domain' flag marking where the partitions go, how about
a flag that tells the kernel it is ok not to load balance tasks in
a cpuset?
Then lower level cpusets could set such a flag, without immediate and
brutal affects on the partitioning of all their parent cpusets. But if
the big top level non-overlapping cpusets were so marked, then we could
partition all the way down to where we were no longer able to do so,
because we hit a cpuset that didn't have this flag set.
I think such a "ok not to load balance tasks in this cpuset" flag
better fits what the users see here. They are being asked to let us
turn off some automatic load balancing, in return for which they get
better performance. I doubt that the phrase "dynamic scheduler domain
partitions" is in the vocabulary of most of our users. More of them
will understand the concept of load balancing automatically moving
tasks to underutilized CPUs, and more of them would be prepared to
make the choice between turning off load balancing in some top cpusets,
and better kernel scheduler performance.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
next prev parent reply other threads:[~2006-10-23 20:47 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-10-19 9:26 [RFC] cpuset: add interface to isolated cpus Paul Jackson
2006-10-19 10:17 ` Nick Piggin
2006-10-19 17:55 ` Paul Jackson
2006-10-19 18:07 ` Nick Piggin
2006-10-19 18:56 ` Paul Jackson
2006-10-19 19:03 ` Nick Piggin
2006-10-20 3:37 ` Paul Jackson
2006-10-20 8:02 ` Nick Piggin
2006-10-20 14:52 ` Nick Piggin
2006-10-20 20:03 ` Paul Jackson
2006-10-20 19:59 ` Paul Jackson
2006-10-20 20:01 ` Paul Jackson
2006-10-20 20:59 ` Siddha, Suresh B
2006-10-21 1:33 ` Paul Jackson
2006-10-21 6:14 ` Nick Piggin
2006-10-21 7:24 ` Paul Jackson
2006-10-21 10:51 ` Nick Piggin
2006-10-22 4:54 ` Paul Jackson
2006-10-20 21:04 ` Dinakar Guniguntala
2006-10-23 3:18 ` Paul Jackson
2006-10-23 5:07 ` Nick Piggin
2006-10-23 5:51 ` Paul Jackson
2006-10-23 5:40 ` Siddha, Suresh B
2006-10-23 6:06 ` Paul Jackson
2006-10-23 6:07 ` Paul Jackson
2006-10-23 6:17 ` Nick Piggin
2006-10-23 6:41 ` Paul Jackson
2006-10-23 6:49 ` Nick Piggin
2006-10-23 6:48 ` Paul Jackson
2006-10-23 20:58 ` Paul Jackson
2006-10-23 19:50 ` Dinakar Guniguntala
2006-10-23 20:47 ` Paul Jackson [this message]
2006-10-24 15:44 ` Paul Jackson
2006-10-25 19:40 ` Paul Jackson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061023134730.62e791a2.pj@sgi.com \
--to=pj@sgi.com \
--cc=Simon.Derr@bull.net \
--cc=akpm@osdl.org \
--cc=dino@in.ibm.com \
--cc=dipankar@in.ibm.com \
--cc=holt@sgi.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mbligh@google.com \
--cc=menage@google.com \
--cc=nickpiggin@yahoo.com.au \
--cc=rohitseth@google.com \
--cc=suresh.b.siddha@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox