From: Rik van Riel <riel@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, mgorman@suse.de,
chegu_vinod@hp.com, mingo@kernel.org, efault@gmx.de,
vincent.guittot@linaro.org
Subject: Re: [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies
Date: Mon, 13 Oct 2014 03:15:15 -0400 [thread overview]
Message-ID: <543B7C03.1080002@redhat.com> (raw)
In-Reply-To: <20141012145358.GH3015@worktop>
On 10/12/2014 10:53 AM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:37:29PM -0400, riel@redhat.com wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
>> In order to do task placement on systems with complex NUMA topologies,
>> it is necessary to count the faults on nodes nearby the node that is
>> being examined for a potential move.
>>
>> In case of a system with a backplane interconnect, we are dealing with
>> groups of NUMA nodes; each of the nodes within a group is the same number
>> of hops away from nodes in other groups in the system. Optimal placement
>> on this topology is achieved by counting all nearby nodes equally. When
>> comparing nodes A and B at distance N, nearby nodes are those at distances
>> smaller than N from nodes A or B.
>>
>> Placement strategy on a system with a glueless mesh NUMA topology needs
>> to be different, because there are no natural groups of nodes determined
>> by the hardware. Instead, when dealing with two nodes A and B at distance
>> N, N >= 2, there will be intermediate nodes at distance < N from both nodes
>> A and B. Good placement can be achieved by right shifting the faults on
>> nearby nodes by the number of hops from the node being scored. In this
>> context, a nearby node is any node less than the maximum distance in the
>> system away from the node. Those nodes are skipped for efficiency reasons,
>> there is no real policy reason to do so.
>
>
>> +/* Handle placement on systems where not all nodes are directly connected. */
>> +static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
>> + int hoplimit, bool task)
>> +{
>> + unsigned long score = 0;
>> + int node;
>> +
>> + /*
>> + * All nodes are directly connected, and the same distance
>> + * from each other. No need for fancy placement algorithms.
>> + */
>> + if (sched_numa_topology_type == NUMA_DIRECT)
>> + return 0;
>> +
>> + for_each_online_node(node) {
>
>> + }
>> +
>> + return score;
>> +}
>
>> @@ -944,6 +1003,8 @@ static inline unsigned long task_weight(struct task_struct *p, int nid,
>> return 0;
>>
>> faults = task_faults(p, nid);
>> + faults += score_nearby_nodes(p, nid, hops, true);
>> +
>> return 1000 * faults / total_faults;
>> }
>
>> @@ -961,6 +1022,8 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
>> return 0;
>>
>> faults = group_faults(p, nid);
>> + faults += score_nearby_nodes(p, nid, hops, false);
>> +
>> return 1000 * faults / total_faults;
>> }
>
> So this makes {task,group}_weight() O(nr_nodes), and we call these
> function from O(nr_nodes) loops, giving a total of O(nr_nodes^2)
> computational complexity, right?
>
> Seems important to mention; I realize this is only for !DIRECT, but
> still, I bet the real large people (those same 512 nodes we had
> previous) would not really appreciate this.
>
If desired, we can set up a nodemask for each node that
covers only the nodes that are less than the maximum
distance away from each other.
Even on the very large systems, that is likely to be
less than a dozen nodes.
I am not sure whether to add that complexity now, or
whether that should be considered premature optimization :)
next prev parent reply other threads:[~2014-10-13 7:15 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-08 19:37 [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies riel
2014-10-08 19:37 ` [PATCH RFC 1/5] sched,numa: build table of node hop distance riel
2014-10-12 13:17 ` Peter Zijlstra
2014-10-12 13:28 ` Rik van Riel
2014-10-14 6:47 ` Peter Zijlstra
2014-10-14 7:49 ` Rik van Riel
2014-10-08 19:37 ` [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system riel
2014-10-12 14:30 ` Peter Zijlstra
2014-10-13 7:12 ` Rik van Riel
2014-10-08 19:37 ` [PATCH RFC 3/5] sched,numa: preparations for complex topology placement riel
2014-10-12 14:37 ` Peter Zijlstra
2014-10-13 7:12 ` Rik van Riel
2014-10-08 19:37 ` [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies riel
2014-10-12 14:53 ` Peter Zijlstra
2014-10-13 7:15 ` Rik van Riel [this message]
2014-10-08 19:37 ` [PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology riel
2014-10-12 14:56 ` Peter Zijlstra
2014-10-13 7:17 ` Rik van Riel
[not found] ` <4168C988EBDF2141B4E0B6475B6A73D126F58E4F@G6W2504.americas.hpqcorp.net>
[not found] ` <54367446.3020603@redhat.com>
2014-10-10 18:44 ` [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies Vinod, Chegu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=543B7C03.1080002@redhat.com \
--to=riel@redhat.com \
--cc=chegu_vinod@hp.com \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).