Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node
Date: Tue, 16 Jun 2015 22:49:59 +0530	[thread overview]
Message-ID: <20150616171959.GA1976@linux.vnet.ibm.com> (raw)
In-Reply-To: <55803898.6030007@redhat.com>

> > @@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
> >  			/* Only consider nodes where both task and groups benefit */
> >  			taskimp = task_weight(p, nid, dist) - taskweight;
> >  			groupimp = group_weight(p, nid, dist) - groupweight;
> > -			if (taskimp < 0 && groupimp < 0)
> > +			if (taskimp < 0 || groupimp < 0)
> >  				continue;
>
> NAK
>
> Here is the simplest possible example of a workload where
> the above change breaks things.

In which case, shouldnt we be updating the comment above.
The comment says. "/* Only consider nodes where both task and groups
benefit */". From the comment, I felt taskimp and groupimp should be
positive.

The current code is a bit confusing. There are 3 possible cases where
the current code allows us to do further updates.
1. task and group faults improving which is a good case.
2. task faults improve, group faults decrease.
3. task faults decrease, group faults increase. (mostly a good case)

case 2 and case 3 are somewhat contradictory. But our actions seem to be
the same for both. Shouldnt we have given preference to groupimp here ..
i.e shouldnt we be not updating in case 2.

Would a below check be okay?
if (groupimp < 0 || (taskimp < 0 && !groupimp))

<snipped>
>
> However, because a majority of memory accesses for
> each thread are local, your above change will result
> in the kernel not evaluating the best node for the
> process (due to a negative taskimp).
>
> Your patch works fine with the autonuma benchmark
> tests, because they do essentially all shared
> faults, not a more realistic mix of shared and private
> faults like you would see in eg. SPECjbb2005, or in a
> database or virtual machine workload.

IIUC numa02 has only private accesses while numa01 has mostly shared
faults.  I plan to run SPECjbb2005 and see how they vary with the
changes.

>
> > @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
> >  	 * and is migrating into one of the workload's active nodes, remember
> >  	 * this node as the task's preferred numa node, so the workload can
> >  	 * settle down.
> > -	 * A task that migrated to a second choice node will be better off
> > -	 * trying for a better one later. Do not set the preferred node here.
> >  	 */
> >  	if (p->numa_group) {
> > -		if (env.best_cpu == -1)
> > -			nid = env.src_nid;
> > -		else
> > -			nid = env.dst_nid;
> > -
> > -		if (node_isset(nid, p->numa_group->active_nodes))
> > +		if (env.dst_nid != p->numa_preferred_nid)
> >  			sched_setnuma(p, env.dst_nid);
> >  	}
>
> NAK
>
> This can also break group convergence, by setting the
> task's preferred nid to the node of a CPU it may not
> even be migrating to.

I havent changed the parameters to sched_setnuma. Previously also
sched_setnuma was getting set to env.dst_nid. Whats changed is
evaluating nid based on env.best_cpu and using the nid to see if we
should be setting env.dst_nid as preferred node.

With this change, we are setting the preferred node more often, but we
are not changing the logic of which node gets set as preferred node.

I did wonder if the intent was to set nid as the preferred node. i.e

+			sched_setnuma(p, env.dst_nid);
-			sched_setnuma(p, nid);

>
> Setting the preferred nid to a node the task should
> not be migrating to (but may, because the better ones
> are currently overloaded) prevents the task from moving
> to its preferred nid at a later time (when the good
> nodes are no longer overloaded).
>
> Have you tested this patch with any workload that does
> not consist of tasks that are running at 100% cpu time
> for the duration of the test?
>

Okay, I will try to see if I can run some workloads and get back to you.

--
Thanks and Regards
Srikar Dronamraju

next prev parent reply	other threads:[~2015-06-16 17:21 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-16 11:55 [PATCH v2 0/4] Improve numa load balancing Srikar Dronamraju
2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
2015-07-06 15:50   ` [tip:sched/core] sched/numa: Prefer NUMA " tip-bot for Srikar Dronamraju
2015-07-07  0:19     ` Srikar Dronamraju
2015-07-08 13:31       ` Srikar Dronamraju
2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
2015-06-16 14:39   ` Rik van Riel
2015-06-22 16:29     ` Srikar Dronamraju
2015-06-23  1:18       ` Rik van Riel
2015-06-23  8:10       ` Ingo Molnar
2015-06-23 13:01         ` Srikar Dronamraju
2015-06-23 14:36           ` Ingo Molnar
2015-07-06 15:50   ` [tip:sched/core] sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity() tip-bot for Srikar Dronamraju
2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
2015-06-16 11:56 ` [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node Srikar Dronamraju
2015-06-16 14:54   ` Rik van Riel
2015-06-16 17:19     ` Srikar Dronamraju [this message]
2015-06-16 18:25       ` Rik van Riel
2015-06-16 17:18   ` Rik van Riel
2015-06-16 11:56 ` [PATCH v2 4/4] sched:Use correct nid while evaluating task weights Srikar Dronamraju
2015-06-16 15:00   ` Rik van Riel
2015-06-16 17:26     ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150616171959.GA1976@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox