[PATCH 4.12 23/27] sched/numa: Implement NUMA node level wake_affine()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Rik van Riel <riel@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Mike Galbraith <efault@gmx.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	jhladky@redhat.com, Ingo Molnar <mingo@kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 4.12 23/27] sched/numa: Implement NUMA node level wake_affine()
Date: Mon, 10 Jul 2017 19:10:20 +0200	[thread overview]
Message-ID: <20170710170104.863724549@linuxfoundation.org> (raw)
In-Reply-To: <20170710170103.831324799@linuxfoundation.org>

4.12-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Rik van Riel <riel@redhat.com>

commit 3fed382b46baac83703130fe4cd3d9147f427fb9 upstream.

Since select_idle_sibling() can place a task anywhere on a socket,
comparing loads between individual CPU cores makes no real sense
for deciding whether to do an affine wakeup across sockets, either.

Instead, compare the load between the sockets in a similar way the
load balancer and the numa balancing code do.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-4-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/sched/fair.c |  130 ++++++++++++++++++++++++++++------------------------
 1 file changed, 71 insertions(+), 59 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2585,6 +2585,60 @@ void task_tick_numa(struct rq *rq, struc
 		}
 	}
 }
+
+/*
+ * Can a task be moved from prev_cpu to this_cpu without causing a load
+ * imbalance that would trigger the load balancer?
+ */
+static inline bool numa_wake_affine(struct sched_domain *sd,
+				    struct task_struct *p, int this_cpu,
+				    int prev_cpu, int sync)
+{
+	struct numa_stats prev_load, this_load;
+	s64 this_eff_load, prev_eff_load;
+
+	update_numa_stats(&prev_load, cpu_to_node(prev_cpu));
+	update_numa_stats(&this_load, cpu_to_node(this_cpu));
+
+	/*
+	 * If sync wakeup then subtract the (maximum possible)
+	 * effect of the currently running task from the load
+	 * of the current CPU:
+	 */
+	if (sync) {
+		unsigned long current_load = task_h_load(current);
+
+		if (this_load.load > current_load)
+			this_load.load -= current_load;
+		else
+			this_load.load = 0;
+	}
+
+	/*
+	 * In low-load situations, where this_cpu's node is idle due to the
+	 * sync cause above having dropped this_load.load to 0, move the task.
+	 * Moving to an idle socket will not create a bad imbalance.
+	 *
+	 * Otherwise check if the nodes are near enough in load to allow this
+	 * task to be woken on this_cpu's node.
+	 */
+	if (this_load.load > 0) {
+		unsigned long task_load = task_h_load(p);
+
+		this_eff_load = 100;
+		this_eff_load *= prev_load.compute_capacity;
+
+		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+		prev_eff_load *= this_load.compute_capacity;
+
+		this_eff_load *= this_load.load + task_load;
+		prev_eff_load *= prev_load.load - task_load;
+
+		return this_eff_load <= prev_eff_load;
+	}
+
+	return true;
+}
 #else
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
@@ -2597,6 +2651,13 @@ static inline void account_numa_enqueue(
 static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 {
 }
+
+static inline bool numa_wake_affine(struct sched_domain *sd,
+				    struct task_struct *p, int this_cpu,
+				    int prev_cpu, int sync)
+{
+	return true;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -5386,74 +5447,25 @@ static int wake_wide(struct task_struct
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
-	s64 this_load, load;
-	s64 this_eff_load, prev_eff_load;
-	int idx, this_cpu;
-	struct task_group *tg;
-	unsigned long weight;
-	int balanced;
-
-	idx	  = sd->wake_idx;
-	this_cpu  = smp_processor_id();
-	load	  = source_load(prev_cpu, idx);
-	this_load = target_load(this_cpu, idx);
+	int this_cpu = smp_processor_id();
+	bool affine = false;
 
 	/*
 	 * Common case: CPUs are in the same socket, and select_idle_sibling()
 	 * will do its thing regardless of what we return:
 	 */
 	if (cpus_share_cache(prev_cpu, this_cpu))
-		return true;
-
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current CPU:
-	 */
-	if (sync) {
-		tg = task_group(current);
-		weight = current->se.avg.load_avg;
-
-		this_load += effective_load(tg, this_cpu, -weight, -weight);
-		load += effective_load(tg, prev_cpu, 0, -weight);
-	}
-
-	tg = task_group(p);
-	weight = p->se.avg.load_avg;
-
-	/*
-	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
-	 * due to the sync cause above having dropped this_load to 0, we'll
-	 * always have an imbalance, but there's really nothing you can do
-	 * about that, so that's good too.
-	 *
-	 * Otherwise check if either cpus are near enough in load to allow this
-	 * task to be woken on this_cpu.
-	 */
-	this_eff_load = 100;
-	this_eff_load *= capacity_of(prev_cpu);
-
-	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-	prev_eff_load *= capacity_of(this_cpu);
-
-	if (this_load > 0) {
-		this_eff_load *= this_load +
-			effective_load(tg, this_cpu, weight, weight);
-
-		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
-	}
-
-	balanced = this_eff_load <= prev_eff_load;
+		affine = true;
+	else
+		affine = numa_wake_affine(sd, p, this_cpu, prev_cpu, sync);
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
+	if (affine) {
+		schedstat_inc(sd->ttwu_move_affine);
+		schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	}
 
-	if (!balanced)
-		return 0;
-
-	schedstat_inc(sd->ttwu_move_affine);
-	schedstat_inc(p->se.statistics.nr_wakeups_affine);
-
-	return 1;
+	return affine;
 }
 
 static inline int task_util(struct task_struct *p);

next prev parent reply	other threads:[~2017-07-10 17:30 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-10 17:09 [PATCH 4.12 00/27] 4.12.1-stable review Greg Kroah-Hartman
2017-07-10 17:09 ` [PATCH 4.12 01/27] driver core: platform: fix race condition with driver_override Greg Kroah-Hartman
2017-07-10 17:09 ` [PATCH 4.12 02/27] RDMA/uverbs: Check port number supplied by user verbs cmds Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 03/27] usb: dwc3: replace %p with %pK Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 04/27] USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 05/27] usb: usbip: set buffer pointers to NULL after free Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 06/27] Add USB quirk for HVR-950q to avoid intermittent device resets Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 07/27] usb: Fix typo in the definition of Endpoint[out]Request Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 08/27] USB: core: fix device node leak Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 11/27] xhci: Limit USB2 port wake support for AMD Promontory hosts Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 12/27] gfs2: Fix glock rhashtable rcu bug Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 13/27] Add "shutdown" to "struct class" Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 14/27] tpm: Issue a TPM2_Shutdown for TPM2 devices Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 15/27] tpm: fix a kernel memory leak in tpm-sysfs.c Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 16/27] powerpc/powernv: Fix CPU_HOTPLUG=n idle.c compile error Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 17/27] x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 18/27] sched/fair, cpumask: Export for_each_cpu_wrap() Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 19/27] sched/core: Implement new approach to scale select_idle_cpu() Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 20/27] sched/numa: Use down_read_trylock() for the mmap_sem Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 21/27] sched/numa: Override part of migrate_degrades_locality() when idle balancing Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 22/27] sched/fair: Simplify wake_affine() for the single socket case Greg Kroah-Hartman
2017-07-10 17:10 ` Greg Kroah-Hartman [this message]
2017-07-10 17:10 ` [PATCH 4.12 24/27] sched/fair: Remove effective_load() Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 25/27] sched/numa: Hide numa_wake_affine() from UP build Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 26/27] xen: avoid deadlock in xenbus driver Greg Kroah-Hartman
2017-07-10 17:10 ` [PATCH 4.12 27/27] crypto: drbg - Fixes panic in wait_for_completion call Greg Kroah-Hartman
2017-07-11  1:21 ` [PATCH 4.12 00/27] 4.12.1-stable review Guenter Roeck
2017-07-11  9:47   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170710170104.863724549@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=efault@gmx.de \
    --cc=jhladky@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox