[PATCH 00/26] Performance-related backports for 4.12.2

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/26] Performance-related backports for 4.12.2
@ 2017-07-20 21:21 Mel Gorman
  2017-07-20 21:21 ` [PATCH 01/26] sched/topology: Refactor function build_overlap_sched_groups() Mel Gorman
                   ` (26 more replies)
  0 siblings, 27 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

This is a second round of performance-related backports based on low-hanging
fruit in the 4.13 merge window based on 4.12.2.

As before, these have only been tested on 4.12-stable.  While they may
merge against older kernels, I have no data on how it behaves and cannot
guarantee it's a good idea so I don't recommend it.  There will also be
some major conflicts that are not trivial to resolve.

For most of the tests I conducted, the impact is marginal but patches the
first two sets of patches are important for large machines and for uses
of nohz_full. The load balancing patch is fairly specific but measurable.
The removal of unnecessary IRQ disabling/enabling is borderline in terms of
performance but they are trivial patches and avoiding unnecessary expensive
operations is always a plus.

Patches 1-17 resolve a number of topology problems in the scheduler that
	primarily impact NUMA machines with a ring topology. There are
	more patches in there than necessary but one adds very helpful
	comments on understanding how it works and a few bring the naming of
	functions in line with 4.13 which makes it a bit easier to follow.
	Others shuffle comments around and restructure the code which could
	have been avoided but then the backported patches would not look
	like their upstream equivalent.  While some of the extra patches are
	outside the scope of -stable, it removes the delta when comparing
	the 4.12-stable and 4.13 scheduler but I can drop them if necessary.

	Performance impact on UMA and fully-connected machines is marginal
	with minor gains/losses across multiple machines that is mostly
	within the noise but other reports indicate that the impact on
	ring topologies is substantial. In particular, the full machine
	will be properly utilised instead of saturating a subset of nodes
	for workloads with lots of threads or processes.

Patches 18-22 are more about accounting than performance. The bug is with
	workloads running on nohz_full+isolcpus configurations. If 2 or more
	processes are running on an isolated CPU are 100% userspace bound
	and normal processes are running on other CPUs then the isolated
	processes report a mix of userspace and system CPU usage.  It can
	be up to 100% system CPU usage even though in reality there is no
	time being spent in the kernel. This misaccounting is confusing
	when analysing workloads.

	For normal workloads, there is no measurable difference.

Patch 23 fixes a scheduler load balancing issue where an imbalanced domain
	is considered balanced when some tasks are pinned for affinity.
	Again, for many workloads the impact is marginal but it was a small
	boost (1-2% barely outside noise) for a specjbb configuration
	that pinned JVMs. It may be co-incidence but the patch is
	straight-forward.

Patches 24-25 avoid unnecessary IRQ disable/enable while updating writeback
	stats. In many cases this will not be noticable because it happens
	out-of-band and the cost of stats updates are often negligible
	compared to the overall cost of writeback. However, unnecessary
	IRQ disabling is never a good thing and it may be noticable during
	writeback to ultra-fast storage.

Patches 26 avoids an IRQ disable/enable in the fork path. It's noticable
	on fork-intensive workloads with a 1-3% boost on hackbench for
	example that is just outside the noise.

-- 
2.13.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/26] sched/topology: Refactor function build_overlap_sched_groups()
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 02/26] sched/topology: Fix building of overlapping sched-groups Mel Gorman
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Lauro Ramos Venancio <lvenanci@redhat.com>

commit 8c0334697dc37eb3d6d7632304d3a3662248daac upstream.

Create functions build_group_from_child_sched_domain() and
init_overlap_sched_group(). No functional change.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 62 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb12837..d786d45c44d9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -513,6 +513,47 @@ int group_balance_cpu(struct sched_group *sg)
 	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
 }
 
+static struct sched_group *
+build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+{
+	struct sched_group *sg;
+	struct cpumask *sg_span;
+
+	sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+			GFP_KERNEL, cpu_to_node(cpu));
+
+	if (!sg)
+		return NULL;
+
+	sg_span = sched_group_cpus(sg);
+	if (sd->child)
+		cpumask_copy(sg_span, sched_domain_span(sd->child));
+	else
+		cpumask_copy(sg_span, sched_domain_span(sd));
+
+	return sg;
+}
+
+static void init_overlap_sched_group(struct sched_domain *sd,
+				     struct sched_group *sg, int cpu)
+{
+	struct sd_data *sdd = sd->private;
+	struct cpumask *sg_span;
+
+	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	if (atomic_inc_return(&sg->sgc->ref) == 1)
+		build_group_mask(sd, sg);
+
+	/*
+	 * Initialize sgc->capacity such that even if we mess up the
+	 * domains and no possible iteration will get us here, we won't
+	 * die on a /0 trap.
+	 */
+	sg_span = sched_group_cpus(sg);
+	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+}
+
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
@@ -537,31 +578,14 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-		sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-				GFP_KERNEL, cpu_to_node(cpu));
-
+		sg = build_group_from_child_sched_domain(sibling, cpu);
 		if (!sg)
 			goto fail;
 
 		sg_span = sched_group_cpus(sg);
-		if (sibling->child)
-			cpumask_copy(sg_span, sched_domain_span(sibling->child));
-		else
-			cpumask_set_cpu(i, sg_span);
-
 		cpumask_or(covered, covered, sg_span);
 
-		sg->sgc = *per_cpu_ptr(sdd->sgc, i);
-		if (atomic_inc_return(&sg->sgc->ref) == 1)
-			build_group_mask(sd, sg);
-
-		/*
-		 * Initialize sgc->capacity such that even if we mess up the
-		 * domains and no possible iteration will get us here, we won't
-		 * die on a /0 trap.
-		 */
-		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+		init_overlap_sched_group(sd, sg, i);
 
 		/*
 		 * Make sure the first group of this domain contains the
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/26] sched/topology: Fix building of overlapping sched-groups
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
  2017-07-20 21:21 ` [PATCH 01/26] sched/topology: Refactor function build_overlap_sched_groups() Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 03/26] sched/topology: Simplify build_overlap_sched_groups() Mel Gorman
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 0372dd2736e02672ac6e189c31f7d8c02ad543cd upstream.

When building the overlapping groups, we very obviously should start
with the previous domain of _this_ @cpu, not CPU-0.

This can be readily demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) CPU1 ends up generating the following nonsensical groups:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 2 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)

Where the fact that domain 1 doesn't include a group with span 0-2 is
the obvious fail.

With patch this looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 0 2
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d786d45c44d9..921dedde2ee1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -566,7 +566,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 
 	cpumask_clear(covered);
 
-	for_each_cpu(i, span) {
+	for_each_cpu_wrap(i, span, cpu) {
 		struct cpumask *sg_span;
 
 		if (cpumask_test_cpu(i, covered))
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 03/26] sched/topology: Simplify build_overlap_sched_groups()
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
  2017-07-20 21:21 ` [PATCH 01/26] sched/topology: Refactor function build_overlap_sched_groups() Mel Gorman
  2017-07-20 21:21 ` [PATCH 02/26] sched/topology: Fix building of overlapping sched-groups Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 04/26] sched/debug: Print the scheduler topology group mask Mel Gorman
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 91eaed0d61319f58a9f8e43d41a8cbb069b4f73d upstream.

Now that the first group will always be the previous domain of this
@cpu this can be simplified.

In fact, writing the code now removed should've been a big clue I was
doing it wrong :/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/topology.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 921dedde2ee1..6b10e0a956c7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -557,7 +557,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 static int
 build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 {
-	struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+	struct sched_group *first = NULL, *last = NULL, *sg;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
@@ -587,15 +587,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 
 		init_overlap_sched_group(sd, sg, i);
 
-		/*
-		 * Make sure the first group of this domain contains the
-		 * canonical balance CPU. Otherwise the sched_domain iteration
-		 * breaks. See update_sg_lb_stats().
-		 */
-		if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
-		    group_balance_cpu(sg) == cpu)
-			groups = sg;
-
 		if (!first)
 			first = sg;
 		if (last)
@@ -603,7 +594,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		last = sg;
 		last->next = first;
 	}
-	sd->groups = groups;
+	sd->groups = first;
 
 	return 0;
 
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 04/26] sched/debug: Print the scheduler topology group mask
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (2 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 03/26] sched/topology: Simplify build_overlap_sched_groups() Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 05/26] sched/topology: Verify the first group matches the child domain Mel Gorman
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit b0151c25548cacc50771a7930475727c6c8ee869 upstream.

In order to determine the balance_cpu (for should_we_balance()) we need
the sched_group_mask() for overlapping domains.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6b10e0a956c7..3d50ee38b8fb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -82,12 +82,22 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 
 		printk(KERN_CONT " %*pbl",
 		       cpumask_pr_args(sched_group_cpus(group)));
+
+		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
+			printk(KERN_CONT " (mask: %*pbl)",
+				cpumask_pr_args(sched_group_mask(group)));
+		}
+
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity = %lu)",
+			printk(KERN_CONT " (cpu_capacity: %lu)",
 				group->sgc->capacity);
 		}
 
 		group = group->next;
+
+		if (group != sd->groups)
+			printk(KERN_CONT ",");
+
 	} while (group != sd->groups);
 	printk(KERN_CONT "\n");
 
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 05/26] sched/topology: Verify the first group matches the child domain
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (3 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 04/26] sched/debug: Print the scheduler topology group mask Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 06/26] sched/topology: Optimize build_group_mask() Mel Gorman
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit a420b0630362c2c451060e6187e36d72df827134 upstream.

We want sched_groups to be sibling child domains (or individual CPUs
when there are no child domains). Furthermore, since the first group
of a domain should include the CPU of that domain, the first group of
each domain should match the child domain.

Verify this is indeed so.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3d50ee38b8fb..81c82031ed95 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -93,6 +93,12 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 				group->sgc->capacity);
 		}
 
+		if (group == sd->groups && sd->child &&
+		    !cpumask_equal(sched_domain_span(sd->child),
+				   sched_group_cpus(group))) {
+			printk(KERN_ERR "ERROR: domain->groups does not match domain->child\n");
+		}
+
 		group = group->next;
 
 		if (group != sd->groups)
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 06/26] sched/topology: Optimize build_group_mask()
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (4 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 05/26] sched/topology: Verify the first group matches the child domain Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 07/26] sched/topology: Move comment about asymmetric node setups Mel Gorman
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Lauro Ramos Venancio <lvenanci@redhat.com>

commit f32d782e31bf079f600dcec126ed117b0577e85c upstream.

The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 81c82031ed95..5a4d9aeda258 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -506,12 +506,12 @@ enum s_alloc {
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
-	const struct cpumask *span = sched_domain_span(sd);
+	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
-	for_each_cpu(i, span) {
+	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/26] sched/topology: Move comment about asymmetric node setups
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (5 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 06/26] sched/topology: Optimize build_group_mask() Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 08/26] sched/topology: Remove FORCE_SD_OVERLAP Mel Gorman
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Lauro Ramos Venancio <lvenanci@redhat.com>

commit c20e1ea4b61c3d99a354d912f2d74822fd2a001d upstream.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5a4d9aeda258..c10f44a1ab2d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,14 +495,6 @@ enum s_alloc {
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
- *
- * Asymmetric node setups can result in situations where the domain tree is of
- * unequal depth, make sure to skip domains that already cover the entire
- * range.
- *
- * In that case build_sched_domains() will have terminated the iteration early
- * and our sibling sd spans will be empty. Domains should always include the
- * CPU they're built on, so check that.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -590,7 +582,16 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
-		/* See the comment near build_group_mask(). */
+		/*
+		 * Asymmetric node setups can result in situations where the
+		 * domain tree is of unequal depth, make sure to skip domains
+		 * that already cover the entire range.
+		 *
+		 * In that case build_sched_domains() will have terminated the
+		 * iteration early and our sibling sd spans will be empty.
+		 * Domains should always include the CPU they're built on, so
+		 * check that.
+		 */
 		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
 			continue;
 
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 08/26] sched/topology: Remove FORCE_SD_OVERLAP
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (6 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 07/26] sched/topology: Move comment about asymmetric node setups Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 09/26] sched/topology: Fix overlapping sched_group_mask Mel Gorman
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit af85596c74de2fd9abb87501ae280038ac28a3f4 upstream.

Its an obsolete debug mechanism and future code wants to rely on
properties this undermines.

Namely, it would be good to assume that SD_OVERLAP domains have
children, but if we build the entire hierarchy with SD_OVERLAP this is
obviously false.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 1 -
 kernel/sched/topology.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ce7b4b6ac733..d3fb15555291 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -77,7 +77,6 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false)
 SCHED_FEAT(RT_PUSH_IPI, true)
 #endif
 
-SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index c10f44a1ab2d..21efacf547d4 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1439,7 +1439,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 			if (tl == sched_domain_topology)
 				*per_cpu_ptr(d.sd, i) = sd;
-			if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
+			if (tl->flags & SDTL_OVERLAP)
 				sd->flags |= SD_OVERLAP;
 			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
 				break;
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/26] sched/topology: Fix overlapping sched_group_mask
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (7 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 08/26] sched/topology: Remove FORCE_SD_OVERLAP Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 10/26] sched/topology: Small cleanup Mel Gorman
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 73bb059f9b8a00c5e1bf2f7ca83138c05d05e600 upstream.

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 21efacf547d4..09563c1d1d5b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,6 +495,9 @@ enum s_alloc {
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
+ *
+ * Only CPUs that can arrive at this group should be considered to continue
+ * balancing.
  */
 static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 {
@@ -505,11 +508,24 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
-		if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+		/*
+		 * Can happen in the asymmetric case, where these siblings are
+		 * unused. The mask will not be empty because those CPUs that
+		 * do have the top domain _should_ span the domain.
+		 */
+		if (!sibling->child)
+			continue;
+
+		/* If we would not end up here, we can't continue from here */
+		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
 			continue;
 
 		cpumask_set_cpu(i, sched_group_mask(sg));
 	}
+
+	/* We must not have empty masks here */
+	WARN_ON_ONCE(cpumask_empty(sched_group_mask(sg)));
 }
 
 /*
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 10/26] sched/topology: Small cleanup
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (8 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 09/26] sched/topology: Fix overlapping sched_group_mask Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 11/26] sched/topology: Add sched_group_capacity debugging Mel Gorman
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 8d5dc5126bb2bbcebf0b1e061cca2fc02c935620 upstream.

Move the allocation of topology specific cpumasks into the topology
code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c     | 4 +---
 kernel/sched/sched.h    | 4 +---
 kernel/sched/topology.c | 7 +++++--
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 326d4f88e2b1..3223905b96de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5958,7 +5958,6 @@ void __init sched_init_smp(void)
 	cpumask_var_t non_isolated_cpus;
 
 	alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
-	alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
 	sched_init_numa();
 
@@ -5968,7 +5967,7 @@ void __init sched_init_smp(void)
 	 * happen.
 	 */
 	mutex_lock(&sched_domains_mutex);
-	init_sched_domains(cpu_active_mask);
+	sched_init_domains(cpu_active_mask);
 	cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
 	if (cpumask_empty(non_isolated_cpus))
 		cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
@@ -6199,7 +6198,6 @@ void __init sched_init(void)
 	calc_load_update = jiffies + LOAD_FREQ;
 
 #ifdef CONFIG_SMP
-	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT);
 	/* May be allocated at isolcpus cmdline parse time */
 	if (cpu_isolated_map == NULL)
 		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6dda2aab731e..6e1eae717a24 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -606,11 +606,9 @@ struct root_domain {
 
 extern struct root_domain def_root_domain;
 extern struct mutex sched_domains_mutex;
-extern cpumask_var_t fallback_doms;
-extern cpumask_var_t sched_domains_tmpmask;
 
 extern void init_defrootdomain(void);
-extern int init_sched_domains(const struct cpumask *cpu_map);
+extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 
 #endif /* CONFIG_SMP */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 09563c1d1d5b..a4b868c76f3c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1526,7 +1526,7 @@ static struct sched_domain_attr		*dattr_cur;
  * cpumask) fails, then fallback to a single sched domain,
  * as determined by the single cpumask fallback_doms.
  */
-cpumask_var_t				fallback_doms;
+static cpumask_var_t			fallback_doms;
 
 /*
  * arch_update_cpu_topology lets virtualized architectures update the
@@ -1568,10 +1568,13 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms)
  * For now this just excludes isolated CPUs, but could be used to
  * exclude other special cases in the future.
  */
-int init_sched_domains(const struct cpumask *cpu_map)
+int sched_init_domains(const struct cpumask *cpu_map)
 {
 	int err;
 
+	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
+	zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
+
 	arch_update_cpu_topology();
 	ndoms_cur = 1;
 	doms_cur = alloc_sched_domains(ndoms_cur);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 11/26] sched/topology: Add sched_group_capacity debugging
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (9 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 10/26] sched/topology: Small cleanup Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 12/26] sched/topology: Fix overlapping sched_group_capacity Mel Gorman
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 005f874dd2843116e2ea079e3679f4f318f12fee upstream.

Add sgc::id to easier spot domain construction issues.

Take the opportunity to slightly rework the group printing, because
adding more "(id: %d)" strings makes the entire thing very hard to
read. Also the individual groups are very hard to separate, so add
explicit visual grouping, which allows replacing all the "(%s: %d)"
format things with shorter "%s=%d" variants.

Then fix up some inconsistencies in surrounding prints for domains.

The end result looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/sched.h    |  4 ++++
 kernel/sched/topology.c | 25 +++++++++++++++----------
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6e1eae717a24..4312b2adfb02 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1023,6 +1023,10 @@ struct sched_group_capacity {
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 
+#ifdef CONFIG_SCHED_DEBUG
+	int id;
+#endif
+
 	unsigned long cpumask[0]; /* iteration mask */
 };
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a4b868c76f3c..12af4b157928 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -35,7 +35,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 
 	cpumask_clear(groupmask);
 
-	printk(KERN_DEBUG "%*s domain %d: ", level, "", level);
+	printk(KERN_DEBUG "%*s domain-%d: ", level, "", level);
 
 	if (!(sd->flags & SD_LOAD_BALANCE)) {
 		printk("does not load-balance\n");
@@ -45,7 +45,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		return -1;
 	}
 
-	printk(KERN_CONT "span %*pbl level %s\n",
+	printk(KERN_CONT "span=%*pbl level=%s\n",
 	       cpumask_pr_args(sched_domain_span(sd)), sd->name);
 
 	if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) {
@@ -80,18 +80,17 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 
 		cpumask_or(groupmask, groupmask, sched_group_cpus(group));
 
-		printk(KERN_CONT " %*pbl",
-		       cpumask_pr_args(sched_group_cpus(group)));
+		printk(KERN_CONT " %d:{ span=%*pbl",
+				group->sgc->id,
+				cpumask_pr_args(sched_group_cpus(group)));
 
 		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
-			printk(KERN_CONT " (mask: %*pbl)",
+			printk(KERN_CONT " mask=%*pbl",
 				cpumask_pr_args(sched_group_mask(group)));
 		}
 
-		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity: %lu)",
-				group->sgc->capacity);
-		}
+		if (group->sgc->capacity != SCHED_CAPACITY_SCALE)
+			printk(KERN_CONT " cap=%lu", group->sgc->capacity);
 
 		if (group == sd->groups && sd->child &&
 		    !cpumask_equal(sched_domain_span(sd->child),
@@ -99,6 +98,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			printk(KERN_ERR "ERROR: domain->groups does not match domain->child\n");
 		}
 
+		printk(KERN_CONT " }");
+
 		group = group->next;
 
 		if (group != sd->groups)
@@ -129,7 +130,7 @@ static void sched_domain_debug(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu);
+	printk(KERN_DEBUG "CPU%d attaching sched-domain(s):\n", cpu);
 
 	for (;;) {
 		if (sched_domain_debug_one(sd, cpu, level, sched_domains_tmpmask))
@@ -1356,6 +1357,10 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			if (!sgc)
 				return -ENOMEM;
 
+#ifdef CONFIG_SCHED_DEBUG
+			sgc->id = j;
+#endif
+
 			*per_cpu_ptr(sdd->sgc, j) = sgc;
 		}
 	}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 12/26] sched/topology: Fix overlapping sched_group_capacity
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (10 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 11/26] sched/topology: Add sched_group_capacity debugging Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 13/26] sched/topology: Add a few comments Mel Gorman
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 1676330ecfa840113a37b25a49afda068380d19c upstream.

When building the overlapping groups we need to attach a consistent
sched_group_capacity structure. That is, all 'identical' sched_group's
should have the _same_ sched_group_capacity.

This can (once again) be demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

But we need at least 2 CPUs per node for this to show up, after all,
if there is only one CPU per node, our CPU @i is per definition a
unique CPU that reaches this domain (aka balance-cpu).

Given the above NUMA topo and 2 CPUs per node:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 4:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Observe how CPU0-domain1-group0 and CPU1-domain1-group4 are the
'same' but have a different id (0 vs 4).

To fix this, use the group balance CPU to select the SGC. This means
we have to compute the full mask for each CPU and require a second
temporary mask to store the group mask in (it otherwise lives in the
SGC).

The fixed topology looks like:

  [] CPU0 attaching sched-domain(s):
  []  domain-0: span=0,4 level=DIE
  []   groups: 0:{ span=0 }, 4:{ span=4 }
  []   domain-1: span=0-1,3-5,7 level=NUMA
  []    groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
  [] CPU1 attaching sched-domain(s):
  []  domain-0: span=1,5 level=DIE
  []   groups: 1:{ span=1 }, 5:{ span=5 }
  []   domain-1: span=0-2,4-6 level=NUMA
  []    groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 0:{ span=0,4 mask=0,4 cap=2048 }
  []    domain-2: span=0-7 level=NUMA
  []     groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 12af4b157928..4f6fa7553d92 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -10,6 +10,7 @@ DEFINE_MUTEX(sched_domains_mutex);
 
 /* Protected by sched_domains_mutex: */
 cpumask_var_t sched_domains_tmpmask;
+cpumask_var_t sched_domains_tmpmask2;
 
 #ifdef CONFIG_SCHED_DEBUG
 
@@ -500,13 +501,16 @@ enum s_alloc {
  * Only CPUs that can arrive at this group should be considered to continue
  * balancing.
  */
-static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
+static void
+build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
 {
 	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
 
+	cpumask_clear(mask);
+
 	for_each_cpu(i, sg_span) {
 		sibling = *per_cpu_ptr(sdd->sd, i);
 
@@ -522,11 +526,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
 		if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
 			continue;
 
-		cpumask_set_cpu(i, sched_group_mask(sg));
+		cpumask_set_cpu(i, mask);
 	}
 
 	/* We must not have empty masks here */
-	WARN_ON_ONCE(cpumask_empty(sched_group_mask(sg)));
+	WARN_ON_ONCE(cpumask_empty(mask));
 }
 
 /*
@@ -560,14 +564,19 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 }
 
 static void init_overlap_sched_group(struct sched_domain *sd,
-				     struct sched_group *sg, int cpu)
+				     struct sched_group *sg)
 {
+	struct cpumask *mask = sched_domains_tmpmask2;
 	struct sd_data *sdd = sd->private;
 	struct cpumask *sg_span;
+	int cpu;
+
+	build_group_mask(sd, sg, mask);
+	cpu = cpumask_first_and(sched_group_cpus(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
-		build_group_mask(sd, sg);
+		cpumask_copy(sched_group_mask(sg), mask);
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the
@@ -619,7 +628,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		sg_span = sched_group_cpus(sg);
 		cpumask_or(covered, covered, sg_span);
 
-		init_overlap_sched_group(sd, sg, i);
+		init_overlap_sched_group(sd, sg);
 
 		if (!first)
 			first = sg;
@@ -1578,6 +1587,7 @@ int sched_init_domains(const struct cpumask *cpu_map)
 	int err;
 
 	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
+	zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
 	zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
 	arch_update_cpu_topology();
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 13/26] sched/topology: Add a few comments
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (11 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 12/26] sched/topology: Fix overlapping sched_group_capacity Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 14/26] sched/topology: Rewrite get_group() Mel Gorman
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 35a566e6e8a18c3bc16229abeac146a707b8f216 upstream.

Try and describe what this code is about..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 200 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 193 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 4f6fa7553d92..b2790830e184 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,11 +495,127 @@ enum s_alloc {
 };
 
 /*
+ * Return the canonical balance CPU for this group, this is the first CPU
+ * of this group that's also in the iteration mask.
+ *
+ * The iteration mask are all those CPUs that could actually end up at this
+ * group. See build_group_mask().
+ *
+ * Also see should_we_balance().
+ */
+int group_balance_cpu(struct sched_group *sg)
+{
+	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+}
+
+
+/*
+ * NUMA topology (first read the regular topology blurb below)
+ *
+ * Given a node-distance table, for example:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  30  20
+ *     1:  20  10  20  30
+ *     2:  30  20  10  20
+ *     3:  20  30  20  10
+ *
+ * which represents a 4 node ring topology like:
+ *
+ *   0 ----- 1
+ *   |       |
+ *   |       |
+ *   |       |
+ *   3 ----- 2
+ *
+ * We want to construct domains and groups to represent this. The way we go
+ * about doing this is to build the domains on 'hops'. For each NUMA level we
+ * construct the mask of all nodes reachable in @level hops.
+ *
+ * For the above NUMA topology that gives 3 levels:
+ *
+ * NUMA-2	0-3		0-3		0-3		0-3
+ *  groups:	{0-1,3},{1-3}	{0-2},{0,2-3}	{1-3},{0-1,3}	{0,2-3},{0-2}
+ *
+ * NUMA-1	0-1,3		0-2		1-3		0,2-3
+ *  groups:	{0},{1},{3}	{0},{1},{2}	{1},{2},{3}	{0},{2},{3}
+ *
+ * NUMA-0	0		1		2		3
+ *
+ *
+ * As can be seen; things don't nicely line up as with the regular topology.
+ * When we iterate a domain in child domain chunks some nodes can be
+ * represented multiple times -- hence the "overlap" naming for this part of
+ * the topology.
+ *
+ * In order to minimize this overlap, we only build enough groups to cover the
+ * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
+ *
+ * Because:
+ *
+ *  - the first group of each domain is its child domain; this
+ *    gets us the first 0-1,3
+ *  - the only uncovered node is 2, who's child domain is 1-3.
+ *
+ * However, because of the overlap, computing a unique CPU for each group is
+ * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
+ * groups include the CPUs of Node-0, while those CPUs would not in fact ever
+ * end up at those groups (they would end up in group: 0-1,3).
+ *
+ * To correct this we have to introduce the group iteration mask. This mask
+ * will contain those CPUs in the group that can reach this group given the
+ * (child) domain tree.
+ *
+ * With this we can once again compute balance_cpu and sched_group_capacity
+ * relations.
+ *
+ * XXX include words on how balance_cpu is unique and therefore can be
+ * used for sched_group_capacity links.
+ *
+ *
+ * Another 'interesting' topology is:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  20  30
+ *     1:  20  10  20  20
+ *     2:  20  20  10  20
+ *     3:  30  20  20  10
+ *
+ * Which looks a little like:
+ *
+ *   0 ----- 1
+ *   |     / |
+ *   |   /   |
+ *   | /     |
+ *   2 ----- 3
+ *
+ * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3
+ * are not.
+ *
+ * This leads to a few particularly weird cases where the sched_domain's are
+ * not of the same number for each cpu. Consider:
+ *
+ * NUMA-2	0-3						0-3
+ *  groups:	{0-2},{1-3}					{1-3},{0-2}
+ *
+ * NUMA-1	0-2		0-3		0-3		1-3
+ *
+ * NUMA-0	0		1		2		3
+ *
+ */
+
+
+/*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
  *
  * Only CPUs that can arrive at this group should be considered to continue
  * balancing.
+ *
+ * We do this during the group creation pass, therefore the group information
+ * isn't complete yet, however since each group represents a (child) domain we
+ * can fully construct this using the sched_domain bits (which are already
+ * complete).
  */
 static void
 build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
@@ -534,14 +650,10 @@ build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask
 }
 
 /*
- * Return the canonical balance CPU for this group, this is the first CPU
- * of this group that's also in the iteration mask.
+ * XXX: This creates per-node group entries; since the load-balancer will
+ * immediately access remote memory to construct this group's load-balance
+ * statistics having the groups node local is of dubious benefit.
  */
-int group_balance_cpu(struct sched_group *sg)
-{
-	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
-}
-
 static struct sched_group *
 build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 {
@@ -577,6 +689,8 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
 		cpumask_copy(sched_group_mask(sg), mask);
+	else
+		WARN_ON_ONCE(!cpumask_equal(sched_group_mask(sg), mask));
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the
@@ -647,6 +761,78 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 	return -ENOMEM;
 }
 
+
+/*
+ * Package topology (also see the load-balance blurb in fair.c)
+ *
+ * The scheduler builds a tree structure to represent a number of important
+ * topology features. By default (default_topology[]) these include:
+ *
+ *  - Simultaneous multithreading (SMT)
+ *  - Multi-Core Cache (MC)
+ *  - Package (DIE)
+ *
+ * Where the last one more or less denotes everything up to a NUMA node.
+ *
+ * The tree consists of 3 primary data structures:
+ *
+ *	sched_domain -> sched_group -> sched_group_capacity
+ *	    ^ ^             ^ ^
+ *          `-'             `-'
+ *
+ * The sched_domains are per-cpu and have a two way link (parent & child) and
+ * denote the ever growing mask of CPUs belonging to that level of topology.
+ *
+ * Each sched_domain has a circular (double) linked list of sched_group's, each
+ * denoting the domains of the level below (or individual CPUs in case of the
+ * first domain level). The sched_group linked by a sched_domain includes the
+ * CPU of that sched_domain [*].
+ *
+ * Take for instance a 2 threaded, 2 core, 2 cache cluster part:
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * DIE  [                             ]
+ * MC   [             ] [             ]
+ * SMT  [     ] [     ] [     ] [     ]
+ *
+ *  - or -
+ *
+ * DIE  0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7
+ * MC	0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7
+ * SMT  0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * One way to think about it is: sched_domain moves you up and down among these
+ * topology levels, while sched_group moves you sideways through it, at child
+ * domain granularity.
+ *
+ * sched_group_capacity ensures each unique sched_group has shared storage.
+ *
+ * There are two related construction problems, both require a CPU that
+ * uniquely identify each group (for a given domain):
+ *
+ *  - The first is the balance_cpu (see should_we_balance() and the
+ *    load-balance blub in fair.c); for each group we only want 1 CPU to
+ *    continue balancing at a higher domain.
+ *
+ *  - The second is the sched_group_capacity; we want all identical groups
+ *    to share a single sched_group_capacity.
+ *
+ * Since these topologies are exclusive by construction. That is, its
+ * impossible for an SMT thread to belong to multiple cores, and cores to
+ * be part of multiple caches. There is a very clear and unique location
+ * for each CPU in the hierarchy.
+ *
+ * Therefore computing a unique CPU for each group is trivial (the iteration
+ * mask is redundant and set all 1s; all CPUs in a group will end up at _that_
+ * group), we can simply pick the first CPU in each group.
+ *
+ *
+ * [*] in other words, the first group of each domain is its child domain.
+ */
+
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 14/26] sched/topology: Rewrite get_group()
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (12 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 13/26] sched/topology: Add a few comments Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 15/26] sched/topology: Simplify sched_group_mask() usage Mel Gorman
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit 0c0e776a9b0f21ac41d4c6982c57493928524dba upstream.

We want to attain:

  sg_cpus() & sg_mask() == sg_mask()

for this to be so we must initialize sg_mask() to sg_cpus() for the
!overlap case (its currently cpumask_setall()).

Since the code makes my head hurt bad, rewrite it into a simpler form,
inspired by the now fixed overlap code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/topology.c | 46 ++++++++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b2790830e184..dea1950b42a5 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -833,23 +833,34 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
  * [*] in other words, the first group of each domain is its child domain.
  */
 
-static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
+static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
 	struct sched_domain *child = sd->child;
+	struct sched_group *sg;
 
 	if (child)
 		cpu = cpumask_first(sched_domain_span(child));
 
-	if (sg) {
-		*sg = *per_cpu_ptr(sdd->sg, cpu);
-		(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	sg = *per_cpu_ptr(sdd->sg, cpu);
+	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+
+	/* For claim_allocations: */
+	atomic_inc(&sg->ref);
+	atomic_inc(&sg->sgc->ref);
 
-		/* For claim_allocations: */
-		atomic_set(&(*sg)->sgc->ref, 1);
+	if (child) {
+		cpumask_copy(sched_group_cpus(sg), sched_domain_span(child));
+		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
+	} else {
+		cpumask_set_cpu(cpu, sched_group_cpus(sg));
+		cpumask_set_cpu(cpu, sched_group_mask(sg));
 	}
 
-	return cpu;
+	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_cpus(sg));
+	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+
+	return sg;
 }
 
 /*
@@ -868,34 +879,20 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	struct cpumask *covered;
 	int i;
 
-	get_group(cpu, sdd, &sd->groups);
-	atomic_inc(&sd->groups->ref);
-
-	if (cpu != cpumask_first(span))
-		return 0;
-
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
 
 	cpumask_clear(covered);
 
-	for_each_cpu(i, span) {
+	for_each_cpu_wrap(i, span, cpu) {
 		struct sched_group *sg;
-		int group, j;
 
 		if (cpumask_test_cpu(i, covered))
 			continue;
 
-		group = get_group(i, sdd, &sg);
-		cpumask_setall(sched_group_mask(sg));
+		sg = get_group(i, sdd);
 
-		for_each_cpu(j, span) {
-			if (get_group(j, sdd, NULL) != group)
-				continue;
-
-			cpumask_set_cpu(j, covered);
-			cpumask_set_cpu(j, sched_group_cpus(sg));
-		}
+		cpumask_or(covered, covered, sched_group_cpus(sg));
 
 		if (!first)
 			first = sg;
@@ -904,6 +901,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 		last = sg;
 	}
 	last->next = first;
+	sd->groups = first;
 
 	return 0;
 }
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 15/26] sched/topology: Simplify sched_group_mask() usage
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (13 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 14/26] sched/topology: Rewrite get_group() Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 16/26] sched/topology: Rename sched_group_mask() Mel Gorman
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit af218122b103900fa33d408aea0c2468791e698c upstream.

While writing the comments, it occurred to me that:

  sg_cpus & sg_mask == sg_mask

at least conceptually; the !overlap case sets the all 1s mask. If we
correct that we can simplify things and directly use sg_mask.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c     | 7 +++----
 kernel/sched/topology.c | 5 +++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bafa4e04b850..713a3106693e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7915,7 +7915,7 @@ static int active_load_balance_cpu_stop(void *data);
 static int should_we_balance(struct lb_env *env)
 {
 	struct sched_group *sg = env->sd->groups;
-	struct cpumask *sg_cpus, *sg_mask;
+	struct cpumask *sg_mask;
 	int cpu, balance_cpu = -1;
 
 	/*
@@ -7925,11 +7925,10 @@ static int should_we_balance(struct lb_env *env)
 	if (env->idle == CPU_NEWLY_IDLE)
 		return 1;
 
-	sg_cpus = sched_group_cpus(sg);
 	sg_mask = sched_group_mask(sg);
 	/* Try to find first idle cpu */
-	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
-		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+	for_each_cpu_and(cpu, sg_mask, env->cpus) {
+		if (!idle_cpu(cpu))
 			continue;
 
 		balance_cpu = cpu;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dea1950b42a5..bf53a99eb511 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -85,7 +85,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 				group->sgc->id,
 				cpumask_pr_args(sched_group_cpus(group)));
 
-		if ((sd->flags & SD_OVERLAP) && !cpumask_full(sched_group_mask(group))) {
+		if ((sd->flags & SD_OVERLAP) &&
+		    !cpumask_equal(sched_group_mask(group), sched_group_cpus(group))) {
 			printk(KERN_CONT " mask=%*pbl",
 				cpumask_pr_args(sched_group_mask(group)));
 		}
@@ -505,7 +506,7 @@ enum s_alloc {
  */
 int group_balance_cpu(struct sched_group *sg)
 {
-	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+	return cpumask_first(sched_group_mask(sg));
 }
 
 
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 16/26] sched/topology: Rename sched_group_mask()
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (14 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 15/26] sched/topology: Simplify sched_group_mask() usage Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 17/26] sched/topology: Rename sched_group_cpus() Mel Gorman
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit e5c14b1fb89213ff718261e6fb1bb29c5ffbbe99 upstream.

Since sched_group_mask() is now an independent cpumask (it no longer
masks sched_group_cpus()), rename the thing.

Suggested-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c     |  4 +---
 kernel/sched/sched.h    |  7 +++----
 kernel/sched/topology.c | 33 +++++++++++++++------------------
 3 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 713a3106693e..c2472fb846c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7915,7 +7915,6 @@ static int active_load_balance_cpu_stop(void *data);
 static int should_we_balance(struct lb_env *env)
 {
 	struct sched_group *sg = env->sd->groups;
-	struct cpumask *sg_mask;
 	int cpu, balance_cpu = -1;
 
 	/*
@@ -7925,9 +7924,8 @@ static int should_we_balance(struct lb_env *env)
 	if (env->idle == CPU_NEWLY_IDLE)
 		return 1;
 
-	sg_mask = sched_group_mask(sg);
 	/* Try to find first idle cpu */
-	for_each_cpu_and(cpu, sg_mask, env->cpus) {
+	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
 		if (!idle_cpu(cpu))
 			continue;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4312b2adfb02..f7c70575ae34 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1027,7 +1027,7 @@ struct sched_group_capacity {
 	int id;
 #endif
 
-	unsigned long cpumask[0]; /* iteration mask */
+	unsigned long cpumask[0]; /* balance mask */
 };
 
 struct sched_group {
@@ -1054,10 +1054,9 @@ static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
 }
 
 /*
- * cpumask masking which cpus in the group are allowed to iterate up the domain
- * tree.
+ * See build_balance_mask().
  */
-static inline struct cpumask *sched_group_mask(struct sched_group *sg)
+static inline struct cpumask *group_balance_mask(struct sched_group *sg)
 {
 	return to_cpumask(sg->sgc->cpumask);
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bf53a99eb511..070191f02035 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -86,9 +86,9 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 				cpumask_pr_args(sched_group_cpus(group)));
 
 		if ((sd->flags & SD_OVERLAP) &&
-		    !cpumask_equal(sched_group_mask(group), sched_group_cpus(group))) {
+		    !cpumask_equal(group_balance_mask(group), sched_group_cpus(group))) {
 			printk(KERN_CONT " mask=%*pbl",
-				cpumask_pr_args(sched_group_mask(group)));
+				cpumask_pr_args(group_balance_mask(group)));
 		}
 
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE)
@@ -497,16 +497,16 @@ enum s_alloc {
 
 /*
  * Return the canonical balance CPU for this group, this is the first CPU
- * of this group that's also in the iteration mask.
+ * of this group that's also in the balance mask.
  *
- * The iteration mask are all those CPUs that could actually end up at this
- * group. See build_group_mask().
+ * The balance mask are all those CPUs that could actually end up at this
+ * group. See build_balance_mask().
  *
  * Also see should_we_balance().
  */
 int group_balance_cpu(struct sched_group *sg)
 {
-	return cpumask_first(sched_group_mask(sg));
+	return cpumask_first(group_balance_mask(sg));
 }
 
 
@@ -563,7 +563,7 @@ int group_balance_cpu(struct sched_group *sg)
  * groups include the CPUs of Node-0, while those CPUs would not in fact ever
  * end up at those groups (they would end up in group: 0-1,3).
  *
- * To correct this we have to introduce the group iteration mask. This mask
+ * To correct this we have to introduce the group balance mask. This mask
  * will contain those CPUs in the group that can reach this group given the
  * (child) domain tree.
  *
@@ -607,11 +607,8 @@ int group_balance_cpu(struct sched_group *sg)
 
 
 /*
- * Build an iteration mask that can exclude certain CPUs from the upwards
- * domain traversal.
- *
- * Only CPUs that can arrive at this group should be considered to continue
- * balancing.
+ * Build the balance mask; it contains only those CPUs that can arrive at this
+ * group and should be considered to continue balancing.
  *
  * We do this during the group creation pass, therefore the group information
  * isn't complete yet, however since each group represents a (child) domain we
@@ -619,7 +616,7 @@ int group_balance_cpu(struct sched_group *sg)
  * complete).
  */
 static void
-build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
+build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
 {
 	const struct cpumask *sg_span = sched_group_cpus(sg);
 	struct sd_data *sdd = sd->private;
@@ -684,14 +681,14 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	struct cpumask *sg_span;
 	int cpu;
 
-	build_group_mask(sd, sg, mask);
+	build_balance_mask(sd, sg, mask);
 	cpu = cpumask_first_and(sched_group_cpus(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
-		cpumask_copy(sched_group_mask(sg), mask);
+		cpumask_copy(group_balance_mask(sg), mask);
 	else
-		WARN_ON_ONCE(!cpumask_equal(sched_group_mask(sg), mask));
+		WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the
@@ -852,10 +849,10 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 
 	if (child) {
 		cpumask_copy(sched_group_cpus(sg), sched_domain_span(child));
-		cpumask_copy(sched_group_mask(sg), sched_group_cpus(sg));
+		cpumask_copy(group_balance_mask(sg), sched_group_cpus(sg));
 	} else {
 		cpumask_set_cpu(cpu, sched_group_cpus(sg));
-		cpumask_set_cpu(cpu, sched_group_mask(sg));
+		cpumask_set_cpu(cpu, group_balance_mask(sg));
 	}
 
 	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_cpus(sg));
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 17/26] sched/topology: Rename sched_group_cpus()
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (15 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 16/26] sched/topology: Rename sched_group_mask() Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 18/26] vtime, sched/cputime: Remove vtime_account_user() Mel Gorman
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

commit ae4df9d6c935105857d9d166b615e3f17531ce6b upstream.

There's a discrepancy in naming between the sched_domain and
sched_group cpumask accessor. Since we're doing changes, fix it.

  $ git grep sched_group_cpus | wc -l
  28
  $ git grep sched_domain_span | wc -l
  38

Suggests changing sched_group_cpus() into sched_group_span():

  for i  in `git grep -l sched_group_cpus`
  do
    sed -ie 's/sched_group_cpus/sched_group_span/g' $i
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c     | 20 ++++++++++----------
 kernel/sched/sched.h    |  4 ++--
 kernel/sched/topology.c | 38 +++++++++++++++++++-------------------
 3 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2472fb846c2..0e8dacd08f2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5384,12 +5384,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		int i;
 
 		/* Skip over this group if it has no CPUs allowed */
-		if (!cpumask_intersects(sched_group_cpus(group),
+		if (!cpumask_intersects(sched_group_span(group),
 					&p->cpus_allowed))
 			continue;
 
 		local_group = cpumask_test_cpu(this_cpu,
-					       sched_group_cpus(group));
+					       sched_group_span(group));
 
 		/*
 		 * Tally up the load of all CPUs in the group and find
@@ -5399,7 +5399,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		runnable_load = 0;
 		max_spare_cap = 0;
 
-		for_each_cpu(i, sched_group_cpus(group)) {
+		for_each_cpu(i, sched_group_span(group)) {
 			/* Bias balancing toward cpus of our domain */
 			if (local_group)
 				load = source_load(i, load_idx);
@@ -5502,10 +5502,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 
 	/* Check if we have any choice: */
 	if (group->group_weight == 1)
-		return cpumask_first(sched_group_cpus(group));
+		return cpumask_first(sched_group_span(group));
 
 	/* Traverse only the allowed CPUs */
-	for_each_cpu_and(i, sched_group_cpus(group), &p->cpus_allowed) {
+	for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) {
 		if (idle_cpu(i)) {
 			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
@@ -7111,7 +7111,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		 * span the current group.
 		 */
 
-		for_each_cpu(cpu, sched_group_cpus(sdg)) {
+		for_each_cpu(cpu, sched_group_span(sdg)) {
 			struct sched_group_capacity *sgc;
 			struct rq *rq = cpu_rq(cpu);
 
@@ -7290,7 +7290,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	memset(sgs, 0, sizeof(*sgs));
 
-	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
+	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
 		/* Bias balancing toward cpus of our domain */
@@ -7454,7 +7454,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
 
-		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
+		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(sg));
 		if (local_group) {
 			sds->local = sg;
 			sgs = local;
@@ -7809,7 +7809,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	unsigned long busiest_load = 0, busiest_capacity = 1;
 	int i;
 
-	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
+	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		unsigned long capacity, wl;
 		enum fbq_type rt;
 
@@ -7962,7 +7962,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.sd		= sd,
 		.dst_cpu	= this_cpu,
 		.dst_rq		= this_rq,
-		.dst_grpmask    = sched_group_cpus(sd->groups),
+		.dst_grpmask    = sched_group_span(sd->groups),
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7c70575ae34..f8cf1d87f065 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1048,7 +1048,7 @@ struct sched_group {
 	unsigned long cpumask[0];
 };
 
-static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
+static inline struct cpumask *sched_group_span(struct sched_group *sg)
 {
 	return to_cpumask(sg->cpumask);
 }
@@ -1067,7 +1067,7 @@ static inline struct cpumask *group_balance_mask(struct sched_group *sg)
  */
 static inline unsigned int group_first_cpu(struct sched_group *group)
 {
-	return cpumask_first(sched_group_cpus(group));
+	return cpumask_first(sched_group_span(group));
 }
 
 extern int group_balance_cpu(struct sched_group *sg);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 070191f02035..79895aec281e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -53,7 +53,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		printk(KERN_ERR "ERROR: domain->span does not contain "
 				"CPU%d\n", cpu);
 	}
-	if (!cpumask_test_cpu(cpu, sched_group_cpus(group))) {
+	if (!cpumask_test_cpu(cpu, sched_group_span(group))) {
 		printk(KERN_ERR "ERROR: domain->groups does not contain"
 				" CPU%d\n", cpu);
 	}
@@ -66,27 +66,27 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		if (!cpumask_weight(sched_group_cpus(group))) {
+		if (!cpumask_weight(sched_group_span(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
 			break;
 		}
 
 		if (!(sd->flags & SD_OVERLAP) &&
-		    cpumask_intersects(groupmask, sched_group_cpus(group))) {
+		    cpumask_intersects(groupmask, sched_group_span(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: repeated CPUs\n");
 			break;
 		}
 
-		cpumask_or(groupmask, groupmask, sched_group_cpus(group));
+		cpumask_or(groupmask, groupmask, sched_group_span(group));
 
 		printk(KERN_CONT " %d:{ span=%*pbl",
 				group->sgc->id,
-				cpumask_pr_args(sched_group_cpus(group)));
+				cpumask_pr_args(sched_group_span(group)));
 
 		if ((sd->flags & SD_OVERLAP) &&
-		    !cpumask_equal(group_balance_mask(group), sched_group_cpus(group))) {
+		    !cpumask_equal(group_balance_mask(group), sched_group_span(group))) {
 			printk(KERN_CONT " mask=%*pbl",
 				cpumask_pr_args(group_balance_mask(group)));
 		}
@@ -96,7 +96,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 
 		if (group == sd->groups && sd->child &&
 		    !cpumask_equal(sched_domain_span(sd->child),
-				   sched_group_cpus(group))) {
+				   sched_group_span(group))) {
 			printk(KERN_ERR "ERROR: domain->groups does not match domain->child\n");
 		}
 
@@ -618,7 +618,7 @@ int group_balance_cpu(struct sched_group *sg)
 static void
 build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
 {
-	const struct cpumask *sg_span = sched_group_cpus(sg);
+	const struct cpumask *sg_span = sched_group_span(sg);
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
@@ -664,7 +664,7 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 	if (!sg)
 		return NULL;
 
-	sg_span = sched_group_cpus(sg);
+	sg_span = sched_group_span(sg);
 	if (sd->child)
 		cpumask_copy(sg_span, sched_domain_span(sd->child));
 	else
@@ -682,7 +682,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	int cpu;
 
 	build_balance_mask(sd, sg, mask);
-	cpu = cpumask_first_and(sched_group_cpus(sg), mask);
+	cpu = cpumask_first_and(sched_group_span(sg), mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -695,7 +695,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	 * domains and no possible iteration will get us here, we won't
 	 * die on a /0 trap.
 	 */
-	sg_span = sched_group_cpus(sg);
+	sg_span = sched_group_span(sg);
 	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
 }
@@ -737,7 +737,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!sg)
 			goto fail;
 
-		sg_span = sched_group_cpus(sg);
+		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
 		init_overlap_sched_group(sd, sg);
@@ -848,14 +848,14 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 	atomic_inc(&sg->sgc->ref);
 
 	if (child) {
-		cpumask_copy(sched_group_cpus(sg), sched_domain_span(child));
-		cpumask_copy(group_balance_mask(sg), sched_group_cpus(sg));
+		cpumask_copy(sched_group_span(sg), sched_domain_span(child));
+		cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
 	} else {
-		cpumask_set_cpu(cpu, sched_group_cpus(sg));
+		cpumask_set_cpu(cpu, sched_group_span(sg));
 		cpumask_set_cpu(cpu, group_balance_mask(sg));
 	}
 
-	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_cpus(sg));
+	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
 
 	return sg;
@@ -890,7 +890,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 
 		sg = get_group(i, sdd);
 
-		cpumask_or(covered, covered, sched_group_cpus(sg));
+		cpumask_or(covered, covered, sched_group_span(sg));
 
 		if (!first)
 			first = sg;
@@ -923,12 +923,12 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
 	do {
 		int cpu, max_cpu = -1;
 
-		sg->group_weight = cpumask_weight(sched_group_cpus(sg));
+		sg->group_weight = cpumask_weight(sched_group_span(sg));
 
 		if (!(sd->flags & SD_ASYM_PACKING))
 			goto next;
 
-		for_each_cpu(cpu, sched_group_cpus(sg)) {
+		for_each_cpu(cpu, sched_group_span(sg)) {
 			if (max_cpu < 0)
 				max_cpu = cpu;
 			else if (sched_asym_prefer(cpu, max_cpu))
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 18/26] vtime, sched/cputime: Remove vtime_account_user()
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (16 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 17/26] sched/topology: Rename sched_group_cpus() Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 19/26] sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime Mel Gorman
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Frederic Weisbecker <fweisbec@gmail.com>

commit 1c3eda01a79b8e9237d91c52c5a75b20983f47c6 upstream.

It's an unnecessary function between vtime_user_exit() and
account_user_time().

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/vtime.h  |  9 +--------
 kernel/sched/cputime.c | 12 ++++++------
 2 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 0681fe25abeb..18b405e3cd93 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -67,19 +67,12 @@ static inline void vtime_account_system(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_account_user(struct task_struct *tsk);
 extern void vtime_user_enter(struct task_struct *tsk);
-
-static inline void vtime_user_exit(struct task_struct *tsk)
-{
-	vtime_account_user(tsk);
-}
-
+extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
-static inline void vtime_account_user(struct task_struct *tsk) { }
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
 static inline void vtime_guest_enter(struct task_struct *tsk) { }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index aea3135c5d90..5e080caf125e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -724,21 +724,21 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_account_user(struct task_struct *tsk)
+void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	tsk->vtime_snap_whence = VTIME_SYS;
 	if (vtime_delta(tsk))
-		account_user_time(tsk, get_vtime_delta(tsk));
+		__vtime_account_system(tsk);
+	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_user_enter(struct task_struct *tsk)
+void vtime_user_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
+	tsk->vtime_snap_whence = VTIME_SYS;
 	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	tsk->vtime_snap_whence = VTIME_USER;
+		account_user_time(tsk, get_vtime_delta(tsk));
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 19/26] sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (17 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 18/26] vtime, sched/cputime: Remove vtime_account_user() Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 20/26] sched/cputime: Rename vtime fields Mel Gorman
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Frederic Weisbecker <fweisbec@gmail.com>

commit 9fa57cf5a5c4aed1e45879b335fe433048709327 upstream.

Even though it doesn't have functional consequences, setting
the task's new context state after we actually accounted the pending
vtime from the old context state makes more sense from a review
perspective.

vtime_user_exit() is the only function that doesn't follow that rule
and that can bug the reviewer for a little while until he realizes there
is no reason for this special case.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/cputime.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5e080caf125e..db7ef1037f39 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -736,9 +736,9 @@ void vtime_user_enter(struct task_struct *tsk)
 void vtime_user_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	tsk->vtime_snap_whence = VTIME_SYS;
 	if (vtime_delta(tsk))
 		account_user_time(tsk, get_vtime_delta(tsk));
+	tsk->vtime_snap_whence = VTIME_SYS;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 20/26] sched/cputime: Rename vtime fields
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (18 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 19/26] sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 21/26] sched/cputime: Move the vtime task fields to their own struct Mel Gorman
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Frederic Weisbecker <fweisbec@gmail.com>

commit 60a9ce57e7c5ac1df3a39fb941022bbfa40c0862 upstream.

The current "snapshot" based naming on vtime fields suggests we record
some past event but that's a low level picture of their actual purpose
which comes out blurry. The real point of these fields is to run a basic
state machine that tracks down cputime entry while switching between
contexts.

So lets reflect that with more meaningful names.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/init_task.h |  4 ++--
 include/linux/sched.h     |  4 ++--
 kernel/fork.c             |  4 ++--
 kernel/sched/cputime.c    | 30 +++++++++++++++---------------
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e049526bc188..3d537331cd4e 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -171,8 +171,8 @@ extern struct cred init_cred;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 # define INIT_VTIME(tsk)						\
 	.vtime_seqcount = SEQCNT_ZERO(tsk.vtime_seqcount),	\
-	.vtime_snap = 0,				\
-	.vtime_snap_whence = VTIME_SYS,
+	.vtime_starttime = 0,				\
+	.vtime_state = VTIME_SYS,
 #else
 # define INIT_VTIME(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc650201..8555d7b8f93d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -671,7 +671,7 @@ struct task_struct {
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t			vtime_seqcount;
-	unsigned long long		vtime_snap;
+	unsigned long long		vtime_starttime;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive: */
 		VTIME_INACTIVE = 0,
@@ -679,7 +679,7 @@ struct task_struct {
 		VTIME_USER,
 		/* Task runs in kernelspace in a CPU with VTIME active: */
 		VTIME_SYS,
-	} vtime_snap_whence;
+	} vtime_state;
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d2bf95..83c4f9bf3e14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1638,8 +1638,8 @@ static __latent_entropy struct task_struct *copy_process(
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_init(&p->vtime_seqcount);
-	p->vtime_snap = 0;
-	p->vtime_snap_whence = VTIME_INACTIVE;
+	p->vtime_starttime = 0;
+	p->vtime_state = VTIME_INACTIVE;
 #endif
 
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index db7ef1037f39..6b152c243079 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -683,10 +683,10 @@ static u64 vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
 
-	if (time_before(now, (unsigned long)tsk->vtime_snap))
+	if (time_before(now, (unsigned long)tsk->vtime_starttime))
 		return 0;
 
-	return jiffies_to_nsecs(now - tsk->vtime_snap);
+	return jiffies_to_nsecs(now - tsk->vtime_starttime);
 }
 
 static u64 get_vtime_delta(struct task_struct *tsk)
@@ -701,10 +701,10 @@ static u64 get_vtime_delta(struct task_struct *tsk)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - tsk->vtime_snap);
+	delta = jiffies_to_nsecs(now - tsk->vtime_starttime);
 	other = account_other_time(delta);
-	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-	tsk->vtime_snap = now;
+	WARN_ON_ONCE(tsk->vtime_state == VTIME_INACTIVE);
+	tsk->vtime_starttime = now;
 
 	return delta - other;
 }
@@ -746,7 +746,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 {
 	/*
 	 * The flags must be updated under the lock with
-	 * the vtime_snap flush and update.
+	 * the vtime_starttime flush and update.
 	 * That enforces a right ordering and update sequence
 	 * synchronization against the reader (task_gtime())
 	 * that can thus safely catch up with a tickless delta.
@@ -776,12 +776,12 @@ void vtime_account_idle(struct task_struct *tsk)
 void arch_vtime_task_switch(struct task_struct *prev)
 {
 	write_seqcount_begin(&prev->vtime_seqcount);
-	prev->vtime_snap_whence = VTIME_INACTIVE;
+	prev->vtime_state = VTIME_INACTIVE;
 	write_seqcount_end(&prev->vtime_seqcount);
 
 	write_seqcount_begin(&current->vtime_seqcount);
-	current->vtime_snap_whence = VTIME_SYS;
-	current->vtime_snap = jiffies;
+	current->vtime_state = VTIME_SYS;
+	current->vtime_starttime = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -791,8 +791,8 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 
 	local_irq_save(flags);
 	write_seqcount_begin(&t->vtime_seqcount);
-	t->vtime_snap_whence = VTIME_SYS;
-	t->vtime_snap = jiffies;
+	t->vtime_state = VTIME_SYS;
+	t->vtime_starttime = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
@@ -809,7 +809,7 @@ u64 task_gtime(struct task_struct *t)
 		seq = read_seqcount_begin(&t->vtime_seqcount);
 
 		gtime = t->gtime;
-		if (t->vtime_snap_whence == VTIME_SYS && t->flags & PF_VCPU)
+		if (t->vtime_state == VTIME_SYS && t->flags & PF_VCPU)
 			gtime += vtime_delta(t);
 
 	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
@@ -840,7 +840,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		*stime = t->stime;
 
 		/* Task is sleeping, nothing to add */
-		if (t->vtime_snap_whence == VTIME_INACTIVE || is_idle_task(t))
+		if (t->vtime_state == VTIME_INACTIVE || is_idle_task(t))
 			continue;
 
 		delta = vtime_delta(t);
@@ -849,9 +849,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		 * Task runs either in user or kernel space, add pending nohz time to
 		 * the right place.
 		 */
-		if (t->vtime_snap_whence == VTIME_USER || t->flags & PF_VCPU)
+		if (t->vtime_state == VTIME_USER || t->flags & PF_VCPU)
 			*utime += delta;
-		else if (t->vtime_snap_whence == VTIME_SYS)
+		else if (t->vtime_state == VTIME_SYS)
 			*stime += delta;
 	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
 }
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 21/26] sched/cputime: Move the vtime task fields to their own struct
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (19 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 20/26] sched/cputime: Rename vtime fields Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 22/26] sched/cputime: Accumulate vtime on top of nsec clocksource Mel Gorman
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Frederic Weisbecker <fweisbec@gmail.com>

commit bac5b6b6b11560f323e71d0ebac4061cfe5f56c0 upstream.

We are about to add vtime accumulation fields to the task struct. Let's
avoid more bloatification and gather vtime information to their own
struct.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/init_task.h |   6 +--
 include/linux/sched.h     |  26 ++++++-----
 kernel/fork.c             |   6 +--
 kernel/sched/cputime.c    | 112 ++++++++++++++++++++++++++--------------------
 4 files changed, 86 insertions(+), 64 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 3d537331cd4e..a2f6707e9fc0 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -170,9 +170,9 @@ extern struct cred init_cred;
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 # define INIT_VTIME(tsk)						\
-	.vtime_seqcount = SEQCNT_ZERO(tsk.vtime_seqcount),	\
-	.vtime_starttime = 0,				\
-	.vtime_state = VTIME_SYS,
+	.vtime.seqcount = SEQCNT_ZERO(tsk.vtime.seqcount),		\
+	.vtime.starttime = 0,						\
+	.vtime.state = VTIME_SYS,
 #else
 # define INIT_VTIME(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8555d7b8f93d..21f44ee3f8c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,6 +223,21 @@ struct task_cputime {
 #define prof_exp			stime
 #define sched_exp			sum_exec_runtime
 
+enum vtime_state {
+	/* Task is sleeping or running in a CPU with VTIME inactive: */
+	VTIME_INACTIVE = 0,
+	/* Task runs in userspace in a CPU with VTIME active: */
+	VTIME_USER,
+	/* Task runs in kernelspace in a CPU with VTIME active: */
+	VTIME_SYS,
+};
+
+struct vtime {
+	seqcount_t		seqcount;
+	unsigned long long	starttime;
+	enum vtime_state	state;
+};
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
@@ -670,16 +685,7 @@ struct task_struct {
 	u64				gtime;
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	seqcount_t			vtime_seqcount;
-	unsigned long long		vtime_starttime;
-	enum {
-		/* Task is sleeping or running in a CPU with VTIME inactive: */
-		VTIME_INACTIVE = 0,
-		/* Task runs in userspace in a CPU with VTIME active: */
-		VTIME_USER,
-		/* Task runs in kernelspace in a CPU with VTIME active: */
-		VTIME_SYS,
-	} vtime_state;
+	struct vtime			vtime;
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/fork.c b/kernel/fork.c
index 83c4f9bf3e14..d927ec11aa7a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1637,9 +1637,9 @@ static __latent_entropy struct task_struct *copy_process(
 	prev_cputime_init(&p->prev_cputime);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	seqcount_init(&p->vtime_seqcount);
-	p->vtime_starttime = 0;
-	p->vtime_state = VTIME_INACTIVE;
+	seqcount_init(&p->vtime.seqcount);
+	p->vtime.starttime = 0;
+	p->vtime.state = VTIME_INACTIVE;
 #endif
 
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 6b152c243079..b28d3122465f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -679,17 +679,17 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static u64 vtime_delta(struct task_struct *tsk)
+static u64 vtime_delta(struct vtime *vtime)
 {
 	unsigned long now = READ_ONCE(jiffies);
 
-	if (time_before(now, (unsigned long)tsk->vtime_starttime))
+	if (time_before(now, (unsigned long)vtime->starttime))
 		return 0;
 
-	return jiffies_to_nsecs(now - tsk->vtime_starttime);
+	return jiffies_to_nsecs(now - vtime->starttime);
 }
 
-static u64 get_vtime_delta(struct task_struct *tsk)
+static u64 get_vtime_delta(struct vtime *vtime)
 {
 	unsigned long now = READ_ONCE(jiffies);
 	u64 delta, other;
@@ -701,49 +701,56 @@ static u64 get_vtime_delta(struct task_struct *tsk)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - tsk->vtime_starttime);
+	delta = jiffies_to_nsecs(now - vtime->starttime);
 	other = account_other_time(delta);
-	WARN_ON_ONCE(tsk->vtime_state == VTIME_INACTIVE);
-	tsk->vtime_starttime = now;
+	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
+	vtime->starttime = now;
 
 	return delta - other;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
 {
-	account_system_time(tsk, irq_count(), get_vtime_delta(tsk));
+	account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
 }
 
 void vtime_account_system(struct task_struct *tsk)
 {
-	if (!vtime_delta(tsk))
+	struct vtime *vtime = &tsk->vtime;
+
+	if (!vtime_delta(vtime))
 		return;
 
-	write_seqcount_begin(&tsk->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
 	__vtime_account_system(tsk);
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
 		__vtime_account_system(tsk);
-	tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	vtime->state = VTIME_USER;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_user_exit(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		account_user_time(tsk, get_vtime_delta(tsk));
-	tsk->vtime_snap_whence = VTIME_SYS;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
+		account_user_time(tsk, get_vtime_delta(vtime));
+	vtime->state = VTIME_SYS;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_guest_enter(struct task_struct *tsk)
 {
+	struct vtime *vtime = &tsk->vtime;
 	/*
 	 * The flags must be updated under the lock with
 	 * the vtime_starttime flush and update.
@@ -751,54 +758,62 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * synchronization against the reader (task_gtime())
 	 * that can thus safely catch up with a tickless delta.
 	 */
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
 		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_enter);
 
 void vtime_guest_exit(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
 	__vtime_account_system(tsk);
 	current->flags &= ~PF_VCPU;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	account_idle_time(get_vtime_delta(tsk));
+	account_idle_time(get_vtime_delta(&tsk->vtime));
 }
 
 void arch_vtime_task_switch(struct task_struct *prev)
 {
-	write_seqcount_begin(&prev->vtime_seqcount);
-	prev->vtime_state = VTIME_INACTIVE;
-	write_seqcount_end(&prev->vtime_seqcount);
+	struct vtime *vtime = &prev->vtime;
 
-	write_seqcount_begin(&current->vtime_seqcount);
-	current->vtime_state = VTIME_SYS;
-	current->vtime_starttime = jiffies;
-	write_seqcount_end(&current->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_INACTIVE;
+	write_seqcount_end(&vtime->seqcount);
+
+	vtime = &current->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_SYS;
+	vtime->starttime = jiffies;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_init_idle(struct task_struct *t, int cpu)
 {
+	struct vtime *vtime = &t->vtime;
 	unsigned long flags;
 
 	local_irq_save(flags);
-	write_seqcount_begin(&t->vtime_seqcount);
-	t->vtime_state = VTIME_SYS;
-	t->vtime_starttime = jiffies;
-	write_seqcount_end(&t->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_SYS;
+	vtime->starttime = jiffies;
+	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
 
 u64 task_gtime(struct task_struct *t)
 {
+	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
 	u64 gtime;
 
@@ -806,13 +821,13 @@ u64 task_gtime(struct task_struct *t)
 		return t->gtime;
 
 	do {
-		seq = read_seqcount_begin(&t->vtime_seqcount);
+		seq = read_seqcount_begin(&vtime->seqcount);
 
 		gtime = t->gtime;
-		if (t->vtime_state == VTIME_SYS && t->flags & PF_VCPU)
-			gtime += vtime_delta(t);
+		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
+			gtime += vtime_delta(vtime);
 
-	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
 	return gtime;
 }
@@ -824,8 +839,9 @@ u64 task_gtime(struct task_struct *t)
  */
 void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 {
-	u64 delta;
+	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
+	u64 delta;
 
 	if (!vtime_accounting_enabled()) {
 		*utime = t->utime;
@@ -834,25 +850,25 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	}
 
 	do {
-		seq = read_seqcount_begin(&t->vtime_seqcount);
+		seq = read_seqcount_begin(&vtime->seqcount);
 
 		*utime = t->utime;
 		*stime = t->stime;
 
 		/* Task is sleeping, nothing to add */
-		if (t->vtime_state == VTIME_INACTIVE || is_idle_task(t))
+		if (vtime->state == VTIME_INACTIVE || is_idle_task(t))
 			continue;
 
-		delta = vtime_delta(t);
+		delta = vtime_delta(vtime);
 
 		/*
 		 * Task runs either in user or kernel space, add pending nohz time to
 		 * the right place.
 		 */
-		if (t->vtime_state == VTIME_USER || t->flags & PF_VCPU)
+		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
 			*utime += delta;
-		else if (t->vtime_state == VTIME_SYS)
+		else if (vtime->state == VTIME_SYS)
 			*stime += delta;
-	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 22/26] sched/cputime: Accumulate vtime on top of nsec clocksource
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (20 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 21/26] sched/cputime: Move the vtime task fields to their own struct Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 23/26] sched/fair: Fix load_balance() affinity redo path Mel Gorman
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Wanpeng Li <wanpeng.li@hotmail.com>

commit 2a42eb9594a1480b4ead9e036e06ee1290e5fa6d upstream.

Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

    CPU 0 (timekeeper)                  CPU 1
    -------------------              -------------
                      jiffies = N
    ...                              run in userspace for a jiffy
    tick entry                       tick entry (sees jiffies = N)
    set jiffies = N + 1
    tick exit                        tick exit (sees jiffies = N + 1)
                                                account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
  on cputime readers, keep idle vtime as-is (low overhead accounting),
  harmonize clock sources. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched.h  |  3 +++
 kernel/sched/cputime.c | 64 +++++++++++++++++++++++++++++++++-----------------
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21f44ee3f8c6..7f2a1eff2997 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -236,6 +236,9 @@ struct vtime {
 	seqcount_t		seqcount;
 	unsigned long long	starttime;
 	enum vtime_state	state;
+	u64			utime;
+	u64			stime;
+	u64			gtime;
 };
 
 struct sched_info {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b28d3122465f..3dafea579263 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -681,18 +681,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
+	unsigned long long clock;
 
-	if (time_before(now, (unsigned long)vtime->starttime))
+	clock = sched_clock_cpu(smp_processor_id());
+	if (clock < vtime->starttime)
 		return 0;
 
-	return jiffies_to_nsecs(now - vtime->starttime);
+	return clock - vtime->starttime;
 }
 
 static u64 get_vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
-	u64 delta, other;
+	u64 delta = vtime_delta(vtime);
+	u64 other;
 
 	/*
 	 * Unlike tick based timing, vtime based timing never has lost
@@ -701,17 +702,31 @@ static u64 get_vtime_delta(struct vtime *vtime)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - vtime->starttime);
 	other = account_other_time(delta);
 	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
-	vtime->starttime = now;
+	vtime->starttime += delta;
 
 	return delta - other;
 }
 
-static void __vtime_account_system(struct task_struct *tsk)
+static void __vtime_account_system(struct task_struct *tsk,
+				   struct vtime *vtime)
 {
-	account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
+	vtime->stime += get_vtime_delta(vtime);
+	if (vtime->stime >= TICK_NSEC) {
+		account_system_time(tsk, irq_count(), vtime->stime);
+		vtime->stime = 0;
+	}
+}
+
+static void vtime_account_guest(struct task_struct *tsk,
+				struct vtime *vtime)
+{
+	vtime->gtime += get_vtime_delta(vtime);
+	if (vtime->gtime >= TICK_NSEC) {
+		account_guest_time(tsk, vtime->gtime);
+		vtime->gtime = 0;
+	}
 }
 
 void vtime_account_system(struct task_struct *tsk)
@@ -722,7 +737,11 @@ void vtime_account_system(struct task_struct *tsk)
 		return;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	/* We might have scheduled out from guest path */
+	if (current->flags & PF_VCPU)
+		vtime_account_guest(tsk, vtime);
+	else
+		__vtime_account_system(tsk, vtime);
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -731,8 +750,7 @@ void vtime_user_enter(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	vtime->state = VTIME_USER;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -742,8 +760,11 @@ void vtime_user_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		account_user_time(tsk, get_vtime_delta(vtime));
+	vtime->utime += get_vtime_delta(vtime);
+	if (vtime->utime >= TICK_NSEC) {
+		account_user_time(tsk, vtime->utime);
+		vtime->utime = 0;
+	}
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -759,8 +780,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -771,7 +791,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	vtime_account_guest(tsk, vtime);
 	current->flags &= ~PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -794,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(smp_processor_id());
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -806,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_save(flags);
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(cpu);
 	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
@@ -825,7 +845,7 @@ u64 task_gtime(struct task_struct *t)
 
 		gtime = t->gtime;
 		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
-			gtime += vtime_delta(vtime);
+			gtime += vtime->gtime + vtime_delta(vtime);
 
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
@@ -866,9 +886,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		 * the right place.
 		 */
 		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
-			*utime += delta;
+			*utime += vtime->utime + delta;
 		else if (vtime->state == VTIME_SYS)
-			*stime += delta;
+			*stime += vtime->stime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 23/26] sched/fair: Fix load_balance() affinity redo path
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (21 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 22/26] sched/cputime: Accumulate vtime on top of nsec clocksource Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 24/26] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch Mel Gorman
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Jeffrey Hugo <jhugo@codeaurora.org>

commit 65a4433aebe36c8c6abeb69b99ef00274b971c6c upstream.

If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source CPU from consideration and
attempts to redo and balance among the new subset of CPUs.

There is a bug in this code path where the algorithm considers all active
CPUs in the system (minus the source that was just masked out).  This is
not valid for two reasons: some active CPUs may not be in the current
scheduling domain and one of the active CPUs is dst_cpu. These CPUs should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid CPUs and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other CPUs in future
load_balance() runs as that algorithm intends.

Fix the check by removing CPUs not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining CPUs
from which load might be migrated.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Austin Christ <austinwc@codeaurora.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Timur Tabi <timur@codeaurora.org>
Link: http://lkml.kernel.org/r/1496863138-11322-2-git-send-email-jhugo@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bafa4e04b850..5f8c91bdac3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6619,10 +6619,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		 * our sched_group. We may want to revisit it if we couldn't
 		 * meet load balance goals by pulling other tasks on src_cpu.
 		 *
-		 * Also avoid computing new_dst_cpu if we have already computed
-		 * one in current iteration.
+		 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
+		 * already computed one in current iteration.
 		 */
-		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
+		if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
 			return 0;
 
 		/* Prevent to re-select dst_cpu via env's cpus */
@@ -7973,14 +7973,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
 
-	/*
-	 * For NEWLY_IDLE load_balancing, we don't need to consider
-	 * other cpus in our group
-	 */
-	if (idle == CPU_NEWLY_IDLE)
-		env.dst_grpmask = NULL;
-
-	cpumask_copy(cpus, cpu_active_mask);
+	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
@@ -8102,7 +8095,15 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
 			cpumask_clear_cpu(cpu_of(busiest), cpus);
-			if (!cpumask_empty(cpus)) {
+			/*
+			 * Attempting to continue load balancing at the current
+			 * sched_domain level only makes sense if there are
+			 * active CPUs remaining as possible busiest CPUs to
+			 * pull load from which are not contained within the
+			 * destination group that is receiving any migrated
+			 * load.
+			 */
+			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
@@ -8398,6 +8399,13 @@ static int active_load_balance_cpu_stop(void *data)
 			.src_cpu	= busiest_rq->cpu,
 			.src_rq		= busiest_rq,
 			.idle		= CPU_IDLE,
+			/*
+			 * can_migrate_task() doesn't need to compute new_dst_cpu
+			 * for active balancing. Since we have CPU_IDLE, but no
+			 * @dst_grpmask we need to make that test go away with lying
+			 * about DST_PINNED.
+			 */
+			.flags		= LBF_DST_PINNED,
 		};
 
 		schedstat_inc(sd->alb_count);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 24/26] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (22 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 23/26] sched/fair: Fix load_balance() affinity redo path Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 25/26] writeback: rework wb_[dec|inc]_stat family of functions Mel Gorman
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Nikolay Borisov <nborisov@suse.com>

commit 104b4e5139fe384431ac11c3b8a6cf4a529edf4a upstream.

Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable.  Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add.  In terms of context-safety,
they're equivalent.  The only difference is that the __ version takes
a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

This patch doesn't cause any functional changes.

tj: Minor updates to patch description for clarity.  Cosmetic
    indentation updates.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 fs/btrfs/disk-io.c             | 12 ++++++------
 fs/btrfs/extent_io.c           |  6 +++---
 fs/btrfs/inode.c               |  8 ++++----
 fs/xfs/xfs_mount.c             |  4 ++--
 include/linux/backing-dev.h    |  2 +-
 include/linux/blk-cgroup.h     |  6 +++---
 include/linux/mman.h           |  2 +-
 include/linux/percpu_counter.h |  7 ++++---
 include/net/inet_frag.h        |  4 ++--
 lib/flex_proportions.c         |  6 +++---
 lib/percpu_counter.c           |  4 ++--
 11 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f678dcb20e6..db58d06c4e75 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1255,9 +1255,9 @@ void clean_tree_block(struct btrfs_fs_info *fs_info,
 		btrfs_assert_tree_locked(buf);
 
 		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
-			__percpu_counter_add(&fs_info->dirty_metadata_bytes,
-					     -buf->len,
-					     fs_info->dirty_metadata_batch);
+			percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
+						 -buf->len,
+						 fs_info->dirty_metadata_batch);
 			/* ugh, clear_extent_buffer_dirty needs to lock the page */
 			btrfs_set_lock_blocking(buf);
 			clear_extent_buffer_dirty(buf);
@@ -4049,9 +4049,9 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 			buf->start, transid, fs_info->generation);
 	was_dirty = set_extent_buffer_dirty(buf);
 	if (!was_dirty)
-		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
-				     buf->len,
-				     fs_info->dirty_metadata_batch);
+		percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
+					 buf->len,
+					 fs_info->dirty_metadata_batch);
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	if (btrfs_header_level(buf) == 0 && check_leaf(root, buf)) {
 		btrfs_print_leaf(fs_info, buf);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..3ef90e91d8be 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3597,9 +3597,9 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
 		spin_unlock(&eb->refs_lock);
 		btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
-		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
-				     -eb->len,
-				     fs_info->dirty_metadata_batch);
+		percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
+					 -eb->len,
+					 fs_info->dirty_metadata_batch);
 		ret = 1;
 	} else {
 		spin_unlock(&eb->refs_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..a760b8d8e39c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1766,8 +1766,8 @@ static void btrfs_set_bit_hook(struct inode *inode,
 		if (btrfs_is_testing(fs_info))
 			return;
 
-		__percpu_counter_add(&fs_info->delalloc_bytes, len,
-				     fs_info->delalloc_batch);
+		percpu_counter_add_batch(&fs_info->delalloc_bytes, len,
+					 fs_info->delalloc_batch);
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->delalloc_bytes += len;
 		if (*bits & EXTENT_DEFRAG)
@@ -1840,8 +1840,8 @@ static void btrfs_clear_bit_hook(struct btrfs_inode *inode,
 					&inode->vfs_inode,
 					state->start, len);
 
-		__percpu_counter_add(&fs_info->delalloc_bytes, -len,
-				     fs_info->delalloc_batch);
+		percpu_counter_add_batch(&fs_info->delalloc_bytes, -len,
+					 fs_info->delalloc_batch);
 		spin_lock(&inode->lock);
 		inode->delalloc_bytes -= len;
 		if (do_list && inode->delalloc_bytes == 0 &&
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2eaf81859166..7147d4a8d207 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1209,7 +1209,7 @@ xfs_mod_icount(
 	struct xfs_mount	*mp,
 	int64_t			delta)
 {
-	__percpu_counter_add(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
+	percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
 	if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
 		ASSERT(0);
 		percpu_counter_add(&mp->m_icount, -delta);
@@ -1288,7 +1288,7 @@ xfs_mod_fdblocks(
 	else
 		batch = XFS_FDBLOCKS_BATCH;
 
-	__percpu_counter_add(&mp->m_fdblocks, delta, batch);
+	percpu_counter_add_batch(&mp->m_fdblocks, delta, batch);
 	if (__percpu_counter_compare(&mp->m_fdblocks, mp->m_alloc_set_aside,
 				     XFS_FDBLOCKS_BATCH) >= 0) {
 		/* we had space! */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 557d84063934..ace73f96eb1e 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -66,7 +66,7 @@ static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
 static inline void __add_wb_stat(struct bdi_writeback *wb,
 				 enum wb_stat_item item, s64 amount)
 {
-	__percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH);
+	percpu_counter_add_batch(&wb->stat[item], amount, WB_STAT_BATCH);
 }
 
 static inline void __inc_wb_stat(struct bdi_writeback *wb,
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7bac74..7104bea8dab1 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -518,7 +518,7 @@ static inline void blkg_stat_exit(struct blkg_stat *stat)
  */
 static inline void blkg_stat_add(struct blkg_stat *stat, uint64_t val)
 {
-	__percpu_counter_add(&stat->cpu_cnt, val, BLKG_STAT_CPU_BATCH);
+	percpu_counter_add_batch(&stat->cpu_cnt, val, BLKG_STAT_CPU_BATCH);
 }
 
 /**
@@ -597,14 +597,14 @@ static inline void blkg_rwstat_add(struct blkg_rwstat *rwstat,
 	else
 		cnt = &rwstat->cpu_cnt[BLKG_RWSTAT_READ];
 
-	__percpu_counter_add(cnt, val, BLKG_STAT_CPU_BATCH);
+	percpu_counter_add_batch(cnt, val, BLKG_STAT_CPU_BATCH);
 
 	if (op_is_sync(op))
 		cnt = &rwstat->cpu_cnt[BLKG_RWSTAT_SYNC];
 	else
 		cnt = &rwstat->cpu_cnt[BLKG_RWSTAT_ASYNC];
 
-	__percpu_counter_add(cnt, val, BLKG_STAT_CPU_BATCH);
+	percpu_counter_add_batch(cnt, val, BLKG_STAT_CPU_BATCH);
 }
 
 /**
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 634c4c51fe3a..c8367041fafd 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -22,7 +22,7 @@ unsigned long vm_memory_committed(void);
 
 static inline void vm_acct_memory(long pages)
 {
-	__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
+	percpu_counter_add_batch(&vm_committed_as, pages, vm_committed_as_batch);
 }
 
 static inline void vm_unacct_memory(long pages)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 84a109449610..ec065387f443 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -39,7 +39,8 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
 
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
-void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
+void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
+			      s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
 
@@ -50,7 +51,7 @@ static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
 
 static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
-	__percpu_counter_add(fbc, amount, percpu_counter_batch);
+	percpu_counter_add_batch(fbc, amount, percpu_counter_batch);
 }
 
 static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
@@ -136,7 +137,7 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 }
 
 static inline void
-__percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	percpu_counter_add(fbc, amount);
 }
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 5894730ec82a..5932e6de8fc0 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -154,12 +154,12 @@ static inline int frag_mem_limit(struct netns_frags *nf)
 
 static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	__percpu_counter_add(&nf->mem, -i, frag_percpu_counter_batch);
+	percpu_counter_add_batch(&nf->mem, -i, frag_percpu_counter_batch);
 }
 
 static inline void add_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	__percpu_counter_add(&nf->mem, i, frag_percpu_counter_batch);
+	percpu_counter_add_batch(&nf->mem, i, frag_percpu_counter_batch);
 }
 
 static inline unsigned int sum_frag_mem_limit(struct netns_frags *nf)
diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index a71cf1bdd4c9..2cc1f94e03a1 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -207,7 +207,7 @@ static void fprop_reflect_period_percpu(struct fprop_global *p,
 		if (val < (nr_cpu_ids * PROP_BATCH))
 			val = percpu_counter_sum(&pl->events);
 
-		__percpu_counter_add(&pl->events,
+		percpu_counter_add_batch(&pl->events,
 			-val + (val >> (period-pl->period)), PROP_BATCH);
 	} else
 		percpu_counter_set(&pl->events, 0);
@@ -219,7 +219,7 @@ static void fprop_reflect_period_percpu(struct fprop_global *p,
 void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl)
 {
 	fprop_reflect_period_percpu(p, pl);
-	__percpu_counter_add(&pl->events, 1, PROP_BATCH);
+	percpu_counter_add_batch(&pl->events, 1, PROP_BATCH);
 	percpu_counter_add(&p->events, 1);
 }
 
@@ -267,6 +267,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
 			return;
 	} else
 		fprop_reflect_period_percpu(p, pl);
-	__percpu_counter_add(&pl->events, 1, PROP_BATCH);
+	percpu_counter_add_batch(&pl->events, 1, PROP_BATCH);
 	percpu_counter_add(&p->events, 1);
 }
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 9c21000df0b5..8ee7e5ec21be 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -72,7 +72,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
-void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
 
@@ -89,7 +89,7 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 	}
 	preempt_enable();
 }
-EXPORT_SYMBOL(__percpu_counter_add);
+EXPORT_SYMBOL(percpu_counter_add_batch);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 25/26] writeback: rework wb_[dec|inc]_stat family of functions
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (23 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 24/26] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-20 21:21 ` [PATCH 26/26] kernel/fork.c: virtually mapped stacks: do not disable interrupts Mel Gorman
  2017-07-24 16:44 ` [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Nikolay Borisov <nborisov@suse.com>

commit 3e8f399da490e6ac20a3cfd6aa404c9aa961a9a2 upstream.

Currently the writeback statistics code uses a percpu counters to hold
various statistics.  Furthermore we have 2 families of functions - those
which disable local irq and those which doesn't and whose names begin
with double underscore.  However, they both end up calling
__add_wb_stats which in turn calls percpu_counter_add_batch which is
already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since
they don't add any further protection than we already have.
Furthermore, refactor the wb_* function to call __add_wb_stat directly
without the irq-disabling dance.  This will likely result in better
runtime of code which deals with modifying the stat counters.

While at it also document why percpu_counter_add_batch is in fact
preempt and irq-safe since at least 3 people got confused.

Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 fs/fs-writeback.c           |  8 ++++----
 include/linux/backing-dev.h | 24 ++----------------------
 lib/percpu_counter.c        |  7 +++++++
 mm/page-writeback.c         | 10 +++++-----
 4 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 63ee2940775c..309364aab2a5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -380,8 +380,8 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 		struct page *page = radix_tree_deref_slot_protected(slot,
 							&mapping->tree_lock);
 		if (likely(page) && PageDirty(page)) {
-			__dec_wb_stat(old_wb, WB_RECLAIMABLE);
-			__inc_wb_stat(new_wb, WB_RECLAIMABLE);
+			dec_wb_stat(old_wb, WB_RECLAIMABLE);
+			inc_wb_stat(new_wb, WB_RECLAIMABLE);
 		}
 	}
 
@@ -391,8 +391,8 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 							&mapping->tree_lock);
 		if (likely(page)) {
 			WARN_ON_ONCE(!PageWriteback(page));
-			__dec_wb_stat(old_wb, WB_WRITEBACK);
-			__inc_wb_stat(new_wb, WB_WRITEBACK);
+			dec_wb_stat(old_wb, WB_WRITEBACK);
+			inc_wb_stat(new_wb, WB_WRITEBACK);
 		}
 	}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ace73f96eb1e..e9c967b86054 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,34 +69,14 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
 	percpu_counter_add_batch(&wb->stat[item], amount, WB_STAT_BATCH);
 }
 
-static inline void __inc_wb_stat(struct bdi_writeback *wb,
-				 enum wb_stat_item item)
-{
-	__add_wb_stat(wb, item, 1);
-}
-
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__inc_wb_stat(wb, item);
-	local_irq_restore(flags);
-}
-
-static inline void __dec_wb_stat(struct bdi_writeback *wb,
-				 enum wb_stat_item item)
-{
-	__add_wb_stat(wb, item, -1);
+	__add_wb_stat(wb, item, 1);
 }
 
 static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__dec_wb_stat(wb, item);
-	local_irq_restore(flags);
+	__add_wb_stat(wb, item, -1);
 }
 
 static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 8ee7e5ec21be..3bf4a9984f4c 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -72,6 +72,13 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
+/**
+ * This function is both preempt and irq safe. The former is due to explicit
+ * preemption disable. The latter is guaranteed by the fact that the slow path
+ * is explicitly protected by an irq-safe spinlock whereas the fast patch uses
+ * this_cpu_add which is irq-safe by definition. Hence there is no need muck
+ * with irq state before calling this one
+ */
 void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b7451891959a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -601,7 +601,7 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb)
 {
 	struct wb_domain *cgdom;
 
-	__inc_wb_stat(wb, WB_WRITTEN);
+	inc_wb_stat(wb, WB_WRITTEN);
 	wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
 			       wb->bdi->max_prop_frac);
 
@@ -2437,8 +2437,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		__inc_node_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
-		__inc_wb_stat(wb, WB_RECLAIMABLE);
-		__inc_wb_stat(wb, WB_DIRTIED);
+		inc_wb_stat(wb, WB_RECLAIMABLE);
+		inc_wb_stat(wb, WB_DIRTIED);
 		task_io_account_write(PAGE_SIZE);
 		current->nr_dirtied++;
 		this_cpu_inc(bdp_ratelimits);
@@ -2745,7 +2745,7 @@ int test_clear_page_writeback(struct page *page)
 			if (bdi_cap_account_writeback(bdi)) {
 				struct bdi_writeback *wb = inode_to_wb(inode);
 
-				__dec_wb_stat(wb, WB_WRITEBACK);
+				dec_wb_stat(wb, WB_WRITEBACK);
 				__wb_writeout_inc(wb);
 			}
 		}
@@ -2791,7 +2791,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
-				__inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+				inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
 
 			/*
 			 * We can come through here when swapping anonymous
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 26/26] kernel/fork.c: virtually mapped stacks: do not disable interrupts
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (24 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 25/26] writeback: rework wb_[dec|inc]_stat family of functions Mel Gorman
@ 2017-07-20 21:21 ` Mel Gorman
  2017-07-24 16:44 ` [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
  26 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2017-07-20 21:21 UTC (permalink / raw)
  To: Linux-Stable; +Cc: Mel Gorman

From: Christoph Lameter <cl@linux.com>

commit 112166f88cf83dd11486cf1818672d42b540865b upstream.

The reason to disable interrupts seems to be to avoid switching to a
different processor while handling per cpu data using individual loads and
stores.  If we use per cpu RMV primitives we will not have to disable
interrupts.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/fork.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d2bf95..d07d1562f314 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -205,19 +205,17 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 	void *stack;
 	int i;
 
-	local_irq_disable();
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		struct vm_struct *s = this_cpu_read(cached_stacks[i]);
+		struct vm_struct *s;
+
+		s = this_cpu_xchg(cached_stacks[i], NULL);
 
 		if (!s)
 			continue;
-		this_cpu_write(cached_stacks[i], NULL);
 
 		tsk->stack_vm_area = s;
-		local_irq_enable();
 		return s->addr;
 	}
-	local_irq_enable();
 
 	stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
 				     VMALLOC_START, VMALLOC_END,
@@ -245,19 +243,15 @@ static inline void free_thread_stack(struct task_struct *tsk)
 {
 #ifdef CONFIG_VMAP_STACK
 	if (task_stack_vm_area(tsk)) {
-		unsigned long flags;
 		int i;
 
-		local_irq_save(flags);
 		for (i = 0; i < NR_CACHED_STACKS; i++) {
-			if (this_cpu_read(cached_stacks[i]))
+			if (this_cpu_cmpxchg(cached_stacks[i],
+					NULL, tsk->stack_vm_area) != NULL)
 				continue;
 
-			this_cpu_write(cached_stacks[i], tsk->stack_vm_area);
-			local_irq_restore(flags);
 			return;
 		}
-		local_irq_restore(flags);
 
 		vfree_atomic(tsk->stack);
 		return;
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/26] Performance-related backports for 4.12.2
  2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
                   ` (25 preceding siblings ...)
  2017-07-20 21:21 ` [PATCH 26/26] kernel/fork.c: virtually mapped stacks: do not disable interrupts Mel Gorman
@ 2017-07-24 16:44 ` Mel Gorman
  2017-07-24 23:29   ` Greg KH
  26 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2017-07-24 16:44 UTC (permalink / raw)
  To: Linux-Stable

On Thu, Jul 20, 2017 at 10:21:18PM +0100, Mel Gorman wrote:
> This is a second round of performance-related backports based on low-hanging
> fruit in the 4.13 merge window based on 4.12.2.
> 
> As before, these have only been tested on 4.12-stable.  While they may
> merge against older kernels, I have no data on how it behaves and cannot
> guarantee it's a good idea so I don't recommend it.  There will also be
> some major conflicts that are not trivial to resolve.
> 
> For most of the tests I conducted, the impact is marginal but patches the
> first two sets of patches are important for large machines and for uses
> of nohz_full. The load balancing patch is fairly specific but measurable.
> The removal of unnecessary IRQ disabling/enabling is borderline in terms of
> performance but they are trivial patches and avoiding unnecessary expensive
> operations is always a plus.
> 

With 4.12.3, the patches 1-17 can be dropped. The main commit left over
that is missing from this series was "sched/topology: Fix overlapping
sched_group_capacity" which is relatively minor in impact. The rest were
to bring the schedulers more or less in line so debugging problems in
4.12-stable would be easier to compare with mainline and to make the 4.12
scheduler easier to understand.

The rest of the series should apply ok on top of 4.12.3 if you'd like to
pick it up.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/26] Performance-related backports for 4.12.2
  2017-07-24 16:44 ` [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
@ 2017-07-24 23:29   ` Greg KH
  2017-07-25  8:14     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2017-07-24 23:29 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-Stable

On Mon, Jul 24, 2017 at 05:44:01PM +0100, Mel Gorman wrote:
> On Thu, Jul 20, 2017 at 10:21:18PM +0100, Mel Gorman wrote:
> > This is a second round of performance-related backports based on low-hanging
> > fruit in the 4.13 merge window based on 4.12.2.
> > 
> > As before, these have only been tested on 4.12-stable.  While they may
> > merge against older kernels, I have no data on how it behaves and cannot
> > guarantee it's a good idea so I don't recommend it.  There will also be
> > some major conflicts that are not trivial to resolve.
> > 
> > For most of the tests I conducted, the impact is marginal but patches the
> > first two sets of patches are important for large machines and for uses
> > of nohz_full. The load balancing patch is fairly specific but measurable.
> > The removal of unnecessary IRQ disabling/enabling is borderline in terms of
> > performance but they are trivial patches and avoiding unnecessary expensive
> > operations is always a plus.
> > 
> 
> With 4.12.3, the patches 1-17 can be dropped.

Really?  Why, what happened in .3 that make the need for those 17
patches just "go away"?

> The main commit left over
> that is missing from this series was "sched/topology: Fix overlapping
> sched_group_capacity" which is relatively minor in impact. The rest were
> to bring the schedulers more or less in line so debugging problems in
> 4.12-stable would be easier to compare with mainline and to make the 4.12
> scheduler easier to understand.
> 
> The rest of the series should apply ok on top of 4.12.3 if you'd like to
> pick it up.

So just patches 18-26?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/26] Performance-related backports for 4.12.2
  2017-07-24 23:29   ` Greg KH
@ 2017-07-25  8:14     ` Mel Gorman
  2017-07-25 15:21       ` Greg KH
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2017-07-25  8:14 UTC (permalink / raw)
  To: Greg KH; +Cc: Linux-Stable

On Mon, Jul 24, 2017 at 04:29:05PM -0700, Greg KH wrote:
> On Mon, Jul 24, 2017 at 05:44:01PM +0100, Mel Gorman wrote:
> > On Thu, Jul 20, 2017 at 10:21:18PM +0100, Mel Gorman wrote:
> > > This is a second round of performance-related backports based on low-hanging
> > > fruit in the 4.13 merge window based on 4.12.2.
> > > 
> > > As before, these have only been tested on 4.12-stable.  While they may
> > > merge against older kernels, I have no data on how it behaves and cannot
> > > guarantee it's a good idea so I don't recommend it.  There will also be
> > > some major conflicts that are not trivial to resolve.
> > > 
> > > For most of the tests I conducted, the impact is marginal but patches the
> > > first two sets of patches are important for large machines and for uses
> > > of nohz_full. The load balancing patch is fairly specific but measurable.
> > > The removal of unnecessary IRQ disabling/enabling is borderline in terms of
> > > performance but they are trivial patches and avoiding unnecessary expensive
> > > operations is always a plus.
> > > 
> > 
> > With 4.12.3, the patches 1-17 can be dropped.
> 
> Really?  Why, what happened in .3 that make the need for those 17
> patches just "go away"?
> 

One of the biggest motivation of the series was to bring in these
commits that are now in -stable

d5ba2df79bc6ae02927d5465e93eb635809e6db5 sched/topology: Fix overlapping sched_group_mask
3397c1f0371a49f52635c9e60e552ef3c91fa10f sched/topology: Optimize build_group_mask()
e64918b4b7c3bac472198d7b59b6092a78092169 sched/topology: Fix building of overlapping sched-groups

The series now needs to be rebased on top to deal with minor collisions
but once it is, there is only one notable fix left related to scheduler
topology -- "sched/topology: Fix overlapping sched_group_capacity" and it
is borderline. The bulk of the rest are cleanups, micro-optimisations,
simplifications and then documentation so the whole thing is
understandable. The rebase is trivial (collisions with comments) but we
can also live without them.

> > The main commit left over
> > that is missing from this series was "sched/topology: Fix overlapping
> > sched_group_capacity" which is relatively minor in impact. The rest were
> > to bring the schedulers more or less in line so debugging problems in
> > 4.12-stable would be easier to compare with mainline and to make the 4.12
> > scheduler easier to understand.
> > 
> > The rest of the series should apply ok on top of 4.12.3 if you'd like to
> > pick it up.
> 
> So just patches 18-26?
> 

Yes, that will then be the accounting fixes for nohz_full users, a load
balancer fix when some processes are pinned to CPUs and a few minor
optimisations that avoid unnecessary disabling of IRQs.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/26] Performance-related backports for 4.12.2
  2017-07-25  8:14     ` Mel Gorman
@ 2017-07-25 15:21       ` Greg KH
  0 siblings, 0 replies; 31+ messages in thread
From: Greg KH @ 2017-07-25 15:21 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-Stable

On Tue, Jul 25, 2017 at 09:14:35AM +0100, Mel Gorman wrote:
> On Mon, Jul 24, 2017 at 04:29:05PM -0700, Greg KH wrote:
> > On Mon, Jul 24, 2017 at 05:44:01PM +0100, Mel Gorman wrote:
> > > On Thu, Jul 20, 2017 at 10:21:18PM +0100, Mel Gorman wrote:
> > > > This is a second round of performance-related backports based on low-hanging
> > > > fruit in the 4.13 merge window based on 4.12.2.
> > > > 
> > > > As before, these have only been tested on 4.12-stable.  While they may
> > > > merge against older kernels, I have no data on how it behaves and cannot
> > > > guarantee it's a good idea so I don't recommend it.  There will also be
> > > > some major conflicts that are not trivial to resolve.
> > > > 
> > > > For most of the tests I conducted, the impact is marginal but patches the
> > > > first two sets of patches are important for large machines and for uses
> > > > of nohz_full. The load balancing patch is fairly specific but measurable.
> > > > The removal of unnecessary IRQ disabling/enabling is borderline in terms of
> > > > performance but they are trivial patches and avoiding unnecessary expensive
> > > > operations is always a plus.
> > > > 
> > > 
> > > With 4.12.3, the patches 1-17 can be dropped.
> > 
> > Really?  Why, what happened in .3 that make the need for those 17
> > patches just "go away"?
> > 
> 
> One of the biggest motivation of the series was to bring in these
> commits that are now in -stable
> 
> d5ba2df79bc6ae02927d5465e93eb635809e6db5 sched/topology: Fix overlapping sched_group_mask
> 3397c1f0371a49f52635c9e60e552ef3c91fa10f sched/topology: Optimize build_group_mask()
> e64918b4b7c3bac472198d7b59b6092a78092169 sched/topology: Fix building of overlapping sched-groups
> 
> The series now needs to be rebased on top to deal with minor collisions
> but once it is, there is only one notable fix left related to scheduler
> topology -- "sched/topology: Fix overlapping sched_group_capacity" and it
> is borderline. The bulk of the rest are cleanups, micro-optimisations,
> simplifications and then documentation so the whole thing is
> understandable. The rebase is trivial (collisions with comments) but we
> can also live without them.
> 
> > > The main commit left over
> > > that is missing from this series was "sched/topology: Fix overlapping
> > > sched_group_capacity" which is relatively minor in impact. The rest were
> > > to bring the schedulers more or less in line so debugging problems in
> > > 4.12-stable would be easier to compare with mainline and to make the 4.12
> > > scheduler easier to understand.
> > > 
> > > The rest of the series should apply ok on top of 4.12.3 if you'd like to
> > > pick it up.
> > 
> > So just patches 18-26?
> > 
> 
> Yes, that will then be the accounting fixes for nohz_full users, a load
> balancer fix when some processes are pinned to CPUs and a few minor
> optimisations that avoid unnecessary disabling of IRQs.

Ok, I've queued patches 18-26 up now, thanks!

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-07-25 15:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-20 21:21 [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
2017-07-20 21:21 ` [PATCH 01/26] sched/topology: Refactor function build_overlap_sched_groups() Mel Gorman
2017-07-20 21:21 ` [PATCH 02/26] sched/topology: Fix building of overlapping sched-groups Mel Gorman
2017-07-20 21:21 ` [PATCH 03/26] sched/topology: Simplify build_overlap_sched_groups() Mel Gorman
2017-07-20 21:21 ` [PATCH 04/26] sched/debug: Print the scheduler topology group mask Mel Gorman
2017-07-20 21:21 ` [PATCH 05/26] sched/topology: Verify the first group matches the child domain Mel Gorman
2017-07-20 21:21 ` [PATCH 06/26] sched/topology: Optimize build_group_mask() Mel Gorman
2017-07-20 21:21 ` [PATCH 07/26] sched/topology: Move comment about asymmetric node setups Mel Gorman
2017-07-20 21:21 ` [PATCH 08/26] sched/topology: Remove FORCE_SD_OVERLAP Mel Gorman
2017-07-20 21:21 ` [PATCH 09/26] sched/topology: Fix overlapping sched_group_mask Mel Gorman
2017-07-20 21:21 ` [PATCH 10/26] sched/topology: Small cleanup Mel Gorman
2017-07-20 21:21 ` [PATCH 11/26] sched/topology: Add sched_group_capacity debugging Mel Gorman
2017-07-20 21:21 ` [PATCH 12/26] sched/topology: Fix overlapping sched_group_capacity Mel Gorman
2017-07-20 21:21 ` [PATCH 13/26] sched/topology: Add a few comments Mel Gorman
2017-07-20 21:21 ` [PATCH 14/26] sched/topology: Rewrite get_group() Mel Gorman
2017-07-20 21:21 ` [PATCH 15/26] sched/topology: Simplify sched_group_mask() usage Mel Gorman
2017-07-20 21:21 ` [PATCH 16/26] sched/topology: Rename sched_group_mask() Mel Gorman
2017-07-20 21:21 ` [PATCH 17/26] sched/topology: Rename sched_group_cpus() Mel Gorman
2017-07-20 21:21 ` [PATCH 18/26] vtime, sched/cputime: Remove vtime_account_user() Mel Gorman
2017-07-20 21:21 ` [PATCH 19/26] sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime Mel Gorman
2017-07-20 21:21 ` [PATCH 20/26] sched/cputime: Rename vtime fields Mel Gorman
2017-07-20 21:21 ` [PATCH 21/26] sched/cputime: Move the vtime task fields to their own struct Mel Gorman
2017-07-20 21:21 ` [PATCH 22/26] sched/cputime: Accumulate vtime on top of nsec clocksource Mel Gorman
2017-07-20 21:21 ` [PATCH 23/26] sched/fair: Fix load_balance() affinity redo path Mel Gorman
2017-07-20 21:21 ` [PATCH 24/26] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch Mel Gorman
2017-07-20 21:21 ` [PATCH 25/26] writeback: rework wb_[dec|inc]_stat family of functions Mel Gorman
2017-07-20 21:21 ` [PATCH 26/26] kernel/fork.c: virtually mapped stacks: do not disable interrupts Mel Gorman
2017-07-24 16:44 ` [PATCH 00/26] Performance-related backports for 4.12.2 Mel Gorman
2017-07-24 23:29   ` Greg KH
2017-07-25  8:14     ` Mel Gorman
2017-07-25 15:21       ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).