linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier
@ 2018-09-26  3:58 Rik van Riel
  2018-09-26  3:58 ` [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode Rik van Riel
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin

Linus asked me to come up with a smaller patch set to get the benefits
of lazy TLB mode, so I spent some time trying out various permutations
of the code, with a few workloads that do lots of context switches, and
also happen to have a fair number of TLB flushes a second.

Both of the workloads tested are memcache style workloads, running
on two socket systems. One of the workloads has around 300,000
context switches a second, and around 19,000 TLB flushes.

The first patch in the series, of always using lazy TLB mode,
reduces CPU use around 1% on both Haswell and Broadwell systems.

The rest of the series reduces the number of TLB flush IPIs by
about 1,500 a second, resulting in a 0.2% reduction in CPU use,
on top of the 1% seen by just enabling lazy TLB mode.

These are the low hanging fruits in the context switch code.

The big thing remaining is the reference count overhead of
the lazy TLB mm_struct, but getting rid of that is rather a
lot of code for a small performance gain. Not quite what
Linus asked for :)

This v2 is "identical" to the version I posted yesterday,
except this one is actually against current -tip (not sure
what went wrong before), with a number of relevant patches
on top:
- tip x86/core
	012e77a903d ("x86/nmi: Fix NMI uaccess race against CR3 switching")
- arm64 tlb/asm-generic (entire branch)
- peterz queue mm/tlb
	12b2b80ec6f4 ("x86/mm: Page size aware flush_tlb_mm_range()")



^ permalink raw reply	[flat|nested] 27+ messages in thread
* [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
@ 2018-07-16 19:03 Rik van Riel
  2018-07-17  9:35 ` [tip:x86/mm] x86/mm/tlb: Make " tip-bot for Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2018-07-16 19:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, efault, kernel-team, mingo, dave.hansen, Rik van Riel

Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.

Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.

The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.

Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
are not removed from the mm_cpumask(mm), since that would prevent the TLB
flush IPIs at page table free time from being sent to all the CPUs
that need them.

This patch reduces total CPU use in the system by about 1-2% for a
memcache workload on two socket systems, and by about 1% for a heavily
multi-process netperf between two systems.

Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/mm/tlb.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 59 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4b73fe835c95..26542cc17043 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/gfp.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -185,6 +186,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
 	bool need_flush;
@@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			   next->context.ctx_id);
 
 		/*
-		 * We don't currently support having a real mm loaded without
-		 * our cpu set in mm_cpumask().  We have all the bookkeeping
-		 * in place to figure out whether we would need to flush
-		 * if our cpu were cleared in mm_cpumask(), but we don't
-		 * currently use it.
+		 * Even in lazy TLB mode, the CPU should stay set in the
+		 * mm_cpumask. The TLB shootdown code can figure out from
+		 * from cpu_tlbstate.is_lazy whether or not to send an IPI.
 		 */
 		if (WARN_ON_ONCE(real_prev != &init_mm &&
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
-		return;
+		/*
+		 * If the CPU is not in lazy TLB mode, we are just switching
+		 * from one thread in a process to another thread in the same
+		 * process. No TLB flush required.
+		 */
+		if (!was_lazy)
+			return;
+
+		/*
+		 * Read the tlb_gen to check whether a flush is needed.
+		 * If the TLB is up to date, just use it.
+		 * The barrier synchronizes with the tlb_gen increment in
+		 * the TLB shootdown code.
+		 */
+		smp_mb();
+		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+				next_tlb_gen)
+			return;
+
+		/*
+		 * TLB contents went out of date while we were in lazy
+		 * mode. Fall through to the TLB switching code below.
+		 */
+		new_asid = prev_asid;
+		need_flush = true;
 	} else {
 		u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
 
@@ -454,6 +479,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * paging-structure cache to avoid speculatively reading
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
+		 *
+		 * This should be rare, with native_flush_tlb_others skipping
+		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
@@ -560,6 +588,9 @@ static void flush_tlb_func_remote(void *info)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
+	cpumask_var_t lazymask;
+	unsigned int cpu;
+
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -583,8 +614,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
@@ -592,8 +621,29 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 					       (void *)info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+	/*
+	 * A temporary cpumask is used in order to skip sending IPIs
+	 * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm).
+	 * If the allocation fails, simply IPI every CPU in mm_cpumask.
+	 */
+	if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) {
+		smp_call_function_many(cpumask, flush_tlb_func_remote,
+			       (void *)info, 1);
+		return;
+	}
+
+	cpumask_copy(lazymask, cpumask);
+
+	for_each_cpu(cpu, lazymask) {
+		if (per_cpu(cpu_tlbstate.is_lazy, cpu))
+			cpumask_clear_cpu(cpu, lazymask);
+	}
+
+	smp_call_function_many(lazymask, flush_tlb_func_remote,
 			       (void *)info, 1);
+
+	free_cpumask_var(lazymask);
 }
 
 /*
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-10-09 15:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
2018-09-26  3:58 ` [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode Rik van Riel
2018-10-01 15:58   ` Peter Zijlstra
2018-10-01 16:07     ` Rik van Riel
2018-09-26  3:58 ` [PATCH 2/7] x86/mm/tlb: Restructure switch_mm_irqs_off() Rik van Riel
2018-10-02  7:32   ` Peter Zijlstra
2018-09-26  3:58 ` [PATCH 3/7] smp: use __cpumask_set_cpu in on_each_cpu_cond Rik van Riel
2018-10-09 14:59   ` [tip:x86/mm] " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 4/7] smp,cpumask: introduce on_each_cpu_cond_mask Rik van Riel
2018-10-09 14:59   ` [tip:x86/mm] " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 5/7] Add freed_tables argument to flush_tlb_mm_range Rik van Riel
2018-10-09 15:00   ` [tip:x86/mm] x86/mm/tlb: " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 6/7] Add freed_tables element to flush_tlb_info Rik van Riel
2018-10-09 15:00   ` [tip:x86/mm] x86/mm/tlb: " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier Rik van Riel
2018-10-01 16:07   ` Peter Zijlstra
2018-10-09 15:01   ` [tip:x86/mm] " tip-bot for Rik van Riel
2018-10-01 16:09 ` [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Peter Zijlstra
2018-10-02  7:44 ` Peter Zijlstra
2018-10-02 13:41   ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2018-07-16 19:03 [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel
2018-07-17  9:35 ` [tip:x86/mm] x86/mm/tlb: Make " tip-bot for Rik van Riel
2018-07-17 11:33   ` Peter Zijlstra
2018-07-18 15:33     ` Rik van Riel
2018-07-18 16:00       ` Peter Zijlstra
     [not found]         ` <081E558D-DB34-4A18-A35C-896BC47F6EBA@surriel.com>
2018-07-18 18:23           ` Peter Zijlstra
2018-07-18 18:51             ` Rik van Riel
2018-07-19  9:13               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).